Skip to content

fetch_openml with mnist_784 uses excessive memory #19774

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
louisabraham opened this issue Mar 27, 2021 · 16 comments · Fixed by #21938
Closed

fetch_openml with mnist_784 uses excessive memory #19774

louisabraham opened this issue Mar 27, 2021 · 16 comments · Fixed by #21938

Comments

@louisabraham
Copy link

louisabraham commented Mar 27, 2021

from sklearn.datasets import fetch_openml
fetch_openml(name="mnist_784")

Uses 3GB of RAM during execution and then 1.5 GB. Additional runs make the memory usage go up by 500 MB each time.

The whole dataset has 70k values data of dimension 784. It should take about 500MB in memory. I don't understand why the function uses so much memory.

This has caused numerous people to have memory errors in the past:

@rth
Copy link
Member

rth commented Mar 27, 2021

It's inherently due to the fact that OpenML uses ARFF (a plain text data format) for storing datasets, and that we use a pure Python parser. The situation should get much better once OpenML switches to parquet openml/openml-python#1032 #19669 (comment)

@ogrisel
Copy link
Member

ogrisel commented Mar 30, 2021

It's inherently due to the fact that OpenML uses ARFF (a plain text data format) for storing datasets, and that we use a pure Python parser. The situation should get much better once OpenML switches to parquet openml/openml-python#1032 #19669 (comment)

I agree but parquet is going to introduce a new dependency. Maybe we could have a chunked implementation of the Python ARFF parse to progressively store the results in a pre-allocated numpy array instead of building a huge pure-python datastructure and converting it to a numpy array at the end?

@ogrisel
Copy link
Member

ogrisel commented Mar 30, 2021

Uses 3GB of RAM during execution and then 1.5 GB. Additional runs make the memory usage go up by 500 MB each time.

This sounds like a memory leak...

@ogrisel
Copy link
Member

ogrisel commented Mar 30, 2021

I agree but parquet is going to introduce a new dependency. Maybe we could have a chunked implementation of the Python ARFF parse to progressively store the results in a pre-allocated numpy array instead of building a huge pure-python datastructure and converting it to a numpy array at the end?

On the other hand, pure python ARFF parsing is so slow that it's really a pain to use on medium sized datasets such as MNIST... so maybe the dependency on parquet is worth it in retrospect.

@ogrisel
Copy link
Member

ogrisel commented Mar 30, 2021

I don't think there is a real leak:

from sklearn.datasets import fetch_openml
import psutil
import gc


for i in range(5):
    gc.collect()
    mem = psutil.Process().memory_info().rss
    print(f"Iteration {i:02d}: {mem / 1e6:.1f} MB")
    fetch_openml(name="mnist_784")
% python ~/tmp/repro_fetch_openml_leak.py
Iteration 00: 77.2 MB
Iteration 01: 883.2 MB
Iteration 02: 1092.4 MB
Iteration 03: 1223.3 MB
Iteration 04: 1244.5 MB

It's probably the python allocator that is a bit eager to keep recently allocated memory but tends to reuse it after a while.

@louisabraham
Copy link
Author

louisabraham commented Mar 30, 2021

Thanks for trying to reproduce

I tried your code as well, with more iterations. Here are the results on my m1 chip with rosetta.

Iteration 00: 389.3 MB
Iteration 01: 4405.1 MB
Iteration 02: 5113.9 MB
Iteration 03: 5271.9 MB
Iteration 04: 5117.9 MB
Iteration 05: 5524.7 MB
Iteration 06: 5267.1 MB
Iteration 07: 5845.4 MB
Iteration 08: 5366.4 MB
Iteration 09: 5836.7 MB

During execution real memory peaks to 14Go. I agree there is no memory leak, but now I'm concerned about the memory usage of python executed through Rosetta.

@rth
Copy link
Member

rth commented Mar 30, 2021

Caching the results once they are parsed one could also be a partial solution #14855

You can do that on the user side with joblib.Memory

@lesteve
Copy link
Member

lesteve commented Dec 8, 2021

The high memory-usage only happens for as_frame=True and not as_frame=False, so there is likely some inefficiency in the code creating the dataframe. In the snippet below as_frame=True uses ~3.5GB and as_frame=False uses ~800MB.

In [1]: from sklearn.datasets import fetch_openml
   ...: 
   ...: %load_ext memory_profiler
   ...: print('as_frame=True')
   ...: %memit fetch_openml(name="mnist_784", as_frame=True)
   ...: 
   ...: print('as_frame=False')
   ...: %memit fetch_openml(name="mnist_784", as_frame=False)
   ...: 
as_frame=True
peak memory: 3572.36 MiB, increment: 3477.56 MiB
as_frame=False
peak memory: 1396.97 MiB, increment: 783.14 MiB

@glemaitre
Copy link
Member

glemaitre commented Dec 9, 2021

I rewrote the ARFF reader using pandas that is much faster and avoid some memory copy. I have a similar benchmark than as_frame=False but getting a dataframe:

In [8]:  %memit arff_reader_via_pandas("mnist_784.arff")
peak memory: 1212.86 MiB, increment: 483.52 MiB

The rough implementation is there:

# %%
def _strip_quotes(string):
    for quotes_char in ["'", '"']:
        if string.startswith(quotes_char) and string.endswith(quotes_char):
            string = string[1:-1]
    return string


# %%
def _map_arff_dtypes_to_numpy_dtypes(df, arff_dtypes):
    import pandas as pd
    dtypes = {}
    for feature_name in df.columns:
        pd_dtype = df[feature_name].dtype
        arff_dtype = arff_dtypes[feature_name]
        if arff_dtype.lower() in ("numeric", "real", "integer"):
            # pandas will properly parse numerical values
            dtypes[feature_name] = pd_dtype
        elif arff_dtype.startswith("{") and arff_dtype.endswith("}"):
            categories = arff_dtype[1:-1].split(",")
            categories = [_strip_quotes(category) for category in categories]
            if pd_dtype.kind == "i":
                categories = [int(category) for category in categories]
            elif pd_dtype.kind == "f":
                categories = [float(category) for category in categories]
            dtypes[feature_name] = pd.CategoricalDtype(categories)
        else:
            dtypes[feature_name] = pd_dtype
    return dtypes


# %%
line_tag_data = 0

columns = []
arff_dtypes = {}
with open(filename, "r") as f:
    for idx_line, line in enumerate(f):
        if line.lower().startswith("@attribute"):
            _, feature_name, feature_type = line.split()
            feature_name = _strip_quotes(feature_name)
            columns.append(feature_name)
            arff_dtypes[feature_name] = feature_type
        if line.lower().startswith("@data"):
            line_tag_data = idx_line
            break


# %%
def arff_reader_via_pandas(filename):
    import pandas as pd

    df = pd.read_csv(
        filename,
        skiprows=line_tag_data + 1,
        header=None,
        na_values=["?"],
    )
    df.columns = columns
    dtypes = _map_arff_dtypes_to_numpy_dtypes(df, arff_dtypes)
    df = df.astype(dtypes)

    return df

@glemaitre
Copy link
Member

glemaitre commented Dec 9, 2021

This implementation is supposed to convert the feature to categorical, object and right numerical data types:

df = arff_reader_via_pandas("titanic.arff")
df.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1309 entries, 0 to 1308
Data columns (total 14 columns):
 #   Column     Non-Null Count  Dtype   
---  ------     --------------  -----   
 0   pclass     1309 non-null   int64   
 1   survived   1309 non-null   category
 2   name       1309 non-null   object  
 3   sex        1309 non-null   category
 4   age        1046 non-null   float64 
 5   sibsp      1309 non-null   int64   
 6   parch      1309 non-null   int64   
 7   ticket     1309 non-null   object  
 8   fare       1308 non-null   float64 
 9   cabin      295 non-null    object  
 10  embarked   1307 non-null   category
 11  boat       486 non-null    object  
 12  body       121 non-null    float64 
 13  home.dest  745 non-null    object  
dtypes: category(3), float64(3), int64(3), object(5)
memory usage: 116.8+ KB

Currently, we would get:

df = fetch_openml("titanic", version=1)
df.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1309 entries, 0 to 1308
Data columns (total 14 columns):
 #   Column     Non-Null Count  Dtype   
---  ------     --------------  -----   
 0   pclass     1309 non-null   float64 
 1   survived   1309 non-null   category
 2   name       1309 non-null   object  
 3   sex        1309 non-null   category
 4   age        1046 non-null   float64 
 5   sibsp      1309 non-null   float64 
 6   parch      1309 non-null   float64 
 7   ticket     1309 non-null   object  
 8   fare       1308 non-null   float64 
 9   cabin      295 non-null    object  
 10  embarked   1307 non-null   category
 11  boat       486 non-null    object  
 12  body       121 non-null    float64 
 13  home.dest  745 non-null    object  
dtypes: category(3), float64(6), object(5)
memory usage: 116.8+ KB

@glemaitre
Copy link
Member

glemaitre commented Dec 9, 2021

Uhm still have a bug with the category column casting thought. Just forgot to reassign my data frame after casting the dtypes.

@ogrisel
Copy link
Member

ogrisel commented Dec 9, 2021

So the difference it that it can detect integers instead of reading all numerical values as float?

@glemaitre
Copy link
Member

So the difference it that it can detect integers instead of reading all numerical values as float?

exactly

@ogrisel
Copy link
Member

ogrisel commented Dec 9, 2021

I think this is really cool. +1 for a pandas-based ARFF parser while keeping the one we already have vendored ("liac-arff") as a fallback option in case pandas is not installed.

Not sure how to deal about the discrepancy for the numerical type inference since the ARFF headers are not explicit enough. Maybe it's fine.

I am not 100% sure if we should rely on pandas to parse ARFF files when as_frame=False and the data is dense. Using the pandas parser (if pandas is installed) is likely to be way more efficient than the "liac-arff"` parser, so we should at least add a new option to make it possible to select the pandas parser in this case.

@adrinjalali
Copy link
Member

So we're gonna have a PR with your implementation in it? 😁 looks pretty good to me.

@ogrisel
Copy link
Member

ogrisel commented Dec 10, 2021

@glemaitre just opened a draft PR under #21938.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging a pull request may close this issue.

7 participants