fetch_openml with mnist_784 uses excessive memory #19774

louisabraham · 2021-03-27T17:48:16Z

from sklearn.datasets import fetch_openml
fetch_openml(name="mnist_784")

Uses 3GB of RAM during execution and then 1.5 GB. Additional runs make the memory usage go up by 500 MB each time.

The whole dataset has 70k values data of dimension 784. It should take about 500MB in memory. I don't understand why the function uses so much memory.

This has caused numerous people to have memory errors in the past:

rth · 2021-03-27T17:55:58Z

It's inherently due to the fact that OpenML uses ARFF (a plain text data format) for storing datasets, and that we use a pure Python parser. The situation should get much better once OpenML switches to parquet openml/openml-python#1032 #19669 (comment)

ogrisel · 2021-03-30T16:32:33Z

It's inherently due to the fact that OpenML uses ARFF (a plain text data format) for storing datasets, and that we use a pure Python parser. The situation should get much better once OpenML switches to parquet openml/openml-python#1032 #19669 (comment)

I agree but parquet is going to introduce a new dependency. Maybe we could have a chunked implementation of the Python ARFF parse to progressively store the results in a pre-allocated numpy array instead of building a huge pure-python datastructure and converting it to a numpy array at the end?

ogrisel · 2021-03-30T16:33:17Z

Uses 3GB of RAM during execution and then 1.5 GB. Additional runs make the memory usage go up by 500 MB each time.

This sounds like a memory leak...

ogrisel · 2021-03-30T16:40:36Z

I agree but parquet is going to introduce a new dependency. Maybe we could have a chunked implementation of the Python ARFF parse to progressively store the results in a pre-allocated numpy array instead of building a huge pure-python datastructure and converting it to a numpy array at the end?

On the other hand, pure python ARFF parsing is so slow that it's really a pain to use on medium sized datasets such as MNIST... so maybe the dependency on parquet is worth it in retrospect.

ogrisel · 2021-03-30T16:42:29Z

I don't think there is a real leak:

from sklearn.datasets import fetch_openml
import psutil
import gc


for i in range(5):
    gc.collect()
    mem = psutil.Process().memory_info().rss
    print(f"Iteration {i:02d}: {mem / 1e6:.1f} MB")
    fetch_openml(name="mnist_784")

% python ~/tmp/repro_fetch_openml_leak.py
Iteration 00: 77.2 MB
Iteration 01: 883.2 MB
Iteration 02: 1092.4 MB
Iteration 03: 1223.3 MB
Iteration 04: 1244.5 MB

It's probably the python allocator that is a bit eager to keep recently allocated memory but tends to reuse it after a while.

louisabraham · 2021-03-30T17:03:56Z

Thanks for trying to reproduce

I tried your code as well, with more iterations. Here are the results on my m1 chip with rosetta.

Iteration 00: 389.3 MB
Iteration 01: 4405.1 MB
Iteration 02: 5113.9 MB
Iteration 03: 5271.9 MB
Iteration 04: 5117.9 MB
Iteration 05: 5524.7 MB
Iteration 06: 5267.1 MB
Iteration 07: 5845.4 MB
Iteration 08: 5366.4 MB
Iteration 09: 5836.7 MB

During execution real memory peaks to 14Go. I agree there is no memory leak, but now I'm concerned about the memory usage of python executed through Rosetta.

rth · 2021-03-30T19:17:02Z

Caching the results once they are parsed one could also be a partial solution #14855

You can do that on the user side with joblib.Memory

lesteve · 2021-12-08T16:59:09Z

The high memory-usage only happens for as_frame=True and not as_frame=False, so there is likely some inefficiency in the code creating the dataframe. In the snippet below as_frame=True uses ~3.5GB and as_frame=False uses ~800MB.

In [1]: from sklearn.datasets import fetch_openml
   ...: 
   ...: %load_ext memory_profiler
   ...: print('as_frame=True')
   ...: %memit fetch_openml(name="mnist_784", as_frame=True)
   ...: 
   ...: print('as_frame=False')
   ...: %memit fetch_openml(name="mnist_784", as_frame=False)
   ...: 
as_frame=True
peak memory: 3572.36 MiB, increment: 3477.56 MiB
as_frame=False
peak memory: 1396.97 MiB, increment: 783.14 MiB

glemaitre · 2021-12-09T12:17:29Z

I rewrote the ARFF reader using pandas that is much faster and avoid some memory copy. I have a similar benchmark than as_frame=False but getting a dataframe:

In [8]:  %memit arff_reader_via_pandas("mnist_784.arff")
peak memory: 1212.86 MiB, increment: 483.52 MiB

The rough implementation is there:

# %%
def _strip_quotes(string):
    for quotes_char in ["'", '"']:
        if string.startswith(quotes_char) and string.endswith(quotes_char):
            string = string[1:-1]
    return string


# %%
def _map_arff_dtypes_to_numpy_dtypes(df, arff_dtypes):
    import pandas as pd
    dtypes = {}
    for feature_name in df.columns:
        pd_dtype = df[feature_name].dtype
        arff_dtype = arff_dtypes[feature_name]
        if arff_dtype.lower() in ("numeric", "real", "integer"):
            # pandas will properly parse numerical values
            dtypes[feature_name] = pd_dtype
        elif arff_dtype.startswith("{") and arff_dtype.endswith("}"):
            categories = arff_dtype[1:-1].split(",")
            categories = [_strip_quotes(category) for category in categories]
            if pd_dtype.kind == "i":
                categories = [int(category) for category in categories]
            elif pd_dtype.kind == "f":
                categories = [float(category) for category in categories]
            dtypes[feature_name] = pd.CategoricalDtype(categories)
        else:
            dtypes[feature_name] = pd_dtype
    return dtypes


# %%
line_tag_data = 0

columns = []
arff_dtypes = {}
with open(filename, "r") as f:
    for idx_line, line in enumerate(f):
        if line.lower().startswith("@attribute"):
            _, feature_name, feature_type = line.split()
            feature_name = _strip_quotes(feature_name)
            columns.append(feature_name)
            arff_dtypes[feature_name] = feature_type
        if line.lower().startswith("@data"):
            line_tag_data = idx_line
            break


# %%
def arff_reader_via_pandas(filename):
    import pandas as pd

    df = pd.read_csv(
        filename,
        skiprows=line_tag_data + 1,
        header=None,
        na_values=["?"],
    )
    df.columns = columns
    dtypes = _map_arff_dtypes_to_numpy_dtypes(df, arff_dtypes)
    df = df.astype(dtypes)

    return df

glemaitre · 2021-12-09T12:24:17Z

This implementation is supposed to convert the feature to categorical, object and right numerical data types:

df = arff_reader_via_pandas("titanic.arff")
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1309 entries, 0 to 1308
Data columns (total 14 columns):
 #   Column     Non-Null Count  Dtype   
---  ------     --------------  -----   
 0   pclass     1309 non-null   int64   
 1   survived   1309 non-null   category
 2   name       1309 non-null   object  
 3   sex        1309 non-null   category
 4   age        1046 non-null   float64 
 5   sibsp      1309 non-null   int64   
 6   parch      1309 non-null   int64   
 7   ticket     1309 non-null   object  
 8   fare       1308 non-null   float64 
 9   cabin      295 non-null    object  
 10  embarked   1307 non-null   category
 11  boat       486 non-null    object  
 12  body       121 non-null    float64 
 13  home.dest  745 non-null    object  
dtypes: category(3), float64(3), int64(3), object(5)
memory usage: 116.8+ KB

Currently, we would get:

df = fetch_openml("titanic", version=1)
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1309 entries, 0 to 1308
Data columns (total 14 columns):
 #   Column     Non-Null Count  Dtype   
---  ------     --------------  -----   
 0   pclass     1309 non-null   float64 
 1   survived   1309 non-null   category
 2   name       1309 non-null   object  
 3   sex        1309 non-null   category
 4   age        1046 non-null   float64 
 5   sibsp      1309 non-null   float64 
 6   parch      1309 non-null   float64 
 7   ticket     1309 non-null   object  
 8   fare       1308 non-null   float64 
 9   cabin      295 non-null    object  
 10  embarked   1307 non-null   category
 11  boat       486 non-null    object  
 12  body       121 non-null    float64 
 13  home.dest  745 non-null    object  
dtypes: category(3), float64(6), object(5)
memory usage: 116.8+ KB

glemaitre · 2021-12-09T12:28:03Z

~~Uhm still have a bug with the category column casting thought.~~ Just forgot to reassign my data frame after casting the dtypes.

ogrisel · 2021-12-09T13:08:46Z

So the difference it that it can detect integers instead of reading all numerical values as float?

glemaitre · 2021-12-09T13:35:47Z

So the difference it that it can detect integers instead of reading all numerical values as float?

exactly

ogrisel · 2021-12-09T13:45:23Z

I think this is really cool. +1 for a pandas-based ARFF parser while keeping the one we already have vendored ("liac-arff") as a fallback option in case pandas is not installed.

Not sure how to deal about the discrepancy for the numerical type inference since the ARFF headers are not explicit enough. Maybe it's fine.

I am not 100% sure if we should rely on pandas to parse ARFF files when as_frame=False and the data is dense. Using the pandas parser (if pandas is installed) is likely to be way more efficient than the "liac-arff"` parser, so we should at least add a new option to make it possible to select the pandas parser in this case.

adrinjalali · 2021-12-10T08:46:37Z

So we're gonna have a PR with your implementation in it? 😁 looks pretty good to me.

ogrisel · 2021-12-10T08:53:21Z

@glemaitre just opened a draft PR under #21938.

louisabraham added the Bug: triage label Mar 27, 2021

cmarmo added module:datasets Performance and removed Bug: triage labels Apr 30, 2021

This was referenced Dec 6, 2021

DBG checking fetching of MNIST #21859

Closed

CircleCI failure on "doc" build #21897

Closed

lesteve mentioned this issue Dec 8, 2021

Use as_frame=False for mnist_784 examples to reduce memory usage #21922

Merged

glemaitre mentioned this issue Dec 13, 2021

ENH improve ARFF parser using pandas #21938

Merged

glemaitre closed this as completed in #21938 May 12, 2022

ageron mentioned this issue Sep 13, 2022

from sklearn.datasets import fetch_openml ageron/handson-ml#416

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fetch_openml with mnist_784 uses excessive memory #19774

fetch_openml with mnist_784 uses excessive memory #19774

louisabraham commented Mar 27, 2021 •

edited

Loading

rth commented Mar 27, 2021

ogrisel commented Mar 30, 2021

ogrisel commented Mar 30, 2021

ogrisel commented Mar 30, 2021 •

edited

Loading

ogrisel commented Mar 30, 2021

louisabraham commented Mar 30, 2021 •

edited

Loading

rth commented Mar 30, 2021 •

edited

Loading

lesteve commented Dec 8, 2021

glemaitre commented Dec 9, 2021 •

edited

Loading

glemaitre commented Dec 9, 2021 •

edited

Loading

glemaitre commented Dec 9, 2021 •

edited

Loading

ogrisel commented Dec 9, 2021

glemaitre commented Dec 9, 2021

ogrisel commented Dec 9, 2021

adrinjalali commented Dec 10, 2021

ogrisel commented Dec 10, 2021

fetch_openml with mnist_784 uses excessive memory #19774

fetch_openml with mnist_784 uses excessive memory #19774

Comments

louisabraham commented Mar 27, 2021 • edited Loading

rth commented Mar 27, 2021

ogrisel commented Mar 30, 2021

ogrisel commented Mar 30, 2021

ogrisel commented Mar 30, 2021 • edited Loading

ogrisel commented Mar 30, 2021

louisabraham commented Mar 30, 2021 • edited Loading

rth commented Mar 30, 2021 • edited Loading

lesteve commented Dec 8, 2021

glemaitre commented Dec 9, 2021 • edited Loading

glemaitre commented Dec 9, 2021 • edited Loading

glemaitre commented Dec 9, 2021 • edited Loading

ogrisel commented Dec 9, 2021

glemaitre commented Dec 9, 2021

ogrisel commented Dec 9, 2021

adrinjalali commented Dec 10, 2021

ogrisel commented Dec 10, 2021

louisabraham commented Mar 27, 2021 •

edited

Loading

ogrisel commented Mar 30, 2021 •

edited

Loading

louisabraham commented Mar 30, 2021 •

edited

Loading

rth commented Mar 30, 2021 •

edited

Loading

glemaitre commented Dec 9, 2021 •

edited

Loading

glemaitre commented Dec 9, 2021 •

edited

Loading

glemaitre commented Dec 9, 2021 •

edited

Loading