-
-
Notifications
You must be signed in to change notification settings - Fork 25.8k
fetch_openml with mnist_784 uses excessive memory #19774
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Comments
It's inherently due to the fact that OpenML uses ARFF (a plain text data format) for storing datasets, and that we use a pure Python parser. The situation should get much better once OpenML switches to parquet openml/openml-python#1032 #19669 (comment) |
I agree but parquet is going to introduce a new dependency. Maybe we could have a chunked implementation of the Python ARFF parse to progressively store the results in a pre-allocated numpy array instead of building a huge pure-python datastructure and converting it to a numpy array at the end? |
This sounds like a memory leak... |
On the other hand, pure python ARFF parsing is so slow that it's really a pain to use on medium sized datasets such as MNIST... so maybe the dependency on parquet is worth it in retrospect. |
I don't think there is a real leak: from sklearn.datasets import fetch_openml
import psutil
import gc
for i in range(5):
gc.collect()
mem = psutil.Process().memory_info().rss
print(f"Iteration {i:02d}: {mem / 1e6:.1f} MB")
fetch_openml(name="mnist_784")
It's probably the python allocator that is a bit eager to keep recently allocated memory but tends to reuse it after a while. |
Thanks for trying to reproduce I tried your code as well, with more iterations. Here are the results on my m1 chip with rosetta.
During execution real memory peaks to 14Go. I agree there is no memory leak, but now I'm concerned about the memory usage of python executed through Rosetta. |
Caching the results once they are parsed one could also be a partial solution #14855 You can do that on the user side with joblib.Memory |
The high memory-usage only happens for
|
I rewrote the ARFF reader using pandas that is much faster and avoid some memory copy. I have a similar benchmark than
The rough implementation is there: # %%
def _strip_quotes(string):
for quotes_char in ["'", '"']:
if string.startswith(quotes_char) and string.endswith(quotes_char):
string = string[1:-1]
return string
# %%
def _map_arff_dtypes_to_numpy_dtypes(df, arff_dtypes):
import pandas as pd
dtypes = {}
for feature_name in df.columns:
pd_dtype = df[feature_name].dtype
arff_dtype = arff_dtypes[feature_name]
if arff_dtype.lower() in ("numeric", "real", "integer"):
# pandas will properly parse numerical values
dtypes[feature_name] = pd_dtype
elif arff_dtype.startswith("{") and arff_dtype.endswith("}"):
categories = arff_dtype[1:-1].split(",")
categories = [_strip_quotes(category) for category in categories]
if pd_dtype.kind == "i":
categories = [int(category) for category in categories]
elif pd_dtype.kind == "f":
categories = [float(category) for category in categories]
dtypes[feature_name] = pd.CategoricalDtype(categories)
else:
dtypes[feature_name] = pd_dtype
return dtypes
# %%
line_tag_data = 0
columns = []
arff_dtypes = {}
with open(filename, "r") as f:
for idx_line, line in enumerate(f):
if line.lower().startswith("@attribute"):
_, feature_name, feature_type = line.split()
feature_name = _strip_quotes(feature_name)
columns.append(feature_name)
arff_dtypes[feature_name] = feature_type
if line.lower().startswith("@data"):
line_tag_data = idx_line
break
# %%
def arff_reader_via_pandas(filename):
import pandas as pd
df = pd.read_csv(
filename,
skiprows=line_tag_data + 1,
header=None,
na_values=["?"],
)
df.columns = columns
dtypes = _map_arff_dtypes_to_numpy_dtypes(df, arff_dtypes)
df = df.astype(dtypes)
return df |
This implementation is supposed to convert the feature to categorical, object and right numerical data types: df = arff_reader_via_pandas("titanic.arff")
df.info()
Currently, we would get:
|
|
So the difference it that it can detect integers instead of reading all numerical values as float? |
exactly |
I think this is really cool. +1 for a pandas-based ARFF parser while keeping the one we already have vendored ( Not sure how to deal about the discrepancy for the numerical type inference since the ARFF headers are not explicit enough. Maybe it's fine. I am not 100% sure if we should rely on pandas to parse ARFF files when |
So we're gonna have a PR with your implementation in it? 😁 looks pretty good to me. |
@glemaitre just opened a draft PR under #21938. |
Uses 3GB of RAM during execution and then 1.5 GB. Additional runs make the memory usage go up by 500 MB each time.
The whole dataset has 70k values data of dimension 784. It should take about 500MB in memory. I don't understand why the function uses so much memory.
This has caused numerous people to have memory errors in the past:
The text was updated successfully, but these errors were encountered: