Skip to content

[MRG] Fix missing 'const' in a few memoryview declaration in trees. #13626

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 6 commits into from
Apr 16, 2019

Conversation

jeremiedbb
Copy link
Member

@jeremiedbb jeremiedbb commented Apr 12, 2019

Memory views were introduced in trees in #12886.
It misses the const keyword in a few declarations.

A typical use case is doing cross validation on a RandomForest:

import numpy as np
from sklearn.ensemble import RandomForestClassifier
from sklearn.model_selection import cross_val_score

X = np.random.random_sample((10000, 1000))
y = np.random.randint(2, size=10000)
rf = RandomForestClassifier(n_jobs=-1)

cross_val_score(rf, X, y)

Here X is more than 1Mb, which means it's mem-mapped by joblib in cross_val_score. This code breaks on master.

What's happening is that for cross_val_score, the joblib backend is the sequential backend (as expected) but for the random forest it's loky backend, ignoring prefer='threads'. So it seems that even if this PR fixes the bug in sklearn, there's also a bug in joblib. @ogrisel

@adrinjalali
Copy link
Member

LGTM, except I guess adding your example as a test wouldn't hurt.

@jeremiedbb
Copy link
Member Author

I added a test. It does not involve cross_val_score. Only a mem-mapped X.

@adrinjalali
Copy link
Member

Fails on windows, interesting!

# check that random forest supports read-only buffer (#13626)
X_orig = np.random.RandomState(0).random_sample((10, 2)).astype(np.float32)

with NamedTemporaryFile() as tmp:
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hmm

with NamedTemporaryFile(prefix="sklearn-test", suffix=".gz") as tmp:
tmp.close() # necessary under windows
with open(datafile, "rb") as f:
with gzip.open(tmp.name, "wb") as fh_out:
shutil.copyfileobj(f, fh_out)

X_mmap = np.memmap(tmp.name, dtype='float32', mode='r', shape=(10, 2))
y = np.zeros(10)

RandomForestClassifier(n_estimators=2).fit(X_mmap, y)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think the test could be under sklearn/tree/tests/test_tree.py, and test a DecisionTreeRegressor instead. It kinda feels like that's a more natural place for the test since it's actually testing the splitter.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I moved the test, and cleaned it since we actually have a helper to test on memmap arrays.

@jnothman jnothman added this to the 0.21 milestone Apr 15, 2019
Copy link
Member

@adrinjalali adrinjalali left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@jeremiedbb
Copy link
Member Author

I wondered why the common test check_classifiers_train(readonly_memmap=True) passes on master. It tests on float64 data, but the tree requires float32 so it makes a copy and it's no longer a memmap...

@jnothman
Copy link
Member

jnothman commented Apr 15, 2019 via email

@jnothman
Copy link
Member

jnothman commented Apr 15, 2019 via email

@jeremiedbb
Copy link
Member Author

Should we run the common test with both dtypes?
Or perhaps there should be an estimator tag specifying what format (dtype, order) is non-copying for some estimator

I'd prefer the second option since the common tests are already quite long. But it's out of scope of this PR I think.

@thomasjpfan thomasjpfan merged commit 5bc3edc into scikit-learn:master Apr 16, 2019
@thomasjpfan
Copy link
Member

Thank you! @jeremiedbb

jeremiedbb added a commit to jeremiedbb/scikit-learn that referenced this pull request Apr 25, 2019
xhluca pushed a commit to xhluca/scikit-learn that referenced this pull request Apr 28, 2019
xhluca pushed a commit to xhluca/scikit-learn that referenced this pull request Apr 28, 2019
xhluca pushed a commit to xhluca/scikit-learn that referenced this pull request Apr 28, 2019
koenvandevelde pushed a commit to koenvandevelde/scikit-learn that referenced this pull request Jul 12, 2019
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants