Skip to content

[MRG+1] Read-only memmap input data in common tests #10663

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 7 commits into from
Apr 23, 2018

Conversation

lesteve
Copy link
Member

@lesteve lesteve commented Feb 20, 2018

This is an attempt to revive #4807 (a rebase was deemed too painful). Closes #4807, closes #5507, closes #5481.

Note that at the moment I am not able to reproduce the problems that were listed in #5481. Note that no errors can be seen on the last Travis run for the Python 3 build from #4807. On the other hand I have a segmentation fault with LinearSVR, which I have not looked at yet (opened #10667 to track it)edit: looks like the segmentation fault is specific to my setup and I found a work-around.

Still to do:

  • [MRG+1] Read-only input data in common tests #4807 adds check_regressors_pickle. Maybe unrelated to the main goal of the PR not sure, I have not double-checked. Edit: I am pretty sure check_regressors_pickle is not used anywhere and I guess it was not meant to be included in the PR.
  • at the moment I create a lot of temporary files without cleaning up. Not sure whether that can be tidied up. Using pytest tmpdir is not really an option because it will need to be passed down quite a few functions (and thus complicate the signature of quite a few _yield_*_checks and check_*) before it can reach the function that actually uses tmpdir. I had a solution using TempMemmap but that would add an additional level of indentation generating a painful diff to review without any good reason. Edit: I think it's fine like this but better suggestions more than welcome!
  • I have reused the main fix of [MRG+1] Read-only input data in common tests #4807 in order to avoid copying the memmap into memory without thoroughly checking it. I am wondering whether using np.asanyarray which keeps the memmap class is slightly better. Edit: there is a test that checks that np.matrix are turned into arrays that breaks if using np.asanyarray so I stuck with asarray.

@lesteve lesteve force-pushed the read-only-memmap branch 2 times, most recently from e50c034 to 4477794 Compare February 23, 2018 16:30
@lesteve lesteve changed the title [WIP] Read-only memmap input data in common tests [MRG] Read-only memmap input data in common tests Feb 24, 2018
@rth
Copy link
Member

rth commented Feb 26, 2018

at the moment I create a lot of temporary files without cleaning up
Not sure whether that can be tidied up. Using pytest tmpdir is not really an option because it will need to be passed down quite a few functions

Also that would have added pytest as a runtime dependency...

In https://github.com/scikit-learn/scikit-learn/pull/4807/files it looks like there is some mechanism of shared temp folder via the global _TEMP_READONLY_MEMMAP_MEMORY variable. Couldn't the same be used here? For instance, initialize this variable in the beginning of check_estimator store all the temporary files inside (possibly in separate temp sub folders) then remove it (e.g. in a try: finally: clause) after all checks are executed?

@lesteve
Copy link
Member Author

lesteve commented Feb 27, 2018

In https://github.com/scikit-learn/scikit-learn/pull/4807/files it looks like there is some mechanism of shared temp folder via the global _TEMP_READONLY_MEMMAP_MEMORY variable. Couldn't the same be used here?

I felt like it was complicating the code quite a bit, without a substantial benefit. The cleanest way IMO would be to create the mmap files in the pytest tmpdir and not bother deleting them (pytest deletes its tmpdir after a few runs as can been seen from this). I may try to have a closer look at how this can be done.

@arthurmensch you are more than welcome to look at it in case you have some spare bandwidth.

@rth
Copy link
Member

rth commented Feb 27, 2018

the cleanest way IMO would be to create the mmap files in the pytest tmpdir

Yes, but wouldn't this require importing pytest when running check_estimator, that's currently not a runtime dependency as far as I know?

I felt like it was complicating the code quite a bit, without a substantial benefit.

If I understand correctly this created temp files with random names in some temporary folder?. I'm not sure if those will get cleaned up by default on system restart or not. If it's not the case, it's not so great to have some temp files that get accumulated particularly for code that is suggested to run to other projects in the scikit-learn ecosystem...

@lesteve
Copy link
Member Author

lesteve commented Feb 27, 2018

Those are very good points. I just pushed a commit so that create_memmap_backed_data uses atexit to make sure the mmap files get deleted at the interpreter exit.

@lesteve
Copy link
Member Author

lesteve commented Mar 5, 2018

@rth any more comments on this one?

@rth
Copy link
Member

rth commented Mar 8, 2018

@lesteve I'll try to review this soon.

@lesteve
Copy link
Member Author

lesteve commented Mar 8, 2018

@lesteve I'll try to review this soon.

Nice, thanks !

IMO the main thing to review is the logic in sklearn.utils.validation.check_array that avoid making a copy of a memmap. I kept the logic exactly the same as in @arthurmensch's PR.

The rest is mostly awkward (but hopefully close to minimal) contortions to include the read-only memmap tests inside the test_common.py and estimator_checks.py existing code.

@lesteve
Copy link
Member Author

lesteve commented Mar 26, 2018

ping @rth if you have some spare bandwitdth ;-)

@lesteve
Copy link
Member Author

lesteve commented Apr 5, 2018

ping @arthurmensch maybe ?

@rth
Copy link
Member

rth commented Apr 5, 2018

I'll try to review this next week @lesteve ...

@lesteve
Copy link
Member Author

lesteve commented Apr 5, 2018

I'll try to review this next week @lesteve ...

That would be very nice! If there is anything I can do to make this easier to review, let me know!

@lesteve
Copy link
Member Author

lesteve commented Apr 5, 2018

I'll try to review this next week @lesteve ...

That would be very nice! If there is anything I can do to make this easier to review, let me know! I think #10663 (comment) is still valid.

@jnothman
Copy link
Member

jnothman commented Apr 9, 2018

Pity common tests aren't pytest-ready to use the fixture paradigm more.

Am I getting confused, or should we expect this to fail on old cython

Do we have a test that check_array keeps memmapped data memmapped?

Otherwise looking great

@lesteve
Copy link
Member Author

lesteve commented Apr 9, 2018

Am I getting confused, or should we expect this to fail on old cython

For some time, in a lot of people mind (including mine), #4807 was identified with the read-only ValueError that had to do with the cython limitation that typed memoryview needed to be writeable (until cython 0.28 where const typed memoryview have been introduced).

As I mentioned in my top post, I was not able to reproduce the failures listed in #5481 that #4807 were supposed to highlight. For the last commit of #4807, Travis is green for Python 3 (see this). I find it rather odd that such failures would happen on Python 2 and not Python 3, so the only reasonable option I can think of is that the failures were due to another aspect of #4807.

Do we have a test that check_array keeps memmapped data memmapped?

Yes, see test_check_array_memmap here.

Otherwise looking great

Thanks for looking into this, let me know if there is any way I can make this PR easier to review!

Copy link
Member

@rth rth left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I have not followed in details all the implications and the perceived intent in the original PR #4807 but generally, I find this PR fairly comprehensible.

A comment / question is below, apart from that it this LGTM.

temp_folder = tempfile.mkdtemp(prefix='sklearn_testing_')
atexit.register(functools.partial(_delete_folder, temp_folder, warn=True))
filename = op.join(temp_folder, 'data.pkl')
joblib.dump(data, filename)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Stupid question, but it's not possible to create a memap without physically writing to disk? Creating temp files in multiple check_estimators_* functions (for each estimator in scikit-learn) doesn't seem ideal - that certainly wouldn't help test run time. Joblib seems to explicitly forbid this and this also doesn't work with np.load(..., mmap_mode='r') but I wonder if there is a workaround. We don't need an actual mmap, just something like behaves like it.

Alternatively is it possible to cache the create_memmap_backed_data function? the data should be repeated multiple times for different estimators.

Mostly a side comment, as this was already the case before this PR..

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

but it's not possible to create a memap without physically writing to disk?

Looks like there is joblib/joblib#378 , but it's out of the scope for this PR ..

@@ -556,6 +558,10 @@ def check_array(array, accept_sparse=False, dtype="numeric", order=None,
msg = ("Data with input dtype %s was converted to %s%s."
% (dtype_orig, array.dtype, context))
warnings.warn(msg, DataConversionWarning)

if copy and np.may_share_memory(array, array_orig):
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

neat !

@rth rth changed the title [MRG] Read-only memmap input data in common tests [MRG+1] Read-only memmap input data in common tests Apr 21, 2018
@glemaitre
Copy link
Member

LGTM. @lesteve could you only make an entry inside the whats new please.
I'll merge right after.

@lesteve
Copy link
Member Author

lesteve commented Apr 23, 2018

I added a whats_new entry, let me know in case you have suggestions for improvements.

Copy link
Member

@jnothman jnothman left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This looks fine, but you also need to update Changes to Estimator Checks, below.

@lesteve
Copy link
Member Author

lesteve commented Apr 23, 2018

Very good point, I just did that.

lesteve added 2 commits April 23, 2018 11:35
Looks like the segmentation fault was specific to my setup.
@sklearn-lgtm
Copy link

This pull request introduces 3 alerts when merging 2521286 into 27a804d - view on lgtm.com

new alerts:

  • 3 for Mismatch in multiple assignment

Comment posted by lgtm.com

@lesteve
Copy link
Member Author

lesteve commented Apr 23, 2018

Just for completeness, I double-checked and sklearn-lgtm alerts are false positives.

@glemaitre glemaitre merged commit 9ea723f into scikit-learn:master Apr 23, 2018
@glemaitre
Copy link
Member

Thanks!!!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Estimators should not try to modify X and y inplace in order to handle readonly memory maps
5 participants