-
-
Notifications
You must be signed in to change notification settings - Fork 25.8k
[MRG+1] Read-only memmap input data in common tests #10663
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
e50c034
to
4477794
Compare
Also that would have added pytest as a runtime dependency... In https://github.com/scikit-learn/scikit-learn/pull/4807/files it looks like there is some mechanism of shared temp folder via the global |
I felt like it was complicating the code quite a bit, without a substantial benefit. The cleanest way IMO would be to create the mmap files in the pytest tmpdir and not bother deleting them (pytest deletes its tmpdir after a few runs as can been seen from this). I may try to have a closer look at how this can be done. @arthurmensch you are more than welcome to look at it in case you have some spare bandwidth. |
Yes, but wouldn't this require importing pytest when running
If I understand correctly this created temp files with random names in some temporary folder?. I'm not sure if those will get cleaned up by default on system restart or not. If it's not the case, it's not so great to have some temp files that get accumulated particularly for code that is suggested to run to other projects in the scikit-learn ecosystem... |
Those are very good points. I just pushed a commit so that |
baa3300
to
15eee5e
Compare
@rth any more comments on this one? |
@lesteve I'll try to review this soon. |
Nice, thanks ! IMO the main thing to review is the logic in The rest is mostly awkward (but hopefully close to minimal) contortions to include the read-only memmap tests inside the test_common.py and estimator_checks.py existing code. |
ping @rth if you have some spare bandwitdth ;-) |
ping @arthurmensch maybe ? |
I'll try to review this next week @lesteve ... |
That would be very nice! If there is anything I can do to make this easier to review, let me know! |
That would be very nice! If there is anything I can do to make this easier to review, let me know! I think #10663 (comment) is still valid. |
Pity common tests aren't pytest-ready to use the fixture paradigm more. Am I getting confused, or should we expect this to fail on old cython Do we have a test that check_array keeps memmapped data memmapped? Otherwise looking great |
For some time, in a lot of people mind (including mine), #4807 was identified with the read-only ValueError that had to do with the cython limitation that typed memoryview needed to be writeable (until cython 0.28 where const typed memoryview have been introduced). As I mentioned in my top post, I was not able to reproduce the failures listed in #5481 that #4807 were supposed to highlight. For the last commit of #4807, Travis is green for Python 3 (see this). I find it rather odd that such failures would happen on Python 2 and not Python 3, so the only reasonable option I can think of is that the failures were due to another aspect of #4807. Do we have a test that check_array keeps memmapped data memmapped? Yes, see test_check_array_memmap here.
Thanks for looking into this, let me know if there is any way I can make this PR easier to review! |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I have not followed in details all the implications and the perceived intent in the original PR #4807 but generally, I find this PR fairly comprehensible.
A comment / question is below, apart from that it this LGTM.
temp_folder = tempfile.mkdtemp(prefix='sklearn_testing_') | ||
atexit.register(functools.partial(_delete_folder, temp_folder, warn=True)) | ||
filename = op.join(temp_folder, 'data.pkl') | ||
joblib.dump(data, filename) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Stupid question, but it's not possible to create a memap without physically writing to disk? Creating temp files in multiple check_estimators_*
functions (for each estimator in scikit-learn) doesn't seem ideal - that certainly wouldn't help test run time. Joblib seems to explicitly forbid this and this also doesn't work with np.load(..., mmap_mode='r')
but I wonder if there is a workaround. We don't need an actual mmap, just something like behaves like it.
Alternatively is it possible to cache the create_memmap_backed_data
function? the data should be repeated multiple times for different estimators.
Mostly a side comment, as this was already the case before this PR..
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
but it's not possible to create a memap without physically writing to disk?
Looks like there is joblib/joblib#378 , but it's out of the scope for this PR ..
@@ -556,6 +558,10 @@ def check_array(array, accept_sparse=False, dtype="numeric", order=None, | |||
msg = ("Data with input dtype %s was converted to %s%s." | |||
% (dtype_orig, array.dtype, context)) | |||
warnings.warn(msg, DataConversionWarning) | |||
|
|||
if copy and np.may_share_memory(array, array_orig): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
neat !
LGTM. @lesteve could you only make an entry inside the whats new please. |
I added a whats_new entry, let me know in case you have suggestions for improvements. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This looks fine, but you also need to update Changes to Estimator Checks, below.
Very good point, I just did that. |
Looks like the segmentation fault was specific to my setup.
Added tests for TempMemmap and create_memmap_backed_data
7db5518
to
2521286
Compare
This pull request introduces 3 alerts when merging 2521286 into 27a804d - view on lgtm.com new alerts:
Comment posted by lgtm.com |
Just for completeness, I double-checked and sklearn-lgtm alerts are false positives. |
Thanks!!! |
This is an attempt to revive #4807 (a rebase was deemed too painful). Closes #4807, closes #5507, closes #5481.
Note that at the moment I am not able to reproduce the problems that were listed in #5481. Note that no errors can be seen on the last Travis run for the Python 3 build from #4807.
On the other hand I have a segmentation fault with LinearSVR, which I have not looked at yet (opened #10667 to track it)edit: looks like the segmentation fault is specific to my setup and I found a work-around.Still to do:
check_regressors_pickle
is not used anywhere and I guess it was not meant to be included in the PR._yield_*_checks
andcheck_*
) before it can reach the function that actually uses tmpdir. I had a solution using TempMemmap but that would add an additional level of indentation generating a painful diff to review without any good reason. Edit: I think it's fine like this but better suggestions more than welcome!np.asanyarray
which keeps the memmap class is slightly better. Edit: there is a test that checks that np.matrix are turned into arrays that breaks if usingnp.asanyarray
so I stuck withasarray
.