-
-
Notifications
You must be signed in to change notification settings - Fork 25.8k
GridsearchCV.fit throws ValueError when passed a large dataframe that contains an Object column #9483
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Comments
this appears to be a bug in @joblib or pandas or their integration. Joblib
is assuming that it is safe to make a read-only memory map of any numpy
array, and that it should do so when the array exceeds a specified size.
Pandas is making use @cython's typed memoryviews, which do not (yet)
support read-only memory, including memmaps. In scikit-learn we are sadly
forced to avoid memoryviews in many cases for this reason.
I've not looked at your notebook in detail (thanks for your thoroughness),
but I'm not convinced that this is going to be limited to object arrays. I
don't think we can easily solve this on our side of my diagnosis is correct.
|
Thanks for the reply. The weird thing about the issue is that it doesn't happen when there are only numeric columns in the dataframe. Other people experienced this bug in prior versions with numeric-only dataframes but those issues seemed to have been fixed [1][2] (or at least as far as I can by the issue status). That's what caused me to think that the issue might be specifically with Object columns. The work around, for now, is to separate the encoding step from the Pipeline and encode things before GridSearchCV.fit is called. I don't think this is ideal because it breaks the really nice ability of Pipeline to encapsulate everything related to fitting a model. I forgot to link some issues that this seems to be related to: [1] #4772 There seemed to be a few more when I went down rabbit hole one day but I've lost those links. |
I have edited your snippet to add the missing imports (and also added part of the traceback in a "details" section) and I can reproduce the problem. I think @jnothman's assesment of the problem is pretty accurate. This needs more investigation to figure out whether there is a work-around or a fix. |
@stoddardg so I looked a little bit more at it and a work-around I found for your particular case is to build the my_dict = {name: arr for name, arr in zip(numeric_features, x.T)}
my_dict['category'] = 'a'
df = pd.DataFrame(my_dict) Don't ask me exactly why yet because I have not fully understood the problem ... I'll try to clarify a bit what I have found. |
A minimal snippet showing that the problem is related to pandas: import numpy as np
import pandas as pd
df = pd.DataFrame({'first': np.ones(100, dtype='float64')})
indices = np.array([1, 3, 6])
indices.flags.writeable = False
df.iloc[indices] This seems like a variation of pandas-dev/pandas#10043 (fixed by pandas-dev/pandas#10070). The difference here is that the |
I opened an issue on pandas-dev/pandas#17192. |
Description
I get the error of
ValueError: buffer source array is read-only
in the example below whenever I pass a dataframe with around 200K rows and at least one column of dtype Object intoGridSearchCV
withn_jobs > 1
. The error seems to be caused by passing in a Dataframe that has Object columns intoGridsearchCV.fit
. My custom class,DataFrame_Encoder
, properly encodes the Object rows (by dummy encoding them) when the pipeline executes but this error occurs before it executes. Things work fine if I use a smaller dataset, drop the Object column from the dataframe, or setn_jobs=1
.My minimal example to reproduce the bug is a bit lengthy, so I've also included a notebook with the code and some theories as to what is happening: https://github.com/stoddardg/sklearn_bug_example/blob/master/Bug%20Exploration.ipynb
Steps/Code to Reproduce
Example:
Expected Results
No error is thrown.
Actual Results
I get an incredibly long error message (viewable in the notebook) but the punchline is:
ValueError: buffer source array is read-only
Versions
Darwin-15.6.0-x86_64-i386-64bit
Python 3.6.1 |Continuum Analytics, Inc.| (default, May 11 2017, 13:04:09)
[GCC 4.2.1 Compatible Apple LLVM 6.0 (clang-600.0.57)]
NumPy 1.13.1
SciPy 0.19.1
Scikit-Learn 0.18.2
The text was updated successfully, but these errors were encountered: