-
-
Notifications
You must be signed in to change notification settings - Fork 25.8k
[MRG] Adds fetch_openml pandas dataframe support #13902
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[MRG] Adds fetch_openml pandas dataframe support #13902
Conversation
What about if I agree that including the target in the frame corresponds to common pandas usage, but it doesn't correspond to Scikit-learn convention. I'd rather leave the target out unless you have a more compelling reason otherwise. |
sklearn/datasets/openml.py
Outdated
check_pandas_support('fetch_openml with return_frame=True') | ||
import pandas as pd | ||
|
||
df = pd.DataFrame(arrf_data['data'], columns=list(features_dict.keys()), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
arff_data['data']
is a generator, right? Does that work with all supported versions of Pandas? Can you make sure that it does not just do list(data)
and explode the memory usage?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Otherwise we can pass it to pd.DataFrame.from_records()
which support generator from 0.12
Actually @jorisvandenbossche mentioned that it should not be implemented in from_records
either.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
pd.Dataframe
does call list(data)
, which we need to avoid.
Otherwise we can pass it to pd.DataFrame.from_records()
Just tested from_records
out and it seems to work. :)
I agree that splitting up the dataframe into target and data matches the Scikit-learn convention more. If we break up data and target, we would need to guide users to recombine the dataset to do any exploratory data analysis that needs the target in the dataframe. This is a little counter to what is expected when loading a dataframe from a csv, where the target is included. Users are use to breaking up the dataframe themselves. In the end, I am okay with going either way on this matter. |
return_frame='include-target' in a future version?? or is that hella ugly?
|
sklearn/datasets/openml.py
Outdated
------- | ||
df : pd.DataFrame | ||
""" | ||
check_pandas_support('fetch_openml with return_frame=True') |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do you think that it would be nice to have directly something like:
pd = check_pandas_support(...)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yup, that would be nice to have. I'll include it and see what others think.
sklearn/datasets/openml.py
Outdated
check_pandas_support('fetch_openml with return_frame=True') | ||
import pandas as pd | ||
|
||
df = pd.DataFrame(arrf_data['data'], columns=list(features_dict.keys()), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Otherwise we can pass it to pd.DataFrame.from_records()
which support generator from 0.12
Actually @jorisvandenbossche mentioned that it should not be implemented in from_records
either.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@thomasjpfan thanks for taking this up! Added a few comments
sklearn/utils/__init__.py
Outdated
except ImportError as e: | ||
raise ImportError( | ||
"{} requires pandas. You can install pandas with " | ||
"`pip install pandas`".format(caller_name) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I personally wouldn't suggest to pip install pandas (eg for all conda users, this is the wrong suggestion)
sklearn/datasets/openml.py
Outdated
df : pd.DataFrame | ||
""" | ||
pd = check_pandas_support('fetch_openml with return_frame=True') | ||
df = pd.DataFrame.from_records(arrf_data['data'], |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
For some reason github prevents me of replying on the previous comments about this ..
But, I would personally just use pd.DataFrame(..)
. from_records
is not more efficient (it expands the generator first as well), and less well maintained as the main constructor.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What might work is to load the data in chunks to limit memory consumption
sklearn/datasets/openml.py
Outdated
dtype = pd.CategoricalDtype(attributes[column]) | ||
dtypes[column] = dtype | ||
|
||
return df.astype(dtypes) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In my original PR, I had the special case of all numeric columns (that prevents yet another copy for that specific case).
Just mentioning, but I think it might be worth it (didn't check again with a timing though)
This PR has been updated with:
|
sklearn/datasets/openml.py
Outdated
arrf_data_gen = _chunk_iterable(arrf['data'], nrows) | ||
dfs = [pd.DataFrame(list(data), columns=arrf_columns, dtype=object) | ||
for data in arrf_data_gen] | ||
df = pd.concat(dfs, copy=False) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You can forget the copy=False here :) (that's never possible along this axis)
sklearn/datasets/openml.py
Outdated
If True, returns a Bunch where the data attribute is a pandas | ||
DataFrame. | ||
|
||
nrows : int, default=5000 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Maybe rather chunksize
as how you called it above?
For an nrows
argument, I would expect that it specifies to only return me that many (first) rows.
sklearn/datasets/openml.py
Outdated
arrf_columns = list(attributes) | ||
|
||
arrf_data_gen = _chunk_iterable(arrf['data'], nrows) | ||
dfs = [pd.DataFrame(list(data), columns=arrf_columns, dtype=object) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is this object dtype needed?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm still not convinced by the sense of returning the target in the frame. I see it as incompatible with the rest of sklearn.datasets (apart from making it easy to leak y to an estimator).
|
||
_monkey_patch_webbased_functions(monkeypatch, data_id, True) | ||
|
||
bunch = fetch_openml(data_id=data_id, return_frame=True, cache=False) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think it would be best to compare the bunches returned with return_frame=False vs True. This is especially needed in case we were to add something to the bunch later...
sklearn/datasets/openml.py
Outdated
|
||
chunksize : int, default=5000 | ||
Number of rows to read at a time when constructing a dataframe. | ||
Only used when ``return_frame`` is True. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
we could be using this more generally instead of the messy np.fromiter code.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What do you find messy about the np.fromiter
code?
sklearn/datasets/openml.py
Outdated
DataFrame. | ||
|
||
chunksize : int, default=5000 | ||
Number of rows to read at a time when constructing a dataframe. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is it possible to estimate against working_memory instead?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It is possible. We would need to estimate the amount of memory used per row.
This PR was updated with an additional bunch entry, I think the common use case for using a dataframe is to do EDA (using the target), and then use the Lets say we split the dataframe up into Currently the |
So you want to keep target in the DF for two reasons: one is to support EDA
(although even there the user should probably not look at the target - and
arguably the features - on their test set), the other because we don't
properly support categoricals as y elsewhere in scikit-learn. Should we be
supporting the latter?
I am really quite uncomfortable about, on the one hand, saying fetch_openml
is not very useful for datasets with heterogeneous types until it supports
frame output, and on the other hand that a frame is fundamentally
semantically different to how we represent other datasets and hence
requires different instructions to work with as train/test data for
scikit-learn estimators. We are forcing users to learn a new way of
interacting with datasets (whether changing to fetch_openml from load_iris,
or vice versa) despite them being functionally required to use return_frame
in many cases. Does that make sense?
A version that returns an integrated frame belongs in the openml python
library... And it can also live here but imo must be separated from a
function which has API like all the other dataset leaders.
|
I should also point out that return_frame='auto' could be a nice feature,
for datasets with any non-numeric features. The current approach precludes
that possibility.
|
There are some plots that would like having the target in the dataframe. For example, the iris dataset with seaborn, uses the target,
This makes sense. Keeping the datasets API consistent is important. Okay, I will update this PR to split the features and the target.
This sounds reasonable. It does not feel like too much magic. |
This PR was updated with:
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I've only checked the diff since last review, but I think this is almost there.
first_row = next(arrf['data']) | ||
first_df = pd.DataFrame([first_row], columns=arrf_columns) | ||
|
||
row_bytes = first_df.memory_usage(deep=True).sum() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Maybe just comment to that effect. In practice the temporary list would have a further intp overhead per feature per sample if I'm not mistaken.
sklearn/datasets/openml.py
Outdated
@@ -582,8 +569,11 @@ def fetch_openml(name=None, version='active', data_id=None, data_home=None, | |||
below for more information about the `data` and `target` objects. | |||
|
|||
as_frame : boolean, default=False | |||
If True, returns a Bunch where the data attribute is a pandas | |||
DataFrame. | |||
If True, where the data is a pandas DataFrame including columns with |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
"where the" -> "the returned"
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Not addressed?
sklearn/datasets/openml.py
Outdated
If True, where the data is a pandas DataFrame including columns with | ||
appropriate dtypes (numeric, string or categorical). The target is | ||
a pandas DataFrame or Series depending on the number of target_columns. | ||
If ``return_X_y`` is True, then ``(data, target)`` will be pandas |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I am not sure this is necessary if the previous statement does not mention a Bunch. You could just say "this applies regardless of return_X_y"
appropriate dtypes (numeric, string or categorical). The target is | ||
a pandas DataFrame or Series depending on the number of target_columns. | ||
If ``return_X_y`` is True, then ``(data, target)`` will be pandas | ||
DataFrames or Series as describe above. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
*described
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I still wouldn't mind a test that essentially checked the diff between the as_frame=True and the as_frame=False bunch. But otherwise LGTM. Please add a what's new entry. And perhaps open a follow-up issue towards as_frame='auto'
.
PR updated with:
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think that this is almost done. Can you add a what's new entry.
sklearn/datasets/openml.py
Outdated
raise ValueError('Unsupported feature: {}'.format(feature)) | ||
|
||
|
||
def _chunk_generator(gen, chunksize): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I would move this function in the utils next to the other generator.
I don't think that we can factorize it with the current one.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM. I would like to hear from @ogrisel and @jorisvandenbossche regarding if they agree with the API.
I think this is in line with what we discussed earlier but just to confirm before to click on the green button ;)
Co-Authored-By: Guillaume Lemaitre <g.lemaitre58@gmail.com>
@@ -71,9 +71,6 @@ | |||
clf = Pipeline(steps=[('preprocessor', preprocessor), | |||
('classifier', LogisticRegression())]) | |||
|
|||
X = data.drop('survived', axis=1) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I would like to document this idiom somewhere. It's sooo common and I'm not sure if it's anywhere else in the docs.
Maybe give it as an alternative here?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Added a comment using the frame
attribute.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
thanks!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I would like to document this idiom somewhere.
Is y = X.pop('survived')
a better idiom? Or not because it modifies X in-place.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I personally prefer frame.drop(columns='survived')
instead of specifying the axis=1
@@ -489,26 +557,38 @@ def fetch_openml(name=None, version='active', data_id=None, data_home=None, | |||
If True, returns ``(data, target)`` instead of a Bunch object. See | |||
below for more information about the `data` and `target` objects. | |||
|
|||
as_frame : boolean, default=False | |||
If True, where the data is a pandas DataFrame including columns with | |||
appropriate dtypes (numeric, string or categorical). The target is |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It doesn't mention the frame
attribute?
details : dict | ||
More metadata from OpenML | ||
frame : pandas DataFrame |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
"Only present when as_frame=True
?
@thomasjpfan Thanks. Maybe we can have some example that could benefit from it now. |
I'm really excited that this was added! Thanks to everyone who worked on it! |
""" | ||
pd = check_pandas_support('fetch_openml with as_frame=True') | ||
|
||
attributes = OrderedDict(arrf['attributes']) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
typos, should be arff
@thomasjpfan thanks for taking this to the finish line ! |
Reference Issues/PRs
Fixes #11819
Fixes #11818
Supersedes #11875
Supersedes #13177
What does this implement/fix? Explain your changes.
This PR adds
return_frame
tofetch_openml
. There were some design decision made to get this to work (whenreturn_frame
isTrue
):data
attribute will contain the whole data frame. Usually a pandas dataframe is use to explore the data, using groupbys, etc. Having the target in the dataframe helps with this aspect.target
attribute will be set to None. (The target is in the pandas dataframe)target_names
attribute was added. Together withfeature_names
, this allows the pandas dataframe to be split features and targets later.There will be an unrelated error from
test_scale_and_stability
.