-
-
Notifications
You must be signed in to change notification settings - Fork 25.8k
ENH improve ARFF parser using pandas #21938
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
sklearn/datasets/_arff_parser.py
Outdated
def _split_sparse_columns( | ||
arff_data: ArffSparseDataType, include_columns: List | ||
) -> ArffSparseDataType: | ||
"""Obtains several columns from sparse ARFF representation. Additionally, | ||
the column indices are re-labelled, given the columns that are not | ||
included. (e.g., when including [1, 2, 3], the columns will be relabelled | ||
to [0, 1, 2]). | ||
|
||
Parameters | ||
---------- | ||
arff_data : tuple | ||
A tuple of three lists of equal size; first list indicating the value, | ||
second the x coordinate and the third the y coordinate. | ||
|
||
include_columns : list | ||
A list of columns to include. | ||
|
||
Returns | ||
------- | ||
arff_data_new : tuple | ||
Subset of arff data with only the include columns indicated by the | ||
include_columns argument. | ||
""" | ||
arff_data_new: ArffSparseDataType = (list(), list(), list()) | ||
reindexed_columns = { | ||
column_idx: array_idx for array_idx, column_idx in enumerate(include_columns) | ||
} | ||
for val, row_idx, col_idx in zip(arff_data[0], arff_data[1], arff_data[2]): | ||
if col_idx in include_columns: | ||
arff_data_new[0].append(val) | ||
arff_data_new[1].append(row_idx) | ||
arff_data_new[2].append(reindexed_columns[col_idx]) | ||
return arff_data_new | ||
|
||
|
||
def _sparse_data_to_array( | ||
arff_data: ArffSparseDataType, include_columns: List | ||
) -> np.ndarray: | ||
# turns the sparse data back into an array (can't use toarray() function, | ||
# as this does only work on numeric data) | ||
num_obs = max(arff_data[1]) + 1 | ||
y_shape = (num_obs, len(include_columns)) | ||
reindexed_columns = { | ||
column_idx: array_idx for array_idx, column_idx in enumerate(include_columns) | ||
} | ||
# TODO: improve for efficiency | ||
y = np.empty(y_shape, dtype=np.float64) | ||
for val, row_idx, col_idx in zip(arff_data[0], arff_data[1], arff_data[2]): | ||
if col_idx in include_columns: | ||
y[row_idx, reindexed_columns[col_idx]] = val | ||
return y |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Those line are moved from the previous implementation
sklearn/datasets/_arff_parser.py
Outdated
frame = pd.concat(dfs, ignore_index=True) | ||
del dfs, first_df | ||
|
||
frame = _cast_frame(frame, columns_info_openml, infer_casting) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is the only difference with the previous implementation where we attempt to infer casting if requested.
sklearn/datasets/_arff_parser.py
Outdated
return y | ||
|
||
|
||
def _cast_frame(frame, columns_info, infer_casting=False): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It is the equivalence of _feature_to_dtype
. It makes smarter casting.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thank you for working on this @glemaitre . I left a comment on a possible way to make this PR smaller.
Still a big diff but this is mainly due to the refactoring of the test file indeed. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Some first comments, I haven't completed my review yet.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
More comments about the test and the behavior of parser="auto"
, I will do a quick final pass once addressed but the PR LGTM otherwise.
sklearn/datasets/_arff_parser.py
Outdated
) | ||
columns_to_select = feature_names_to_select + target_names_to_select | ||
|
||
nominal_attributes = { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Maybe this should be renamed to "categorical" because we do not really know if categorical variables should be treated as nominal or ordinal in ARFF (e.g. for categories encoded with integer values).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
At a glance, there looks to be refactor of _liac_arff_parser
, where _convert_arff_data
+ _convert_arff_data_dataframe
are moved into _liac_arff_parser
? There is also a _post_process_frame
+ tests.
My sense is that the refactor can be done in another PR. How do you feel about splitting the refactor out? (I know we done this once already, but the diff is quite big now)
I can try to refactor the |
Okay, I'll review this PR. My remaining concern for this PR is how parquet support may be coming soon: openml/OpenML#1133 I suspect we will switch to parquet when it comes available on openml. Do we want to maintain our own arff parser using pandas when we switch to parquet? |
We could also make that decision when it happens? :D |
I think that we can add support for parquet when it comes. However, it could be nice to have a reader that is fast enough in the meanwhile. Adding support for parquet will also add a new dependency ( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think something's gone wrong with your merge from main
. There are all changes from other PRs here.
sklearn/datasets/_openml.py
Outdated
if parser == "auto": | ||
parser = "liac-arff" if return_sparse else "pandas" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
shouldn't we use liac-arff
if pandas
is not installed instead of raising?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You can refer to the discussion with @ogrisel #21938 (comment)
Basically, since the two parsers behave slightly differently, changing depending on the libraries installed will be difficult to debug and it fails. I recall such a bug in pandas
with numexpr
is installed or not, and this is not a good thing.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
A potential fix is given in #22354. But it is ugly, make a drop in performance, and I am not sure that we should spend more time trying to improve the LIAC-ARFF parser.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Also fetch_openml
with default arguments on a dense dataset will already raise an exception in scikit-learn 1.0 because as_frame="auto"
arlready requires pandas if the dataset is dense.
So better get parser
and as_frame
follow the same logic by default.
a525575
to
f9bb5f3
Compare
The remaining failure is linked with casting and handling of |
I think that's an old enough pandas, and it's not an essential dependency. We could bump the min version. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Gave this a deeper pass. Overall, looks good!
Co-authored-by: Olivier Grisel <olivier.grisel@ensta.org>
Co-authored-by: Olivier Grisel <olivier.grisel@ensta.org>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I went over each example that was changed and the output is the same as main
.
I think this is ready to ship. LGTM
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM, just a final batch on nitpicks.
Co-authored-by: Olivier Grisel <olivier.grisel@ensta.org>
Co-authored-by: Olivier Grisel <olivier.grisel@ensta.org>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Such a PR! It turned out to be a LOT larger than anticipated at first. But very happy with it. It'd be nice if we could merge and have it in the release @jeremiedbb
CI fails |
The circle ci arm64 failure might be caused by a dependency upgrade? Indeed, and it's being addressed in #23336. |
Then shall we merge when CI is green? Not sure where the release is (cc @jeremiedbb ) |
already generated the wheels (#23321) and tagged. And the PR on conda-forge is already opened (conda-forge/scikit-learn-feedstock#186) :/ |
Pity, then the versions need to be fixed and then we can merge :) |
Congrats on the merge! |
build upon #22026
closes #19774
closes #11821
closes #14855
closes #11817
It also addresses 3 remaining example from #21598 by reducing the computational time:
plot_poisson_regression_non_normal_loss.py
plot_sgd_early_stopping.py
plot_stack_predictors.py
Some other example should benefit from these changes.
Add an option
parser
that can take"liac-arff"
to get the current behaviour,"pandas"
to get a C-parser that is much faster, more memory efficient, and infer the proper data type, and"auto"
(would default on"pandas"
if installed otherwise on"liac-arff"
).A couple of benchmark: