Skip to content

ENH improve ARFF parser using pandas #21938

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 305 commits into from
May 12, 2022

Conversation

glemaitre
Copy link
Member

@glemaitre glemaitre commented Dec 10, 2021

build upon #22026
closes #19774
closes #11821
closes #14855
closes #11817

It also addresses 3 remaining example from #21598 by reducing the computational time:

  • 60s to 30s plot_poisson_regression_non_normal_loss.py
  • 91s to 35s plot_sgd_early_stopping.py
  • 34s to 23s plot_stack_predictors.py

Some other example should benefit from these changes.

Add an option parser that can take "liac-arff" to get the current behaviour, "pandas" to get a C-parser that is much faster, more memory efficient, and infer the proper data type, and "auto" (would default on "pandas" if installed otherwise on "liac-arff").

A couple of benchmark:

image

image

Comment on lines 19 to 69
def _split_sparse_columns(
arff_data: ArffSparseDataType, include_columns: List
) -> ArffSparseDataType:
"""Obtains several columns from sparse ARFF representation. Additionally,
the column indices are re-labelled, given the columns that are not
included. (e.g., when including [1, 2, 3], the columns will be relabelled
to [0, 1, 2]).

Parameters
----------
arff_data : tuple
A tuple of three lists of equal size; first list indicating the value,
second the x coordinate and the third the y coordinate.

include_columns : list
A list of columns to include.

Returns
-------
arff_data_new : tuple
Subset of arff data with only the include columns indicated by the
include_columns argument.
"""
arff_data_new: ArffSparseDataType = (list(), list(), list())
reindexed_columns = {
column_idx: array_idx for array_idx, column_idx in enumerate(include_columns)
}
for val, row_idx, col_idx in zip(arff_data[0], arff_data[1], arff_data[2]):
if col_idx in include_columns:
arff_data_new[0].append(val)
arff_data_new[1].append(row_idx)
arff_data_new[2].append(reindexed_columns[col_idx])
return arff_data_new


def _sparse_data_to_array(
arff_data: ArffSparseDataType, include_columns: List
) -> np.ndarray:
# turns the sparse data back into an array (can't use toarray() function,
# as this does only work on numeric data)
num_obs = max(arff_data[1]) + 1
y_shape = (num_obs, len(include_columns))
reindexed_columns = {
column_idx: array_idx for array_idx, column_idx in enumerate(include_columns)
}
# TODO: improve for efficiency
y = np.empty(y_shape, dtype=np.float64)
for val, row_idx, col_idx in zip(arff_data[0], arff_data[1], arff_data[2]):
if col_idx in include_columns:
y[row_idx, reindexed_columns[col_idx]] = val
return y
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Those line are moved from the previous implementation

frame = pd.concat(dfs, ignore_index=True)
del dfs, first_df

frame = _cast_frame(frame, columns_info_openml, infer_casting)
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is the only difference with the previous implementation where we attempt to infer casting if requested.

return y


def _cast_frame(frame, columns_info, infer_casting=False):
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It is the equivalence of _feature_to_dtype. It makes smarter casting.

@glemaitre glemaitre marked this pull request as ready for review December 14, 2021 14:13
Copy link
Member

@thomasjpfan thomasjpfan left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you for working on this @glemaitre . I left a comment on a possible way to make this PR smaller.

@glemaitre
Copy link
Member Author

Still a big diff but this is mainly due to the refactoring of the test file indeed.

Copy link
Member

@ogrisel ogrisel left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Some first comments, I haven't completed my review yet.

Copy link
Member

@ogrisel ogrisel left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

More comments about the test and the behavior of parser="auto", I will do a quick final pass once addressed but the PR LGTM otherwise.

)
columns_to_select = feature_names_to_select + target_names_to_select

nominal_attributes = {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe this should be renamed to "categorical" because we do not really know if categorical variables should be treated as nominal or ordinal in ARFF (e.g. for categories encoded with integer values).

Copy link
Member

@thomasjpfan thomasjpfan left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

At a glance, there looks to be refactor of _liac_arff_parser, where _convert_arff_data + _convert_arff_data_dataframe are moved into _liac_arff_parser? There is also a _post_process_frame + tests.

My sense is that the refactor can be done in another PR. How do you feel about splitting the refactor out? (I know we done this once already, but the diff is quite big now)

@glemaitre
Copy link
Member Author

At a glance, there looks to be refactor of _liac_arff_parser, where _convert_arff_data + _convert_arff_data_dataframe are moved into _liac_arff_parser?

I can try to refactor the _liac_arff_parser part only. But at the end over the 1,934 lines of changes, there is already 1,750 lines of test refactoring and a couple of lines of variable renaming only. So after refactoring, I don't expect the diff to be reduced by many lines.

@thomasjpfan
Copy link
Member

So after refactoring, I don't expect the diff to be reduced by many lines.

Okay, I'll review this PR. My remaining concern for this PR is how parquet support may be coming soon: openml/OpenML#1133 I suspect we will switch to parquet when it comes available on openml. Do we want to maintain our own arff parser using pandas when we switch to parquet?

@adrinjalali
Copy link
Member

Okay, I'll review this PR. My remaining concern for this PR is how parquet support may be coming soon: openml/OpenML#1133 I suspect we will switch to parquet when it comes available on openml. Do we want to maintain our own arff parser using pandas when we switch to parquet?

We could also make that decision when it happens? :D

@glemaitre
Copy link
Member Author

glemaitre commented Feb 8, 2022

I think that we can add support for parquet when it comes. However, it could be nice to have a reader that is fast enough in the meanwhile. Adding support for parquet will also add a new dependency (pyarrow I assume) that is still optional with pandas. So I think that this is worth adding the parser in the meanwhile.

@github-actions github-actions bot added the cython label Feb 9, 2022
Copy link
Member

@adrinjalali adrinjalali left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think something's gone wrong with your merge from main. There are all changes from other PRs here.

Comment on lines 911 to 912
if parser == "auto":
parser = "liac-arff" if return_sparse else "pandas"
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

shouldn't we use liac-arff if pandas is not installed instead of raising?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You can refer to the discussion with @ogrisel #21938 (comment)

Basically, since the two parsers behave slightly differently, changing depending on the libraries installed will be difficult to debug and it fails. I recall such a bug in pandas with numexpr is installed or not, and this is not a good thing.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A potential fix is given in #22354. But it is ugly, make a drop in performance, and I am not sure that we should spend more time trying to improve the LIAC-ARFF parser.

Copy link
Member

@ogrisel ogrisel Feb 25, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also fetch_openml with default arguments on a dense dataset will already raise an exception in scikit-learn 1.0 because as_frame="auto" arlready requires pandas if the dataset is dense.

So better get parser and as_frame follow the same logic by default.

@glemaitre
Copy link
Member Author

The remaining failure is linked with casting and handling of np.NA that would provide integer missing values but require pandas >= 1.0. So we would need to bump pandas minimum version to have consistent behaviour.

@adrinjalali
Copy link
Member

I think that's an old enough pandas, and it's not an essential dependency. We could bump the min version.

Copy link
Member

@thomasjpfan thomasjpfan left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Gave this a deeper pass. Overall, looks good!

glemaitre and others added 5 commits May 3, 2022 14:23
Co-authored-by: Olivier Grisel <olivier.grisel@ensta.org>
Co-authored-by: Olivier Grisel <olivier.grisel@ensta.org>
Copy link
Member

@thomasjpfan thomasjpfan left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I went over each example that was changed and the output is the same as main.

I think this is ready to ship. LGTM

Copy link
Member

@ogrisel ogrisel left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM, just a final batch on nitpicks.

glemaitre and others added 2 commits May 12, 2022 14:17
Co-authored-by: Olivier Grisel <olivier.grisel@ensta.org>
Co-authored-by: Olivier Grisel <olivier.grisel@ensta.org>
Copy link
Member

@adrinjalali adrinjalali left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Such a PR! It turned out to be a LOT larger than anticipated at first. But very happy with it. It'd be nice if we could merge and have it in the release @jeremiedbb

@adrinjalali
Copy link
Member

CI fails

@ogrisel
Copy link
Member

ogrisel commented May 12, 2022

The circle ci arm64 failure might be caused by a dependency upgrade?

https://app.circleci.com/pipelines/github/scikit-learn/scikit-learn/26909/workflows/74388600-25aa-456a-8cd8-7136dca5e1a8/jobs/192262

Indeed, and it's being addressed in #23336.

@adrinjalali
Copy link
Member

Then shall we merge when CI is green? Not sure where the release is (cc @jeremiedbb )

@jeremiedbb
Copy link
Member

already generated the wheels (#23321) and tagged. And the PR on conda-forge is already opened (conda-forge/scikit-learn-feedstock#186)

:/

@adrinjalali
Copy link
Member

adrinjalali commented May 12, 2022

Pity, then the versions need to be fixed and then we can merge :)

@ogrisel ogrisel modified the milestones: 1.1, 1.2 May 12, 2022
@glemaitre glemaitre merged commit a47d569 into scikit-learn:main May 12, 2022
@ogrisel
Copy link
Member

ogrisel commented May 13, 2022

Congrats on the merge!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment