Skip to content

Fix performance regression in ColumnTransformer #29330

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 2 commits into from
Jun 28, 2024

Conversation

jeremiedbb
Copy link
Member

@jeremiedbb jeremiedbb commented Jun 21, 2024

Fixes #29229

#28822 was a quick fix for #28781 before releasing 1.5.0. It turns out that it caused a performance regression when the input can't be memory mapped, for instance columns where elements are arrays of variable length. In that case the whole X is copied in each subprocess.

Since #29018 has now been merged, we can revert #28822 because it also fixes #28781 without causing the performance regression (note that I kept the non regression test). The following results of the snippet from #29229 with this PR are close to the 1.4.2 ones:

sklearn version: 1.6.dev0 and n_jobs=1
5: Per col: 0.0206s / total 0.10 s
10: Per col: 0.0203s / total 0.20 s
15: Per col: 0.0204s / total 0.31 s
20: Per col: 0.0201s / total 0.40 s
25: Per col: 0.0214s / total 0.54 s
30: Per col: 0.0204s / total 0.61 s
35: Per col: 0.0202s / total 0.71 s
40: Per col: 0.0200s / total 0.80 s
45: Per col: 0.0208s / total 0.93 s

sklearn version: 1.6.dev0 and n_jobs=2
5: Per col: 0.1613s / total 0.81 s
10: Per col: 0.0352s / total 0.35 s
15: Per col: 0.0321s / total 0.48 s
20: Per col: 0.0334s / total 0.67 s
25: Per col: 0.0299s / total 0.75 s
30: Per col: 0.0296s / total 0.89 s
35: Per col: 0.0283s / total 0.99 s
40: Per col: 0.0294s / total 1.18 s
45: Per col: 0.0294s / total 1.32 s

Copy link

✔️ Linting Passed

All linting checks passed. Your pull request is in excellent shape! ☀️

Generated for commit: 2bc3222. Link to the linter CI: here

Copy link
Member

@betatim betatim left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks reasonable to me.

I don't think we can add a regression test for this. So it is Ok to not have one.

Copy link
Member

@ogrisel ogrisel left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM.

@ogrisel ogrisel merged commit 4bc61a0 into scikit-learn:main Jun 28, 2024
32 checks passed
jeremiedbb added a commit to jeremiedbb/scikit-learn that referenced this pull request Jul 2, 2024
@jeremiedbb jeremiedbb mentioned this pull request Jul 2, 2024
11 tasks
jeremiedbb added a commit to jeremiedbb/scikit-learn that referenced this pull request Jul 2, 2024
snath-xoc pushed a commit to snath-xoc/scikit-learn that referenced this pull request Jul 5, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment