Skip to content

ENH Preserving dtype for np.float32 in LatentDirichletAllocation #22113

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
wants to merge 25 commits into from

Conversation

takoika
Copy link
Contributor

@takoika takoika commented Jan 2, 2022

Reference Issues/PRs

This PR is part of #11000 .
Closes #13275

What does this implement/fix? Explain your changes.

This PR makes LatentDirichletAllocation preserve numpy.float32 when input data is numpy.float32 in order to preserve input data type.

Any other comments?

I used #20155 and #13275 as references to make this.

@takoika
Copy link
Contributor Author

takoika commented Jan 2, 2022

This PR is tackling to the same issue with a quite old PR #13275 .

@takoika takoika changed the title ENH Preserving dtype for np.float32 in LatentDirichletAllocation [WIP] ENH Preserving dtype for np.float32 in LatentDirichletAllocation Jan 2, 2022
@takoika takoika changed the title [WIP] ENH Preserving dtype for np.float32 in LatentDirichletAllocation ENH Preserving dtype for np.float32 in LatentDirichletAllocation Jan 2, 2022
Copy link
Member

@thomasjpfan thomasjpfan left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you for the PR @takoika !

Copy link
Member

@ogrisel ogrisel left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Overall LGTM once the review suggestions are handled.

@takoika takoika requested review from thomasjpfan and ogrisel January 6, 2022 10:54
Copy link
Member

@thomasjpfan thomasjpfan left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you for the PR @takoika !

I gave this PR a quick pass and left a question regarding fused dtypes.

@takoika takoika requested a review from thomasjpfan January 10, 2022 16:36
@takoika takoika requested a review from thomasjpfan January 13, 2022 15:35
@takoika takoika requested a review from thomasjpfan January 17, 2022 15:56
Copy link
Member

@thomasjpfan thomasjpfan left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Minor comments, otherwise LGTM!

I'm not merging yet, since there has been Cython changes since @ogrisel approval. I think it is worth it for @ogrisel to take another look.

Copy link
Member

@jjerphan jjerphan left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just one comment and this LGTM.

takoika and others added 2 commits January 19, 2022 21:03
Co-authored-by: Thomas J. Fan <thomasjpfan@gmail.com>
Co-authored-by: Julien Jerphanion <git@jjerphan.xyz>
@takoika takoika requested a review from jjerphan January 19, 2022 13:49
Copy link
Member

@jjerphan jjerphan left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM. Thank you, @takoika.

PS: As @thomasjfox said, let's wait for @ogrisel review.
PPS: 🤦 oops, as @thomasjpfan indicated, let's wait for @ogrisel review.*

@thomasjfox
Copy link
Contributor

thomasjfox commented Jan 20, 2022

Edit: As @thomasjfox said, let's wait for @ogrisel review.

thomasjfox is especially unrelated to the scikit-learn project ;)

-> wrong guy

Copy link
Member

@thomasjpfan thomasjpfan left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I did a quick benchmark and noticed that this PR leads to a runtime regression:

from sklearn.decomposition import LatentDirichletAllocation
from sklearn.datasets import make_multilabel_classification
from time import perf_counter

X, _ = make_multilabel_classification(random_state=0, n_samples=2_000,
                                      n_features=20)
lda = LatentDirichletAllocation(n_components=5, random_state=0)

start = perf_counter()
lda.fit(X)
delta = perf_counter() - start
print(f"{delta} s")

With this PR I get 7.3 s and on main I get 4.4 s. We would need to investigate before merging.

@takoika
Copy link
Contributor Author

takoika commented Feb 14, 2022

I did a quick benchmark and noticed that this PR leads to a runtime regression:

from sklearn.decomposition import LatentDirichletAllocation
from sklearn.datasets import make_multilabel_classification
from time import perf_counter

X, _ = make_multilabel_classification(random_state=0, n_samples=2_000,
                                      n_features=20)
lda = LatentDirichletAllocation(n_components=5, random_state=0)

start = perf_counter()
lda.fit(X)
delta = perf_counter() - start
print(f"{delta} s")

With this PR I get 7.3 s and on main I get 4.4 s. We would need to investigate before merging.

Thanks! I have reproduce the same slow down issue. I will investigate it.

@ogrisel
Copy link
Member

ogrisel commented Mar 3, 2022

You might want to try using a profiler such as viztracer, scalene or py-spy to pinpoint the part of the code responsible for the slowdown.

@jeremiedbb
Copy link
Member

I identified and fixed the regression if PR #24528

@jeremiedbb
Copy link
Member

Thanks @takoika, your contribution has been included in #24528

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

7 participants