ENH Preserving dtype for np.float32 in LatentDirichletAllocation #22113

takoika · 2022-01-02T08:40:25Z

Reference Issues/PRs

This PR is part of #11000 .
Closes #13275

What does this implement/fix? Explain your changes.

This PR makes LatentDirichletAllocation preserve numpy.float32 when input data is numpy.float32 in order to preserve input data type.

Any other comments?

I used #20155 and #13275 as references to make this.

takoika · 2022-01-02T08:45:02Z

This PR is tackling to the same issue with a quite old PR #13275 .

thomasjpfan

Thank you for the PR @takoika !

sklearn/decomposition/_lda.py

ogrisel

Overall LGTM once the review suggestions are handled.

sklearn/decomposition/_lda.py

sklearn/decomposition/tests/test_online_lda.py

sklearn/decomposition/_lda.py

…0_lda

thomasjpfan

Thank you for the PR @takoika !

I gave this PR a quick pass and left a question regarding fused dtypes.

sklearn/decomposition/_online_lda_fast.pyx

…0_lda

sklearn/decomposition/_online_lda_fast.pyx

thomasjpfan

Minor comments, otherwise LGTM!

I'm not merging yet, since there has been Cython changes since @ogrisel approval. I think it is worth it for @ogrisel to take another look.

sklearn/decomposition/_lda.py

jjerphan

Just one comment and this LGTM.

sklearn/decomposition/_online_lda_fast.pyx

Co-authored-by: Thomas J. Fan <thomasjpfan@gmail.com> Co-authored-by: Julien Jerphanion <git@jjerphan.xyz>

jjerphan

LGTM. Thank you, @takoika.

PS: As @thomasjfox said, let's wait for @ogrisel review.
PPS: 🤦 oops, as @thomasjpfan indicated, let's wait for @ogrisel review.*

thomasjfox · 2022-01-20T10:04:47Z

Edit: As @thomasjfox said, let's wait for @ogrisel review.

thomasjfox is especially unrelated to the scikit-learn project ;)

-> wrong guy

thomasjpfan

I did a quick benchmark and noticed that this PR leads to a runtime regression:

from sklearn.decomposition import LatentDirichletAllocation
from sklearn.datasets import make_multilabel_classification
from time import perf_counter

X, _ = make_multilabel_classification(random_state=0, n_samples=2_000,
                                      n_features=20)
lda = LatentDirichletAllocation(n_components=5, random_state=0)

start = perf_counter()
lda.fit(X)
delta = perf_counter() - start
print(f"{delta} s")

With this PR I get 7.3 s and on main I get 4.4 s. We would need to investigate before merging.

takoika · 2022-02-14T10:20:34Z

I did a quick benchmark and noticed that this PR leads to a runtime regression:

from sklearn.decomposition import LatentDirichletAllocation
from sklearn.datasets import make_multilabel_classification
from time import perf_counter

X, _ = make_multilabel_classification(random_state=0, n_samples=2_000,
                                      n_features=20)
lda = LatentDirichletAllocation(n_components=5, random_state=0)

start = perf_counter()
lda.fit(X)
delta = perf_counter() - start
print(f"{delta} s")

With this PR I get 7.3 s and on main I get 4.4 s. We would need to investigate before merging.

Thanks! I have reproduce the same slow down issue. I will investigate it.

ogrisel · 2022-03-03T20:20:11Z

You might want to try using a profiler such as viztracer, scalene or py-spy to pinpoint the part of the code responsible for the slowdown.

jeremiedbb · 2022-09-28T10:49:15Z

I identified and fixed the regression if PR #24528

jeremiedbb · 2022-09-28T13:30:40Z

Thanks @takoika, your contribution has been included in #24528

takoika added 2 commits January 2, 2022 17:12

Support np.float32 for lda

cf85127

Add changelog entry

c2f20c8

github-actions bot added module:decomposition cython labels Jan 2, 2022

Fill pr number

4cb2e1c

takoika added 2 commits January 2, 2022 17:50

Add learning_method test patterns

1eb34ae

Make tags aligned by dictionary order

4beafc1

takoika changed the title ~~ENH Preserving dtype for np.float32 in LatentDirichletAllocation~~ [WIP] ENH Preserving dtype for np.float32 in LatentDirichletAllocation Jan 2, 2022

takoika added 2 commits January 2, 2022 20:39

Use generic type

936638f

Add test cases for different dtype input

2b5e07a

takoika changed the title ~~[WIP] ENH Preserving dtype for np.float32 in LatentDirichletAllocation~~ ENH Preserving dtype for np.float32 in LatentDirichletAllocation Jan 2, 2022

thomasjpfan reviewed Jan 2, 2022

View reviewed changes

sklearn/decomposition/_lda.py Outdated Show resolved Hide resolved

ogrisel approved these changes Jan 3, 2022

View reviewed changes

sklearn/decomposition/_lda.py Outdated Show resolved Hide resolved

sklearn/decomposition/tests/test_online_lda.py Show resolved Hide resolved

sklearn/decomposition/_lda.py Outdated Show resolved Hide resolved

takoika added 5 commits January 5, 2022 16:23

Do not copy in astype

b2aeabb

Add test for trained components for dtype and numerical consistency

467b541

Accept different dtype arrays

dce63da

Merge branch 'main' of github.com:takoika/scikit-learn into issue1100…

57341ba

…0_lda

Add space

75a6580

takoika requested review from thomasjpfan and ogrisel January 6, 2022 10:54

takoika added 3 commits January 6, 2022 20:00

Fix wrong variable

aa6ea51

revert accidental change

78cbf69

revert accidental change

17d1e80

thomasjpfan reviewed Jan 6, 2022

View reviewed changes

sklearn/decomposition/_online_lda_fast.pyx Outdated Show resolved Hide resolved

thomasjpfan reviewed Jan 9, 2022

View reviewed changes

sklearn/decomposition/_online_lda_fast.pyx Outdated Show resolved Hide resolved

sklearn/decomposition/_online_lda_fast.pyx Outdated Show resolved Hide resolved

sklearn/decomposition/_online_lda_fast.pyx Outdated Show resolved Hide resolved

takoika added 3 commits January 10, 2022 12:34

Use floating fused type

0ba227e

Add comments

24a3008

Merge branch 'main' of github.com:takoika/scikit-learn into issue1100…

dc97ada

…0_lda

takoika requested a review from thomasjpfan January 10, 2022 16:36

thomasjpfan reviewed Jan 10, 2022

View reviewed changes

sklearn/decomposition/_online_lda_fast.pyx Outdated Show resolved Hide resolved

takoika added 2 commits January 14, 2022 00:27

Use float64 as accumulation

8beb222

Use float64 as accumulation

b209f1e

takoika requested a review from thomasjpfan January 13, 2022 15:35

Revert

07f2b6f

cmarmo added the Waiting for Reviewer label Jan 15, 2022

thomasjpfan reviewed Jan 15, 2022

View reviewed changes

sklearn/decomposition/_online_lda_fast.pyx Outdated Show resolved Hide resolved

sklearn/decomposition/_online_lda_fast.pyx Outdated Show resolved Hide resolved

takoika added 2 commits January 18, 2022 00:45

Remove comments

c4fbc13

Use floating instead of own fused type

d5e9b63

takoika requested a review from thomasjpfan January 17, 2022 15:56

thomasjpfan approved these changes Jan 18, 2022

View reviewed changes

sklearn/decomposition/_lda.py Outdated Show resolved Hide resolved

sklearn/decomposition/_lda.py Outdated Show resolved Hide resolved

jjerphan reviewed Jan 18, 2022

View reviewed changes

sklearn/decomposition/_online_lda_fast.pyx Outdated Show resolved Hide resolved

takoika and others added 2 commits January 19, 2022 21:03

revert

978237a

Apply suggestions from code review

c7cf4aa

Co-authored-by: Thomas J. Fan <thomasjpfan@gmail.com> Co-authored-by: Julien Jerphanion <git@jjerphan.xyz>

takoika requested a review from jjerphan January 19, 2022 13:49

jjerphan approved these changes Jan 19, 2022

View reviewed changes

jjerphan removed the Waiting for Reviewer label Jan 19, 2022

thomasjpfan mentioned this pull request Feb 9, 2022

Preserving dtype for float32 / float64 in transformers #11000

Open

28 tasks

thomasjpfan requested changes Feb 12, 2022

View reviewed changes

glemaitre self-requested a review March 4, 2022 15:37

thomasjpfan mentioned this pull request Jul 20, 2022

Add float32 support for Latent Dirichlet Allocation #13275

Closed

jeremiedbb mentioned this pull request Sep 28, 2022

ENH Preserving dtype for np.float32 in LatentDirichletAllocation #24528

Merged

thomasjpfan closed this in #24528 Sep 28, 2022

Uh oh!

ENH Preserving dtype for np.float32 in LatentDirichletAllocation #22113

ENH Preserving dtype for np.float32 in LatentDirichletAllocation #22113

Uh oh!

Conversation

takoika commented Jan 2, 2022 • edited by thomasjpfan Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Reference Issues/PRs

What does this implement/fix? Explain your changes.

Any other comments?

Uh oh!

takoika commented Jan 2, 2022 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

thomasjpfan left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

ogrisel left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

thomasjpfan left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

thomasjpfan left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

jjerphan left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

jjerphan left a comment • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

thomasjfox commented Jan 20, 2022 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

thomasjpfan left a comment • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

takoika commented Feb 14, 2022

Uh oh!

ogrisel commented Mar 3, 2022

Uh oh!

jeremiedbb commented Sep 28, 2022

Uh oh!

jeremiedbb commented Sep 28, 2022

Uh oh!

Uh oh!

takoika commented Jan 2, 2022 •

edited by thomasjpfan

Loading

takoika commented Jan 2, 2022 •

edited

Loading

jjerphan left a comment •

edited

Loading

thomasjfox commented Jan 20, 2022 •

edited

Loading

thomasjpfan left a comment •

edited

Loading