[MRG] partial_fit implementation in TfidfTransformer #9014

n0mad · 2017-06-06T15:30:04Z

Reference Issue

What does this implement/fix? Explain your changes.

This PR implements a partial_fit method for TfidfTransformer.
As discussed in the thread #7549 (comment) , the number of features should not change after the partial_fit call, so I only update the DF. In order to do that, I slightly changed the logic of the TfidfTransformer: it maintains the vector of document frequencies (df) and the number of documents n_sample, not the actual idf vector. Instead, the idf vector is calculated when needed.

This commit simplifies building the partial_fit interface, that would update the document frequency vector and the number of observed documents.

partial_fit updates the document frequencies stored by TfIdfTransformer. Fixes scikit-learn#7549

Issue scikit-learn#7549

A small refactoring so that all calcualtions are in the same place

rth · 2017-06-06T16:05:24Z

Thanks for the PR @n0mad ! Could you please benchmark that this doesn't increase run time of TfidfVectorizer(use_idf=True).fit_transform on the 20_newsgroups dataset and post the results here? Thanks..

n0mad · 2017-06-06T16:08:43Z

OK!

n0mad · 2017-06-06T21:29:23Z

Not sure that is the best way to benchmark, but running https://gist.github.com/n0mad/591cf186cc05ea191c4e9a4cbe43b685
I get differences of the means pretty much within 1 std:

$ git checkout tfidf-partial-fit
Switched to branch 'tfidf-partial-fit'
Your branch is up-to-date with 'origin/tfidf-partial-fit'.
$ cd ../
$ PYTHONPATH=./scikit-learn/ python bench.py
loading data
data loaded
11314 documents - 22.055MB (training set)
mean 4.07119162321 std 0.417126325148
$ cd scikit-learn/
$ git checkout master
Switched to branch 'master'
Your branch is up-to-date with 'origin/master'.
$ cd ./../
$ PYTHONPATH=./scikit-learn/ python bench.py
loading data
data loaded
11314 documents - 22.055MB (training set)
mean 4.06442222357 std 0.452654689778

jnothman · 2017-06-07T00:45:41Z

sklearn/feature_extraction/text.py

-        # if _idf_diag is not set, this will raise an attribute error,
-        # which means hasattr(self, "idf_") is False
-        return np.ravel(self._idf_diag.sum(axis=0))
+        return np.log(float(self._n_samples) / self._df) + 1.0


Remove float() and add division to the __future__ import above please. The __future__ is now!

jnothman · 2017-06-07T00:53:34Z

sklearn/feature_extraction/text.py

            if n_features != expected_n_features:
                raise ValueError("Input has n_features=%d while the model"
                                 " has been trained with n_features=%d" % (
                                     n_features, expected_n_features))
+
+            idf_diag = sp.spdiags(self.idf_, diags=0, m=n_features,


Given the power law distribution of vocabularies, this is potentially doing a lot more work at each call to transform relative to the vocabulary used in each transformed document. ~~We need a benchmark of repeated calls to transform.~~ I now realise that converting the diagonal to CSR in transform already requires O(vocab_size) time so we're unlikely to see a big change (except that the previous O(vocab_size) operation should have been a little faster). :( Perhaps we should be storing idf_diag from call to call.

The scenario with a somewhat degraded performance would be a sequence of transform() after a fit()/partial_fit(), since the idf vector would be re-calculated each time.

If we think it's a problem, I see the following possibilities to avoid it:

cache idf_diag (a) (maybe under a flag) or just simply update it after each partial_fit (b) (i.e. store idf_diag, n_samples, and df);

we can only store idf diag and n_samples, and then use them to re-calculate df, update it, and store the updated idf_diag & n_samples again. Super ugly, has performance costs to partial_fit().

I think that in NLP applications the vocabulary would be upper-bounded by millions, hence always storing both idf and df at the same shouldn't be much of a problem.

Which option do you like most?

So if I understand the issue correctly,

idf_ needs to be exposed and valid after each partial_fit since it's a public attribute

to calculate idf_ at each iteration you need the df from the previous iteration and the n_samples

in transform to multiple X rows by idf_ we need to create a diagonal sparse matrix. We can do that either in partial_fit (and cache it) or directly in transform. The overhead would be the same, and in my opinion, it's largely a matter whether we would a) do numerous partial_fit calls b) numerous transform calls. I think a) is always true by design (and thus more likely), and I'm not sure if the possible use case of calling transform multiple times is worth adding an additional flag for caching idf_diag.

In any case, it might be worth to actually benchmarking the overhead for the creation of sp.spdiags(self.idf_,..) to see if it's anything that we need to worry about...

@n0mad Aww, so you actually did benchmarks above. Could you also test for e.g. running a transform on a single document?

jnothman · 2017-06-07T11:09:48Z

Given that we're already doing an O(vocabulary) operation, I don't consider this an impediment to the PR either way.

…

On 7 June 2017 at 21:05, Eugene ***@***.***> wrote: ***@***.**** commented on this pull request. ------------------------------ In sklearn/feature_extraction/text.py <#9014 (comment)> : > if n_features != expected_n_features: raise ValueError("Input has n_features=%d while the model" " has been trained with n_features=%d" % ( n_features, expected_n_features)) + + idf_diag = sp.spdiags(self.idf_, diags=0, m=n_features, The scenario with a somewhat degraded performance would be a sequence of transform() after a fit()/partial_fit(), since the idf vector would be re-calculated each time. If we think it's a problem, I see the following possibilities to avoid it: 1. cache idf_diag (a) (maybe under a flag) or just simply update it after each partial_fit (b) (i.e. store idf_diag, n_samples, and df); 2. we can only store idf diag and n_samples, and then use them to re-calculate df, update it, and store the updated idf_diag & n_samples again. Super ugly, has performance costs to partial_fit(). I think that in NLP applications the vocabulary would be upper-bounded by millions, hence always storing both idf and df at the same shouldn't be much of a problem. Which option do you like most? — You are receiving this because you commented. Reply to this email directly, view it on GitHub <#9014 (comment)>, or mute the thread <https://github.com/notifications/unsubscribe-auth/AAEz62_Tc4JZGAA_2lfYLuilw-5j0RNRks5sBoP9gaJpZM4NxhAN> .

jnothman · 2017-06-08T13:24:08Z

sklearn/feature_extraction/text.py

+        return self
+
+    def partial_fit(self, X, y=None):
+        """Update the df vector (global term weights),


PEP257: one-line summary here

jnothman · 2017-06-08T13:24:39Z

sklearn/feature_extraction/tests/test_text.py

+    tfidf_full = tr_full.fit_transform(X).toarray()
+
+    tr_partial = TfidfTransformer(smooth_idf=True, norm='l2')
+    tr_partial.fit([[1, 1, 1]])


no, we'd usually use partial_fit all the way through, not fit the first time.

I was wondering what the semantics would be if the 'partial_fit cannot change the number of features' (see the issue discussion).
Ok, will re-do

@n0mad When calling partial_fit all the way through, the use case would be to provide a fixed vocabulary at initialization (otherwise it wouldn't make sense indeed).

Generally, to do out-of-core text vectorization one would do the first pass on the dataset to estimate the vocabulary (but currently there is no way of doing that in scikit-learn without also computing the full Bag of Words matrix), then a second pass to actually compute the BoW matrix ~~(though that's probably not compatible with sklearn API)~~. Gensim does something like that I think, but I can't find the exact reference anymore...

jnothman · 2017-06-08T23:17:48Z

we have to assume that the vectorizer stays constant, i.e. constant vocabulary size / n_features. departing from that assumption would be a change in the input or the preceding vectorizer, not here in the transformer.

…

On 9 Jun 2017 12:17 am, "Eugene" ***@***.***> wrote: ***@***.**** commented on this pull request. ------------------------------ In sklearn/feature_extraction/tests/test_text.py <#9014 (comment)> : > @@ -324,6 +324,23 @@ def test_tf_idf_smoothing(): assert_true((tfidf >= 0).all()) +def test_tfidf_partial_fit(): + X = [[1, 1, 1], + [1, 1, 0], + [1, 0, 0]] + + tr_full = TfidfTransformer(smooth_idf=True, norm='l2') + tfidf_full = tr_full.fit_transform(X).toarray() + + tr_partial = TfidfTransformer(smooth_idf=True, norm='l2') + tr_partial.fit([[1, 1, 1]]) I was wondering what the semantics would be if the 'partial_fit cannot change the number of features' (see the issue discussion). Ok, will re-do — You are receiving this because you commented. Reply to this email directly, view it on GitHub <#9014 (comment)>, or mute the thread <https://github.com/notifications/unsubscribe-auth/AAEz61-0Z1y3a10yU8GPQX9f2PxMfbvRks5sCAKPgaJpZM4NxhAN> .

eyadsibai · 2018-06-23T11:37:10Z

any updates to make this happening? this is a pretty good feature to have specially with large datasets

deanp70 · 2020-05-05T14:57:14Z

This is exactly what I was looking for. This seems pretty stale, but any chance it will be incorporated?

idoshlomo · 2021-02-11T20:25:31Z

@jnothman @rth I'd like to pick this up if it's still considered useful to the community?

I implemented this feature internally at my company a few years ago and it's been running in production very well ever since. Based on this code in my personal repo. Would love the opportunity to contribute here.

bilelomrani1 · 2023-02-01T15:32:35Z

Any news on this issue? It is still very relevant today and it would be very nice to have this feature natively in scikit-learn

cieske · 2023-11-13T01:11:38Z

I think this partial_fit function looks really good and looking forward to using this in scikit-learn soon. Thanks for your hard work:)

OmarManzoor · 2024-05-07T09:34:44Z

@adrinjalali Do you think this PR can be moved forward?

adrinjalali · 2024-05-07T11:47:19Z

@OmarManzoor it seems if we take the vocabulary as a constructor argument, then this can move forward. It would be nice if people here who find this feature useful, could tell us if that's a reasonable requirement for them.

cc @cieske @bilelomrani1 @idoshlomo

bilelomrani1 · 2024-07-17T19:47:29Z

Hi and thank you @adrinjalali. Your suggestion sounds very reasonable to me.

cieske · 2024-08-14T03:20:03Z

@adrinjalali Sorry for late reply and thanks for reasonable suggestion!

Eugene Kharitonov added 4 commits June 6, 2017 17:02

[WIP] TfIdfTransformer persists df and n_samples

1878f40

This commit simplifies building the partial_fit interface, that would update the document frequency vector and the number of observed documents.

[WIP] partial_fit in TfIdfTransformer

a286fdd

partial_fit updates the document frequencies stored by TfIdfTransformer. Fixes scikit-learn#7549

UT checks that TfIdfTransformer.partial_fit works

b9ddb7a

Issue scikit-learn#7549

idf_ is used to calculate idf in TfidfTransformer

c5e3dbe

A small refactoring so that all calcualtions are in the same place

n0mad changed the title ~~partial_fit implementation in TfidfTransformer~~ [WIP] partial_fit implementation in TfidfTransformer Jun 6, 2017

Merge branch 'master' into tfidf-partial-fit

b218410

jnothman reviewed Jun 7, 2017

View reviewed changes

Replacing float() call by importing __future__.div

257988d

n0mad changed the title ~~[WIP] partial_fit implementation in TfidfTransformer~~ [MRG] partial_fit implementation in TfidfTransformer Jun 7, 2017

jnothman reviewed Jun 8, 2017

View reviewed changes

rth added Sprint Stalled labels Jul 12, 2019

github-actions bot added the module:feature_extraction label Mar 2, 2020

Base automatically changed from master to main January 22, 2021 10:49

cmarmo added the help wanted label Dec 14, 2021

Uh oh!

[MRG] partial_fit implementation in TfidfTransformer #9014

Are you sure you want to change the base?

[MRG] partial_fit implementation in TfidfTransformer #9014

Uh oh!

Conversation

n0mad commented Jun 6, 2017

Reference Issue

What does this implement/fix? Explain your changes.

Uh oh!

rth commented Jun 6, 2017

Uh oh!

n0mad commented Jun 6, 2017

Uh oh!

n0mad commented Jun 6, 2017 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

jnothman commented Jun 7, 2017 via email

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

rth Sep 1, 2017 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

jnothman commented Jun 8, 2017 via email

Uh oh!

eyadsibai commented Jun 23, 2018

Uh oh!

deanp70 commented May 5, 2020

Uh oh!

idoshlomo commented Feb 11, 2021 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

bilelomrani1 commented Feb 1, 2023

Uh oh!

cieske commented Nov 13, 2023

Uh oh!

OmarManzoor commented May 7, 2024

Uh oh!

adrinjalali commented May 7, 2024

Uh oh!

bilelomrani1 commented Jul 17, 2024

Uh oh!

cieske commented Aug 14, 2024

Uh oh!

Uh oh!

n0mad commented Jun 6, 2017 •

edited

Loading

rth Sep 1, 2017 •

edited

Loading

idoshlomo commented Feb 11, 2021 •

edited

Loading