Skip to content

[MRG] partial_fit implementation in TfidfTransformer #9014

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
wants to merge 6 commits into
base: main
Choose a base branch
from

Conversation

n0mad
Copy link

@n0mad n0mad commented Jun 6, 2017

Reference Issue

Closes #7549

What does this implement/fix? Explain your changes.

This PR implements a partial_fit method for TfidfTransformer.
As discussed in the thread #7549 (comment) , the number of features should not change after the partial_fit call, so I only update the DF. In order to do that, I slightly changed the logic of the TfidfTransformer: it maintains the vector of document frequencies (df) and the number of documents n_sample, not the actual idf vector. Instead, the idf vector is calculated when needed.

Eugene Kharitonov added 4 commits June 6, 2017 17:02
This commit simplifies building the partial_fit interface, that would
update the document frequency vector and the number of observed
documents.
partial_fit updates the document frequencies stored by TfIdfTransformer.
Fixes scikit-learn#7549
A small refactoring so that all calcualtions are in the same place
@n0mad n0mad changed the title partial_fit implementation in TfidfTransformer [WIP] partial_fit implementation in TfidfTransformer Jun 6, 2017
@rth
Copy link
Member

rth commented Jun 6, 2017

Thanks for the PR @n0mad ! Could you please benchmark that this doesn't increase run time of TfidfVectorizer(use_idf=True).fit_transform on the 20_newsgroups dataset and post the results here? Thanks..

@n0mad
Copy link
Author

n0mad commented Jun 6, 2017

OK!

@n0mad
Copy link
Author

n0mad commented Jun 6, 2017

Not sure that is the best way to benchmark, but running https://gist.github.com/n0mad/591cf186cc05ea191c4e9a4cbe43b685
I get differences of the means pretty much within 1 std:

$ git checkout tfidf-partial-fit
Switched to branch 'tfidf-partial-fit'
Your branch is up-to-date with 'origin/tfidf-partial-fit'.
$ cd ../
$ PYTHONPATH=./scikit-learn/ python bench.py
loading data
data loaded
11314 documents - 22.055MB (training set)
mean 4.07119162321 std 0.417126325148
$ cd scikit-learn/
$ git checkout master
Switched to branch 'master'
Your branch is up-to-date with 'origin/master'.
$ cd ./../
$ PYTHONPATH=./scikit-learn/ python bench.py
loading data
data loaded
11314 documents - 22.055MB (training set)
mean 4.06442222357 std 0.452654689778

# if _idf_diag is not set, this will raise an attribute error,
# which means hasattr(self, "idf_") is False
return np.ravel(self._idf_diag.sum(axis=0))
return np.log(float(self._n_samples) / self._df) + 1.0
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Remove float() and add division to the __future__ import above please. The __future__ is now!

if n_features != expected_n_features:
raise ValueError("Input has n_features=%d while the model"
" has been trained with n_features=%d" % (
n_features, expected_n_features))

idf_diag = sp.spdiags(self.idf_, diags=0, m=n_features,
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Given the power law distribution of vocabularies, this is potentially doing a lot more work at each call to transform relative to the vocabulary used in each transformed document. We need a benchmark of repeated calls to transform. I now realise that converting the diagonal to CSR in transform already requires O(vocab_size) time so we're unlikely to see a big change (except that the previous O(vocab_size) operation should have been a little faster). :( Perhaps we should be storing idf_diag from call to call.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The scenario with a somewhat degraded performance would be a sequence of transform() after a fit()/partial_fit(), since the idf vector would be re-calculated each time.

If we think it's a problem, I see the following possibilities to avoid it:

  1. cache idf_diag (a) (maybe under a flag) or just simply update it after each partial_fit (b) (i.e. store idf_diag, n_samples, and df);
  2. we can only store idf diag and n_samples, and then use them to re-calculate df, update it, and store the updated idf_diag & n_samples again. Super ugly, has performance costs to partial_fit().

I think that in NLP applications the vocabulary would be upper-bounded by millions, hence always storing both idf and df at the same shouldn't be much of a problem.

Which option do you like most?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So if I understand the issue correctly,

  • idf_ needs to be exposed and valid after each partial_fit since it's a public attribute
  • to calculate idf_ at each iteration you need the df from the previous iteration and the n_samples
  • in transform to multiple X rows by idf_ we need to create a diagonal sparse matrix. We can do that either in partial_fit (and cache it) or directly in transform. The overhead would be the same, and in my opinion, it's largely a matter whether we would a) do numerous partial_fit calls b) numerous transform calls. I think a) is always true by design (and thus more likely), and I'm not sure if the possible use case of calling transform multiple times is worth adding an additional flag for caching idf_diag.

In any case, it might be worth to actually benchmarking the overhead for the creation of sp.spdiags(self.idf_,..) to see if it's anything that we need to worry about...

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@n0mad Aww, so you actually did benchmarks above. Could you also test for e.g. running a transform on a single document?

@jnothman
Copy link
Member

jnothman commented Jun 7, 2017 via email

@n0mad n0mad changed the title [WIP] partial_fit implementation in TfidfTransformer [MRG] partial_fit implementation in TfidfTransformer Jun 7, 2017
return self

def partial_fit(self, X, y=None):
"""Update the df vector (global term weights),
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

PEP257: one-line summary here

tfidf_full = tr_full.fit_transform(X).toarray()

tr_partial = TfidfTransformer(smooth_idf=True, norm='l2')
tr_partial.fit([[1, 1, 1]])
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

no, we'd usually use partial_fit all the way through, not fit the first time.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I was wondering what the semantics would be if the 'partial_fit cannot change the number of features' (see the issue discussion).
Ok, will re-do

Copy link
Member

@rth rth Sep 1, 2017

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@n0mad When calling partial_fit all the way through, the use case would be to provide a fixed vocabulary at initialization (otherwise it wouldn't make sense indeed).

Generally, to do out-of-core text vectorization one would do the first pass on the dataset to estimate the vocabulary (but currently there is no way of doing that in scikit-learn without also computing the full Bag of Words matrix), then a second pass to actually compute the BoW matrix (though that's probably not compatible with sklearn API). Gensim does something like that I think, but I can't find the exact reference anymore...

@jnothman
Copy link
Member

jnothman commented Jun 8, 2017 via email

@eyadsibai
Copy link

any updates to make this happening? this is a pretty good feature to have specially with large datasets

@deanp70
Copy link

deanp70 commented May 5, 2020

This is exactly what I was looking for. This seems pretty stale, but any chance it will be incorporated?

Base automatically changed from master to main January 22, 2021 10:49
@idoshlomo
Copy link

idoshlomo commented Feb 11, 2021

@jnothman @rth I'd like to pick this up if it's still considered useful to the community?

I implemented this feature internally at my company a few years ago and it's been running in production very well ever since. Based on this code in my personal repo. Would love the opportunity to contribute here.

@bilelomrani1
Copy link

Any news on this issue? It is still very relevant today and it would be very nice to have this feature natively in scikit-learn

@cieske
Copy link

cieske commented Nov 13, 2023

I think this partial_fit function looks really good and looking forward to using this in scikit-learn soon. Thanks for your hard work:)

@OmarManzoor
Copy link
Contributor

@adrinjalali Do you think this PR can be moved forward?

@adrinjalali
Copy link
Member

@OmarManzoor it seems if we take the vocabulary as a constructor argument, then this can move forward. It would be nice if people here who find this feature useful, could tell us if that's a reasonable requirement for them.

cc @cieske @bilelomrani1 @idoshlomo

@bilelomrani1
Copy link

Hi and thank you @adrinjalali. Your suggestion sounds very reasonable to me.

@cieske
Copy link

cieske commented Aug 14, 2024

@adrinjalali Sorry for late reply and thanks for reasonable suggestion!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

TfIdf vectorizer partial_fit