-
-
Notifications
You must be signed in to change notification settings - Fork 26k
[MRG] partial_fit implementation in TfidfTransformer #9014
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
This commit simplifies building the partial_fit interface, that would update the document frequency vector and the number of observed documents.
partial_fit updates the document frequencies stored by TfIdfTransformer. Fixes scikit-learn#7549
A small refactoring so that all calcualtions are in the same place
Thanks for the PR @n0mad ! Could you please benchmark that this doesn't increase run time of |
OK! |
Not sure that is the best way to benchmark, but running https://gist.github.com/n0mad/591cf186cc05ea191c4e9a4cbe43b685 $ git checkout tfidf-partial-fit |
sklearn/feature_extraction/text.py
Outdated
# if _idf_diag is not set, this will raise an attribute error, | ||
# which means hasattr(self, "idf_") is False | ||
return np.ravel(self._idf_diag.sum(axis=0)) | ||
return np.log(float(self._n_samples) / self._df) + 1.0 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Remove float()
and add division
to the __future__
import above please. The __future__
is now!
if n_features != expected_n_features: | ||
raise ValueError("Input has n_features=%d while the model" | ||
" has been trained with n_features=%d" % ( | ||
n_features, expected_n_features)) | ||
|
||
idf_diag = sp.spdiags(self.idf_, diags=0, m=n_features, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Given the power law distribution of vocabularies, this is potentially doing a lot more work at each call to transform
relative to the vocabulary used in each transformed document. We need a benchmark of repeated calls to I now realise that converting the diagonal to CSR in transform
.transform
already requires O(vocab_size)
time so we're unlikely to see a big change (except that the previous O(vocab_size)
operation should have been a little faster). :( Perhaps we should be storing idf_diag
from call to call.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The scenario with a somewhat degraded performance would be a sequence of transform() after a fit()/partial_fit(), since the idf vector would be re-calculated each time.
If we think it's a problem, I see the following possibilities to avoid it:
- cache idf_diag (a) (maybe under a flag) or just simply update it after each partial_fit (b) (i.e. store idf_diag, n_samples, and df);
- we can only store idf diag and n_samples, and then use them to re-calculate df, update it, and store the updated idf_diag & n_samples again. Super ugly, has performance costs to partial_fit().
I think that in NLP applications the vocabulary would be upper-bounded by millions, hence always storing both idf and df at the same shouldn't be much of a problem.
Which option do you like most?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
So if I understand the issue correctly,
idf_
needs to be exposed and valid after eachpartial_fit
since it's a public attribute- to calculate
idf_
at each iteration you need thedf
from the previous iteration and then_samples
- in transform to multiple
X
rows byidf_
we need to create a diagonal sparse matrix. We can do that either inpartial_fit
(and cache it) or directly intransform
. The overhead would be the same, and in my opinion, it's largely a matter whether we would a) do numerouspartial_fit
calls b) numeroustransform
calls. I think a) is always true by design (and thus more likely), and I'm not sure if the possible use case of callingtransform
multiple times is worth adding an additional flag for cachingidf_diag
.
In any case, it might be worth to actually benchmarking the overhead for the creation of sp.spdiags(self.idf_,..)
to see if it's anything that we need to worry about...
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@n0mad Aww, so you actually did benchmarks above. Could you also test for e.g. running a transform on a single document?
Given that we're already doing an O(vocabulary) operation, I don't consider
this an impediment to the PR either way.
…On 7 June 2017 at 21:05, Eugene ***@***.***> wrote:
***@***.**** commented on this pull request.
------------------------------
In sklearn/feature_extraction/text.py
<#9014 (comment)>
:
> if n_features != expected_n_features:
raise ValueError("Input has n_features=%d while the model"
" has been trained with n_features=%d" % (
n_features, expected_n_features))
+
+ idf_diag = sp.spdiags(self.idf_, diags=0, m=n_features,
The scenario with a somewhat degraded performance would be a sequence of
transform() after a fit()/partial_fit(), since the idf vector would be
re-calculated each time.
If we think it's a problem, I see the following possibilities to avoid it:
1. cache idf_diag (a) (maybe under a flag) or just simply update it
after each partial_fit (b) (i.e. store idf_diag, n_samples, and df);
2. we can only store idf diag and n_samples, and then use them to
re-calculate df, update it, and store the updated idf_diag & n_samples
again. Super ugly, has performance costs to partial_fit().
I think that in NLP applications the vocabulary would be upper-bounded by
millions, hence always storing both idf and df at the same shouldn't be
much of a problem.
Which option do you like most?
—
You are receiving this because you commented.
Reply to this email directly, view it on GitHub
<#9014 (comment)>,
or mute the thread
<https://github.com/notifications/unsubscribe-auth/AAEz62_Tc4JZGAA_2lfYLuilw-5j0RNRks5sBoP9gaJpZM4NxhAN>
.
|
return self | ||
|
||
def partial_fit(self, X, y=None): | ||
"""Update the df vector (global term weights), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
PEP257: one-line summary here
tfidf_full = tr_full.fit_transform(X).toarray() | ||
|
||
tr_partial = TfidfTransformer(smooth_idf=True, norm='l2') | ||
tr_partial.fit([[1, 1, 1]]) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
no, we'd usually use partial_fit
all the way through, not fit
the first time.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I was wondering what the semantics would be if the 'partial_fit cannot change the number of features' (see the issue discussion).
Ok, will re-do
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@n0mad When calling partial_fit
all the way through, the use case would be to provide a fixed vocabulary at initialization (otherwise it wouldn't make sense indeed).
Generally, to do out-of-core text vectorization one would do the first pass on the dataset to estimate the vocabulary (but currently there is no way of doing that in scikit-learn without also computing the full Bag of Words matrix), then a second pass to actually compute the BoW matrix (though that's probably not compatible with sklearn API). Gensim does something like that I think, but I can't find the exact reference anymore...
we have to assume that the vectorizer stays constant, i.e. constant
vocabulary size / n_features. departing from that assumption would be a
change in the input or the preceding vectorizer, not here in the
transformer.
…On 9 Jun 2017 12:17 am, "Eugene" ***@***.***> wrote:
***@***.**** commented on this pull request.
------------------------------
In sklearn/feature_extraction/tests/test_text.py
<#9014 (comment)>
:
> @@ -324,6 +324,23 @@ def test_tf_idf_smoothing():
assert_true((tfidf >= 0).all())
+def test_tfidf_partial_fit():
+ X = [[1, 1, 1],
+ [1, 1, 0],
+ [1, 0, 0]]
+
+ tr_full = TfidfTransformer(smooth_idf=True, norm='l2')
+ tfidf_full = tr_full.fit_transform(X).toarray()
+
+ tr_partial = TfidfTransformer(smooth_idf=True, norm='l2')
+ tr_partial.fit([[1, 1, 1]])
I was wondering what the semantics would be if the 'partial_fit cannot
change the number of features' (see the issue discussion).
Ok, will re-do
—
You are receiving this because you commented.
Reply to this email directly, view it on GitHub
<#9014 (comment)>,
or mute the thread
<https://github.com/notifications/unsubscribe-auth/AAEz61-0Z1y3a10yU8GPQX9f2PxMfbvRks5sCAKPgaJpZM4NxhAN>
.
|
any updates to make this happening? this is a pretty good feature to have specially with large datasets |
This is exactly what I was looking for. This seems pretty stale, but any chance it will be incorporated? |
@jnothman @rth I'd like to pick this up if it's still considered useful to the community? I implemented this feature internally at my company a few years ago and it's been running in production very well ever since. Based on this code in my personal repo. Would love the opportunity to contribute here. |
Any news on this issue? It is still very relevant today and it would be very nice to have this feature natively in scikit-learn |
I think this partial_fit function looks really good and looking forward to using this in scikit-learn soon. Thanks for your hard work:) |
@adrinjalali Do you think this PR can be moved forward? |
@OmarManzoor it seems if we take the vocabulary as a constructor argument, then this can move forward. It would be nice if people here who find this feature useful, could tell us if that's a reasonable requirement for them. |
Hi and thank you @adrinjalali. Your suggestion sounds very reasonable to me. |
@adrinjalali Sorry for late reply and thanks for reasonable suggestion! |
Reference Issue
Closes #7549
What does this implement/fix? Explain your changes.
This PR implements a partial_fit method for TfidfTransformer.
As discussed in the thread #7549 (comment) , the number of features should not change after the partial_fit call, so I only update the DF. In order to do that, I slightly changed the logic of the TfidfTransformer: it maintains the vector of document frequencies (df) and the number of documents n_sample, not the actual idf vector. Instead, the idf vector is calculated when needed.