-
-
Notifications
You must be signed in to change notification settings - Fork 25.8k
Setting idf_ is impossible #7102
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Comments
I'd be happy to see a setter for idf_. |
I'd be happy to see a setter for idf_.
But that's not what the TfidfTransformer uses internally. Shouldn't the
user be setting the _idf_diag matrix rather? I agree that setting a
private attribute is ugly, so the question is: shouldn't we make it
public?
|
What's wrong with providing a setter function that transforms to _idf_diag |
As the setter for _idf_diag is defined, this does give me a workaround for my problem. This leads me to two new questions.
Thank you for the quick response |
+1 on what @jnothman said |
Hey, I'd be interested in working on this one. |
sure go ahead @manu-chroma :) |
Sorry for not getting back earlier. #let us say that CountV is the previously built countVectorizer that we want to recreate identically
from sklearn.feature_extraction.text import CountVectorizer
doc = ['some fake text that is fake to test the vectorizer']
c = CountVectorizer
c.set_params(**CountV.get_params())
c.set_params(**{'vocabulary':CountV.vocabulary_}) Can someone elaborate on how the |
hi @manu-chroma
Notice how in Doing the same thing to |
hey @nelson-liu, thanks so much for explaining. I didn't know about So essentially, a new setter function would be added to class |
I looked through this, and it isn't really possible to add a setter for idf_ that updates _idf_diag @manu-chroma @nelson-liu @amueller @jnothman. The reason being when we set _idf_diag we use n_features which comes from X in the fit function text.py:1018. So we would have to keep track of n_features, add it to the setter, or we could move over to setting idf_diag directly as @GaelVaroquaux suggested. I would suggest we directly set idf_diag since we aren't really keeping the state necessary for idf but that's just my humble opinion. *Note when we get idf we are actually deriving it from _idf_diag text.py:1069 def test_copy_idf__tf():
counts = [[3, 0, 1],
[2, 0, 0],
[3, 0, 0],
[4, 0, 0],
[3, 2, 0],
[3, 0, 2]]
t1 = TfidfTransformer()
t2 = TfidfTransformer()
t1.fit_transform(counts)
t2.set_params(**t1.get_params())
#t2.set_params(**{'idf':t1.idf_})
t2.idf_ = t1.idf_ |
Surely the number of features is already present in the value of _idf_diag? |
Context
Rather than a bug i guess that would go as a sort of "enhancement proposition" ?
I'm currently trying to persist a
TfidfTransformer
by basically saving its parameters in a mongoDB database and then rebuilding it alike. This technique works forCountVectorizer
but simply blocks forTfidfTransformer
as there is no way to setidf_
.Is there any actual architectural reason why setting this attributes raise an error ? if yes, do you have an idea for a workaround ? I obviously want to avoid keeping the matrix on which it has been fitted as it would completely mess up the architecture (i believe that the saving/loading process should be separated from the whole treatment/learning process and trying to keep both would mean having to propagate a dirty return)
Steps/Code to Reproduce
functioning example on CountVectorizer
This might not seem very impressive, but dictionnaries can be stored inside of mongoDB databases, which means that you can basically restore the
CountVectoriser
or at least an identical copy of it by simply storingvocabulary_
and the output ofget_params()
.Now the incriminated piece of code
I would expect that at least one would work.
However, both return an error.
I think that being able to reproduce a fitted object (even if it is only for non-classifier objects) without having to recompute it at each launch would benefit a lot of applications.
I'm currently developping a RestAPI that has to do heavy computations on data before feeding it to the vectorizer, having to relearn the whole model with each computing is very slow, and means i have to currently wait up to half an hour for modifications that are sometimes about 1 line of code.
Versions
Windows-10-10.0.10586
('Python', '2.7.11 |Continuum Analytics, Inc.| (default, Feb 16 2016, 09:58:36) [MSC v.1500 64 bit (AMD64)]')
('NumPy', '1.11.0')
('SciPy', '0.17.1')
('Scikit-Learn', '0.17.1')
The text was updated successfully, but these errors were encountered: