-
-
Notifications
You must be signed in to change notification settings - Fork 26.2k
Description
Context
Rather than a bug i guess that would go as a sort of "enhancement proposition" ?
I'm currently trying to persist a TfidfTransformer
by basically saving its parameters in a mongoDB database and then rebuilding it alike. This technique works for CountVectorizer
but simply blocks for TfidfTransformer
as there is no way to set idf_
.
Is there any actual architectural reason why setting this attributes raise an error ? if yes, do you have an idea for a workaround ? I obviously want to avoid keeping the matrix on which it has been fitted as it would completely mess up the architecture (i believe that the saving/loading process should be separated from the whole treatment/learning process and trying to keep both would mean having to propagate a dirty return)
Steps/Code to Reproduce
functioning example on CountVectorizer
#let us say that CountV is the previously built countVectorizer that we want to recreate identically
from sklearn.feature_extraction.text import CountVectorizer
doc = ['some fake text that is fake to test the vectorizer']
c = CountVectorizer
c.set_params(**CountV.get_params())
c.set_params(**{'vocabulary':CountV.vocabulary_})
#Now let us test if they do the same conversion
m1 = CountV.transform(doc)
m2 = c.transform(doc)
print m1.todense().tolist()#just for visibility sake here
print m2.todense().tolist()
#Note : This code does what is expected
This might not seem very impressive, but dictionnaries can be stored inside of mongoDB databases, which means that you can basically restore the CountVectoriser
or at least an identical copy of it by simply storing vocabulary_
and the output of get_params()
.
Now the incriminated piece of code
#let us say that TFtransformer is the previously computed transformer
from sklearn.feature_extraction.text import TfidfTransformer
t = TfidfTransformer()
t.set_params(**TFtransformer.get_params())
#Now here comes the problem :
#2 potential solutions
t.set_params(**{'idf':TFtransformer.idf_})
t.idf_ = TFtransformer.idf_
I would expect that at least one would work.
However, both return an error.
- In the first case, it seems logical, as there is no idf/idf_ parameter
- In the second case, i suppose that encapsulation forbids the direct setting
I think that being able to reproduce a fitted object (even if it is only for non-classifier objects) without having to recompute it at each launch would benefit a lot of applications.
I'm currently developping a RestAPI that has to do heavy computations on data before feeding it to the vectorizer, having to relearn the whole model with each computing is very slow, and means i have to currently wait up to half an hour for modifications that are sometimes about 1 line of code.
Versions
Windows-10-10.0.10586
('Python', '2.7.11 |Continuum Analytics, Inc.| (default, Feb 16 2016, 09:58:36) [MSC v.1500 64 bit (AMD64)]')
('NumPy', '1.11.0')
('SciPy', '0.17.1')
('Scikit-Learn', '0.17.1')