Skip to content

Setting idf_ is impossible #7102

@Sbelletier

Description

@Sbelletier

Context

Rather than a bug i guess that would go as a sort of "enhancement proposition" ?

I'm currently trying to persist a TfidfTransformer by basically saving its parameters in a mongoDB database and then rebuilding it alike. This technique works for CountVectorizer but simply blocks for TfidfTransformer as there is no way to set idf_.
Is there any actual architectural reason why setting this attributes raise an error ? if yes, do you have an idea for a workaround ? I obviously want to avoid keeping the matrix on which it has been fitted as it would completely mess up the architecture (i believe that the saving/loading process should be separated from the whole treatment/learning process and trying to keep both would mean having to propagate a dirty return)

Steps/Code to Reproduce

functioning example on CountVectorizer

#let us say that CountV is the previously built countVectorizer that we want to recreate identically
from sklearn.feature_extraction.text import CountVectorizer 

doc = ['some fake text that is fake to test the vectorizer']

c = CountVectorizer
c.set_params(**CountV.get_params())
c.set_params(**{'vocabulary':CountV.vocabulary_})
#Now let us test if they do the same conversion
m1 =  CountV.transform(doc)
m2 = c.transform(doc)
print m1.todense().tolist()#just for visibility sake here
print m2.todense().tolist()
#Note : This code does what is expected

This might not seem very impressive, but dictionnaries can be stored inside of mongoDB databases, which means that you can basically restore the CountVectoriser or at least an identical copy of it by simply storing vocabulary_ and the output of get_params() .

Now the incriminated piece of code

#let us say that TFtransformer is the previously computed transformer
from sklearn.feature_extraction.text import TfidfTransformer
t = TfidfTransformer()
t.set_params(**TFtransformer.get_params())
#Now here comes the problem :
#2 potential solutions
t.set_params(**{'idf':TFtransformer.idf_})
t.idf_ = TFtransformer.idf_

I would expect that at least one would work.
However, both return an error.

  • In the first case, it seems logical, as there is no idf/idf_ parameter
  • In the second case, i suppose that encapsulation forbids the direct setting

I think that being able to reproduce a fitted object (even if it is only for non-classifier objects) without having to recompute it at each launch would benefit a lot of applications.
I'm currently developping a RestAPI that has to do heavy computations on data before feeding it to the vectorizer, having to relearn the whole model with each computing is very slow, and means i have to currently wait up to half an hour for modifications that are sometimes about 1 line of code.

Versions

Windows-10-10.0.10586
('Python', '2.7.11 |Continuum Analytics, Inc.| (default, Feb 16 2016, 09:58:36) [MSC v.1500 64 bit (AMD64)]')
('NumPy', '1.11.0')
('SciPy', '0.17.1')
('Scikit-Learn', '0.17.1')

Metadata

Metadata

Assignees

No one assigned

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions