Skip to content

Correct TF-IDF formula in TfidfTransformer comments. #13054

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 4 commits into from
Jan 29, 2019
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
29 changes: 15 additions & 14 deletions doc/modules/feature_extraction.rst
Original file line number Diff line number Diff line change
Expand Up @@ -436,11 +436,12 @@ Using the ``TfidfTransformer``'s default settings,
the term frequency, the number of times a term occurs in a given document,
is multiplied with idf component, which is computed as

:math:`\text{idf}(t) = log{\frac{1 + n_d}{1+\text{df}(d,t)}} + 1`,
:math:`\text{idf}(t) = \log{\frac{1 + n}{1+\text{df}(t)}} + 1`,

where :math:`n_d` is the total number of documents, and :math:`\text{df}(d,t)`
is the number of documents that contain term :math:`t`. The resulting tf-idf
vectors are then normalized by the Euclidean norm:
where :math:`n` is the total number of documents in the document set, and
:math:`\text{df}(t)` is the number of documents in the document set that
contain term :math:`t`. The resulting tf-idf vectors are then normalized by the
Euclidean norm:

:math:`v_{norm} = \frac{v}{||v||_2} = \frac{v}{\sqrt{v{_1}^2 +
v{_2}^2 + \dots + v{_n}^2}}`.
Expand All @@ -455,14 +456,14 @@ computed in scikit-learn's :class:`TfidfTransformer`
and :class:`TfidfVectorizer` differ slightly from the standard textbook
notation that defines the idf as

:math:`\text{idf}(t) = log{\frac{n_d}{1+\text{df}(d,t)}}.`
:math:`\text{idf}(t) = \log{\frac{n}{1+\text{df}(t)}}.`


In the :class:`TfidfTransformer` and :class:`TfidfVectorizer`
with ``smooth_idf=False``, the
"1" count is added to the idf instead of the idf's denominator:

:math:`\text{idf}(t) = log{\frac{n_d}{\text{df}(d,t)}} + 1`
:math:`\text{idf}(t) = \log{\frac{n}{\text{df}(t)}} + 1`

This normalization is implemented by the :class:`TfidfTransformer`
class::
Expand Down Expand Up @@ -509,21 +510,21 @@ v{_2}^2 + \dots + v{_n}^2}}`
For example, we can compute the tf-idf of the first term in the first
document in the `counts` array as follows:

:math:`n_{d} = 6`
:math:`n = 6`

:math:`\text{df}(d, t)_{\text{term1}} = 6`
:math:`\text{df}(t)_{\text{term1}} = 6`

:math:`\text{idf}(d, t)_{\text{term1}} =
log \frac{n_d}{\text{df}(d, t)} + 1 = log(1)+1 = 1`
:math:`\text{idf}(t)_{\text{term1}} =
\log \frac{n}{\text{df}(t)} + 1 = \log(1)+1 = 1`

:math:`\text{tf-idf}_{\text{term1}} = \text{tf} \times \text{idf} = 3 \times 1 = 3`

Now, if we repeat this computation for the remaining 2 terms in the document,
we get

:math:`\text{tf-idf}_{\text{term2}} = 0 \times (log(6/1)+1) = 0`
:math:`\text{tf-idf}_{\text{term2}} = 0 \times (\log(6/1)+1) = 0`

:math:`\text{tf-idf}_{\text{term3}} = 1 \times (log(6/2)+1) \approx 2.0986`
:math:`\text{tf-idf}_{\text{term3}} = 1 \times (\log(6/2)+1) \approx 2.0986`

and the vector of raw tf-idfs:

Expand All @@ -540,12 +541,12 @@ Furthermore, the default parameter ``smooth_idf=True`` adds "1" to the numerator
and denominator as if an extra document was seen containing every term in the
collection exactly once, which prevents zero divisions:

:math:`\text{idf}(t) = log{\frac{1 + n_d}{1+\text{df}(d,t)}} + 1`
:math:`\text{idf}(t) = \log{\frac{1 + n}{1+\text{df}(t)}} + 1`

Using this modification, the tf-idf of the third term in document 1 changes to
1.8473:

:math:`\text{tf-idf}_{\text{term3}} = 1 \times log(7/3)+1 \approx 1.8473`
:math:`\text{tf-idf}_{\text{term3}} = 1 \times \log(7/3)+1 \approx 1.8473`

And the L2-normalized tf-idf changes to

Expand Down
23 changes: 12 additions & 11 deletions sklearn/feature_extraction/text.py
Original file line number Diff line number Diff line change
Expand Up @@ -1146,17 +1146,18 @@ class TfidfTransformer(BaseEstimator, TransformerMixin):
informative than features that occur in a small fraction of the training
corpus.

The formula that is used to compute the tf-idf of term t is
tf-idf(d, t) = tf(t) * idf(d, t), and the idf is computed as
idf(d, t) = log [ n / df(d, t) ] + 1 (if ``smooth_idf=False``),
where n is the total number of documents and df(d, t) is the
document frequency; the document frequency is the number of documents d
that contain term t. The effect of adding "1" to the idf in the equation
above is that terms with zero idf, i.e., terms that occur in all documents
in a training set, will not be entirely ignored.
(Note that the idf formula above differs from the standard
textbook notation that defines the idf as
idf(d, t) = log [ n / (df(d, t) + 1) ]).
The formula that is used to compute the tf-idf for a term t of a document d
in a document set is tf-idf(t, d) = tf(t, d) * idf(t), and the idf is
computed as idf(t) = log [ n / df(t) ] + 1 (if ``smooth_idf=False``), where
n is the total number of documents in the document set and df(t) is the
document frequency of t; the document frequency is the number of documents
in the document set that contain the term t. The effect of adding "1" to
the idf in the equation above is that terms with zero idf, i.e., terms
that occur in all documents in a training set, will not be entirely
ignored.
(Note that the idf formula above differs from the standard textbook
notation that defines the idf as
idf(t) = log [ n / (df(t) + 1) ]).

If ``smooth_idf=True`` (the default), the constant "1" is added to the
numerator and denominator of the idf as if an extra document was seen
Expand Down