Skip to content

Use a stable sorting algorithm when selecting the max_features in TfidfVectorizer #21446

Open
@mesejo

Description

@mesejo

Describe the workflow you want to enable

Currently when selecting max_features in TfidfVectorizers, the algorithm (line 1172) uses numpy.argsort with the quicksort enabled (the default argument).

Given that quicksort is not stable, this leads to inconsistent behavior across different datasets, for example:

corpus = ['AAA', 'AAB', 'AAC', 'AAD', 'AAE', 'AAF']
vectorizer = TfidfVectorizer(smooth_idf=False, max_features=5)
vectorizer.fit_transform(corpus)
print(vectorizer.get_feature_names_out())   # ['aaa' 'aab' 'aac' 'aad' 'aae']

As all the features have the same frequency they are return in lexicographic order, this does not happens for the following dataset:

corpus = ['AAA', 'AAB', 'AAC', 'AAD', 'AAE', 'AAF', 'AAG', 'ABA', 'ABB', 'ABC', 'ABD', 'ACA', 'ACB', 'ADA', 'AEA',
          'AFA', 'BAA']
vectorizer = TfidfVectorizer(smooth_idf=False, max_features=5)
vectorizer.fit_transform(corpus)
print(vectorizer.get_feature_names_out())  # ['aaa' 'aca' 'acb' 'ada' 'aea']

I would expect the result to be the same as the first example but is not.

Describe your proposed solution

Since 1.15.0 the option stable was added to numpy.argsort, so simply changing the line to:

mask_inds = (-tfs[mask]).argsort(kind="stable")[:limit]

should be enough to solve the problem.

Describe alternatives you've considered, if relevant

As an alternative use the option mergesort that is stable

mask_inds = (-tfs[mask]).argsort(kind="mergesort")[:limit]

Additional context

sklearn.__version__
'1.0'

Metadata

Metadata

Assignees

No one assigned

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions