Open
Description
Describe the workflow you want to enable
Currently when selecting max_features
in TfidfVectorizers, the algorithm (line 1172) uses numpy.argsort with the quicksort
enabled (the default argument).
Given that quicksort is not stable, this leads to inconsistent behavior across different datasets, for example:
corpus = ['AAA', 'AAB', 'AAC', 'AAD', 'AAE', 'AAF']
vectorizer = TfidfVectorizer(smooth_idf=False, max_features=5)
vectorizer.fit_transform(corpus)
print(vectorizer.get_feature_names_out()) # ['aaa' 'aab' 'aac' 'aad' 'aae']
As all the features have the same frequency they are return in lexicographic order, this does not happens for the following dataset:
corpus = ['AAA', 'AAB', 'AAC', 'AAD', 'AAE', 'AAF', 'AAG', 'ABA', 'ABB', 'ABC', 'ABD', 'ACA', 'ACB', 'ADA', 'AEA',
'AFA', 'BAA']
vectorizer = TfidfVectorizer(smooth_idf=False, max_features=5)
vectorizer.fit_transform(corpus)
print(vectorizer.get_feature_names_out()) # ['aaa' 'aca' 'acb' 'ada' 'aea']
I would expect the result to be the same as the first example but is not.
Describe your proposed solution
Since 1.15.0
the option stable was added to numpy.argsort, so simply changing the line to:
mask_inds = (-tfs[mask]).argsort(kind="stable")[:limit]
should be enough to solve the problem.
Describe alternatives you've considered, if relevant
As an alternative use the option mergesort
that is stable
mask_inds = (-tfs[mask]).argsort(kind="mergesort")[:limit]
Additional context
sklearn.__version__
'1.0'