Skip to content

The default token pattern in CountVectorizer breaks Indic sentences into non-sensical tokens #30935

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
skesiraju opened this issue Mar 3, 2025 · 3 comments · May be fixed by #31130
Open
Assignees
Labels

Comments

@skesiraju
Copy link

skesiraju commented Mar 3, 2025

Describe the bug

The default token_pattern in CountVectorizer is r"(?u)\b\w\w+\b" which tokenizes Indic texts in a wrong way - breaks whitespace tokenized words into multiple chunks and even omits several valid characters. The resulting vocabulary doesn't make any sense !

Is this the expected behaviour?

Sample code is pasted in the sections below

Steps/Code to Reproduce

import sklearn
from sklearn.feature_extraction.text import CountVectorizer

tel = ["ప్రధానమంత్రిని కలుసుకున్నారు"]
hin = ["आधुनिक मानक हिन्दी"]
eng = ["They met the Prime Minister"]

cvect = CountVectorizer(
    ngram_range=(1, 1),
    max_features=None,
    min_df=1,
    strip_accents=None,
)
cvect.fit(tel + hin + eng)
print(cvect.vocabulary_)

Expected Results

{'ప్రధానమంత్రిని': 9, 'కలుసుకున్నారు': 8, 'आधुनिक': 5, 'मानक': 6, 'हिन्दी': 7, 'they': 4, 'met': 0, 'the': 3, 'prime': 2, 'minister': 1}

Actual Results

{'రధ': 9, 'నమ': 8, 'కల': 7, 'आध': 5, 'नक': 6, 'they': 4, 'met': 0, 'the': 3, 'prime': 2, 'minister': 1}

Versions

System:
    python: 3.11.10 (main, Oct  3 2024, 07:29:13) [GCC 11.2.0]
executable: miniconda3/envs/lolm/bin/python
   machine: Linux-6.1.0-25-amd64-x86_64-with-glibc2.36

Python dependencies:
      sklearn: 1.5.2
          pip: 24.2
   setuptools: 75.1.0
        numpy: 1.26.0
        scipy: 1.14.1
       Cython: None
       pandas: 2.2.3
   matplotlib: 3.9.2
       joblib: 1.4.2
threadpoolctl: 3.5.0

Built with OpenMP: True

threadpoolctl info:
       user_api: blas
   internal_api: mkl
    num_threads: 1
         prefix: libmkl_rt
       filepath: miniconda3/envs/lolm/lib/libmkl_rt.so.2
        version: 2023.1-Product
threading_layer: gnu

       user_api: openmp
   internal_api: openmp
    num_threads: 1
         prefix: libgomp
       filepath: miniconda3/envs/lolm/lib/libgomp.so.1.0.0
        version: None
@skesiraju skesiraju added Bug Needs Triage Issue requires triage labels Mar 3, 2025
@betatim
Copy link
Member

betatim commented Mar 3, 2025

Thanks for reporting this. It is an interesting one. It is surprising to see the tokenisation fail like this and I am not quite sure why. The default token pattern is something like "word break, one or more word characters, word break" - a word break is the transition from a \W to a \w (or reverse). With a \W being a "not word character" and \w being a "word character". What is and isn't a word character seems to boil down to https://docs.python.org/3/library/stdtypes.html#str.isalpha ("Alphabetic characters are those characters defined in the Unicode character database as “Letter”, i.e., those with general category property being one of “Lm”, “Lt”, “Lu”, “Ll”, or “Lo”.")

So overall I'd have thought this should work.

Interestingly with token_pattern=r"\S\S+" the splitting works better. But it includes punctuation characters, so something like "They met: a prime minister" becomes ['They', 'met:', 'Prime', 'Minister'] (the : doesn't get removed).

I don't know enough about non latin script to understand if \w should work or not :-/

@betatim
Copy link
Member

betatim commented Mar 4, 2025

Looking at what characters are in the string and which category they belong to, I think I now understand why the string gets split up in weird ways:

for c in tel:
    print(f"'{c}' is in category '{unicodedata.category(c)}' and is called '{unicodedata.name(c)}'")

Prints out:

'ప' is in category 'Lo' and is called 'TELUGU LETTER PA'
'్' is in category 'Mn' and is called 'TELUGU SIGN VIRAMA'
'ర' is in category 'Lo' and is called 'TELUGU LETTER RA'
'ధ' is in category 'Lo' and is called 'TELUGU LETTER DHA'
'ా' is in category 'Mn' and is called 'TELUGU VOWEL SIGN AA'
'న' is in category 'Lo' and is called 'TELUGU LETTER NA'
'మ' is in category 'Lo' and is called 'TELUGU LETTER MA'
'ం' is in category 'Mc' and is called 'TELUGU SIGN ANUSVARA'
'త' is in category 'Lo' and is called 'TELUGU LETTER TA'
'్' is in category 'Mn' and is called 'TELUGU SIGN VIRAMA'
'ర' is in category 'Lo' and is called 'TELUGU LETTER RA'
'ి' is in category 'Mn' and is called 'TELUGU VOWEL SIGN I'
'న' is in category 'Lo' and is called 'TELUGU LETTER NA'
'ి' is in category 'Mn' and is called 'TELUGU VOWEL SIGN I'
' ' is in category 'Zs' and is called 'SPACE'
'క' is in category 'Lo' and is called 'TELUGU LETTER KA'
'ల' is in category 'Lo' and is called 'TELUGU LETTER LA'
'ు' is in category 'Mc' and is called 'TELUGU VOWEL SIGN U'
'స' is in category 'Lo' and is called 'TELUGU LETTER SA'
'ు' is in category 'Mc' and is called 'TELUGU VOWEL SIGN U'
'క' is in category 'Lo' and is called 'TELUGU LETTER KA'
'ు' is in category 'Mc' and is called 'TELUGU VOWEL SIGN U'
'న' is in category 'Lo' and is called 'TELUGU LETTER NA'
'్' is in category 'Mn' and is called 'TELUGU SIGN VIRAMA'
'న' is in category 'Lo' and is called 'TELUGU LETTER NA'
'ా' is in category 'Mn' and is called 'TELUGU VOWEL SIGN AA'
'ర' is in category 'Lo' and is called 'TELUGU LETTER RA'
'ు' is in category 'Mc' and is called 'TELUGU VOWEL SIGN U'

Characters in the Mn category get treated as "not word characters". Maybe this is a case where the default pattern doesn't make sense and you need to use a custom pattern. If this is common we should add some documentation or an example to help others figure this out

@ogrisel ogrisel added help wanted and removed Needs Triage Issue requires triage labels Mar 4, 2025
@andrelrodriguess
Copy link

/take

andrelrodriguess added a commit to andrelrodriguess/scikit-learn that referenced this issue Mar 28, 2025
The default token_pattern r'(?u)\b\w\w+\b' splits Indic scripts (e.g., Telugu, Hindi) into incomplete tokens. This commit adds an optional _indic_tokenizer that removes punctuation and tokenizes on non-whitespace sequences (\S+), preserving full words in Unicode-heavy scripts. A multilingual test is added to validate the fix.

Signed-off-by: Andre Rodrigues <andre.lourenco.rodrigues@tecnico.ulisboa.pt>
andrelrodriguess added a commit to andrelrodriguess/scikit-learn that referenced this issue Mar 28, 2025
The default r'(?u)\b\w\w+\b' splits Indic scripts (e.g.,
Telugu, Hindi) into partial tokens. This adds an optional
_indic_tokenizer to remove punctuation and tokenize on
non-whitespace (\S+), keeping full Unicode words. A test
is added to validate it.

Signed-off-by: Andre Rodrigues <andre.lourenco.rodrigues@tecnico.ulisboa.pt>
andrelrodriguess added a commit to andrelrodriguess/scikit-learn that referenced this issue Mar 28, 2025
Signed-off-by: Andre Rodrigues <andre.lourenco.rodrigues@tecnico.ulisboa.pt>
andrelrodriguess added a commit to andrelrodriguess/scikit-learn that referenced this issue Mar 28, 2025
The default r'(?u)\b\w\w+\b' splits Indic scripts (e.g.,
Telugu, Hindi) into partial tokens. This adds an optional
_indic_tokenizer to remove punctuation and tokenize on
non-whitespace (\S+), keeping full Unicode words. A test
is added to validate it.
andrelrodriguess added a commit to andrelrodriguess/scikit-learn that referenced this issue Mar 28, 2025
The default r'(?u)\b\w\w+\b' splits Indic scripts (e.g.,
Telugu, Hindi) into partial tokens. This adds an optional
_indic_tokenizer to remove punctuation and tokenize on
non-whitespace (\S+), keeping full Unicode words. A test
is added to validate it.

Signed-off-by: Andre Rodrigues <andre.lourenco.rodrigues@tecnico.ulisboa.pt>
andrelrodriguess added a commit to andrelrodriguess/scikit-learn that referenced this issue Mar 28, 2025
andrelrodriguess added a commit to andrelrodriguess/scikit-learn that referenced this issue Mar 28, 2025
andrelrodriguess added a commit to andrelrodriguess/scikit-learn that referenced this issue Mar 30, 2025
@andrelrodriguess andrelrodriguess linked a pull request Apr 2, 2025 that will close this issue
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging a pull request may close this issue.

4 participants