The default token pattern in CountVectorizer breaks Indic sentences into non-sensical tokens #30935

skesiraju · 2025-03-03T13:55:23Z

Describe the bug

The default token_pattern in CountVectorizer is r"(?u)\b\w\w+\b" which tokenizes Indic texts in a wrong way - breaks whitespace tokenized words into multiple chunks and even omits several valid characters. The resulting vocabulary doesn't make any sense !

Is this the expected behaviour?

Sample code is pasted in the sections below

Steps/Code to Reproduce

import sklearn
from sklearn.feature_extraction.text import CountVectorizer

tel = ["ప్రధానమంత్రిని కలుసుకున్నారు"]
hin = ["आधुनिक मानक हिन्दी"]
eng = ["They met the Prime Minister"]

cvect = CountVectorizer(
    ngram_range=(1, 1),
    max_features=None,
    min_df=1,
    strip_accents=None,
)
cvect.fit(tel + hin + eng)
print(cvect.vocabulary_)

Expected Results

{'ప్రధానమంత్రిని': 9, 'కలుసుకున్నారు': 8, 'आधुनिक': 5, 'मानक': 6, 'हिन्दी': 7, 'they': 4, 'met': 0, 'the': 3, 'prime': 2, 'minister': 1}

Actual Results

{'రధ': 9, 'నమ': 8, 'కల': 7, 'आध': 5, 'नक': 6, 'they': 4, 'met': 0, 'the': 3, 'prime': 2, 'minister': 1}

Versions

System:
    python: 3.11.10 (main, Oct  3 2024, 07:29:13) [GCC 11.2.0]
executable: miniconda3/envs/lolm/bin/python
   machine: Linux-6.1.0-25-amd64-x86_64-with-glibc2.36

Python dependencies:
      sklearn: 1.5.2
          pip: 24.2
   setuptools: 75.1.0
        numpy: 1.26.0
        scipy: 1.14.1
       Cython: None
       pandas: 2.2.3
   matplotlib: 3.9.2
       joblib: 1.4.2
threadpoolctl: 3.5.0

Built with OpenMP: True

threadpoolctl info:
       user_api: blas
   internal_api: mkl
    num_threads: 1
         prefix: libmkl_rt
       filepath: miniconda3/envs/lolm/lib/libmkl_rt.so.2
        version: 2023.1-Product
threading_layer: gnu

       user_api: openmp
   internal_api: openmp
    num_threads: 1
         prefix: libgomp
       filepath: miniconda3/envs/lolm/lib/libgomp.so.1.0.0
        version: None

The text was updated successfully, but these errors were encountered:

betatim · 2025-03-03T18:31:25Z

Thanks for reporting this. It is an interesting one. It is surprising to see the tokenisation fail like this and I am not quite sure why. The default token pattern is something like "word break, one or more word characters, word break" - a word break is the transition from a \W to a \w (or reverse). With a \W being a "not word character" and \w being a "word character". What is and isn't a word character seems to boil down to https://docs.python.org/3/library/stdtypes.html#str.isalpha ("Alphabetic characters are those characters defined in the Unicode character database as “Letter”, i.e., those with general category property being one of “Lm”, “Lt”, “Lu”, “Ll”, or “Lo”.")

So overall I'd have thought this should work.

Interestingly with token_pattern=r"\S\S+" the splitting works better. But it includes punctuation characters, so something like "They met: a prime minister" becomes ['They', 'met:', 'Prime', 'Minister'] (the : doesn't get removed).

I don't know enough about non latin script to understand if \w should work or not :-/

betatim · 2025-03-04T07:55:24Z

Looking at what characters are in the string and which category they belong to, I think I now understand why the string gets split up in weird ways:

for c in tel:
    print(f"'{c}' is in category '{unicodedata.category(c)}' and is called '{unicodedata.name(c)}'")

Prints out:

'ప' is in category 'Lo' and is called 'TELUGU LETTER PA'
'్' is in category 'Mn' and is called 'TELUGU SIGN VIRAMA'
'ర' is in category 'Lo' and is called 'TELUGU LETTER RA'
'ధ' is in category 'Lo' and is called 'TELUGU LETTER DHA'
'ా' is in category 'Mn' and is called 'TELUGU VOWEL SIGN AA'
'న' is in category 'Lo' and is called 'TELUGU LETTER NA'
'మ' is in category 'Lo' and is called 'TELUGU LETTER MA'
'ం' is in category 'Mc' and is called 'TELUGU SIGN ANUSVARA'
'త' is in category 'Lo' and is called 'TELUGU LETTER TA'
'్' is in category 'Mn' and is called 'TELUGU SIGN VIRAMA'
'ర' is in category 'Lo' and is called 'TELUGU LETTER RA'
'ి' is in category 'Mn' and is called 'TELUGU VOWEL SIGN I'
'న' is in category 'Lo' and is called 'TELUGU LETTER NA'
'ి' is in category 'Mn' and is called 'TELUGU VOWEL SIGN I'
' ' is in category 'Zs' and is called 'SPACE'
'క' is in category 'Lo' and is called 'TELUGU LETTER KA'
'ల' is in category 'Lo' and is called 'TELUGU LETTER LA'
'ు' is in category 'Mc' and is called 'TELUGU VOWEL SIGN U'
'స' is in category 'Lo' and is called 'TELUGU LETTER SA'
'ు' is in category 'Mc' and is called 'TELUGU VOWEL SIGN U'
'క' is in category 'Lo' and is called 'TELUGU LETTER KA'
'ు' is in category 'Mc' and is called 'TELUGU VOWEL SIGN U'
'న' is in category 'Lo' and is called 'TELUGU LETTER NA'
'్' is in category 'Mn' and is called 'TELUGU SIGN VIRAMA'
'న' is in category 'Lo' and is called 'TELUGU LETTER NA'
'ా' is in category 'Mn' and is called 'TELUGU VOWEL SIGN AA'
'ర' is in category 'Lo' and is called 'TELUGU LETTER RA'
'ు' is in category 'Mc' and is called 'TELUGU VOWEL SIGN U'

Characters in the Mn category get treated as "not word characters". Maybe this is a case where the default pattern doesn't make sense and you need to use a custom pattern. If this is common we should add some documentation or an example to help others figure this out

andrelrodriguess · 2025-03-06T11:49:45Z

/take

The default token_pattern r'(?u)\b\w\w+\b' splits Indic scripts (e.g., Telugu, Hindi) into incomplete tokens. This commit adds an optional _indic_tokenizer that removes punctuation and tokenizes on non-whitespace sequences (\S+), preserving full words in Unicode-heavy scripts. A multilingual test is added to validate the fix. Signed-off-by: Andre Rodrigues <andre.lourenco.rodrigues@tecnico.ulisboa.pt>

The default r'(?u)\b\w\w+\b' splits Indic scripts (e.g., Telugu, Hindi) into partial tokens. This adds an optional _indic_tokenizer to remove punctuation and tokenize on non-whitespace (\S+), keeping full Unicode words. A test is added to validate it. Signed-off-by: Andre Rodrigues <andre.lourenco.rodrigues@tecnico.ulisboa.pt>

Signed-off-by: Andre Rodrigues <andre.lourenco.rodrigues@tecnico.ulisboa.pt>

The default r'(?u)\b\w\w+\b' splits Indic scripts (e.g., Telugu, Hindi) into partial tokens. This adds an optional _indic_tokenizer to remove punctuation and tokenize on non-whitespace (\S+), keeping full Unicode words. A test is added to validate it.

The default r'(?u)\b\w\w+\b' splits Indic scripts (e.g., Telugu, Hindi) into partial tokens. This adds an optional _indic_tokenizer to remove punctuation and tokenize on non-whitespace (\S+), keeping full Unicode words. A test is added to validate it. Signed-off-by: Andre Rodrigues <andre.lourenco.rodrigues@tecnico.ulisboa.pt>

skesiraju added Bug Needs Triage Issue requires triage labels Mar 3, 2025

ogrisel added help wanted and removed Needs Triage Issue requires triage labels Mar 4, 2025

github-actions bot assigned andrelrodriguess Mar 6, 2025

github-actions bot removed the help wanted label Mar 6, 2025

andrelrodriguess added a commit to andrelrodriguess/scikit-learn that referenced this issue Mar 28, 2025

Fix scikit-learn#30935: Add _indic_tokenizer for Indic language support

ec420cb

Signed-off-by: Andre Rodrigues <andre.lourenco.rodrigues@tecnico.ulisboa.pt>

andrelrodriguess added a commit to andrelrodriguess/scikit-learn that referenced this issue Mar 28, 2025

Fix scikit-learn#30935: Add _indic_tokenizer for Indic support

d2d2b97

andrelrodriguess added a commit to andrelrodriguess/scikit-learn that referenced this issue Mar 28, 2025

Fix scikit-learn#30935: Add _indic_tokenizer for Indic support

0c34a89

andrelrodriguess added a commit to andrelrodriguess/scikit-learn that referenced this issue Mar 30, 2025

Fix scikit-learn#30935: Add _indic_tokenizer for Indic support

59ac597

andrelrodriguess linked a pull request Apr 2, 2025 that will close this issue

Fix Add _indic_tokenizer for Indic support #31130

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

The default token pattern in CountVectorizer breaks Indic sentences into non-sensical tokens #30935

The default token pattern in CountVectorizer breaks Indic sentences into non-sensical tokens #30935

skesiraju commented Mar 3, 2025 •

edited

Loading

betatim commented Mar 3, 2025

betatim commented Mar 4, 2025

andrelrodriguess commented Mar 6, 2025

The default token pattern in CountVectorizer breaks Indic sentences into non-sensical tokens #30935

The default token pattern in CountVectorizer breaks Indic sentences into non-sensical tokens #30935

Comments

skesiraju commented Mar 3, 2025 • edited Loading

Describe the bug

Steps/Code to Reproduce

Expected Results

Actual Results

Versions

betatim commented Mar 3, 2025

betatim commented Mar 4, 2025

andrelrodriguess commented Mar 6, 2025

skesiraju commented Mar 3, 2025 •

edited

Loading