-
-
Notifications
You must be signed in to change notification settings - Fork 25.8k
The default token pattern in CountVectorizer breaks Indic sentences into non-sensical tokens #30935
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Comments
Thanks for reporting this. It is an interesting one. It is surprising to see the tokenisation fail like this and I am not quite sure why. The default token pattern is something like "word break, one or more word characters, word break" - a word break is the transition from a So overall I'd have thought this should work. Interestingly with I don't know enough about non latin script to understand if |
Looking at what characters are in the string and which category they belong to, I think I now understand why the string gets split up in weird ways: for c in tel:
print(f"'{c}' is in category '{unicodedata.category(c)}' and is called '{unicodedata.name(c)}'") Prints out:
Characters in the |
/take |
The default token_pattern r'(?u)\b\w\w+\b' splits Indic scripts (e.g., Telugu, Hindi) into incomplete tokens. This commit adds an optional _indic_tokenizer that removes punctuation and tokenizes on non-whitespace sequences (\S+), preserving full words in Unicode-heavy scripts. A multilingual test is added to validate the fix. Signed-off-by: Andre Rodrigues <andre.lourenco.rodrigues@tecnico.ulisboa.pt>
The default r'(?u)\b\w\w+\b' splits Indic scripts (e.g., Telugu, Hindi) into partial tokens. This adds an optional _indic_tokenizer to remove punctuation and tokenize on non-whitespace (\S+), keeping full Unicode words. A test is added to validate it. Signed-off-by: Andre Rodrigues <andre.lourenco.rodrigues@tecnico.ulisboa.pt>
Signed-off-by: Andre Rodrigues <andre.lourenco.rodrigues@tecnico.ulisboa.pt>
The default r'(?u)\b\w\w+\b' splits Indic scripts (e.g., Telugu, Hindi) into partial tokens. This adds an optional _indic_tokenizer to remove punctuation and tokenize on non-whitespace (\S+), keeping full Unicode words. A test is added to validate it.
The default r'(?u)\b\w\w+\b' splits Indic scripts (e.g., Telugu, Hindi) into partial tokens. This adds an optional _indic_tokenizer to remove punctuation and tokenize on non-whitespace (\S+), keeping full Unicode words. A test is added to validate it. Signed-off-by: Andre Rodrigues <andre.lourenco.rodrigues@tecnico.ulisboa.pt>
Describe the bug
The default
token_pattern
inCountVectorizer
isr"(?u)\b\w\w+\b"
which tokenizes Indic texts in a wrong way - breaks whitespace tokenized words into multiple chunks and even omits several valid characters. The resulting vocabulary doesn't make any sense !Is this the expected behaviour?
Sample code is pasted in the sections below
Steps/Code to Reproduce
Expected Results
Actual Results
Versions
The text was updated successfully, but these errors were encountered: