-
-
Notifications
You must be signed in to change notification settings - Fork 25.8k
Fix Add _indic_tokenizer for Indic support #31130
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Fix Add _indic_tokenizer for Indic support #31130
Conversation
❌ Linting issuesThis PR is introducing linting issues. Here's a summary of the issues. Note that you can avoid having linting issues by enabling You can see the details of the linting issues under the
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The same issue could exist in all other non-latin languages then. We should test with some arabic as well as korean/japanese/chinese like strings and call the function sth like "unicode_tokenizer" maybe?
The function should also probably be invisible to the user. We can accept maybe "unicode" as a string in the constructor argument? I'm not sure.
In general, I'm not sure if we want to go down this rabbithole, since it's not really the core of the library.
Reference Issues/PRs
Fixes #30935.
What does this implement/fix? Explain your changes.
This implements a new
_indic_tokenizer
function to add support for tokenization of Indic languages in scikit-learn. The issue #30935 reported that the existing tokenization tools lacked proper handling for scripts used in languages like Hindi, Tamil, and Bengali, which require specific segmentation rules due to their syllabic nature.The changes include:
_indic_tokenizer
insklearn/feature_extraction/text.py
, which uses syllable-based segmentation tailored for Indic scripts.tests/test_text.py
to include cases for Hindi and Tamil text, ensuring the tokenizer correctly splits words and handles edge cases (e.g., conjunct consonants)._indic_tokenizer
an optional parameter in relevant classes.