Skip to content

Support obtaining Script_Extensions of a character #4

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
Manishearth opened this issue Jan 1, 2020 · 3 comments
Closed

Support obtaining Script_Extensions of a character #4

Manishearth opened this issue Jan 1, 2020 · 3 comments

Comments

@Manishearth
Copy link
Member

This is needed for mixed script detection.

The easy way to do this is just to store a slice of script_extensions for each code point / range, but there's actually a limited set of ways script_extensions mix (taken from here):

 Adlam (Adlam),
 Adlam,Arabic,Hanifi_Rohingya,Mandaic,Manichaean,Psalter_Pahlavi,Sogdian,Syriac (Adlam,Arabic,Hanifi_Rohingya,Mandaic,Manichaean,Psalter_Pahlavi,Sogdian,Syriac),
 Ahom (Ahom),
 Anatolian_Hieroglyphs (Anatolian_Hieroglyphs),
 Arabic (Arabic),
 Arabic,Coptic (Arabic,Coptic),
 Arabic,Hanifi_Rohingya (Arabic,Hanifi_Rohingya),
 Arabic,Hanifi_Rohingya,Syriac,Thaana (Arabic,Hanifi_Rohingya,Syriac,Thaana),
 Arabic,Syriac (Arabic,Syriac),
 Arabic,Syriac,Thaana (Arabic,Syriac,Thaana),
 Arabic,Thaana (Arabic,Thaana),
 Armenian (Armenian),
 Armenian,Georgian (Armenian,Georgian),
 Avestan (Avestan),
 Balinese (Balinese),
 Bamum (Bamum),
 Bassa_Vah (Bassa_Vah),
 Batak (Batak),
 Bengali (Bengali),
 Bengali,Chakma,Syloti_Nagri (Bengali,Chakma,Syloti_Nagri),
 Bengali,Devanagari (Bengali,Devanagari),
 Bengali,Devanagari,Dogra,Grantha,Gujarati,Gunjala_Gondi,Gurmukhi,Kannada,Khudawadi,Limbu,Mahajani,Malayalam,Masaram_Gondi,Nandinagari,Oriya,Sinhala,Syloti_Nagri,Takri,Tamil,Telugu,Tirhuta (Bengali,Devanagari,Dogra,Grantha,Gujarati,Gunjala_Gondi,Gurmukhi,Kannada,Khudawadi,Limbu,Mahajani,Malayalam,Masaram_Gondi,Nandinagari,Oriya,Sinhala,Syloti_Nagri,Takri,Tamil,Telugu,Tirhuta),
 Bengali,Devanagari,Dogra,Grantha,Gujarati,Gunjala_Gondi,Gurmukhi,Kannada,Khudawadi,Mahajani,Malayalam,Masaram_Gondi,Nandinagari,Oriya,Sinhala,Syloti_Nagri,Takri,Tamil,Telugu,Tirhuta (Bengali,Devanagari,Dogra,Grantha,Gujarati,Gunjala_Gondi,Gurmukhi,Kannada,Khudawadi,Mahajani,Malayalam,Masaram_Gondi,Nandinagari,Oriya,Sinhala,Syloti_Nagri,Takri,Tamil,Telugu,Tirhuta),
 Bengali,Devanagari,Grantha,Gujarati,Gurmukhi,Kannada,Latin,Malayalam,Oriya,Sharada,Tamil,Telugu,Tirhuta (Bengali,Devanagari,Grantha,Gujarati,Gurmukhi,Kannada,Latin,Malayalam,Oriya,Sharada,Tamil,Telugu,Tirhuta),
 Bengali,Devanagari,Grantha,Gujarati,Gurmukhi,Kannada,Latin,Malayalam,Oriya,Tamil,Telugu,Tirhuta (Bengali,Devanagari,Grantha,Gujarati,Gurmukhi,Kannada,Latin,Malayalam,Oriya,Tamil,Telugu,Tirhuta),
 Bengali,Devanagari,Grantha,Kannada (Bengali,Devanagari,Grantha,Kannada),
 Bengali,Devanagari,Grantha,Kannada,Nandinagari,Oriya,Telugu,Tirhuta (Bengali,Devanagari,Grantha,Kannada,Nandinagari,Oriya,Telugu,Tirhuta),
 Bhaiksuki (Bhaiksuki),
 Bopomofo (Bopomofo),
 Bopomofo,Han (Bopomofo,Han),
 Bopomofo,Han,Hangul,Hiragana,Katakana (Bopomofo,Han,Hangul,Hiragana,Katakana),
 Bopomofo,Han,Hangul,Hiragana,Katakana,Yi (Bopomofo,Han,Hangul,Hiragana,Katakana,Yi),
 Brahmi (Brahmi),
 Braille (Braille),
 Buginese (Buginese),
 Buginese,Javanese (Buginese,Javanese),
 Buhid (Buhid),
 Buhid,Hanunoo,Tagalog,Tagbanwa (Buhid,Hanunoo,Tagalog,Tagbanwa),
 Canadian_Aboriginal (Canadian_Aboriginal),
 Carian (Carian),
 Caucasian_Albanian (Caucasian_Albanian),
 Chakma (Chakma),
 Chakma,Myanmar,Tai_Le (Chakma,Myanmar,Tai_Le),
 Cham (Cham),
 Cherokee (Cherokee),
 Common (Common),
 Coptic (Coptic),
 Cuneiform (Cuneiform),
 Cypriot (Cypriot),
 Cypriot,Linear_A,Linear_B (Cypriot,Linear_A,Linear_B),
 Cypriot,Linear_B (Cypriot,Linear_B),
 Cyrillic (Cyrillic),
 Cyrillic,Glagolitic (Cyrillic,Glagolitic),
 Cyrillic,Latin (Cyrillic,Latin),
 Cyrillic,Old_Permic (Cyrillic,Old_Permic),
 Deseret (Deseret),
 Devanagari (Devanagari),
 Devanagari,Dogra,Gujarati,Gurmukhi,Kaithi,Kannada,Khojki,Khudawadi,Mahajani,Malayalam,Modi,Nandinagari,Takri,Tirhuta (Devanagari,Dogra,Gujarati,Gurmukhi,Kaithi,Kannada,Khojki,Khudawadi,Mahajani,Malayalam,Modi,Nandinagari,Takri,Tirhuta),
 Devanagari,Dogra,Gujarati,Gurmukhi,Kaithi,Kannada,Khojki,Khudawadi,Mahajani,Modi,Nandinagari,Takri,Tirhuta (Devanagari,Dogra,Gujarati,Gurmukhi,Kaithi,Kannada,Khojki,Khudawadi,Mahajani,Modi,Nandinagari,Takri,Tirhuta),
 Devanagari,Dogra,Gujarati,Gurmukhi,Kaithi,Khojki,Khudawadi,Mahajani,Modi,Takri,Tirhuta (Devanagari,Dogra,Gujarati,Gurmukhi,Kaithi,Khojki,Khudawadi,Mahajani,Modi,Takri,Tirhuta),
 Devanagari,Dogra,Kaithi,Mahajani (Devanagari,Dogra,Kaithi,Mahajani),
 Devanagari,Grantha (Devanagari,Grantha),
 Devanagari,Grantha,Kannada (Devanagari,Grantha,Kannada),
 Devanagari,Grantha,Latin (Devanagari,Grantha,Latin),
 Devanagari,Kannada,Malayalam,Oriya,Tamil,Telugu (Devanagari,Kannada,Malayalam,Oriya,Tamil,Telugu),
 Devanagari,Nandinagari (Devanagari,Nandinagari),
 Devanagari,Sharada (Devanagari,Sharada),
 Devanagari,Tamil (Devanagari,Tamil),
 Dogra (Dogra),
 Duployan (Duployan),
 Egyptian_Hieroglyphs (Egyptian_Hieroglyphs),
 Elbasan (Elbasan),
 Elymaic (Elymaic),
 Ethiopic (Ethiopic),
 Georgian (Georgian),
 Georgian,Latin (Georgian,Latin),
 Glagolitic (Glagolitic),
 Gothic (Gothic),
 Grantha (Grantha),
 Grantha,Tamil (Grantha,Tamil),
 Greek (Greek),
 Gujarati (Gujarati),
 Gujarati,Khojki (Gujarati,Khojki),
 Gunjala_Gondi (Gunjala_Gondi),
 Gurmukhi (Gurmukhi),
 Gurmukhi,Multani (Gurmukhi,Multani),
 Han (Han),
 Han,Hiragana,Katakana (Han,Hiragana,Katakana),
 Hangul (Hangul),
 Hanifi_Rohingya (Hanifi_Rohingya),
 Hanunoo (Hanunoo),
 Hatran (Hatran),
 Hebrew (Hebrew),
 Hiragana (Hiragana),
 Hiragana,Katakana (Hiragana,Katakana),

 Imperial_Aramaic (Imperial_Aramaic),
 Inherited (Inherited),
 Inscriptional_Pahlavi (Inscriptional_Pahlavi),
 Inscriptional_Parthian (Inscriptional_Parthian),

 Javanese (Javanese),

 Kaithi (Kaithi),
 Kannada (Kannada),
 Kannada,Nandinagari (Kannada,Nandinagari),
 Katakana (Katakana),
 Kayah_Li (Kayah_Li),
 Kayah_Li,Latin,Myanmar (Kayah_Li,Latin,Myanmar),
 Kharoshthi (Kharoshthi),
 Khmer (Khmer),
 Khojki (Khojki),
 Khudawadi (Khudawadi),

 Lao (Lao),
 Latin (Latin),
 Latin,Mongolian (Latin,Mongolian),
 Lepcha (Lepcha),
 Limbu (Limbu),
 Linear_A (Linear_A),
 Linear_B (Linear_B),
 Lisu (Lisu),
 Lycian (Lycian),
 Lydian (Lydian),

 Mahajani (Mahajani),
 Makasar (Makasar),
 Malayalam (Malayalam),
 Mandaic (Mandaic),
 Manichaean (Manichaean),
 Marchen (Marchen),
 Masaram_Gondi (Masaram_Gondi),
 Medefaidrin (Medefaidrin),
 Meetei_Mayek (Meetei_Mayek),
 Mende_Kikakui (Mende_Kikakui),
 Meroitic_Cursive (Meroitic_Cursive),
 Meroitic_Hieroglyphs (Meroitic_Hieroglyphs),
 Miao (Miao),
 Modi (Modi),
 Mongolian (Mongolian),
 Mongolian,Phags_Pa (Mongolian,Phags_Pa),
 Mro (Mro),
 Multani (Multani),
 Myanmar (Myanmar),

Nabataean (Nabataean),
 Nandinagari (Nandinagari),
 New_Tai_Lue (New_Tai_Lue),
 Newa (Newa),
 Nko (Nko),
 Nushu (Nushu),
 Nyiakeng_Puachue_Hmong (Nyiakeng_Puachue_Hmong),

Ogham (Ogham),
 Ol_Chiki (Ol_Chiki),
 Old_Hungarian (Old_Hungarian),
 Old_Italic (Old_Italic),
 Old_North_Arabian (Old_North_Arabian),
 Old_Permic (Old_Permic),
 Old_Persian (Old_Persian),
 Old_Sogdian (Old_Sogdian),
 Old_South_Arabian (Old_South_Arabian),
 Old_Turkic (Old_Turkic),
 Oriya (Oriya),
 Osage (Osage),
 Osmanya (Osmanya),

Pahawh_Hmong (Pahawh_Hmong),
 Palmyrene (Palmyrene),
 Pau_Cin_Hau (Pau_Cin_Hau),
 Phags_Pa (Phags_Pa),
 Phoenician (Phoenician),
 Psalter_Pahlavi (Psalter_Pahlavi),

Rejang (Rejang),
 Runic (Runic),

Samaritan (Samaritan),
 Saurashtra (Saurashtra),
 Sharada (Sharada),
 Shavian (Shavian),
 Siddham (Siddham),
 Sign_Writing (Sign_Writing),
 Sinhala (Sinhala),
 Sogdian (Sogdian),
 Sora_Sompeng (Sora_Sompeng),
 Soyombo (Soyombo),
 Sundanese (Sundanese),
 Syloti_Nagri (Syloti_Nagri),
 Syriac (Syriac),

Tagalog (Tagalog),
 Tagbanwa (Tagbanwa),
 Tai_Le (Tai_Le),
 Tai_Tham (Tai_Tham),
 Tai_Viet (Tai_Viet),
 Takri (Takri),
 Tamil (Tamil),
 Tangut (Tangut),
 Telugu (Telugu),
 Thaana (Thaana),
 Thai (Thai),
 Tibetan (Tibetan),
 Tifinagh (Tifinagh),
 Tirhuta (Tirhuta),

Ugaritic (Ugaritic),
 Unknown (Unknown),

Vai (Vai),
Wancho (Wancho),
Warang_Citi (Warang_Citi),
Yi (Yi),
Zanabazar_Square (Zanabazar_Square)

We can very easily make a single enum value for each one, and programmatically generate an intersect() function that can calculate intersections. This would be faster.

(For performance it would also probably be worth only running these checks on non-ascii identifiers)

@Manishearth
Copy link
Member Author

I might create a separate unicode-script crate for the guts of this.

@Manishearth
Copy link
Member Author

@Manishearth
Copy link
Member Author

#6

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant