A collection of pronunciation dictionaries and neural grapheme-to-phoneme models.
Install from PyPI
pip install lexikos
Editable install from Source
git clone https://github.com/bookbot-hive/lexikos.git
pip install -e lexikos
>>> from lexikos import Lexicon
>>> lexicon = Lexicon()
>>> print(lexicon["added"])
{'ˈæ d ɪ d', 'ˈæ ɾ ə d', 'æ ɾ ɪ d', 'a d ɪ d', 'ˈa d ɪ d', 'æ ɾ ə d', 'ˈa d ə d', 'a d ə d', 'ˈæ d ə d', 'æ d ə d', 'æ d ɪ d', 'ˈæ ɾ ɪ d'}
>>> print(lexicon["runner"])
{'ɹ ʌ n ɚ', 'ɹ ʌ n ə', 'ɹ ʌ n ɝ', 'ˈr ʌ n ɝ'}
>>> print(lexicon["water"])
{'ˈʋ aː ʈ ə r ɯ', 'ˈw oː t ə', 'w ɑ t ə ɹ', 'ˈw aː ʈ ə r ɯ', 'ˈw ɔ t ɝ', 'w ɔ t ə ɹ', 'ˈw ɑ t ə ɹ', 'w ɔ t ɝ', 'w ɑ ɾ ɚ', 'ˈw ɑ ɾ ɚ', 'ˈʋ ɔ ʈ ə r', 'w ɔ ɾ ɚ', 'w ɔː t ə', 'ˈw oː ɾ ə', 'ˈw ɔ ʈ ə r'}
To get a lexicon where phonemes are normalized (diacritics removed, digraphs split):
>>> from lexikos import Lexicon
>>> lexicon = Lexicon(normalize_phonemes=True)
>>> print(lexicon["added"])
{'æ ɾ ɪ d', 'a d ɪ d', 'a d ə d', 'æ ɾ ə d', 'æ d ə d', 'æ d ɪ d'}
>>> print(lexicon["runner"])
{'ɹ ʌ n ɚ', 'ɹ ʌ n ə', 'r ʌ n ɝ', 'ɹ ʌ n ɝ'}
>>> print(lexicon["water"])
{'w o ɾ ə', 'w ɔ t ə', 'ʋ ɔ ʈ ə r', 'w a ʈ ə r ɯ', 'w ɔ t ə ɹ', 'ʋ a ʈ ə r ɯ', 'w ɑ ɾ ɚ', 'w o t ə', 'w ɔ t ɝ', 'w ɔ ʈ ə r', 'w ɔ ɾ ɚ', 'w ɑ t ə ɹ'}
To include synthetic (non-dictionary-based) pronunciations:
>>> from lexikos import Lexicon
>>> lexicon = Lexicon(include_synthetic=True)
>>> print(lexicon["athletic"])
{'æ t l ɛ t ɪ k', 'æ θ ˈl ɛ t ɪ k', 'æ θ l ɛ t ɪ k'}
>>> from lexikos import G2p
>>> g2p = G2p(lang="en-us")
>>> g2p("Hello there! $100 is not a lot of money in 2023.")
['h ɛ l o ʊ', 'ð ɛ ə ɹ', 'w ʌ n', 'h ʌ n d ɹ ɪ d', 'd ɑ l ɚ z', 'ɪ z', 'n ɒ t', 'ə', 'l ɑ t', 'ʌ v', 'm ʌ n i', 'ɪ n', 't w ɛ n t i', 't w ɛ n t i', 'θ ɹ iː']
>>> g2p = G2p(lang="en-au")
>>> g2p("Hi there mate! Have a g'day!")
['h a ɪ', 'θ ɛ ə ɹ', 'm e ɪ t', 'h e ɪ v', 'ə', 'ɡ ə ˈd æ ɪ']
Language | Dictionary | Phone Set | Corpus | G2P Model |
---|---|---|---|---|
en | Wikipron | IPA | Link | bookbot/byt5-small-wikipron-eng-latn |
Language | Dictionary | Phone Set | Corpus | G2P Model |
---|---|---|---|---|
en-US | CMU Dict | ARPA | External Link | bookbot/byt5-small-cmudict |
en-US | CMU Dict IPA | IPA | External Link | |
en-US | CharsiuG2P | IPA | External Link | charsiu/g2p_multilingual_byT5_small_100 |
en-US (Broad) | Wikipron | IPA | External Link | bookbot/byt5-small-wikipron-eng-latn-us-broad |
en-US (Narrow) | Wikipron | IPA | External Link | |
en-US | LibriSpeech | IPA | Link |
Language | Dictionary | Phone Set | Corpus | G2P Model |
---|---|---|---|---|
en-UK | CharsiuG2P | IPA | External Link | charsiu/g2p_multilingual_byT5_small_100 |
en-UK (Broad) | Wikipron | IPA | External Link | bookbot/byt5-small-wikipron-eng-latn-uk-broad |
en-UK (Narrow) | Wikipron | IPA | External Link |
Language | Dictionary | Phone Set | Corpus | G2P Model |
---|---|---|---|---|
en-AU (Broad) | Wikipron | IPA | Link | bookbot/byt5-small-wikipron-eng-latn-au-broad |
en-AU (Narrow) | Wikipron | IPA | Link | |
en-AU | AusTalk | IPA | Link | |
en-AU | SC-CW | IPA | Link |
Language | Dictionary | Phone Set | Corpus | G2P Model |
---|---|---|---|---|
en-CA (Broad) | Wikipron | IPA | Link | bookbot/byt5-small-wikipron-eng-latn-ca-broad |
en-CA (Narrow) | Wikipron | IPA | Link |
Language | Dictionary | Phone Set | Corpus | G2P Model |
---|---|---|---|---|
en-NZ (Broad) | Wikipron | IPA | Link | bookbot/byt5-small-wikipron-eng-latn-nz-broad |
en-NZ (Narrow) | Wikipron | IPA | Link |
Language | Dictionary | Phone Set | Corpus | G2P Model |
---|---|---|---|---|
en-IN (Broad) | Wikipron | IPA | Link | bookbot/byt5-small-wikipron-eng-latn-in-broad |
en-IN (Narrow) | Wikipron | IPA | Link |
We modified the sequence-to-sequence training script of 🤗 HuggingFace for the purpose of training G2P models. Refer to their installation requirements for more details.
Training a new G2P model generally follow this recipe:
python run_translation.py \
+ --model_name_or_path $PRETRAINED_MODEL \
+ --dataset_name $DATASET_NAME \
--output_dir $OUTPUT_DIR \
--per_device_train_batch_size 128 \
--per_device_eval_batch_size 32 \
--learning_rate 2e-4 \
--lr_scheduler_type linear \
--warmup_ratio 0.1 \
--num_train_epochs 10 \
--evaluation_strategy epoch \
--save_strategy epoch \
--logging_strategy epoch \
--max_source_length 64 \
--max_target_length 64 \
--val_max_target_length 64 \
--pad_to_max_length True \
--overwrite_output_dir \
--do_train --do_eval \
--bf16 \
--predict_with_generate \
--report_to tensorboard \
--push_to_hub \
+ --hub_model_id $HUB_MODEL_ID \
--use_auth_token
python run_translation.py \
--model_name_or_path google/byt5-small \
--dataset_name bookbot/cmudict-0.7b \
--output_dir ./byt5-small-cmudict \
--per_device_train_batch_size 128 \
--per_device_eval_batch_size 32 \
--learning_rate 2e-4 \
--lr_scheduler_type linear \
--warmup_ratio 0.1 \
--num_train_epochs 10 \
--evaluation_strategy epoch \
--save_strategy epoch \
--logging_strategy epoch \
--max_source_length 64 \
--max_target_length 64 \
--val_max_target_length 64 \
--pad_to_max_length True \
--overwrite_output_dir \
--do_train --do_eval \
--bf16 \
--predict_with_generate \
--report_to tensorboard \
--push_to_hub \
--hub_model_id bookbot/byt5-small-cmudict \
--use_auth_token
Then to evaluate:
python eval.py \
+ --model $PRETRAINED_MODEL \
+ --dataset_name $DATASET_NAME \
--source_text_column_name source \
--target_text_column_name target \
--max_length 64 \
--batch_size 64
python eval.py \
--model bookbot/byt5-small-cmudict \
--dataset_name bookbot/cmudict-0.7b \
--source_text_column_name source \
--target_text_column_name target \
--max_length 64 \
--batch_size 64
Language Family | Code | Region | Corpus | G2P Model |
---|---|---|---|---|
African English | en-ZA | South Africa | ||
Australian English | en-AU | Australia | ✅ | ✅ |
East Asian English | en-CN, en-HK, en-JP, en-KR, en-TW | China, Hong Kong, Japan, South Korea, Taiwan | ||
European English | en-UK, en-HU, en-IE | United Kingdom, Hungary, Ireland | 🚧 | 🚧 |
Mexican English | en-MX | Mexico | ||
New Zealand English | en-NZ | New Zealand | ✅ | ✅ |
North American | en-CA, en-US | Canada, United States | ✅ | ✅ |
Middle Eastern English | en-EG, en-IL | Egypt, Israel | ||
Southeast Asian | en-TH, en-ID, en-MY, en-PH, en-SG | Thailand, Indonesia, Malaysia, Philippines, Singapore | ||
South Asian English | en-IN | India | ✅ | ✅ |
@inproceedings{lee-etal-2020-massively,
title = "Massively Multilingual Pronunciation Modeling with {W}iki{P}ron",
author = "Lee, Jackson L. and
Ashby, Lucas F.E. and
Garza, M. Elizabeth and
Lee-Sikka, Yeonju and
Miller, Sean and
Wong, Alan and
McCarthy, Arya D. and
Gorman, Kyle",
booktitle = "Proceedings of LREC",
year = "2020",
publisher = "European Language Resources Association",
pages = "4223--4228",
}
@misc{zhu2022byt5,
title={ByT5 model for massively multilingual grapheme-to-phoneme conversion},
author={Jian Zhu and Cong Zhang and David Jurgens},
year={2022},
eprint={2204.03067},
archivePrefix={arXiv},
primaryClass={cs.CL}
}