Lexikos - λεξικός /lek.si.kós/

A collection of pronunciation dictionaries and neural grapheme-to-phoneme models.

Install Lexikos

Install from PyPI

pip install lexikos

Editable install from Source

git clone https://github.com/bookbot-hive/lexikos.git
pip install -e lexikos

Usage

Lexicon

>>> from lexikos import Lexicon
>>> lexicon = Lexicon()
>>> print(lexicon["added"])
{'ˈæ d ɪ d', 'ˈæ ɾ ə d', 'æ ɾ ɪ d', 'a d ɪ d', 'ˈa d ɪ d', 'æ ɾ ə d', 'ˈa d ə d', 'a d ə d', 'ˈæ d ə d', 'æ d ə d', 'æ d ɪ d', 'ˈæ ɾ ɪ d'}
>>> print(lexicon["runner"])
{'ɹ ʌ n ɚ', 'ɹ ʌ n ə', 'ɹ ʌ n ɝ', 'ˈr ʌ n ɝ'}
>>> print(lexicon["water"])
{'ˈʋ aː ʈ ə r ɯ', 'ˈw oː t ə', 'w ɑ t ə ɹ', 'ˈw aː ʈ ə r ɯ', 'ˈw ɔ t ɝ', 'w ɔ t ə ɹ', 'ˈw ɑ t ə ɹ', 'w ɔ t ɝ', 'w ɑ ɾ ɚ', 'ˈw ɑ ɾ ɚ', 'ˈʋ ɔ ʈ ə r', 'w ɔ ɾ ɚ', 'w ɔː t ə', 'ˈw oː ɾ ə', 'ˈw ɔ ʈ ə r'}

To get a lexicon where phonemes are normalized (diacritics removed, digraphs split):

>>> from lexikos import Lexicon
>>> lexicon = Lexicon(normalize_phonemes=True)
>>> print(lexicon["added"])
{'æ ɾ ɪ d', 'a d ɪ d', 'a d ə d', 'æ ɾ ə d', 'æ d ə d', 'æ d ɪ d'}
>>> print(lexicon["runner"])
{'ɹ ʌ n ɚ', 'ɹ ʌ n ə', 'r ʌ n ɝ', 'ɹ ʌ n ɝ'}
>>> print(lexicon["water"])
{'w o ɾ ə', 'w ɔ t ə', 'ʋ ɔ ʈ ə r', 'w a ʈ ə r ɯ', 'w ɔ t ə ɹ', 'ʋ a ʈ ə r ɯ', 'w ɑ ɾ ɚ', 'w o t ə', 'w ɔ t ɝ', 'w ɔ ʈ ə r', 'w ɔ ɾ ɚ', 'w ɑ t ə ɹ'}

To include synthetic (non-dictionary-based) pronunciations:

>>> from lexikos import Lexicon
>>> lexicon = Lexicon(include_synthetic=True)
>>> print(lexicon["athletic"])
{'æ t l ɛ t ɪ k', 'æ θ ˈl ɛ t ɪ k', 'æ θ l ɛ t ɪ k'}

Phonemization

>>> from lexikos import G2p
>>> g2p = G2p(lang="en-us")
>>> g2p("Hello there! $100 is not a lot of money in 2023.")
['h ɛ l o ʊ', 'ð ɛ ə ɹ', 'w ʌ n', 'h ʌ n d ɹ ɪ d', 'd ɑ l ɚ z', 'ɪ z', 'n ɒ t', 'ə', 'l ɑ t', 'ʌ v', 'm ʌ n i', 'ɪ n', 't w ɛ n t i', 't w ɛ n t i', 'θ ɹ iː']
>>> g2p = G2p(lang="en-au")
>>> g2p("Hi there mate! Have a g'day!")
['h a ɪ', 'θ ɛ ə ɹ', 'm e ɪ t', 'h e ɪ v', 'ə', 'ɡ ə ˈd æ ɪ']

Dictionaries & Models

English `(en)`

Language	Dictionary	Phone Set	Corpus	G2P Model
en	Wikipron	IPA	Link	bookbot/byt5-small-wikipron-eng-latn

English `(en-US)`

Language	Dictionary	Phone Set	Corpus	G2P Model
en-US	CMU Dict	ARPA	External Link	bookbot/byt5-small-cmudict
en-US	CMU Dict IPA	IPA	External Link
en-US	CharsiuG2P	IPA	External Link	charsiu/g2p_multilingual_byT5_small_100
en-US (Broad)	Wikipron	IPA	External Link	bookbot/byt5-small-wikipron-eng-latn-us-broad
en-US (Narrow)	Wikipron	IPA	External Link
en-US	LibriSpeech	IPA	Link

English `(en-UK)`

Language	Dictionary	Phone Set	Corpus	G2P Model
en-UK	CharsiuG2P	IPA	External Link	charsiu/g2p_multilingual_byT5_small_100
en-UK (Broad)	Wikipron	IPA	External Link	bookbot/byt5-small-wikipron-eng-latn-uk-broad
en-UK (Narrow)	Wikipron	IPA	External Link

English `(en-AU)`

Language	Dictionary	Phone Set	Corpus	G2P Model
en-AU (Broad)	Wikipron	IPA	Link	bookbot/byt5-small-wikipron-eng-latn-au-broad
en-AU (Narrow)	Wikipron	IPA	Link
en-AU	AusTalk	IPA	Link
en-AU	SC-CW	IPA	Link

English `(en-CA)`

Language	Dictionary	Phone Set	Corpus	G2P Model
en-CA (Broad)	Wikipron	IPA	Link	bookbot/byt5-small-wikipron-eng-latn-ca-broad
en-CA (Narrow)	Wikipron	IPA	Link

English `(en-NZ)`

Language	Dictionary	Phone Set	Corpus	G2P Model
en-NZ (Broad)	Wikipron	IPA	Link	bookbot/byt5-small-wikipron-eng-latn-nz-broad
en-NZ (Narrow)	Wikipron	IPA	Link

English `(en-IN)`

Language	Dictionary	Phone Set	Corpus	G2P Model
en-IN (Broad)	Wikipron	IPA	Link	bookbot/byt5-small-wikipron-eng-latn-in-broad
en-IN (Narrow)	Wikipron	IPA	Link

Training G2P Model

We modified the sequence-to-sequence training script of 🤗 HuggingFace for the purpose of training G2P models. Refer to their installation requirements for more details.

Training a new G2P model generally follow this recipe:

python run_translation.py \
+   --model_name_or_path $PRETRAINED_MODEL \
+   --dataset_name $DATASET_NAME \
    --output_dir $OUTPUT_DIR \
    --per_device_train_batch_size 128 \
    --per_device_eval_batch_size 32 \
    --learning_rate 2e-4 \
    --lr_scheduler_type linear \
    --warmup_ratio 0.1 \
    --num_train_epochs 10 \
    --evaluation_strategy epoch \
    --save_strategy epoch \
    --logging_strategy epoch \
    --max_source_length 64 \
    --max_target_length 64 \
    --val_max_target_length 64 \
    --pad_to_max_length True \
    --overwrite_output_dir \
    --do_train --do_eval \
    --bf16 \
    --predict_with_generate \
    --report_to tensorboard \
    --push_to_hub \
+   --hub_model_id $HUB_MODEL_ID \
    --use_auth_token

Example: Fine-tune ByT5 on CMU Dict

python run_translation.py \
    --model_name_or_path google/byt5-small \
    --dataset_name bookbot/cmudict-0.7b \
    --output_dir ./byt5-small-cmudict \
    --per_device_train_batch_size 128 \
    --per_device_eval_batch_size 32 \
    --learning_rate 2e-4 \
    --lr_scheduler_type linear \
    --warmup_ratio 0.1 \
    --num_train_epochs 10 \
    --evaluation_strategy epoch \
    --save_strategy epoch \
    --logging_strategy epoch \
    --max_source_length 64 \
    --max_target_length 64 \
    --val_max_target_length 64 \
    --pad_to_max_length True \
    --overwrite_output_dir \
    --do_train --do_eval \
    --bf16 \
    --predict_with_generate \
    --report_to tensorboard \
    --push_to_hub \
    --hub_model_id bookbot/byt5-small-cmudict \
    --use_auth_token

Evaluating G2P Model

Then to evaluate:

python eval.py \
+   --model $PRETRAINED_MODEL \
+   --dataset_name $DATASET_NAME \
    --source_text_column_name source \
    --target_text_column_name target \
    --max_length 64 \
    --batch_size 64

Example: Evaluate ByT5 on CMU Dict

python eval.py \
    --model bookbot/byt5-small-cmudict \
    --dataset_name bookbot/cmudict-0.7b \
    --source_text_column_name source \
    --target_text_column_name target \
    --max_length 64 \
    --batch_size 64

Corpus Roadmap

Wikipron

Language Family	Code	Region	Corpus	G2P Model
African English	en-ZA	South Africa
Australian English	en-AU	Australia	✅	✅
East Asian English	en-CN, en-HK, en-JP, en-KR, en-TW	China, Hong Kong, Japan, South Korea, Taiwan
European English	en-UK, en-HU, en-IE	United Kingdom, Hungary, Ireland	🚧	🚧
Mexican English	en-MX	Mexico
New Zealand English	en-NZ	New Zealand	✅	✅
North American	en-CA, en-US	Canada, United States	✅	✅
Middle Eastern English	en-EG, en-IL	Egypt, Israel
Southeast Asian	en-TH, en-ID, en-MY, en-PH, en-SG	Thailand, Indonesia, Malaysia, Philippines, Singapore
South Asian English	en-IN	India	✅	✅

Resources

References

@inproceedings{lee-etal-2020-massively,
    title = "Massively Multilingual Pronunciation Modeling with {W}iki{P}ron",
    author = "Lee, Jackson L.  and
      Ashby, Lucas F.E.  and
      Garza, M. Elizabeth  and
      Lee-Sikka, Yeonju  and
      Miller, Sean  and
      Wong, Alan  and
      McCarthy, Arya D.  and
      Gorman, Kyle",
    booktitle = "Proceedings of LREC",
    year = "2020",
    publisher = "European Language Resources Association",
    pages = "4223--4228",
}

@misc{zhu2022byt5,
    title={ByT5 model for massively multilingual grapheme-to-phoneme conversion}, 
    author={Jian Zhu and Cong Zhang and David Jurgens},
    year={2022},
    eprint={2204.03067},
    archivePrefix={arXiv},
    primaryClass={cs.CL}
}

Name		Name	Last commit message	Last commit date
Latest commit History 71 Commits
assets		assets
examples		examples
lexikos		lexikos
mfa_g2p		mfa_g2p
notebooks		notebooks
.gitignore		.gitignore
LICENSE		LICENSE
MANIFEST.in		MANIFEST.in
README.md		README.md
pyproject.toml		pyproject.toml
requirements.txt		requirements.txt
setup.cfg		setup.cfg
setup.py		setup.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Lexikos - λεξικός /lek.si.kós/

Install Lexikos

Usage

Lexicon

Phonemization

Dictionaries & Models

English `(en)`

English `(en-US)`

English `(en-UK)`

English `(en-AU)`

English `(en-CA)`

English `(en-NZ)`

English `(en-IN)`

Training G2P Model

Example: Fine-tune ByT5 on CMU Dict

Evaluating G2P Model

Example: Evaluate ByT5 on CMU Dict

Corpus Roadmap

Wikipron

Resources

References

About

Releases 4

Packages

Contributors 2

Languages

License

bookbot-hive/lexikos

Folders and files

Latest commit

History

Repository files navigation

Lexikos - λεξικός /lek.si.kós/

Install Lexikos

Usage

Lexicon

Phonemization

Dictionaries & Models

English (en)

English (en-US)

English (en-UK)

English (en-AU)

English (en-CA)

English (en-NZ)

English (en-IN)

Training G2P Model

Example: Fine-tune ByT5 on CMU Dict

Evaluating G2P Model

Example: Evaluate ByT5 on CMU Dict

Corpus Roadmap

Wikipron

Resources

References

About

Topics

Resources

License

Stars

Watchers

Forks

Releases 4

Packages 0

Contributors 2

Languages

English `(en)`

English `(en-US)`

English `(en-UK)`

English `(en-AU)`

English `(en-CA)`

English `(en-NZ)`

English `(en-IN)`

Packages