We read every piece of feedback, and take your input very seriously.
To see all available qualifiers, see our documentation.
There was an error while loading. Please reload this page.
1 parent a227226 commit cb01adeCopy full SHA for cb01ade
models/convert.py
@@ -111,7 +111,7 @@ def bytes_to_unicode():
111
The reversible bpe codes work on unicode strings.
112
This means you need a large # of unicode characters in your vocab if you want to avoid UNKs.
113
When you're at something like a 10B token dataset you end up needing around 5K for decent coverage.
114
- This is a signficant percentage of your normal, say, 32K bpe vocab.
+ This is a significant percentage of your normal, say, 32K bpe vocab.
115
To avoid that, we want lookup tables between utf-8 bytes and unicode strings.
116
And avoids mapping to whitespace/control characters the bpe code barfs on.
117
"""
0 commit comments