Skip to content

Commit f1e2e42

Browse files
authored
Fix fast tokenizers too (huggingface#5562)
1 parent 5787e4c commit f1e2e42

File tree

2 files changed

+37
-19
lines changed

2 files changed

+37
-19
lines changed

src/transformers/tokenization_gpt2.py

Lines changed: 19 additions & 10 deletions
Original file line numberDiff line numberDiff line change
@@ -297,21 +297,30 @@ def prepare_for_tokenization(self, text, is_pretokenized=False, **kwargs):
297297

298298
class GPT2TokenizerFast(PreTrainedTokenizerFast):
299299
"""
300-
Constructs a "Fast" GPT-2 BPE tokenizer (backed by HuggingFace's `tokenizers` library).
300+
Constructs a "Fast" GPT-2 BPE tokenizer (backed by HuggingFace's `tokenizers` library), using byte-level
301+
Byte-Pair-Encoding.
301302
302-
Peculiarities:
303-
304-
- Byte-level Byte-Pair-Encoding
305-
- Requires a space to start the input string => the encoding methods should be called with the
306-
``add_prefix_space`` flag set to ``True``.
307-
Otherwise, this tokenizer ``encode`` and ``decode`` method will not conserve
308-
the absence of a space at the beginning of a string:
303+
This tokenizer has been trained to treat spaces like parts of the tokens (a bit like sentencepiece) so a word will
304+
be encoded differently whether it is at the beginning of the sentence (without space) or not:
309305
310306
::
311307
312-
tokenizer.decode(tokenizer.encode("Hello")) = " Hello"
308+
>>> from transformers import GPT2TokenizerFast
309+
>>> tokenizer = GPT2TokenizerFast.from_pretrained("gpt2")
310+
>>> tokenizer("Hello world")['input_ids']
311+
[15496, 995]
312+
>>> tokenizer(" Hello world")['input_ids']
313+
[18435, 995]
314+
315+
You can get around that behavior by passing ``add_prefix_space=True`` when instantiating this tokenizer or when you
316+
call it on some text, but since the model was not pretrained this way, it might yield a decrease in performance.
313317
314-
This tokenizer inherits from :class:`~transformers.PreTrainedTokenizerFast` which contains most of the methods. Users
318+
.. note::
319+
320+
When used with ``is_pretokenized=True``, this tokenizer needs to be instantiated with
321+
``add_prefix_space=True``.
322+
323+
This tokenizer inherits from :class:`~transformers.PreTrainedTokenizer` which contains most of the methods. Users
315324
should refer to the superclass for more information regarding methods.
316325
317326
Args:

src/transformers/tokenization_roberta.py

Lines changed: 18 additions & 9 deletions
Original file line numberDiff line numberDiff line change
@@ -260,19 +260,28 @@ def prepare_for_tokenization(self, text, is_pretokenized=False, **kwargs):
260260

261261
class RobertaTokenizerFast(GPT2TokenizerFast):
262262
"""
263-
Constructs a "Fast" RoBERTa BPE tokenizer (backed by HuggingFace's `tokenizers` library).
263+
Constructs a "Fast" RoBERTa BPE tokenizer (backed by HuggingFace's `tokenizers` library), derived from the GPT-2
264+
tokenizer, using byte-level Byte-Pair-Encoding.
264265
265-
Peculiarities:
266-
267-
- Byte-level Byte-Pair-Encoding
268-
- Requires a space to start the input string => the encoding methods should be called with the
269-
``add_prefix_space`` flag set to ``True``.
270-
Otherwise, this tokenizer ``encode`` and ``decode`` method will not conserve
271-
the absence of a space at the beginning of a string:
266+
This tokenizer has been trained to treat spaces like parts of the tokens (a bit like sentencepiece) so a word will
267+
be encoded differently whether it is at the beginning of the sentence (without space) or not:
272268
273269
::
274270
275-
tokenizer.decode(tokenizer.encode("Hello")) = " Hello"
271+
>>> from transformers import RobertaTokenizerFast
272+
>>> tokenizer = RobertaTokenizerFast.from_pretrained("roberta-base")
273+
>>> tokenizer("Hello world")['input_ids']
274+
[0, 31414, 232, 328, 2]
275+
>>> tokenizer(" Hello world")['input_ids']
276+
[0, 20920, 232, 2]
277+
278+
You can get around that behavior by passing ``add_prefix_space=True`` when instantiating this tokenizer or when you
279+
call it on some text, but since the model was not pretrained this way, it might yield a decrease in performance.
280+
281+
.. note::
282+
283+
When used with ``is_pretokenized=True``, this tokenizer needs to be instantiated with
284+
``add_prefix_space=True``.
276285
277286
This tokenizer inherits from :class:`~transformers.PreTrainedTokenizerFast` which contains most of the methods. Users
278287
should refer to the superclass for more information regarding methods.

0 commit comments

Comments
 (0)