Skip to content

Commit 21f28c3

Browse files
authored
1 parent 9d9b872 commit 21f28c3

File tree

2 files changed

+32
-14
lines changed

2 files changed

+32
-14
lines changed

src/transformers/tokenization_gpt2.py

Lines changed: 16 additions & 7 deletions
Original file line numberDiff line numberDiff line change
@@ -102,17 +102,26 @@ def get_pairs(word):
102102

103103
class GPT2Tokenizer(PreTrainedTokenizer):
104104
"""
105-
GPT-2 BPE tokenizer. Peculiarities:
105+
GPT-2 BPE tokenizer, using byte-level Byte-Pair-Encoding.
106106
107-
- Byte-level Byte-Pair-Encoding
108-
- Requires a space to start the input string => the encoding methods should be called with the
109-
``add_prefix_space`` flag set to ``True``.
110-
Otherwise, this tokenizer ``encode`` and ``decode`` method will not conserve
111-
the absence of a space at the beginning of a string:
107+
This tokenizer has been trained to treat spaces like parts of the tokens (a bit like sentencepiece) so a word will
108+
be encoded differently whether it is at the beginning of the sentence (without space) or not:
112109
113110
::
114111
115-
tokenizer.decode(tokenizer.encode("Hello")) = " Hello"
112+
>>> from transformers import GPT2Tokenizer
113+
>>> tokenizer = GPT2Tokenizer.from_pretrained("gpt2")
114+
>>> tokenizer("Hello world")['input_ids']
115+
[15496, 995]
116+
>>> tokenizer(" Hello world")['input_ids']
117+
[18435, 995]
118+
119+
You can get around that behavior by passing ``add_prefix_space=True`` when instantiating this tokenizer or when you
120+
call it on some text, but since the model was not pretrained this way, it might yield a decrease in performance.
121+
122+
.. note::
123+
124+
When used with ``is_pretokenized=True``, this tokenizer will add a space before each word (even the first one).
116125
117126
This tokenizer inherits from :class:`~transformers.PreTrainedTokenizer` which contains most of the methods. Users
118127
should refer to the superclass for more information regarding methods.

src/transformers/tokenization_roberta.py

Lines changed: 16 additions & 7 deletions
Original file line numberDiff line numberDiff line change
@@ -62,17 +62,26 @@
6262

6363
class RobertaTokenizer(GPT2Tokenizer):
6464
"""
65-
Constructs a RoBERTa BPE tokenizer, derived from the GPT-2 tokenizer. Peculiarities:
65+
Constructs a RoBERTa BPE tokenizer, derived from the GPT-2 tokenizer, using byte-level Byte-Pair-Encoding.
6666
67-
- Byte-level Byte-Pair-Encoding
68-
- Requires a space to start the input string => the encoding methods should be called with the
69-
``add_prefix_space`` flag set to ``True``.
70-
Otherwise, this tokenizer ``encode`` and ``decode`` method will not conserve
71-
the absence of a space at the beginning of a string:
67+
This tokenizer has been trained to treat spaces like parts of the tokens (a bit like sentencepiece) so a word will
68+
be encoded differently whether it is at the beginning of the sentence (without space) or not:
7269
7370
::
7471
75-
tokenizer.decode(tokenizer.encode("Hello")) = " Hello"
72+
>>> from transformers import RobertaTokenizer
73+
>>> tokenizer = RobertaTokenizer.from_pretrained("roberta-base")
74+
>>> tokenizer("Hello world")['input_ids']
75+
[0, 31414, 232, 328, 2]
76+
>>> tokenizer(" Hello world")['input_ids']
77+
[0, 20920, 232, 2]
78+
79+
You can get around that behavior by passing ``add_prefix_space=True`` when instantiating this tokenizer or when you
80+
call it on some text, but since the model was not pretrained this way, it might yield a decrease in performance.
81+
82+
.. note::
83+
84+
When used with ``is_pretokenized=True``, this tokenizer will add a space before each word (even the first one).
7685
7786
This tokenizer inherits from :class:`~transformers.PreTrainedTokenizer` which contains most of the methods. Users
7887
should refer to the superclass for more information regarding methods.

0 commit comments

Comments
 (0)