|
62 | 62 |
|
63 | 63 | class RobertaTokenizer(GPT2Tokenizer):
|
64 | 64 | """
|
65 |
| - Constructs a RoBERTa BPE tokenizer, derived from the GPT-2 tokenizer. Peculiarities: |
| 65 | + Constructs a RoBERTa BPE tokenizer, derived from the GPT-2 tokenizer, using byte-level Byte-Pair-Encoding. |
66 | 66 |
|
67 |
| - - Byte-level Byte-Pair-Encoding |
68 |
| - - Requires a space to start the input string => the encoding methods should be called with the |
69 |
| - ``add_prefix_space`` flag set to ``True``. |
70 |
| - Otherwise, this tokenizer ``encode`` and ``decode`` method will not conserve |
71 |
| - the absence of a space at the beginning of a string: |
| 67 | + This tokenizer has been trained to treat spaces like parts of the tokens (a bit like sentencepiece) so a word will |
| 68 | + be encoded differently whether it is at the beginning of the sentence (without space) or not: |
72 | 69 |
|
73 | 70 | ::
|
74 | 71 |
|
75 |
| - tokenizer.decode(tokenizer.encode("Hello")) = " Hello" |
| 72 | + >>> from transformers import RobertaTokenizer |
| 73 | + >>> tokenizer = RobertaTokenizer.from_pretrained("roberta-base") |
| 74 | + >>> tokenizer("Hello world")['input_ids'] |
| 75 | + [0, 31414, 232, 328, 2] |
| 76 | + >>> tokenizer(" Hello world")['input_ids'] |
| 77 | + [0, 20920, 232, 2] |
| 78 | +
|
| 79 | + You can get around that behavior by passing ``add_prefix_space=True`` when instantiating this tokenizer or when you |
| 80 | + call it on some text, but since the model was not pretrained this way, it might yield a decrease in performance. |
| 81 | +
|
| 82 | + .. note:: |
| 83 | +
|
| 84 | + When used with ``is_pretokenized=True``, this tokenizer will add a space before each word (even the first one). |
76 | 85 |
|
77 | 86 | This tokenizer inherits from :class:`~transformers.PreTrainedTokenizer` which contains most of the methods. Users
|
78 | 87 | should refer to the superclass for more information regarding methods.
|
|
0 commit comments