Skip to content

Commit 9e147d3

Browse files
sguggerLysandreJikpatil-suraj
authored
Deprecate prepare_seq2seq_batch (huggingface#10287)
* Deprecate prepare_seq2seq_batch * Fix last tests * Apply suggestions from code review Co-authored-by: Lysandre Debut <lysandre@huggingface.co> Co-authored-by: Suraj Patil <surajp815@gmail.com> * More review comments Co-authored-by: Lysandre Debut <lysandre@huggingface.co> Co-authored-by: Suraj Patil <surajp815@gmail.com>
1 parent e73a3e1 commit 9e147d3

31 files changed

+325
-320
lines changed

docs/source/model_doc/fsmt.rst

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -56,7 +56,7 @@ FSMTTokenizer
5656

5757
.. autoclass:: transformers.FSMTTokenizer
5858
:members: build_inputs_with_special_tokens, get_special_tokens_mask,
59-
create_token_type_ids_from_sequences, prepare_seq2seq_batch, save_vocabulary
59+
create_token_type_ids_from_sequences, save_vocabulary
6060

6161

6262
FSMTModel

docs/source/model_doc/marian.rst

Lines changed: 33 additions & 30 deletions
Original file line numberDiff line numberDiff line change
@@ -76,27 +76,29 @@ require 3 character language codes:
7676

7777
.. code-block:: python
7878
79-
from transformers import MarianMTModel, MarianTokenizer
80-
src_text = [
81-
'>>fra<< this is a sentence in english that we want to translate to french',
82-
'>>por<< This should go to portuguese',
83-
'>>esp<< And this to Spanish'
84-
]
79+
>>> from transformers import MarianMTModel, MarianTokenizer
80+
>>> src_text = [
81+
... '>>fra<< this is a sentence in english that we want to translate to french',
82+
... '>>por<< This should go to portuguese',
83+
... '>>esp<< And this to Spanish'
84+
>>> ]
8585
86-
model_name = 'Helsinki-NLP/opus-mt-en-roa'
87-
tokenizer = MarianTokenizer.from_pretrained(model_name)
88-
print(tokenizer.supported_language_codes)
89-
model = MarianMTModel.from_pretrained(model_name)
90-
translated = model.generate(**tokenizer.prepare_seq2seq_batch(src_text, return_tensors="pt"))
91-
tgt_text = [tokenizer.decode(t, skip_special_tokens=True) for t in translated]
92-
# ["c'est une phrase en anglais que nous voulons traduire en français",
93-
# 'Isto deve ir para o português.',
94-
# 'Y esto al español']
86+
>>> model_name = 'Helsinki-NLP/opus-mt-en-roa'
87+
>>> tokenizer = MarianTokenizer.from_pretrained(model_name)
88+
>>> print(tokenizer.supported_language_codes)
89+
['>>zlm_Latn<<', '>>mfe<<', '>>hat<<', '>>pap<<', '>>ast<<', '>>cat<<', '>>ind<<', '>>glg<<', '>>wln<<', '>>spa<<', '>>fra<<', '>>ron<<', '>>por<<', '>>ita<<', '>>oci<<', '>>arg<<', '>>min<<']
9590
91+
>>> model = MarianMTModel.from_pretrained(model_name)
92+
>>> translated = model.generate(**tokenizer(src_text, return_tensors="pt", padding=True))
93+
>>> [tokenizer.decode(t, skip_special_tokens=True) for t in translated]
94+
["c'est une phrase en anglais que nous voulons traduire en français",
95+
'Isto deve ir para o português.',
96+
'Y esto al español']
9697
9798
9899
99-
Code to see available pretrained models:
100+
101+
Here is the code to see all available pretrained models on the hub:
100102

101103
.. code-block:: python
102104
@@ -147,21 +149,22 @@ Example of translating english to many romance languages, using old-style 2 char
147149

148150
.. code-block::python
149151
150-
from transformers import MarianMTModel, MarianTokenizer
151-
src_text = [
152-
'>>fr<< this is a sentence in english that we want to translate to french',
153-
'>>pt<< This should go to portuguese',
154-
'>>es<< And this to Spanish'
155-
]
152+
>>> from transformers import MarianMTModel, MarianTokenizer
153+
>>> src_text = [
154+
... '>>fr<< this is a sentence in english that we want to translate to french',
155+
... '>>pt<< This should go to portuguese',
156+
... '>>es<< And this to Spanish'
157+
>>> ]
156158
157-
model_name = 'Helsinki-NLP/opus-mt-en-ROMANCE'
158-
tokenizer = MarianTokenizer.from_pretrained(model_name)
159-
print(tokenizer.supported_language_codes)
159+
>>> model_name = 'Helsinki-NLP/opus-mt-en-ROMANCE'
160+
>>> tokenizer = MarianTokenizer.from_pretrained(model_name)
160161
161-
model = MarianMTModel.from_pretrained(model_name)
162-
translated = model.generate(**tokenizer.prepare_seq2seq_batch(src_text, return_tensors="pt"))
163-
tgt_text = [tokenizer.decode(t, skip_special_tokens=True) for t in translated]
164-
# ["c'est une phrase en anglais que nous voulons traduire en français", 'Isto deve ir para o português.', 'Y esto al español']
162+
>>> model = MarianMTModel.from_pretrained(model_name)
163+
>>> translated = model.generate(**tokenizer(src_text, return_tensors="pt", padding=True))
164+
>>> tgt_text = [tokenizer.decode(t, skip_special_tokens=True) for t in translated]
165+
["c'est une phrase en anglais que nous voulons traduire en français",
166+
'Isto deve ir para o português.',
167+
'Y esto al español']
165168
166169
167170
@@ -176,7 +179,7 @@ MarianTokenizer
176179
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
177180

178181
.. autoclass:: transformers.MarianTokenizer
179-
:members: prepare_seq2seq_batch
182+
:members: as_target_tokenizer
180183

181184

182185
MarianModel

docs/source/model_doc/mbart.rst

Lines changed: 28 additions & 19 deletions
Original file line numberDiff line numberDiff line change
@@ -34,22 +34,31 @@ The Authors' code can be found `here <https://github.com/pytorch/fairseq/tree/ma
3434
Training of MBart
3535
_______________________________________________________________________________________________________________________
3636

37-
MBart is a multilingual encoder-decoder (seq-to-seq) model primarily intended for translation task. As the model is
38-
multilingual it expects the sequences in a different format. A special language id token is added in both the source
39-
and target text. The source text format is :obj:`X [eos, src_lang_code]` where :obj:`X` is the source text. The target
40-
text format is :obj:`[tgt_lang_code] X [eos]`. :obj:`bos` is never used.
37+
MBart is a multilingual encoder-decoder (sequence-to-sequence) model primarily intended for translation task. As the
38+
model is multilingual it expects the sequences in a different format. A special language id token is added in both the
39+
source and target text. The source text format is :obj:`X [eos, src_lang_code]` where :obj:`X` is the source text. The
40+
target text format is :obj:`[tgt_lang_code] X [eos]`. :obj:`bos` is never used.
4141

42-
The :meth:`~transformers.MBartTokenizer.prepare_seq2seq_batch` handles this automatically and should be used to encode
43-
the sequences for sequence-to-sequence fine-tuning.
42+
The regular :meth:`~transformers.MBartTokenizer.__call__` will encode source text format, and it should be wrapped
43+
inside the context manager :meth:`~transformers.MBartTokenizer.as_target_tokenizer` to encode target text format.
4444

4545
- Supervised training
4646

4747
.. code-block::
4848
49-
example_english_phrase = "UN Chief Says There Is No Military Solution in Syria"
50-
expected_translation_romanian = "Şeful ONU declară că nu există o soluţie militară în Siria"
51-
batch = tokenizer.prepare_seq2seq_batch(example_english_phrase, src_lang="en_XX", tgt_lang="ro_RO", tgt_texts=expected_translation_romanian, return_tensors="pt")
52-
model(input_ids=batch['input_ids'], labels=batch['labels']) # forward pass
49+
>>> from transformers import MBartForConditionalGeneration, MBartTokenizer
50+
51+
>>> tokenizer = MBartTokenizer.from_pretrained("facebook/mbart-large-en-ro")
52+
>>> example_english_phrase = "UN Chief Says There Is No Military Solution in Syria"
53+
>>> expected_translation_romanian = "Şeful ONU declară că nu există o soluţie militară în Siria"
54+
55+
>>> inputs = tokenizer(example_english_phrase, return_tensors="pt", src_lang="en_XX", tgt_lang="ro_RO")
56+
>>> with tokenizer.as_target_tokenizer():
57+
... labels = tokenizer(expected_translation_romanian, return_tensors="pt")
58+
59+
>>> model = MBartForConditionalGeneration.from_pretrained("facebook/mbart-large-en-ro")
60+
>>> # forward pass
61+
>>> model(**inputs, labels=batch['labels'])
5362
5463
- Generation
5564

@@ -58,14 +67,14 @@ the sequences for sequence-to-sequence fine-tuning.
5867

5968
.. code-block::
6069
61-
from transformers import MBartForConditionalGeneration, MBartTokenizer
62-
model = MBartForConditionalGeneration.from_pretrained("facebook/mbart-large-en-ro")
63-
tokenizer = MBartTokenizer.from_pretrained("facebook/mbart-large-en-ro")
64-
article = "UN Chief Says There Is No Military Solution in Syria"
65-
batch = tokenizer.prepare_seq2seq_batch(src_texts=[article], src_lang="en_XX", return_tensors="pt")
66-
translated_tokens = model.generate(**batch, decoder_start_token_id=tokenizer.lang_code_to_id["ro_RO"])
67-
translation = tokenizer.batch_decode(translated_tokens, skip_special_tokens=True)[0]
68-
assert translation == "Şeful ONU declară că nu există o soluţie militară în Siria"
70+
>>> from transformers import MBartForConditionalGeneration, MBartTokenizer
71+
72+
>>> tokenizer = MBartTokenizer.from_pretrained("facebook/mbart-large-en-ro", src_lang="en_XX")
73+
>>> article = "UN Chief Says There Is No Military Solution in Syria"
74+
>>> inputs = tokenizer(article, return_tensors="pt")
75+
>>> translated_tokens = model.generate(**inputs, decoder_start_token_id=tokenizer.lang_code_to_id["ro_RO"])
76+
>>> tokenizer.batch_decode(translated_tokens, skip_special_tokens=True)[0]
77+
"Şeful ONU declară că nu există o soluţie militară în Siria"
6978
7079
7180
Overview of MBart-50
@@ -160,7 +169,7 @@ MBartTokenizer
160169
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
161170

162171
.. autoclass:: transformers.MBartTokenizer
163-
:members: build_inputs_with_special_tokens, prepare_seq2seq_batch
172+
:members: as_target_tokenizer, build_inputs_with_special_tokens
164173

165174

166175
MBartTokenizerFast

docs/source/model_doc/pegasus.rst

Lines changed: 14 additions & 14 deletions
Original file line numberDiff line numberDiff line change
@@ -78,20 +78,20 @@ Usage Example
7878

7979
.. code-block:: python
8080
81-
from transformers import PegasusForConditionalGeneration, PegasusTokenizer
82-
import torch
83-
src_text = [
84-
""" PG&E stated it scheduled the blackouts in response to forecasts for high winds amid dry conditions. The aim is to reduce the risk of wildfires. Nearly 800 thousand customers were scheduled to be affected by the shutoffs which were expected to last through at least midday tomorrow."""
85-
]
81+
>>> from transformers import PegasusForConditionalGeneration, PegasusTokenizer
82+
>>> import torch
83+
>>> src_text = [
84+
... """ PG&E stated it scheduled the blackouts in response to forecasts for high winds amid dry conditions. The aim is to reduce the risk of wildfires. Nearly 800 thousand customers were scheduled to be affected by the shutoffs which were expected to last through at least midday tomorrow."""
85+
>>> ]
8686
87-
model_name = 'google/pegasus-xsum'
88-
torch_device = 'cuda' if torch.cuda.is_available() else 'cpu'
89-
tokenizer = PegasusTokenizer.from_pretrained(model_name)
90-
model = PegasusForConditionalGeneration.from_pretrained(model_name).to(torch_device)
91-
batch = tokenizer.prepare_seq2seq_batch(src_text, truncation=True, padding='longest', return_tensors="pt").to(torch_device)
92-
translated = model.generate(**batch)
93-
tgt_text = tokenizer.batch_decode(translated, skip_special_tokens=True)
94-
assert tgt_text[0] == "California's largest electricity provider has turned off power to hundreds of thousands of customers."
87+
>>> model_name = 'google/pegasus-xsum'
88+
>>> device = 'cuda' if torch.cuda.is_available() else 'cpu'
89+
>>> tokenizer = PegasusTokenizer.from_pretrained(model_name)
90+
>>> model = PegasusForConditionalGeneration.from_pretrained(model_name).to(device)
91+
>>> batch = tokenizer(src_text, truncation=True, padding='longest', return_tensors="pt").to(torch_device)
92+
>>> translated = model.generate(**batch)
93+
>>> tgt_text = tokenizer.batch_decode(translated, skip_special_tokens=True)
94+
>>> assert tgt_text[0] == "California's largest electricity provider has turned off power to hundreds of thousands of customers."
9595
9696
9797
@@ -107,7 +107,7 @@ PegasusTokenizer
107107
warning: ``add_tokens`` does not work at the moment.
108108

109109
.. autoclass:: transformers.PegasusTokenizer
110-
:members: __call__, prepare_seq2seq_batch
110+
:members:
111111

112112

113113
PegasusTokenizerFast

docs/source/model_doc/rag.rst

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -56,7 +56,7 @@ RagTokenizer
5656
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
5757

5858
.. autoclass:: transformers.RagTokenizer
59-
:members: prepare_seq2seq_batch
59+
:members:
6060

6161

6262
Rag specific outputs

docs/source/model_doc/t5.rst

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -104,7 +104,7 @@ T5Tokenizer
104104

105105
.. autoclass:: transformers.T5Tokenizer
106106
:members: build_inputs_with_special_tokens, get_special_tokens_mask,
107-
create_token_type_ids_from_sequences, prepare_seq2seq_batch, save_vocabulary
107+
create_token_type_ids_from_sequences, save_vocabulary
108108

109109

110110
T5TokenizerFast

scripts/fsmt/fsmt-make-super-tiny-model.py

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -71,7 +71,7 @@
7171
print(f"num of params {tiny_model.num_parameters()}")
7272

7373
# Test
74-
batch = tokenizer.prepare_seq2seq_batch(["Making tiny model"], return_tensors="pt")
74+
batch = tokenizer(["Making tiny model"], return_tensors="pt")
7575
outputs = tiny_model(**batch)
7676

7777
print("test output:", len(outputs.logits[0]))

scripts/fsmt/fsmt-make-tiny-model.py

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -42,7 +42,7 @@
4242
print(f"num of params {tiny_model.num_parameters()}")
4343

4444
# Test
45-
batch = tokenizer.prepare_seq2seq_batch(["Making tiny model"], return_tensors="pt")
45+
batch = tokenizer(["Making tiny model"], return_tensors="pt")
4646
outputs = tiny_model(**batch)
4747

4848
print("test output:", len(outputs.logits[0]))

src/transformers/models/marian/modeling_marian.py

Lines changed: 6 additions & 5 deletions
Original file line numberDiff line numberDiff line change
@@ -522,13 +522,14 @@ def dummy_inputs(self):
522522
>>> src = 'fr' # source language
523523
>>> trg = 'en' # target language
524524
>>> sample_text = "où est l'arrêt de bus ?"
525-
>>> mname = f'Helsinki-NLP/opus-mt-{src}-{trg}'
525+
>>> model_name = f'Helsinki-NLP/opus-mt-{src}-{trg}'
526526
527-
>>> model = MarianMTModel.from_pretrained(mname)
528-
>>> tok = MarianTokenizer.from_pretrained(mname)
529-
>>> batch = tok.prepare_seq2seq_batch(src_texts=[sample_text], return_tensors="pt") # don't need tgt_text for inference
527+
>>> model = MarianMTModel.from_pretrained(model_name)
528+
>>> tokenizer = MarianTokenizer.from_pretrained(model_name)
529+
>>> batch = tokenizer([sample_text], return_tensors="pt")
530530
>>> gen = model.generate(**batch)
531-
>>> words: List[str] = tok.batch_decode(gen, skip_special_tokens=True) # returns "Where is the bus stop ?"
531+
>>> tokenizer.batch_decode(gen, skip_special_tokens=True)
532+
"Where is the bus stop ?"
532533
"""
533534

534535
MARIAN_INPUTS_DOCSTRING = r"""

src/transformers/models/marian/modeling_tf_marian.py

Lines changed: 6 additions & 5 deletions
Original file line numberDiff line numberDiff line change
@@ -557,13 +557,14 @@ def serving(self, inputs):
557557
>>> src = 'fr' # source language
558558
>>> trg = 'en' # target language
559559
>>> sample_text = "où est l'arrêt de bus ?"
560-
>>> mname = f'Helsinki-NLP/opus-mt-{src}-{trg}'
560+
>>> model_name = f'Helsinki-NLP/opus-mt-{src}-{trg}'
561561
562-
>>> model = MarianMTModel.from_pretrained(mname)
563-
>>> tok = MarianTokenizer.from_pretrained(mname)
564-
>>> batch = tok.prepare_seq2seq_batch(src_texts=[sample_text], return_tensors="tf") # don't need tgt_text for inference
562+
>>> model = TFMarianMTModel.from_pretrained(model_name)
563+
>>> tokenizer = MarianTokenizer.from_pretrained(model_name)
564+
>>> batch = tokenizer([sample_text], return_tensors="tf")
565565
>>> gen = model.generate(**batch)
566-
>>> words: List[str] = tok.batch_decode(gen, skip_special_tokens=True) # returns "Where is the bus stop ?"
566+
>>> tokenizer.batch_decode(gen, skip_special_tokens=True)
567+
"Where is the bus stop ?"
567568
"""
568569

569570
MARIAN_INPUTS_DOCSTRING = r"""

src/transformers/models/marian/tokenization_marian.py

Lines changed: 7 additions & 4 deletions
Original file line numberDiff line numberDiff line change
@@ -80,12 +80,15 @@ class MarianTokenizer(PreTrainedTokenizer):
8080
Examples::
8181
8282
>>> from transformers import MarianTokenizer
83-
>>> tok = MarianTokenizer.from_pretrained('Helsinki-NLP/opus-mt-en-de')
83+
>>> tokenizer = MarianTokenizer.from_pretrained('Helsinki-NLP/opus-mt-en-de')
8484
>>> src_texts = [ "I am a small frog.", "Tom asked his teacher for advice."]
8585
>>> tgt_texts = ["Ich bin ein kleiner Frosch.", "Tom bat seinen Lehrer um Rat."] # optional
86-
>>> batch_enc = tok.prepare_seq2seq_batch(src_texts, tgt_texts=tgt_texts, return_tensors="pt")
87-
>>> # keys [input_ids, attention_mask, labels].
88-
>>> # model(**batch) should work
86+
>>> inputs = tokenizer(src_texts, return_tensors="pt", padding=True)
87+
>>> with tokenizer.as_target_tokenizer():
88+
... labels = tokenizer(tgt_texts, return_tensors="pt", padding=True)
89+
>>> inputs["labels"] = labels["input_ids"]
90+
# keys [input_ids, attention_mask, labels].
91+
>>> outputs = model(**inputs) should work
8992
"""
9093

9194
vocab_files_names = vocab_files_names

0 commit comments

Comments
 (0)