Skip to content

Commit 86f4986

Browse files
polmsvlandeg
authored andcommitted
Clarify how to fill in init_tok2vec after pretraining (explosion#9639)
* Clarify how to fill in init_tok2vec after pretraining * Ignore init_tok2vec arg in pretraining * Update docs, config setting * Remove obsolete note about not filling init_tok2vec early This seems to have also caught some lines that needed cleanup.
1 parent 2cdc6e5 commit 86f4986

File tree

3 files changed

+19
-20
lines changed

3 files changed

+19
-20
lines changed

spacy/training/pretrain.py

Lines changed: 2 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -31,6 +31,8 @@ def pretrain(
3131
allocator = config["training"]["gpu_allocator"]
3232
if use_gpu >= 0 and allocator:
3333
set_gpu_allocator(allocator)
34+
# ignore in pretraining because we're creating it now
35+
config["initialize"]["init_tok2vec"] = None
3436
nlp = load_model_from_config(config)
3537
_config = nlp.config.interpolate()
3638
P = registry.resolve(_config["pretraining"], schema=ConfigSchemaPretrain)

website/docs/api/data-formats.md

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -248,7 +248,7 @@ Also see the usage guides on the
248248
| `after_init` | Optional callback to modify the `nlp` object after initialization. ~~Optional[Callable[[Language], Language]]~~ |
249249
| `before_init` | Optional callback to modify the `nlp` object before initialization. ~~Optional[Callable[[Language], Language]]~~ |
250250
| `components` | Additional arguments passed to the `initialize` method of a pipeline component, keyed by component name. If type annotations are available on the method, the config will be validated against them. The `initialize` methods will always receive the `get_examples` callback and the current `nlp` object. ~~Dict[str, Dict[str, Any]]~~ |
251-
| `init_tok2vec` | Optional path to pretrained tok2vec weights created with [`spacy pretrain`](/api/cli#pretrain). Defaults to variable `${paths.init_tok2vec}`. ~~Optional[str]~~ |
251+
| `init_tok2vec` | Optional path to pretrained tok2vec weights created with [`spacy pretrain`](/api/cli#pretrain). Defaults to variable `${paths.init_tok2vec}`. Ignored when actually running pretraining, as you're creating the file to be used later. ~~Optional[str]~~ |
252252
| `lookups` | Additional lexeme and vocab data from [`spacy-lookups-data`](https://github.com/explosion/spacy-lookups-data). Defaults to `null`. ~~Optional[Lookups]~~ |
253253
| `tokenizer` | Additional arguments passed to the `initialize` method of the specified tokenizer. Can be used for languages like Chinese that depend on dictionaries or trained models for tokenization. If type annotations are available on the method, the config will be validated against them. The `initialize` method will always receive the `get_examples` callback and the current `nlp` object. ~~Dict[str, Any]~~ |
254254
| `vectors` | Name or path of pipeline containing pretrained word vectors to use, e.g. created with [`init vectors`](/api/cli#init-vectors). Defaults to `null`. ~~Optional[str]~~ |

website/docs/usage/embeddings-transformers.md

Lines changed: 16 additions & 19 deletions
Original file line numberDiff line numberDiff line change
@@ -391,8 +391,8 @@ A wide variety of PyTorch models are supported, but some might not work. If a
391391
model doesn't seem to work feel free to open an
392392
[issue](https://github.com/explosion/spacy/issues). Additionally note that
393393
Transformers loaded in spaCy can only be used for tensors, and pretrained
394-
task-specific heads or text generation features cannot be used as part of
395-
the `transformer` pipeline component.
394+
task-specific heads or text generation features cannot be used as part of the
395+
`transformer` pipeline component.
396396

397397
<Infobox variant="warning">
398398

@@ -715,8 +715,8 @@ network for a temporary task that forces the model to learn something about
715715
sentence structure and word cooccurrence statistics.
716716

717717
Pretraining produces a **binary weights file** that can be loaded back in at the
718-
start of training, using the configuration option `initialize.init_tok2vec`.
719-
The weights file specifies an initial set of weights. Training then proceeds as
718+
start of training, using the configuration option `initialize.init_tok2vec`. The
719+
weights file specifies an initial set of weights. Training then proceeds as
720720
normal.
721721

722722
You can only pretrain one subnetwork from your pipeline at a time, and the
@@ -751,15 +751,14 @@ layer = "tok2vec"
751751

752752
#### Connecting pretraining to training {#pretraining-training}
753753

754-
To benefit from pretraining, your training step needs to know to initialize
755-
its `tok2vec` component with the weights learned from the pretraining step.
756-
You do this by setting `initialize.init_tok2vec` to the filename of the
757-
`.bin` file that you want to use from pretraining.
754+
To benefit from pretraining, your training step needs to know to initialize its
755+
`tok2vec` component with the weights learned from the pretraining step. You do
756+
this by setting `initialize.init_tok2vec` to the filename of the `.bin` file
757+
that you want to use from pretraining.
758758

759-
A pretraining step that runs for 5 epochs with an output path of `pretrain/`,
760-
as an example, produces `pretrain/model0.bin` through `pretrain/model4.bin`.
761-
To make use of the final output, you could fill in this value in your config
762-
file:
759+
A pretraining step that runs for 5 epochs with an output path of `pretrain/`, as
760+
an example, produces `pretrain/model0.bin` through `pretrain/model4.bin`. To
761+
make use of the final output, you could fill in this value in your config file:
763762

764763
```ini
765764
### config.cfg
@@ -773,16 +772,14 @@ init_tok2vec = ${paths.init_tok2vec}
773772

774773
<Infobox variant="warning">
775774

776-
The outputs of `spacy pretrain` are not the same data format as the
777-
pre-packaged static word vectors that would go into
778-
[`initialize.vectors`](/api/data-formats#config-initialize).
779-
The pretraining output consists of the weights that the `tok2vec`
780-
component should start with in an existing pipeline, so it goes in
781-
`initialize.init_tok2vec`.
775+
The outputs of `spacy pretrain` are not the same data format as the pre-packaged
776+
static word vectors that would go into
777+
[`initialize.vectors`](/api/data-formats#config-initialize). The pretraining
778+
output consists of the weights that the `tok2vec` component should start with in
779+
an existing pipeline, so it goes in `initialize.init_tok2vec`.
782780

783781
</Infobox>
784782

785-
786783
#### Pretraining objectives {#pretraining-objectives}
787784

788785
> ```ini

0 commit comments

Comments
 (0)