@@ -30,10 +30,16 @@ into three components:
30
30
tagging, parsing, lemmatization and named entity recognition, or ` dep ` for
31
31
only tagging, parsing and lemmatization).
32
32
2 . ** Genre:** Type of text the pipeline is trained on, e.g. ` web ` or ` news ` .
33
- 3 . ** Size:** Package size indicator, ` sm ` , ` md ` , ` lg ` or ` trf ` (` sm ` : no word
34
- vectors, ` md ` : reduced word vector table with 20k unique vectors for ~ 500k
35
- words, ` lg ` : large word vector table with ~ 500k entries, ` trf ` : transformer
36
- pipeline without static word vectors)
33
+ 3 . ** Size:** Package size indicator, ` sm ` , ` md ` , ` lg ` or ` trf ` .
34
+
35
+ ` sm ` and ` trf ` pipelines have no static word vectors.
36
+
37
+ For pipelines with default vectors, ` md ` has a reduced word vector table with
38
+ 20k unique vectors for ~ 500k words and ` lg ` has a large word vector table
39
+ with ~ 500k entries.
40
+
41
+ For pipelines with floret vectors, ` md ` vector tables have 50k entries and
42
+ ` lg ` vector tables have 200k entries.
37
43
38
44
For example, [ ` en_core_web_sm ` ] ( /models/en#en_core_web_sm ) is a small English
39
45
pipeline trained on written web text (blogs, news, comments), that includes
@@ -90,19 +96,42 @@ Main changes from spaCy v2 models:
90
96
In the ` sm ` /` md ` /` lg ` models:
91
97
92
98
- The ` tagger ` , ` morphologizer ` and ` parser ` components listen to the ` tok2vec `
93
- component.
99
+ component. If the lemmatizer is trainable (v3.3+), ` lemmatizer ` also listens
100
+ to ` tok2vec ` .
94
101
- The ` attribute_ruler ` maps ` token.tag ` to ` token.pos ` if there is no
95
102
` morphologizer ` . The ` attribute_ruler ` additionally makes sure whitespace is
96
103
tagged consistently and copies ` token.pos ` to ` token.tag ` if there is no
97
104
tagger. For English, the attribute ruler can improve its mapping from
98
105
` token.tag ` to ` token.pos ` if dependency parses from a ` parser ` are present,
99
106
but the parser is not required.
100
- - The ` lemmatizer ` component for many languages (Catalan, Dutch, English,
101
- French, Greek, Italian Macedonian, Norwegian, Polish and Spanish) requires
102
- ` token.pos ` annotation from either ` tagger ` +` attribute_ruler ` or
103
- ` morphologizer ` .
107
+ - The ` lemmatizer ` component for many languages requires ` token.pos ` annotation
108
+ from either ` tagger ` +` attribute_ruler ` or ` morphologizer ` .
104
109
- The ` ner ` component is independent with its own internal tok2vec layer.
105
110
111
+ #### CNN/CPU pipelines with floret vectors
112
+
113
+ The Finnish, Korean and Swedish ` md ` and ` lg ` pipelines use
114
+ [ floret vectors] ( /usage/v3-2#vectors ) instead of default vectors. If you're
115
+ running a trained pipeline on texts and working with [ ` Doc ` ] ( /api/doc ) objects,
116
+ you shouldn't notice any difference with floret vectors. With floret vectors no
117
+ tokens are out-of-vocabulary, so [ ` Token.is_oov ` ] ( /api/token#attributes ) will
118
+ return ` True ` for all tokens.
119
+
120
+ If you access vectors directly for similarity comparisons, there are a few
121
+ differences because floret vectors don't include a fixed word list like the
122
+ vector keys for default vectors.
123
+
124
+ - If your workflow iterates over the vector keys, you need to use an external
125
+ word list instead:
126
+
127
+ ``` diff
128
+ - lexemes = [nlp.vocab[orth] for orth in nlp.vocab.vectors]
129
+ + lexemes = [nlp.vocab[word] for word in external_word_list]
130
+ ```
131
+
132
+ - [ ` Vectors.most_similar ` ] ( /api/vectors#most_similar ) is not supported because
133
+ there's no fixed list of vectors to compare your vectors to.
134
+
106
135
### Transformer pipeline design {#design-trf}
107
136
108
137
In the transformer (` trf ` ) models, the ` tagger ` , ` parser ` and ` ner ` (if present)
@@ -133,10 +162,14 @@ nlp = spacy.load("en_core_web_trf", disable=["tagger", "attribute_ruler", "lemma
133
162
<Infobox variant="warning" title="Rule-based and POS-lookup lemmatizers require
134
163
Token.pos">
135
164
136
- The lemmatizer depends on ` tagger ` +` attribute_ruler ` or ` morphologizer ` for
137
- Catalan, Dutch, English, French, Greek, Italian, Macedonian, Norwegian, Polish
138
- and Spanish. If you disable any of these components, you'll see lemmatizer
139
- warnings unless the lemmatizer is also disabled.
165
+ The lemmatizer depends on ` tagger ` +` attribute_ruler ` or ` morphologizer ` for a
166
+ number of languages. If you disable any of these components, you'll see
167
+ lemmatizer warnings unless the lemmatizer is also disabled.
168
+
169
+ ** v3.3** : Catalan, English, French, Russian and Spanish
170
+
171
+ ** v3.0-v3.2** : Catalan, Dutch, English, French, Greek, Italian, Macedonian,
172
+ Norwegian, Polish, Russian and Spanish
140
173
141
174
</Infobox >
142
175
@@ -154,10 +187,34 @@ nlp.enable_pipe("senter")
154
187
The ` senter ` component is ~ 10× ; faster than the parser and more accurate
155
188
than the rule-based ` sentencizer ` .
156
189
190
+ #### Switch from trainable lemmatizer to default lemmatizer
191
+
192
+ Since v3.3, a number of pipelines use a trainable lemmatizer. You can check whether
193
+ the lemmatizer is trainable:
194
+
195
+ ``` python
196
+ nlp = spacy.load(" de_core_web_sm" )
197
+ assert nlp.get_pipe(" lemmatizer" ).is_trainable
198
+ ```
199
+
200
+ If you'd like to switch to a non-trainable lemmatizer that's similar to v3.2 or
201
+ earlier, you can replace the trainable lemmatizer with the default non-trainable
202
+ lemmatizer:
203
+
204
+ ``` python
205
+ # Requirements: pip install spacy-lookups-data
206
+ nlp = spacy.load(" de_core_web_sm" )
207
+ # Remove existing lemmatizer
208
+ nlp.remove_pipe(" lemmatizer" )
209
+ # Add non-trainable lemmatizer from language defaults
210
+ # and load lemmatizer tables from spacy-lookups-data
211
+ nlp.add_pipe(" lemmatizer" ).initialize()
212
+ ```
213
+
157
214
#### Switch from rule-based to lookup lemmatization
158
215
159
216
For the Dutch, English, French, Greek, Macedonian, Norwegian and Spanish
160
- pipelines, you can switch from the default rule-based lemmatizer to a lookup
217
+ pipelines, you can swap out a trainable or rule-based lemmatizer for a lookup
161
218
lemmatizer:
162
219
163
220
``` python
0 commit comments