Skip to content

Commit 497a708

Browse files
Docs for v3.3 (explosion#10628)
* Temporarily disable CI tests * Start v3.3 website updates * Add trainable lemmatizer to pipeline design * Fix Vectors.most_similar * Add floret vector info to pipeline design * Add Lower and Upper Sorbian * Add span to sidebar * Work on release notes * Copy from release notes * Update pipeline design graphic * Upgrading note about Doc.from_docs * Add tables and details * Update website/docs/models/index.md Co-authored-by: Sofie Van Landeghem <svlandeg@users.noreply.github.com> * Fix da lemma acc * Add minimal intro, various updates * Round lemma acc * Add section on floret / word lists * Add new pipelines table, minor edits * Fix displacy spans example title * Clarify adding non-trainable lemmatizer * Update adding-languages URLs * Revert "Temporarily disable CI tests" This reverts commit 1dee505. * Spell out words/sec Co-authored-by: Sofie Van Landeghem <svlandeg@users.noreply.github.com>
1 parent 10377fb commit 497a708

File tree

10 files changed

+407
-82
lines changed

10 files changed

+407
-82
lines changed

website/docs/api/doc.md

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -621,7 +621,7 @@ relative clauses.
621621
622622
To customize the noun chunk iterator in a loaded pipeline, modify
623623
[`nlp.vocab.get_noun_chunks`](/api/vocab#attributes). If the `noun_chunk`
624-
[syntax iterator](/usage/adding-languages#language-data) has not been
624+
[syntax iterator](/usage/linguistic-features#language-data) has not been
625625
implemented for the given language, a `NotImplementedError` is raised.
626626
627627
> #### Example

website/docs/api/span.md

Lines changed: 9 additions & 7 deletions
Original file line numberDiff line numberDiff line change
@@ -283,8 +283,9 @@ objects, if the document has been syntactically parsed. A base noun phrase, or
283283
it – so no NP-level coordination, no prepositional phrases, and no relative
284284
clauses.
285285
286-
If the `noun_chunk` [syntax iterator](/usage/adding-languages#language-data) has
287-
not been implemeted for the given language, a `NotImplementedError` is raised.
286+
If the `noun_chunk` [syntax iterator](/usage/linguistic-features#language-data)
287+
has not been implemeted for the given language, a `NotImplementedError` is
288+
raised.
288289
289290
> #### Example
290291
>
@@ -520,12 +521,13 @@ sent = doc[sent.start : max(sent.end, span.end)]
520521
521522
## Span.sents {#sents tag="property" model="sentences" new="3.2.1"}
522523
523-
Returns a generator over the sentences the span belongs to. This property is only available
524-
when [sentence boundaries](/usage/linguistic-features#sbd) have been set on the
525-
document by the `parser`, `senter`, `sentencizer` or some custom function. It
526-
will raise an error otherwise.
524+
Returns a generator over the sentences the span belongs to. This property is
525+
only available when [sentence boundaries](/usage/linguistic-features#sbd) have
526+
been set on the document by the `parser`, `senter`, `sentencizer` or some custom
527+
function. It will raise an error otherwise.
527528
528-
If the span happens to cross sentence boundaries, all sentences the span overlaps with will be returned.
529+
If the span happens to cross sentence boundaries, all sentences the span
530+
overlaps with will be returned.
529531
530532
> #### Example
531533
>

website/docs/api/vectors.md

Lines changed: 8 additions & 8 deletions
Original file line numberDiff line numberDiff line change
@@ -347,14 +347,14 @@ supported for `floret` mode.
347347
> most_similar = nlp.vocab.vectors.most_similar(queries, n=10)
348348
> ```
349349
350-
| Name | Description |
351-
| -------------- | --------------------------------------------------------------------------- | ----------------------------------------------------------------------------------------------------------------------- |
352-
| `queries` | An array with one or more vectors. ~~numpy.ndarray~~ |
353-
| _keyword-only_ | |
354-
| `batch_size` | The batch size to use. Default to `1024`. ~~int~~ |
355-
| `n` | The number of entries to return for each query. Defaults to `1`. ~~int~~ |
356-
| `sort` | Whether to sort the entries returned by score. Defaults to `True`. ~~bool~~ |
357-
| **RETURNS** | tuple | The most similar entries as a `(keys, best_rows, scores)` tuple. ~~Tuple[numpy.ndarray, numpy.ndarray, numpy.ndarray]~~ |
350+
| Name | Description |
351+
| -------------- | ----------------------------------------------------------------------------------------------------------------------- |
352+
| `queries` | An array with one or more vectors. ~~numpy.ndarray~~ |
353+
| _keyword-only_ | |
354+
| `batch_size` | The batch size to use. Default to `1024`. ~~int~~ |
355+
| `n` | The number of entries to return for each query. Defaults to `1`. ~~int~~ |
356+
| `sort` | Whether to sort the entries returned by score. Defaults to `True`. ~~bool~~ |
357+
| **RETURNS** | The most similar entries as a `(keys, best_rows, scores)` tuple. ~~Tuple[numpy.ndarray, numpy.ndarray, numpy.ndarray]~~ |
358358
359359
## Vectors.get_batch {#get_batch tag="method" new="3.2"}
360360

website/docs/images/pipeline-design.svg

Lines changed: 55 additions & 48 deletions
Loading

website/docs/models/index.md

Lines changed: 71 additions & 14 deletions
Original file line numberDiff line numberDiff line change
@@ -30,10 +30,16 @@ into three components:
3030
tagging, parsing, lemmatization and named entity recognition, or `dep` for
3131
only tagging, parsing and lemmatization).
3232
2. **Genre:** Type of text the pipeline is trained on, e.g. `web` or `news`.
33-
3. **Size:** Package size indicator, `sm`, `md`, `lg` or `trf` (`sm`: no word
34-
vectors, `md`: reduced word vector table with 20k unique vectors for ~500k
35-
words, `lg`: large word vector table with ~500k entries, `trf`: transformer
36-
pipeline without static word vectors)
33+
3. **Size:** Package size indicator, `sm`, `md`, `lg` or `trf`.
34+
35+
`sm` and `trf` pipelines have no static word vectors.
36+
37+
For pipelines with default vectors, `md` has a reduced word vector table with
38+
20k unique vectors for ~500k words and `lg` has a large word vector table
39+
with ~500k entries.
40+
41+
For pipelines with floret vectors, `md` vector tables have 50k entries and
42+
`lg` vector tables have 200k entries.
3743

3844
For example, [`en_core_web_sm`](/models/en#en_core_web_sm) is a small English
3945
pipeline trained on written web text (blogs, news, comments), that includes
@@ -90,19 +96,42 @@ Main changes from spaCy v2 models:
9096
In the `sm`/`md`/`lg` models:
9197

9298
- The `tagger`, `morphologizer` and `parser` components listen to the `tok2vec`
93-
component.
99+
component. If the lemmatizer is trainable (v3.3+), `lemmatizer` also listens
100+
to `tok2vec`.
94101
- The `attribute_ruler` maps `token.tag` to `token.pos` if there is no
95102
`morphologizer`. The `attribute_ruler` additionally makes sure whitespace is
96103
tagged consistently and copies `token.pos` to `token.tag` if there is no
97104
tagger. For English, the attribute ruler can improve its mapping from
98105
`token.tag` to `token.pos` if dependency parses from a `parser` are present,
99106
but the parser is not required.
100-
- The `lemmatizer` component for many languages (Catalan, Dutch, English,
101-
French, Greek, Italian Macedonian, Norwegian, Polish and Spanish) requires
102-
`token.pos` annotation from either `tagger`+`attribute_ruler` or
103-
`morphologizer`.
107+
- The `lemmatizer` component for many languages requires `token.pos` annotation
108+
from either `tagger`+`attribute_ruler` or `morphologizer`.
104109
- The `ner` component is independent with its own internal tok2vec layer.
105110

111+
#### CNN/CPU pipelines with floret vectors
112+
113+
The Finnish, Korean and Swedish `md` and `lg` pipelines use
114+
[floret vectors](/usage/v3-2#vectors) instead of default vectors. If you're
115+
running a trained pipeline on texts and working with [`Doc`](/api/doc) objects,
116+
you shouldn't notice any difference with floret vectors. With floret vectors no
117+
tokens are out-of-vocabulary, so [`Token.is_oov`](/api/token#attributes) will
118+
return `True` for all tokens.
119+
120+
If you access vectors directly for similarity comparisons, there are a few
121+
differences because floret vectors don't include a fixed word list like the
122+
vector keys for default vectors.
123+
124+
- If your workflow iterates over the vector keys, you need to use an external
125+
word list instead:
126+
127+
```diff
128+
- lexemes = [nlp.vocab[orth] for orth in nlp.vocab.vectors]
129+
+ lexemes = [nlp.vocab[word] for word in external_word_list]
130+
```
131+
132+
- [`Vectors.most_similar`](/api/vectors#most_similar) is not supported because
133+
there's no fixed list of vectors to compare your vectors to.
134+
106135
### Transformer pipeline design {#design-trf}
107136

108137
In the transformer (`trf`) models, the `tagger`, `parser` and `ner` (if present)
@@ -133,10 +162,14 @@ nlp = spacy.load("en_core_web_trf", disable=["tagger", "attribute_ruler", "lemma
133162
<Infobox variant="warning" title="Rule-based and POS-lookup lemmatizers require
134163
Token.pos">
135164

136-
The lemmatizer depends on `tagger`+`attribute_ruler` or `morphologizer` for
137-
Catalan, Dutch, English, French, Greek, Italian, Macedonian, Norwegian, Polish
138-
and Spanish. If you disable any of these components, you'll see lemmatizer
139-
warnings unless the lemmatizer is also disabled.
165+
The lemmatizer depends on `tagger`+`attribute_ruler` or `morphologizer` for a
166+
number of languages. If you disable any of these components, you'll see
167+
lemmatizer warnings unless the lemmatizer is also disabled.
168+
169+
**v3.3**: Catalan, English, French, Russian and Spanish
170+
171+
**v3.0-v3.2**: Catalan, Dutch, English, French, Greek, Italian, Macedonian,
172+
Norwegian, Polish, Russian and Spanish
140173

141174
</Infobox>
142175

@@ -154,10 +187,34 @@ nlp.enable_pipe("senter")
154187
The `senter` component is ~10&times; faster than the parser and more accurate
155188
than the rule-based `sentencizer`.
156189

190+
#### Switch from trainable lemmatizer to default lemmatizer
191+
192+
Since v3.3, a number of pipelines use a trainable lemmatizer. You can check whether
193+
the lemmatizer is trainable:
194+
195+
```python
196+
nlp = spacy.load("de_core_web_sm")
197+
assert nlp.get_pipe("lemmatizer").is_trainable
198+
```
199+
200+
If you'd like to switch to a non-trainable lemmatizer that's similar to v3.2 or
201+
earlier, you can replace the trainable lemmatizer with the default non-trainable
202+
lemmatizer:
203+
204+
```python
205+
# Requirements: pip install spacy-lookups-data
206+
nlp = spacy.load("de_core_web_sm")
207+
# Remove existing lemmatizer
208+
nlp.remove_pipe("lemmatizer")
209+
# Add non-trainable lemmatizer from language defaults
210+
# and load lemmatizer tables from spacy-lookups-data
211+
nlp.add_pipe("lemmatizer").initialize()
212+
```
213+
157214
#### Switch from rule-based to lookup lemmatization
158215

159216
For the Dutch, English, French, Greek, Macedonian, Norwegian and Spanish
160-
pipelines, you can switch from the default rule-based lemmatizer to a lookup
217+
pipelines, you can swap out a trainable or rule-based lemmatizer for a lookup
161218
lemmatizer:
162219

163220
```python

0 commit comments

Comments
 (0)