Skip to content

Commit 58bdd86

Browse files
polmsvlandegrichardpaulhudsonDuygu AltinokHaakonME
authored
Bump sudachipy version (explosion#9917)
* Edited Slovenian stop words list (explosion#9707) * Noun chunks for Italian (explosion#9662) * added it vocab * copied portuguese * added possessive determiner * added conjed Nps * added nmoded Nps * test misc * more examples * fixed typo * fixed parenth * fixed comma * comma fix * added syntax iters * fix some index problems * fixed index * corrected heads for test case * fixed tets case * fixed determiner gender * cleaned left over * added example with apostophe * French NP review (explosion#9667) * adapted from pt * added basic tests * added fr vocab * fixed noun chunks * more examples * typo fix * changed naming * changed the naming * typo fix * Add Japanese kana characters to default exceptions (fix explosion#9693) (explosion#9742) This includes the main kana, or phonetic characters, used in Japanese. There are some supplemental kana blocks in Unicode outside the BMP that could also be included, but because their actual use is rare I omitted them for now, but maybe they should be added. The omitted blocks are: - Kana Supplement - Kana Extended (A and B) - Small Kana Extension * Remove NER words from stop words in Norwegian (explosion#9820) Default stop words in Norwegian bokmål (nb) in Spacy contain important entities, e.g. France, Germany, Russia, Sweden and USA, police district, important units of time, e.g. months and days of the week, and organisations. Nobody expects their presence among the default stop words. There is a danger of users complying with the general recommendation of filtering out stop words, while being unaware of filtering out important entities from their data. See explanation in explosion#3052 (comment) and comment explosion#3052 (comment) * Bump sudachipy version * Update sudachipy versions * Bump versions Bumping to the most recent dictionary just to keep thing current. Bumping sudachipy to 5.2 because older versions don't support recent dictionaries. Co-authored-by: Sofie Van Landeghem <svlandeg@users.noreply.github.com> Co-authored-by: Richard Hudson <richard@explosion.ai> Co-authored-by: Duygu Altinok <duygu@explosion.ai> Co-authored-by: Haakon Meland Eriksen <haakon.eriksen@far.no>
1 parent a784b12 commit 58bdd86

File tree

10 files changed

+624
-162
lines changed

10 files changed

+624
-162
lines changed

setup.cfg

Lines changed: 2 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -108,8 +108,8 @@ apple =
108108
thinc-apple-ops>=0.0.4,<1.0.0
109109
# Language tokenizers with external dependencies
110110
ja =
111-
sudachipy>=0.4.9
112-
sudachidict_core>=20200330
111+
sudachipy>=0.5.2,!=0.6.1
112+
sudachidict_core>=20211220
113113
ko =
114114
natto-py==0.9.0
115115
th =

spacy/lang/char_classes.py

Lines changed: 5 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -45,6 +45,10 @@
4545
_hangul_jamo = r"\u1100-\u11FF"
4646
_hangul = _hangul_syllables + _hangul_jamo
4747

48+
_hiragana = r"\u3040-\u309F"
49+
_katakana = r"\u30A0-\u30FFー"
50+
_kana = _hiragana + _katakana
51+
4852
# letters with diacritics - Catalan, Czech, Latin, Latvian, Lithuanian, Polish, Slovak, Turkish, Welsh
4953
_latin_u_extendedA = (
5054
r"\u0100\u0102\u0104\u0106\u0108\u010A\u010C\u010E\u0110\u0112\u0114\u0116\u0118\u011A\u011C"
@@ -244,6 +248,7 @@
244248
+ _tamil
245249
+ _telugu
246250
+ _hangul
251+
+ _kana
247252
+ _cjk
248253
)
249254

spacy/lang/fr/syntax_iterators.py

Lines changed: 60 additions & 12 deletions
Original file line numberDiff line numberDiff line change
@@ -6,16 +6,35 @@
66

77

88
def noun_chunks(doclike: Union[Doc, Span]) -> Iterator[Tuple[int, int, int]]:
9-
"""Detect base noun phrases from a dependency parse. Works on Doc and Span."""
10-
# fmt: off
11-
labels = ["nsubj", "nsubj:pass", "obj", "iobj", "ROOT", "appos", "nmod", "nmod:poss"]
12-
# fmt: on
9+
"""
10+
Detect base noun phrases from a dependency parse. Works on both Doc and Span.
11+
"""
12+
labels = [
13+
"nsubj",
14+
"nsubj:pass",
15+
"obj",
16+
"obl",
17+
"obl:agent",
18+
"obl:arg",
19+
"obl:mod",
20+
"nmod",
21+
"pcomp",
22+
"appos",
23+
"ROOT",
24+
]
25+
post_modifiers = ["flat", "flat:name", "flat:foreign", "fixed", "compound"]
1326
doc = doclike.doc # Ensure works on both Doc and Span.
1427
if not doc.has_annotation("DEP"):
1528
raise ValueError(Errors.E029)
16-
np_deps = [doc.vocab.strings[label] for label in labels]
17-
conj = doc.vocab.strings.add("conj")
29+
np_deps = {doc.vocab.strings.add(label) for label in labels}
30+
np_modifs = {doc.vocab.strings.add(modifier) for modifier in post_modifiers}
1831
np_label = doc.vocab.strings.add("NP")
32+
adj_label = doc.vocab.strings.add("amod")
33+
det_label = doc.vocab.strings.add("det")
34+
det_pos = doc.vocab.strings.add("DET")
35+
adp_pos = doc.vocab.strings.add("ADP")
36+
conj_label = doc.vocab.strings.add("conj")
37+
conj_pos = doc.vocab.strings.add("CCONJ")
1938
prev_end = -1
2039
for i, word in enumerate(doclike):
2140
if word.pos not in (NOUN, PROPN, PRON):
@@ -24,16 +43,45 @@ def noun_chunks(doclike: Union[Doc, Span]) -> Iterator[Tuple[int, int, int]]:
2443
if word.left_edge.i <= prev_end:
2544
continue
2645
if word.dep in np_deps:
27-
prev_end = word.right_edge.i
28-
yield word.left_edge.i, word.right_edge.i + 1, np_label
29-
elif word.dep == conj:
46+
right_childs = list(word.rights)
47+
right_child = right_childs[0] if right_childs else None
48+
49+
if right_child:
50+
if (
51+
right_child.dep == adj_label
52+
): # allow chain of adjectives by expanding to right
53+
right_end = right_child.right_edge
54+
elif (
55+
right_child.dep == det_label and right_child.pos == det_pos
56+
): # cut relative pronouns here
57+
right_end = right_child
58+
elif right_child.dep in np_modifs: # Check if we can expand to right
59+
right_end = word.right_edge
60+
else:
61+
right_end = word
62+
else:
63+
right_end = word
64+
prev_end = right_end.i
65+
66+
left_index = word.left_edge.i
67+
left_index = (
68+
left_index + 1 if word.left_edge.pos == adp_pos else left_index
69+
)
70+
71+
yield left_index, right_end.i + 1, np_label
72+
elif word.dep == conj_label:
3073
head = word.head
31-
while head.dep == conj and head.head.i < head.i:
74+
while head.dep == conj_label and head.head.i < head.i:
3275
head = head.head
3376
# If the head is an NP, and we're coordinated to it, we're an NP
3477
if head.dep in np_deps:
35-
prev_end = word.right_edge.i
36-
yield word.left_edge.i, word.right_edge.i + 1, np_label
78+
prev_end = word.i
79+
80+
left_index = word.left_edge.i # eliminate left attached conjunction
81+
left_index = (
82+
left_index + 1 if word.left_edge.pos == conj_pos else left_index
83+
)
84+
yield left_index, word.i + 1, np_label
3785

3886

3987
SYNTAX_ITERATORS = {"noun_chunks": noun_chunks}

spacy/lang/it/__init__.py

Lines changed: 3 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -6,13 +6,15 @@
66
from .punctuation import TOKENIZER_PREFIXES, TOKENIZER_INFIXES
77
from ...language import Language, BaseDefaults
88
from .lemmatizer import ItalianLemmatizer
9+
from .syntax_iterators import SYNTAX_ITERATORS
910

1011

1112
class ItalianDefaults(BaseDefaults):
1213
tokenizer_exceptions = TOKENIZER_EXCEPTIONS
13-
stop_words = STOP_WORDS
1414
prefixes = TOKENIZER_PREFIXES
1515
infixes = TOKENIZER_INFIXES
16+
stop_words = STOP_WORDS
17+
syntax_iterators = SYNTAX_ITERATORS
1618

1719

1820
class Italian(Language):

spacy/lang/it/syntax_iterators.py

Lines changed: 86 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,86 @@
1+
from typing import Union, Iterator, Tuple
2+
3+
from ...symbols import NOUN, PROPN, PRON
4+
from ...errors import Errors
5+
from ...tokens import Doc, Span
6+
7+
8+
def noun_chunks(doclike: Union[Doc, Span]) -> Iterator[Tuple[int, int, int]]:
9+
"""
10+
Detect base noun phrases from a dependency parse. Works on both Doc and Span.
11+
"""
12+
labels = [
13+
"nsubj",
14+
"nsubj:pass",
15+
"obj",
16+
"obl",
17+
"obl:agent",
18+
"nmod",
19+
"pcomp",
20+
"appos",
21+
"ROOT",
22+
]
23+
post_modifiers = ["flat", "flat:name", "fixed", "compound"]
24+
dets = ["det", "det:poss"]
25+
doc = doclike.doc # Ensure works on both Doc and Span.
26+
if not doc.has_annotation("DEP"):
27+
raise ValueError(Errors.E029)
28+
np_deps = {doc.vocab.strings.add(label) for label in labels}
29+
np_modifs = {doc.vocab.strings.add(modifier) for modifier in post_modifiers}
30+
np_label = doc.vocab.strings.add("NP")
31+
adj_label = doc.vocab.strings.add("amod")
32+
det_labels = {doc.vocab.strings.add(det) for det in dets}
33+
det_pos = doc.vocab.strings.add("DET")
34+
adp_label = doc.vocab.strings.add("ADP")
35+
conj = doc.vocab.strings.add("conj")
36+
conj_pos = doc.vocab.strings.add("CCONJ")
37+
prev_end = -1
38+
for i, word in enumerate(doclike):
39+
if word.pos not in (NOUN, PROPN, PRON):
40+
continue
41+
# Prevent nested chunks from being produced
42+
if word.left_edge.i <= prev_end:
43+
continue
44+
if word.dep in np_deps:
45+
right_childs = list(word.rights)
46+
right_child = right_childs[0] if right_childs else None
47+
48+
if right_child:
49+
if (
50+
right_child.dep == adj_label
51+
): # allow chain of adjectives by expanding to right
52+
right_end = right_child.right_edge
53+
elif (
54+
right_child.dep in det_labels and right_child.pos == det_pos
55+
): # cut relative pronouns here
56+
right_end = right_child
57+
elif right_child.dep in np_modifs: # Check if we can expand to right
58+
right_end = word.right_edge
59+
else:
60+
right_end = word
61+
else:
62+
right_end = word
63+
prev_end = right_end.i
64+
65+
left_index = word.left_edge.i
66+
left_index = (
67+
left_index + 1 if word.left_edge.pos == adp_label else left_index
68+
)
69+
70+
yield left_index, right_end.i + 1, np_label
71+
elif word.dep == conj:
72+
head = word.head
73+
while head.dep == conj and head.head.i < head.i:
74+
head = head.head
75+
# If the head is an NP, and we're coordinated to it, we're an NP
76+
if head.dep in np_deps:
77+
prev_end = word.i
78+
79+
left_index = word.left_edge.i # eliminate left attached conjunction
80+
left_index = (
81+
left_index + 1 if word.left_edge.pos == conj_pos else left_index
82+
)
83+
yield left_index, word.i + 1, np_label
84+
85+
86+
SYNTAX_ITERATORS = {"noun_chunks": noun_chunks}

spacy/lang/nb/stop_words.py

Lines changed: 13 additions & 17 deletions
Original file line numberDiff line numberDiff line change
@@ -4,46 +4,42 @@
44
55
bak bare bedre beste blant ble bli blir blitt bris by både
66
7-
da dag de del dem den denne der dermed det dette disse drept du
7+
da dag de del dem den denne der dermed det dette disse du
88
99
eller en enn er et ett etter
1010
11-
fem fikk fire fjor flere folk for fortsatt fotball fra fram frankrike fredag
11+
fem fikk fire fjor flere folk for fortsatt fra fram
1212
funnet få får fått før først første
1313
1414
gang gi gikk gjennom gjorde gjort gjør gjøre god godt grunn gå går
1515
16-
ha hadde ham han hans har hele helt henne hennes her hun hva hvor hvordan
17-
hvorfor
16+
ha hadde ham han hans har hele helt henne hennes her hun
1817
1918
i ifølge igjen ikke ingen inn
2019
2120
ja jeg
2221
2322
kamp kampen kan kl klart kom komme kommer kontakt kort kroner kunne kveld
24-
kvinner
2523
26-
la laget land landet langt leder ligger like litt løpet lørdag
24+
la laget land landet langt leder ligger like litt løpet
2725
28-
man mandag mange mannen mars med meg mellom men mener menn mennesker mens mer
29-
millioner minutter mot msci mye må mål måtte
26+
man mange med meg mellom men mener mennesker mens mer mot mye må mål måtte
3027
31-
ned neste noe noen nok norge norsk norske ntb ny nye nå når
28+
ned neste noe noen nok ny nye nå når
3229
33-
og også om onsdag opp opplyser oslo oss over
30+
og også om opp opplyser oss over
3431
35-
personer plass poeng politidistrikt politiet president prosent
32+
personer plass poeng på
3633
37-
regjeringen runde rundt russland
34+
runde rundt
3835
39-
sa saken samme sammen samtidig satt se seg seks selv senere september ser sett
36+
sa saken samme sammen samtidig satt se seg seks selv senere ser sett
4037
siden sier sin sine siste sitt skal skriver skulle slik som sted stedet stor
41-
store står sverige svært så søndag
38+
store står svært så
4239
43-
ta tatt tid tidligere til tilbake tillegg tirsdag to tok torsdag tre tror
44-
tyskland
40+
ta tatt tid tidligere til tilbake tillegg tok tror
4541
46-
under usa ut uten utenfor
42+
under ut uten utenfor
4743
4844
vant var ved veldig vi videre viktig vil ville viser vår være vært
4945

0 commit comments

Comments
 (0)