Skip to content

Commit b2f8232

Browse files
committed
Add examples
1 parent 81c5b84 commit b2f8232

File tree

1 file changed

+212
-0
lines changed

1 file changed

+212
-0
lines changed

EXAMPLES.md

Lines changed: 212 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,212 @@
1+
# Examples
2+
3+
This file provides instructions on how to train parsing models from scratch and evaluate them.
4+
Some information has been given in [`README`](README.md).
5+
Here we describe in detail the commands and other settings.
6+
7+
## Dependency Parsing
8+
9+
Below are examples of training `biaffine` and `crf2o` dependency parsers on PTB.
10+
11+
```sh
12+
# biaffine
13+
$ python -u -m supar.cmds.biaffine_dep train -b -d 0 -c biaffine-dep-en -p model -f char \
14+
--train ptb/train.conllx \
15+
--dev ptb/dev.conllx \
16+
--test ptb/test.conllx \
17+
--embed glove.6B.100d.txt \
18+
--unk
19+
# crf2o
20+
$ python -u -m supar.cmds.crf2o_dep train -b -d 0 -c crf2o-dep-en -p model -f char \
21+
--train ptb/train.conllx \
22+
--dev ptb/dev.conllx \
23+
--test ptb/test.conllx \
24+
--embed glove.6B.100d.txt \
25+
--unk unk \
26+
--mbr \
27+
--proj
28+
```
29+
The option `-c` controls where to load predefined configs, you can either specify a local file path or the same short name as a pretrained model.
30+
For CRF models, you need to specify `--proj` to remove non-projective trees.
31+
Specifying `--mbr` to perform MBR decoding often leads to consistent improvement.
32+
33+
The model finetuned on [`robert-large`](https://huggingface.co/roberta-large) achieves nearly state-of-the-art performance in English dependency parsing.
34+
Here we provide some recommended hyper-parameters (not the best, but good enough).
35+
You are allowed to set values of registered/unregistered parameters in bash to suppress default configs in the file.
36+
```sh
37+
$ python -u -m supar.cmds.biaffine_dep train -b -d 0 -c biaffine-dep-roberta-en -p model \
38+
--train ptb/train.conllx \
39+
--dev ptb/dev.conllx \
40+
--test ptb/test.conllx \
41+
--encoder=bert \
42+
--bert=roberta-large \
43+
--lr=5e-5 \
44+
--lr-rate=20 \
45+
--batch-size=5000 \
46+
--epochs=10 \
47+
--update-steps=4
48+
```
49+
The pretrained multilingual model `biaffine-dep-xlmr` takes [`xlm-roberta-large`](https://huggingface.co/xlm-roberta-large) as backbone architecture and finetunes on it.
50+
The training command is as following:
51+
```sh
52+
$ python -u -m supar.cmds.biaffine_dep train -b -d 0 -c biaffine-dep-xlmr -p model \
53+
--train ud2.3/train.conllx \
54+
--dev ud2.3/dev.conllx \
55+
--test ud2.3/test.conllx \
56+
--encoder=bert \
57+
--bert=xlm-roberta-large \
58+
--lr=5e-5 \
59+
--lr-rate=20 \
60+
--batch-size=5000 \
61+
--epochs=10 \
62+
--update-steps=4
63+
```
64+
65+
To evaluate:
66+
```sh
67+
# biaffine
68+
python -u -m supar.cmds.biaffine_dep evaluate -d 0 -p biaffine-dep-en --data ptb/test.conllx --tree --proj
69+
# crf2o
70+
python -u -m supar.cmds.crf2o_dep evaluate -d 0 -p crf2o-dep-en --data ptb/test.conllx --mbr --tree --proj
71+
```
72+
`--tree` and `--proj` ensures to output well-formed and projective trees respectively.
73+
74+
The commands for training and evaluating Chinese models are similar, except that you need to specify `--punct` to include punctuation.
75+
76+
## Constituency Parsing
77+
78+
Command for training `crf` constituency parser is simple.
79+
We follow instructions of [Benepar](https://github.com/nikitakit/self-attentive-parser) to preprocess the data.
80+
81+
To train a BiLSTM-based model:
82+
```sh
83+
$ python -u -m supar.cmds.crf_con train -b -d 0 -c crf-con-en -p model -f char --mbr
84+
--train ptb/train.pid \
85+
--dev ptb/dev.pid \
86+
--test ptb/test.pid \
87+
--embed glove.6B.100d.txt \
88+
--unk unk \
89+
--mbr
90+
```
91+
92+
To finetune [`robert-large`](https://huggingface.co/roberta-large):
93+
```sh
94+
$ python -u -m supar.cmds.crf_con train -b -d 0 -c crf-con-roberta-en -p model \
95+
--train ptb/train.pid \
96+
--dev ptb/dev.pid \
97+
--test ptb/test.pid \
98+
--encoder=bert \
99+
--bert=roberta-large \
100+
--lr=5e-5 \
101+
--lr-rate=20 \
102+
--batch-size=5000 \
103+
--epochs=10 \
104+
--update-steps=4
105+
```
106+
107+
The command for finetuning [`xlm-roberta-large`](https://huggingface.co/xlm-roberta-large) on merged treebanks of 9 languages in SPMRL dataset is:
108+
```sh
109+
$ python -u -m supar.cmds.crf_con train -b -d 0 -c crf-con-roberta-en -p model \
110+
--train spmrl/train.pid \
111+
--dev spmrl/dev.pid \
112+
--test spmrl/test.pid \
113+
--encoder=bert \
114+
--bert=xlm-roberta-large \
115+
--lr=5e-5 \
116+
--lr-rate=20 \
117+
--batch-size=5000 \
118+
--epochs=10 \
119+
--update-steps=4
120+
```
121+
122+
Different from conventional evaluation manner of executing `EVALB`, we internally integrate python code for constituency tree evaluation.
123+
As different treebanks do not share the same evaluation parameters, it is recommended to evaluate the results in interactive mode.
124+
125+
To evaluate English and Chinese models:
126+
```py
127+
>>> Parser.load('crf-con-en').evaluate('ptb/test.pid',
128+
delete={'TOP', 'S1', '-NONE-', ',', ':', '``', "''", '.', '?', '!', ''},
129+
equal={'ADVP': 'PRT'},
130+
verbose=False)
131+
(0.21318972731630007, UCM: 50.08% LCM: 47.56% UP: 94.89% UR: 94.71% UF: 94.80% LP: 94.16% LR: 93.98% LF: 94.07%)
132+
>>> Parser.load('crf-con-zh').evaluate('ctb7/test.pid',
133+
delete={'TOP', 'S1', '-NONE-', ',', ':', '``', "''", '.', '?', '!', ''},
134+
equal={'ADVP': 'PRT'},
135+
verbose=False)
136+
(0.3994724107416053, UCM: 24.96% LCM: 23.39% UP: 90.88% UR: 90.47% UF: 90.68% LP: 88.82% LR: 88.42% LF: 88.62%)
137+
```
138+
139+
To evaluate the multilingual model:
140+
```py
141+
>>> Parser.load('crf-con-xlmr').evaluate('spmrl/eu/test.pid',
142+
delete={'TOP', 'ROOT', 'S1', '-NONE-', 'VROOT'},
143+
equal={},
144+
verbose=False)
145+
(0.45620645582675934, UCM: 53.07% LCM: 48.10% UP: 94.74% UR: 95.53% UF: 95.14% LP: 93.29% LR: 94.07% LF: 93.68%)
146+
```
147+
148+
## Semantic Dependency Parsing
149+
150+
The raw semantic dependency parsing datasets are not in line with the `conllu` format.
151+
We follow [Second_Order_SDP](https://github.com/wangxinyu0922/Second_Order_SDP) to preprocess the data into the format shown in the following example.
152+
```txt
153+
#20001001
154+
1 Pierre Pierre _ NNP _ 2 nn _ _
155+
2 Vinken _generic_proper_ne_ _ NNP _ 9 nsubj 1:compound|6:ARG1|9:ARG1 _
156+
3 , _ _ , _ 2 punct _ _
157+
4 61 _generic_card_ne_ _ CD _ 5 num _ _
158+
5 years year _ NNS _ 6 npadvmod 4:ARG1 _
159+
6 old old _ JJ _ 2 amod 5:measure _
160+
7 , _ _ , _ 2 punct _ _
161+
8 will will _ MD _ 9 aux _ _
162+
9 join join _ VB _ 0 root 0:root|12:ARG1|17:loc _
163+
10 the the _ DT _ 11 det _ _
164+
11 board board _ NN _ 9 dobj 9:ARG2|10:BV _
165+
12 as as _ IN _ 9 prep _ _
166+
13 a a _ DT _ 15 det _ _
167+
14 nonexecutive _generic_jj_ _ JJ _ 15 amod _ _
168+
15 director director _ NN _ 12 pobj 12:ARG2|13:BV|14:ARG1 _
169+
16 Nov. Nov. _ NNP _ 9 tmod _ _
170+
17 29 _generic_dom_card_ne_ _ CD _ 16 num 16:of _
171+
18 . _ _ . _ 9 punct _ _
172+
```
173+
174+
By default, BiLSTM-based semantic dependency parsing models take POS tag, lemma, and character embeddings as model inputs.
175+
Below are examples of training `biaffine` and `vi` semantic dependency parsing models:
176+
```sh
177+
# biaffine
178+
$ python -u -m supar.cmds.biaffine_sdp train -b -c biaffine-sdp-en -d 0 -f tag char lemma -p model \
179+
--train dm/train.conllu \
180+
--dev dm/dev.conllu \
181+
--test dm/test.conllu \
182+
--embed glove.6B.100d.txt \
183+
--unk unk
184+
# vi
185+
$ python -u -m supar.cmds.vi_sdp train -b -c vi-sdp-en -d 1 -f tag char lemma -p model \
186+
--train dm/train.conllu \
187+
--dev dm/dev.conllu \
188+
--test dm/test.conllu \
189+
--embed glove.6B.100d.txt \
190+
--unk unk \
191+
--inference mfvi
192+
```
193+
194+
To finetune [`robert-large`](https://huggingface.co/roberta-large):
195+
```sh
196+
$ python -u -m supar.cmds.biaffine_sdp train -b -d 0 -c biaffine-sdp-roberta-en -p model \
197+
--train dm/train.conllu \
198+
--dev dm/dev.conllu \
199+
--test dm/test.conllu \
200+
--encoder=bert \
201+
--bert=roberta-large \
202+
--lr=5e-5 \
203+
--lr-rate=1 \
204+
--batch-size=500 \
205+
--epochs=10 \
206+
--update-steps=1
207+
```
208+
209+
To evaluate:
210+
```sh
211+
python -u -m supar.cmds.biaffine_sdp evaluate -d 0 -p biaffine-sdp-en --data dm/test.conllu
212+
```

0 commit comments

Comments
 (0)