Update the results

zysite · zysite · commit e779bb1e5a12 · 2019-10-15T05:13:24.000Z
diff --git a/README.md b/README.md
@@ -7,16 +7,18 @@
 
 An implementation of "Deep Biaffine Attention for Neural Dependency Parsing".
 
-Details and [hyperparameter choices](#Hyperparameters) are almost identical to those described in the paper, except that we do not provide a decoding algorithm to ensure well-formedness, which does not seriously affect the results.
+Details and [hyperparameter choices](#Hyperparameters) are almost identical to those described in the paper, 
+except that we provide the Eisner rather than MST algorithm to ensure well-formedness. 
+Practically, projective decoding like Eisner is the best choice since PTB contains mostly (99.9%) projective trees.
 
-Another version of the implementation is available on [char](https://github.com/zysite/biaffine-parser/tree/char) branch, which replaces the tag embedding with char lstm and achieves better performance.
+Besides the basic implementations, we also provide other features to replace the POS tags (TAG), 
+i.e., character-level embeddings (CHAR) and BERT.
 
 ## Requirements
 
-```txt
-python == 3.7.0
-pytorch == 1.0.0
-```
+* `python`: 3.7.0
+* [`pytorch`](https://github.com/pytorch/pytorch): 1.3.0
+* [`transformers`](https://github.com/huggingface/transformers): 2.1.1
 
 ## Datasets
 
@@ -30,22 +32,28 @@ For all datasets, we follow the conventional data splits:
 
 ## Performance
 
-|               |  UAS  |  LAS  |
-| ------------- | :---: | :---: |
-| tag embedding | 95.85 | 94.14 |
-| char lstm     | 96.02 | 94.38 |
+| FEAT          |  UAS  |  LAS  | Speed (Sents/s) |
+| ------------- | :---: | :---: | :-------------: |
+| TAG           | 95.90 | 94.25 |     1696.22     |
+| TAG + Eisner  | 95.93 | 94.28 |     350.46      |
+| CHAR          | 95.99 | 94.38 |     1464.59     |
+| CHAR + Eisner | 96.02 | 94.41 |     323.73      |
+| BERT          | 96.64 | 95.11 |     438.72      |
+| BERT + Eisner | 96.65 | 95.12 |     214.68      |
 
-Note that punctuation is excluded in all evaluation metrics. 
+Note that punctuation is ignored in all evaluation metrics for PTB. 
 
 Aside from using consistent hyperparameters, there are some keypoints that significantly affect the performance:
 
 - Dividing the pretrained embedding by its standard-deviation
 - Applying the same dropout mask at every recurrent timestep
-- Jointly dropping the words and tags
+- Jointly dropping the word and additional feature representations
 
-For the above reasons, we may have to give up some native modules in pytorch (e.g., `LSTM` and `Dropout`), and use self-implemented ones instead.
+For the above reasons, we may have to give up some native modules in pytorch (e.g., `LSTM` and `Dropout`), 
+and use custom ones instead.
 
-As shown above, our results, especially on char lstm version, have outperformed the [offical implementation](https://github.com/tdozat/Parser-v1) (95.74 and 94.08).
+As shown above, our results have outperformed the [offical implementation](https://github.com/tdozat/Parser-v1) (95.74 and 94.08). 
+Incorporating character-level features or external embeddings like BERT can further improve the performance of the model. 
 
 ## Usage
 
@@ -67,16 +75,56 @@ Commands:
     train               Train a model.
 ```
 
-Before triggering the subparser, please make sure that the data files must be in CoNLL-X format. If some fields are missing, you can use underscores as placeholders.
+Before triggering the subcommands, please make sure that the data files must be in CoNLL-X format. 
+If some fields are missing, you can use underscores as placeholders.
+Below are some examples:
+
+```sh
+$ python run.py train -p -d=0 -f=exp/ptb.char --feat=char  \
+      --ftrain=data/ptb/train.conllx  \
+      --fdev=data/ptb/dev.conllx  \
+      --ftest=data/ptb/test.conllx  \
+      --fembed=data/glove.6B.100d.txt  \
+      --unk=unk
+
+$ python run.py evaluate -d=0 -f=exp/ptb.char --feat=char --tree  \
+      --fdata=data/ptb/test.conllx
+
+$ cat data/naive.conllx 
+1       Too     _       _       _       _       _       _       _       _
+2       young   _       _       _       _       _       _       _       _
+3       too     _       _       _       _       _       _       _       _
+4       simple  _       _       _       _       _       _       _       _
+5       ,       _       _       _       _       _       _       _       _
+6       sometimes       _       _       _       _       _       _       _       _
+7       naive   _       _       _       _       _       _       _       _
+8       .       _       _       _       _       _       _       _       _
+
+$ python run.py predict -d=0 -f=exp/ptb.char --feat=char --tree  \
+      --fdata=data/naive.conllx  \
+      --fpred=naive.conllx
+
+$ cat naive.conllx
+1       Too     _       _       _       _       2       advmod  _       _
+2       young   _       _       _       _       0       root    _       _
+3       too     _       _       _       _       4       advmod  _       _
+4       simple  _       _       _       _       2       dep     _       _
+5       ,       _       _       _       _       2       punct   _       _
+6       sometimes       _       _       _       _       7       advmod  _       _
+7       naive   _       _       _       _       2       dep     _       _
+8       .       _       _       _       _       2       punct   _       _
+
+```
 
-Optional arguments of the subparsers are as follows:
+All the optional arguments of the subcommands are as follows:
 
 ```sh
 $ python run.py train -h
 usage: run.py train [-h] [--buckets BUCKETS] [--punct] [--ftrain FTRAIN]
                     [--fdev FDEV] [--ftest FTEST] [--fembed FEMBED]
-                    [--unk UNK] [--conf CONF] [--model MODEL] [--vocab VOCAB]
+                    [--unk UNK] [--conf CONF] [--file FILE] [--preprocess]
                     [--device DEVICE] [--seed SEED] [--threads THREADS]
+                    [--tree] [--feat {tag,char,bert}]
 
 optional arguments:
   -h, --help            show this help message and exit
@@ -88,21 +136,22 @@ optional arguments:
   --fembed FEMBED       path to pretrained embeddings
   --unk UNK             unk token in pretrained embeddings
   --conf CONF, -c CONF  path to config file
-  --model MODEL, -m MODEL
-                        path to model file
-  --vocab VOCAB, -v VOCAB
-                        path to vocab file
+  --file FILE, -f FILE  path to saved files
+  --preprocess, -p      whether to preprocess the data first
   --device DEVICE, -d DEVICE
                         ID of GPU to use
   --seed SEED, -s SEED  seed for generating random numbers
   --threads THREADS, -t THREADS
                         max num of threads
+  --tree                whether to ensure well-formedness
+  --feat {tag,char,bert}
+                        choices of additional features
 
 $ python run.py evaluate -h
 usage: run.py evaluate [-h] [--batch-size BATCH_SIZE] [--buckets BUCKETS]
-                       [--punct] [--fdata FDATA] [--conf CONF] [--model MODEL]
-                       [--vocab VOCAB] [--device DEVICE] [--seed SEED]
-                       [--threads THREADS]
+                       [--punct] [--fdata FDATA] [--conf CONF] [--file FILE]
+                       [--preprocess] [--device DEVICE] [--seed SEED]
+                       [--threads THREADS] [--tree] [--feat {tag,char,bert}]
 
 optional arguments:
   -h, --help            show this help message and exit
@@ -112,21 +161,22 @@ optional arguments:
   --punct               whether to include punctuation
   --fdata FDATA         path to dataset
   --conf CONF, -c CONF  path to config file
-  --model MODEL, -m MODEL
-                        path to model file
-  --vocab VOCAB, -v VOCAB
-                        path to vocab file
+  --file FILE, -f FILE  path to saved files
+  --preprocess, -p      whether to preprocess the data first
   --device DEVICE, -d DEVICE
                         ID of GPU to use
   --seed SEED, -s SEED  seed for generating random numbers
   --threads THREADS, -t THREADS
                         max num of threads
+  --tree                whether to ensure well-formedness
+  --feat {tag,char,bert}
+                        choices of additional features
 
 $ python run.py predict -h
 usage: run.py predict [-h] [--batch-size BATCH_SIZE] [--fdata FDATA]
-                      [--fpred FPRED] [--conf CONF] [--model MODEL]
-                      [--vocab VOCAB] [--device DEVICE] [--seed SEED]
-                      [--threads THREADS]
+                      [--fpred FPRED] [--conf CONF] [--file FILE]
+                      [--preprocess] [--device DEVICE] [--seed SEED]
+                      [--threads THREADS] [--tree] [--feat {tag,char,bert}]
 
 optional arguments:
   -h, --help            show this help message and exit
@@ -135,39 +185,42 @@ optional arguments:
   --fdata FDATA         path to dataset
   --fpred FPRED         path to predicted result
   --conf CONF, -c CONF  path to config file
-  --model MODEL, -m MODEL
-                        path to model file
-  --vocab VOCAB, -v VOCAB
-                        path to vocab file
+  --file FILE, -f FILE  path to saved files
+  --preprocess, -p      whether to preprocess the data first
   --device DEVICE, -d DEVICE
                         ID of GPU to use
   --seed SEED, -s SEED  seed for generating random numbers
   --threads THREADS, -t THREADS
                         max num of threads
+  --tree                whether to ensure well-formedness
+  --feat {tag,char,bert}
+                        choices of additional features
 ```
 
 ## Hyperparameters
 
-| Param         | Description                                      |                                 Value                                  |
-| :------------ | :----------------------------------------------- | :--------------------------------------------------------------------: |
-| n_embed       | dimension of word embedding                      |                                  100                                   |
-| n_tag_embed   | dimension of tag embedding                       |                                  100                                   |
-| embed_dropout | dropout ratio of embeddings                      |                                  0.33                                  |
-| n_lstm_hidden | dimension of lstm hidden state                   |                                  400                                   |
-| n_lstm_layers | number of lstm layers                            |                                   3                                    |
-| lstm_dropout  | dropout ratio of lstm                            |                                  0.33                                  |
-| n_mlp_arc     | arc mlp size                                     |                                  500                                   |
-| n_mlp_rel     | label mlp size                                   |                                  100                                   |
-| mlp_dropout   | dropout ratio of mlp                             |                                  0.33                                  |
-| lr            | starting learning rate of training               |                                  2e-3                                  |
-| betas         | hyperparameter of momentum and L2 norm           |                               (0.9, 0.9)                               |
-| epsilon       | stability constant                               |                                 1e-12                                  |
-| annealing     | formula of learning rate annealing               | <img src="https://latex.codecogs.com/gif.latex?.75^{\frac{t}{5000}}"/> |
-| batch_size    | approximate number of tokens per training update |                                  5000                                  |
-| epochs        | max number of epochs                             |                                 50000                                  |
-| patience      | patience for early stop                          |                                  100                                   |
+| Param         | Description                                                  |                                 Value                                  |
+| :------------ | :----------------------------------------------------------- | :--------------------------------------------------------------------: |
+| n_embed       | dimension of embeddings                                      |                                  100                                   |
+| n_char_embed  | dimension of char embeddings                                 |                                   50                                   |
+| n_bert_layers | number of bert layers to use                                 |                                   4                                    |
+| embed_dropout | dropout ratio of embeddings                                  |                                  0.33                                  |
+| n_lstm_hidden | dimension of lstm hidden states                              |                                  400                                   |
+| n_lstm_layers | number of lstm layers                                        |                                   3                                    |
+| lstm_dropout  | dropout ratio of lstm                                        |                                  0.33                                  |
+| n_mlp_arc     | arc mlp size                                                 |                                  500                                   |
+| n_mlp_rel     | label mlp size                                               |                                  100                                   |
+| mlp_dropout   | dropout ratio of mlp                                         |                                  0.33                                  |
+| lr            | starting learning rate of training                           |                                  2e-3                                  |
+| betas         | hyperparameters of momentum and L2 norm                      |                               (0.9, 0.9)                               |
+| epsilon       | stability constant                                           |                                 1e-12                                  |
+| annealing     | formula of learning rate annealing                           | <img src="https://latex.codecogs.com/gif.latex?.75^{\frac{t}{5000}}"/> |
+| batch_size    | approximate number of tokens per training update             |                                  5000                                  |
+| epochs        | max number of epochs                                         |                                 50000                                  |
+| patience      | patience for early stop                                      |                                  100                                   |
+| min_freq      | minimum frequency of words in the training set not discarded |                                   2                                    |
+| fix_len       | fixed length of a word                                       |                                   20                                   |
 
 ## References
 
 * [Deep Biaffine Attention for Neural Dependency Parsing](https://arxiv.org/abs/1611.01734)
-