A 02
A 02
A 02
2017-11-06
Assignment 2
Readings: Read chapter 4 in Jurafsky-Martin.
Problems:
1. We want a program that computes all bigram probabilites from a given (training) corpus,
and stores it in a file. For instance, from the file data/small.txt:
I live in Boston.
I like ants.
Ants like honey.
Therefore I like honey too!
we want to produce the contents of the file small model correct.txt. Note that:
The first line contains two numbers, separated by space: The vocabulary size V (= the
number of unique tokens, including punctuation), and the size of the corpus N (= the
total number of tokens).
Then follows V lines, each containing three items: an identifier (0, 1, . . .), a token, and
the number of times that token appears in the corpus.
Then follows a number of lines, one for each non-zero bigram probability. Each line
contains three numbers: The identifiers of the first and second token of the bigram, re-
spectively, followed by the logarithm of the bigram probability, printed with 15 decimals.
The natural logarithm is used (as computed by the math.log library method).
The final line is -1 to mark end-of-file.
The BigramTrainer.py program contains a skeleton program for reading a corpus, computing
unigram counts and bigram probabilities, and printing the model. Your task is to extend
the code so that the program works correctly (look for the comments YOUR CODE HERE
in the program). Use the scripts run trainer small.sh and run trainer kafka.sh to run
the program on test examples.
You can use the -d option to save the model to file, e.g. :
If you are using Windows, printing the model to the terminal will likely lead to character
encoding errors.
By adding the --check flag, you can verify that your results are correct.
2. The BigramTester.py program contains a skeleton program for reading a model on the
format described in the previous problem, reading a test corpus, and computing the entropy
of the test corpus given the model (the cross-entropy of the training set and the test set).
(a) Extend the code so that the program works correctly (look for the comments YOUR
CODE HERE in the program). The entropy of the test set is computed as the average
log-probability:
N
1 X
log P (wi1 wi )
N i=1
where N is the number of tokens in the test corpus. To be able to handle missing words
and missing bigrams, use linear interpolation:
The values for the constants 1 , 2 and 3 are given in the code for the BigramTester
program. The script run tester small kafka.sh tests the model built from small.txt
using kafka.txt as a test corpus, and the script run tester kafka small.sh tests the
model build from kafka.txt on the test corpus small.txt. Compare with my numbers
by using the --check flag. (Your numbers might deviate slightly from my numbers; for
instance, if you are using a different logarithm. I used the natural logarithm.)
(b) Build a model from the file data/guardian training.txt and another model from
data/austen training.txt. Compute the entropy of the test file guardian test.txt
and the test file austen test.txt, using both models. Report your numbers and your
conclusions from these experiments!