Question Answering System: 296: Natural Language Processing
Question Answering System: 296: Natural Language Processing
Question Answering System: 296: Natural Language Processing
by
Lakshmi Shankarrao
We would like to extend our gratitude towards Prof. Ming-Hwa Wang, who inspired us to
make this project. We would also like to thank Santa Clara University which provided us
the opportunity to learn the subject and apply its concept into making this project.
1
Table of Contents
Acknowledgements 1
Table of Contents 2
Abstract 5
Introduction 6
Objective: 6
Build a Question Answering System using neural networks. 6
What is the problem 6
Why is this a project related the this class 6
Why other approaches are no good and why the chosen approach is better 6
Statement of the problem 7
Area or Scope of Investigation 7
Theoretical Bases and Literature Review: 8
Theoretical background of the problem 8
Related research to solve the problem 9
Advantage/disadvantage of those research 11
Our solution to solve this problem 12
Hypothesis/Goals 12
LSTM 12
biLSTM 13
QA-LSTM 13
ATTENTION-BASED QA-LSTM 14
Methodology 15
Data Sets 15
Setup 15
Results and Analysis 15
Implementation 16
Code 16
Design document and flowchart 19
2
Data Analysis and Discussion 21
Output generation 22
Output analysis 25
Try to Analyse why it went wrong 26
Abnormal case explanation (the most important task) 27
Conclusions and Recommendations 27
Summary and Conclusions 27
Recommendations for future studies 27
Bibliography 28
Appendices 29
Program Flowchart 29
Program source code with documentation 29
Input/Output listing 29
3
List Of Tables and Figures
4
Abstract
In this project we apply concepts of Deep Learning(DL) for question answering task. This
approach avoids the tedious task of numerous feature extraction that are done in
traditional linguistic tools. Here we use bidirectional Long Short-Term Memory models to
generate embeddings of questions and answer and measure cosine similarity to compute
the distance between questions and answers pairs which would be used for appropriate
answer selection. The model is to involve attention mechanism which would increase the
efficiency of the system. The models are experimented on different dataset such as
InsuranceQA, bAbI project and the results are captured and analysis are done.
5
Introduction
Objective:
Why other approaches are no good and why the chosen approach is
better
Neural network are increasingly gaining focus in NLP related tasks. Over the past
few years, neural networks have re-emerged as powerful machine-learning models,
yielding state-of-the-art results in fields such as image recognition and speech processing.
More recently, neural network models started to be applied also to textual natural language
signals, again with very promising results. (Ref: A Primer on Neural Network Models for
Natural Language Processing- Yoav Goldberg, October 5, 2015). Feature engineering is
needed in conventional machine learning tasks. Whereas it is not needed for neural
networks. What happens inside the hidden layers of the neural networks is unknown to
6
anybody. It learns all possible features. May be ones that humans might not have even
thought of. Neural networks are powerful learners, providing opportunities ranging from
non-linear classification to non-Markovian modeling of sequences and trees. (Ref: A Primer
on Neural Network Models for Natural Language Processing- Yoav Goldberg, October 5,
2015).
7
model to handle tasks involving like Single Supporting Fact, Two or Three Supporting
Facts, Yes/No Questions, Counting and Lists/Sets, Basic Coreference, Conjunctions and
Compound Coreference etc., and analyse and compare their performance.
8
representation (image/text, hidden-layer output matrix, etc.), reducing it's dimensionality
and allowing for assumptions to be made about features contained in the sub-regions
binned. Max-pooling is generally said to be better than other forms of pooling because it
extracts more local values for each dimension, so that more local information can be
reflected on the output embeddings.
Fig 1
In this model words in input sentences are first converted to vector representations
learned from word2vec. A special start symbol(<S>) is inserted between Question
and Answer for differentiation. Then the Question and Answer word vectors are
sequentially read by BLSTM from both directions. For each time step in the BLSTM
layer, the hidden vector or the output vector is generated by combining the cell
memory vectors from two LSTM of both sides. The final output of each time step is
the label indicating whether the candidate answer sentence should be selected as
the correct answer sentence for the input question. This objective encourages the
BLSTMs to learn a weight matrix that outputs a positive label if there is overlapping
context information between two LSTM cell memories. Mean pooling is applied to
all time step outputs during the training. During the test phase, the mean, sum and
max poolings as features are collected. And the final answer vector is extracted.
2. QA-LSTM model
9
Fig 2
BLSTM generates distributed representations for both the question and answer
independently, and then utilize cosine similarity to measure their distance. Then
representations based on the word-level BLSTM outputs are generated using
Average pooling or Max pooling or Concatenation of the last vectors on both
directions. Max Pooling is provides better performance. Architectures, in which both
question and answer sides share the same network parameters, is significantly
better than the one that the question and answer sides own their own parameters
separately, and converges much faster.
Fig 3
The above mentioned models is said to produce very good results but a simple
pooling layer may suffer from the incapability of keeping the local linguistic
information. Thus the the question and answer representations are generated using
a CNN structure built on the outputs of BLSTM, in order to give a more composite
representation of questions and answers. Unlike the traditional forward neural
network, where each output is interactive with each input, the convolutional
structure only imposes local interactions between the inputs within a filter size m.
Same as typical CNNs, a max-k pooling layer is built on the top of the convolutional
layer. The top-k values from each convolutional filter are extracted by doing so.
These top-k values indicate the highest degree that a filter matches the input
10
sequence. In the end, two output vectors with dimension of kN for the questions and
answers respectively are generated from N parallel filters, with different parameter
initialization. The convolutional layer gets N-dimensional output vectors from N
filters and chooses top-k values from it.
Fig 4
In order to better distinguish candidate answers according to the question, we
introduce a simple but efficient attention model to this framework for the answer
embedding generation according to the question context. Here prior to extracting
representations using the average or mean pooling, each BLSTM output vector will be
multiplied by a softmax weight, which is determined by the question embedding from
BLSTM.
11
Our solution to solve this problem
We plan to implement a QA-LSTM model with attention using BiLSTMs. The Answer
and Question representations generated are filtered using Max pooling. And the cosine
difference between the selected question and answer representations is taken to choose
the best answer.
Hypothesis/Goals
In this project (we are proposing), our goal is to build a framework which uses
bidirectional LSTM based RNN models that train on both question and answers data and
then measure the distance between the pairs of question and answers to get the similarity
metrics.
LSTM
The LSTM model helps alleviate the Vanishing Gradient problem of the Recurrent Neural
Networks hence help provide a better memory. This is achieved with the help of gates.
Given an input sequence x = {x(1), x(2),......,x(n)}, where x(t) is an E-dimension
word vector. The hidden vector h(t) ( the size is H ) at the time step t is updated as follows.
All basic LSTM have three gates (input i, forget f and output o), and a cell memory vector
C,is the sigmoid function. The input gate determines how the state of the memory cell is
altered based on the in vector x(t). The output gate determines the effect on the that the
particular memory cell has on the outputs. The forget gate determines whether to
remember or forget the cell’s previous state. W 𝝐 RH x E, U 𝝐 RH x H and b 𝝐 RH x 1 are the
12
network parameters. W - input weight matrix, U - recurrent weight matrix, b - bias, σ is the
logistic sigmoid function
biLSTM
Bidirectional Long Short-Term Memory (biLSTM)are variation of LSTM in which, it uses
both the previous and the future state by processing the input in two directions, then
generate two independent LSTM output vectors. These models do a better job than Single
direction LSTMs. As Single direction LSTMs does not make use of the contextual
information from the future input data. The input sequence is processed in the forward
direction and also in the reverse direction. Output at any given step is the concatenation of
the two output vectors.
QA-LSTM
The simple model of our project is shown in Fig 1. The BiLSTM models generates
representations for question answer separately and then compute cosine similarity to
check the distance between them. Following the same ranking loss in (Feng et al., 2015), we
define the training objective eq(7) as a hinge loss.
where, a+ : a ground truth answer, a- : incorrect answer randomly chosen from the entire
answer space, and M is constant margin.
13
ATTENTION-BASED QA-LSTM
In this variation we use a simple attention model which is based on the question, for the
answer vector generation. An attention mechanism is used for dynamically aligning the
informative part of the the answers to the question. This is the same strategy has been
used in machine translation (Bahdanau et al., 2015; Sutskever et al., 2014) and factoid
question answering (Hermann et al., 2015).
Our world level attention model will be similar/replication to the work of (Tan, Ming et al.,
2015). Fig 2 shows the structure.
The output vector of biLSTM will be multiplied by a softmax weight, which is computed
using the question embedding from biLSTM. The output vector of biLSTM on the answer
︿
side at time step t, ha(t),
and the question embedding, oq, the new vector h a(t)
for each
answer is given below.
T
Where Wam,W qm and w ms are attention parameters. We will see in our experiments how
this attention will help in increasing the efficiency.
14
Methodology
Data Sets
(how to generate/collect input data)
In this work we will try to experiment our model on multiple data set such as,
InsuranceQA, bAbI project(The 20 QA bAbI tasks, The 6 dialog bAbI tasks, The Children's
Book Test, The Movie Dialog dataset, The WikiMovies dataset ,The Dialog-based Language
Learning dataset,The Simple Questions dataset) and see how our model responds to the
different data sets.
All these dataset either provide training sets, validation sets and test sets separately or
just have a one heterogenous sets of questions, answers and story. In both the case we will
divide the data sets into three parts as training, validation and test set to train and evaluate
the model for accuracy.
Setup
(how to solve the problem )
We are planning to implement our model using tensorflow framework and run the model
on single GPU(nvidia GeForce GT 730M). We will be build a biLSTM with attention which is
explained in equation(9) (10) and (11). We will be using the accuracy that we get on the
validation set to compute the best epoch and best hyper-parameter which we will use for
testing. The code will be in python3.5. Apart from tensorflow we will be using numpy, scikit
learn, pandas, matplotlib and many more python based libraries.
15
Implementation
Code
"""
This code is a modified version of the code from this link:
https://github.com/aymericdamien/TensorFlow-Examples/blob/master/examples/3_NeuralNetworks/rec
urrent_network.py
"""
import preprocess_sample_3 as pre_pro
import skip_gram
import tensorflow as tf
import numpy as np
o = skip_gram.SkipGram()
p = pre_pro.Preprocess()
content, ans = p.read_input("train_dev_data", "train_dev_ans")
ip_sgm, words = p.generate_ip_for_SGM(content)
o.build_dataset(words)
o.run_skip_gram()
# hyperparameters
lr = 0.001
training_iters = 100000
batch_size = 30
16
# tf Graph input
x = tf.placeholder(tf.float32, [
None, n_steps, n_inputs])
y = tf.placeholder(tf.float32, [ None, n_classes])
# Define weights
weights = {
# (300, 512)
'in': tf.Variable(tf.random_normal([n_inputs, n_hidden_units])),
# (512, 4)
'out': tf.Variable(tf.random_normal([n_hidden_units, n_classes]))
}
biases = {
# (512, )
'in': tf.Variable(tf.constant(0.1, shape=[n_hidden_units, ])),
# (4, )
'out': tf.Variable(tf.constant(0.1, shape=[n_classes, ]))
}
# mul
X_in = ((30 batch * 300 steps), 512 inputs)
#
X_in = tf.matmul(X, weights['in']) + biases['in']
# reshape
X_in ==> (128 batch, 300 steps, 512 hidden)
#
X_in = tf.reshape(X_in, [-1, n_steps, n_hidden_units])
# cell
# dynamic_rnn receive Tensor (batch, steps, inputs) or (steps, batch, inputs) as X_in.
outputs, final_state = tf.nn.dynamic_rnn(lstm_cell, X_in, initial_state=_init_state,
time_major=False)
17
return results
18
Design document and flowchart
Fig 5: Flowchart
● Using the word vectors and the dictionary indexed dataset we generate word vector input
for the dataset and pass it to the RNN(LSTM network). Details about the model is
explained in the next section.
19
Fig 7: LSTM model
20
Data Analysis and Discussion
MCTest dataset is a free data set that has 450 stories each of which has multiple questions
and the each questions has 4 choices answer choice in which one of them is correct and others
are wrong. The data was gathered using Mechanical Turk. The one of example is as given below,
James the Turtle was always getting in trouble. Sometimes he'd reach into the freezer and empty out all the food.
Other times he'd sled on the deck and get a splinter. His aunt Jane tried as hard as she could to keep him out of
trouble, but he was sneaky and got into lots of trouble behind her back.
One day, James thought he would go into town and see what kind of trouble he could get into. He went to the
grocery store and pulled all the pudding off the shelves and ate two jars. Then he walked to the fast food restaurant
and ordered 15 bags of fries. He didn't pay, and instead headed home.
His aunt was waiting for him in his room. She told James that she loved him, but he would have to start acting like a
well-behaved turtle.After about a month, and after getting into lots of trouble, James finally made up his mind to be
a better turtle.
1) What is the name of the trouble making turtle?
A) Fries
B) Pudding
C) James
D) Jane
2) What did James pull off of the shelves in the grocery store?
A) pudding
B) fries
C) food
D) splinters
3) Where did James go after he went to the grocery store?
A) his deck
B) his freezer
C) a fast food restaurant
D) his room
4) What did James do after he ordered the fries?
A) went to the grocery store
B) went home without paying
C) ate them
D) made up his mind to be a better turtle
In Data Analysis we have done are cleaning the text data so that we can use for our project,
creating lexicon of all unique words that is present in the complete data set and finally convert
the lexicon into word vector using skip gram model.
In cleaning the data we had to taken out all punctuations, unwanted spaces and
characters which are not part of the english dictionary.
In creating lexicon we had to read the whole text data set and get the unique words and
make a data file that is has only words.
21
Using this we had to create a word vector. First we tried the bag of words model. But
word vector generated by bag of words model did not have the semantic or syntactic information
about the words so we have to look for better model. Hence we opted for Skip-gram model which
is word vector representation for textual words which is generated by the Deep neural network.
Using this word vector we generate embeddings for the dataset which would take the
story question and the answer choice which are data inputs for the recurrent neural network and
predict the answer. The model is LSTM RNN which would learn from the embeddings.
Output generation
Input preprocessing involved cleaning involved removal of unwanted material from the
dataset. The the data needed to be processed to remove unwanted punctuations and obtain
words in the order of occurrence in the dataset. This was needed to build the lexicon. The lexicon
was reordered in the order of the max frequency of occurrence. Later a reverse dictionary was
built to enable ease of indexing a word. Next the whole data content was rewritten based on this
index of the dictionary. This is the actual data we will be passing to the SGM.
Computation Graph and session operation for Skip Gram Model is described below. The
dictionary indexed content is the input to the SGM. We chose batch_size = 128, embedding_size =
300, skip_window = 1, num_skips = 2 for the SGm we have used. We initialize all the variables
discussed above and we generate a batch with the given skip window and number of skips
chosen. We obtain the labels and the respective training dataset. And feed it to our neural
network and obtain predictions. The tf.nn.embedding_lookup function will look up for the
embeddings given the training dataset and the embeddings dimensions. These embeddings are of
dimension vocabulary_size x embedding_size defined above and range from values -1 to 1. This is
fed to the neural network and the output is multiplied with the weights and the biases are added
and the output is obtained. This is compared with the labels to determine the loss. And the loss is
fed to the optimizer to minimize the loss. We have used Adagrad Optimizer with a learning rate of
1.0.
Output
The skip gram model gives a output which represents the word vectors as follows. Below are
some of the words that are near to each other in the context of particular data set.
Nearest to eight: seven, nine, six, four, five, three, zero, two,
Nearest to he: she, it, they, there, who, we, this, never,
Nearest to i: we, he, they, mary, james, she, theirs, melanie,
Nearest to she: he, they, it, i, midnight, we, hugo, snowball,
Nearest to back: off, ended, over, down, out, beginning, toward, float,
Nearest to her: his, him, their, margaret, me, its, chili, twisted,
Nearest to this: it, which, he, some, checkerboard, another, the, amritsar,
Nearest to been: become, be, already, was, recently, whispering, successfully,
Nearest to they: we, there, he, you, she, it, i, not,
Nearest to new: different, turkic, boasted, various, neared, stabilizers, deviating,
Nearest to called: named, used, distorted, considered, referred, known, vat,
Nearest to where: what, twist, beneath, artist, toilet, needs, performances, liking,
22
Nearest to dog: cow, smashing, babysit, hazel, drawing, witch, owner, mouse,
Below is a tensor board output generated by the skip gram model. It depicts the distribution of
words in space(a Vector Space Model for the words in the dataset).
23
-0.01357032 -0.05207205 0.04069532 -0.03321727 -0.0258834 0.05395871
0.03483186 -0.08999643]
The computation graph and the session description for the LSTM QA model is described below.
We chose learning rate = 0.001, num of training_iters = 100000, batch_size = 30, number of
inputs = 300 which is same as the word vector size, number of steps = 300, number of hidden
units = 512 number of classes = 4 and total steps = 1800 for our model. We generate a batch of
input and corresponding labels. Here in our model we select 30 sets of questions for each batch.
The model is given this input by first multiplying with the weights and adding the biases. It
outputs some prediction. We take argmax of this prediction and compare with the labels and
calculate the cost and the accuracy. Later we provide the cost to an optimizer to reduce the cost.
We use Adam Optimizer here. The graph of accuracy obtained across batches is shown in the
figure below.
24
Output analysis
We saw the acuuracy ranging between 20% to 60%.The sample output of the project is as given
below,
The output is batch wise
[Accuracy,Loss]
[0.43333337, 9.1849651]
[0.40000004, 6.956367]
[0.50000006, 5.3823442]
[0.33333337, 7.3978882]
[0.36666667, 4.4839745]
[0.30000001, 5.5185552]
[0.33333337, 2.9936876]
[0.26666671, 6.6921492]
[0.26666668, 6.105495]
[0.20000002, 8.5729513]
[0.33333337, 31.36186]
[0.23333335, 21.811695]
[0.33333337, 18.389278]
[0.26666668, 16.924873]
[0.33333337, 16.615463]
[0.33333337, 6.8296127]
[0.36666667, 12.750624]
[0.20000002, 15.518364]
[0.30000001, 7.9172506]
[0.40000004, 15.276219]
[0.23333335, 30.156597]
[0.23333335, 17.513439]
[0.40000004, 15.64605]
[0.20000002, 14.187802]
[0.40000004, 4.8628817]
[0.30000001, 8.0700626]
[0.30000001, 6.3062959]
[0.23333335, 16.49066]
[0.20000002, 15.956774]
[0.30000001, 16.061716]
[0.30000001, 9.7183285]
[0.30000001, 6.5377407]
[0.16666669, 7.4950838]
25
[0.20000002, 7.0059314]
[0.23333335, 9.5372372]
[0.43333337, 1.2171307]
[0.23333335, 9.7734261]
[0.33333337, 6.7862101]
[0.16666669, 12.330596]
[0.40000004, 8.3129349]
[0.26666668, 9.4911385]
[0.30000001, 4.7775865]
[0.4666667, 2.4099643]
[0.23333335, 4.2682471]
[0.23333335, 4.5539985]
[0.33333337, 5.5995607]
[0.33333337, 4.0388737]
[0.26666668, 6.6063185]
[0.20000002, 5.0251484]
[0.4333334, 4.7363453]
[0.40000004, 3.7464991]
[0.30000001, 2.3158572]
[0.20000002, 4.1808128]
We have decided to use Tensor flow for designing our neural network and the model that
we were trying to build was two neural network side by side which we would learn question and
answer separately. We spent a lot of time digging into this problem and then learnt that this
particularly is not supported by tensor flow
We weren’t able to use 2 RNN models in the same session. It gave errors pertaining to
shared parameters between the 2 models. We tried declaring variables separately for both but
just using 2 models in the same session, but it further gave room for more errors. We were not
able to resolve we have raised a bug resolve request to google tensor flow group but given the
time frame we were not able to come up with are round about solution for the problem.
So we resorted to use another model described in the paper “A Long Short-Term Memory Model
for Answer Sentence Selection in Question Answering”. The paper describes a model with
bidirectional LSTM. We were able to implement a simpler version of the model using LSTMs. We
plan to in future implement biLSTM for our QA system and hopefully obtain a better accuracy.
26
We also faced challenges with selecting the data set that works best for our model. We
explored numerous data set like TREC-QA[], Insurance-QA[], BABI[] and MCTest data sets.
Initially we wanted to use TREC-QA data set but the faced lot of difficulty to in using it for our
particular model. We also tried using the Insurance-QA and we were finally able to use BABI and
MCTest data sets and finally resolved to MCTest data set as it was the one which suits the best.
This project gave us a great start into the field of Neural networks. We obtained good
learning wrt implementation of conventional and recursive neural networks using Tensor Flow.
We were able to build an lstm neural network for answer prediction which, given a story and a
question, predicts the correct answers from the answer choices. We got an accuracy between
20% to 70%.
27
Bibliography
1. D. Wang and E. Nyberg. A long short-term memory model for answer sentence
selection in question answering. In ACL-IJCNLP, ACL 2015, July 26-31, 2015, Beijing,
China, Volume 2: Short Papers, pages 707{712, 2015.
2. Daniel Cohen and W. Bruce Croft. “End to End Long Short Term Memory Networks
for Non-Factoid Question Answering” ICTIR '16 Proceedings of the 2016 ACM
International Conference on the Theory of Information Retrieval.
3. Tan, Ming; dos Santos, Cicero; Xiang, Bing; Zhou and Bowen. “LSTM-BASED DEEP
LEARNING MODELS FOR NONFACTOID ANSWER SELECTION“. In eprint
arXiv:1511.04108, 11/2015,
4. H. Palangi, L. Deng, Y. Shen, J. Gao, X. He, J. Chen, X. Song, and R. K. Ward. Deep
sentence embedding using the long short term memory network: Analysis and
application to information retrieval. CoRR, abs/1502.06922, 2015.
5. Wikipedia https://en.wikipedia.org/wiki/
6. Applying Deep Learning to Answer Selection: A Study and An Open Task Minwei
Feng, Bing Xiang, Michael R. Glass, Lidan Wang, Bowen Zhou ASRU 2015
7. Jason Weston, Antoine Bordes, Sumit Chopra, Alexander M. Rush, Bart van
Merrienboer, Armand Joulin & Tomas Mikolov “TOWARDS AI-COMPLETE
QUESTION ANSWERING : A SET OF PREREQUISITE TOY TASKS”12/2015.
https://arxiv.org/pdf/1502.05698v10.pdf
8. D. Bahdanau, K. Cho, and Y. Bengio, “Neural machine translation by jointly learning
to align and translate,” ICLR2015, 2015. [Online]. Available:
http://arxiv.org/abs/1409.0473
9. I. Sutskever, J. Martens, G. E. Dahl, and G. E. Hinton, “On the importance of
initialization and momentum in deep learning,” in ICML (3)’13, 2013, pp.
1139–1147.
10. K. M. Hermann and P. Blunsom, “Multilingual models for compositional distributed
semantics,” arXiv preprint arXiv:1404.4641, 2014.
11. Feng, Minwei, Xiang, Bing, Glass, Michael, Wang, Lidan, and Zhou, Bowen. Applying
deep learning to answer selection: A study and an open task. IEEE Automatic Speech
Recognition and Understanding Workshop (ASRU), 2015.
28
Appendices
Program Flowchart
p.16
Input/Output listing
p.18
29