Project Report
Project Report
Project Report
Retrieval Mechanisms
Jishnu Ray Chowdhury Mobashir Sadat
jraych2@uic.edu msadat3@uic.edu
University of Illinois Chicago University of Illinois Chicago
1 INTRODUCTION
Conversational bots are bots which can engage in conversations
with a partner in natural language. They can be used as virtual
tutors, digital assistants, customer service, virtual therapists, task-
oriented services, and entertainment. Conversational bots often
come in broadly three forms of models - (i) rule-based model (ii) re-
trieval (IR) model, and (iii) generative model. Each of these variants
have their own advantages and disadvantages. In this project we
attempt to combine different aspects of all these approaches with a
predominant focus on retrieval and generation. Our overall model
is a synergy of multiple sub-modules for retrieval, classification,
generation, and ranking. We focus on building an open-domain
chatbot. Our key contributions are listed below.
Contributions:
(1) We provide a unique method to combine classification, gen-
eration, and retrieval for open-domain dialogue.
(2) We combine MMI-inspired1 [7] ranking with a standard rank-
ing method to score candidate responses (both retrieved and
generated ones).
Figure 1: Model Overview.
(3) We propose and motivate future directions to enrich the
model.
Transformer-based models are gaining prominence. Some of the
2 RELATED WORKS notable Transformer-based generative models are DialogPT [23]
ELIZA [16] and ALICE [15] are some of the earliest implementa- and Transfer-Transfo [19]. They are based on fine-tuning large
tions of chatbots. They usually rely entirely on explicit pre-defined pre-trained language models
rules or pattern matching schemes. Mitsuki Bot2 which won the Besides generative models, there are also retrieval based models
Loebner Prize in 2013, 2016, 2017, and 2018 3 , is also based on a which retrieve existing responses from a dataset based on some
similar scheme. This goes to show that ’ALICE’-like models are still measure of relevancy between the responses and the user utterance
competitive. However, rule-based bots are not usually as flexible, in the given context. One classic deep learning-based approach is
adaptive or scalable. It can be a life-long work for a bot-master using a dual LSTM encoder (as a Siamese Network) [8] to encode
to periodically review chat-logs and incrementally improve their both the input utterances and the responses or queries in a dataset,
rule-based bots. to predict the probability of that the pair is relevant. The newer
Recent research on open-domain chatbots are mostly based on re- methods are based on Transformer-based PolyEncoder models [3,
trieval or generation. Generative models can generate its response 5].
given a section of the conversation history. The newer genera- There are also multiple hybrid models that are ensembles of
tive bots are usually some variant of the neural seq2seq architec- various approaches. Particularly most papers associated with Alexa
ture [12, 14]. There are many works on developing conversational Prize 5 are combinations of various modules. One recent paper [20]
agents on seq2seq, addressing the typical issues of seq2seq (low also explores this direction.
diversity, generic responses, inability to track user characteristics
etc. 4 ). One notable Recurrent Neural Network (RNN-based) neural 3 METHOD
conversational model that hierarchically encodes the conversation Our overall model has five broad modules. The first module is based
history is HRED [10]. HVRED [11] ,in addition to HRED, uses a on retrieval from a collection of custom scripts with query-response
Variational Autoencoder [6] based objective to create a latent vari- pairs which are specifically made for the bot. The second module
able to guide the response decoding process. Recently, however, is dialogue act classification based on which different downstream
1 https://github.com/jiweil/Jiwei-Thesis
decisions are made. The third module is large-scale retrieval from a
2 http://www.square-bear.co.uk/mitsuku/home.htm large Reddit corpus. The fourth module is a generative model based
3 https://www.aisb.org.uk/events/loebner-prize
4 https://github.com/ricsinaruto/Seq2seqChatbots/wiki/ 5 https://developer.amazon.com/alexaprize
Conference’17, July 2017, Washington, DC, USA Jishnu Ray Chowdhury and Mobashir Sadat
Figure 2: Retriever
on DialoGPT [23] which is OpenAI’s GPT-2 after being fine-tuned also handle a lot of personal questions towards the bot. This mod-
in Reddit Data. The fifth module is a ranker of scripted, retrieved, ule also gives some freedom to the bot-developer to customize
and generated candidates. their bot for specific tasks and purposes. Our "scripting" consists of
mapping potential encounter-able utterances or queries to a list of
3.1 Meta-Sentence Embedding candidate responses. Besides personally handcrafting some of the
Almost all of our modules utilizes some form of sentence embed- scripted pairs, we also create a second set of mappings by extracting
ding. Inspired from Nina Poerner, et al [9], to encode sentences data from Chatterbot Corpus8 . All the queries in the scripts are
or utterances we take a meta-sentence embedding where we con- pre-encoded using the query encoder defined in the previous sub-
catenate multiple embeddings from different pre-trained models. section. We use a retrieval mechanism on this module; precisely, we
For this project we settled on using the concatenation of ConveRT compute cosine-similarity scores between encoded user utterance
(multi-context version)6 [3] and Universal Sentence Encoder QA and the encoded queries in the script. We then find the query with
(USE-QA)7 [2, 21]. To encode queries or user utterances we con- the maximum score and retrieve all the candidate responses asso-
catenate ConveRT context encoder (with previous 5 turns of con- ciated with that query. To some extent, it is analogous to classical
versation history as extra context) and USE-QA query encoder. To AIML bots (instead of doing regex-based pattern matching we are
encode candidate responses (can be generated, retrieved or scripted), doing soft matching in semantic space). The retrieval mechanism
we concatenate ConveRT response encoder and USE-QA answer is shown in Figure 2. Some of the queries in the script are mapped
encoder with the previous response as context. Both encoder are to ‘command codes’. For example “tell me a joke" is mapped to the
based on pre-trained Transformers. USE-QA is a multilingual model command code “<JOKE>". Command codes, when detected, specific
which was fine tuned in SQuAD retrieval task. ConveRT was ex- tasks are executed. For example, upon encountering “<JOKE>", the
plicitly trained on Reddit data for conversational purposes making bot may randomly retrieve some joke from r/jokes subreddit data
it especially suited for our task. and respond with it. This same technique can be used to implement
some task oriented services to our bot.
3.2 Scripted Response Module
When constructing the scripted response module we kept in mind
3.3 Dialog-Act Classification Module
the Zipf distribution law. We wanted this module to handle com- Dialog-Act classification uses a simple MLP for classifying dialog
mon but frequent kinds of questions and utterances ("How are acts. We use the MIDAS dataset9 [22] along with its annotation
you", "Who are you", "what do you do?" etc.). This module should scheme for training. As an input it receives the query encodings
6 https://github.com/PolyAI-LDN/polyai-models 8 https://github.com/gunthercox/chatterbot-corpus
7 https://tfhub.dev/google/universal-sentence-encoder-multilingual-qa/2 9 https://github.com/DianDYu/MIDAS_dialog_act/tree/master/da_data
Open-Domain Conversational AI with Hybrid Generative and Retrieval Mechanisms Conference’17, July 2017, Washington, DC, USA
or MMI [7] which were used for the SOTA variant of DialoGPT.
We tried to implement beam search but had difficulties generat-
ing diverse candidates. Simple local nucleus sampling [4] at each
time step gave much more diverse responses. We tried different
approaches to diversify beam search [13] but to no avail. Finally,
we settled on simply using greedy nucleus sampling multiple times
parallely (in a batch) to generate multiple candidate responses. In
this way, we generated 30 responses for each turn. We also use
the last 3 turns of conversation history as extra context for the
generator. The generator can sometimes generate good responses
and sometimes bad ones for the same user utterance. Thus when
creating multiple candidates the chance of creating at least some
good responses increase. Then we can use a ranking module to
select from the better responses.
α, β are scalar weights that determines how much weight is to be otherwise we found our model was often bit too biased towards gen-
given to the query-response matching based scores and how much erated responses even in some cases where the available retrieved
of it is to be given to the reverse generation loss scores respec- candidates were preferable based on our subjective judgments. N is
tively. We set them as 0.4 and 0.6 respectively after some not-very- a normalizing function which converts the score to a probability
exhaustive experimentation and qualitative analysis. Ideally this distribution:
can be tuning can be done with reinforcement learning in a human
in a loop setting or in some other way; but we are keeping things x i − min(x)
simple for now. A separate bias term (‘Bias’) is present for every N (x i ) = Í
i xi
candidate. We use the ‘Bias’ term to bias the candidate scoring for
Reddit-retrieved candidates. This term can also be using for bias- Something like softmax can also be used alternatively. After
ing the response candidates from a specific source or module. We computing the scores, we filter most of the lower scoring candi-
use this to add bias towards ranking retrieval candidates because dates, normalize the scores of the remaining ones into a probability
distribution based on which the final response is sampled. We find
that the probability distribution is sometimes quite flat; so if we
Open-Domain Conversational AI with Hybrid Generative and Retrieval Mechanisms Conference’17, July 2017, Washington, DC, USA
don’t filter, it can be easy to sample lower ranked candidates which the rules can be found in the code that we will make available. If
we do not want. The ranker module is shown in Figure 2. the model does not find its responses in the dialog-act submodule it
goes into the chatterbot-scripted response submodule and extracts
candidates from there. If the maximum cosine similarity score is
3.7 Module Interaction still less than 0.75 it goes to the large-scale retrieval and generation
When a user enters an utterance it is first encoded using the query module. From all these modules candidates are collected along with
encoder. The encoding is then first compared with our handcrafted their source-based bias and sent to the ranker which returns the
query-response pairs in the scripted response module. If there is final response. The model overview is shown in Figure 1.
a high-confidence match beyond the threshold of 0.75 cosine sim-
ilarity, we simply rank the associated candidates and return the
response. If some candidates have command codes we execute the 4 EVALUATION
related task. Otherwise if the maximum cosine similarity score is Our dialog acts classification model achieves a performance of
less than the threshold, we store the scripted candidates and move about 86.12% on test data. For retrieval and generation we are using
on to the next sub-module where we classify the dialog acts of the pre-trained modules which were already evaluated on quantitative
utterance. Based on the classified dialog acts different decisions measures on previous works [2, 3, 21, 23]. In our case, it is difficult
are made in a rule-based fashion (we just use “if-else" conditions). to make quantitative analysis because we don’t have any ground
For example, if the dialog act is related to a factual questions or truths for the overall model. Instead we do qualitative analysis of
command or such we add some bias for retrieved candidates (as the major modules separately (and also the full module) on a fixed
described in subsection 3.6) otherwise not. For certain dialog acts (preset) set of queries. We only use a few queries because our model
which does not require an exact response, there is also a chance to is currently quite slow (there are some optimization issues with our
go to an ’initiate’ mode where the bot may bring up some random tensorflow hub loading-encoding script which is slowing things
fact or joke. Some special handcrafted-response-candidates are also up). Furthermore we also had to use CPU for loading Tensorflow
mapped with the dialog act class themselves. The specific details of because DialoGPT needs all the GPU memory. We are working on
Conference’17, July 2017, Washington, DC, USA Jishnu Ray Chowdhury and Mobashir Sadat
14 https://github.com/JRC1995/Chatbot
Open-Domain Conversational AI with Hybrid Generative and Retrieval Mechanisms Conference’17, July 2017, Washington, DC, USA