0% found this document useful (0 votes)
5 views7 pages

Project Report

Download as pdf or txt
Download as pdf or txt
Download as pdf or txt
You are on page 1/ 7

Open-Domain Conversational AI with Hybrid Generative and

Retrieval Mechanisms
Jishnu Ray Chowdhury Mobashir Sadat
jraych2@uic.edu msadat3@uic.edu
University of Illinois Chicago University of Illinois Chicago

1 INTRODUCTION
Conversational bots are bots which can engage in conversations
with a partner in natural language. They can be used as virtual
tutors, digital assistants, customer service, virtual therapists, task-
oriented services, and entertainment. Conversational bots often
come in broadly three forms of models - (i) rule-based model (ii) re-
trieval (IR) model, and (iii) generative model. Each of these variants
have their own advantages and disadvantages. In this project we
attempt to combine different aspects of all these approaches with a
predominant focus on retrieval and generation. Our overall model
is a synergy of multiple sub-modules for retrieval, classification,
generation, and ranking. We focus on building an open-domain
chatbot. Our key contributions are listed below.
Contributions:
(1) We provide a unique method to combine classification, gen-
eration, and retrieval for open-domain dialogue.
(2) We combine MMI-inspired1 [7] ranking with a standard rank-
ing method to score candidate responses (both retrieved and
generated ones).
Figure 1: Model Overview.
(3) We propose and motivate future directions to enrich the
model.
Transformer-based models are gaining prominence. Some of the
2 RELATED WORKS notable Transformer-based generative models are DialogPT [23]
ELIZA [16] and ALICE [15] are some of the earliest implementa- and Transfer-Transfo [19]. They are based on fine-tuning large
tions of chatbots. They usually rely entirely on explicit pre-defined pre-trained language models
rules or pattern matching schemes. Mitsuki Bot2 which won the Besides generative models, there are also retrieval based models
Loebner Prize in 2013, 2016, 2017, and 2018 3 , is also based on a which retrieve existing responses from a dataset based on some
similar scheme. This goes to show that ’ALICE’-like models are still measure of relevancy between the responses and the user utterance
competitive. However, rule-based bots are not usually as flexible, in the given context. One classic deep learning-based approach is
adaptive or scalable. It can be a life-long work for a bot-master using a dual LSTM encoder (as a Siamese Network) [8] to encode
to periodically review chat-logs and incrementally improve their both the input utterances and the responses or queries in a dataset,
rule-based bots. to predict the probability of that the pair is relevant. The newer
Recent research on open-domain chatbots are mostly based on re- methods are based on Transformer-based PolyEncoder models [3,
trieval or generation. Generative models can generate its response 5].
given a section of the conversation history. The newer genera- There are also multiple hybrid models that are ensembles of
tive bots are usually some variant of the neural seq2seq architec- various approaches. Particularly most papers associated with Alexa
ture [12, 14]. There are many works on developing conversational Prize 5 are combinations of various modules. One recent paper [20]
agents on seq2seq, addressing the typical issues of seq2seq (low also explores this direction.
diversity, generic responses, inability to track user characteristics
etc. 4 ). One notable Recurrent Neural Network (RNN-based) neural 3 METHOD
conversational model that hierarchically encodes the conversation Our overall model has five broad modules. The first module is based
history is HRED [10]. HVRED [11] ,in addition to HRED, uses a on retrieval from a collection of custom scripts with query-response
Variational Autoencoder [6] based objective to create a latent vari- pairs which are specifically made for the bot. The second module
able to guide the response decoding process. Recently, however, is dialogue act classification based on which different downstream
1 https://github.com/jiweil/Jiwei-Thesis
decisions are made. The third module is large-scale retrieval from a
2 http://www.square-bear.co.uk/mitsuku/home.htm large Reddit corpus. The fourth module is a generative model based
3 https://www.aisb.org.uk/events/loebner-prize
4 https://github.com/ricsinaruto/Seq2seqChatbots/wiki/ 5 https://developer.amazon.com/alexaprize
Conference’17, July 2017, Washington, DC, USA Jishnu Ray Chowdhury and Mobashir Sadat

Figure 2: Retriever

on DialoGPT [23] which is OpenAI’s GPT-2 after being fine-tuned also handle a lot of personal questions towards the bot. This mod-
in Reddit Data. The fifth module is a ranker of scripted, retrieved, ule also gives some freedom to the bot-developer to customize
and generated candidates. their bot for specific tasks and purposes. Our "scripting" consists of
mapping potential encounter-able utterances or queries to a list of
3.1 Meta-Sentence Embedding candidate responses. Besides personally handcrafting some of the
Almost all of our modules utilizes some form of sentence embed- scripted pairs, we also create a second set of mappings by extracting
ding. Inspired from Nina Poerner, et al [9], to encode sentences data from Chatterbot Corpus8 . All the queries in the scripts are
or utterances we take a meta-sentence embedding where we con- pre-encoded using the query encoder defined in the previous sub-
catenate multiple embeddings from different pre-trained models. section. We use a retrieval mechanism on this module; precisely, we
For this project we settled on using the concatenation of ConveRT compute cosine-similarity scores between encoded user utterance
(multi-context version)6 [3] and Universal Sentence Encoder QA and the encoded queries in the script. We then find the query with
(USE-QA)7 [2, 21]. To encode queries or user utterances we con- the maximum score and retrieve all the candidate responses asso-
catenate ConveRT context encoder (with previous 5 turns of con- ciated with that query. To some extent, it is analogous to classical
versation history as extra context) and USE-QA query encoder. To AIML bots (instead of doing regex-based pattern matching we are
encode candidate responses (can be generated, retrieved or scripted), doing soft matching in semantic space). The retrieval mechanism
we concatenate ConveRT response encoder and USE-QA answer is shown in Figure 2. Some of the queries in the script are mapped
encoder with the previous response as context. Both encoder are to ‘command codes’. For example “tell me a joke" is mapped to the
based on pre-trained Transformers. USE-QA is a multilingual model command code “<JOKE>". Command codes, when detected, specific
which was fine tuned in SQuAD retrieval task. ConveRT was ex- tasks are executed. For example, upon encountering “<JOKE>", the
plicitly trained on Reddit data for conversational purposes making bot may randomly retrieve some joke from r/jokes subreddit data
it especially suited for our task. and respond with it. This same technique can be used to implement
some task oriented services to our bot.
3.2 Scripted Response Module
When constructing the scripted response module we kept in mind
3.3 Dialog-Act Classification Module
the Zipf distribution law. We wanted this module to handle com- Dialog-Act classification uses a simple MLP for classifying dialog
mon but frequent kinds of questions and utterances ("How are acts. We use the MIDAS dataset9 [22] along with its annotation
you", "Who are you", "what do you do?" etc.). This module should scheme for training. As an input it receives the query encodings
6 https://github.com/PolyAI-LDN/polyai-models 8 https://github.com/gunthercox/chatterbot-corpus
7 https://tfhub.dev/google/universal-sentence-encoder-multilingual-qa/2 9 https://github.com/DianDYu/MIDAS_dialog_act/tree/master/da_data
Open-Domain Conversational AI with Hybrid Generative and Retrieval Mechanisms Conference’17, July 2017, Washington, DC, USA

or MMI [7] which were used for the SOTA variant of DialoGPT.
We tried to implement beam search but had difficulties generat-
ing diverse candidates. Simple local nucleus sampling [4] at each
time step gave much more diverse responses. We tried different
approaches to diversify beam search [13] but to no avail. Finally,
we settled on simply using greedy nucleus sampling multiple times
parallely (in a batch) to generate multiple candidate responses. In
this way, we generated 30 responses for each turn. We also use
the last 3 turns of conversation history as extra context for the
generator. The generator can sometimes generate good responses
and sometimes bad ones for the same user utterance. Thus when
creating multiple candidates the chance of creating at least some
good responses increase. Then we can use a ranking module to
select from the better responses.

3.6 Ranking Module


The bot first generates or retrieves multiple candidates for a given
Figure 3: Ranker utterance or query. These candidates are then scored in this rank-
ing module. Based on the scores the top 3 candidates are chosen.
Their scores are then normalized to be converted into a probability
described in section 3.1 (thus, it also takes into account previous distribution and then a response is finally randomly sampled based
turns). This module can classify a given utterance given its context on the probability distribution. During sampling we also check the
into dialog acts like command, open factual question, open opinion conversation history to avoid repetitions if possible. The ranking
question, opinion, statement, back-channeling etc. There are a total module has two sub-components. One component computes cosine-
of 24 tags. It is a multi-classification tasks where some samples similarity scores between query encodings and response encodings
have two gold classes and some have one gold class. We use sepa- (the encoding methods are described in subsection 3.1). For the
rate binary classifiers for each class with a sigmoid function. For other component we took some inspiration from the Bidi anti-LM
prediction we use the two classes with the highest score (predicted method introduced by Li et al. [7] for maximal mutual information.
probability). We also add a "<NULL>" class. So if one of the pre- We would prefer less generic responses to ranked higher than the
dicted class is "<NULL>" it means there is only one dialog act. Based more generic ones. Intuitively, since generic responses applies to
on the predicted dialog acts, different downstream decisions are a large space of queries and utterances it is hard to estimate the
made which will be explained in subsection 3.7. locality of the semantic space of the original utterance and query to
which it was directed; however, from a more specific and informa-
3.4 Large-Scale Retrieval Module tive query estimating the theme of the original query or utterance
This subsection uses the same mechanism as 3.2 for retrieval but can be easier. For example, it is much easier to estimate that the
here it is done on organically created Reddit data (not something original utterance or query is related to Quantum Loop Gravity
handcrafted for a conversational AI) and at a much larger scale. or something from a response: “Loop Quantum Gravity Theory
We used SQL to retrieve some of the highly upvoted threads and attempts to unify relativity and Quantum Mechanics" compared to
comments from the Reddit database in Google Big Query 10 . We the response: “I don’t know". Based on this intuition, we score each
then pre-encode the queries and fill in the relevant data in local candidates by using the cross entropy error for generating the orig-
sqlite database so that we can easily find queries and candidate inal utterance from the response from a ‘reverse-generator’. This
responses through some simple SQL. Using a database is also much ‘reverse-generator’ is again DialoGPT but with different weights
easier on RAM. We used AWS to pre-encode the queries and fill the and configurations. This version of DialoGPT was trained to pre-
database. dict the utterance or query context from the responses. Note that
we are not using the full anti-LM bidi objective which also uses
3.5 Generative Module log-probability scores for forward generation and length penalty.
This module is based on the currently SOTA DialoGPT model which While we have the log-probability scores immediately available for
is GPT2 fine tuned on millions of samples of reddit dialogues. It is the generative sequences, we do not have them for the retrieved
based on Huggingface’s Transformer’s library[18]11 . We use their data. Though it ispossible to generate the scores by running them
pretrained weights12 . The original decoding code is not publicly on forward generator, for the sake of simplicity we ignore them for
shared, so we attempted to implement it ourselves. We started with now and only use the cross-entropy loss of the reverse generator
some publicly shared codes 13 , but they do not use beam search for the MMI-based component.
The scores from the two components is then combined using
10 https://pushshift.io/using-bigquery-with-reddit-data/ the following formula which we created:
11 https://github.com/huggingface/transformers
12 https://github.com/microsoft/dialogpt
13 https://colab.research.google.com/drive/1-_KjlAV3J1IVDw_9KogjKDCzgFY7Jp7E α N (Scorescosine + Bias) + β · N (1 − N (Lossr ever se ))
Conference’17, July 2017, Washington, DC, USA Jishnu Ray Chowdhury and Mobashir Sadat

Say something: Hello, how are you?


Bot: Not a lot is going on for me at the moment
Say something: What is your name?
Bot: I don’t really have a name.
Say Something: Do you like movies? If so, what kind?
Bot: : I like them all as long as the storyline and narration is good.
Say Something: Are you sentient?
Bot: : May be I am a bot pretending to be a human. May be I am a human pretending to be a bot. May be I am a bot pretending
to be a human pretending to be a bot.
Say Something: Tell me a Joke.
Bot: :I just saw a midget get pickpocketed....... I still can’t believe someone could stoop so low.
Say Something: What can you tell me about Loop Quantum Gravity?
Bot: : Quantum field theory put simply is relativistic quantum mechanics where you can create and destroy particles. It’s
not "*the*" quantum field theory; there are lots of quantum field theories. For example, quantum electrodynamics describes
electromagnetic interactions.
Say Something: Do you know anything about Sellar’s myth of the given?
Bot: : That is foundationalist epistemology, i.e. the notion that one can found &amp; justify belief on an extra-theoretical basis.
And he argues that this is a faulty position. https://plato.stanford.edu/entries/sellars/#4 So, Sellars is anti-foundationalist–
instead claiming that all concepts, knowledge, even truth itself is wrapped up in a framework of inference. There’s no such
thing as a properly basic foundation to knowledge.
Say Something: I am quite tired today; I didn’t know that the project deadline was so soon. I had a sleepless night working
on the project.
Bot: : you can’t make excuses for laziness; this is no big deal, but it’s a slippery slope man. i’m a hypocrite though, i usually
don’t study and also don’t sleep sooo
Say Something: Did you know I accidentally deleted major portions of the project and had to reconstruct it again? Version
control is important huh. Who knew?
Bot: : Reminds me of the time I was working on a web development project in uni, the night before due date I kinda blew up
the php code, and couldn’t figure out what I broke.
Say Something: Just my luck, I have three project presentations in the same day.
Bot: : CLOSE REDDIT. GO ALL IN. YOU CAN DO IT. (also, what’s the project about?)
Say Something: By the way, can you tell me about Language Models?
Bot: On 1B dataset, [Exploring the Limits of Language Modeling](https://arxiv.org/abs/1602.02410) has 30 (23.7 with ensem-
bles)
Say Something: Please tell me more about them.
Bot: *Last year when I had to bake bread in the middle of the school day.* This sounds like it has a nice story behind it so I’d
like to hear it :)

Table 1: Conversation with Scripts+Retrieval+Ranker.

α, β are scalar weights that determines how much weight is to be otherwise we found our model was often bit too biased towards gen-
given to the query-response matching based scores and how much erated responses even in some cases where the available retrieved
of it is to be given to the reverse generation loss scores respec- candidates were preferable based on our subjective judgments. N is
tively. We set them as 0.4 and 0.6 respectively after some not-very- a normalizing function which converts the score to a probability
exhaustive experimentation and qualitative analysis. Ideally this distribution:
can be tuning can be done with reinforcement learning in a human
in a loop setting or in some other way; but we are keeping things x i − min(x)
simple for now. A separate bias term (‘Bias’) is present for every N (x i ) = Í
i xi
candidate. We use the ‘Bias’ term to bias the candidate scoring for
Reddit-retrieved candidates. This term can also be using for bias- Something like softmax can also be used alternatively. After
ing the response candidates from a specific source or module. We computing the scores, we filter most of the lower scoring candi-
use this to add bias towards ranking retrieval candidates because dates, normalize the scores of the remaining ones into a probability
distribution based on which the final response is sampled. We find
that the probability distribution is sometimes quite flat; so if we
Open-Domain Conversational AI with Hybrid Generative and Retrieval Mechanisms Conference’17, July 2017, Washington, DC, USA

Say something: Hello, how are you?


Bot: Am well. How about you?
Say something: What is your name?
Bot: I am Not Daniel Marsch. What is yours?
Say Something: Do you like movies? If so, what kind?
Bot: : No, but really, I like movies in general so I’ll probably pick a genre!
Say Something: Are you sentient?
Bot: : I am sentient, so yes I am
Say Something: Tell me a Joke.
Bot: : I am a joke!
Say Something: What can you tell me about Loop Quantum Gravity?
Bot: : I remember as child doing quantum gravity when I was watching Super Star : The Movie the feels stopped.
Say Something: Do you know anything about Sellar’s myth of the given?
Bot: : Nope. I thought it was a well known Science Fact of the Day, but I never knew it was actually proven.
Say Something: I am quite tired today; I didn’t know that the project deadline was so soon. I had a sleepless night working
on the project.
Bot: : Is the project as impressive as the last and the rainy day aliens.
Say Something: Did you know I accidentally deleted major portions of the project and had to reconstruct it again? Version
control is important huh. Who knew?
Bot: : I knew.
Say Something: Just my luck, I have three project presentations in the same day.
Bot: : I’m trying to recreate those presentations because Sapphire
Say Something: By the way, can you tell me about Language Models?
Bot: I got hooked on the language models
Say Something: Please tell me more about them.
Bot: Just stick on the classics and type out some test.org stuff. Then tell me what you find when you get to their.org page

Table 2: Conversation with Generator+Ranker.

don’t filter, it can be easy to sample lower ranked candidates which the rules can be found in the code that we will make available. If
we do not want. The ranker module is shown in Figure 2. the model does not find its responses in the dialog-act submodule it
goes into the chatterbot-scripted response submodule and extracts
candidates from there. If the maximum cosine similarity score is
3.7 Module Interaction still less than 0.75 it goes to the large-scale retrieval and generation
When a user enters an utterance it is first encoded using the query module. From all these modules candidates are collected along with
encoder. The encoding is then first compared with our handcrafted their source-based bias and sent to the ranker which returns the
query-response pairs in the scripted response module. If there is final response. The model overview is shown in Figure 1.
a high-confidence match beyond the threshold of 0.75 cosine sim-
ilarity, we simply rank the associated candidates and return the
response. If some candidates have command codes we execute the 4 EVALUATION
related task. Otherwise if the maximum cosine similarity score is Our dialog acts classification model achieves a performance of
less than the threshold, we store the scripted candidates and move about 86.12% on test data. For retrieval and generation we are using
on to the next sub-module where we classify the dialog acts of the pre-trained modules which were already evaluated on quantitative
utterance. Based on the classified dialog acts different decisions measures on previous works [2, 3, 21, 23]. In our case, it is difficult
are made in a rule-based fashion (we just use “if-else" conditions). to make quantitative analysis because we don’t have any ground
For example, if the dialog act is related to a factual questions or truths for the overall model. Instead we do qualitative analysis of
command or such we add some bias for retrieved candidates (as the major modules separately (and also the full module) on a fixed
described in subsection 3.6) otherwise not. For certain dialog acts (preset) set of queries. We only use a few queries because our model
which does not require an exact response, there is also a chance to is currently quite slow (there are some optimization issues with our
go to an ’initiate’ mode where the bot may bring up some random tensorflow hub loading-encoding script which is slowing things
fact or joke. Some special handcrafted-response-candidates are also up). Furthermore we also had to use CPU for loading Tensorflow
mapped with the dialog act class themselves. The specific details of because DialoGPT needs all the GPU memory. We are working on
Conference’17, July 2017, Washington, DC, USA Jishnu Ray Chowdhury and Mobashir Sadat

speeding things up or at least make a faster variant which we may REFERENCES


release on github someday 14 . From the evaluation we find that the [1] Danqi Chen, Adam Fisch, Jason Weston, and Antoine Bordes. 2017. Reading
modules perform pretty well; it does retrieve more or less relevant Wikipedia to Answer Open-Domain Questions. In Association for Computational
Linguistics (ACL).
texts, and even the generation is surprisingly good. The results are [2] Muthuraman Chidambaram, Yinfei Yang, Daniel Cer, Steve Yuan, Yun-Hsuan
shown in Table 1, Table 2, and Table 3. Sung, Brian Strope, and Ray Kurzweil. 2018. Learning Cross-Lingual Sentence
Representations via a Multi-task Dual-Encoder Model. arXiv:cs.CL/1810.12836
[3] Matthew Henderson, Iñigo Casanueva, Nikola Mrkšić, Pei-Hao Su, Ivan Vulić,
5 DISCUSSION et al. 2019. ConveRT: Efficient and Accurate Conversational Representations
from Transformers. arXiv preprint arXiv:1911.03688 (2019).
There are a lot of rooms for improvement in this project. The bal- [4] Ari Holtzman, Jan Buys, Maxwell Forbes, and Yejin Choi. 2019. The curious case
ancing between retrieval, generation and the overall module inter- of neural text degeneration. arXiv preprint arXiv:1904.09751 (2019).
actions are done in a somewhat ad-hoc fashion. The overall model [5] Samuel Humeau, Kurt Shuster, Marie-Anne Lachaux, and Jason Weston. 2019.
Poly-encoders: Transformer Architectures and Pre-training Strategies for Fast
may have the best of the both worlds of retrieval and generation, and Accurate Multi-sentence Scoring. arXiv preprint arXiv:1905.01969 (2019).
but it also have the worsts of them. The retrieved responses often [6] Diederik P Kingma and Max Welling. 2013. Auto-encoding variational bayes.
may contain references and terms that give away that fact that it arXiv preprint arXiv:1312.6114 (2013).
[7] Jiwei Li, Michel Galley, Chris Brockett, Jianfeng Gao, and Bill Dolan. 2015. A
was referred to someone else although in a related context - this diversity-promoting objective function for neural conversation models. arXiv
breaks immersion. The generative model is not perfect either. It can preprint arXiv:1510.03055 (2015).
[8] Ryan Lowe, Nissan Pow, Iulian Serban, and Joelle Pineau. 2015. The ubuntu
still result in simplistic replies and some replies although good has dialogue corpus: A large dataset for research in unstructured multi-turn dialogue
some noisy components in it (like a random nonsensical word or systems. arXiv preprint arXiv:1506.08909 (2015).
repetitions) - we may need to check the decoding process and tune [9] Nina Poerner, Ulli Waltinger, and Hinrich Schütze. 2019. Sentence
Meta-Embeddings for Unsupervised Semantic Textual Similarity.
some parameters more exhaustively. Other rooms for improvement arXiv:cs.CL/1911.03700
include knowledge grounding (one could connect it to a database [10] Iulian Vlad Serban, Alessandro Sordoni, Yoshua Bengio, Aaron C. Courville, and
with wikipedia information, some IR, QA mechanisms and a new Joelle Pineau. 2015. Hierarchical Neural Network Generative Models for Movie
Dialogues. CoRR abs/1507.04808 (2015). arXiv:1507.04808 http://arxiv.org/abs/
module to GPT2 for taking relevant knowledge-base facts into 1507.04808
account), adding common sense reasoning, a natural language in- [11] Iulian Vlad Serban, Alessandro Sordoni, Ryan Lowe, Laurent Charlin, Joelle
Pineau, Aaron C Courville, and Yoshua Bengio. 2017. A Hierarchical Latent
ference module to filter candidates inconsistent with conversational Variable Encoder-Decoder Model for Generating Dialogues.. In AAAI. 3295–3301.
history, using latent variables to guide decoding, topic-tracking and [12] Ilya Sutskever, Oriol Vinyals, and Quoc V. Le. 2014. Sequence to Sequence
topic based database selection where database related to different Learning with Neural Networks. CoRR abs/1409.3215 (2014). arXiv:1409.3215
http://arxiv.org/abs/1409.3215
topics are prepared with data from different sources from the inter- [13] Ashwin K Vijayakumar, Michael Cogswell, Ramprasath R. Selvaraju, Qing Sun,
net, a text to speech synthesis module to make the bot able to speak, Stefan Lee, David Crandall, and Dhruv Batra. 2016. Diverse Beam Search: Decod-
and so on. We initially tried to add some knowledge-base access in ing Diverse Solutions from Neural Sequence Models. arXiv:cs.AI/1610.02424
[14] Oriol Vinyals and Quoc V. Le. 2015. A Neural Conversational Model. CoRR
a shallow way by integrating DrQA [1] with Wikipedia database abs/1506.05869 (2015). arXiv:1506.05869 http://arxiv.org/abs/1506.05869
but we were not able to load some components of their project. We [15] Richard S Wallace. 2009. The anatomy of ALICE. In Parsing the Turing Test.
Springer, 181–210.
were also trying a custom Transformer-Encoder-LSTM-Decoder [16] Joseph Weizenbaum. 1966. ELIZA&Mdash;a Computer Program for the Study of
Seq2seq code for paraphrase generation but training did not go well Natural Language Communication Between Man and Machine. Commun. ACM
- more hyper-parameter tuning and more training may be required. 9, 1 (Jan. 1966), 36–45. https://doi.org/10.1145/365153.365168
[17] John Wieting and Kevin Gimpel. 2017. Paranmt-50m: Pushing the limits of
We initially wanted to add a paraphrase module to reduce chances paraphrastic sentence embeddings with millions of machine translations. arXiv
of repetition, potentially make retrieved sentences more interesting, preprint arXiv:1711.05732 (2017).
and generated sentences more coherent. Previous works [17] show [18] Thomas Wolf, Lysandre Debut, Victor Sanh, Julien Chaumond, Clement Delangue,
Anthony Moi, Pierric Cistac, Tim Rault, R’emi Louf, Morgan Funtowicz, and Jamie
that a LSTM-based paraphrase generation module can be even used Brew. 2019. HuggingFace’s Transformers: State-of-the-art Natural Language
to make sentences more coherent and grammatically polished. An- Processing. ArXiv abs/1910.03771 (2019).
[19] Thomas Wolf, Victor Sanh, Julien Chaumond, and Clement Delangue. 2019. Trans-
other serious concern is that there is nothing here in this project fertransfo: A transfer learning approach for neural network based conversational
to handle toxic and offensive contents. Handling toxicity during agents. arXiv preprint arXiv:1901.08149 (2019).
ranking can be another future direction. It can be also worthwhile [20] L. Yang, J. Hu, M. Qiu, C. Qu, J. Gao, W. B. Croft, X. Liu, Y. Shen, and J. Liu. 2019.
A Hybrid Retrieval-Generation Neural Conversation Model. In CIKM ’19.
to explore distillation, model compression and in general, inference [21] Yinfei Yang, Daniel Cer, Amin Ahmad, Mandy Guo, Jax Law, Noah Constant,
speed up techniques. Gustavo Hernandez Abrego, Steve Yuan, Chris Tar, Yun-Hsuan Sung, Brian Strope,
and Ray Kurzweil. 2019. Multilingual Universal Sentence Encoder for Semantic
Retrieval. arXiv:cs.CL/1907.04307
[22] Dian Yu and Zhou Yu. 2019. MIDAS: A Dialog Act Annotation Scheme for Open
Domain Human Machine Spoken Conversations. arXiv:cs.CL/1908.10023
[23] Yizhe Zhang, Siqi Sun, Michel Galley, Yen-Chun Chen, Chris Brockett, Xiang
Gao, Jianfeng Gao, Jingjing Liu, and Bill Dolan. 2019. DialoGPT: Large-Scale
Generative Pre-training for Conversational Response Generation. arXiv preprint
arXiv:1911.00536 (2019).

14 https://github.com/JRC1995/Chatbot
Open-Domain Conversational AI with Hybrid Generative and Retrieval Mechanisms Conference’17, July 2017, Washington, DC, USA

Say something: Hello, how are you?


Predicted Dialog Act(s): Open Question Factual
Bot: Not a lot is going on for me at the moment
Source: SCRIPT
Say something: What is your name?
Predicted Dialog Act(s): Open Question Factual
Bot: I don’t really have a name.
Source: SCRIPT
Say Something: Do you like movies? If so, what kind?
Predicted Dialog Act(s): Yes No Question
Bot: : I like them all as long as the storyline and narration is good.
Source: SCRIPT
Say Something: Are you sentient?
Predicted Dialog Act(s): Yes No Question
Bot: : May be I am a bot pretending to be a human. May be I am a human pretending to be a bot. May be I am a bot pretending
to be a human pretending to be a bot.
Source: SCRIPT
Say Something: Tell me a Joke.
Predicted Dialog Act(s): Command
Bot: : *Creating password*.... "MTWTFSS_MTWTFSS" ERROR: [Password two week] Edited for better understanding
Source: Reddit R/Jokes,R/WritingPrompts, R/TIL ETC.
Say Something: What can you tell me about Loop Quantum Gravity?
Predicted Dialog Act(s): Command
Bot: : Quantum field theory put simply is relativistic quantum mechanics where you can create and destroy particles. It’s
not "*the*" quantum field theory; there are lots of quantum field theories. For example, quantum electrodynamics describes
electromagnetic interactions.
Source: RETRIEVED FROM REDDIT
Say Something: Do you know anything about Sellar’s myth of the given?
Predicted Dialog Act(s): Yes No Question, Command
Bot: : The only thing I know is that I know nothing at all.
Source: SCRIPT
Say Something: I am quite tired today; I didn’t know that the project deadline was so soon. I had a sleepless night working
on the project.
Predicted Dialog Act(s): Statement
Bot: : I had a relaxed afternoon working on the project.
Source: GENERATED
Say Something: Did you know I accidentally deleted major portions of the project and had to reconstruct it again? Version
control is important huh. Who knew?
Predicted Dialog Act(s): Yes No Question
Bot: : Reminds me of the time I was working on a web development project in uni, the night before due date I kinda blew up
the php code, and couldn’t figure out what I broke.
Source: RETRIEVED FROM REDDIT
Say Something: Just my luck, I have three project presentations in the same day.
Predicted Dialog Act(s): Statement
Bot: : It’s ok, there are lots of study groups that all go through the project anyway. Just keep practicing.ulatedatalalfeck
Source: GENERATED
Say Something: By the way, can you tell me about Language Models?
Predicted Dialog Act(s): Command, Yes No Question
Bot: On 1B dataset, [Exploring the Limits of Language Modeling](https://arxiv.org/abs/1602.02410) has 30 (23.7 with ensem-
bles)
Source: RETRIEVED FROM REDDIT
Say Something: Please tell me more about them.
Predicted Dialog Act(s): Command
Bot: They are models on the English Language Language Wiki
Source: GENERATED

Table 3: Conversation with Full Model (In Verbose Mode).

You might also like