Coreference-Aware Dialogue Summarization

Coreference-Aware Dialogue Summarization
Zhengyuan Liu, Ke Shi, Nancy F. Chen

Institute for Infocomm Research, A*STAR, Singapore
{liu zhengyuan,shi ke,nfychen}@i2r.a-star.edu.sg
Abstract
Summarizing conversations via neural ap-
proaches has been gaining research traction
lately, yet it is still challenging to obtain
practical solutions. Examples of such chal-
lenges include unstructured information ex-
change in dialogues, informal interactions be-
tween speakers, and dynamic role changes of
speakers as the dialogue evolves. Many of
such challenges result in complex coreference
links. Therefore, in this work, we investi-
gate different approaches to explicitly incorpo-
rate coreference information in neural abstrac-
tive dialogue summarization models to tackle
the aforementioned challenges. Experimen-
tal results show that the proposed approaches
achieve state-of-the-art performance, implying
it is useful to utilize coreference information in
dialogue summarization. Evaluation results on
factual correctness suggest such coreference-
aware models are better at tracing the informa- Figure 1: An example of dialogue summarization: The
tion flow among interlocutors and associating original conversation (in grey) is abbreviated; the sum-
accurate status/actions with the corresponding mary generated by a baseline model is in blue; the
interlocutors and person mentions. summary generated by a coreference-aware model is
in orange. While these two summaries obtain similar
1 Introduction ROUGE scores, the summary from the baseline model
Text summarization condenses the source content is not factually correct; errors are highlighted in italic
and magenta.
into a shorter version while retaining essential and
informative content. Most prior work focuses on
summarizing well-organized single-speaker con- While there has been substantial progress on doc-
tent such as news articles (Hermann et al., 2015) ument summarization, dialogue summarization has
and encyclopedia documents (Liu* et al., 2018). received less attention. Unlike documents, conver-
Recently, models applied on text summarization sations are interactions among multiple speakers,
benefit favorably from sophisticated neural archi- they are less structured and are interspersed with
tectures and pre-trained contextualized language more informal linguistic usage (Sacks et al., 1978).
backbones: on the popular benchmark corpus Based on the characteristics of human-to-human
CNN/Daily Mail (Hermann et al., 2015), Liu and conversations (Jurafsky and Martin, 2008), chal-
Lapata (2019) explored fine-tuning BERT (De- lenges of summarizing dialogues stem from: (1)
vlin et al., 2019) to achieve state-of-the-art per- Multiple speakers: the interactive information ex-
formance for extractive news summarization, and change among interlocutors implies that essential
BART (Lewis et al., 2020) has also improved gen- information is referred to back and forth across
eration quality on abstractive summarization. speakers and dialogue turns; (2) Speaker role shift-
509
Proceedings of the 22nd Annual Meeting of the Special Interest Group on Discourse and Dialogue, pages 509–519
July 29–31, 2021. ©2021 Association for Computational Linguistics
ing: multi-turn dialogues often involve frequent capabilities by conducting probing analysis to aug-
role shifting from one type of interlocutor to an- ment our coreference injection design.
other type (e.g., questioner becomes responder and Experiments on SAMSum (Gliwa et al., 2019)
vice versa); (3) Ubiquitous referring expressions: show that the proposed methods achieve state-of-
aside from speakers referring to themselves and the-art performance. Furthermore, human evalua-
each other, speakers also mention third-party per- tion and error analysis suggest our models generate
sons, concepts, and objects. Moreover, referring more factually consistent summaries. As shown
could also take on forms such as anaphora or cat- in Figure 1, a model guided with coreference in-
aphora where pronouns are used, making corefer- formation accurately associates events with their
ence chains more elusive to track. Figure 1 shows corresponding subjects, and generates more trust-
one dialogue example: two speakers exchange in- worthy summaries compared with the baseline.
formation among interactive turns, where the pro-
noun “them” is used multiple times, referring to 2 Related Work
the word “sites”. Without sufficient understanding
of the coreference information, the base summa- In abstractive text summarization, recent stud-
rizer fails to link mentions with their antecedents, ies mainly focus on neural approaches. Rush
and produces an incorrect description (highlighted et al. (2015) proposed an attention-based neural
in magenta and italic) in the generation. From the summarizer with sequence-to-sequence generation.
aforementioned linguistic characteristics, dialogues Pointer-generator networks (See et al., 2017) were
possess multiple inherent sources of complex coref- designed to directly copy words from the source
erence, motivating us to explicitly consider coref- content, which resolved out-of-vocabulary issues.
erence information for dialogue summarization to Liu and Lapata (2019) leveraged the pre-trained
more appropriately model the context, to more dy- language model BERT (Devlin et al., 2019) on both
namically track the interactive information flow extractive and abstractive summarization. Lewis
throughout a conversation, and to enable the poten- et al. (2020) proposed BART, taking advantage of
tial of multi-hop dialogue reasoning. the bi-directional encoder in BERT and the auto-
regressive decoder of GPT (Radford et al., 2018) to
Previous work on dialogue summarization fo- obtain impressive results on language generation.
cuses on modeling conversation topics or dialogue While many prior studies focus on summarizing
acts (Goo and Chen, 2018; Liu et al., 2019; Li et al., well-organized text such as news articles (Hermann
2019; Chen and Yang, 2020). Few, if any, leverage et al., 2015), dialogue summarization has been gain-
on features from coreference information explicitly. ing traction. Shang et al. (2018) proposed an unsu-
On the other hand, large-scale pre-trained language pervised multi-sentence compression method for
models are shown only to implicitly model lower- meeting summarization. Goo and Chen (2018) in-
level linguistic knowledge such as part-of-speech troduced a sentence-gated mechanism to grasp the
and syntactic structure (Tenney et al., 2019; Jawa- relations between dialogue acts. Liu et al. (2019)
har et al., 2019). Without directly training on tasks proposed to utilize topic segmentation and turn-
that provide specific and explicit linguistic anno- level information (Liu and Chen, 2019) for conver-
tation such as coreference resolution or semantics- sational tasks. Zhao et al. (2019) proposed a neural
related reasoning, model performance remains sub- model with a hierarchical encoder and a reinforced
par for language generation tasks (Dasigi et al., decoder to generate meeting summaries. Chen and
2019). Therefore, in this paper, we propose to Yang (2020) used diverse conversational structures
improve abstractive dialogue summarization by like topic segments and conversational stages to
explicitly incorporating coreference information. design a multi-view summarizer, and achieved the
Since entities are linked to each other in coref- current state-of-the-art performance on the SAM-
erence chains, we postulate adding a graph neu- Sum corpus (Gliwa et al., 2019).
ral layer could readily characterize the underlying Improving factual correctness has received keen
structure, thus enhancing contextualized represen- attention in neural abstractive summarization lately.
tation. We further explore two parameter-efficient Cao et al. (2018) leveraged on dependency pars-
approaches: one with an additional coreference- ing and open information extraction to enhance
guided attention layer, and the other resourcefully the reliability of generated summaries. Zhu et al.
enhancing BART’s limited coreference resolution (2021) proposed a factual corrector model based on
510
Figure 2: Examples of three common issues in adopting a document coreference resolution model for dialogues
without additional domain adaptation training. Spans in blocks are items in coreference clusters with their cluster
ID number. We highlight some spans for better readability.
knowledge graphs, significantly improving factual conducted data post-processing on the automatic
correctness in text summarization. output: (1) First, we applied a model ensemble
strategy to obtain more accurate cluster predictions;
3 Dialogue Coreference Resolution (2) Then, we re-assigned coreference cluster labels
Since the common summarization datasets do not to the words with speaker roles that were not in-
contain coreference annotations, automatic coref- cluded in any chains; (3) Moreover, we compared
erence resolution is needed to process the samples. the clusters and merged those that presented the
Neural approaches (Joshi et al., 2020) have shown same coreference chain. Human evaluation on the
impressive performance on document coreference processed data showed that this post-processing
resolution. However, they are still sub-optimal reduced incorrect coreference assignments by ap-
for conversational scenarios (Chen et al., 2017), proximately 19%.2
and there are no large-scale annotated dialogue
4 Coreference-Aware Summarization
corpora for transfer learning. When applying a
document coreference resolution model (Lee et al., In this section, we adopt a neural model for ab-
2018; Joshi et al., 2020) on dialogue samples with- stractive dialogue summarization, and investigate
out domain adaptation,1 as shown in Figure 2, we various methods to enhance it with the coreference
observed some common issues: (1) Each dialogue information obtained in Section 3.
utterance is started with a speaker, but sometimes The base neural architecture is a sequence-to-
they were not recognized as a coreference-related sequence model Transformer (Vaswani et al., 2017).
entity, and thus not added in any coreference clus- Given a conversation containing n tokens T =
ters; (2) In dialogues, coreference chains often {t1 , t2 , ..., tn }, a self-attention-based encoder is
spanned across multiple turns, but sometimes they used to produce the contextualized hidden rep-
were split to multiple clusters; (3) When a dialogue resentations H = {h1 , h2 , ..., hn }, then an auto-
contained multiple coreference chain across multi- regressive decoder generates the target sequence
turns, speaker entities could be wrongly clustered. O = {w1 , w2 , ..., wk } sequentially. Here, we use
Based on the observation, to improve the over- BART (Lewis et al., 2020) as the pre-trained lan-
all quality of dialogue coreference resolution, we
2
In our pilot experiment, we observed that models with
1
The off-the-shelf version of coreference resolution original coreference resolution outputs showed 10% relative
model we used is allennlp-public-models/coref-spanbert- lower performance than that with the optimized data, validat-
large-2020.02.27, which is trained on OntoNotes 5.0 dataset. ing the effectiveness of our post-processing.
511
Figure 4: Architecture overview of the GNN-based
Figure 3: One dialogue example with labeled corefer- coreference fusion: the encoder is employed to encode
ence clusters: there are three coreference clusters in the input sequence; the coreference graph encoding
this conversation, where each cluster contains all men- layer is used to model the coreference connections be-
tions of one personal identity. tween all mentions; the auto-regressive decoder gener-
ates the summaries.
guage backbone, and conduct fine-tuning.

word tokenization), a coreference graph G is ini-
For each dialogue, there is a set of coreference
tialized with n nodes and an empty adjacent matrix
clusters {C1 , C2 , ..., Cu }, and each cluster Ci con-
G[:][:] = 0. Iterating each coreference cluster C,
tains entities {E1i , E2i ..., Em
i }. As the multi-turn
the first token ti of each mention (a word or a text
dialogue sample shown in Figure 3, there are three
span) is connected with the first token tj of its an-
coreference clusters (colored in yellow, red, and
tecedent in the same cluster with a bi-directional
blue, respectively), and each cluster consists a num-
edge, i.e., G[i][j] = 1 and G[j][i] = 1.
ber of words/spans in the same coreference chain.
During the conversational interaction, the referring 4.1.2 GNN Encoder
of pronouns is important for semantic context un- Given a graph G with the nodes (words/spans with
derstanding (Sacks et al., 1978), thus we postulate coreference information in the conversation) and
that incorporating coreference information explic- the edges (links between mentions), we employ
itly can be useful for abstractive dialogue summa- stacked graph modeling layers to update the hidden
rization. In this work, we focus on enhancing the representations H of all nodes. Here, we take a
encoder with auxiliary coreference features. single coreference graph encoding (CGE) layer as
4.1 GNN-Based Coreference Fusion an example: the input of the first CGE layer is the
output H from the Transformer encoder. We denote
As entities in coreference chains link to each other, the input of k-th CGE layer as H k = {hk1 , ..., hkn },
a graphical representation could readily character- and the representations of (k+1)-th layer H k+1 are
ize the underlying structure and facilitate compu- updated as follows:
tational modeling of the inter-connected relations.
In previous works, Graph Convolutional Networks uki = W0k ReLU(W1k hki + bk1 ) + bk2 (1)
(GCN) (Kipf and Welling, 2017) show strong ca-
pability of modeling graphical features in various vik = LayerNorm(hki + Dropout(uki )) (2)
tasks (Yasunaga et al., 2017; Xu et al., 2020), thus
we use it for the coreference feature fusion. X 1
wik = ReLU( W3k vjk + bk3 ) (3)
j∈Ni
|Ni |
4.1.1 Coreference Graph Construction
hk+1
i = LayerNorm(Dropout(wik ) + vik ) (4)
To build the chain of a coreference cluster, we add
links between each entity and their mentions. Un- where W and b denote the trainable parameter ma-
like previous work (Xu et al., 2020) where entities trix and bias, LayerN orm(∗) is the layer normal-
in one cluster are all pointed to the first occurrence, ization component, and Ni denotes the neighbor-
here we connect the adjacent pairs to retain more hood nodes of the i-th node. After feature propaga-
local information. More specifically, given a clus- tion in all stacked CGE layers, we obtain the final
ter Ci of entities {E1i , E2i ..., Em
i }, we add a link of
representations by adding the coreference-aware
each E to its precedent. hidden states H G = {hG G
1 , ..., hn } with the contex-
Then each coreference chain is transformed to a tualized hidden states H (here a weight λ is used,
graph, and fed to a graph neural network (GNN). and initialized as 0.7), then the auto-regressive de-
Given a text input of n tokens (here we use a sub- coder is applied to generate summaries.
512
Figure 6: Similarity distribution of head probing with
Figure 5: Architecture overview of coreference-guided
pre-defined coreference matrix. The X-axis shows the
attention model and an example of coreference atten-
heads in the 6-th layer of the Transformer encoder. Val-
tion weight matrix Ac , where {t1 ,t3 ,t7 } are in one
ues on the Y-axis denote the ratio that a head has the
coreference cluster and {t2 ,t5 } are in another cluster,
highest similarity with the coreference attention matrix.
while t4 and t6 are tokens without any coreference link.
we obtain the final representations with coreference
4.2 Coreference-Guided Attention information H A = {hA A
1 , ..., hn }, then they are fed
Aside from the GNN-based method which intro- to the decoder for output generation.
duces a certain number of additional parameters,
4.3 Coreference-Informed Transformer
we further explore a parameter-free method. With
the self-attention mechanism (Vaswani et al., 2017), While pre-trained models bring significant improve-
contextualized representation can be obtained with ment, they still present insufficient prior knowledge
attentive weighted sum. For entities in a corefer- for tasks requiring high-level semantic understand-
ence cluster, they all share the referring informa- ing such as coreference resolution. In this section,
tion at the semantic level. Therefore, we propose to we explore another parameter-free method by di-
fuse the coreference information via one additional rectly enhancing the language backbone. Since
attention layer in the contextualized representation. the encoder of our neural architecture uses the self-
Given a sample with coreference clusters, a attention mechanism, we proposed feature injection
coreference-guided attention layer is constructed by attention weight manipulation. In our case, the
to update the encoded representations H. The encoder of BART (Lewis et al., 2020) comprises 6
overview of adding the coreference-guided atten- multi-head self-attention layers, and each layer has
tion layer is shown in Figure 5. Since items in the 12 heads. To incorporate coreference information,
same coreference cluster are attended to each other, we selected heads and modified them with weights
values in the attention weight matrix Ac are nor- that present coreference mentions (see Figure 7).
malized with the number of all referring mentions 4.3.1 Attention Head Probing and Selection
in one cluster, then the representation hi of token i
To retain prior knowledge provided by the language
is updated according to the following:
backbone as much as possible, we first conduct a
X 1 probing task to strategically select attention heads.
ai = hj , if ti ∈ C ∗ (5)
∗
|C ∗ | Since different layers and heads convey linguis-
j∈C
tic features of different granularity (Hewitt and
hA
i = λhi + (1 − λ)ai (6) Manning, 2019), our target is to find the head that
represents the most coreference information. We
where ai is the attentive representation of ti , if ti probe the attention heads by measuring the cosine
belongs to one coreference cluster C ∗ , the repre- similarity between their attention weight matrix Ao
sentation of ti is updated, otherwise, it remains and a pre-defined coreference attention matrix Ac
unchanged. λ is an adjustable parameter and ini- as described in Section 4.2:
tialized as 0.7. In our experimental settings, we
observed that when λ is trainable, it is trained to be headprobe = arg max(cos(Aoi , Ac )) (7)
0.69 when our coreference-guided attention model i
achieved the best performance on the validation set. where Aoi is the attention weight matrix of the origi-
Following the coreference-guided attention layer, nal i-th head, and i ∈ (1, ..., Nh ), Nh is the number
513
# Conv # Sp # Turns # Ref Len
Train 14732 2.40 11.17 23.44
Validation 818 2.39 10.83 23.42
Test 819 2.36 11.25 23.12
Table 1: Data details of the SAMSum corpus. # Conv,

# Sp, # Turns and # Ref Len refer to the average number
of conversations, speakers, dialogue turns and the aver-
age number of words in the gold reference summaries.
5 Experiments
5.1 Dataset
Figure 7: Architecture overview of the coreference-
informed Transformer with attention head manipula- We evaluated the proposed methods on SAMSum
tion. The second attention head is selected and replaced (Gliwa et al., 2019), a dialogue summarization
by a coreference attention weight matrix Ac . dataset consisting of 16,369 conversations with
human-written summaries. Dataset statistics are
of heads in each layer. With all samples in the val- listed in Table 1.
idation set, we conducted probing on all heads in
the 5-th layer and 6-th layer of the ‘BART-base’ en- 5.2 Model Settings
coder. We observed that: (1) in the 5-th layer, the The vanilla sequence-to-sequence Transformer
7-th head obtained the highest similarity score on (Vaswani et al., 2017) was applied as the base ar-
95.2% evaluation samples; (2) in the 6-th layer, the chitecture. We used the pre-trained ‘BART-base’
5-th head obtained the highest similarity score on (Lewis et al., 2020) as language backbone. Then,
68.9% evaluation samples. The statistics of heads we enhanced the base model with following three
in 6-th encoding layer are shown in Figure 6. methods: Coref-GNN: Incorporating coreference
4.3.2 Coreference-Informed Multi-Head information by the GNN-based fusion (see Sec-
Self-Attention tion 4.1); Coref-Attention: Encoding corefer-
ence information by an additional attention layer
In order to explicitly utilize the coreference infor-
(see Section 4.2); Coref-Transformer: Model-
mation, we replaced the two predominant attention
ing coreference information by the attentive head
heads with coreference-informed attention weights.
probing and replacement (see Section 4.3). Sev-
The multi-head self-attention layers (Vaswani et al.,
eral baselines were selected for comparison: (1)
2017) are formulated as:
Pointer-Generator Network (See et al., 2017); (2)
QK T DynamicConv-News (Wu et al., 2019); (3) Fast-
Attention(Q, K, V ) = Softmax( √ )V (8)
dk Abs-RL-Enhanced (Chen and Bansal, 2018); (4)
headi = Attention(QWiQ , KWiK , V WiV ) (9) Multi-View BART (Chen and Yang, 2020), which
provides the state-of-the-art result.
MHA(Q, K, V ) = Concat(head1 , ..., headNh ) (10)
5.3 Training Configuration
FFN(xli ) = ReLU(xli W1F + bF F F
1 )W2 + b2 (11)
The proposed models were implemented in Py-
where Q, K and V are the sets of queries, keys Torch (Paszke et al., 2019), and Hugging Face
and values respectively. W and b are the trainable Transformers (Wolf et al., 2020). The Deep Graph
parameter matrix and bias. dk is the dimension Library (DGL) (Wang et al., 2019) was used for
of keys, xli is the representation of i-th token after implementing the Coref-GNN. The trainable param-
the l-th multi-head self-attention layer. FFN is eters were optimized by Adam (Kingma and Ba,
the point-wise feed forward layer. Based on the 2014). The learning rate of the GCN component
probing analysis in Section 4.3.1, we selected the was 1e-3, and that of BART was set at 2e-5. We
7-th head of 5-th encoding layer, and the 5-th head trained each model for 20 epochs and selected the
of 6-th encoding layer for coreference injection, best checkpoints on the validation set with ROUGE-
and observed that models with probing selection 2 score. All experiments were run on a single Tesla
outperformed that of random head selection. V100 GPU with 16GB memory.
514
ROUGE-1 ROUGE-2 ROUGE-L
Model
F P R F P R F P R
Pointer-Generator* 40.1 - - 15.3 - - 36.6 - -
Fast-Abs-RL-Enhanced* 42.0 - - 18.1 - - 39.2 - -
DynamicConv-News* 45.4 - - 20.6 - - 41.5 - -
BART-Large* 48.2 49.3 51.7 24.5 25.1 26.4 46.6 47.5 49.5
Multi-View BART-Large* 49.3 51.1 52.2 25.6 26.5 27.4 47.7 49.3 49.9
BART-Base 48.7 50.8 51.5 23.9 25.8 24.9 45.3 48.4 47.3
Coref-GNN 50.3 56.1 50.3 24.5 27.3 24.6 46.0 50.9 46.8
Coref-Attention 50.9 54.6 52.8 25.5 27.4 26.8 46.6 50.0 48.4
Coref-Transformer 50.3 55.5 50.9 25.1 27.7 25.6 46.2 50.9 46.9
Table 2: ROUGE scores of baselines and proposed models. * denotes the results from Chen and Yang (2020). F, P,
and R denote F1 Score, Precision and Recall, respectively.
6 Results Model Average # Words

Reference 23.12 ± 12.20
6.1 Automatic Evaluation BART-Base 22.72 ± 10.78
Coref-GNN 19.62 ± 8.75
We quantitatively evaluated the proposed methods
Coref-Attention 21.68 ± 10.27
with the standard metric ROUGE (Lin and Och, Coref-Transformer 20.54 ± 9.39
2004), and reported ROUGE-1, ROUGE-2 and
ROUGE-L.3 As shown in Table 2, our base model Table 3: Average word number with standard devia-
BART-Base outperformed Fast-Abs-RL-Enhanced tions of generated summaries.
and DynamicConv-News significantly, showing the
effectiveness of fine-tuning pre-trained language Model Average Scores
backbones for abstractive dialogue summarization. BART-Base 0.60
Coref-GNN 0.84
However, BART-Large did not bring substantial
Coref-Attention 1.16
improvement despite doubling the parameter size Coref-Transformer 0.96
and training time of BART-Base. As shown in Ta-
ble 2, compared with the base model BART-Base, Table 4: Human evaluation results: each summary is
the performance is improved significantly by our scored on the scale of [-2, 0, 2] as (Chen and Yang,
proposed methods. In particular, Coref-Attention 2020). Reported scores are averaged on 100 samples.
performed best with 4.95%, 6.69% and 2.87% rel-
ative F-measure score improvement, and Coref-
GNN achieved the highest scores on precision scores while the recall scores remains comparable
with 10.43% on ROUGE-1, 5.81% on ROUGE-2 with strong baselines. Moreover, as shown in Table
and 5.17% on ROUGE-L. Coref-Transformer also 3, the average length of generated summaries of the
showed consistent improvement. base model is 22.72, and that of the coref-models
Moreover, compared with the previous SOTA is slightly shorter. We speculated that the proposed
Multi-View BART-Large (Chen and Yang, 2020), models tend to generate more concise summaries
the proposed models performed better on ROUGE- while preserving the important information, which
1 scores, especially on the precision metrics. More is also supported by the analysis in Section 7.1.
specifically, precision scores are improved 9.78%,
6.2 Human Evaluation
6.85%, and 8.61% relatively by Coref-GNN, Coref-
Attention and Coref-Transformer, respectively. For As the example shown in Figure 1, ROUGE scores
ROUGE-2 and ROUGE-L, our models also obtain are insensitive to semantic errors such as incorrect
comparable performance. reference, thus we conducted human evaluation to
As shown in Table 2, we also observed that the complement objective metrics. Following Gliwa
most significant improvement is on the precision et al. (2019) and Chen and Yang (2020), each sum-
3
mary is scored on the scale of [-2, 0, 2], where -2
We used integrated functions in HuggingFace Transform-
ers (Wolf et al., 2020) to calculate ROUGE scores. Note that means the summary is unacceptable with the wrong
different libraries may result in different ROUGE scores. reference, extracted irrelevant information or does
515
Model Missing Information Redundant Information Wrong Reference Incorrect Reasoning
Base Model 34 26 22 20
Coref-GNN 32 [5.8% ↓] 8 [69% ↓] 14 [36% ↓] 16 [20% ↓]
Coref-Attention 28 [17% ↓] 4 [84% ↓] 12 [45% ↓] 9 [55% ↓]
Coref-Transformer 32 [5.8% ↓] 12 [53% ↓] 14 [36% ↓] 12 [40% ↓]
Table 5: Percentage of typical errors in summaries generated by the baseline and our proposed models. Values in
brackets denote the relative decrease compared with the base model.
Conversation (abbreviated) BART-Base Coref-Attention

Eric is not sure if he’s go-
(i) ... Ivan : so youre coming to the wedding Eric: your brother’s Ivan: Eric is coming to Ivan’s
ing to the wedding, be-
yea Eric: i dont know mannn Ivan: YOU DONT KNOW?? Eric: i brother’s wedding. Eric
cause he has a lot to do at
just have a lot to do at home, plus i dont know if my parents would has a lot to do at home
home and doesn’t know if
let me Ivan: ill take care of your parents Eric: youre telling me you and he can’t take care of
his parents would let him.
have the guts to talk to them XD Ivan: thats my problem Eric: okay his parents. Ivan will be
Ivan will come to Eric’s
man, if you say so Ivan: yea just be there Eric: alright there.
wedding.
(ii) Derek McCarthy: Filip - are you around? Would you have an
Tommy will call his wife
Android cable I could borrow for an hour? ... Tommy : I am in Filip will lend Derek Mc-
to borrow a phone charger
Poland but can ring my wife and she will give you one ... Tommy: 67 Carthy his Android cable.
from Derek McCarthy.
glenoaks close Derek McCarthy: That would be great if you could!! He will call his wife at 67
Tommy will be at home
... Tommy: Sent her msg. She will give it to you. Approx time when glenoaks close.
at 8:15 pm.
she will be at home is 8:15 pm Derek McCarthy: Thanks again!! ...
(iii) Ann: Congratulations!! Ann: You did great, both of you! Sue: Ann congratulates Sue
Ann and Julie are congrat-
Thanks, Ann Julie: I’m glad it’s over! Julie: That’s co cute of you, and Julie on their success.
ulating Sue on their suc-
my girl! Ann: Let’s have a little celebration tonight! Sue: I’m in Julie: Ann and Julie will cele-
cess.
me too!!! aww brate tonight.
Table 6: Three examples of generated summaries: For conversation i and conversation ii, Coref-Attention model
generated correct summaries by incorporating coreference information. Coref-Attention model generated an im-
perfect summary for conversation iii due to inaccurate coreference resolution provided.
not make logical sense, 0 means the summary is tent in the generated summary compared with the
acceptable but lacks of important information con- human-written reference.
verge, and 2 refers to a good summary which is con- Wrong References: The actions are associated
cise and informative. We randomly selected 100 with the wrong interlocutors or mentions (e.g., In
test samples, and scored the summaries generated the example of Figure 1, the summary generated
by the base model, Coref-GNN, Coref-Attention by base model confused “Payton” and “Max” in
and Coref-Transformer. Four linguistic experts the actions of “look for good places to buy clothes”
conducted the human evaluation, and their average and “love reading books”).
scores are reported in Table 4. Compared with the Incorrect Reasoning: The model incorrectly rea-
base model, our coref-models obtain higher scores sons the conclusion from context of multiple di-
in human ratings, which is consistent with the quan- alogue turns. Moreover, wrong reference and in-
titative ROUGE results. correct reasoning will lead to factual inconsistency
from source content.
7 Analysis We randomly sampled 100 conversations in the test
set and manually annotated the summaries gener-
7.1 Quantitative Analysis ated by the base and our proposed models with the
To further evaluate the generation quality and effec- four error types. As shown Table 5, 34% of sum-
tiveness of coreference fusion for dialogue summa- maries generated by the base model cannot summa-
rization, we annotated four types of common errors rize all the information included in the gold refer-
in the automatic summaries: ences, and models with coreference fusion improve
Missing Information: The content is incomplete the information coverage marginally. Coreference-
in the generated summary compared with the aware models essentially reduced the redundant
human-written reference. information: 84% relative reduction by Coref-
Redundant Information: There is redundant con- Attention, 69% relative reduction by Coref-GNN,
516
and 53% relative reduction by Coref-Transformer. conversations. Our proposed models compare fa-
Coref-Attention model also performed best on re- vorably with baselines without coreference guid-
ducing 45% of wrong reference errors relatively, ance and generate summaries with higher factual
Coref-GNN and Coref-Transformer both relatively consistency. Our work provides empirical evidence
reduced 36% of that. Encoding coreference infor- that coreference is useful in dialogue summariza-
mation by an additional attention layer substan- tion and opens up new possibilities of exploiting
tially improves the reasoning capability by reduc- coreference for other dialogue related tasks.
ing 55% relatively in incorrect reasoning, Coref-
Transformer and Coref-GNN also relatively re- Acknowledgments
duced this error by 40% and 20% compared with
This research was supported by funding from
the base model. This shows our models can gen-
the Institute for Infocomm Research (I2R) under
erate more concise summaries with less redundant
A*STAR ARES, Singapore. We thank Ai Ti Aw
content, and incorporating coreference information
and Minh Nguyen for insightful discussions. We
is helpful to reduce wrong references, and conduct
also thank the anonymous reviewers for their pre-
better multi-turn reasoning.
cious feedback to help improve and extend this
7.2 Sample Analysis piece of work.
Here we conducted a sample analysis as in (Lewis

et al., 2020). Table 6 shows 3 examples along References
with their corresponding summaries from the BART- Ziqiang Cao, Furu Wei, Wenjie Li, and Sujian Li. 2018.
Base and Coref-Attention model. Conversation i Faithful to the original: Fact aware neural abstrac-
and ii contain multiple interlocutors and referrals. tive summarization. In Proceedings of the AAAI
The base model made some referring mistakes: (1) Conference on Artificial Intelligence, volume 32.
in conversation i, “your brother’s wedding” should Henry Y Chen, Ethan Zhou, and Jinho D Choi. 2017.
refer to “Ivan’s brother’s wedding”; (2) in con- Robust coreference resolution and entity linking on
versation ii, since “Fillip” and “Tommy” are ex- dialogues: Character identification on tv show tran-
actly the same person, pronouns “you” and “I” in scripts. In Proceedings of the 21st Conference on
Computational Natural Language Learning (CoNLL
“Would you have an Android cable I could borrow...” 2017), pages 216–225.
should refer to “Tommy” and “Derek McCarthy”,
respectively. In contrast, the Coref-Attention model Jiaao Chen and Diyi Yang. 2020. Multi-view sequence-
was able to make correct statements. However, to-sequence models with conversational structure
for abstractive dialogue summarization. In Proceed-
if the coreference resolution quality is poor, the ings of the 2020 Conference on Empirical Methods
coreference-aware models will be affected. For ex- in Natural Language Processing (EMNLP), pages
ample, in the conversation iii, when the pronouns 4106–4118, Online. Association for Computational
“you” and “my girl” in “Julie: That’s co cute of Linguistics.
you, my girl” are wrongly included in the corefer- Yen-Chun Chen and Mohit Bansal. 2018. Fast abstrac-
ence cluster of “Julie”, the model will also make tive summarization with reinforce-selected sentence
referring mistakes in the summary . rewriting. In Proceedings of the 56th Annual Meet-
ing of the Association for Computational Linguis-
8 Conclusion tics (Volume 1: Long Papers), pages 675–686, Mel-
bourne, Australia. Association for Computational
Linguistics.
In this paper, we investigated the effectiveness of
utilizing coreference information for summarizing Pradeep Dasigi, Nelson F. Liu, Ana Marasović,
multi-party conversations. We proposed three ap- Noah A. Smith, and Matt Gardner. 2019. Quoref:
proaches to explicitly incorporate coreference in- A reading comprehension dataset with questions re-
quiring coreferential reasoning. In Proceedings of
formation into neural abstractive dialogue summa- the 2019 Conference on Empirical Methods in Nat-
rization: (1) GNN-based coreference fusion; (2) ural Language Processing and the 9th International
coreference-guided attention; and (3) coreference- Joint Conference on Natural Language Processing
informed Transformer. These methods can be (EMNLP-IJCNLP), pages 5925–5932, Hong Kong,
China. Association for Computational Linguistics.
adopted on various neural architectures. Quantita-
tive results and human analysis suggest that coref- Jacob Devlin, Ming-Wei Chang, Kenton Lee, and
erence information helps track referring chains in Kristina Toutanova. 2019. BERT: Pre-training of
517
deep bidirectional transformers for language under- Kenton Lee, Luheng He, and Luke Zettlemoyer. 2018.
standing. In Proceedings of the 2019 Conference Higher-order coreference resolution with coarse-to-
of the North American Chapter of the Association fine inference. In Proceedings of the 2018 Confer-
for Computational Linguistics: Human Language ence of the North American Chapter of the Associ-
Technologies, Volume 1 (Long and Short Papers), ation for Computational Linguistics: Human Lan-
pages 4171–4186, Minneapolis, Minnesota. Associ- guage Technologies, pages 687–692.
ation for Computational Linguistics.
Mike Lewis, Yinhan Liu, Naman Goyal, Mar-
Bogdan Gliwa, Iwona Mochol, Maciej Biesek, and jan Ghazvininejad, Abdelrahman Mohamed, Omer
Aleksander Wawer. 2019. SAMSum corpus: A Levy, Veselin Stoyanov, and Luke Zettlemoyer.
human-annotated dialogue dataset for abstractive 2020. BART: Denoising sequence-to-sequence pre-
summarization. In Proceedings of the 2nd Workshop training for natural language generation, translation,
on New Frontiers in Summarization, pages 70–79, and comprehension. In Proceedings of the 58th An-
Hong Kong, China. Association for Computational nual Meeting of the Association for Computational
Linguistics. Linguistics, pages 7871–7880, Online. Association
for Computational Linguistics.
Chih-Wen Goo and Yun-Nung Chen. 2018. Ab-
Manling Li, Lingyu Zhang, Heng Ji, and Richard J.
stractive dialogue summarization with sentence-
Radke. 2019. Keep meeting summaries on topic:
gated modeling optimized by dialogue acts. In
Abstractive multi-modal meeting summarization. In
2018 IEEE Spoken Language Technology Workshop
Proceedings of the 57th Annual Meeting of the
(SLT), pages 735–742. IEEE.
Association for Computational Linguistics, pages
2190–2196, Florence, Italy. Association for Compu-
Karl Moritz Hermann, Tomas Kocisky, Edward Grefen-
tational Linguistics.
stette, Lasse Espeholt, Will Kay, Mustafa Suleyman,
and Phil Blunsom. 2015. Teaching machines to read Chin-Yew Lin and Franz Josef Och. 2004. Auto-
and comprehend. In Advances in neural information matic evaluation of machine translation quality us-
processing systems, pages 1693–1701. ing longest common subsequence and skip-bigram
statistics. In Proceedings of the 42nd Annual Meet-
John Hewitt and Christopher D. Manning. 2019. A ing of the Association for Computational Linguistics
structural probe for finding syntax in word repre- (ACL-04), pages 605–612, Barcelona, Spain.
sentations. In Proceedings of the 2019 Conference
of the North American Chapter of the Association Peter J. Liu*, Mohammad Saleh*, Etienne Pot, Ben
for Computational Linguistics: Human Language Goodrich, Ryan Sepassi, Lukasz Kaiser, and Noam
Technologies, Volume 1 (Long and Short Papers), Shazeer. 2018. Generating wikipedia by summariz-
pages 4129–4138, Minneapolis, Minnesota. Associ- ing long sequences. In International Conference on
ation for Computational Linguistics. Learning Representations.
Ganesh Jawahar, Benoı̂t Sagot, and Djamé Seddah. Yang Liu and Mirella Lapata. 2019. Text summariza-
2019. What does BERT learn about the structure tion with pretrained encoders. In Proceedings of
of language? In Proceedings of the 57th Annual the 2019 Conference on Empirical Methods in Nat-
Meeting of the Association for Computational Lin- ural Language Processing and the 9th International
guistics, pages 3651–3657, Florence, Italy. Associa- Joint Conference on Natural Language Processing
tion for Computational Linguistics. (EMNLP-IJCNLP), pages 3730–3740, Hong Kong,
China. Association for Computational Linguistics.
Mandar Joshi, Danqi Chen, Yinhan Liu, Daniel S.
Zhengyuan Liu and Nancy Chen. 2019. Reading turn
Weld, Luke Zettlemoyer, and Omer Levy. 2020.
by turn: Hierarchical attention architecture for spo-
SpanBERT: Improving pre-training by representing
ken dialogue comprehension. In Proceedings of the
and predicting spans. Transactions of the Associa-
57th Annual Meeting of the Association for Compu-
tion for Computational Linguistics, 8:64–77.
tational Linguistics, pages 5460–5466.
Daniel Jurafsky and James H Martin. 2008. Speech Zhengyuan Liu, Angela Ng, Sheldon Lee, Ai Ti Aw,
and language processing: An introduction to speech and Nancy F Chen. 2019. Topic-aware pointer-
recognition, computational linguistics and natural generator networks for summarizing spoken conver-
language processing. Upper Saddle River, NJ: Pren- sations. In 2019 IEEE Automatic Speech Recogni-
tice Hall. tion and Understanding Workshop (ASRU), pages
814–821. IEEE.
Diederik Kingma and Jimmy Ba. 2014. Adam: A
method for stochastic optimization. International Adam Paszke, Sam Gross, Francisco Massa, Adam
Conference on Learning Representations (ICLR). Lerer, James Bradbury, Gregory Chanan, Trevor
Killeen, Zeming Lin, Natalia Gimelshein, Luca
Thomas N. Kipf and Max Welling. 2017. Semi- Antiga, et al. 2019. Pytorch: An imperative style,
supervised classification with graph convolutional high-performance deep learning library. In Ad-
networks. In Proceedings of International Confer- vances in Neural Information Processing Systems,
ence on Learning Representations (ICLR). pages 8024–8035.
518
Alec Radford, Karthik Narasimhan, Tim Salimans, and Quentin Lhoest, and Alexander Rush. 2020. Trans-
Ilya Sutskever. 2018. Improving language under- formers: State-of-the-art natural language process-
standing by generative pre-training. ing. In Proceedings of the 2020 Conference on Em-
pirical Methods in Natural Language Processing:
Alexander M. Rush, Sumit Chopra, and Jason Weston. System Demonstrations, pages 38–45, Online. Asso-
2015. A neural attention model for abstractive sen- ciation for Computational Linguistics.
tence summarization. In Proceedings of the 2015
Conference on Empirical Methods in Natural Lan- Felix Wu, Angela Fan, Alexei Baevski, Yann Dauphin,
guage Processing, pages 379–389, Lisbon, Portugal. and Michael Auli. 2019. Pay less attention with
Association for Computational Linguistics. lightweight and dynamic convolutions. In Interna-
tional Conference on Learning Representations.
HARVEY Sacks, EMANUEL A. SCHEGLOFF, and
GAIL JEFFERSON. 1978. A simplest systematics Jiacheng Xu, Zhe Gan, Yu Cheng, and Jingjing Liu.
for the organization of turn taking for conversation. 2020. Discourse-aware neural extractive text sum-
In JIM SCHENKEIN, editor, Studies in the Organi- marization. In Proceedings of the 58th Annual Meet-
zation of Conversational Interaction, pages 7 – 55. ing of the Association for Computational Linguistics,
Academic Press. pages 5021–5031, Online. Association for Computa-
tional Linguistics.
Abigail See, Peter J Liu, and Christopher D Manning.
Michihiro Yasunaga, Rui Zhang, Kshitijh Meelu,
2017. Get to the point: Summarization with pointer-
Ayush Pareek, Krishnan Srinivasan, and Dragomir
generator networks. In Proceedings of the 55th An-
Radev. 2017. Graph-based neural multi-document
nual Meeting of the Association for Computational
summarization. In Proceedings of the 21st Confer-
Linguistics (Volume 1: Long Papers), pages 1073–
ence on Computational Natural Language Learning
1083.
(CoNLL 2017), pages 452–462, Vancouver, Canada.
Association for Computational Linguistics.
Guokan Shang, Wensi Ding, Zekun Zhang, An-
toine Tixier, Polykarpos Meladianos, Michalis Vazir- Zhou Zhao, Haojie Pan, Changjie Fan, Yan Liu, Lin-
giannis, and Jean-Pierre Lorré. 2018. Unsuper- lin Li, Min Yang, and Deng Cai. 2019. Abstrac-
vised abstractive meeting summarization with multi- tive meeting summarization via hierarchical adap-
sentence compression and budgeted submodular tive segmental network learning. In The World Wide
maximization. In Proceedings of the 56th An- Web Conference, pages 3455–3461.
nual Meeting of the Association for Computational
Linguistics (Volume 1: Long Papers), pages 664– Chenguang Zhu, William Hinthorn, Ruochen Xu,
674, Melbourne, Australia. Association for Compu- Qingkai Zeng, Michael Zeng, Xuedong Huang, and
tational Linguistics. Meng Jiang. 2021. Enhancing factual consistency
of abstractive summarization. In Proceedings of the
Ian Tenney, Patrick Xia, Berlin Chen, Alex Wang, 2021 Conference of the North American Chapter of
Adam Poliak, R Thomas McCoy, Najoung Kim, the Association for Computational Linguistics: Hu-
Benjamin Van Durme, Sam Bowman, Dipanjan Das, man Language Technologies, pages 718–733, On-
and Ellie Pavlick. 2019. What do you learn from line. Association for Computational Linguistics.
context? probing for sentence structure in contextu-
alized word representations. In International Con-
ference on Learning Representations.
Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob

Uszkoreit, Llion Jones, Aidan N Gomez, Łukasz
Kaiser, and Illia Polosukhin. 2017. Attention is all
you need. In Advances in neural information pro-
cessing systems, pages 5998–6008.
Minjie Wang, Da Zheng, Zihao Ye, Quan Gan, Mufei

Li, Xiang Song, Jinjing Zhou, Chao Ma, Ling-
fan Yu, Yu Gai, Tianjun Xiao, Tong He, George
Karypis, Jinyang Li, and Zheng Zhang. 2019. Deep
graph library: A graph-centric, highly-performant
package for graph neural networks. arXiv preprint
arXiv:1909.01315.
Thomas Wolf, Lysandre Debut, Victor Sanh, Julien

Chaumond, Clement Delangue, Anthony Moi, Pier-
ric Cistac, Tim Rault, Remi Louf, Morgan Funtow-
icz, Joe Davison, Sam Shleifer, Patrick von Platen,
Clara Ma, Yacine Jernite, Julien Plu, Canwen Xu,
Teven Le Scao, Sylvain Gugger, Mariama Drame,
519

Coreference-Aware Dialogue Summarization

Uploaded by

Copyright:

Available Formats

Coreference-Aware Dialogue Summarization

Uploaded by

Document Information

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Coreference-Aware Dialogue Summarization

Uploaded by

Copyright:

Available Formats

Coreference-Aware Dialogue Summarization

Zhengyuan Liu, Ke Shi, Nancy F. Chen

guage backbone, and conduct fine-tuning.

Table 1: Data details of the SAMSum corpus. # Conv,

6 Results Model Average # Words

Conversation (abbreviated) BART-Base Coref-Attention

Here we conducted a sample analysis as in (Lewis

Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob

Minjie Wang, Da Zheng, Zihao Ye, Quan Gan, Mufei

Thomas Wolf, Lysandre Debut, Victor Sanh, Julien

You might also like