EMNLP 2024 Handbook Digital
EMNLP 2024 Handbook Digital
2024
MIAMI, FLORIDA
November 12–16
Page
1 Conference Information 1
Message from the General Chair . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1
Message from the Program Chairs . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3
Message from the Local Chair . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7
Organizing Committee . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9
Senior Program Committee . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11
Conference Organizers & Vendors . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15
2 Anti-Harassment Policy 16
3 Meal Info 18
Overview . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 18
4 Welcome Reception 20
6 Keynotes 25
7 Panel 29
i
10 Oral Presentations 40
Session 02 - Nov 12 (Tue) 11:00-12:30 . . . . . . . . . . . . . . . . . . . . . . . . . . . . 40
Language Modeling 1 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 40
Interpretability and Analysis of Models for NLP 1 . . . . . . . . . . . . . . . . . . . . 41
Low-resource Methods for NLP 1 . . . . . . . . . . . . . . . . . . . . . . . . . . . . 42
Human-centered NLP 1 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 43
Machine Translation 1 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 44
Session 03 - Nov 12 (Tue) 14:00-15:30 . . . . . . . . . . . . . . . . . . . . . . . . . . . . 45
Generation and Summarization . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 45
Dialogue and Interactive Systems 1 . . . . . . . . . . . . . . . . . . . . . . . . . . . 47
Computational Social Science and Cultural Analytics 1 . . . . . . . . . . . . . . . . . 48
Special Theme: Efficiency in Model Algorithms, Training, and Inference 1 . . . . . . . 49
Resources and Evaluation 1 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 50
Session 04 - Nov 12 (Tue) 16:00-17:30 . . . . . . . . . . . . . . . . . . . . . . . . . . . . 51
Ethics, Bias, and Fairness 2 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 51
Information Retrieval and Text Mining 2 . . . . . . . . . . . . . . . . . . . . . . . . . 52
Multimodality and Language Grounding to Vision, Robotics and Beyond 2 . . . . . . . 53
Linguistic Theories, Cognitive Modeling and Psycholinguistics 2 . . . . . . . . . . . . 54
Industry . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 55
Session 06 - Nov 13 (Wed) 10:30-12:00 . . . . . . . . . . . . . . . . . . . . . . . . . . . . 57
Multimodality and Language Grounding to Vision, Robotics and Beyond 3 . . . . . . . 57
Ethics, Bias, and Fairness 3 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 58
Discourse + Phonology + Syntax 2 . . . . . . . . . . . . . . . . . . . . . . . . . . . . 59
Question Answering 2 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 60
Industry . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 61
Session 09 - Nov 13 (Wed) 16:00-17:30 . . . . . . . . . . . . . . . . . . . . . . . . . . . . 63
Resources and Evaluation 4 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 63
Interpretability and Analysis of Models for NLP 4 . . . . . . . . . . . . . . . . . . . . 64
NLP Applications 3 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 65
Information Extraction 1 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 66
Machine Learning for NLP 2 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 67
Session 11 - Nov 14 (Thu) 10:30-12:00 . . . . . . . . . . . . . . . . . . . . . . . . . . . . 69
NLP Applications 4 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 69
Computational Social Science and Cultural Analytics 3 . . . . . . . . . . . . . . . . . 70
Sentiment and Semantics . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 71
Language Modeling 4 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 72
Multilinguality and Language Diversity 2 . . . . . . . . . . . . . . . . . . . . . . . . 73
Session 12 - Nov 14 (Thu) 14:00-15:30 . . . . . . . . . . . . . . . . . . . . . . . . . . . . 74
Interpretability and Analysis of Models for NLP 6 . . . . . . . . . . . . . . . . . . . . 74
Speech Processing and Spoken Language Understanding 2 . . . . . . . . . . . . . . . . 75
Resources and Evaluation 6 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 76
Generation 3 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 78
Machine Learning for NLP 4 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 79
ii
Session 03 - Nov 12 (Tue) 14:00-15:30 . . . . . . . . . . . . . . . . . . . . . . . . . . . . 112
Demo 2 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 112
Discourse + Phonology + Syntax 1 . . . . . . . . . . . . . . . . . . . . . . . . . . . . 113
Ethics, Bias, and Fairness 1 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 117
Interpretability and Analysis of Models for NLP 2 . . . . . . . . . . . . . . . . . . . . 122
Language Modeling 2 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 128
Machine Learning for NLP 1 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 133
Multilinguality and Language Diversity 1 . . . . . . . . . . . . . . . . . . . . . . . . 138
Session 04 - Nov 12 (Tue) 16:00-17:30 . . . . . . . . . . . . . . . . . . . . . . . . . . . . 145
Computational Social Science and Cultural Analytics 2 . . . . . . . . . . . . . . . . . 145
Demo 3 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 150
Machine Translation 2 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 151
Question Answering 1 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 155
Resources and Evaluation 2 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 159
Sentiment Analysis, Stylistic Analysis, and Argument Mining . . . . . . . . . . . . . . 169
Special Theme: Efficiency in Model Algorithms, Training, and Inference 2 . . . . . . . 173
Summarization . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 179
Session 06 - Nov 13 (Wed) 10:30-12:00 . . . . . . . . . . . . . . . . . . . . . . . . . . . . 182
Demo 4 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 182
Human-centered NLP 2 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 183
Interpretability and Analysis of Models for NLP 3 . . . . . . . . . . . . . . . . . . . . 186
Low-resource Methods for NLP 2 . . . . . . . . . . . . . . . . . . . . . . . . . . . . 192
NLP Applications 2 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 198
Resources and Evaluation 3 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 206
Speech Processing and Spoken Language Understanding 1 . . . . . . . . . . . . . . . . 214
Session 09 - Nov 13 (Wed) 16:00-17:30 . . . . . . . . . . . . . . . . . . . . . . . . . . . . 218
Demo 5 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 218
Dialogue and Interactive Systems 2 . . . . . . . . . . . . . . . . . . . . . . . . . . . 219
Generation and Summarization . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 224
Industry . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 224
Information Retrieval and Text Mining 3 . . . . . . . . . . . . . . . . . . . . . . . . . 228
Language Modeling 3 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 232
Multimodality and Language Grounding to Vision, Robotics and Beyond 4 . . . . . . . 238
Question Answering 3 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 244
Semantics: Lexical, Sentence-level Semantics, Textual Inference and Other Areas . . . . 248
TACL + CL . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 251
Session 11 - Nov 14 (Thu) 10:30-12:00 . . . . . . . . . . . . . . . . . . . . . . . . . . . . 253
Demo 6 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 253
Ethics, Bias, and Fairness 4 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 253
Generation 2 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 258
Interpretability and Analysis of Models for NLP 5 . . . . . . . . . . . . . . . . . . . . 264
Machine Learning for NLP 3 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 270
Machine Translation 2 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 275
NLP Applications 2 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 276
Resources and Evaluation 5 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 276
Special Theme: Efficiency in Model Algorithms, Training, and Inference 3 . . . . . . . 285
Session 12 - Nov 14 (Thu) 14:00-15:30 . . . . . . . . . . . . . . . . . . . . . . . . . . . . 290
Computational Social Science and Cultural Analytics 4 . . . . . . . . . . . . . . . . . 290
Dialogue and Interactive Systems 3 . . . . . . . . . . . . . . . . . . . . . . . . . . . 296
Industry . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 300
Information Extraction 2 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 304
Low-resource Methods for NLP 2 . . . . . . . . . . . . . . . . . . . . . . . . . . . . 310
Multimodality and Language Grounding to Vision, Robotics and Beyond 5 . . . . . . . 310
iii
NLP Applications 5 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 316
Virtual Poster Session 1 - (Nov 12): 17:45-18:45 (Evening) . . . . . . . . . . . . . . . . . . 325
Computational Social Science and Cultural Analytics . . . . . . . . . . . . . . . . . . 325
Demo . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 325
Dialogue and Interactive Systems . . . . . . . . . . . . . . . . . . . . . . . . . . . . 326
Ethics, Bias, and Fairness . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 326
Generation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 327
Information Extraction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 327
Information Retrieval and Text Mining . . . . . . . . . . . . . . . . . . . . . . . . . . 327
Interpretability and Analysis of Models for NLP . . . . . . . . . . . . . . . . . . . . . 328
Language Modeling . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 329
Linguistic Theories, Cognitive Modeling and Psycholinguistics . . . . . . . . . . . . . 329
Low-resource Methods for NLP . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 330
Machine Learning for NLP . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 330
Machine Translation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 331
Multilinguality and Language Diversity . . . . . . . . . . . . . . . . . . . . . . . . . 332
Multimodality and Language Grounding to Vision, Robotics and Beyond . . . . . . . . 332
NLP Applications . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 333
Question Answering . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 334
Resources and Evaluation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 335
Semantics: Lexical, Sentence-level Semantics, Textual Inference and Other Areas . . . . 336
Sentiment Analysis, Stylistic Analysis, and Argument Mining . . . . . . . . . . . . . . 336
Special Theme: Efficiency in Model Algorithms, Training, and Inference . . . . . . . . 337
Speech Processing and Spoken Language Understanding . . . . . . . . . . . . . . . . . 337
Syntax: Tagging, Chunking and Parsing . . . . . . . . . . . . . . . . . . . . . . . . . 338
Virtual Poster Session 2 - (Nov 13): 7:45-8:45 (Morning) . . . . . . . . . . . . . . . . . . . 338
Computational Social Science and Cultural Analytics . . . . . . . . . . . . . . . . . . 338
Demo . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 339
Dialogue and Interactive Systems . . . . . . . . . . . . . . . . . . . . . . . . . . . . 342
Discourse and Pragmatics . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 343
Ethics, Bias, and Fairness . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 344
Generation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 345
Industry . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 348
Information Extraction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 353
Information Retrieval and Text Mining . . . . . . . . . . . . . . . . . . . . . . . . . . 355
Interpretability and Analysis of Models for NLP . . . . . . . . . . . . . . . . . . . . . 356
Language Modeling . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 359
Linguistic Theories, Cognitive Modeling and Psycholinguistics . . . . . . . . . . . . . 362
Low-resource Methods for NLP . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 363
Machine Learning for NLP . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 366
Machine Translation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 367
Multilinguality and Language Diversity . . . . . . . . . . . . . . . . . . . . . . . . . 367
Multimodality and Language Grounding to Vision, Robotics and Beyond . . . . . . . . 368
NLP Applications . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 373
Question Answering . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 380
Resources and Evaluation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 382
Semantics: Lexical, Sentence-level Semantics, Textual Inference and Other Areas . . . . 386
Sentiment Analysis, Stylistic Analysis, and Argument Mining . . . . . . . . . . . . . . 387
Special Theme: Efficiency in Model Algorithms, Training, and Inference . . . . . . . . 388
Speech Processing and Spoken Language Understanding . . . . . . . . . . . . . . . . . 389
Summarization . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 390
Syntax: Tagging, Chunking and Parsing . . . . . . . . . . . . . . . . . . . . . . . . . 391
Virtual Poster Session 3 - (Nov 14): 13:00-14:00 (Afternoon) . . . . . . . . . . . . . . . . . 391
iv
Computational Social Science and Cultural Analytics . . . . . . . . . . . . . . . . . . 391
Demo . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 393
Dialogue and Interactive Systems 3 . . . . . . . . . . . . . . . . . . . . . . . . . . . 394
Ethics, Bias, and Fairness . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 394
Generation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 395
Industry . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 396
Information Extraction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 397
Information Retrieval and Text Mining . . . . . . . . . . . . . . . . . . . . . . . . . . 397
Interpretability and Analysis of Models for NLP . . . . . . . . . . . . . . . . . . . . . 397
Language Modeling 2 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 398
Linguistic Theories, Cognitive Modeling and Psycholinguistics . . . . . . . . . . . . . 399
Low-resource Methods for NLP . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 399
Machine Learning for NLP . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 400
Machine Translation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 400
Multimodality and Language Grounding to Vision, Robotics and Beyond . . . . . . . . 400
NLP Applications . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 401
Question Answering . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 403
Resources and Evaluation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 403
Semantics: Lexical, Sentence-level Semantics, Textual Inference and Other Areas . . . . 404
Sentiment Analysis, Stylistic Analysis, and Argument Mining . . . . . . . . . . . . . . 404
Special Theme: Efficiency in Model Algorithms, Training, and Inference . . . . . . . . 405
Speech Processing and Spoken Language Understanding 1 . . . . . . . . . . . . . . . . 406
Syntax: Tagging, Chunking and Parsing . . . . . . . . . . . . . . . . . . . . . . . . . 406
15 Workshops 418
Overview . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 418
W1 - BlackboxNLP 2024: Analyzing and interpreting neural networks for NLP . . . . . . . . 420
W2 - Seventh Workshop on Computational Models of Reference, Anaphora and Coreference . 421
W3 - Seventh Workshop on Fact Extraction and VERification (FEVER) . . . . . . . . . . . . 423
W4 - Workshop on the Future of Event Detection . . . . . . . . . . . . . . . . . . . . . . . 427
W5 - The Sixth Workshop on Narrative Understanding . . . . . . . . . . . . . . . . . . . . 428
W6 - Third Workshop on NLP for Positive Impact . . . . . . . . . . . . . . . . . . . . . . . 429
W7 - The Third Workshop on Text Simplification, Accessibility and Readability . . . . . . . 430
W8 - The Eighth Widening NLP Workshop (WiNLP 2024) . . . . . . . . . . . . . . . . . . 432
W9 - The SIGNLL Conference on Computational Natural Language Learning (CoNLL) . . . 433
W10 - Ninth Conference on Machine Translation (WMT24) . . . . . . . . . . . . . . . . . . 436
W11 - Workshop on Customizable NLP: Progress and Challenges in Customizing NLP for a
Domain, Application, Group, or Individual (CustomNLP4U) . . . . . . . . . . . . . . . 446
v
W12 - The 4th International Workshop on Natural Language Processing for Digital Humanities
(NLP4DH) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 448
W13 - GenBench: The second workshop on generalisation (benchmarking) in NLP . . . . . . 452
W14 - Natural Legal Language Processing (NLLP) Workshop 2024 . . . . . . . . . . . . . . 454
W15 - The 4th Workshop on Multilingual Representation Learning . . . . . . . . . . . . . . 457
W16 - NLP4Science: The First Workshop on Natural Language Processing for Science . . . . 458
W17 - The Second Workshop on Social Influence in Conversations (SICon 2024) . . . . . . . 459
W18 - The 11th Workshop on Asian Translation (WAT2024) . . . . . . . . . . . . . . . . . 460
W19 - The First Workshop on Advancing Natural Language Processing for Wikipedia (NLP for
Wikipedia) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 462
Sponsorship 519
vi
Conference Information
1
Message from the General Chair
Im over the moon excited to welcome you to the 2024 edition of the Conference on Empirical Methods
in Natural Language Processing! This year marks the 29th edition of EMNLP, at least according to ACL
Anthology proceedings. I counted 14 papers in that first edition; how times have changed and how much
our community has grown!
The organization’s logistics have become much more complex, with a growing number of submissions
and attendees. The effort and time from the many volunteers poured into making this meeting happen are
tremendous and worthy of recognition. My first acknowledgments go to the entire organization committee.
My heartfelt thank you goes to each one of them:
• Program Chairs: Yaser Al-Onaizan, Mohit Bansal, and Yun-Nung (Vivian) Chen
• Industry Track Chairs: Franck Dernoncourt, Daniel Preot, iuc-Pietro, and Anastasia Shimorina
• Demonstration Chairs: Delia Irazu Hernandez Farias, Tom Hope, and Manling Li
• Publication Chairs: Milad Alshomary, Danilo Croce, and Gözde Gül ahin
1
• Student Volunteer Chairs: Shubhra Kanti (Santu) Karmaker, Nafise Sadat Moosavi, and Emily
Prud’hommeaux
Thank you to the wonderful Jenn Rachford, our amazing business manager, and the SIGDAT board, Is-
abelle Augenstein, Kai-Wei Chang, Alice Oh, and Juan Pino, for all their support and flexibility.
I hope this conference brings you inspiration, motivation, energy, and stronger connections; enjoy your
time in Miami!
Deseo que ésta conferencia les sea muy productiva!
Thamar Solorio
General Chair
2
Message from the Program Chairs
Welcome to EMNLP 2024! We are excited to welcome you to one of the most prominent conferences
in the field of Natural Language Processing. This year, EMNLP 2024 is being held in a hybrid format,
offering both virtual and in-person participation in beautiful Miami. Due to a record-breaking number
of submissions, weve expanded the total number of accepted papers to accommodate more cutting-edge
research from around the globe.
Limitations Statement
Continuing the practice from previous conferences, every submitted paper was required to include an
explicitly named Limitations section, discussing the limitations of the work. Importantly, this section does
not count toward the page limit. While this rule was in place, we chose not to enforce desk rejections for
papers that did not strictly comply during this submission phase.
Tracks
For a smooth submission process, EMNLP 2024 papers were categorized into 26 tracks, closely mirroring
the structure of previous EMNLP conferences and reflecting the established divisions within the field.
Among these tracks, NLP Applications, Resources and Evaluation, Interpretability and Analysis of Models
for NLP, and Multimodality and Language Grounding to Vision, Robotics, and Beyond were the most
popular, each receiving over 200 submissions.
3
ACL Rolling Review
The ACL Rolling Review (ARR) is an initiative of the Association for Computational Linguistics that
introduces a two-step process for reviewing and accepting papers: (1) a centralized rolling review and (2)
the opportunity for authors to commit their reviewed papers to a specific publication venue. For EMNLP
2024, we continued the practice from previous years where authors first submit their papers to ARR, then
commit the reviewed papers by a specified deadline. As part of the process, we served as ARR Editors
and worked in collaboration with the ARR Editors-in-Chief for the June cycle, overseeing the review
processgathering reviews and meta-reviews, and coordinating with reviewers and meta-reviewers. The
committed papers underwent additional review by SACs, who provided recommendations. Final decisions
were made based on SAC recommendations and all information collected from the reviewing phase, taking
into account not just review scores, but also the quality of reviews, author responses, discussions, meta-
reviews, and SAC/AC recommendations.
• Senior Area Chair’s awards are similar to Outstanding papers, but specific to this research track.
• Best Theme Papers (= Senior Area Chair’s awards for the special theme track) make significant
new contributions to efficiency in model algorithms, training, and inference.
• Social Impact Papers have the potential for significant positive societal impact.
• Resource Papers announce, describe, and share a fascinating, valuable, or potentially field-changing
new resource.
Based on nominations from SACs and ACs, 114 candidates have been shortlisted for consideration for the
above awards. The final selection is made by the Best Paper Award Committee, and the winners will be
announced and will present their work during the closing ceremony.
Presentation Mode
When deciding between oral and poster presentations, our goal was not to base the choice solely on the
perceived quality or merit of the papers. Instead, we also considered the authors’ preferences for their
presentation mode, as well as our assessment of which format would best suit the content of each individual
paper for optimal engagement and clarity.
• Prof. Anca Dragan (University of California Berkeley and Google Deepmind) on My Journey in
AI Safety and Alignment.
• Prof. Tom Griffiths (Princeton University) on Bayes in the Age of Intelligent Machines.
4
Moreover, we will have an exciting panel to discuss the importance of NLP in the era of LLMs.
Gratitude
We would like to thank the following people for their support and contributions:
• The ARR Editors-in-Chief of the June 2024 cycle (Vincent Ng), Technical Staff (Jonathan K.
Kummerfeld), and the entire team (Mausam, Viviane Moreira, Lilja Øvrelid, Anna Rogers, Jun
Suzuki, Jing Jiang, Michael White);
• The OpenReview team, especially Celeste Martinez for multiple rounds of technical help in setting
up EMNLP 2024 on the OR platform;
• The awards committee chairs, Luke Zettlemoyer, Ivan Titov, and Claire T. Cardie, and 36 awards
committee members;
• The ethics chairs, Luciana Benotti, Snigdha Chaturvedi, and Sunipa Dev;
• The industry track chairs, Franck Dernoncourt, Daniel Preot, iuc-Pietro, Daniel Preot, iuc-Pietro, and
Anastasia Shimorina;
• The demonstration chairs, Delia Irazu Hernandez Farias, Tom Hope, and Manling Li;
• The publication chairs, Milad Alshomary, Danilo Croce, and Gözde Gül ahin;
• The local organization chairs, Mark Finlayson and Zoey Liu , and their team;
• The student volunteer chairs, Shubhra Kanti (Santu) Karmaker, Nafise Sadat Moosavi, and Emily
Prud’hommeaux;
• The TACL editors-in-chief (Asli Celikyilmaz, Roi Reichart, Dilek Hakkani Tur) and CL Editor
in-Chief Wei Lu for coordinating TACL and CL presentations with us;
• The NAACL 2024 Program Chairs (Kevin Duh, Helena Gomez, and Steven Bethard) and the ACL
2024 Program Chairs (Lun-Wei Ku, André F. T. Martins, Vivek Srikumar);
5
• Damira Mrsic and Underline Team;
• All the authors of papers submitted for review and committed to the conference.
We hope that you will enjoy this years program and hybrid conference!
Yaser Al-Onaizan (Saudi Data and AI Authority, National Center for AI)
Mohit Bansal (University of North Carolina at Chapel Hill)
Yun-Nung (Vivian) Chen (National Taiwan University)
EMNLP 2024 Program Co-Chairs
6
Message from the Local Chair
It is our great pleasure to welcome you to EMNLP 2024, held in the lovely tropical city of Miami,
Florida, which is North America’s gateway to South America and the Caribbean.
Miami’s unique language characteristics are at the heart of it’s identity. Spanish is the dominant lan-
guage here, but not central American Spanish, as is often found elsewhere in the United States. The
dominant dialect is Cuban Spanish, with extensive local enclaves of Venezuelans, Colombians, and Ar-
gentinians. Indeed, every nationality and cultural group from South American and Caribbean is well
represented here. You will find neighborhoods which speak primarily Haitian Creole, as well as Brazilian
Portuguese. When you add in the large populations of Europeans from France, Spain, and the Balkans, the
local language picture becomes quite rich indeed. Miami even has it’s own dialect of English, which was
identified by Florida International University socio-linguist Phillip Carter: this dialect overlays standard
English with Spanish syntax and Cuban vocabulary and slang, and comes with it’s own distinctive accent.
So Miami has truly a distinctive language mix!
Given Miami’s differences from much of the rest of the United States—in language, population, and
climate—locals joke that we are not really in the United States at all, but rather in Latin America. The joke
continues with the observation that one of Miami’s most convenient attributes, as a supposedly independent
Latin American nation, is its proximity to the United States and the fact that we share an open border and a
currency. While tongue-in-cheek, if you have traveled elsewhere in the United States you will see a grain
of truth in this, and we hope you enjoy and appreciate Miami’s unusual character.
While you are here we hope you take full advantage of the cultural richness Miami has to offer, and
the many fun things to do. This includes our vibrant, burgeoning, world-class food scene (with many
Michelin starred restaurants), our world-renowned nightlife, and our beautiful beaches. Try the widely
available Latin American food, such as pastelitos or croquetas, or our many varieties of speciality coffee,
such as cafecitas (cuban coffee) or cortaditos. Enjoy a night out salsa dancing on Calle Ocho (8th Street)
in Little Havana, or catch a Latin music or concert at the many concert halls in Downtown, Brickell, or
Wynwood. Dance the night away at our dance clubs that feature world-class electronic music DJs, or visit
some of our fantastic museums, such as the Perez Art Museum Miami (the PAMM), the Frost Museum
of Science (where the social event will be held), or the Viscaya Museum and Gardens. Indulge your dark
desire for conspicuous consumption of luxury goods in our high-end stores in Brickell City Center or the
Miami Design District.
In the past Miami has been famous for its nightlife and Latin flavors, as a place to visit, relax, and have
fun. This is still true, but Miami has grown tremendously as a city in recent years in many other ways. For
example, we now boast two major research universities: Florida International University, a public research
university, is home to over 50,000 students and has recently been ranked as a top-50 public university in
the United states and a top-100 university overall. The University of Miami, also ranked in the top 100,
boasts a beautiful campus in Coral Gables, nearly 20,000 students, and a major research hospital. Miami
is also home to a fast-growing tech startup and cryptocurrency scene, with a variety of startup accelerators,
incubators, funders, and networking organizations, including Endeavor Miami, The Knight Foundation,
The Lab, Rokk3r Labs, eMerge Americas, the Miami Angels and Wyncode. Finally, Miami continues to
solidify its standing a major hub of international finance for Latin America, with many banks and other
financial firms opening major branches here or even moving their headquarters to Miami.
Returning to EMNLP, we would like to extend our thanks to Jennifer Rachford and Megs Haddad, both
of the ACL business office, who provided quick, gracious, and ever-informative help in the quite laborious
process of issuing many hundreds of visa invitation letters for those coming from abroad. If you were one
7
of those who needed a visa letter, and see them at the conference, please take a moment to thank them for
their hard work.
In closing, we hope that you will thoroughly enjoy your stay in Miami, exploring its rich culture, taking
advantage of the many opportunities for fun, all the while getting the most out the extensive technical
program of EMNLP.
aBienvenido
˛ a Miami!
Mark Finlayson
Florida International University, Miami, FL
Zoey Liu
University of Florida, Gainesville, FL
8
Organizing Committee
General Chair
Thamar Solorio, Mohamed bin Zayed University of Artificial Intelligence
and University of Houston
Program Chairs
Yaser Al-Onaizan, Saudi Data and AI Authority, National Center for AI
Mohit Bansal, University of North Carolina at Chapel Hill
Yun-Nung (Vivian) Chen, National Taiwan University
Local Chair
Mark Finlayson, Florida International University
Zoey Liu, University of Florida
Workshop Chairs
David Vilar, Google Inc.
Xiaodan Zhu, Queen’s University
Marta R. Costa-Jussa, Meta AI
Tutorial Chairs
Jessy Li, The University of Texas at Austin
Fei Liu, Emory University
Ethics Chairs
Luciana Benotti, Facultad de Matemática, Astronomía, Física y Computación
Snigdha Chaturvedi, University of North Carolina at Chapel Hill
Sunipa Dev, Google Research
Demonstration Chairs
Delia Irazu Hernandez Farias, Instituto Nacional de Astrofísica, Óptica y Electrónica
Tom Hope, AI2
Manling Li, Northwestern University
Publication Chairs
Milad Alshomary, Columbia University
Danilo Croce, University of Rome Tor Vergata
Gözde Gül ahin, Koç University
9
Handbook Chair
Marco Polignano, University of Bari Aldo Moro
Publicity Chairs
Shruti Rijhwani, Google DeepMind
Elias Stengel-Eskin, University of North Carolina
Sponsorship Chairs
Heba Elfardy, Amazon
Leonardo Neves, Snap Inc.
Website Chairs
Raj Dabre, National Institute of Information and Communications Technology (NICT), Japan
Tiago Torrent, Federal University of Juiz de Fora
10
Senior Program Committee
Generation
Mirella Lapata, Edinburgh University
Naoaki Okazaki, Tokyo Institute of Technology
Sebastian Gehrmann, Bloomberg LP
Yangfeng Ji, University of Virginia
Human-Centered NLP
David Mimno, Cornell University
Jeff Bigham, Carnegie Mellon University
Marine Carpuat, University of Maryland
Information Extraction
Derry Tanti Wijaya, Boston University
Lifu Huang, University of California
Ndapa Nakashole, University of California
Ruihong Huang, Texas A&M University
Scott Yih, Facebook AI Research
11
Pawan Goyal, Indian Institute of Technology
Wenhu Chen, University of Waterloo
Language Modeling
Anna Rumshisky, UMass Lowell
Nanyun Peng, University of California, Los Angeles
Swabha Swayamdipta, University of Southern California
Tatsunori Hashimoto, Stanford University
Machine Translation
Alexander Fraser, Technical University of Munich
Lei Li, Carnegie Mellon University
Paco Guzmán, Meta AI
12
Multimodality and Language Grounding to Vision, Robotics and Beyond
Gabriel Stanovsky, Hebrew University of Jerusalem
Hao Tan, Nottingham University Business School China
Jack Hessel, Samaya.ai
Jesse Thomason, University of Southern California
Roma Patel, Brown University
Zhe Gan, Apple
NLP Applications
Avirup Sil, IBM Research AI
Gholamreza Haffari, Monash University
Gokhan Tur, University of Illinois Urbana-Champaign
Joel Tetreault, University of Rochester
Kevin Small, Amazon
Makoto Miwa, Toyota Technological Institute
Parisa Kordjamshidi, Michigan State University
Roman Klinger, University of Bamberg
Sudha Rao, Microsoft Research
Wei Lu, University of Michigan
Question Answering
Eunsol Choi, New York University
Huan Sun, The Ohio State University
Mrinmaya Sachan, ETH Zürich
Siva Reddy, McGill University
13
Veronique Hoste, Ghent University
Zhongyu Wei, Fudan University
Summarization
Ramakanth Pasunuru, FAIR at Meta
Xiaojun Wan, Peking University
14
Conference Organizers & Vendors
Thank you to the entire organizing and program committees for your hard work, dedication, and countless
hours of effort—sometimes at the expense of sleep—to make this conference a success.
Sincerely,
Jennifer Rachford
ACL Director of Events/Business Manager
15
Anti-Harassment Policy
Anti-Harassment Policy
2
EMNLP 2024 adheres to the ACL Anti-Harassment Policy. Any participant who experiences harassment
or hostile behavior may contact any current member of the ACL Professional Conduct Committee or Jen-
nifer Rachford, who is usually available at the registration desk of the conference. Please be assured that
if you approach us, your concerns will be kept in strict confidence, and we will consult with you on any
actions taken. The open exchange of ideas, the freedom of thought and expression, and respectful sci-
entific debate are central to the aims and goals of a ACL conference. These require a community and an
environment that recognizes the inherent worth of every person and group, that fosters dignity, understand-
ing, and mutual respect, and that embraces diversity. For these reasons, ACL is dedicated to providing a
harassment-free experience for participants at our events and in our programs. Harassment and hostile
behavior are unwelcome at any ACL conference. This includes speech or behavior (including in pub-
lic presentations and on-line discourse) that intimidates, creates discomfort, or interferes with a persons
participation or opportunity for participation in the conference. We aim for ACL conferences to be an envi-
ronment where harassment in any form does not happen, including but not limited to: harassment based on
race, gender, religion, age, color, national origin, ancestry, disability, sexual orientation, or gender identity.
Harassment includes degrading verbal comments, deliberate intimidation, stalking, harassing photography
or recording, inappropriate physical contact, and unwelcome sexual attention.
The ACL board members are listed at https://www.aclweb.org/portal/about. The full policy and its imple-
mentation is defined at https://aclweb.org/adminwiki/index.php/Anti-Harassment_Policy
16
Anti-Harassment Policy
The open exchange of ideas, the freedom of thought and expression, and respectful scientific debate are
central to the aims and goals of the ACL. These require a community and an environment that recognizes
the inherent worth of every person and group, that fosters dignity, understanding, and mutual respect, and
embraces diversity. For these reasons, ACL is dedicated to providing a harassment-free experience for all
the members, as well as participants at our events and in our programs.
Harassment and hostile behavior are unwelcome at any ACL conference, associated event, or in ACL-
affiliated online discussions. This includes speech or behavior that intimidates, creates discomfort, or
interferes with a persons participation or opportunity for participation in a conference or an event. We
aim for ACL-related activities to be an environment where harassment in any form does not happen,
including but not limited to: harassment based on race, gender, religion, age, color, appearance, national
origin, ancestry, disability, sexual orientation, or gender identity. Harassment includes degrading verbal
comments, deliberate intimidation, stalking, harassing photography or recording, inappropriate physical
contact, and unwelcome sexual attention. The policy is not intended to inhibit challenging scientific debate,
but rather to promote it by ensuring that all are welcome to participate in the shared spirit of scientific
inquiry. Vexatious complaints and willful misuse of this procedure will render the complainant subject to
the same sanctions as a violation of the anti-harassment policy.
It is the responsibility of the community as a whole to promote an inclusive and positive environment for
our scholarly activities. In addition, anyone who experiences harassment or hostile behavior may contact
any member of the Professional Conduct Committee. Members of this committee are instructed to keep
any such contact in strict confidence, and those who approach the committee will be consulted before any
actions are taken.
17
Meal Info
Meal Info
3
Overview
Breakfast
Nov 11 - Nov 16
Breakfast is not provided, the hotel has a market open 24hrs that has breakfast sandwiches, pastries and
snacks for purchase.
Breaks∗
Nov 11 - Nov 16
Coffee, tea, pastry, and fruit are provided late morning and midafternoon.
Lunch
Nov 11 - Nov 16
On your own. Lunch is not provided, the hotel has a market open 24hrs that has breakfast sandwiches,
pastries and snacks for purchase. See detailed agenda for times.
Dinner
Nov 11 - Nov 16
On your own. Dinner is not provided, but there are plenty of cafes, and restaurants within walking
distance.
Social Events ∗∗
Nov 11, 18:30 - 21:00 - Welcome Reception
Light canapes, and a drink ticket will be provided on Monday Evening, November 11, 2024, at the
Welcome Reception. It will be held on the lower level of the Hyatt Regency Miami. From the lobby take
the escalator down to the Terrace Level.
Nov 13, 19:00 - 22:30 - Social Event Gala Dinner
International Buffet Dinner and a drink ticket will be provided on Wednesday Evening, November 13,
2024, at the Social Event Gala Dinner (Social Gala). Held at the Frost Science Museum.
18
Meal Info
Our primary goal for EMNLP 2024 is to make it an exceptional annual meeting. We want this conference
to be remembered not just for the outstanding lineup of speakers and the impressive conference venue
but also for the vibrant Social Programs for our Full Conference Attendees. Our aim is to provide an
experience that goes beyond the academic sessions, ensuring that every delegate gets a taste of the best
we have to offer. Allow us to provide you with a detailed glimpse into the exciting activities and events
that we have in store to enhance networking opportunities and foster a strong sense of community among
participants.
19
Welcome Reception
Welcome Reception
4
Venue: Hyatt Regency Miami Riverfront - 400 South East Second Ave - Miami, FL 33131
Lower Level Terrace (take the lobby escalator down to the Terrace level)
https://www.hyatt.com/
Kick off EMNLP 2024 with our Welcome Reception, an evening of networking, refreshments, and en-
gaging conversations. This event offers a great opportunity to meet fellow attendees, reconnect with col-
leagues, and engage with the natural language processing community before the conference begins in full.
We look forward to welcoming you to an evening that promises to set the stage for an inspiring and pro-
ductive conference experience.
Please Note: Attendance to the Welcome Reception is included for Main Conference attendees. If you
are not registered for the Main Conference but would like to attend, you may add this event to your regis-
tration at the Registration Solution Desk located on the lobby level.
20
Social Event Gala Dinner
The Frost Science Museum is a prominent science museum and planetarium. It features a variety of
interactive exhibits focusing on science, technology, engineering, and mathematics (STEM) topics. Here
are some highlights:
• Exhibits: The museum offers exhibits on diverse topics such as the human body, the physics of
flight, marine biology, and outer space exploration.
• Aquarium: It includes a 500,000-gallon Gulf Stream Aquarium showcasing the diverse marine
life found in Florida’s ecosystems.
21
Social Event Gala Dinner
• Planetarium: The Frost Planetarium hosts astronomy shows and immersive experiences about
space exploration and the universe.
• Interactive Learning: Many exhibits are hands-on, encouraging visitors to engage directly with
scientific concepts and phenomena.
The Frost Science Museum aims to inspire curiosity and a passion for science through its interactive
exhibits and educational programs. It’s a popular destination for both locals and tourists interested in ex-
ploring science and technology in an engaging way.
22
Guide to
Your Visit MUSEUM MAP
FROST SCIENCE
Knight 1
Learning
Center
Marine
Conservation
WetLab
Marine
e
1 Conservation
tiv
nis
tra
PRO-TIP
Take a look WetLab
mi ces
inside the lab This is an active research facility;
Ad O
to see coral visitation hours may vary.
conservation
in action!
1 Aquarium:
2 River of Grass
The Vista
3
eD
ig
Aquarium: 1 2 River of Grass
Th The Vista
3 The Dig
SPECIAL
1 Aquarium:
EXHIBIT
Discover the Max
Planck Images of
2
Science photo
exhibit.
Feathers to
the Stars The Dive
3 b: ry
La ove
Aquarium:
The Dive
2 Feathers to
Me isc
eD 1 the Stars
Th
SPECIAL
EXHIBIT
3 MeLab
NURSING Explore the Black
POD Wings: American
Dreams of Flight
photo exhibit.
Planetarium
Exit
1 Aquarium:
2 b y
La rne Gulf Aquarium: 1
The Deep
MeJou The Deep
e Stream
Th Oculus
*PRO-TIP
Access the
2 MeLab
Oculus by going
up from Level 1
or down from
Level 3
Planetarium 1 Frost
Entrance Planetarium
1 STAIRS
MUSEUM ENTRANCE
@
od e
Fo ienc ST
AI
RS 2 Power of
Sc 2 Science
Store Science
Power of
Science
EXIT
TO PARKING TO KNIGHT
GARAGE PL AZA
NW 5 ST NE 5 ST
Rohde
Building
Miami State
Police Offices
Language Assistance: Miami-Dade Transit (MDT) is committed to providing Departmen Wilkie D. Miami Dade
Ferguson, Jr. Miami Federal College
ADRIENNE ARSHT INNER LOOP OMNI LOOP t NW 4 ST NE 4 ST
information about its transit services to passengers with limited English as part of Federal Courthouse Wolfson
Courthouse Complex Campus
NE 15 ST nforcement Children’s
CENTER Serves Downtown Downtown to Omni its non-discrimination program. MDT publishes route information in Spanish and rs Memorial Courthouse COLLE
High School
SCHOOL Haitian Creole and offers assistance in both languages at our Call Center at 3-1-1 (TRANSFER TO
(Central Business District) (North Extension) NW 3 ST NE 3 ST
BOARD PrestaservicioalDowntown or 305- 468-5900. For more information, call MDT’s Office of Civil Rights & Labor
Dade
NW 1 AVE
N MIAMI AVE
DelDowntownaOmni(ExtensiónNorte) New World
County Tax
Sèvianbalavil(DistriSantDafè) Relations at 786-469-5486. METRGVOERNMENTCENTERVER
Collector
School of the Arts
AnbalavilpoualeOmni(EkstansyonNò)
NW 2 ST NE 2 ST
NW 2 AVE
VE
WILKIE D. Miami-Dade County provides equal access and equal opportunity in employment and
SE 2 AVE
N MIAMI AVE
METRORAIL ELEVENTH
of Civil Rights and Labor Relations, 701 NW 1st Court, Suite 1700, Miami, FL
BRICKELL LOOP MUSEUM
STREET PARK 33136. Attention: ADA Coordinator. Telephone: 786-469-5225, Fax: 786-469-5589.
MIAMI AVENUE BAYFRONT
E-mail: DTPW-ADA@miamidade.gov.
BRIGHTLINE / TRI-RAIL
PARK WEST PARK PARK WEST
COLLEGE
FREE TO RIDE
KNIGHT NORTH
COLLEGE CENTER FREEDOM SERVICIO GRATUITO| VWAYAJ GRATIS
FREEDOM INNER LOOP
WILKIE D. NORTH TOWER Español: El Departamento de Transporte Público de Miami-Dade (MDT, su
TOWER WILKIE D.
LEGEND COLLEGE sigla en inglés) está dedicado a proveer información sobre sus servicios a los
FERGUSON, JR.
FERGUSON, JR. INNER LOOP 5 A.M. – 12 MIDNIGHT
BRIGHTLINE BRIGHTLINE BAYSIDE pasajeros que no hablan inglés. MDT publica información sobre sus rutas de
NE 6 ST PORT BLVD FIRST STREET autobús en español y creole haitiano y ofrece asistencia en ambos idiomas en
OMNI LOOP GOVERNMENT 5 A.M. A 12 DE LA NOCHE | 5 A.M. – 12 MINUI
CENTER nuestro Centro de Llamadas en el 3-1-1 o 305-468-5900. Para más informacion,
METRORAIL
COLLEGE INNER LOOP llame la Oficina de Derechos Humanos y Relaciones Laborales de MDT SEVEN DAYS A WEEK
BAYFRONT
KNIGHT PARK
al 786-469-5486.
BAYSIDE THIRD CENTER
BRICKELL LOOP STREET LOS SIETE DIAS | SÈT JOU SOU SÈT
El Condado de Miami-Dade ofrece igualdad de acceso y de oportunidades en el
OMNI LOOP BRICKELL LOOP
BISCAYNE BLVD
FIRST STREET empleo y no practica la discriminación por discapacidad, en sus programas o
GOVERNMENT STATION SERVING servicios. Los dispositivos y servicios de ayuda auditiva para la comunicación están CONNECT TO:
CENTER SINGLE LOOP disponibles previa solicitud, con cinco días de anticipación. Para obtener materiales CONECTA CON | KONEKTE POU ALE:
METRORAIL ESTACIÓN CON SERVICIO
PARA UN SOLO CIRCUITO
en formato alternativo (cinta de audio, Braille o disco de computadora), para solicitar
FLAGLER ST ESTASYON KI SÈVI
BRICKELL LOOP un intérprete del lenguaje de las señas u otros servicios similares sírvase llamar a: • GOVERNMENT
KNIGHT SINGLE LOOP Transporte de Miami-Dade, Oficina de Derechos Civiles y Relaciones Laborales,
MIAMI BAY- Downtown to Brickell CENTER METRORAIL •
CENTER 701 NW 1st Court, Suite 1700, Miami, FL 33136. Atención: ADA Coordinator. Teléfono:
(South Extension) 786-469-5225, Fax: 786-469-5589. Correo electrónico: DTPW-ADA@miamidade.gov.
AVENUE FRONT DeldowntownaBrickell(ExtensiónSur) BRICKELL METRORAIL
STATION SERVING
MULTIPLE LOOPS AnbalavilpoualeBrickell(EkstansyonSid)
ESTACIÓN CON SERVICIO Kreyòl Ayisyen: Miami-Dade Transit (MDT) angaje li a bay pasaje ak konesans
STATION • DOWNTOWN
PARK PARA VARIOS CIRCUITOS
I-95 RAMP THIRD WILKIE D. limite an Anglè yo tout enfòmasyon sou sèvis transpò piblik nan lang pa yo. MDT METROBUS TERMINAL •
ESTASYON KI SÈVI FERGUSON, JR. COLLEGE
RIVERWALK BRIGHTLINE NORTH
pibliye enfòmasyon sou trajè otobis yo an Espanyòl ak an Kreyòl Ayisyen epi li bay
STREET MULTIPLE LOOPS
MIAMI RIVER asistans nan toude lang yo nan Sant Repons nou an 3-1-1 oswa 305-468-5900. Pou OMNI METROBUS
BRICKELL LOOP COLLEGE BAYSIDE plis enfòmasyon, rele Biwo Dwa Sivik ak Relasyon Travay MDT la nan 786-469-5486.
FIFTH STREET OMNI LOOP TERMINAL • BRIGHTLINE
BEST TRANSFER GOVERNMENT Konte Miami-Dade bay aksè ak opòtinite egal ego nan anplwa epi li pa fè
FIRST STREET
STATION CENTER diskriminasyon baze sou enfi mite nan pwogram li yo ak sèvis li yo. Aparèy ak sèvis
MEJOR ESTACIÓN PARA METRORAIL
AT MIAMICENTRAL
TRANSFERIR KNIGHT kominikasyon pou moun ki pa tande/wè byen yo disponib ak yon preyavi senk
CENTER BAYFRONT
METRORAIL
S MIAMI AVE
MEYÈ ESTASYON TRANSFÈ PARK jou. Pou jwenn dokiman nan lòt fòma (tep odyo, Bray oswa disk konpit), sèvis BIKES WELCOME
INNER LOOP
BRICKELL CITY CENTRE yon entèprèt ki pale lang siy oswa lòt akomodasyon, tanpri kontakte: Miami-Dade
SW 8 ST BRICKELL KEY DR SE PERMITEN BICICLETAS | NOU AKSEPTE BISIKLÈT
THIRD Transit, Biwo Dwa Civil ak Relasyon Travay, 701 NW 1st Court, Suite 1700, Miami,
STREET RIVERWALK
TENTH STREET DIRECTION FL 33136. Atansyon: ADA Coordinator. Telefòn: 786-469-5225, Faks: 786-469-5589.
NORTH OF TRAVEL FIFTH STREET Imel: DTPW-ADA@miamidade.gov.
PROMENADE DIRECCIÓN DEL
RECORRIDO BRICKELL CITY CENTRE
E
NU DIREKSYON VWAYAJ (EIGHTH STREET) miamidade.gov/transportation
VE
BRICKELL LA TENTH STREET
EL Information • Información • Enfòmasyon
METRORAIL IR CK PROMENADE
B METRORAIL, 311 (305.468.5900) TTY/Florida Relay: 711
BRICKELL
BRIGHTLINE METRORAIL
FINANCIAL DISTRICT & TRI-RAIL FINANCIAL DISTRICT @GoMiamiDade GO Miami-Dade Transit DEPARTMENT OF TRANSPORTATION AND PUBLIC WORKS
Keynotes
6
Keynotes
25
Keynotes
Percy Liang
Stanford University, Stanford, California
Tuesday, November 12th – Time: from 09:30 to 10:30 – Room: James Knight Center
Abstract: As capabilities of foundation models skyrocket, openness plummets. In this talk, I argue that
open-source models are essential for the long-term goal of building a rigorous foundation for AI. Greater
access—from API to open-weight to open-source—enables deeper forms of research. API access allows
us to push the frontier of agents, and I will present our recent work on simulation and problem-solving
agents. Open weights enables reproducible research on safety, interpretability, and more generally, model
forensics. Open-source unlocks fundamental innovations in architectures, training procedures, and data
curation methods. Of course, the key obstacle for building open-source models is the resources required
(data, compute, and research/engineering). I will conclude with some promising directions that leverage
the community that bring us closer to the vision of open-source foundation models.
Bio: Percy Liang is an Associate Professor of Computer Science at Stanford University (B.S. from MIT,
2004; Ph.D. from UC Berkeley, 2011) and the director of the Center for Research on Foundation Models
(CRFM). He is currently focused on making foundation models (in particular, language models) more
accessible through open-source and understandable through rigorous benchmarking. In the past, he has
worked on many topics centered on machine learning and natural language processing, including robust-
ness, interpretability, human interaction, learning theory, grounding, semantics, and reasoning. He is also
a strong proponent of reproducibility through the creation of CodaLab Worksheets. His awards include
the Presidential Early Career Award for Scientists and Engineers (2019), IJCAI Computers and Thought
Award (2016), an NSF CAREER Award (2016), a Sloan Research Fellowship (2015), a Microsoft Re-
search Faculty Fellowship (2014), and paper awards at ACL, EMNLP, ICML, COLT, ISMIR, CHI, UIST,
and RSS.
26
Keynotes
Anca Dragan
University of California, Berkeley, California
Wednesday, November 13th – Time: from 09:00 to 10:00 – Room: James Knight Center
Abstract: For nearly a decade now, the problem that has been top of mind for me is how we might
enable AI systems to robustly optimize for what people want, and to avoid causing harm from robots
and self-driving cars, to assistive devices and deep brain stimulation, to theory and toy models, to large
language models and now Gemini. In this talk, Ill take the opportunity to share a bit about my journey
in this space, what lessons Ive learned, and how were approaching the safety and alignment of frontier
models at Google DeepMind.
Bio: Anca Dragan is an Associate Professor in the EECS Department at UC Berkeley, currently on leave
to head AI Safety and Alignment at Google DeepMind. The goal of her research at UC Berkeley has
been to enable AI agents (from robots to cars to LLMs to recommender systems) to work with, around,
and in support of people. Anca runs the InterACT Lab, where they focus on algorithms for human-AI
and human-robot interaction. One of the core problems the Lab has worked on since its inception is AI
alignment: getting AI agents to do what people actually want – this has meant learning reward functions
interactively, from diverse human feedback forms, across different modalities, while maintaining uncer-
tainty. They have also contributed to algorithms for human-AI collaboration and coordination, like agents
fluently working together with human-driven avatars in games, assistance and adaption in brain-machine
interfaces, and autonomous cars sharing the road with human drivers.
27
Keynotes
Tom Griffiths
Princeton University, Princeton, New Jersey
Thursday, November 14th – Time: from 09:00 to 10:00 – Room: James Knight Center
Abstract: Recent rapid progress in the creation of artificial intelligence (AI) systems has been driven
in large part by innovations in architectures and algorithms for developing large scale artificial neural net-
works. As a consequence, its natural to ask what role abstract principles of intelligence such as Bayes rule
might play in developing intelligent machines. In this talk, I will argue that there is a new way in which
Bayes can be used in the context of AI, more akin to how it is used in cognitive science: providing an
abstract description of how agents should solve certain problems and hence a tool for understanding their
behavior. This new role is motivated in large part by the fact that we have succeeded in creating intelligent
systems that we do not fully understand, making the problem for the machine learning researcher more
closely parallel that of the cognitive scientist. I will talk about how this perspective can help us think about
making machines with better informed priors about the world and give us insight into their behavior by
directly creating cognitive models of neural networks.
Bio: Tom Griffiths is the Henry R. Luce Professor of Information Technology, Consciousness and Culture
in the Departments of Psychology and Computer Science at Princeton University, where he is also the Di-
rector of the Princeton Laboratory for Artificial Intelligence. His research explores connections between
human and machine learning, using ideas from statistics and artificial intelligence to understand how peo-
ple solve the challenging computational problems they encounter in everyday life. Tom completed his
PhD in Psychology at Stanford University in 2005, and taught at Brown University and the University of
California, Berkeley before moving to Princeton. He has received awards for his research from organi-
zations ranging from the American Psychological Association to the National Academy of Sciences and
is a co-author of the book Algorithms to Live By, introducing ideas from computer science and cognitive
science to a general audience.
28
Panel
Panel
7
This diverse group of panelists will provide a comprehensive view of the latest trends and challenges in
NLP and the interactions with the LLM era. Panelist details can be found on the following Platforms:
Underline, Whova, EMNLP 24 Website
29
Panel
30
Panel
jects/) NLP and deep learning, and develop projects to make deep learning systems safer, more clear, and
easier to use. I work part-time at Hugging Face (http://huggingface.co/) and like to release various software
projects to support NLP and DL research.
31
Birds-of-a-Feather and Affinity Group Meetup
Tuesday, Nov 12
11:00 - 12:30 LLMs for Embodied Agents Organizer: Manling Li
Room: Foster (Convention Center 2nd level)
11:00 - 12:30 Law Law Land: Legal Language Meets Large Language Models Organizer: Santosh
Tokala - Room: Johnson (Convention Center 2nd level)
12:30 - 14:00 Large Multimodal Models for Biomedical Research Organizer: Tianyu Liu
Room: Miami Lecture Hall (Convention Center Level 2)
14:00 - 15:30 LLM Agents for Acoustics and Continuous Signals Organizer: Huck Yang
Room: Foster (Convention Center Level 2)
16:00 - 17:30 Queer in AI More details at: https://www.queerinai.com/
16:00 - 17:30 Southeast Asian NLP Organizer: Genta Winata
Room: Johnson (Convention Center Level 2)
16:00 - 17:30 NLP for Structured Data Organizer: Vivek Gupta
Room: Miami Lecture Hall (Convention Center Level 2)
Wednesday, Nov 13
10:30 - 12:00 Fostering Native and Cultural Inclusivity in LLMs Organizer: Firoj Alam
Room: Foster
10:30 - 12:00 NLP Tools for Community-Owned Religious Texts in Low-Resourced Languages Or-
ganizer: Inam Ullah - Room: Johnson
Thursday, Nov 14
10:30 - 12:00 2112: The AI Odyssey Organizer: Prasson Bajpai
Room: Foster
10:30 - 12:00 Embeddings, Reranker, Small LM for Better Search Organizer: Han Xiao
Room: Miami Lecture Hall
More details can be found on the following Platforms: Underline, Whova, EMNLP 24 Website
32
Main Conference Overview
• 18:30 - 21:00 Welcome Reception (Terrace Level - Take lobby escalator down to Terrace Level)
– Generation
– NLP Applications
– Information Retrieval
33
Main Conference Overview
– Linguistic Theories
– All Demos located in Riverfront Hall
– Multimodality
– Industry Track
Oral Presentations:
– LLMs for Embodied Agents (Foster - Convention Center 2nd level) - Organizer: Manling
Li
– Law Law Land: Legal Language Meets Large Language Models (Johnson - Convention
Center 2nd level) - Organizer: Santosh Tokala
– Large Multimodal Models for Biomedical Research (Miami Lecture Hall - Convention Cen-
ter Level 2) - Organizer: Tianyu Liu
– Language Modeling
– Ethics
– Multilinguality
– Discourse + Phonology + Syntax
– All Demos located in Riverfront Hall
– Interpretability
– Machine Learning for NLP
Oral Presentations:
34
Main Conference Overview
– Special Theme: Efficiency in Model Algorithms, Training, and Inference (Monroe Lower
Terrace Level)
– Resources and Evaluation 1 (Tuttle Lower Terrace Level)
– LLM Agents for Acoustics and Continuous Signals (Foster - Convention Center Level 2) -
Organizer: Huck Yang
Oral Presentations:
– Ethics, Bias, and Fairness 1 (Ashe Auditorium 2nd Floor Convention Level)
– Information Retrieval and Text Mining (Brickell Lower Terrace Level)
– Multimodality and Language Grounding to Vision, Robotics and Beyond 1 (Flagler Lower
Terrace Level)
– Linguistic Theories, Cognitive Modeling and Psycholinguistics (Monroe Lower Terrace
Level)
– Industry Track 1 (Tuttle Lower Terrace Level)
– Southeast Asian NLP (Johnson - Convention Center Level 2) - Organizer: Genta Winata
– NLP for Structured Data (Miami Lecture Hall - Convention Center Level 2) - Organizer:
Vivek Gupta
35
Main Conference Overview
– Human-centered NLP
– Resources and Evaluation
– Speech Processing
– NLP Applications
– Low-resource
– Interpretability
Oral Presentations:
– Multimodality and Language Grounding to Vision, Robotics and Beyond (Ashe Auditorium
2nd Floor Convention Level)
– Ethics, Bias, and Fairness (Brickell Lower Terrace Level)
– Discourse, Phonology, and Syntax (Flagler Lower Terrace Level)
– Question Answering (Monroe Lower Terrace Level)
– Industry Track 2 (Tuttle Lower Terrace Level)
– Fostering Native and Cultural Inclusivity in LLMs (Foster - Convention Center Level 2) -
Organizer: Firoj Alam
– NLP Tools for Community-Owned Religious Texts in Low-Resourced Languages (Johnson
- Convention Center Level 2) - Organizer: Inam Ullah
• 13:30 - 14:15 Session 7: Business Meeting - James knight Center (All Attendees Welcome)
36
Main Conference Overview
– Dialogue
– Multimodality
– Semantics
– Information Retrieval
– Industry Track
– Language Modeling
– Question Answering
– TACL + CL
Oral Presentations:
– Generation
– Machine Learning for NLP
– Special Theme: Efficiency
– Resources and Evaluation
37
Main Conference Overview
– Interpretability
– Ethics
Oral Presentations:
– NLP Applications 2 (Ashe Auditorium 2nd Floor Convention Level)
– Computational Social Science and Cultural Analytics 2 (Brickell Lower Terrace Level)
– Sentiment and Semantics (Flagler Lower Terrace Level)
– Language Modeling 2 (Monroe Lower Terrace Level)
– Multilinguality and Language Diversity (Tuttle Lower Terrace Level)
Birds-of-a-Feather and Affinity Group Meetup
– 2112: The AI Odyssey (Foster - Convention Center Level 2) - Organizer: Prasson Bajpai
– Embeddings, Reranker, Small LM for Better Search (Miami Lecture Hall - Convention
Center Level 2) - Organizer: Han Xiao
• 12:00 - 13:00 Lunch Break
• 13:00 - 14:00 Virtual Poster Session 3
• 14:00 - 15:30 Session 12: Orals/Posters/Demos Session G
Poster Presentation Tracks & Demos:
Riverfront Hall Lobby Level
– Dialogue
– NLP Applications
– Information Extraction
– Industry Track
Jasmine Lower Terrace Level
– Computational Social Science
– Multimodality
Oral Presentations:
– Interpretability and Analysis of Models for NLP 3 (Ashe Auditorium 2nd Floor Convention
Level)
– Speech Processing and Spoken Language Understanding (Brickell Lower Terrace Level)
– Resources and Evaluation 3 (Flagler Lower Terrace Level)
– Generation (Monroe Lower Terrace Level)
– Machine Learning for NLP 2 (Tuttle Lower Terrace Level)
• 15:30 - 16:00 Break
Riverfront Hall (Lobby Level of James Knight Convention Center)
• 16:00 - 17:00 Session 13: Best Paper Awards
James Knight Center (2nd Floor Convention Level)
• 17:00 - 17:30 Session 13: Closing Session James Knight Center (2nd Floor Convention Level)
38
Main Conference Overview
39
Oral Presentations
Oral Presentations
10
Session 02 - Nov 12 (Tue) 11:00-12:30
Language Modeling 1
Nov 12 (Tue) 11:00-12:30 - Room: Ashe Auditorium
40
Oral Presentations
prompts, they are good at providing feedback about LLM outputs; we therefore introduce a new LLM-driven discrete prompt optimization
framework PROMST that incorporates human-designed feedback rules to automatically offer direct suggestions for improvement. We also
use an extra learned heuristic model that predicts prompt performance to efficiently sample from prompt candidates. This approach signifi-
cantly outperforms both human-engineered prompts and several other prompt optimization methods across 11 representative multi-step tasks
(an average 10.6%-29.3% improvement to current best methods on five LLMs respectively). We believe our work can serve as a benchmark
for automatic prompt optimization for LLM-driven multi-step tasks.
12:00 - 12:15 - Ashe Auditorium
Retrieval-Pretrained Transformer: Long-range Language Modeling with Self-retrieval
Ohad Rubin, Jonathan Berant
Retrieval-augmented language models (LMs) have received much attention recently. However, typically the retriever is not trained jointly as
a native component of the LM, but added post-hoc to an already-pretrained LM, which limits the ability of the LM and the retriever to adapt
to one another. In this work, we propose the
emphRetrieval-Pretrained Transformer (RPT), an architecture and training procedure for jointly training a retrieval-augmented LM from
scratch and apply it to the task of modeling long texts. Given a recently generated text chunk in a long document, the LM computes query
representations, which are then used to retrieve earlier chunks in the document, located potentially tens of thousands of tokens before. In-
formation from retrieved chunks is fused into the LM representations to predict the next target chunk. We train the retriever component
with a semantic objective, where the goal is to retrieve chunks that increase the probability of the next chunk, according to a reference LM.
We evaluate RPT on four long-range language modeling tasks, spanning books, code, and mathematical writing, and demonstrate that RPT
improves retrieval quality and subsequently perplexity across the board compared to strong baselines.
12:15 - 12:30 - Ashe Auditorium
When Can LLMs Actually Correct Their Own Mistakes? A Critical Survey of Self-Correction of LLMs
Ryo Kamoi, Yusen Zhang, Nan Zhang, Jiawei Han, Rui Zhang
Self-correction is an approach to improving responses from large language models (LLMs) by refining the responses using LLMs during
inference. Prior work has proposed various self-correction frameworks using different sources of feedback, including self-evaluation and
external feedback. However, there is still no consensus on the question of when LLMs can correct their own mistakes, as recent studies also
report negative results. In this work, we critically survey broad papers and discuss the conditions required for successful self-correction.
We first find that prior studies often do not define their research questions in detail and involve impractical frameworks or unfair evaluations
that over-evaluate self-correction. To tackle these issues, we categorize research questions in self-correction research and provide a checklist
for designing appropriate experiments. Our critical survey based on the newly categorized research questions shows that (1) no prior work
demonstrates successful self-correction with feedback from prompted LLMs, except for studies in tasks that are exceptionally suited for
self-correction, (2) self-correction works well in tasks that can use reliable external feedback, and (3) large-scale fine-tuning enables self-
correction.
41
Oral Presentations
lower loss if they are more frequent in the training data. Overall, our results reveal factors influencing the novelty of LM-generated text, and
we release Rusty-DAWG to facilitate further pretraining data research.
42
Oral Presentations
Zihang Liu, Yuanzhe Hu, Tianyu Pang, Yefan Zhou, Pu Ren, Yaoqing Yang
Recent advances in foundation models have emphasized the need to align pre-trained models with specialized domains using small, curated
datasets. Studies on these foundation models underscore the importance of low-data training and fine-tuning. This topic, well-known in natu-
ral language processing (NLP), has also gained increasing attention in the emerging field of scientific machine learning (SciML). To address
the limitations of low-data training and fine-tuning, we draw inspiration from Heavy-Tailed Self-Regularization (HT-SR) theory, analyzing
the shape of empirical spectral densities (ESDs) and revealing an imbalance in training quality across different model layers. To mitigate this
issue, we adapt a recently proposed layer-wise learning rate scheduler, TempBalance, which effectively balances training quality across layers
and enhances low-data training and fine-tuning for both NLP and SciML tasks. Notably, TempBalance demonstrates increasing performance
gains as the amount of available tuning data decreases. Comparative analyses further highlight the effectiveness of TempBalance and its
adaptability as an add-on method for improving model performance.
11:45 - 12:00 - Flagler
SciPrompt: Knowledge-Augmented Prompting for Fine-Grained Categorization of Scientific Topics
Zhiwen You, Kanyao Han, Haotian Zhu, Bertram Ludaescher, Jana Diesner
Prompt-based fine-tuning has become an essential method for eliciting information encoded in pre-trained language models for a variety of
tasks, including text classification. For multi-class classification tasks, prompt-based fine-tuning under low-resource scenarios has resulted in
performance levels comparable to those of fully fine-tuning methods. Previous studies have used crafted prompt templates and verbalizers,
mapping from the label terms space to the class space, to solve the classification problem as a masked language modeling task. However,
cross-domain and fine-grained prompt-based fine-tuning with an automatically enriched verbalizer remains unexplored, mainly due to the
difficulty and costs of manually selecting domain label terms for the verbalizer, which requires humans with domain expertise. To address
this challenge, we introduce SciPrompt, a framework designed to automatically retrieve scientific topic-related terms for low-resource text
classification tasks. To this end, we select semantically correlated and domain-specific label terms within the context of scientific literature for
verbalizer augmentation. Furthermore, we propose a new verbalization strategy that uses correlation scores as additional weights to enhance
the prediction performance of the language model during model tuning. Our method outperforms state-of-the-art, prompt-based fine-tuning
methods on scientific text classification tasks under few and zero-shot settings, especially in classifying fine-grained and emerging scientific
topics.
Human-centered NLP 1
Nov 12 (Tue) 11:00-12:30 - Room: Monroe
43
Oral Presentations
we collect a dataset of negotiation transcripts between MBA students. These transcripts come from trained negotiators and emulate realistic
bargaining scenarios. We use the dataset, along with expert consultations, to design an annotation scheme for detecting negotiation mistakes.
ACE employs this scheme to identify mistakes and provide targeted feedback to users. To test the effectiveness of ACE-generated feedback,
we conducted a user experiment with two consecutive trials of negotiation and found that it improves negotiation performances significantly
compared to a system that doesn’t provide feedback and one which uses an alternative method of providing feedback.
11:30 - 11:45 - Monroe
Do LLMs Plan Like Human Writers? Comparing Journalist Coverage of Press Releases with LLMs
Alexander Spangher, Nanyun Peng, Sebastian Gehrmann, Mark Dredze
Journalists engage in multiple steps in the news writing process that depend on human creativity, like exploring different “angles” (i.e. the
specific perspectives a reporter takes). These can potentially be aided by large language models (LLMs). By affecting planning decisions,
such interventions can have an outsize impact on creative output. We advocate a careful approach to evaluating these interventions to ensure
alignment with human values.In a case study of journalistic coverage of press releases, we assemble a large dataset of 250k press releases
and 650k articles covering them. We develop methods to identify news articles that _challenge and contextualize_ press releases. Finally,
we evaluate suggestions made by LLMs for these articles and compare these with decisions made by human journalists. Our findings are
three-fold: (1) Human-written news articles that challenge and contextualize press releases more take more creative angles and use more
informational sources. (2) LLMs align better with humans when recommending angles, compared with informational sources. (3) Both the
angles and sources LLMs suggest are significantly less creative than humans.
11:45 - 12:00 - Monroe
Roleplay-doh: Enabling Domain-Experts to Create LLM-simulated Patients via Eliciting and Adhering to Principles
Ryan Louie, Ananjan Nandi, William Fang, Cheng Chang, Emma Brunskill, Diyi Yang
Recent works leverage LLMs to roleplay realistic social scenarios, aiding novices in practicing their social skills. However, simulating sensi-
tive interactions, such as in the domain of mental health, is challenging. Privacy concerns restrict data access, and collecting expert feedback,
although vital, is laborious. To address this, we develop Roleplay-doh, a novel human-LLM collaboration pipeline that elicits qualitative
feedback from a domain-expert, which is transformed into a set of principles, or natural language rules, that govern an LLM-prompted role-
play. We apply this pipeline to enable senior mental health supporters to create customized AI patients as simulated practice partners for
novice counselors. After uncovering issues with basic GPT-4 simulations not adhering to expert-defined principles, we also introduce a novel
principle-adherence prompting pipeline which shows a 30% improvement in response quality and principle following for the downstream
task. Through a user study with 25 counseling experts, we demonstrate that the pipeline makes it easy and effective to create AI patients that
more faithfully resemble real patients, as judged by both creators and third-party counselors. We provide access to the code and data on our
project website: https://roleplay-doh.github.io/.
12:00 - 12:15 - Monroe
Toxicity Detection is NOT all you Need: Measuring the Gaps to Supporting Volunteer Content Moderators through a User-Centric
Method
Yang Trista Cao, Lovely-Frances Domingo, Sarah Gilbert, Michelle L. Mazurek, Katherine Shilton, Hal Daumé III
Extensive efforts in automated approaches for content moderation have been focused on developing models to identify toxic, offensive, and
hateful content with the aim of lightening the load for moderators. Yet, it remains uncertain whether improvements on those tasks have truly
addressed moderators needs in accomplishing their work. In this paper, we surface gaps between past research efforts that have aimed to pro-
vide automation for aspects of content moderation and the needs of volunteer content moderators, regarding identifying violations of various
moderation rules. To do so, we conduct a model review on Hugging Face to reveal the availability of models to cover various moderation
rules and guidelines from three exemplar forums. We further put state-of-the-art LLMs to the test, evaluating how well these models perform
in flagging violations of platform rules from one particular forum. Finally, we conduct a user survey study with volunteer moderators to gain
insight into their perspectives on useful moderation models. Overall, we observe a non trivial gap, as missing developed models and LLMs
exhibit moderate to low performance on a significant portion of the rules. Moderators reports provide guides for future work on developing
moderation assistant models.
12:15 - 12:30 - Monroe
Successfully Guiding Humans with Imperfect Instructions by Highlighting Potential Errors and Suggesting Corrections
Min-Hsuan Yeh, Ruyuan Wan, Ting-Hao Kenneth Huang
Language models will inevitably err in situations with which they are unfamiliar. However, by effectively communicating uncertainties, they
can still guide humans toward making sound decisions in those contexts. We demonstrate this idea by developing HEAR, a system that
can successfully guide humans in simulated residential environments despite generating potentially inaccurate instructions. Diverging from
systems that provide users with only the instructions they generate, HEAR warns users of potential errors in its instructions and suggests cor-
rections. This rich uncertainty information effectively prevents misguidance and reduces the search space for users. Evaluation with 80 users
shows that HEAR achieves a 13% increase in success rate and a 29% reduction in final location error distance compared to only presenting
instructions to users. Interestingly, we find that offering users possibilities to explore, HEAR motivates them to make more attempts at the
task, ultimately leading to a higher success rate. To our best knowledge, this work is the first to show the practical benefits of uncertainty
communication in a long-horizon sequential decision-making problem.
Machine Translation 1
Nov 12 (Tue) 11:00-12:30 - Room: Tuttle
44
Oral Presentations
transfer.
11:15 - 11:30 - Tuttle
PsFuture: A Pseudo-Future-based Zero-Shot Adaptive Policy for Simultaneous Machine Translation
Libo Zhao, Jing Li, Ziqian Zeng
Simultaneous Machine Translation (SiMT) requires target tokens to be generated in real-time as streaming source tokens are consumed.
Traditional approaches to SiMT typically require sophisticated architectures and extensive parameter configurations for training adaptive
read/write policies, which in turn demand considerable computational power and memory. We propose PsFuture, the first zero-shot adaptive
read/write policy for SiMT, enabling the translation model to independently determine read/write actions without the necessity for additional
training. Furthermore, we introduce a novel training strategy, Prefix-to-Full (P2F), specifically tailored to adjust offline translation models
for SiMT applications, exploiting the advantages of the bidirectional attention mechanism inherent in offline models. Experiments across
multiple benchmarks demonstrate that our zero-shot policy attains performance on par with strong baselines and the P2F method can further
enhance performance, achieving an outstanding trade-off between translation quality and latency.
11:30 - 11:45 - Tuttle
Simul-MuST-C: Simultaneous Multilingual Speech Translation Corpus Using Large Language Model
Mana Makinae, Yusuke Sakai, Hidetaka Kamigaito, Taro Watanabe
Simultaneous Speech Translation (SiST) begins translating before the entire source input is received, making it crucial to balance quality and
latency. In real interpreting situations, interpreters manage this simultaneity by breaking sentences into smaller segments and translating them
while maintaining the source order as much as possible. SiST could benefit from this approach to balance quality and latency. However,
current corpora used for simultaneous tasks often involve significant word reordering in translation, which is not ideal given that interpreters
faithfully follow source syntax as much as possible. Inspired by conference interpreting by humans utilizing the salami technique, we intro-
duce the Simul-MuST-C, a dataset created by leveraging the Large Language Model (LLM), specifically GPT-4o, which aligns the target text
as closely as possible to the source text by using minimal chunks that contain enough information to be interpreted. Experiments on three
language pairs show that the effectiveness of segmented-base monotonicity in training data varies with the grammatical distance between the
source and the target, with grammatically distant language pairs benefiting the most in achieving quality while minimizing latency.
45
Oral Presentations
46
Oral Presentations
47
Oral Presentations
Training task-oriented dialogue systems typically requires turn-level annotations for interacting with their APIs: e.g. a dialogue state and the
system actions taken at each step. These annotations can be costly to produce, error-prone, and require both domain and annotation exper-
tise. With advances in LLMs, we hypothesize that unlabeled data and a schema definition are sufficient for building a working task-oriented
dialogue system, completely unsupervised. We consider a novel unsupervised setting of only (1) a well-defined API schema (2) a set of un-
labeled dialogues between a user and agent. We propose an innovative approach using expectation-maximization (EM) that infers turn-level
annotations as latent variables using a noisy channel model to build an end-to-end dialogue agent. Evaluating our approach on the MultiWOZ
benchmark, our method more than doubles the dialogue success rate of a strong GPT-3.5 baseline.
48
Oral Presentations
49
Oral Presentations
propose a novel PEFT method, which conducts row and column-wise sparse low-rank adaptation (RoseLoRA), to address this challenge.
RoseLoRA identifies and updates only the most important parameters for a specific task, maintaining efficiency while preserving other model
knowledge. By adding a sparsity constraint on the product of low-rank matrices and converting it to row and column-wise sparsity, we ensure
efficient and precise model updates. Our theoretical analysis guarantees the lower bound of the sparsity with respective to the matrix product.
Extensive experiments on five benchmarks across twenty datasets demonstrate that RoseLoRA outperforms baselines in both general fine-
tuning and knowledge editing tasks.
15:15 - 15:30 - Monroe
RevMUX: Data Multiplexing with Reversible Adapters for Efficient LLM Batch Inference
Yige Xu, Xu Guo, Zhiwei Zeng, Chunyan Miao
Large language models (LLMs) have brought a great breakthrough to the natural language processing (NLP) community, while leading the
challenge of handling concurrent customer queries due to their high throughput demands. Data multiplexing addresses this by merging
multiple inputs into a single composite input, allowing more efficient inference through a shared forward pass. However, as distinguishing
individuals from a composite input is challenging, conventional methods typically require training the entire backbone, yet still suffer from
performance degradation. In this paper, we introduce RevMUX, a parameter-efficient data multiplexing framework that incorporates a re-
versible design in the multiplexer, which can be reused by the demultiplexer to perform reverse operations and restore individual samples
for classification. Extensive experiments on four datasets and three types of LLM backbones demonstrate the effectiveness of RevMUX for
enhancing LLM inference efficiency while retaining a satisfactory classification performance.
50
Oral Presentations
lengths: Spotlight Locating, Comparison, Clustering, and Chain of Reasoning, to facilitate a more realistic and comprehensive evaluation of
long-context understanding. Extensive experiments indicate that existing long-context language models still exhibit considerable potential for
enhancement. Retrieval augmented generation (RAG) achieves poor performance, demonstrating that Loong can reliably assess the model’s
long-context modeling capabilities.
15:00 - 15:15 - Tuttle
MedReadMe: A Systematic Study for Fine-grained Sentence Readability in Medical Domain
Chao Jiang, Wei Xu
Medical texts are notoriously challenging to read. Properly measuring their readability is the first step towards making them more accessi-
ble. Here, we present the first systematic study on fine-grained readability measurements in the medical domain, at both sentence-level and
span-level. We first introduce a new dataset MedReadMe, which consists of manually annotated readability ratings and fine-grained complex
span annotation for 4,520 sentences, featuring two novel "Google-Easy" and "Google-Hard" categories. It supports our quantitative analysis,
which covers 650 linguistic features and additional complex span features, to answer why medical sentences are so hard. Enabled by our
high-quality annotation, we benchmark several state-of-the-art sentence-level readability metrics, including unsupervised, supervised, and
prompting-based methods using recently developed large language models (LLMs). Informed by our fine-grained complex span annotation,
we find that adding a single feature, capturing the number of jargon spans, into existing readability formulas can significantly improve their
correlation with human judgments, and also make them more stable. We will publicly release data and code.
15:15 - 15:30 - Tuttle
MINT: A Benchmark for Evaluating Instructed Information Retrieval
Weiwei Sun, Zhengliang Shi, Wu Jiu Long, Lingyong Yan, Xinyu Ma, Yiding Liu, Min Cao, Dawei Yin, Zhaochun Ren
Recent information retrieval (IR) models are pre-trained and instruction-tuned on massive datasets and tasks, enabling them to perform well
on a wide range of tasks and potentially generalize to unseen tasks with instructions. However, existing IR benchmarks focus on a limited
scope of tasks, making them insufficient for evaluating the latest IR models. In this paper, we propose MAIR (Massive Instructed Retrieval
Benchmark), a heterogeneous IR benchmark that includes 126 distinct IR tasks across 6 domains, collected from existing datasets. We bench-
mark state-of-the-art instruction-tuned text embedding models and re-ranking models. Our experiments reveal that instruction-tuned models
generally achieve superior performance compared to non-instruction-tuned models on MAIR Additionally, our results suggest that current
instruction-tuned text embedding models and re-ranking models still lack effectiveness in specific long-tail tasks. MAIR is publicly available
at https://github.com/sunnweiwei/Mair.
51
Oral Presentations
52
Oral Presentations
leverages only English retrieval data and general multilingual corpora, training models to focus on language-invariant retrieval by semantic
similarity without necessitating a vast parallel corpus. Experimental results on various datasets show our method consistently improves over
baselines, with extensive analyses demonstrating greater language agnosticism.
16:45 - 17:00 - Brickell
Taxonomy-guided Semantic Indexing for Academic Paper Search
SeongKu Kang, Yunyi Zhang, Pengcheng Jiang, Dongha Lee, Jiawei Han, Hwanjo Yu
Academic paper search is an essential task for efficient literature discovery and scientific advancement. While dense retrieval has advanced
various ad-hoc searches, it often struggles to match the underlying academic concepts between queries and documents, which is critical for
paper search. To enable effective academic concept matching for paper search, we propose Taxonomy-guided Semantic Indexing (TaxoIn-
dex) framework. TaxoIndex extracts key concepts from papers and organizes them as a semantic index guided by an academic taxonomy,
and then leverages this index as foundational knowledge to identify academic concepts and link queries and documents. As a plug-and-play
framework, TaxoIndex can be flexibly employed to enhance existing dense retrievers. Extensive experiments show that TaxoIndex brings
significant improvements, even with highly limited training data, and greatly enhances interpretability.
53
Oral Presentations
and LLaMA-2) and three vision model architectures (ResNet, SegFormer, and MAE). Our experiments show that LMs partially converge
towards representations isomorphic to those of vision models, subject to dispersion, polysemy and frequency. This has important implications
for both multi-modal processing and the LM understanding debate (Mitchell and Krakauer, 2023).
16:45 - 17:00 - Flagler
MatchTime: Towards Automatic Soccer Game Commentary Generation
Jiayuan Rao, Haoning Wu, Chang Liu, Yanfeng Wang, Weidi Xie
Soccer is a globally popular sport with a vast audience, in this paper, we consider constructing an automatic soccer game commentary model
to improve the audiences’ viewing experience. In general, we make the following contributions: *First*, observing the prevalent video-
text misalignment in existing datasets, we manually annotate timestamps for 49 matches, establishing a more robust benchmark for soccer
game commentary generation, termed as *SN-Caption-test-align*; *Second*, we propose a multi-modal temporal alignment pipeline to au-
tomatically correct and filter the existing dataset at scale, creating a higher-quality soccer game commentary dataset for training, denoted as
*MatchTime*; *Third*, based on our curated dataset, we train an automatic commentary generation model, named **MatchVoice**. Ex-
tensive experiments and ablation studies have demonstrated the effectiveness of our alignment pipeline, and training model on the curated
datasets achieves state-of-the-art performance for commentary generation, showcasing that better alignment can lead to significant perfor-
mance improvements in downstream tasks.
17:00 - 17:15 - Flagler
Selective Vision is the Challenge for Visual Reasoning: A Benchmark for Visual Argument Understanding
Jiwan Chung, Sungjae Lee, Minseo Kim, Seungju Han, Ashkan Yousefpour, Jack Hessel, Youngjae Yu
Visual arguments, often used in advertising or social causes, rely on images to persuade viewers to do or believe something. Understanding
these arguments requires selective vision: only specific visual stimuli within an image are relevant to the argument, and relevance can only
be understood within the context of a broader argumentative structure. While visual arguments are readily appreciated by human audiences,
we ask: are today’s AI capable of similar understanding?We present VisArgs, a dataset of 1,611 images annotated with 5,112 visual premises
(with regions), 5,574 commonsense premises, and reasoning trees connecting them into structured arguments. We propose three tasks for
evaluating visual argument understanding: premise localization, premise identification, and conclusion deduction.Experiments show that 1)
machines struggle to capture visual cues: GPT-4-O achieved 78.5% accuracy, while humans reached 98.0%. Models also performed 19.5%
worse when distinguishing between irrelevant objects within the image compared to external objects. 2) Providing relevant visual premises
improved model performance significantly.
17:15 - 17:30 - Flagler
Pre-trained Language Models Do Not Help Auto-regressive Text-to-Image Generation
Yuhui Zhang, Brandon McKinzie, Zhe Gan, Vaishaal Shankar, Alexander T Toshev
Recent advances in image tokenizers, such as VQ-VAE, have enabled text-to-image generation using auto-regressive methods, similar to lan-
guage modeling. However, these methods have yet to leverage pre-trained language models, despite their adaptability to various downstream
tasks. In this work, we explore this gap by adapting a pre-trained language model for auto-regressive text-to-image generation, and find
that pre-trained language models offer limited help. We provide a two-fold explanation by analyzing tokens from each modality. First, we
demonstrate that image tokens possess significantly different semantics compared to text tokens, rendering pre-trained language models no
more effective in modeling them than randomly initialized ones. Second, the text tokens in the image-text datasets are too simple compared
to normal language model pre-training data, which causes the catastrophic degradation of language models’ capability.
54
Oral Presentations
What do language models (LMs) do with language? They can produce sequences of (mostly) coherent strings closely resembling English.
But do those sentences mean something, or are LMs simply babbling in a convincing simulacrum of language use? We address one aspect of
this broad question: whether LMs words can refer, that is, achieve word-to-world connections. There is prima facie reason to think they do
not, since LMs do not interact with the world in the way that ordinary language users do. Drawing on the externalist tradition in philosophy
of language, we argue that those appearances are misleading: Even if the inputs to LMs are simply strings of text, they are strings of text with
natural histories, and that may suffice for LMs words to refer.
16:45 - 17:00 - Monroe
Fine-Grained Prediction of Reading Comprehension from Eye Movements
Omer Shubi, Yoav Meiri, Cfir Avraham Hadar, Yevgeni Berzak
Can human reading comprehension be assessed from eye movements in reading? In this work, we address this longstanding question using
large-scale eyetracking data. We focus on a cardinal and largely unaddressed variant of this question: predicting reading comprehension of
a single participant for a single question from their eye movements over a single paragraph. We tackle this task using a battery of recent
models from the literature, and three new multimodal language models. We evaluate the models in two different reading regimes: ordinary
reading and information seeking, and examine their generalization to new textual items, new participants, and the combination of both. The
evaluations suggest that the task is highly challenging, and highlight the importance of benchmarking against a strong text-only baseline.
While in some cases eye movements provide improvements over such a baseline, they tend to be small. This could be due to limitations of
current modelling approaches, limitations of the data, or because eye movement behavior does not sufficiently pertain to fine-grained aspects
of reading comprehension processes. Our study provides an infrastructure for making further progress on this question.
Industry
Nov 12 (Tue) 16:00-17:30 - Room: Tuttle
55
Oral Presentations
instructions or prompt-engineering. We propose a self-supervised denoising method to train the model from an existing dataset of such ob-
jects. The input query can be the existing object itself, in which case the system acts as a regenerator, completing, correcting, normalizing
the input, or any unstructured blurb to be structured. We show that the self-supervised denoising training provides a strong baseline, and
that additional supervised fine-tuning with small amount of human demonstrations leads to further improvement. Experimental results show
that the proposed method matches or outperforms prompt-engineered general-purpose state-of-the-art LLMs (Claude 3, Mixtral-8x7B), while
being order-of-magnitude more cost-efficient.
16:20 - 16:30 - Tuttle
MARS: Multilingual Aspect-centric Review Summarisation
Sandeep Sricharan Mukku, Abinesh Kanagarajan, Chetan Aggarwal, Promod Yenigalla
Summarizing customer feedback to provide actionable insights for products/services at scale is an important problem for businesses across
industries. Lately, the reviews are spreading across multiple languages, the challenge of aggregating and understanding customer sentiment
across multiple languages becomes increasingly vital. In this paper, we propose a novel framework involving a two-step paradigm Extract-
then-Summarise, namely MARS to revolutionise traditions and address the domain agnostic aspect-level multilingual review summarisation.
Extensive automatic and human evaluation shows that our approach brings substantial improvements over abstractive baselines and efficiency
to production systems.
16:30 - 16:40 - Tuttle
Two-tiered Encoder-based Hallucination Detection for Retrieval-Augmented Generation in the Wild
Ilana Zimmerman, Jadin Tredup, Ethan Selfridge, Joseph Bradley
Detecting hallucinations, where Large Language Models (LLMs) are not factually consistent with a Knowledge Base (KB), is a challenge
for Retrieval-Augmented Generation (RAG) systems. Current solutions rely on public datasets to develop prompts or fine-tune a Natural
Language Inference (NLI) model. However, these approaches are not focused on developing an enterprise RAG system; they do not consider
latency, train or evaluate on production data, nor do they handle non-verifiable statements such as small talk or questions. To address this, we
leverage the customer service conversation data of four large brands to evaluate existing solutions and propose a set of small encoder models
trained on a new dataset. We find the proposed models to outperform existing methods and highlight the value of combining a small amount
of in-domain data with public datasets.
16:40 - 16:50 - Tuttle
Granite-Function Calling Model: Introducing Function Calling Abilities via Multi-task Learning of Granular Tasks
Ibrahim Abdelaziz, Kinjal Basu, Mayank Agarwal, Sadhana Kumaravel, Matthew Stallone, Rameswar Panda, Yara Rizk, G P Shrivatsa Bhar-
gav, Maxwell Crouse, Chulaka Gunasekara, Shajith Ikbal, Sachindra Joshi, Hima Karanam, Vineet Kumar, Asim Munawar, Sumit Neelam,
Dinesh Raghu, Udit Sharma, Adriana Meza Soria, Dheeraj Sreedhar, Praveen Venkateswaran, Merve Unuvar, David Daniel Cox, Salim
Roukos, Luis A. Lastras, Pavan Kapanipathi
An emergent research trend explores the use of Large Language Models (LLMs) as the backbone of agentic systems (e.g., SWE-Bench,
Agent-Bench). To fulfill LLMs’ potential as autonomous agents, they must be able to identify, call, and interact with a variety of exter-
nal tools and application program interfaces (APIs). This capability of LLMs, commonly termed function calling, leads to a myriad of
advantages such as access to current and domain-specific information in databases and the outsourcing of tasks that can be reliably per-
formed by tools. In this work, we introduce GRINDER-20B-FunctionCalling, a model trained using a multi-task training approach on seven
fundamental tasks encompassed in function calling. Our comprehensive evaluation on multiple out-of-domain datasets, which compares
GRINDER-20B-FunctionCalling to more than 15 other best proprietary and open models, shows that GRINDER-20B-FunctionCalling has
better generalizability on multiple tasks across seven different evaluation benchmarks. Moreover, GRINDER-20B-FunctionCalling shows the
best performance among all open models and ranks among the top on the Berkeley Function Calling Leaderboard (BFCL).
56
Oral Presentations
answer quality.
17:20 - 17:30 - Tuttle
Divide-Conquer-Reasoning for Consistency Evaluation and Automatic Improvement of Large Language Models
Wendi Cui, Jiaxin Zhang, Zhuohang Li, Damien Lopez, Kamalika Das, Bradley A. Malin, Sricharan Kumar
Evaluating the quality and consistency of text generated by Large Language Models (LLMs) poses a significant, yet unresolved challenge for
industry research. We propose methodNamenew, an automated framework for evaluating and improving the consistency of LLM-generated
texts using a divide-conquer-reasoning approach. Unlike existing LLM-based evaluators operating at the paragraph level, our method em-
ploys a divide-and-conquer evaluator (divideconquer) that breaks down the paragraph-to-paragraph comparison into sentence-to-paragraph
comparisons. To facilitate this approach, we also introduce an automatic metric converter (metric) that translates the output from dividecon-
quer into an interpretable numeric score. Beyond the consistency evaluation, we further present a reason-assisted improver (reasoning) that
mitigates inconsistencies by leveraging the analytical reasons identified by divideconquer. Through comprehensive and systematic empirical
analysis, we show that our approach outperforms state-of-the-art methods by a large margin (e.g., +16.8% and +32.5% on the SummEval
dataset) in consistency evaluation across multiple benchmarks. Our approach also substantially reduces nearly 90% output inconsistencies in
one iteration, showing promise for effective hallucination mitigation in real-world industrial applications.
57
Oral Presentations
Yueting Zhuang
Although most current large multimodal models (LMMs) can already understand photos of natural scenes and portraits, their understanding
of abstract images, e.g., charts, maps, or layouts, and visual reasoning capabilities remains quite rudimentary. They often struggle with simple
daily tasks, such as reading time from a clock, understanding a flowchart, or planning a route using a road map. In light of this, we de-
sign a multi-modal self-instruct, utilizing large language models and their code capabilities to synthesize massive abstract images and visual
reasoning instructions across daily scenarios. Our strategy effortlessly creates a multimodal benchmark with 11,193 instructions for eight
visual scenarios: charts, tables, simulated maps, dashboards, flowcharts, relation graphs, floor plans, and visual puzzles. This benchmark,
constructed with simple lines and geometric elements, exposes the shortcomings of most advanced LMMs like GPT-4V and Llava in
abstract image understanding, spatial relations reasoning, and visual element induction. Besides, to verify the quality of our synthetic data,
we fine-tune an LMM using 62,476 synthetic chart, table and road map instructions. The results demonstrate improved chart understanding
and map navigation performance, and also demonstrate potential benefits for other visual reasoning tasks.
11:30 - 11:45 - Ashe Auditorium
Preserving Multi-Modal Capabilities of Pre-trained VLMs for Improving Vision-Linguistic Compositionality
Youngtaek Oh, Jae Won Cho, Dong-Jin Kim, In So Kweon, Junmo Kim
In this paper, we propose a new method to enhance compositional understanding in pre-trained vision and language models (VLMs) without
sacrificing performance in zero-shot multi-modal tasks. Traditional fine-tuning approaches often improve compositional reasoning at the cost
of degrading multi-modal capabilities, primarily due to the use of global hard negative (HN) loss, which contrasts global representations
of images and texts. This global HN loss pushes HN texts that are highly similar to the original ones, damaging the models multi-modal
representations. To overcome this limitation, we propose Fine-grained Selective Calibrated CLIP (FSC-CLIP), which integrates local hard
negative loss and selective calibrated regularization. These innovations provide fine-grained negative supervision while preserving the models
representational integrity. Our extensive evaluations across diverse benchmarks for both compositionality and multi-modal tasks show that
FSC-CLIP not only achieves compositionality on par with state-of-the-art models but also retains strong multi-modal capabilities. Code is
available at: https://github.com/ytaek-oh/fsc-clip.
11:45 - 12:00 - Ashe Auditorium
TopViewRS: Vision-Language Models as Top-View Spatial Reasoners
Chengzu Li, Caiqi Zhang, Han Zhou, Nigel Collier, Anna Korhonen, Ivan Vuli
Top-view perspective denotes a typical way in which humans read and reason over different types of maps, and it is vital for localization and
navigation of humans as well as of ‘non-human’ agents, such as the ones backed by large Vision-Language Models (VLMs). Nonetheless,
spatial reasoning capabilities of modern VLMs in this setup remain unattested and underexplored. In this work, we study their capability to
understand and reason over spatial relations from the top view. The focus on top view also enables controlled evaluations at different gran-
ularity of spatial reasoning; we clearly disentangle different abilities (e.g., recognizing particular objects versus understanding their relative
positions). We introduce the TopViewRS (Top-View Reasoning in Space) dataset, consisting of 11,384 multiple-choice questions with either
realistic or semantic top-view map as visual input. We then use it to study and evaluate VLMs across 4 perception and reasoning tasks with
different levels of complexity. Evaluation of 10 representative open- and closed-source VLMs reveals the gap of more than 50% compared
to average human performance, and it is even lower than the random baseline in some cases. Although additional experiments show that
Chain-of-Thought reasoning can boost model capabilities by 5.82% on average, the overall performance of VLMs remains limited. Our
findings underscore the critical need for enhanced model capability in top-view spatial reasoning and set a foundation for further research
towards human-level proficiency of VLMs in real-world multimodal tasks.
1
Code is available on https://www.iitp.ac.in/~ai-nlp-ml/resources.html and at the GitHub repository: https:
//github.com/20118/BiasWipe
58
Oral Presentations
59
Oral Presentations
significant discrepancies between the LLM- and human- written stories. While human-written stories are suspenseful, arousing, and diverse
in narrative structures, LLM stories are homogeneously positive and lack tension. Next, we measure narrative reasoning skills as a precursor
to generative capacities, concluding that most LLMs fall short of human abilities in discourse understanding. Finally, we show that explicit
integration of aforementioned discourse features can enhance storytelling, as is demonstrated by over 40% improvement in neural storytelling
in terms of diversity, suspense, and arousal. Such advances promise to facilitate greater and more natural roles LLMs in human communica-
tion.
11:00 - 11:15 - Flagler
Revisiting Supertagging for faster HPSG parsing
Olga Zamaraeva, Carlos Gómez-Rodríguez
We present new supertaggers trained on English HPSG-based treebanks and test the effects of the best tagger on parsing speed and accuracy.
HPSG treebanks are produced automatically by large manually built grammars and feature high-quality annotation based on a well-developed
linguistic theory. The English Resource Grammar treebanks include diverse and challenging test datasets, beyond the usual WSJ section 23
and Wikipedia data. HPSG supertagging has previously relied on MaxEnt-based models. We use SVM and neural CRF- and BERT-based
methods and show that both SVM and neural supertaggers achieve considerably higher accuracy compared to the baseline and lead to an
increase not only in the parsing speed but also the parser accuracy with respect to gold dependency structures. Our fine-tuned BERT-based
tagger achieves 97.26% accuracy on 950 sentences from WSJ23 and 93.88% on the out-of-domain technical essay The Cathedral and the
Bazaar. We present experiments with integrating the best supertagger into an HPSG parser and observe a speedup of a factor of 3 with respect
to the system which uses no tagging at all, as well as large recall gains and an overall precision gain. We also compare our system to an existing
integrated tagger and show that although the well-integrated tagger remains the fastest, our experimental system can be more accurate. Finally,
we hope that the diverse and difficult datasets we used for evaluation will gain more popularity in the field: we show that results can differ
depending on the dataset, even if it is an in-domain one. We contribute the complete datasets reformatted for Huggingface token classification.
11:15 - 11:30 - Flagler
Which questions should I answer? Salience Prediction of Inquisitive Questions
Yating Wu, Ritika Rajesh Mangla, Alex Dimakis, Greg Durrett, Junyi Jessy Li
Inquisitive questions — open-ended, curiosity-driven questions people ask as they read — are an integral part of discourse processing and
comprehension. Recent work in NLP has taken advantage of question generation capabilities of LLMs to enhance a wide range of applications.
But the space of inquisitive questions is vast: many questions can be evoked from a given context. So which of those should be prioritized to
find answers? Linguistic theories, unfortunately, have not yet provided an answer to this question. This paper presents QSalience, a salience
predictor of inquisitive questions. QSalience is instruction-tuned over our dataset of linguist-annotated salience scores of 1,766 (context,
question) pairs. A question scores high on salience if answering it would greatly enhance the understanding of the text. We show that highly
salient questions are empirically more likely to be answered in the same article, bridging potential questions with Questions Under Discussion.
We further validate our findings by showing that answering salient questions is an indicator of summarization quality in news.
11:30 - 11:45 - Flagler
CSSL: Contrastive Self-Supervised Learning for Dependency Parsing on Relatively Free Word Ordered and Morphologically Rich
Low Resource Languages
Pretam Ray, Jivnesh Sandhan, Amrith Krishna, Pawan Goyal
Neural dependency parsing has achieved remarkable performance for low resource morphologically rich languages. It has also been well-
studied that morphologically rich languages exhibit relatively free word order. This prompts a fundamental investigation: Is there a way to
enhance dependency parsing performance, making the model robust to word order variations utilizing the relatively free word order nature
of morphologically rich languages? In this work, we examine the robustness of graph-based parsing architectures on 7 relatively free word
order languages. We focus on scrutinizing essential modifications such as data augmentation and the removal of position encoding required
to adapt these architectures accordingly. To this end, we propose a contrastive self-supervised learning method to make the model robust to
word order variations. Furthermore, our proposed modification demonstrates a substantial average gain of 3.03/2.95 points in 7 relatively free
word order languages, as measured by the UAS/LAS Score metric when compared to the best performing baseline.
11:45 - 12:00 - Flagler
Tokenization Is More Than Compression
Craig W Schmidt, Varshini Reddy, Haoran Zhang, Alec Alameddine, Omri Uzan, Yuval Pinter, Chris Tanner
Tokenization is a foundational step in natural language processing (NLP) tasks, bridging raw text and language models. Existing tokenization
approaches like Byte-Pair Encoding (BPE) originate from the field of data compression, and it has been suggested that the effectiveness of
BPE stems from its ability to condense text into a relatively small number of tokens. We test the hypothesis that fewer tokens lead to better
downstream performance by introducing PathPiece, a new tokenizer that segments a document’s text into the minimum number of tokens
for a given vocabulary. Through extensive experimentation we find this hypothesis not to be the case, casting doubt on the understanding
of the reasons for effective tokenization. To examine which other factors play a role, we evaluate design decisions across all three phases
of tokenization: pre-tokenization, vocabulary construction, and segmentation, offering new insights into the design of effective tokenizers.
Specifically, we illustrate the importance of pre-tokenization and the benefits of using BPE to initialize vocabulary construction. We train 64
language models with varying tokenization, ranging in size from 350M to 2.4B parameters, all of which are made publicly available.
Question Answering 2
Nov 13 (Wed) 10:30-12:00 - Room: Monroe
60
Oral Presentations
https://github.com/primeqa/clapnq.
10:45 - 11:00 - Monroe
Evidence-Focused Fact Summarization for Knowledge-Augmented Zero-Shot Question Answering
Sungho Ko, Hyunjin Cho, Hyungjoo Chae, Jinyoung Yeo, Dongha Lee
Recent studies have investigated utilizing Knowledge Graphs (KGs) to enhance Quesetion Answering (QA) performance of Large Language
Models (LLMs), yet structured KG verbalization remains challenging. Existing methods, like concatenation or free-form textual conversion
of triples, have limitations, including duplicated entities or relations, reduced evidence density, and failure to highlight crucial evidence. To
address these issues, we propose EFSum, an Evidence-focused Fact Summarization framework for enhanced QA with knowledge-augmented
LLMs. We optimize an LLM as a fact summarizer through distillation and preference alignment. Our extensive expeirments show that EFSum
improves LLM’s zero-shot QA performance with its helpful and faithful summaries, especially when noisy facts are retrieved.
11:00 - 11:15 - Monroe
Adaptive Query Rewriting: Aligning Rewriters through Marginal Probability of Conversational Answers
Tianhua Zhang, Kun LI, Hongyin Luo, Xixin Wu, James R. Glass, Helen M. Meng
Query rewriting is a crucial technique for passage retrieval in open-domain conversational question answering (CQA). It decontexualizes
conversational queries into self-contained questions suitable for off-the-shelf retrievers. Existing methods attempt to incorporate retriever’s
preference during the training of rewriting models. However, these approaches typically rely on extensive annotations such as in-domain
rewrites and/or relevant passage labels, limiting the models’ generalization and adaptation capabilities. In this paper, we introduce AdaQR
(Adaptive Query Rewriting), a framework for training query rewriting models with limited rewrite annotations from seed datasets and com-
pletely no passage label. Our approach begins by fine-tuning compact large language models using only 10% of rewrite annotations from
the seed dataset training split. The models are then utilized to self-sample rewrite candidates for each query instance, further eliminating the
expense for human labeling or larger language model prompting often adopted in curating preference data. A novel approach is then proposed
to assess retriever’s preference for these candidates with the probability of answers conditioned on the conversational query by marginalizing
the Top-K passages. This serves as the reward for optimizing the rewriter further using Direct Preference Optimization (DPO), a process free
of rewrite and retrieval annotations. Experimental results on four open-domain CQA datasets demonstrate that AdaQR not only enhances the
in-domain capabilities of the rewriter with limited annotation requirement, but also adapts effectively to out-of-domain datasets.
11:15 - 11:30 - Monroe
MedCoT: Medical Chain of Thought via Hierarchical Expert
Jiaxiang Liu, Yuan Wang, Jiawei Du, Joey Tianyi Zhou, Zuozhu Liu
Artificial intelligence has advanced in Medical Visual Question Answering (Med-VQA), but prevalent research tends to focus on the accuracy
of the answers, often overlooking the reasoning paths and interpretability, which are crucial in clinical settings. Besides, current Med-VQA
algorithms, typically reliant on singular models, lack the robustness needed for real-world medical diagnostics which usually require col-
laborative expert evaluation. To address these shortcomings, this paper presents MedCoT, a novel hierarchical expert verification reasoning
chain method designed to enhance interpretability and accuracy in biomedical imaging inquiries. MedCoT is predicated on two principles:
The necessity for explicit reasoning paths in Med-VQA and the requirement for multi-expert review to formulate accurate conclusions. The
methodology involves an Initial Specialist proposing diagnostic rationales, followed by a Follow-up Specialist who validates these rationales,
and finally, a consensus is reached through a vote among a sparse Mixture of Experts within the locally deployed Diagnostic Specialist, which
then provides the definitive diagnosis. Experimental evaluations on four standard Med-VQA datasets demonstrate that MedCoT surpasses
existing state-of-the-art approaches, providing significant improvements in performance and interpretability.
11:30 - 11:45 - Monroe
QUITE: Quantifying Uncertainty in Natural Language Text in Bayesian Reasoning Scenarios
Timo Pierre Schrader, Lukas Lange, Simon Razniewski, Annemarie Friedrich
Reasoning is key to many decision making processes. It requires consolidating a set of rule-like premises that are often associated with
degrees of uncertainty and observations to draw conclusions. In this work, we address both the case where premises are specified as numeric
probabilistic rules and situations in which humans state their estimates using words expressing degrees of certainty. Existing probabilistic
reasoning datasets simplify the task, e.g., by requiring the model to only rank textual alternatives, by including only binary random variables,
or by making use of a limited set of templates that result in less varied text.In this work, we present QUITE, a question answering dataset of
real-world Bayesian reasoning scenarios with categorical random variables and complex relationships. QUITE provides high-quality natural
language verbalizations of premises together with evidence statements and expects the answer to a question in the form of an estimated prob-
ability. We conduct an extensive set of experiments, finding that logic-based models outperform out-of-the-box large language models on all
reasoning types (causal, evidential, and explaining-away). Our results provide evidence that neuro-symbolic models are a promising direction
for improving complex reasoning. We release QUITE and code for training and experiments on Github.
Industry
Nov 13 (Wed) 10:30-12:00 - Room: Tuttle
61
Oral Presentations
62
Oral Presentations
igating counterfactual fairness in closed-source text safety classifiers. Our approach involves building an ensemble that not only outperforms
the input classifiers and policy-aligns them, but also acts as a debiasing regularizer. We introduce two threshold-agnostic metrics to assess the
counterfactual fairness of a model, and demonstrate how combining these metrics with Fair Data Reweighting (FDW) helps mitigate biases.
We create an expanded Open AI dataset, and a new templated LLM-generated dataset based on user-prompts, both of which are counterfac-
tually balanced across identity groups and cover four key areas of safety; we will work towards publicly releasing these datasets. Our results
show that our approach improves counterfactual fairness with minimal impact on model performance.
11:40 - 11:50 - Tuttle
Robust ASR Error Correction with Conservative Data Filtering
Takuma Udagawa, Masayuki Suzuki, Masayasu Muraoka, Gakuto Kurata
Error correction (EC) based on large language models is an emerging technology to enhance the performance of automatic speech recognition
(ASR) systems. Generally, training data for EC are collected by automatically pairing a large set of ASR hypotheses (as sources) and their
gold references (as targets). However, the quality of such pairs is not guaranteed, and we observed various types of noise which can make
the EC models brittle, e.g. inducing overcorrection in out-of-domain (OOD) settings. In this work, we propose two fundamental criteria
that EC training data should satisfy: namely, EC targets should (1) improve linguistic acceptability over sources and (2) be inferable from
the available context (e.g. source phonemes). Through these criteria, we identify low-quality EC pairs and train the models not to make
any correction in such cases, the process we refer to as conservative data filtering. In our experiments, we focus on Japanese ASR using a
strong Conformer-CTC as the baseline and finetune Japanese LLMs for EC. Through our evaluation on a suite of 21 internal benchmarks,
we demonstrate that our approach can significantly reduce overcorrection and improve both the accuracy and quality of ASR results in the
challenging OOD settings.
11:50 - 12:00 - Tuttle
SHIELD: LLM-Driven Schema Induction for Predictive Analytics in EV Battery Supply Chain Disruptions
Zhi-Qi Cheng, Yifei Dong, Yuzhi Hu, Aike Shi, Wei Liu, Jason O’Connor, Alexander Hauptmann, Kate Whitefoot
The electric vehicle (EV) battery supply chain’s vulnerability to disruptions necessitates advanced predictive analytics. We present SHIELD
(Schema-based Hierarchical Induction for EV supply chain Disruption), a system integrating Large Language Models (LLMs) with domain
expertise for EV battery supply chain risk assessment. SHIELD combines: (1) LLM-driven schema learning to construct a comprehensive
knowledge library, (2) a disruption analysis system utilizing fine-tuned language models for event extraction, multi-dimensional similarity
matching for schema matching, and Graph Convolutional Networks (GCNs) with logical constraints for prediction, and (3) an interactive
interface for visualizing results and incorporating expert feedback to enhance decision-making. Evaluated on 12,070 paragraphs from 365
sources (2022-2023), SHIELD outperforms baseline GCNs and LLM+prompt methods (e.g. GPT-4o) in disruption prediction. These results
demonstrate SHIELD’s effectiveness in combining LLM capabilities with domain expertise for enhanced supply chain risk assessment.
63
Oral Presentations
set of aspects for assessing data quality, namely style preservation, meaning preservation, and divergence, as a proxy for privacy. We introduce
metrics corresponding to each aspect. Moreover, through a set of generation strategies and representative tasks and baselines across domains,
we demonstrate the relation between the quality aspects of synthetic user generated content, generation strategies, metrics and downstream
performance. To our knowledge, our work is the first unified evaluation framework for user-generated text in relation to the specified aspects,
offering both intrinsic and extrinsic evaluation. We envisage it will facilitate developments towards shareable, high-quality synthetic language
data.
16:45 - 17:00 - Ashe Auditorium
Finding Blind Spots in Evaluator LLMs with Interpretable Checklists
Sumanth Doddapaneni, Mohammed Safi Ur Rahman Khan, Sshubam Verma, Mitesh M Khapra
Large Language Models (LLMs) are increasingly relied upon to evaluate text outputs of other LLMs, thereby influencing leaderboards and
development decisions. However, concerns persist over the accuracy of these assessments and the potential for misleading conclusions. In
this work, we investigate the effectiveness of LLMs as evaluators for text generation tasks. We propose FBI, a novel framework designed
to examine the proficiency of Evaluator LLMs in assessing four critical abilities in other LLMs: factual accuracy, instruction following, co-
herence in long-form writing, and reasoning proficiency. By introducing targeted perturbations in answers generated by LLMs, that clearly
impact one of these key capabilities, we test whether an Evaluator LLM can detect these quality drops. By creating a total of 2400 per-
turbed answers covering 22 perturbation categories, we conduct a comprehensive study using different evaluation strategies on five prominent
LLMs commonly used as evaluators in the literature. Our findings reveal significant shortcomings in current Evaluator LLMs, which failed
to identify quality drops in over 50% of cases on average. Single-answer and pairwise evaluations demonstrated notable limitations, whereas
reference-based evaluations showed comparatively better performance. These results underscore the unreliable nature of current Evaluator
LLMs and advocate for cautious implementation in practical applications.
17:00 - 17:15 - Ashe Auditorium
LLMs Assist NLP Researchers: Critique Paper (Meta-)Reviewing
Jiangshu Du, Yibo Wang, Wenting Zhao, Zhongfen Deng, Shuaiqi LIU, Renze Lou, Henry Peng Zou, Pranav Narayanan Venkit, Nan Zhang,
Mukund Srinath, Haoran Ranran Zhang, Vipul Gupta, Yinghui Li, Tao Li, Fei Wang, Qin Liu, Tianlin Liu, Pengzhi Gao, Congying Xia, Chen
Xing, Cheng Jiayang, Zhaowei Wang, Ying Su, Raj Sanjay Shah, Ruohao Guo, Jing Gu, Haoran Li, Kangda Wei, Zihao Wang, Lu Cheng,
Surangika Ranathunga, Meng Fang, Jie Fu, Fei Liu, Ruihong Huang, Eduardo Blanco, Yixin Cao, Rui Zhang, Philip S. Yu, Wenpeng Yin
Claim: This work is not advocating the use of LLMs for paper (meta-)reviewing. Instead, wepresent a comparative analysis to identify and
distinguish LLM activities from human activities. Two research goals: i) Enable better recognition of instances when someone implicitly
uses LLMs for reviewing activities; ii) Increase community awareness that LLMs, and AI in general, are currently inadequate for performing
tasks that require a high level of expertise and nuanced judgment.This work is motivated by two key trends. On one hand, large language
models (LLMs) have shown remarkable versatility in various generative tasks such as writing, drawing, and question answering, signifi-
cantly reducing the time required for many routine tasks. On the other hand, researchers, whose work is not only time-consuming but also
highly expertise-demanding, face increasing challenges as they have to spend more time reading, writing, and reviewing papers. This raises
the question: how can LLMs potentially assist researchers in alleviating their heavy workload?This study focuses on the topic of LLMs as
NLP Researchers, particularly examining the effectiveness of LLMs in assisting paper (meta-)reviewing and its recognizability. To address
this, we constructed the ReviewCritique dataset, which includes two types of information: (i) NLP papers (initial submissions rather than
camera-ready) with both human-written and LLM-generated reviews, and (ii) each review comes with "deficiency" labels and corresponding
explanations for individual segments, annotated by experts. Using ReviewCritique, this study explores two threads of research questions: (i)
"LLMs as Reviewers", how do reviews generated by LLMs compare with those written by humans in terms of quality and distinguishability?
(ii) "LLMs as Metareviewers", how effectively can LLMs identify potential issues, such as Deficient or unprofessional review segments,
within individual paper reviews? To our knowledge, this is the first work to provide such a comprehensive analysis.
64
Oral Presentations
given context) when training LMs with these objectives. For instance, fine-tuning LLaMA-7B on instruction following datasets renders it less
faithful. Conversely, instruction-tuned Vicuna-7B shows degraded performance at following instructions when further optimized on tasks that
require contextual grounding. One common remedy is multi-task learning (MTL) with data mixing, yet it remains far from achieving a syn-
ergic outcome. We propose a simple yet effective method that relies on Reject-sampling by Self-instruct with Continued Fine-tuning (ReSet),
which significantly outperforms vanilla MTL. Surprisingly, we find that less is more, as training ReSet with high-quality, yet substantially
smaller data (three-fold less) yields superior results. Our findings offer a better understanding of objective discrepancies in alignment training
of LMs.
16:30 - 16:45 - Brickell
Dissecting Fine-Tuning Unlearning in Large Language Models
Yihuai Hong, Yuelin Zou, Lijie Hu, Ziqian Zeng, Di Wang, Haiqin Yang
Fine-tuning-based unlearning methods prevail for erasing targeted harmful, sensitive, or copyrighted information within large language mod-
els while preserving overall capabilities. However, the true effectiveness of the methods is unclear. In this paper, we delve into the limitations
of fine-tuning-based unlearning through activation patching and parameter restoration experiments. Our findings reveal that these methods
alter the model’s knowledge retrieval process, rather than genuinely erasing the problematic knowledge embedded in the model parame-
ters. Furthermore, behavioral tests demonstrate that the unlearning mechanisms inevitably impact the global behavior of the models, affecting
unrelated knowledge or capabilities. Our work advocates the development of more resilient unlearning techniques for truly erasing knowledge.
16:45 - 17:00 - Brickell
Interpreting Arithmetic Mechanism in Large Language Models through Comparative Neuron Analysis
ZEPING YU, Sophia Ananiadou
We find arithmetic ability resides within a limited number of attention heads, with each head specializing in distinct operations. To delve into
the reason, we introduce the Comparative Neuron Analysis (CNA) method, which identifies an internal logic chain consisting of four distinct
stages from input to prediction: feature enhancing with shallow FFN neurons, feature transferring by shallow attention layers, feature predict-
ing by arithmetic heads, and prediction enhancing among deep FFN neurons. Moreover, we identify the human-interpretable FFN neurons
within both feature-enhancing and feature-predicting stages. These findings lead us to investigate the mechanism of LoRA, revealing that it
enhances prediction probabilities by amplifying the coefficient scores of FFN neurons related to predictions. Finally, we apply our method
in model pruning for arithmetic tasks and model editing for reducing gender bias. Code is on https://github.com/zepingyu0512/arithmetic-
mechanism.
17:00 - 17:15 - Brickell
Pixology: Probing the Linguistic and Visual Knowledge of Pixel-based Language Models
Kushal Tatariya, Vladimir Araujo, Thomas Bauwens, Miryam de Lhoneux
Pixel-based language models have emerged as a compelling alternative to subword-based language modelling, particularly because they can
represent virtually any script. PIXEL, a canonical example of such a model, is a vision transformer that has been pre-trained on rendered text.
While PIXEL has shown promising cross-script transfer abilities and robustness to orthographic perturbations, it falls short of outperforming
monolingual subword counterparts like BERT in most other contexts. This discrepancy raises questions about the amount of linguistic knowl-
edge learnt by these models and whether their performance in language tasks stems more from their visual capabilities than their linguistic
ones. To explore this, we probe PIXEL using a variety of linguistic and visual tasks to assess its position on the vision-to-language spectrum.
Our findings reveal a substantial gap between the models visual and linguistic understanding. The lower layers of PIXEL predominantly cap-
ture superficial visual features, whereas the higher layers gradually learn more syntactic and semantic abstractions. Additionally, we examine
variants of PIXEL trained with different text rendering strategies, discovering that introducing certain orthographic constraints at the input
level can facilitate earlier learning of surface-level features. With this study, we hope to provide insights that aid the further development of
pixel-based language models.
17:15 - 17:30 - Brickell
Towards Faithful Model Explanation in NLP: A Survey
Qing Lyu, Chris Callison-Burch, Marianna Apidianaki
End-to-end neural Natural Language Processing (NLP) models are notoriously difficult to understand. This has given rise to numerous efforts
towards model explainability in recent years. One desideratum of model explanation is faithfulness, i.e. an explanation should accurately
represent the reasoning process behind the modelâĂŹs prediction. In this survey, we review over 110 model explanation methods in NLP
through the lens of faithfulness. We first discuss the definition and evaluation of faithfulness, as well as its significance for explainability.
We then introduce recent advances in faithful explanation, grouping existing approaches into five categories: similarity methods, analysis of
model-internal structures, backpropagation-based methods, counterfactual intervention, and self-explanatory models. For each category, we
synthesize its representative studies, strengths, and weaknesses. Finally, we summarize their common virtues and remaining challenges, and
reflect on future work directions towards faithful explainability in NLP.
NLP Applications 3
Nov 13 (Wed) 16:00-17:30 - Room: Flagler
65
Oral Presentations
CareCorpus+: Expanding and Augmenting Caregiver Strategy Data to Support Pediatric Rehabilitation
Shahla Farzana, Ivana Lucero, Vivian Villegas, Vera C Kaelin, Mary Khetani, Natalie Parde
Caregiver strategy classification in pediatric rehabilitation contexts is strongly motivated by real-world clinical constraints but highly under-
resourced and seldom studied in natural language processing settings. We introduce a large dataset of 4,037 caregiver strategies in this setting,
a five-fold increase over the nearest contemporary dataset. These strategies are manually categorized into clinically established constructs
with high agreement (κ=0.68-0.89). We also propose two techniques to further address identified data constraints. First, we manually sup-
plement target task data with publicly relevant data from online child health forums. Next, we propose a novel data augmentation technique
to generate synthetic caregiver strategies with high downstream task utility. Extensive experiments showcase the quality of our dataset. They
also establish evidence that both the publicly available data and the synthetic strategies result in large performance gains, with relative F_1
increases of 22.6% and 50.9%, respectively.
16:30 - 16:45 - Flagler
Conformal Prediction for Natural Language Processing: A Survey
Margarida M. Campos, António Farinhas, Chrysoula Zerva, Mário A. T. Figueiredo, André F. T. Martins
The rapid proliferation of large language models and natural language processing (NLP) applications creates a crucial need for uncertainty
quantification to mitigate risks such as hallucinations and to enhance decision-making reliability in critical applications. Conformal pre-
diction is emerging as a theoretically sound and practically useful framework, combining flexibility with strong statistical guarantees. Its
model-agnostic and distribution-free nature makes it particularly promising to address the current shortcomings of NLP systems that stem
from the absence of uncertainty quantification. This paper provides a comprehensive survey of conformal prediction techniques, their guar-
antees, and existing applications in NLP, pointing to directions for future research and open challenges.
16:45 - 17:00 - Flagler
Consistent Autoformalization for Constructing Mathematical Libraries
Lan Zhang, XIN QUAN, Andre Freitas
Autoformalization is the task of automatically translating mathematical content written in natural language to a formal language expression.
The growing language interpretation capabilities of Large Language Models (LLMs), including in formal languages, are lowering the barriers
for autoformalization. However, LLMs alone are not capable of consistently and reliably delivering autoformalization, in particular as the
complexity and specialization of the target domain grows. As the field evolves into the direction of systematically applying autoformalization
towards large mathematical libraries, the need to improve syntactic, terminological and semantic control increases. This paper proposes the
coordinated use of three mechanisms, most-similar retrieval augmented generation (MS-RAG), denoising steps, and auto-correction with
syntax error feedback (Auto-SEF) to improve autoformalization quality. The empirical analysis, across different models, demonstrates that
these mechanisms can deliver autoformalizaton results which are syntactically, terminologically and semantically more consistent. These
mechanisms can be applied across different LLMs and have shown to deliver improve results across different model types.
17:00 - 17:15 - Flagler
LMs learn governing principles of dynamical systems, revealing an in-context neural scaling law
Toni J.B. Liu, Nicolas Boulle, Raphaël Sarfati, Christopher Earls
We study LLMs’ ability to extrapolate the behavior of various dynamical systems, including stochastic, chaotic, continuous, and discrete
systems, whose evolution is governed by principles of physical interest. Our results show that LLaMA-2, a language model trained on text,
achieves accurate predictions of dynamical system time series without fine-tuning or prompt engineering. Moreover, the accuracy of the
learned physical rules increases with the length of the input context window, revealing an in-context version of a neural scaling law. Along
the way, we present a flexible and efficient algorithm for extracting probability density functions of multi-digit numbers directly from LLMs.
17:15 - 17:30 - Flagler
Text-Tuple-Table: Towards Information Integration in Text-to-Table Generation via Global Tuple Extraction
Zheye Deng, Chunkit Chan, Weiqi Wang, Yuxi Sun, Wei Fan, Tianshi Zheng, Yauwai Yim, Yangqiu Song
The task of condensing large chunks of textual information into concise and structured tables has gained attention recently due to the emer-
gence of Large Language Models (LLMs) and their potential benefit for downstream tasks, such as text summarization and text mining.
Previous approaches often generate tables that directly replicate information from the text, limiting their applicability in broader contexts,
as text-to-table generation in real-life scenarios necessitates information extraction, reasoning, and integration. However, there is a lack
of both datasets and methodologies towards this task. In this paper, we introduce LiveSum, a new benchmark dataset created for generat-
ing summary tables of competitions based on real-time commentary texts. We evaluate the performances of state-of-the-art LLMs on this
task in both fine-tuning and zero-shot settings, and additionally propose a novel pipeline called T 3 (Text-Tuple-Table) to improve their per-
formances. Extensive experimental results demonstrate that LLMs still struggle with this task even after fine-tuning, while our approach
can offer substantial performance gains without explicit training. Further analyses demonstrate that our method exhibits strong generalization
abilities, surpassing previous approaches on several other text-to-table datasets. Our codeand data can be found at https://github.com/HKUST-
KnowComp/LiveSum.
Information Extraction 1
Nov 13 (Wed) 16:00-17:30 - Room: Monroe
66
Oral Presentations
67
Oral Presentations
hack the defense of LLMs and lead to dangerous outputs. However, similar to traditional text adversarial attacks, this approach, while effec-
tive, is limited by the challenge of the discrete tokens. This gradient based discrete optimization attack requires over 100,000 LLM calls, and
due to the unreadable of adversarial suffixes, it can be relatively easily penetrated by common defense methods such as perplexity filters.To
cope with this challenge, in this paper, we propose an Adversarial Suffix Embedding Translation Framework (ASETF), aimed at transforming
continuous adversarial suffix embeddings into coherent and understandable text. This method greatly reduces the computational overhead
during the attack process and helps to automatically generate multiple adversarial samples, which can be used as data to strengthen LLM’s
security defense. Experimental evaluations were conducted on Llama2, Vicuna, and other prominent LLMs, employing harmful directives
sourced from the Advbench dataset.The results indicate that our method significantly reduces the computation time of adversarial suffixes
and achieves a much better attack success rate than existing techniques, while significantly enhancing the textual fluency of the prompts. In
addition, our approach can be generalized into a broader method for generating transferable adversarial suffixes that can successfully attack
multiple LLMs, even black-box LLMs, such as ChatGPT and Gemini.
16:15 - 16:30 - Tuttle
Context-Aware Machine Translation with Source Coreference Explanation
Huy Hien Vu, Hidetaka Kamigaito, Taro Watanabe
Classification systems are evaluated in a countless number of papers. However, we find that evaluation practice is often nebulous. Frequently,
metrics are selected without arguments, and blurry terminology invites misconceptions. For instance, many works use so-called ’macro’
metrics to rank systems (e.g., ’macro F1’) but do not clearly specify what they would expect from such a ’macro’ metric. This is problematic,
since picking a metric can affect paper findings as well as shared task rankings, and thus any clarity in the process should be maximized.
Starting from the intuitive concepts of bias and prevalence, we perform an analysis of common evaluation metrics, considering expectations
as found expressed in papers. Equipped with a thorough understanding of the metrics, we survey metric selection in recent shared tasks of
Natural Language Processing. The results show that metric choices are often not supported with convincing arguments, an issue that can make
any ranking seem arbitrary. This work aims at providing overview and guidance for more informed and transparent metric selection, fostering
meaningful evaluation.
16:30 - 16:45 - Tuttle
Efficient Sequential Decision Making with Large Language Models
Dingyang Chen, Qi Zhang, Yinglun Zhu
This paper focuses on extending the success of large language models (LLMs) to sequential decision making. Existing efforts either (i) re-
train or finetune LLMs for decision making, or (ii) design prompts for pretrained LLMs. The former approach suffers from the computational
burden of gradient updates, and the latter approach does not show promising results. In this paper, we propose a new approach that leverages
online model selection algorithms to efficiently incorporate LLMs agents into sequential decision making. Statistically, our approach signifi-
cantly outperforms both traditional decision making algorithms and vanilla LLM agents. Computationally, our approach avoids the need for
expensive gradient updates of LLMs, and throughout the decision making process, it requires only a small number of LLM calls. We conduct
extensive experiments to verify the effectiveness of our proposed approach. As an example, on a large-scale Amazon dataset, our approach
achieves more than a 6x performance gain over baselines while calling LLMs in only 1.5% of the time steps.
16:45 - 17:00 - Tuttle
Lookback Lens: Detecting and Mitigating Contextual Hallucinations in Large Language Models Using Only Attention Maps
Yung-Sung Chuang, Linlu Qiu, Cheng-Yu Hsieh, Ranjay Krishna, Yoon Kim, James R. Glass
When asked to summarize articles or answer questions given a passage, large language models (LLMs) can hallucinate details and respond
with unsubstantiated answers that are inaccurate with respect to the input context. This paper describes a simple approach for detecting such
**contextual hallucinations**. We hypothesize that contextual hallucinations are related to the extent to which an LLM attends to information
in the provided context versus its own generations. Based on this intuition, we propose a simple hallucination detection model whose input
features are given by the ratio of attention weights on the context versus newly generated tokens (for each attention head). We find that a
linear classifier based on these _lookback ratio_ features is as effective as a richer detector that utilizes the entire hidden states of an LLM
or a text-based entailment model. The lookback ratio-based detector**Lookback Lens**is found to transfer across tasks and even models,
allowing a detector that is trained on a 7B model to be applied (without retraining) to a larger 13B model. We further apply this detector to
mitigate contextual hallucinations, and find that a simple classifier-guided decoding approach is able to reduce the amount of hallucination,
for example by 9.6% in the XSum summarization task.
68
Oral Presentations
NLP Applications 4
Nov 14 (Thu) 10:30-12:00 - Room: Ashe Auditorium
69
Oral Presentations
Standardize: Aligning Language Models with Expert-Defined Standards for Content Generation
Joseph Marvin Imperial, Gail Forey, Harish Tayyar Madabushi
Domain experts across engineering, healthcare, and education follow strict standards for producing quality content such as technical manuals,
medication instructions, and children’s reading materials. However, current works in controllable text generation have yet to explore using
these standards as references for control. Towards this end, we introduce Standardize, a retrieval-style in-context learning-based framework
to guide large language models to align with expert-defined standards. Focusing on English language standards in the education domain as
a use case, we consider the Common European Framework of Reference for Languages (CEFR) and Common Core Standards (CCS) for
the task of open-ended content generation. Our findings show that models can gain 45% to 100% increase in precise accuracy across open
and commercial LLMs evaluated, demonstrating that the use of knowledge artifacts extracted from standards and integrating them in the
generation process can effectively guide models to produce better standard-aligned content.
70
Oral Presentations
applied to different transformer-based architectures. We show new SOTA performance on three different real-time change detection tasks.
11:45 - 12:00 - Brickell
The Empirical Variability of Narrative Perceptions of Social Media Texts
Joel Mire, Maria Antoniak, Elliott Ash, Andrew Piper, Maarten Sap
Most NLP work on narrative detection has focused on prescriptive definitions of stories crafted by researchers, leaving open the questions:
how do crowd workers perceive texts to be a story, and why? We investigate this by building StoryPerceptions, a dataset of 2,496 perceptions
of storytelling in 502 social media texts from 255 crowd workers, including categorical labels along with free-text storytelling rationales, au-
thorial intent, and more. We construct a fine-grained bottom-up taxonomy of crowd workers’ varied and nuanced perceptions of storytelling
by open-coding their free-text rationales. Through comparative analyses at the label and code level, we illuminate patterns of disagreement
among crowd workers and across other annotation contexts, including prescriptive labeling from researchers and LLM-based predictions.
Notably, plot complexity, references to generalized or abstract actions, and holistic aesthetic judgments (such as a sense of cohesion) are
especially important in disagreements. Our empirical findings broaden understanding of the types, relative importance, and contentiousness
of features relevant to narrative detection, highlighting opportunities for future work on reader-contextualized models of narrative reception.
71
Oral Presentations
Is Safer Better? The Impact of Guardrails on the Argumentative Strength of LLMs in Hate Speech Countering
Helena Bonaldi, Greta Damo, Nicolás Benjamín Ocampo, Elena Cabrio, Serena Villata, Marco Guerini
The potential effectiveness of counterspeech as a hate speech mitigation strategy is attracting increasing interest in the NLG research com-
munity, particularly towards the task of automatically producing it. However, automatically generated responses often lack the argumentative
richness which characterises expert-produced counterspeech. In this work, we focus on two aspects of counterspeech generation to produce
more cogent responses. First, by investigating the tension between helpfulness and harmlessness of LLMs, we test whether the presence of
safety guardrails hinders the quality of the generations. Secondly, we assess whether attacking a specific component of the hate speech results
in a more effective argumentative strategy to fight online hate. By conducting an extensive human and automatic evaluation, we show how
the presence of safety guardrails can be detrimental also to a task that inherently aims at fostering positive social interactions. Moreover, our
results show that attacking a specific component of the hate speech, and in particular its implicit negative stereotype and its hateful parts, leads
to higher-quality generations.
11:45 - 12:00 - Flagler
PsyGUARD: An Automated System for Suicide Detection and Risk Assessment in Psychological Counseling
Huachuan Qiu, Lizhi Ma, Zhenzhong Lan
As awareness of mental health issues grows, online counseling support services are becoming increasingly prevalent worldwide. Detect-
ing whether users express suicidal ideation in text-based counseling services is crucial for identifying and prioritizing at-risk individuals.
However, the lack of domain-specific systems to facilitate fine-grained suicide detection and corresponding risk assessment in online coun-
seling poses a significant challenge for automated crisis intervention aimed at suicide prevention. In this paper, we propose PsyGUARD, an
automated system for detecting suicide ideation and assessing risk in psychological counseling. To achieve this, we first develop a detailed
taxonomy for detecting suicide ideation based on foundational theories. We then curate a large-scale, high-quality dataset called PsySUICIDE
for suicide detection. To evaluate the capabilities of automated systems in fine-grained suicide detection, we establish a range of baselines.
Subsequently, to assist automated services in providing safe, helpful, and tailored responses for further assessment, we propose to build a suite
of risk assessment frameworks. Our study not only provides an insightful analysis of the effectiveness of automated risk assessment systems
based on fine-grained suicide detection but also highlights their potential to improve mental health services on online counseling platforms.
Code, data, and models are available at https://github.com/qiuhuachuan/PsyGUARD.
Language Modeling 4
Nov 14 (Thu) 10:30-12:00 - Room: Monroe
72
Oral Presentations
gains in model accuracy on downstream evaluations. Then, we categorize the most widely used data source, web crawl snapshots, across the
attributes of toxicity, quality, type of speech, and domain. Finally, we show how such attribute information can be used to further refine and
improve the quality of a pretraining set. These findings constitute an actionable set of steps that practitioners can use to develop high quality
pretraining sets.
11:30 - 11:45 - Monroe
Fine-Tuning or Retrieval? Comparing Knowledge Injection in LLMs
Oded Ovadia, Menachem Brief, Moshik Mishaeli, Oren Elisha
Large language models (LLMs) encapsulate a vast amount of factual information within their pre-trained weights, as evidenced by their
ability to answer diverse questions across different domains. However, this knowledge is inherently limited, relying heavily on the char-
acteristics of the training data. Consequently, using external datasets to incorporate new information or refine the capabilities of LLMs on
previously seen information poses a significant challenge. In this study, we compare two common approaches: unsupervised fine-tuning and
retrieval-augmented generation (RAG). We evaluate both approaches on a variety of knowledge-intensive tasks across different topics. Our
findings reveal that while unsupervised fine-tuning offers some improvement, RAG consistently outperforms it, both for existing knowledge
encountered during training and entirely new knowledge. Moreover, we find that LLMs struggle to learn new factual information through
unsupervised fine-tuning, and that exposing them to numerous variations of the same fact during training could alleviate this problem.
11:45 - 12:00 - Monroe
User Inference Attacks on Large Language Models
Nikhil Kandpal, Krishna Pillutla, Alina Oprea, Peter Kairouz, Christopher A. Choquette-Choo, Zheng Xu
Text written by humans makes up the vast majority of the data used to pre-train and fine-tune large language models (LLMs). Many sources
of this data—like code, forum posts, personal websites, and books—are easily attributed to one or a few "users". In this paper, we ask if
it is possible to infer if any of a _user’s_ data was used to train an LLM. Not only would this constitute a breach of privacy, but it would
also enable users to detect when their data was used for training. We develop the first effective attacks for _user inference_—at times, with
near-perfect success—against LLMs. Our attacks are easy to employ, requiring only black-box access to an LLM and a few samples from
the user, which _need not be the ones that were trained on_. We find, both theoretically and empirically, that certain properties make users
more susceptible to user inference: being an outlier, having highly correlated examples, and contributing a larger fraction of data. Based on
these findings, we identify several methods for mitigating user inference including training with example-level differential privacy, removing
within-user duplicate examples, and reducing a user’s contribution to the training data. Though these provide partial mitigation, our work
highlights the need to develop methods to fully protect LLMs from user inference.
While most transliteration research is focused on single tokens such as named entitiesfor example, transliteration of from the Gujarati script
to the Latin script Ahmedabadthe informal romanization prevalent in South Asia and elsewhere often requires transliteration of full sentences.
The lack of large parallel text collections of full sentence (as opposed to single word) transliterations necessitates incorporation of contextual
information into transliteration via non-parallel resources, such as via mono-script text collections. In this article, we present a number of
methods for improving transliteration in context for such a use scenario. Some of these methods in fact improve performance without making
use of sentential context, allowing for better quantification of the degree to which contextual information in particular is responsible for system
improvements. Our final systems, which ultimately rely upon ensembles including large pretrained language models fine-tuned on simulated
parallel data, yield substantial improvements over the best previously reported results for full sentence transliteration from Latin to native script
on all 12 languages in the Dakshina dataset (Roark et al. 2020), with an overall 3.3% absolute (18.6% relative) mean word-error rate reduction.
10:45 - 11:00 - Tuttle
Dotless Arabic text for Natural Language Processing
Irfan Ahmad, Maged S. Al-Shaibani
This paper introduces a novel representation of Arabic text as an alternative approach for Arabic NLP, inspired by the dotless script of an-
cient Arabic. We explored this representation through extensive analysis on various text corpora, differing in size and domain, and tokenized
using multiple tokenization techniques. Furthermore, we examined the information density of this representation and compared it with the
standard dotted Arabic text using text entropy analysis. Utilizing parallel corpora, we also drew comparisons between Arabic and English
text analysis to gain additional insights. Our investigation extended to various upstream and downstream NLP tasks, including language
modeling, text classification, sequence labeling, and machine translation, examining the implications of both the representations. Specifically,
we performed seven different downstream tasks using various tokenization schemes comparing the standard dotted text with dotless Arabic
text representations. The performances using both the representations were comparable across different tokenizations. However, dotless
representation achieves these results with significant reduction in vocabulary sizes, and in some scenarios showing reduction of up to 50%.
Additionally, we present a system that restores dots to the dotless Arabic text. This system is useful for tasks that require Arabic texts as output.
73
Oral Presentations
RLHF Can Speak Many Languages: Unlocking Multilingual Preference Optimization for LLMs
John Dang, Arash Ahmadian, Kelly Marchisio, Julia Kreutzer, Ahmet Üstün, Sara Hooker
Preference optimization techniques have become a standard final stage for training state-of-art large language models (LLMs). However,
despite widespread adoption, the vast majority of work to-date has focused on a small set of high-resource languages like English and Chi-
nese. This captures a small fraction of the languages in the world, but also makes it unclear which aspects of current state-of-the-art research
transfer to a multilingual setting. In this work, we perform an exhaustive study to achieve a new state of the art in aligning multilingual LLMs.
We introduce a novel, scalable method for generating high-quality multilingual feedback data to balance data coverage. We establish the
benefits of cross-lingual transfer and increased dataset size in preference training. Our preference-trained model achieves a 54.4% win-rate
against Aya 23 8B, the current state-of-the-art multilingual LLM in its parameter class, and a 69.5% win-rate or higher against widely used
models like Gemma, Mistral and Llama 3. As a result of our efforts, we expand the frontier of alignment techniques to 23 languages, covering
approximately half of the world’s population.
11:30 - 11:45 - Tuttle
Getting More from Less: Large Language Models are Good Spontaneous Multilingual Learners
Shimao Zhang, Changjiang Gao, Wenhao Zhu, Jiajun Chen, Xin Huang, Xue Han, Junlan Feng, Chao Deng, Shujian Huang
Recently, Large Language Models (LLMs) have shown impressive language capabilities, while most of them have very unbalanced perfor-
mance across different languages. Multilingual alignment based on the translation parallel data is an effective method to enhance LLMs’
multilingual capabilities. In this work, we first discover and comprehensively investigate the spontaneous multilingual alignment of LLMs.
Firstly, we find that LLMs instruction-tuned on the question translation data (i.e. without annotated answers) are able to encourage the align-
ment between English and a wide range of languages, even including those unseen during instruction-tuning. Additionally, we utilize different
settings and mechanistic interpretability methods to analyze the LLM’s performance in the multilingual scenario comprehensively. Our work
suggests that LLMs have enormous potential for improving multilingual alignment efficiently with great language generalization and task
generalization.
11:45 - 12:00 - Tuttle
What is ”Typological Diversity” in NLP?
Esther Ploeger, Wessel Poelman, Miryam de Lhoneux, Johannes Bjerva
The NLP research community has devoted increased attention to languages beyond English, resulting in considerable improvements for mul-
tilingual NLP. However, these improvements only apply to a small subset of the world’s languages. An increasing number of papers aspires
to enhance generalizable multilingual performance across languages. To this end, linguistic typology is commonly used to motivate language
selection, on the basis that a broad typological sample ought to imply generalization across a broad range of languages. These selections
are often described as being typologically diverse. In this meta-analysis, we systematically investigate NLP research that includes claims
regarding typological diversity. We find there are no set definitions or criteria for such claims. We introduce metrics to approximate the
diversity of resulting language samples along several axes and find that the results vary considerably across papers. Crucially, we show that
skewed language selection can lead to overestimated multilingual performance. We recommend future work to include an operationalization
of typological diversity that empirically justifies the diversity of language samples. To help facilitate this, we release the code for our diversity
measures.
74
Oral Presentations
This paper introduces Filtered Corpus Training, a method that trains language models (LMs) on corpora with certain linguistic constructions
filtered out from the training data, and uses it to measure the ability of LMs to perform linguistic generalization on the basis of indirect
evidence. We apply the method to both LSTM and Transformer LMs (of roughly comparable size), developing filtered corpora that target a
wide range of linguistic phenomena. Our results show that while transformers are better qua LMs (as measured by perplexity), both models
perform equally and surprisingly well on linguistic generalization measures, suggesting that they are capable of generalizing from indirect
evidence.
14:45 - 15:00 - Ashe Auditorium
Prompts have evil twins
Rimon Melamed, Lucas Hurley McCabe, Tanay Wakhare, Yejin Kim, H. Howie Huang, Enric Boix-Adserà
We discover that many natural-language prompts can be replaced by corresponding prompts that are unintelligible to humans but that provably
elicit similar behavior in language models. We call these prompts “evil twins” because they are obfuscated and uninterpretable (evil), but at
the same time mimic the functionality of the original natural-language prompts (twins). Remarkably, evil twins transfer between models. We
find these prompts by solving a maximum-likelihood problem which has applications of independent interest.
75
Oral Presentations
ing models using test samples at inference time. However, current ASR TTA methods have largely focused on non-continual TTA, which
limits cross-sample knowledge learning compared to continual TTA. In this work, we first propose a Fast-slow TTA framework for ASR
that leverages the advantage of continual and non-continual TTA. Following this framework, we introduce Dynamic SUTA (DSUTA), an
entropy-minimization-based continual TTA method for ASR. To enhance DSUTAs robustness for time-varying data, we design a dynamic
reset strategy to automatically detect domain shifts and reset the model, making it more effective at handling multi-domain data. Our method
demonstrates superior performance on various noisy ASR datasets, outperforming both non-continual and continual TTA baselines while
maintaining robustness to domain changes without requiring domain boundary information.
14:45 - 15:00 - Brickell
EH-MAM: Easy-to-Hard Masked Acoustic Modeling for Self-Supervised Speech Representation Learning
Ashish Seth, Ramaneswaran S, S Sakshi, Sonal Kumar, Sreyan Ghosh, Dinesh Manocha
In this paper, we present EH-MAM (Easy-to-Hard adaptive Masked Acoustic Modeling), a novel self-supervised learning approach for speech
representation learning. In contrast to the prior methods that use random masking schemes for Masked Acoustic Modeling (MAM), we in-
troduce a novel selective and adaptive masking strategy. Specifically, during SSL training, we progressively introduce harder regions to the
model for reconstruction. Our approach automatically selects hard regions and is built on the observation that the reconstruction loss of
individual frames in MAM can provide natural signals to judge the difficulty of solving the MAM pre-text task for that frame. To identify
these hard regions, we employ a teacher model that first predicts the frame-wise losses and then decides which frames to mask. By learning
to create challenging problems, such as identifying harder frames and solving them simultaneously, the model is able to learn more effec-
tive representations and thereby acquire a more comprehensive understanding of the speech. Quantitatively, EH-MAM outperforms several
state-of-the-art baselines across various low-resource speech recognition and SUPERB benchmarks by 5%-10%. Additionally, we conduct a
thorough analysis to show that the regions masked by EH-MAM effectively capture useful context across speech frames.
15:00 - 15:15 - Brickell
Interventional Speech Noise Injection for ASR Generalizable Spoken Language Understanding
YeonJoon Jung, Jaeseong Lee, Seungtaek Choi, Dohyeon Lee, Minsoo Kim, seung-won hwang
Recently, pre-trained language models (PLMs) have been increasingly adopted in spoken language understanding (SLU). However, automatic
speech recognition (ASR) systems frequently produce inaccurate transcriptions, leading to noisy inputs for SLU models, which can signifi-
cantly degrade their performance. To address this, our objective is to train SLU models to withstand ASR errors by exposing them to noises
commonly observed in ASR systems, referred to as ASR-plausible noises. Speech noise injection (SNI) methods have pursued this objective
by introducing ASR-plausible noises, but we argue that these methods are inherently biased towards specific ASR systems, or ASR-specific
noises. In this work, we propose a novel and less biased augmentation method of introducing the noises that are plausible to any ASR system,
by cutting off the non-causal effect of noises. Experimental results and analyses demonstrate the effectiveness of our proposed methods in
enhancing the robustness and generalizability of SLU models against unseen ASR systems by introducing more diverse and plausible ASR
noises in advance.
15:15 - 15:30 - Brickell
SPIRIT-LM: Interleaved Spoken and Written Language Model
Tu Anh Nguyen, Benjamin Muller, Bokai Yu, Marta Costa-jussà, Maha Elbayadb, Sravya Popuri, Paul-Ambroise Duquenne, Robin Algayres,
Ruslan Mavlyutov, Itai Gat, Gabriel Synnaeve, Juan Pino, Benoît Sagot, Emmanuel Dupoux, Christophe Ropers, Mary Williamson,
We introduce SPIRIT-LM, a foundation multimodal language model that freely mixes text and speech. Our model is based on a 7B pretrained
text language model that we extend to the speech modality by continuously training it on text and speech units. Speech and text sequences are
concatenated as a single stream of tokens, and trained with a word-level interleaving method using a small automatically-curated speech-text
parallel corpus. SPIRIT-LM comes in two versions: a BASE version that uses speech semantic units and an EXPRESSIVE version that
models expressivity using pitch and style units in addition to the semantic units. For both versions, the text is encoded with subword BPE
tokens. The resulting model displays both the semantic abilities of text models and the expressive abilities of speech models. Additionally,
we demonstrate that SPIRIT-LM is able to learn new tasks in a few-shot fashion across modalities (i.e. ASR, TTS, Speech Classification).
76
Oral Presentations
CoCoLoFa: A Dataset of News Comments with Common Logical Fallacies Written by LLM-Assisted Crowds
Min-Hsuan Yeh, Ruyuan Wan, Ting-Hao Kenneth Huang
Detecting logical fallacies in texts can help users spot argument flaws, but automating this detection is not easy. Manually annotating fal-
lacies in large-scale, real-world text data to create datasets for developing and validating detection models is costly. This paper introduces
CoCoLoFa, the largest known logical fallacy dataset, containing 7,706 comments for 648 news articles, with each comment labeled for fallacy
presence and type. We recruited 143 crowd workers to write comments embodying specific fallacy types (e.g., slippery slope) in response
to news articles. Recognizing the complexity of this writing task, we built an LLM-powered assistant into the workers’ interface to aid in
drafting and refining their comments. Experts rated the writing quality and labeling validity of CoCoLoFa as high and reliable. BERT-based
models fine-tuned using CoCoLoFa achieved the highest fallacy detection (F1=0.86) and classification (F1=0.87) performance on its test set,
outperforming the state-of-the-art LLMs. Our work shows that combining crowdsourcing and LLMs enables us to more effectively construct
datasets for complex linguistic phenomena that crowd workers find challenging to produce on their own.
14:45 - 15:00 - Flagler
Granular Privacy Control for Geolocation with Vision Language Models
Ethan Mendes, Yang Chen, James Hays, Sauvik Das, Wei Xu, Alan Ritter
Vision Language Models (VLMs) are rapidly advancing in their capability to answer information-seeking questions. As these models are
widely deployed in consumer applications, they could lead to new privacy risks due to emergent abilities to identify people in photos, geolo-
cate images, etc. As we demonstrate, somewhat surprisingly, current open-source and proprietary VLMs are very capable image geolocators,
making widespread geolocation with VLMs an immediate privacy risk, rather than merely a theoretical future concern. As a first step to
address this challenge, we develop a new benchmark, GPTGeoChat, to test the capability of VLMs to moderate geolocation dialogues with
users. We collect a set of 1,000 image geolocation conversations between in-house annotators and GPT-4v, which are annotated with the
granularity of location information revealed at each turn. Using this new dataset we evaluate the ability of various VLMs to moderate GPT-4v
geolocation conversations by determining when too much location information has been revealed. We find that custom fine-tuned models per-
form on par with prompted API-based models when identifying leaked location information at the country or city level, however fine-tuning
on supervised data appears to be needed to accurately moderate finer granularities, such as the name of a restaurant or building.
15:00 - 15:15 - Flagler
Measuring Psychological Depth in Language Models
Fabrice Y Harel-Canada, Hanyu Zhou, Sreya Muppalla, Zeynep Senahan Yildiz, Miryung Kim, Nanyun Peng, Amit Sahai
Evaluations of creative stories generated by large language models (LLMs) often focus on objective properties of the text, such as its style,
coherence, and diversity. While these metrics are indispensable, they do not speak to a story’s subjective, psychological impact from a reader’s
perspective. We introduce the Psychological Depth Scale (PDS), a novel framework rooted in literary theory that measures an LLM’s ability to
produce authentic and narratively complex stories that provoke emotion, empathy, and engagement. We empirically validate our framework by
showing that humans can consistently evaluate stories based on PDS (0.72 Krippendorff’s alpha). We also explore techniques for automating
the PDS to easily scale future analyses. GPT-4o, combined with a novel Mixture-of-Personas (MoP) prompting strategy, achieves an average
Spearman correlation of 0.51 with human judgment while Llama-3-70B with constrained decoding scores as high as 0.68 for empathy. Finally,
we compared the depth of stories authored by both humans and LLMs. Surprisingly, GPT-4 stories either surpassed or were statistically indis-
tinguishable from highly-rated human-written stories sourced from Reddit. By shifting the focus from text to reader, the Psychological Depth
Scale is a validated, automated, and systematic means of measuring the capacity of LLMs to connect with humans through the stories they tell.
15:15 - 15:30 - Flagler
Exceptions, Instantiations, and Overgeneralization: Insights into How Language Models Process Generics
Emily Allaway, Chandra Bhagavatula, Jena D. Hwang„ Kathleen McKeown, Sarah-Jane Leslie
Large language models (LLMs) have garnered a great deal of attention for their exceptional generative performance on commonsense and
reasoning tasks. In this work, we investigate LLMsâĂŹ capabilities for generalization using a particularly challenging type of statement:
generics. Generics express generalizations (e.g., birds can fly) but do so without explicit quantification. They are notable because they gen-
eralize over their instantiations (e.g., sparrows can fly) yet hold true even in the presence of exceptions (e.g., penguins do not). For humans,
these generic generalizations play a fundamental role in cognition, concept acquisition, and intuitive reasoning. We investigate how LLMs
respond to and reason about generics. To this end, we first propose a framework grounded in pragmatics to automatically generate both ex-
ceptions and instantiationsâĂŞ collectively exemplars. We make use of focus âĂŞ a pragmatic phenomenon that highlights meaning-bearing
elements in a sentence âĂŞ to capture the full range of interpretations of generics across different contexts of use. This allows us to derive
precise logical definitions for exemplars and operationalize them to automatically generate exemplars from LLMs. Using our system, we
generate a dataset of âĞ 370k exemplars across âĞ 17k generics and conduct a human validation of a sample of the generated data. We use
our final generated dataset to investigate how LLMsâĂŹ reason about generics. Humans have a documented tendency to conflate universally
quantified statements (e.g., all birds can fly) with generics. Therefore, we probe whether LLMs exhibit similar overgeneralization behavior
in terms of quantification and in property inheritance. We find that LLMs do show evidence of overgeneralization, although they sometimes
struggle to reason about exceptions. Furthermore, we find that LLMs may exhibit similar non-logical behavior to humans when considering
property inheritance from generics.
15:30 - 15:45 - Flagler
Step-by-Step Reasoning to Solve Grid Puzzles: Where do LLMs Falter?
Nemika Tyagi, Mihir Parmar, Mohith Kulkarni, Aswin RRV, Nisarg Patel, Mutsumi Nakamura, Arindam Mitra, Chitta Baral
Solving grid puzzles involves a significant amount of logical reasoning. Hence, it is a good domain to evaluate reasoning capability of a
model which can then guide us to improve the reasoning ability of models. However, most existing works evaluate only the final predicted
answer of a puzzle, without delving into an in-depth analysis of the LLMs’ reasoning chains (such as where they falter) or providing any
finer metrics to evaluate them. Since LLMs may rely on simple heuristics or artifacts to predict the final answer, it is crucial to evaluate the
generated reasoning chain beyond overall correctness measures, for accurately evaluating the reasoning abilities of LLMs. To this end, we
first develop GridPuzzle, an evaluation dataset comprising of 274 grid-based puzzles with different complexities. Second, we propose a new
error taxonomy derived from manual analysis of reasoning chains from LLMs including GPT-4, Claude-3, Gemini, Mistral, and Llama-2.
Then, we develop a LLM-based framework for large-scale subjective evaluation (i.e., identifying errors) and an objective metric, PuzzleEval,
to evaluate the correctness of reasoning chains. Evaluating reasoning chains from LLMs leads to several interesting findings. We further show
that existing prompting methods used for enhancing models’ reasoning abilities do not improve performance on GridPuzzle. This highlights
the importance of understanding fine-grained errors and presents a challenge for future research to enhance LLMs’ puzzle-solving abilities by
developing methods that address these errors.
77
Oral Presentations
Generation 3
Nov 14 (Thu) 14:00-15:30 - Room: Monroe
78
Oral Presentations
79
Oral Presentations
both correctness and consistency traits for paraphrased queries. Recently, significant attempts have been made to benchmark datasets and
metrics to evaluate LLMs for these traits. However, structural simplicity (subject-relation-object) and contemporary association in their query
formulation limit the broader definition of factuality and consistency. In this study, we introduce TeCFaP, a novel Temporally Consistent Fac-
tuality Probe task to expand the consistent factuality probe in the temporal dimension. To this end, we propose TEMP-COFAC, a high-quality
dataset of prefix-style English query paraphrases. Subsequently, we extend the definitions of existing metrics to represent consistent factu-
ality across temporal dimension. We experiment with a diverse set of LLMs and find most of them performing poorly on TeCFaP. Next, we
propose a novel solution CoTSeLF (Consistent-Time-Sensitive Learning Framework) combining multi-task instruction tuning (MT-IT) with
consistent-time-sensitive reinforcement learning (CTSRL) to improve temporally consistent factuality in LLMs. Our experiments demonstrate
the efficacy of CoTSeLF over several baselines.
80
Posters and Demos
Demo 1
Nov 12 (Tue) 11:00-12:30 - Room: Riverfront Hall
81
Posters and Demos
Bin Xu, Hanming Li, Jiaxi Yuan, Jifan Yu, Juanzi Li, Ruimiao Li, Yan Xuan, Zhanxin Hao, Zhiyuan Liu
Semi-structured interviews are a crucial method of data acquisition in qualitative research. Typically controlled by the interviewer, the pro-
cess progresses through a question-and-answer format, aimed at eliciting information from the interviewee. However, interviews are highly
time-consuming and demand considerable experience of the interviewers, which greatly limits the efficiency and feasibility of data col-
lection. Therefore, we introduce LM-Interview, a novel system designed to automate the process of preparing, conducting and analyzing
semi-structured interviews. Experimental results demonstrate that LM-interview achieves performance comparable to that of skilled human
interviewers.
Generation 1
Nov 12 (Tue) 11:00-12:30 - Room: Riverfront Hall
82
Posters and Demos
features. Here we propose a method for evaluating diversity over syntactic features to characterize general repetition in models, beyond
frequent n-grams. Specifically, we define syntactic templates (e.g., strings comprising parts-of-speech) and show that models tend to produce
templated text in downstream tasks at a higher rate than what is found in human-reference textsWe find that most (76%) templates in model-
generated text can be found in pre-training data (compared to only 35% of human-authored text), and are not overwritten during fine-tuning
or alignment processes such as RLHF. The connection between templates in generated text and the pre-training data allows us to analyze
syntactic templates in models where we do not have the pre-training data.We also find that templates as features are able to differentiate
between models, tasks, and domains, and are useful for qualitatively evaluating common model constructions.Finally, we demonstrate the use
of templates as a useful tool for analyzing style memorization of training data in LLMs.
Nov 12 (Tue) 11:00-12:30 - Riverfront Hall
Automatic Instruction Evolving for Large Language Models
Weihao Zeng, Can Xu, Yingxiu Zhao, Jian-Guang Lou, Weizhu Chen
Fine-tuning large pre-trained language models with Evol-Instruct has achieved encouraging results across a wide range of tasks. However, de-
signing effective evolving methods for instruction evolution requires substantial human expertise. This paper proposes Auto Evol-Instruct, an
end-to-end framework that evolves instruction datasets using large language models without any human effort. The framework automatically
analyzes and summarizes suitable evolutionary strategies for the given instruction data and iteratively improves the evolving method based
on issues exposed during the instruction evolution process. Our extensive experiments demonstrate that the best method optimized by Auto
Evol-Instruct outperforms human-designed methods on various benchmarks, including MT-Bench, AlpacaEval, GSM8K, and HumanEval.
Nov 12 (Tue) 11:00-12:30 - Riverfront Hall
Improving Retrieval-augmented Text-to-SQL with AST-based Ranking and Schema Pruning
Zhili Shen, Pavlos Vougiouklis, Chenxin Diao, Kaustubh Vyas, Yuanyi Ji, Jeff Z. Pan
We focus on Text-to-SQL semantic parsing from the perspective of retrieval-augmented generation. Motivated by challenges related to the
size of commercial database schemata and the deployability of business intelligence solutions, we propose ASTReS that dynamically retrieves
input database information and uses abstract syntax trees to select few-shot examples for in-context learning.Furthermore, we investigate the
extent to which an in-parallel semantic parser can be leveraged for generating approximated versions of the expected SQL queries, to support
our retrieval. We take this approach to the extreme–we adapt a model consisting of less than 500M parameters, to act as an extremely
efficient approximator, enhancing it with the ability to process schemata in a parallelised manner. We apply ASTReS to monolingual and
cross-lingual benchmarks for semantic parsing, showing improvements over state-of-the-art baselines. Comprehensive experiments highlight
the contribution of modules involved in this retrieval-augmented generation setting, revealing interesting directions for future work.
Nov 12 (Tue) 11:00-12:30 - Riverfront Hall
A Thorough Examination of Decoding Methods in the Era of LLMs
Chufan Shi, HAORAN YANG, Deng Cai, Zhisong Zhang, Yifan Wang, Yujiu Yang, Wai Lam
Decoding methods play an indispensable role in converting language models from next-token predictors into practical task solvers. Prior
research on decoding methods, primarily focusing on task-specific models, may not extend to the current era of general-purpose large lan-
guage models (LLMs). Moreover, the recent influx of decoding strategies has further complicated this landscape. This paper provides a
comprehensive and multifaceted analysis of various decoding methods within the context of LLMs, evaluating their performance, robustness
to hyperparameter changes, and decoding speeds across a wide range of tasks, models, and deployment environments. Our findings reveal
that decoding method performance is notably task-dependent and influenced by factors such as alignment, model size, and quantization.
Intriguingly, sensitivity analysis exposes that certain methods achieve superior performance at the cost of extensive hyperparameter tuning,
highlighting the trade-off between attaining optimal results and the practicality of implementation in varying contexts.
Nov 12 (Tue) 11:00-12:30 - Riverfront Hall
Learning to Correct for QA Reasoning with Black-box LLMs
Jaehyung Kim, Dongyoung Kim, Yiming Yang
An open challenge in recent machine learning is about how to improve the reasoning capability of large language models (LLMs) in a
black-box setting, i.e., without access to detailed information such as output token probabilities. Existing approaches either rely on accessi-
bility (which is often unrealistic) or involve significantly increased train- and inference-time costs. This paper addresses those limitations or
shortcomings by proposing a novel approach, namely CoBB (Correct for improving QA reasoning of Black-Box LLMs). It uses a trained
adaptation model to perform a seq2seq mapping from the often-imperfect reasonings of the original black-box LLM to the correct or im-
proved reasonings. Specifically, the adaptation model is initialized with a relatively small open-source LLM and adapted over a collection of
sub-sampled training pairs. To select the representative pairs of correct and incorrect reasonings, we formulated the dataset construction as an
optimization problem that minimizes the statistical divergence between the sampled subset and the entire collection, and solved it via a ge-
netic algorithm. We then train the adaptation model over the sampled pairs by contrasting the likelihoods of correct and incorrect reasonings.
Our experimental results demonstrate that CoBB significantly improves reasoning accuracy across various QA benchmarks, compared to the
best-performing adaptation baselines.
Nov 12 (Tue) 11:00-12:30 - Riverfront Hall
PostMark: A Robust Blackbox Watermark for Large Language Models
Yapei Chang, Kalpesh Krishna, Amir Houmansadr, John Frederick Wieting, Mohit Iyyer
The most effective techniques to detect LLM-generated text rely on inserting a detectable signature—or watermark—during the model’s de-
coding process. Most existing watermarking methods require access to the underlying LLM’s logits, which LLM API providers are loath to
share due to fears of model distillation. As such, these watermarks must be implemented independently by each LLM provider. In this paper,
we develop PostMark, a modular post-hoc watermarking procedure in which an input-dependent set of words (determined via a semantic
embedding) is inserted into the text after the decoding process has completed. Critically, PostMark does not require logit access, which
means it can be implemented by a third party. We also show that PostMark is more robust to paraphrasing attacks than existing watermarking
methods: our experiments cover eight baseline algorithms, five base LLMs, and three datasets. Finally, we evaluate the impact of PostMark
on text quality using both automated and human assessments, highlighting the trade-off between quality and robustness to paraphrasing. We
release our code, outputs, and annotations at https://github.com/lilakk/PostMark.
Nov 12 (Tue) 11:00-12:30 - Riverfront Hall
Assessing Implicit Retrieval Robustness of Large Language Models
Xiaoyu Shen, Rexhina Blloshmi, Dawei Zhu, Jiahuan Pei, Wei Zhang
Retrieval-augmented generation has gained popularity as a framework to enhance large language models with external knowledge. However,
its effectiveness hinges on the retrieval robustness of the model. If the model lacks retrieval robustness, its performance is constrained by
the accuracy of the retriever, resulting in significant compromises when the retrieved context is irrelevant. In this paper, we evaluate the
“implicit” retrieval robustness of various large language models, instructing them to directly output the final answer without explicitly judging
the relevance of the retrieved context. Our findings reveal that fine-tuning on a mix of gold and distracting context significantly enhances
83
Posters and Demos
the model’s robustness to retrieval inaccuracies, while still maintaining its ability to extract correct answers when retrieval is accurate. This
suggests that large language models can implicitly handle relevant or irrelevant retrieved context by learning solely from the supervision of
the final answer in an end-to-end manner. Introducing an additional process for explicit relevance judgment can be unnecessary and disrupts
the end-to-end approach.
Nov 12 (Tue) 11:00-12:30 - Riverfront Hall
CleanGen: Mitigating Backdoor Attacks for Generation Tasks in Large Language Models
Yuetai Li, Zhangchen Xu, Fengqing Jiang, Luyao Niu, Dinuka Sahabandu, Bhaskar Ramasubramanian, Radha Poovendran
The remarkable performance of large language models (LLMs) in generation tasks has enabled practitioners to leverage publicly available
models to power custom applications, such as chatbots and virtual assistants. However, the data used to train or fine-tune these LLMs is
often undisclosed, allowing an attacker to compromise the data and inject backdoors into the models. In this paper, we develop a novel
inference time defense, named CleanGen, to mitigate backdoor attacks for generation tasks in LLMs. CleanGen is a lightweight and effective
decoding strategy that is compatible with the state-of-the-art (SOTA) LLMs. Our insight behind CleanGen is that compared to other LLMs,
backdoored LLMs assign significantly higher probabilities to tokens representing the attacker-desired contents. These discrepancies in token
probabilities enable CleanGen to identify suspicious tokens favored by the attacker and replace them with tokens generated by another LLM
that is not compromised by the same attacker, thereby avoiding generation of attacker-desired content. We evaluate CleanGen against five
SOTA backdoor attacks. Our results show that CleanGen achieves lower attack success rates (ASR) compared to five SOTA baseline defenses
for all five backdoor attacks. Moreover, LLMs deploying CleanGen maintain helpfulness in their responses when serving benign user queries
with minimal added computational overhead.
84
Posters and Demos
omydividing puzzles into rule-based and rule-less categoriesto critically assess LLMs through various methodologies, including prompting
techniques, neuro-symbolic approaches, and fine-tuning. Through a critical review of relevant datasets and benchmarks, we assess LLMs’
performance, identifying significant challenges in complex puzzle scenarios. Our findings highlight the disparity between LLM capabilities
and human-like reasoning, particularly in those requiring advanced logical inference. The survey underscores the necessity for novel strate-
gies and richer datasets to advance LLMs’ puzzle-solving proficiency and contribute to AI’s logical reasoning and creative problem-solving
advancements.
Nov 12 (Tue) 11:00-12:30 - Riverfront Hall
Atomic Self-Consistency for Better Long Form Generations
Raghuveer Thirukovalluru, Yukun Huang, Bhuwan Dhingra
Recent work has aimed to improve LLM generations by filtering out hallucinations, thereby improving the precision of the information in
responses. Correctness of a long-form response, however, also depends on the recall of multiple pieces of information relevant to the question.
In this paper, we introduce Atomic Self-Consistency (ASC), a technique for improving the recall of relevant information in an LLM response.
ASC follows recent work, Universal Self-Consistency (USC) in using multiple stochastic samples from an LLM to improve the long-form
response. Unlike USC which only focuses on selecting the best single generation, ASC picks authentic subparts from the samples and merges
them into a superior composite answer. Through extensive experiments and ablations, we show that merging relevant subparts of multiple
samples performs significantly better than picking a single sample. ASC demonstrates significant gains over USC on multiple factoids and
open-ended QA datasets - ASQA, QAMPARI, QUEST, ELI5 with ChatGPT and Llama3. Our analysis also reveals untapped potential for
enhancing long-form generations using the approach of merging multiple samples.
85
Posters and Demos
tion, which has been trained with our designed multi-perspective consistency verification and rating-oriented preference alignment methods.
Themis can conduct flexible and interpretable evaluations without references, and it exhibits superior evaluation performance on various NLG
tasks, simultaneously generalizing well to unseen tasks and surpassing other evaluation models, including GPT-4.
Nov 12 (Tue) 11:00-12:30 - Riverfront Hall
CorrSynth - A Correlated Sampling Method for Diverse dataset Generation from LLMs
Abhishek Divekar, Suhas S Kowshik, Vijit Malik
Large language models (LLMs) have demonstrated remarkable performance in diverse tasks using zero-shot and few-shot prompting. Even
though their capabilities of data synthesis have been studied well in recent years, the generated data suffers from a lack of diversity, less adher-
ence to the prompt, and potential biases that creep into the data from the generator model. In this work, we tackle the challenge of generating
datasets with high diversity, upon which a student model is trained for downstream tasks. Taking the route of decoding-time guidance-based
approaches, we propose CorrSynth, which generates data that is more diverse and faithful to the input prompt using a correlated sampling
strategy. Further, our method overcomes the complexity drawbacks of some other guidance-based techniques like classifier-based guidance.
With extensive experiments, we show the effectiveness of our approach and substantiate our claims. In particular, we perform intrinsic evalu-
ation to show the improvements in diversity. Our experiments show that CorrSynth improves both student metrics and intrinsic metrics upon
competitive baselines across four datasets, showing the innate advantage of our method.
Nov 12 (Tue) 11:00-12:30 - Riverfront Hall
Foundational Autoraters: Taming Large Language Models for Better Automatic Evaluation
Tu Vu, Kalpesh Krishna, Salaheddin Alzubi, Chris Tar, Manaal Faruqui, Yun-Hsuan Sung
As large language models (LLMs) evolve, evaluating their output reliably becomes increasingly difficult due to the high cost of human eval-
uation. To address this, we introduce FLAMe, a family of Foundational Large Autorater Models. FLAMe is trained on a diverse set of over
100 quality assessment tasks, incorporating 5M+ human judgments curated from publicly released human evaluations. FLAMe outperforms
models like GPT-4 and Claude-3 on various held-out tasks, and serves as a powerful starting point for fine-tuning, as shown in our reward
model evaluation case study (FLAMe-RM). On Reward-Bench, FLAMe-RM-24B achieves 87.8% accuracy, surpassing GPT-4-0125 (85.9%)
and GPT-4o (84.7%). Additionally, we introduce FLAMe-Opt-RM, an efficient tail-patch fine-tuning approach that offers competitive Re-
wardBench performance using 25Œfewer training datapoints. Our FLAMe variants outperform popular proprietary LLM-as-a-Judge models
on 8 of 12 autorater benchmarks, covering 53 quality assessment tasks, including RewardBench and LLM-AggreFact. Finally, our analysis
shows that FLAMe is significantly less biased than other LLM-as-a-Judge models on the CoBBLEr autorater bias benchmark.
Nov 12 (Tue) 11:00-12:30 - Riverfront Hall
Label Confidence Weighted Learning for Target-level Sentence Simplification
Jingshen Zhang, Xin Ying Qiu
Multi-level sentence simplification generates simplified sentences with varying language proficiency levels. We propose Label Confidence
Weighted Learning (LCWL), a novel approach that incorporates a label confidence weighting scheme in the training loss of the encoder-
decoder model, setting it apart from existing confidence-weighting methods primarily designed for classification. Experimentation on English
grade-level simplification dataset shows that LCWL outperforms state-of-the-art unsupervised baselines. Fine-tuning the LCWL model on in-
domain data and combining with Symmetric Cross Entropy (SCE) consistently delivers better simplifications compared to strong supervised
methods. Our results highlight the effectiveness of label confidence weighting techniques for text simplification tasks with encoder-decoder
architectures.
Nov 12 (Tue) 11:00-12:30 - Riverfront Hall
The Effect of Sampling Temperature on Problem Solving in Large Language Models
Matthew Renze
In this research study, we empirically investigate the effect of sampling temperature on the performance of Large Language Models (LLMs)
on various problem-solving tasks. We created a multiple-choice question-and-answer (MCQA) exam by randomly sampling problems from
standard LLM benchmarks. Then, we used nine popular LLMs with five prompt-engineering techniques to solve the MCQA problems while
increasing the sampling temperature from 0.0 to 1.6. Despite anecdotal reports to the contrary, our empirical results indicate that changes in
temperature from 0.0 to 1.0 do not have a statistically significant impact on LLM performance for problem-solving tasks. In addition, these
results appear to generalize across LLMs, prompt-engineering techniques, and problem domains. All code, data, and supplemental materials
are available on GitHub at: https://github.com/matthewrenze/jhu-llm-temperature
Nov 12 (Tue) 11:00-12:30 - Riverfront Hall
LLM Self-Correction with DeCRIM: Decompose, Critique, and Refine for Enhanced Following of Instructions with Multiple Con-
straints
Thomas Palmeira Ferraz, Kartik Mehta, Yu-Hsiang Lin, Haw-Shiuan Chang, Shereen Oraby, Sijia Liu, Vivek Subramanian, Tagyoung Chung,
Mohit Bansal, Nanyun Peng
Instruction following is a key capability for LLMs. However, recent studies have shown that LLMs often struggle with instructions contain-
ing multiple constraints (e.g. a request to create a social media post "in a funny tone" with "no hashtag"). Despite this, most evaluations
focus solely on synthetic data. To address this, we introduce RealInstruct, the first benchmark designed to evaluate LLMs’ ability to follow
real-world multi-constrained instructions by leveraging queries real users asked AI assistants. We also investigate model-based evaluation as
a cost-effective alternative to human annotation for this task. Our findings reveal that even the proprietary GPT-4 model fails to meet at least
one constraint on over 21% of instructions, highlighting the limitations of state-of-the-art models. To address the performance gap between
open-source and proprietary models, we propose the Decompose, Critique and Refine (DeCRIM) self-correction pipeline, which enhances
LLMs’ ability to follow constraints. DeCRIM works by decomposing the original instruction into a list of constraints and using a Critic model
to decide when and where the LLM’s response needs refinement. Our results show that DeCRIM improves Mistrals performance by 7.3% on
RealInstruct and 8.0% on IFEval even with weak feedback. Moreover, we demonstrate that with strong feedback, open-source LLMs with
DeCRIM can outperform GPT-4 on both benchmarks.
Nov 12 (Tue) 11:00-12:30 - Riverfront Hall
Compare without Despair: Reliable Preference Evaluation with Generation Separability
Sayan Ghosh, Tejas Srinivasan, Swabha Swayamdipta
Human evaluation of generated language through pairwise preference judgments is pervasive. However, under common scenarios, such as
when generations from a model pair are very similar, or when stochastic decoding results in large variations in generations, it results in incon-
sistent preference ratings. We address these challenges by introducing a meta-evaluation measure, separability, which estimates how suitable
a test instance is for pairwise preference evaluation. For a candidate test instance, separability samples multiple generations from a pair of
models, and measures how distinguishable the two sets of generations are. Our experiments show that instances with high separability values
yield more consistent preference ratings from both human- and auto-raters. Further, the distribution of separability allows insights into which
test benchmarks are more valuable for comparing models. Finally, we incorporate separability into ELO ratings, accounting for how suitable
86
Posters and Demos
each test instance might be for reliably ranking LLMs. Overall, separability has implications for consistent, efficient and robust preference
evaluation of LLMs with both human- and auto-raters.
Industry
Nov 12 (Tue) 11:00-12:30 - Room: Jasmine
1
Code repo will be made available after Amazon review.
87
Posters and Demos
Runhui Wang, Yefan Tao, Adit Krishnan, Luyang Kong, Xuanqing Liu, Yuqian Deng, Yunzhao Yang, Henrik Johnson, Andrew Borthwick,
Shobhit Gupta, Aditi Sinha, Davor Golac
Data deduplication is a critical task in data management and mining, focused on consolidating duplicate records that refer to the same entity.
Personally Identifiable Information (PII) is a critical class of data for deduplication across various industries. Consumer data, stored and
generated through various engagement channels, is crucial for marketers, agencies, and publishers. However, a major challenge to PII data
deduplication is the lack of open-source benchmark datasets due to stringent privacy concerns, which hinders the research, development, and
evaluation of robust solutions. This paper addresses this critical lack of PII deduplication benchmarks by introducing the first open-source,
high-quality dataset for this task. We provide two datasets: one with 1,000,000 unlabeled synthetic PII profiles and a subset of 10,000 pairs
curated and labeled by trained annotators as matches or non-matches. Our datasets contain synthetic profiles built from publicly available
sources that do not represent any real individuals, thus ensuring privacy and ethical compliance. We provide several challenging data varia-
tions to evaluate the effectiveness of various deduplication techniques, including traditional supervised methods, deep-learning approaches,
and large language models (LLMs). Our work aims to set a new standard for PII deduplication, paving the way for more accurate and secure
solutions. We share our data publicly at this link https://zenodo.org/records/12774140.
Nov 12 (Tue) 11:00-12:30 - Jasmine
MERLIN: Multimodal Embedding Refinement via LLM-based Iterative Navigation for Text-Video Retrieval-Rerank Pipeline
Donghoon Han, Eunhwan Park, Gisang Lee, Adam Lee, Nojun Kwak
The rapid expansion of multimedia content has made accurately retrieving relevant videos from large collections increasingly challenging.
Recent advancements in text-video retrieval have focused on cross-modal interactions, large-scale foundation model training, and proba-
bilistic modeling, yet often neglect the crucial user perspective, leading to discrepancies between user queries and the content retrieved. To
address this, we introduce MERLIN (Multimodal Embedding Refinement via LLM-based Iterative Navigation), a novel, training-free pipeline
that leverages Large Language Models (LLMs) for iterative feedback learning. MERLIN refines query embeddings from a user perspective,
enhancing alignment between queries and video content through a dynamic question answering process. Experimental results on datasets
like MSR-VTT, MSVD, and ActivityNet demonstrate that MERLIN substantially improves Recall@1, outperforming existing systems and
confirming the benefits of integrating LLMs into multimodal retrieval systems for more responsive and context-aware multimedia retrieval.
Nov 12 (Tue) 11:00-12:30 - Jasmine
Identifying High Consideration E-Commerce Search Queries
Zhiyu Chen, Jason Ingyu Choi, Besnik Fetahu, Shervin Malmasi
In e-commerce, high consideration search missions typically require careful and elaborate decision making, and involve a substantial re-
search investment from customers. We consider the task of identifying High Consideration (HC) queries. Identifying such queries enables
e-commerce sites to better serve user needs using targeted experiences such as curated QA widgets that help users reach purchase decisions.
We explore the task by proposing an Engagement-based Query Ranking (EQR) approach, focusing on query ranking to indicate potential
engagement levels with query-related shopping knowledge content during product search. Unlike previous studies on predicting trends, EQR
prioritizes query-level features related to customer behavior, finance, and catalog information rather than popularity signals. We introduce
an accurate and scalable method for EQR and present experimental results demonstrating its effectiveness. Offline experiments show strong
ranking performance. Human evaluation shows a precision of 96% for HC queries identified by our model. The model was commercially
deployed, and shown to outperform human-selected queries in terms of downstream customer impact, as measured through engagement.
88
Posters and Demos
mitigate this issue our work proposes to regularize the cross-entropy loss with an in-scope embedding reconstruction loss learned using an
auto-encoder. Our method achieves a 1-4% improvement in the area under the precision-recall curve for rejecting out-of-sample (OOS) in-
stances, without compromising intent classification performance.
Nov 12 (Tue) 11:00-12:30 - Jasmine
Can Machine Unlearning Reduce Social Bias in Language Models?
Omkar Dige, Diljot Arneja, Tsz Fung Yau, Qixuan Zhang, Mohammad Bolandraftar, Xiaodan Zhu, Faiza Khan Khattak
Mitigating bias in language models (LMs) has become a critical problem due to the widespread deployment of LMs in the industry and
customer-facing applications. Numerous approaches revolve around data pre-processing and subsequent fine-tuning of language models,
tasks that can be both time-consuming and computationally demanding. As alternatives, machine unlearning techniques are being explored,
yet there is a notable lack of comparative studies evaluating the effectiveness of these methods. In this work, we explore the effectiveness
of two machine unlearning methods: Partitioned Contrastive Gradient Unlearning (PCGU) applied on decoder models, and Negation via
Task Vector, and compare them with Direct Preference Optimization (DPO) to reduce social biases in open-source LMs such as LLaMA-2
and OPT. We also implement distributed PCGU for large models. It is empirically shown, through quantitative and qualitative analyses, that
negation via Task Vector method outperforms PCGU and is comparable to DPO in debiasing models with minimum deterioration in model
performance and perplexity. Negation via Task Vector reduces the bias score by 25.5% for LLaMA-2 and achieves bias reduction of up to
40% for OPT models. Moreover, it can be easily tuned to balance the trade-off between bias reduction and generation quality, unlike DPO.
Nov 12 (Tue) 11:00-12:30 - Jasmine
Don’t be my Doctor! Recognizing Healthcare Advice in Large Language Models
Kellen Tan Cheng, Anna Lisa Gentile, Pengyuan Li, Chad Deluca, Guang-Jie Ren
Large language models (LLMs) have seen increasing popularity in daily use, with their widespread adoption by many corporations as virtual
assistants, chatbots, predictors, and many more. With the growing influence of industry corporations in this field, this raises the need for
safeguards and guardrails to ensure that the outputs from LLMs do not mislead or harm users. This is especially true for highly regulated
domains such as healthcare, where misleading advice may influence users to unknowingly commit malpractice. Despite this vulnerability,
the majority of guardrail benchmarking datasets do not have enough focus on specifically medical advice. In this paper, we present the HEaL
benchmark (HEalth Advice in LLMs), a health-advice benchmark dataset that has been manually curated and annotated to evaluate LLMs’
capability in recognizing health-advice - which we use to safeguard LLMs deployed in industrial settings. We use HEaL to assess several
models and report a detail analysis of the findings.
Nov 12 (Tue) 11:00-12:30 - Jasmine
OMG-QA: Building Open-Domain Multi-Modal Generative Question Answering Systems
Linyong Nan, Weining Fang, Aylin Rasteh, Pouya Lahabi, Weijin Zou, Yilun Zhao, Arman Cohan
We introduce OMG-QA, a new resource for question answering that is designed to evaluate the effectiveness of question answering systems
that perform retrieval augmented generation (RAG) in scenarios that demand reasoning on multi-modal, multi-document contexts. These
systems, given a user query, must retrieve relevant contexts from the web, which may include non-textual information, and then reason and
synthesize these contents to generate a detailed, coherent answer. Unlike existing open-domain QA datasets, OMG-QA requires systems
to navigate and integrate diverse modalities and a broad pool of information sources, making it uniquely challenging. We conduct a thor-
ough evaluation and analysis of a diverse set of QA systems, featuring various retrieval frameworks, document retrievers, document indexing
approaches, evidence retrieval methods, and LLMs tasked with both information retrieval and generation. Our findings reveal significant limi-
tations in existing approaches using RAG or LLM agents to address open questions that require long-form answers supported by multi-modal
evidence. We believe that OMG-QA will be a valuable resource for developing QA systems that are better equipped to handle open-domain,
multi-modal information-seeking tasks.
89
Posters and Demos
A Cost-Efficient Modular Sieve for Extracting Product Information from Company Websites
Anna Hätty, Dragan Milchevski, Kersten Döring, Marko Putnikovic, Mohsen Mesgar, Filip Novovi, Maximilian Braun, Karina Leoni Bori-
mann, Igor Stranjanac
Extracting product information is crucial for informed business decisions and strategic planning across multiple industries. However, re-
cent methods relying only on large language models (LLMs) are resource-intensive and computationally prohibitive due to website structure
differences and numerous non-product pages. To address these challenges, we propose a novel modular method that leverages low-cost
classification models to filter out company web pages, significantly reducing computational costs. Our approach consists of three modules:
web page crawling, product page classification using efficient machine learning models, and product information extraction using LLMs on
classified product pages. We evaluate our method on a new dataset of about 7000 product and non-product web pages, achieving a 6-point
improvement in F1-score, 95% reduction in computational time, and 87.5% reduction in cost compared to end-to-end LLMs. Our research
demonstrates the effectiveness of our proposed low-cost classification module to identify web pages containing product information, making
product information extraction more effective and cost-efficient.
Nov 12 (Tue) 11:00-12:30 - Jasmine
Predicting Entity Salience in Extremely Short Documents
Benjamin Bullough, Harrison Lundberg, Chen Hu, Weihang Xiao
A frequent challenge in applications that use entities extracted from text documents is selecting the most salient entities when only a small
number can be used by the application (e.g., displayed to a user). Solving this challenge is particularly difficult in the setting of extremely
short documents, such as the response from a digital assistant, where traditional signals of salience such as position and frequency are less
likely to be useful. In this paper, we propose a lightweight and data-efficient approach for entity salience detection on short text documents.
Our experiments show that our approach achieves competitive performance with respect to complex state-of-the-art models, such as GPT-4,
at a significant advantage in latency and cost. In limited data settings, we show that a semi-supervised fine-tuning process can improve
performance further. Furthermore, we introduce a novel human-labeled dataset for evaluating entity salience on short question-answer pair
documents.
Nov 12 (Tue) 11:00-12:30 - Jasmine
Centrality-aware Product Retrieval and Ranking
Hadeel Saadany, Swapnil Bhosale, Samarth Agrawal, Constantin Orasan, Zhe Wu, Diptesh Kanojia
This paper addresses the challenge of improving user experience on e-commerce platforms by enhancing product ranking relevant to user’s
search queries. Ambiguity and complexity of user queries often lead to a mismatch between user’s intent and retrieved product titles or
documents. Recent approaches have proposed the use of Transformer-based models which need millions of annotated query-title pairs during
the pre-training stage, and this data often does not take user intent into account. To tackle this, we curate samples from existing datasets at
eBay, manually annotated with buyer-centric relevance scores, and centrality scores which reflect how well the product title matches the users
intent. We introduce a User-intent Centrality Optimization (UCO) approach for existing models, which optimizes for the user intent in seman-
tic product search. To that end, we propose a dual-loss based optimization to handle hard negatives, i.e., product titles that are semantically
relevant but do not reflect the user’s intent. Our contributions include curating challenging evaluation sets and implementing UCO, resulting
in significant improvements in product ranking efficiency, observed for different evaluation metrics. Our work aims to ensure that the most
buyer-centric titles for a query are ranked higher, thereby, enhancing the user experience on e-commerce platforms.
90
Posters and Demos
(LLMs) for effective and interpretable legal case retrieval. By incorporating professional legal knowledge about crimes and law articles, we
enable large language models to accurately reformulate the original legal case into concise sub-facts of crimes, which contain the essential
information of the case. Extensive experiments on two legal case retrieval benchmarks demonstrate superior retrieval performance and ro-
bustness on complex legal case queries of KELLER over existing methods.
Nov 12 (Tue) 11:00-12:30 - Riverfront Hall
Large Language Models as Foundations for Next-Gen Dense Retrieval: A Comprehensive Empirical Assessment
Kun Luo, Minghao Qin, Zheng Liu, Shitao Xiao, Jun Zhao, Kang Liu
Pre-trained language models like BERT and T5 serve as crucial backbone encoders for dense retrieval. However, these models often exhibit
limited generalization capabilities and face challenges in improving in-domain accuracy. Recent research has explored using large language
models (LLMs) as retrievers, achieving state-of-the-art performance across various tasks. Despite these advancements, the specific benefits of
LLMs over traditional retrievers and the impact of different LLM configurationssuch as parameter sizes, pre-training duration, and alignment
processeson retrieval tasks remain unclear. In this work, we conduct a comprehensive empirical study on a wide range of retrieval tasks,
including in-domain accuracy, data efficiency, zero-shot generalization, lengthy retrieval, instruction-based retrieval, and multi-task learning.
We evaluate over 15 different backbone LLMs and non-LLMs. Our findings reveal that larger models and extensive pre-training consistently
enhance in-domain accuracy and data efficiency. Additionally, larger models demonstrate significant potential in zero-shot generalization,
lengthy retrieval, instruction-based retrieval, and multi-task learning. These results underscore the advantages of LLMs as versatile and effec-
tive backbone encoders in dense retrieval, providing valuable insights for future research and development in this field.
Nov 12 (Tue) 11:00-12:30 - Riverfront Hall
CHIQ: Contextual History Enhancement for Improving Query Rewriting in Conversational Search
Fengran Mo, Abbas Ghaddar, Kelong Mao, Mehdi Rezagholizadeh, Boxing Chen, Qun Liu, Jian-Yun Nie
In this paper, we study how open-source large language models (LLMs) can be effectively deployed for improving query rewriting in con-
versational search, especially for ambiguous queries. We introduce CHIQ, a two-step method that leverages the capabilities of LLMs to
resolve ambiguities in the conversation history before query rewriting. This approach contrasts with prior studies that predominantly use
closed-source LLMs to directly generate search queries from conversation history. We demonstrate on five well-established benchmarks that
CHIQ leads to state-of-the-art results across most settings, showing highly competitive performances with systems leveraging closed-source
LLMs. Our study provides a first step towards leveraging open-source LLMs in conversational search, as a competitive alternative to the
prevailing reliance on commercial LLMs. Data, models, and source code will be publicly available upon acceptance at https://github.com/fen-
granMark/CHIQ.
Nov 12 (Tue) 11:00-12:30 - Riverfront Hall
PromptReps: Prompting Large Language Models to Generate Dense and Sparse Representations for Zero-Shot Document Retrieval
Shengyao Zhuang, Xueguang Ma, Bevan Koopman, Jimmy Lin, Guido Zuccon
Utilizing large language models (LLMs) for zero-shot document ranking is done in one of two ways: (1) prompt-based re-ranking methods,
which require no further training but are only feasible for re-ranking a handful of candidate documents due to computational costs; and (2)
unsupervised contrastive trained dense retrieval methods, which can retrieve relevant documents from the entire corpus but require a large
amount of paired text data for contrastive training.In this paper, we propose PromptReps, which combines the advantages of both categories:
no need for training and the ability to retrieve from the whole corpus. Our method only requires prompts to guide an LLM to generate query
and document representations for effective document retrieval. Specifically, we prompt the LLMs to represent a given text using a single word,
and then use the last token’s hidden states and the corresponding logits associated with the prediction of the next token to construct a hybrid
document retrieval system. The retrieval system harnesses both dense text embedding and sparse bag-of-words representations given by the
LLM.Our experimental evaluation on the MSMARCO, TREC deep learning and BEIR zero-shot document retrieval datasets illustrates that
this simple prompt-based LLM retrieval method can achieve a similar or higher retrieval effectiveness than state-of-the-art LLM embedding
methods that are trained with large amounts of unsupervised data, especially when using a larger LLM.
Nov 12 (Tue) 11:00-12:30 - Riverfront Hall
Unifying Multimodal Retrieval via Document Screenshot Embedding
Xueguang Ma, Sheng-Chieh Lin, Minghan Li, Wenhu Chen, Jimmy Lin
In the real world, documents are organized in different formats and varied modalities. Traditional retrieval pipelines require tailored document
parsing techniques and content extraction modules to prepare input for indexing. This process is tedious, prone to errors, and has information
loss. To this end, we propose Document Screenshot Embedding (DSE), a novel retrieval paradigm that regards document screenshots as a uni-
fied input format, which does not require any content extraction preprocess and preserves all the information in a document (e.g., text, image
and layout). DSE leverages a large vision-language model to directly encode document screenshots into dense representations for retrieval.
To evaluate our method, we first craft the dataset of Wiki-SS, a 1.3M Wikipedia web page screenshots as the corpus to answer the questions
from the Natural Questions dataset. In such a text-intensive document retrieval setting, DSE shows competitive effectiveness compared to
other text retrieval methods relying on parsing. For example, DSE outperforms BM25 by 17 points in top-1 retrieval accuracy. Additionally,
in a mixed-modality task of slide retrieval, DSE significantly outperforms OCR text retrieval methods by over 15 points in nDCG@10. These
experiments show that DSE is an effective document retrieval paradigm for diverse types of documents. Model checkpoints, code, and Wiki-
SS collection will be released.
Nov 12 (Tue) 11:00-12:30 - Riverfront Hall
GENRA: Enhancing Zero-shot Retrieval with Rank Aggregation
Georgios Katsimpras, Georgios Paliouras
Large Language Models (LLMs) have been shown to effectively perform zero-shot document retrieval, a process that typically consists of
two steps: i) retrieving relevant documents, and ii) re-ranking them based on their relevance to the query. This paper presents GENRA, a
new approach to zero-shot document retrieval that incorporates rank aggregation to improve retrieval effectiveness. Given a query, GENRA
first utilizes LLMs to generate informative passages that capture the query’s intent. These passages are then employed to guide the retrieval
process, selecting similar documents from the corpus. Next, we use LLMs again for a second refinement step. This step can be configured
for either direct relevance assessment of each retrieved document or for re-ranking the retrieved documents. Ultimately, both approaches
ensure that only the most relevant documents are kept. Upon this filtered set of documents, we perform multi-document retrieval, generating
individual rankings for each document. As a final step, GENRA leverages rank aggregation, combining the individual rankings to produce a
single refined ranking. Extensive experiments on benchmark datasets demonstrate that GENRA improves existing approaches, highlighting
the effectiveness of the proposed methodology in zero-shot retrieval.
91
Posters and Demos
rerankers have showcased superior performance and generalizability compared to existing supervised approaches. However, conventional
listwise LLM reranking methods lack efficiency as they provide ranking output in the form of a generated ordered sequence of candidate pas-
sage identifiers. Further, they are trained with the typical language modeling objective, which treats all ranking errors uniformly–potentially at
the cost of misranking highly relevant passages. Addressing these limitations, we introduce FIRST, a novel listwise LLM reranking approach
leveraging the output logits of the first generated identifier to directly obtain a ranked ordering of the candidates. Further, we incorporate a
learning-to-rank loss during training, prioritizing ranking accuracy for the more relevant passages. Empirical results demonstrate that FIRST
accelerates inference by 50% while maintaining a robust ranking performance with gains across the BEIR benchmark. Finally, to illustrate
the practical effectiveness of listwise LLM rerankers, we investigate their application in providing relevance feedback for retrievers during
inference. Our results show that LLM rerankers can provide a stronger distillation signal compared to cross-encoders, yielding substantial
improvements in retriever recall after relevance feedback.
Nov 12 (Tue) 11:00-12:30 - Riverfront Hall
Matryoshka-Adaptor: Unsupervised and Supervised Tuning for Smaller Embedding Dimensions
Jinsung Yoon, Rajarishi Sinha, Sercan O Arik, Tomas Pfister
Embeddings from Large Language Models (LLMs) have emerged as critical components in various applications, particularly for information
retrieval. While high-dimensional embeddings generally demonstrate superior performance as they contain more salient information, their
practical application is frequently hindered by elevated computational latency and the associated higher cost. To address these challenges,
we propose Matryoshka-Adaptor, a novel tuning framework designed for the customization of LLM embeddings. Matryoshka-Adaptor fa-
cilitates substantial dimensionality reduction while maintaining comparable performance levels, thereby achieving a significant enhancement
in computational efficiency and cost-effectiveness. Our framework directly modifies the embeddings from pre-trained LLMs which is de-
signed to be seamlessly integrated with any LLM architecture, encompassing those accessible exclusively through black-box APIs. Also, it
exhibits efficacy in both unsupervised and supervised learning settings. A rigorous evaluation conducted across a diverse corpus of English,
multilingual, and multimodal datasets consistently reveals substantial gains with Matryoshka-Adaptor. Notably, with Google and OpenAI
Embedding APIs, Matryoshka-Adaptor achieves a reduction in dimensionality ranging from two- to twelve-fold without compromising per-
formance across multiple BEIR datasets.
Nov 12 (Tue) 11:00-12:30 - Riverfront Hall
MixGR: Enhancing Retriever Generalization for Scientific Domain through Complementary Granularity
Fengyu Cai, Xinran Zhao, Tong Chen, Sihao Chen, Hongming Zhang, Iryna Gurevych, Heinz Koeppl
Recent studies show the growing significance of document retrieval in the generation of LLMs, i.e., RAG, within the scientific domain by
bridging their knowledge gap. However, dense retrievers often struggle with domain-specific retrieval and complex query-document rela-
tionships, particularly when query segments correspond to various parts of a document. To alleviate such prevalent challenges, this paper
introduces MixGR, which improves dense retrievers’ awareness of query-document matching across various levels of granularity in queries
and documents using a zero-shot approach. MixGR fuses various metrics based on these granularities to a united score that reflects a com-
prehensive query-document similarity. Our experiments demonstrate that MixGR outperforms previous document retrieval by 24.7%, 9.8%,
and 6.9% on nDCG@5 with unsupervised, supervised, and LLM-based retrievers, respectively, averaged on queries containing multiple sub-
queries from five scientific retrieval datasets. Moreover, the efficacy of two downstream scientific question-answering tasks highlights the
advantage of MixGR to boost the application of LLMs in the scientific domain.
92
Posters and Demos
93
Posters and Demos
94
Posters and Demos
one is the pointwise mutual information (PMI) between a unit and its context, which turns out to yield the same predictive power as surprisal
when controlling for unigram frequency. Moreover, both PMI and surprisal are correlated with frequency. This means that neither PMI nor
surprisal contains information about context alone. In response to this, we propose a technique where we project surprisal onto the orthogonal
complement of frequency, yielding a new contextual predictor that is uncorrelated with frequency. Our experiments show that the proportion
of variance in reading times explained by context is a lot smaller when context is represented by the orthogonalized predictor. From an
interpretability standpoint, this indicates that previous studies may have overstated the role that context has in predicting reading times.
Nov 12 (Tue) 11:00-12:30 - Riverfront Hall
Leading Whitespaces of Language Models’ Subword Vocabulary Poses a Confound for Calculating Word Probabilities
Byung-Doh Oh, William Schuler
Predictions of word-by-word conditional probabilities from Transformer-based language models are often evaluated to model the incremental
processing difficulty of human readers. In this paper, we argue that there is a confound posed by the most common method of aggregating
subword probabilities of such language models into word probabilities. This is due to the fact that tokens in the subword vocabulary of
most language models have leading whitespaces and therefore do not naturally define stop probabilities of words. We first prove that this
can result in distributions over word probabilities that sum to more than one, thereby violating the axiom that P(Ω) = 1. This property
results in a misallocation of word-by-word surprisal, where the unacceptability of the end of the current word is incorrectly carried over to the
next word. Additionally, this implicit prediction of word boundaries incorrectly models psycholinguistic experiments where human subjects
directly observe upcoming word boundaries. We present a simple decoding technique to reaccount the probability of the trailing whitespace
into that of the current word, which resolves this confound. Experiments show that this correction reveals lower estimates of garden-path
effects in transitive/intransitive sentences and poorer fits to naturalistic reading times.
Nov 12 (Tue) 11:00-12:30 - Riverfront Hall
Semantic Training Signals Promote Hierarchical Syntactic Generalization in Transformers
Aditya Yedetore, Najoung Kim
Neural networks without hierarchical biases often struggle to learn linguistic rules that come naturally to humans. However, neural networks
are trained primarily on form alone, while children acquiring language additionally receive data about meaning. Would neural networks
generalize more like humans when trained on both form and meaning? We investigate this by examining if Transformers—neural networks
without a hierarchical bias—better achieve hierarchical generalization when trained on both form and meaning compared to when trained on
form alone. Our results show that Transformers trained on form and meaning do favor the hierarchical generalization more than those trained
on form alone, suggesting that statistical learners without hierarchical biases can leverage semantic training signals to bootstrap hierarchical
syntactic generalization.
95
Posters and Demos
96
Posters and Demos
97
Posters and Demos
express emotions. In psychology, variation in the ability of individuals to differentiate between emotion concepts is called emotion granu-
larity (determined through self-reports of one’s emotions). High emotion granularity has been linked with better mental and physical health;
whereas low emotion granularity has been linked with maladaptive emotion regulation strategies and poor health outcomes. In this work,
we propose computational measures of emotion granularity derived from temporally-ordered speaker utterances in social media (in lieu of
self reports that suffer from various biases). We then investigate the effectiveness of such text-derived measures of emotion granularity in
functioning as markers of various mental health conditions (MHCs). We establish baseline measures of emotion granularity derived from
textual utterances, and show that, at an aggregate level, emotion granularities are significantly lower for people self-reporting as having an
MHC than for the control population. This paves the way towards a better understanding of the MHCs, and specifically the role emotions play
in our well-being.
98
Posters and Demos
99
Posters and Demos
In the realm of vision models, the primary mode of representation is using pixels to rasterize the visual world. Yet this is not always the
best or unique way to represent visual content, especially for designers and artists who depict the world using geometry primitives such as
polygons. Vector graphics (VG), on the other hand, offer a textual representation of visual content, which can be more concise and powerful
for content like cartoons, sketches and scientific figures. Recent studies have shown promising results on processing vector graphics with
capable Large Language Models (LLMs). However, such works focus solely on qualitative results, understanding, or a specific type of vector
graphics. We propose VGBench, a comprehensive benchmark for LLMs on handling vector graphics through diverse aspects, including (a)
both visual understanding and generation, (b) evaluation of various vector graphics formats, (c) diverse question types, (d) wide range of
prompting techniques, (e) under multiple LLMs and (f) comparison with VLMs on rasterized representations. Evaluating on our collected
4279 understanding and 5845 generation samples, we find that LLMs show strong capability on both aspects while exhibiting less desirable
performance on low-level formats (SVG). Both data and evaluation pipeline will be open-sourced.
Nov 12 (Tue) 11:00-12:30 - Jasmine
VIMI: Grounding Video Generation through Multi-modal Instruction
Yuwei Fang, Willi Menapace, Aliaksandr Siarohin, Tsai-Shien Chen, Kuan-Chieh Wang, Ivan Skorokhodov, Graham Neubig, Sergey Tulyakov
Existing text-to-video diffusion models rely solely on text-only encoders for their pretraining. This limitation stems from the absence of
large-scale multimodal prompt video datasets, resulting in a lack of visual grounding and restricting their versatility and application in mul-
timodal integration. To address this, we construct a large-scale multimodal prompt dataset by employing retrieval methods to pair in-context
examples with the given text prompts and then utilize a two-stage training strategy to enable diverse video generation tasks within a model.
In the first stage, we propose a multimodal conditional video generation framework for pretraining on these augmented datasets, establishing
a foundational model for grounded video generation. Secondly, we fine-tune the model from the first stage on various video generation tasks,
incorporating multimodal instructions. This process further refines the model’s ability to handle diverse inputs and tasks, ensuring seamless
integration of multimodal information. After this two-stage training process, VIMI demonstrates multimodal understanding capabilities, pro-
ducing contextually rich and personalized videos grounded in the provided inputs, as shown in Figure1. Compared to previous subject-driven
video generation methods, our generator can synthesize consistent and temporally coherent videos with large motion while retaining the
semantic control. Our generator also achieves state-of-the-art text-to-video generation results on UCF101 benchmark.
Nov 12 (Tue) 11:00-12:30 - Jasmine
Distilling Knowledge from Text-to-Image Generative Models Improves Visio-Linguistic Reasoning in CLIP
Samyadeep Basu, Shell Xu Hu, Maziar Sanjabi, Daniela Massiceti, Soheil Feizi
Image-text contrastive models like CLIP have wide applications in zero-shot classification, image-text retrieval, and transfer learning. How-
ever, they often struggle on compositional visio-linguistic tasks (e.g., attribute-binding or object-relationships) where their performance is no
better than random chance. To address this, we introduce SDS-CLIP, a lightweight and sample-efficient distillation method to enhance CLIP’s
compositional visio-linguistic reasoning. Our approach fine-tunes CLIP using a distillation objective borrowed from large text-to-image gen-
erative models like Stable-Diffusion, which are known for their strong visio-linguistic reasoning abilities. On the challenging Winoground
benchmark, SDS-CLIP improves the visio-linguistic performance of various CLIP models by up to 7%, while on the ARO dataset, it boosts
performance by up to 3%. This work underscores the potential of well-designed distillation objectives from generative models to enhance
contrastive image-text models with improved visio-linguistic reasoning capabilities.
Nov 12 (Tue) 11:00-12:30 - Jasmine
Finer: Investigating and Enhancing Fine-Grained Visual Concept Recognition in Large Vision Language Models
Jeonghwan Kim, Heng Ji
Recent advances in instruction-tuned Large Vision-Language Models (LVLMs) have imbued the models with the ability to generate high-level,
image-grounded explanations with ease. While such capability is largely attributed to the rich world knowledge contained within the Large
Language Models (LLMs), our work reveals their shortcomings in fine-grained visual categorization (FGVC) across six different benchmark
settings. Most recent state-of-the-art LVLMs such as LLaVa-1.5, InstructBLIP and GPT-4V not only severely deteriorate in terms of clas-
sification performance, e.g., average drop of 65.58 in EM for Stanford Dogs for LLaVA-1.5, but also struggle to generate descriptive visual
attributes based on a concept that appears within an input image despite their prominent zero-shot image captioning ability. In-depth analyses
show that instruction-tuned LVLMs suffer from modality gap, showing discrepancy when given textual and visual inputs that correspond
to the same concept. In an effort to further the community’s endeavor in this direction, we propose a multiple granularity attribute-centric
benchmark and training mixture, Finer, which aims to establish a ground to evaluate LVLMs’ fine-grained visual comprehension ability and
provide significantly improved explainability.
Nov 12 (Tue) 11:00-12:30 - Jasmine
UOUO: Uncontextualized Uncommon Objects for Measuring Knowledge Horizons of Vision Language Models
Xinyu Pi, Mingyuan Wu, Jize Jiang, Haozhen Zheng, Beitong Tian, ChengXiang Zhai, Klara Nahrstedt, Zhiting Hu
Smaller-scale Vision-Language Models (VLMs) often claim to perform on par with larger models in general-domain visual grounding and
question-answering benchmarks while offering advantages in computational efficiency and storage. However, their ability to handle rare
objects, which fall into the long tail of data distributions, is less understood. To rigorously evaluate this aspect, we introduce the "Uncon-
textualized Uncommon Objects" (UOUO) benchmark. This benchmark focuses on systematically testing VLMs with both large and small
parameter counts on rare and specialized objects. Our comprehensive analysis reveals that while smaller VLMs maintain competitive perfor-
mance on common datasets, they significantly underperform on tasks involving uncommon objects. We also propose an advanced, scalable
pipeline for data collection and cleaning, ensuring the UOUO benchmark provides high-quality, challenging instances. These findings high-
light the need to consider long-tail distributions when assessing the true capabilities of VLMs. Code and project details for UOUO can be
found at https://zoezheng126.github.io/UOUO-Website/.
Nov 12 (Tue) 11:00-12:30 - Jasmine
Encoding and Controlling Global Semantics for Long-form Video Question Answering
Thong Thanh Nguyen, Zhiyuan Hu, Xiaobao Wu, Cong-Duy T Nguyen, See-Kiong Ng, Anh Tuan Luu
Seeking answers effectively for long videos is essential to build video question answering (videoQA) systems. Previous methods adaptively
select frames and regions from long videos to save computations. However, this fails to reason over the whole sequence of video, leading
to sub-optimal performance. To address this problem, we introduce a state space layer (SSL) into multi-modal Transformer to efficiently
integrate global semantics of the video, which mitigates the video information loss caused by frame and region selection modules. Our
SSL includes a gating unit to enable controllability over the flow of global semantics into visual representations. To further enhance the
controllability, we introduce a cross-modal compositional congruence objective to encourage global semantics aligned with the question. To
rigorously evaluate long-form videoQA capacity, we construct two new benchmarks Ego-QA and MAD-QA featuring videos of considerably
long length, i.e. 17.5 minutes and 1.9 hours, respectively. Extensive experiments demonstrate the superiority of our framework on these new
as well as existing datasets.
Nov 12 (Tue) 11:00-12:30 - Jasmine
100
Posters and Demos
The Factuality Tax of Diversity-Intervened Text-to-Image Generation: Benchmark and Fact-Augmented Intervention
Yixin Wan, Di Wu, Haoran Wang, Kai-Wei Chang
Prompt-based “diversity interventions” are commonly adopted to improve the diversity of Text-to-Image (T2I) models depicting individuals
with various racial or gender traits. However, will this strategy result in nonfactual demographic distribution, especially when generating
real historical figures? In this work, we propose **DemOgraphic FActualIty Representation (DoFaiR)**, a benchmark to systematically
quantify the trade-off between using diversity interventions and preserving demographic factuality in T2I models. DoFaiR consists of 756
meticulously fact-checked test instances to reveal the factuality tax of various diversity prompts through an automated evidence-supported
evaluation pipeline. Experiments on DoFaiR unveil that diversity-oriented instructions increase the number of different gender and racial
groups in DALLE-3’s generations at the cost of historically inaccurate demographic distributions. To resolve this issue, we propose **Fact-
Augmented Intervention** (FAI), which instructs a Large Language Model (LLM) to reflect on verbalized or retrieved factual information
about gender and racial compositions of generation subjects in history, and incorporate it into the generation context of T2I models. By orient-
ing model generations using the reflected historical truths, FAI significantly improves the demographic factuality under diversity interventions
while preserving diversity.
Nov 12 (Tue) 11:00-12:30 - Jasmine
Read Anywhere Pointed: Layout-aware GUI Screen Reading with Tree-of-Lens Grounding
Yue Fan, Lei Ding, Ching-Chen Kuo, Shan Jiang, Yang Zhao, Xinze Guan, Jie Yang, Yi Zhang, Xin Eric Wang
Graphical User Interfaces (GUIs) are central to our interaction with digital devices and growing efforts have been made to build models
for various GUI understanding tasks. However, these efforts largely overlook an important GUI-referring task: screen reading based on
user-indicated points, which we name the Screen Point-and-Read (ScreenPR) task. Currently, this task is predominantly handled by rigid
accessible screen reading tools, in great need of new models driven by advancements in Multimodal Large Language Models (MLLMs). In
this paper, we propose a Tree-of-Lens (ToL) agent, utilizing a novel ToL grounding mechanism, to address the ScreenPR task. Based on the
input point coordinate and the corresponding GUI screenshot, our ToL agent constructs a Hierarchical Layout Tree. Based on the tree, our
ToL agent not only comprehends the content of the indicated area but also articulates the layout and spatial relationships between elements.
Such layout information is crucial for accurately interpreting information on the screen, distinguishing our ToL agent from other screen read-
ing tools. We also thoroughly evaluate the ToL agent against other baselines on a newly proposed ScreenPR benchmark, which includes
GUIs from mobile, web, and operating systems. Last but not least, we test the ToL agent on mobile GUI navigation tasks, demonstrating
its utility in identifying incorrect actions along the path of agent execution trajectories. Code and data: https://screen-point-and-read.github.io.
Nov 12 (Tue) 11:00-12:30 - Jasmine
If CLIP Could Talk: Understanding Vision-Language Model Representations Through Their Preferred Concept Descriptions
Reza Esfandiarpoor, Cristina Menghini, Stephen Bach
Recent works often assume that Vision-Language Model (VLM) representations are based on visual attributes like shape. However, it is
unclear to what extent VLMs prioritize this information to represent concepts. We propose Extract and Explore (EX2), a novel approach
to characterize textual features that are important for VLMs. EX2 uses reinforcement learning to align a large language model with VLM
preferences and generates descriptions that incorporate features that are important for the VLM. Then, we inspect the descriptions to identify
features that contribute to VLM representations. Using EX2, we find that spurious descriptions have a major role in VLM representations
despite providing no helpful information, e.g., Click to enlarge photo of CONCEPT. More importantly, among informative descriptions,
VLMs rely significantly on non-visual attributes like habitat (e.g., North America) to represent visual concepts. Also, our analysis reveals
that different VLMs prioritize different attributes in their representations. Overall, we show that VLMs do not simply match images to scene
descriptions and that non-visual or even spurious descriptions significantly influence their representations.
Nov 12 (Tue) 11:00-12:30 - Jasmine
MMOE: Enhancing Multimodal Models with Mixtures of Multimodal Interaction Experts
Haofei Yu, Zhengyang Qi, Lawrence Keunho Jang, Russ Salakhutdinov, Louis-Philippe Morency, Paul Pu Liang
Advances in multimodal models have greatly improved how interactions relevant to various tasks are modeled. Today’s multimodal models
mainly focus on the correspondence between images and text, using this for tasks like image-text matching. However, this covers only a subset
of real-world interactions. Novel interactions, such as sarcasm expressed through opposing spoken words and gestures or humor expressed
through utterances and tone of voice, remain challenging. In this paper, we introduce an approach to enhance multimodal models, which
we call Multimodal Mixtures of Experts (MMoE). The key idea in MMoE is to train separate expert models for each type of multimodal
interaction, such as redundancy present in both modalities, uniqueness in one modality, or synergy that emerges when both modalities are
fused. On a sarcasm detection task (MUStARD) and a humor detection task (URFUNNY), we obtain new state-of-the-art results. MMoE is
also able to be applied to various types of models to gain improvement.
Nov 12 (Tue) 11:00-12:30 - Jasmine
OmAgent: A Multi-modal Agent Framework for Complex Video Understanding with Task Divide-and-Conquer
Lu Zhang, Tiancheng Zhao, Heting Ying, Yibo Ma, Kyusong Lee
Recent advancements in Large Language Models (LLMs) have expanded their capabilities to multimodal contexts, including comprehensive
video understanding. However, processing extensive videos such as 24-hour CCTV footage or full-length films presents significant chal-
lenges due to the vast data and processing demands. Traditional methods, like extracting key frames or converting frames to text, often result
in substantial information loss. To address these shortcomings, we develop OmAgent, efficiently stores and retrieves relevant video frames
for specific queries, preserving the detailed content of videos. Additionally, it features an Divide-and-Conquer Loop capable of autonomous
reasoning, dynamically invoking APIs and tools to enhance query processing and accuracy. This approach ensures robust video understand-
ing, significantly reducing information loss. Experimental results affirm OmAgent’s efficacy in handling various types of videos and complex
tasks. Moreover, we have endowed it with greater autonomy and a robust tool-calling system, enabling it to accomplish even more intricate
tasks.
Nov 12 (Tue) 11:00-12:30 - Jasmine
VLEU: a Method for Automatic Evaluation for Generalizability of Text-to-Image Models
Jingtao Cao, Zhang Zheng, Hongru WANG, Kam-Fai Wong
Progress in Text-to-Image (T2I) models has significantly advanced the generation of images from textual descriptions. Existing metrics, such
as CLIP, effectively measure the semantic alignment between single prompts and their corresponding images. However, they fall short in eval-
uating a model’s ability to generalize across a broad spectrum of textual inputs. To address this gap, we propose the VLEU (Visual Language
Evaluation Understudy) metric. VLEU leverages the power of Large Language Models (LLMs) to sample from the visual text domain, en-
compassing the entire range of potential inputs for the T2I task, to generate a wide variety of visual text. The images generated by T2I models
from these prompts are then assessed for their alignment with the input text using the CLIP model. VLEU quantitatively measures a model’s
generalizability by computing the Kullback-Leibler (KL) divergence between the visual text marginal distribution and the conditional distribu-
tion over the images generated by the model. This provides a comprehensive metric for comparing the overall generalizability of T2I models,
beyond single-prompt evaluations, and offers valuable insights during the finetuning process. Our experimental results demonstrate VLEU’s
101
Posters and Demos
effectiveness in evaluating the generalizability of various T2I models, positioning it as an essential metric for future research and development
in image synthesis from text prompts. Our code and data will be publicly available at https://github.com/mio7690/VLEU.
102
Posters and Demos
Visual Text Matters: Improving Text-KVQA with Visual Text Entity Knowledge-aware Large Multimodal Assistant
Abhirama Subramanyam Penamakuri, Anand Mishra
We revisit knowledge-aware text-based visual question answering, also known as Text-KVQA in the light of modern advancements in large
multimodal models (LMMs), and make the following contributions: (i) We propose VisTEL – a principled approach to perform visual text
entity linking. The proposed VisTEL module harnesses a state-of-the-art visual text recognition engine and the power of a large multimodal
model to jointly reason using textual and visual context obtained using surrounding cues in the image to link visual text entity to the correct
knowledge base entity. (ii) We present KaLMA – knowledge-aware large multimodal assistant that augments an LMM with knowledge as-
sociated with visual text entity in the image to arrive at an accurate answer. Further, we provide a comprehensive experimental analysis and
comparison of our approach with traditional visual question answering, pre-large multimodal models, and large multimodal models, as well
as prior top-performing approaches. Averaging over three splits of Text-KVQA, our proposed approach surpasses the previous best approach
by a substantial 23.3% on an absolute scale and establishes a new state of the art. We make our implementation publicly available.
Nov 12 (Tue) 11:00-12:30 - Jasmine
IFCap: Image-like Retrieval and Frequency-based Entity Filtering for Zero-shot Captioning
Soeun Lee, Si-Woo Kim, Taewhan Kim, Dong-Jin Kim
Recent advancements in image captioning have explored text-only training methods to overcome the limitations of paired image-text data.
However, existing text-only training methods often overlook the modality gap between using text data during training and employing images
during inference. To address this issue, we propose a novel approach called Image-like Retrieval, which aligns text features with visually rele-
vant features to mitigate the modality gap. Our method further enhances the accuracy of generated captions by designing a fusion module that
integrates retrieved captions with input features. Additionally, we introduce a Frequency-based Entity Filtering technique that significantly
improves caption quality. We integrate these methods into a unified framework, which we refer to as IFCap (**I**mage-like Retrieval and
**F**requency-based Entity Filtering for Zero-shot **Cap**tioning). Through extensive experimentation, our straightforward yet powerful
approach has demonstrated its efficacy, outperforming the state-of-the-art methods by a significant margin in both image captioning and video
captioning compared to zero-shot captioning based on text-only training.
Nov 12 (Tue) 11:00-12:30 - Jasmine
Bayesian Example Selection Improves In-Context Learning for Speech, Text, and Visual Modalities
Siyin Wang, Chao-Han Huck Yang, Ji Wu, Chao Zhang
Large language models (LLMs) can adapt to new tasks through in-context learning (ICL) based on a few examples presented in dialogue
history without any model parameter update. Despite such convenience, the performance of ICL heavily depends on the quality of the in-
context examples presented, which makes the in-context example selection approach a critical choice. This paper proposes a novel eBayesian
in-Context example Selection method (ByCS) for ICL. Extending the inference probability conditioned on in-context examples based on
Bayes’ theorem, ByCS focuses on the inverse inference conditioned on test input. Following the assumption that accurate inverse inference
probability (likelihood) will result in accurate inference probability (posterior), in-context examples are selected based on their inverse in-
ference results. Diverse and extensive cross-tasking and cross-modality experiments are performed with speech, text, and image examples.
Experimental results show the efficacy and robustness of our ByCS method on various models, tasks and modalities.
Nov 12 (Tue) 11:00-12:30 - Jasmine
An Empirical Analysis on Spatial Reasoning Capabilities of Large Multimodal Models
Fatemeh Shiri, Xiao-Yu Guo, Mona Golestan Far, Xin Yu, Reza Haf, Yuan-Fang Li
Large Multimodal Models (LMMs) have achieved strong performance across a range of vision and language tasks. However, their spatial
reasoning capabilities are under-investigated. In this paper, we construct a novel VQA dataset, Spatial-MM, to comprehensively study LMMs’
spatial understanding and reasoning capabilities. Our analyses on object-relationship and multi-hop reasoning reveal several important find-
ings. Firstly, bounding boxes and scene graphs, even synthetic ones, can significantly enhance LMMs’ spatial reasoning. Secondly, LMMs
struggle more with questions posed from the human perspective than the camera perspective about the image. Thirdly, chain of thought (CoT)
prompting does not improve model performance on complex multi-hop questions involving spatial relations. Moreover, spatial reasoning
steps are much less accurate than non-spatial ones across MLLMs. Lastly, our perturbation analysis on GQA-spatial reveals that LMMs are
much stronger at basic object detection than complex spatial reasoning. We believe our new benchmark dataset and in-depth analyses can
spark further research on LMMs spatial reasoning.
Nov 12 (Tue) 11:00-12:30 - Jasmine
Nearest Neighbor Normalization Improves Multimodal Retrieval
Neil Chowdhury, Franklin Wang, Sumedh Shenoy, Douwe Kiela, Sarah Schwettmann, Tristan Thrush
Multimodal models leverage large-scale pretraining to achieve strong but still imperfect performance on tasks such as image captioning,
visual question answering, and cross-modal retrieval. In this paper, we present a simple and efficient method for correcting errors in trained
contrastive image-text retrieval models with no additional training, called Nearest Neighbor Normalization (NNN). We show an improvement
on retrieval metrics in both text retrieval and image retrieval for all of the contrastive models that we tested (CLIP, BLIP, ALBEF, SigLIP,
BEiT) and for both of the datasets that we used (MS-COCO and Flickr30k). NNN requires a reference database, but does not require any
training on this database, and can even increase the retrieval accuracy of a model after finetuning.
Nov 12 (Tue) 11:00-12:30 - Jasmine
PropTest: Automatic Property Testing for Improved Visual Programming
Jaywon Koo, Ziyan Yang, Paola Cascante-Bonilla, Baishakhi Ray, Vicente Ordonez
Visual Programming has recently emerged as an alternative to end-to-end black-box visual reasoning models. This type of method leverages
Large Language Models (LLMs) to generate the source code for an executable computer program that solves a given problem. This strategy
has the advantage of offering an interpretable reasoning path and does not require finetuning a model with task-specific data. We propose
PropTest, a general strategy that improves visual programming by further using an LLM to generate code that tests for visual properties in an
initial round of proposed solutions. Our method generates tests for data-type consistency, output syntax, and semantic properties. PropTest
achieves comparable results to state-of-the-art methods while using publicly available LLMs. This is demonstrated across different bench-
marks on visual question answering and referring expression comprehension. Particularly, PropTest improves ViperGPT by obtaining 46.1%
accuracy (+6.0%) on GQA using Llama3-8B and 59.5% (+8.1%) on RefCOCO+ using CodeLlama-34B.
Nov 12 (Tue) 11:00-12:30 - Jasmine
MMedAgent: Learning to Use Medical Tools with Multi-modal Agent
Binxu Li, Tiankai Yan, Yuanting Pan, Jie Luo, Ruiyang Ji, Jiayuan Ding, Zhe Xu, Shilong Liu, Haoyu Dong, Zihao Lin, Yixin Wang
Multi-Modal Large Language Models (MLLMs), despite being successful, exhibit limited generality and often fall short when compared to
specialized models. Recently, LLM-based agents have been developed to address these challenges by selecting appropriate specialized models
as tools based on user inputs. However, such advancements have not been extensively explored within the medical domain. To bridge this gap,
this paper introduces the first agent explicitly designed for the medical field, named Multi-modal Medical Agent (MMedAgent). We curate
103
Posters and Demos
an instruction-tuning dataset comprising six medical tools solving seven tasks across five modalities, enabling the agent to choose the most
suitable tools for a given task. Comprehensive experiments demonstrate that MMedAgent achieves superior performance across a variety
of medical tasks compared to state-of-the-art open-source methods and even the closed-source model, GPT-4o. Furthermore, MMedAgent
exhibits efficiency in updating and integrating new medical tools.
Nov 12 (Tue) 11:00-12:30 - Jasmine
Can Textual Unlearning Solve Cross-Modality Safety Alignment?
Trishna Chakraborty, Erfan Shayegani, Zikui Cai, Nael B. Abu-Ghazaleh, M. Salman Asif, Yue Dong, Amit Roy-Chowdhury, Chengyu Song
Recent studies reveal that integrating new modalities into large language models (LLMs), such as vision-language models (VLMs), creates
a new attack surface that bypasses existing safety training techniques like supervised fine-tuning (SFT) and reinforcement learning with hu-
man feedback (RLHF). While further SFT and RLHF-based safety training can be conducted in multi-modal settings, collecting multi-modal
training datasets poses a significant challenge. Inspired by the structural design of recent multi-modal models, where all input modalities are
ultimately fused into the language space, we explore whether unlearning solely in the textual domain can be effective for cross-modality safety
alignment. Our empirical evaluation across seven datasets demonstrates promising transferability — textual unlearning in VLMs significantly
reduces the Attack Success Rate (ASR) to less than 8% and in some cases, even as low as nearly 2% for both text-based and vision-text-based
attacks, alongside preserving the utility. Moreover, our experiments show that unlearning with a multi-modal dataset offers no potential ben-
efits but incurs significantly increased computational demands.
Nov 12 (Tue) 11:00-12:30 - Jasmine
What if...?: Thinking Counterfactual Keywords Helps to Mitigate Hallucination in Large Multi-modal Models
Junho Kim, KIM YEONJU, Yong Man Ro
This paper presents a way of enhancing the reliability of Large Multi-modal Models (LMMs) in addressing hallucination, where the models
generate cross-modal inconsistent responses. Without additional training, we propose Counterfactual Inception, a novel method that implants
counterfactual thinking into LMMs using self-generated counterfactual keywords. Our method is grounded in the concept of counterfac-
tual thinking, a cognitive process where human considers alternative realities, enabling more extensive context exploration. Bridging the
human cognition mechanism into LMMs, we aim for the models to engage with and generate responses that span a wider contextual scene
understanding, mitigating hallucinatory outputs. We further introduce Plausibility Verification Process (PVP), a simple yet robust keyword
constraint that effectively filters out sub-optimal keywords to enable the consistent triggering of counterfactual thinking in the model re-
sponses. Comprehensive analyses across various LMMs, including both open-source and proprietary models, corroborate that counterfactual
thinking significantly reduces hallucination and helps to broaden contextual understanding based on true visual clues.
Nov 12 (Tue) 11:00-12:30 - Jasmine
LLM-A*: Large Language Model Enhanced Incremental Heuristic Search on Path Planning
Silin Meng, Yiwei Wang, Cheng-Fu Yang, Nanyun Peng, Kai-Wei Chang
Path planning is a fundamental scientific problem in robotics and autonomous navigation, requiring the derivation of efficient routes from
starting to destination points while avoiding obstacles. Traditional algorithms like A* and its variants are capable of ensuring path validity but
suffer from significant computational and memory inefficiencies as the state space grows. Conversely, large language models (LLMs) excel
in broader environmental analysis through contextual understanding, providing global insights into environments. However, they fall short in
detailed spatial and temporal reasoning, often leading to invalid or inefficient routes. In this work, we propose LLM-A*, an new LLM based
route planning method that synergistically combines the precise pathfinding capabilities of A* with the global reasoning capability of LLMs.
This hybrid approach aims to enhance pathfinding efficiency in terms of time and space complexity while maintaining the integrity of path
validity, especially in large-scale scenarios. By integrating the strengths of both methodologies, LLM-A* addresses the computational and
memory limitations of conventional algorithms without compromising on the validity required for effective pathfinding.
NLP Applications 1
Nov 12 (Tue) 11:00-12:30 - Room: Riverfront Hall
104
Posters and Demos
105
Posters and Demos
ical report shares how we use GPT-4 as an automatic assignment evaluator in a university course with over 1000 students. Based on student
responses, we found that LLM-based assignment evaluators are generally acceptable to students when they have free access to these tools.
However, students also noted that the LLM sometimes fails to adhere to the evaluation instructions, resulting in unreasonable assessments.
Additionally, we observed that students can easily manipulate the LLM to output specific strings, allowing them to achieve high scores without
meeting the assignment rubric. Based on student feedback and our experience, we offer several recommendations for effectively integrating
LLMs into future classroom evaluations. Our observation also highlights potential directions for improving LLM-based evaluators, including
their instruction-following ability and vulnerability to prompt hacking.
Nov 12 (Tue) 11:00-12:30 - Riverfront Hall
An Electoral Approach to Diversify LLM-based Multi-Agent Collective Decision-Making
Xiutian Zhao, Ke Wang, Wei Peng
Modern large language models (LLMs) have exhibited cooperative synergy on complex task-solving, and collective decision-making (CDM)
is a pivotal component in LLM-based multi-agent collaboration frameworks. Our survey on 52 recent such systems uncovers a severe lack of
diversity, with a heavy reliance on dictatorial and plurality voting for CDM. Through the lens of social choice theory, we scrutinize widely-
adopted CDM methods and identify their limitations. To enrich current landscape of LLM-based CDM, we present GEDI, an electoral CDM
module that incorporates various ordinal preferential voting mechanisms. Our empirical case study across three benchmarks shows that the
integration of certain CDM methods can markedly improve the reasoning capabilities and robustness of some leading LLMs, all without
requiring intricate system designs. Additionally, we find that some CDM mechanisms generate positive synergies even with as few as three
agents. The voting-based methods also demonstrate robustness against single points of failure, as well as diversity in terms of hit-rate@k and
subject-wise impacts.
Nov 12 (Tue) 11:00-12:30 - Riverfront Hall
LLM4Decompile: Decompiling Binary Code with Large Language Models
Hanzhuo Tan, Qi Luo, Jing Li, Yuqun Zhang
Decompilation aims to convert binary code to high-level source code, but traditional tools like Ghidra often produce results that are diffi-
cult to read and execute. Motivated by the advancements in Large Language Models (LLMs), we propose LLM4Decompile, the first and
largest open-source LLM series (1.3B to 33B) trained to decompile binary code. We optimize the LLM training process and introduce
the LLM4Decompile-End models to decompile binary directly. The resulting models significantly outperform GPT-4o and Ghidra on the
HumanEval and ExeBench benchmarks by over 100% in terms of re-executability rate. Additionally, we improve the standard refinement
approach to fine-tune the LLM4Decompile-Ref models, enabling them to effectively refine the decompiled code from Ghidra and achieve a
further 16.2% improvement over the LLM4Decompile-End. LLM4Decompile demonstrates the potential of LLMs to revolutionize binary
code decompilation, delivering remarkable improvements in readability and executability while complementing conventional tools for optimal
results.
Nov 12 (Tue) 11:00-12:30 - Riverfront Hall
Focused Large Language Models are Stable Many-Shot Learners
Peiwen Yuan, Shaoxiong Feng, Yiwei Li, Xinglin Wang, Yueqi Zhang, Chuyi Tan, Boyuan Pan, Heda Wang, Yao Hu, Kan Li
In-Context Learning (ICL) enables large language models (LLMs) to achieve rapid task adaptation by learning from demonstrations. With the
increase in available context length of LLMs, recent experiments have shown that the performance of ICL does not necessarily scale well in
many-shot (demonstration) settings. We hypothesize that the reason lies in more demonstrations dispersing the model attention from the query,
hindering its understanding of key content, which we validate both theoretically and experimentally. Inspired by how humans learn from ex-
amples, we propose a training-free method FocusICL, which conducts triviality filtering to avoid attention being diverted by unimportant
contents at token-level and operates hierarchical attention to further ensure sufficient attention towards current query at demonstration-level.
We also design an efficient hyperparameter searching strategy for FocusICL based on model perplexity of demonstrations. Comprehensive
experiments validate that FocusICL achieves an average performance improvement of 5.2% over vanilla ICL and scales well with many-shot
demonstrations.
Nov 12 (Tue) 11:00-12:30 - Riverfront Hall
Personalized Pieces: Efficient Personalized Large Language Models through Collaborative Efforts
Zhaoxuan Tan, Zheyuan Liu, Meng Jiang
Personalized large language models (LLMs) aim to tailor interactions, content, and recommendations to individual user preferences. While
parameter-efficient fine-tuning (PEFT) methods excel in performance and generalization, they are costly and limit communal benefits when
used individually. To this end, we introduce Personalized Pieces (Per-Pcs), a framework that allows users to safely share and assemble person-
alized PEFT efficiently with collaborative efforts. Per-Pcs involves selecting sharers, breaking their PEFT into pieces, and training gates for
each piece. These pieces are added to a pool, from which target users can select and assemble personalized PEFT using their history data. This
approach preserves privacy and enables fine-grained user modeling without excessive storage and computation demands. Experimental results
show Per-Pcs outperforms non-personalized and PEFT retrieval baselines, offering performance comparable to OPPU with significantly lower
resource use across six tasks. Further analysis highlights Per-Pcs’s robustness concerning sharer count and selection strategy, pieces sharing
ratio, and scalability in computation time and storage space. Per-Pcs’s modularity promotes safe sharing, making LLM personalization more
efficient, effective, and widely accessible through collaborative efforts.
Nov 12 (Tue) 11:00-12:30 - Riverfront Hall
Efficient LLM Comparative Assessment: A Product of Experts Framework for Pairwise Comparisons
Adian Liusie, Vatsal Raina, Yassir Fathullah, Mark Gales
LLM-as-a-judge approaches are a practical and effective way of assessing a range of text tasks. However, when using pairwise comparisons
to rank a set of candidates, the computational cost scales quadratically with the number of candidates, which has practical limitations. This
paper introduces a Product of Expert (PoE) framework for efficient LLM Comparative Assessment. Here individual comparisons are con-
sidered experts that provide information on a pair’s score difference. The PoE framework combines the information from these experts to
yield an expression that can be maximized with respect to the underlying set of candidates, and is highly flexible where any form of expert
can be assumed. When Gaussian experts are used one can derive simple closed-form solutions for the optimal candidate ranking, as well
as expressions for selecting which comparisons should be made to maximize the probability of this ranking. Our approach enables efficient
comparative assessment, where by using only a small subset of the possible comparisons, one can generate score predictions that correlate
well with human judgements. We evaluate the approach on multiple NLG tasks and demonstrate that our framework can yield considerable
computational savings when performing pairwise comparative assessment. With many candidate texts, using as few as 2% of comparisons
the PoE solution can achieve similar performance to when all comparisons are used.
Nov 12 (Tue) 11:00-12:30 - Riverfront Hall
Towards Injecting Medical Visual Knowledge into Multimodal LLMs at Scale
Junying Chen, Chi Gui, OuyangRuyi, Anningzhe Gao, Shunian Chen, Guiming Hardy Chen, Xidong Wang, Zhenyang Cai, Ke Ji, Xiang Wan,
106
Posters and Demos
Benyou Wang
The rapid development of multimodal large language models (MLLMs), such as GPT-4V, has led to significant advancements. However,
these models still face challenges in medical multimodal capabilities due to limitations in the quantity and quality of medical vision-text
data, stemming from data privacy concerns and high annotation costs. While pioneering approaches utilize PubMed’s large-scale, de-
identified medical image-text pairs to address these limitations, they often fall short due to inherent data noise. To tackle this, we refined
medical image-text pairs from PubMed and employed MLLMs (GPT-4V) in an ’unblinded’ capacity to denoise and reformat the data, re-
sulting in the creation of the **PubMedVision** dataset with 1.3 million medical VQA samples. Our validation demonstrates that: (1)
PubMedVision can significantly enhance the medical multimodal capabilities of MLLMs, showing significant improvement in benchmarks
including the MMMU Health & Medicine track; (2) manual checks by medical experts and empirical results validate the superior data
quality of our dataset compared to other data construction methods. Using PubMedVision, we train a 34B medical MLLM **HuatuoGPT-
Vision**, which shows superior performance in medical multimodal scenarios among open-source MLLMs. Our code and data are available
at https://github.com/FreedomIntelligence/HuatuoGPT-Vision.
Nov 12 (Tue) 11:00-12:30 - Riverfront Hall
FEDKIM: Adaptive Federated Knowledge Injection into Medical Foundation Models
Xiaochen Wang, Jiaqi Wang, Houping Xiao, Jinghui Chen, Fenglong Ma
Foundation models have demonstrated remarkable capabilities in handling diverse modalities and tasks, outperforming conventional artifi-
cial intelligence (AI) approaches that are highly task-specific and modality-reliant. In the medical domain, however, the development of
comprehensive foundation models is constrained by limited access to diverse modalities and stringent privacy regulations. To address these
constraints, this study introduces a novel knowledge injection approach, FedKIM, designed to scale the medical foundation model within
a federated learning framework. FedKIM leverages lightweight local models to extract healthcare knowledge from private data and inte-
grates this knowledge into a centralized foundation model using a designed adaptive Multitask Multimodal Mixture Of Experts (M3 OE)
module. This method not only preserves privacy but also enhances the model’s ability to handle complex medical tasks involving mul-
tiple modalities. Our extensive experiments across twelve tasks in seven modalities demonstrate the effectiveness of FedKIM in various
settings, highlighting its potential to scale medical foundation models without direct access to sensitive data. Source codes are available at
https://github.com/XiaochenWang-PSU/FedKIM.
Nov 12 (Tue) 11:00-12:30 - Riverfront Hall
DiVERT: Distractor Generation with Variational Errors Represented as Text for Math Multiple-choice Questions
Nigel Fernandez, Alexander Scarlatos, Digory Smith, Simon Woodhead, Nancy Otero Ornelas, Andrew Lan
High-quality distractors are crucial to both the assessment and pedagogical value of multiple-choice questions (MCQs), where manually
crafting ones that anticipate knowledge deficiencies or misconceptions among real students is difficult. Meanwhile, automated distractor
generation, even with the help of large language models (LLMs), remains challenging for subjects like math. It is crucial to not only identify
plausible distractors but also understand the error behind them. In this paper, we introduce DiVERT (Distractor Generation with Variational
Errors Represented as Text), a novel variational approach that learns an interpretable representation of errors behind distractors in math
MCQs. Through experiments on a real-world math MCQ dataset with 1,434 questions used by hundreds of thousands of students, we show
that DiVERT, despite using a base open-source LLM with 7B parameters, outperforms state-of-the-art approaches using GPT-4o on down-
stream distractor generation. We also conduct a human evaluation with math educators and find that DiVERT leads to error labels that are of
comparable quality to human-authored ones.
107
Posters and Demos
components: biomedical entity identification (Named Entity Recognition, NER) and their interrelation determination (Relation Extraction,
RE). However, existing methods often neglect unique features of the biomedical literature, such as ambiguous entities, nested proper nouns,
and overlapping relation triplets, and underutilize prior knowledge, leading to an intolerable performance decline in the biomedical domain,
especially with limited annotated training data. In this paper, we propose the Biomedical Relation-First eXtraction (Bio-RFX) model by
leveraging sentence-level relation classification before entity extraction to tackle entity ambiguity. Moreover, we exploit structural constraints
between entities and relations to guide the model’s hypothesis space, enhancing extraction performance across different training scenarios.
Comprehensive experimental results on biomedical datasets show that Bio-RFX achieves significant improvements on both NER and RE
tasks. Even under the low-resource training scenarios, it outperforms all baselines in NER and has highly competitive performance compared
to the state-of-the-art fine-tuned baselines in RE.
108
Posters and Demos
Ernie Chang, Pin-Jie Lin, Yang Li, Changsheng Zhao, Daeil Kim, Rastislav Rabatin, Zechun Liu, Yangyang Shi, Vikas Chandra
Language model pretraining generally targets a broad range of use cases and incorporates data from diverse sources. However, there are
instances where we desire a model that excels in specific areas without markedly compromising performance in other areas. A cost-effective
and straightforward approach is sampling with low-dimensional data features, which allows selecting large-scale pretraining data for domain-
specific use cases. In this work, we revisit importance sampling with n-gram features consisting of multi-granular tokens, which strikes a
good balance between sentence compression and representation capabilities. We observed the sampled data to have a high correlation with the
target downstream task performance *while preserving its effectiveness on other tasks*. This leads to the proposed data sampling paradigm
where language models can be pretrained more efficiently on selected documents. On eight benchmarks we demonstrate with ∼1% of the
data, pretrained models perform on par with the full RefinedWeb data and outperform randomly selected samples for model sizes ranging
from 125M to 1.5B.
Nov 12 (Tue) 11:00-12:30 - Riverfront Hall
Not Everything is All You Need: Toward Low-Redundant Optimization for Large Language Model Alignment
Zhipeng Chen, Kun Zhou, Xin Zhao, Jingyuan Wang, Ji-Rong Wen
Large language models (LLMs) are still struggling in aligning with human preference in complex tasks and scenarios. They are prone to
overfit into the unexpected patterns or superficial styles in the training data. We conduct an empirical study that only selects the top-10% most
updated parameters in LLMs for alignment training, and see improvements in the convergence process and final performance. It indicates
the existence of redundant neurons in LLMs for alignment training. To reduce its influence, we propose a low-redundant alignment method
named **ALLO**, focusing on optimizing the most related neurons with the most useful supervised signals. Concretely, we first identify the
neurons that are related to the human preference data by a gradient-based strategy, then identify the alignment-related key tokens by reward
models for computing loss. Besides, we also decompose the alignment process into the forgetting and learning stages, where we first forget
the tokens with unaligned knowledge and then learn aligned knowledge, by updating different ratios of neurons, respectively. Experimental
results on 10 datasets have shown the effectiveness of ALLO. Our code and data will be publicly released.
Nov 12 (Tue) 11:00-12:30 - Riverfront Hall
ECCO: Can We Improve Model-Generated Code Efficiency Without Sacrificing Functional Correctness?
Siddhant Waghjale, Vishruth Veerendranath, Zhiruo Wang, Daniel Fried
Although large language models (LLMs) have been largely successful in generating functionally correct programs, conditioning models to
produce efficient solutions while ensuring correctness remains a challenge. Further, unreliability in benchmarking code efficiency is a hurdle
across varying hardware specifications for popular interpreted languages such as Python. In this paper, we present ECCO, a reproducible
benchmark for evaluating program efficiency via two paradigms: natural language (NL) based code generation and history-based code edit-
ing. On ECCO, we adapt and thoroughly investigate the three most promising existing LLM-based approaches: in-context learning, iterative
refinement with execution or NL feedback, and fine-tuning conditioned on execution and editing history. While most methods degrade func-
tional correctness and moderately increase program efficiency, we find that adding execution information often helps maintain functional
correctness, and NL feedback enhances more on efficiency. We release our benchmark to support future work on LLM-based generation of
efficient code.
Nov 12 (Tue) 11:00-12:30 - Riverfront Hall
Effective Synthetic Data and Test-Time Adaptation for OCR Correction
Shuhao Guan, Cheng Xu, Moule Lin, Derek Greene
Post-OCR technology is used to correct errors in the text produced by OCR systems. This study introduces a method for constructing post-
OCR synthetic data with different noise levels using weak supervision. We define Character Error Rate (CER) thresholds for "effective" and
"ineffective" synthetic data, allowing us to create more useful multi-noise level synthetic datasets. Furthermore, we propose Self-Correct-
Noise Test-Time Adaptation (SCN-TTA), which combines self-correction and noise generation mechanisms. SCN-TTA allows a model to
dynamically adjust to test data without relying on labels, effectively handling proper nouns in long texts and further reducing CER. In our
experiments we evaluate a range of models, including multiple PLMs and LLMs. Results indicate that our method yields models that are
effective across diverse text types. Notably, the ByT5 model achieves a CER reduction of 68.67% without relying on manually annotated data
Nov 12 (Tue) 11:00-12:30 - Riverfront Hall
Autoregressive Multi-trait Essay Scoring via Reinforcement Learning with Scoring-aware Multiple Rewards
Heejin Do, Sangwon Ryu, Gary Lee
Recent advances in automated essay scoring (AES) have shifted towards evaluating multiple traits to provide enriched feedback. Like typical
AES systems, multi-trait AES employs the quadratic weighted kappa (QWK) to measure agreement with human raters, aligning closely with
the rating schema; however, its non-differentiable nature prevents its direct use in neural network training. In this paper, we propose Scoring-
aware Multi-reward Reinforcement Learning (SaMRL), which integrates actual evaluation schemes into the training process by designing
QWK-based rewards with a mean-squared error penalty for multi-trait AES. Existing reinforcement learning (RL) applications in AES are
limited to classification models despite associated performance degradation, as RL requires probability distributions; instead, we adopt an au-
toregressive score generation framework to leverage token generation probabilities for robust multi-trait score predictions. Empirical analyses
demonstrate that SaMRL facilitates model training, notably enhancing scoring of previously inferior prompts.
Nov 12 (Tue) 11:00-12:30 - Riverfront Hall
SEGMENT+: Long Text Processing with Short-Context Language Models
Wei Shi, Shuang Li, Kerun Yu, Jinglei Chen, Zujie Liang, Xinhui Wu, Yuxi Qian, Feng Wei, Bo Zheng, Jiaqing Liang, Jiangjie Chen, Yanghua
Xiao
There is a growing interest in expanding the input capacity of language models (LMs) across various domains. However, simply increas-
ing the context window does not guarantee robust performance across diverse long-input processing tasks, such as understanding extensive
documents and extracting detailed information from lengthy and noisy data. In response, we introduce Segment+, a general framework that
enables LMs to handle extended inputs within limited context windows efficiently. Segment+ utilizes structured notes and a filtering module
to manage information flow, resulting in a system that is both controllable and interpretable. Our extensive experiments across various model
sizes, focusing on long-document question-answering and Needle-in-a-Haystack tasks, demonstrate the effectiveness of Segment+ in improv-
ing performance.
Nov 12 (Tue) 11:00-12:30 - Riverfront Hall
CoSafe: Evaluating Large Language Model Safety in Multi-Turn Dialogue Coreference
Erxin Yu, Jing Li, Ming Liao, Siqi Wang, GAO Zuchen, Fei Mi, Lanqing HONG
As large language models (LLMs) constantly evolve, ensuring their safety remains a critical research issue. Previous red teaming approaches
for LLM safety have primarily focused on single prompt attacks or goal hijacking. To the best of our knowledge, we are the first to study LLM
safety in multi-turn dialogue coreference. We created a dataset of 1,400 questions across 14 categories, each featuring multi-turn coreference
safety attacks. We then conducted detailed evaluations on five widely used open-source LLMs. The results indicated that under multi-turn
109
Posters and Demos
coreference safety attacks, the highest attack success rate was 56% with the LLaMA2-Chat-7b model, while the lowest was 13.9% with the
Mistral-7B-Instruct model. These findings highlight the safety vulnerabilities in LLMs during dialogue coreference interactions.
110
Posters and Demos
111
Posters and Demos
orizing patterns in the synthetic training data. To this end, we propose the NLGift benchmark, an evaluation suite of LLM graph reasoning
generalization: whether LLMs could go beyond semantic, numeric, structural, reasoning patterns in the synthetic training data and improve
utility on real-world graph-based tasks. Extensive experiments with two LLMs across four graph reasoning tasks demonstrate that while
generalization on simple patterns (semantic, numeric) is somewhat satisfactory, LLMs struggle to generalize across reasoning and real-world
patterns, casting doubt on the benefit of synthetic graph tuning for real-world tasks with underlying network structures. We explore three
strategies to improve LLM graph reasoning generalization, and we find that while post-training alignment is most promising for real-world
tasks, empowering LLM graph reasoning to go beyond pattern memorization remains an open research question.
Nov 12 (Tue) 11:00-12:30 - Riverfront Hall
Modeling News Interactions and Influence for Financial Market Prediction
Mengyu Wang, Shay B Cohen, Tiejun Ma
The diffusion of financial news into market prices is a complex process, making it challenging to evaluate the connections between news
events and market movements. This paper introduces FININ (Financial Interconnected News Influence Network), a novel market prediction
model that captures not only the links between news and prices but also the interactions among news items themselves. FININ effectively
integrates multi-modal information from both market data and news articles. We conduct extensive experiments on two datasets, encom-
passing the S&P 500 and NASDAQ 100 indices over a 15-year period and over 2.7 million news articles. The results demonstrate FININ’s
effectiveness, outperforming advanced market prediction models with an improvement of 0.429 and 0.341 in the daily Sharpe ratio for the
two markets respectively. Moreover, our results reveal insights into the financial news, including the delayed market pricing of news, the long
memory effect of news, and the limitations of financial sentiment analysis in fully extracting predictive power from news data.
Demo 2
Nov 12 (Tue) 14:00-15:30 - Room: Riverfront Hall
112
Posters and Demos
113
Posters and Demos
lexical meaning. Third, we introduce an efficient segmentation algorithm based on a subword bigram model that can be initialized with the
lexically aware segmentation method to avoid using Morfessor and large embedding tables at inference time. We evaluate the proposed ap-
proaches using two intrinsic metrics and measure their performance on two downstream tasks: part-of-speech tagging and machine translation.
Our experiments show significant improvements in the morphological plausibility of the segmentation when evaluated using segmentation
precision on morpheme boundaries and improved Rényi efficiency in 8 languages. Although the proposed tokenization methods do not have
a large impact on automatic translation quality, we observe consistent performance gains in the arguably more morphological task of part-of-
speech tagging.
Nov 12 (Tue) 14:00-15:30 - Riverfront Hall
Deterministic Weighted L* Algorithm
Clemente Pasti, Talu Karagöz, Franz Nowak, Anej Svete, Ryan Cotterell
Extracting finite state automata (FSAs) fromblack-box models offers a powerful approachto gaining interpretable insights into complexmodel
behaviors. To support this pursuit, wepresent a weighted variant of Angluins (1987)L* algorithm for learning FSAs. We stay faithful to the
original formulation, devising a wayto exactly learn deterministic weighted FSAswhose weights support division. Furthermore,we formulate
the learning process in a mannerthat highlights the connection with FSA minimization, showing how L* directly learns aminimal automaton
for the target language.
Nov 12 (Tue) 14:00-15:30 - Riverfront Hall
On Eliciting Syntax from Language Models via Hashing
Yiran Wang, Masao Utiyama
Unsupervised parsing, also known as grammar induction, aims to infer syntactic structure from raw text. Recently, binary representation has
exhibited remarkable information-preserving capabilities at both lexicon and syntax levels. In this paper, we explore the possibility of lever-
aging this capability to deduce parsing trees from raw text, relying solely on the implicitly induced grammars within models. To achieve this,
we upgrade the bit-level CKY from zero-order to first-order to encode the lexicon and syntax in a unified binary representation space, switch
training from supervised to unsupervised under the contrastive hashing framework, and introduce a novel loss function to impose stronger yet
balanced alignment signals. Our model shows competitive performance on various datasets, therefore, we claim that our method is effective
and efficient enough to acquire high-quality parsing trees from pre-trained language models at a low cost.
Nov 12 (Tue) 14:00-15:30 - Riverfront Hall
Distributional Properties of Subword Regularization
Marco Cognetta, Vilém Zouhar, Naoaki Okazaki
Subword regularization, used widely in NLP, improves model performance by reducing the dependency on exact tokenizations, augmenting
the training corpus, and exposing the model to more unique contexts during training. BPE and MaxMatch, two popular subword tokenization
schemes, have stochastic dropout regularization variants. However, there has not been an analysis of the distributions formed by them.We
show that these stochastic variants are heavily biased towards a small set of tokenizations per word. If the benefits of subword regularization
are as mentioned, we hypothesize that biasedness artificially limits the effectiveness of these schemes. Thus, we propose an algorithm to
uniformly sample tokenizations that we use as a drop-in replacement for the stochastic aspects of existing tokenizers, and find that it improves
machine translation quality.
114
Posters and Demos
Various linearizations have been proposed to cast syntactic dependency parsing as sequence labeling. However, these approaches do not sup-
port more complex graph-based representations, such as semantic dependencies or enhanced universal dependencies, as they cannot handle
reentrancy or cycles. By extending them, we define a range of unbounded and bounded linearizations that can be used to cast graph parsing
as a tagging task, enlarging the toolbox of problems that can be solved under this paradigm. Experimental results on semantic dependency
and enhanced UD parsing show that with a good choice of encoding, sequence-labeling semantic dependency parsers combine high efficiency
with accuracies close to the state of the art, in spite of their simplicity.
Nov 12 (Tue) 14:00-15:30 - Riverfront Hall
Subword Segmentation in LLMs: Looking at Inflection and Consistency
Marion Di Marco, Alexander Fraser
The role of subword segmentation in relation to capturing morphological patterns in LLMs is currently not well explored. Ideally, one would
train models like GPT using various segmentations and evaluate how well word meanings are captured. Since this is not computationally
feasible, we group words according to their segmentation properties and compare how well a model can solve a linguistic task for these
groups. We study two criteria: (i) adherence to morpheme boundaries and (ii) the segmentation consistency of the different inflected forms
of a lemma. We select word forms with high and low values for these criteria and carry out experiments on GPT-4os ability to capture verbal
inflection for 10 languages. Our results indicate that in particular the criterion of segmentation consistency can help to predict the models
ability to recognize and generate the lemma from an inflected form, providing evidence that subword segmentation is relevant.
Nov 12 (Tue) 14:00-15:30 - Riverfront Hall
GDTB: Genre Diverse Data for English Shallow Discourse Parsing across Modalities, Text Types, and Domains
Yang Janet Liu, Tatsuya Aoyama, Wesley Scivetti, Yilun Zhu, Shabnam Behzad, Lauren Elizabeth Levine, Jessica Lin, Devika Tiwari, Amir
Zeldes
Work on shallow discourse parsing in English has focused on the Wall Street Journal corpus, the only large-scale dataset for the language in
the PDTB framework. However, the data is not openly available, is restricted to the news domain, and is by now 35 years old. In this paper,
we present and evaluate a new open-access, multi-genre benchmark for PDTB-style shallow discourse parsing, based on the existing UD
English GUM corpus, for which discourse relation annotations in other frameworks already exist. In a series of experiments on cross-domain
relation classification, we show that while our dataset is compatible with PDTB, substantial out-of-domain degradation is observed, which
can be alleviated by joint training on both datasets.
Nov 12 (Tue) 14:00-15:30 - Riverfront Hall
Zero-Shot Cross-Lingual NER Using Phonemic Representations for Low-Resource Languages
Jimin Sohn, Haeji Jung, Alex Cheng, Jooeon Kang, Yilin Du, David R Mortensen
Existing zero-shot cross-lingual NER approaches require substantial prior knowledge of the target language, which is impractical for low-
resource languages.In this paper, we propose a novel approach to NER using phonemic representation based on the International Phonetic
Alphabet (IPA) to bridge the gap between representations of different languages.Our experiments show that our method significantly outper-
forms baseline models in extremely low-resource languages, with the highest average F1 score (46.38%) and lowest standard deviation (12.67),
particularly demonstrating its robustness with non-Latin scripts. Ourcodes are available at https://github.com/Gabriel819/zeroshot_ner.git
Nov 12 (Tue) 14:00-15:30 - Riverfront Hall
Argument Relation Classification through Discourse Markers and Adversarial Training
Michele Luca Contalbo, Francesco Guerra, Matteo Paganelli
Argument relation classification (ARC) identifies supportive, contrasting and neutral relations between argumentative units. The current ap-
proaches rely on transformer architectures which have proven to be more effective than traditional methods based on hand-crafted linguistic
features. In this paper, we introduce DISARM, which advances the state of the art with a training procedure combining multi-task and adver-
sarial learning strategies. By jointly solving the ARC and discourse marker detection tasks and aligning their embedding spaces into a unified
latent space, DISARM outperforms the accuracy of existing approaches.
Nov 12 (Tue) 14:00-15:30 - Riverfront Hall
Getting The Most Out of Your Training Data: Exploring Unsupervised Tasks for Morphological Inflection
Abhishek Purushothama, Adam Wiemerslage, Katharina von der Wense
Pre-trained transformers such as BERT have been shown to be effective in many natural language tasks. However, they are under-explored
for character-level sequence to sequence tasks. In this work, we investigate pre-training transformers for the character-level task of morpho-
logical inflection in several languages. We compare various training setups and secondary tasks where unsupervised data taken directly from
the target task is used. We show that training on secondary unsupervised tasks increases inflection performance even without any external
data, suggesting that models learn from additional unsupervised tasks themselves—not just from additional data. We also find that this does
not hold true for specific combinations of secondary task and training setup, which has interesting implications for denoising objectives in
character-level tasks.
Nov 12 (Tue) 14:00-15:30 - Riverfront Hall
Will LLMs Replace the Encoder-Only Models in Temporal Relation Classification?
Gabriel Roccabruna, Massimo Rizzoli, giuseppe riccardi
The automatic detection of temporal relations among events has been mainly investigated with encoder-only models such as RoBERTa. Large
Language Models (LLM) have recently shown promising performance in temporal reasoning tasks such as temporal question answering.
Nevertheless, recent studies have tested the LLMs’ performance in detecting temporal relations of closed-source models only, limiting the
interpretability of those results. In this work, we investigate LLMs’ performance and decision process in the Temporal Relation Classifica-
tion task. First, we assess the performance of seven open and closed-sourced LLMs experimenting with in-context learning and lightweight
fine-tuning approaches. Results show that LLMs with in-context learning significantly underperform smaller encoder-only models based on
RoBERTa. Then, we delve into the possible reasons for this gap by applying explainable methods. The outcome suggests a limitation of
LLMs in this task due to their autoregressive nature, which causes them to focus only on the last part of the sequence. Additionally, we
evaluate the word embeddings of these two models to better understand their pre-training differences. The code and the fine-tuned models
can be found respectively on GitHub.
Nov 12 (Tue) 14:00-15:30 - Riverfront Hall
Automatic sentence segmentation of clinical record narratives in real-world data
Dongfang Xu, Davy Weissenbacher, Karen O’Connor, Siddharth Rawal, Graciela Gonzalez Hernandez
Sentence segmentation is a linguistic task and is widely used as a pre-processing step in many NLP applications. The need for sentence seg-
mentation is particularly pronounced in clinical notes, where ungrammatical and fragmented texts are common. We propose a straightforward
and effective sequence labeling classifier to predict sentence spans using a dynamic sliding window based on the prediction of each input
sequence. This sliding window algorithm allows our approach to segment long text sequences on the fly. To evaluate our approach, we anno-
115
Posters and Demos
tated 90 clinical notes from the MIMIC-III dataset. Additionally, we tested our approach on five other datasets to assess its generalizability
and compared its performance against state-of-the-art systems on these datasets. Our approach outperformed all the systems, achieving an F1
score that is 15% higher than the next best-performing system on the clinical dataset.
Nov 12 (Tue) 14:00-15:30 - Riverfront Hall
Communicating with Speakers and Listeners of Different Pragmatic Levels
Kata Naszadi, Frans A Oliehoek, Christof Monz
This paper explores the impact of variable pragmatic competence on communicative success through simulating language learning and con-
versing between speakers and listeners with different levels of reasoning abilities. Through studying this interaction, we hypothesize that
matching levels of reasoning between communication partners would create a more beneficial environment for communicative success and
language learning. Our research findings indicate that learning from more explicit, literal language is advantageous, irrespective of the
learner’s level of pragmatic competence. Furthermore, we find that integrating pragmatic reasoning during language learning, not just during
evaluation, significantly enhances overall communication performance. This paper provides key insights into the importance of aligning rea-
soning levels and incorporating pragmatic reasoning in optimizing communicative interactions.
116
Posters and Demos
117
Posters and Demos
118
Posters and Demos
investigating **Misinformation Oversight Bias**, **Gender Bias**, **Authority Bias** and **Beauty Bias** on LLM and human judges.
We curate a dataset referring to the revised Bloom’s Taxonomy and conduct thousands of evaluations. Results show that human and LLM
judges are vulnerable to perturbations to various degrees, and that even the cutting-edge judges possess considerable biases. We further exploit
these biases to conduct attacks on LLM judges. We hope that our work can notify the community of the bias and vulnerability of human- and
LLM-as-a-judge, as well as the urgency of developing robust evaluation systems.
Nov 12 (Tue) 14:00-15:30 - Riverfront Hall
LoRA-Guard: Parameter-Efficient Guardrail Adaptation for Content Moderation of Large Language Models
Hayder Elesedy, Pedro M Esperanca, Silviu Vlad Oprea, Mete Ozay
Guardrails have emerged as an alternative to safety alignment for content moderation of large language models (LLMs). Existing model-based
guardrails have not been designed for resource-constrained computational portable devices, such as mobile phones, more and more of which
are running LLM-based applications locally. We introduce LoRA-Guard, a parameter-efficient guardrail adaptation method that relies on
knowledge sharing between LLMs and guardrail models. LoRA-Guard extracts language features from the LLMs and adapts them for the
content moderation task using low-rank adapters, while a dual-path design prevents any performance degradation on the generative task. We
show that LoRA-Guard outperforms existing approaches with 100-1000x lower parameter overhead while maintaining accuracy, enabling
on-device content moderation.
Nov 12 (Tue) 14:00-15:30 - Riverfront Hall
Global is Good, Local is Bad?”: Understanding Brand Bias in LLMs
Mahammed Kamruzzaman, Hieu Minh Nguyen, Gene Louis Kim
Many recent studies have investigated social biases in LLMs but brand bias has received little attention. This research examines the biases
exhibited by LLMs towards different brands, a significant concern given the widespread use of LLMs in affected use cases such as product
recommendation and market analysis. Biased models may perpetuate societal inequalities, unfairly favoring established global brands while
marginalizing local ones. Using a curated dataset across four brand categories, we probe the behavior of LLMs in this space. We find a consis-
tent pattern of bias in this space—both in terms of disproportionately associating global brands with positive attributes and disproportionately
recommending luxury gifts for individuals in high-income countries. We also find LLMs are subject to country-of-origin effects which may
boost local brand preference in LLM outputs in specific contexts.
Nov 12 (Tue) 14:00-15:30 - Riverfront Hall
ModSCAN: Measuring Stereotypical Bias in Large Vision-Language Models from Vision and Language Modalities
Yukun Jiang, Zheng Li, Xinyue Shen, Yugeng Liu, Michael Backes, Yang Zhang
Large vision-language models (LVLMs) have been rapidly developed and widely used in various fields, but the (potential) stereotypical bias
in the model is largely unexplored. In this study, we present a pioneering measurement framework, ModSCAN, to SCAN the stereotypical
bias within LVLMs from both vision and language Modalities. ModSCAN examines stereotypical biases with respect to two typical stereo-
typical attributes (gender and race) across three kinds of scenarios: occupations, descriptors, and persona traits. Our findings suggest that 1)
the currently popular LVLMs show significant stereotype biases, with CogVLM emerging as the most biased model; 2) these stereotypical
biases may stem from the inherent biases in the training dataset and pre-trained models; 3) the utilization of specific prompt prefixes (from
both vision and language modalities) performs well in reducing stereotypical biases. We believe our work can serve as the foundation for
understanding and addressing stereotypical bias in LVLMs.
Nov 12 (Tue) 14:00-15:30 - Riverfront Hall
BEEAR: Embedding-based Adversarial Removal of Safety Backdoors in Instruction-tuned Language Models
Yi Zeng, Weiyu Sun, Tran Ngoc Huynh, Dawn Song, Bo Li, Ruoxi Jia
Safety backdoor attacks in large language models (LLMs) enable harmful behaviors to be stealthily triggered while evading detection during
normal interactions. The high dimensionality of the trigger search space and the diverse range of potential malicious behaviors in LLMs
make this a critical open problem. This paper presents BEEAR, a novel mitigation method based on a key insight: backdoor triggers induce
a uniform drift in the model’s embedding space, irrespective of the trigger’s form or targeted behavior. Leveraging this observation, we
introduce a bi-level optimization approach. The inner level identifies universal perturbations to the decoder’s embeddings that steer the model
towards defender-defined unwanted behaviors; the outer level fine-tunes the model to reinforce safe behaviors against these perturbations.
Our experiments demonstrate the effectiveness of this approach, reducing the success rate of safety backdoor attacks from over 95% to <1%
for general harmful behaviors and from 47% to 0% for Sleeper Agents, without compromising the model’s helpfulness. Notably, our method
relies only on defender-defined sets of safe and unwanted behaviors without any assumptions about the trigger location or attack mechanism.
This work represents the first practical framework to counter safety backdoors in LLMs and provides a foundation for future advancements in
AI safety and security.
Nov 12 (Tue) 14:00-15:30 - Riverfront Hall
Linguistic Bias in ChatGPT: Language Models Reinforce Dialect Discrimination
Eve Fleisig, Genevieve Smith, Madeline Bossi, Ishita Rustagi, Xavier Yin, Dan Klein
We present a large-scale study of linguistic bias exhibited by ChatGPT covering ten dialects of English (Standard American English, Standard
British English, and eight widely spoken non-"standard" varieties from around the world). We prompted GPT-3.5 Turbo and GPT-4 with
text by native speakers of each variety and analyzed the responses via detailed linguistic feature annotation and native speaker evaluation.
We find that the models default to "standard" varieties of English; based on evaluation by native speakers, we also find that model responses
to non-"standard" varieties consistently exhibit a range of issues: stereotyping (19% worse than for "standard" varieties), demeaning content
(25% worse), lack of comprehension (9% worse), and condescending responses (15% worse). Moreover, if these models are asked to imitate
the writing style of prompts in non-"standard" varieties, they produce text that exhibits lower comprehension of the input and is especially
prone to stereotyping. GPT-4 improves on GPT-3.5 in terms of comprehension, warmth, and friendliness, but also exhibits a marked increase
in stereotyping (+18%). The results indicate that GPT-3.5 Turbo and GPT-4 can perpetuate linguistic discrimination toward speakers of
non-"standard" varieties.
Nov 12 (Tue) 14:00-15:30 - Riverfront Hall
BiasAlert: A Plug-and-play Tool for Social Bias Detection in LLMs
Zhiting Fan, Ruizhe Chen, Ruiling Xu, Zuozhu Liu
Evaluating the bias of LLMs becomes more crucial with their rapid development. However, existing evaluation approaches rely on fixed-form
outputs and cannot adapt to the flexible open-text generation scenarios of LLMs (e.g., sentence completion and question answering). To ad-
dress this, we introduce BiasAlert, a plug-and-play tool designed to detect social bias in open-text generations of LLMs. BiasAlert integrates
external human knowledge with its inherent reasoning capabilities to detect bias reliably. Extensive experiments demonstrate that BiasAlert
significantly outperforms existing state-of-the-art methods like GPT-4-as-Judge in detecting bias. Furthermore, through application studies,
we showcase the utility of BiasAlert in reliable LLM fairness evaluation and bias mitigation across various scenarios. Model and code will be
publicly released.
119
Posters and Demos
120
Posters and Demos
tion. Across multiple datasets, languages, and types of users, our study shows that feminine post-editing demands significantly more technical
and temporal effort, also corresponding to higher financial costs. Existing bias measurements, however, fail to reflect the found disparities.
Our findings advocate for human-centered approaches that can inform the societal impact of bias.
Nov 12 (Tue) 14:00-15:30 - Riverfront Hall
Jailbreaking LLMs with Arabic Transliteration and Arabizi
Mansour Al Ghanim, saleh almohaimeed, Mengxin Zheng, Yan Solihin, Qian Lou
This study identifies the potential vulnerabilities of Large Language Models (LLMs) to ’jailbreak’ attacks, specifically focusing on the Arabic
language and its various forms. While most research has concentrated on English-based prompt manipulation, our investigation broadens
the scope to investigate the Arabic language. We initially tested the AdvBench benchmark in Standardized Arabic, finding that even with
prompt manipulation techniques like prefix injection, it was insufficient to provoke LLMs into generating unsafe content. However, when
using Arabic transliteration and chatspeak (or arabizi), we found that unsafe content could be produced on platforms like OpenAI GPT-4
and Anthropic Claude 3 Sonnet. Our findings suggest that using Arabic and its various forms could expose information that might remain
hidden, potentially increasing the risk of jailbreak attacks. We hypothesize that this exposure could be due to the model’s learned connection
to specific words, highlighting the need for more comprehensive safety training across all language forms.
Nov 12 (Tue) 14:00-15:30 - Riverfront Hall
Who is better at math, Jenny or Jingzhen? Uncovering Stereotypes in Large Language Models
Zara Siddique, Liam Turner, Luis Espinosa-Anke
Large language models (LLMs) have been shown to propagate and amplify harmful stereotypes, particularly those that disproportionately
affect marginalised communities. To understand the effect of these stereotypes more comprehensively, we introduce GlobalBias, a dataset
of 876k sentences incorporating 40 distinct gender-by-ethnicity groups alongside descriptors typically used in bias literature, which enables
us to study a broad set of stereotypes from around the world. We use GlobalBias to directly probe a suite of LMs via perplexity, which
we use as a proxy to determine how certain stereotypes are represented in the model’s internal representations. Following this, we generate
character profiles based on given names and evaluate the prevalence of stereotypes in model outputs. We find that the demographic groups
associated with various stereotypes remain consistent across model likelihoods and model outputs. Furthermore, larger models consistently
display higher levels of stereotypical outputs, even when explicitly instructed not to.
Nov 12 (Tue) 14:00-15:30 - Riverfront Hall
The Generation Gap: Exploring Age Bias Underlying in the Value Systems of Large Language Models
Siyang Liu, Trisha Maturi, Bowen Yi, Siqi Shen, Rada Mihalcea
We explore the alignment of values in Large Language Models (LLMs) with specific age groups, leveraging data from the World Value
Survey across thirteen categories. Through a diverse set of prompts tailored to ensure response robustness, we find a general inclination of
LLM values towards younger demographics, especially when compared to the US population. Although a general inclination can be ob-
served, we also found that this inclination toward younger groups can be different across different value categories. Additionally, we explore
the impact of incorporating age identity information in prompts and observe challenges in mitigating value discrepancies with different age
cohorts. Our findings highlight the age bias in LLMs and provide insights for future work. Materials for our analysis will be available via
https://github.com/anonymous
121
Posters and Demos
122
Posters and Demos
in LLM serving as the origination of knowledge propagation as ”deduction anchors”. However, current KE approaches, which only operate
on (subject, relation, object) triple. We both theoretically and empirically observe that this simplified setting often leads to uncertainty when
determining the deduction anchors, causing low confidence in their answers. To mitigate this issue, we propose a novel task of event-based
knowledge editing that pairs facts with event descriptions. This task manifests not only a closer simulation of real-world editing scenarios
but also a more logically sound setting, implicitly defining the deduction anchor and enabling LLMs to propagate knowledge confidently.
We curate a new benchmark dataset Evedit derived from the CounterFact dataset and validate its superiority in improving model confidence.
Moreover, while we observe that the event-based setting is significantly challenging for existing approaches, we propose a novel approach
Self-Edit that showcases stronger performance, achieving 55.6% consistency improvement while maintaining the naturalness of generation.
Nov 12 (Tue) 14:00-15:30 - Jasmine
SaySelf: Teaching LLMs to Express Confidence with Self-Reflective Rationales
Tianyang Xu, Shujin Wu, Shizhe Diao, Xiaoze Liu, Xingyao Wang, Yangyi Chen, Jing Gao
Large language models (LLMs) often generate inaccurate or fabricated information and generally fail to indicate their confidence, which limits
their broader applications. Previous work has elicited confidence from LLMs by direct or self-consistency prompting, or constructing specific
datasets for supervised finetuning. The prompting-based approaches have inferior performance, and the training-based approaches are limited
to binary or inaccurate group-level confidence estimates. In this work, we present SaySelf, a novel training framework that teaches LLMs to
express more fine-grained confidence estimates. In addition, beyond the confidence scores, SaySelf initiates the process of directing LLMs to
produce self-reflective rationales that clearly identify gaps in their parametric knowledge and explain their uncertainty. This is achieved by
using an LLM to automatically summarize the uncertainties in specific knowledge via natural language. The summarization is based on the
analysis of the inconsistency in multiple sampled reasoning chains, and the resulting data is utilized for supervised fine-tuning. Moreover, we
utilize reinforcement learning with a meticulously crafted reward function to calibrate the confidence estimates, motivating LLMs to deliver
accurate, high-confidence predictions and to penalize overconfidence in erroneous outputs. Experimental results demonstrate the effectiveness
of SaySelf in reducing the confidence calibration error and maintaining the task performance. The generated self-reflective rationales are also
reasonable and can further contribute to the calibration. The code is made public at https://github.com/xu1868/SaySelf.
Nov 12 (Tue) 14:00-15:30 - Jasmine
Investigating Mysteries of CoT-Augmented Distillation
Somin Wadhwa, Silvio Amir, Byron C Wallace
Eliciting chain of thought (CoT) rationales - sequences of token that convey a "reasoning" process has been shown to consistently improve
LLM performance on tasks like question answering. More recent efforts have shown that such rationales can also be used for model distil-
lation: Including CoT sequences (elicited from a large "teacher" model) in addition to target labels when fine-tuning a small student model
yields (often substantial) improvements. In this work we ask: Why and how does this additional training signal help in model distillation?
We perform ablations to interrogate this, and report some potentially surprising results. Specifically: (1) Placing CoT sequences after labels
(rather than before) realizes consistently better downstream performance – this means that no student "reasoning" is necessary at test time to
realize gains. (2) When rationales are appended in this way, they need not be coherent reasoning sequences to yield improvements; perfor-
mance increases are robust to permutations of CoT tokens, for example. In fact, (3) a small number of key tokens are sufficient to achieve
improvements equivalent to those observed when full rationales are used in model distillation.
Nov 12 (Tue) 14:00-15:30 - Jasmine
Personas as a Way to Model Truthfulness in Language Models
Nitish Joshi, Javier Rando, Abulhair Saparov, Najoung Kim, He He
Large language models (LLMs) are trained on vast amounts of text from the internet, which contains both factual and misleading information
about the world. While unintuitive from a classic view of LMs, recent work has shown that the truth value of a statement can be elicited
from the models representations. This paper presents an explanation for why LMs appear to know the truth despite not being trained with
truth labels. We hypothesize that the pretraining data is generated by groups of (un)truthful agents whose outputs share common features,
and they form a (un)truthful persona. By training on this data, LMs can infer and represent the persona in its activation space. This allows
the model to separate truth from falsehoods and controls the truthfulness of its generation. We show evidence for the persona hypothesis via
two observations: (1) we can probe whether a models answer will be truthful before it is generated; (2) finetuning a model on a set of facts
improves its truthfulness on unseen topics. Next, using arithmetics as a synthetic environment, we show that structures of the pretraining data
are crucial for the model to infer the truthful persona. Overall, our findings suggest that models can exploit hierarchical structures in the data
to learn abstract concepts like truthfulness.
Nov 12 (Tue) 14:00-15:30 - Jasmine
Discovering Knowledge-Critical Subnetworks in Pretrained Language Models
Deniz Bayazit, Negar Foroutan, Zeming Chen, Gail Weiss, Antoine Bosselut
Pretrained language models (LMs) encode implicit representations of knowledge in their parameters. However, localizing these representa-
tions and disentangling them from each other remains an open problem. In this work, we investigate whether pretrained language models
contain various *knowledge-critical* subnetworks: particular sparse computational subgraphs that can, if removed, precisely suppress specific
knowledge the model has memorized. We propose a multi-objective differentiable masking scheme that can be applied to both weights and
neurons to discover such subnetworks and show that we can use them to precisely remove specific knowledge from models while minimizing
adverse effects on the behavior of the original model. We demonstrate our method on multiple GPT2 variants, uncovering highly sparse sub-
networks (98%+ sparsity) that are critical for expressing specific collections of relational knowledge. When these subnetworks are removed,
the remaining network maintains most of its initial abilities but struggles to represent the suppressed knowledge.
Nov 12 (Tue) 14:00-15:30 - Jasmine
Towards Understanding Jailbreak Attacks in LLMs: A Representation Space Analysis
Yuping Lin, Pengfei He, Han Xu, Yue Xing, Makoto Yamada, Hui Liu, Jiliang Tang
Large language models (LLMs) are susceptible to a type of attack known as jailbreaking, which misleads LLMs to output harmful contents.
Although there are diverse jailbreak attack strategies, there is no unified understanding on why some methods succeed and others fail. This
paper explores the behavior of harmful and harmless prompts in the LLM’s representation space to investigate the intrinsic properties of
successful jailbreak attacks. We hypothesize that successful attacks share some similar properties: They are effective in moving the rep-
resentation of the harmful prompt towards the direction to the harmless prompts. We leverage hidden representations into the objective of
existing jailbreak attacks to move the attacks along the acceptance direction, and conduct experiments to validate the above hypothesis using
the proposed objective. We hope this study provides new insights into understanding how LLMs understand harmfulness information.
123
Posters and Demos
processing (NLP) tasks. However, there are significant differences in the knowledge and abilities required for different tasks. Therefore,
it is important to understand whether the same LLM processes different tasks in the same way. Are there specific neurons in a LLM for
different tasks? Inspired by neuroscience, this paper pioneers the exploration of whether distinct neurons are activated when a LLM handles
different tasks. Compared with current research exploring the neurons of language and knowledge, task-specific neurons present a greater
challenge due to their abstractness, diversity, and complexity. To address these challenges, this paper proposes a method for task-specific neu-
ron localization based on Causal Gradient Variation with Special Tokens (CGVST). CGVST identifies task-specific neurons by concentrating
on the most significant tokens during task processing, thereby eliminating redundant tokens and minimizing interference from non-essential
neurons. Compared to traditional neuron localization methods, our approach can more effectively identify task-specific neurons. We conduct
experiments across eight different public tasks. Experiments involving the inhibition and amplification of identified neurons demonstrate that
our method can accurately locate task-specific neurons.
Nov 12 (Tue) 14:00-15:30 - Jasmine
Unveiling Factual Recall Behaviors of Large Language Models through Knowledge Neurons
Yifei Wang, Yuheng Chen, Wanting Wen, Yu Sheng, Linjing Li, Daniel Dajun Zeng
In this paper, we investigate whether Large Language Models (LLMs) actively recall or retrieve their internal repositories of factual knowl-
edge when faced with reasoning tasks. Through an analysis of LLMs’ internal factual recall at each reasoning step via Knowledge Neurons,
we reveal that LLMs fail to harness the critical factual associations under certain circumstances. Instead, they tend to opt for alternative,
shortcut-like pathways to answer reasoning questions. By manually manipulating the recall process of parametric knowledge in LLMs, we
demonstrate that enhancing this recall process directly improves reasoning performance whereas suppressing it leads to notable degradation.
Furthermore, we assess the effect of Chain-of-Thought (CoT) prompting, a powerful technique for addressing complex reasoning tasks. Our
findings indicate that CoT can intensify the recall of factual knowledge by encouraging LLMs to engage in orderly and reliable reasoning.
Furthermore, we explored how contextual conflicts affect the retrieval of facts during the reasoning process to gain a comprehensive under-
standing of the factual recall behaviors of LLMs. Code and data will be available soon.
Nov 12 (Tue) 14:00-15:30 - Jasmine
Is LLM-as-a-Judge Robust? Investigating Universal Adversarial Attacks on Zero-shot LLM Assessment
Vyas Raina, Adian Liusie, Mark Gales
Large Language Models (LLMs) are powerful zero-shot assessors used in real-world situations such as assessing written exams and bench-
marking systems. Despite these critical applications, no existing work has analyzed the vulnerability of judge-LLMs to adversarial manip-
ulation. This work presents the first study on the adversarial robustness of assessment LLMs, where we demonstrate that short universal
adversarial phrases can be concatenated to deceive judge LLMs to predict inflated scores. Since adversaries may not know or have access
to the judge-LLMs, we propose a simple surrogate attack where a surrogate model is first attacked, and the learned attack phrase then trans-
ferred to unknown judge-LLMs. We propose a practical algorithm to determine the short universal attack phrases and demonstrate that when
transferred to unseen models, scores can be drastically inflated such that irrespective of the assessed text, maximum scores are predicted. It
is found that judge-LLMs are significantly more susceptible to these adversarial attacks when used for absolute scoring, as opposed to com-
parative assessment. Our findings raise concerns on the reliability of LLM-as-a-judge methods, and emphasize the importance of addressing
vulnerabilities in LLM assessment methods before deployment in high-stakes real-world scenarios.
Nov 12 (Tue) 14:00-15:30 - Jasmine
Can Large Language Models Faithfully Express Their Intrinsic Uncertainty in Words?
Gal Yona, Roee Aharoni, Mor Geva
We posit that large language models (LLMs) should be capable of expressing their intrinsic uncertainty in natural language. For example, if
the LLM is equally likely to output two contradicting answers to the same question, then its generated response should reflect this uncertainty
by hedging its answer (e.g., "Im not sure, but I think..."). We formalize faithful response uncertainty based on the gap between the model’s
intrinsic confidence in the assertions it makes and the decisiveness by which they are conveyed. This example-level metric reliably indicates
whether the model reflects its uncertainty, as it penalizes both excessive and insufficient hedging. We evaluate a variety of aligned LLMs at
faithfully conveying uncertainty on several knowledge-intensive question answering tasks. Our results provide strong evidence that modern
LLMs are poor at faithfully conveying their uncertainty, and that better alignment is necessary to improve their trustworthiness.
Nov 12 (Tue) 14:00-15:30 - Jasmine
Attribute or Abstain: Large Language Models as Long Document Assistants
Jan Buchmann, Xiao Liu, Iryna Gurevych
LLMs can help humans working with long documents, but are known to hallucinate. *Attribution* can increase trust in LLM responses: The
LLM provides evidence that supports its response, which enhances verifiability. Existing approaches to attribution have only been evaluated
in RAG settings, where the initial retrieval confounds LLM performance. This is crucially different from the long document setting, where
retrieval is not needed, but could help. Thus, a long document specific evaluation of attribution is missing. To fill this gap, we present LAB, a
benchmark of 6 diverse long document tasks with attribution, and experiments with different approaches to attribution on 5 LLMs of different
sizes. We find that *citation*, i.e. response generation and evidence extraction in one step, performs best for large and fine-tuned models,
while additional retrieval can help for small, prompted models. We investigate whether the "Lost in the Middle” phenomenon exists for
attribution, but do not find this. We also find that evidence quality can predict response quality on datasets with simple responses, but not so
for complex responses, as models struggle with providing evidence for complex claims. We release code and data for further investigation.
[Link](https://github.com/UKPLab/arxiv2024-attribute-or-abstain)
124
Posters and Demos
LLMs process text as sequences of tokens that roughly correspond to words, where less common words are represented by multiple tokens.
However, individual tokens are often semantically unrelated to the meanings of the words/concepts they comprise. For example, Llama-2-7b’s
tokenizer splits the word "patrolling" into two tokens, "pat" and "rolling", neither of which correspond to semantically meaningful units like
"patrol" or "-ing." Similarly, the overall meanings of named entities like "Neil Young" and multi-word expressions like "break a leg" cannot be
directly inferred from their constituent tokens. Mechanistically, how do LLMs convert such arbitrary groups of tokens into useful higher-level
representations? In this work, we find that last token representations of named entities and multi-token words exhibit a pronounced "erasure"
effect, where information about previous and current tokens is rapidly forgotten in early layers. Using this observation, we propose a method
to "read out" the implicit vocabulary of an autoregressive LLM by examining differences in token representations across layers, and present
results of this method for Llama-2-7b and Llama-3-8B. To our knowledge, this is the first attempt to probe the implicit vocabulary of an LLM.
125
Posters and Demos
instructions, our work reveals that they can indeed lead to non-trivial property inheritance behavior in LMs. However, this ability is incon-
sistent: with a minimal reformulation of the task, some LMs were found to pick up on shallow, non-semantic heuristics from their inputs,
suggesting that the computational principles of semantic property inference are yet to be mastered by LMs.
Nov 12 (Tue) 14:00-15:30 - Jasmine
Latent Concept-based Explanation of NLP Models
Xuemin Yu, Fahim Dalvi, Nadir Durrani, Marzia Nouri, Hassan Sajjad
Interpreting and understanding the predictions made by deep learning models poses a formidable challenge due to their inherently opaque
nature. Many previous efforts aimed at explaining these predictions rely on input features, specifically, the words within NLP models. How-
ever, such explanations are often less informative due to the discrete nature of these words and their lack of contextual verbosity. To address
this limitation, we introduce the Latent Concept Attribution method (LACOAT), which generates explanations for predictions based on latent
concepts. Our foundational intuition is that a word can exhibit multiple facets, contingent upon the context in which it is used. Therefore,
given a word in context, the latent space derived from our training process reflects a specific facet of that word. LACOAT functions by
mapping the representations of salient input words into the training latent space, allowing it to provide latent context-based explanations of
the prediction.
Nov 12 (Tue) 14:00-15:30 - Jasmine
Cluster-Norm for Unsupervised Probing of Knowledge
Walter Laurito, Sharan Maiya, Grégoire DHIMOÏLA, Owen Ho Wan Yeung, Kaarel Hänni
The deployment of language models brings challenges in generating reliable text, especially when these models are fine-tuned with human
preferences. To extract the encoded knowledge in these models without (potentially) biased human labels, unsupervised probing techniques
like Contrast-Consistent Search (CCS) have been developed (Burns et al., 2022). However, salient but unrelated features in activation space
can mislead these probes (Farquhar et al., 2023). Addressing this, we propose a cluster-normalization method to minimize the impact of such
features by clustering and normalizing activations of contrast pairs before applying unsupervised probing techniques. While this approach
does not address the issue of distinguishing between latent knowledge and that portrayed by a simulated agenta major issue in the literature
of eliciting latent knowledge (Paul Christiano and Xu, 2021)it still significantly improves the accuracy of probes in identifying the intended
knowledge amidst distractions.
Nov 12 (Tue) 14:00-15:30 - Jasmine
LLM Task Interference: An Initial Study on the Impact of Task-Switch in Conversational History
Akash Gupta, Ivaxi Sheth, Vyas Raina, Mark Gales, Mario Fritz
With the recent emergence of powerful instruction-tuned large language models (LLMs), various helpful conversational Artificial Intelligence
(AI) systems have been deployed across many applications. When prompted by users, these AI systems successfully perform a wide range of
tasks as part of a conversation. To provide some sort of memory and context, such approaches typically condition their output on the entire
conversational history. Although this sensitivity to the conversational history can often lead to improved performance on subsequent tasks,
we find that performance can in fact also be negatively impacted, if there is a _task-switch_. To the best of our knowledge, our work makes
the first attempt to formalize the study of such vulnerabilities and interference of tasks in conversational LLMs caused by task-switches in the
conversational history. Our experiments across 5 datasets with 15 task switches using popular LLMs reveal that many of the task-switches
can lead to significant performance degradation.
Nov 12 (Tue) 14:00-15:30 - Jasmine
Layer by Layer: Uncovering Where Multi-Task Learning Happens in Instruction-Tuned Large Language Models
Zheng Zhao, Yftah Ziser, Shay B Cohen
Fine-tuning pre-trained large language models (LLMs) on a diverse array of tasks has become a common approach for building models that
can solve various natural language processing (NLP) tasks. However, where and to what extent these models retain task-specific knowledge
remains largely unexplored. This study investigates the task-specific information encoded in pre-trained LLMs and the effects of instruction
tuning on their representations across a diverse set of over 60 NLP tasks. We use a set of matrix analysis tools to examine the differences be-
tween the way pre-trained and instruction-tuned LLMs store task-specific information. Our findings reveal that while some tasks are already
encoded within the pre-trained LLMs, others greatly benefit from instruction tuning. Additionally, we pinpointed the layers in which the
model transitions from high-level general representations to more task-oriented representations. This finding extends our understanding of the
governing mechanisms of LLMs and facilitates future research in the fields of parameter-efficient transfer learning and multi-task learning.
Our code is available at: https://github.com/zsquaredz/layer_by_layer/
Nov 12 (Tue) 14:00-15:30 - Jasmine
CMR Scaling Law: Predicting Critical Mixture Ratios for Continual Pre-training of Language Models
Jiawei Gu, Zacc Yang, Chuanghao Ding, Rui Zhao, Fei Tan
Large Language Models (LLMs) excel in diverse tasks but often underperform in specialized fields due to limited domain-specific or pro-
prietary corpus. Continual pre-training (CPT) enhances LLM capabilities by imbuing new domain-specific or proprietary knowledge while
replaying general corpus to prevent catastrophic forgetting. The data mixture ratio of general corpus and domain-specific corpus, however,
has been chosen heuristically, leading to sub-optimal training efficiency in practice. In this context, we attempt to re-visit the scaling be-
havior of LLMs under the hood of CPT, and discover a power-law relationship between loss, mixture ratio, and training tokens scale. We
formalize the trade-off between general and domain-specific capabilities, leading to a well-defined Critical Mixture Ratio (CMR) of general
and domain data. By striking the balance, CMR maintains the model’s general ability and achieves the desired domain transfer, ensuring the
highest utilization of available resources. Considering the balance between efficiency and effectiveness, CMR can be regarded as the optimal
mixture ratio. Through extensive experiments, we ascertain the predictability of CMR, propose CMR scaling law and have substantiated
its generalization. These findings offer practical guidelines for optimizing LLM training in specialized domains, ensuring both general and
domain-specific performance while efficiently managing training resources.
Nov 12 (Tue) 14:00-15:30 - Jasmine
Information Flow Routes: Automatically Interpreting Language Models at Scale
Javier Ferrando, Elena Voita
Information flows by routes inside the network via mechanisms implemented in the model. These routes can be represented as graphs where
nodes correspond to token representations and edges to computations. We automatically build these graphs in a top-down manner, for each
prediction leaving only the most important nodes and edges. In contrast to the existing workflows relying on activation patching, we do this
through attribution: this allows us to efficiently uncover existing circuits with just a single forward pass. Unlike with patching, we do not
need a human to carefully design prediction templates, and we can extract information flow routes for any prediction (not just the ones among
the allowed templates). As a result, we can analyze model behavior in general, for specific types of predictions, or different domains. We
experiment with Llama 2 and show that some attention head roles are overall important, e.g. previous token heads and subword merging
heads. Next, we find similarities in Llama 2 behavior when handling tokens of the same part of speech. Finally, we show that some model
126
Posters and Demos
127
Posters and Demos
models (e.g. coding), improved confidence through multiple computations, and enhanced divergent thinking, leading to more diverse outputs.
Thus, the collaborative use of language models is expected to grow significantly in the coming years. In this work, we evaluate the behavior of
a network of models collaborating through debate under the influence of an adversary. We introduce pertinent metrics to assess the adversary’s
effectiveness, focusing on system accuracy and model agreement. Our findings highlight the importance of a models persuasive ability in
influencing others. Additionally, we explore inference-time methods to generate more compelling arguments and evaluate the potential of
prompt-based mitigation as a defensive strategy.
Nov 12 (Tue) 14:00-15:30 - Jasmine
InternalInspector I 2 : Robust Confidence Estimation in LLMs through Internal States
Mohammad Beigi, Ying Shen, Runing Yang, Zihao Lin, Qifan Wang, Ankith Mohan, Jianfeng He, Ming Jin, Chang-Tien Lu, Lifu Huang
Despite their vast capabilities, Large Language Models (LLMs) often struggle with generating reliable outputs, frequently producing high-
confidence inaccuracies known as hallucinations. Addressing this challenge, our research introduces InternalInspector, a novel framework
designed to enhance confidence estimation in LLMs by leveraging contrastive learning on internal states including attention states, feed-
forward states, and activation states of all layers. Unlike existing methods that primarily focus on the final activation state, InternalInspector
conducts a comprehensive analysis across all internal states of every layer to accurately identify both correct and incorrect prediction pro-
cesses. By benchmarking InternalInspector against existing confidence estimation methods across various natural language understanding
and generation tasks, including factual question answering, commonsense reasoning, and reading comprehension, InternalInspector achieves
significantly higher accuracy in aligning the estimated confidence scores with the correctness of the LLM’s predictions and lower calibration
error. Furthermore, InternalInspector excels at HaluEval, a hallucination detection benchmark, outperforming other internal-based confidence
estimation methods in this task.
Nov 12 (Tue) 14:00-15:30 - Jasmine
Semantic Token Reweighting for Interpretable and Controllable Text Embeddings in CLIP
Eunji Kim, Kyuhong Shim, Simyung Chang, Sungroh Yoon
A text encoder within Vision-Language Models (VLMs) like CLIP plays a crucial role in translating textual input into an embedding space
shared with images, thereby facilitating the interpretative analysis of vision tasks through natural language. Despite the varying significance
of different textual elements within a sentence depending on the context, efforts to account for variation of importance in constructing text
embeddings have been lacking. We propose a framework of Semantic Token Reweighting to build Interpretable text embeddings (SToRI),
which incorporates controllability as well. SToRI refines the text encoding process in CLIP by differentially weighting semantic elements
based on contextual importance, enabling finer control over emphasis responsive to data-driven insights and user preferences. The efficacy of
SToRI is demonstrated through comprehensive experiments on few-shot image classification and image retrieval tailored to user preferences.
Nov 12 (Tue) 14:00-15:30 - Jasmine
Infer-then-Verbalize: How do LMs Map true/false to cat/dog During In-Context Learning?
Junyi Tao, Xiaoyin Chen, Nelson F. Liu
Large language models (LMs) are capable of in-context learning from a few demonstrations (example-label pairs) to solve new tasks during
inference. Despite the intuitive importance of high-quality demonstrations, previous work has observed that, in some settings, ICL perfor-
mance is minimally affected by irrelevant labels (Min et al., 2022). We hypothesize that LMs perform ICL with irrelevant labels via two
sequential processes: an inference function that solves the task, followed by a verbalization function that maps the inferred answer to the
label space. Importantly, we hypothesize that the inference function is invariant to remappings of the label space (e.g., true/false to cat/dog),
enabling LMs to share the same inference function across settings with different label words. We empirically validate this hypothesis with
controlled layer-wise interchange intervention experiments. Our findings confirm the hypotheses on multiple datasets and tasks (natural lan-
guage inference, sentiment analysis, and topic classification) and further suggest that the two functions can be localized in specific layers
across various open-sourced models, including GEMMA-7B, MISTRAL-7B-V0.3, GEMMA-2-27B, and LLAMA-3.1-70B.
Language Modeling 2
Nov 12 (Tue) 14:00-15:30 - Room: Riverfront Hall
128
Posters and Demos
129
Posters and Demos
Commercially available models dominate academic leaderboards. While impressive, this has concentrated research on creating and adapting
general-purpose models to improve NLP leaderboard standings for large language models. However, leaderboards collect many individual
tasks and general-purpose models often underperform in specialized domains; domain-specific or adapted models yield superior results. This
focus on large general-purpose models excludes many academics and draws attention away from areas where they can make important con-
tributions. We advocate for a renewed focus on developing and evaluating domain- and task-specific models, and highlight the unique role of
academics in this endeavor.
Nov 12 (Tue) 14:00-15:30 - Riverfront Hall
Scaling Laws Across Model Architectures: A Comparative Analysis of Dense and MoE Models in Large Language Models
Siqi Wang, Zhengyu Chen, Bei Li, Keqing He, Min Zhang, Jingang Wang
The scaling of large language models (LLMs) is a critical research area for the efficiency and effectiveness of model training and deployment.
Our work investigates the transferability and discrepancies of scaling laws between Dense Models and Mixture of Experts (MoE) models.
Through a combination of theoretical analysis and extensive experiments, including consistent loss scaling, optimal batch size/learning rate
scaling, and resource allocation strategies scaling, our findings reveal that the power-law scaling framework also applies to MoE Models,
indicating that the fundamental principles governing the scaling behavior of these models are preserved, even though the architecture differs.
Additionally, MoE Models demonstrate superior generalization, resulting in lower testing losses with the same training compute budget com-
pared to Dense Models. These findings indicate the scaling consistency and transfer generalization capabilities of MoE Models, providing
new insights for optimizing MoE Model training and deployment strategies.
Nov 12 (Tue) 14:00-15:30 - Riverfront Hall
SURf: Teaching Large Vision-Language Models to Selectively Utilize Retrieved Information
Yunmo Chen, Tongfei Chen, Harsh Jhamtani, Patrick Xia, Richard Shin, Jason Eisner, Benjamin Van Durme
We introduce iterative retrieval, a novel framework that empowers retrievers to make iterative decisions through policy optimization. Finding
an optimal portfolio of retrieved items is a combinatorial optimization problem, generally considered NP-hard. This approach provides a
learned approximation to such a solution, meeting specific task requirements under a given family of large language models (LLMs). We
propose a training procedure based on reinforcement learning, incorporating feedback from LLMs. We instantiate an iterative retriever for
composing in-context learning (ICL) exemplars and apply it to various semantic parsing tasks that demand synthesized programs as outputs.
By adding only 4M additional parameters for state encoding, we convert an off-the-shelf dense retriever into a stateful iterative retriever, out-
performing previous methods in selecting ICL exemplars on semantic parsing datasets such as CalFlow, TreeDST, and MTOP. Additionally,
the trained iterative retriever generalizes across different inference LLMs beyond the one used during training.
Nov 12 (Tue) 14:00-15:30 - Riverfront Hall
LIONs: An Empirically Optimized Approach to Align Language Models
Xiao Yu, Qingyang Wu, Yu Li, Zhou Yu
Alignment is a crucial step to enhance the instruction-following and conversational abilities of language models. Despite many recent works
proposing new algorithms, datasets, and training pipelines, there is a lack of comprehensive studies measuring the impact of various design
choices throughout the whole training process. We first conduct a rigorous analysis over a three-stage training pipeline consisting of super-
vised fine-tuning, offline preference learning, and online preference learning. We have found that using techniques like sequence packing,
loss masking in SFT, increasing the preference dataset size in DPO, and online DPO training can significantly improve the performance of
language models. We then train from Gemma-2b-base and LLama-3-8b-base, and find that our best models exceed the performance of the
official instruct models tuned with closed-source data and algorithms. Our code and models can be found at https://github.com/Columbia-
NLP-Lab/LionAlignment.
Nov 12 (Tue) 14:00-15:30 - Riverfront Hall
Birdie: Advancing State Space Models with a Minimalist Architecture and Novel Pre-training Objectives
Sam Blouir, Jimmy T.H. Smith, Antonios Anastasopoulos, Amarda Shehu
Efficient state space models (SSMs), including linear recurrent neural networks and linear attention variants, have emerged as potential al-
ternative language models to Transformers. While efficient, SSMs struggle with tasks requiring in-context retrieval, such as text copying
and associative recall, limiting their usefulness in practical settings. Prior work on how to meet this challenge has focused on the internal
model architecture and not investigated the role of the training procedure. This paper proposes a new training procedure that improve the
performance of SSMs on retrieval-intensive tasks. This novel pre-training procedure combines a bidirectional processing of the input with
dynamic mixtures of pre-training objectives to improve the utilization of the SSM’s fixed-size state. Our experimental evaluations show that
this procedure significantly improves performance on retrieval-intensive tasks that challenge current SSMs, such as phone book lookup, long
paragraph question-answering, and infilling tasks. Our findings offer insights into a new direction to advance the training of SSMs to close
the performance gap with Transformers.
130
Posters and Demos
Furthermore, we present TTE-StablePrompt, an extension for generating input-dependent prompts. It outperforms StablePrompt in tasks that
are hard to solve with a single prompt. This shows that StablePrompt is an extensible and stable RL framework for LLM.
131
Posters and Demos
complish the task. This modular framework facilitates individual updates and the potential use of smaller LLMs for building each capability.
To effectively train this framework, we introduce a two-stage training paradigm. First, we fine-tune a backbone LLM on the entire dataset
without discriminating sub-tasks, providing the model with a comprehensive understanding of the task. Second, the fine-tuned LLM is used
to instantiate the planner, caller, and summarizer respectively, which are continually fine-tuned on respective sub-tasks. Evaluation across
various tool-use benchmarks illustrates that our proposed multi-LLM framework surpasses the traditional single-LLM approach, highlighting
its efficacy and advantages in tool learning.
Nov 12 (Tue) 14:00-15:30 - Riverfront Hall
Model Editing Harms General Abilities of Large Language Models: Regularization to the Rescue
Jia-Chen Gu, Hao-Xiang Xu, Jun-Yu Ma, Pan Lu, Zhen-Hua Ling, Kai-Wei Chang, Nanyun Peng
Model editing is a technique that edits the large language models (LLMs) with updated knowledge to alleviate hallucinations without resource-
intensive retraining. While current model editing methods can effectively modify a models behavior within a specific area of interest, they
often overlook the potential unintended side effects on the general abilities of LLMs such as reasoning, natural language inference, and
question answering. In this paper, we raise concerns that model editings improvements on factuality may come at the cost of a significant
degradation of the models general abilities. We systematically analyze the side effects by evaluating four popular editing methods on three
LLMs across eight representative tasks. Our extensive empirical experiments show that it is challenging for current editing methods to si-
multaneously improve factuality of LLMs and maintain their general abilities. Our analysis reveals that the side effects are caused by model
editing altering the original model weights excessively, leading to overfitting to the edited facts. To mitigate this, a method named RECT is
proposed to regularize the edit update weights by imposing constraints on their complexity based on the RElative Change in weighT. Evalua-
tion results show that RECT can significantly mitigate the side effects of editing while still maintaining over 94% editing performance.
Nov 12 (Tue) 14:00-15:30 - Riverfront Hall
T-FREE: Tokenizer-Free Generative LLMs via Sparse Representations for Memory-Efficient Embeddings
Björn Deiseroth, Manuel Brack, Samuel Weinbach, Patrick Schramowski, Kristian Kersting
Tokenizers are crucial for encoding information in Large Language Models, but their development has recently stagnated, and they contain
inherent weaknesses. Major limitations include computational overhead, ineffective vocabulary use, and unnecessarily large embedding and
head layers. Additionally, their performance is biased towards a reference corpus, leading to reduced effectiveness for underrepresented lan-
guages.To remedy these issues, we propose T-Free, which directly embeds words through sparse activation patterns over character triplets and
does not require a reference corpus. T-Free inherently exploits morphological similarities and allows for strong compression of embedding
layers. In our exhaustive experimental evaluation, we achieve competitive downstream performance with a parameter reduction of more than
85% on these layers. Further, T-Free shows significant improvements in cross-lingual transfer learning.
132
Posters and Demos
2
https://github.com/avalonstrel/Mitigating-the-Alignment-Tax-of-RLHF.git
133
Posters and Demos
Our experimental analysis reveals that the aligned models can provide responses that match various preferences among the ”3H” (helpfulness,
honesty, harmlessness) desiderata. Furthermore, by introducing diverse data and alignment goals, we surpass baseline methods in aligning
with single objectives, hence mitigating the impact of the alignment tax and achieving improvements in multi-objective alignment.
Nov 12 (Tue) 14:00-15:30 - Jasmine
Direct Multi-Turn Preference Optimization for Language Agents
Wentao Shi, Mengqi Yuan, Junkang Wu, Qifan Wang, Fuli Feng
Adapting Large Language Models (LLMs) for agent tasks is critical in developing language agents. Direct Preference Optimization (DPO)
is a promising technique for this adaptation with the alleviation of compounding errors, offering a means to directly optimize Reinforcement
Learning (RL) objectives. However, applying DPO to multi-turn tasks presents challenges due to the inability to cancel the partition function.
Overcoming this obstacle involves making the partition function independent of the current state and addressing length disparities between
preferred and dis-preferred trajectories. In this light, we replace the policy constraint with the state-action occupancy measure constraint in
the RL objective and add length normalization to the Bradley-Terry model, yielding a novel loss function named DMPO for multi-turn agent
tasks with theoretical explanations. Extensive experiments on three multi-turn agent task datasets confirm the effectiveness and superiority of
the DMPO loss.
Nov 12 (Tue) 14:00-15:30 - Jasmine
MoDULA: Mixture of Domain-Specific and Universal LoRA for Multi-Task Learning
Yufei Ma, Zihan Liang, Huangyu Dai, Ben Chen, Dehong Gao, Zhuoran Ran, ZihanWang, Linbo Jin, Wen Jiang, Guannan Zhang, Xiaoyan
Cai, Libin Yang
The growing demand for larger-scale models in the development of Large Language Models (LLMs) poses challenges for efficient train-
ing within limited computational resources. Traditional fine-tuning methods often exhibit instability in multi-task learning and rely heavily
on extensive training resources. Here, we propose MoDULA (Mixture of Domain-Specific and Universal LoRA), a novel Parameter Ef-
ficient Fine-Tuning (PEFT) Mixture-of-Expert (MoE) paradigm for improved fine-tuning and parameter efficiency in multi-task learning.
The paradigm effectively improves the multi-task capability of the model by training universal experts, domain-specific experts, and routers
separately. MoDULA-Res is a new method within the MoDULA paradigm, which maintains the model’s general capability by connect-
ing universal and task-specific experts through residual connections. The experimental results demonstrate that the overall performance of
the MoDULA-Flan and MoDULA-Res methods surpasses that of existing fine-tuning methods on various LLMs. Notably, MoDULA-Res
achieves more significant performance improvements in multiple tasks while reducing training costs by over 80% without losing general
capability. Moreover, MoDULA displays flexible pluggability, allowing for the efficient addition of new tasks without retraining existing
experts from scratch. This progressive training paradigm circumvents data balancing issues, enhancing training efficiency and model stability.
Overall, MoDULA provides a scalable, cost-effective solution for fine-tuning LLMs with enhanced parameter efficiency and generalization
capability.
Nov 12 (Tue) 14:00-15:30 - Jasmine
SEEKR: Selective Attention-Guided Knowledge Retention for Continual Learning of Large Language Models
Jinghan He, Haiyun Guo, Kuan Zhu, Zihan Zhao, Ming Tang, Jinqiao Wang
Continual learning (CL) is crucial for language models to dynamically adapt to the evolving real-world demands. To mitigate the catastrophic
forgetting problem in CL, data replay has been proven a simple and effective strategy, and the subsequent data-replay-based distillation can
further enhance the performance. However, existing methods fail to fully exploit the knowledge embedded in models from previous tasks,
resulting in the need for a relatively large number of replay samples to achieve good results. In this work, we first explore and emphasize
the importance of attention weights in knowledge retention, and then propose a SElective attEntion-guided Knowledge Retention method
(SEEKR) for data-efficient replay-based continual learning of large language models (LLMs). Specifically, SEEKR performs attention distil-
lation on the selected attention heads for finer-grained knowledge retention, where the proposed forgettability-based and task-sensitivity-based
measures are used to identify the most valuable attention heads. Experimental results on two continual learning benchmarks for LLMs demon-
strate the superiority of SEEKR over the existing methods on both performance and efficiency. Explicitly, SEEKR achieves comparable or
even better performance with only 1/10 of the replayed data used by other methods, and reduces the proportion of replayed data to 1%. The
code is available at https://github.com/jinghan1he/SEEKR.
Nov 12 (Tue) 14:00-15:30 - Jasmine
SOUL: Unlocking the Power of Second-Order Optimization for LLM Unlearning
Jinghan Jia, Yihua Zhang, Yimeng Zhang, Jiancheng Liu, Bharat Runwal, James Diffenderfer, Bhavya Kailkhura, Sijia Liu
Large Language Models (LLMs) have highlighted the necessity of effective unlearning mechanisms to comply with data regulations and
ethical AI practices. LLM unlearning aims at removing undesired data influences and associated model capabilities without compromising
utility beyond the scope of unlearning. While interest in studying LLM unlearning is growing, the impact of the optimizer choice for LLM
unlearning remains unexplored. In this work, we shed light on the significance of optimizer selection in LLM unlearning for the first time,
establishing a clear connection between second-order optimization and influence unlearning (a classical approach using influence functions to
update the model for data influence removal). This insight propels us to develop a second-order optimization-based LLM unlearning frame-
work, termed Second-Order UnLearning (SOUL), which extends the static, one-shot model update using influence unlearning to a dynamic,
iterative unlearning process. Our extensive experiments show that SOUL consistently outperforms conventional first-order methods across
various unlearning tasks, models, and metrics, indicating that second-order optimization offers an effective and broadly applicable solution
for LLM unlearning.
134
Posters and Demos
135
Posters and Demos
simulating on-policy learning with off-policy preference data. Our Weighted Preference Optimization (WPO) method adapts off-policy data
to resemble on-policy data more closely by reweighting preference pairs according to their probability under the current policy. This method
not only addresses the distributional gap problem but also enhances the optimization process without incurring additional costs. We validate
our method on instruction following benchmarks including Alpaca Eval 2 and MT-bench. WPO not only outperforms Direct Preference
Optimization (DPO) by up to 5.6% on Alpaca Eval 2 but also establishes a remarkable length-controlled winning rate against GPT-4-turbo of
76.7% based on Gemma-2-9b-it. We release the code and models at https://github.com/wzhouad/WPO.
Nov 12 (Tue) 14:00-15:30 - Jasmine
Contextualized Sequence Likelihood: Enhanced Confidence Scores for Natural Language Generation
Zhen Lin, Shubhendu Trivedi, Jimeng Sun
The advent of large language models (LLMs) has dramatically advanced the state-of-the-art in numerous natural language generation tasks.
For LLMs to be applied reliably, it is essential to have an accurate measure of their confidence. Currently, the most commonly used confi-
dence score function is the likelihood of the generated sequence, which, however, conflates semantic and syntactic components. For instance,
in question-answering (QA) tasks, an awkward phrasing of the correct answer might result in a lower probability prediction. Additionally,
different tokens should be weighted differently depending on the context. In this work, we propose enhancing the predicted sequence prob-
ability by assigning different weights to various tokens using attention values elicited from the base LLM. By employing a validation set,
we can identify the relevant attention heads, thereby significantly improving the reliability of the vanilla sequence probability confidence
measure. We refer to this new score as the Contextualized Sequence Likelihood (CSL). CSL is easy to implement, fast to compute, and offers
considerable potential for further improvement with task-specific prompts. Across several QA datasets and a diverse array of LLMs, CSL
has demonstrated significantly higher reliability than state-of-the-art baselines in predicting generation quality, as measured by the AUROC
or AUARC.
Nov 12 (Tue) 14:00-15:30 - Jasmine
SparseGrad: A Selective Method for Efficient Fine-tuning of MLP Layers
Viktoriia A. Chekalina, Anna Rudenko, Gleb Mezentsev, Aleksandr Mikhalev, Alexander Panchenko, Ivan Oseledets
The performance of Transformer models has been enhanced by increasing the number of parameters and the length of the processed text.
Consequently, fine-tuning the entire model becomes a memory-intensive process. High-performance methods for parameter-efficient fine-
tuning (PEFT) typically work with Attention blocks and often overlook MLP blocks, which contain about half of the model parameters. We
propose a new selective PEFT method, namely SparseGrad, that performs well on MLP blocks. We transfer layer gradients to a space where
only about 1% of the layer’s elements remain significant. By converting gradients into a sparse structure, we reduce the number of updated
parameters. We apply SparseGrad to fine-tune BERT and RoBERTa for the NLU task and LLaMa-2 for the Question-Answering task. In
these experiments, with identical memory requirements, our method outperforms LoRA and MeProp, robust popular state-of-the-art PEFT
approaches.
Nov 12 (Tue) 14:00-15:30 - Jasmine
Explicit Memory Learning with Expectation Maximization
Zhangyue Yin, Qiushi Sun, Qipeng Guo, Zhiyuan Zeng, Qinyuan Cheng, Xipeng Qiu, Xuanjing Huang
Large Language Models (LLMs) have revolutionized the landscape of natural language processing, demonstrating remarkable abilities across
various complex tasks. However, their stateless nature limits the capability to retain information across interactions, hindering performance
in scenarios requiring historical context recall. To mitigate this, current approaches primarily use explicit memory to allow LLMs to store
useful information, which is accessible, readable, and interpretable. Nevertheless, explicit memory lacks the reliable learning mechanisms
of implicit memory, which can be optimized end-to-end. To harness the benefits of both, we introduce EM2 , a novel framework enhancing
explicit memory updates via the Expectation-Maximization (EM) algorithm. EM2 treats memory as a latent variable, ensuring continual
learning and improvement during updates. Experimental results on streaming inference tasks demonstrate that EM2 outperforms existing
methods without memory or with static external memory. Our in-depth analysis highlights that EM2 significantly enhances performance
across various backbones and memory strategies, providing a robust solution for advancing LLM memory management and enabling explicit
memory to learn and improve similarly to implicit memory.
Nov 12 (Tue) 14:00-15:30 - Jasmine
Semformer: Transformer Language Models with Semantic Planning
Yongjing Yin, Junran Ding, Kai Song, Yue Zhang
Next-token prediction serves as the dominant component in current neural language models.During the training phase, the model employs
teacher forcing, which predicts tokens based on all preceding ground truth tokens.However, this approach has been found to create short-
cuts, utilizing the revealed prefix to spuriously fit future tokens, potentially compromising the accuracy of the next-token predictor.In this
paper, we introduce Semformer, a novel method of training a Transformer language model that explicitly models the semantic planning of
response.Specifically, we incorporate a sequence of planning tokens into the prefix, guiding the planning token representations to predict the
latent semantic representations of the response, which are induced by an autoencoder.In a minimal planning task (i.e., graph path-finding),
our model exhibits near-perfect performance and effectively mitigates shortcut learning, a feat that standard training methods and baseline
models have been unable to accomplish.Furthermore, we pretrain Semformer from scratch with 125M parameters, demonstrating its efficacy
through measures of perplexity, in-context learning, and fine-tuning on summarization tasks.
Nov 12 (Tue) 14:00-15:30 - Jasmine
VerifyMatch: A Semi-Supervised Learning Paradigm for Natural Language Inference with Confidence-Aware MixUp
Seo Yeon Park, Cornelia Caragea
While natural language inference (NLI) has emerged as a prominent task for evaluating a model’s capability to perform natural language
understanding, creating large benchmarks for training deep learning models imposes a significant challenge since it requires extensive human
annotations. To overcome this, we propose to construct pseudo-generated samples (premise-hypothesis pairs) using class-specific fine-tuned
large language models (LLMs) thereby reducing the human effort and the costs in annotating large amounts of data. However, despite the
impressive performance of LLMs, it is necessary to verify that the pseudo-generated labels are actually correct. Towards this goal, in this
paper, we propose VerifyMatch, a semi-supervised learning (SSL) approach in which the LLM pseudo-labels guide the training of the SSL
model and, at the same time, the SSL model acts as a verifier of the LLM-generated data. In our approach, we retain all pseudo-labeled
samples, but to ensure unlabeled data quality, we further propose to use MixUp whenever the verifier does not agree with the LLM-generated
label or when they both agree on the label but the verifier has a low confidence—lower than an adaptive confidence threshold. We achieve
competitive accuracy compared to strong baselines for NLI datasets in low-resource settings.
Nov 12 (Tue) 14:00-15:30 - Jasmine
Defending Jailbreak Prompts via In-Context Adversarial Game
Yujun Zhou, Yufei Han, Haomin Zhuang, Kehan Guo, Zhenwen Liang, Hongyan Bao, Xiangliang Zhang
136
Posters and Demos
Large Language Models (LLMs) demonstrate remarkable capabilities across diverse applications. However, concerns regarding their security,
particularly the vulnerability to jailbreak attacks, persist. Drawing inspiration from adversarial training in deep learning and LLM agent
learning processes, we introduce the In-Context Adversarial Game (ICAG) for defending against jailbreaks without the need for fine-tuning.
ICAG leverages agent learning to conduct an adversarial game, aiming to dynamically extend knowledge to defend against jailbreaks. Unlike
traditional methods that rely on static datasets, ICAG employs an iterative process to enhance both the defense and attack agents. This con-
tinuous improvement process strengthens defenses against newly generated jailbreak prompts. Our empirical studies affirm ICAG’s efficacy,
where LLMs safeguarded by ICAG exhibit significantly reduced jailbreak success rates across various attack scenarios. Moreover, ICAG
demonstrates remarkable transferability to other LLMs, indicating its potential as a versatile defense mechanism. The code is available at
https://github.com/YujunZhou/In-Context-Adversarial-Game.
137
Posters and Demos
Large language models (LLMs) have shown strong results on a range of applications, including regression and scoring tasks.Typically, one
obtains outputs from an LLM via autoregressive sampling from the model’s output distribution. We show that this inference strategy can be
sub-optimal for common regression and scoring evaluation metrics. As a remedy, we build on prior work on Minimum Bayes Risk decod-
ing,and propose alternate inference strategies that estimate the Bayes-optimal solution for regression and scoring metrics in closed-form from
sampled responses.We show that our proposal significantly improves over baselines across datasets and models.
Nov 12 (Tue) 14:00-15:30 - Jasmine
Reconfidencing LLMs from the Grouping Loss Perspective
Lihu Chen, Alexandre Perez-Lebel, Fabian M. Suchanek, Gael Varoquaux
Large Language Models (LLMs), such as GPT and LLaMA, are susceptible to generating hallucinated answers in a confident tone. While
previous efforts to elicit and calibrate confidence scores have shown some success, they often overlook biases towards certain groups, such
as specific nationalities. Existing calibration methods typically focus on average performance, failing to address this disparity. In our study,
we demonstrate that the concept of grouping loss is an effective metric for understanding and correcting the heterogeneity in confidence
levels. We introduce a novel evaluation dataset, derived from a knowledge base, specifically designed to assess the confidence scores of LLM
responses across different groups. Our experimental results highlight significant variations in confidence, which are accurately captured by
grouping loss. To tackle this issue, we propose a new method to calibrate the confidence scores of LLMs by considering different groups, a
process we term reconfidencing. Our findings indicate that this approach effectively mitigates biases against minority groups, contributing to
the development of fairer LLMs.
Nov 12 (Tue) 14:00-15:30 - Jasmine
LaRS: Latent Reasoning Skills for Chain-of-Thought Reasoning
Zifan Xu, Haozhu Wang, Dmitriy Bespalov, Xian Wu, Peter Stone, Yanjun Qi
Chain-of-thought (CoT) prompting is a popular in-context learning (ICL) approach for large language models (LLMs), especially when
tackling complex reasoning tasks. Traditional ICL approaches construct prompts using examples that contain questions similar to the input
question. However, CoT prompting, which includes crucial intermediate reasoning steps (rationales) within its examples, necessitates select-
ing examples based on these rationales rather than the questions themselves. Existing methods require human experts or pre-trained LLMs to
describe the skill, a high-level abstraction of rationales, to guide the selection. These methods, however, are often costly and difficult to scale.
Instead, this paper introduces a new approach named Latent Reasoning Skills (LaRS) that employs unsupervised learning to create a latent
space representation of rationales, with a latent variable called a reasoning skill. Concurrently, LaRS learns a reasoning policy to determine
the required reasoning skill for a given question. Then the ICL examples are selected by aligning the reasoning skills between past examples
and the question. This approach is theoretically grounded and compute-efficient, eliminating the need for auxiliary LLM inference or manual
prompt design. Empirical results demonstrate that LaRS consistently outperforms SOTA skill-based selection methods, processing example
banks four times faster, reducing LLM inferences during the selection stage by half, and showing greater robustness to sub-optimal example
banks.
138
Posters and Demos
performance for both low-resource and high-resource languages, likely due to limited model capacity (the "curse of multilinguality"). These
results suggest that massively multilingual pre-training may not be optimal for any languages involved, but that more targeted models can
significantly improve performance.
Nov 12 (Tue) 14:00-15:30 - Riverfront Hall
An Analysis of Multilingual FActScore
Vu Trong Kim, Michael Krumdick, Varshini Reddy, Franck Dernoncourt, Viet Dac Lai
FActScore has gained popularity as a metric to estimate the factuality of long-form texts generated by Large Language Models (LLMs) in
English. However, there has not been any work in studying the behavior of FActScore in other languages. This paper studies the limitations
of each component in the four-component pipeline of FActScore in the multilingual setting. We introduce a new dataset for FActScore on
texts generated by strong multilingual LLMs. Our evaluation shows that LLMs exhibit distinct behaviors in both fact extraction and fact
scoring tasks. No LLM produces consistent and reliable FActScore across languages of varying levels of resources. We also find that the
knowledge source plays an important role in the quality of the estimated FActScore. Using Wikipedia as the knowledge source may hinder
the true FActScore of long-form text due to its limited coverage in medium- and low-resource languages. We also incorporate 3 mitigations
to our knowledge source that ultimately improve FActScore estimation across all languages.
Nov 12 (Tue) 14:00-15:30 - Riverfront Hall
Voices Unheard: NLP Resources and Models for Yorùbá Regional Dialects
Orevaoghene Ahia, Anuoluwapo Aremu, Diana Abagyan, Hila Gonen, David Ifeoluwa Adelani, Daud Abolade, Noah A. Smith, Yulia Tsvetkov
Yoruba—an African language with roughly 47 million speakers—encompasses a continuum with several dialects. Recent efforts to develop
NLP technologies for African languages have focused on their standard dialects, resulting in disparities for dialects and varieties for which
there are little to no resources or tools. We take steps towards bridging this gap by introducing a new high-quality parallel text and speech
corpus; YORULECT across three domains and four regional yoruba dialects. To develop this corpus, we engaged native speakers, traveling to
communities where these dialects are spoken, to collect text and speech data. Using our newly created corpus, we conducted extensive exper-
iments on (text) machine translation, automatic speech recognition, and speech-to-text translation. Our results reveal substantial performance
disparities between standard yoruba and the other dialects across all tasks. However, we also show that with dialect-adaptive finetuning, we
are able to narrow this gap. We believe our dataset and experimental analysis will contribute greatly to developing NLP tools for Yoruba and
its dialects, and potentially for other African languages, by improving our understanding of existing challenges and offering a high-quality
dataset for further development. We will release YORULECT dataset and models publicly under an open license.
Nov 12 (Tue) 14:00-15:30 - Riverfront Hall
Using Language Models to Disambiguate Lexical Choices in Translation
Josh Barua, Sanjay Subramanian, Kayo Yin, Alane Suhr
In translation, a concept represented by a single word in a source language can have multiple variations in a target language. The task of
lexical selection requires using context to identify which variation is most appropriate for a source text. We work with native speakers of
nine languages to create DTAiLS, a dataset of 1,377 sentence pairs that exhibit cross-lingual concept variation when translating from English.
We evaluate recent LLMs and neural machine translation systems on DTAiLS, with the best-performing model, GPT-4, achieving from 67
to 85% accuracy across languages. Finally, we use language models to generate English rules describing target-language concept variations.
Providing weaker models with high-quality lexical rules improves accuracy substantially, in some cases reaching or outperforming GPT-4.
Nov 12 (Tue) 14:00-15:30 - Riverfront Hall
SEACrowd: A Multilingual Multimodal Data Hub and Benchmark Suite for Southeast Asian Languages
Holy Lovenia, Rahmad Mahendra, Salsabil Maulana Akbar, Lester James Validad Miranda, Jennifer Santoso, Elyanah Aco, Akhdan Fadhilah,
Jonibek Mansurov, Joseph Marvin Imperial, Onno P. Kampman, Joel Ruben Antony Moniz, Muhammad Ravi Shulthan Habibi, Frederikus
Hudi, Jann Railey Montalan, Ryan Ignatius Hadiwijaya, Joanito Agili Lopo, William Nixon, Börje F. Karlsson, James Jaya, Ryandito Dian-
daru, Yuze GAO, Patrick Amadeus Irawan, Bin Wang, Jan Christian Blaise Cruz, Chenxi Whitehouse, Ivan Halim Parmonangan, Maria Khelli,
Wenyu Zhang, Lucky Susanto, Reynard Adha Ryanda, Sonny Lazuardi Hermawan, Dan John Velasco, Muhammad Dehan Al Kautsar, Willy
Fitra Hendria, Yasmin Moslem, Noah Flynn, Muhammad Farid Adilazuarda, Haochen Li, Johanes Lee, R. Damanhuri, Shuo Sun, Muhammad
Reza Qorib, Amirbek Djanibekov, Wei Qi Leong, Quyet V. Do, Niklas Muennighoff, Tanrada Pansuwan, Ilham Firdausi Putra, Yan Xu, Tai
Ngee Chia, Ayu Purwarianti, Sebastian Ruder, William Chandra Tjhi, Peerat Limkonchotiwat, Alham Fikri Aji, Sedrick Keh, Genta Indra
Winata, Ruochen Zhang, Fajri Koto, Zheng Xin Yong, Samuel Cahyawijaya
Southeast Asia (SEA) is a region rich in linguistic diversity and cultural variety, with over 1,300 indigenous languages and a population of
671 million people. However, prevailing AI models suffer from a significant lack of representation of texts, images, and audio datasets from
SEA, compromising the quality of AI models for SEA languages. Evaluating models for SEA languages is challenging due to the scarcity of
high-quality datasets, compounded by the dominance of English training data, raising concerns about potential cultural misrepresentation. To
address these challenges, through a collaborative movement, we introduce SEACrowd, a comprehensive resource center that fills the resource
gap by providing standardized corpora in nearly 1,000 SEA languages across three modalities. Through our SEACrowd benchmarks, we
assess the quality of AI models on 36 indigenous languages across 13 tasks, offering valuable insights into the current AI landscape in SEA.
Furthermore, we propose strategies to facilitate greater AI advancements, maximizing potential utility and resource equity for the future of
AI in Southeast Asia.
Nov 12 (Tue) 14:00-15:30 - Riverfront Hall
Concept Space Alignment in Multilingual LLMs
Qiwei Peng, Anders Søgaard
Multilingual large language models (LLMs) seem to generalize somewhat across languages. We hypothesize this is a result of implicit vector
space alignment. Evaluating such alignment, we see that larger models exhibit very high-quality linear alignments between corresponding
concepts in different languages. Our experiments show that multilingual LLMs suffer from two familiar weaknesses: generalization works
best for languages with similar typology, and for abstract concepts. For some models, e.g., the Llama-2 family of models, prompt-based em-
beddings align better than word embeddings, but the projections are less linear – an observation that holds across almost all model families,
indicating that some of the implicitly learned alignments are broken somewhat by prompt-based methods.
Nov 12 (Tue) 14:00-15:30 - Riverfront Hall
Methods of Automatic Matrix Language Determination for Code-Switched Speech
Olga Iakovenko, Thomas Hain
Code-switching (CS) is the process of speakers interchanging between two or more languages which in the modern world becomes increas-
ingly common. In order to better describe CS speech the Matrix Language Frame (MLF) theory introduces the concept of a Matrix Language,
which is the language that provides the grammatical structure for a CS utterance. In this work the MLF theory was used to develop systems
for Matrix Language Identity (MLID) determination. The MLID of English/Mandarin and English/Spanish CS text and speech was compared
to acoustic language identity (LID), which is a typical way to identify a language in monolingual utterances. MLID predictors from audio
139
Posters and Demos
show higher correlation with the textual principles than LID in all cases while also outperforming LID in an MLID recognition task based
on F1 macro (60%) and correlation score (0.38). This novel approach has identified that non-English languages (Mandarin and Spanish) are
preferred over the English language as the ML contrary to the monolingual choice of LID.
Nov 12 (Tue) 14:00-15:30 - Riverfront Hall
Verba volant, scripta volant? Don’t worry! There are computational solutions for protoword reconstruction
Liviu P Dinu, Ana Sabina Uban, Alina Maria Cristea, Ioan-Bogdan Iordache, Teodor-George Marchitan, Simona Georgescu, Laurentiu
Zoicas
We introduce a new database of cognate words and etymons for the five main Romance languages, the most comprehensive one to date. We
propose a strong benchmark for the automatic reconstruction of protowords for Romance languages, by applying a set of machine learning
models and features on these data. The best results reach 90% accuracy in predicting the protoword of a given cognate set, surpassing existing
state-of-the-art results for this task and showing that computational methods can be very useful in assisting linguists with protoword recon-
struction.
Nov 12 (Tue) 14:00-15:30 - Riverfront Hall
Understanding and Mitigating Language Confusion in LLMs
Kelly Marchisio, Wei-Yin Ko, Alexandre Berard, Théo Dehaze, Sebastian Ruder
We investigate a surprising limitation of LLMs: their inability to consistently generate text in a user’s desired language. We create the Lan-
guage Confusion Benchmark (LCB) to evaluate such failures, covering 15 typologically diverse languages with existing and newly-created
English and multilingual prompts. We evaluate a range of LLMs on monolingual and cross-lingual generation reflecting practical use cases,
finding that Llama Instruct and Mistral models exhibit high degrees of language confusion and even the strongest models fail to consistently
respond in the correct language. We observe that base and English-centric instruct models are more prone to language confusion, which is
aggravated by complex prompts and high sampling temperatures. We find that language confusion can be partially mitigated via few-shot
prompting, multilingual SFT and preference tuning. We release our language confusion benchmark, which serves as a first layer of efficient,
scalable multilingual evaluation.
Nov 12 (Tue) 14:00-15:30 - Riverfront Hall
An Empirical Study of Multilingual Reasoning Distillation for Question Answering
Patomporn Payoungkhamdee, Peerat Limkonchotiwat, Jinheon Baek, Potsawee Manakul, Can Udomcharoenchaikit, Ekapol Chuangsuwanich,
Sarana Nutanong
Reasoning is one crucial capability in Large Language Models (LLMs), allowing them to perform complex tasks such as solving math prob-
lems and multi-step planning. While reasoning capability can emerge in larger models, smaller ones usually have to rely on distillation to
transfer this capability from a larger model. However, recent efforts to distill reasoning capabilities have focused mainly on English, leaving
multilingual distillation underexplored. To address this gap, this paper examines existing English reasoning distillation methods that utilize
a variety of positive rationales in multilingual settings and proposes d-CoT-nR, a novel approach that incorporates incorrect rationales as
additional guidance. Empirical results from multilingual high-school examinations show that d-CoT-nR significantly surpasses the baseline,
improving accuracy in unseen languages and correctness in step-by-step reasoning.
Nov 12 (Tue) 14:00-15:30 - Riverfront Hall
Is It Good Data for Multilingual Instruction Tuning or Just Bad Multilingual Evaluation for Large Language Models?
Pinzhen Chen, Simon Yu, Zhicheng Guo, Barry Haddow
Multilingual large language models are designed, claimed, and expected to cater to speakers of varied languages. We hypothesise that the
current practices of fine-tuning and evaluating these models may not perfectly align with this objective owing to a heavy reliance on trans-
lation, which cannot cover language-specific knowledge but can introduce translation defects. It remains unknown whether the nature of the
instruction data has an impact on the model output; conversely, it is questionable whether translated test sets can capture such nuances. Due
to the often coupled practices of using translated data in both stages, such imperfections could have been overlooked. This work investigates
these issues using controlled native or translated data during the instruction tuning and evaluation stages. We show that native or generation
benchmarks reveal a notable difference between native and translated instruction data especially when model performance is high, whereas
other types of test sets cannot. The comparison between round-trip and single-pass translations reflects the importance of knowledge from
language-native resources. Finally, we demonstrate that regularization is beneficial to bridging this gap on structured but not generative tasks.
140
Posters and Demos
141
Posters and Demos
142
Posters and Demos
The development of state-of-the-art generative large language models (LLMs) disproportionately relies on English-centric tokenizers, vocab-
ulary and pre-training data. Despite the fact that some LLMs have multilingual capabilities, recent studies have shown that their inference
efficiency deteriorates when generating text in languages other than English. This results in increased inference time and costs. Cross-lingual
vocabulary adaptation (CVA) methods have been proposed for adapting models to a target language aiming to improve downstream perfor-
mance. However, the effectiveness of these methods on increasing inference efficiency of generative LLMs has yet to be explored. In this
paper, we perform an empirical study of five CVA methods on four generative LLMs (including monolingual and multilingual models) across
four typologically-diverse languages and four natural language understanding tasks. We find that CVA substantially contributes to LLM in-
ference speedups of up to 271.5%. We also show that adapting LLMs that have been pre-trained on more balanced multilingual data results
in downstream performance comparable to the original models.
143
Posters and Demos
144
Posters and Demos
guages, with promising results thus far. Among the endangered languages spoken today, a significant number exhibit complex morphology.
The models employed in contemporary language documentation pipelines that utilize ASR, however, are predominantly based on isolating or
inflectional languages, often from the Indo-European family. This raises a critical concern: building models exclusively on such languages
may introduce a bias, resulting in better performance with simpler morphological structures. In this paper, we investigate the performance
of modern ASR architectures on morphologically complex languages. Results indicate that modern ASR architectures appear less robust in
managing high OOV rates for morphologically complex languages in terms of word error rate, while character error rates are consistently
higher for isolating languages.
Nov 12 (Tue) 14:00-15:30 - Riverfront Hall
LinguAlchemy: Fusing Typological and Geographical Elements for Unseen Language Generalization
Muhammad Farid Adilazuarda, Samuel Cahyawijaya, Genta Indra Winata, Ayu Purwarianti, Alham Fikri Aji
Pretrained language models (PLMs) have shown remarkable generalization toward multiple tasks and languages. Nonetheless, the general-
ization of PLMs towards unseen languages is poor, resulting in significantly worse language performance, or even generating nonsensical
responses that are comparable to a random baseline. This limitation has been a longstanding problem of PLMs raising the problem of diver-
sity and equal access to language modeling technology. In this work, we solve this limitation by introducing LinguAlchemy, a regularization
technique that incorporates various aspects of languages covering typological, geographical, and phylogenetic constraining the resulting
representation of PLMs to better characterize the corresponding linguistics constraints. LinguAlchemy significantly improves the accuracy
performance of mBERT and XLM-R on unseen languages by 18% and 2%, respectively compared to fully finetuned models and displaying
a high degree of unseen language generalization. We further introduce AlchemyScale and AlchemyTune, extension of LinguAlchemy which
adjusts the linguistic regularization weights automatically, alleviating the need for hyperparameter search. LinguAlchemy enables better
cross-lingual generalization to unseen languages which is vital for better inclusivity and accessibility of PLMs.
145
Posters and Demos
humility" (IH), or acknowledging the potential limitations in one’s own beliefs. Specifically, we explore the development of computational
methods for measuring IH at scale. We manually curate and validate an IH codebook on 350 posts about religion drawn from subreddits
and use them to develop LLM-based models for automating this measurement. Our best model achieves a Macro-F1 score of 0.64 across
labels (and 0.70 when predicting IH/IA/Neutral at the coarse level), higher than an expected naive baseline of 0.51 (0.32 for IH/IA/Neutral)
but lower than a human annotator-informed upper bound of 0.85 (0.83 for IH/IA/Neutral). Our results both highlight the challenging nature
of detecting IH online—opening the door to new directions in NLP research—and also lay a foundation for computational social science
researchers interested in analyzing and fostering more IH in online public discourse.
Nov 12 (Tue) 16:00-17:30 - Riverfront Hall
Analyzing Key Factors Influencing Emotion Prediction Performance of VLLMs in Conversational Contexts
Jaewook Lee, Yeajin Jang, Hongjin KIM, Woojin Lee, Harksoo Kim
Emotional intelligence (EI) in artificial intelligence (AI), which refers to the ability of an AI to understand and respond appropriately to human
emotions, has emerged as a crucial research topic. Recent studies have shown that large language models (LLMs) and vision large language
models (VLLMs) possess EI and the ability to understand emotional stimuli in the form of text and images, respectively. However, factors
influencing the emotion prediction performance of VLLMs in real-world conversational contexts have not been sufficiently explored. This
study aims to analyze the key elements affecting the emotion prediction performance of VLLMs in conversational contexts systematically.
To achieve this, we reconstructed the MELD dataset, which is based on the popular TV series Friends, and conducted experiments through
three sub-tasks: overall emotion tone prediction, character emotion prediction, and contextually appropriate emotion expression selection.
We evaluated the performance differences based on various model architectures (e.g., image encoders, modality alignment, and LLMs) and
image scopes (e.g., entire scene, person, and facial expression). In addition, we investigated the impact of providing persona information on
the emotion prediction performance of the models and analyzed how personality traits and speaking styles influenced the emotion prediction
process. We conducted an in-depth analysis of the impact of various other factors, such as gender and regional biases, on the emotion predic-
tion performance of VLLMs. The results revealed that these factors significantly influenced the model performance.
Nov 12 (Tue) 16:00-17:30 - Riverfront Hall
ToxiCloakCN: Evaluating Robustness of Offensive Language Detection in Chinese with Cloaking Perturbations
Yunze Xiao, Yujia Hu, Kenny Tsu Wei Choo, Roy Ka-Wei Lee
Detecting hate speech and offensive language is essential for maintaining a safe and respectful digital environment. This study examines the
limitations of state-of-the-art large language models (LLMs) in identifying offensive content within systematically perturbed data, with a focus
on Chinese, a language particularly susceptible to such perturbations. We introduce ToxiCloakCN, an enhanced dataset derived from ToxiCN,
augmented with homophonic substitutions and emoji transformations, to test the robustness of LLMs against these cloaking perturbations.
Our findings reveal that existing models significantly underperform in detecting offensive content when these perturbations are applied. We
provide an in-depth analysis of how different types of offensive content are affected by these perturbations and explore the alignment between
human and model explanations of offensiveness. Our work highlights the urgent need for more advanced techniques in offensive language
detection to combat the evolving tactics used to evade detection mechanisms.
Nov 12 (Tue) 16:00-17:30 - Riverfront Hall
Locating Information Gaps and Narrative Inconsistencies Across Languages: A Case Study of LGBT People Portrayals on Wikipedia
Farhan Samir, Chan Young Park, Vered Shwartz, Anjalie Field, Yulia Tsvetkov
To explain social phenomena and identify systematic biases, much research in computational social science focuses on comparative text anal-
yses. These studies often rely on coarse corpus-level statistics or local word-level analyses, mainly in English. We introduce the InfoGap
method—an efficient and reliable approach to locating information gaps and inconsistencies in articles at the fact level, across languages. We
evaluate InfoGap by analyzing LGBT people’s portrayals, across 2.7K biography pages on English, Russian, and French Wikipedias. We
find large discrepancies in factual coverage across the languages. Moreover, our analysis reveals that biographical facts carrying negative
connotations are more likely to be highlighted in Russian Wikipedia. Crucially, InfoGap both facilitates large scale analyses, and pinpoints
local document- and fact-level information gaps, laying a new foundation for targeted and nuanced comparative language analysis at scale.
Nov 12 (Tue) 16:00-17:30 - Riverfront Hall
From Local Concepts to Universals: Evaluating the Multicultural Understanding of Vision-Language Models
Mehar Bhatia, Sahithya Ravi, Aditya Chinchure, EunJeong Hwang, Vered Shwartz
Despite recent advancements in vision-language models, their performance remains suboptimal on images from non-western cultures due to
underrepresentation in training datasets. Various benchmarks have been proposed to test models’ cultural inclusivity. Still, they have limited
coverage of cultures and do not adequately assess cultural diversity across universal and culture-specific local concepts. To address these
limitations, we introduce the GlobalRG benchmark, comprising two challenging tasks: retrieval across universals and cultural visual ground-
ing. The former task entails retrieving culturally diverse images for universal concepts from 50 countries, while the latter aims at grounding
culture-specific concepts within images from 15 countries. Our evaluation across a wide range of models reveals that the performance varies
significantly across cultures – underscoring the necessity for enhancing multicultural understanding in vision-language models.
Nov 12 (Tue) 16:00-17:30 - Riverfront Hall
Bridging Modalities: Enhancing Cross-Modality Hate Speech Detection with Few-Shot In-Context Learning
Ming Shan Hee, Aditi Kumaresan, Roy Ka-Wei Lee
The widespread presence of hate speech on the internet, including formats such as text-based tweets and multimodal memes, poses a signif-
icant challenge to digital platform safety. Recent research has developed detection models tailored to specific modalities; however, there is a
notable gap in transferring detection capabilities across different formats. This study conducts extensive experiments using few-shot in-context
learning with large language models to explore the transferability of hate speech detection between modalities. Our findings demonstrate that
text-based hate speech examples can significantly enhance the classification accuracy of vision-language hate speech. Moreover, text-based
demonstrations outperform vision-language demonstrations in few-shot learning settings. These results highlight the effectiveness of cross-
modality knowledge transfer and offer valuable insights for improving hate speech detection systems.
Nov 12 (Tue) 16:00-17:30 - Riverfront Hall
Image, Tell me your story! Predicting the original meta-context of visual misinformation
Jonathan Tonglet, Marie-Francine Moens, Iryna Gurevych
To assist human fact-checkers, researchers have developed automated approaches for visual misinformation detection. These methods assign
veracity scores by identifying inconsistencies between the image and its caption, or by detecting forgeries in the image. However, they ne-
glect a crucial point of the human fact-checking process: identifying the original meta-context of the image. By explaining what is actually
true about the image, fact-checkers can better detect misinformation, focus their efforts on check-worthy visual content, engage in counter-
messaging before misinformation spreads widely, and make their explanation more convincing. Here, we fill this gap by introducing the task
of automated image contextualization. We create 5Pils, a dataset of 1,676 fact-checked images with question-answer pairs about their original
meta-context. Annotations are based on the 5 Pillars fact-checking framework. We implement a first baseline that grounds the image in its
146
Posters and Demos
original meta-context using the content of the image and textual evidence retrieved from the open web. Our experiments show promising
results while highlighting several open challenges in retrieval and reasoning.
147
Posters and Demos
challenging to scale for downstream applications. To address these limitations, in this work, we propose a computational approach to effi-
ciently model users’ latent susceptibility levels. As shown in previous work, susceptibility is influenced by various factors (e.g., demographic
factors, political ideology), and directly influences people’s reposting behavior on social media. To represent the underlying mental process,
our susceptibility modeling incorporates these factors as inputs, guided by the supervision of people’s sharing behavior. Using COVID-19 as a
testbed, our experiments demonstrate a significant alignment between the susceptibility scores estimated by our computational modeling and
human judgments, confirming the effectiveness of this latent modeling approach. Furthermore, we apply our model to annotate susceptibility
scores on a large-scale dataset and analyze the relationships between susceptibility with various factors. Our analysis reveals that political
leanings and other psychological factors exhibit varying degrees of association with susceptibility to COVID-19 misinformation, and shows
that susceptibility is unevenly distributed across different professional and geographical backgrounds.
148
Posters and Demos
need datasets with extensive parallel annotations from a variety of social and cultural groups.In this paper we introduce the D3CODE dataset:
a large-scale cross-cultural dataset of parallel annotations for offensive language in over 4.5K English sentences annotated by a pool of more
than 4k annotators, balanced across gender and age, from across 21 countries, representing eight geo-cultural regions. The dataset captures
annotators’ moral values along six moral foundations: care, equality, proportionality, authority, loyalty, and purity. Our analyses reveal sub-
stantial regional variations in annotators’ perceptions that are shaped by individual moral values, providing crucial insights for developing
pluralistic, culturally sensitive NLP models.
Nov 12 (Tue) 16:00-17:30 - Riverfront Hall
Virtual Personas for Language Models via an Anthology of Backstories
Suhong Moon, Marwa Abdulhai, Minwoo Kang, Joseph Suh, Widyadewi Soedarmadji, Eran Kohen Behar, David Chan
Large language models (LLMs) are trained from vast repositories of text authored by millions of distinct authors, reflecting an enormous
diversity of human traits. While these models bear the potential to be used as approximations of human subjects in behavioral studies, prior
efforts have been limited in steering model responses to match individual human users. In this work, we introduce Anthology, a method
for conditioning LLMs to particular virtual personas by harnessing open-ended life narratives, which we refer to as backstories. We show
that our methodology enhances the consistency and reliability of experimental outcomes while ensuring better representation of diverse sub-
populations. Across three nationally representative human surveys conducted as part of Pew Research Center’s American Trends Panel (ATP),
we demonstrate that Anthology achieves up to 18% improvement in matching the response distributions of human respondents and 27% im-
provement in consistency metrics.
Nov 12 (Tue) 16:00-17:30 - Riverfront Hall
Promoting Constructive Deliberation: Reframing for Receptiveness
Gauri Kambhatla, Matthew Lease, Ashwin Rajadesingan
To promote constructive discussion of controversial topics online, we propose automatic reframing of disagreeing responses to signal re-
ceptiveness to a preceding comment. Drawing on research from psychology, communications, and linguistics, we identify six strategies
for reframing. We automatically reframe replies to comments according to each strategy, using a Reddit dataset. Through human-centered
experiments, we find that the replies generated with our framework are perceived to be significantly more receptive than the original replies
and a generic receptiveness baseline. We illustrate how transforming receptiveness, a particular social science construct, into a computational
framework, can make LLM generations more aligned with human perceptions. We analyze and discuss the implications of our results, and
highlight how a tool based on our framework might be used for more teachable and creative content moderation.
Nov 12 (Tue) 16:00-17:30 - Riverfront Hall
Using RL to Identify Divisive Perspectives Improves LLMs Abilities to Identify Communities on Social Media
Nikhil Mehta, Dan Goldwasser
The large scale usage of social media, combined with its significant impact, has made it increasingly important to understand it. In particular,
identifying user communities, can be helpful for many downstream tasks. However, particularly when models are trained on past data and
tested on future, doing this is difficult.In this paper, we hypothesize to take advantage of Large Language Models (LLMs), to better identify
user communities. Due to the fact that many LLMs, such as ChatGPT, are fixed and must be treated as black-boxes, we propose an approach
to better prompt them, by training a smaller LLM to do this. We devise strategies to train this smaller model, showing how it can improve the
larger LLMs ability to detect communities. Experimental results show improvements on Reddit and Twitter data, and the tasks of community
detection, bot detection, and news media profiling.
Nov 12 (Tue) 16:00-17:30 - Riverfront Hall
Enabling Cross-Platform Comparison of Online Communities Using Content and Opinion Similarity
Prasanna Lakkur Subramanyam, Jeng-Yu Chou, Kevin K. Nam, Brian Levine
With the continuous growth of online communities, understanding their similarities and dissimilarities is more crucial than ever for enhanc-
ing digital interactions, maintaining healthy interactions, and improving content recommendation and moderation systems. In this work, we
present two novel techniques: BOTS for finding similarity between online communities based on their opinion, and Emb-PSR for finding
similarity in the content they post. To facilitate finding the similarity based on opinion, we model the opinions on online communities us-
ing upvotes and downvotes as an indicator for community approval. Our results demonstrate that BOTS and Emb-PSR outperform existing
techniques at their individual tasks while also being flexible enough to allow for cross-platform comparison of online communities. We
demonstrate this novel cross-platform capability by comparing GAB with various subreddits.
Nov 12 (Tue) 16:00-17:30 - Riverfront Hall
How Personality Traits Influence Negotiation Outcomes? A Simulation based on Large Language Models
Yin Jou Huang, Rafik Hadfi
Psychological evidence reveals the influence of personality traits on decision-making. For instance, agreeableness is generally associated with
positive outcomes in negotiations, whereas neuroticism is often linked to less favorable outcomes. This paper introduces a simulation frame-
work centered on large language model (LLM) agents endowed with synthesized personality traits. The agents negotiate within bargaining
domains and possess customizable personalities and objectives. The experimental results show that the behavioral tendencies of LLM-based
simulations can reproduce behavioral patterns observed in human negotiations. The contribution is twofold. First, we propose a simulation
methodology that investigates the alignment between the linguistic and economic capabilities of LLM agents. Secondly, we offer empirical
insights into the strategic impacts of Big Five personality traits on the outcomes of bilateral negotiations. We also provide an in-depth analysis
based on simulated bargaining dialogues to reveal intriguing behaviors, including deceitful and compromising behaviors.
Nov 12 (Tue) 16:00-17:30 - Riverfront Hall
Mental Disorder Classification via Temporal Representation of Text
Raja Kumar, Kishan Maharaj, Ashita Saxena, Pushpak Bhattacharyya
Mental disorders pose a global challenge, aggravated by the shortage of qualified mental health professionals. Mental disorder prediction from
social media posts by current LLMs is challenging due to the complexities of sequential text data and the limited context length of language
models. Current language model-based approaches split a single data instance into multiple chunks to compensate for limited context size.
The predictive model is then applied to each chunk individually, and the most voted output is selected as the final prediction. This results
in the loss of inter-post dependencies and important time variant information, leading to poor performance. We propose a novel framework
which first compresses the large sequence of chronologically ordered social media posts into a series of numbers. We then use this time
variant representation for mental disorder classification. We demonstrate the generalization capabilities of our framework by outperforming
the current SOTA in three different mental conditions: depression, self-harm, and anorexia, by an absolute improvement of 5% in the F1
score. We also investigate the situation when current data instances fall within the context length of language models and present empirical
results highlighting the importance of temporal properties of textual data. Furthermore, we utilize the proposed framework for a cross-domain
study, exploring commonalities across disorders and the possibility of inter-domain data usage.
149
Posters and Demos
Demo 3
Nov 12 (Tue) 16:00-17:30 - Room: Riverfront Hall
150
Posters and Demos
mains challenging. This is due to their causal attention mechanism and the misalignment between their pre-training objectives and the text
ranking tasks. Despite some recent efforts to address these issues, existing frameworks for LLM-based text embeddings have been limited by
their support for only a limited range of LLM architectures and fine-tuning strategies, limiting their practical application and versatility. In this
work, we introduce the Unified framework for Large Language Model Embedding (ULLME), a flexible, plug-and-play implementation that
enables bidirectional attention across various LLMs and supports a range of fine-tuning strategies. We also propose Generation-augmented
Representation Learning (GRL), a novel fine-tuning method to boost LLMs for text embedding tasks. GRL enforces consistency between
representation-based and generation-based relevance scores, leveraging LLMsâ powerful generative abilities for learning passage embeddings.
To showcase our frameworkâs flexibility and effectiveness, we release three pre-trained models from ULLME with different backbone archi-
tectures, ranging from 1.5B to 8B parameters, all of which demonstrate strong performance on the Massive Text Embedding Benchmark. Our
framework is publicly available at: https://github.com/nlp-uoregon/ullme. A demo video for ULLME can also be found at https://rb.gy/ws1ile.
Nov 12 (Tue) 16:00-17:30 - Riverfront Hall
To the Globe (TTG): Towards Language-Driven Guaranteed Travel Planning
Aaron Foss, Andrew Cohen, Arman Zharmagambetov, Brandon Amos, Da JU, Justine T Kao, Maryam Fazel-Zarandi, Sasha Mitts, Song
Jiang, Xian Li, Yuandong Tian
Travel planning is a challenging and time-consuming task that aims to find an itinerary which satisfies multiple, interdependent constraints
regarding flights, accommodations, attractions, and other travel arrangements. In this paper, we propose To the Globe (TTG), a real-time demo
system that takes natural language requests from users, translates it to symbolic form via a fine-tuned Large Language Model, and produces
optimal travel itineraries with Mixed Integer Linear Programming solvers. The overall system takes 5 seconds to reply to the user request
with guaranteed itineraries. To train TTG, we develop a synthetic data pipeline that generates user requests, flight and hotel information in
symbolic form without human annotations, based on the statistics of real-world datasets, and fine-tune an LLM to translate NL user requests
to their symbolic form, which is sent to the symbolic solver to compute optimal itineraries. Our NL-symbolic translation achieves 91% exact
match in a backtranslation metric (i.e., whether the estimated symbolic form of generated natural language matches the groundtruth), and
its returned itineraries have a ratio of 0.979 compared to the optimal cost of the ground truth user request. When evaluated by users, TTG
achieves consistently high Net Promoter Scores (NPS) of 35-40% on generated itinerary.
Machine Translation 2
Nov 12 (Tue) 16:00-17:30 - Room: Riverfront Hall
151
Posters and Demos
provide a first investigation of what is forgotten, and why. We examine the relationship between forgetting and the in-domain data, and show
that the amount and type of forgetting is linked to that data’s target vocabulary coverage. Our findings pave the way toward better informed
NMT domain adaptation.
Nov 12 (Tue) 16:00-17:30 - Riverfront Hall
Can Automatic Metrics Assess High-Quality Translations?
Sweta Agrawal, António Farinhas, Ricardo Rei, Andre Martins
Automatic metrics for evaluating translation quality are typically validated by measuring how well they correlate with human assessments.
However, correlation methods tend to capture only the ability of metrics to differentiate between good and bad source-translation pairs, over-
looking their reliability in distinguishing alternative translations for the same source. In this paper, we confirm that this is indeed the case by
showing that current metrics are insensitive to nuanced differences in translation quality. This effect is most pronounced when the quality is
high and the variance among alternatives is low. Given this finding, we shift towards detecting high-quality correct translations, an important
problem in practical decision-making scenarios where a binary check of correctness is prioritized over a nuanced evaluation of quality. Using
the MQM framework as the gold standard, we systematically stress-test the ability of current metrics to identify translations with no errors
as marked by humans. Our findings reveal that current metrics often over or underestimate translation quality, indicating significant room for
improvement in machine translation evaluation.
Nov 12 (Tue) 16:00-17:30 - Riverfront Hall
Modeling User Preferences with Automatic Metrics: Creating a High-Quality Preference Dataset for Machine Translation
Bowen Xing, Lizi Liao, Minlie Huang, Ivor Tsang
Alignment with human preferences is an important step in developing accurate and safe large language models. This is no exception in
machine translation (MT), where better handling of language nuances and context-specific variations leads to improved quality. However,
preference data based on human feedback can be very expensive to obtain and curate at a large scale. Automatic metrics, on the other hand,
can induce preferences, but they might not match human expectations perfectly. In this paper, we propose an approach that leverages the
best of both worlds. We first collect sentence-level quality assessments from professional linguists on translations generated by multiple
high-quality MT systems and evaluate the ability of current automatic metrics to recover these preferences. We then use this analysis to curate
a new dataset, MT-Pref (metric induced translation preference) dataset, which comprises 18k instances covering 18 language directions, using
texts sourced from multiple domains post-2022. We show that aligning TOWER models on MT-Pref significantly improves translation quality
on WMT23 and FLORES benchmarks.
Nov 12 (Tue) 16:00-17:30 - Riverfront Hall
Ladder: A Model-Agnostic Framework Boosting LLM-based Machine Translation to the Next Level
Zhaopeng Feng, Ruizhe Chen, Yan Zhang, Zijie Meng, Zuozhu Liu
General-purpose Large Language Models (LLMs) like GPT-4 have achieved remarkable advancements in machine translation (MT) by lever-
aging extensive web content. On the other hand, translation-specific LLMs are built by pre-training on domain-specific monolingual corpora
and fine-tuning with human-annotated translation data. Despite the superior performance, these methods either demand an unprecedented
scale of computing and data or substantial human editing and annotation efforts. In this paper, we develop MT-Ladder, a novel model-
agnostic and cost-effective tool to refine the performance of general LLMs for MT. MT-Ladder is trained on pseudo-refinement triplets which
can be easily obtained from existing LLMs without additional human cost. During training, we propose a hierarchical fine-tuning strategy
with an easy-to-hard schema, improving MT-Ladder’s refining performance progressively. The trained MT-Ladder can be seamlessly inte-
grated with any general-purpose LLMs to boost their translation performance. By utilizing Gemma-2B/7B as the backbone, MT-Ladder-2B
can elevate raw translations to the level of top-tier open-source models (e.g., refining BigTranslate-13B with +6.91 BLEU and +3.52 COMET
for XXEn), and MT-Ladder-7B can further enhance model performance to be on par with the state-of-the-art GPT-4. Extensive ablation and
analysis corroborate the effectiveness of MT-Ladder in diverse settings.
Nov 12 (Tue) 16:00-17:30 - Riverfront Hall
Mitigating the Language Mismatch and Repetition Issues in LLM-based Machine Translation via Model Editing
Weichuan Wang, Zhaoyi Li, Defu Lian, Chen Ma, Linqi Song, Ying Wei
Large Language Models (LLMs) have recently revolutionized the NLP field, while they still fall short in some specific down-stream tasks.
In the work, we focus on utilizing LLMs to perform machine translation, where we observe that two patterns of errors frequently occur and
drastically affect the translation quality: language mismatch and repetition. The work sets out to explore the potential for mitigating these two
issues by leveraging model editing methods, e.g., by locating Feed-Forward Network (FFN) neurons or something that are responsible for the
errors and deactivating them in the inference time.We find that directly applying such methods either limited effect on the targeted errors or
has significant negative side-effect on the general translation quality, indicating that the located components may also be crucial for ensuring
machine translation with LLMs on the rails.To this end, we propose to refine the located components by fetching the intersection of the locat-
ing results under different language settings, filtering out the aforementioned information that is irrelevant to targeted errors. The experiment
results empirically demonstrate that our methods can effectively reduce the language mismatch and repetition ratios and meanwhile enhance
or keep the general translation quality in most cases.
Nov 12 (Tue) 16:00-17:30 - Riverfront Hall
Simultaneous Masking, Not Prompting Optimization: A Paradigm Shift in Fine-tuning LLMs for Simultaneous Translation
Matthew Raffel, Victor Agostinelli, Lizhong Chen
Large language models (LLMs) have achieved state-of-the-art performance in various language processing tasks, motivating their adoption in
simultaneous translation. Current fine-tuning methods to adapt LLMs for simultaneous translation focus on prompting optimization strategies
using either data augmentation or prompt structure modifications. However, these methods suffer from several issues, such as unnecessarily
expanded training sets, computational inefficiency from dumping the key and value cache, increased prompt sizes, or restriction to a single
decision policy. To eliminate these issues, in this work, we propose SimulMask, a new paradigm for fine-tuning LLMs for simultaneous trans-
lation. It utilizes a novel attention mask approach that models simultaneous translation during fine-tuning by masking attention for a desired
decision policy. Applying the proposed SimulMask on a Falcon LLM for the IWSLT 2017 dataset, we have observed a significant translation
quality improvement compared to state-of-the-art prompting optimization strategies on five language pairs while reducing the computational
cost.
Nov 12 (Tue) 16:00-17:30 - Riverfront Hall
Enhanced Hallucination Detection in Neural Machine Translation through Simple Detector Aggregation
Anas Himmi, Guillaume Staerman, Marine Picot, Pierre Colombo, Nuno M Guerreiro
Hallucinated translations pose significant threats and safety concerns when it comes to practical deployment of machine translation systems.
Previous research works have identified that detectors exhibit complementary performance — different detectors excel at detecting differ-
ent types of hallucinations. In this paper, we propose to address the limitations of individual detectors by combining them and introducing
a straightforward method for aggregating multiple detectors. Our results demonstrate the efficacy of our aggregated detector, providing a
152
Posters and Demos
153
Posters and Demos
formance of Chinese-English machine translation (MT) systems. Built on an extended version of the data from the WMT22 Metrics Shared
Task (with extra labels of 9 types of Chinese MWEs, and 19 types of Chinese multiword NEs) which includes scores and error annotations
provided by human experts, we make further extraction of MWE- and NE-related translation errors. By investigating the human evaluation
scores and the error rates on each category of MWEs and NEs, we find that: 1) MT systems tend to perform significantly worse on Chinese
sentences with most kinds of MWEs and NEs; 2) MWEs and NEs which make up of about twenty percent of tokens, i.e. characters in Chinese,
result in one-third of translation errors; 3) for 13 categories of MWEs and NEs, the error rates exceed 50% with the highest to be 84.8%.
Based on the results, we emphasize that MWEs and NEs are still a bottleneck issue for MT and special attention to MWEs and NEs should
be paid to further improving the performance of MT systems.
Nov 12 (Tue) 16:00-17:30 - Riverfront Hall
Cross-lingual Contextualized Phrase Retrieval
Huayang Li, Deng Cai, Zhi Qu, Qu Cui, Hidetaka Kamigaito, Lemao Liu, Taro Watanabe
Phrase-level dense retrieval has shown many appealing characteristics in downstream NLP tasks by leveraging the fine-grained information
that phrases offer. In our work, we propose a new task formulation of dense retrieval, cross-lingual contextualized phrase retrieval, which
aims to augment cross-lingual applications by addressing polysemy using context information. However, the lack of specific training data and
models are the primary challenges to achieve our goal. As a result, we extract pairs of cross-lingual phrases using word alignment information
automatically induced from parallel sentences. Subsequently, we train our Cross-lingual Contextualized Phrase Retriever (CCPR) using con-
trastive learning, which encourages the hidden representations of phrases with similar contexts and semantics to align closely. Comprehensive
experiments on both the cross-lingual phrase retrieval task and a downstream task, i.e, machine translation, demonstrate the effectiveness of
CCPR. On the phrase retrieval task, CCPR surpasses baselines by a significant margin, achieving a top-1 accuracy that is at least 13 points
higher. When utilizing CCPR to augment the large-language-model-based translator, it achieves average gains of 0.7 and 1.5 in BERTScore
for translations from X=>En and vice versa, respectively, on WMT16 dataset. We will release our code and data.
Nov 12 (Tue) 16:00-17:30 - Riverfront Hall
Low-Resource Machine Translation through the Lens of Personalized Federated Learning
Viktor Moskvoretskii, Nazarii Tupitsa, Chris Biemann, Samuel Horváth, Eduard Gorbunov, Irina Nikishina
We present a new approach called MeritOpt based on the Personalized Federated Learning algorithm MeritFed that can be applied to Natural
Language Tasks with heterogeneous data. We evaluate it on the Low-Resource Machine Translation task, using the datasets of South East
Asian and Finno-Ugric languages. In addition to its effectiveness, MeritOpt is also highly interpretable, as it can be applied to track the impact
of each language used for training. Our analysis reveals that target dataset size affects weight distribution across auxiliary languages, that
unrelated languages do not interfere with the training, and auxiliary optimizer parameters have minimal impact. Our approach is easy to apply
with a few lines of code, and we provide scripts for reproducing the experiments (https://github.com/VityaVitalich/MeritOpt).
Nov 12 (Tue) 16:00-17:30 - Riverfront Hall
Creative and Context-Aware Translation of East Asian Idioms with GPT-4
Kenan Tang, Peiyang Song, Yao Qin, Xifeng Yan
As a type of figurative language, an East Asian idiom condenses rich cultural background into only a few characters. Translating such idioms
is challenging for human translators, who often resort to choosing a context-aware translation from an existing list of candidates. However,
compiling a dictionary of candidate translations demands much time and creativity even for expert translators. To alleviate such burden, we
evaluate if GPT-4 can help generate high-quality translations. Based on automatic evaluations of faithfulness and creativity, we first identify
Pareto-optimal prompting strategies that can outperform translation engines from Google and DeepL. Then, at a low cost, our context-aware
translations can achieve far more high-quality translations per idiom than the human baseline. We open-source all code and data to facilitate
further research.
154
Posters and Demos
cover pathological translations, such as hallucinations. Our findings shed light on the internal workings of LLM-based MT which go beyond
those known for standard encoder-decoder MT models.
Question Answering 1
Nov 12 (Tue) 16:00-17:30 - Room: Riverfront Hall
155
Posters and Demos
pose KB-Plugin, a plug-and-play framework that enables LLMs to induce programs over any low-resourced KB. Firstly, KB-Plugin adopts
self-supervised learning to encode the detailed schema information of a given KB into a pluggable module, namely schema plugin. Secondly,
KB-Plugin utilizes abundant annotated data from a rich-resourced KB to train another pluggable module, namely PI plugin, which can help
the LLM extract question-relevant schema information from the schema plugin of any KB and utilize the information to induce programs over
this KB. Experiments show that KB-Plugin outperforms SoTA low-resourced PI methods with 25x smaller backbone LLM on both large-scale
and domain-specific KBs, and even approaches the performance of supervised methods.
Nov 12 (Tue) 16:00-17:30 - Riverfront Hall
CoTKR: Chain-of-Thought Enhanced Knowledge Rewriting for Complex Knowledge Graph Question Answering
Yike Wu, Yi Huang, Nan Hu, YUNCHENG HUA, Guilin Qi, Jiaoyan Chen, Jeff Z. Pan
Recent studies have explored the use of Large Language Models (LLMs) with Retrieval Augmented Generation (RAG) for Knowledge Graph
Question Answering (KGQA). They typically require rewriting retrieved subgraphs into natural language formats comprehensible to LLMs.
However, when tackling complex questions, the knowledge rewritten by existing methods may include irrelevant information, omit crucial
details, or fail to align with the question’s semantics. To address them, we propose a novel rewriting method CoTKR, Chain- of-Thought
Enhanced Knowledge Rewriting, for generating reasoning traces and corresponding knowledge in an interleaved manner, thereby mitigating
the limitations of single-step knowledge rewriting. Additionally, to bridge the preference gap between the knowledge rewriter and the question
answering (QA) model, we propose a training strategy PAQAF, Preference Alignment from Question Answering Feedback, for leveraging
feedback from the QA model to further optimize the knowledge rewriter. We conduct experiments using various LLMs across several KGQA
benchmarks. Experimental results demonstrate that, compared with previous knowledge rewriting methods, CoTKR generates the most ben-
eficial knowledge representation for QA models, which significantly improves the performance of LLMs in KGQA.
Nov 12 (Tue) 16:00-17:30 - Riverfront Hall
Teaching LLMs to Abstain across Languages via Multilingual Feedback
Shangbin Feng, Weijia Shi, Yike Wang, Wenxuan Ding, Orevaoghene Ahia, Shuyue Stella Li, Vidhisha Balachandran, Sunayana Sitaram, Yulia
Tsvetkov
Multilingual LLMs often have knowledge disparities across languages, with larger gaps in under-resourced languages. Teaching LLMs to ab-
stain in the face of knowledge gaps is thus a promising strategy to mitigate hallucinations in multilingual settings. However, previous studies
on LLM abstention primarily focus on English; we find that directly applying existing solutions beyond English results in up to 20.5% perfor-
mance gaps between high and low-resource languages, potentially due to LLMs’ drop in calibration and reasoning beyond a few resource-rich
languages. To this end, we propose strategies to enhance LLM abstention by learning from multilingual feedback, where LLMs self-reflect
on proposed answers in one language by generating multiple feedback items in related languages: we show that this helps identifying the
knowledge gaps across diverse languages, cultures, and communities. Extensive experiments demonstrate that our multilingual feedback
approach outperforms various strong baselines, achieving up to 9.2% improvement for low-resource languages across three black-box and
open models on three datasets, featuring open-book, closed-book, and commonsense QA. Further analysis reveals that multilingual feed-
back is both an effective and a more equitable abstain strategy to serve diverse language speakers, and cultural factors have great impact on
language selection and LLM abstention behavior, highlighting future directions for multilingual and multi-cultural reliable language modeling.
Nov 12 (Tue) 16:00-17:30 - Riverfront Hall
RAG-QA Arena: Evaluating Domain Robustness for Long-form Retrieval Augmented Question Answering
Rujun Han, Yuhao Zhang, Peng Qi, Yumo Xu, Jenyuan Wang, Lan Liu, William Yang Wang, Bonan Min, Vittorio Castelli
Question answering based on retrieval augmented generation (RAG-QA) is an important research topic in NLP and has a wide range of
real-world applications. However, most existing datasets for this task are either constructed using a single source corpus or consist of short
extractive answers, which fall short of evaluating large language model (LLM) based RAG-QA systems on cross-domain generalization.
To address these limitations, we create Long-form RobustQA (LFRQA), a new dataset comprising human-written long-form answers that
integrate short extractive answers from multiple documents into a single, coherent narrative, covering 26K queries and large corpora across
seven different domains. We further propose RAG-QA Arena by directly comparing model-generated answers against LFRQA’s answers
using LLMs as evaluators. We show via extensive experiments that RAG-QA Arena and human judgments on answer quality are highly
correlated. Moreover, only 41.3% of the most competitive LLM’s answers are preferred to LFRQA’s answers, demonstrating RAG-QA Arena
as a challenging evaluation platform for future research.
Nov 12 (Tue) 16:00-17:30 - Riverfront Hall
REAR: A Relevance-Aware Retrieval-Augmented Framework for Open-Domain Question Answering
Yuhao Wang, Ruiyang Ren, Junyi Li, Xin Zhao, Jing Liu, Ji-Rong Wen
Considering the limited internal parametric knowledge, retrieval-augmented generation (RAG) has been widely used to extend the knowledge
scope of large language models (LLMs). Despite the extensive efforts on RAG research, in existing methods, LLMs cannot precisely assess
the relevance of retrieved documents, thus likely leading to misleading or even incorrect utilization of external knowledge (i.e., retrieved
documents). To address this issue, in this paper, we propose REAR, a RElevance-Aware Retrieval-augmented approach for open-domain
question answering (QA). As the key motivation, we aim to enhance the self-awareness regarding the reliability of external knowledge for
LLMs, so as to adaptively utilize external knowledge in RAG systems. Specially, we develop a novel architecture for LLM based RAG sys-
tem, by incorporating a specially designed assessnent module that precisely assesses the relevance of retrieved documents. Furthermore, we
propose an improved training method based on bi-granularity relevance fusion and noise-resistant training. By combining the improvements
in both architecture and training, our proposed REAR can better utilize external knowledge by effectively perceiving the relevance of retrieved
documents. Experiments on four open-domain QA tasks show that REAR significantly outperforms previous a number of competitive RAG
approaches. Our codes can be accessed at https://github.com/RUCAIBox/REAR.
Nov 12 (Tue) 16:00-17:30 - Riverfront Hall
Right for Right Reasons: Large Language Models for Verifiable Commonsense Knowledge Graph Question Answering
Armin Toroghi, Willis Guo, Mohammad Mahdi Abdollah Pour, Scott Sanner
Knowledge Graph Question Answering (KGQA) methods seek to answer Natural Language questions using the relational information stored
in Knowledge Graphs (KGs). With the recent advancements of Large Language Models (LLMs) and their remarkable reasoning abilities, there
is a growing trend to leverage them for KGQA. However, existing methodologies have only focused on answering factual questions, e.g., *"In
which city was Silvio Berlusconi’s first wife born?"*, leaving questions involving commonsense reasoning that real-world users may pose
more often, e.g., *"Do I need separate visas to see the Venus of Willendorf and attend the Olympics this summer?"* unaddressed. In this work,
we first observe that existing LLM-based methods for KGQA struggle with hallucination on such questions, especially on queries targeting
long-tail entities (e.g., non-mainstream and recent entities), thus hindering their applicability in real-world applications especially since their
reasoning processes are not easily verifiable. In response, we propose Right for Right Reasons (R3 ), a commonsense KGQA methodology
that allows for a verifiable reasoning procedure by axiomatically surfacing intrinsic commonsense knowledge of LLMs and grounding every
factual reasoning step on KG triples. Through experimental evaluations across three different tasksquestion answering, claim verification, and
156
Posters and Demos
preference matchingour findings showcase R3 as a superior approach, outperforming existing methodologies and notably reducing instances
of hallucination and reasoning errors.
157
Posters and Demos
KnowTuning, through automatic and human evaluations, across various sizes of LLMs. We further verify that KnowTuning generates more
facts with less factual error rate under fine-grained facts evaluation.
158
Posters and Demos
methods to map requests onto. In order to bridge this gap, we present CoXQL, the first dataset in the NLP domain for user intent recognition
in ConvXAI, covering 31 intents, seven of which require filling multiple slots. Subsequently, we enhance an existing parsing approach by
incorporating template validations, and conduct an evaluation of several LLMs on CoXQL using different parsing strategies. We conclude
that the improved parsing approach (MP+) surpasses the performance of previous approaches. We also discover that intents with multiple
slots remain highly challenging for LLMs.
Nov 12 (Tue) 16:00-17:30 - Riverfront Hall
MedLogic-AQA: Enhancing Medicare Question Answering with Abstractive Models Focusing on Logical Structures
Aizan Zafar, Kshitij Mishra, Asif Ekbal
In Medicare question-answering (QA) tasks, the need for effective systems is pivotal in delivering accurate responses to intricate medical
queries. However, existing approaches often struggle to grasp the intricate logical structures and relationships inherent in medical contexts,
thus limiting their capacity to furnish precise and nuanced answers. In this work, we address this gap by proposing a novel Abstractive QA sys-
tem MedLogic-AQA that harnesses first-order logic-based rules extracted from both context and questions to generate well-grounded answers.
Through initial experimentation, we identified six pertinent first-order logical rules, which were then used to train a Logic-Understanding (LU)
model capable of generating logical triples for a given context, question, and answer. These logic triples are then integrated into the training
of MediLogic-AQA, enabling reasoned and coherent reasoning during answer generation. This distinctive fusion of logical reasoning with
abstractive question answering equips our system to produce answers that are logically sound, relevant, and engaging. Evaluation with respect
to both automated and human-based demonstrates the robustness of MedLogic-AQA against strong baselines. Through empirical assessments
and case studies, we validate the efficacy of MedLogic-AQA in elevating the quality and comprehensiveness of answers in terms of reasoning
as well as informativeness.
Nov 12 (Tue) 16:00-17:30 - Riverfront Hall
AHP-Powered LLM Reasoning for Multi-Criteria Evaluation of Open-Ended Responses
Xiaotian Lu, Jiyi Li, Koh Takeuchi, Hisashi Kashima
Question answering (QA) tasks have been extensively studied in the field of natural language processing (NLP). Answers to open-ended
questions are highly diverse and difficult to quantify, and cannot be simply evaluated as correct or incorrect, unlike close-ended questions with
definitive answers. While large language models (LLMs) have demonstrated strong capabilities across various tasks, they exhibit relatively
weaker performance in evaluating answers to open-ended questions. In this study, we propose a method that leverages LLMs and the analytic
hierarchy process (AHP) to assess answers to open-ended questions. We utilized LLMs to generate multiple evaluation criteria for a question.
Subsequently, answers were subjected to pairwise comparisons under each criterion with LLMs, and scores for each answer were calculated
in the AHP. We conducted experiments on four datasets using both ChatGPT-3.5-turbo and GPT-4. Our results indicate that our approach
more closely aligns with human judgment compared to the four baselines. Additionally, we explored the impact of the number of criteria,
variations in models, and differences in datasets on the results.
Nov 12 (Tue) 16:00-17:30 - Riverfront Hall
SynTQA: Synergistic Table-based Question Answering via Mixture of Text-to-SQL and E2E TQA
Siyue Zhang, Anh Tuan Luu, Chen Zhao
Text-to-SQL parsing and end-to-end question answering (E2E TQA) are two main approaches for Table-based Question Answering task.
Despite success on multiple benchmarks, they have yet to be compared and their synergy remains unexplored. In this paper, we identify dif-
ferent strengths and weaknesses through evaluating state-of-the-art models on benchmark datasets: Text-to-SQL demonstrates superiority in
handling questions involving arithmetic operations and long tables; E2E TQA excels in addressing ambiguous questions, non-standard table
schema, and complex table contents. To combine both strengths, we propose a Synergistic Table-based Question Answering approach that
integrate different models via answer selection, which is agnostic to any model types. Further experiments validate that ensembling models
by either feature-based or LLM-based answer selector significantly improves the performance over individual models.
Nov 12 (Tue) 16:00-17:30 - Riverfront Hall
Typos that Broke the RAGs Back: Genetic Attack on RAG Pipeline by Simulating Documents in the Wild via Low-level Perturbations
Sukmin Cho, Soyeong Jeong, Jeongyeon Seo, Taeho Hwang, Jong C. Park
The robustness of recent Large Language Models (LLMs) has become increasingly crucial as their applicability expands across various do-
mains and real-world applications. Retrieval-Augmented Generation (RAG) is a promising solution for addressing the limitations of LLMs,
yet existing studies on the robustness of RAG often overlook the interconnected relationships between RAG components or the potential
threats prevalent in real-world databases, such as minor textual errors. In this work, we investigate two underexplored aspects when assessing
the robustness of RAG: 1) vulnerability to noisy documents through low-level perturbations and 2) a holistic evaluation of RAG robustness.
Furthermore, we introduce a novel attack method, the Genetic Attack on RAG (GARAG), which targets these aspects. Specifically, GARAG
is designed to reveal vulnerabilities within each component and test the overall system functionality against noisy documents. We validate
RAG robustness by applying our GARAG to standard QA datasets, incorporating diverse retrievers and LLMs. The experimental results show
that GARAG consistently achieves high attack success rates. Also, it significantly devastates the performance of each component and their
synergy, highlighting the substantial risk that minor textual inaccuracies pose in disrupting RAG systems in the real world. Code is available
at https://github.com/zomss/GARAG.
Nov 12 (Tue) 16:00-17:30 - Riverfront Hall
A Notion of Complexity for Theory of Mind via Discrete World Models
X. Angelo Huang, Emanuele La Malfa, Samuele Marro, Andrea Asperti, Anthony G. Cohn, Michael J. Wooldridge
Theory of Mind (ToM) can be used to assess the capabilities of Large Language Models (LLMs) in complex scenarios where social reasoning
is required. While the research community has proposed many ToM benchmarks, their hardness varies greatly, and their complexity is not
well defined. This work proposes a framework inspired by cognitive load theory to measure the complexity of ToM tasks. We quantify a
problem’s complexity as the number of states necessary to solve it correctly. Our complexity measure also accounts for spurious states of a
ToM problem designed to make it apparently harder. We use our method to assess the complexity of five widely adopted ToM benchmarks.
On top of this framework, we design a prompting technique that augments the information available to a model with a description of how the
environment changes with the agents’ interactions. We name this technique Discrete World Models (DWM) and show how it elicits superior
performance on ToM tasks.
159
Posters and Demos
160
Posters and Demos
Bowen Jiang, Yangxinyu Xie, Zhuoqun Hao, Xiaomeng Wang, Tanwi Mallick, Weijie J Su, Camillo Jose Taylor, Dan Roth
This study introduces a hypothesis-testing framework to assess whether large language models (LLMs) possess genuine reasoning abilities or
primarily depend on token bias. We go beyond evaluating LLMs on accuracy; rather, we aim to investigate their token bias in solving logical
reasoning tasks. Specifically, we develop carefully controlled synthetic datasets, featuring conjunction fallacy and syllogistic problems. Our
framework outlines a list of hypotheses where token biases are readily identifiable, with all null hypotheses assuming genuine reasoning ca-
pabilities of LLMs. The findings in this study suggest, with statistical guarantee, that most LLMs still struggle with logical reasoning. While
they may perform well on classic problems, their success largely depends on recognizing superficial patterns with strong token bias, thereby
raising concerns about their actual reasoning and generalization abilities.
Nov 12 (Tue) 16:00-17:30 - Riverfront Hall
Quality Matters: Evaluating Synthetic Data for Tool-Using LLMs
Shadi Iskander, Sofia Tolmach, Ori Shapira, Nachshon Cohen, Zohar Karnin
Training large language models (LLMs) for external tool usage is a rapidly expanding field, with recent research focusing on generating
synthetic data to address the shortage of available data. However, the absence of systematic data quality checks poses complications for
properly training and testing models. To that end, we propose two approaches for assessing the reliability of data for training LLMs to use
external tools. The first approach uses intuitive, human-defined correctness criteria. The second approach uses a model-driven assessment
with in-context evaluation. We conduct a thorough evaluation of data quality on two popular benchmarks, followed by an extrinsic evaluation
that showcases the impact of data quality on model performance. Our results demonstrate that models trained on high-quality data outperform
those trained on unvalidated data, even when trained with a smaller quantity of data. These findings empirically support the significance of
assessing and ensuring the reliability of training data for tool-using LLMs.
Nov 12 (Tue) 16:00-17:30 - Riverfront Hall
NLEBench+NorGLM: A Comprehensive Empirical Analysis and Benchmark Dataset for Generative Language Models in Norwegian
Peng Liu, Lemei Zhang, Terje Farup, Even W. Lauvrak, Jon Espen Ingvaldsen, Simen Eide, Jon Atle Gulla, Zhirong Yang
Norwegian, spoken by only 5 million population, is under-representative within the most impressive breakthroughs in NLP tasks. To the
best of our knowledge, there has not yet been a comprehensive evaluation of the existing language models (LMs) on Norwegian generation
tasks during the article writing process. To fill this gap, we 1) compiled the existing Norwegian dataset and pre-trained 4 Norwegian Open
Language Models varied from parameter scales and architectures, collectively called NorGLM; 2) introduced a comprehensive benchmark,
NLEBench, for evaluating natural language generation capabilities in Norwegian, encompassing translation and human annotation. Based on
the investigation, we find that: 1) the mainstream, English-dominated LM GPT-3.5 has limited capability in understanding the Norwegian
context; 2) the increase in model parameter scales demonstrates limited impact on the performance of downstream tasks when the pre-training
dataset is constrained in size; 3) smaller models also demonstrate the reasoning capability through Chain-of-Thought; 4) a multi-task dataset
that includes synergy tasks can be used to verify the generalizability of LLMs on natural language understanding and, meanwhile, test the
interconnectedness of these NLP tasks. We share our resources and code for reproducibility under a CC BY-NC 4.0 license.
Nov 12 (Tue) 16:00-17:30 - Riverfront Hall
Boosting Scientific Concepts Understanding: Can Analogies from Teacher Models Empower Student Models?
Siyu Yuan, Cheng Jiayang, Lin Qiu, Deqing Yang
Analogical reasoning plays a critical role in human cognition, enabling us to understand new concepts by associating them with familiar
ones. Previous research in the AI community has mainly focused on identifying and generating analogies and then examining their quality
under human evaluation, which overlooks the practical application of these analogies in real-world settings. Inspired by the human edu-
cation process, in this paper, we propose to investigate how analogies created by teacher language models (LMs) can assist student LMs
in understanding scientific concepts, thereby aligning more closely with practical scenarios. Our results suggest that free-form analogies
can indeed aid LMs in understanding concepts. Additionally, analogies generated by student LMs can improve their own performance on
scientific question answering, demonstrating their capability to use analogies for self-learning new knowledge. Resources are available
athttps://github.com/siyuyuan/SCUA.
Nov 12 (Tue) 16:00-17:30 - Riverfront Hall
InterIntent: Investigating Social Intelligence of LLMs via Intention Understanding in an Interactive Game Context
Ziyi Liu, Abhishek Anand, Pei Zhou, Jen-tse Huang, Jieyu Zhao
Large language models (LLMs) have demonstrated the potential to mimic human social intelligence. However, most studies focus on sim-
plistic and static self-report or performance-based tests, which limits the depth and validity of the analysis. In this paper, we developed a
novel framework, InterIntent, to assess LLMs’ social intelligence by mapping their ability to understand and manage intentions in a game
setting. We focus on four dimensions of social intelligence: situational awareness, self-regulation, self-awareness, and theory of mind. Each
dimension is linked to a specific game task: intention selection, intention following, intention summarization, and intention guessing. Our
findings indicate that while LLMs exhibit high proficiency in selecting intentions, achieving an accuracy of 88%, their ability to infer the
intentions of others is significantly weaker, trailing human performance by 20%. Additionally, game performance correlates with intention
understanding, highlighting the importance of the four components towards success in this game. These findings underline the crucial role
of intention understanding in evaluating LLMs’ social intelligence and highlight the potential of using social deduction games as a complex
testbed to enhance LLM evaluation. InterIntent contributes a structured approach to bridging the evaluation gap in social intelligence within
multiplayer LLM-based games.
Nov 12 (Tue) 16:00-17:30 - Riverfront Hall
A Large-Scale Investigation of Human-LLM Evaluator Agreement on Multilingual and Multi-Cultural Data
Ishaan Watts, Varun Gumma, Aditya Yadavalli, Vivek Seshadri, Manohar Swaminathan, Sunayana Sitaram
Evaluation of multilingual Large Language Models (LLMs) is challenging due to a variety of factors – the lack of benchmarks with sufficient
linguistic diversity, contamination of popular benchmarks into LLM pre-training data and the lack of local, cultural nuances in translated
benchmarks. In this work, we study human and LLM-based evaluation in a multilingual, multi-cultural setting. We evaluate 30 models across
10 Indic languages by conducting 90K human evaluations and 30K LLM-based evaluations and find that models such as GPT-4o and Llama-3
70B consistently perform best for most Indic languages. We build leaderboards for two evaluation settings - pairwise comparison and direct
assessment and analyse the agreement between humans and LLMs. We find that humans and LLMs agree fairly well in the pairwise setting
but the agreement drops for direct assessment evaluation especially for languages such as Bengali and Odia. We also check for various biases
in human and LLM-based evaluation and find evidence of self-bias in the GPT-based evaluator. Our work presents a significant step towards
scaling up multilingual evaluation of LLMs.
161
Posters and Demos
agents (RPAs) are particularly popular, especially for fictional characters. The prerequisite for these RPAs lies in the capability of LLMs
to understand characters from fictional works. Previous efforts have evaluated this capability via basic classification tasks or characteristic
imitation, failing to capture the nuanced character understanding with LLMs. In this paper, we propose evaluating LLMs’ character under-
standing capability via the character profiling task, i.e., summarizing character profiles from corresponding materials, a widely adopted yet
understudied practice for RPA development. Specifically, we construct the CROSS dataset from literature experts and assess the generated
profiles by comparing them with ground truth references and evaluating their applicability in downstream tasks. Our experiments, which
cover various summarization methods and LLMs, have yielded promising results. These results strongly validate the character understanding
capability of LLMs. Resources are available at https://github.com/Joanna0123/character_profiling.
Nov 12 (Tue) 16:00-17:30 - Riverfront Hall
Precise Model Benchmarking with Only a Few Observations
Riccardo Fogliato, Pratik Patil, Nil-Jana Akpinar, Mathew Monfort
How can we precisely estimate a large language model’s (LLM) accuracy on questions belonging to a specific topic within a larger question-
answering dataset? The standard direct estimator, which averages the model’s accuracy on the questions in each subgroup, may exhibit high
variance for subgroups (topics) with small sample sizes. Synthetic regression modeling, which leverages the model’s accuracy on questions
about other topics, may yield biased estimates that are too unreliable for large subgroups. We prescribe a simple yet effective solution:
an empirical Bayes (EB) estimator that balances direct and regression estimates for each subgroup separately, improving the precision of
subgroup-level estimates of model performance. Our experiments on multiple datasets show that this approach consistently provides more
precise estimates of the LLM performance compared to the direct and regression approaches, achieving substantial reductions in the mean
squared error. Confidence intervals for EB estimates also have near-nominal coverage and are narrower compared to those for the direct
estimator. Additional experiments on tabular and vision data validate the benefits of this EB approach.
Nov 12 (Tue) 16:00-17:30 - Riverfront Hall
"Rows, Columns and Values, Oh My!" Synthesizing Scientific Literature into Tables using Language Models
Benjamin Newman, Yoonjoo Lee, Aakanksha Naik, Pao Siangliulue, Raymond Fok, Juho Kim, Daniel S Weld, Joseph Chee Chang, Kyle Lo
When conducting literature reviews, scientists often create literature review tablestables whose rows are publications and whose columns
constitute a schema, a set of aspects used to compare and contrast the papers. Can we automatically generate these tables using language
models (LMs)? In this work, we introduce a framework that leverages LMs to perform this task by decomposing it into separate schema and
value generation steps. To enable experimentation, we address two main challenges: First, we overcome a lack of high-quality datasets to
benchmark table generation by curating and releasing arxivDIGESTables, a new dataset of 2,228 literature review tables extracted from ArXiv
papers that synthesize a total of 7,542 research papers. Second, to support scalable evaluation of model generations against human-authored
reference tables, we develop DecontextEval, an automatic evaluation method that aligns elements of tables with the same underlying aspects
despite differing surface forms. Given these tools, we evaluate LMs abilities to reconstruct reference tables, finding this task benefits from
additional context to ground the generation (e.g. table captions, in-text references). Finally, through a human evaluation study we find that
even when LMs fail to fully reconstruct a reference table, their generated novel aspects can still be useful.
Nov 12 (Tue) 16:00-17:30 - Riverfront Hall
Reasoning Robustness of LLMs to Adversarial Typographical Errors
Esther Gan, Yiran Zhao, Liying Cheng, Mao Yancan, Anirudh Goyal, Kenji Kawaguchi, Min-Yen Kan, Michael Shieh
Large Language Models (LLMs) have demonstrated impressive capabilities in reasoning using Chain-of-Thought (CoT) prompting. However,
CoT can be biased by users’ instruction. In this work, we study the reasoning robustness of LLMs to typographical errors, which can naturally
occur in users’ queries. We design an Adversarial Typo Attack (ATA) algorithm that iteratively samples typos for words that are important to
the query and selects the edit that is most likely to succeed in attacking. It shows that LLMs are sensitive to minimal adversarial typographical
changes. Notably, with 1 character edit, Mistral-7B’s accuracy drops from 43.7% to 38.6% on GSM8K, while with 8 character edits the
performance further drops to 19.2%. To extend our evaluation to larger and closed-source LLMs, we develop the R2 ATA benchmark, which
assesses models’ Reasoning Robustness to ATA. It includes adversarial typographical questions derived from three widely-used reasoning
datasetsGSM8K, BBH, and MMLUby applying ATA to open-source LLMs. R2 ATA demonstrates remarkable transferability and causes no-
table performance drops across multiple super large and closed-source LLMs.
Nov 12 (Tue) 16:00-17:30 - Riverfront Hall
AmbigNLG: Addressing Task Ambiguity in Instruction for NLG
Ayana Niwa, Hayate Iso
We introduce AmbigNLG, a novel task designed to tackle the challenge of task ambiguity in instructions for Natural Language Generation
(NLG). Ambiguous instructions often impede the performance of Large Language Models (LLMs), especially in complex NLG tasks. To
tackle this issue, we propose an ambiguity taxonomy that categorizes different types of instruction ambiguities and refines initial instructions
with clearer specifications. Accompanying this task, we present AmbigSNI_NLG, a dataset comprising 2,500 instances annotated to facili-
tate research in AmbigNLG. Through comprehensive experiments with state-of-the-art LLMs, we demonstrate that our method significantly
enhances the alignment of generated text with user expectations, achieving up to a 15.02-point increase in ROUGE scores. Our findings high-
light the critical importance of addressing task ambiguity to fully harness the capabilities of LLMs in NLG tasks.Furthermore, we confirm the
effectiveness of our method in practical settings involving interactive ambiguity mitigation with users, underscoring the benefits of leveraging
LLMs for interactive clarification.
Nov 12 (Tue) 16:00-17:30 - Riverfront Hall
DataTales: A Benchmark for Real-World Intelligent Data Narration
Yajing Yang, Qian Liu, Min-Yen Kan
We introduce DataTales, a novel benchmark designed to assess the proficiency of language models in data narration, a task crucial for trans-
forming complex tabular data into accessible narratives. Existing benchmarks often fall short in capturing the requisite analytical complexity
for practical applications. DataTales addresses this gap by offering 4.9k financial reports paired with corresponding market data, showcasing
the demand for models to create clear narratives and analyze large datasets while understanding specialized terminology in the field. Our
findings highlights the significant challenge that language models face in achieving the necessary precision and analytical depth for proficient
data narration, suggesting promising avenues for future model development and evaluation methodologies.
Nov 12 (Tue) 16:00-17:30 - Riverfront Hall
SciEx: Benchmarking Large Language Models on Scientific Exams with Human Expert Grading and Automatic Grading
Tu Anh Dinh, Carlos Mullov, Leonard Bärmann, Zhaolin Li, Danni Liu, Simon ReiSS, Jueun Lee, Nathan Lerzer, Jianfeng Gao, Fabian
Peller-Konrad, Alexander Waibel, Tamim Asfour, Michael Beigl, Rainer Stiefelhagen, Carsten Dachsbacher, Klemens Böhm, Jan Niehues
With the rapid development of Large Language Models (LLMs), it is crucial to have benchmarks which can evaluate the ability of LLMs on
different domains. One common use of LLMs is performing tasks on scientific topics, such as writing algorithms, querying databases or giv-
162
Posters and Demos
ing mathematical proofs. Inspired by the way university students are evaluated on such tasks, in this paper, we propose SciEx - a benchmark
consisting of university computer science exam questions, to evaluate LLMs’ ability on solving scientific tasks. SciEx is (1) multilingual,
containing both English and German exams, and (2) multi-modal, containing questions that involve images, and (3) contains various types of
freeform questions with different difficulty levels, due to the nature of university exams. We evaluate the performance of various state-of-the-
art LLMs on our new benchmark. Since SciEx questions are freeform, it is not straightforward to evaluate LLM performance. Therefore, we
provide human expert grading of the LLM outputs on SciEx. We show that the free-form exams in SciEx remain challenging for the current
LLMs, where the best LLM only achieves 59.4% exam grade on average. We also provide detailed comparisons between LLM performance
and student performance on SciEx. To enable future evaluation of new LLMs, we propose using LLM-as-a-judge to grade the LLM answers
on SciEx. Our experiments show that, although they do not perform perfectly on solving the exams, LLMs are decent as graders, achieving
0.948 Pearson correlation with expert grading.
Nov 12 (Tue) 16:00-17:30 - Riverfront Hall
"A good pun is its own reword": Can Large Language Models Understand Puns?
Zhijun Xu, Siyu Yuan, Lingjie Chen, Deqing Yang
Puns play a vital role in academic research due to their distinct structure and clear definition, which aid in the comprehensive analysis of
linguistic humor. However, the understanding of puns in large language models (LLMs) has not been thoroughly examined, limiting their
use in creative writing and humor creation. In this paper, we leverage three popular tasks, i.e., pun recognition, explanation and generation to
systematically evaluate the capabilities of LLMs in pun understanding. In addition to adopting the automated evaluation metrics from prior
research, we introduce new evaluation methods and metrics that are better suited to the in-context learning paradigm of LLMs. These new
metrics offer a more rigorous assessment of an LLM’s ability to understand puns and align more closely with human cognition than previ-
ous metrics. Our findings reveal the "lazy pun generation" pattern and identify the primary challenges LLMs encounter in understanding puns.
Nov 12 (Tue) 16:00-17:30 - Riverfront Hall
SUPER: Evaluating Agents on Setting Up and Executing Tasks from Research Repositories
Ben Bogin, Kejuan Yang, Shashank Gupta, Kyle Richardson, Erin Bransom, Peter Clark, Ashish Sabharwal, Tushar Khot
Given that Large Language Models (LLMs) have made significant progress in writing code, can they now be used to autonomously reproduce
results from research repositories? Such a capability would be a boon to the research community, helping researchers validate, understand,
and extend prior work. To advance towards this goal, we introduce SUPER, the first benchmark designed to evaluate the capability of LLMs in
setting up and executing tasks from research repositories. SUPER aims to capture the realistic challenges faced by researchers working with
Machine Learning (ML) and Natural Language Processing (NLP) research repositories. Our benchmark comprises three distinct problem sets:
45 end-to-end problems with annotated expert solutions, 152 sub-problems derived from the expert set that focus on specific challenges (e.g.,
configuring a trainer), and 602 automatically generated problems for larger-scale development. We introduce various evaluation measures
to assess both task success and progress, utilizing gold solutions when available or approximations otherwise. We show that state-of-the-art
approaches struggle to solve these problems with the best model (GPT-4o) solving only 16.3% of the end-to-end set, and 46.1% of the sce-
narios. This illustrates the challenge of this task, and suggests that SUPER can serve as a valuable resource for the community to make and
measure progress.
Nov 12 (Tue) 16:00-17:30 - Riverfront Hall
DA-Code: Agent Data Science Code Generation Benchmark for Large Language Models
Yiming Huang, Jianwen Luo, Yan Yu, Yitong Zhang, Fangyu Lei, Yifan Wei, Shizhu He, Lifu Huang, Xiao Liu, Jun Zhao, Kang Liu
We introduce DA-Code, a code generation benchmark specifically designed to assess LLMs on agent-based data science tasks. This bench-
mark features three core elements: First, the tasks within DA-Code are inherently challenging, setting them apart from traditional code
generation tasks and demanding advanced coding skills in grounding and planning. Second, examples in DA-Code are all based on real and
diverse data, covering a wide range of complex data wrangling and analytics tasks. Third, to solve the tasks, the models must utilize complex
data science programming languages, including Python and SQL, to perform intricate data processing and derive the answers. We set up the
benchmark in a controllable and executable environment that aligns with real-world data analysis scenarios and is scalable. The annotators
meticulously designed the evaluation suite to ensure the accuracy and robustness of the evaluation. We developed the DA-Agent baseline.
Experiments show that although the baseline performs better than other existing frameworks, using the current best LLMs achieves only
30.5% accuracy, leaving ample room for improvement. We release our benchmark at [link](https://github.com/yiyihum/dabench)
Nov 12 (Tue) 16:00-17:30 - Riverfront Hall
DynamicER: Resolving Emerging Mentions to Dynamic Entities for RAG
Jinyoung Kim, Dayoon Ko, Gunhee Kim
In the rapidly evolving landscape of language, resolving new linguistic expressions in continuously updating knowledge bases remains a
formidable challenge. This challenge becomes critical in retrieval-augmented generation (RAG) with knowledge bases, as emerging expres-
sions hinder the retrieval of relevant documents, leading to generator hallucinations. To address this issue, we introduce a novel task aimed
at resolving emerging mentions to dynamic entities and present DynamicER benchmark. Our benchmark includes dynamic entity men-
tion resolution and entity-centric knowledge-intensive QA task, evaluating entity linking and RAG model’s adaptability to new expressions,
respectively. We discovered that current entity linking models struggle to link these new expressions to entities. Therefore, we propose a tem-
poral segmented clustering method with continual adaptation, effectively managing the temporal dynamics of evolving entities and emerging
mentions. Extensive experiments demonstrate that our method outperforms existing baselines, enhancing RAG model performance on QA
task with resolved mentions.
163
Posters and Demos
complex reasoning. Despite recent advancements in Large Language Models (LLMs) for LFTQA, evaluating their effectiveness remains a sig-
nificant challenge. We introduce LFTQA-Eval, a meta-evaluation dataset comprising 2,988 human-annotated examples, to rigorously assess
the efficacy of current automated metrics in assessing LLM-based LFTQA systems, with a focus on faithfulness and comprehensiveness. Our
findings reveal that existing automatic metrics poorly correlate with human judgments and fail to consistently differentiate between factually
accurate responses and those that are coherent but factually incorrect. Additionally, our in-depth examination of the limitations associated
with automated evaluation methods provides essential insights for the improvement of LFTQA automated evaluation.
Nov 12 (Tue) 16:00-17:30 - Riverfront Hall
FinDVer: Explainable Claim Verification over Long and Hybrid-content Financial Documents
Yilun Zhao, Yitao Long, Tintin Jiang, Weiyuan Chen, Chengye Wang, Hongjun Liu, Xiangru Tang, Yiming Zhang, Chen Zhao, Arman Cohan
We introduce FinDVer, a comprehensive benchmark specifically designed to evaluate the explainable claim verification capabilities of LLMs
in the context of understanding and analyzing long, hybrid-content financial documents. FinDVer contains 4,000 expert-annotated examples
across four subsets, each focusing on a type of scenario that frequently arises in real-world financial domains. We assess a broad spectrum of
25 LLMs under long-context and RAG settings. Our results show that even the current best-performing system (i.e., GPT-4o) significantly
lags behind human experts. Our detailed findings and insights highlight the strengths and limitations of existing LLMs in this new task.
We believe FinDVer can serve as a valuable benchmark for evaluating LLM capabilities in claim verification over complex, expert-domain
documents.
Nov 12 (Tue) 16:00-17:30 - Riverfront Hall
AKEW: Assessing Knowledge Editing in the Wild
Xiaobao Wu, Liangming Pan, William Yang Wang, Anh Tuan Luu
Knowledge editing injects knowledge updates into language models to keep them correct and up-to-date. However, its current evaluations
deviate significantly from practice: their knowledge updates solely consist of structured facts derived from meticulously crafted datasets, in-
stead of practical sources—unstructured texts like news articles, and they often overlook practical real-world knowledge updates. To address
these issues, in this paper we propose AKEW (Assessing Knowledge Editing in the Wild), a new practical benchmark for knowledge editing.
AKEW fully covers three editing settings of knowledge updates: structured facts, unstructured texts as facts, and extracted triplets. It further
introduces new datasets featuring both counterfactual and real-world knowledge updates. Through extensive experiments, we demonstrate
the considerable gap between state-of-the-art knowledge-editing methods and practical scenarios. Our analyses further highlight key insights
to motivate future research for practical knowledge editing.
Nov 12 (Tue) 16:00-17:30 - Riverfront Hall
WorryWords: Norms of Anxiety Association for 44,450 English Words
Saif M. Mohammad
Anxiety, the anticipatory unease about a potential negative outcome, is a common and beneficial human emotion. However, there is still much
that is not known about anxiety, such as how it relates to our body and how it manifests in language; especially pertinent given the increasing
impact of related disorders.In this work,we introduce WorryWords, the first large-scale repository of manually derived word–anxiety associa-
tions for over 44,450 English words. We show that the anxiety associations are highly reliable.We use WorryWords to study the relationship
between anxiety and other emotion constructs, as well as the rate at which children acquire anxiety words with age. Finally, we show that us-
ing WorryWords alone, one can accurately track the change of anxiety in streams of text.WorryWords enables a wide variety of anxiety-related
research in psychology, NLP, public health, and social sciences.WorryWords (and its translations to over 100 languages) is freely available.
http://saifmohammad.com/worrywords.html
Nov 12 (Tue) 16:00-17:30 - Riverfront Hall
***YesBut***: A High-Quality Annotated Multimodal Dataset for evaluating Satire Comprehension capability of Vision-Language
Models
Abhilash Nandy, Yash Agarwal, Ashish Patwa, Millon Madhur Das, Aman Bansal, ANKIT RAJ, Pawan Goyal, Niloy Ganguly
Understanding satire and humor is a challenging task for even current Vision-Language models. In this paper, we propose the challenging
tasks of Satirical Image Detection (detecting whether an image is satirical), Understanding (generating the reason behind the image being
satirical), and Completion (given one half of the image, selecting the other half from 2 given options, such that the complete image is satiri-
cal) and release a high-quality dataset ***YesBut***, consisting of 2547 images, 1084 satirical and 1463 non-satirical, containing different
artistic styles, to evaluate those tasks. Each satirical image in the dataset depicts a normal scenario, along with a conflicting scenario which is
funny or ironic. Despite the success of current Vision-Language Models on multimodal tasks such as Visual QA and Image Captioning, our
benchmarking experiments show that such models perform poorly on the proposed tasks on the ***YesBut*** Dataset in Zero-Shot Settings
w.r.t both automated as well as human evaluation. Additionally, we release a dataset of 119 real, satirical photographs for further research.
Nov 12 (Tue) 16:00-17:30 - Riverfront Hall
Mathador-LM: A Dynamic Benchmark for Mathematical Reasoning on Large Language Models
Eldar Kurtic, Amir Moeini, Dan Alistarh
We introduce Mathador-LM, a new benchmark for evaluating the mathematical reasoning on large language models (LLMs), combining
ruleset interpretation, planning, and problem-solving. This benchmark is inspired by the Mathador game, where the objective is to reach a
target number using basic arithmetic operations on a given set of base numbers, following a simple set of rules. We show that, across leading
LLMs, we obtain stable average performance while generating benchmark instances dynamically, following a target difficulty level. Thus,
our benchmark alleviates concerns about test-set leakage into training data, an issue that often undermines popular benchmarks. Addition-
ally, we conduct a comprehensive evaluation of both open and closed-source state-of-the-art LLMs on Mathador-LM. Our findings reveal
that contemporary models struggle with Mathador-LM, scoring significantly lower than average 3rd graders. This stands in stark contrast to
their strong performance on popular mathematical reasoning benchmarks. The implementation of Mathador-LM benchmark is available at
https://github.com/IST-DASLab/Mathador-LM.
164
Posters and Demos
their decisions are often inaccurate even for correctly-labeled claims; and (3) models perform substantially worse on speculative fiction books
that contain extensive world-building. The methodology proposed in NoCha allows for the evolution of the benchmark dataset and the easy
analysis of future models.
Nov 12 (Tue) 16:00-17:30 - Riverfront Hall
StorySpark: Expert-Annotated QA Pairs with Real-World Knowledge for Children Storytelling
Jiaju Chen, Yuxuan Lu, Shao Zhang, Bingsheng Yao, Yuanzhe Dong, Ying Xu, Yunyao Li, Qianwen Wang, Dakuo Wang, Yuling Sun
Interactive story reading is common in early childhood education, where teachers expect to teach both language skills and real-world knowl-
edge beyond the story. While many story reading systems have been developed for this activity, they often fail to infuse real-world knowledge
into the conversation. This limitation can be attributed to the existing question-answering (QA) datasets used for childrens education, upon
which the systems are built, failing to capture the nuances of how education experts think when conducting interactive story reading activities.
To bridge this gap, we design an annotation framework, empowered by existing knowledge graph to capture experts annotations and thinking
process, and leverage this framework to construct StorySparkQA dataset, which comprises 5, 868 expert-annotated QA pairs with real-world
knowledge. We conduct automated and human expert evaluations across various QA pair generation settings to demonstrate that our Sto-
rySparkQA can effectively support models in generating QA pairs that target real-world knowledge beyond story content. StorySparkQA is
available at https://huggingface.co/datasets/NEU-HAI/StorySparkQA.
Nov 12 (Tue) 16:00-17:30 - Riverfront Hall
Data Contamination Can Cross Language Barriers
Feng Yao, Yufan Zhuang, Zihao Sun, Sunan Xu, Animesh Kumar, Jingbo Shang
The opacity in developing large language models (LLMs) is raising growing concerns about the potential contamination of public benchmarks
in the pre-training data. Existing contamination detection methods are typically based on the text overlap between training and evaluation data,
which can be too superficial to reflect deeper forms of contamination. In this paper, we first present a cross-lingual form of contamination that
inflates LLMs’ performance while evading current detection methods, deliberately injected by overfitting LLMs on the translated versions
of benchmark test sets. Then, we propose generalization-based approaches to unmask such deeply concealed contamination. Specifically,
we examine the LLM’s performance change after modifying the original benchmark by replacing the false answer choices with correct ones
from other questions. Contaminated models can hardly generalize to such easier situations, where the false choices can be not even wrong,
as all choices are correct in their memorization. Experimental results demonstrate that cross-lingual contamination can easily fool existing
detection methods, but not ours. In addition, we discuss the potential utilization of cross-lingual contamination in interpreting LLMs’ working
mechanisms and in post-training LLMs for enhanced multilingual capabilities.
Nov 12 (Tue) 16:00-17:30 - Riverfront Hall
GuardBench: A Large-Scale Benchmark for Guardrail Models
Elias Bassani, Ignacio Sanchez
Generative AI systems powered by Large Language Models have become increasingly popular in recent years. Lately, due to the risk of
providing users with unsafe information, the adoption of those systems in safety-critical domains has raised significant concerns. To respond
to this situation, input-output filters, commonly called guardrail models, have been proposed to complement other measures, such as model
alignment. Unfortunately, the lack of a standard benchmark for guardrail models poses significant evaluation issues and makes it hard to
compare results across scientific publications. To fill this gap, we introduce GuardBench, a large-scale benchmark for guardrail models
comprising 40 safety evaluation datasets. To facilitate the adoption of GuardBench, we release a Python library providing an automated
evaluation pipeline built on top of it. With our benchmark, we also share the first large-scale prompt moderation datasets in German, French,
Italian, and Spanish. To assess the current state-of-the-art, we conduct an extensive comparison of recent guardrail models and show that a
general-purpose instruction-following model of comparable size achieves competitive results without the need for specific fine-tuning.
165
Posters and Demos
derived scientific QA dataset, designed to facilitate research on complex reasoning within the domain of question answering for scientific texts.
Nov 12 (Tue) 16:00-17:30 - Riverfront Hall
ArMeme: Propagandistic Content in Arabic Memes
Firoj Alam, Abul Hasnat, Fatema Ahmad, Md. Arid Hasan, Maram Hasanain
With the rise of digital communication memes have become a significant medium for cultural and political expression that is often used to
mislead audience. Identification of such misleading and persuasive multimodal content become more important among various stakeholders,
including social media platforms, policymakers, and the broader society as they often cause harm to the individuals, organizations and/or
society. While there has been effort to develop AI based automatic system for resource rich languages (e.g., English), it is relatively little
to none for medium to low resource languages. In this study, we focused on developing an Arabic memes dataset with manual annotations
of propagandistic content. We annotated ∼ 6K Arabic memes collected from various social media platforms, which is a first resource for
Arabic multimodal research. We provide a comprehensive analysis aiming to develop computational tools for their detection. We made the
dataset publicly available for the community.
Nov 12 (Tue) 16:00-17:30 - Riverfront Hall
Evaluating Large Language Models on Time Series Feature Understanding: A Comprehensive Taxonomy and Benchmark
Elizabeth Fons, Rachneet Kaur, Soham Palande, Zhen Zeng, Tucker Balch, Manuela Veloso, Svitlana Vyetrenko
Large Language Models (LLMs) offer the potential for automatic time series analysis and reporting, which is a critical task across many
domains, spanning healthcare, finance, climate, energy, and many more. In this paper, we propose a framework for rigorously evaluating
the capabilities of LLMs on time series understanding, encompassing both univariate and multivariate forms. We introduce a comprehensive
taxonomy of time series features, a critical framework that delineates various characteristics inherent in time series data. Leveraging this
taxonomy, we have systematically designed and synthesized a diverse dataset of time series, embodying the different outlined features, each
accompanied by textual descriptions. This dataset acts as a solid foundation for assessing the proficiency of LLMs in comprehending time
series. Our experiments shed light on the strengths and limitations of state-of-the-art LLMs in time series understanding, revealing which fea-
tures these models readily comprehend effectively and where they falter. In addition, we uncover the sensitivity of LLMs to factors including
the formatting of the data, the position of points queried within a series and the overall time series length.
166
Posters and Demos
benchmark of 3 million papers to quantify these trends.Our benchmark shows that desirable policies for combining edge- and node-based
methods depend on h and t.We release our benchmark, evaluation scripts, and embeddings.
167
Posters and Demos
in reasoning and bias mitigation can be seen. These findings provide important insights for the development of more robust MLLMs and
contribute to the broader goal of advancing multimodal AI systems capable of deeper understanding and reasoning. Our project page is at
https://github.com/OpenCausaLab/MORE.
Nov 12 (Tue) 16:00-17:30 - Riverfront Hall
PythonSaga: Redefining the Benchmark to Evaluate Code Generating LLMs
Ankit Yadav, Mayank Singh, Himanshu Beniwal
Driven by the surge in code generation using large language models (LLMs), numerous benchmarks have emerged to evaluate these LLMs
capabilities. We conducted a large-scale human evaluation of *HumanEval* and *MBPP*, two popular benchmarks for Python code genera-
tion, analyzing their diversity and difficulty. Our findings unveil a critical bias towards a limited set of programming concepts, neglecting most
of the other concepts entirely. Furthermore, we uncover a worrying prevalence of easy tasks that can inflate model performance estimations.
To address these limitations, we propose a novel benchmark, *PythonSaga*, featuring 185 hand-crafted prompts in a balanced representation
of 38 programming concepts across diverse difficulty levels. The robustness of our benchmark is demonstrated by the poor performance of
existing Code-LLMs. The code and data set are openly available to the NLP community at this [URL](https://github.com/PythonSaga/Python-
Saga).
Nov 12 (Tue) 16:00-17:30 - Riverfront Hall
The SIFo Benchmark: Investigating the Sequential Instruction Following Ability of Large Language Models
Xinyi Chen, Baohao Liao, Jirui Qi, Panagiotis Eustratiadis, Christof Monz, Arianna Bisazza, Maarten de Rijke
Following multiple instructions is a crucial ability for large language models (LLMs). Evaluating this ability comes with significant chal-
lenges: (i) limited coherence between multiple instructions, (ii) positional bias where the order of instructions affects model performance, and
(iii) a lack of objectively verifiable tasks. To address these issues, we introduce a benchmark designed to evaluate models’ abilities to follow
multiple instructions through sequential instruction following (SIFo) tasks. In SIFo, the successful completion of multiple instructions is
verifiable by examining only the final instruction. Our benchmark evaluates instruction following using four tasks (text modification, question
answering, mathematics, and security rule following), each assessing different aspects of sequential instruction following. Our evaluation of
popular LLMs, both closed-source and open-source, shows that more recent and larger models significantly outperform their older and smaller
counterparts on the SIFo tasks, validating the benchmark’s effectiveness. All models struggle with following sequences of instructions, hinting
at an important lack of robustness of today’s language models.
Nov 12 (Tue) 16:00-17:30 - Riverfront Hall
Efficient Data Generation for Source-grounded Information-seeking Dialogs: A Use Case for Meeting Transcripts
Lotem Golany, Filippo Galgani, Maya Mamo, Nimrod Parasol, Omer Vandsburger, Nadav Bar, Ido Dagan
Automating data generation with Large Language Models (LLMs) has become increasingly popular. In this work, we investigate the feasibil-
ity and effectiveness of LLM-based data generation in the challenging setting of source-grounded information-seeking dialogs, with response
attribution, over long documents. Our source texts consist of long and noisy meeting transcripts, adding to the task complexity. Since automat-
ing attribution remains difficult, we propose a semi-automatic approach: dialog queries and responses are generated with LLMs, followed by
human verification and identification of attribution spans. Using this approach, we created MISeD – Meeting Information Seeking Dialogs
dataset – a dataset of information-seeking dialogs focused on meeting transcripts. Models finetuned with MISeD demonstrate superior per-
formance compared to off-the-shelf models, even those of larger size. Finetuning on MISeD gives comparable response generation quality to
finetuning on fully manual data, while improving attribution quality and reducing time and effort.
Nov 12 (Tue) 16:00-17:30 - Riverfront Hall
IntentionQA: A Benchmark for Evaluating Purchase Intention Comprehension Abilities of Language Models in E-commerce
Wenxuan Ding, Weiqi Wang, Sze Heng Douglas Kwok, Minghao LIU, Tianqing Fang, Jiaxin Bai, Xin Liu, Changlong Yu, Zheng Li, Chen Luo,
Qingyu Yin, Bing Yin, Junxian He, Yangqiu Song
Enhancing Language Models’ (LMs) ability to understand purchase intentions in E-commerce scenarios is crucial for their effective assis-
tance in various downstream tasks. However, previous approaches that distill intentions from LMs often fail to generate meaningful and
human-centric intentions applicable in real-world E-commerce contexts. This raises concerns about the true comprehension and utilization of
purchase intentions by LMs. In this paper, we present IntentionQA, a double-task multiple-choice question answering benchmark to evaluate
LMs’ comprehension of purchase intentions in E-commerce. Specifically, LMs are tasked to infer intentions based on purchased products
and utilize them to predict additional purchases. IntentionQA consists of 4,360 carefully curated problems across three difficulty levels, con-
structed using an automated pipeline to ensure scalability on large E-commerce platforms. Human evaluations demonstrate the high quality
and low false-negative rate of our benchmark. Extensive experiments across 19 language models show that they still struggle with certain
scenarios, such as understanding products and intentions accurately, jointly reasoning with products and intentions, and more, in which they
fall far behind human performances.
3
We will make all our prompts as well as LVLMs’ responses open source for future research
168
Posters and Demos
169
Posters and Demos
Authorship attribution aims to identify the origin or author of a document. Traditional approaches have heavily relied on manual features
and fail to capture long-range correlations, limiting their effectiveness. Recent advancements leverage text embeddings from pre-trained lan-
guage models, which require significant fine-tuning on labeled data, posing challenges in data dependency and limited interpretability. Large
Language Models (LLMs), with their deep reasoning capabilities and ability to maintain long-range textual associations, offer a promising al-
ternative. This study explores the potential of pre-trained LLMs in one-shot authorship attribution, specifically utilizing Bayesian approaches
and probability outputs of LLMs. Our methodology calculates the probability that a text entails previous writings of an author, reflecting a
more nuanced understanding of authorship. By utilizing only pre-trained models such as Llama-3-70B, our results on the IMDb and blog
datasets show an impressive 85% accuracy in one-shot authorship classification across ten authors. Our findings set new baselines for one-shot
authorship analysis using LLMs and expand the application scope of these models in forensic linguistics. This work also includes extensive
ablation studies to validate our approach.
Nov 12 (Tue) 16:00-17:30 - Jasmine
Style-Specific Neurons for Steering LLMs in Text Style Transfer
Wen Lai, Viktor Hangya, Alexander Fraser
Text style transfer (TST) aims to modify the style of a text without altering its original meaning. Large language models (LLMs) demonstrate
superior performance across multiple tasks, including TST. However, in zero-shot setups, they tend to directly copy a significant portion of the
input text to the output without effectively changing its style. To enhance the stylistic variety and fluency of the text, we present sNeuron-TST,
a novel approach for steering LLMs using style-specific neurons in TST. Specifically, we identify neurons associated with the source and target
styles and deactivate source-style-only neurons to give target-style words a higher probability, aiming to enhance the stylistic diversity of the
generated text. However, we find that this deactivation negatively impacts the fluency of the generated text, which we address by proposing
an improved contrastive decoding method that accounts for rapid token probability shifts across layers caused by deactivated source-style
neurons. Empirical experiments demonstrate the effectiveness of the proposed method on six benchmarks, encompassing formality, toxicity,
politics, politeness, authorship, and sentiment.
Nov 12 (Tue) 16:00-17:30 - Jasmine
CasiMedicos-Arg: A Medical Question Answering Dataset Annotated with Explanatory Argumentative Structures
Ekaterina Sviridova, Anar Yeginbergen, Ainara Estarrona, Elena Cabrio, Serena Villata, Rodrigo Agerri
Explaining Artificial Intelligence (AI) decisions is a major challenge nowadays in AI, in particular when applied to sensitive scenarios like
medicine and law. However, the need to explain the rationale behind decisions is a main issues also for human-based deliberation as it is
important to justify why a certain decision has been taken. Resident medical doctors for instance are required not only to provide a (possibly
correct) diagnosis, but also to explain how they reached a certain conclusion. Developing new tools to aid residents to train their explanation
skills is therefore a central objective of AI in education. In this paper, we follow this direction, and we present, to the best of our knowledge,
the first multilingual dataset for Medical Question Answering where correct and incorrect diagnoses for a clinical case are enriched with a
natural language explanation written by doctors. These explanations have been manually annotated with argument components (i.e., premise,
claim) and argument relations (i.e., attack, support). The Multilingual CasiMedicos-arg dataset consists of 558 clinical cases (English, Span-
ish, French, Italian) with explanations, where we annotated 5021 claims, 2313 premises, 2431 support relations, and 1106 attack relations.
We conclude by showing how competitive baselines perform over this challenging dataset for the argument mining task.
Nov 12 (Tue) 16:00-17:30 - Jasmine
Flee the Flaw: Annotating the Underlying Logic of Fallacious Arguments Through Templates and Slot-filling
Irfan Robbani, Paul Reisert, Surawat Pothong, Naoya Inoue, Camélia Guerraoui, Wenzhi Wang, Shoichi Naito, Jungmin Choi, Kentaro Inui
Prior research in computational argumentation has mainly focused on scoring the quality of arguments, with less attention on explicating
logical errors. In this work, we introduce four sets of explainable templates for common informal logical fallacies designed to explicate a
fallacy’s implicit logic. Using our templates, we conduct an annotation study on top of 400 fallacious arguments taken from LOGIC dataset
and achieve a high agreement score (Krippendorf’s α of 0.54) and reasonable coverage 83%. Finally, we conduct an experiment for detecting
the structure of fallacies and discover that state-of-the-art language models struggle with detecting fallacy templates (0.47 accuracy). To
facilitate research on fallacies, we make our dataset and guidelines publicly available.
Nov 12 (Tue) 16:00-17:30 - Jasmine
Lets discuss! Quality Dimensions and Annotated Datasets for Computational Argument Quality
Rositsa V Ivanova, Thomas Huber, Christina Niklaus
Research in the computational assessment of Argumentation Quality has gained popularity over the last ten years. Various quality dimensions
have been explored through the creation of domain-specific datasets and assessment methods. We survey the related literature (211 publi-
cations and 32 datasets), while addressing potential overlaps and blurry boundaries to related domains. This paper provides a representative
overview of the state of the art in Computational Argument Quality Assessment with a focus on quality dimensions and annotated datasets.
The aim of the survey is to identify research gaps and to aid future discussions and work in the domain.
Nov 12 (Tue) 16:00-17:30 - Jasmine
Contrastive Classification via Linear Layer Extrapolation
Mayukh Sharma, Sean O’Brien, Julian McAuley
Certain abilities of Transformer-based language models consistently emerge in their later layers. Previous research has leveraged this phe-
nomenon to improve factual accuracy through self-contrast, penalizing early-exit predictions based on the premise that later-layer updates are
more factually reliable than earlier-layer associations. We observe a similar pattern for fine-grained emotion classification in text, demonstrat-
ing that self-contrast can enhance encoder-based text classifiers. Additionally, we reinterpret self-contrast as a form of linear extrapolation,
which motivates a refined approach that dynamically adjusts the contrastive strength based on the selected intermediate layer. Experiments
across multiple models and emotion classification datasets show that our method outperforms standard classification techniques in fine-grained
emotion classification tasks.
170
Posters and Demos
verify the model performance on MERC and MECPE tasks and achieve consistent improvements compared with the previous state-of-the-art
methods.
171
Posters and Demos
172
Posters and Demos
performance but also generalizes even to other domains, such as Aspect-based Sentiment Analysis. We make the code publicly available.4
4
https://github.com/kgarg8/Stanceformer
173
Posters and Demos
Large Language Models (LLMs) face significant challenges at inference time due to their high computational demands. To address this, we
present Performance-Guided Knowledge Distillation (PGKD), a cost-effective and high-throughput solution for production text classification
applications. PGKD utilizes teacher-student Knowledge Distillation to distill the knowledge of LLMs into smaller, task-specific models.
PGKD establishes an active learning routine between the student model and the LLM; the LLM continuously generates new training data
leveraging hard-negative mining, student model validation performance, and early-stopping protocols to inform the data generation. By
employing a cyclical, performance-aware approach tailored for highly multi-class, sparsely annotated datasets prevalent in industrial text
classification, PGKD effectively addresses training challenges and outperforms traditional BERT-base models and other knowledge distilla-
tion methods on several multi-class classification datasets. Additionally, cost and latency benchmarking reveals that models fine-tuned with
PGKD are up to 130X faster and 25X less expensive than LLMs for inference on the same classification task. While PGKD is showcased
for text classification tasks, its versatile framework can be extended to any LLM distillation task, including language generation, making it a
powerful tool for optimizing performance across a wide range of AI applications.
Nov 12 (Tue) 16:00-17:30 - Jasmine
Optimized Speculative Sampling for GPU Hardware Accelerators
Dominik Wagner, Seanie Lee, Ilja Baumann, Philipp Seeberger, Korbinian Riedhammer, Tobias Bocklet
In this work, we optimize speculative sampling for parallel hardware accelerators to improve sampling speed. We notice that substantial
portions of the intermediate matrices necessary for speculative sampling can be computed concurrently. This allows us to distribute the
workload across multiple GPU threads, enabling simultaneous operations on matrix segments within thread blocks. This results in profiling
time improvements ranging from 6% to 13% relative to the baseline implementation, without compromising accuracy. To further accelerate
speculative sampling, probability distributions parameterized by softmax are approximated by sigmoid. This approximation approach results
in significantly greater relative improvements in profiling time, ranging from 37% to 94%, with a minor decline in accuracy. We conduct ex-
tensive experiments on both automatic speech recognition and summarization tasks to validate the effectiveness of our optimization methods.
Nov 12 (Tue) 16:00-17:30 - Jasmine
MOSEL: Inference Serving Using Dynamic Modality Selection
Bodun Hu, Le Xu, Jeongyoon Moon, Neeraja J Yadwadkar, Aditya Akella
Rapid advancements over the years have helped machine learning models reach previously hard-to-achieve goals, sometimes even exceeding
human capabilities. However, achieving desired accuracy comes at the cost of larger model sizes and increased computational demands. Thus,
serving predictions from these models to meet any latency and cost requirements of applications remains a key challenge, despite recent work
in building inference serving systems as well as algorithmic approaches that dynamically adapt models based on inputs. Our paper introduces
a new form of dynamism, modality selection, where we adaptively choose modalities from inference inputs while maintaining the model
quality. We introduce MOSEL, an automated inference serving system for multi-modal ML models that carefully picks input modalities per
request based on user-defined performance and accuracy requirements. MOSEL exploits modality configurations extensively, improving sys-
tem throughput by 3.6 × with an accuracy guarantee. It also reduces job completion times by 11× compared to modality-agnostic approaches.
Nov 12 (Tue) 16:00-17:30 - Jasmine
Fast Forwarding Low-Rank Training
Adir Rahamim, Naomi Saphra, Sara Kangaslahti, Yonatan Belinkov
Parameter efficient finetuning methods like low-rank adaptation (LoRA) aim to reduce the computational costs of finetuning pretrained Lan-
guage Models (LMs). Enabled by these low-rank settings, we propose an even more efficient optimization strategy: Fast Forward, a simple
and effective approach to accelerate large segments of SGD training. In a Fast Forward stage, we repeat the most recent optimizer step until
the loss stops improving on a tiny validation set. By alternating between regular optimization steps and Fast Forward stages, Fast Forward
provides up to an 87% reduction in FLOPs over standard SGD with Adam. We validate Fast Forward by finetuning various models on different
tasks and demonstrate that it speeds up training without compromising model performance. Additionally, we analyze when and how to apply
Fast Forward.
Nov 12 (Tue) 16:00-17:30 - Jasmine
Breaking ReLU Barrier: Generalized MoEfication for Dense Pretrained Models
Jaeseong Lee, seung-won hwang, Wonpyo Park, Mingi Ji
As the scale of language models (LMs) continues to grow, there is a heightened interest in reducing the inference cost associated with these
models. Mixture-of-Experts (MoEs) present an efficient alternative to dense models, while the existing methods to convert pretrained dense
models to MoEs is limited to ReLU-based models with natural sparsity. This paper introduces G-MoEfication, applicable to arbitrary dense
models, where ReLU-based activation sparsity assumptions no longer hold. For generalizations, we encounter the dilemma of needing to
zero-out deactivated experts, while also avoiding excessive zeroing-out to retain dense activation information. We publicly release our code
and report results conducted with mBERT, SantaCoder-1.1B, Phi-2-2.7B, and Falcon-7B demonstrating the efficacy of our approach in gen-
eral scenarios: from multitask to multilingual, from fine-tuning to zero-shot evaluation.
Nov 12 (Tue) 16:00-17:30 - Jasmine
Towards Fast Multilingual LLM Inference: Speculative Decoding and Specialized Drafters
Euiin Yi, Taehyeon Kim, Hongseok Jeung, Du-Seong Chang, Se-Young Yun
Large language models (LLMs) have revolutionized natural language processing and broadened their applicability across diverse commer-
cial applications. However, the deployment of these models is constrained by high inference time in multilingual settings. To mitigate this
challenge, this paper explores a training recipe of an assistant model in speculative decoding, which are leveraged to draft and-then its future
tokens are verified by the target LLM. We show that language-specific draft models, optimized through a targeted pretrain-and-finetune strat-
egy, substantially brings a speedup of inference time compared to the previous methods. We validate these models across various languages
in inference time, out-of-domain speedup, and GPT-4o evaluation.
Nov 12 (Tue) 16:00-17:30 - Jasmine
Segment Any Text: A Universal Approach for Robust, Efficient and Adaptable Sentence Segmentation
Markus Frohmann, Igor Sterner, Ivan Vuli, Benjamin Minixhofer, Markus Schedl
Segmenting text into sentences plays an early and crucial role in many NLP systems. This is commonly achieved by using rule-based or
statistical methods relying on lexical features such as punctuation. Although some recent works no longer exclusively rely on punctuation,
we find that no prior method achieves all of (i) robustness to missing punctuation, (ii) effective adaptability to new domains, and (iii) high
efficiency. We introduce a new model Segment any Text (SaT) to solve this problem. To enhance robustness, we propose a new pretraining
scheme that ensures less reliance on punctuation. To address adaptability, we introduce an extra stage of parameter-efficient fine-tuning,
establishing state-of-the-art performance in distinct domains such as verses from lyrics and legal documents. Along the way, we introduce
architectural modifications that result in a threefold gain in speed over the previous state of the art and solve spurious reliance on context far
in the future. Finally, we introduce a variant of our model with fine-tuning on a diverse, multilingual mixture of sentence-segmented data,
acting as a drop-in replacement and enhancement for existing segmentation tools. Overall, our contributions provide a universal approach
174
Posters and Demos
for segmenting any text. Our method outperforms all baselines including strong LLMs across 8 corpora spanning diverse domains and
languages, especially in practically relevant situations where text is poorly formatted. Our models and code, including documentation, are
readily available at https://github.com/segment-any-text/wtpsplit under the MIT license.
Nov 12 (Tue) 16:00-17:30 - Jasmine
Heterogeneous LoRA for Federated Fine-tuning of On-Device Foundation Models
Yae Jee Cho, Luyang Liu, Zheng Xu, Aldi Fahrezi, Gauri Joshi
Foundation models (FMs) adapt surprisingly well to downstream tasks with fine-tuning. However, their colossal parameter space prohibits
their training on resource-constrained edge-devices. For federated fine-tuning, we need to consider the smaller FMs of few billion parame-
ters at most, namely on-device FMs (ODFMs), which can be deployed on-device. Federated fine-tuning of ODFMs has unique challenges
non-present in standard fine-tuning: i) ODFMs poorly generalize to downstream tasks due to their limited sizes making proper fine-tuning
imperative to their performance, and ii) devices have limited and heterogeneous system capabilities and data that can deter the performance
of fine-tuning.Tackling these challenges, we propose HetLoRA, a feasible and effective federated fine-tuning method for ODFMs that lever-
ages the system and data heterogeneity at the edge. HetLoRA allows heterogeneous LoRA ranks across clients for their individual system
resources, and efficiently aggregates and distributes these LoRA modules in a data-aware manner by applying rank self-pruning locally and
sparsity-weighted aggregation at the server. It combines the advantages of high and low-rank LoRAs, achieving improved convergence speed
and final performance compared to homogeneous LoRA. Furthermore, HetLoRA has enhanced computation and communication efficiency
compared to full fine-tuning making it more feasible for the edge.
Nov 12 (Tue) 16:00-17:30 - Jasmine
Chain and Causal Attention for Efficient Entity Tracking
Erwan Fagnou, Paul Caillon, Blaise Delattre, Alexandre Allauzen
This paper investigates the limitations of transformers for entity-tracking tasks in large language models. We identify a theoretical constraint,
showing that transformers require at least log _2(n + 1) layers to handle entity tracking with n state changes. To address this issue, we
propose an efficient and frugal enhancement to the standard attention mechanism, enabling it to manage long-term dependencies more effi-
ciently. By considering attention as an adjacency matrix, our model can track entity states with a single layer.Empirical results demonstrate
significant improvements in entity tracking datasets while keeping competitive performance on standard natural language modeling. Our
modified attention allows us to achieve the same performance with drastically fewer layers. Additionally, our enhanced mechanism reveals
structured internal representations of attention. Extensive experiments on both toy and complex datasets validate our approach. Our contribu-
tions include theoretical insights, an improved attention mechanism, and empirical validation.
Nov 12 (Tue) 16:00-17:30 - Jasmine
Ouroboros: Generating Longer Drafts Phrase by Phrase for Faster Speculative Decoding
Weilin Zhao, Yuxiang Huang, Xu Han, Wang Xu, Chaojun Xiao, Xinrong Zhang, Yewei Fang, Kaihuo Zhang, Zhiyuan Liu, Maosong Sun
Speculative decoding is a widely used method that accelerates the generation process of large language models (LLMs) with no compromise in
model performance. It achieves this goal by using an existing smaller model for drafting and then employing the target LLM to verify the draft
in a low-cost parallel manner. Under such a drafting-verification framework, drafting efficiency has become a bottleneck in the final speedup
of speculative decoding. Therefore, generating longer drafts at less cost can lead to better decoding speedup. To achieve this, we introduce
Ouroboros, which can generate draft phrases to parallelize the drafting process and meanwhile lengthen drafts in a training-free manner. The
experimental results on various typical text generation tasks show that Ouroboros can achieve speedups of up to 2.4× over speculative de-
coding and 3.9× over vanilla decoding, without fine-tuning draft and target models. Code available at https://github.com/thunlp/Ouroboros.
Nov 12 (Tue) 16:00-17:30 - Jasmine
Fewer is More: Boosting Math Reasoning with Reinforced Context Pruning
Xijie Huang, Li Lyna Zhang, Kwang-Ting Cheng, Fan Yang, Mao Yang
Large Language Models (LLMs) have shown impressive capabilities, yet they still struggle with math reasoning. In this work, we propose
CoT-Influx, a novel approach that pushes the boundary of few-shot Chain-of-Thoughts (CoT) learning to improve LLM mathematical rea-
soning. Motivated by the observation that adding more concise CoT examples in the prompt can improve LLM reasoning performance,
CoT-Influx employs a coarse-to-fine pruner to maximize the input of effective and concise CoT examples. The pruner first selects as many
crucial CoT examples as possible and then prunes unimportant tokens to fit the context window. As a result, by enabling more CoT ex-
amples with double the context window size in tokens, CoT-Influx significantly outperforms various prompting baselines across various
LLMs (LLaMA2-7B, 13B, 70B) and 5 math datasets, achieving up to 4.55% absolute improvements. Remarkably, without any fine-tuning,
LLaMA2-70B with CoT-Influx surpasses GPT-3.5 and a wide range of larger LLMs (PaLM, Minerva 540B, etc.) on the GSM8K. CoT-Influx
is a plug-and-play module for LLMs, adaptable in various scenarios. It’s compatible with advanced reasoning prompting techniques, such as
self-consistency, and supports different long-context LLMs, including Mistral-7B-v0.3-32K and Yi-6B-200K.
5
Code available at https://github.com/C-W-D/CasCoD
175
Posters and Demos
through novel importance metrics, effectively maintaining critical data even without access to future context. Our comprehensive evaluations
indicate that InfiniPot significantly outperforms models trained for long contexts in various NLP tasks, establishing its efficacy and versatility.
This work represents a substantial advancement toward making LLMs applicable to a broader range of real-world scenarios.
Nov 12 (Tue) 16:00-17:30 - Jasmine
Exploring Space Efciency in a Tree-based Linear Model for Extreme Multi-label Classication
He-Zhe Lin, Cheng-Hung Liu, Chih-Jen Lin
Extreme multi-label classification (XMC) aims to identify relevant subsets from numerous labels. Among the various approaches for XMC,
tree-based linear models are effective due to their superior efficiency and simplicity. However, the space complexity of tree-based methods
is not well-studied. Many past works assume that storing the model is not affordable and apply techniques such as pruning to save space,
which may lead to performance loss. In this work, we conduct both theoretical and empirical analyses on the space to store a tree model under
the assumption of sparse data, a condition frequently met in text data. We found that, some features may be unused when training binary
classifiers in a tree method, resulting in zero values in the weight vectors. Hence, storing only non-zero elements can greatly save space. Our
experimental results indicate that tree models can require less than 10% of the size of the standard one-vs-rest method for multi-label text
classification. Our research provides a simple procedure to estimate the size of a tree model before training any classifier in the tree nodes.
Then, if the model size is already acceptable, this approach can help avoid modifying the model through weight pruning or other techniques.
Nov 12 (Tue) 16:00-17:30 - Jasmine
FFN-SkipLLM: A Hidden Gem for Autoregressive Decoding with Adaptive Feed Forward Skipping
AJAY KUMAR JAISWAL, Bodun Hu, Lu Yin, Yeonju Ro, Tianlong Chen, Shiwei Liu, Aditya Akella
Autoregressive Large Language Models (e.g., LLaMa, GPTs) are omnipresent achieving remarkable success in language understanding and
generation. However, such impressive capability typically comes with a substantial model size, which presents significant challenges for au-
toregressive token-by-token generation. To mitigate computation overload incurred during generation, several early-exit and layer-dropping
strategies have been proposed. Despite some promising success due to the redundancy across LLMs layers on metrics like Rough-L/BLUE,
our careful knowledge-intensive evaluation unveils issues such as generation collapse, hallucination, and noticeable performance drop even
at the trivial exit ratio of 10-15% of layers. We attribute these errors primarily to ineffective handling of the KV cache through state copying
during early exit. In this work, we observe the saturation of computationally expensive feed-forward blocks of LLM layers and propose FFN-
SkipLLM, which is a novel fine-grained skip strategy for autoregressive LLMs. FFN-SkipLLM leverages an input-adaptive feed-forward
skipping approach that can skip 25-30% of FFN blocks of LLMs with marginal change in performance on knowledge-intensive generation
tasks without any requirement to handle the KV cache. Our extensive experiments and ablation studies across benchmarks like MT-Bench,
Factoid-QA, and variable-length text summarization illustrate how our simple and easy-to-use method can facilitate faster autoregressive
decoding.
Nov 12 (Tue) 16:00-17:30 - Jasmine
LLoCO: Learning Long Contexts Offline
Sijun Tan, Xiuyu Li, Shishir G Patil, Ziyang Wu, Tianjun Zhang, Kurt Keutzer, Joseph E. Gonzalez, Raluca Popa
Processing long contexts remains a challenge for large language models (LLMs) due to the quadratic computational and memory overhead
of the self-attention mechanism and the substantial KV cache sizes during generation. We propose LLoCO, a novel approach to address
this problem by learning contexts offline through context compression and in-domain parameter-efficient finetuning with LoRA. Our method
enables an LLM to create a concise representation of the original context and efficiently retrieve relevant information to answer questions
accurately. Our approach extends the effective context window of a 4k token LLaMA2-7B model to handle up to 128k tokens. We evaluate
our approach on several long-context question-answering datasets, demonstrating that LLoCO significantly outperforms in-context learning
while using 30× fewer tokens during inference. LLoCO achieves up to 7.62× speed-up during inference and 11.52× higher throughput
during finetuning, substantially reduces the cost of long document question answering. This makes it a promising solution for efficient long
context processing.
Nov 12 (Tue) 16:00-17:30 - Jasmine
Mentor-KD: Making Small Language Models Better Multi-step Reasoners
Hojae Lee, Junho Kim, SangKeun Lee
Large Language Models (LLMs) have displayed remarkable performances across various complex tasks by leveraging Chain-of-Thought
(CoT) prompting. Recently, studies have proposed a Knowledge Distillation (KD) approach, reasoning distillation, which transfers such
reasoning ability of LLMs through fine-tuning language models of multi-step rationales generated by LLM teachers. However, they have
inadequately considered two challenges regarding insufficient distillation sets from the LLM teacher model, in terms of 1) data quality and 2)
soft label provision. In this paper, we propose Mentor-KD, which effectively distills the multi-step reasoning capability of LLMs to smaller
LMs while addressing the aforementioned challenges. Specifically, we exploit a mentor, intermediate-sized task-specific fine-tuned model,
to augment additional CoT annotations and provide soft labels for the student model during reasoning distillation. We conduct extensive
experiments and confirm Mentor-KD’s effectiveness across various models and complex reasoning tasks.
Nov 12 (Tue) 16:00-17:30 - Jasmine
Is C4 Dataset Enough for Pruning? An Investigation of Calibration Data for LLM Pruning
Abhinav Bandari, Lu Yin, Cheng-Yu Hsieh, AJAY KUMAR JAISWAL, Tianlong Chen, Li Shen, Ranjay Krishna, Shiwei Liu
Network pruning has emerged as a potential solution to make LLMs cheaper to deploy. However, existing LLM pruning approachesuniver-
sally rely on the C4 dataset as the calibration data for calculating pruning scores, leaving its optimality unexplored. In this study, we evaluate
the choice of calibration data on LLM pruning, across a wide range of datasets that are most commonly used in LLM training and evaluation,
including four pertaining datasets as well as three categories of downstream tasks encompassing nine datasets. Each downstream dataset
is prompted with In-Context Learning (ICL) and Chain-of-Thought (CoT), respectively. Besides the already intriguingobservation that the
choice of calibration data significantly impacts the performance of pruned LLMs, our results also uncover several subtle and often unexpected
findings, summarized as follows: (1) C4 is not the optimal choice for LLM pruning, even among commonly used pre-training datasets;
(2) arithmetic datasetswhen used as calibration dataperforms on par or even better than pre-training datasets; (3) pruning with downstream
datasets does not necessarily help the corresponding downstream task, compared to pre-training data; (4) ICL is widely beneficial to all data
categories, whereas CoT is only useful on certain tasks. Our findings shed light on the importance of carefully selecting calibration data for
LLM pruning and pave the way for more efficient deployment of these powerfulmodels in real-world applications. We release our code at:
https://github.com/abx393/llm-pruning-calibration-data.
Nov 12 (Tue) 16:00-17:30 - Jasmine
PALM: Few-Shot Prompt Learning for Audio Language Models
Asif Hanif, Maha Tufail Agro, Mohammad Areeb Qazi, Hanan Aldarmaki
Audio-Language Models (ALMs) have recently achieved remarkable success in zero-shot audio recognition tasks, which match features of
audio waveforms with class-specific text prompt features, inspired by advancements in Vision-Language Models (VLMs). Given the sen-
176
Posters and Demos
sitivity of zero-shot performance to the choice of hand-crafted text prompts, many prompt learning techniques have been developed for
VLMs. We explore the efficacy of these approaches in ALMs and propose a novel method, Prompt Learning in Audio Language Mod-
els (PALM), which optimizes the feature space of the text encoder branch. Unlike existing methods that work in the input space, our
approach results in greater training efficiency. We demonstrate the effectiveness of our approach on 11 audio recognition datasets, en-
compassing a variety of speech-processing tasks, and compare the results with three baselines in a few-shot learning setup. Our method
is either on par with or outperforms other approaches while being computationally less demanding. Our code is publicly available at
https://asif-hanif.github.io/palm/.
Nov 12 (Tue) 16:00-17:30 - Jasmine
SpecHub: Provable Acceleration to Multi-Draft Speculative Decoding
Hanchi Sun, Tianyi Zhou, Xun Chen, Lichao Sun
Large Language Models (LLMs) have become essential in advancing natural language processing (NLP) tasks, but their sequential token
generation limits inference speed. Multi-Draft Speculative Decoding (MDSD) offers a promising solution by using a smaller draft model
to generate multiple token sequences, which the target LLM verifies in parallel.However, current heuristic approaches, such as Recursive
Rejection Sampling (RRS), suffer from low acceptance rates in subsequent drafts, limiting the advantages of using multiple drafts. Mean-
while, Optimal Transport with Membership Cost (OTM) can theoretically improve acceptance rates, but its computational cost is too high for
real-time use.We present SpecHub, a novel, efficient sampling-verification method for MDSD that improves acceptance rates with only linear
computational overhead. By simplifying the OTM problem into a compact Linear Programming model, SpecHub significantly reduces com-
putational complexity. It further accelerates sampling by leveraging a sparse joint distribution, focusing computation on high-probability token
sequences.%It integrates seamlessly into existing MDSD frameworks.In extensive experiments, Spechub consistently generates 0.05-0.27 and
0.02-0.16 more tokens per step than RRS and RRS without replacement. We attach our code at https://github.com/MasterGodzilla/Specula-
tive_decoding_OT.
Nov 12 (Tue) 16:00-17:30 - Jasmine
ApiQ: Finetuning of 2-Bit Quantized Large Language Model
Baohao Liao, Christian Herold, Shahram Khadivi, Christof Monz
Memory-efficient finetuning of large language models (LLMs) has recently attracted huge attention with the increasing size of LLMs, primar-
ily due to the constraints posed by GPU memory limitations and the effectiveness of these methods compared to full finetuning. Despite the
advancements, current strategies for memory-efficient finetuning, such as QLoRA, exhibit inconsistent performance across diverse bit-width
quantizations and multifaceted tasks. This inconsistency largely stems from the detrimental impact of the quantization process on preserved
knowledge, leading to catastrophic forgetting and undermining the utilization of pretrained models for finetuning purposes. In this work, we
introduce a novel quantization framework named ApiQ, designed to restore the lost information from quantization by concurrently initial-
izing the LoRA components and quantizing the weights of LLMs. This approach ensures the maintenance of the original LLM’s activation
precision while mitigating the error propagation from shallower into deeper layers. Through comprehensive evaluations conducted on a spec-
trum of language tasks with various LLMs, ApiQ demonstrably minimizes activation error during quantization. Consequently, it consistently
achieves superior finetuning results across various bit-widths. Notably, one can even finetune a 2-bit Llama-2-70b with ApiQ on a single
NVIDIA A100-80GB GPU without any memory-saving techniques, and achieve promising results.
Nov 12 (Tue) 16:00-17:30 - Jasmine
Memory-Efficient Fine-Tuning of Transformers via Token Selection
Antoine Simoulin, Namyong Park, Xiaoyi Liu, Grey Yang
Fine-tuning provides an effective means to specialize pre-trained models for various downstream tasks. However, fine-tuning often incurs
high memory overhead, especially for large transformer-based models, such as LLMs. While existing methods may reduce certain parts of the
memory required for fine-tuning, they still require caching all intermediate activations computed in the forward pass to update weights during
the backward pass. In this work, we develop TokenTune, a method to reduce memory usage, specifically the memory to store intermediate
activations, in the fine-tuning of transformer-based models. During the backward pass, TokenTune approximates the gradient computation by
backpropagating through just a subset of input tokens. Thus, with TokenTune, only a subset of intermediate activations are cached during the
forward pass. Also, TokenTune can be easily combined with existing methods like LoRA, further reducing the memory cost. We evaluate
our approach on pre-trained transformer models with up to billions of parameters, considering the performance on multiple downstream tasks
such as text classification and question answering in a few-shot learning setup. Overall, TokenTune achieves performance on par with full
fine-tuning or representative memory-efficient fine-tuning methods, while greatly reducing the memory footprint, especially when combined
with other methods with complementary memory reduction mechanisms. We hope that our approach will facilitate the fine-tuning of large
transformers, in specializing them for specific domains or co-training them with other neural components from a larger system. Our code is
available at https://github.com/facebookresearch/tokentune.
Nov 12 (Tue) 16:00-17:30 - Jasmine
Sprout: Green Generative AI with Carbon-Efficient LLM Inference
Baolin Li, Yankai Jiang, Vijay Gadepally, Devesh Tiwari
The rapid advancement of generative AI has heightened environmental concerns, particularly regarding carbon emissions. Our framework,
Sprout, addresses these challenges by reducing the carbon footprint of inference in large language models (LLMs). Sprout introduces "gen-
eration directives" to guide the autoregressive generation process, achieving a balance between ecological sustainability and high-quality
outputs. By employing a strategic optimizer for directive assignment and a novel offline quality evaluator, Sprout reduces the carbon footprint
of generative LLM inference by over 40% in real-world evaluations, using the Llama model and global electricity grid data. This work is
crucial as the rising interest in inference time compute scaling laws amplifies environmental concerns, emphasizing the need for eco-friendly
AI solutions.
Nov 12 (Tue) 16:00-17:30 - Jasmine
DisGeM: Distractor Generation for Multiple Choice Questions with Span Masking
Devrim Çavuolu, Seçil en, Ula Sert
Recent advancements in Natural Language Processing (NLP) have impacted numerous sub-fields such as natural language generation, natural
language inference, question answering, and more. However, in the field of question generation, the creation of distractors for multiple-choice
questions (MCQ) remains a challenging task. In this work, we present a simple, generic framework for distractor generation using readily
available Pre-trained Language Models (PLMs). Unlike previous methods, our framework relies solely on pre-trained language models and
does not require additional training on specific datasets. Building upon previous research, we introduce a two-stage framework consisting
of candidate generation and candidate selection. Our proposed distractor generation framework outperforms previous methods without the
need for training or fine-tuning. Human evaluations confirm that our approach produces more effective and engaging distractors. The related
codebase is publicly available at https://github.com/obss/disgem.
Nov 12 (Tue) 16:00-17:30 - Jasmine
177
Posters and Demos
178
Posters and Demos
Summarization
Nov 12 (Tue) 16:00-17:30 - Room: Jasmine
179
Posters and Demos
ences. However, it cannot test if a model utilizes all three types of cues provided in ICPL prompts: (i) example summaries, (ii) user’s reading
histories, and (iii) contrast in user profiles. To address this, we propose the iCOPERNICUS framework, a novel In-Context Personalization
Learning Scrutiny of Summarization capability in LLMs that uses EGISES as a comparative measure. As a case-study, we evaluate 17
state-of-the-art LLMs based on their reported ICL performances and observe that 15 models’ ICPL degrades (min: 1.6%↓; max: 3.6%↓)
when probed with richer prompts, thereby showing lack of true ICPL.
Nov 12 (Tue) 16:00-17:30 - Jasmine
Model-based Preference Optimization in Abstractive Summarization without Human Feedback
Jaepill choi, Kyubyung Chae, Jiwoo Song, Yohan Jo, Taesup Kim
In abstractive summarization, the challenge of producing concise and accurate summaries arises from the vast amount of information contained
in the source document. Consequently, although Large Language Models (LLMs) can generate fluent text, they often introduce inaccuracies
by hallucinating content not found in the original source. While supervised fine-tuning methods that maximize likelihood contribute to this
issue, they do not consistently enhance the faithfulness of the summaries. Preference-based optimization methods, such as Direct Preference
Optimization (DPO), can further refine the model to align with human preferences. However, these methods still heavily depend on costly
human feedback. In this work, we introduce a novel and straightforward approach called Model-based Preference Optimization (MPO) to
fine-tune LLMs for improved summarization abilities without any human feedback. By leveraging the model’s inherent summarization capa-
bilities, we create a preference dataset that is fully generated by the model using different decoding strategies. Our experiments on standard
summarization datasets and various metrics demonstrate that our proposed MPO significantly enhances the quality of generated summaries
without relying on human feedback. The code is publicly available at https://github.com/cjaep/MPO.
180
Posters and Demos
181
Posters and Demos
Yuho Lee, Taewon Yun, Jason Cai, Hang Su, Hwanjun Song
Existing benchmarks for summarization quality evaluation often lack diverse input scenarios, focus on narrowly defined dimensions (e.g.,
faithfulness), and struggle with subjective and coarse-grained annotation schemes. To address these shortcomings, we create UniSumEval
benchmark, which extends the range of input context (e.g., domain, length) and provides fine-grained, multi-dimensional annotations. We
use AI assistance in data creation, identifying potentially hallucinogenic input texts, and also helping human annotators reduce the difficulty
of fine-grained annotation tasks. With UniSumEval, we benchmark nine latest language models as summarizers, offering insights into their
performance across varying input contexts and evaluation dimensions. Furthermore, we conduct a thorough comparison of SOTA automated
summary evaluators. Our benchmark data will be available at https://github.com/DISL-Lab/UniSumEval-v1.0.
Demo 4
Nov 13 (Wed) 10:30-12:00 - Room: Riverfront Hall
182
Posters and Demos
Human-centered NLP 2
Nov 13 (Wed) 10:30-12:00 - Room: Riverfront Hall
183
Posters and Demos
agree when annotating the safety of user-chatbot conversations. We leverage the recent DICES dataset (Aroyo et al. 2023), in which 350 con-
versations are each rated for safety by 112 annotators spanning 10 race-gender groups. GPT-4 achieves a Pearson correlation of r = 0.59
with the average annotator rating, higher than the median annotator’s correlation with the average (r = 0.51). We show that larger datasets
are needed to resolve whether GPT-4 exhibits disparities in how well it correlates with different demographic groups. Also, there is substantial
idiosyncratic variation in correlation within groups, suggesting that race & gender do not fully capture differences in alignment. Finally, we
find that GPT-4 cannot predict when one demographic group finds a conversation more unsafe than another.
Nov 13 (Wed) 10:30-12:00 - Riverfront Hall
Into the Unknown Unknowns: Engaged Human Learning through Participation in Language Model Agent Conversations
Yucheng Jiang, Yijia Shao, Dekun Ma, Sina Semnani, Monica Lam
While language model (LM)-powered chatbots and generative search engines excel at answering concrete queries, discovering information in
the terrain of unknown unknowns remains challenging for users. To emulate the common educational scenario where children/students learn
by listening to and participating in conversations of their parents/teachers, we create Collaborative STORM (Co-STORM). Unlike QA sys-
tems that require users to ask all the questions, Co-STORM lets users observe and occasionally steer the discourse among several LM agents.
The agents ask questions on the user’s behalf, allowing the user to discover unknown unknowns serendipitously. To facilitate user interaction,
Co-STORM assists users in tracking the discourse by organizing the uncovered information into a dynamic mind map, ultimately generating
a comprehensive report as takeaways. For automatic evaluation, we construct the WildSeek dataset by collecting real information-seeking
records with user goals. Co-STORM outperforms baseline methods on both discourse trace and report quality. In a further human evaluation,
70% of participants prefer Co-STORM over a search engine, and 78% favor it over a RAG chatbot.
184
Posters and Demos
185
Posters and Demos
Beiduo Chen, Xinpeng Wang, Siyao Peng, Robert Litschko, Anna Korhonen, Barbara Plank
Human label variation (HLV) is a valuable source of information that arises when multiple human annotators provide different labels for valid
reasons. In Natural Language Inference (NLI) earlier approaches to capturing HLV involve either collecting annotations from many crowd
workers to represent human judgment distribution (HJD) or use expert linguists to provide detailed explanations for their chosen labels. While
the former method provides denser HJD information, obtaining it is resource-intensive. In contrast, the latter offers richer textual information
but it is challenging to scale up to many human judges. Besides, large language models (LLMs) are increasingly used as evaluators ("LLM
judges") but with mixed results, and few works aim to study HJDs. This study proposes to exploit LLMs to approximate HJDs using a small
number of expert labels and explanations. Our experiments show that a few explanations significantly improve LLMs’ ability to approximate
HJDs with and without explicit labels, thereby providing a solution to scale up annotations for HJD. However, fine-tuning smaller soft-label
aware models with the LLM-generated model judgment distributions (MJDs) presents partially inconsistent results: while similar in distance,
their resulting fine-tuned models and visualized distributions differ substantially. We show the importance of complementing instance-level
distance measures with a global-level shape metric and visualization to more effectively evaluate MJDs against human judgment distributions.
Nov 13 (Wed) 10:30-12:00 - Riverfront Hall
Minimal Yet Big Impact: How AI Agent Back-channeling Enhances Conversational Engagement through Conversation Persistence
and Context Richness
Jin Yea Jang, Saim Shin, gahgene gweon
The increasing use of AI agents in conversational services, such as counseling, highlights the importance of back-channeling (BC) as an
active listening strategy to enhance conversational engagement. BC improves conversational engagement by providing timely acknowledg-
ments and encouraging the speaker to talk. This study investigates the effect of BC provided by an AI agent on conversational engagement,
offering insights for future AI conversational service design. We conducted an experiment with 55 participants, divided into Todak_BC and
Todak_NoBC groups based on the presence or absence of the BC feature in Todak, a conversational agent. Each participant engaged in
nine sessions with predetermined subjects and questions. We collected and analyzed approximately 6 hours and 30 minutes of conversation
logs to evaluate conversational engagement using both quantitative (conversation persistence, including conversation duration and number of
utterances) and qualitative metrics (context richness, including self-disclosure and topic diversity). The findings reveal significantly higher
conversational engagement in the Todak_BC group compared to the Todak_NoBC group across all metrics (p<0.05). Additionally, the impact
of BC varies across sessions, suggesting that conversation characteristics such as question type and topic sensitivity can influence BC effec-
tiveness.
Nov 13 (Wed) 10:30-12:00 - Riverfront Hall
Two Tales of Persona in LLMs: A Survey of Role-Playing and Personalization
Yu-Min Tseng, Yu-Chao Huang, Teng-Yun Hsiao, Wei-Lin Chen, Chao-Wei Huang, Yu Meng, Yun-Nung Chen
The concept of *persona*, originally adopted in dialogue literature, has re-surged as a promising framework for tailoring large language mod-
els (LLMs) to specific context (*e.g.*, personalized search, LLM-as-a-judge). However, the growing research on leveraging persona in LLMs
is relatively disorganized and lacks a systematic taxonomy. To close the gap, we present a comprehensive survey to categorize the current
state of the field. We identify two lines of research, namely (1) *LLM Role-Playing*, where personas are assigned to LLMs, and (2) *LLM
Personalization*, where LLMs take care of user personas. Additionally, we introduce existing methods for LLM personality evaluation. To
the best of our knowledge, we present the first survey for role-playing and personalization in LLMs under the unified view of persona. We
continuously maintain a paper collection to foster future endeavors.
186
Posters and Demos
this phenomenon, we analyze the value-output vectors in these heads and discover that the vectors at each label position contain substantial
information about the corresponding labels. Furthermore, we observe that the prediction shift from "foo" to "bar" is due to the respective
reduction and increase in these heads’ attention scores at "foo" and "bar" positions. Therefore, we propose a hypothesis for ICL: in in-context
heads, the value-output matrices extract label features, while the query-key matrices compute the similarity between the features at the last
position and those at each label position. The query and key matrices can be considered as two towers that learn the similarity metric between
the last position’s features and each demonstration at label positions. Using this hypothesis, we explain the majority label bias and recency
bias in ICL and propose two methods to reduce these biases by 22% and 17%, respectively.
Nov 13 (Wed) 10:30-12:00 - Jasmine
Contextual and Parametric Knowledge: More Context, More Focus
Yufei Tao, Adam Hiatt, Erik Haake, Antonie J. Jetter, Ameeta Agrawal
Large language models (LLMs) have demonstrated remarkable progress in leveraging diverse knowledge sources. This study investigates
how nine widely used LLMs allocate knowledge between local context and global parameters when answering open-ended questions in
knowledge-consistent scenarios. We introduce a novel dataset, WikiAtomic, and systematically vary context sizes to analyze how LLMs
prioritize and utilize the provided information and their parametric knowledge in knowledge-consistent scenarios. Additionally, we also study
their tendency to hallucinate under varying context sizes. Our findings reveal consistent patterns across models, including a consistent re-
liance on both contextual (around 70%) and parametric (around 30%) knowledge, and a decrease in hallucinations with increasing context.
These insights highlight the importance of more effective context organization and developing models that use input more deterministically
for robust performance.
187
Posters and Demos
abilities of language models in the same context. In this paper, we investigate how language models map linguistic expressions of uncertainty
to numerical responses. Our approach assesses whether language models can employ theory of mind in this setting: understanding the uncer-
tainty of another agent about a particular statement, independently of the model’s own certainty about that statement. We find that 7 out of
10 models are able to map uncertainty expressions to probabilistic responses in a human-like manner. However, we observe systematically
different behavior depending on whether a statement is actually true or false. This sensitivity indicates that language models are substantially
more susceptible to bias based on their prior knowledge (as compared to humans). These findings raise important questions and have broad
implications for human-AI and AI-AI communication.
Nov 13 (Wed) 10:30-12:00 - Jasmine
The effects of distance on NPI illusive effects in BERT
So Young Lee, Mai Ha Vu
Previous studies have examined the syntactic capabilities of large pre-trained language models, such as BERT, by using stimuli from psy-
cholinguistic studies. Studying well-known processing errors, such as NPI illusive effects can reveal whether a model prioritizes linear or
hierarchical information when processing language. Recent experiments have found that BERT is mildly susceptible to Negative Polarity
Item (NPI) illusion effects (Shin et al., 2023; Vu and Lee, 2022). We expand on these results by examining the effect of distance on the
illusive effect, using and modifying stimuli from Parker and Phillips (2016). We also further tease apart whether the model is more affected
by hierarchical distance or linear distance. We find that BERT is highly sensitive to syntactic hierarchical information: added hierarchical
layers affected its processing capabilities compared to added linear distance.
Nov 13 (Wed) 10:30-12:00 - Jasmine
Unlocking Memorization in Large Language Models with Dynamic Soft Prompting
Zhepeng Wang, Runxue Bao, Yawen Wu, Jackson Taylor, Cao Xiao, Feng Zheng, Weiwen Jiang, Shangqian Gao, Yanfu Zhang
Pretrained large language models (LLMs) have excelled in a variety of natural language processing (NLP) tasks, including summarization,
question answering, and translation. However, LLMs pose significant security risks due to their tendency to memorize training data, lead-
ing to potential privacy breaches and copyright infringement. Therefore, accurate measurement of the memorization is essential to evaluate
and mitigate these potential risks. However, previous attempts to characterize memorization are constrained by either using prefixes only or
by prepending a constant soft prompt to the prefixes, which cannot react to changes in input. To address this challenge, we propose a novel
method for estimating LLM memorization using dynamic, prefix-dependent soft prompts. Our approach involves training a transformer-based
generator to produce soft prompts that adapt to changes in input, thereby enabling more accurate extraction of memorized data. Our method
not only addresses the limitations of previous methods but also demonstrates superior performance in diverse experimental settings compared
to state-of-the-art techniques. In particular, our method can achieve the maximum relative improvement of 135.3% and 39.8% over the vanilla
baseline on average in terms of *discoverable memorization rate* for the text generation task and code generation task, respectively. Our code
is available at https://github.com/wangger/llm-memorization-dsp.
Nov 13 (Wed) 10:30-12:00 - Jasmine
LLMs Are Prone to Fallacies in Causal Inference
Nitish Joshi, Abulhair Saparov, Yixin Wang, He He
Recent work shows that causal facts can be effectively extracted from LLMs through prompting, facilitating the creation of causal graphs for
causal inference tasks. However, it is unclear if this success is limited to explicitly-mentioned causal facts in the pretraining data which the
model can memorize. Thus, this work investigates: Can LLMs infer causal relations from other relational data in text? To disentangle the role
of memorized causal facts vs inferred causal relations, we finetune LLMs on synthetic data containing temporal, spatial and counterfactual
relations, and measure whether the LLM can then infer causal relations. We find that: (a) LLMs are susceptible to inferring causal relations
from the order of two entity mentions in text (e.g. X mentioned before Y implies X causes Y); (b) if the order is randomized, LLMs still
suffer from the post hoc fallacy, i.e. X occurs before Y (temporal relation) implies X causes Y. We also find that while LLMs can correctly
deduce the absence of causal relations from temporal and spatial relations, they have difficulty inferring causal relations from counterfactuals,
questioning their understanding of causality.
Nov 13 (Wed) 10:30-12:00 - Jasmine
Paraphrase Types Elicit Prompt Engineering Capabilities
Jan Philip Wahle, Terry Ruas, Yang Xu, Bela Gipp
Much of the success of modern language models depends on finding a suitable prompt to instruct the model. Until now, it has been largely
unknown how variations in the linguistic expression of prompts affect these models. This study systematically and empirically evaluates
which linguistic features influence models through paraphrase types, i.e., different linguistic changes at particular positions. We measure be-
havioral changes for five models across 120 tasks and six families of paraphrases (i.e., morphology, syntax, lexicon, lexico-syntax, discourse,
and others). We also control for other prompt engineering factors (e.g., prompt length, lexical diversity, and proximity to training data). Our
results show a potential for language models to improve tasks when their prompts are adapted in specific paraphrase types (e.g., 6.7% median
gain in Mixtral 8x7B; 5.5% in LLaMA 3 8B). In particular, changes in morphology and lexicon, i.e., the vocabulary used, showed promise
in improving prompts. These findings contribute to developing more robust language models capable of handling variability in linguistic
expression.
Nov 13 (Wed) 10:30-12:00 - Jasmine
Towards Interpretable Sequence Continuation: Analyzing Shared Circuits in Large Language Models
Michael Lan, Philip Torr, Fazl Barez
While transformer models exhibit strong capabilities on linguistic tasks, their complex architectures make them difficult to interpret. Re-
cent work has aimed to reverse engineer transformer models into human-readable representations called circuits that implement algorithmic
functions. We extend this research by analyzing and comparing circuits for similar sequence continuation tasks, which include increasing
sequences of Arabic numerals, number words, and months. By applying circuit interpretability analysis, we identify a key sub-circuit in both
GPT-2 Small and Llama-2-7B responsible for detecting sequence members and for predicting the next member in a sequence. Our analysis
reveals that semantically related sequences rely on shared circuit subgraphs with analogous roles. Additionally, we show that this sub-circuit
has effects on various math-related prompts, such as on intervaled circuits, Spanish number word and months continuation, and natural lan-
guage word problems. Overall, documenting shared computational structures enables better model behavior predictions, identification of
errors, and safer editing procedures. This mechanistic understanding of transformers is a critical step towards building more robust, aligned,
and interpretable language models.
Nov 13 (Wed) 10:30-12:00 - Jasmine
Why Does New Knowledge Create Messy Ripple Effects in LLMs?
Jiaxin Qin, Zixuan Zhang, Chi Han, Pengfei Yu, Manling Li, Heng Ji
Extensive previous research has focused on post-training knowledge editing (KE) for language models (LMs) to ensure that knowledge re-
mains accurate and up-to-date. One desired property and open question in KE is to let edited LMs correctly handle ripple effects, where LM
188
Posters and Demos
is expected to answer its logically related knowledge accurately. In this paper, we answer the question of why most KE methods still create
messy ripple effects. We conduct extensive analysis and identify a salient indicator, GradSim, that effectively reveals when and why updated
knowledge ripples in LMs. GradSim is computed by the cosine similarity between gradients of the original fact and its related knowledge.
We observe a strong positive correlation between ripple effect performance and GradSim across different LMs, KE methods, and evaluation
metrics. Further investigations into three counter-intuitive failure cases (Negation, Over-Ripple, Multi-Lingual) of ripple effects demonstrate
that these failures are often associated with very low GradSim. This finding validates that GradSim is an effective indicator of when knowl-
edge ripples in LMs.
Nov 13 (Wed) 10:30-12:00 - Jasmine
AnaloBench: Benchmarking the Identification of Abstract and Long-context Analogies
Xiao Ye, Andrew Wang, Jacob Choi, Yining Lu, Shreya Sharma, Lingfeng Shen, Vijay Murari Tiyyala, Nicholas Andrews, Daniel Khashabi
Humans regularly engage in analogical thinking, relating personal experiences to current situations (X is analogous to Y because of Z). Ana-
logical thinking allows humans to solve problems in creative ways, grasp difficult concepts, and articulate ideas more effectively. Can language
models (LMs) do the same? To answer this question, we propose AnaloBench, a benchmark to determine analogical reasoning ability in LMs.
Our benchmarking approach focuses on aspects of this ability that are common among humans: (i) recalling related experiences from a large
amount of information, and (ii) applying analogical reasoning to complex and lengthy scenarios. We collect a set of 340 high quality, human
written analogies for use in our benchmark, which constitutes the largest such collection to date. We then test a broad collection of models con-
sisting of 12 open source and 3 proprietary in various sizes and architectures. As in prior results, scaling up LMs results in some performance
boosts. Surprisingly, scale offers minimal gains when, (i) analogies involve lengthy scenarios, or (ii) recalling relevant scenarios from a large
pool of information, a process analogous to finding a needle in a haystack. We hope these observations encourage further research in this field.
Nov 13 (Wed) 10:30-12:00 - Jasmine
An Analysis and Mitigation of the Reversal Curse
Ang Lv, Kaiyi Zhang, Shufang Xie, Quan Tu, Yuhan Chen, Ji-Rong Wen, Rui Yan
Recent research observed a noteworthy phenomenon in large language models (LLMs), referred to as the "reversal curse." The reversal curse
is that when dealing with two entities, denoted as a and b, connected by their relation R and its inverse R−1 , LLMs excel in handling se-
quences in the form of "aRb," but encounter challenges when processing "bR−1 a," whether in generation or comprehension. For instance,
GPT-4 can accurately respond to the query "Tom Cruise’s mother is?" with "Mary Lee Pfeiffer," but it struggles to provide a satisfactory
answer when asked "Mary Lee Pfeiffer’s son is?" In this paper, we undertake the first-ever study of how the reversal curse happens in LLMs.
Our investigations reveal that the reversal curse can stem from the specific training objectives, which become particularly evident in the
widespread use of next-token prediction within most causal language models. We hope this initial investigation can draw more attention to
the reversal curse, as well as other underlying limitations in current LLMs.
Nov 13 (Wed) 10:30-12:00 - Jasmine
Hopping Too Late: Exploring the Limitations of Large Language Models on Multi-Hop Queries
Eden Biran, Daniela Gottesman, Sohee Yang, Mor Geva, Amir Globerson
Large language models (LLMs) can solve complex multi-step problems, but little is known about how these computations are implemented
internally. Motivated by this, we study how LLMs answer multi-hop queries such as "The spouse of the performer of Imagine is". These
queries require two information extraction steps: a latent one for resolving the first hop ("the performer of Imagine") into the bridge entity
(John Lennon), and another for resolving the second hop ("the spouse of John Lennon") into the target entity (Yoko Ono). Understanding
how the latent step is computed internally is key to understanding the overall computation. By carefully analyzing the internal computations
of transformer-based LLMs, we discover that the bridge entity is resolved in the early layers of the model. Then, only after this resolution,
the two-hop query is solved in the later layers. Because the second hop commences in later layers, there could be cases where these layers no
longer encode the necessary knowledge for correctly predicting the answer. Motivated by this, we propose a novel "back-patching" analysis
method whereby a hidden representation from a later layer is patched back to an earlier layer. We find that in up to 66% of previously
incorrect cases there exists a back-patch that results in the correct generation of the answer, showing that the later layers indeed sometimes
lack the needed functionality. Overall our methods and findings open further opportunities for understanding and improving latent reasoning
in transformer-based LLMs.
189
Posters and Demos
190
Posters and Demos
191
Posters and Demos
192
Posters and Demos
Sentiment classification (SC) often suffers from low-resource challenges such as domain-specific contexts, imbalanced label distributions, and
few-shot scenarios. The potential of the diffusion language model (LM) for textual data augmentation (DA) remains unexplored, moreover,
textual DA methods struggle to balance the diversity and consistency of new samples. Most DA methods either perform logical modifica-
tions or rephrase less important tokens in the original sequence with the language model. In the context of SC, strong emotional tokens
could act critically on the sentiment of the whole sequence. Therefore, contrary to rephrasing less important context, we propose Diffusion-
CLS to leverage a diffusion LM to capture in-domain knowledge and generate pseudo samples by reconstructing strong label-related tokens.
This approach ensures a balance between consistency and diversity, avoiding the introduction of noise and augmenting crucial features of
datasets. DiffusionCLS also comprises a Noise-Resistant Training objective to help the model generalize. Experiments demonstrate the effec-
tiveness of our method in various low-resource scenarios including domain-specific and domain-general problems. Ablation studies confirm
the effectiveness of our framework’s modules, and visualization studies highlight optimal deployment conditions, reinforcing our conclusions.
Nov 13 (Wed) 10:30-12:00 - Jasmine
Self-Refine Instruction-Tuning for Aligning Reasoning in Language Models
Leonardo Ranaldi, Andre Freitas
The alignment of reasoning abilities between smaller and larger Language Models are largely conducted via supervised fine-tuning using
demonstrations generated from robust Large Language Models (LLMs). Although these approaches deliver more performant models, they
do not show sufficiently strong generalization ability as the training only relies on the provided demonstrations.In this paper, we propose
the Self-refine Instruction-tuning method that elicits Smaller Language Models to self-improve their abilities.Our approach is based on a
two-stage process, where reasoning abilities are first transferred between LLMs and Small Language Models (SLMs) via Instruction-tuning
on synthetic demonstrations provided by LLMs, and then the instructed models self-improve their abilities through preference optimization
strategies.In particular, the second phase operates refinement heuristics based on Direct Preference Optimization, where the SLMs are elicited
to deliver a series of reasoning paths by automatically sampling the generated responses and providing rewards using ground truths from the
LLMs.Results obtained on commonsense and math reasoning tasks show that this approach consistently outperforms Instruction-tuning in
both in-domain and out-domain scenarios, aligning the reasoning abilities of Smaller and Larger language models.
Nov 13 (Wed) 10:30-12:00 - Jasmine
Collaborative Performance Prediction for Large Language Models
Qiyuan Zhang, Fuyuan Lyu, Xue Liu, Chen Ma
Comprehensively understanding and accurately predicting the performance of large language models across diverse downstream tasks has
emerged as a pivotal challenge in NLP research. The pioneering scaling law on downstream works demonstrated intrinsic similarities within
model families and utilized such similarities for performance prediction. However, they tend to overlook the similarities between model
families and only consider design factors listed in the original scaling law. To overcome these limitations, we introduce a novel framework,
Collaborative Performance Prediction (CPP), which significantly enhances prediction accuracy by leveraging the historical performance of
various models on downstream tasks and other design factors for both model and task. We also collect a collaborative data sourced from on-
line platforms containing both historical performance and additional design factors. With the support of the collaborative data, CPP not only
surpasses traditional scaling laws in predicting the performance of scaled LLMs but also facilitates a detailed analysis of factor importance,
an area previously overlooked.
Nov 13 (Wed) 10:30-12:00 - Jasmine
A Generic Method for Fine-grained Category Discovery in Natural Language Texts
Chang Tian, Matthew B. Blaschko, Wenpeng Yin, Mingzhe Xing, Yinliang Yue, Marie-Francine Moens
Fine-grained category discovery using only coarse-grained supervision is a cost-effective yet challenging task. Previous training methods
focus on aligning query samples with positive samples and distancing them from negatives. They often neglect intra-category and inter-
category semantic similarities of fine-grained categories when navigating sample distributions in the embedding space. Furthermore, some
evaluation techniques that rely on pre-collected test samples are inadequate for real-time applications. To address these shortcomings, we
introduce a method that successfully detects fine-grained clusters of semantically similar texts guided by a novel objective function. The
method uses semantic similarities in a logarithmic space to guide sample distributions in the Euclidean space and to form distinct clusters
that represent fine-grained categories. We also propose a centroid inference mechanism to support real-time applications. The efficacy of
the method is both theoretically justified and empirically confirmed on three benchmark tasks. The proposed objective function is inte-
grated in multiple contrastive learning based neural models. Its results surpass existing state-of-the-art approaches in terms of Accuracy,
Adjusted Rand Index and Normalized Mutual Information of the detected fine-grained categories. Code and data are publicly available at
https://github.com/changtianluckyforever/F-grained-STAR.
Nov 13 (Wed) 10:30-12:00 - Jasmine
Text Grafting: Near-Distribution Weak Supervision for Minority Classes in Text Classification
Letian Peng, Yi Gu, Chengyu Dong, Zihan Wang, Jingbo Shang
For extremely weak-supervised text classification, pioneer research generates pseudo labels by mining texts similar to the class names from
the raw corpus, which may end up with very limited or even no samples for the minority classes. Recent works have started to generate the
relevant texts by prompting LLMs using the class names or definitions; however, there is a high risk that LLMs cannot generate in-distribution
(i.e., similar to the corpus where the text classifier will be applied) data, leading to ungeneralizable classifiers. In this paper, we combine
the advantages of these two approaches and propose to bridge the gap via a novel framework, text grafting, which aims to obtain clean and
near-distribution weak supervision for minority classes. Specifically, we first use LLM-based logits to mine masked templates from the raw
corpus, which have a high potential for data synthesis into the target minority class. Then, the templates are filled by state-of-the-art LLMs to
synthesize near-distribution texts falling into minority classes. Text grafting shows significant improvement over direct mining or synthesis
on minority classes. We also use analysis and case studies to comprehend the property of text grafting.
Nov 13 (Wed) 10:30-12:00 - Jasmine
Incubating Text Classifiers Following User Instruction with Nothing but LLM
Letian Peng, Zilong Wang, Jingbo Shang
In this paper, we aim to generate text classification data given arbitrary class definitions (i.e., user instruction), so one can train a text classifier
without any human annotation or raw corpus. Recent advances in large language models (LLMs) lead to pioneer attempts to individually
generate texts for each class via prompting. In this paper, we propose Incubator, the first framework that can handle complicated and even
mutually dependent classes (e.g., "TED Talk given by Educator" and "Other"). Specifically, our Incubator is a fine-tuned LLM that takes the
instruction of all class definitions as input, and in each inference, it can jointly generate one sample for every class. First, we tune Incubator
on the instruction-to-data mappings that we obtained from classification datasets and descriptions on Hugging Face together with in-context
augmentation by GPT-4. To emphasize the uniformity and diversity in generations, we refine Incubator by fine-tuning with the cluster centers
of semantic textual embeddings of the generated samples. We compare Incubator on various classification tasks with strong baselines such
as direct LLM-based inference and training data generation by prompt engineering. Experiments show Incubator is able to (1) outperform
previous methods on traditional benchmarks, (2) take label interdependency and user preference into consideration, and (3) enable logical text
193
Posters and Demos
194
Posters and Demos
DogeRM: Equipping Reward Models with Domain Knowledge through Model Merging
Tzu-Han Lin, Chen-An Li, Hung-yi Lee, Yun-Nung Chen
Reinforcement learning from human feedback (RLHF) is a popular strategy for aligning large language models (LLMs) with desired behaviors.
Reward modeling is a crucial step in RLHF. However, collecting paired preference data for training reward models is often costly and time-
consuming, especially for domain-specific preferences requiring expert annotation. To address this challenge, we propose the **Do**main
knowled**ge** merged **R**eward **M**odel (**DogeRM**), a novel framework that integrates domain-specific knowledge into a gen-
eral reward model by model merging. The experiments demonstrate that DogeRM enhances performance across different benchmarks and
provide a detailed analysis showcasing the effects of model merging, showing the great potential of facilitating model alignment.
Nov 13 (Wed) 10:30-12:00 - Jasmine
HiFT: A Hierarchical Full Parameter Fine-Tuning Strategy
YongKang Liu, Yiqun Zhang, Qian Li, Tong Liu, Shi Feng, Daling Wang, Yifei Zhang, Hinrich Schuetze
Full-parameter fine-tuning (FPFT) has become the go-to choice for adapting language models (LMs) to downstream tasks due to its excellent
performance. As LMs grow in size, fine-tuning the full parameters of LMs requires a prohibitively large amount of GPU memory. Existing
approaches utilize zeroth-order optimizer to conserve GPU memory, which potentially compromises the performance of LMs as non-zero
order optimizers tend to converge more readily on most downstream tasks. We propose a novel, memory-efficient, optimizer-independent,
end-to-end hierarchical fine-tuning strategy, HiFT, which only updates a subset of parameters at each training step. HiFT significantly reduces
the amount of gradients and optimizer state parameters residing in GPU memory at the same time, thereby reducing GPU memory usage. Our
results demonstrate that: (1) HiFT achieves comparable performance with parameter-efficient fine-tuning and standard FPFT. (2) Results on
six models show that HiFT reduces the number of trainable parameters by about 89.18% on average compared to FPFT. (3) HiFT supports
FPFT of 7B models for 24G GPU memory devices under mixed precision without using any memory saving techniques. (4) HiFT supports
various optimizers including AdamW, AdaGrad, SGD, etc. The source code link is https://github.com/misonsky/HiFT.
Nov 13 (Wed) 10:30-12:00 - Jasmine
Link, Synthesize, Retrieve: Universal Document Linking for Zero-Shot Information Retrieval
Dae Yon Hwang, Bilal Taha, Harshit Pande, Yaroslav Nechaev
Despite the recent advancements in information retrieval (IR), zero-shot IR remains a significant challenge, especially when dealing with new
domains, languages, and newly-released use cases that lack historical query traffic from existing users. For such cases, it is common to use
query augmentations followed by fine-tuning pre-trained models on the document data paired with synthetic queries. In this work, we propose
a novel Universal Document Linking (UDL) algorithm, which links similar documents to enhance synthetic query generation across multiple
datasets with different characteristics. UDL leverages entropy for the choice of similarity models and named entity recognition (NER) for the
link decision of documents using similarity scores. Our empirical studies demonstrate the effectiveness and universality of the UDL across
diverse datasets and IR models, surpassing state-of-the-art methods in zero-shot cases. The developed code for reproducibility is included in
https://github.com/eoduself/UDL
Nov 13 (Wed) 10:30-12:00 - Jasmine
A Two-Step Approach for Data-Efficient French Pronunciation Learning
Hoyeon Lee, Hyeeun Jang, JONGHWAN KIM, Jaemin Kim
Recent studies have addressed intricate phonological phenomena in French, relying on either extensive linguistic knowledge or a significant
amount of sentence-level pronunciation data. However, creating such resources is expensive and non-trivial. To this end, we propose a novel
two-step approach that encompasses two pronunciation tasks: grapheme-to-phoneme and post-lexical processing. We then investigate the
efficacy of the proposed approach with a notably limited amount of sentence-level pronunciation data. Our findings demonstrate that the
proposed two-step approach effectively mitigates the lack of extensive labeled data, and serves as a feasible solution for addressing French
phonological phenomena even under resource-constrained environments.
Nov 13 (Wed) 10:30-12:00 - Jasmine
DEFT-UCS: Data Efficient Fine-Tuning for Pre-Trained Language Models via Unsupervised Core-Set Selection
Devleena Das, Vivek Khetan
Recent advances have led to the availability of many pre-trained language models (PLMs); however, a question that remains is how much
data is truly needed to fine-tune PLMs for downstream tasks? In this work, we introduce DEFT-UCS, a data-efficient fine-tuning framework
that leverages unsupervised core-set selection to identify a smaller, representative dataset to fine-tune PLMs for text-generation needed for
text editing tasks such as simplification, grammar correction, clarity, etc. We examine the efficacy of DEFT-UCS across multiple text-editing
tasks, and compare to the state-of-the art text-editing model, CoEDIT. Our results demonstrate that DEFT-UCS models are just as accurate as
CoEDIT, across eight different datasets consisting of six different editing tasks, while finetuned on 70% less data.
Nov 13 (Wed) 10:30-12:00 - Jasmine
CoverICL: Selective Annotation for In-Context Learning via Active Graph Coverage
Costas Mavromatis, Balasubramaniam Srinivasan, Zhengyuan Shen, Jiani Zhang, Huzefa Rangwala, Christos Faloutsos, George Karypis
In-context learning (ICL) adapts Large Language Models (LLMs) to new tasks, without requiring any parameter updates, but few annotated
examples as input. In this work, we investigate selective annotation for ICL, where there is a limited budget for annotating examples, similar
to low-budget active learning (AL). Although uncertainty-based selection is unreliable with few annotated data, we present CoverICL, an
adaptive graph-based selection algorithm, that effectively incorporates uncertainty sampling into selective annotation for ICL. First, Cover-
ICL builds a nearest-neighbor graph based on the semantic similarity between candidate ICL examples. Then, CoverICL employs uncertainty
estimation by the LLM to identify hard examples for the task. Selective annotation is performed over the active graph of the hard exam-
ples, adapting the process to the particular LLM used and the task tackled. CoverICL selects the most representative examples by solving a
Maximum Coverage problem, approximating diversity-based sampling. Extensive experiments on ten datasets and seven LLMs show that,
by incorporating uncertainty via coverage on the active graph, CoverICL (1) outperforms existing AL methods for ICL by 2–4.6% accuracy
points, (2) is up to 2x more budget-efficient than SOTA methods for low-budget AL, and (3) generalizes better across tasks compared to
non-graph alternatives.
Nov 13 (Wed) 10:30-12:00 - Jasmine
CapEEN: Image Captioning with Early Exits and Knowledge Distillation
Divya Jyoti Bajpai, Manjesh Kumar Hanawal
Deep neural networks (DNNs) have made significant progress in recognizing visual elements and generating descriptive text in image-
captioning tasks. However, their improved performance comes from increased computational burden and inference latency. Early Exit (EE)
strategies can be used to enhance their efficiency, but their adaptation presents challenges in image captioning as it requires varying levels
of semantic information for accurate predictions. To overcome this, we introduce CapEEN to improve the performance of EE strategies
using knowledge distillation. Inference in CapEEN is completed at intermediary layers if prediction confidence exceeds a predefined value
learned from the training data. To account for real-world deployments, where target distributions could drift from that of training samples, we
195
Posters and Demos
introduce a variant A-CapEEN to adapt the thresholds on the fly using Multi-armed bandits framework. Experiments on the MS COCO and
Flickr30k datasets show that CapEEN gains speedup of 1.77× while maintaining competitive performance compared to the final layer, and
A-CapEEN additionally offers robustness against distortions. The source code is available at https://github.com/Div290/CapEEN.
Nov 13 (Wed) 10:30-12:00 - Jasmine
Intermediate Layer Distillation with the Reused Teacher Classifier: A Study on the Importance of the Classifier of Attention-based
Models
Hang Zhang, Seyyed Hasan Mozafari, James J. Clark, Brett H. Meyer, Warren J. Gross
Intermediate Layer Distillation (ILD) effectively compresses large-scale pre-trained language models (PLMs).Existing ILD methods underes-
timate the importance of utilizing the teacher’s discriminative classifier and face challenges in establishing proper layer mappings.Therefore,
we propose ILD-RTC, to show that a straightforward implementation of reusing the pre-trained teacher classifier improves student perfor-
mance even with simple uniform layer mapping.Through extensive experiments, our method outperforms other ILD techniques, maintaining
97.7% performance of the original teacher BERT_base without additional trainable parameters.Projectors are developed to help the stu-
dent match the hidden size of the teacher model, making our ILD-RTC applicable to students with different sizes.In addition, our technique
achieves the same average GLUE score as students initialized by pre-trained LMs, saving over 80× cost resulting from the pre-training
step.Our method emphasizes the reuse of pre-trained teacher classifiers as an alternative to pre-training the student for initialization.
Nov 13 (Wed) 10:30-12:00 - Jasmine
Active Learning for Abstractive Text Summarization via LLM-Determined Curriculum and Certainty Gain Maximization
Dongyuan Li, Ying Zhang, Zhen Wang, Shiyin Tan, Satoshi Kosugi, Manabu Okumura
For abstractive text summarization, laborious data annotation and time-consuming model training become two high walls, hindering its fur-
ther progress. Active Learning, selecting a few informative instances for annotation and model training, sheds light on solving these issues.
However, only few active learning-based studies focus on abstractive text summarization and suffer from low stability, effectiveness, and
efficiency. To solve the problems, we propose a novel LLM-determined curriculum active learning framework. Firstly, we design a prompt
to ask large language models to rate the difficulty of instances, which guides the model to train on from easier to harder instances. Secondly,
we design a novel active learning strategy, i.e., Certainty Gain Maximization, enabling to select instances whose distribution aligns well with
the overall distribution. Experiments show our method can improve stability, effectiveness, and efficiency of abstractive text summarization
backbones.
Nov 13 (Wed) 10:30-12:00 - Jasmine
LaMDA: Large Model Fine-Tuning via Spectrally Decomposed Low-Dimensional Adaptation
Seyedarmin Azizi, Souvik Kundu, Massoud Pedram
Low-rank adaptation (LoRA) has become the default approach to fine-tune large language models (LLMs) due to its significant reduction
in trainable parameters. However, trainable parameter demand for LoRA increases with increasing model embedding dimensions, leading
to high compute costs. Additionally, its backward updates require storing high-dimensional intermediate activations and optimizer states,
demanding high peak GPU memory. In this paper, we introduce _LaMDA_, a novel approach to fine-tuning large language models, which
leverages low-dimensional adaptation to achieve significant reductions in trainable parameters and peak GPU memory footprint. LaMDA
freezes a first projection matrix (PMA) in the adaptation path while introducing a low-dimensional trainable square matrix, resulting in sub-
stantial reductions in trainable parameters and peak GPU memory usage. LaMDA gradually freezes a second projection matrix (PMB) during
the early fine-tuning stages, reducing the compute cost associated with weight updates to enhance parameter efficiency further.We also present
an enhancement, LaMDA++, incorporating a "lite-weight" adaptive rank allocation for the LoRA path via normalized spectrum analysis of
pre-trained model weights. We evaluate LaMDA/LaMDA++ across various tasks, including natural language understanding with the GLUE
benchmark, text summarization, natural language generation, and complex reasoning on different LLMs.Results show that LaMDA matches
or surpasses the performance of existing alternatives while requiring up to **17.7×** fewer parameter updates and up to **1.32×** lower
peak GPU memory usage during fine-tuning. Code will be publicly available at https://github.com/ArminAzizi98/LaMDA.
Nov 13 (Wed) 10:30-12:00 - Jasmine
MORL-Prompt: An Empirical Analysis of Multi-Objective Reinforcement Learning for Discrete Prompt Optimization
Yasaman Jafari, Dheeraj Mekala, Rose Yu, Taylor Berg-Kirkpatrick
RL-based techniques can be employed to search for prompts that, when fed into a target language model, maximize a set of user-specified
reward functions. However, in many target applications, the natural reward functions are in tension with one another – for example, content
preservation vs. style matching in style transfer tasks. Current techniques focus on maximizing the average of reward functions, which does
not necessarily lead to prompts that achieve balance across rewards – an issue that has been well-studied in the multi-objective and robust
optimization literature. In this paper, we conduct an empirical comparison of several existing multi-objective optimization techniques adapted
to this new setting: RL-based discrete prompt optimization. We compare two methods optimizing the volume of the Pareto reward surface
and one method that chooses an update direction that benefits all rewards simultaneously. We evaluate performance on two NLP tasks: style
transfer and machine translation, each using three competing reward functions. Our experiments demonstrate that multi-objective methods
that directly optimize the volume of the Pareto reward surface perform better and achieve a better balance of all rewards than those that attempt
to find monotonic update directions.
Nov 13 (Wed) 10:30-12:00 - Jasmine
TAP4LLM: Table Provider on Sampling, Augmenting, and Packing Semi-structured Data for Large Language Model Reasoning
Yuan Sui, Jiaru Zou, Mengyu Zhou, Xinyi He, Lun Du, Shi Han, Dongmei Zhang
Table reasoning tasks have shown remarkable progress with the development of large language models (LLMs), which involve interpreting
and drawing conclusions from tabular data based on natural language (NL) questions. Existing solutions mainly tested on smaller tables face
scalability issues and struggle with complex queries due to incomplete or dispersed data across different table sections. To alleviate these
challenges, we propose TAP4LLM as a versatile pre-processor suite for leveraging LLMs in table-based tasks effectively. It covers several
distinct components: (1) table sampling to decompose large tables into manageable sub-tables based on query semantics, (2) table augmenta-
tion to enhance tables with additional knowledge from external sources or models, and (3) table packing & serialization to convert tables into
various formats suitable for LLMs’ understanding. In each module, we design and compare several common methods for usage in various
scenarios, aiming to shed light on the best practices for leveraging LLMs for table-reasoning tasks. Our experiments show that our method
improves LLMs’ reasoning capabilities in various tabular tasks and enhances the interaction between LLMs and tabular data by employing
effective pre-processing.
Nov 13 (Wed) 10:30-12:00 - Jasmine
Self-training Language Models in Arithmetic Reasoning
Marek Kadlík, Michal tefánik
Recent language models achieve impressive results in tasks involving complex multistep reasoning, but scaling these capabilities further
traditionally requires expensive collection of more annotated data.In this work, we explore the potential of improving models’ reasoning
196
Posters and Demos
capabilities without new data, merely using automated feedback to the validity of their predictions in arithmetic reasoning (self-training).In
systematic experimentation across six different arithmetic reasoning datasets, we find that models can substantially improve in both single-
round (offline) and online self-training, reaching a correct result in +13.9% and +25.9% more cases, respectively, underlining the importance
of actuality of self-training feedback. We further find that in the single-round, offline self-training, traditional supervised training can de-
liver gains comparable to preference optimization, but in online self-training, preference optimization methods largely outperform supervised
training thanks to their superior stability and robustness on unseen types of problems.
Nov 13 (Wed) 10:30-12:00 - Jasmine
All You Need is Attention: Lightweight Attention-based Data Augmentation for Text Classification
Junehyung Kim, Sungjae Hwang
This paper introduces LADAM, a novel method for enhancing the performance of text classification tasks. LADAM employs attention
mechanisms to exchange semantically similar words between sentences. This approach generates a greater diversity of synthetic sentences
compared to simpler operations like random insertions, while maintaining the context of the original sentences. Additionally, LADAM is
an easy-to-use, lightweight technique that does not require external datasets or large language models. Our experimental results across five
datasets demonstrate that LADAM consistently outperforms baseline methods across diverse text classification conditions.
Nov 13 (Wed) 10:30-12:00 - Jasmine
Generate then Refine: Data Augmentation for Zero-shot Intent Detection
I-Fan Lin, Faegheh Hasibi, Suzan Verberne
In this short paper we propose a data augmentation method for intent detection in zero-resource domains.Existing data augmentation methods
rely on few labelled examples for each intent category, which can be expensive in settings with many possible intents.We use a two-stage
approach: First, we generate utterances for intent labels using an open-source large language model in a zero-shot setting. Second, we de-
velop a smaller sequence-to-sequence model (the Refiner), to improve the generated utterances. The Refiner is fine-tuned on seen domains
and then applied to unseen domains. We evaluate our method by training an intent classifier on the generated data, and evaluating it on real
(human) data.We find that the Refiner significantly improves the data utility and diversity over the zero-shot LLM baseline for unseen domains
and over common baseline approaches.Our results indicate that a two-step approach of a generative LLM in zero-shot setting and a smaller
sequence-to-sequence model can provide high-quality data for intent detection.
Nov 13 (Wed) 10:30-12:00 - Jasmine
BanglaTLit: A Benchmark Dataset for Back-Transliteration of Romanized Bangla
Md Fahim, Fariha Tanjim Shifat, Md Farhan Ishmam, Deeparghya Dutta Barua, Fabiha Haider, MD SAKIB UL RAHMAN SOUROVE, Md
Farhad Alam Bhuiyan
Low-resource languages like Bangla are severely limited by the lack of datasets. Romanized Bangla texts are ubiquitous on the internet, of-
fering a rich source of data for Bangla NLP tasks and extending the available data sources. However, due to the informal nature of romanized
text, they often lack the structure and consistency needed to provide insights. We address these challenges by proposing: (1) BanglaTLit,
the large-scale Bangla transliteration dataset consisting of 42.7k samples, (2) BanglaTLit-PT, a pre-training corpus on romanized Bangla
with 245.7k samples, (3) encoders further-pretrained on BanglaTLit-PT achieving state-of-the-art performance in several romanized Bangla
classification tasks, and (4) multiple back-transliteration baseline methods, including a novel encoder-decoder architecture using further pre-
trained encoders. Our results show the potential of automated Bangla back-transliteration in utilizing the untapped sources of romanized
Bangla to enrich this language. The code and datasets are publicly available: https://github.com/farhanishmam/BanglaTLit.
Nov 13 (Wed) 10:30-12:00 - Jasmine
Self-training Large Language Models through Knowledge Detection
Yeo Wei Jie, Teddy Ferdinan, Przemyslaw Kazienko, Ranjan Satapathy, Erik Cambria
Large language models (LLMs) often necessitate extensive labeled datasets and training compute to achieve impressive performance across
downstream tasks. This paper explores a self-training paradigm, where the LLM autonomously curates its own labels and selectively trains on
unknown data samples identified through a reference-free consistency method. Empirical evaluations demonstrate significant improvements
in reducing hallucination in generation across multiple subjects. Furthermore, the selective training framework mitigates catastrophic for-
getting in out-of-distribution benchmarks, addressing a critical limitation in training LLMs. Our findings suggest that such an approach can
substantially reduce the dependency on large labeled datasets, paving the way for more scalable and cost-effective language model training.
197
Posters and Demos
Parameter-Efficient Fine-Tuning (PEFT) methods have gained significant popularity for adapting pre-trained Large Language Models (LLMs)
to downstream tasks, primarily due to their potential to significantly reduce memory and computational overheads. However, a common lim-
itation in most PEFT approaches is their application of a uniform architectural design across all layers. This uniformity involves identical
trainable modules and ignores the varying importance of each layer, leading to sub-optimal fine-tuning results. To overcome the above lim-
itation and obtain better performance, we develop a novel approach, Importance-aware Sparse Tuning (IST), to fully utilize the inherent
sparsity and select the most important subset of full layers with effective layer-wise importance scoring. The proposed IST is a versatile and
plug-and-play technique compatible with various PEFT methods that operate on a per-layer basis. By leveraging the estimated importance
scores, IST dynamically updates these selected layers in PEFT modules, leading to reduced memory demands. We further provide theoretical
proof of convergence and empirical evidence of superior performance to demonstrate the advantages of IST over uniform updating strategies.
Extensive experiments on a range of LLMs, PEFTs, and downstream tasks substantiate the effectiveness of our proposed method, showcasing
IST’s capacity to enhance existing layer-based PEFT methods. Our code is available at https://github.com/Kaiseem/IST
Nov 13 (Wed) 10:30-12:00 - Jasmine
InfuserKI: Enhancing Large Language Models with Knowledge Graphs via Infuser-Guided Knowledge Integration
Fali Wang, Runxue Bao, Suhang Wang, Wenchao Yu, Yanchi Liu, Wei Cheng, Haifeng Chen
Large Language Models (LLMs) have achieved exceptional capabilities in open generation across various domains, yet they encounter dif-
ficulties with tasks that require intensive knowledge. To address these challenges, methods for integrating knowledge have been developed,
which augment LLMs with domain-specific knowledge graphs through external modules. These approaches, however, face data inefficiency
issues as they necessitate the processing of both known and unknown knowledge for fine-tuning. Thus, our research focuses on a novel
problem: efficiently integrating unknown knowledge into LLMs without unnecessary overlap of known knowledge. A risk of introducing
new knowledge is the potential forgetting of existing knowledge. To mitigate this risk, we propose the innovative InfuserKI framework. This
framework employs transformer internal states to determine when to enrich LLM outputs with additional information, effectively preventing
knowledge forgetting. Performance evaluations using the UMLS-2.5k and MetaQA domain knowledge graphs reveal that InfuserKI not only
successfully integrates new knowledge but also outperforms state-of-the-art baselines, reducing knowledge forgetting by 9% and 6%, respec-
tively.
NLP Applications 2
Nov 13 (Wed) 10:30-12:00 - Room: Riverfront Hall
198
Posters and Demos
Our extensive experiments reveal that gradients from a single Transformer layer, or even a single linear component with 0.54% parameters,
are susceptible to training data leakage. Additionally, we show that applying differential privacy on gradients during training offers limited
protection against the novel vulnerability of data disclosure.
Nov 13 (Wed) 10:30-12:00 - Riverfront Hall
Learning from Natural Language Explanations for Generalizable Entity Matching
Somin Wadhwa, ADIT KRISHNAN, Runhui Wang, Byron C Wallace, Luyang Kong
Entity matching is the task of linking records from different sources that refer to the same real-world entity. Past work has primarily treated en-
tity linking as a standard supervised learning problem. However, supervised entity matching models often do not generalize well to new data,
and collecting exhaustive labeled training data is often cost prohibitive. Further, recent efforts have adopted LLMs for this task in few/zero-
shot settings, exploiting their general knowledge. But LLMs are prohibitively expensive for performing inference at scale for real-world entity
matching tasks.As an efficient alternative, we re-cast entity matching as a conditional generation task as opposed to binary classification. This
enables us to "distill" LLM reasoning into smaller entity matching models via natural language explanations. This approach achieves strong
performance, especially on out-of-domain generalization tests (10.85% F-1) where standalone generative methods struggle. We perform ab-
lations that highlight the importance of explanations, both for performance and model robustness.
Nov 13 (Wed) 10:30-12:00 - Riverfront Hall
On the Reliability of Psychological Scales on Large Language Models
Jen-tse Huang, Wenxuan Wang, Man Ho LAM, Eric John Li, Wenxiang Jiao, Michael Lyu
Recent research has focused on examining Large Language Models (LLMs) characteristics from a psychological standpoint, acknowledging
the necessity of understanding their behavioral characteristics. The administration of personality tests to LLMs has emerged as a noteworthy
area in this context. However, the suitability of employing psychological scales, initially devised for humans, on LLMs is a matter of ongoing
debate. Our study aims to determine the reliability of applying personality assessments to LLMs, explicitly investigating whether LLMs
demonstrate consistent personality traits. Analysis of 2,500 settings per model, including GPT-3.5, GPT-4, Gemini-Pro, and LLaMA-3.1,
reveals that various LLMs show consistency in responses to the Big Five Inventory, indicating a satisfactory level of reliability. Furthermore,
our research explores the potential of GPT-3.5 to emulate diverse personalities and represent various groupsa capability increasingly sought
after in social sciences for substituting human participants with LLMs to reduce costs. Our findings reveal that LLMs have the potential to
represent different personalities with specific prompt instructions.
Nov 13 (Wed) 10:30-12:00 - Riverfront Hall
Democratizing Large Language Models via Personalized Parameter-Efficient Fine-tuning
Zhaoxuan Tan, Qingkai Zeng, Yijun Tian, Zheyuan Liu, Bing Yin, Meng Jiang
Personalization in large language models (LLMs) is increasingly important, aiming to align the LLMs’ interactions, content, and recom-
mendations with individual user preferences. Recent advances have highlighted effective prompt design by enriching user queries with
non-parametric knowledge through behavior history retrieval and textual profiles. However, these methods faced limitations due to a lack
of model ownership, resulting in constrained customization and privacy issues, and often failed to capture complex, dynamic user behavior
patterns. To address these shortcomings, we introduce One PEFT Per User (OPPU), employing personalized parameter-efficient fine-tuning
(PEFT) modules to store user-specific behavior patterns and preferences. By plugging in personal PEFT parameters, users can own and use
their LLMs individually. OPPU integrates parametric user knowledge in the personal PEFT parameters with non-parametric knowledge from
retrieval and profiles, adapting LLMs to user behavior shifts. Experimental results demonstrate that OPPU significantly outperforms existing
prompt-based methods across seven diverse tasks in the LaMP benchmark. Further studies reveal OPPU’s enhanced capabilities in handling
user behavior shifts, modeling users at different activity levels, maintaining robustness across various user history formats, and displaying
versatility with different PEFT methods.
199
Posters and Demos
review tasks: (1) detect inconsistencies between code changes and commit messages, (2) identify vulnerability introductions, (3) validate code
style adherence, and (4) suggest code revisions. The results demonstrate CodeAgent’s effectiveness, contributing to a new state-of-the-art in
code review automation. Our data and code are publicly available (https://github.com/Daniel4SE/codeagent).
Nov 13 (Wed) 10:30-12:00 - Riverfront Hall
DECOR: Improving Coherence in L2 English Writing with a Novel Benchmark for Incoherence Detection, Reasoning, and Rewriting
Xuanming Zhang, Anthony Diaz, Zixun Chen, Qingyang Wu, Kun Qian, Erik Voss, Zhou Yu
Coherence in writing, an aspect that L2 English learners often struggle with, is crucial in assessing L2 English writing. Existing automated
writing evaluation systems primarily use basic surface linguistic features to detect coherence in writing. However, little effort has been made
to correct the detected incoherence, which could significantly benefit L2 language learners seeking to improve their writing. To bridge this
gap, we introduce DECOR, a novel benchmark that includes expert annotations for detecting incoherence in L2 English writing, identifying
the underlying reasons, and rewriting the incoherent sentences. To our knowledge, DECOR is the first coherence assessment dataset specifi-
cally designed for improving L2 English writing, featuring pairs of original incoherent sentences alongside their expert-rewritten counterparts.
Additionally, we fine-tuned models to automatically detect and rewrite incoherence in student essays. We find that incorporating specific rea-
sons for incoherence during fine-tuning consistently improves the quality of the rewrites, achieving a level that is favored in both automatic
and human evaluations.
Nov 13 (Wed) 10:30-12:00 - Riverfront Hall
Large Language Models Can Self-Correct with Key Condition Verification
Zhenyu Wu, Qingkai Zeng, Zhihan Zhang, Zhaoxuan Tan, Chao Shen, Meng Jiang
Intrinsic self-correct was a method that instructed large language models (LLMs) to verify and correct their responses without external
feedback. Unfortunately, the study concluded that the LLMs could not self-correct reasoning yet. We find that a simple yet effective prompt-
ing method enhances LLM performance in identifying and correcting inaccurate answers without external feedback.That is to mask a key
condition in the question, add the current response to construct a verification question, and predict the condition to verify the response.
The condition can be an entity in an open-domain question or a numerical value in an arithmetic question, which requires minimal effort
(via prompting) to identify. We propose an iterative verify-then-correct framework to progressively identify and correct (probably) false
responses, named ProCo. We conduct experiments on three reasoning tasks. On average, ProCo, with GPT-3.5-Turbo-1106 as the back-
end LLM, yields +6.8 exact match on four open-domain question answering datasets, +14.1 accuracy on three arithmetic reasoning
datasets, and +9.6 accuracy on a commonsense reasoning dataset, compared to Self-Correct.Our implementation is made publicly available
at https://wzy6642.github.io/proco.github.io/.
Nov 13 (Wed) 10:30-12:00 - Riverfront Hall
Large Language Models Are Involuntary Truth-Tellers: Exploiting Fallacy Failure for Jailbreak Attacks
Yue Zhou, Henry Peng Zou, Barbara Di Eugenio, Yang Zhang
We find that language models have difficulties generating fallacious and deceptive reasoning. When asked to generate deceptive outputs, lan-
guage models tend to leak honest counterparts but believe them to be false. Exploiting this deficiency, we propose a jailbreak attack method
that elicits an aligned language model for malicious output. Specifically, we query the model to generate a fallacious yet deceptively real
procedure for the harmful behavior. Since a fallacious procedure is generally considered fake and thus harmless by LLMs, it helps bypass
the safeguard mechanism. Yet the output is factually harmful since the LLM cannot fabricate fallacious solutions but proposes truthful ones.
We evaluate our approach over five safety-aligned large language models, comparing four previous jailbreak methods, and show that our
approach achieves competitive performance with more harmful outputs. We believe the findings could be extended beyond model safety, such
as self-verification and hallucination.
Nov 13 (Wed) 10:30-12:00 - Riverfront Hall
MolTRES: Improving Chemical Language Representation Learning for Molecular Property Prediction
Jun-Hyung Park, Yeachan Kim, Mingyu Lee, Hyuntae Park, SangKeun Lee
Chemical representation learning has gained increasing interest due to the limited availability of supervised data in fields such as drug and
materials design. This interest particularly extends to chemical language representation learning, which involves pre-training Transformers
on SMILES sequences - textual descriptors of molecules. Despite its success in molecular property prediction, current practices often lead
to overfitting and limited scalability due to early convergence. In this paper, we introduce a novel chemical language representation learn-
ing framework, called MolTRES, to address these issues. MolTRES incorporates generator-discriminator training, allowing the model to
learn from more challenging examples that require structural understanding. In addition, we enrich molecular representations by transferring
knowledge from scientific literature by integrating external materials embedding. Experimental results show that our model outperforms
existing state-of-the-art models on popular molecular property prediction tasks.
Nov 13 (Wed) 10:30-12:00 - Riverfront Hall
SecCoder: Towards Generalizable and Robust Secure Code Generation
Boyu Zhang, Tianyu Du, Junkai Tong, Xuhong Zhang, Kingsum Chow, Sheng Cheng, Xun Wang, Jianwei Yin
After large models (LMs) have gained widespread acceptance in code-related tasks, their superior generative capacity has greatly promoted the
application of the code LM. Nevertheless, the security of the generated code has raised attention to its potential damage. Existing secure code
generation methods have limited generalizability to unseen test cases and poor robustness against the attacked model, leading to safety failures
in code generation. In this paper, we propose a generalizable and robust secure code generation method SecCoder by using in-context learning
(ICL) and the safe demonstration. The dense retriever is also used to select the most helpful demonstration to maximize the improvement of
the generated codes security. Experimental results show the superior generalizability of the proposed model SecCoder compared to the current
secure code generation method, achieving a significant security improvement of an average of 7.20% on unseen test cases. The results also
show the better robustness of SecCoder compared to the current attacked code LM, achieving a significant security improvement of an av-
erage of 7.74%. Our analysis indicates that SecCoder enhances the security of LMs in generating code, and it is more generalizable and robust.
200
Posters and Demos
self-consistency baseline on the complicated MATH dataset, DynaThink achieved more than 3% increase in accuracy with lower cost. The
code will be made available upon publication.
201
Posters and Demos
facts to ground applicable rules at each step. Experiments indicate our framework’s effectiveness in rule application and its robustness across
various steps and settings.
202
Posters and Demos
Ran Xu, Wenqi Shi, Yue Yu, Yuchen Zhuang, Yanqiao Zhu, May Dongmei Wang, Joyce C. Ho, Chao Zhang, Carl Yang
Developing effective biomedical retrieval models is important for excelling at knowledge-intensive biomedical tasks but still challenging due
to the lack of sufficient publicly annotated biomedical data and computational resources. We present BMRetriever, a series of dense retrievers
for enhancing biomedical retrieval via unsupervised pre-training on large biomedical corpora, followed by instruction fine-tuning on a combi-
nation of labeled datasets and synthetic pairs. Experiments on 5 biomedical tasks across 11 datasets verify BMRetriever’s efficacy on various
biomedical applications. BMRetriever also exhibits strong parameter efficiency, with the 410M variant outperforming baselines up to 11.7
times larger, and the 2B variant matching the performance of models with over 5B parameters. The training data and model checkpoints are
released at https://huggingface.co/BMRetriever to ensure transparency, reproducibility, and application to new domains.
Nov 13 (Wed) 10:30-12:00 - Riverfront Hall
MedAdapter: Efficient Test-Time Adaptation of Large Language Models Towards Medical Reasoning
Wenqi Shi, Ran Xu, Yuchen Zhuang, Yue Yu, Haotian Sun, Hang Wu, Carl Yang, May Dongmei Wang
Despite their improved capabilities in generation and reasoning, adapting large language models (LLMs) to the biomedical domain remains
challenging due to their immense size and privacy concerns. In this study, we propose MedAdapter, a unified post-hoc adapter for test-time
adaptation of LLMs towards biomedical applications. Instead of fine-tuning the entire LLM, MedAdapter effectively adapts the original
model by fine-tuning only a small BERT-sized adapter to rank candidate solutions generated by LLMs. Experiments on four biomedical
tasks across eight datasets demonstrate that MedAdapter effectively adapts both white-box and black-box LLMs in biomedical reasoning,
achieving average performance improvements of 18.24% and 10.96%, respectively, without requiring extensive computational resources or
sharing data with third parties. MedAdapter also yields enhanced performance when combined with train-time adaptation, highlighting a flex-
ible and complementary solution to existing adaptation methods. Faced with the challenges of balancing model performance, computational
resources, and data privacy, MedAdapter provides an efficient, privacy-preserving, cost-effective, and transparent solution for adapting LLMs
to the biomedical domain.
Nov 13 (Wed) 10:30-12:00 - Riverfront Hall
Coffee-Gym: An Environment for Evaluating and Improving Natural Language Feedback on Erroneous Code
Hyungjoo Chae, Taeyoon Kwon, Seungjun Moon, Yongho Song, Dongjin Kang, Kai Tzu-iunn Ong, Beong-woo Kwak, Seonghyeon Bae, seung-
won hwang, Jinyoung Yeo
This paper presents Coffee-Gym, a comprehensive RL environment for training models that provide feedback on code editing. Coffee-Gym
includes two major components: (1) Coffee, a dataset containing humans’ code edit traces for coding questions and human-written feedback
for editing erroneous code; (2) CoffeeEval, a reward function that faithfully reflects the helpfulness of feedback by assessing the performance
of the revised code in unit tests. With them, Coffee-Gym addresses the unavailability of high-quality datasets for training feedback models
with RL, and provides more accurate rewards than the SOTA reward model (i.e., GPT-4). By applying Coffee-Gym, we elicit feedback models
that outperform baselines in enhancing open-source code LLMs’ code editing, making them comparable with closed-source LLMs. We make
the dataset and the model checkpoint publicly available in https://huggingface.co/spaces/Coffee-Gym/Project-Coffee-Gym.
Nov 13 (Wed) 10:30-12:00 - Riverfront Hall
Entity Insertion in Multilingual Linked Corpora: The Case of Wikipedia
Tomás Feith, Akhil Arora, Martin Gerlach, Debjit Paul, Robert West
Links are a fundamental part of information networks, turning isolated pieces of knowledge into a network of information that is much richer
than the sum of its parts. However, adding a new link to the network is not trivial: it requires not only the identification of a suitable pair of
source and target entities but also the understanding of the content of the source to locate a suitable position for the link in the text. The latter
problem has not been addressed effectively, particularly in the absence of text spans in the source that could serve as anchors to insert a link
to the target entity. To bridge this gap, we introduce and operationalize the task of entity insertion in information networks. Focusing on the
case of Wikipedia, we empirically show that this problem is, both, relevant and challenging for editors. We compile a benchmark dataset in
105 languages and develop a framework for entity insertion called LocEI (Localized Entity Insertion) and its multilingual variant XLocEI. We
show that XLocEI outperforms all baseline models (including state-of-the-art prompt-based ranking with LLMs such as GPT-4) and that it
can be applied in a zero-shot manner on languages not seen during training with minimal performance drop. These findings are important for
applying entity insertion models in practice, e.g., to support editors in adding links across the more than 300 language versions of Wikipedia.
Nov 13 (Wed) 10:30-12:00 - Riverfront Hall
CodeFort: Robust Training for Code Generation Models
Yuhao Zhang, Shiqi Wang, Haifeng Qian, Zijian Wang, Mingyue Shang, Linbo Liu, Sanjay Krishna Gouda, Baishakhi Ray, Murali Krishna
Ramanathan, Xiaofei Ma, Anoop Deoras
Code generation models are not robust to small perturbations, which often lead to incorrect generations and significantly degrade the perfor-
mance of these models. Although improving the robustness of code generation models is crucial to enhancing user experience in real-world
applications, existing research efforts do not address this issue. To fill this gap, we propose CodeFort, a framework to improve the robustness
of code generation models, generalizing a large variety of code perturbations to enrich the training data and enabling various robust training
strategies, mixing data augmentation, batch augmentation, adversarial logits pairing, and contrastive learning, all carefully designed to support
high-throughput training. Extensive evaluations show that we increase the average robust pass rates of baseline CodeGen models from 14.79
to 21.74. We notably decrease the robustness drop rate from 95.02% to 54.95% against code-syntax perturbations.
Nov 13 (Wed) 10:30-12:00 - Riverfront Hall
Exploring Automated Keyword Mnemonics Generation with Large Language Models via Overgenerate-and-Rank
Jaewook Lee, Hunter McNichols, Andrew Lan
In this paper, we study an under-explored area of language and vocabulary learning: keyword mnemonics, a technique for memorizing vo-
cabulary through memorable associations with a target word via a verbal cue. Typically, creating verbal cues requires extensive human effort
and is quite time-consuming, necessitating an automated method that is more scalable. We propose a novel overgenerate-and-rank method
via prompting large language models (LLMs) to generate verbal cues and then ranking them according to psycholinguistic measures and
takeaways from a pilot user study. To assess cue quality, we conduct both an automated evaluation of imageability and coherence, as well as a
human evaluation involving English teachers and learners. Results show that LLM-generated mnemonics are comparable to human-generated
ones in terms of imageability, coherence, and perceived usefulness, but there remains plenty of room for improvement due to the diversity in
background and preference among language learners.
Nov 13 (Wed) 10:30-12:00 - Riverfront Hall
ESG-Kor: A Korean Dataset for ESG-related Information Extraction and Practical Use Cases
Jaeyoung Lee, Geonyeong Son, Misuk Kim
With the expansion of pre-trained language model usage in recent years, the importance of datasets for performing tasks in specialized domains
has significantly increased. Therefore, we have built a Korean dataset called ESG-Kor to automatically extract Environmental, Social, and
Governance (ESG) information, which has recently gained importance. ESG-Kor is a dataset consisting of a total of 118,946 sentences that
203
Posters and Demos
extracted information on each ESG component from Korean companies’ sustainability reports and manually labeled it according to objective
rules provided by ESG evaluation agencies. To verify the effectiveness and applicability of the ESG-Kor dataset, classification performance
was confirmed using several Korean pre-trained language models, and significant performance was obtained. Additionally, by extending the
ESG classification model to documents of small and medium enterprises and extracting information based on ESG key issues and in-depth
analysis, we demonstrated potential and practical use cases in the ESG field.
Nov 13 (Wed) 10:30-12:00 - Riverfront Hall
Explaining Graph Neural Networks with Large Language Models: A Counterfactual Perspective on Molecule Graphs
Yinhan He, Zaiyi Zheng, Patrick Soga, Yaochen Zhu, Yushun Dong, Jundong Li
In recent years, Graph Neural Networks (GNNs) have become successful in molecular property prediction tasks such as toxicity analysis.
However, due to the black-box nature of GNNs, their outputs can be concerning in high-stakes decision-making scenarios, e.g., drug discov-
ery. Facing such an issue, Graph Counterfactual Explanation (GCE) has emerged as a promising approach to improve GNN transparency.
However, current GCE methods usually fail to take domain-specific knowledge into consideration, which can result in outputs that are not
easily comprehensible by humans. To address this challenge, we propose a novel GCE method, LLM-GCE, to unleash the power of large
language models (LLMs) in explaining GNNs for molecular property prediction. Specifically, we utilize an autoencoder to generate the
counterfactual graph topology from a set of counterfactual text pairs (CTPs) based on an input graph. Meanwhile, we also incorporate a CTP
dynamic feedback module to mitigate LLM hallucination, which provides intermediate feedback derived from the generated counterfactuals
as an attempt to give more faithful guidance. Extensive experiments demonstrate the superior performance of LLM-GCE. Our code is released
on https://github.com/YinhanHe123/new_LLM4GNNExplanation.
204
Posters and Demos
Nevertheless, many pre-training datasets are restricted by patient privacy concerns, potentially containing noise that can adversely affect
downstream performance. Moreover, the growing reliance on multi-modal generation exacerbates this issue because of its susceptibility to
adversarial attacks. To investigate how VLMs trained on adversarial noisy data perform on downstream medical tasks, we first craft noisy
upstream datasets using multi-modal adversarial attacks. Through our comprehensive analysis, we unveil that moderate noise enhances model
robustness and transferability, but increasing noise levels negatively impact downstream task performance. To mitigate this issue, we propose
rectify adversarial noise (RAN) framework, a recipe designed to effectively defend adversarial attacks and rectify the influence of upstream
noise during fine-tuning.
Nov 13 (Wed) 10:30-12:00 - Riverfront Hall
BSharedRAG: Backbone Shared Retrieval-Augmented Generation for the E-commerce Domain
Kaisi Guan, Qian Cao, Yuchong Sun, Xiting Wang, Ruihua Song
Retrieval Augmented Generation (RAG) system is important in domains such as e-commerce, which has many long-tail entities and frequently
updated information. Most existing works adopt separate modules for retrieval and generation, which may be suboptimal since the retrieval
task and the generation task cannot benefit from each other to improve performance. We propose a novel Backbone Shared RAG framework
(BSharedRAG). It first uses a domain-specific corpus to continually pre-train a base model as a domain-specific backbone model and then
trains two plug-and-play Low-Rank Adaptation (LoRA) modules based on the shared backbone to minimize retrieval and generation losses
respectively. Experimental results indicate that our proposed BSharedRAG outperforms baseline models by 5% and 13% in Hit@3 upon two
datasets in retrieval evaluation and by 23% in terms of BLEU-3 in generation evaluation. Our codes, models, and dataset are available at
https://bsharedrag.github.io.
205
Posters and Demos
Characters are at the heart of every story, driving the plot and engaging readers. In this study, we explore the understanding of characters in
full-length books, which contain complex narratives and numerous interacting characters. We define two tasks: character description, which
generates a brief factual profile, and character analysis, which offers an in-depth interpretation, including character development, personality,
and social context. We introduce the BookWorm dataset, pairing books from the Gutenberg Project with human-written descriptions and
analyses. Using this dataset, we evaluate state-of-the-art long-context models in zero-shot and fine-tuning settings, utilizing both retrieval-
based and hierarchical processing for book-length inputs. Our findings show that retrieval-based approaches outperform hierarchical ones in
both tasks. Additionally, fine-tuned models using coreference-based retrieval produce the most factual descriptions, as measured by fact- and
entailment-based metrics. We hope our dataset, experiments, and analysis will inspire further research in character-based narrative under-
standing.
206
Posters and Demos
nuanced aspects. Our work underscores the richness of information regarding text quality embedded within LLM representations, offering
insights for the development of new metrics.
207
Posters and Demos
of SPA and closed-book models reaches the best overall performance. Moreover, we analyze failures of current systems and highlight that
open web navigation remains a major challenge.
208
Posters and Demos
in accuracy with zero-shot reviewers compared to GPT-4. It also outperforms GPT-4 by 46.01% in Kendall correlation on new domains,
indicating its transferability
209
Posters and Demos
future research directions in this vital field. A summary of the related literature is available at https://github.com/cui-shaobo/causality-papers
.
210
Posters and Demos
challenging and we present some insights for future research. We will release our dataset and source code to facilitate further studies in this
direction.
6
We will release the dataset upon acceptance.
211
Posters and Demos
of the state-of-the-art knowledge editing algorithms is very limited, as they can not reduce the cases of outdatedness and output inconsistency.
Nov 13 (Wed) 10:30-12:00 - Riverfront Hall
Forecasting Future International Events: A Reliable Dataset for Text-Based Event Modeling
Daehoon Gwak, Junwoo Park, Minho Park, ChaeHun Park, Hyunchan Lee, Edward Choi, Jaegul Choo
Predicting future international events from textual information, such as news articles, has tremendous potential for applications in global
policy, strategic decision-making, and geopolitics. However, existing datasets available for this task are often limited in quality, hindering
the progress of related research. In this paper, we introduce a novel dataset designed to address these limitations by leveraging the advanced
reasoning capabilities of large-language models (LLMs). Our dataset features high-quality scoring labels generated through advanced prompt
modeling and rigorously validated by domain experts in political science. We showcase the quality and utility of our dataset for real-world
event prediction tasks, demonstrating its effectiveness through extensive experiments and analysis. Furthermore, we publicly release our
dataset along with the full automation source code for data collection, labeling, and benchmarking, aiming to support and advance research
in text-based event prediction.
Nov 13 (Wed) 10:30-12:00 - Riverfront Hall
DiaHalu: A Dialogue-level Hallucination Evaluation Benchmark for Large Language Models
KediChen, Qin Chen, Jie Zhou, He Yishen, Liang He
Though large language models (LLMs) achieve significant success in recent years, the hallucination issue remains a challenge, and numerous
benchmarks are proposed for hallucination detection. Nevertheless, some of these benchmarks are not naturally generated by LLMs but are
intentionally induced. Also, many merely focus on the factuality hallucination while ignoring the faithfulness hallucination. Additionally,
although dialogue pattern is more widely utilized in the era of LLMs, current benchmarks only concentrate on sentence-level and passage-
level hallucination. In this study, we propose DiaHalu, the first dedicated dialogue-level hallucination evaluation benchmark for LLMs to
our knowledge. Initially, we integrate the collected topics into system prompts and facilitate a dialogue between two LLMs. Subsequently,
we manually modify the contents that do not adhere to human language conventions and then have LLMs re-generate, simulating authentic
human-machine interaction scenarios. Finally, professional scholars annotate all the samples in the dataset. DiaHalu covers four common
multi-turn dialogue domains and five hallucination subtypes, extended from factuality and faithfulness hallucination. Experiments through
some well-known LLMs and detection methods on the dataset show that DiaHalu is a challenging benchmark, holding significant value for
further research.
Nov 13 (Wed) 10:30-12:00 - Riverfront Hall
STARD: A Chinese Statute Retrieval Dataset Derived from Real-life Queries by Non-professionals
Weihang Su, Yiran HU, Anzhe Xie, Qingyao Ai, quezibing, Ning Zheng, Yun Liu, Weixing Shen, Yiqun LIU
Statute retrieval aims to find relevant statutory articles for specific queries. This process is the basis of a wide range of legal applications
such as legal advice, automated judicial decisions, legal document drafting, etc. Existing statute retrieval benchmarks emphasize formal and
professional queries from sources like bar exams and legal case documents, thereby neglecting non-professional queries from the general
public, which often lack precise legal terminology and references. To address this gap, we introduce the STAtute Retrieval Dataset (STARD),
a Chinese dataset comprising 1,543 query cases collected from real-world legal consultations and 55,348 candidate statutory articles. Un-
like existing statute retrieval datasets, which primarily focus on professional legal queries, STARD captures the complexity and diversity
of real queries from the general public. Through a comprehensive evaluation of various retrieval baselines, we reveal that existing retrieval
approaches all fall short of these real queries issued by non-professional users. The best method only achieves a Recall@100 of 0.907, sug-
gesting the necessity for further exploration and additional research in this area.
Nov 13 (Wed) 10:30-12:00 - Riverfront Hall
Ask the experts: sourcing a high-quality nutrition counseling dataset through Human-AI collaboration
Simone Balloccu, Ehud Reiter, Karen Jia-Hui Li, Rafael Sargsyan, Vivek Kumar, Diego Reforgiato, Daniele Riboni, Ondrej Dusek
Large Language Models (LLMs) are being employed by end-users for various tasks, including sensitive ones such as health counseling, dis-
regarding potential safety concerns. It is thus necessary to understand how adequately LLMs perform in such domains. We conduct a case
study on ChatGPT in nutrition counseling, a popular use-case where the model supports a user with their dietary struggles. We crowd-source
real-world diet-related struggles, then work with nutrition experts to generate supportive text using ChatGPT. Finally, experts evaluate the
safety and text quality of ChatGPT’s output. The result is the HAI-coaching dataset, containing 2.4K crowdsourced dietary struggles and
97K corresponding ChatGPT-generated and expert-annotated supportive texts. We analyse ChatGPT’s performance, discovering potentially
harmful behaviours, especially for sensitive topics like mental health. Finally, we use HAI-coaching to test open LLMs on various downstream
tasks, showing that even the latest models struggle to achieve good performance. HAI-coaching is available at https://github.com/uccollab/
hai-coaching/
Nov 13 (Wed) 10:30-12:00 - Riverfront Hall
"Vorbes, ti Românes, te?" A Recipe to Train Powerful Romanian LLMs with English Instructions
Mihai Masala, Denis Ilie-Ablachim, Alexandru Dima, Dragos Georgian Corlatescu, Miruna-Andreea Zavelca, Ovio Olaru, Simina-Maria
Terian, Andrei Terian, Marius Leordeanu, Horia Velicu, Marius Popescu, Mihai Dascalu, Traian Rebedea
In recent years, Large Language Models (LLMs) have achieved almost human-like performance on various tasks. While some LLMs have
been trained on multilingual data, most of the training data is in English; hence, their performance in English greatly exceeds other languages.
To our knowledge, we are the first to collect and translate a large collection of texts, instructions, and benchmarks and train, evaluate, and
release open-source LLMs tailored for Romanian. We evaluate our methods on four different categories, including academic benchmarks,
MT-Bench (manually translated), and a professionally built historical, cultural, and social benchmark adapted to Romanian. We argue for the
usefulness and high performance of RoLLMs by obtaining state-of-the-art results across the board. We publicly release all resources (i.e.,
data, training and evaluation code, models) with the goal of supporting and encouraging research on Romanian LLMs while concurrently
creating a generalizable recipe adequate for other low or less-resourced languages.
212
Posters and Demos
213
Posters and Demos
214
Posters and Demos
ence speed of Speech DDPMs by simply redirecting the generative target to the wavelet domain. This method not only achieves comparable
or superior performance to the original model in speech synthesis tasks but also demonstrates its versatility. By investigating and utilizing
different wavelet bases, our approach proves effective not just in speech synthesis, but also in speech enhancement.
Nov 13 (Wed) 10:30-12:00 - Riverfront Hall
Scaling Properties of Speech Language Models
Santiago Cuervo, Ricard Marxer
Speech Language Models (SLMs) aim to learn language from raw audio, without textual resources. Despite significant advances, our current
models exhibit weak syntax and semantic abilities. However, if the scaling properties of neural language models hold for the speech modality,
these abilities will improve as the amount of compute used for training increases. In this paper, we use models of this scaling behavior to
estimate the scale at which our current methods will yield a SLM with the English proficiency of text-based Large Language Models (LLMs).
We establish a strong correlation between pre-training loss and downstream syntactic and semantic performance in SLMs and LLMs, which
results in predictable scaling of linguistic performance. We show that the linguistic performance of SLMs scales up to three orders of magni-
tude more slowly than that of text-based LLMs. Additionally, we study the benefits of synthetic data designed to boost semantic understanding
and the effects of coarser speech tokenization.
Nov 13 (Wed) 10:30-12:00 - Riverfront Hall
EmphAssess : a Prosodic Benchmark on Assessing Emphasis Transfer in Speech-to-Speech Models
Maureen de Seyssel, Antony D’Avirro, Adina Williams, Emmanuel Dupoux
We introduce EmphAssess, a prosodic benchmark designed to evaluate the capability of speech-to-speech models to encode and reproduce
prosodic emphasis. We apply this to two tasks: speech resynthesis and speech-to-speech translation. In both cases, the benchmark evaluates
the ability of the model to encode emphasis in the speech input and accurately reproduce it in the output, potentially across a change of speaker
and language. As part of the evaluation pipeline, we introduce EmphaClass, a new model that classifies emphasis at the frame or word level.
Nov 13 (Wed) 10:30-12:00 - Riverfront Hall
Improving Spoken Language Modeling with Phoneme Classification: A Simple Fine-tuning Approach
Maxime Poli, Emmanuel Chemla, Emmanuel Dupoux
Recent progress in Spoken Language Modeling has shown that learning language directly from speech is feasible. Generating speech through
a pipeline that operates at the text level typically loses nuances, intonations, and non-verbal vocalizations. Modeling directly from speech
opens up the path to more natural and expressive systems. On the other hand, speech-only systems require up to three orders of magnitude
more data to catch up to their text-based counterparts in terms of their semantic abilities. We show that fine-tuning speech representation
models on phoneme classification leads to more context-invariant representations, and language models trained on these units achieve com-
parable lexical comprehension to ones trained on hundred times more data.
Nov 13 (Wed) 10:30-12:00 - Riverfront Hall
On Mitigating Performance Disparities in Multilingual Speech Recognition
Monorama Swain, Anna Katrine van Zee, Anders Søgaard
How far have we come in mitigating performance disparities across genders in multilingual speech recognition? We compare the impact on
gender disparity of different fine-tuning algorithms for automated speech recognition across model sizes, languages and gender. We look
at both performance-focused and fairness-promoting algorithms. Across languages, we see slightly better performance for female speakers
for larger models regardless of the fine-tuning algorithm. The best trade-off between performance and parity is found using adapter fusion.
Fairness-promoting fine-tuning algorithms (Group-DRO and Spectral Decoupling) hurt performance compared to adapter fusion with only
slightly better performance parity. LoRA increases disparities slightly. Fairness-mitigating fine-tuning techniques led to slightly higher vari-
ance in performance across languages, with the exception of adapter fusion.
Nov 13 (Wed) 10:30-12:00 - Riverfront Hall
Muting Whisper: A Universal Acoustic Adversarial Attack on Speech Foundation Models
Vyas Raina, Rao Ma, Charles McGhee, Kate Knill, Mark Gales
Recent developments in large speech foundation models like Whisper have led to their widespread use in many automatic speech recognition
(ASR) applications. These systems incorporate ’special tokens’ in their vocabulary, such as < |endoftext| >, to guide their language
generation process. However, we demonstrate that these tokens can be exploited by adversarial attacks to manipulate the model’s behavior.
We propose a simple yet effective method to learn a universal acoustic realization of Whisper’s < |endoftext| > token, which, when
prepended to any speech signal, encourages the model to ignore the speech and only transcribe the special token, effectively ’muting’ the
model. Our experiments demonstrate that the same, universal 0.64-second adversarial audio segment can successfully mute a target Whisper
ASR model for over 97% of speech samples. Moreover, we find that this universal adversarial audio segment often transfers to new datasets
and tasks. Overall this work demonstrates the vulnerability of Whisper models to ‘muting’ adversarial attacks, where such attacks can pose
both risks and potential benefits in real-world settings: for example the attack can be used to bypass speech moderation systems, or conversely
the attack can also be used to protect private speech data.
Nov 13 (Wed) 10:30-12:00 - Riverfront Hall
EmoKnob: Enhance Voice Cloning with Fine-Grained Emotion Control
Haozhe Chen, Run Chen, Julia Hirschberg
While recent advances in Text-to-Speech (TTS) technology produce natural and expressive speech, they lack the option for users to select
emotion and control intensity. We propose EmoKnob, a framework that allows fine-grained emotion control in speech synthesis with few-shot
demonstrative samples of arbitrary emotion. Our framework leverages the expressive speaker representation space made possible by recent
advances in foundation voice cloning models. Based on the few-shot capability of our emotion control framework, we propose two methods
to apply emotion control on emotions described by open-ended text, enabling an intuitive interface for controlling a diverse array of nuanced
emotions. To facilitate a more systematic emotional speech synthesis field, we introduce a set of evaluation metrics designed to rigorously
assess the faithfulness and recognizability of emotion control frameworks. Through objective and subjective evaluations, we show that our
emotion control framework effectively embeds emotions into speech and surpasses emotion expressiveness of commercial TTS services.
Nov 13 (Wed) 10:30-12:00 - Riverfront Hall
ESC: Efficient Speech Coding with Cross-Scale Residual Vector Quantized Transformers
Yuzhe Gu, Enmao Diao
Neural speech codecs aim to compress input signals into minimal bits while maintaining content quality in a low-latency manner. However,
existing neural codecs often trade model complexity for reconstruction performance. These codecs primarily use convolutional blocks for
feature transformation, which are not inherently suited for capturing the local redundancies in speech signals. To compensate, they require ei-
ther adversarial discriminators or a large number of model parameters to enhance audio quality. In response to these challenges, we introduce
the Efficient Speech Codec (ESC), a lightweight, parameter-efficient speech codec based on a cross-scale residual vector quantization scheme
215
Posters and Demos
and transformers. Our model employs mirrored hierarchical window transformer blocks and performs step-wise decoding from coarse-to-
fine feature representations. To enhance bitrate efficiency, we propose a novel combination of vector quantization techniques along with a
pre-training paradigm. Extensive experiments demonstrate that ESC can achieve high-fidelity speech reconstruction with significantly lower
model complexity, making it a promising alternative to existing convolutional audio codecs.
Nov 13 (Wed) 10:30-12:00 - Riverfront Hall
Towards Robust Speech Representation Learning for Thousands of Languages
William Chen, Wangyou Zhang, Yifan Peng, Xinjian Li, Jinchuan Tian, Jiatong Shi, Xuankai Chang, Soumi Maiti, Karen Livescu, Shinji
Watanabe
Self-supervised learning (SSL) has helped extend speech technologies to more languages by reducing the need for labeled data. However,
models are still far from supporting the world’s 7000+ languages. We propose XEUS, a Cross-lingual Encoder for Universal Speech, trained
on over 1 million hours of data across 4057 languages, extending the language coverage of SSL models 4-fold. We combine 1 million hours
of speech from existing publicly accessible corpora with a newly created corpus of 7400+ hours from 4057 languages, which will be publicly
released. To handle the diverse conditions of multilingual speech data, we augment the typical SSL masked prediction approach with a novel
dereverberation objective, increasing robustness. We evaluate XEUS on several benchmarks, and show that it consistently outperforms or
achieves comparable results to state-of-the-art (SOTA) SSL models across a variety of tasks. XEUS sets a new SOTA on the ML-SUPERB
benchmark: it outperforms MMS 1B and w2v-BERT 2.0 v2 by 0.8% and 4.4% respectively, despite having less parameters or pre-training
data. Checkpoints, code, and data are found in https://www.wavlab.org/activities/2024/xeus/.
Nov 13 (Wed) 10:30-12:00 - Riverfront Hall
Unveiling the Role of Pretraining in Direct Speech Translation
Belen Alastruey, Gerard I. Gállego, Marta R. Costa-jussà
Direct speech-to-text translation systems encounter an important drawback in data scarcity. A common solution consists on pretraining the
encoder on automatic speech recognition, hence losing efficiency in the training process. In this study, we compare the training dynamics of
a system using a pretrained encoder, the conventional approach, and one trained from scratch. We observe that, throughout the training, the
randomly initialized model struggles to incorporate information from the speech inputs for its predictions. Hence, we hypothesize that this
issue stems from the difficulty of effectively training an encoder for direct speech translation. While a model trained from scratch needs to
learn acoustic and semantic modeling simultaneously, a pretrained one can just focus on the latter. Based on these findings, we propose a
subtle change in the decoder cross-attention to integrate source information from earlier steps in training. We show that with this change, the
model trained from scratch can achieve comparable performance to the pretrained one, while reducing the training time.
Nov 13 (Wed) 10:30-12:00 - Riverfront Hall
Optimizing Rare Word Accuracy in Direct Speech Translation with a Retrieval-and-Demonstration Approach
Siqi Li, Danni Liu, Jan Niehues
Direct speech translation (ST) models often struggle with rare words. Incorrect translation of these words can have severe consequences,
impacting translation quality and user trust. While rare word translation is inherently challenging for neural models due to sparse learning
signals, real-world scenarios often allow access to translations of past recordings on similar topics. To leverage these valuable resources, we
propose a retrieval-and-demonstration approach to enhance rare word translation accuracy in direct ST models. First, we adapt existing ST
models to incorporate retrieved examples for rare word translation, which allows the model to benefit from prepended examples, similar to
in-context learning. We then develop a cross-modal (speech-to-speech, speech-to-text, text-to-text) retriever to locate suitable examples. We
demonstrate that standard ST models can be effectively adapted to leverage examples for rare word translation, improving rare word trans-
lation accuracy over the baseline by 17.6% with gold examples and 8.5% with retrieved examples. Moreover, our speech-to-speech retrieval
approach outperforms other modalities and exhibits higher robustness to unseen speakers. Our code is publicly available.
216
Posters and Demos
spoken dialogue, with just 2k hours of real-world spoken dialogue data. Synchronous LLMs outperform state-of-the-art in dialogue mean-
ingfulness while maintaining naturalness. Finally, we demonstrate the model’s ability to participate in full-duplex dialogue by simulating
interaction between two agents trained on different datasets, while considering Internet-scale latencies of up to 240 ms.
Nov 13 (Wed) 10:30-12:00 - Riverfront Hall
Casablanca: Data and Models for Multidialectal Arabic Speech Recognition
Bashar Talafha, Karima Kadaoui, Samar Mohamed Magdy, Mariem Habiboullah, Chafei Mohamed Chafei, Ahmed Oumar El-Shangiti, Hiba
Zayed, Mohamedou cheikh tourad, Rahaf Alhamouri, Rwaa Assi, Aisha Alraeesi, Hour Mohamed, Fakhraddin Alwajih, Abdelrahman Mo-
hamed, Abdellah EL MEKKI, El Moatez Billah Nagoudi, Benelhadj Djelloul Mama Saadia, Hamzah A. Alsayadi, Walid Al-Dhabyani, Sara
Shatnawi, Yasir ECH-CHAMMAKHY, AMAL MAKOUAR, Yousra Berrachedi, Mustafa Jarrar, Shady Shehata, Ismail Berrada, Muhammad
Abdul-Mageed
In spite of the recent progress in speech processing, the majority of world languages and dialects remain uncovered. This situation only
furthers an already wide technological divide, thereby hindering technological and socioeconomic inclusion. This challenge is largely due
to the absence of datasets that can empower diverse speech systems. In this paper, we seek to mitigate this obstacle for a number of Arabic
dialects by presenting Casablanca, a large-scale community-driven effort to collect and transcribe a multi-dialectal Arabic dataset. The dataset
covers eight dialects: Algerian, Egyptian, Emirati, Jordanian, Mauritanian, Moroccan, Palestinian, and Yemeni, and includes annotations for
transcription, gender, dialect, and code-switching. We also develop a number of strong baselines exploiting Casablanca. The project page for
Casablanca is accessible at: www.dlnlp.ai/speech/casablanca.
Nov 13 (Wed) 10:30-12:00 - Riverfront Hall
MultiVerse: Efficient and Expressive Zero-Shot Multi-Task Text-to-Speech
Taejun Bak, Youngsik Eom, SeungJae Choi, Young-Sun Joo
Text-to-speech (TTS) systems that scale up the amount of training data have achieved significant improvements in zero-shot speech synthesis.
However, these systems have certain limitations: they require a large amount of training data, which increases costs, and often overlook
prosody similarity. To address these issues, we propose MultiVerse, a zero-shot multi-task TTS system that is able to perform TTS or speech
style transfer in zero-shot and cross-lingual conditions. MultiVerse requires much less training data than traditional data-driven approaches.
To ensure zero-shot performance even with limited data, we leverage source-filter theory-based disentanglement, utilizing the prompt for
modeling filter-related and source-related representations. Additionally, to further enhance prosody similarity, we adopt a prosody modeling
approach combining prompt-based autoregressive and non-autoregressive methods. Evaluations demonstrate the remarkable zero-shot multi-
task TTS performance of MultiVerse and show that MultiVerse not only achieves zero-shot TTS performance comparable to data-driven TTS
systems with much less data, but also significantly outperforms other zero-shot TTS systems trained with the same small amount of data.
In particular, our novel prosody modeling technique significantly contributes to MultiVerse’s ability to generate speech with high prosody
similarity to the given prompts.
Nov 13 (Wed) 10:30-12:00 - Riverfront Hall
Where Visual Speech Meets Language: VSP-LLM Framework for Efficient and Context-Aware Visual Speech Processing
Jeonghun Yeo, Seunghee Han, Minsu Kim, Yong Man Ro
In visual speech processing, context modeling capability is one of the most important requirements due to the ambiguous nature of lip move-
ments. For example, homophenes, words that share identical lip movements but produce different sounds, can be distinguished by considering
the context. In this paper, we propose a novel framework, namely Visual Speech Processing incorporated with LLMs (VSP-LLM), to maxi-
mize the context modeling ability by bringing the overwhelming power of LLMs. Specifically, VSP-LLM is designed to perform multi-tasks
of visual speech recognition and translation, where the given instructions control the type of task. The input video is mapped to the input
latent space of an LLM by employing a self-supervised visual speech model. Focused on the fact that there is redundant information in input
frames, we propose a novel deduplication method that reduces the embedded visual features by employing visual speech units. Through the
proposed deduplication and low rank adaptation, VSP-LLM can be trained in a computationally efficient manner. In the translation dataset,
the MuAViC benchmark, we demonstrate that VSP-LLM trained on just 30 hours of labeled data can more effectively translate compared to
the recent model trained with 433 hours of data.
Nov 13 (Wed) 10:30-12:00 - Riverfront Hall
Phonetic and Lexical Discovery of Canine Vocalization
Sinong Wang, Xingyuan Li, Chunhao Zhang, Mengyue Wu, Kenny Q. Zhu
This paper attempts to discover communication patterns automatically within dog vocalizations in a data-driven approach, which breaks the
barrier previous approaches that rely on human prior knowledge on limited data. We present a self-supervised approach with HuBERT, en-
abling the accurate classification of phones, and an adaptive grammar induction method that identifies phone sequence patterns that suggest
a preliminary vocabulary within dog vocalizations. Our results show that a subset of this vocabulary has substantial causality relations with
certain canine activities, suggesting signs of stable semantics associated with these “words”.
Nov 13 (Wed) 10:30-12:00 - Riverfront Hall
Audio-Based Linguistic Feature Extraction for Enhancing Multi-lingual and Low-Resource Text-to-Speech
Youngjae Kim, Yejin Jeon, Gary Lee
The difficulty of acquiring abundant, high-quality data, especially in multi-lingual contexts, has sparked interest in addressing low-resource
scenarios. Moreover, current literature rely on fixed expressions from language IDs, which results in the inadequate learning of language
representations, and the failure to generate speech in unseen languages. To address these challenges, we propose a novel method that directly
extracts linguistic features from audio input while effectively filtering out miscellaneous acoustic information including speaker-specific
attributes like timbre. Subjective and objective evaluations affirm the effectiveness of our approach for multi-lingual text-to-speech, and high-
light its superiority in low-resource transfer learning for previously unseen language.
Nov 13 (Wed) 10:30-12:00 - Riverfront Hall
Modeling Gender and Dialect Bias in Automatic Speech Recognition
Camille Harris, Chijioke Mgbahurike, Neha Kumar, Diyi Yang
Dialect and gender-based biases have become an area of concern in language-dependent AI systemsincluding around automatic speech recog-
nition (ASR) which processes speech audio into text. These potential biases raise concern for discriminatory outcomes with AI systems
depending on demographic- particularly gender discrimination against women, and racial discrimination against minorities with ethnic or
cultural English dialects.As such we aim to evaluate the performance of ASR systems across different genders and across dialects of English.
Concretely, we take a deep dive of the performance of ASR systems on men and women across four US-based English dialects: Standard
American English (SAE), African American Vernacular English (AAVE), Chicano English, and Spanglish. To do this, we construct a labeled
dataset of 13 hours of podcast audio, transcribed by speakers of the represented dialects. We then evaluate zero-shot performance of different
automatic speech recognition models on our dataset, and further finetune models to better understand how finetuning can impact performance.
Our work fills the gap of investigating possible gender disparities within underrepresented dialects.
217
Posters and Demos
Demo 5
Nov 13 (Wed) 16:00-17:30 - Room: Riverfront Hall
218
Posters and Demos
219
Posters and Demos
intermediate goals, achieving seamless knowledge transitions becomes tricky. This paper proposes a novel Bootstrapped Policy Learning
(BPL) framework, which adaptively tailors progressively challenging subgoal curriculum for each complex goal through goal shaping, en-
suring a smooth knowledge transition. Goal shaping involves goal decomposition and evolution, decomposing complex goals into subgoals
with solvable maximum difficulty and progressively increasing difficulty as the policy improves. Moreover, to enhance BPL’s adaptability
across various environments, we explore various combinations of goal decomposition and evolution within BPL, and identify two univer-
sal curriculum patterns that remain effective across different dialogue environments, independent of specific environmental constraints. By
integrating the summarized curriculum patterns, our BPL has exhibited efficacy and versatility across four publicly available datasets with
different difficulty levels.
Nov 13 (Wed) 16:00-17:30 - Riverfront Hall
Retrospex: Language Agent Meets Offline Reinforcement Learning Critic
Yufei Xiang, Yiqun Shen, Yeqin Zhang, Nguyen Cam-Tu
Large language models (LLMs) possess extensive knowledge and commonsense reasoning capabilities, making them valuable for creating
powerful agents. However, existing LLM agent frameworks have not fully utilized past experiences for improvement. This work introduces a
new LLM-based agent framework called Retrospex, which addresses this challenge by analyzing past experiences in depth. Unlike previous
approaches, Retrospex does not directly integrate experiences into the LLMs context. Instead, it combines the LLMs action likelihood with
action values estimated by a Reinforcement Learning (RL) Critic, which is trained on past experiences through an offline retrospection pro-
cess. Additionally, Retrospex employs a dynamic action rescoring mechanism that increases the importance of experience-based values for
tasks that require more interaction with the environment. We evaluate Retrospex in ScienceWorld, ALFWorld and Webshop environments,
demonstrating its advantages over strong baselines.
Nov 13 (Wed) 16:00-17:30 - Riverfront Hall
Synergizing In-context Learning with Hints for End-to-end Task-oriented Dialog Systems
Vishal Vivek Saley, Rocktim Jyoti Das, Dinesh Raghu, Mausam
End-to-end Task-Oriented Dialog (TOD) systems typically require extensive training datasets to perform well. In contrast, large language
model (LLM) based TOD systems can excel even with limited data due to their ability to learn tasks through in-context exemplars. However,
these models lack alignment with the style of responses in training data and often generate comprehensive responses, making it difficult for
users to grasp the information quickly. In response, we propose SyncTOD that synergizes LLMs with task-specific hints to improve alignment
in low-data settings. SyncTOD employs small auxiliary models to provide hints and select exemplars for in-context prompts. With ChatGPT,
SyncTOD achieves superior performance compared to LLM-based baselines and SoTA models in low-data settings, while retaining competi-
tive performance in full-data settings.
220
Posters and Demos
following Contrastive Decoding (CAPID). This framework generates dynamic, context-aware slot queries, effectively improving the model’s
transferability. Our context-aware auto-prompting approach tailors slot queries to the current dialogue context, increasing flexibility and
reducing ambiguities. Additionally, an instruction-following contrastive decoding strategy helps reduce errors related to off-topic slots by
penalizing deviations from the provided instructions. Extensive experiments on two datasets, with varying model sizes (from 60M to 7B),
demonstrate the superior performance of CAPID. The source code is provided for reproducibility.
Nov 13 (Wed) 16:00-17:30 - Riverfront Hall
Evaluating the Effectiveness of Large Language Models in Establishing Conversational Grounding
Biswesh Mohapatra, Manav Nitin Kapadnis, Laurent Romary, Justine Cassell
Conversational grounding, vital for building dependable dialog systems, involves ensuring a mutual understanding of shared information.
Despite its importance, there has been limited research on this aspect of conversation in recent years, especially after the advent of Large
Language Models (LLMs). Previous studies have highlighted the shortcomings of pre-trained language models in conversational grounding.
However, most testing for conversational grounding capabilities involves human evaluations that are costly and time-consuming. This has led
to a lack of testing across multiple models of varying sizes, a critical need given the rapid rate of new model releases. This gap in research
becomes more significant considering recent advances in language models, which have led to new emergent capabilities. In this paper, we
aim to evaluate the performance of LLMs in various aspects of conversational grounding and analyze why some models perform better than
others. We demonstrate a direct correlation between the size of the pre-training data and conversational grounding abilities, meaning that they
have independently acquired a specific form of pragmatic capabilities from larger pre-training datasets. Finally, we propose ways to enhance
the capabilities of the models that lag in this aspect.
221
Posters and Demos
In the realm of multi-intent spoken language understanding, recent advancements have leveraged the potential of prompt learning frameworks.
However, critical gaps exist in these frameworks: the lack of explicit modeling of dual-task dependencies and the oversight of task-specific
semantic differences among utterances. To address these shortcomings, we propose DC-Instruct, a novel generative framework based on
Dual-task Inter-dependent Instructions (DII) and Supervised Contrastive Instructions (SCI). Specifically, DII guides large language models
(LLMs) to generate labels for one task based on the other task’s labels, thereby explicitly capturing dual-task inter-dependencies. Moreover,
SCI leverages utterance semantics differences by guiding LLMs to determine whether a pair of utterances share the same or similar labels.
This can improve LLMs on extracting and discriminating task-specific semantics, thus enhancing their SLU reasoning abilities. Extensive
experiments on public benchmark datasets show that DC-Instruct markedly outperforms current generative models and state-of-the-art meth-
ods, demonstrating its effectiveness in enhancing dialogue language understanding and reasoning.
Nov 13 (Wed) 16:00-17:30 - Riverfront Hall
GDPO: Learning to Align Language Models with Diversity Using GFlowNets
Oh Joon Kwon, Daiki E. Matsunaga, Kee-Eung Kim
A critical component of the current generation of language models is preference alignment, which aims to precisely control the model’s
behavior to meet human needs and values. The most notable among such methods is Reinforcement Learning with Human Feedback (RLHF)
and its offline variant Direct Preference Optimization (DPO), both of which seek to maximize a reward model based on human preferences.
In particular, DPO derives reward signals directly from the offline preference data, but in doing so overfits the reward signals and generates
suboptimal responses that may contain human biases in the dataset. In this work, we propose a practical application of a diversity-seeking RL
algorithm called GFlowNet-DPO (GDPO) in an offline preference alignment setting to curtail such challenges. Empirical results show GDPO
can generate far more diverse responses than the baseline methods that are still relatively aligned with human values in dialog generation and
summarization tasks.
Nov 13 (Wed) 16:00-17:30 - Riverfront Hall
Unsupervised Extraction of Dialogue Policies from Conversations
Makesh Narsimhan Sreedhar, Traian Rebedea, Christopher Parisien
Dialogue policies play a crucial role in developing task-oriented dialogue systems, yet their development and maintenance are challenging
and typically require substantial effort from experts in dialogue modeling. While in many situations, large amounts of conversational data
are available for the task at hand, people lack an effective solution able to extract dialogue policies from this data. In this paper, we address
this gap by first illustrating how Large Language Models (LLMs) can be instrumental in extracting dialogue policies from datasets, through
the conversion of conversations into a unified intermediate representation consisting of canonical forms. We then propose a novel method
for generating dialogue policies utilizing a controllable and interpretable graph-based methodology. By combining canonical forms across
conversations into a flow network, we find that running graph traversal algorithms helps in extracting dialogue flows. These flows are a
better representation of the underlying interactions than flows extracted by prompting LLMs. Our technique focuses on giving conversation
designers greater control, offering a productivity tool to improve the process of developing dialogue policies.
Nov 13 (Wed) 16:00-17:30 - Riverfront Hall
Generative Subgraph Retrieval for Knowledge Graph-Grounded Dialog Generation
Jinyoung Park, Minseok Joo, Joo-Kyung Kim, Hyunwoo J. Kim
Knowledge graph-grounded dialog generation requires retrieving a dialog-relevant subgraph from the given knowledge base graph and in-
tegrating it with the dialog history. Previous works typically represent the graph using an external encoder, such as graph neural networks,
and retrieve relevant triplets based on the similarity between single-vector representations of triplets and the dialog history. However, these
external encoders fail to leverage the rich knowledge of pretrained language models, and the retrieval process is also suboptimal due to the in-
formation bottleneck caused by the single-vector abstraction of the dialog history. In this work, we propose Dialog generation with Generative
Subgraph Retrieval (DialogGSR), which retrieves relevant knowledge subgraphs by directly generating their token sequences on top of lan-
guage models. For effective generative subgraph retrieval, we introduce two key methods: (i) structure-aware knowledge graph linearization
with self-supervised graph-specific tokens and (ii) graph-constrained decoding utilizing graph structural proximity-based entity informative-
ness scores for valid and relevant generative retrieval. DialogGSR achieves state-of-the-art performance in knowledge graph-grounded dialog
generation, as demonstrated on OpenDialKG and KOMODIS datasets.
Nov 13 (Wed) 16:00-17:30 - Riverfront Hall
Is this the real life? Is this just fantasy? The Misleading Success of Simulating Social Interactions With LLMs
Xuhui Zhou, Zhe Su, Tiwalayo Eisape, Hyunwoo Kim, Maarten Sap
Recent advances in large language models (LLM) have enabled richer social simulations, allowing for the study of various social phenomena.
However, most recent work has used a more omniscient perspective on these simulations (e.g., single LLM to generate all interlocutors), which
is fundamentally at odds with the non-omniscient, information asymmetric interactions that involve humans and AI agents in the real world.
To examine these differences, we develop an evaluation framework to simulate social interactions with LLMs in various settings (omniscient,
non-omniscient). Our experiments show that LLMs perform better in unrealistic, omniscient simulation settings but struggle in ones that
more accurately reflect real-world conditions with information asymmetry. Moreover, we illustrate the limitations inherent in learning from
omniscient simulations. Our findings indicate that addressing information asymmetry remains a fundamental challenge for LLM-based agents.
222
Posters and Demos
needs and preferences, tailored assistance according to user personas is crucial. In this paper, we introduce ABLE (Adaptive, Bespoke,
Listen and Empathetic), a Conversational Support System for Physical Disabilities. By tracking user personas, including gender, age, and
personality traits based on the OCEAN model, ABLE ensures that support interactions are uniquely tailored to each user’s characteristics
and preferences. Moreover, integrating politeness and empathy levels in responses enhances user satisfaction and engagement, fostering a
supportive and respectful environment. The development of ABLE involves compiling a comprehensive conversational dataset enriched with
user profile annotations. Leveraging reinforcement learning techniques and diverse reward mechanisms, ABLE trains a model to generate
responses aligned with individual user profiles while maintaining appropriate levels of politeness and empathy. Based on rigorous empir-
ical analysis encompassing automatic and human evaluation metrics based on persona-consistency, politeness accuracy, empathy accuracy,
perplexity, and conversation coherence, the efficacy of ABLE is assessed. Our findings underscore ABLE’s success in delivering tailored
support to individuals grappling with physical disabilities. To the best of our knowledge, this is the very first attempt towards building a user’s
persona-oriented physical disability support system.
Nov 13 (Wed) 16:00-17:30 - Riverfront Hall
A Fairness-Driven Method for Learning Human-Compatible Negotiation Strategies
Ryan Shea, Zhou Yu
Despite recent advancements in AI and NLP, negotiation remains a difficult domain for AI agents. Traditional game theoretic approaches
that have worked well for two-player zero-sum games struggle in the context of negotiation due to their inability to learn human-compatible
strategies. On the other hand, approaches that only use human data tend to be domain-specific and lack the theoretical guarantees provided by
strategies grounded in game theory. Motivated by the notion of fairness as a criterion for optimality in general sum games, we propose a ne-
gotiation framework called FDHC which incorporates fairness into both the reward design and search to learn human-compatible negotiation
strategies. Our method includes a novel, RL+search technique called LGM-Zero which leverages a pre-trained language model to retrieve
human-compatible offers from large action spaces. Our results show that our method is able to achieve more egalitarian negotiation outcomes
and improve negotiation quality.
Nov 13 (Wed) 16:00-17:30 - Riverfront Hall
From Pixels to Personas: Investigating and Modeling Self-Anthropomorphism in Human-Robot Dialogues
Yu Li, Devamanyu Hazarika, Di Jin, Julia Hirschberg, Yang Liu
Self-anthropomorphism in robots manifests itself through their display of human-like characteristics in dialogue, such as expressing pref-
erences and emotions. Our study systematically analyzes self-anthropomorphic expression within various dialogue datasets, outlining the
contrasts between self-anthropomorphic and non-self-anthropomorphic responses in dialogue systems. We show significant differences in
these two types of responses and propose transitioning from one type to the other. We also introduce Pix2Persona, a novel dataset aimed
at developing ethical and engaging AI systems in various embodiments. This dataset preserves the original dialogues from existing corpora
and enhances them with paired responses: self-anthropomorphic and non-self-anthropomorphic for each original bot response. Our work not
only uncovers a new category of bot responses that were previously under-explored but also lays the groundwork for future studies about
dynamically adjusting self-anthropomorphism levels in AI systems to align with ethical standards and user expectations.
Nov 13 (Wed) 16:00-17:30 - Riverfront Hall
Diverse and Effective Synthetic Data Generation for Adaptable Zero-Shot Dialogue State Tracking
James D. Finch, Jinho D. Choi
We demonstrate substantial performance gains in zero-shot dialogue state tracking (DST) by enhancing training data diversity through syn-
thetic data generation.Existing DST datasets are severely limited in the number of application domains and slot types they cover due to the
high costs of data collection, restricting their adaptability to new domains.This work addresses this challenge with a novel, fully automatic
data generation approach that creates synthetic zero-shot DST datasets.Distinguished from previous methods, our approach can generate di-
alogues across a massive range of application domains, complete with silver-standard dialogue state annotations and slot descriptions.This
technique is used to create the D0T dataset for training zero-shot DST models, encompassing an unprecedented 1,000+ domains. Experiments
on the MultiWOZ benchmark show that training models on diverse synthetic data improves Joint Goal Accuracy by 6.7%, achieving results
competitive with models 13.5 times larger than ours.
Nov 13 (Wed) 16:00-17:30 - Riverfront Hall
Preference Tuning For Toxicity Mitigation Generalizes Across Languages
Xiaochen Li, Zheng Xin Yong, Stephen Bach
Detoxifying multilingual Large Language Models (LLMs) has become crucial due to their increasing global use. In this work, we explore
zero-shot cross-lingual generalization of preference tuning in detoxifying LLMs. Unlike previous studies that show limited cross-lingual
generalization for other safety tasks, we demonstrate that Direct Preference Optimization (DPO) training with only English data can signif-
icantly reduce toxicity in multilingual open-ended generations. For example, the probability of mGPT-1.3B generating toxic continuations
drops from 46.8% to 3.9% across 17 different languages after training. Our results also extend to other multilingual LLMs, such as BLOOM,
Llama3, and Aya-23. Using mechanistic interpretability tools like causal intervention and activation analysis, we identified the dual multilin-
guality property of MLP layers in LLMs, which explains the cross-lingual generalization of DPO. Finally, we show that bilingual sentence
retrieval can predict the cross-lingual transferability of DPO preference tuning.
Nov 13 (Wed) 16:00-17:30 - Riverfront Hall
Active Listening: Personalized Question Generation in Open-Domain Social Conversation with User Model Based Prompting
Kevin Bowden, Yue Fan, Winson Chen, Wen Cui, Davan Harrison, Marilyn Walker, Xin Eric Wang
Large language models (LLMs) capable of casual conversation have recently become widely available. We hypothesize that users of con-
versational systems want a more personalized experience, and existing work shows that users are highly receptive to personalized questions
(PQs). Question Generation tasks, however, focus on factual questions from textual excerpts. To create a PQ generator, we first identify
over 400 real user interests by anonymously aggregating 39K user models. We then populate prompt templates with these 400 interests and
use an LLM to generate PQs customized to user interests. The result is PerQs, a novel corpus of 19K question/answer pairs. We evaluate
PerQs at scale in the unique context of the Alexa Prize. Our results show significant positive effects on perceived conversation quality. We
then fine-tune, deploy, and evaluate PerQy, a neural model that generates PQs in real-time. When evaluated against several competitive LLM
baselines, PerQy produced the most natural and engaging responses.
Nov 13 (Wed) 16:00-17:30 - Riverfront Hall
UrbanLLM: Autonomous Urban Activity Planning and Management with Large Language Models
YUE JIANG, Qin Chao, Yile Chen, Xiucheng Li, SHUAI LIU, Gao Cong
Location-based services play an critical role in improving the quality of our daily lives. Despite the proliferation of numerous specialized AI
models within spatio-temporal context of location-based services, these models struggle to autonomously tackle problems regarding complex
urban planing and management. To bridge this gap, we introduce UrbanLLM, a fine-tuned large language model (LLM) designed to tackle
diverse problems in urban scenarios. UrbanLLM functions as a problem- solver by decomposing urban-related queries into manageable
223
Posters and Demos
sub-tasks, identifying suitable spatio-temporal AI models for each sub-task, and generating comprehensive responses to the given queries.
Our experimental results indicate that UrbanLLM significantly outperforms other established LLMs, such as Llama and the GPT series, in
handling problems concerning complex urban activity planning and management. UrbanLLM exhibits considerable potential in enhancing
the effectiveness of solving problems in urban scenarios, reducing the workload and reliance for human experts.
Nov 13 (Wed) 16:00-17:30 - Riverfront Hall
Multi-dimensional Evaluation of Empathetic Dialogue Responses
Zhichao Xu, Jiepu Jiang
Empathy is critical for effective and satisfactory conversational communication. Prior efforts to measure conversational empathy mostly focus
on expressed communicative intentsthat is, the way empathy is expressed. Yet, these works ignore the fact that conversation is also a collab-
oration involving both speakers and listeners. In contrast, we propose a multi-dimensional empathy evaluation framework to measure both
expressed intents from the speakers perspective and perceived empathy from the listeners perspective. We apply our analytical framework to
examine internal customer-service dialogues. We find the two dimensions (expressed intent types and perceived empathy) are interconnected,
while perceived empathy has high correlations with dialogue satisfaction levels.To reduce the annotation cost, we explore different options to
automatically measure conversational empathy: prompting LLMs and training language model-based classifiers. Our experiments show that
prompting methods with even popular models like GPT-4 and Flan family models perform relatively poorly on both public and our internal
datasets. In contrast, instruction-finetuned classifiers based on FlanT5 family models outperform prior works and competitive baselines. We
conduct a detailed ablation study to give more insights into instruction finetuning methods strong performance.
Nov 13 (Wed) 16:00-17:30 - Riverfront Hall
Learning to Ask Informative Questions: Enhancing LLMs with Preference Optimization and Expected Information Gain
Davide Mazzaccara, Alberto Testoni, Raffaella Bernardi
Questions are essential tools for acquiring the necessary information to complete information-seeking tasks. However, large language models
(LLMs), especially open-source models, often perform poorly in generating informative questions, as measured by expected information gain
(EIG). In this paper, we propose a method to enhance the informativeness of LLM-generated questions in 20-question game dialogues. We
sample multiple questions from the same model (LLaMA 2-Chat 7B) for each game and create pairs of low-EIG and high-EIG questions to
apply a Direct Preference Optimization (DPO) algorithm. Our results show that this method produces more effective questions (in terms of
EIG), even in domains different from those used to train the DPO model.
Nov 13 (Wed) 16:00-17:30 - Riverfront Hall
Can LLMs Understand the Implication of Emphasized Sentences in Dialogue?
Guan-Ting Lin, Hung-yi Lee
Emphasis is a crucial component in human communication, which indicates speaker’s intention and implication beyond pure text in dialogue.
While Large Language Models (LLMs) have revolutionized natural language processing, their ability to understand emphasis in dialogue
remains uncertain. This paper introduces Emphasized-Talk, a benchmark dataset with annotated dialogue samples capturing the implications
of emphasis. We evaluate various LLMs, both open-source and commercial, to assess their performance in understanding and generating
emphasis. Additionally, we propose an automatic evaluation pipeline using GPT-4, which achieve high correlation with human scoring. Our
findings reveal that although commercial LLMs generally perform better, there is still significant room for improvement in comprehending
emphasized sentences.
Industry
Nov 13 (Wed) 16:00-17:30 - Room: Riverfront Hall
224
Posters and Demos
Pre-trained chemical language models (CLMs) excel in the field of molecular property prediction, utilizing string-based molecular descriptors
such as SMILES for learning universal representations. However, such string-based descriptors implicitly contain limited structural infor-
mation, which is closely associated with molecular property prediction. In this work, we introduce Moleco, a novel contrastive learning
framework to enhance the understanding of molecular structures within CLMs. Based on the similarity of fingerprint vectors among dif-
ferent molecules, we train CLMs to distinguish structurally similar and dissimilar molecules in a contrastive manner. Experimental results
demonstrate that Moleco significantly improves the molecular property prediction performance of CLMs, outperforming state-of-the-art mod-
els. Moreover, our in-depth analysis with diverse Moleco variants verifies that fingerprint vectors are highly effective features in improving
CLMs’ understanding of the structural information of molecules.
Nov 13 (Wed) 16:00-17:30 - Riverfront Hall
Refining App Reviews: Dataset, Methodology, and Evaluation
Amrita Singh, Chirag Jain, Mohit Chaudhary, Preethu Rose Anish
With the growing number of mobile users, app development has become increasingly lucrative. Reviews on platforms such as Google Play
and Apple App Store provide valuable insights to developers, highlighting bugs, suggesting new features, and offering feedback. However,
many reviews contain typos, spelling errors, grammar mistakes, and complex sentences, hindering efficient interpretation and slowing down
app improvement processes. To tackle this, we introduce RARE (Repository for App review REfinement), a benchmark dataset of 10,000
annotated pairs of original and refined reviews from 10 mobile applications. These reviews were collaboratively refined by humans and
large language models (LLMs). We also conducted an evaluation of eight state-of-the-art LLMs for automated review refinement. The top-
performing model (Flan-T5) was further used to refine an additional 10,000 reviews, contributing to RARE as a silver corpus.
225
Posters and Demos
hensive assessment of eight LMs, revealing that larger models, such as Claude-3.5-Sonnet, exhibit superior performance in comprehending
contact-center conversations. We introduce methodologies to transfer this domain-specific knowledge to smaller models, by leveraging eval-
uation plans generated by more knowledgeable models, with optional human-in-the-loop refinement to enhance the capabilities of smaller
models. Notably, our experimental results demonstrate an improvement of upto 18.95% in Macro F1 on in-house QA dataset. Our findings
emphasize the importance of evaluation plan in guiding reasoning and highlight the potential of AI-assisted tools to advance objective, con-
sistent, and scalable agent evaluation processes in contact-centers.
Nov 13 (Wed) 16:00-17:30 - Riverfront Hall
Intelligent Predictive Maintenance RAG framework for Power Plants: Enhancing QA with StyleDFS and Domain Specific Instruc-
tion Tuning
Seongtae Hong, Shin Joong Min, Jaehyung Seo, Taemin Lee, Jeongbae Park, Cho Man Young, Byeongho Choi, Heuiseok Lim
Process plants are complex large-scale industrial facilities that convert raw materials or intermediate products into final products, requiring
continuous processes with high safety and efficiency standards. In particular, in nuclear process plants, Predictive Maintenance Systems
(PMS) play a critical role in predicting equipment anomalies and performing preventive maintenance. However, current PMS relies heavily
on the experience of a few experts, leading to knowledge loss upon their retirement and difficulty in swift response. Existing off-premise
Question-Answering (QA) systems based on Large Language Models (LLM) face issues such as data leakage and challenges in domain-
specific tuning. To address these problems, this study proposes an on-premise intelligent PMS framework utilizing a new chunking method,
StyleDFS, which effectively reflects the structural information of documents. Additionally, we demonstrate that Instruction tuning using rele-
vant domain-specific data improves LLM performance even under limited data conditions.
226
Posters and Demos
227
Posters and Demos
we propose an approach to enhance query understanding by augmenting queries with rich contextual signals derived from web search results
and large language models, stored in an online cache. Specifically, we use web search titles and snippets to ground queries in real-world
information, and utilize GPT-4 to generate query rewrites and explanations that clarify user intent. These signals are efficiently integrated
through a Fusion-in-Decoder based Unity architecture, enabling both dense and generative retrieval with serving costs on par with traditional
context-free models. To address scenarios where context is unavailable in the cache, we introduce context glancing, a curriculum learning
strategy that improves model robustness and performance even without contextual signals during inference. Extensive offline experiments
demonstrate that our context-aware approach substantially outperforms context-free models. Furthermore, online A/B testing on a prominent
search engine across 160+ countries shows significant improvements in user engagement and revenue.
Nov 13 (Wed) 16:00-17:30 - Riverfront Hall
Salient Information Prompting to Steer Content in Prompt-based Abstractive Summarization
Lei Xu, Mohammed Asad Karim, Saket Dingliwal, Aparna Elangovan
Large language models (LLMs) can generate fluent summaries across domains using prompting techniques, reducing the effort required for
summarization applications. However, crafting effective prompts that guide LLMs to generate summaries with the appropriate level of detail
and writing style remains a challenge. In this paper, we explore the use of salient information extracted from the source document to enhance
summarization prompts. We show that adding keyphrases in prompts can improve ROUGE F1 and recall, making the generated summaries
more similar to the reference and more complete. The number of keyphrases can control the precision-recall trade-off. Furthermore, our
analysis reveals that incorporating phrase-level salient information is superior to word- or sentence-level. However, the impact on summary
faithfulness is not universally positive across LLMs. To enable this approach, we introduce Keyphrase Signal Extractor (SigExt), a lightweight
model that can be finetuned to extract salient keyphrases. By using SigExt, we achieve consistent ROUGE improvements across datasets and
LLMs without any LLM customization. Our findings provide insights into leveraging salient information in building prompt-based summa-
rization systems.
Nov 13 (Wed) 16:00-17:30 - Riverfront Hall
Don’t Shoot The Breeze: Topic Continuity Model Using Nonlinear Naive Bayes With Attention
Shu-Ting Pi, Pradeep Bagavan, Yejia Li, Disha, Qun Liu
Utilizing Large Language Models (LLM) as chatbots in diverse business scenarios often presents the challenge of maintaining topic continu-
ity. Abrupt shifts in topics can lead to poor user experiences and inefficient utilization of computational resources. In this paper, we present a
topic continuity model aimed at assessing whether a response aligns with the initial conversation topic. Our model is built upon the expansion
of the corresponding natural language understanding (NLU) model into quantifiable terms using a Naive Bayes approach. Subsequently, we
have introduced an attention mechanism and logarithmic nonlinearity to enhance its capability to capture topic continuity. This approach
allows us to convert the NLU model into an interpretable analytical formula. In contrast to many NLU models constrained by token limits,
our proposed model can seamlessly handle conversations of any length with linear time complexity. Furthermore, the attention mechanism
significantly improves the model’s ability to identify topic continuity in complex conversations. According to our experiments, our model
consistently outperforms traditional methods, particularly in handling lengthy and intricate conversations. This unique capability offers us an
opportunity to ensure the responsible and interpretable use of LLMs.
Nov 13 (Wed) 16:00-17:30 - Riverfront Hall
Investigating the Personality Consistency in Quantized Role-Playing Dialogue Agents
Yixiao Wang, Homa Fashandi, Kevin Ferreira
This study explores the consistency of personality traits in quantized large language models (LLMs) for edge device role-playing scenarios.
Using the Big Five personality traits model, we evaluate how stable assigned personalities are for Quantized Role-Playing Dialog Agents
(QRPDA) during multi-turn interactions. We evaluate multiple LLMs with various quantization levels, combining binary indexing of person-
ality traits, explicit self-assessments, and linguistic analysis of narratives. To address personality inconsistency, we propose a non-parametric
method called Think2. Our multi-faceted evaluation framework demonstrates Think2’s effectiveness in maintaining consistent personality
traits for QRPDA. Moreover, we offer insights to help select the optimal model for QRPDA, improving its stability and reliability in real-
world applications.
228
Posters and Demos
tection task.
Nov 13 (Wed) 16:00-17:30 - Riverfront Hall
Evaluating D-MERIT of Partial-annotation on Information Retrieval
Royi Rassin, Yaron Fairstein, Oren Kalinsky, Guy Kushilevitz, Nachshon Cohen, Alexander Libov, Yoav Goldberg
Retrieval models are often evaluated on partially-annotated datasets. Each query is mapped to a few relevant texts, and the remaining corpus
is assumed to be irrelevant. As a result, models that successfully retrieve falsely labeled negatives are punished during evaluation. Unfortu-
nately, completely annotating all texts for every query is not resource-efficient. In this work, we show that using partially-annotated datasets
in evaluation can paint a distorted picture. We curate D-MERIT, a passage retrieval evaluation set from Wikipedia, aspiring to contain *all*
relevant passages for each query. Queries describe a group (e.g., “journals about linguistics”), and relevant passages are evidence that en-
tities belong to the group (e.g., a passage indicating that *Language* is a journal about linguistics). We show that evaluating on a dataset
containing annotations for only a subset of the relevant passages might result in misleading ranking of the retrieval systems and that as more
relevant texts are included in the evaluation set, the rankings converge. We propose our dataset as a resource for evaluation and our study as
a recommendation for a balance between resource-efficiency and reliable evaluation when annotating evaluation sets for text retrieval. Our
dataset can be downloaded from https://D-MERIT.github.io.
Nov 13 (Wed) 16:00-17:30 - Riverfront Hall
AGRaME: Any-Granularity Ranking with Multi-Vector Embeddings
Revanth Gangi Reddy, Omar Attia, Yunyao Li, Heng Ji, Saloni Potdar
Ranking is a fundamental problem in search, however, existing ranking algorithms usually restrict the granularity of ranking to full passages
or require a specific dense index for each desired level of granularity. Such lack of flexibility in granularity negatively affects many appli-
cations that can benefit from more granular ranking, such as sentence-level ranking for open-domain QA, or proposition-level ranking for
attribution. In this work, we introduce the idea of any-granularity ranking which leverages multi-vector embeddings to rank at varying levels
of granularity while maintaining encoding at a single (coarser) level of granularity. We propose a multi-granular contrastive loss for training
multi-vector approaches and validate its utility with both sentences and propositions as ranking units. Finally, we demonstrate the application
of proposition-level ranking to post-hoc citation addition in retrieval-augmented generation, surpassing the performance of prompt-driven
citation generation.
Nov 13 (Wed) 16:00-17:30 - Riverfront Hall
Exploring the Practicality of Generative Retrieval on Dynamic Corpora
Soyoung Yoon, Chaeeun Kim, Hyunji Lee, Joel Jang, Sohee Yang, Minjoon Seo
Benchmarking the performance of information retrieval (IR) is mostly conducted with a fixed set of documents (static corpora). However,
in realistic scenarios, this is rarely the case and the documents to be retrieved are constantly updated and added. In this paper, we focus on
Generative Retrievals (GR), which apply autoregressive language models to IR problems, and explore their adaptability and robustness in dy-
namic scenarios. We also conduct an extensive evaluation of computational and memory efficiency, crucial factors for real-world deployment
of IR systems handling vast and ever-changing document collections. Our results on the StreamingQA benchmark demonstrate that GR is
more adaptable to evolving knowledge (4-11%), robust in learning knowledge with temporal information, and efficient in terms of inference
FLOPs (x2), indexing time (x6), and storage footprint (x4) compared to Dual Encoders (DE), which are commonly used in retrieval systems.
Our paper highlights the potential of GR for future use in practical IR systems within dynamic environments.
Nov 13 (Wed) 16:00-17:30 - Riverfront Hall
Open-world Multi-label Text Classification with Extremely Weak Supervision
Xintong Li, Jinya Jiang, Ria Dharmani, Jayanth Srinivasa, Gaowen Liu, Jingbo Shang
We study open-world multi-label text classification under extremely weak supervision (XWS), where the user only provides a brief description
for classification objectives without any labels or ground-truth label space. Similar single-label XWS settings have been explored recently,
however, these methods cannot be easily adapted for multi-label. We observe that (1) most documents have a dominant class covering the ma-
jority of content and (2) long-tail labels would appear in some documents as a dominant class. Therefore, we first utilize the user description
to prompt a large language model (LLM) for dominant keyphrases of a subset of raw documents, and then construct a (initial) label space via
clustering. We further apply a zero-shot multi-label classifier to locate the documents with small top predicted scores, so we can revisit their
dominant keyphrases for more long-tail labels. We iterate this process to discover a comprehensive label space and construct a multi-label
classifier as a novel method, X-MLClass. X-MLClass exhibits a remarkable increase in ground-truth label space coverage on various datasets,
for example, a 40% improvement on the AAPD dataset over topic modeling and keyword extraction methods. Moreover, X-MLClass achieves
the best end-to-end multi-label classification accuracy.
Nov 13 (Wed) 16:00-17:30 - Riverfront Hall
Unleashing the Power of Emojis in Texts via Self-supervised Graph Pre-Training
Zhou Zhang, Dongzeng Tan, Jiaan Wang, Yilong Chen, Jiarong Xu
Emojis have gained immense popularity on social platforms, serving as a common means to supplement or replace text. However, exist-
ing data mining approaches generally either completely ignore or simply treat emojis as ordinary Unicode characters, which may limit the
model’s ability to grasp the rich semantic information in emojis and the interaction between emojis and texts. Thus, it is necessary to release
the emoji’s power in social media data mining. To this end, we first construct a heterogeneous graph consisting of three types of nodes,
i.e. post, word and emoji nodes to improve the representation of different elements in posts. The edges are also well-defined to model how
these three elements interact with each other. To facilitate the sharing of information among post, word and emoji nodes, we propose a graph
pre-train framework for text and emoji co-modeling, which contains two graph pre-training tasks: node-level graph contrastive learning and
edge-level link reconstruction learning. Extensive experiments on the Xiaohongshu and Twitter datasets with two types of downstream tasks
demonstrate that our approach proves significant improvement over previous strong baseline methods.
Nov 13 (Wed) 16:00-17:30 - Riverfront Hall
LumberChunker: Long-Form Narrative Document Segmentation
André V. Duarte, João DS Marques, Miguel Graça, Miguel Freire, Lei Li, Arlindo L. Oliveira
Modern NLP tasks increasingly rely on dense retrieval methods to access up-to-date and relevant contextual information. We are motivated
by the premise that retrieval benefits from segments that can vary in size such that a contents semantic independence is better captured. We
propose LumberChunker, a method leveraging an LLM to dynamically segment documents, which iteratively prompts the LLM to identify the
point within a group of sequential passages where the content begins to shift. To evaluate our method, we introduce GutenQA, a benchmark
with 3000 "needle in a haystack" type of question-answer pairs derived from 100 public domain narrative books available on Project Guten-
berg. Our experiments show that LumberChunker not only outperforms the most competitive baseline by 7.37% in retrieval performance
(DCG@20) but also that, when integrated into a RAG pipeline, LumberChunker proves to be more effective than other chunking methods and
competitive baselines, such as the Gemini 1.5M Pro.
229
Posters and Demos
230
Posters and Demos
space. However, such similarity often fails to capture the relevance. Alternatively, large language models (LLMs) have been used for ranking
contexts. However, they can encounter scalability issues when the number of candidate contexts grows and the context window sizes of the
LLMs remain constrained. Additionally, these approaches require fine-tuning LLMs with domain-specific data. In this work, we introduce a
scalable ranking framework that combines embedding similarity and LLM capabilities without requiring LLM fine-tuning. Our framework
uses a pre-trained LLM to hypothesize the user query based on the retrieved contexts and ranks the context based on the similarity between
the hypothesized queries and the user query. Our framework is efficient at inference time and is compatible with many other retrieval and
ranking techniques. Experimental results show that our method improves the ranking performance across multiple benchmarks.
Nov 13 (Wed) 16:00-17:30 - Riverfront Hall
Topic Modeling: Contextual Token Embeddings Are All You Need
Dimo Angelov, Diana Inkpen
The goal of topic modeling is to find meaningful topics that capture the information present in a collection of documents. The main challenges
of topic modeling are finding the optimal number of topics, labeling the topics, segmenting documents by topic, and evaluating topic model
performance. Current neural approaches have tackled some of these problems but none have been able to solve all of them. We introduce a
novel topic modeling approach, Contextual-Top2Vec, which uses document contextual token embeddings, it creates hierarchical topics, finds
topic spans within documents and labels topics with phrases rather than just words. We propose the use of BERTScore to evaluate topic
coherence and to evaluate how informative topics are of the underlying documents. Our model outperforms the current state-of-the-art models
on a comprehensive set of topic model evaluation metrics.
Nov 13 (Wed) 16:00-17:30 - Riverfront Hall
Dense Passage Retrieval: Is it Retrieving?
Benjamin Reichman, Larry Heck
Large Language Models (LLMs) internally store repositories of knowledge. However, their access to this repository is imprecise and they
frequently hallucinate information that is not true or does not exist. A paradigm called Retrieval Augmented Generation (RAG) promises to
fix these issues. Dense passage retrieval (DPR) is the first step in this paradigm. In this paper, we analyze the role of DPR fine-tuning and
how it affects the model being trained. DPR fine-tunes pre-trained networks to enhance the alignment of the embeddings between queries and
relevant textual data. We explore DPR-trained models mechanistically by using a combination of probing, layer activation analysis, and model
editing. Our experiments show that DPR training decentralizes how knowledge is stored in the network, creating multiple access pathways
to the same information. We also uncover a limitation in this training style: the internal knowledge of the pre-trained model bounds what the
retrieval model can retrieve. These findings suggest a few possible directions for dense retrieval: (1) expose the DPR training process to more
knowledge so more can be decentralized, (2) inject facts as decentralized representations, (3) model and incorporate knowledge uncertainty
in the retrieval process, and (4) directly map internal model knowledge to a knowledge base.
Nov 13 (Wed) 16:00-17:30 - Riverfront Hall
R3 -NL2GQL: A Model Coordination and Knowledge Graph Alignment Approach for NL2GQL
Yuhang Zhou, Yu He, Siyu Tian, Yuchen Ni, Zhangyue Yin, Xiang Liu, Chuanjun Ji, Sen Liu, Xipeng Qiu, Guangnan Ye, Hongfeng Chai
While current tasks of converting natural language to SQL (NL2SQL) using Foundation Models have shown impressive achievements, adapt-
ing these approaches for converting natural language to Graph Query Language (NL2GQL) encounters hurdles due to the distinct nature of
GQL compared to SQL, alongside the diverse forms of GQL. Moving away from traditional rule-based and slot-filling methodologies, we
introduce a novel approach, R3 -NL2GQL, integrating both small and large Foundation Models for ranking, rewriting, and refining tasks.
This method leverages the interpretative strengths of smaller models for initial ranking and rewriting stages, while capitalizing on the superior
generalization and query generation prowess of larger models for the final transformation of natural language queries into GQL formats.
Addressing the scarcity of datasets in this emerging field, we have developed a bilingual dataset, sourced from graph database manuals and
selected open-source Knowledge Graphs (KGs). Our evaluation of this methodology on this dataset demonstrates its promising efficacy and
robustness.
Nov 13 (Wed) 16:00-17:30 - Riverfront Hall
ConTReGen: Context-driven Tree-structured Retrieval for Open-domain Long-form Text Generation
Kashob Kumar Roy, Pritom Saha Akash, Lucian Popa, Kevin Chen-Chuan Chang
Open-domain long-form text generation requires generating coherent, comprehensive responses that address complex queries with both
breadth and depth. This task is challenging due to the need to accurately capture diverse facets of input queries. Existing iterative retrieval-
augmented generation (RAG) approaches often struggle to delve deeply into each facet of complex queries and integrate knowledge from
various sources effectively. This paper introduces ConTReGen, a novel framework that employs a context-driven, tree-structured retrieval ap-
proach to enhance the depth and relevance of retrieved content. ConTReGen integrates a hierarchical, top-down in-depth exploration of query
facets with a systematic bottom-up synthesis, ensuring comprehensive coverage and coherent integration of multifaceted information. Exten-
sive experiments on multiple datasets, including LFQA and ODSUM, alongside a newly introduced dataset, ODSUM-WikiHow, demonstrate
that ConTReGen outperforms existing state-of-the-art RAG models.
Nov 13 (Wed) 16:00-17:30 - Riverfront Hall
Prefix-VAE: Efficient and Consistent Short-Text Topic Modeling with LLMs
Pritom Saha Akash, Kevin Chen-Chuan Chang
Topic modeling is a powerful technique for uncovering hidden themes within a collection of documents. However, the effectiveness of tra-
ditional topic models often relies on sufficient word co-occurrence, which is lacking in short texts. Therefore, existing approaches, whether
probabilistic or neural, frequently struggle to extract meaningful patterns from such data, resulting in incoherent topics. To address this chal-
lenge, we propose a novel approach that leverages large language models (LLMs) to extend short texts into more detailed sequences before
applying topic modeling. To further improve the efficiency and solve the problem of semantic inconsistency from LLM-generated texts, we
propose to use prefix tuning to train a smaller language model coupled with a variational autoencoder for short-text topic modeling. Our
method significantly improves short-text topic modeling performance, as demonstrated by extensive experiments on real-world datasets with
extreme data sparsity, outperforming current state-of-the-art topic models.
Nov 13 (Wed) 16:00-17:30 - Riverfront Hall
Exploring the Best Practices of Query Expansion with Large Language Models
Le Zhang, Yihong Wu, Qian Yang, Jian-Yun Nie
Large Language Models (LLMs) are foundational in language technologies, particularly in information retrieval (IR). In this paper, we
thoroughly explore the best practice of leveraging LLMs for query expansion. To this end, we introduce a training-free, straightforward yet
effective framework called Multi-Text Generation Integration (MuGI). This approach leverages LLMs to generate multiple pseudo-references,
which are then integrated with the original queries to enhance both sparse and dense retrieval methods. Additionally, we introduce a retrieval
pipeline based on MuGI, which combines the strengths of sparse and dense retrievers to achieve superior performance without the need for
231
Posters and Demos
costly pre-indexing. Our empirical findings reveal that: (1) Increasing the number of samples from LLMs benefits IR systems; (2) A balance
between the query and pseudo-documents, and an effective integration strategy, is critical for high performance; (3) Contextual information
from LLMs is essential, even boost a 23M model to outperform a 7B baseline model; (4) Pseudo relevance feedback can further calibrate
queries for improved performance; and (5) Query expansion is widely applicable and versatile, consistently enhancing models ranging from
23M to 7B parameters. Our code and all generated references are made available at https://github.com/lezhang7/Retrieval_MuGI.
Nov 13 (Wed) 16:00-17:30 - Riverfront Hall
Revisiting Query Variation Robustness of Transformer Models
Tim Hagen, Harrisen Scells, Martin Potthast
The most commonly used transformers for retrieval at present, BERT and T5, have been shown not to be robust to query variations such as
typos or paraphrases. Although this is an important prerequisite for their practicality, this problem has hardly been investigated. More recent
large language models (LLMs), including instruction-tuned LLMs, have not been analyzed yet, and only one study looks beyond typos. We
close this gap by reproducing this study and extending it with a systematic analysis of more recent models, including Sentence-BERT, Char-
acterBERT, E5-Mistral, AnglE, and Ada v2. We further investigate if instruct-LLMs can be prompted for robustness. Our results are mixed
in that the previously observed robustness issues for cross-encoders also apply to bi-encoders that use much larger LLMs, albeit to a lesser
extent. While further LLM scaling may improve their embeddings, their cost-effective use for all but large deployments is limited. Training
data that includes query variations allows LLMs to be fine-tuned for more robustness, but focusing on a single category of query variation
may even degrade the effectiveness on others. Our code, results, and artifacts can be found at https://github.com/webis-de/EMNLP-24
Nov 13 (Wed) 16:00-17:30 - Riverfront Hall
Re-Invoke: Tool Invocation Rewriting for Zero-Shot Tool Retrieval
Yanfei Chen, Jinsung Yoon, Devendra Singh Sachan, Qingze Wang, Vincent Cohen-Addad, Mohammadhossein Bateni, Chen-Yu Lee, Tomas
Pfister
Recent advances in large language models (LLMs) have enabled autonomous agents with complex reasoning and task-fulfillment capabilities
using a wide range of tools. However, effectively identifying the most relevant tools for a given task becomes a key bottleneck as the toolset
size grows, hindering reliable tool utilization. To address this, we introduce Re-Invoke, an unsupervised tool retrieval method designed to
scale effectively to large toolsets without training. Specifically, we first generate a diverse set of synthetic queries that comprehensively cover
different aspects of the query space associated with each tool document during the tool indexing phase. Second, we leverage LLM’s query
understanding capabilities to extract key tool-related context and underlying intents from user queries during the inference phase. Finally,
we employ a novel multi-view similarity ranking strategy based on intents to pinpoint the most relevant tools for each query. Our evaluation
demonstrates that Re-Invoke significantly outperforms state-of-the-art alternatives in both single-tool and multi-tool scenarios, all within a
fully unsupervised setting. Notably, on the ToolE datasets, we achieve a 20% relative improvement in nDCG@5 for single-tool retrieval and
a 39% improvement for multi-tool retrieval.
Language Modeling 3
Nov 13 (Wed) 16:00-17:30 - Room: Jasmine
232
Posters and Demos
233
Posters and Demos
234
Posters and Demos
fewer queries, thereby establishing a new standard for interpretable jailbreaking methods.
Nov 13 (Wed) 16:00-17:30 - Jasmine
Initialization of Large Language Models via Reparameterization to Mitigate Loss Spikes
Kosuke Nishida, Kyosuke Nishida, Kuniko Saito
Loss spikes, a phenomenon in which the loss value diverges suddenly, is a fundamental issue in the pre-training of large language models.
This paper supposes that the non-uniformity of the norm of the parameters is one of the causes of loss spikes. Here, in training of neural net-
works, the scale of the gradients is required to be kept constant throughout the layers to avoid the vanishing and exploding gradients problem.
However, to meet these requirements in the Transformer model, the norm of the model parameters must be non-uniform, and thus, parameters
whose norm is smaller are more sensitive to the parameter update. To address this issue, we propose a novel technique, weight scaling as
reparameterization (WeSaR). WeSaR introduces a gate parameter per parameter matrix and adjusts it to the value satisfying the requirements.
Because of the gate parameter, WeSaR sets the norm of the original parameters uniformly, which results in stable training. Experimental
results with the Transformer decoders consisting of 130 million, 1.3 billion, and 13 billion parameters showed that WeSaR stabilizes and
accelerates training and that it outperformed compared methods including popular initialization methods.
235
Posters and Demos
Tutor-ICL: Guiding Large Language Models for Improved In-Context Learning Performance
Ikhyun Cho, Gaeul Kwon, Julia Hockenmaier
There has been a growing body of work focusing on the in-context learning (ICL) abilities of large language models (LLMs). However, it is an
open question how effective ICL can be. This paper presents Tutor-ICL, a simple prompting method for classification tasks inspired by how
effective instructors might engage their students in learning a task. Specifically, we propose presenting exemplar answers in a *comparative
format* rather than the traditional single-answer format. We also show that including the test instance before the exemplars can improve
performance, making it easier for LLMs to focus on relevant exemplars. Lastly, we include a summarization step before attempting the test,
following a common human practice. Experiments on various classification tasks, conducted across both decoder-only LLMs (Llama 2, 3)
and encoder-decoder LLMs (Flan-T5-XL, XXL), show that Tutor-ICL consistently boosts performance, achieving up to a 13.76% increase in
accuracy.
Nov 13 (Wed) 16:00-17:30 - Jasmine
Beyond Fine-tuning: Unleashing the Potential of Continuous Pretraining for Clinical LLMs.
Clement Christophe, Tathagata Raha, Svetlana Maslenkova, Muhammad Umar Salman, Praveenkumar Kanithi, Marco AF Pimentel, Shadab
Khan
Large Language Models (LLMs) have demonstrated significant potential in revolutionizing clinical applications. In this study, we investigate
the efficacy of four techniques in adapting LLMs for clinical use-cases: continuous pretraining, instruct fine-tuning, NEFTune, and prompt
engineering. We employ these methods on Mistral 7B and Mixtral 8x7B models, leveraging a large-scale clinical pretraining dataset of 50
billion tokens and an instruct fine-tuning dataset of 500 million tokens. Our evaluation across various clinical tasks reveals nuanced insights.
While continuous pretraining beyond 250 billion tokens yields marginal improvements, instruct fine-tuning emerges as a more influential
factor. Notably, NEFTune, designed primarily to enhance generation quality, surprisingly demonstrates additional gains on our benchmark.
These findings underscore the importance of tailoring fine-tuning strategies and exploring innovative techniques to optimize LLM perfor-
mance in the clinical domain.
Nov 13 (Wed) 16:00-17:30 - Jasmine
Selection-p: Self-Supervised Task-Agnostic Prompt Compression for Faithfulness and Transferability
Tsz Ting Chung, Leyang Cui, Lemao Liu, Xinting Huang, Shuming Shi, Dit-Yan Yeung
Large Language Models (LLMs) have demonstrated impressive capabilities in a wide range of natural language processing tasks when lever-
aging in-context learning. To mitigate the additional computational and financial costs associated with in-context learning, several prompt
compression methods have been proposed to compress the in-context learning prompts. Despite their success, these methods face challenges
with transferability due to model-specific compression, or rely on external training data, such as GPT-4. In this paper, we investigate the
ability of LLMs to develop a unified compression method that discretizes uninformative tokens, utilizing a self-supervised pre-training tech-
nique. By introducing a small number of parameters during the continual pre-training, the proposed Selection-p produces a probability for
each input token, indicating whether to preserve or discard it. Experiments show Selection-p achieves state-of-the-art performance across nu-
merous classification tasks, achieving compression rates of up to 10 times while experiencing only a marginal 0.8% decrease in performance.
Moreover, it exhibits superior transferability to different models compared to prior work. Additionally, we further analyze how Selection-p
helps maintain performance on in-context learning with long contexts.
Nov 13 (Wed) 16:00-17:30 - Jasmine
Prompt-Based Bias Calibration for Better Zero/Few-Shot Learning of Language Models
Kang He, Yinghan Long, Kaushik Roy
Prompt-based learning is susceptible to intrinsic bias present in pre-trained language models (LMs), leading to sub-optimal performance in
prompt-based zero/few-shot settings. In this work, we propose a null-input prompting method to calibrate intrinsic bias encoded in pre-trained
LMs. Different from prior efforts that address intrinsic bias primarily for social fairness and often involve excessive computational cost, our
objective is to explore enhancing LMs’ performance in downstream zero/few-shot learning while emphasizing the efficiency of intrinsic bias
calibration. Specifically, we leverage a diverse set of auto-selected null-meaning inputs generated from GPT-4 to probe intrinsic bias of
pre-trained LMs. Utilizing the bias-reflected probability distribution, we formulate a distribution disparity loss for bias calibration, where we
exclusively update bias parameters (0.1% of total parameters) of LMs towards equal probability distribution. Experimental results show that
the calibration promotes an equitable starting point for LMs while preserving language modeling abilities. Across a wide range of datasets,
including sentiment analysis and topic classification, our method significantly improves zero/few-shot learning performance of LMs for both
in-context learning and prompt-based fine-tuning (on average 9% and 2%, respectively).
Nov 13 (Wed) 16:00-17:30 - Jasmine
Auto-Evolve: Enhancing Large Language Model’s Performance via Self-Reasoning Framework
Krishna Aswani, Huilin Lu, Pranav Patankar, Priya Dhalwani, Xue Tan, Jayant Ganeshmohan, Simon Lacasse
Recent advancements in prompt engineering strategies, such as Chain-of-Thought (CoT) and Self-Discover, have demonstrated significant
potential in improving the reasoning abilities of Large Language Models (LLMs). However, these state-of-the-art (SOTA) prompting strate-
gies rely on a fixed set of static seed reasoning modules like "think step by step" or "break down this problem" intended to simulate human
approach to problem-solving. This constraint limits the flexibility of models in tackling diverse problems effectively. In this paper, we intro-
duce Auto-Evolve, a novel framework that enables LLMs to self-create dynamic reasoning modules and downstream action plan, resulting
in significant improvements over current SOTA methods. We evaluate Auto-Evolve on the challenging BigBench-Hard (BBH) dataset with
Claude 2.0, Claude 3 Sonnet, Mistral Large, and GPT-4, where it consistently outperforms the SOTA prompt strategies. Auto-Evolve outper-
forms CoT by up to 10.4% and on an average by 7% across these four models. Our framework introduces two innovations: a) Auto-Evolve
dynamically generates reasoning modules for each task while aligning with human reasoning paradigm, thus eliminating the need for prede-
fined templates. b) An iterative refinement component, that incrementally refines instruction guidance for LLMs and helps boost performance
by average 2.8% compared to doing it in a single step.
Nov 13 (Wed) 16:00-17:30 - Jasmine
DrAttack: Prompt Decomposition and Reconstruction Makes Powerful LLMs Jailbreakers
Xirui Li, Ruochen Wang, Minhao Cheng, Tianyi Zhou, Cho-Jui Hsieh
Safety-aligned Large Language Models (LLMs) are still vulnerable to some manual and automated jailbreak attacks, which adversarially
trigger LLMs to output harmful content. However, existing jailbreaking methods usually view a harmful prompt as a whole but they are not
effective at reducing LLMs’ attention on combinations of words with malice, which well-aligned LLMs can easily reject. This paper discov-
ers that decomposing a malicious prompt into separated sub-prompts can effectively reduce LLMs’ attention on harmful words by presenting
them to LLMs in a fragmented form, thereby addressing these limitations and improving attack effectiveness. We introduce an automatic
prompt Decomposition and Reconstruction framework for jailbreaking Attack (DrAttack). DrAttack consists of three key components: (a)
’Decomposition’ of the original prompt into sub-prompts, (b) ’Reconstruction’ of these sub-prompts implicitly by In-Context Learning with
semantically similar but benign reassembling example, and (c) ’Synonym Search’ of sub-prompts, aiming to find sub-prompts’ synonyms
that maintain the original intent while jailbreaking LLMs. An extensive empirical study across multiple open-source and closed-source LLMs
236
Posters and Demos
demonstrates that, with fewer queries, DrAttack obtains a substantial gain of success rate on powerful LLMs over prior SOTA attackers.
Notably, the success rate of 80% on GPT-4 surpassed previous art by 65%. Code and data are made publicly available at https://turningpoint-
ai.github.io/DrAttack/.
Nov 13 (Wed) 16:00-17:30 - Jasmine
POSIX: A Prompt Sensitivity Index For Large Language Models
Anwoy Chatterjee, H S V N S Kowndinya Renduchintala, Sumit Bhatia, Tanmoy Chakraborty
Despite their remarkable capabilities, Large Language Models (LLMs) are found to be surprisingly sensitive to minor variations in prompts,
often generating significantly divergent outputs in response to minor variations in the prompts, such as spelling errors, alteration of wording
or the prompt template. However, while assessing the quality of an LLM, the focus often tends to be solely on its performance on downstream
tasks, while very little to no attention is paid to prompt sensitivity. To fill this gap, we propose POSIX - a novel PrOmpt Sensitivity IndeX as a
reliable measure of prompt sensitivity, thereby offering a more comprehensive evaluation of LLM performance. The key idea behind POSIX is
to capture the relative change in loglikelihood of a given response upon replacing the corresponding prompt with a different intent-preserving
prompt. We provide thorough empirical evidence demonstrating the efficacy of POSIX in capturing prompt sensitivity and subsequently use
it to measure and thereby compare prompt sensitivity of various open source LLMs. We find that merely increasing the parameter count
or instruction tuning does not necessarily reduce prompt sensitivity whereas adding some few-shot exemplars, even just one, almost always
leads to significant decrease in prompt sensitivity. We also find that alterations to prompt template lead to the highest sensitivity in the case of
MCQ type tasks, whereas paraphrasing results in the highest sensitivity in open-ended generation tasks. The code for reproducing our results
is open-sourced at https://github.com/kowndinya-renduchintala/POSIX.
237
Posters and Demos
is hindered by issues of hallucination and generating non-factual content. This is particularly problematic in long-form responses, where
assessing and ensuring factual accuracy is complex. In this paper, we address this gap by proposing FactAlign, a novel alignment framework
designed to enhance the factuality of LLMs’ long-form responses while maintaining their helpfulness. We introduce fKTO, a fine-grained,
sentence-level alignment algorithm that extends the Kahneman-Tversky Optimization (KTO) alignment method. Leveraging recent advances
in automatic factuality evaluation, FactAlign utilizes fine-grained factuality assessments to guide the alignment process. Our experiments
on open-domain prompts and information-seeking questions demonstrate that FactAlign significantly improves the factual accuracy of LLM
responses while also improving their helpfulness. Further analyses identify that FactAlign is capable of training LLMs to provide more in-
formation without losing factual precision, thus improving the factual F1 score. Our source code, datasets, and trained models are publicly
available at https://github.com/MiuLab/FactAlign
238
Posters and Demos
Crucially, CLIP models exhibit dramatically better performance than LVLMs. Since the image encoders of LVLMs come from these CLIP
models, this points to inadequate alignment for fine-grained object distinction between the encoder and the LLM and warrants (pre)training
data with more fine-grained annotation. We release our code at ANONYMIZED.
Nov 13 (Wed) 16:00-17:30 - Riverfront Hall
Does Object Grounding Really Reduce Hallucination of Large Vision-Language Models?
Gregor Geigle, Radu Timofte, Goran Glava
Large vision-language models (LVLMs) have recently dramatically pushed the state of the art in image captioning and many image under-
standing tasks (e.g., visual question answering). LVLMs, however, often hallucinate and produce captions that mention concepts that cannot
be found in the image. These hallucinations erode the trustworthiness of LVLMs and are arguably among the main obstacles to their ubiq-
uitous adoption. Recent work suggests that addition of grounding objectives—those that explicitly align image regions or objects to text
spans—reduces the amount of LVLM hallucination. Although intuitive, this claim is not empirically justified as the reduction effects have
been established, we argue, with flawed evaluation protocols that (i) rely on data (i.e., MSCOCO) that has been extensively used in LVLM
training and (ii) measure hallucination via question answering rather than open-ended caption generation.In this work, in contrast, we offer
the first systematic analysis of the effect of fine-grained object grounding on LVLM hallucination under an evaluation protocol that more
realistically captures LVLM hallucination in open generation. Our extensive experiments over three backbone LLMs reveal that grounding
objectives have little to no effect on object hallucination in open caption generation.
Nov 13 (Wed) 16:00-17:30 - Riverfront Hall
Decompose and Compare Consistency: Measuring VLMs’ Answer Reliability via Task-Decomposition Consistency Comparison
Qian Yang, Weixiang Yan, Aishwarya Agrawal
Despite tremendous advancements, current state-of-the-art Vision-Language Models (VLMs) are still far from perfect. They tend to hallu-
cinate and may generate biased responses. In such circumstances, having a way to assess the reliability of a given response generated by a
VLM is quite useful. Existing methods, such as estimating uncertainty using answer likelihoods or prompt-based confidence generation, often
suffer from overconfidence. Other methods use self-consistency comparison but are affected by confirmation biases. To alleviate these, we
propose Decompose and Compare Consistency (DeCC) for reliability measurement. By comparing the consistency between the direct answer
generated using the VLM’s internal reasoning process, and the indirect answers obtained by decomposing the question into sub-questions
and reasoning over the sub-answers produced by the VLM, DeCC measures the reliability of VLM’s direct answer. Experiments across six
vision-language tasks with three VLMs show DeCC’s reliability estimation achieves better correlation with task accuracy compared to the
existing methods.
Nov 13 (Wed) 16:00-17:30 - Riverfront Hall
Teaching Embodied Reinforcement Learning Agents: Informativeness and Diversity of Language Use
Jiajun Xi, Yinong He, Jianing Yang, Yinpei Dai, Joyce Chai
In real-world scenarios, it is desirable for embodied agents to have the ability to leverage human language to gain explicit or implicit knowl-
edge for learning tasks. Despite recent progress, most previous approaches adopt simple low-level instructions as language inputs, which may
not reflect natural human communication. We expect human language to be informative (i.e., providing feedback on agents’ past behaviors
and offering guidance on achieving their future goals) and diverse (i.e., encompassing a wide range of expressions and style nuances). To
enable flexibility of language use in teaching agents tasks, this paper studies different types of language inputs in facilitating reinforcement
learning (RL) embodied agents. More specifically, we examine how different levels of language informativeness and diversity impact agent
learning and inference. Our empirical results based on four RL benchmarks demonstrate that agents trained with diverse and informative
language feedback can achieve enhanced generalization and fast adaptation to new tasks. These findings highlight the pivotal role of language
use in teaching embodied agents new tasks in an open world.
239
Posters and Demos
to train LLM-based agents. Extensive experiments on ALFRED demonstrate the state-of-the-art performance of our framework, achieving
first place on the ALFRED public leaderboard and showcasing its potential to improve long-horizon decision-making in diverse environments.
240
Posters and Demos
hypotheses to explain this phenomenon: 1) the effect of task-agnostic visual encoding on the updates of the hidden states, and 2) the difficulty
in performing visual grounding from the perspective of in-context multimodal retrieval. Our results indicate that a task-aware encoding yields
minimal performance gains on grounding, however, Transformers significantly outperform Mamba at in-context multimodal retrieval. Over-
all, Mamba shows promising performance on tasks where the correct output relies on a summary of the image but struggles when retrieval of
explicit information from the context is required.
Nov 13 (Wed) 16:00-17:30 - Riverfront Hall
DocEditAgent: Document Structure Editing Via Multimodal LLM Grounding
Manan Suri, Puneet Mathur, Franck Dernoncourt, Rajiv Jain, Vlad I Morariu, Ramit Sawhney, Preslav Nakov, Dinesh Manocha
Document structure editing involves manipulating localized textual, visual, and layout components in document images based on the user’s
requests. Past works have shown that multimodal grounding of user requests in the document image and identifying the accurate structural
components and their associated attributes remain key challenges for this task. To address these, we introduce the DocEditAgent, a novel
framework that performs end-to-end document editing by leveraging Large Multimodal Models (LMMs). It consists of three novel compo-
nents – (1) Doc2Command to simultaneously localize edit regions of interest (RoI) and disambiguate user edit requests into edit commands.
(2) LLM-based Command Reformulation prompting to tailor edit commands originally intended for specialized software into edit instructions
suitable for generalist LMMs. (3) Moreover, DocEditAgent processes these outputs via Large Multimodal Models like GPT-4V and Gemini,
to parse the document layout, execute edits on grounded Region of Interest (RoI), and generate the edited document image. Extensive exper-
iments on the DocEdit dataset show that DocEditAgent significantly outperforms strong baselines on edit command generation (2-33%), RoI
bounding box detection (12-31%), and overall document editing (1-12%) tasks.
241
Posters and Demos
answering tasks where models need to answer questions based on multiple images, a single image, and text-only descriptions, respectively.
While LLMs excel at text-based question answering, surpassing human accuracy, the open-sourced VLMs still fall short by 41% on multi-
image and 21% on single-image VQA tasks, although closed-weights models perform closer to human levels (within 10%). Our findings
highlight that understanding food and its cultural implications remains a challenging and under-explored direction.
Nov 13 (Wed) 16:00-17:30 - Riverfront Hall
Advancing Social Intelligence in AI Agents: Technical Challenges and Open Question
Leena Mathur, Paul Pu Liang, Louis-Philippe Morency
Building socially-intelligent AI agents (Social-AI) is a multidisciplinary, multimodal research goal that involves creating agents that can
sense, perceive, reason about, learn from, and respond to affect, behavior, and cognition of other agents (human or artificial). Progress to-
wards Social-AI has accelerated in the past decade across several computing communities, including natural language processing, machine
learning, robotics, human-machine interaction, computer vision, and speech. Natural language processing, in particular, has been prominent
in Social-AI research, as language plays a key role in constructing the social world. In this position paper, we identify a set of underlying
technical challenges and open questions for researchers across computing communities to advance Social-AI. We anchor our discussion in
the context of social intelligence concepts and prior progress in Social-AI research.
Nov 13 (Wed) 16:00-17:30 - Riverfront Hall
Mitigating Open-Vocabulary Caption Hallucinations
Assaf Ben-Kish, Moran Yanuka, Morris Alper, Raja Giryes, Hadar Averbuch-Elor
While recent years have seen rapid progress in image-conditioned text generation, image captioning still suffers from the fundamental issue
of hallucinations, namely, the generation of spurious details that cannot be inferred from the given image. Existing methods largely use
closed-vocabulary object lists to mitigate or evaluate hallucinations in image captioning, ignoring the long-tailed nature of hallucinations that
occur in practice. To this end, we propose a framework for addressing hallucinations in image captioning in the open-vocabulary setting. Our
framework includes a new benchmark, OpenCHAIR, that leverages generative foundation models to evaluate open-vocabulary object hallu-
cinations for image captioning, surpassing the popular and similarly-sized CHAIR benchmark in both diversity and accuracy. Furthermore,
to mitigate open-vocabulary hallucinations without using a closed object list, we propose MOCHa, an approach harnessing advancements
in reinforcement learning. Our multi-objective reward function explicitly targets the trade-off between fidelity and adequacy in generations
without requiring any strong supervision. MOCHa improves a large variety of image captioning models, as captured by our OpenCHAIR
benchmark and other existing metrics. We will release our code and models.
Nov 13 (Wed) 16:00-17:30 - Riverfront Hall
VideoINSTA: Zero-shot Long-Form Video Understanding via Informative Spatial-Temporal Reasoning
Ruotong Liao, Max Erler, Huiyu Wang, Guangyao Zhai, Gengyuan Zhang, Yunpu Ma, Volker Tresp
In the video-language domain, recent works in leveraging zero-shot Large Language Model-based reasoning for video understanding have be-
come competitive challengers to previous end-to-end models. However, long video understanding presents unique challenges due to the com-
plexity of reasoning over extended timespans, even for zero-shot LLM-based approaches. The challenge of information redundancy in long
videos prompts the question of what specific information is essential for large language models (LLMs) and how to leverage them for complex
spatial-temporal reasoning in long-form video analysis. We propose a framework VideoINSTA , i.e. INformative Spatial-TemporAl Reasoning
for zero-shot long-form video understanding.VideoINSTA contributes (1) a zero-shot framework for long video understanding using LLMs;
(2) an event-based temporalreasoning and content-based spatial reasoning approach for LLMs to reason over spatial-temporal information in
videos; (3) a self-reflective information reasoning scheme based on information sufficiency and prediction confidence while balancing tempo-
ral factors.Our model significantly improves the state-of-the-art on three long video question-answering benchmarks: EgoSchema, NextQA,
and IntentQA, and the open question answering dataset ActivityNetQA. Code is released: https://github.com/mayhugotong/VideoINSTA.
242
Posters and Demos
Visual programs are executable code generated by large language models to address visual reasoning problems. They decompose complex
questions into multiple reasoning steps and invoke specialized models for each step to solve the problems. However, these programs are prone
to logic errors, with our preliminary evaluation showing that 58% of the total errors are caused by program logic errors. Debugging complex
visual programs remains a major bottleneck for visual reasoning. To address this, we introduce **VDebugger**, a novel critic-refiner frame-
work trained to localize and debug visual programs by tracking execution step by step. VDebugger identifies and corrects program errors
leveraging detailed execution feedback, improving interpretability and accuracy. The training data is generated through an automated pipeline
that injects errors into correct visual programs using a novel mask-best decoding technique. Evaluations on six datasets demonstrate VDe-
bugger’s effectiveness, showing performance improvements of up to 3.2% in downstream task accuracy. Further studies show VDebugger’s
ability to generalize to unseen tasks, bringing a notable improvement of 2.3% on the unseen COVR task.
243
Posters and Demos
Question Answering 3
Nov 13 (Wed) 16:00-17:30 - Room: Jasmine
244
Posters and Demos
KARL: Knowledge-Aware Retrieval and Representations aid Retention and Learning in Students
Matthew Shu, Nishant Balepur, Shi Feng, Jordan Lee Boyd-Graber
Flashcard schedulers rely on 1) *student models* to predict the flashcards a student knows; and 2) *teaching policies* to pick which cards to
show next via these predictions.Prior student models, however, just use study data like the student’s past responses, ignoring the text on cards.
We propose **content-aware scheduling**, the first schedulers exploiting flashcard content.To give the first evidence that such schedulers
enhance student learning, we build KARL, a simple but effective content-aware student model employing deep knowledge tracing (DKT), re-
trieval, and BERT to predict student recall.We train KARL by collecting a new dataset of 123,143 study logs on diverse trivia questions.KARL
bests existing student models in AUC and calibration error.To ensure our improved predictions lead to better student learning, we create a
novel delta-based teaching policy to deploy KARL online.Based on 32 study paths from 27 users, KARL improves learning efficiency over
SOTA, showing KARL’s strength and encouraging researchers to look beyond historical study data to fully capture student abilities.
Nov 13 (Wed) 16:00-17:30 - Jasmine
LONGAGENT: Achieving Question Answering for 128k-Token-Long Documents through Multi-Agent Collaboration
Jun Zhao, Can Zu, Xu Hao, Yi Lu, Wei He, Yiwen Ding, Tao Gui, Qi Zhang, Xuanjing Huang
Large language models (LLMs) have achieved tremendous success in understanding language and processing text. However, question-
answering (QA) on lengthy documents faces challenges of resource constraints and a high propensity for errors, even for the most advanced
models such as GPT-4 and Claude2.In this paper, we introduce _LongAgent_, a multi-agent collaboration method that enables efficient and
effective QA over 128k-token-long documents. _LongAgent_ adopts a _divide-and-conquer_ strategy, breaking down lengthy documents
into shorter, more manageable text chunks. A leader agent comprehends the user’s query and organizes the member agents to read their
assigned chunks, reasoning a final answer through multiple rounds of discussion.Due to members’ hallucinations, it’s difficult to guarantee
that every response provided by each member is accurate.To address this, we develop an _inter-member communication_ mechanism that fa-
cilitates information sharing, allowing for the detection and mitigation of hallucinatory responses.Experimental results show that a LLaMA-2
7B driven by _LongAgent_ can effectively support QA over 128k-token documents, achieving 16.42% and 1.63% accuracy gains over
GPT-4 on single-hop and multi-hop QA settings, respectively.
Nov 13 (Wed) 16:00-17:30 - Jasmine
Adaptive Question Answering: Enhancing Language Model Proficiency for Addressing Knowledge Conflicts with Source Citations
Sagi Shaier, Ari Kobren, Philip V. Ogren
Resolving knowledge conflicts is a crucial challenge in Question Answering (QA) tasks, as the internet contains numerous conflicting facts
and opinions. While some research has made progress in tackling ambiguous settings where multiple valid answers exist, these approaches
often neglect to provide source citations, leaving users to evaluate the factuality of each answer. On the other hand, existing work on citation
generation has focused on unambiguous settings with single answers, failing to address the complexity of real-world scenarios. Despite the
importance of both aspects, no prior research has combined them, leaving a significant gap in the development of QA systems. In this work,
we bridge this gap by proposing the novel task of QA with source citation in ambiguous settings, where multiple valid answers exist. To facil-
itate research in this area, we create a comprehensive framework consisting of: (1) five novel datasets, obtained by augmenting three existing
reading comprehension datasets with citation meta-data across various ambiguous settings, such as distractors and paraphrasing; (2) the first
ambiguous multi-hop QA dataset featuring real-world, naturally occurring contexts; (3) two new metrics to evaluate models performances;
and (4) several strong baselines using rule-based, prompting, and finetuning approaches over five large language models. We hope that this
new task, datasets, metrics, and baselines will inspire the community to push the boundaries of QA research and develop more trustworthy
and interpretable systems.
Nov 13 (Wed) 16:00-17:30 - Jasmine
Generate-on-Graph: Treat LLM as both Agent and KG for Incomplete Knowledge Graph Question Answering
Yao Xu, Shizhu He, Jiabei Chen, Zihao Wang, Yangqiu Song, Hanghang Tong, Guang Liu, Jun Zhao, Kang Liu
To address the issues of insufficient knowledge and hallucination in Large Language Models (LLMs), numerous studies have explored inte-
grating LLMs with Knowledge Graphs (KGs). However, these methods are typically evaluated on conventional Knowledge Graph Question
Answering (KGQA) with complete KGs, where all factual triples required for each question are entirely covered by the given KG. In such
cases, LLMs primarily act as an agent to find answer entities within the KG, rather than effectively integrating the internal knowledge of LLMs
and external knowledge sources such as KGs. In fact, KGs are often incomplete to cover all the knowledge required to answer questions.
To simulate these real-world scenarios and evaluate the ability of LLMs to integrate internal and external knowledge, we propose leveraging
LLMs for QA under Incomplete Knowledge Graph (IKGQA), where the provided KG lacks some of the factual triples for each question, and
construct corresponding datasets. To handle IKGQA, we propose a training-free method called Generate-on-Graph (GoG), which can gen-
erate new factual triples while exploring KGs. Specifically, GoG performs reasoning through a Thinking-Searching-Generating framework,
which treats LLM as both Agent and KG in IKGQA. Experimental results on two datasets demonstrate that our GoG outperforms all previous
methods.
245
Posters and Demos
RE-RAG: Improving Open-Domain QA Performance and Interpretability with Relevance Estimator in Retrieval-Augmented Gener-
ation
Kiseung Kim, Jay-Yoon Lee
The Retrieval Augmented Generation (RAG) framework utilizes a combination of parametric knowledge and external knowledge to demon-
strate state-of-the-art performance on open-domain question answering tasks. However, the RAG framework suffers from performance degra-
dation when the query is accompanied by irrelevant contexts. In this work, we propose the RE-RAG framework, which introduces a relevance
estimator (RE) that not only provides relative relevance between contexts as previous rerankers did, but also provide confidence, which can be
used to classify whether given context is useful for answering the given question. We propose a weakly supervised method for training the RE
simply utilizing question-answer data without any labels for correct contexts. We show that RE trained with a small generator (sLM) can not
only improve the sLM fine-tuned together with RE but also improve previously unreferenced large language models (LLMs). Furthermore, we
investigate new decoding strategies that utilize the proposed confidence measured by RE such as choosing to let the user know that it is unan-
swerable to answer the question given the retrieved contexts or choosing to rely on LLMs parametric knowledge rather than unrelated contexts.
Nov 13 (Wed) 16:00-17:30 - Jasmine
ZEBRA: Zero-Shot Example-Based Retrieval Augmentation for Commonsense Question Answering
Francesco Maria Molfese, Simone Conia, Riccardo Orlando, Roberto Navigli
Current Large Language Models (LLMs) have shown strong reasoning capabilities in commonsense question answering benchmarks, but the
process underlying their success remains largely opaque. As a consequence, recent approaches have equipped LLMs with mechanisms for
knowledge retrieval, reasoning and introspection, not only to improve their capabilities but also to enhance the interpretability of their outputs.
However, these methods require additional training, hand-crafted templates or human-written explanations. To address these issues, we intro-
duce ZEBRA, a zero-shot question answering framework that combines retrieval, case-based reasoning and introspection and dispenses with
the need for additional training of the LLM. Given an input question, ZEBRA retrieves relevant question-knowledge pairs from a knowledge
base and generates new knowledge by reasoning over the relationships in these pairs. This generated knowledge is then used to answer the
input question, improving the model’s performance and interpretability. We evaluate our approach across 8 well-established commonsense
reasoning benchmarks, demonstrating that ZEBRA consistently outperforms strong LLMs and previous knowledge integration approaches,
achieving an average accuracy improvement of up to 4.5 points.
246
Posters and Demos
Large Language Models (LLMs) have shown great potential in the biomedical domain with the advancement of retrieval-augmented genera-
tion (RAG). However, existing retrieval-augmented approaches face challenges in addressing diverse queries and documents, particularly for
medical knowledge queries, resulting in sub-optimal performance. To address these limitations, we propose a novel plug-and-play LLM-based
retrieval method called Self-Rewarding Tree Search (SeRTS) based on Monte Carlo Tree Search (MCTS) and a self-rewarding paradigm. By
combining the reasoning capabilities of LLMs with the effectiveness of tree search, SeRTS boosts the zero-shot performance of retrieving
high-quality and informative results for RAG. We further enhance retrieval performance by fine-tuning LLMs with Proximal Policy Opti-
mization (PPO) objectives using the trajectories collected by SeRTS as feedback. Controlled experiments using the BioASQ-QA dataset with
GPT-3.5-Turbo and LLama2-7b demonstrate that our method significantly improves the performance of the BM25 retriever and surpasses the
strong baseline of self-reflection in both efficiency and scalability. Moreover, SeRTS generates higher-quality feedback for PPO training than
self-reflection. Our proposed method effectively adapts LLMs to document retrieval tasks, enhancing their ability to retrieve highly relevant
documents for RAG in the context of medical knowledge queries. This work presents a significant step forward in leveraging LLMs for
accurate and comprehensive biomedical question answering.
Nov 13 (Wed) 16:00-17:30 - Jasmine
Retrieving Contextual Information for Long-Form Question Answering using Weak Supervision
Philipp Christmann, Svitlana Vakulenko, Ionut Teodor Sorodoc, Adrià de Gispert, Bill Byrne
Long-form question answering (LFQA) aims at generating in-depth answers to end-user questions, providing relevant information beyond
the direct answer. However, existing retrievers are typically optimized towards information that directly targets the question, missing out on
such contextual information. Furthermore, there is a lack of training data for relevant context. To this end, we propose and compare different
weak supervision techniques to optimize retrieval for contextual information. Experiments demonstrate improvements on the end-to-end QA
performance on ASQA, a dataset for long-form question answering. Importantly, as more contextual information is retrieved, we improve the
relevant page recall for LFQA by 14.7% and the groundedness of generated long-form answers by 12.5%. Finally, we show that long-form
answers often anticipate likely follow-up questions, via experiments on a conversational QA dataset.
Nov 13 (Wed) 16:00-17:30 - Jasmine
Less is More: Making Smaller Language Models Competent Subgraph Retrievers for Multi-hop KGQA
Wenyu Huang, Guancheng Zhou, Hongru WANG, Pavlos Vougiouklis, Mirella Lapata, Jeff Z. Pan
Retrieval-Augmented Generation (RAG) is widely used to inject external non-parametric knowledge into large language models (LLMs). Re-
cent works suggest that Knowledge Graphs (KGs) contain valuable external knowledge for LLMs. Retrieving information from KGs differs
from extracting it from document sets. Most existing approaches seek to directly retrieve relevant subgraphs, thereby eliminating the need
for extensive SPARQL annotations, traditionally required by semantic parsing methods. In this paper, we model the subgraph retrieval task
as a conditional generation task handled by small language models. Specifically, we define a subgraph identifier as a sequence of relations,
each represented as a special token stored in the language models. Our base generative subgraph retrieval model, consisting of only 220M
parameters, achieves competitive retrieval performance compared to state-of-the-art models relying on 7B parameters, demonstrating that
small language models are capable of performing the subgraph retrieval task. Furthermore, our largest 3B model, when plugged with an LLM
reader, sets new SOTA end-to-end performance on both the WebQSP and CWQ benchmarks. Our model and data will be made available
online: https://github.com/hwy9855/GSR.
Nov 13 (Wed) 16:00-17:30 - Jasmine
SaSR-Net: Source-Aware Semantic Representation Network for Enhancing Audio-Visual Question Answering
Tianyu Yang, Yiyang Nan, Lisen Dai, Zhenwen Liang, Yapeng Tian, Xiangliang Zhang
Audio-Visual Question Answering (AVQA) is a challenging task that involves answering questions based on both auditory and visual infor-
mation in videos. A significant challenge is interpreting complex multi-modal scenes, which include both visual objects and sound sources,
and connecting them to the given question. In this paper, we introduce the Source-aware Semantic Representation Network (SaSR-Net), a
novel model designed for AVQA. SaSR-Net utilizes source-wise learnable tokens to efficiently capture and align audio-visual elements with
the corresponding question. It streamlines the fusion of audio and visual information using spatial and temporal attention mechanisms to iden-
tify answers in multi-modal scenes. Extensive experiments on the Music-AVQA and AVQA-Yang datasets show that SaSR-Net outperforms
state-of-the-art AVQA methods. We will release our source code and pre-trained models.
Nov 13 (Wed) 16:00-17:30 - Jasmine
SPINACH: SPARQL-Based Information Navigation for Challenging Real-World Questions
Shicheng Liu, Sina Semnani, Harold Triedman, Jialiang Xu, Isaac Dan Zhao, Monica Lam
Large Language Models (LLMs) have led to significant improvements in the Knowledge Base Question Answering (KBQA) task. However,
datasets used in KBQA studies do not capture the true complexity of KBQA tasks. They either have simple questions, use synthetically gen-
erated logical forms, or are based on small knowledge base (KB) schemas.We introduce the SPINACH dataset, an expert-annotated KBQA
dataset collected from discussions on Wikidata’s "Request a Query" forum with 320 decontextualized question-SPARQL pairs. The com-
plexity of these in-the-wild queries calls for a KBQA system that can dynamically explore large and often incomplete schemas and reason
about them, as it is infeasible to create a comprehensive training dataset. We also introduce an in-context learning KBQA agent, also called
SPINACH, that mimics how a human expert would write SPARQLs to handle challenging questions. SPINACH achieves a new state of the
art on the QALD-7, QALD-9 Plus and QALD-10 datasets by 31.0%, 27.0%, and 10.0% in F _1, respectively, and coming within 1.6% of
the fine-tuned LLaMA SOTA model on WikiWebQuestions.On our new SPINACH dataset, the SPINACH agent outperforms all baselines,
including the best GPT-4-based KBQA agent, by at least 38.1% in F _1.
247
Posters and Demos
Chain of Condition: Construct, Verify and Solve Conditions for Conditional Question Answering
Jiuheng Lin, Yuxuan Lai, Yansong Feng
Conditional question answering (CQA) is an important task that aims to find probable answers and identify missing conditions. Existing
approaches struggle with CQA due to two challenges: (1) precisely identifying necessary conditions and the logical relationship, and (2)
verifying conditions to detect any that are missing. In this paper, we propose a novel prompting approach, Chain of condition, by first identi-
fying all conditions and constructing their logical relationships explicitly according to the document, then verifying whether these conditions
are satisfied, finally solving the logical expression to indicate any missing conditions and generating the answer accordingly. Experiments
on two CQA benchmark datasets show our chain of condition outperforms existing prompting baselines, establishing a new state of the art.
Furthermore, with only a few examples, our method can facilitate GPT-3.5-Turbo or GPT-4 to outperform all existing supervised models.
248
Posters and Demos
249
Posters and Demos
tions. We introduce the AMS parser, a compositional, neurosymbolic semantic parser for DRT. It rests on a novel mechanism for predicting
quantifier scope. We show that the AMS parser reliably produces well-formed outputs and performs well on DRT parsing, especially on
complex sentences.
Nov 13 (Wed) 16:00-17:30 - Riverfront Hall
Connecting the Dots: Evaluating Abstract Reasoning Capabilities of LLMs Using the New York Times Connections Word Game
Prisha Samdarshi, Mariam Mustafa, Anushka Kulkarni, Raven Rothkopf, Tuhin Chakrabarty, Smaranda Muresan
The New York Times Connections game has emerged as a popular and challenging pursuit for word puzzle enthusiasts. We collect438 Con-
nections games to evaluate the performance of state-of-the-art large language models (LLMs) against expert and novice humanplayers. Our
results show that even the best-performing LLM, Claude 3.5 Sonnet, which has otherwise shown impressive reasoning abilities on a wide
variety of benchmarks, can only fully solve 18% of the games. Novice and expert players perform better than Claude 3.5 Sonnet, with expert
human players significantly outperforming it. We create a taxonomy of the knowledge types required to successfully cluster and categorize
words in the Connections game. We find that while LLMs are decent at categorizing words based on semantic relations they struggle with
other types of knowledge such as Encyclopedic Knowledge, Multiword Expressions or knowledge that combines both Word Form and Mean-
ing. Our results establish the New York Times Connections game as a challenging benchmark for evaluating abstract reasoning capabilities
in humans and AI systems.
Nov 13 (Wed) 16:00-17:30 - Riverfront Hall
Transferability of Syntax-Aware Graph Neural Networks in Zero-Shot Cross-Lingual Semantic Role Labeling
Rachel Sidney Devianti, Yusuke Miyao
Recent models in cross-lingual semantic role labeling (SRL) barely analyze the applicability of their network selection.We believe that network
selection is important since it affects the transferability of cross-lingual models, i.e., how the model can extract universal features from source
languages to label target languages.Therefore, we comprehensively compare the transferability of different graph neural network (GNN)-
based models enriched with universal dependency trees.GNN-based models include transformer-based, graph convolutional network-based,
and graph attention network (GAT)-based models.We focus our study on a zero-shot setting by training the models in English and evaluating
the models in 23 target languages provided by the Universal Proposition Bank.Based on our experiments, we consistently show that syntax
from universal dependency trees is essential for cross-lingual SRL models to achieve better transferability.Dependency-aware self-attention
with relative position representations (SAN-RPRs) transfer best across languages, especially in the long-range dependency distance.We also
show that dependency-aware two-attention relational GATs transfer better than SAN-RPRs in languages where most arguments lie in a 1-2
dependency distance.
Nov 13 (Wed) 16:00-17:30 - Riverfront Hall
The Daunting Dilemma with Sentence Encoders: Glowing on Standard Benchmarks, Struggling with Capturing Basic Semantic
Properties
Yash mahajan, Naman Bansal, Eduardo Blanco, Santu Karmaker
Sentence embeddings play a pivotal role in a wide range of NLP tasks, yet evaluating and interpreting these real-valued vectors remains an
open challenge to date, especially in a task-free setting. To address this challenge, we introduce a novel task-free test bed for evaluating
and interpreting sentence embeddings. Our test bed consists of five semantic similarity alignment criteria, namely, *semantic distinction,
synonym replacement, antonym replacement, paraphrasing without negation, and sentence jumbling*. Using these criteria, we examined
five classical (e.g., Sentence-BERT, Universal Sentence Encoder (USE), etc.) and eight LLM-induced sentence embedding techniques (e.g.,
LLaMA2, GPT-3, OLMo, etc.) to test whether their semantic similarity spaces align with what a human mind would naturally expect. Our
extensive experiments with 13 different sentence encoders revealed that none of the studied embeddings aligned with all the five semantic
similarity alignment criteria. Yet, most encoders performed highly on the SentEval dataset, a popular task-specific benchmark. This finding
demonstrates a significant limitation of the current practice in sentence embedding evaluation and associated popular benchmarks, a critical
issue that needs careful attention and reassessment by the NLP community. Finally, we conclude the paper by highlighting the utility of the
proposed alignment-based test bed for analyzing sentence embeddings in a novel way, especially in a task-free setting.
Nov 13 (Wed) 16:00-17:30 - Riverfront Hall
CONTOR: Benchmarking Strategies for Completing Ontologies with Plausible Missing Rules
Na Li, Thomas Bailleux, Zied Bouraoui, Steven Schockaert
We consider the problem of finding plausible rules that are missing from a given ontology. A number of strategies for this problem have
already been considered in the literature. Little is known about the relative performance of these strategies, however, as they have thus far
been evaluated on different ontologies. Moreover, existing evaluations have focused on distinguishing held-out ontology rules from randomly
corrupted ones, which often makes the task unrealistically easy and leads to the presence of incorrectly labelled negative examples. To address
these concerns, we introduce a benchmark with manually annotated hard negatives and use this benchmark to evaluate ontology completion
models. In addition to previously proposed models, we test the effectiveness of several approaches that have not yet been considered for this
task, including LLMs and simple but effective hybrid strategies.
Nov 13 (Wed) 16:00-17:30 - Riverfront Hall
Are ELECTRA’s Sentence Embeddings Beyond Repair? The Case of Semantic Textual Similarity
Ivan Rep, David Duki, Jan Snajder
While BERT produces high-quality sentence embeddings, its pre-training computational cost is a significant drawback. In contrast, ELEC-
TRA provides a cost-effective pre-training objective and downstream task performance improvements, but worse sentence embeddings. The
community tacitly stopped utilizing ELECTRA’s sentence embeddings for semantic textual similarity (STS). We notice a significant drop in
performance for the ELECTRA discriminator’s last layer in comparison to prior layers. We explore this drop and propose a way to repair
the embeddings using a novel truncated model fine-tuning (TMFT) method. TMFT improves the Spearman correlation coefficient by over 8
points while increasing parameter efficiency on the STS Benchmark. We extend our analysis to various model sizes, languages, and two other
tasks. Further, we discover the surprising efficacy of ELECTRA’s generator model, which performs on par with BERT, using significantly
fewer parameters and a substantially smaller embedding size. Finally, we observe boosts by combining TMFT with word similarity or domain
adaptive pre-training.
Nov 13 (Wed) 16:00-17:30 - Riverfront Hall
SpeciaLex: A Benchmark for In-Context Specialized Lexicon Learning
Joseph Marvin Imperial, Harish Tayyar Madabushi
Specialized lexicons are collections of words with associated constraints such as special definitions, specific roles, and intended target audi-
ences. These constraints are necessary for content generation and documentation tasks (e.g., writing technical manuals or children’s reading
materials), where the goal is to reduce the ambiguity of text content and increase its overall readability for a specific group of audience.
Understanding how large language models can capture these constraints can help researchers build better, more impactful tools for wider use
beyond the NLP community. Towards this end, we introduce SpeciaLex, a benchmark for evaluating a language model’s ability to follow
250
Posters and Demos
specialized lexicon-based constraints across 18 diverse subtasks with 1,785 test instances covering core tasks of Checking, Identification,
Rewriting, and Open Generation. We present an empirical evaluation of 15 open and closed-source LLMs and discuss insights on how factors
such as model scale, openness, setup, and recency affect performance upon evaluating with the benchmark.
Nov 13 (Wed) 16:00-17:30 - Riverfront Hall
Monotonic Paraphrasing Improves Generalization of Language Model Prompting
Qin Liu, Fei Wang, Nan Xu, Tianyi Lorena Yan, Tao Meng, Muhao Chen
Performance of large language models (LLMs) may vary with different prompts or instructions of even the same task. One commonly recog-
nized factor for this phenomenon is the model’s familiarity with the given prompt or instruction, which is typically estimated by its perplexity.
However, finding the prompt with the lowest perplexity is challenging, given the enormous space of possible prompting phrases. In this
paper, we propose monotonic paraphrasing (MonoPara), an end-to-end decoding strategy that paraphrases given prompts or instructions into
their lower perplexity counterparts based on an ensemble of a paraphrase LM for prompt (or instruction) rewriting, and a target LM (i.e. the
prompt or instruction executor) that constrains the generation for lower perplexity. The ensemble decoding process can efficiently paraphrase
the original prompt without altering its semantic meaning, while monotonically decrease the perplexity of each generation as calculated by
the target LM. We explore in detail both greedy and search-based decoding as two alternative decoding schemes of MonoPara. Notably,
MonoPara does not require any training and can monotonically lower the perplexity of the paraphrased prompt or instruction, leading to im-
proved performance of zero-shot LM prompting as evaluated on a wide selection of tasks. In addition, MonoPara is also shown to effectively
improve LMs’ generalization on perturbed and unseen task instructions.
Nov 13 (Wed) 16:00-17:30 - Riverfront Hall
Talking the Talk Does Not Entail Walking the Walk: On the Limits of Large Language Models in Lexical Entailment Recognition
Candida Maria Greco, Lucio La Cava, Andrea Tagarelli
Verbs form the backbone of language, providing the structure and meaning to sentences. Yet, their intricate semantic nuances pose a long-
standing challenge. Understanding verb relations through the concept of lexical entailment is crucial for comprehending sentence meanings
and grasping verb dynamics. This work investigates the capabilities of eight Large Language Models in recognizing lexical entailment re-
lations among verbs through differently devised prompting strategies and zero-/few-shot settings over verb pairs from two lexical databases,
namely WordNet and HyperLex. Our findings unveil that the models can tackle the lexical entailment recognition task with moderately good
performance, although at varying degree of effectiveness and under different conditions. Also, utilizing few-shot prompting can enhance the
models’ performance. However, perfectly solving the task arises as an unmet challenge for all examined LLMs, which raises an emergence
for further research developments on this topic.
TACL + CL
Nov 13 (Wed) 16:00-17:30 - Room: Jasmine
251
Posters and Demos
evaluation) relying on their correlation with human judgments. However, conventional meta-evaluations in English GEC encounter several
challenges including biases caused by inconsistencies in evaluation granularity, and an outdated setup using classical systems. These problems
can lead to misinterpretation of metrics and potentially hinder the applicability of GEC techniques. To address these issues, this paper pro-
poses SEEDA, a new dataset for GEC meta-evaluation. SEEDA consists of corrections with human ratings along two different granularities:
edit-based and sentence-based, covering 12 state-of-the-art systems including large language models (LLMs), and two human corrections
with different focuses. The results of improved correlations by aligning the granularity in the sentence-level meta-evaluation suggest that
edit-based metrics may have been underestimated in existing studies. Furthermore, correlations of most metrics decrease when changing from
classical to neural systems, indicating that traditional metrics are relatively poor at evaluating fluently corrected sentences with many edits.
Nov 13 (Wed) 16:00-17:30 - Jasmine
Investigating Hallucinations in Pruned Large Language Models for Abstractive Summarization
Nikolaos Aletras, Zhixue Zhao, George Chrysostomou, Miles Williams
Despite the remarkable performance of generative large language models (LLMs) on abstractive summarization, they face two significant
challenges: their considerable size and tendency to hallucinate. Hallucinations are concerning because they erode reliability and raise safety
issues. Pruning is a technique that reduces model size by removing redundant weights, enabling more efficient sparse inference. Pruned
models yield downstream task performance comparable to the original, making them ideal alternatives when operating on a limited budget.
However, the effect that pruning has upon hallucinations in abstraction summarization with LLMs has yet to be explored. In this paper,
we provide an extensive empirical study across five summarization datasets, two state-of-the-art pruning methods, and five instruction-tuned
LLMs. Surprisingly, we find that hallucinations are less prevalent from pruned LLMs than the original models. Our analysis suggests that
pruned models tend to depend more on the source document for summary generation. This leads to a higher lexical overlap between the
generated summary and the source document, which could be a reason for the reduction in hallucination risk.
Nov 13 (Wed) 16:00-17:30 - Jasmine
Do LLMs Exhibit Human-like Response Biases? A Case Study in Survey Design
Lindia Tjuatja, Valerie Chen, Tongshuang Wu, Ameet Talwalkar, Graham Neubig
One widely-cited barrier to the adoption of LLMs as proxies for humans in subjective tasks is their sensitivity to prompt wordingŮbut interest-
ingly, humans also display sensitivities to instruction changes in the form of response biases. We investigate the extent to which LLMs reflect
human response biases, if at all. We look to survey design, where human response biases caused by changes in the wordings of Şprompts’"
have been extensively explored in social psychology literature. Drawing from these works, we design a dataset and framework to evaluate
whether LLMs exhibit human-like response biases in survey questionnaires. Our comprehensive evaluation of nine models shows that popular
open and commercial LLMs generally fail to reflect human-like behavior, particularly in models that have undergone RLHF. Furthermore,
even if a model shows a significant change in the same direction as humans, we find that they are sensitive to perturbations that do not elicit
significant changes in humans. These results highlight the pitfalls of using LLMs as human proxies, and underscore the need for finer-grained
characterizations of model behavior.
252
Posters and Demos
Demo 6
Nov 14 (Thu) 10:30-12:00 - Room: Riverfront Hall
253
Posters and Demos
datasets, such as the ASL Citizen dataset, are great resources for improving accessibility and preserving linguistic diversity, but they must
be used thoughtfully to avoid reinforcing existing biases.In this work, we utilize the rich information about participant demographics and
lexical features present in the ASL Citizen dataset to study and document the biases that may result from models trained on crowd-sourced
sign datasets. Further, we apply several bias mitigation techniques during model training, and find that these techniques reduce performance
disparities without decreasing accuracy. With the publication of this work, we release the demographic information about the participants in
the ASL Citizen dataset to encourage future bias mitigation work in this space.
Nov 14 (Thu) 10:30-12:00 - Jasmine
A Study of Nationality Bias in Names and Perplexity using Off-the-Shelf Affect-related Tweet Classifiers
Valentin Barriere, Sebastian Cifuentes
In this paper, we apply a method to quantify biases associated with named entities from various countries. We create counterfactual exam-
ples with small perturbations on target-domain data instead of relying on templates or specific datasets for bias detection. On widely used
classifiers for subjectivity analysis, including sentiment, emotion, hate speech, and offensive text using Twitter data, our results demonstrate
positive biases related to the language spoken in a country across all classifiers studied. Notably, the presence of certain country names in
a sentence can strongly influence predictions, up to a 23% change in hate speech detection and up to a 60% change in the prediction of
negative emotions such as anger. We hypothesize that these biases stem from the training data of pre-trained language models (PLMs) and
find correlations between affect predictions and PLMs likelihood in English and unknown languages like Basque and Maori, revealing dis-
tinct patterns with exacerbate correlations. Further, we followed these correlations in-between counterfactual examples from a same sentence
to remove the syntactical component, uncovering interesting results suggesting the impact of the pre-training data was more important for
English-speaking-country names.
Nov 14 (Thu) 10:30-12:00 - Jasmine
"You Gotta be a Doctor, Lin" : An Investigation of Name-Based Bias of Large Language Models in Employment Recommendations
Huy Nghiem, John Prindle, Jieyu Zhao, Hal Daumé III
Social science research has shown that candidates with names indicative of certain races or genders often face discrimination in employment
practices. Similarly, Large Language Models (LLMs) have demonstrated racial and gender biases in various applications. In this study, we
utilize GPT-3.5-Turbo and Llama 3-70B-Instruct to simulate hiring decisions and salary recommendations for candidates with 320 first names
that strongly signal their race and gender, across over 750,000 prompts. Our empirical results indicate a preference among these models
for hiring candidates with White female-sounding names over other demographic groups across 40 occupations. Additionally, even among
candidates with identical qualifications, salary recommendations vary by as much as 5% between different subgroups. A comparison with
real-world labor data reveals inconsistent alignment with U.S. labor market characteristics, underscoring the necessity of risk investigation of
LLM-powered systems.
Nov 14 (Thu) 10:30-12:00 - Jasmine
Resampled Datasets Are Not Enough: Mitigating Societal Bias Beyond Single Attributes
Yusuke Hirota, Jerone Andrews, Dora Zhao, Orestis Papakyriakopoulos, Apostolos Modas, Yuta Nakashima, Alice Xiang
We tackle societal bias in image-text datasets by removing spurious correlations between protected groups and image attributes. Traditional
methods only target labeled attributes, ignoring biases from unlabeled ones. Using text-guided inpainting models, our approach ensures
protected group independence from all attributes and mitigates inpainting biases through data filtering. Evaluations on multi-label image clas-
sification and image captioning tasks show our method effectively reduces bias without compromising performance across various models.
Specifically, we achieve an average societal bias reduction of 46.1% in leakage-based bias metrics for multi-label classification and 74.8% for
image captioning.
Nov 14 (Thu) 10:30-12:00 - Jasmine
Holistic Automated Red Teaming for Large Language Models through Top-Down Test Case Generation and Multi-turn Interaction
Jinchuan Zhang, Yan Zhou, Yaxin Liu, Ziming Li, Songlin Hu
Automated red teaming is an effective method for identifying misaligned behaviors in large language models (LLMs). Existing approaches,
however, often focus primarily on improving attack success rates while overlooking the need for comprehensive test case coverage. Addition-
ally, most of these methods are limited to single-turn red teaming, failing to capture the multi-turn dynamics of real-world human-machine
interactions. To overcome these limitations, we propose **HARM** (**H**olistic **A**utomated **R**ed tea**M**ing), which scales up
the diversity of test cases using a top-down approach based on an extensible, fine-grained risk taxonomy. Our method also leverages a novel
fine-tuning strategy and reinforcement learning techniques to facilitate multi-turn adversarial probing in a human-like manner. Experimental
results demonstrate that our framework enables a more systematic understanding of model vulnerabilities and offers more targeted guidance
for the alignment process.
Nov 14 (Thu) 10:30-12:00 - Jasmine
Large Language Models Can Be Contextual Privacy Protection Learners
Yijia Xiao, Yiqiao Jin, Yushi Bai, Yue Wu, Xianjun Yang, Xiao Luo, Wenchao Yu, Xujiang Zhao, Yanchi Liu, Quanquan Gu, Haifeng Chen, Wei
Wang, Wei Cheng
The proliferation of Large Language Models (LLMs) has driven considerable interest in fine-tuning them with domain-specific data to create
specialized language models. Nevertheless, such domain-specific fine-tuning data often contains contextually sensitive personally identifi-
able information (PII). Direct fine-tuning LLMs on this data without privacy protection poses a risk of data leakage of sensitive PII during
inference time. To address this challenge, we introduce Contextual Privacy Protection Language Models (CPPLM), a novel paradigm for
fine-tuning LLMs that effectively injects domain-specific knowledge while safeguarding inference-time data privacy. Our work offers a theo-
retical analysis for model design and delves into various techniques such as corpus curation, penalty-based unlikelihood in training loss, and
instruction-based tuning, etc. Extensive experiments across diverse datasets and scenarios demonstrate the effectiveness of our approaches.
In particular, instruction tuning with both positive and negative examples, stands out as a promising method, effectively protecting private
data while enhancing the models knowledge. Our work underscores the potential for Large Language Models as robust contextual privacy
protection learners.
Nov 14 (Thu) 10:30-12:00 - Jasmine
Applying Intrinsic Debiasing on Downstream Tasks: Challenges and Considerations for Machine Translation
Bar Iluz, Yanai Elazar, Asaf Yehudai, Gabriel Stanovsky
Most works on gender bias focus on intrinsic bias — removing traces of information about a protected group from the model’s internal
representation. However, these works are often disconnected from the impact of such debiasing on downstream applications, which is the
main motivation for debiasing in the first place. In this work, we systematically test how methods for intrinsic debiasing affect neural ma-
chine translation models, by measuring the extrinsic bias of such systems under different design choices. We highlight three challenges and
mismatches between the debiasing techniques and their end-goal usage, including the choice of embeddings to debias, the mismatch between
words and sub-word tokens debiasing, and the effect on different target languages. We find that these considerations have a significant impact
254
Posters and Demos
255
Posters and Demos
256
Posters and Demos
and strong learning ability of LLMs to harm the end-user: the backdoor. We demonstrate that LLMs can capture the combinational backdoor
representation. Only upon presentation of triggers together does the backdoor activate. We also verify empirically that this representation is
invariant to the position of the trigger utterance. Subsequently, inserting a single extra token into any two utterances of 5% of the data can
cause over 99% Attack Success Rate (ASR). Our results with 3 triggers demonstrate that this framework is generalizable, compatible with any
trigger in an adversarys toolbox in a plug-and-play manner. Defending the backdoor can be challenging in the conversational setting because
of the large input and output space. Our analysis indicates that the distributed backdoor exacerbates the current challenges by polynomially
increasing the dimension of the attacked input space. Canonical textual defenses like ONION and BKI leverage auxiliary model forward
passes over individual tokens, scaling exponentially with the input sequence length and struggling to maintain computational feasibility. To
this end, we propose a decoding time defense - decayed contrastive decoding - that scales linearly with the assistant response sequence length
and reduces the backdoor to as low as 0.35%.
Nov 14 (Thu) 10:30-12:00 - Jasmine
A Unified Framework and Dataset for Assessing Societal Bias in Vision-Language Models
Ashutosh Sathe, Prachi Jain, Sunayana Sitaram
Vision-language models (VLMs) have gained widespread adoption in both industry and academia. In this study, we propose a unified
framework for systematically evaluating gender, race, and age biases in VLMs with respect to professions. Our evaluation encompasses all
supported inference modes of the recent VLMs, including image-to-text, text-to-text, text-to-image, and image-to-image. We create a syn-
thetic, high-quality dataset comprising text and images that intentionally obscure gender, race, and age distinctions across various professions.
The dataset includes action-based descriptions of each profession and serves as a benchmark for evaluating societal biases in vision-language
models (VLMs). In our benchmarking of popular vision-language models (VLMs), we observe that different input-output modalities result
in distinct bias magnitudes and directions. We hope our work will help guide future progress in improving VLMs to learn socially unbiased
representations. We will release our data and code.
Nov 14 (Thu) 10:30-12:00 - Jasmine
Model Merging and Safety Alignment: One Bad Model Spoils the Bunch
Hasan Abed Al Kader Hammoud, Umberto Michieli, Fabio Pizzati, Philip Torr, Adel Bibi, Bernard Ghanem, Mete Ozay
Merging Large Language Models (LLMs) is a cost-effective technique for combining multiple expert LLMs into a single versatile model,
retaining the expertise of the original ones. However, current approaches often overlook the importance of safety alignment during merging,
leading to highly misaligned models. This work investigates the effects of model merging on alignment. We evaluate several popular model
merging techniques, demonstrating that existing methods do not only transfer domain expertise but also propagate misalignment. We propose
a simple two-step approach to address this problem: (i) generating synthetic safety and domain-specific data, and (ii) incorporating these
generated data into the optimization process of existing data-aware model merging techniques. This allows us to treat alignment as a skill
that can be maximized in the resulting merged LLM. Our experiments illustrate the effectiveness of integrating alignment-related data during
merging, resulting in models that excel in both domain expertise and alignment.
Nov 14 (Thu) 10:30-12:00 - Jasmine
Re-examining Sexism and Misogyny Classification with Annotator Attitudes
Aiqi Jiang, Nikolas Vitsakis, Tanvi Dinkar, Gavin Abercrombie, Ioannis Konstas
Gender-Based Violence (GBV) is an increasing problem online, but existing datasets fail to capture the plurality of possible annotator per-
spectives or ensure the representation of affected groups. We revisit two important stages in the moderation pipeline for GBV: (1) manual
data labelling; and (2) automated classification. For (1), we examine two datasets to investigate the relationship between annotator identities
and attitudes and the responses they give to two GBV labelling tasks. To this end, we collect demographic and attitudinal information from
crowd-sourced annotators using three validated surveys from Social Psychology. We find that higher Right Wing Authoritarianism scores are
associated with a higher propensity to label text as sexist, while for Social Dominance Orientation and Neosexist Attitudes, higher scores are
associated with a negative tendency to do so.For (2), we conduct classification experiments using Large Language Models and five prompting
strategies, including infusing prompts with annotator information. We find: (i) annotator attitudes affect the ability of classifiers to predict
their labels; (ii) including attitudinal information can boost performance when we use well-structured brief annotator descriptions; and (iii)
models struggle to reflect the increased complexity and imbalanced classes of the new label sets.
Nov 14 (Thu) 10:30-12:00 - Jasmine
Generating and Evaluating Synthetic Data for Privacy Preservation in High-Stakes Domains
Krithika Ramesh, Nupoor Gandhi, Pulkit Madaan, Lisa Bauer, Charith Peris, Anjalie Field
The difficulty of anonymizing text data hinders the development and deployment of NLP in high-stakes domains that involve private data, such
as healthcare and social services. Poorly anonymized sensitive data cannot be easily shared with annotators or external researchers, nor can it
be used to train public models. In this work, we explore the feasibility of using synthetic data generated from differentially private language
models in place of real data to facilitate the development of NLP in these domains without compromising privacy. In contrast to prior work,
we generate synthetic data for real high-stakes domains, and we propose and conduct use-inspired evaluations to assess data quality. Our
results show that prior simplistic evaluations have failed to highlight utility, privacy, and fairness issues in the synthetic data. Overall, our
work underscores the need for further improvements to synthetic data generation for it to be a viable way to enable privacy-preserving data
sharing.
Nov 14 (Thu) 10:30-12:00 - Jasmine
Evaluating Gender Bias of LLMs in Making Morality Judgements
Divij Bajaj, Yuanyuan Lei, Jonathan Tong, Ruihong Huang
Large Language Models (LLMs) have shown remarkable capabilities in a multitude of Natural Language Processing (NLP) tasks. However,
these models are still not immune to limitations such as social biases, especially gender bias. This work investigates whether current closed
and open-source LLMs possess gender bias, especially when asked to give moral opinions. To evaluate these models, we curate and introduce
a new dataset GenMO (Gender-bias in Morality Opinions) comprising parallel short stories featuring male and female characters respectively.
Specifically, we test models from the GPT family (GPT-3.5-turbo, GPT-3.5-turbo-instruct, GPT-4-turbo), Llama 3 and 3.1 families (8B/70B),
Mistral-7B and Claude 3 families (Sonnet and Opus). Surprisingly, despite employing safety checks, all production-standard models we
tested display significant gender bias with GPT-3.5-turbo giving biased opinions in 24% of the samples. Additionally, all models consistently
favour female characters, with GPT showing bias in 68-85% of cases and Llama 3 in around 81-85% instances. Additionally, our study inves-
tigates the impact of model parameters on gender bias and explores real-world situations where LLMs reveal biases in moral decision-making.
Nov 14 (Thu) 10:30-12:00 - Jasmine
JobFair: A Framework for Benchmarking Gender Hiring Bias in Large Language Models
Ze Wang, Zekun Wu, Xin Guan, Michael Thaler, Adriano Koshiyama, Skylar Lu, Sachin Beepath, Ediz Ertekin, Maria Perez-Ortiz
The use of Large Language Models (LLMs) in hiring has led to legislative actions to protect vulnerable demographic groups. This paper
presents a novel framework for benchmarking hierarchical gender hiring bias in Large Language Models (LLMs) for resume scoring, reveal-
257
Posters and Demos
ing significant issues of reverse gender hiring bias and overdebiasing. Our contributions are fourfold: Firstly, we introduce a new construct
grounded in labour economics, legal principles, and critiques of current bias benchmarks: hiring bias can be categorized into two types: Level
bias (difference in the average outcomes between demographic counterfactual groups) and Spread bias (difference in the variance of outcomes
between demographic counterfactual groups); Level bias can be further subdivided into statistical bias (i.e. changing with non-demographic
content) and taste-based bias (i.e. consistent regardless of non-demographic content). Secondly, the framework includes rigorous statistical
and computational hiring bias metrics, such as Rank After Scoring (RAS), Rank-based Impact Ratio, Permutation Test, and Fixed Effects
Model. Thirdly, we analyze gender hiring biases in ten state-of-the-art LLMs. Seven out of ten LLMs show significant biases against males
in at least one industry. An industry-effect regression reveals that the healthcare industry is the most biased against males. Moreover, we
found that the bias performance remains invariant with resume content for eight out of ten LLMs. This indicates that the bias performance
measured in this paper might apply to other resume datasets with different resume qualities. Fourthly, we provide a user-friendly demo and
resume dataset to support the adoption and practical use of the framework, which can be generalized to other social traits and tasks.
Nov 14 (Thu) 10:30-12:00 - Jasmine
Multi-Stage Balanced Distillation: Addressing Long-Tail Challenges in Sequence-Level Knowledge Distillation
Yuhang Zhou, Jing Zhu, Paiheng Xu, Xiaoyu Liu, Xiyao Wang, Danai Koutra, Wei Ai, Furong Huang
Large language models (LLMs) have significantly advanced various natural language processing tasks, but deploying them remains compu-
tationally expensive. Knowledge distillation (KD) is a promising solution, enabling the transfer of capabilities from larger teacher LLMs to
more compact student models. Particularly, sequence-level KD, which distills rationale-based reasoning processes instead of merely final out-
comes, shows great potential in enhancing students’ reasoning capabilities. However, current methods struggle with sequence-level KD under
long-tailed data distributions, adversely affecting generalization on sparsely represented domains. We introduce the Multi-Stage Balanced
Distillation (BalDistill) framework, which iteratively balances training data within a fixed computational budget. By dynamically selecting
representative head domain examples and synthesizing tail domain examples, BalDistill achieves state-of-the-art performance across diverse
long-tailed datasets, enhancing both the efficiency and efficacy of the distilled models.
Nov 14 (Thu) 10:30-12:00 - Jasmine
Granularity is crucial when applying differential privacy to text
Doan Nam Long Vu, Timour Igamberdiev, Ivan Habernal
Applying differential privacy (DP) by means of the DP-SGD algorithm to protect individual data points during training is becoming increas-
ingly popular in NLP. However, the choice of granularity at which DP is applied is often neglected. For example, neural machine translation
(NMT) typically operates on the sentence-level granularity. From the perspective of DP, this setup assumes that each sentence belongs to a
single person and any two sentences in the training dataset are independent. This assumption is however violated in many real-world NMT
datasets, e.g., those including dialogues. For proper application of DP we thus must shift from sentences to entire documents. In this paper,
we investigate NMT at both the sentence and document levels, analyzing the privacy/utility trade-off for both scenarios, and evaluating the
risks of not using the appropriate privacy granularity in terms of leaking personally identifiable information (PII). Our findings indicate that
the document-level NMT system is more resistant to membership inference attacks, emphasizing the significance of using the appropriate
granularity when working with DP.
Generation 2
Nov 14 (Thu) 10:30-12:00 - Room: Riverfront Hall
258
Posters and Demos
from peer-clients. Lastly, we employ optimization methods like client-batching or server-hierarchical, adopting different acceleration methods
based on the actual computational capabilities of the server. Experimental results on NLU and generation tasks demonstrate that FL-GLM
achieves comparable metrics to centralized chatGLM model, validating the effectiveness of our federated learning framework.
Nov 14 (Thu) 10:30-12:00 - Riverfront Hall
RSA-Control: A Pragmatics-Grounded Lightweight Controllable Text Generation Framework
Yifan Wang, Vera Demberg
Despite significant advancements in natural language generation, controlling language models to produce texts with desired attributes re-
mains a formidable challenge. In this work, we introduce RSA-Control, a training-free controllable text generation framework grounded in
pragmatics. RSA-Control directs the generation process by recursively reasoning between imaginary speakers and listeners, enhancing the
likelihood that target attributes are correctly interpreted by listeners amidst distractors. Additionally, we introduce a self-adjustable rationality
parameter, which allows for automatic adjustment of control strength based on context. Our experiments, conducted with two task types
and two types of language models, demonstrate that RSA-Control achieves strong attribute control while maintaining language fluency and
content consistency. Our code is available at https://github.com/Ewanwong/RSA-Control.
259
Posters and Demos
to generate reliable citations in question answering tasks (substantially enhancing citation results without compromising answer accuracy).
Nov 14 (Thu) 10:30-12:00 - Riverfront Hall
SynthesizRR: Generating Diverse Datasets with Retrieval Augmentation
Abhishek Divekar, Greg Durrett
It is often desirable to distill the capabilities of large language models (LLMs) into smaller student models due to compute and memory con-
straints. One way to do this for classification tasks is via dataset synthesis, which can be accomplished by generating examples of each label
from the LLM. Prior approaches to synthesis use few-shot prompting, which relies on the LLM’s parametric knowledge to generate usable
examples. However, this leads to issues of repetition, bias towards popular entities, and stylistic differences from human text. In this work,
we propose Synthesize by Retrieval and Refinement (SynthesizRR), which uses retrieval augmentation to introduce variety into the dataset
synthesis process: as retrieved passages vary, the LLM is seeded with different content to generate its examples. We empirically study the
synthesis of six datasets, covering topic classification, sentiment analysis, tone detection, and humor, requiring complex synthesis strategies.
We find SynthesizRR greatly improves lexical and semantic diversity, similarity to human-written text, and distillation performance, when
compared to 32-shot prompting and four prior approaches.
260
Posters and Demos
261
Posters and Demos
across a range of scores. We develop three metrics for assessing LLM calibration and propose confidence elicitation methods based on self-
consistency and self-evaluation. Our experiments demonstrate that larger models don’t necessarily guarantee better calibration, that various
calibration metrics complement each other, and that self-consistency methods excel in factoid datasets. We also find that calibration can
be enhanced through techniques such as fine-tuning, scaling the temperature. Finally, we illustrate one application of long-form calibration
through selective answering in long-form responses, optimizing correctness within a constrained API budget.
Nov 14 (Thu) 10:30-12:00 - Riverfront Hall
Margin Matching Preference Optimization: Enhanced Model Alignment with Granular Feedback
Kyuyoung Kim, Ah Jeong Seo, Hao Liu, Jinwoo Shin, Kimin Lee
Large language models (LLMs) fine-tuned with alignment techniques, such as reinforcement learning from human feedback, have been in-
strumental in developing some of the most capable AI systems to date. Despite their success, existing methods typically rely on simple
binary labels, such as those indicating preferred outputs in pairwise preferences, which fail to capture the subtle differences in relative quality
between pairs. To address this limitation, we introduce an approach called Margin Matching Preference Optimization (MMPO), which incor-
porates relative quality margins into optimization, leading to improved LLM policies and reward models. Specifically, given quality margins
in pairwise preferences, we design soft target probabilities based on the Bradley-Terry model, which are then used to train models with the
standard cross-entropy objective. Experiments with both human and AI feedback data demonstrate that MMPO consistently outperforms
baseline methods, often by a substantial margin, on popular benchmarks including MT-bench and RewardBench. Notably, the 7B model
trained with MMPO achieves state-of-the-art performance on RewardBench as of June 2024, outperforming other models of the same scale.
Our analysis also shows that MMPO is more robust to overfitting, leading to better-calibrated models.
262
Posters and Demos
decoding across two algorithms (top-k and top-π) with various hyperparameters, using the Pythia language models. Results show that in most
configuration, global decoding performs worse than the local decoding versions of the same algorithms, despite preserving the distribution’s
integrity. Our results thus suggest that distortion might be an important feature of local decoding algorithms.
Nov 14 (Thu) 10:30-12:00 - Riverfront Hall
SARCAT: Generative Span-Act Guided Response Generation using Copy-enhanced Target Augmentation
Jeong-Doo Lee, Hyeongjun Choi, Beomseok Hong, Youngsub Han, Byoung-Ki Jeon, Seung-Hoon Na
In this paper, we present a novel extension to improve the document grounded response generation, by proposing the Generative Span Act
Guided Response Generation using Copy enhanced Target Augmentation (SARCAT) that consists of two major components as follows: 1)
Copy-enhanced target-side input augmentation is an extended data augmentation to deal with the exposure bias problem by additionally in-
corporating the copy mechanism on top of the target-side augmentation (Xie et al., 2021). 2) Span-act guided response generation, which first
predicts grounding spans and dialogue acts before generating a response. Experiment results on validation set in MultiDoc2Dial show that the
proposed SARSAT leads to improvement over strong baselines on both seen and unseen settings and achieves the start-of the-art performance,
even with the base reader using the pretrained T5-base model.
263
Posters and Demos
7
https://lmexplainer.github.io/xplainllm
264
Posters and Demos
that this "familiarity" significantly impacts learning performance. Training with LLM-generated responses not only enhances performance
but also helps maintain the model’s capabilities in other reasoning tasks after fine-tuning on a specific task.
265
Posters and Demos
text corpora curated from the Internet with minimal human intervention, and (iii) trained in an online fashion. These stark contrasts prevent
researchers from transferring lessons learned on model generalization and adaptation in deep learning contexts to LLMs.To this end, our short
paper introduces empirical observations that aim to shed light on further training of already pretrained language models. Specifically, we
demonstrate that training a model on a text domain could degrade its perplexity on the test portion of the same domain. We observe with
our subsequent analysis that the performance degradation is positively correlated with the similarity between the additional and the original
pretraining dataset of the LLM. Our further token-level perplexity analysis reveals that the perplexity degradation is due to a handful of tokens
that are not informative about the domain. We hope these findings will guide us in determining when to adapt a model vs when to rely on its
foundational capabilities.
Nov 14 (Thu) 10:30-12:00 - Jasmine
Can LLMs Learn Uncertainty on Their Own? Expressing Uncertainty Effectively in A Self-Training Manner
Shudong Liu, Zhaocong Li, Xuebo Liu, Runzhe Zhan, Derek F. Wong, Lidia S. Chao, Min zhang
Large language models (LLMs) often exhibit excessive, random, and uninformative uncertainty, rendering them unsuitable for decision-
making in human-computer interactions. In this paper, we aim to instigate a heightened awareness of self-uncertainty in LLMs, enabling them
to express uncertainty more effectively. To accomplish this, we propose an uncertainty-aware instruction tuning (UaIT) method, aligning
LLMs’ perception with the probabilistic uncertainty of the generation. We conducted experiments using LLaMA2 and Mistral on multiple
free-form QA tasks. Experimental results revealed a surprising 45.2% improvement in the effectiveness of uncertainty expression by LLMs,
accompanied by reasonably good out-of-domain generalization capabilities. Moreover, this uncertainty expression can serve as a valuable
real-time basis for human decision-making, e.g., retrieving external documents and incorporating stronger LLMs.
266
Posters and Demos
but struggles with complex multi-hop questions as the new fact alone fails to specify the chain of facts involved in such scenarios. Besides,
memory-based editing maintains additional storage for all edits and related facts, requiring continuous updates to stay effective. As a result of
the design limitations, the challenge remains, with the highest accuracy being only 33.8% on the MQuAKE-CF benchmarks for Vicuna-7B.
To address this, we propose RippleCOT, a novel ICL editing approach integrating Chain-of-Thought (COT) reasoning. RippleCOT structures
demonstrations as new fact, question, thought, answer, incorporating a thought component to identify and decompose the multi-hop logic
within questions. This approach effectively guides the model through complex multi-hop questions with chains of related facts. Compre-
hensive experiments demonstrate that RippleCOT significantly outperforms the state-of-the-art in the ripple effect, achieving accuracy gains
ranging from 7.8% to 87.1%.
Nov 14 (Thu) 10:30-12:00 - Jasmine
Measuring Susceptibility to Irrelevant Context in Language Models
Tianyu Liu, Kevin Du, Mrinmaya Sachan, Ryan Cotterell
One strength of modern language models is their ability to incorporate information from a user-input context when answering queries. How-
ever, they are not equally sensitive to the subtle changes to that context.To quantify this, Du et al. (2024) gives an information-theoretic metric
to measure such sensitivity. Their metric, susceptibility, is defined as the degree to which contexts can influence a model’s response to a
query at a distributional level.However, exactly computing susceptibility is difficult and, thus, Du et al. (2024) falls back on a Monte Carlo
approximation.Due to the large number of samples required, the Monte Carlo approximation is inefficient in practice. As a faster alternative,
we propose Fisher susceptibility, an efficient method to estimate the susceptibility based on Fisher information.Empirically, we validate that
Fisher susceptibility is comparable to Monte Carlo estimated susceptibility across a diverse set of query domains despite its being 70×
faster.Exploiting the improved efficiency, we apply Fisher susceptibility to analyze factors affecting the susceptibility of language models.We
observe that larger models are as susceptible as smaller ones.
Nov 14 (Thu) 10:30-12:00 - Jasmine
Learning Semantic Structure through First-Order-Logic Translation
Akshay Chaturvedi, Nicholas Asher
In this paper, we study whether transformer-based language models can extract predicate argument structure from simple sentences. We firstly
show that language models sometimes confuse which predicates apply to which objects. To mitigate this, we explore two tasks: question
answering (Q/A), and first order logic (FOL) translation, and two regimes, prompting and finetuning. In FOL translation, we finetune several
large language models on synthetic datasets designed to gauge their generalization abilities. For Q/A, we finetune encoder models like BERT
and RoBERTa and use prompting for LLMs. The results show that FOL translation for LLMs is better suited to learn predicate argument
structure.
267
Posters and Demos
the residual stream space, and is language-independent. Finally, we demonstrate this direction has a causal effect on the model predictions,
effectively flipping the Spanish predicted verb number by intervening with the direction found in English.
268
Posters and Demos
Models (LMs) approximates human performance, they often exhibit a drop in performance on real-world noisy data. This lack of robustness
can be concerning, as even small perturbations in text, irrelevant to the target task, can cause classifiers to incorrectly change their predictions.
A potential solution can be the family of Prototype-Based Networks (PBNs) that classifies examples based on their similarity to prototypical
examples of a class (prototypes) and has been shown to be robust to noise for computer vision tasks. In this paper, we study whether the
robustness properties of PBNs transfer to text classification tasks under both targeted and static adversarial attack settings. Our results show
that PBNs, as a mere architectural variation of vanilla LMs, offer more robustness compared to vanilla LMs under both targeted and static
settings. We showcase how PBNs’ interpretability can help us understand PBNs’ robustness properties. Finally, our ablation studies reveal
the sensitivity of PBNs’ robustness to the strictness of clustering and the number of prototypes in the training phase, as tighter clustering and
a low number of prototypes result in less robust PBNs.
269
Posters and Demos
discrepancies in LLM behavior, showing that while the models may possess knowledge, they struggle to apply it consistently, exhibit biases
toward preferred readings, and display self-inconsistencies. This highlights the need to address entity ambiguity in the future for more trust-
worthy LLMs.
Nov 14 (Thu) 10:30-12:00 - Jasmine
Faithful and Plausible Natural Language Explanations for Image Classification: A Pipeline Approach
Adam Wojciechowski, Mateusz Lango, Ondrej Dusek
Existing explanation methods for image classification struggle to provide faithful and plausible explanations. This paper addresses this issue
by proposing a post-hoc natural language explanation method that can be applied to any CNN-based classifier without altering its training
process or affecting predictive performance. By analysing influential neurons and the corresponding activation maps, the method generates a
faithful description of the classifier’s decision process in the form of a structured meaning representation, which is then converted into text by
a language model. Through this pipeline approach, the generated explanations are grounded in the neural network architecture, providing ac-
curate insight into the classification process while remaining accessible to non-experts. Experimental results show that the NLEs constructed
by our method are significantly more plausible and faithful than baselines. In particular, user interventions in the neural network structure
(masking of neurons) are three times more effective.
Nov 14 (Thu) 10:30-12:00 - Jasmine
Enhancing Healthcare LLM Trust with Atypical Presentations Recalibration
Jeremy Qin, Bang Liu, Quoc Dinh Nguyen
Black-box large language models (LLMs) are increasingly deployed in various environments, making it essential for these models to ef-
fectively convey their confidence and uncertainty, especially in high-stakes settings. However, these models often exhibit overconfidence,
leading to potential risks and misjudgments. Existing techniques for eliciting and calibrating LLM confidence have primarily focused on
general reasoning datasets, yielding only modest improvements. Accurate calibration is crucial for informed decision-making and preventing
adverse outcomes but remains challenging due to the complexity and variability of tasks these models perform. In this work, we investigate the
miscalibration behavior of black-box LLMs within the healthcare setting. We propose a novel method, Atypical Presentations Recalibration,
which leverages atypical presentations to adjust the model’s confidence estimates. Our approach significantly improves calibration, reducing
calibration errors by approximately 60% on three medical question answering datasets and outperforming existing methods such as vanilla
verbalized confidence, CoT verbalized confidence and others. Additionally, we provide an in-depth analysis of the role of atypicality within
the recalibration framework.
Nov 14 (Thu) 10:30-12:00 - Jasmine
Deciphering the Factors Influencing the Efficacy of Chain-of-Thought: Probability, Memorization, and Noisy Reasoning
Akshara Prabhakar, Thomas L. Griffiths, R. Thomas McCoy
Chain-of-Thought (CoT) prompting has been shown to enhance the multi-step reasoning capabilities of Large Language Models (LLMs).
However, debates persist about whether LLMs exhibit *abstract generalization* or rely on *shallow heuristics* when given CoT prompts. To
understand the factors influencing CoT reasoning we provide a detailed case study of the symbolic reasoning task of decoding shift ciphers,
where letters are shifted forward some number of steps in the alphabet. We analyze the pattern of results produced by three LLMs—GPT-4,
Claude 3, and Llama 3.1—performing this task using CoT prompting. By focusing on a single relatively simple task, we are able to identify
three factors that systematically affect CoT performance: the probability of the task’s expected output (probability), what the model has im-
plicitly learned during pre-training (memorization), and the number of intermediate operations involved in reasoning (noisy reasoning). We
show that these factors can drastically influence task accuracy across all three LLMs; e.g., when tested with GPT-4, varying the output’s prob-
ability of occurrence shifts accuracy from 26% to 70%. Overall, we conclude that CoT prompting performance reflects both memorization
and a probabilistic version of genuine reasoning.
270
Posters and Demos
ata from pre-collected, offline preference datasets. While recent studies indicate that existing offline DAP methods can directly benefit from
online training samples, we highlight the need to develop specific online DAP algorithms to fully harness the power of online training. Specif-
ically, we identify that the learned LLM should adhere to the proximity of the behavior LLM, which collects the training samples. To this
end, we propose online Preference Optimization in proximity to the Behavior LLM (BPO), emphasizing the importance of constructing a
proper trust region for LLM alignment.We conduct extensive experiments to validate the effectiveness and applicability of our approach by
integrating it with various DAP methods, resulting in significant performance improvements across a wide range of tasks when training with
the same amount of preference data. Even when only introducing one additional data collection phase, our online BPO improves its offline
DAP baseline from 72.0% to 80.2% on TL;DR and from 82.2% to 89.1% on Anthropic Helpfulness in terms of win rate against human
reference text.
271
Posters and Demos
trained tokens, and does not compromise text compression. Our experiments show that this method either improves downstream performance
or does not harm it.
272
Posters and Demos
random sampling? How much of a role does the prediction pipeline play in AL’s success?We examine these questions in detail for the task
of text classification using pre-trained representations, which are ubiquitous today.Our primary contribution here is a rigorous evaluation of
AL techniques, old and new, across setups that vary wrt datasets, text representations and classifiers. This unlocks multiple insights around
warm-up times, i.e., number of labels before gains from AL are seen, viability of an "Always ON" mode and the relative significance of
different factors.Additionally, we release a framework for rigorous benchmarking of AL techniques for text classification.
Nov 14 (Thu) 10:30-12:00 - Riverfront Hall
Immunization against harmful fine-tuning attacks
Domenic Rosati, Jan Wehner, Kai Williams, Lukasz Bartoszcze, Hassan Sajjad, Frank Rudzicz
Large Language Models (LLMs) are often trained with safety guards intended to prevent harmful text generation. However, such safety
training can be removed by fine-tuning the LLM on harmful datasets. While this emerging threat (harmful fine-tuning attacks) has been
characterized by previous work, there is little understanding of how we should proceed in constructing and validating defenses against these
attacks especially in the case where defenders would not have control of the fine-tuning process. We introduce a formal framework based
on the training budget of an attacker which we call "Immunization" conditions. Using a formal characterisation of the harmful fine-tuning
problem, we provide a thorough description of what a successful defense must comprise of and establish a set of guidelines on how rigorous
defense research that gives us confidence should proceed.
Nov 14 (Thu) 10:30-12:00 - Riverfront Hall
Enhancing Large Language Model Based Sequential Recommender Systems with Pseudo Labels Reconstruction
Hyunsoo Na, Minseok Gang, Youngrok Ko, Jinseok Seol, Sang-goo Lee
Large language models (LLMs) are utilized in various studies, and they also demonstrate a potential to function independently as a recom-
mendation model. Nevertheless, training sequences and text labels modifies LLMs’ pre-trained weights, diminishing their inherent strength in
constructing and comprehending natural language sentences. In this study, we propose a reconstruction-based LLM recommendation model
(ReLRec) that harnesses the feature extraction capability of LLMs, while preserving LLMs’ sentence generation abilities. We reconstruct the
user and item pseudo-labels generated from user reviews, while training on sequential data, aiming to exploit the key features of both users
and items. Experimental results demonstrate the efficacy of label reconstruction in sequential recommendation tasks.
273
Posters and Demos
Xinyi Xu, Zhaoxuan Wu, Rui Qiao, Arun Verma, Yao Shu, Jingtan Wang, Xinyuan Niu, Zhenfeng He, Jiangwei Chen, Zijian Zhou, Gregory
Kang Ruey Lau, Hieu Dao, Lucas Agussurja, Rachael Hwee Ling Sim, Xiaoqiang Lin, Wenyang Hu, Zhongxiang Dai, Pang Wei Koh, Bryan
Kian Hsiang Low
This position paper proposes a data-centric viewpoint of AI research, focusing on large language models (LLMs). We start by making a key
observation that data is instrumental in the developmental (e.g., pretraining and fine-tuning) and inferential stages (e.g., in-context learning)
of LLMs, and advocate that data-centric research should receive more attention from the community. We identify four specific scenarios cen-
tered around data, covering data-centric benchmarks and data curation, data attribution, knowledge transfer, and inference contextualization.
In each scenario, we underscore the importance of data, highlight promising research directions, and articulate the potential impacts on the
research community and, where applicable, the society as a whole. For instance, we advocate for a suite of data-centric benchmarks tailored
to the scale and complexity of data for LLMs. These benchmarks can be used to develop new data curation methods and document research
efforts and results, which can help promote openness and transparency in AI and LLM research.
Nov 14 (Thu) 10:30-12:00 - Riverfront Hall
Efficient Pointwise-Pairwise Learning-to-Rank for News Recommendation
Nithish Kannen, Yao Ma, Gerrit J.J. Van den Burg, Jean Baptiste Faddoul
News recommendation is a challenging task that involves personalization based on the interaction history and preferences of each user. Re-
cent works have leveraged the power of pretrained language models (PLMs) to directly rank news items by using inference approaches that
predominately fall into three categories: pointwise, pairwise, and listwise learning-to-rank. While pointwise methods offer linear inference
complexity, they fail to capture crucial comparative information between items that is more effective for ranking tasks. Conversely, pair-
wise and listwise approaches excel at incorporating these comparisons but suffer from practical limitations: pairwise approaches are either
computationally expensive or lack theoretical guarantees and listwise methods often perform poorly in practice. In this paper, we propose
a novel framework for PLM-based news recommendation that integrates both pointwise relevance prediction and pairwise comparisons in a
scalable manner. We present a rigorous theoretical analysis of our framework, establishing conditions under which our approach guarantees
improved performance. Extensive experiments show that our approach outperforms the state-of-the-art methods on the MIND and Adressa
news recommendation datasets.
Nov 14 (Thu) 10:30-12:00 - Riverfront Hall
Out-of-Distribution Detection through Soft Clustering with Non-Negative Kernel Regression
Aryan Gulati, Xingjian Dong, Carlos Hurtado, Sarath Shekkizhar, Swabha Swayamdipta, Antonio Ortega
As language models become more general purpose, increased attention needs to be paid to detecting out-of-distribution (OOD) instances, i.e.,
those not belonging to any of the distributions seen during training. Existing methods for detecting OOD data are computationally complex
and storage-intensive. We propose a novel soft clustering approach for OOD detection based on non-negative kernel regression. Our approach
greatly reduces computational and space complexities (up to 11× improvement in inference time and 87% reduction in storage requirements).
It outperforms existing approaches by up to 4 AUROC points on four benchmarks. We also introduce an entropy-constrained version of our
algorithm, leading to further reductions in storage requirements (up to 97% lower than comparable approaches) while retaining competitive
performance. Our soft clustering approach for OOD detection highlights its potential for detecting tail-end phenomena in extreme-scale data
settings. Our source code is available on Github.
Nov 14 (Thu) 10:30-12:00 - Riverfront Hall
Better Alignment with Instruction Back-and-Forth Translation
Thao Nguyen, Jeffrey Li, Sewoong Oh, Ludwig Schmidt, Jason E Weston, Luke Zettlemoyer, Xian Li
We propose a new method, instruction back-and-forth translation, to improve the quality of instruction-tuning data used for aligning large
language models (LLMs). Given preprocessed texts from an initial web corpus (e.g. Dolma (Soldaini et al., 2024)), we generate synthetic
instructions using the backtranslation approach proposed by Li et al., (2023), filter the generated data and rewrite the responses to improve
their quality further based on the initial texts. Given similar quantities of instructions, fine-tuning Llama-2 on our (synthetic instruction,
rewritten response) pairs yields better AlpacaEval win rates than using other common instruction datasets such as Humpback, ShareGPT,
Open Orca, Alpaca-GPT4 and Self-instruct, at both 7B and 70B parameter scales. We also demonstrate that rewriting the responses with
an LLM is different from direct distillation: the former process yields better win rate at 70B scale, and the two text distributions exhibit
significant distinction in the embedding space. Besides, we provide analyses showing that our backtranslated instructions are of higher quality
than other sources of synthetic instructions, while our responses are more diverse and complex than what can be obtained from distillation.
Overall we find that instruction back-and-forth translation combines the best of both worlds—making use of the information diversity and
quantity found on the web, while ensuring the quality of the responses which is necessary for effective alignment.
Nov 14 (Thu) 10:30-12:00 - Riverfront Hall
Attribute Controlled Fine-tuning for Large Language Models: A Case Study on Detoxification
Tao Meng, Ninareh Mehrabi, Palash Goyal, Anil Ramakrishna, Aram Galstyan, Richard Zemel, Kai-Wei Chang, Rahul Gupta, Charith Peris
We propose a constraint learning schema forfine-tuning Large Language Models (LLMs)with attribute control. Given a training corpusand
control criteria formulated as a sequence-level constraint on model outputs, our methodfine-tunes the LLM on the training corpus whileen-
hancing constraint satisfaction with minimalimpact on its utility and generation quality.Specifically, our approach regularizes the LLMtraining
by penalizing the KL divergence be-tween the desired output distribution, which sat-isfies the constraints, and the LLMs posterior.This reg-
ularization term can be approximatedby an auxiliary model trained to decomposethe sequence-level constraints into token-levelguidance,
allowing the term to be measuredby a closed-form formulation. To further im-prove efficiency, we design a parallel schemefor concurrently
updating both the LLM andthe auxiliary model. We evaluate the empiricalperformance of our approach by controlling thetoxicity when train-
ing an LLM. We show thatour approach leads to an LLM that producesfewer inappropriate responses while achievingcompetitive performance
on benchmarks and atoxicity detection task
Nov 14 (Thu) 10:30-12:00 - Riverfront Hall
A Study of Parameter Efficient Fine-tuning by Learning to Efficiently Fine-Tune
Taha Ceritli, Savas Ozkan, Jeongwon Min, Eunchung Noh, Cho Jung Min, Mete Ozay
The growing size of large language models (LLMs) requires parameter-efficient fine-tuning (PEFT) methods for their adaptation to new tasks.
Existing methods, such as Low-Rank Adaptation (LoRA), typically involve model adaptation by training the PEFT parameters. One open
problem required to be solved to effectively employ these methods is the identification of PEFT parameters. More precisely, related works
identify PEFT parameters by projecting high dimensional parameters of LLMs onto low dimensional parameter manifolds with predefined
projections, or identifying PEFT parameters as projections themselves. To study this problem, we propose a new approach called Learning
to Efficiently Fine-tune (LEFT) where we aim to learn spaces of PEFT parameters from data. In order to learn how to generate the PEFT
parameters on a learned parameter space while fine-tuning the LLMs, we propose the Parameter Generation (PG) method. In the experimental
analyses, we examine the effectiveness of our solutions exploring accuracy of fine-tuned LLMs and characteristics of PEFT parameters on
benchmark GLUE tasks.
274
Posters and Demos
Machine Translation 2
Nov 14 (Thu) 10:30-12:00 - Room: Riverfront Hall
275
Posters and Demos
NLP Applications 2
Nov 14 (Thu) 10:30-12:00 - Room: Riverfront Hall
276
Posters and Demos
277
Posters and Demos
278
Posters and Demos
Negar Arabzadeh, Siqing Huo, Nikhil Mehta, Qingyun Wu, Chi Wang, Ahmed Hassan Awadallah, Charles L. A. Clarke, Julia Kiseleva
The rapid development of Large Language Models (LLMs) has led to a surge in applications that facilitate collaboration among multiple
agents, assisting humans in their daily tasks. However, a significant gap remains in assessing to what extent LLM-powered applications
genuinely enhance user experience and task execution efficiency. This highlights the need to verify utility of LLM-powered applications,
particularly by ensuring alignment between the application’s functionality and end-user needs. We introduce AgentEval, a novel frame-
work designed to simplify the utility verification process by automatically proposing a set of criteria tailored to the unique purpose of any
given application. This allows for a comprehensive assessment, quantifying the utility of an application against the suggested criteria. We
present a comprehensive analysis of the effectiveness and robustness of AgentEval for two open source datasets including Math Problem
solving and ALFWorld House-hold related tasks. For reproducibility purposes, we make the data, code and all the logs publicly available at
https://github.com/Narabzad/AgentEval
Nov 14 (Thu) 10:30-12:00 - Riverfront Hall
CELLO: Causal Evaluation of Large Vision-Language Models
Meiqi Chen, Bo Peng, Yan Zhang, Chaochao Lu
Causal reasoning is fundamental to human intelligence and crucial for effective decision-making in real-world environments. Despite recent
advancements in large vision-language models (LVLMs), their ability to comprehend causality remains unclear. Previous work typically
focuses on commonsense causality between events and/or actions, which is insufficient for applications like embodied agents and lacks the
explicitly defined causal graphs required for formal causal reasoning. To overcome these limitations, we introduce a fine-grained and unified
definition of causality involving interactions between humans and/or objects. Building on the definition, we construct a novel dataset, CELLO,
consisting of 14,094 causal questions across all four levels of causality: discovery, association, intervention, and counterfactual. This dataset
surpasses traditional commonsense causality by including explicit causal graphs that detail the interactions between humans and objects.
Extensive experiments on CELLO reveal that current LVLMs still struggle with causal reasoning tasks, but they can benefit significantly from
our proposed CELLO-CoT, a causally inspired chain-of-thought prompting strategy. Both quantitative and qualitative analyses from this study
provide valuable insights for future research. Our project page is at https://github.com/OpenCausaLab/CELLO.
Nov 14 (Thu) 10:30-12:00 - Riverfront Hall
Adversarial Math Word Problem Generation
Roy Xie, Chengxuan Huang, Junlin Wang, Bhuwan Dhingra
Large language models (LLMs) have significantly transformed the educational landscape. As current plagiarism detection tools struggle
to keep pace with LLMs’ rapid advancements, the educational community faces the challenge of assessing students’ true problem-solving
abilities in the presence of LLMs. In this work, we explore a new paradigm for ensuring fair evaluation—generating adversarial examples
which preserve the structure and difficulty of the original questions aimed for assessment, but are unsolvable by LLMs. Focusing on the
domain of math word problems, we leverage abstract syntax trees to structurally generate adversarial examples that cause LLMs to produce
incorrect answers by simply editing the numeric values in the problems. We conduct experiments on various open- and closed-source LLMs,
quantitatively and qualitatively demonstrating that our method significantly degrades their math problem-solving ability. We identify shared
vulnerabilities among LLMs and propose a cost-effective approach to attack high-cost models. Additionally, we conduct automatic analysis
to investigate the cause of failure, providing further insights into the limitations of LLMs.
Nov 14 (Thu) 10:30-12:00 - Riverfront Hall
I Never Said That: A dataset, taxonomy and baselines on response clarity classification
Konstantinos Thomas, Giorgos Filandrianos, Maria Lymperaiou, Chrysoula Zerva, Giorgos Stamou
Equivocation and ambiguity in public speech are well-studied discourse phenomena, especially in political science and analysis of political
interviews. Inspired by the well-grounded theory on equivocation, we aim to resolve the closely related problem of response clarity in ques-
tions extracted from political interviews, leveraging the capabilities of Large Language Models (LLMs) and human expertise. To this end, we
introduce a novel taxonomy that frames the task of detecting and classifying response clarity and a corresponding clarity classification dataset
which consists of question-answer (QA) pairs drawn from political interviews and annotated accordingly. Our proposed two-level taxonomy
addresses the clarity of a response in terms of the information provided for a given question (high-level) and also provides a fine-grained
taxonomy of evasion techniques that relate to unclear, ambiguous responses (lower-level). We combine ChatGPT and human annotators to
collect, validate and annotate discrete QA pairs from political interviews, to be used for our newly introduced response clarity task. We
provide a detailed analysis and conduct several experiments with different model architectures, sizes and adaptation methods to gain insights
and establish new baselines over the proposed dataset and task.
Nov 14 (Thu) 10:30-12:00 - Riverfront Hall
Evaluation of Question Answer Generation for Portuguese: Insights and Datasets
Felipe Paula, CASSIANA ROBERTA LIZZONI MICHELIN, Viviane Moreira
Automatic question generation is an increasingly important task that can be applied in different settings, including educational purposes, data
augmentation for question-answering (QA), and conversational systems. More specifically, we focus on question answer generation (QAG),
which produces question-answer pairs given an input context. We adapt and apply QAG approaches to generate question-answer pairs for
different domains and assess their capacity to generate accurate, diverse, and abundant question-answer pairs. Our analyses combine both
qualitative and quantitative evaluations that allow insights into the quality and types of errors made by QAG methods. We also look into
strategies for error filtering and their effects. Our work concentrates on Portuguese, a widely spoken language that is underrepresented in
natural language processing research. To address the pressing need for resources, we generate and make available human-curated extractive
QA datasets in three diverse domains.
Nov 14 (Thu) 10:30-12:00 - Riverfront Hall
From Generation to Selection Findings of Converting Analogical Problem-Solving into Multiple-Choice Questions
Donghyeon Shin, Seungpil Lee, Klea Lena Kovacec, Sundong Kim
As artificial intelligence reasoning abilities gain prominence, generating reliable benchmarks becomes crucial. The Abstract and Reasoning
Corpus (ARC) offers challenging problems yet unsolved by AI. While ARC effectively assesses reasoning, its generation-based evaluation
overlooks other assessment aspects. Bloom’s Taxonomy suggests evaluating six cognitive stages: Remember, Understand, Apply, Analyze,
Evaluate, and Create. To extend ARC’s focus beyond the Create stage, we developed MC-LARC, a multiple-choice format suitable for as-
sessing stages like Understand and Apply in Large Language Models (LLMs). Our evaluation of ChatGPT4V’s analogical reasoning using
MC-LARC confirmed that this format supports LLMs’ reasoning capabilities and facilitates evidence analysis. However, we observed LLMs
using shortcuts in MC-LARC tasks. To address this, we propose a self-feedback framework where LLMs identify issues and generate im-
proved options. MC-LARC is available at https://mc-larc.github.io/.
Nov 14 (Thu) 10:30-12:00 - Riverfront Hall
CERD: A Comprehensive Chinese Rhetoric Dataset for Rhetorical Understanding and Generation in Essays
Nuowei Liu, Xinhao Chen, Hongyi Wu, Changzhi Sun, Man Lan, Yuanbin Wu, Xiaopeng Bai, Shaoguang Mao, Yan Xia
279
Posters and Demos
Existing rhetorical understanding and generation datasets or corpora primarily focus on single coarse-grained categories or fine-grained cat-
egories, neglecting the common interrelations between different rhetorical devices by treating them as independent sub-tasks. In this paper,
we propose the Chinese Essay Rhetoric Dataset (CERD), consisting of 4 commonly used coarse-grained categories including metaphor, per-
sonification, hyperbole and parallelism and 23 fine-grained categories across both form and content levels. CERD is a manually annotated
and comprehensive Chinese rhetoric dataset with five interrelated sub-tasks. Unlike previous work, our dataset aids in understanding various
rhetorical devices, recognizing corresponding rhetorical components, and generating rhetorical sentences under given conditions, thereby
improving the author’s writing proficiency and language usage skills. Extensive experiments are conducted to demonstrate the interrelations
between multiple tasks in CERD, as well as to establish a benchmark for future research on rhetoric. The experimental results indicate that
Large Language Models achieve the best performance across most tasks, and jointly fine-tuning with multiple tasks further enhances perfor-
mance.
Nov 14 (Thu) 10:30-12:00 - Riverfront Hall
Few-shot clinical entity recognition in English, French and Spanish: masked language models outperform generative model prompt-
ing
Marco Naguib, Xavier Tannier, Aurélie Névéol
Large language models (LLMs) have become the preferred solution for many natural language processing tasks. In low-resource environments
such as specialized domains, their few-shot capabilities are expected to deliver high performance. Named Entity Recognition (NER) is a crit-
ical task in information extraction that is not covered in recent LLM benchmarks. There is a need for better understanding the performance
of LLMs for NER in a variety of settings including languages other than English. This study aims to evaluate generative LLMs, employed
through prompt engineering, for few-shot clinical NER. We compare 13 auto-regressive models using prompting and 16 masked models using
fine-tuning on 14 NER datasets covering English, French and Spanish. While prompt-based auto-regressive models achieve competitive F1
for general NER, they are outperformed within the clinical domain by lighter biLSTM-CRF taggers based on masked models. Additionally,
masked models exhibit lower environmental impact compared to auto-regressive models. Findings are consistent across the three languages
studied, which suggests that LLM prompting is not yet suited for NER production in the clinical domain.
Nov 14 (Thu) 10:30-12:00 - Riverfront Hall
TurkishMMLU: Measuring Massive Multitask Language Understanding in Turkish
Arda Yüksel, Abdullatif Köksal, Lütfi Kerem Senel, Anna Korhonen, Hinrich Schuetze
Multiple choice question answering tasks evaluate the reasoning, comprehension, and mathematical abilities of Large Language Models
(LLMs). While existing benchmarks employ automatic translation for multilingual evaluation, this approach is error-prone and potentially
introduces culturally biased questions, especially in social sciences. We introduce the first multitask, multiple-choice Turkish QA bench-
mark, TurkishMMLU, to evaluate LLMs’ understanding of the Turkish language. TurkishMMLU includes over 10,000 questions, covering
9 different subjects from Turkish high-school education curricula. These questions are written by curriculum experts, suitable for the high-
school curricula in Turkey, covering subjects ranging from natural sciences and math questions to more culturally representative topics such
as Turkish Literature and the history of the Turkish Republic. We evaluate over 20 LLMs, including multilingual open-source (e.g., Gemma,
Llama, MT5), closed-source (GPT 4o, Claude, Gemini), and Turkish-adapted (e.g., Trendyol) models. We provide an extensive evaluation,
including zero-shot and few-shot evaluation of LLMs, chain-of-thought reasoning, and question difficulty analysis along with model perfor-
mance. We provide an in-depth analysis of the Turkish capabilities and limitations of current LLMs to provide insights for future LLMs for the
Turkish language. We publicly release our code for the dataset and evaluation: https://github.com/ArdaYueksel/TurkishMMLU
Nov 14 (Thu) 10:30-12:00 - Riverfront Hall
Generalists vs. Specialists: Evaluating Large Language Models for Urdu
Samee Arif, Abdul Hameed Azeemi, Agha Ali Raza, Awais Athar
In this paper, we compare general-purpose models, GPT-4-Turbo and Llama-3-8b, with special-purpose modelsXLM-Roberta-large,
mT5-large, and Llama-3-8bthat have been fine-tuned on specific tasks. We focus on seven classification and seven generation tasks to
evaluate the performance of these models on Urdu language. Urdu has 70 million native speakers, yet it remains underrepresented in Natural
Language Processing (NLP). Despite the frequent advancements in Large Language Models (LLMs), their performance in low-resource lan-
guages, including Urdu, still needs to be explored. We also conduct a human evaluation for the generation tasks and compare the results with
the evaluations performed by GPT-4-Turbo, Llama-3-8b and Claude 3.5 Sonnet. We find that special-purpose models consis-
tently outperform general-purpose models across various tasks. We also find that the evaluation done by GPT-4-Turbo for generation tasks
aligns more closely with human evaluation compared to the evaluation done by Llama-3-8b. This paper contributes to the NLP community
by providing insights into the effectiveness of general and specific-purpose LLMs for low-resource languages.
Nov 14 (Thu) 10:30-12:00 - Riverfront Hall
MedINST: Meta Dataset of Biomedical Instructions
Wenhan Han, Meng Fang, Zihan Zhang, Yu Yin, Zirui Song, Ling Chen, Mykola Pechenizkiy, Qingyu Chen
The integration of large language model (LLM) techniques in the field of medical analysis has brought about significant advancements, yet the
scarcity of large, diverse, and well-annotated datasets remains a major challenge. Medical data and tasks, which vary in format, size, and other
parameters, require extensive preprocessing and standardization for effective use in training LLMs. To address these challenges, we introduce
MedINST, the Meta Dataset of Biomedical Instructions, a novel multi-domain, multi-task instructional meta-dataset. MedINST comprises
133 biomedical NLP tasks and over 7 million training samples, making it the most comprehensive biomedical instruction dataset to date. Us-
ing MedINST as the meta dataset, we curate MedINST32, a challenging benchmark with different task difficulties aiming to evaluate LLMs’
generalization ability. We fine-tune several LLMs on MedINST and evaluate on MedINST32, showcasing enhanced cross-task generalization.
Nov 14 (Thu) 10:30-12:00 - Riverfront Hall
Is GPT-4V (ision) All You Need for Automating Academic Data Visualization? Exploring Vision-Language Models’ Capability in
Reproducing Academic Charts
Zhehao Zhang, Weicheng Ma, Soroush Vosoughi
While effective data visualization is crucial to present complex information in academic research, its creation demands significant expertise
in both data management and graphic design. We explore the potential of using Vision-Language Models (VLMs) in automating the creation
of data visualizations by generating code templates from existing charts. As the first work to systematically investigate this task, we first
introduce AcademiaChart, a dataset comprising 2525 high-resolution data visualization figures with captions from a variety of AI confer-
ences, extracted directly from source codes. We then conduct large-scale experiments with six state-of-the-art (SOTA) VLMs, including both
closed-source and open-source models. Our findings reveal that SOTA closed-source VLMs can indeed be helpful in reproducing charts. On
the contrary, open-source ones are only effective at reproducing much simpler charts but struggle with more complex ones. Interestingly, the
application of Chain-of-Thought (CoT) prompting significantly enhances the performance of the most advanced model, GPT-4-V, while it
does not work as well for other models. These results underscore the potential of VLMs in data visualization while also highlighting critical
areas that need improvement for broader application.
280
Posters and Demos
281
Posters and Demos
Nandan Thakur, Luiz Bonifacio, Crystina Zhang, Odunayo Ogundepo, Ehsan Kamalloo, David Alfonso-Hermelo, Xiaoguang Li, Qun Liu,
Boxing Chen, Mehdi Rezagholizadeh, Jimmy Lin
Retrieval-Augmented Generation (RAG) grounds Large Language Model (LLM) output by leveraging external knowledge sources to reduce
factual hallucinations. However, prior work lacks a comprehensive evaluation of different language families, making it challenging to evaluate
LLM robustness against errors in external retrieved knowledge. To overcome this, we establish **NoMIRACL**, a human-annotated dataset
for evaluating LLM robustness in RAG across 18 typologically diverse languages. NoMIRACL includes both a non-relevant and a relevant
subset. Queries in the non-relevant subset contain passages judged as non-relevant, whereas queries in the relevant subset include at least a
single judged relevant passage. We measure relevance assessment using: (i) *hallucination rate*, measuring model tendency to hallucinate
when the answer is not present in passages in the non-relevant subset, and (ii) *error rate*, measuring model inaccuracy to recognize relevant
passages in the relevant subset. In our work, we observe that most models struggle to balance the two capacities. Models such as LLAMA-2
and Orca-2 achieve over 88% hallucination rate on the non-relevant subset. Mistral and LLAMA-3 hallucinate less but can achieve up to a
74.9% error rate on the relevant subset. Overall, GPT-4 is observed to provide the best tradeoff on both subsets, highlighting future work
necessary to improve LLM robustness. NoMIRACL dataset and evaluation code are available at: https://github.com/project-miracl/nomiracl.
Nov 14 (Thu) 10:30-12:00 - Riverfront Hall
What is the value of templates? Rethinking Document Information Extraction Datasets for LLMs
Ran Zmigrod, Pranav Shetty, Mathieu Sibue, Zhiqiang Ma, Armineh Nourbakhsh, Xiaomo Liu, Manuela Veloso
The rise of large language models (LLMs) for visually rich document understanding (VRDU) has kindled a need for prompt-response,
document-based datasets. As annotating new datasets from scratch is labor-intensive, the existing literature has generated prompt-response
datasets from available resources using simple templates. For the case of key information extraction (KIE), one of the most common VRDU
tasks, past work has typically employed the template "What is the value for the key?". However, given the variety of questions encountered
in the wild, simple and uniform templates are insufficient for creating robust models in research and industrial contexts. In this work, we
present K2Q, a diverse collection of five datasets converted from KIE to a prompt-response format using a plethora of bespoke templates.
The questions in K2Q can span multiple entities and be extractive or boolean. We empirically compare the performance of seven baseline
generative models on K2Q with zero-shot prompting. We further compare three of these models when training on K2Q versus training on
simpler templates to motivate the need of our work. We find that creating diverse and intricate KIE questions enhances the performance and
robustness of VRDU models. We hope this work encourages future studies on data quality for generative model training.
Nov 14 (Thu) 10:30-12:00 - Riverfront Hall
On Leakage of Code Generation Evaluation Datasets
Alexandre Matton, Tom Sherborne, Dennis Aumiller, Elena Tommasone, Milad Alizadeh, Jingyi He, Raymond Ma, Maxime Voisin, Ellen
Gilsenan-McMahon, Matthias Gallé
In this paper, we consider contamination by code generation test sets, in particular in their use in modern large language models.We discuss
three possible sources of such contamination and show findings supporting each of them: (i) direct data leakage, (ii) indirect data leakage
through the use of synthetic data and (iii) overfitting to evaluation sets during model selection.To address this, we release Less Basic Python
Problems (LBPP): an uncontaminated new benchmark of 161 prompts with their associated Python solutions. LBPP is released at https://hug-
gingface.co/datasets/CohereForAI/lbpp
Nov 14 (Thu) 10:30-12:00 - Riverfront Hall
TOWER: Tree Organized Weighting for Evaluating Complex Instructions
Noah Ziems, Zhihan Zhang, Meng Jiang
Evaluating the ability of large language models (LLMs) to follow complex human-written instructions is essential for their deployment in
real-world applications. While benchmarks like Chatbot Arena use human judges to assess model performance, they are resource-intensive
and time-consuming. Alternative methods using LLMs as judges, such as AlpacaEval, MT Bench, WildBench, and InFoBench offer improve-
ments but still do not capture that certain complex instruction aspects are more important than others to follow.To address this gap, we propose
a novel evaluation metric, TOWER, that incorporates human-judged importance into the assessment of complex instruction following. We
show that human annotators agree with tree-based representations of these complex instructions nearly as much as they agree with other
human annotators. We release tree-based annotations of the InFoBench dataset and the corresponding evaluation code to facilitate future
research.
Nov 14 (Thu) 10:30-12:00 - Riverfront Hall
BLADE: Benchmarking Language Model Agents for Data-Driven Science
Ken Gu, Ruoxi Shang, Ruien Jiang, Keying Kuang, Richard-John Lin, Donghe Lyu, Yue Mao, Youran Pan, Teng Wu, Jiaqian Yu, Yikun Zhang,
Tianmai M. Zhang, Lanyi Zhu, Mike A Merrill, Jeffrey Heer, Tim Althoff
Data-driven scientific discovery requires the iterative integration of scientific domain knowledge, statistical expertise, and an understanding
of data semantics to make nuanced analytical decisions, e.g., about which variables, transformations, and statistical models to consider. LM-
based agents equipped with planning, memory, and code execution capabilities have the potential to support data-driven science. However,
evaluating agents on such open-ended tasks is challenging due to multiple valid approaches, partially correct steps, and different ways to
express the same decisions. To address these challenges, we present BLADE, a benchmark to automatically evaluate agents’ multifaceted
approaches to open-ended research questions. BLADE consists of 12 datasets and research questions drawn from existing scientific literature,
with ground truth collected from independent analyses by expert data scientists and researchers. To automatically evaluate agent responses,
we developed corresponding computational methods to match different representations of analyses to this ground truth. Though language
models possess considerable world knowledge, our evaluation shows that they are often limited to basic analyses. However, agents capable
of interacting with the underlying data demonstrate improved, but still non-optimal, diversity in their analytical decision making. Our work
enables the evaluation of agents for data-driven science and provides researchers deeper insights into agents’ analysis approaches.
Nov 14 (Thu) 10:30-12:00 - Riverfront Hall
LEGOBench: Scientific Leaderboard Generation Benchmark
Shruti Singh, Shoaib Alam, Husain Malwat, Mayank Singh
The ever-increasing volume of paper submissions makes it difficult to stay informed about the latest state-of-the-art research. To address
this challenge, we introduce LEGOBench, a benchmark for evaluating systems that generate scientific leaderboards. LEGOBench is curated
from 22 years of preprint submission data on arXiv and more than 11k machine learning leaderboards on the PapersWithCode portal. We
present a language model-based and four graph-based leaderboard generation task configuration. We evaluate popular encoder-only scientific
language models as well as decoder-only large language models across these task configurations. State-of-the-art models showcase signifi-
cant performance gaps in automatic leaderboard generation on LEGOBench. The code is available on GitHub and the dataset is hosted on OSF.
Nov 14 (Thu) 10:30-12:00 - Riverfront Hall
CHAmbi: A New Benchmark on Chinese Ambiguity Challenges for Large Language Models
Qin Zhang, Sihan Cai, Jiaxu Zhao, Mykola Pechenizkiy, Meng Fang
282
Posters and Demos
Ambiguity is an inherent feature of language, whose management is crucial for effective communication and collaboration. This is partic-
ularly true for Chinese, a language with extensive lexical-morphemic ambiguity. Despite the wide use of large language models (LLMs)
in numerous domains and their growing proficiency in Chinese, there is a notable lack of datasets to thoroughly evaluate LLMs’ ability to
handle ambiguity in Chinese. To bridge this gap, we introduce the CHAmbi dataset, a specialized Chinese multi-label disambiguation dataset
formatted in Natural Language Inference. It comprises 4,991 pairs of premises and hypotheses, including 824 examples featuring a wide range
of ambiguities. In addition to the dataset, we develop a series of tests and conduct an extensive evaluation of pre-trained LLMs’ proficiency in
identifying and resolving ambiguity in the Chinese language. Our findings reveal that GPT-4 consistently delivers commendable performance
across various evaluative measures, albeit with limitations in robustness. The performances of other LLMs, however, demonstrate variability
in handling ambiguity-related tasks, underscoring the complexity of such tasks in the context of Chinese. The overall results highlight the
challenge of ambiguity handling for current LLMs and underscore the imperative need for further enhancement in LLM capabilities for ef-
fective ambiguity resolution in the Chinese language.
Nov 14 (Thu) 10:30-12:00 - Riverfront Hall
One-to-Many Testing for Code Generation from (Just) Natural Language
Mansi Uniyal, Mukul Singh, Gust Verbruggen, Sumit Gulwani, Vu Le
MBPP is a popular dataset for evaluating the task of code generation from natural language. Despite its popularity, there are three problems:
(1) it relies on providing test cases to generate the right signature, (2) there is poor alignment between instruction and evaluation test cases, and
(3) contamination of the exact phrasing being present in training datasets. We adapt MBPP to emphasize on generating code from just natural
language by (1) removing ambiguity about the semantics of the task from the descriptions, and (2) evaluating generated code on multiple sets
of assertions to account for ambiguity in the syntax. We compare popular open and closed weight models on the original (MBPP) and adapted
(MBUPP) datasets.
Nov 14 (Thu) 10:30-12:00 - Riverfront Hall
Probing the Capacity of Language Model Agents to Operationalize Disparate Experiential Context Despite Distraction
Sonny George, Chris Sypherd, Dylan Cashman
Large language model (LLM) agents show promise in an increasing number of domains. In many proposed applications, it is expected that
the agent reasons over accumulated experience presented in an input prompt. We propose the OEDD (Operationalize Experience Despite Dis-
traction) corpus, a human-annotator-validated body of scenarios with pre-scripted agent histories where the agent must make a decision based
on disparate experiential information in the presence of a distractor. We evaluate three state-of-the-art LLMs (GPT-3.5 Turbo, GPT-4o, and
Gemini 1.5 Pro) using a minimal chain-of-thought prompting strategy and observe that when (1) the input context contains over 1,615 tokens
of historical interactions, (2) a crucially decision-informing premise is the rightful conclusion over two disparate environment premises, and
(3) a trivial, but distracting red herring fact follows, all LLMs perform worse than random choice at selecting the better of two actions. Our
code and test corpus are publicly available at: [github.com/sonnygeorge/OEDD](github.com/sonnygeorge/OEDD).
Nov 14 (Thu) 10:30-12:00 - Riverfront Hall
MalAlgoQA: A Pedagogical Approach for Evaluating Counterfactual Reasoning Abilities of Large Language Models
Shashank Sonkar, Naiming Liu, MyCo Le, Richard Baraniuk
This paper introduces MalAlgoQA, a novel dataset designed to evaluate the counterfactual reasoning capabilities of Large Language Models
(LLMs) through a pedagogical approach. The dataset comprises mathematics and reading comprehension questions, each accompanied by
four answer choices and their corresponding rationales. At the heart of MalAlgoQA are “malgorithms” - rationales behind incorrect answer
choices that represent flawed yet logically coherent reasoning paths. These malgorithms serve as counterfactual scenarios, allowing us to
assess an LLM’s ability to identify and analyze flawed reasoning patterns. We propose the Malgorithm Identification task, where LLMs are
assessed based on their ability to identify corresponding malgorithm given an incorrect answer choice. To evaluate the model performance,
we introduce two metrics: Algorithm Identification Accuracy (AIA) for correct answer rationale identification, and Malgorithm Identifi-
cation Accuracy (MIA) for incorrect answer rationale identification. Our experiments reveal that state-of-the-art LLMs exhibit significant
performance drops in MIA compared to AIA, highlighting the challenges in counterfactual reasoning.Surprisingly, we find that the chain-of-
thought prompting technique not only fails to consistently enhance MIA but can sometimes lead to underperformance compared to simple
prompting. These findings have important implications for developing LLMs with improved counterfactual reasoning, particularly relevant
for AI-powered tutoring systems, where identifying and addressing student misconceptions is essential. MalAlgoQA dataset is available here.
Nov 14 (Thu) 10:30-12:00 - Riverfront Hall
Gazelle: An Instruction Dataset for Arabic Writing Assistance
Samar Mohamed Magdy, Fakhraddin Alwajih, Sang Yun Kwon, Reem Abdel-Salam, Muhammad Abdul-Mageed
Writing has long been considered a hallmark of human intelligence and remains a pinnacle task for artificial intelligence (AI) due to the
intricate cognitive processes involved. Recently, rapid advancements in generative AI, particularly through the development of Large Lan-
guage Models (LLMs), have significantly transformed the landscape of writing assistance. However, underrepresented languages like Arabic
encounter significant challenges in the development of advanced AI writing tools, largely due to the limited availability of data. This scarcity
constrains the training of effective models, impeding the creation of sophisticated writing assistance technologies. To address these issues,
we present *Gazelle*, a comprehensive dataset for Arabic writing assistance. In addition, we offer an evaluation framework designed to
enhance Arabic writing assistance tools. Our human evaluation of leading LLMs, including GPT-**4**, GPT-**4o**, Cohere Command
R+, and Gemini **1.5** Pro, highlights their respective strengths and limitations in addressing the challenges of Arabic writing. Our findings
underscore the need for continuous model training and dataset enrichment to manage the complexities of Arabic language processing, paving
the way for more effective AI-powered Arabic writing tools
Nov 14 (Thu) 10:30-12:00 - Riverfront Hall
P-FOLIO: Evaluating and Improving Logical Reasoning with Abundant Human-Written Reasoning Chains
SIMENG HAN, Aaron Yu, Rui Shen, Zhenting Qi, Martin Riddell, Wenfei Zhou, Yujie Qiao, Yilun Zhao, Semih Yavuz, Ye Liu, Shafiq Joty,
Yingbo Zhou, Caiming Xiong, Rex Ying, Arman Cohan, Dragomir Radev
Existing methods on understanding the capabilities of LLMs in logical reasoning rely on binary entailment classification or synthetically
derived rationales, which are not sufficient for properly assessing model’s capabilities. We present P-FOLIO, a human-annotated dataset
consisting of diverse and complex reasoning chains for a set of realistic logical reasoning stories also written by humans. P-FOLIO is col-
lected with an annotation protocol that facilitates humans to annotate well-structured natural language proofs for first-order logic reasoning
problems in a step-by-step manner. The number of reasoning steps in P-FOLIO span from 0 to 20. We further use P-FOLIO to evaluate and
improve large-language-model (LLM) reasoning capabilities. We evaluate LLM reasoning capabilities at a fine granularity via single-step
inference rule classification, with more diverse inference rules of more diverse and higher levels of complexities than previous works. Given
that a single model-generated reasoning chain could take a completely different path than the human-annotated one, we sample multiple
reasoning chains from a model and use pass@k metrics for evaluating the quality of model-generated reasoning chains. We show that human-
written reasoning chains significantly boost the logical reasoning capabilities of LLMs via many-shot prompting and fine-tuning. Furthermore,
fine-tuning Llam3-7B on P-FOLIO improves the model performance by 10% or more on three other out-of-domain logical reasoning datasets.
283
Posters and Demos
284
Posters and Demos
285
Posters and Demos
requirements, optimizes storage costs, and decreases memory bandwidth needs during inference. However, due to numerical representation
limitations, traditional scalar-based weight quantization struggles to achieve such extreme low-bit.Recent research on Vector Quantization
(VQ) for LLMs has demonstrated the potential for extremely low-bit model quantization by compressing vectors into indices using lookup
tables. In this paper, we introduce **Vector Post-Training Quantization (VPTQ)** for extremely low-bit quantization of LLMs. We use
Second-Order Optimization to formulate the LLM VQ problem and guide our quantization algorithm design by solving the optimization.We
further refine the weights using Channel-Independent Second-Order Optimization for a granular VQ.In addition, by decomposing the opti-
mization problem, we propose a brief and effective codebook initialization algorithm. We also extend VPTQ to support residual and outlier
quantization, which enhances model accuracy and further compresses the model.Our experimental results show that VPTQ reduces model
quantization perplexity by 0.01-0.34 on LLaMA-2, 0.38-0.68 on Mistral-7B, 4.41-7.34 on LLaMA-3 over SOTA at 2-bit, with an
average accuracy improvement of 0.79-1.5% on LLaMA-2, 1% on Mistral-7B, 11-22% on LLaMA-3 on QA tasks on average. We only
utilize 10.4-18.6% of the quantization algorithm execution time, resulting in a 1.6-1.8× increase in inference throughput compared to
SOTA.
Nov 14 (Thu) 10:30-12:00 - Riverfront Hall
Can Active Label Correction Improve LLM-based Modular AI Systems?
Karan Taneja, Ashok Goel
Modular AI systems can be developed using LLM-prompts-based modules to minimize deployment time even for complex tasks. However,
these systems do not always perform well and improving them using the data traces collected from a deployment remains an open challenge.
The data traces contain LLM inputs and outputs, but the annotations from LLMs are noisy. We hypothesize that Active Label Correction
(ALC) can be use on the collected data to train smaller task-specific improved models that can replace LLM-based modules. In this paper,
we study the noise in three GPT-3.5-annotated datasets and their denoising with human feedback. We also propose a novel method ALC3
that iteratively applies three updates to the training dataset: auto-correction, correction using human feedback and filtering. Our results show
that ALC3 can lead to oracle performance with feedback on 17-24% fewer examples than the number of noisy examples in the dataset across
three different NLP tasks.
Nov 14 (Thu) 10:30-12:00 - Riverfront Hall
ORPO: Monolithic Preference Optimization without Reference Model
Jiwoo Hong, Noah Lee, James Thorne
While recent preference alignment algorithms for language models have demonstrated promising results, supervised fine-tuning (SFT) re-
mains imperative for achieving successful convergence. In this paper, we revisit SFT in the context of preference alignment, emphasizing that
a minor penalty for the disfavored style is sufficient for preference alignment. Building on this foundation, we introduce a straightforward
reference model-free monolithic odds ratio preference optimization algorithm, ORPO, eliminating the need for an additional preference align-
ment phase. We demonstrate, both empirically and theoretically, that the odds ratio is a sensible choice for contrasting favored and disfavored
styles during SFT across diverse sizes from 125M to 7B. Specifically, fine-tuning Phi-2 (2.7B), Llama-2 (7B), and Mistral (7B) with ORPO
on the UltraFeedback alone surpasses the performance of state-of-the-art language models including Llama-2 Chat and Zephyr with more
than 7B and 13B parameters: achieving up to 12.20% on AlpacaEval 2.0 (Figure 1), and 7.32 in MT-Bench (Table 2). We release code and
model checkpoints for Mistral-ORPO-α (7B) and Mistral-ORPO-β (7B).
Nov 14 (Thu) 10:30-12:00 - Riverfront Hall
Decoding with Limited Teacher Supervision Requires Understanding When to Trust the Teacher
Hyunjong Ok, Jegwang Ryu, Jaeho Lee
How can small-scale large language models (LLMs) efficiently utilize the supervision of LLMs to improve their generative quality? This
question has been well studied in scenarios where there is no restriction on the number of LLM supervisions one can use, giving birth to
many decoding algorithms that utilize supervision without further training. However, it is still unclear what is an effective strategy under
the limited supervision scenario, where we assume that no more than a few tokens can be generated by LLMs. To this end, we develop an
algorithm to effectively aggregate the small-scale LLM and LLM predictions on initial tokens so that the generated tokens can more accurately
condition the subsequent token generation by small-scale LLM only. Critically, we find that it is essential to adaptively overtrust or disregard
the LLM prediction based on the confidence of the small-scale LLM. Through our experiments on a wide range of models and datasets,
we demonstrate that our method provides a consistent improvement over conventional decoding strategies. Code: https://github.com/HJ-
Ok/DecLimSup
286
Posters and Demos
Qiu
The evolution of Large Language Models (LLMs) has led to significant advancements, with models like Claude and Gemini capable of pro-
cessing contexts up to 1 million tokens. However, efficiently handling long sequences remains challenging, particularly during the prefilling
stage when input lengths exceed GPU memory capacity. Traditional methods often segment sequence into chunks and compress them it-
eratively with fixed-size memory. However, our empirical analysis shows that the fixed-size memory results in wasted computational and
GPU memory resources. Therefore, we introduces Incremental Memory (IM), a method that starts with a small memory size and gradually
increases it, optimizing computational efficiency. Additionally, we propose Decremental Chunk based on Incremental Memory (IMDC),
which reduces chunk size while increasing memory size, ensuring stable and lower GPU memory usage. Our experiments demonstrate that
IMDC is consistently faster (1.45x) and reduces GPU memory consumption by 23.3% compared to fixed-size memory, achieving comparable
performance on the LongBench Benchmark.
Nov 14 (Thu) 10:30-12:00 - Riverfront Hall
Scalable Data Ablation Approximations for Language Models through Modular Training and Merging
Clara Na, Ian Magnusson, Ananya Harsh Jha, Tom Sherborne, Emma Strubell, Jesse Dodge, Pradeep Dasigi
Training data compositions for Large Language Models (LLMs) can significantly affect their downstream performance. However, a thorough
data ablation study exploring large sets of candidate data mixtures is typically prohibitively expensive since the full effect is seen only after
training the models; this can lead practitioners to settle for sub-optimal data mixtures. We propose an efficient method for approximating
data ablations which trains individual models on subsets of a training corpus and reuses them across evaluations of combinations of subsets.In
continued pre-training experiments, we find that, given an arbitrary evaluation set, the perplexity score of a single model trained on a candidate
set of data is strongly correlated with perplexity scores of parameter averages of models trained on distinct partitions of that data. From this
finding, we posit that researchers and practitioners can conduct inexpensive simulations of data ablations by maintaining a pool of models that
were each trained on partitions of a large training corpus, and assessing candidate data mixtures by evaluating parameter averages of com-
binations of these models. This approach allows for substantial improvements in amortized training efficiency – scaling only linearly with
respect to new data – by enabling reuse of previous training computation, opening new avenues for improving model performance through
rigorous, incremental data assessment and mixing.
Nov 14 (Thu) 10:30-12:00 - Riverfront Hall
Attention Score is not All You Need for Token Importance Indicator in KV Cache Reduction: Value Also Matters
Zhiyu Guo, Hidetaka Kamigaito, Taro Watanabe
Scaling the context size of large language models (LLMs) enables them to perform various new tasks, e.g., book summarization. However,
the memory cost of the Key and Value (KV) cache in attention significantly limits the practical applications of LLMs. Recent works have
explored token pruning for KV cache reduction in LLMs, relying solely on attention scores as a token importance indicator. However, our
investigation into value vector norms revealed a notably non-uniform pattern questioning their reliance only on attention scores. Inspired by
this, we propose a new method: Value-Aware Token Pruning (VATP) which uses both attention scores and the `_1 norm of value vectors to
evaluate token importance. Extensive experiments on LLaMA2-7B-chat and Vicuna-v1.5-7B across 16 LongBench tasks demonstrate that
VATP outperforms attention-score-only baselines in over 12 tasks, confirming the effectiveness of incorporating value vector norms into token
importance evaluation of LLMs.
Nov 14 (Thu) 10:30-12:00 - Riverfront Hall
Scalable Fine-tuning from Multiple Data Sources: A First-Order Approximation Approach
Dongyue Li, Ziniu Zhang, Lu Wang, Hongyang R. Zhang
We study the problem of fine-tuning a language model (LM) for a target task by optimally using the information from n auxiliary tasks.
This problem has broad applications in NLP, such as targeted instruction tuning and data selection in chain-of-thought fine-tuning. The key
challenge of this problem is that not all auxiliary tasks are useful to improve the performance of the target task. Thus, choosing the right
subset of auxiliary tasks is crucial. Conventional subset selection methods, such as forward & backward selection, are unsuitable for LM
fine-tuning because they require repeated training on subsets of auxiliary tasks. This paper introduces a new algorithm to estimate model
fine-tuning performances without repeated training. Our algorithm first performs multitask training using the data of all the tasks to obtain a
meta initialization. Then, we approximate the model fine-tuning loss of a subset using functional values and gradients from the meta initial-
ization. Empirically, we find that this gradient-based approximation holds with remarkable accuracy for twelve transformer-based LMs. Thus,
we can now estimate fine-tuning performances on CPUs within a few seconds. We conduct extensive experiments to validate our approach,
delivering a speedup of 30× over conventional subset selection while incurring only 1% error of the true fine-tuning performances. In down-
stream evaluations of instruction tuning and chain-of-thought fine-tuning, our approach improves over prior methods that utilize gradient or
representation similarity for subset selection by up to 3.8%.
Nov 14 (Thu) 10:30-12:00 - Riverfront Hall
STTATTS: Unified Speech-To-Text And Text-To-Speech Model
Hawau Olamide Toyin, Hao Li, Hanan Aldarmaki
Speech recognition and speech synthesis models are typically trained separately, each with its own set of learning objectives, training data, and
model parameters, resulting in two distinct large networks. We propose a parameter-efficient approach to learning ASR and TTS jointly via a
multi-task learning objective and shared parameters. Our evaluation demonstrates thatthe performance of our multi-task model is comparable
to that of individually trained models while significantly savingcomputational and memory costs (∼50% reduction in the total number of
parameters required for the two tasks combined). We experiment with English as a resource-rich language, and Arabic as a relatively low-
resource language due to shortage of TTS data. Our models are trained with publicly available data, and both the training code and model
checkpoints are openly available for further research.
Nov 14 (Thu) 10:30-12:00 - Riverfront Hall
RoLoRA: Fine-tuning Rotated Outlier-free LLMs for Effective Weight-Activation Quantization
Xijie Huang, Zechun Liu, Shih-Yang Liu, Kwang-Ting Cheng
Low-Rank Adaptation (LoRA), as a representative Parameter-Efficient Fine-Tuning (PEFT) method, significantly enhances the training effi-
ciency by updating only a small portion of the weights in Large Language Models (LLMs). Recently, weight-only quantization techniques
have also been applied to LoRA methods to reduce the memory footprint of fine-tuning. However, applying weight-activation quantization to
the LoRA pipeline is under-explored, and we observe substantial performance degradation primarily due to the presence of activation outliers.
In this work, we propose RoLoRA, the first LoRA-based scheme to apply rotation for outlier elimination, and then fine-tune rotated outlier-
free LLMs for effective weight-activation quantization. Different from previous work tackling the outlier challenges from a post-training
perspective, we propose rotation-aware fine-tuning to eliminate and preserve the outlier-free characteristics brought by rotation operations.
RoLoRA can improve low-bit LoRA convergence and post-training quantization robustness in weight-activation settings. RoLoRA is evalu-
ated across various LLM series (LLaMA2, LLaMA3, LLaVA-1.5), tasks, and quantization settings, achieving up to 29.5% absolute accuracy
gain of 4-bit weight-activation quantized LLaMA2-13B on commonsense reasoning tasks compared to LoRA baseline. We further demon-
strate its effectiveness on Large Multimodal Models (LMMs) and prove the compatibility with advanced LoRA variants.
287
Posters and Demos
288
Posters and Demos
to various configurations, obtaining competitive quantization performance against strong baselines while obtaining an end-to-end throughput
increase of 1.5 to 2 times.
289
Posters and Demos
290
Posters and Demos
emerging field of historical psychology relies on computational techniques to extract aspects of psychology from historical corpora using new
methods developed in natural language processing (NLP). The present pipeline, called Contextualized Construct Representations (CCR), com-
bines expert knowledge in psychometrics (i.e., psychological surveys) with text representations generated via Transformer-based language
models to measure psychological constructs such as traditionalism, norm strength, and collectivism in classical Chinese corpora. Considering
the scarcity of available data, we propose an indirect supervised contrastive learning approach and build the first Chinese historical psychology
corpus (C-HI-PSY) to fine-tune pre-trained models. We evaluate the pipeline to demonstrate its superior performance compared with other
approaches. The CCR method outperforms word-embedding-based approaches across all of our tasks and exceeds prompting with GPT-4 in
most tasks. Finally, we benchmark the pipeline against objective, external data to further verify its validity.
Nov 14 (Thu) 14:00-15:30 - Jasmine
Noise, Novels, Numbers. A Framework for Detecting and Categorizing Noise in Danish and Norwegian Literature
ALI ALLAITH, Daniel Hershcovich, Jens Bjerring-Hansen, Jakob Ingemann Parby, Alexander Conroy, Timothy R Tangherlini
We present a framework for detecting and categorizing noise in literary texts, demonstrated through its application to Danish and Norwe-
gian literature from the late 19-th century. Noise, understood as "aberrant sonic behaviour,” is not only an auditory phenomenon but also
a cultural construct tied to the processes of civilization and urbanization.We begin by utilizing topic modeling techniques to identify noise-
related documents, followed by fine-tuning BERT-based language models trained on Danish and Norwegian texts to analyze a corpus of
over 800 novels.We identify and track the prevalence of noise in these texts, offering insights into the literary perceptions of noise during
the Scandinavian "Modern Breakthrough” period (1870-1899). Our contributions include the development of a comprehensive dataset anno-
tated for noise-related segments and their categorization into human-made, non-human-made, and musical noises. This study illustrates the
framework’s potential for enhancing the understanding of the relationship between noise and its literary representations, providing a deeper
appreciation of the auditory elements in literary works, including as sources for cultural history.
Nov 14 (Thu) 14:00-15:30 - Jasmine
Contrastive Entity Coreference and Disambiguation for Historical Texts
Abhishek Arora, Emily Silcock, Melissa Dell, Leander Heldring
Massive-scale historical document collections are crucial for social science research. Despite increasing digitization, these documents typi-
cally lack unique cross-document identifiers for individuals mentioned within the texts, as well as individual identifiers from external knowl-
edge bases like Wikipedia/Wikidata. Existing entity disambiguation methods often fall short in accuracy for historical documents, which
are replete with individuals not remembered in contemporary knowledge bases. This study makes three key contributions to improve cross-
document coreference resolution and disambiguation in historical texts: a massive-scale training dataset replete with hard negatives - that
sources over 190 million entity pairs from Wikipedia contexts and disambiguation pages - high-quality evaluation data from hand-labeled
historical newswire articles, and trained models evaluated on this historical benchmark. We contrastively train bi-encoder models for coref-
erencing and disambiguating individuals in historical texts, achieving accurate, scalable performance that identifies out-of-knowledge base
individuals. Our approach significantly surpasses other entity disambiguation models on our historical newswire benchmark. Our models also
demonstrate competitive performance on modern entity disambiguation benchmarks, particularly on certain news disambiguation datasets.
Nov 14 (Thu) 14:00-15:30 - Jasmine
Susu Box or Piggy Bank: Assessing Cultural Commonsense Knowledge between Ghana and the US
Christabel Acquaye, Haozhe An, Rachel Rudinger
Recent work has highlighted the culturally-contingent nature of commonsense knowledge. We introduce AMAMMER, a test set of 525
multiple-choice questions designed to evaluate the commonsense knowledge of English LLMs, relative to the cultural contexts of Ghana and
the United States. To create AMAMMER, we select a set of multiple-choice questions (MCQs) from existing commonsense datasets and
rewrite them in a multi-stage process involving surveys of Ghanaian and U.S. participants. In three rounds of surveys, participants from both
pools are solicited to (1) write correct and incorrect answer choices, (2) rate individual answer choices on a 5-point Likert scale, and (3) select
the best answer choice from the newly-constructed MCQ items, in a final validation step. By engaging participants at multiple stages, our
procedure ensures that participant perspectives are incorporated both in the creation and validation of test items, resulting in high levels of
agreement within each pool. We evaluate several off-the-shelf English LLMs on AMAMMER. Uniformly, models prefer answers choices
that align with the preferences of U.S. annotators over Ghanaian annotators. Additionally, when test items specify a cultural context (Ghana
or the U.S.), models exhibit some ability to adapt, but performance is consistently better in U.S. contexts than Ghanaian. As large resources
are devoted to the advancement of English LLMs, our findings underscore the need for culturally adaptable models and evaluations to meet
the needs of diverse English-speaking populations around the world.
Nov 14 (Thu) 14:00-15:30 - Jasmine
The Lou Dataset - Exploring the Impact of Gender-Fair Language in German Text Classification
Andreas Waldis, Joel Birrer, Anne Lauscher, Iryna Gurevych
Gender-fair language, an evolving linguistic variation in German, fosters inclusion by addressing all genders or using neutral forms. However,
there is a notable lack of resources to assess the impact of this language shift on language models (LMs) might not been trained on examples
of this variation. Addressing this gap, we present Lou, the first dataset providing high-quality reformulations for German text classification
covering seven tasks, like stance detection and toxicity classification. We evaluate 16 mono- and multi-lingual LMs and find substantial label
flips, reduced prediction certainty, and significantly altered attention patterns. However, existing evaluations remain valid, as LM rankings
are consistent across original and reformulated instances. Our study provides initial insights into the impact of gender-fair language on clas-
sification for German. However, these findings are likely transferable to other languages, as we found consistent patterns in multi-lingual and
English LMs.
Nov 14 (Thu) 14:00-15:30 - Jasmine
Understanding Slang with LLMs: Modelling Cross-Cultural Nuances through Paraphrasing
Ifeoluwa Wuraola, Nina Dethlefs, Daniel Marciniak
In the realm of social media discourse, the integration of slang enriches communication, reflecting the sociocultural identities of users. This
study investigates the capability of large language models (LLMs) to paraphrase slang within climate-related tweets from Nigeria and the
UK, with a focus on identifying emotional nuances. Using DistilRoBERTa as the base-line model, we observe its limited comprehension of
slang. To improve cross-cultural understanding, we gauge the effectiveness of leading LLMs ChatGPT 4, Gemini, and LLaMA3 in slang
paraphrasing. While ChatGPT 4 and Gemini demonstrate comparable effectiveness in slang paraphrasing, LLaMA3 shows less coverage,
with all LLMs exhibiting limitations in coverage, especially of Nigerian slang. Our findings underscore the necessity for culturally sensitive
LLM development in emotion classification, particularly in non-anglocentric regions.
291
Posters and Demos
diverse groups is unclear. By including additional context in prompts, we comprehensively analyze LLM’s sensitivity to geographical prim-
ing, persona attributes, and numerical information to assess how well the needs of various groups are reflected. Our findings on two LLMs,
five languages, and six datasets reveal that mimicking persona-based attributes leads to annotation variability. Meanwhile, incorporating ge-
ographical signals leads to better regional alignment. We also find that the LLMs are sensitive to numerical anchors, indicating the ability to
leverage community-based flagging efforts and exposure to adversaries. Our work provides preliminary guidelines and highlights the nuances
of applying LLMs in culturally sensitive cases.
Nov 14 (Thu) 14:00-15:30 - Jasmine
AutoPersuade: A Framework for Evaluating and Explaining Persuasive Arguments
Till Raphael Saenger, Musashi Hinck, Justin Grimmer, Brandon M. Stewart
We introduce a three-part framework for constructing persuasive messages, AutoPersuade. First, we curate a large collection of arguments
and gather human evaluations of their persuasiveness. Next, we introduce a novel topic model to identify the features of these arguments that
influence persuasion. Finally, we use the model to predict the persuasiveness of new arguments and to assess the causal effects of argument
components, offering an explanation of the results. We demonstrate the effectiveness of AutoPersuade in an experimental study on arguments
for veganism, validating our findings through human studies and out-of-sample predictions.
Nov 14 (Thu) 14:00-15:30 - Jasmine
Community-Cross-Instruct: Unsupervised Instruction Generation for Aligning Large Language Models to Online Communities
Zihao He, Rebecca Dorn, Minh Duc Chu, Siyi Guo, Kristina Lerman
Social scientists use surveys to probe the opinions and beliefs of populations, but these methods are slow, costly, and prone to biases. Recent
advances in large language models (LLMs) enable the creating of computational representations or "digital twins" of populations that generate
human-like responses mimicking the population’s language, styles, and attitudes. We introduce Community-Cross-Instruct, an unsupervised
framework for aligning LLMs to online communities to elicit their beliefs. Given a corpus of a community’s online discussions, Community-
Cross-Instruct automatically generates instruction-output pairs by an advanced LLM to (1) finetune a foundational LLM to faithfully represent
that community, and (2) evaluate the alignment of the finetuned model to the community. We demonstrate the method’s utility in accurately
representing political and diet communities on Reddit. Unlike prior methods requiring human-authored instructions, Community-Cross-
Instruct generates instructions in a fully unsupervised manner, enhancing scalability and generalization across domains. This work enables
cost-effective and automated surveying of diverse online communities.
Nov 14 (Thu) 14:00-15:30 - Jasmine
MemeCLIP: Leveraging CLIP Representations for Multimodal Meme Classification
Siddhant Bikram Shah, Shuvam Shiwakoti, Maheep Chaudhary, Haohan Wang
The complexity of text-embedded images presents a formidable challenge in machine learning given the need for multimodal understand-
ing of multiple aspects of expression conveyed by them. While previous research in multimodal analysis has primarily focused on singular
aspects such as hate speech and its subclasses, this study expands this focus to encompass multiple aspects of linguistics: hate, targets of
hate, stance, and humor. We introduce a novel dataset PrideMM comprising 5,063 text-embedded images associated with the LGBTQ+ Pride
movement, thereby addressing a serious gap in existing resources. We conduct extensive experimentation on PrideMM by using unimodal and
multimodal baseline methods to establish benchmarks for each task. Additionally, we propose a novel framework MemeCLIP for efficient
downstream learning while preserving the knowledge of the pre-trained CLIP model. The results of our experiments show that MemeCLIP
achieves superior performance compared to previously proposed frameworks on two real-world datasets. We further compare the perfor-
mance of MemeCLIP and zero-shot GPT-4 on the hate classification task. Finally, we discuss the shortcomings of our model by qualitatively
analyzing misclassified samples. Our code and dataset are publicly available at: https://github.com/SiddhantBikram/MemeCLIP.
Nov 14 (Thu) 14:00-15:30 - Jasmine
Building a Multi-Platform, BERT Classifier for Detecting Connective Language
Josephine Lukito, Bin Chen, Gina M. Masullo, Natalie Jomini Stroud
This study presents an approach for detecting connective language—defined as language that facilitates engagement, understanding, and
conversation—from social media discussions. We developed and evaluated two types of classifiers: BERT and GPT-3.5 turbo. Our results
demonstrate that the BERT classifier significantly outperforms GPT-3.5 turbo in detecting connective language. Furthermore, our analysis
confirms that connective language is distinct from related concepts measuring discourse qualities, such as politeness and toxicity. We also
explore the potential of BERT-based classifiers for platform-agnostic tools. This research advances our understanding of the linguistic dimen-
sions of online communication and proposes practical tools for detecting connective language across diverse digital environments.
Nov 14 (Thu) 14:00-15:30 - Jasmine
Language is Scary when Over-Analyzed: Unpacking Implied Misogynistic Reasoning with Argumentation Theory-Driven Prompts
Arianna Muti, Federico Ruggeri, Khalid Al Khatib, Alberto Barrón-Cedeño, Tommaso Caselli
We propose misogyny detection as an Argumentative Reasoning task and we investigate the capacity of large language models (LLMs) to
understand the implicit reasoning used to convey misogyny in both Italian and English. The central aim is to generate the missing reasoning
link between a message and the implied meanings encoding the misogyny. Our study uses argumentation theory as a foundation to form a
collection of prompts in both zero-shot and few-shot settings. These prompts integrate different techniques, including chain-of-thought rea-
soning and augmented knowledge. Our findings show that LLMs fall short on reasoning capabilities about misogynistic comments and that
they mostly rely on their implicit knowledge derived from internalized common stereotypes about women to generate implied assumptions,
rather than on inductive reasoning.
Nov 14 (Thu) 14:00-15:30 - Jasmine
Style-Shifting Behaviour of the Manosphere on Reddit
Jai Aggarwal, Suzanne Stevenson
Hate speech groups (HSGs) may negatively influence online platforms through their distinctive language, which may affect the tone and
topics of other spaces if spread beyond the HSGs. We explore the linguistic style of the Manosphere, a misogynistic HSG, on Reddit. We find
that Manospheric authors have a distinct linguistic style using not only uncivil language, but a greater focus on gendered topics, which are re-
tained when posting in other communities. Thus, potentially harmful aspects of Manospheric style carry over into posts on non-Manospheric
subreddits, motivating future work to explore how this stylistic spillover may negatively influence community health.
Nov 14 (Thu) 14:00-15:30 - Jasmine
Rater Cohesion and Quality from a Vicarious Perspective
Deepak Pandita, Tharindu Cyril Weerasooriya, Sujan Dutta, Sarah K. K. Luger, Tharindu Ranasinghe, Ashiqur R. KhudaBukhsh, Marcos
Zampieri, Christopher M Homan
Human feedback is essential for building human-centered AI systems across domains where disagreement is prevalent, such as AI safety,
content moderation, or sentiment analysis. Many disagreements, particularly in politically charged settings, arise because raters have op-
292
Posters and Demos
posing values or beliefs. Vicarious annotation is a method for breaking down disagreement by asking raters how they think others would
annotate the data. In this paper, we explore the use of vicarious annotation with analytical methods for moderating rater disagreement. We
employ rater cohesion metrics to study the potential influence of political affiliations and demographic backgrounds on raters’ perceptions
of offense. Additionally, we utilize CrowdTruth’s rater quality metrics, which consider the demographics of the raters, to score the raters
and their annotations. We study how the rater quality metrics influence the in-group and cross-group rater cohesion across the personal and
vicarious levels.
Nov 14 (Thu) 14:00-15:30 - Jasmine
Shall We Team Up: Exploring Spontaneous Cooperation of Competing LLM Agents
Zengqing Wu, Run Peng, Shuyuan Zheng, Qianying Liu, Xu Han, Brian I. Kwon, Makoto Onizuka, Shaojie Tang, Chuan Xiao
Large Language Models (LLMs) have increasingly been utilized in social simulations, where they are often guided by carefully crafted instruc-
tions to stably exhibit human-like behaviors during simulations. Nevertheless, we doubt the necessity of shaping agents’ behaviors for accurate
social simulations. Instead, this paper emphasizes the importance of spontaneous phenomena, wherein agents deeply engage in contexts and
make adaptive decisions without explicit directions. We explored spontaneous cooperation across three competitive scenarios and success-
fully simulated the gradual emergence of cooperation, findings that align closely with human behavioral data. This approach not only aids the
computational social science community in bridging the gap between simulations and real-world dynamics but also offers the AI community
a novel method to assess LLMs’ capability of deliberate reasoning.Our source code is available at https://github.com/wuzengqing001225/S-
ABM_ShallWeTeamUp
Nov 14 (Thu) 14:00-15:30 - Jasmine
Toeing the party line: election manifestos as a key to understand political discourse on Twitter
Maximilian Maurer, Tanise Ceron, Sebastian Padó, Gabriella Lapesa
Political discourse on Twitter is a moving target: politicians continuously make statements about their positions. It is therefore crucial to
track their discourse on social media to understand their ideological positions and goals. However, Twitter data is also challenging to work
with since it is ambiguous and often dependent on social context, and consequently, recent work on political positioning has tended to focus
strongly on manifestos (parties’ electoral programs) rather than social media.In this paper, we extend recently proposed methods to predict
pairwise positional similarities between parties from the manifesto case to the Twitter case, using hashtags as a signal to fine-tune text repre-
sentations, without the need for manual annotation. We verify the efficacy of fine-tuning and conduct a series of experiments that assess the
robustness of our method for low-resource scenarios. We find that our method yields stable positionings reflective of manifesto positionings,
both in scenarios with all tweets of candidates across years available and when only smaller subsets from shorter time periods are available.
This indicates that it is possible to reliably analyze the relative positioning of actors without the need for manual annotation, even in the noisier
context of social media.
Nov 14 (Thu) 14:00-15:30 - Jasmine
On the Rigour of Scientific Writing: Criteria, Analysis, and Insights
Joseph James, Chenghao Xiao, YUCHENG LI, Chenghua Lin
Rigour is crucial for scientific research as it ensures the reproducibility and validity of results and findings. Despite its importance, little work
exists on modelling rigour computationally, and there is a lack of analysis on whether these criteria can effectively signal or measure the
rigour of scientific papers in practice. In this paper, we introduce a bottom-up, data-driven framework to automatically identify and define
rigour criteria and assess their relevance in scientific writing. Our framework includes rigour keyword extraction, detailed rigour definition
generation, and salient criteria identification. Furthermore, our framework is domain-agnostic and can be tailored to the evaluation of sci-
entific rigour for different areas, accommodating the distinct salient criteria across fields. We conducted comprehensive experiments based
on datasets collected from different domains (e.g. ICLR, ACL) to demonstrate the effectiveness of our framework in modelling rigour. In
addition, we analyse linguist patterns of rigour, revealing that framing certainty is crucial for enhancing the perception of scientific rigour,
while suggestion certainty and probability uncertainty diminish it.
Nov 14 (Thu) 14:00-15:30 - Jasmine
Conversation Redirection in Mental Health Therapy
Vivian Nguyen, Sang Min Jung, Lillian Lee, Thomas D. Hull, Cristian Danescu-Niculescu-Mizil
Mental-health therapy involves a complex conversation flow in which patients and therapists continuously negotiate what should be talked
about next. For example, therapists might try to shift the conversations direction to keep the therapeutic process on track and avoid stagna-
tion, or patients might push the discussion towards issues they want to focus on.How do such patient and therapist redirections relate to the
development and quality of their relationship? To answer this question, we introduce a probabilistic measure of the extent to which a certain
utterance immediately redirects the flow of the conversation, accounting for both the intention and the actual realization of such a change. We
apply this new measure to characterize the development of patient- therapist relationships over multiple sessions in a very large, widely-used
online therapy platform. Our analysis reveals that (1) patient control of the conversations direction generally increases relative to that of
the therapist as their relationship progresses; and (2) patients who have less control in the first few sessions are significantly more likely to
eventually express dissatisfaction with their therapist and terminate the relationship.
Nov 14 (Thu) 14:00-15:30 - Jasmine
How Entangled is Factuality and Deception in German?
Aswathy Velutharambath, Amelie Wuehrl, Roman Klinger
The statement "The earth is flat" is factually inaccurate, but if someone truly believes and argues in its favor, it is not deceptive. Research on
deception detection and fact checking often conflates factual accuracy with the truthfulness of statements. This assumption makes it difficult
to (a) study subtle distinctions and interactions between the two and (b) gauge their effects on downstream tasks. The belief-based deception
framework disentangles these properties by defining texts as deceptive when there is a mismatch between what people say and what they
truly believe. In this study, we assess if presumed patterns of deception generalize to German language texts. We test the effectiveness of
computational models in detecting deception using an established corpus of belief-based argumentation. Finally, we gauge the impact of
deception on the downstream task of fact checking and explore if this property confounds verification models. Surprisingly, our analysis finds
no correlation with established cues of deception. Previous work claimed that computational models can outperform humans in deception
detection accuracy, however, our experiments show that both traditional and state-of-the-art models struggle with the task, performing no
better than random guessing. For fact checking, we find that natural language inference-based verification performs worse on non-factual and
deceptive content, while prompting large language models for the same task is less sensitive to these properties.
Nov 14 (Thu) 14:00-15:30 - Jasmine
Do *they* mean ’us’? Interpreting Referring Expressions in Intergroup Bias
Venkata Subrahmanyan Govindarajan, Matianyu Zang, Kyle Mahowald, David Beaver, Junyi Jessy Li
The variations between in-group and out-group speech (intergroup bias) are subtle and could underlie many social phenomena like stereotype
perpetuation and implicit bias. In this paper, we model intergroup bias as a tagging task on English sports comments from forums dedicated
293
Posters and Demos
to fandom for NFL teams. We curate a dataset of over 6 million game-time comments from opposing perspectives (the teams in the game),
each comment grounded in a non-linguistic description of the events that precipitated these comments (live win probabilities for each team).
Expert and crowd annotations justify modeling the bias through tagging of implicit and explicit referring expressions and reveal the rich,
contextual understanding of language and the world required for this task. For large-scale analysis of intergroup variation, we use LLMs for
automated tagging, and discover that LLMs occasionally perform better when prompted with linguistic descriptions of the win probability at
the time of the comment, rather than numerical probability. Further, large-scale tagging of comments using LLMs uncovers linear variations
in the form of referent across win probabilities that distinguish in-group and out-group utterances.
Nov 14 (Thu) 14:00-15:30 - Jasmine
Towards Effective Counter-Responses: Aligning Human Preferences with Strategies to Combat Online Trolling
Huije Lee, Hoyun Song, Jisu Shin, Sukmin Cho, SeungYoon Han, Jong C. Park
Trolling in online communities typically involves disruptive behaviors such as provoking anger and manipulating discussions, leading to a
polarized atmosphere and emotional distress. Robust moderation is essential for mitigating these negative impacts and maintaining a healthy
and constructive community atmosphere. However, effectively addressing trolls is difficult because their behaviors vary widely and require
different response strategies (RSs) to counter them. This diversity makes it challenging to choose an appropriate RS for each specific sit-
uation.To address this challenge, our research investigates whether humans have preferred strategies tailored to different types of trolling
behaviors.Our findings reveal a correlation between the types of trolling encountered and the preferred RS. In this paper, we introduce a
methodology for generating counter-responses to trolls by recommending appropriate RSs, supported by a dataset aligning these strategies
with human preferences across various troll contexts. The experimental results demonstrate that our proposed approach guides constructive
discussion and reduces the negative effects of trolls, thereby enhancing the online community environment.
Nov 14 (Thu) 14:00-15:30 - Jasmine
Communicate to Play: Pragmatic Reasoning for Efficient Cross-Cultural Communication
Isadora White, Sashrika Pandey, Michelle Pan
In this paper, we study how culture leads to differences in common ground and how this influences communication. During communication,
cultural differences in common ground during communication may result in pragmatic failure and misunderstandings. We develop our method
Rational Speech Acts for Cross-Cultural Communication (RSA+C3) to resolve cross-cultural differences in common ground. To measure the
success of our method, we study RSA+C3 in the collaborative referential game of Codenames Duet and show that our method successfully
improves collaboration between simulated players of different cultures. Our contributions are threefold: (1) creating Codenames players using
contrastive learning of an embedding space and LLM prompting that are aligned with human patterns of play, (2) studying culturally induced
differences in common ground reflected in our trained models, and (3) demonstrating that our method RSA+C3 can ease cross-cultural com-
munication in gameplay by inferring sociocultural context from interaction.
Nov 14 (Thu) 14:00-15:30 - Jasmine
Improving Quotation Attribution with Fictional Character Embeddings
Gaspard Michel, Elena V. Epure, Romain Hennequin, Christophe Cerisara
Humans naturally attribute utterances of direct speech to their speaker in literary works.When attributing quotes, we process contextual
information but also access mental representations of characters that we build and revise throughout the narrative. Recent methods to auto-
matically attribute such utterances have explored simulating human logic with deterministic rules or learning new implicit rules with neural
networks when processing contextual information.However, these systems inherently lack character representations, which often leads to
errors in more challenging examples of attribution: anaphoric and implicit quotes.In this work, we propose to augment a popular quotation
attribution system, BookNLP, with character embeddings that encode global stylistic information of characters derived from an off-the-shelf
stylometric model, Universal Authorship Representation (UAR).We create DramaCV, a corpus of English drama plays from the 15th to 20th
century that we automatically annotate for Authorship Verification of fictional characters utterances, and release two versions of UAR trained
on DramaCV, that are tailored for literary characters analysis.Then, through an extensive evaluation on 28 novels, we show that combin-
ing BookNLP’s contextual information with our proposed global character embeddings improves the identification of speakers for anaphoric
and implicit quotes, reaching state-of-the-art performance.Code and data can be found at https://github.com/deezer/character_embeddings_qa.
Nov 14 (Thu) 14:00-15:30 - Jasmine
The Language of Trauma: Modeling Traumatic Event Descriptions Across Domains with Explainable AI
Miriam Schirmer, Tobias Leemann, Gjergji Kasneci, Jürgen Pfeffer, David Jurgens
Psychological trauma can manifest following various distressing events and is captured in diverse online contexts. However, studies tradi-
tionally focus on a single aspect of trauma, often neglecting the transferability of findings across different scenarios. We address this gap
by training various language models with progressing complexity on trauma-related datasets, including genocide-related court data, a Red-
dit dataset on post-traumatic stress disorder (PTSD), counseling conversations, and Incel forum posts. Our results show that the fine-tuned
RoBERTa model excels in predicting traumatic events across domains, slightly outperforming large language models like GPT-4. Addition-
ally, SLALOM-feature scores and conceptual explanations effectively differentiate and cluster trauma-related language, highlighting different
trauma aspects and identifying sexual abuse and experiences related to death as a common traumatic event across all datasets. This transfer-
ability is crucial as it allows for the development of tools to enhance trauma detection and intervention in diverse populations and settings.
Nov 14 (Thu) 14:00-15:30 - Jasmine
Beyond Demographics: Aligning Role-playing LLM-based Agents Using Human Belief Networks
Yun-Shiuan Chuang, Zach Studdiford, Krirk Nirunwiroj, Agam Goyal, Vincent V. Frigo, Sijia Yang, Dhavan V. Shah, Junjie Hu, Timothy T.
Rogers
Creating human-like large language model (LLM) agents is crucial for faithful social simulation. Having LLMs role-play based on de-
mographic information sometimes improves human likeness but often does not. This study assessed whether LLM alignment with human
behavior can be improved by integrating information from empirically-derived human belief networks. Using data from a human survey,
we estimated a belief network encompassing 64 topics loading on nine non-overlapping latent factors. We then seeded LLM-based agents
with an opinion on one topic, and assessed the alignment of its expressed opinions on remaining test topics with corresponding human data.
Role-playing based on demographic information alone did not align LLM and human opinions, but seeding the agent with a single belief
greatly improved alignment for topics related in the belief network, and not for topics outside the network. These results suggest a novel path
for human-LLM belief alignment in work seeking to simulate and understand patterns of belief distributions in society.
Nov 14 (Thu) 14:00-15:30 - Jasmine
Large Language Models for Propaganda Span Annotation
Maram Hasanain, Fatema Ahmad, Firoj Alam
The use of propagandistic techniques in online content has increased in recent years aiming to manipulate online audiences. Fine-grained
propaganda detection and extraction of textual spans where propaganda techniques are used, are essential for more informed content con-
sumption. Automatic systems targeting the task over lower resourced languages are limited, usually obstructed by lack of large scale training
294
Posters and Demos
datasets. Our study investigates whether Large Language Models (LLMs), such as GPT-4, can effectively extract propagandistic spans. We
further study the potential of employing the model to collect more cost-effective annotations. Finally, we examine the effectiveness of labels
provided by GPT-4 in training smaller language models for the task. The experiments are performed over a large-scale in-house manually
annotated dataset. The results suggest that providing more annotation context to GPT-4 within prompts improves its performance compared to
human annotators. Moreover, when serving as an expert annotator (consolidator), the model provides labels that have higher agreement with
expert annotators, and lead to specialized models that achieve state-of-the-art over an unseen Arabic testing set. Finally, our work is the first
to show the potential of utilizing LLMs to develop annotated datasets for propagandistic spans detection task prompting it with annotations
from human annotators with limited expertise. All scripts and annotations will be shared with the community.
Nov 14 (Thu) 14:00-15:30 - Jasmine
Can Machines Resonate with Humans? Evaluating the Emotional and Empathic Comprehension of LMs
Muhammad Arslan Manzoor, Yuxia Wang, Minghan Wang, Preslav Nakov
Empathy plays a pivotal role in fostering prosocial behavior, often triggered by the sharing of personal experiences through narratives.
However, modeling empathy using NLP approaches remains challenging due to its deep interconnection with human interaction dynamics.
Previous approaches, which involve fine-tuning language models (LMs) on human-annotated empathic datasets, have had limited success.
In our pursuit of improving empathy understanding in LMs, we propose several strategies, including contrastive learning with masked LMs
and supervised fine-tuning with large language models. While these methods show improvements over previous methods, the overall results
remain unsatisfactory. To better understand this trend, we performed an analysis which reveals a low agreement among annotators. This lack
of consensus hinders training and highlights the subjective nature of the task. We also explore the cultural impact on annotations. To study
this, we meticulously collected story pairs in Urdu language and find that subjectivity in interpreting empathy among annotators appears to
be independent of cultural background. Our systematic exploration of LMs understanding of empathy reveals substantial opportunities for
further investigation in both task formulation and modeling.
Nov 14 (Thu) 14:00-15:30 - Jasmine
Are Large Language Models Consistent over Value-laden Questions?
Jared Moore, Tanvi Deshpande, Diyi Yang
Large language models (LLMs) appear to bias their survey answers toward certain values. Nonetheless, some argue that LLMs are too incon-
sistent to simulate particular values. Are they? To answer, we first define value consistency as the similarity of answers across 1) paraphrases
of one question, 2) related questions under one topic, 3) multiple-choice and open-ended use-cases of one question, and 4) multilingual trans-
lations of a question to English, Chinese, German, and Japanese. We apply these measures to a few large, open LLMs including llama-3,
as well as gpt-4o, using eight thousand questions spanning more than 300 topics. Unlike prior work, we find that models are relatively
consistent across paraphrases, use-cases, translations, and within a topic. Still, some inconsistencies remain. Models are more consistent on
uncontroversial topics (e.g., in the U.S., "Thanksgiving") than on controversial ones (e.g. "euthanasia"). Base models are both more consistent
compared to fine-tuned models and are uniform in their consistency across topics, while fine-tuned models are more inconsistent about some
topics (e.g. "euthanasia") than others (e.g. "Women’s rights") like our human participants.
Nov 14 (Thu) 14:00-15:30 - Jasmine
Extrinsic Evaluation of Cultural Competence in Large Language Models
Shaily Bhatt, Fernando Diaz
Productive interactions between diverse users and language technologies require outputs from the latter to be culturally relevant and sensitive.
Prior works have evaluated models’ knowledge of cultural norms, values, and artefacts, without considering how this knowledge manifests
in downstream applications. In this work, we focus on extrinsic evaluation of cultural competence in two text generation tasks, open-ended
question answering and story generation. We quantitatively and qualitatively evaluate model outputs when an explicit cue of culture, specifi-
cally nationality, is perturbed in the prompts. Although we find that model outputs do vary when varying nationalities and feature culturally
relevant words, we also find weak correlations between text similarity of outputs for different countries and the cultural values of these coun-
tries. Finally, we discuss important considerations in designing comprehensive evaluation of cultural competence in user-facing tasks.
Nov 14 (Thu) 14:00-15:30 - Jasmine
SocialGaze: Improving the Integration of Human Social Norms in Large Language Models
Anvesh Rao Vijjini, Rakesh R Menon, Shashank Srivastava, Snigdha Chaturvedi
While much research has explored enhancing the reasoning capabilities of large language models (LLMs) in the last few years, there is a
gap in understanding the alignment of these models with social values and norms. We introduce the task of judging social acceptance. So-
cial acceptance requires models to judge and rationalize the acceptability of people’s actions in social situations. For example, is it socially
acceptable for a neighbor to ask others in the community to keep their pets indoors at night? We find that LLMs’ understanding of social
acceptance is often misaligned with human consensus. To alleviate this, we introduce SocialGaze, a multi-step prompting framework, in
which a language model verbalizes a social situation from multiple perspectives before forming a judgment. Our experiments demonstrate
that the SocialGaze approach improves the alignment with human judgments by up to 11 F1 points with the GPT-3.5 model. We also identify
biases and correlations in LLMs in assigning blame that is related to features such as the gender (males are significantly more likely to be
judged unfairly) and age (LLMs are more aligned with humans for older narrators).
Nov 14 (Thu) 14:00-15:30 - Jasmine
ValueScope: Unveiling Implicit Norms and Values via Return Potential Model of Social Interactions
Chan Young Park, Shuyue Stella Li, Hayoung Jung, Svitlana Volkova, Tanu Mitra, David Jurgens, Yulia Tsvetkov
This study introduces ValueScope, a framework leveraging language models to quantify social norms and values within online communities,
grounded in social science perspectives on normative structures. We employ ValueScope to dissect and analyze linguistic and stylistic expres-
sions across 13 Reddit communities categorized under gender, politics, science, and finance. Our analysis provides a quantitative foundation
confirming that even closely related communities exhibit remarkably diverse norms. This diversity supports existing theories and adds a new
dimension to understanding community interactions. ValueScope not only delineates differences in social norms but also effectively tracks
their evolution and the influence of significant external events like the U.S. presidential elections and the emergence of new sub-communities.
The framework thus highlights the pivotal role of social norms in shaping online interactions, presenting a substantial advance in both the
theory and application of social norm studies in digital spaces.
Nov 14 (Thu) 14:00-15:30 - Jasmine
Revealing Fine-Grained Values and Opinions in Large Language Models
Dustin Wright, Arnav Arora, Nadav Borenstein, Srishti Yadav, Serge Belongie, Isabelle Augenstein
Uncovering latent values and opinions embedded in large language models (LLMs) can help identify biases and mitigate potential harm.
Recently, this has been approached by prompting LLMs with survey questions and quantifying the stances in the outputs towards morally and
politically charged statements. However, the stances generated by LLMs can vary greatly depending on how they are prompted, and there are
many ways to argue for or against a given position. In this work, we propose to address this by analysing a large and robust dataset of 156k
295
Posters and Demos
LLM responses to the 62 propositions of the Political Compass Test (PCT) generated by 6 LLMs using 420 prompt variations. We perform
coarse-grained analysis of their generated stances and fine-grained analysis of the plain text justifications for those stances. For fine-grained
analysis, we propose to identify tropes in the responses: semantically similar phrases that are recurrent and consistent across different prompts,
revealing natural patterns in the text that a given LLM is prone to produce. We find that demographic features added to prompts significantly
affect outcomes on the PCT, reflecting bias, as well as disparities between the results of tests when eliciting closed-form vs. open domain
responses. Additionally, patterns in the plain text rationales via tropes show that similar justifications are repeatedly generated across models
and prompts even with disparate stances.
Nov 14 (Thu) 14:00-15:30 - Jasmine
Automated Tone Transcription and Clustering with Tone2Vec
Yi Yang, Yiming Wang, ZhiQiang Tang, Jiahong Yuan
Lexical tones play a crucial role in Sino-Tibetan languages. However, current phonetic fieldwork relies on manual effort, resulting in sub-
stantial time and financial costs. This is especially challenging for the numerous endangered languages that are rapidly disappearing, often
compounded by limited funding. In this paper, we introduce pitch-based similarity representations for tone transcription, named Tone2Vec.
Experiments on dialect clustering and variance show that Tone2Vec effectively captures fine-grained tone variation. Utilizing Tone2Vec, we
develop the first automatic approach for tone transcription and clustering by presenting a novel representation transformation for transcrip-
tions. Additionally, these algorithms are systematically integrated into an open-sourced and easy-to-use package, ToneLab, which facilitates
automated fieldwork and cross-regional, cross-lexical analysis for tonal languages. Extensive experiments were conducted to demonstrate the
effectiveness of our methods.
296
Posters and Demos
demonstrates that D2F yields superior qualitative and quantitative results across diverse domains.
Nov 14 (Thu) 14:00-15:30 - Riverfront Hall
What are the Generator Preferences for End-to-end Task-Oriented Dialog System?
Wanshi Xu, Xianwei Zhuang, Zhanpeng Chen, Zhihong Zhu, Xuxin Cheng, Yuexian Zou
Fully end-to-end task-oriented dialogue (EToD) systems have shown excellent performance, which requires the ability to retrieve entities
accurately for generation. Existing methods improve the accuracy of entity retrieval and construct data flows between retrieval results and re-
sponse generator, achieving promising results. However, most of them suffer from the following issues: (1) The entity is retrieved by directly
interacting with the context at a coarse-grained level, so the similarity score may be disturbed by irrelevant attributes; (2) The generator pays
equal attention to retrieved entities and the context and does not learn the generation preferences for the current turn. In this paper, we propose
a framework called Regulating Preferences of Generator (RPG) based on retrieval results, which includes a generator preference extractor, an
entity retriever, and a generator with the gate-controlled preference regulator. The generator preference extractor not only improves the entity
retriever by filtering the interference of irrelevant attributes but also provides more focused guidance to the generator by performing inter-turn
attribute prediction. Experiments and analyses on three standard benchmarks show that our framework outperforms existing methods and
improves the quality of the dialogue.
Nov 14 (Thu) 14:00-15:30 - Riverfront Hall
MediTOD: An English Dialogue Dataset for Medical History Taking with Comprehensive Annotations
Vishal Vivek Saley, Goonjan Saha, Rocktim Jyoti Das, Dinesh Raghu, Mausam.
Medical task-oriented dialogue systems can assist doctors by collecting patient medical history, aiding in diagnosis, or guiding treatment
selection, thereby reducing doctor burnout and expanding access to medical services. However, doctor-patient dialogue datasets are not
readily available, primarily due to privacy regulations. Moreover, existing datasets lack comprehensive annotations involving medical slots
and their different attributes, such as symptoms and their onset, progression, and severity. These comprehensive annotations are crucial for
accurate diagnosis. Finally, most existing datasets are non-English, limiting their utility for the larger research community.In response, we
introduce MediTOD, a new dataset of doctor-patient dialogues in English for the medical history-taking task. Collaborating with doctors, we
devise a questionnaire-based labeling scheme tailored to the medical domain. Then, medical professionals create the dataset with high-quality
comprehensive annotations, capturing medical slots and their attributes. We establish benchmarks in supervised and few-shot settings on
MediTOD for natural language understanding, policy learning, and natural language generation subtasks, evaluating models from both TOD
and biomedical domains. We make MediTOD publicly available for future research.
Nov 14 (Thu) 14:00-15:30 - Riverfront Hall
MP2D: An Automated Topic Shift Dialogue Generation Framework Leveraging Knowledge Graphs
Yerin Hwang, Yongil Kim, Yunah Jang, Jeesoo Bang, Hyunkyung Bae, Kyomin Jung
Despite advancements in on-topic dialogue systems, effectively managing topic shifts within dialogues remains a persistent challenge, largely
attributed to the limited availability of training datasets. To address this issue, we propose Multi-Passage to Dialogue (MP2D), a data gen-
eration framework that automatically creates conversational question-answering datasets with natural topic transitions. By leveraging the
relationships between entities in a knowledge graph, MP2D maps the flow of topics within a dialogue, effectively mirroring the dynamics of
human conversation. It retrieves relevant passages corresponding to the topics and transforms them into dialogues through the passage-to-
dialogue method. Through quantitative and qualitative experiments, we demonstrate MP2Ds efficacy in generating dialogue with natural topic
shifts. Furthermore, this study introduces a novel benchmark for topic shift dialogues, TS-WikiDialog. Utilizing the dataset, we demonstrate
that even Large Language Models (LLMs) struggle to handle topic shifts in dialogue effectively, and we showcase the performance improve-
ments of models trained on datasets generated by MP2D across diverse topic shift dialogue tasks.
Nov 14 (Thu) 14:00-15:30 - Riverfront Hall
Perceptions to Beliefs: Exploring Precursory Inferences for Theory of Mind in Large Language Models
Chani Jung, Dongkwan Kim, Jiho Jin, Jiseon Kim, Yeon Seonwoo, Yejin Choi, Alice Oh, Hyunwoo Kim
While humans naturally develop theory of mind (ToM), the capability to understand other people’s mental states and beliefs, state-of-the-art
large language models (LLMs) underperform on simple ToM benchmarks. We posit that we can extend our understanding of LLMs’ ToM
abilities by evaluating key human ToM precursors−perception inference and perception-to-belief inference−in LLMs. We introduce two
datasets, Percept-ToMi and Percept-FANToM, to evaluate these precursory inferences for ToM in LLMs by annotating characters’ perceptions
on ToMi and FANToM, respectively.Our evaluation of eight state-of-the-art LLMs reveals that the models generally perform well in percep-
tion inference while exhibiting limited capability in perception-to-belief inference (e.g., lack of inhibitory control).Based on these results,
we present PercepToM, a novel ToM method leveraging LLMs’ strong perception inference capability while supplementing their limited
perception-to-belief inference. Experimental results demonstrate that PercepToM significantly enhances LLM’s performance, especially in
false belief scenarios.
297
Posters and Demos
Hate speech (HS) is a widely acknowledged societal problem with potentially grave effects on vulnerable individuals and minority groups.
Developing counter-narratives (CNs) that confront biases and stereotypes driving hateful narratives is considered an impactful strategy. Cur-
rent automatic methods focus on isolated utterances to detect and react to hateful content online, often omitting the conversational context
where HS naturally occurs. In this work, we explore strategies for the incorporation of conversational history for CN generation, comparing
text and graphical representations with varying degrees of context. Overall, automatic and human evaluations show that 1) contextualized
representations are comparable to those of isolated utterances, and 2) models based on graph representations outperform text representations,
thus opening new research directions for future work.
Nov 14 (Thu) 14:00-15:30 - Riverfront Hall
Rewarding What Matters: Step-by-Step Reinforcement Learning for Task-Oriented Dialogue
Huifang Du, Shuqin Li, Minghao Wu, Xuejing Feng, Yuan-Fang Li, Haofen Wang
Reinforcement learning (RL) is a powerful approach to enhance task-oriented dialogue (TOD) systems. However, existing RL methods
tend to mainly focus on generation tasks, such as dialogue policy learning (DPL) or response generation (RG), while neglecting dialogue
state tracking (DST) for understanding. This narrow focus limits the systems to achieve globally optimal performance by overlooking the
interdependence between understanding and generation. Additionally, RL methods face challenges with sparse and delayed rewards, which
complicates training and optimization. To address these issues, we extend RL into both understanding and generation tasks by introducing
step-by-step rewards throughout the token generation. The understanding reward increases as more slots are correctly filled in DST, while
the generation reward grows with the accurate inclusion of user requests. Our approach provides a balanced optimization aligned with task
completion. Experimental results demonstrate that our approach effectively enhances the performance of TOD systems and achieves new
state-of-the-art results on three widely used datasets, including MultiWOZ2.0, MultiWOZ2.1, and In-Car. Our approach also shows superior
few-shot ability in low-resource settings compared to current models.
Nov 14 (Thu) 14:00-15:30 - Riverfront Hall
Devil’s Advocate: Anticipatory Reflection for LLM Agents
Haoyu Wang, Tao Li, Zhiwei Deng, Dan Roth, Yang Li
In this work, we introduce a novel approach that equips LLM agents with introspection, enhancing consistency and adaptability in solving
complex tasks. Our approach prompts LLM agents to decompose a given task into manageable subtasks (i.e., to make a plan), and to con-
tinuously introspect upon the suitability and results of their actions. We implement a three-fold introspective intervention: 1) anticipatory
reflection on potential failures and alternative remedy before action execution, 2) post-action alignment with subtask objectives and backtrack-
ing with remedy to ensure utmost effort in plan execution, and 3) comprehensive review upon plan completion for future strategy refinement.
By deploying and experimenting with this methodology—a zero-shot approach—within WebArena for practical tasks in web environments,
our agent demonstrates superior performance with a success rate of 23.5% over existing zero-shot methods by 3.5%. The experimental re-
sults suggest that our introspection-driven approach not only enhances the agent’s ability to navigate unanticipated challenges through a robust
mechanism of plan execution, but also improves efficiency by reducing the number of trials and plan revisions by 45% needed to achieve a task.
Nov 14 (Thu) 14:00-15:30 - Riverfront Hall
Instruct, Not Assist: LLM-based Multi-Turn Planning and Hierarchical Questioning for Socratic Code Debugging
Priyanka Kargupta, Ishika Agarwal, Dilek Hakkani Tur, Jiawei Han
Socratic questioning is an effective teaching strategy, encouraging critical thinking and problem-solving. The conversational capabilities of
large language models (LLMs) show great potential for providing scalable, real-time student guidance. However, current LLMs often give
away solutions directly, making them ineffective instructors. We tackle this issue in the code debugging domain with TreeInstruct, an In-
structor agent guided by a novel state space-based planning algorithm. TreeInstruct asks probing questions to help students independently
identify and resolve errors. It estimates a student’s conceptual and syntactical knowledge to dynamically construct a question tree based on
their responses and current knowledge state, effectively addressing both independent and dependent mistakes concurrently in a multi-turn
interaction setting. In addition to using an existing single-bug debugging benchmark, we construct a more challenging multi-bug dataset of
150 coding problems, incorrect solutions, and bug fixes– all carefully constructed and annotated by experts. Extensive evaluation shows Tree-
Instruct’s state-of-the-art performance on both datasets, proving it to be a more effective instructor than baselines. Furthermore, a real-world
case study with five students of varying skill levels further demonstrates TreeInstruct’s ability to guide students to debug their code efficiently
with minimal turns and highly Socratic questioning.
Nov 14 (Thu) 14:00-15:30 - Riverfront Hall
Ask-before-Plan: Proactive Language Agents for Real-World Planning
Xuan Zhang, Yang Deng, Zifeng Ren, See-Kiong Ng, Tat-Seng Chua
The evolution of large language models (LLMs) has enhanced the planning capabilities of language agents in diverse real-world scenarios.
Despite these advancements, the potential of LLM-powered agents to comprehend ambiguous user instructions for reasoning and decision-
making is still under exploration. In this work, we introduce a new task, Proactive Agent Planning, which requires language agents to predict
clarification needs based on user-agent conversation and agent-environment interaction, invoke external tools to collect valid information, and
generate a plan to fulfill the user’s demands. To study this practical problem, we establish a new benchmark dataset, Ask-before-Plan. To
tackle the deficiency of LLMs in proactive planning, we propose a novel multi-agent framework, Clarification-Execution-Planning (CEP),
which consists of three agents specialized in clarification, execution, and planning. We introduce the trajectory tuning scheme for the clarifi-
cation agent and static execution agent, as well as the memory recollection mechanism for the dynamic execution agent. Extensive evaluations
and comprehensive analyses conducted on the Ask-before-Plan dataset validate the effectiveness of our proposed framework.
298
Posters and Demos
299
Posters and Demos
Industry
Nov 14 (Thu) 14:00-15:30 - Room: Jasmine
300
Posters and Demos
Materials science is an interdisciplinary field focused on studying and discovering materials around us. However, due to the vast space of
materials, datasets in this field are typically scarce and have limited coverage. This inherent limitation makes current adaptation methods
less effective when adapting pre-trained language models (PLMs) to materials science, as these methods rely heavily on the frequency infor-
mation from limited downstream datasets. In this paper, we propose Semantic Knowledge Transfer (SEED), a novel vocabulary expansion
method to adapt the pre-trained language models for materials science. The core strategy of SEED is to transfer the materials knowledge
of lightweight embeddings into the PLMs. To this end, we introduce knowledge bridge networks, which learn to transfer the latent knowl-
edge of the materials embeddings into ones compatible with PLMs. By expanding the embedding layer of PLMs with these transformed
embeddings, PLMs can comprehensively understand the complex terminology associated with materials science. We conduct extensive
experiments across a broad range of materials-related benchmarks. Comprehensive evaluation results convincingly demonstrate that SEED
mitigates the mentioned limitations of previous adaptation methods, showcasing the efficacy of transferring embedding knowledge into PLMs.
Nov 14 (Thu) 14:00-15:30 - Riverfront Hall
TensorOpera Router: A Multi-Model Router for Efficient LLM Inference
Dimitris Stripelis, Zhaozhuo Xu, Zijian Hu, Alay Dilipbhai Shah, Han Jin, Yuhang Yao, Jipeng Zhang, Tong Zhang, Salman Avestimehr,
Chaoyang He
With the rapid growth of Large Language Models (LLMs) across various domains, numerous new LLMs have emerged, each possessing
domain-specific expertise. This proliferation has highlighted the need for quick, high-quality, and cost-effective LLM query response meth-
ods. Yet, no single LLM exists to efficiently balance this trilemma. Some models are powerful but extremely costly, while others are fast
and inexpensive but qualitatively inferior. To address this challenge, we present PolyRouter, a non-monolithic LLM querying system that
seamlessly integrates various LLM experts into a single query interface and dynamically routes incoming queries to the most high-performant
expert based on query’s requirements. Through extensive experiments, we demonstrate that when compared to standalone expert models,
PolyRouter improves query efficiency by up to 40%, and leads to significant cost reductions of up to 30%, while maintaining or enhancing
model performance by up to 10%.
Nov 14 (Thu) 14:00-15:30 - Riverfront Hall
Personal Large Language Model Agents: A Case Study on Tailored Travel Planning
Harmanpreet Singh, Nikhil Verma, Yixiao Wang, Manasa Bharadwaj, Homa Fashandi, Kevin Ferreira, Chul Lee
Large Language Models (LLMs) have made significant progress, becoming more autonomous and capable of handling real-world tasks
through their access to tools, various planning strategies, and memory, referred to as LLM agents. One emerging area of focus is customizing
these models to cater to individual user preferences, thereby shaping them into personal LLM agents. This work investigates how the user
model, which encapsulates user-related information, preferences, and personal concepts, influences an LLM agent’s planning and reasoning
capabilities. We introduce a personalized version of TravelPlanner, called TravelPlanner+, and establish baselines for personal LLM agents.
Our evaluation strategy contains an LLM-as-a-Judge component, which provides further in-depth insights into the decision-making process
of a personal LLM agent by comparing generic and personal plans. Our findings reveal that while generic plans perform robustly, personal
plans show marked improvement in relevance and suitability, with preference rates up to 74.4% on validation and 87.3% on the test set. These
results highlight the potential of personal LLM agents to significantly enhance user satisfaction.
Nov 14 (Thu) 14:00-15:30 - Riverfront Hall
TelBench: A Benchmark for Evaluating Telco-Specific Large Language Models
Sunwoo Lee, Dhammiko Arya, Seung-Mo Cho, Gyoung-Eun Han, Seokyoung Hong, Wonbeom Jang, Seojin Lee, Sohee Park, Sereimony Sek,
Injee Song, Sungbin Yoon, Eric Davis
The telecommunications industry, characterized by its vast customer base and complex service offerings, necessitates a high level of domain
expertise and proficiency in customer service center operations. Consequently, there is a growing demand for Large Language Models (LLMs)
to augment the capabilities of customer service representatives. This paper introduces a methodology for developing a specialized Telecom-
munications LLM (Telco LLM) designed to enhance the efficiency of customer service agents and promote consistency in service quality
across representatives. We present the construction process of TelBench, a novel dataset created for performance evaluation of customer
service expertise in the telecommunications domain. We also evaluate various LLMs and demonstrate the ability to benchmark both pro-
prietary and open-source LLMs on predefined telecommunications-related tasks, thereby establishing metrics that define telcommunications
performance.
Nov 14 (Thu) 14:00-15:30 - Riverfront Hall
SMARTCAL: An Approach to Self-Aware Tool-Use Evaluation and Calibration
Yuanhao Shen, Xiaodan Zhu, Lei Chen
The tool-use ability of Large Language Models (LLMs) has a profound impact on a wide range of applications. However, LLMs’ self-
awareness and self-control capability in appropriately using tools remains understudied. The problem is consequential as it alarms a potential
risk of degraded performance and poses a threat to trustworthiness on the models. In this paper, we conduct a study on a family of state-of-
the-art LLMs on three datasets with two mainstream tool-use frameworks. Our study reveals the tool-abuse behavior of LLMs, a tendency
for models to misuse tools along with models’ frequent overconfidence in tool choice. We also find that this is a common issue regardless of
model capability. Accordingly, we propose a novel framework, SMARTCAL, to mitigate the observed issues, and our results show an average
8.6 percent increase in the QA performance in three testing datasets and 21.6 percent lower Expected Calibration Error (ECE) than existing
methods.
Nov 14 (Thu) 14:00-15:30 - Riverfront Hall
Adapting LLMs for Structured Natural Language API Integration
Robin Chan, Katsiaryna Mirylenka, Thomas Gschwind, Christoph Miksovic, Paolo Scotton, Enrico Toniato, Abdel Labbi
Integrating APIs is crucial for enterprise systems, enabling seamless application interaction within workflows. However, the vast and diverse
API landscape makes combining calls based on user intent a significant challenge. Existing methods rely on Named Entity Recognition (NER)
and knowledge graphs, but struggle with control flow structures like conditionals and loops. We propose a novel framework that leverages
the success of Large Language Models (LLMs) in code generation for natural language API integration. Our approach involves fine-tuning
an LLM on automatically generated API flows derived from services’ OpenAPI specifications. This aims to surpass NER-based methods and
compare the effectiveness of different tuning strategies. Specifically, we investigate the impact of enforcing syntax through constrained gen-
eration or retrieval-augmented generation. To facilitate systematic comparison, we introduce targeted test suites that assess the generalization
capabilities and ability of these approaches to retain structured knowledge. We expect to observe that fine-tuned LLMs can: (a) learn structural
constraints implicitly during training, and (b) achieve significant improvements in both in-distribution and out-of-distribution performance.
301
Posters and Demos
variety of state-of-the-art large language models (LLMs) on mock CFA exams to provide an overview of their financial analysis capabilities
using the same evaluation standards applied for human professionals. We benchmark five leading proprietary models and eight open-source
models on all three levels of the CFA through challenging multiple-choice and essay questions. We find that flagship proprietary models
perform relatively well and can solidly pass levels I and II exams, but fail at level III due to essay questions. Open-source models generally
fall short of estimated passing scores, but still show strong performance considering their size, cost, and availability advantages. We also find
that using textbook data helps bridge the gap between open-source and proprietary models to a certain extent, despite reduced gains in CFA
levels II and III. By understanding the current financial analysis abilities of LLMs, we aim to guide practitioners on which models are best
suited for enhancing automation in the financial industry.
Nov 14 (Thu) 14:00-15:30 - Riverfront Hall
Athena: Safe Autonomous Agents with Verbal Contrastive Learning
Tanmana Sadhu, Ali Pesaranghader, Yanan Chen, Dong Hoon Yi
Due to emergent capabilities, large language models (LLMs) have been utilized as language-based agents to perform a variety of tasks and
make decisions with an increasing degree of autonomy. These autonomous agents can understand high-level instructions, interact with their
environments, and execute complex tasks using a selection of tools available to them. As the capabilities of the agents expand, ensuring their
safety and trustworthiness becomes more imperative. In this study, we introduce the Athena framework which leverages the concept of verbal
contrastive learning where past safe and unsafe trajectories are used as in-context (contrastive) examples to guide the agent towards safety
while fulfilling a given task. The framework also incorporates a critiquing mechanism to guide the agent to prevent risky actions at every step.
Furthermore, due to the lack of existing benchmarks on the safety reasoning ability of LLM-based agents, we curate a set of 80 toolkits across
8 categories with 180 scenarios to provide a safety evaluation benchmark. Our experimental evaluation, with both closed- and open-source
LLMs, indicates verbal contrastive learning and interaction-level critiquing improve the safety rate significantly.
Nov 14 (Thu) 14:00-15:30 - Riverfront Hall
Query-OPT: Optimizing Inference of Large Language Models via Multi-Query Instructions in Meeting Summarization
Md Tahmid Rahman Laskar, Elena Khasanova, Xue-Yong Fu, Cheng Chen, Shashi Bhushan Tn
This work focuses on the task of query-based meeting summarization in which the summary of a context (meeting transcript) is generated
in response to a specific query. When using Large Language Models (LLMs) for this task, a new call to the LLM inference endpoint/API is
required for each new query even if the context stays the same. However, repeated calls to the LLM inference endpoints would significantly
increase the costs of using them in production, making LLMs impractical for many real-world use cases. To address this problem, in this pa-
per, we investigate whether combining the queries for the same input context in a single prompt to minimize repeated calls can be successfully
used in meeting summarization. In this regard, we conduct extensive experiments by comparing the performance of various popular LLMs:
GPT-4, Gemini, Claude-3, LLaMA2, Mistral, Phi-3, and Qwen-2 in single-query and multi-query settings. We observe that the capability to
reliably generate the response in the expected format is usually limited to closedsource LLMs, with most open-source LLMs lagging behind
(except Mistral). We conclude that multi-query prompting could be useful to optimize the inference costs by significantly reducing calls to
the inference endpoints/APIs for the task of meeting summarization.
Nov 14 (Thu) 14:00-15:30 - Riverfront Hall
Systematic Evaluation of Long-Context LLMs on Financial Concepts
Lavanya Gupta, Saket Sharma, Yiyun Zhao
Long-context large language models (LC LLMs) promise to increase reliability of LLMs in real-world tasks requiring processing and un-
derstanding of long input documents. However, this ability of LC LLMs to reliably utilize their growing context windows remains under
investigation. In this work, we evaluate the performance of state-of-the-art GPT-4 suite of LC LLMs in solving a series of progressively
challenging tasks, as a function of factors such as context length, task difficulty, and position of key information by creating a real world fi-
nancial news dataset. Our findings indicate that LC LLMs exhibit brittleness at longer context lengths even for simple tasks, with performance
deteriorating sharply as task complexity increases. At longer context lengths, these state-of-the-art models experience catastrophic failures in
instruction following resulting in degenerate outputs. Our prompt ablations also reveal unfortunate continued sensitivity to both the placement
of the task instruction in the context window as well as minor markdown formatting. Finally, we advocate for more rigorous evaluation of LC
LLMs by employing holistic metrics such as F1 (rather than recall) and reporting confidence intervals, thereby ensuring robust and conclusive
findings.
Nov 14 (Thu) 14:00-15:30 - Riverfront Hall
Let Me Speak Freely? A Study On The Impact Of Format Restrictions On Large Language Model Performance.
Zhi Rui Tam, Cheng-Kuang Wu, Yi-Lin Tsai, Chieh-Yen Lin, Hung-Yi Lee, Yun-Nung Chen
Structured generation, the process of producing content in standardized formats like JSON and XML, is widely utilized in real-world appli-
cations to extract key output information from large language models (LLMs). This study investigates whether such constraints on generation
space impact LLMs abilities, including reasoning and domain knowledge comprehension. Specifically, we evaluate LLMs performance when
restricted to adhere to structured formats versus generating free-form responses across various common tasks. Surprisingly, we observe a
significant decline in LLMs reasoning abilities under format restrictions. Furthermore, we find that stricter format constraints generally lead
to greater performance degradation in reasoning tasks.
Nov 14 (Thu) 14:00-15:30 - Riverfront Hall
ASTRA: Automatic Schema Matching using Machine Translation
Tarang Chugh, Deepak Zambre
Many eCommerce platforms source product information from millions of sellers and manufactures, each having their own proprietary
schemas, and employ schema matching solutions to structure it to enable informative shopping experiences. Meanwhile, state-of-the-art
machine translation techniques have demonstrated great success in building context-aware representations that generalize well to new lan-
guages with minimal training data. In this work, we propose modeling the schema matching problem as a neural machine translation task:
given product context and an attribute-value pair from a source schema, the model predicts the corresponding attribute, if available, in the
target schema. We utilize open-source seq2seq models, such as mT5 and mBART, fine-tuned on product attribute mappings to build a scalable
schema matching framework. We demonstrate that our proposed approach achieves a significant performance boost (15% precision and 7%
recall uplift) compared to the baseline system and can support new attributes with precision ≥ 95% using only five labeled samples per
attribute.
Nov 14 (Thu) 14:00-15:30 - Riverfront Hall
Visual Editing with LLM-based Tool Chaining: An Efficient Distillation Approach for Real-Time Applications
Oren Sultan, Alex Khasin, Guy Shiran, Asnat Greenstein-Messica, Dafna Shahaf
We present a practical distillation approach to fine-tune LLMs for invoking tools in real-time applications. We focus on visual editing tasks;
specifically, we modify images and videos by interpreting user stylistic requests, specified in natural language (“golden hour”), using an
LLM to select the appropriate tools and their parameters to achieve the desired visual effect. We found that proprietary LLMs such as GPT-
302
Posters and Demos
3.5-Turbo show potential in this task, but their high cost and latency make them unsuitable for real-time applications. In our approach, we
fine-tune a (smaller) student LLM with guidance from a (larger) teacher LLM and behavioral signals. We introduce offline metrics to evaluate
student LLMs. Both online and offline experiments show that our student models manage to match the performance of our teacher model
(GPT-3.5-Turbo), significantly reducing costs and latency. Lastly, we show that fine-tuning was improved by 25% in low-data regimes using
augmentation.
Nov 14 (Thu) 14:00-15:30 - Riverfront Hall
MARCO: Multi-Agent Real-time Chat Orchestration
Anubhav Shrimal, Stanley Kanagaraj, Kriti Biswas, Swarnalatha Raghuraman, Anish Nediyanchath, Yi Zhang, Promod Yenigalla
Large language model advancements have enabled the development of multi-agent frameworks to tackle complex, real-world problems such
as to automate workflows that require interactions with diverse tools, reasoning, and human collaboration. We present MARCO, a Multi-
Agent Real-time Chat Orchestration framework for automating workflows using LLMs. MARCO addresses key challenges in utilizing LLMs
for complex, multi-step task execution in a production environment. It incorporates robust guardrails to steer LLM behavior, validate outputs,
and recover from errors that stem from inconsistent output formatting, function and parameter hallucination, and lack of domain knowledge.
Through extensive experiments we demonstrate MARCO’s superior performance with 94.48% and 92.74% accuracy on task execution for
Digital Restaurant Service Platform conversations and Retail conversations datasets respectively along with 44.91% improved latency and
33.71% cost reduction in a production setting. We also report effects of guardrails in performance gain along with comparisons of various
LLM models, both open-source and proprietary. The modular and generic design of MARCO allows it to be adapted for automating work-
flows across domains and to execute complex tasks through multi-turn interactions.
303
Posters and Demos
Hyeonwoo Kim, Gyoungjin Gim, Yungi Kim, Jihoo Kim, Byungju Kim, Wonseok Lee, Chanjun Park
This study presents a novel learning approach designed to enhance both mathematical reasoning and problem-solving abilities of Large Lan-
guage Models (LLMs). We focus on integrating the Chain-of-Thought (CoT) and the Program-of-Thought (PoT) learning, hypothesizing that
prioritizing the learning of mathematical reasoning ability is helpful for the amplification of problem-solving ability. Thus, the initial learning
with CoT is essential for solving challenging mathematical problems. To this end, we propose a sequential learning approach, named SAAS
(Solving Ability Amplification Strategy), which strategically transitions from CoT learning to PoT learning. Our empirical study, involving
an extensive performance comparison using several benchmarks, demonstrates that our SAAS achieves state-of-the-art (SOTA) performance.
The results underscore the effectiveness of our sequential learning approach, marking a significant advancement in the field of mathematical
reasoning in LLMs.
Information Extraction 2
Nov 14 (Thu) 14:00-15:30 - Room: Riverfront Hall
304
Posters and Demos
an intervention of treatment. However, the practical difficulty of locating such a match limits its feasibility. Addressing this issue, we use the
synthetic control method to generate such a twin’ from relevant historical data, leveraging text embedding synthesis and inversion techniques.
This approach allows us to identify causal relations more robustly than previous methods, including GPT-4, which is demonstrated on a
causality benchmark, COPES-hard.
Nov 14 (Thu) 14:00-15:30 - Riverfront Hall
Integrating Structural Semantic Knowledge for Enhanced Information Extraction Pre-training
Xiaoyang Yi, Yuru Bao, Jian Zhang, Yifang Qin, Faxin Lin
Information Extraction (IE), aiming to extract structured information from unstructured natural language texts, can significantly benefit from
pre-trained language models. However, existing pre-training methods solely focus on exploiting the textual knowledge, relying extensively on
annotated large-scale datasets, which is labor-intensive and thus limits the scalability and versatility of the resulting models. To address these
issues, we propose SKIE, a novel pre-training framework tailored for IE that integrates structural semantic knowledge via contrastive learning,
effectively alleviating the annotation burden. Specifically, SKIE utilizes Abstract Meaning Representation (AMR) as a low-cost supervision
source to boost model performance without human intervention. By enhancing the topology of AMR graphs, SKIE derives high-quality
cohesive subgraphs as additional training samples, providing diverse multi-level structural semantic knowledge. Furthermore, SKIE refines
the graph encoder to better capture cohesive information and edge relation information, thereby improving the pre-training efficacy. Exten-
sive experimental results demonstrate that SKIE outperforms state-of-the-art baselines across multiple IE tasks and showcases exceptional
performance in few-shot and zero-shot settings.
Nov 14 (Thu) 14:00-15:30 - Riverfront Hall
AutoScraper: A Progressive Understanding Web Agent for Web Scraper Generation
Wenhao Huang, Zhouhong Gu, Chenghao Peng, Jiaqing Liang, Zhixu Li, Yanghua Xiao, liqian wen, Zulong Chen
Web scraping is a powerful technique that extracts data from websites, enabling automated data collection, enhancing data analysis capabil-
ities, and minimizing manual data entry efforts. Existing methods, wrappers-based methods suffer from limited adaptability and scalability
when faced with a new website, while language agents, empowered by large language models (LLMs), exhibit poor reusability in diverse web
environments. In this work, we introduce the paradigm of generating web scrapers with LLMs and propose AutoScraper, a two-stage frame-
work that can handle diverse and changing web environments more efficiently. AutoScraper leverages the hierarchical structure of HTML and
similarity across different web pages for generating web scrapers. Besides, we propose a new executability metric for better measuring the
performance of web scraper generation tasks. We conduct comprehensive experiments with multiple LLMs and demonstrate the effectiveness
of our framework. Our work is now open-source.
Nov 14 (Thu) 14:00-15:30 - Riverfront Hall
LogicST: A Logical Self-Training Framework for Document-Level Relation Extraction with Incomplete Annotations
Shengda Fan, Yanting Wang, Shasha Mo, Jianwei Niu
Document-level relation extraction (DocRE) aims to identify relationships between entities within a document. Due to the vast number of
entity pairs, fully annotating all fact triplets is challenging, resulting in datasets with numerous false negative samples. Recently, self-training-
based methods have been introduced to address this issue. However, these methods are purely black-box and sub-symbolic, making them
difficult to interpret and prone to overlooking symbolic interdependencies between relations.To remedy this deficiency, our insight is that
symbolic knowledge, such as logical rules, can be used as diagnostic tools to identify conflicts between pseudo-labels. By resolving these
conflicts through logical diagnoses, we can correct erroneous pseudo-labels, thus enhancing the training of neural models.To achieve this, we
propose **LogicST**, a neural-logic self-training framework that iteratively resolves conflicts and constructs the minimal diagnostic set for
updating models. Extensive experiments demonstrate that LogicST significantly improves performance and outperforms previous state-of-
the-art methods. For instance, LogicST achieves an increase of **7.94%** in F1 score compared to CAST (Tan et al., 2023a) on the DocRED
benchmark (Yao et al., 2019). Additionally, LogicST is more time-efficient than its self-training counterparts, requiring only **10%** of the
training time of CAST.
Nov 14 (Thu) 14:00-15:30 - Riverfront Hall
Efficient Performance Tracking: Leveraging Large Language Models for Automated Construction of Scientific Leaderboards
Furkan ahinuç, Thy Thy Tran, Yulia Grishina, Yufang Hou, Bei Chen, Iryna Gurevych
Scientific leaderboards are standardized ranking systems that facilitate evaluating and comparing competitive methods. Typically, a leader-
board is defined by a task, dataset, and evaluation metric (TDM) triple, allowing objective performance assessment and fostering innovation
through benchmarking. However, the exponential increase in publications has made it infeasible to construct and maintain these leader-
boards manually. Automatic leaderboard construction has emerged as a solution to reduce manual labor. Existing datasets for this task are
based on the community-contributed leaderboards without additional curation. Our analysis shows that a large portion of these leaderboards
are incomplete, and some of them contain incorrect information. In this work, we present SciLead, a manually-curated Scientific Leader-
board dataset that overcomes the aforementioned problems. Building on this dataset, we propose three experimental settings that simulate
real-world scenarios where TDM triples are fully defined, partially defined, or undefined during leaderboard construction. While previous
research has only explored the first setting, the latter two are more representative of real-world applications. To address these diverse settings,
we develop a comprehensive LLM-based framework for constructing leaderboards. Our experiments and analysis reveal that various LLMs
often correctly identify TDM triples while struggling to extract result values from publications. We make our code and data publicly available.
Nov 14 (Thu) 14:00-15:30 - Riverfront Hall
Exploring Nested Named Entity Recognition with Large Language Models: Methods, Challenges, and Insights
Hongjin KIM, Jai-Eun Kim, Harksoo Kim
Nested Named Entity Recognition (NER) poses a significant challenge in Natural Language Processing (NLP), demanding sophisticated
techniques to identify entities within entities. This research investigates the application of Large Language Models (LLMs) to nested NER,
exploring methodologies from prior work and introducing specific reasoning techniques and instructions to improve LLM efficacy. Through
experiments conducted on the ACE 2004, ACE 2005, and GENIA datasets, we evaluate the impact of these approaches on nested NER
performance. Results indicate that output format critically influences nested NER performance, methodologies from previous works are less
effective, and our nested NER-tailored instructions significantly enhance performance. Additionally, we find that label information and de-
scriptions of nested cases are crucial in eliciting the capabilities of LLMs for nested NER, especially in specific domains (i.e., the GENIA
dataset). However, these methods still do not outperform BERT-based models, highlighting the ongoing need for innovative approaches in
nested NER with LLMs.
Nov 14 (Thu) 14:00-15:30 - Riverfront Hall
Jellyfish: Instruction-Tuning Local Large Language Models for Data Preprocessing
Haochen Zhang, Yuyang Dong, Chuan Xiao, Masafumi Oyamada
This paper explores the utilization of LLMs for data preprocessing (DP), a crucial step in the data mining pipeline that transforms raw data into
a clean format. We instruction-tune local LLMs as universal DP task solvers that operate on a local, single, and low-priced GPU, ensuring data
305
Posters and Demos
security and enabling further customization. We select a collection of datasets across four representative DP tasks and construct instruction
data using data configuration, knowledge injection, and reasoning data distillation techniques tailored to DP. By tuning Mistral-7B, Llama
3-8B, and OpenOrca-Platypus2-13B, our models, Jellyfish-7B/8B/13B, deliver competitiveness compared to GPT-3.5/4 models and strong
generalizability to unseen tasks while barely compromising the base models’ abilities in NLP tasks. Meanwhile, Jellyfish offers enhanced
reasoning capabilities compared to GPT-3.5. Our models are available at: https://huggingface.co/NECOUDBFM/JellyfishOur instruction
dataset is available at: https://huggingface.co/datasets/NECOUDBFM/Jellyfish-Instruct
Nov 14 (Thu) 14:00-15:30 - Riverfront Hall
Extract, Define, Canonicalize: An LLM-based Framework for Knowledge Graph Construction
Bowen Zhang, Harold Soh
In this work, we are interested in automated methods for knowledge graph creation (KGC) from input text. Progress on large language models
(LLMs) has prompted a series of recent works applying them to KGC, e.g., via zero/few-shot prompting. Despite successes on small domain-
specific datasets, these models face difficulties scaling up to text common in many real-world applications. A principal issue is that, in prior
methods, the KG schema has to be included in the LLM prompt to generate valid triplets; larger and more complex schemas easily exceed
the LLMs context window length. Furthermore, there are scenarios where a fixed pre-defined schema is not available and we would like the
method to construct a high-quality KG with a succinct self-generated schema. To address these problems, we propose a three-phase frame-
work named Extract-Define-Canonicalize (EDC): open information extraction followed by schema definition and post-hoc canonicalization.
EDC is flexible in that it can be applied to settings where a pre-defined target schema is available and when it is not; in the latter case, it
constructs a schema automatically and applies self-canonicalization. To further improve performance, we introduce a trained component that
retrieves schema elements relevant to the input text; this improves the LLMs extraction performance in a retrieval-augmented generation-like
manner. We demonstrate on three KGC benchmarks that EDC is able to extract high-quality triplets without any parameter tuning and with
significantly larger schemas compared to prior works. Code for EDC is available at https://github.com/clear-nus/edc.
Nov 14 (Thu) 14:00-15:30 - Riverfront Hall
SPEED++: A Multilingual Event Extraction Framework for Epidemic Prediction and Preparedness
Tanmay Parekh, Jeffrey Kwan, Jiarui Yu, Sparsh Johri, Hyosang Ahn, Sreya Muppalla, Kai-Wei Chang, Wei Wang, Nanyun Peng
Social media is often the first place where communities discuss the latest societal trends. Prior works have utilized this platform to extract
epidemic-related information (e.g. infections, preventive measures) to provide early warnings for epidemic prediction. However, these works
only focused on English posts, while epidemics can occur anywhere in the world, and early discussions are often in the local, non-English
languages. In this work, we introduce the first multilingual Event Extraction (EE) framework SPEED++ for extracting epidemic event infor-
mation for any disease and language. To this end, we extend a previous epidemic ontology with 20 argument roles; and curate our multilingual
EE dataset SPEED++ comprising 5.1K tweets in four languages for four diseases. Annotating data in every language is infeasible; thus we
develop zero-shot cross-lingual cross-disease models (i.e., training only on English COVID data) utilizing multilingual pre-training and show
their efficacy in extracting epidemic-related events for 65 diverse languages across different diseases. Experiments demonstrate that our
framework can provide epidemic warnings for COVID-19 in its earliest stages in Dec 2019 (3 weeks before global discussions) from Chinese
Weibo posts without any training in Chinese. Furthermore, we exploit our framework’s argument extraction capabilities to aggregate commu-
nity epidemic discussions like symptoms and cure measures, aiding misinformation detection and public attention monitoring. Overall, we
lay a strong foundation for multilingual epidemic preparedness.
306
Posters and Demos
Where am I? Large Language Models Wandering between Semantics and Structures in Long Contexts
Seonmin Koo, Jinsung Kim, YoungJoon Jang, Chanjun Park, Heuiseok Lim
As the utilization of Large Language Models (LLMs) becomes more widespread, there is a growing demand for their ability to handle more
complex and longer external knowledge across various use cases. Most existing evaluations of the open-ended question answering (ODQA)
task, which necessitates the use of external knowledge, focus solely on whether the model provides the correct answer. However, even when
LLMs answer correctly, they often fail to provide an obvious source for their responses. Therefore, it is necessary to jointly evaluate and
verify the correctness of the answers and the appropriateness of grounded evidence in complex external contexts. To address this issue, we
examine the phenomenon of discrepancies in abilities across two distinct tasksQA and evidence selectionwhen performed simultaneously,
from the perspective of task alignment. To verify LLMs’ task alignment, we introduce a verification framework and resources considering
both semantic relevancy and structural diversity of the given long context knowledge. Through extensive experiments and detailed analysis,
we provide insights into the task misalignment between QA and evidence selection. Our code and resources will be available upon acceptance.
Nov 14 (Thu) 14:00-15:30 - Riverfront Hall
Weak Reward Model Transforms Generative Models into Robust Causal Event Extraction Systems
Italo Luis da Silva, Hanqi Yan, Lin Gui, Yulan He
The inherent ambiguity of cause and effect boundaries poses a challenge in evaluating causal event extraction tasks. Traditional metrics like
Exact Match and BertScore poorly reflect model performance, so we trained evaluation models to approximate human evaluation, achieving
high agreement. We used them to perform Reinforcement Learning with extraction models to align them with human preference, prioritising
semantic understanding. We successfully explored our approach through multiple datasets, including transferring an evaluator trained on one
dataset to another as a way to decrease the reliance on human-annotated data. In that vein, we also propose a weak-to-strong supervision
method that uses a fraction of the annotated data to train an evaluation model while still achieving high performance in training an RL model.
Nov 14 (Thu) 14:00-15:30 - Riverfront Hall
Investigating Large Language Models for Complex Word Identification in Multilingual and Multidomain Setups
Rzvan-Alexandru Smdu, David-Gabriel ION, Dumitru-Clementin Cercel, Florin Pop, Mihaela-Claudia Cercel
Complex Word Identification (CWI) is an essential step in the lexical simplification task and has recently become a task on its own. Some
variations of this binary classification task have emerged, such as lexical complexity prediction (LCP) and complexity evaluation of multi-
word expressions (MWE). Large language models (LLMs) recently became popular in the Natural Language Processing community because
of their versatility and capability to solve unseen tasks in zero/few-shot settings. Our work investigates LLM usage, specifically open-source
models such as Llama 2, Llama 3, and Vicuna v1.5, and closed-source, such as ChatGPT-3.5-turbo and GPT-4o, in the CWI, LCP, and MWE
settings. We evaluate zero-shot, few-shot, and fine-tuning settings and show that LLMs struggle in certain conditions or achieve comparable
results against existing methods. In addition, we provide some views on meta-learning combined with prompt learning. In the end, we con-
clude that the current state of LLMs cannot or barely outperform existing methods, which are usually much smaller.
Nov 14 (Thu) 14:00-15:30 - Riverfront Hall
Seg2Act: Global Context-aware Action Generation for Document Logical Structuring
Zichao Li, Shaojie He, Meng Liao, Xuanang Chen, Yaojie Lu, Hongyu Lin, Yanxiong Lu, Xianpei Han, Le Sun
Document logical structuring aims to extract the underlying hierarchical structure of documents, which is crucial for document intelligence.
Traditional approaches often fall short in handling the complexity and the variability of lengthy documents. To address these issues, we
introduce Seg2Act, an end-to-end, generation-based method for document logical structuring, revisiting logical structure extraction as an
action generation task. Specifically, given the text segments of a document, Seg2Act iteratively generates the action sequence via a global
context-aware generative model, and simultaneously updates its global context and current logical structure based on the generated actions.
Experiments on ChCatExt and HierDoc datasets demonstrate the superior performance of Seg2Act in both supervised and transfer learning
settings.
Nov 14 (Thu) 14:00-15:30 - Riverfront Hall
Learning to Generate Rules for Realistic Few-Shot Relation Classification: An Encoder-Decoder Approach
Mayank Singh, Eduardo Blanco
We propose a neuro-symbolic approach for realistic few-shot relation classification via rules. Instead of building neural models to predict
relations, we design them to output straightforward rules that can be used to extract relations. The rules are generated using custom T5-style
Encoder-Decoder Language Models. Crucially, our rules are fully interpretable and pliable (i.e., humans can easily modify them to boost
performance). Through a combination of rules generated by these models along with a very effective, novel baseline, we demonstrate a
few-shot relation-classification performance that is comparable to or stronger than the state of the art on the Few-Shot TACRED and NYT29
benchmarks while increasing interpretability and maintaining pliability.
Nov 14 (Thu) 14:00-15:30 - Riverfront Hall
MMUTF: Multimodal Multimedia Event Argument Extraction with Unified Template Filling
Philipp Seeberger, Dominik Wagner, Korbinian Riedhammer
With the advancement of multimedia technologies, news documents and user-generated content are often represented as multiple modalities,
making Multimedia Event Extraction (MEE) an increasingly important challenge. However, recent MEE methods employ weak alignment
strategies and data augmentation with simple classification models, which ignore the capabilities of natural language-formulated event tem-
plates for the challenging Event Argument Extraction (EAE) task. In this work, we focus on EAE and address this issue by introducing a
unified template filling model that connects the textual and visual modalities via textual prompts. This approach enables the exploitation of
cross-ontology transfer and the incorporation of event-specific semantics. Experiments on the M2E2 benchmark demonstrate the effectiveness
of our approach. Our system surpasses the current SOTA on textual EAE by +7% F1, and performs generally better than the second-best
systems for multimedia EAE.
Nov 14 (Thu) 14:00-15:30 - Riverfront Hall
Temporal Cognitive Tree: A Hierarchical Modeling Approach for Event Temporal Relation Extraction
Wanting Ning, Lishuang Li, Xueyang Qin, Yubo Feng, Jingyao Tang
Understanding and analyzing event temporal relations is a crucial task in Natural Language Processing (NLP). This task, known as Event
Temporal Relation Extraction (ETRE), aims to identify and extract temporal connections between events in text. Recent studies focus on
locating the relative position of event pairs on the timeline by designing logical expressions or auxiliary tasks to predict their temporal occur-
rence. Despite these advances, this modeling approach neglects the multidimensional information in temporal relation and the hierarchical
process of reasoning. In this study, we propose a novel hierarchical modeling approach for this task by introducing a Temporal Cognitive Tree
(TCT) that mimics human logical reasoning. Additionally, we also design a integrated model incorporating prompt optimization and deduc-
tive reasoning to exploit multidimensional supervised information. Extensive experiments on TB-Dense and MATRES datasets demonstrate
that our approach outperforms existing methods.
307
Posters and Demos
308
Posters and Demos
demonstrate competitive performance for both named entity recognition with GENIA and CoNLL03, and for relation extraction with SciERC
and CoNLL04.
309
Posters and Demos
When and Where Did it Happen? An Encoder-Decoder Model to Identify Scenario Context
Enrique Noriega-Atala, Robert Vacareanu, Salena Torres Ashton, Adarsh Pyarelal, Clayton T Morrison, Mihai Surdeanu
We introduce a neural architecture finetuned for the task of scenario context generation: The relevant location and time of an event or entity
mentioned in text. Contextualizing information extraction helps to scope the validity of automated finings when aggregating them as knowl-
edge graphs. Our approach uses a high-quality curated dataset of time and location annotations in a corpus of epidemiology papers to train an
encoder-decoder architecture. We also explored the use of data augmentation techniques during training. Our findings suggest that a relatively
small fine-tuned encoder-decoder model performs better than out-of-the-box LLMs and semantic role labeling parsers to accurate predict the
relevant scenario information of a particular entity or event.
Nov 14 (Thu) 14:00-15:30 - Riverfront Hall
Graph-tree Fusion Model with Bidirectional Information Propagation for Long Document Classification
Sudipta Singha Roy, Xindi Wang, Robert Mercer, Frank Rudzicz
Long document classification presents challenges in capturing both local and global dependencies due to their extensive content and complex
structure. Existing methods often struggle with token limits and fail to adequately model hierarchical relationships within documents. To
address these constraints, we propose a novel model leveraging a graph-tree structure. Our approach integrates syntax trees for sentence
encodings and document graphs for document encodings, which capture fine-grained syntactic relationships and broader document contexts,
respectively. We use Tree Transformers to generate sentence encodings, while a graph attention network models inter- and intra-sentence de-
pendencies. During training, we implement bidirectional information propagation from word-to-sentence-to-document and vice versa, which
enriches the contextual representation. Our proposed method enables a comprehensive understanding of content at all hierarchical levels
and effectively handles arbitrarily long contexts without token limit constraints. Experimental results demonstrate the effectiveness of our
approach in all types of long document classification tasks.
310
Posters and Demos
of model flexibility. We empirically show performance gaps between training on captions that come from native German perception and
captions that have been either machine-translated or human-translated from English into German. To address these gaps, we further propose
and evaluate caption augmentation strategies. While we achieve mean recall improvements (+1.3), gaps still remain, indicating an open area
of future work for the community.
Nov 14 (Thu) 14:00-15:30 - Jasmine
SURf: Teaching Large Vision-Language Models to Selectively Utilize Retrieved Information
Jiashuo Sun, Jihai Zhang, Yucheng Zhou, Zhaochen Su, Xiaoye Qu, Yu Cheng
Large Vision-Language Models (LVLMs) have become pivotal at the intersection of computer vision and natural language processing. How-
ever, the full potential of LVLMs’ Retrieval-Augmented Generation (RAG) capabilities remains underutilized. Existing works either focus
solely on the text modality or are limited to specific tasks. Moreover, most LVLMs struggle to selectively utilize retrieved information and
are sensitive to irrelevant or misleading references. To address these challenges, we propose a self-refinement framework designed to teach
LVLMs to Selectively Utilize Retrieved Information (SURf). Specifically, when given questions that are incorrectly answered by the LVLM
backbone, we obtain references that help correct the answers (positive references) and those that do not (negative references). We then fine-
tune the LVLM backbone using a combination of these positive and negative references. Our experiments across three tasks and seven datasets
demonstrate that our framework significantly enhances LVLMs ability to effectively utilize retrieved multimodal references and improves their
robustness against irrelevant or misleading information. The source code is available at https://anonymous.4open.science/r/SURf-6433.
Nov 14 (Thu) 14:00-15:30 - Jasmine
Attribute Diversity Determines the Systematicity Gap in VQA
Ian Berlot-Attwell, Kumar Krishna Agrawal, Annabelle Michael Carrell, Yash Sharma, Naomi Saphra
Although modern neural networks often generalize to new combinations of familiar concepts, the conditions that enable such compositional-
ity have long been an open question. In this work, we study the systematicity gap in visual question answering: the performance difference
between reasoning on previously seen and unseen combinations of object attributes. To test, we introduce a novel diagnostic dataset, CLEVR-
HOPE. We find that the systematicity gap is not reduced by increasing the quantity of training data, but is reduced by increasing the diversity
of training data. In particular, our experiments suggest that the more distinct attribute type combinations are seen during training, the more
systematic we can expect the resulting model to be.
Nov 14 (Thu) 14:00-15:30 - Jasmine
TraveLER: A Modular Multi-LMM Agent Framework for Video Question-Answering
Chuyi Shang, Amos You, Sanjay Subramanian, Trevor Darrell, Roei Herzig
Recently, image-based Large Multimodal Models (LMMs) have made significant progress in video question-answering (VideoQA) using a
frame-wise approach by leveraging large-scale pretraining in a zero-shot manner. Nevertheless, these models need to be capable of finding
relevant information, extracting it, and answering the question simultaneously. Currently, existing methods perform all of these steps in a
single pass without being able to adapt if insufficient or incorrect information is collected. To overcome this, we introduce a modular multi-
LMM agent framework based on several agents with different roles, instructed by a Planner agent that updates its instructions using shared
feedback from the other agents. Specifically, we propose TraveLER, a method that can create a plan to "**Trave**rse" through the video, ask
questions about individual frames to "**L**ocate" and store key information, and then "**E**valuate" if there is enough information to an-
swer the question. Finally, if there is not enough information, our method is able to "**R**eplan" based on its collected knowledge. Through
extensive experiments, we find that the proposed TraveLER approach improves performance on several VideoQA benchmarks without the
need to fine-tune on specific datasets. Our code is available at https://github.com/traveler-framework/TraveLER.
Nov 14 (Thu) 14:00-15:30 - Jasmine
Vision-Language Model Fine-Tuning via Simple Parameter-Efficient Modification
Ming Li, Jike Zhong, Chenxin Li, Liuzhuozheng Li, Nie Lin, Masashi Sugiyama
Recent advances in fine-tuning Vision-Language Models (VLMs) have witnessed the success of prompt tuning and adapter tuning, while
the classic model fine-tuning on inherent parameters seems to be overlooked. It is believed that fine-tuning the parameters of VLMs with
few-shot samples corrupts the pre-trained knowledge since fine-tuning the CLIP model even degrades performance. In this paper, we revisit
this viewpoint, and propose a new perspective: fine-tuning the specific parameters instead of all will uncover the power of classic model fine-
tuning on VLMs. Through our meticulous study, we propose ClipFit, a simple yet effective method to fine-tune CLIP without introducing any
overhead of extra parameters. We demonstrate that by only fine-tuning the specific bias terms and normalization layers, ClipFit can improve
the performance of zero-shot CLIP by 7.27% average harmonic mean accuracy. Lastly, to understand how fine-tuning in CLIPFit affects the
pre-trained models, we conducted extensive experimental analyses w.r.t. changes in internal parameters and representations. We found that
low-level text bias layers and the first layer normalization layer change much more than other layers. The code will be released.
Nov 14 (Thu) 14:00-15:30 - Jasmine
ActPlan-1K: Benchmarking the Procedural Planning Ability of Visual Language Models in Household Activities
Ying Su, Zhan Ling, Haochen Shi, Cheng Jiayang, Yauwai Yim, Yangqiu Song
Large language models(LLMs) have been adopted to process textual task description and accomplish procedural planning in embodied AI
tasks because of their powerful reasoning ability. However, there is still lack of study on how vision language models(VLMs) behave when
multi-modal task inputs are considered. Counterfactual planning that evaluates the model’s reasoning ability over alternative task situations
are also under exploited. In order to evaluate the planning ability of both multi-modal and counterfactual aspects, we propose ActPlan-1K.
ActPlan-1K is a multi-modal planning benchmark constructed based on ChatGPT and household activity simulator iGibson2. The benchmark
consists of 153 activities and 1,187 instances. Each instance describing one activity has a natural language task description and multiple
environment images from the simulator. The gold plan of each instance is action sequences over the objects in provided scenes. Both the
correctness and commonsense satisfaction are evaluated on typical VLMs. It turns out that current VLMs are still struggling at generating
human-level procedural plans for both normal activities and counterfactual activities. We further provide automatic evaluation metrics by
finetuning over BLEURT model to facilitate future research on our benchmark.
Nov 14 (Thu) 14:00-15:30 - Jasmine
On Efficient Language and Vision Assistants for Visually-Situated Natural Language Understanding: What Matters in Reading and
Reasoning
Geewook Kim, Minjoon Seo
Recent advancements in language and vision assistants have showcased impressive capabilities but suffer from a lack of transparency, limiting
broader research and reproducibility. While open-source models handle general image tasks effectively, they face challenges with the high
computational demands of complex visually-situated text understanding. Such tasks often require increased token inputs and large vision
modules to harness high-resolution information. Striking a balance between model size and data importance remains an open question. This
study aims to redefine the design of vision-language models by identifying key components and creating efficient models with constrained in-
ference costs. By strategically formulating datasets, optimizing vision modules, and enhancing supervision techniques, we achieve significant
311
Posters and Demos
improvements in inference throughput while maintaining high performance. Extensive experiments across models ranging from 160M to 13B
parameters offer insights into model optimization.We will fully open-source our codebase, models, and datasets at https://github.com/naver-
ai/elva.
Nov 14 (Thu) 14:00-15:30 - Jasmine
IntCoOp: Interpretability-Aware Vision-Language Prompt Tuning
Soumya Suvra Ghosal, Samyadeep Basu, Soheil Feizi, Dinesh Manocha
Image-text contrastive models such as CLIP learn transferable and robust representations for zero-shot transfer to a variety of downstream
tasks. However, to obtain strong downstream performances, prompts need to be carefully curated, which can be a tedious engineering task.
To address the issue of manual prompt engineering, prompt-tuning is used where a set of contextual vectors are learned by leveraging infor-
mation from the training data. Despite their effectiveness, existing prompt-tuning frameworks often lack interpretability, thus limiting their
ability to understand the compositional nature of images. In this work, we first identify that incorporating compositional attributes (e.g., a
"green" tree frog) in the design of manual prompts can significantly enhance image-text alignment scores. Building upon this observation,
we propose a novel and interpretable prompt-tuning method named IntCoOp, which learns to jointly align attribute-level inductive biases and
class embeddings during prompt-tuning. To assess the effectiveness of our approach, we evaluate IntCoOp across two representative tasks in
a few-shot learning setup: generalization to novel classes, and unseen domain shifts. Through extensive experiments across 10 downstream
datasets on CLIP, we find that introducing attribute-level inductive biases leads to superior performance against state-of-art prompt tuning
frameworks. Notably, in a 16-shot setup, IntCoOp improves CoOp by 7.35% in average performance across 10 diverse datasets.
Nov 14 (Thu) 14:00-15:30 - Jasmine
VIEWS: Entity-Aware News Video Captioning
Hammad Ayyubi, Tianqi Liu, Arsha Nagrani, Xudong Lin, Mingda Zhang, Anurag Arnab, feng han, Yukun Zhu, Xuande Feng, Kevin Zhang,
Jialu Liu, Shih-Fu Chang
Existing popular video captioning benchmarks and models often produce generic captions for videos that lack specific identification of in-
dividuals, locations, or organizations (named entities). However, in the case of news videos, the setting is more demanding, requiring the
inclusion of such named entities for meaningful summarization. Therefore, we introduce the task of directly summarizing news videos into
captions that are entity-aware. To facilitate research in this area, we have collected a large-scale dataset named VIEWS (VIdeo NEWS).
Within this task, we face challenges inherent to recognizing named entities and navigating diverse, dynamic contexts, all while relying solely
on visual cues. To address these challenges, we propose a model-agnostic approach that enriches visual information extracted from videos
with context sourced from external knowledge, enabling the generation of entity-aware captions. We validate the effectiveness of our approach
across three video captioning models. Additionally, we conduct a critical analysis of our methodology to gain insights into the complexity of
the task, the challenges it presents, and potential avenues for future research.
Nov 14 (Thu) 14:00-15:30 - Jasmine
Eliciting In-Context Learning in Vision-Language Models for Videos Through Curated Data Distributional Properties
Keunwoo Peter Yu, Zheyuan Zhang, Fengyuan Hu, Shane Storks, Joyce Chai
A major reason behind the recent success of large language models (LLMs) is their in-context learning capability, which makes it possible to
rapidly adapt them to downstream text-based tasks by prompting them with a small number of relevant demonstrations. While large vision-
language models (VLMs) have recently been developed for tasks requiring both text and images, they largely lack in-context learning over
visual information, especially in understanding and generating text about videos. In this work, we implement Emergent In-context Learning
on Videos (EILeV), a novel training paradigm that induces in-context learning over video and text by capturing key properties of pre-training
data found by prior work to be essential for in-context learning in transformers. In our experiments, we show that EILeV-trained models out-
perform other off-the-shelf VLMs in few-shot video narration for novel, rare actions. Furthermore, we demonstrate that these key properties
of bursty distributions, skewed marginal distributions, and dynamic meaning each contribute to varying degrees to VLMs’ in-context learning
capability in narrating procedural videos. Our results, analysis, and EILeV-trained models yield numerous insights about the emergence of
in-context learning over video and text, creating a foundation for future work to optimize and scale VLMs for open-domain video understand-
ing and reasoning.
Nov 14 (Thu) 14:00-15:30 - Jasmine
Retrieval-enriched zero-shot image classification in low-resource domains
Nicola Dall’Asen, Yiming Wang, Enrico Fini, Elisa Ricci
Low-resource domains, characterized by scarce data and annotations, present significant challenges for language and visual understanding
tasks, with the latter much under-explored in the literature. Recent advancements in Vision-Language Models (VLM) have shown promising
results in high-resource domains but fall short in low-resource concepts that are under-represented (e.g. only a handful of images per category)
in the pre-training set. We tackle the challenging task of zero-shot low-resource image classification from a novel perspective. By leverag-
ing a retrieval-based strategy, we achieve this in a training-free fashion. Specifically, our method, named CoRE (Combination of Retrieval
Enrichment), enriches the representation of both query images and class prototypes by retrieving relevant textual information from large web-
crawled databases. This retrieval-based enrichment significantly boosts classification performance by incorporating the broader contextual
information relevant to the specific class. We validate our method on a newly established benchmark covering diverse low-resource domains,
including medical imaging, rare plants, and circuits. Our experiments demonstrate that CoRE outperforms existing state-of-the-art methods
that rely on synthetic data generation and model fine-tuning.
Nov 14 (Thu) 14:00-15:30 - Jasmine
Show and Guide: Instructional-Plan Grounded Vision and Language Model
Diogo Glória-Silva, David Semedo, Joao Magalhaes
Guiding users through complex procedural plans is an inherently multimodal task in which having visually illustrated plan steps is crucial to
deliver an effective plan guidance. However, existing works on plan-following language models (LMs) often are not capable of multimodal
input and output. In this work, we present MM-PlanLLM, the first multimodal LLM designed to assist users in executing instructional tasks
by leveraging both textual plans and visual information. Specifically, we bring cross-modality through two key tasks: Conversational Video
Moment Retrieval, where the model retrieves relevant step-video segments based on user queries, and Visually-Informed Step Generation,
where the model generates the next step in a plan, conditioned on an image of the user’s current progress. MM-PlanLLM is trained using
a novel multitask-multistage approach, designed to gradually expose the model to multimodal instructional-plans semantic layers, achieving
strong performance on both multimodal and textual dialogue in a plan-grounded setting. Furthermore, we show that the model delivers cross-
modal temporal and plan-structure representations aligned between textual plan steps and instructional video moments.
312
Posters and Demos
method decomposes the short- and long-range modeling aspects of LVQA into two stages. First, we use a short-term visual captioner to
generate textual descriptions of short video clips (0.5-8 seconds in length) densely sampled from a long input video. Afterward, an LLM
aggregates the densely extracted short-term captions to answer a given question. Furthermore, we propose a novel multi-round summarization
prompt that asks the LLM first to summarize the noisy short-term visual captions and then answer a given input question. To analyze what
makes our simple framework so effective, we thoroughly evaluate various components of our framework. Our empirical analysis reveals that
the choice of the visual captioner and LLM is critical for good LVQA performance. The proposed multi-round summarization prompt also
leads to a significant LVQA performance boost. Our method achieves the best-reported results on the EgoSchema dataset, best known for
very long-form video question-answering. LLoVi also outperforms the previous state-of-the-art by **10.2%** and **6.2%** on NExT-QA
and IntentQA for LVQA. Finally, we extend LLoVi to grounded VideoQA, which requires both QA and temporal localization, and show that
it outperforms all prior methods on NExT-GQA. Code is available at https://github.com/CeeZh/LLoVi.
Nov 14 (Thu) 14:00-15:30 - Jasmine
RECANTFormer: Referring Expression Comprehension with Varying Numbers of Targets
Bhathiya Hemanthage, Hakan Bilen, Phil Bartie, Christian Dondrup, Oliver Lemon
The Generalized Referring Expression Comprehension (GREC) task extends classic REC by generating image bounding boxes for objects
referred to in natural language expressions, which may indicate zero, one, or multiple targets. This generalization enhances the practical-
ity of REC models for diverse real-world applications. However, the presence of varying numbers of targets in samples makes GREC a
more complex task, both in terms of training supervision and final prediction selection strategy. Addressing these challenges, we introduce
RECANTFormer, a one-stage method for GREC that combines a decoder-free (encoder-only) transformer architecture with DETR-like Hun-
garian matching. Our approach consistently outperforms baselines by significant margins in three GREC datasets.
Nov 14 (Thu) 14:00-15:30 - Jasmine
Training-free Deep Concept Injection Enables Language Models for Video Question Answering
Xudong Lin, Manling Li, Richard Zemel, Heng Ji, Shih-Fu Chang
Recently, enabling pretrained language models (PLMs) to perform zero-shot crossmodal tasks such as video question answering has been
extensively studied. A popular approach is to learn a projection network that projects visual features into the input text embedding space of a
PLM, as well as feed-forward adaptation layers, with the weights of the PLM frozen. However, is it really necessary to learn such additional
layers? In this paper, we make the first attempt to demonstrate that the PLM is able to perform zero-shot crossmodal tasks without any cross-
modal pretraining, when the observed visual concepts are injected as both additional input text tokens and augmentation in the intermediate
features within each feed-forward network for the PLM. Specifically, inputting observed visual concepts as text tokens helps to inject them
through the self-attention layers in the PLM; to augment the intermediate features in a way that is compatible with the PLM, we propose
to construct adaptation layers based on the intermediate representation of concepts (obtained by solely inputting them to the PLM). These
two complementary injection mechanisms form the proposed Deep Concept Injection, which comprehensively enables the PLM to perceive
instantly without crossmodal pretraining. Extensive empirical analysis on zero-shot video question answering, as well as visual question
answering, shows Deep Concept Injection achieves competitive or even better results in both zero-shot and fine-tuning settings, compared to
state-of-the-art methods that require crossmodal pretraining.
Nov 14 (Thu) 14:00-15:30 - Jasmine
MIBench: Evaluating Multimodal Large Language Models over Multiple Images
Haowei Liu, Xi Zhang, Haiyang Xu, Yaya Shi, Chaoya Jiang, Ming Yan, Ji Zhang, Fei Huang, Chunfeng Yuan, Bing Li, Weiming Hu
Built on the power of LLMs, numerous multimodal large language models (MLLMs) have recently achieved remarkable performance on
various vision-language tasks. However, most existing MLLMs and benchmarks primarily focus on single-image input scenarios, leaving
the performance of MLLMs when handling realistic multiple images underexplored. Although a few benchmarks consider multiple images,
their evaluation dimensions and samples are very limited. In this paper, we propose a new benchmark MIBench, to comprehensively evaluate
fine-grained abilities of MLLMs in multi-image scenarios. Specifically, MIBench categorizes the multi-image abilities into three scenarios:
multi-image instruction (MII), multimodal knowledge-seeking (MKS) and multimodal in-context learning (MIC), and constructs 13 tasks
with a total of 13K annotated samples. During data construction, for MII and MKS, we extract correct options from manual annotations
and create challenging distractors to obtain multiple-choice questions. For MIC, to enable an in-depth evaluation, we set four sub-tasks and
transform the original datasets into in-context learning formats. We evaluate several open-source and closed-source MLLMs on the proposed
MIBench. The results reveal that although current models excel in single-image tasks, they exhibit significant shortcomings when faced with
multi-image inputs, such as limited fine-grained perception, multi-image reasoning and in-context learning abilities. The annotated data of
MIBench is available at https://huggingface.co/datasets/StarBottle/MIBench.
Nov 14 (Thu) 14:00-15:30 - Jasmine
Preserving Pre-trained Representation Space: On Effectiveness of Prefix-tuning for Large Multi-modal Models
Donghoon Kim, Gusang Lee, Kyuhong Shim, Byonghyo Shim
Recently, we have observed that Large Multi-modal Models (LMMs) are revolutionizing the way machines interact with the world, unlocking
new possibilities across various multi-modal applications. To adapt LMMs for downstream tasks, parameter-efficient fine-tuning (PEFT)
which only trains additional prefix tokens or modules, has gained popularity. Nevertheless, there has been little analysis of how PEFT works
in LMMs. In this paper, we delve into the strengths and weaknesses of each tuning strategy, shifting the focus from the efficiency typically
associated with these approaches. We first discover that model parameter tuning methods such as LoRA and Adapters, distort the feature
representation space learned during pre-training, limiting the full utilization of pre-trained knowledge. We also demonstrate that prefix-tuning
excels at preserving the representation space, despite of its lower performance on downstream tasks. These findings suggest a simple two-
step PEFT strategy called Prefix-Tuned PEFT (PT-PEFT), which successively performs prefix-tuning and then other PEFT (i.e., Adapter,
LoRA), combines the benefits of both. Experimental results show that PT-PEFT not only improves performance in image captioning and
visual question answering compared to vanilla PEFT methods but also helps preserve the representation space of the four pre-trained models.
Nov 14 (Thu) 14:00-15:30 - Jasmine
Difficult Task Yes but Simple Task No: Unveiling the Laziness in Multimodal LLMs
Sihang Zhao, Youliang Yuan, Xiaoying Tang, Pinjia He
Multimodal Large Language Models (MLLMs) demonstrate a strong understanding of the real world and can even handle complex tasks.
However, they still fail on some straightforward visual question-answering (VQA) problems. This paper dives deeper into this issue, reveal-
ing that models tend to err when answering easy questions (e.g., Yes/No questions) about an image, even though they can correctly describe
it.We refer to this model behavior discrepancy between difficult and simple questions as model laziness.To systematically investigate model
laziness, we manually construct LazyBench, a benchmark that includes Yes/No, multiple choice, short answer questions, and image descrip-
tion tasks that are related to the same subjects in the images.Based on LazyBench. we observe that laziness widely exists in current advanced
MLLMs (e.g., GPT-4o, Gemini-1.5-pro, Claude 3, LLaVA-1.5, LLaVA-1.6, and QWen-VL). We also analyzed the failure cases of LLaVA-
1.5-13B on the VQA-v2 benchmark and discovered that about half of these failures are due to the models laziness. This further highlights the
importance of ensuring that the model fully utilizes its capability.To this end, we conduct a preliminary exploration of how to mitigate lazi-
313
Posters and Demos
ness and find that chain of thought can effectively avoid this issue. The data can be accessed at https://github.com/Akutagawa1998/LazyBench.
Nov 14 (Thu) 14:00-15:30 - Jasmine
VPL: Visual Proxy Learning Framework for Zero-Shot Medical Image Diagnosis
Jiaxiang Liu, Tianxiang Hu, Huimin Xiong, Jiawei Du, YANG FENG, Jian Wu, Joey Tianyi Zhou, Zuozhu Liu
Vision-language models like CLIP, utilizing class proxies derived from class name text features, have shown a notable capability in zero-shot
medical image diagnosis which is vital in scenarios with limited disease databases or labeled samples. However, insufficient medical text pre-
cision and the modal disparity between text and vision spaces pose challenges for such paradigm. We show analytically and experimentally
that enriching medical texts with detailed descriptions can markedly enhance the diagnosis performance, with the granularity and phrasing of
these enhancements having a crucial impact on CLIP’s understanding of medical images; and learning proxies within the vision domain can
effectively circumvent the modal gap issue. Based on our analysis, we propose a medical visual proxy learning framework comprising two
key components: a text refinement module that create high quality medical text descriptions, and a stable Sinkhorn algorithm for an efficient
generation of pseudo labels which further guide the visual proxy learning. Our method elevates the Vanilla CLIP inference by supplying
meticulously crafted clues to leverage CLIP’s existing interpretive power and using the feature of refined texts to bridge the vision-text gap.
The effectiveness and robustness of our method are clearly demonstrated through extensive experiments. Notably, our method outperforms the
state-of-the-art zero-shot medical image diagnosis by a significant margin, ranging from 1.69% to 15.31% on five datasets covering various
diseases, confirming its immense potential in zero-shot diagnosis across diverse medical applications.
Nov 14 (Thu) 14:00-15:30 - Jasmine
Unleashing the Potentials of Likelihood Composition for Multi-modal Language Models
Shitian Zhao, Renrui Zhang, Xu Luo, Yan Wang, Shanghang Zhang, Peng Gao
Model fusing has always been an important topic, especially in an era where large language models (LLM) and multi-modal language models
(MLM) with different architectures, parameter sizes and training pipelines, are being created all the time. In this work, we propose a post-hoc
framework, aiming at fusing heterogeneous models off-the-shell, which we call likelihood composition, and the basic idea is to compose mul-
tiple models’ likelihood distribution when doing a multi-choice visual-question-answering task. Here the core concept, likelihood, is actually
the log-probability of the candidate answer. In likelihood composition, we introduce some basic operations: debias, highlight, majority-vote
and ensemble. By combining (composing) these basic elements, we get the mixed composition methods: mix-composition. Through con-
ducting comprehensive experiments on 9 VQA datasets and 10 MLMs, we prove the effectiveness of mix-composition compared with simple
ensemble or majority-vote methods. In this framework, people can propose new basic composition methods and combine them to get the new
mixed composition methods. We hope our proposed likelihood composition can provide a new perspective of fusing heterogeneous models
and inspire the exploration under this framework.
314
Posters and Demos
tion, using a range of text descriptions: From single words to rich descriptions. Our results demonstrate strong improvements over previous
approaches, showing that zero-shot learning can be applied with little training data. Furthermore, we conduct an analysis with foundational
vision and language models, demonstrating that they struggle to generalize when describing what attributes the class lacks.
Nov 14 (Thu) 14:00-15:30 - Jasmine
Grounding Complex Events in Multimodal Data
Kate Sanders, Reno Kriz, David Etter, Hannah Recknor, Alexander Martin, Cameron Carpenter, Jingyang Lin, Benjamin Van Durme
How are we able to learn about complex current events just from short snippets of video? While natural language enables straightforward
ways to represent under-specified, partially observable events, visual data does not facilitate analogous methods and, consequently, introduces
unique challenges in event understanding. With the growing prevalence of vision-capable AI agents, these systems must be able to model
events from collections of unstructured video data. To tackle robust event modeling in multimodal settings, we introduce a multimodal for-
mulation for partially-defined events and cast the extraction of these events as a three-stage span retrieval task. We propose a corresponding
benchmark for this task, MultiVENT-G, that consists of 14.5 hours of densely annotated current event videos and 1,168 text documents,
containing 22.8K labeled event-centric entities. We propose a collection of LLM-driven approaches to the task of multimodal event analysis,
and evaluate them on MultiVENT-G. Results illustrate the challenges that abstract event understanding poses and demonstrates promise in
event-centric video-language systems.
Nov 14 (Thu) 14:00-15:30 - Jasmine
EchoSight: Advancing Visual-Language Models with Wiki Knowledge
Yibin Yan, Weidi Xie
Knowledge-based Visual Question Answering (KVQA) tasks require answering questions about images using extensive background knowl-
edge. Despite significant advancements, generative models often struggle with these tasks due to the limited integration of external knowledge.
In this paper, we introduce **EchoSight**, a novel multimodal Retrieval-Augmented Generation (RAG) framework that enables large lan-
guage models (LLMs) to answer visual questions requiring fine-grained encyclopedic knowledge. To strive for high-performing retrieval,
EchoSight first searches wiki articles by using visual-only information, subsequently, these candidate articles are further reranked according
to their relevance to the combined text-image query. This approach significantly improves the integration of multimodal knowledge, leading to
enhanced retrieval outcomes and more accurate VQA responses. Our experimental results on the E-VQA and InfoSeek datasets demonstrate
that EchoSight establishes new state-of-the-art results in knowledge-based VQA, achieving an accuracy of 41.8% on E-VQA and 31.3% on
InfoSeek.
Nov 14 (Thu) 14:00-15:30 - Jasmine
Unraveling the Truth: Do LLMs really Understand Charts? A Deep Dive into Consistency and Robustness
Srija Mukhopadhyay, Adnan Qidwai, Aparna Garimella, Pritika Ramu, Vivek Gupta, Dan Roth
Chart question answering (CQA) is a crucial area of Visual Language Understanding. However, the robustness and consistency of current
Visual Language Models (VLMs) in this field remain under-explored. This paper evaluates state-of-the-art VLMs on comprehensive datasets,
developed specifically for this study, encompassing diverse question categories and chart formats. We investigate two key aspects: 1) the
models’ ability to handle varying levels of chart and question complexity, and 2) their robustness across different visual representations of the
same underlying data. Our analysis reveals significant performance variations based on question and chart types, highlighting both strengths
and weaknesses of current models. Additionally, we identify areas for improvement and propose future research directions to build more
robust and reliable CQA systems. This study sheds light on the limitations of current models and paves the way for future advancements in
the field.
Nov 14 (Thu) 14:00-15:30 - Jasmine
TransferCVLM: Transferring Cross-Modal Knowledge for Vision-Language Modeling
Dongha Choi, Jung-jae Kim, Hyunju Lee
Recent large vision-language multimodal models pre-trained with huge amount of image-text pairs show remarkable performances in down-
stream tasks. However, the multimodal pre-training has limitations in terms of resources and training time when it comes to obtaining
new models that surpass existing models. To overcome these issues, we propose TransferCVLM, a method of efficient knowledge transfer
that integrates pre-trained uni-modal models (and cross-modal fusion-encoder) into a combined vision-language model (CVLM), without
pre-training the CVLM with large amount of multimodal data, and then for each task application, fine-tunes the CVLM and transfers the
multimodal knowledge of a teacher vision-language model to the CVLM by using knowledge distillation techniques. We demonstrate that
1) the fine-tuned CVLM performs comparable to other vision-language models of similar size, that 2) the multimodal knowledge transfer
consistently enhances the CVLM, and the knowledge-transferred CVLM composed of large-size unimodal models outperforms the teacher
multimodal model in most of downstream tasks, and that 3) TransferCVLM can also be used for model compression when using small-size
unimodal models. We estimate that the training of TransferCVLM takes only 6% of pre-training of other vision-language models. Our code
is available at https://github.com/DMCB-GIST/TransferCVLM.
Nov 14 (Thu) 14:00-15:30 - Jasmine
Personalized Video Comment Generation
Xudong Lin, Ali Zare, Shiyuan Huang, Ming-Hsuan Yang, Shih-Fu Chang, Li Zhang
Generating personalized responses, particularly in the context of video, poses a unique challenge for language models. This paper introduces
the novel task of Personalized Video Comment Generation (PVCG), aiming to predict user comments tailored to both the input video
and the user’s comment history, where the user is unseen during the model training process. Unlike existing video captioning tasks that
ignores the personalization in the text generation process, we introduce PerVidCom, a new dataset specifically collected for this novel task
with diverse personalized comments from YouTube. Recognizing the limitations of existing captioning metrics for evaluating this task, we
propose a new automatic metric based on Large Language Models (LLMs) with few-shot in-context learning, named FICL-Score, specifically
measuring quality from the aspects of emotion, language style and content relevance. We verify the proposed metric with human evaluations.
We establish baselines using prominent Multimodal LLMs (MLLMs), analyze their performance discrepancies through extensive evaluation,
and identifies directions for future improvement on this important task. Our research opens up a new direction of personalizing MLLMs and
paves the way for future research.
Nov 14 (Thu) 14:00-15:30 - Jasmine
Learning from Mistakes: Iterative Prompt Relabeling for Text-to-Image Diffusion Model Training
Xinyan Chen, Jiaxin Ge, Tianjun Zhang, Jiaming Liu, Shanghang Zhang
Diffusion models have shown impressive performance in many domains. However, the model’s capability to follow natural language instruc-
tions (e.g., spatial relationships between objects, generating complex scenes) is still unsatisfactory. In this work, we propose Iterative Prompt
Relabeling (IPR), a novel algorithm that aligns images to text through iterative image sampling and prompt relabeling with feedback. IPR first
samples a batch of images conditioned on the text, then relabels the text prompts of unmatched text-image pairs with classifier feedback. We
conduct thorough experiments on SDv2 and SDXL, testing their capability to follow instructions on spatial relations. With IPR, we improved
315
Posters and Demos
up to 15.22% (absolute improvement) on the challenging spatial relation VISOR benchmark, demonstrating superior performance compared
to previous RL methods. Our code is publicly available at https://github.com/cxy000000/IPR-RLDF.
NLP Applications 5
Nov 14 (Thu) 14:00-15:30 - Room: Riverfront Hall
316
Posters and Demos
in performance when the context is long and number of attributes is large. Our code is available at: https://anonymous.4open.science/r/EAVE-
EA18.
317
Posters and Demos
also release a benchmark dataset serving as a test bed for evaluating infringement behaviors by LLMs and stress the need for future alignment.
Nov 14 (Thu) 14:00-15:30 - Riverfront Hall
Computational Meme Understanding: A Survey
Khoi P. N. Nguyen, Vincent Ng
Computational Meme Understanding, which concerns the automated comprehension of memes, has garnered interest over the last four years
and is facing both substantial opportunities and challenges. We survey this emerging area of research by first introducing a comprehensive
taxonomy for memes along three dimensions – forms, functions, and topics. Next, we present three key tasks in Computational Meme Un-
derstanding, namely, classification, interpretation, and explanation, and conduct a comprehensive review of existing datasets and models,
discussing their limitations. Finally, we highlight the key challenges and recommend avenues for future work.
Nov 14 (Thu) 14:00-15:30 - Riverfront Hall
Synthetic Knowledge Ingestion: Towards Knowledge Refinement and Injection for Enhancing Large Language Models
Jiaxin Zhang, Wendi Cui, Yiran Huang, Kamalika Das, Sricharan Kumar
Large language models (LLMs) are proficient in capturing factual knowledge across various domains. However, refining their capabilities
on previously seen knowledge or integrating new knowledge from external sources remains a significant challenge. In this work, we pro-
pose a novel synthetic knowledge ingestion method called skrnospace, which leverages fine-grained synthesis, interleaved generation, and
assemble augmentation strategies to construct high-quality data representations from raw knowledge sources. We then integrate skr and its
variations with three knowledge injection techniques: Retrieval Augmented Generation (RAG), Supervised Fine-tuning (SFT), and Continual
Pre-training (CPT) to inject and refine knowledge in language models. Extensive empirical experiments are conducted on various question-
answering tasks spanning finance, biomedicine, and open-generation domains to demonstrate that skr significantly outperforms baseline
methods by facilitating effective knowledge injection. We believe that our work is an important step towards enhancing the factual accuracy
of LLM outputs by refining knowledge representation and injection capabilities.
Nov 14 (Thu) 14:00-15:30 - Riverfront Hall
Adversarial Text Generation using Large Language Models for Dementia Detection
Youxiang Zhu, Nana Lin, Kiran Sandilya Balivada, Daniel Haehn, Xiaohui Liang
Although large language models (LLMs) excel in various text classification tasks, regular prompting strategies (e.g., few-shot prompting) do
not work well with dementia detection via picture description. The challenge lies in the language marks for dementia are unclear, and LLM
may struggle with relating its internal knowledge to dementia detection. In this paper, we present an accurate and interpretable classification
approach by Adversarial Text Generation (ATG), a novel decoding strategy that could relate dementia detection with other tasks. We further
develop a comprehensive set of instructions corresponding to various tasks and use them to guide ATG, achieving the best accuracy of 85%,
>10% improvement compared to the regular prompting strategies. In addition, we introduce feature context, a human-understandable text that
reveals the underlying features of LLM used for classifying dementia. From feature contexts, we found that dementia detection can be related
to tasks such as assessing attention to detail, language, and clarity with specific features of the environment, character, and other picture
content or language-related features. Future work includes incorporating multi-modal LLMs to interpret speech and picture information.
Nov 14 (Thu) 14:00-15:30 - Riverfront Hall
EHRAgent: Code Empowers Large Language Models for Few-shot Complex Tabular Reasoning on Electronic Health Records
Wenqi Shi, Ran Xu, Yuchen Zhuang, Yue Yu, Jieyu Zhang, Hang Wu, Yuanda Zhu, Joyce C. Ho, Carl Yang, May Dongmei Wang
Clinicians often rely on data engineers to retrieve complex patient information from electronic health record (EHR) systems, a process that
is both inefficient and time-consuming. We propose EHRAgent, a large language model (LLM) agent empowered with accumulative domain
knowledge and robust coding capability. EHRAgent enables autonomous code generation and execution to facilitate clinicians in directly
interacting with EHRs using natural language. Specifically, we formulate a multi-tabular reasoning task based on EHRs as a tool-use plan-
ning process, efficiently decomposing a complex task into a sequence of manageable actions with external toolsets. We first inject relevant
medical information to enable EHRAgent to effectively reason about the given query, identifying and extracting the required records from
the appropriate tables. By integrating interactive coding and execution feedback, EHRAgent then effectively learns from error messages
and iteratively improves its originally generated code. Experiments on three real-world EHR datasets show that EHRAgent outperforms the
strongest baseline by up to 29.6% in success rate, verifying its strong capacity to tackle complex clinical tasks with minimal demonstrations.
Nov 14 (Thu) 14:00-15:30 - Riverfront Hall
MP-RNA: Unleashing Multi-species RNA Foundation Model via Calibrated Secondary Structure Prediction
Heng Yang, Ke Li
RNA foundation models (FMs) have been extensively used to interpret genomic sequences and address a wide range of in-silico genomic
tasks. However, current RNA FMs often overlook the incorporation of secondary structures in the pretraining of FMs, which impedes the ef-
fectiveness in various genomic tasks. To address this problem, we leverage filtered high-fidelity structure annotations for structure pretraining
to enhance the modeling ability of FMs in single nucleotide resolution tasks. Experimental evaluations across four comprehensive genomic
benchmarks demonstrate that our RNA FM consistently outperforms existing RNA FMs, achieving a 40% improvement in RNA secondary
structure prediction and obtaining top-tier results on DNA genomic benchmarks even though it has not been pretrained on any DNA genome.
We release the code and models to encourage further research to bridge the gap between in-silico predictions and biological reality.
Nov 14 (Thu) 14:00-15:30 - Riverfront Hall
Are LLMs Effective Negotiators? Systematic Evaluation of the Multifaceted Capabilities of LLMs in Negotiation Dialogues
Deuksin Kwon, Emily Weiss, Tara Kulshrestha, Kushal Chawla, Gale Lucas, Jonathan Gratch
A successful negotiation requires a range of capabilities, including comprehension of the conversation context, Theory-of-Mind (ToM) skills
to infer the partners motives, strategic reasoning, and effective communication, making it challenging for automated systems. Despite the re-
markable performance of LLMs in various NLP tasks, there is no systematic evaluation of their capabilities in negotiation. Such an evaluation
is critical for advancing AI negotiation agents and negotiation research, ranging from designing dialogue systems to providing pedagogical
feedback and scaling up data collection practices. This work aims to systematically analyze the multifaceted capabilities of LLMs across
diverse dialogue scenarios throughout the stages of a typical negotiation interaction. Our analysis highlights GPT-4s superior performance in
many tasks while identifying specific challenges, such as making subjective assessments and generating contextually appropriate, strategically
advantageous responses.
Nov 14 (Thu) 14:00-15:30 - Riverfront Hall
When Raw Data Prevails: Are Large Language Model Embeddings Effective in Numerical Data Representation for Medical Machine
Learning Applications?
Yanjun Gao, Skatje Myers, Shan Chen, Dmitriy Dligach, Timothy A Miller, Danielle Bitterman, Matthew Churpek, Majid Afshar
The introduction of Large Language Models (LLMs) has advanced data representation and analysis, bringing significant progress in their
use for medical questions and answering. Despite these advancements, integrating tabular data, especially numerical data pivotal in clinical
318
Posters and Demos
contexts, into LLM paradigms has not been thoroughly explored. In this study, we examine the effectiveness of vector representations from
last hidden states of LLMs for medical diagnostics and prognostics using electronic health record (EHR) data. We compare the performance
of these embeddings with that of raw numerical EHR data when used as feature inputs to traditional machine learning (ML) algorithms that
excel at tabular data learning, such as eXtreme Gradient Boosting. We focus on instruction-tuned LLMs in a zero-shot setting to represent ab-
normal physiological data and evaluating their utilities as feature extractors to enhance ML classifiers for predicting diagnoses, length of stay,
and mortality. Furthermore, we examine prompt engineering techniques on zero-shot and few-shot LLM embeddings to measure their impact
comprehensively. Although findings suggest the raw data features still prevail in medical ML tasks, zero-shot LLM embeddings demonstrate
competitive results, suggesting a promising avenue for future research in medical applications.
Nov 14 (Thu) 14:00-15:30 - Riverfront Hall
TriageAgent: Towards Better Multi-Agents Collaborations for Large Language Model-Based Clinical Triage
Meng Lu, Brandon Ho, Dennis Ren, Xuan Wang
The global escalation in emergency department patient visits poses significant challenges to efficient clinical management, particularly in
clinical triage. Traditionally managed by human professionals, clinical triage is susceptible to substantial variability and high workloads.
Although large language models (LLMs) demonstrate promising reasoning and understanding capabilities, directly applying them to clin-
ical triage remains challenging due to the complex and dynamic nature of the clinical triage task. To address these issues, we introduce
TriageAgent, a novel heterogeneous multi-agent framework designed to enhance collaborative decision-making in clinical triage. TriageAgent
leverages LLMs for role-playing, incorporating self-confidence and early-stopping mechanisms in multi-round discussions to improve docu-
ment reasoning and classification precision for triage tasks. In addition, TriageAgent employs the medical Emergency Severity Index (ESI)
handbook through a retrieval-augmented generation (RAG) approach to provide precise clinical knowledge and integrates both coarse- and
fine-grained ESI-level predictions in the decision-making process. Extensive experiments demonstrate that TriageAgent outperforms state-
of-the-art LLM-based methods on three clinical triage test sets. Furthermore, we have released the first public benchmark dataset for clinical
triage with corresponding ESI levels and human expert performance for comparison.
Nov 14 (Thu) 14:00-15:30 - Riverfront Hall
Enabling Discriminative Reasoning in LLMs for Legal Judgment Prediction
Chenlong Deng, Kelong Mao, Yuyao Zhang, Zhicheng Dou
Legal judgment prediction is essential for enhancing judicial efficiency. In this work, we identify that existing large language models (LLMs)
underperform in this domain due to challenges in understanding case complexities and distinguishing between similar charges. To adapt
LLMs for effective legal judgment prediction, we introduce the Ask-Discriminate-Predict (ADAPT) reasoning framework inspired by human
judicial reasoning. ADAPT involves decomposing case facts, discriminating among potential charges, and predicting the final judgment. We
further enhance LLMs through fine-tuning with multi-task synthetic trajectories to improve legal judgment prediction accuracy and efficiency
under our ADAPT framework. Extensive experiments conducted on two widely-used datasets demonstrate the superior performance of our
framework in legal judgment prediction, particularly when dealing with complex and confusing charges.
Nov 14 (Thu) 14:00-15:30 - Riverfront Hall
Retrieval and Reasoning on KGs: Integrate Knowledge Graphs into Large Language Models for Complex Question Answering
Yixin Ji, Kaixin Wu, Juntao Li, Wei Chen, mingjie zhong, Xu Jia, Min Zhang
Despite Large Language Models (LLMs) have performed impressively in various Natural Language Processing (NLP) tasks, their inherent
hallucination phenomena severely challenge their credibility in complex reasoning. Combining explainable Knowledge Graphs (KGs) with
LLMs is a promising path to address this issue. However, structured KGs are difficult to utilize, and how to make LLMs understand and
incorporate them is a challenging topic. We thereby reorganize a more efficient structure of KGs, while designing the KG-related instruction
tuning and continual pre-training strategies to enable LLMs to learn and internalize this form of representation effectively. Moreover, we
construct subgraphs to further enhance the retrieval capabilities of KGs via CoT reasoning. Extensive experiments on two KGQA datasets
demonstrate that our model achieves convincing performance compared to strong baselines8 .
Nov 14 (Thu) 14:00-15:30 - Riverfront Hall
MetaKP: On-Demand Keyphrase Generation
Di Wu, Xiaoxian Shen, Kai-Wei Chang
Traditional keyphrase prediction methods predict a single set of keyphrases per document, failing to cater to the diverse needs of users and
downstream applications. To bridge the gap, we introduce on-demand keyphrase generation, a novel paradigm that requires keyphrases that
conform to specific high-level goals or intents. For this task, we present MetaKP, a large-scale benchmark comprising four datasets, 7500
documents, and 3760 goals across news and biomedical domains with human-annotated keyphrases. Leveraging MetaKP, we design both su-
pervised and unsupervised methods, including a multi-task fine-tuning approach and a self-consistency prompting method with large language
models. The results highlight the challenges of supervised fine-tuning, whose performance is not robust to distribution shifts. By contrast,
the proposed self-consistency prompting approach greatly improves the performance of large language models, enabling GPT-4o to achieve
0.548 SemF1, surpassing the performance of a fully fine-tuned BART-base model. Finally, we demonstrate the potential of our method to
serve as a general NLP infrastructure, exemplified by its application in epidemic event detection from social media.
Nov 14 (Thu) 14:00-15:30 - Riverfront Hall
PRESTO: Progressive Pretraining Enhances Synthetic Chemistry Outcomes
He CAO, Yanjun Shao, Zhiyuan Liu, Zijing Liu, Xiangru Tang, Yuan Yao, Yu Li
Multimodal Large Language Models (MLLMs) have seen growing adoption across various scientific disciplines. These advancements encour-
age the investigation of molecule-text modeling within synthetic chemistry, a field dedicated to designing and conducting chemical reactions
to synthesize new compounds with desired properties and applications. Current approaches, however, often neglect the critical role of multi-
molecule graph interaction in understanding chemical reactions, leading to suboptimal performance in synthetic chemistry tasks. This study
introduces PRESTO (Progressive Pretraining Enhances Synthetic Chemistry Outcomes), a new framework that bridges the molecule-text
modality gap by integrating a comprehensive benchmark of pretraining strategies and dataset configurations. It progressively improves mul-
timodal LLMs through cross-modal alignment and multi-graph understanding. Our extensive experiments demonstrate that PRESTO offers
competitive results in downstream synthetic chemistry tasks. The code can be found at https://github.com/IDEA-XL/PRESTO.
Nov 14 (Thu) 14:00-15:30 - Riverfront Hall
MELT: Materials-aware Continued Pre-training for Language Model Adaptation to Materials Science
Junho Kim, Yeachan Kim, Jun-Hyung Park, Yerim Oh, Suho Kim, SangKeun Lee
We introduce a novel continued pre-training method, MELT (MatEriaLs-aware continued pre-Training), specifically designed to efficiently
adapt the pre-trained language models (PLMs) for materials science. Unlike previous adaptation strategies that solely focus on constructing
8
https://github.com/Dereck0602/Retrieval-and-Reasoning-on-KGs
319
Posters and Demos
domain-specific corpus, MELT comprehensively considers both the corpus and the training strategy, given that materials science corpus has
distinct characteristics from other domains. To this end, we first construct a comprehensive materials knowledge base from the scientific
corpus by building semantic graphs. Leveraging this extracted knowledge, we integrate a curriculum into the adaptation process that begins
with familiar and generalized concepts and progressively moves toward more specialized terms. We conduct extensive experiments across
diverse benchmarks to verify the effectiveness and generality of MELT. A comprehensive evaluation convincingly supports the strength of
MELT, demonstrating superior performance compared to existing continued pre-training methods. In-depth analysis also shows that MELT
enables PLMs to effectively represent materials entities compared to the existing adaptation methods, thereby highlighting its broad applica-
bility across a wide spectrum of materials science.
Nov 14 (Thu) 14:00-15:30 - Riverfront Hall
Cross-Lingual Unlearning of Selective Knowledge in Multilingual Language Models
Minseok Choi, Kyunghyun Min, Jaegul Choo
Pretrained language models memorize vast amounts of information, including private and copyrighted data, raising significant safety con-
cerns. Retraining these models after excluding sensitive data is prohibitively expensive, making machine unlearning a viable, cost-effective
alternative. Previous research has focused on machine unlearning for monolingual models, but we find that unlearning in one language does
not necessarily transfer to others. This vulnerability makes models susceptible to low-resource language attacks, where sensitive information
remains accessible in less dominant languages. This paper presents a pioneering approach to machine unlearning for multilingual language
models, selectively erasing information across different languages while maintaining overall performance. Specifically, our method employs
an adaptive unlearning scheme that assigns language-dependent weights to address different language performances of multilingual language
models. Empirical results demonstrate the effectiveness of our framework compared to existing unlearning baselines, setting a new standard
for secure and adaptable multilingual language models.
Nov 14 (Thu) 14:00-15:30 - Riverfront Hall
Using LLMs to simulate students’ responses to exam questions
Luca Benedetto, Giovanni Aradelli, Antonia Donvito, Alberto Lucchetti, Andrea Cappelli, Paula Buttery
Previous research leveraged Large Language Models (LLMs) in numerous ways in the educational domain. Here, we show that they can
be used to answer exam questions simulating students of different skill levels and share a prompt, engineered for GPT-3.5, that enables
the simulation of varying student skill levels on questions from different educational domains. We evaluate the proposed prompt on three
publicly available datasets (one from science exams and two from English reading comprehension exams) and three LLMs (two versions of
GPT-3.5 and one of GPT-4), and show that it is robust to different educational domains and capable of generalising to data unseen during
the prompt engineering phase. We also show that, being engineered for a specific version of GPT-3.5, the prompt does not generalise well
to different LLMs, stressing the need for prompt engineering for each model in practical applications. Lastly, we find that there is not a di-
rect correlation between the quality of the rationales obtained with chain-of-thought prompting and the accuracy in the student simulation task.
Nov 14 (Thu) 14:00-15:30 - Riverfront Hall
Revisiting the Impact of Pursuing Modularity for Code Generation
Deokyeong Kang, KiJung Seo, Taeuk Kim
Modular programming, which aims to construct the final program by integrating smaller, independent building blocks, has been regarded
as a desirable practice in software development. However, with the rise of recent code generation agents built upon large language models
(LLMs), a question emerges: is this traditional practice equally effective for these new tools? In this work, we assess the impact of modularity
in code generation by introducing a novel metric for its quantitative measurement. Surprisingly, unlike conventional wisdom on the topic, we
find that modularity is not a core factor for improving the performance of code generation models. We also explore potential explanations for
why LLMs do not exhibit a preference for modular code compared to non-modular code.
320
Posters and Demos
321
Posters and Demos
state-of-the-art (SOTA) knowledge editing methods in the multi-hop question answering benchmark, MQuAKE, especially in scenarios with
extensive knowledge edits.
322
Posters and Demos
Sara Pieri, Sahal Shaji Mullappilly, Fahad Shahbaz Khan, Rao Muhammad Anwer, Salman Khan, Timothy Baldwin, Hisham Cholakkal
In this paper, we introduce BiMediX, the first bilingual medical mixture of experts LLM designed for seamless interaction in both English
and Arabic. Our model facilitates a wide range of medical interactions in English and Arabic, including multi-turn chats to inquire about ad-
ditional details such as patient symptoms and medical history, multiple-choice question answering, and open-ended question answering. We
propose a semi-automated English-to-Arabic translation pipeline with human refinement to ensure high-quality translations. We also intro-
duce a comprehensive evaluation benchmark for Arabic medical LLMs. Furthermore, we introduce BiMed1.3M, an extensive Arabic-English
bilingual instruction set that covers 1.3 Million diverse medical interactions, including 200k synthesized multi-turn doctor-patient chats, in
a 1:2 Arabic-to-English ratio. Our model outperforms state-of-the-art Med42 and Meditron by average absolute gains of 2.5% and 4.1%,
respectively, computed across multiple medical evaluation benchmarks in English, while operating at 8-times faster inference. Moreover, our
BiMediX outperforms the generic Arabic-English bilingual LLM, Jais-30B, by average absolute gains of 10% on our Arabic and 15% on
our bilingual evaluations across multiple datasets. Additionally, BiMediX exceeds the accuracy of GPT4 by 4.4% in open-ended question
UPHILL evaluation and largely outperforms state-of-the-art open source medical LLMs in human evaluations of multi-turn conversations.
Our trained models, instruction set, and source code are available at https://github.com/mbzuai-oryx/BiMediX.
Nov 14 (Thu) 14:00-15:30 - Riverfront Hall
Unleashing Large Language Models’ Proficiency in Zero-shot Essay Scoring
Sanwoo Lee, Yida Cai, Desong Meng, Ziyang Wang, Yunfang Wu
Advances in automated essay scoring (AES) have traditionally relied on labeled essays, requiring tremendous cost and expertise for their
acquisition. Recently, large language models (LLMs) have achieved great success in various tasks, but their potential is less explored in AES.
In this paper, we show that our zero-shot prompting framework, Multi Trait Specialization (MTS), elicits LLMs’ ample potential for essay
scoring. In particular, we automatically decompose writing proficiency into distinct traits and generate scoring criteria for each trait. Then, an
LLM is prompted to extract trait scores from several conversational rounds, each round scoring one of the traits based on the scoring criteria.
Finally, we derive the overall score via trait averaging and min-max scaling. Experimental results on two benchmark datasets demonstrate
that MTS consistently outperforms straightforward prompting (Vanilla) in average QWK across all LLMs and datasets, with maximum gains
of 0.437 on TOEFL11 and 0.355 on ASAP. Additionally, with the help of MTS, the small-sized Llama2-13b-chat substantially outperforms
ChatGPT, facilitating an effective deployment in real applications.
323
Posters and Demos
324
Posters and Demos
Tianyu Liu, Yijia Xiao, Xiao Luo, Hua Xu, Wenjin Zheng, Hongyu Zhao
The applications of large language models (LLMs) are promising for biomedical and healthcare research. Despite the availability of open-
source LLMs trained using a wide range of biomedical data, current research on the applications of LLMs to genomics and proteomics is
still limited. To fill this gap, we propose a collection of finetuned LLMs and multimodal LLMs (MLLMs), known as Geneverse, for three
novel tasks in genomic and proteomic research. The models in Geneverse are trained and evaluated based on domain-specific datasets, and
we use advanced parameter-efficient finetuning techniques to achieve the model adaptation for tasks including the generation of descriptions
for gene functions, protein function inference from its structure, and marker gene selection from spatial transcriptomic data. We demonstrate
that adapted LLMs and MLLMs perform well for these tasks and may outperform closed-source large-scale models based on our evaluations
focusing on both truthfulness and structural correctness. All of the training strategies and base models we used are freely accessible. Our
codes can be found at https://github.com/HelloWorldLTY/Geneverse.
Nov 14 (Thu) 14:00-15:30 - Riverfront Hall
Generating Media Background Checks for Automated Source Critical Reasoning
Michael Sejr Schlichtkrull
Not everything on the internet is true. This unfortunate fact requires both humans and models to perform complex reasoning about credibility
when working with retrieved information. In NLP, this problem has seen little attention. Indeed, retrieval-augmented models are not typically
expected to distrust retrieved documents. Human experts overcome the challenge by gathering signals about the context, reliability, and ten-
dency of source documents - that is, they perform *source criticism*. We propose a novel NLP task focused on finding and summarising such
signals. We introduce a new dataset of 6,709 "media background checks" derived from Media Bias / Fact Check, a volunteer-run website doc-
umenting media bias. We test open-source and closed-source LLM baselines with and without retrieval on this dataset, finding that retrieval
greatly improves performance. We furthermore carry out human evaluation, demonstrating that 1) media background checks are helpful for
humans, and 2) media background checks are helpful for retrieval-augmented models.
Demo
(Nov 12): 17:45-18:45 (Evening) - Room: Gather
325
Posters and Demos
Adam Fourney, Chi Wang, Erkang Zhu, Gagan Bansal, Jingya Chen, Saleema Amershi, Suff Syed, Victor Dibia
Multi-agent systems, where multiple agents (generative AI models + tools) collaborate, are emerging as an effective pattern for solving long-
running, complex tasks in numerous do- mains. However, specifying their parameters (such as models, tools, and orchestration mechanisms
etc,.) and debugging them remains challenging for most developers. To address this challenge, we present AUTOGEN STUDIO, a no-code
developer tool for rapidly prototyping, debugging, and evaluating multi-agent work- flows built upon the AUTOGEN framework. AUTOGEN
STUDIO offers a web interface and a Python API for representing LLM-enabled agents using a declarative (JSON-based) specification. It
provides an intuitive drag-and-drop UI for agent workflow specification, interactive evaluation and debugging of workflows, and a gallery
of reusable agent components. We highlight four design principles for no-code multi-agent developer tools and contribute an open-source
implementation. https://github.com/microsoft/autogen/tree/autogenstudio/samples/apps/autogen-studio
326
Posters and Demos
Generation
(Nov 12): 17:45-18:45 (Evening) - Room: Gather
Information Extraction
(Nov 12): 17:45-18:45 (Evening) - Room: Gather
327
Posters and Demos
Le Yan, Zhen Qin, Honglei Zhuang, Rolf Jagerman, Xuanhui Wang, Michael Bendersky, Harrie Oosterhuis
The powerful generative abilities of large language models (LLMs) show potential in generating relevance labels for search applications.
Previous work has found that directly asking about relevancy, such as "*How relevant is document A to query Q?*", results in suboptimal
ranking. Instead, the pairwise-ranking prompting (PRP) approach produces promising ranking performance through asking about pairwise
comparisons, e.g., "*Is document A more relevant than document B to query Q?*". Thus, while LLMs are effective at their ranking ability,
this is not reflected in their relevance label generation.In this work, we propose a post-processing method to consolidate the relevance labels
generated by an LLM with its powerful ranking abilities. Our method takes both LLM generated relevance labels and pairwise preferences.
The labels are then altered to satisfy the pairwise preferences of the LLM, while staying as close to the original values as possible. Our
experimental results indicate that our approach effectively balances label accuracy and ranking performance. Thereby, our work shows it is
possible to combine both the ranking and labeling abilities of LLMs through post-processing.
(Nov 12): 17:45-18:45 (Evening) - Gather
Multi-Granularity History and Entity Similarity Learning for Temporal Knowledge Graph Reasoning
Shi Mingcong, Chunjiang Zhu, Detian Zhang, Shiting Wen, Qing Li
Temporal Knowledge Graph (TKG) reasoning, aiming to predict future unknown facts based on historical information, has attracted consid-
erable attention due to its great practical value. Insight into history is the key to predict the future. However, most existing TKG reasoning
models singly capture repetitive history, ignoring the entity’s multi-hop neighbour history which can provide valuable background knowledge
for TKG reasoning. In this paper, we propose Multi-Granularity History and Entity Similarity Learning (MGESL) model for Temporal
Knowledge Graph Reasoning, which models historical information from both coarse-grained and fine-grained history. Since similar entities
tend to exhibit similar behavioural patterns, we also design a hypergraph convolution aggregator to capture the similarity between entities.
Furthermore, we introduce a more realistic setting for the TKG reasoning, where candidate entities are already known at the timestamp to be
predicted. Extensive experiments on three benchmark datasets demonstrate the effectiveness of our proposed model.
(Nov 12): 17:45-18:45 (Evening) - Gather
PepRec: Progressive Enhancement of Prompting for Recommendation
Yakun Yu, Shi-ang Qi, Baochun Li, Di Niu
With large language models (LLMs) achieving remarkable breakthroughs in natural language processing (NLP) domains, recent researchers
have actively explored the potential of LLMs for recommendation systems by converting the input data into textual sentences through prompt
templates. Although semantic knowledge from LLMs can help enrich the content information of items, to date it is still hard for them to
achieve comparable performance to traditional deep learning recommendation models, partly due to a lack of ability to leverage collabora-
tive filtering. In this paper, we propose a novel training-free prompting framework, PepRec, which aims to capture knowledge from both
content-based filtering and collaborative filtering to boost recommendation performance with LLMs, while providing interpretation for the
recommendation. Experiments based on two real-world datasets from different domains show that PepRec significantly outperforms various
traditional deep learning recommendation models and prompt-based recommendation systems.
328
Posters and Demos
effectively.
(Nov 12): 17:45-18:45 (Evening) - Gather
Formality Favored: Unraveling the Learning Preferences of Large Language Models on Data with Conflicting Knowledge
Jiahuan Li, Yiqing Cao, Shujian Huang, Jiajun Chen
Having been trained on massive pretraining data, large language models have shown excellent performance on many knowledge-intensive
tasks. However, pretraining data tends to contain misleading and even conflicting information, and it is intriguing to understand how LLMs
handle these noisy data during training. In this study, we systematically analyze LLMs learning preferences for data with conflicting knowl-
edge. We find that pretrained LLMs establish learning preferences similar to humans, i.e., preferences towards formal texts and texts with
fewer spelling errors, resulting in faster learning and more favorable treatment of knowledge in data with such features when facing conflicts.
This finding is generalizable across models and languages and is more evident in larger models. An in-depth analysis reveals that LLMs tend
to trust data with features that signify consistency with the majority of data, and it is possible to instill new preferences and erase old ones by
manipulating the degree of consistency with the majority data.
Language Modeling
(Nov 12): 17:45-18:45 (Evening) - Room: Gather
329
Posters and Demos
330
Posters and Demos
trained on the code-nested data in Stage-2 to get the resulting MuMath-Code.Our MuMath-Code-7B achieves 83.8% on GSM8K and 52.4%
on MATH, while MuMath-Code-70B model achieves new state-of-the-art performance among open methods—achieving 90.7% on GSM8K
and 55.1% on MATH. Extensive experiments validate the combination of tool use and data augmentation, as well as our two-stage training
strategy.We release the proposed dataset along with the associated code for public use: https://github.com/youweihao-tal/MuMath-Code.
(Nov 12): 17:45-18:45 (Evening) - Gather
Optimizing Language Models with Fair and Stable Reward Composition in Reinforcement Learning
Jiahui Li, Hanlin Zhang, Fengda Zhang, Tai-Wei Chang, Kun Kuang, Long Chen, JUN ZHOU
Reinforcement learning from human feedback (RLHF) and AI-generated feedback (RLAIF) have become prominent techniques that signifi-
cantly enhance the functionality of pre-trained language models (LMs). These methods harness feedback, sourced either from humans or AI,
as direct rewards or to shape reward models that steer LM optimization. Nonetheless, the effective integration of rewards from diverse sources
presents a significant challenge due to their disparate characteristics. To address this, recent research has developed algorithms incorporating
strategies such as weighting, ranking, and constraining to handle this complexity. Despite these innovations, a bias toward disproportionately
high rewards can still skew the reinforcement learning process and negatively impact LM performance. This paper explores a methodology for
reward composition that enables simultaneous improvements in LMs across multiple dimensions. Inspired by fairness theory, we introduce
a training algorithm that aims to reduce disparity and enhance stability among various rewards. Our method treats the aggregate reward as a
dynamic weighted sum of individual rewards, with alternating updates to the weights and model parameters. For efficient and straightforward
implementation, we employ an estimation technique rooted in the mirror descent method for weight updates, eliminating the need for gradient
computations. The empirical results under various types of rewards across a wide range of scenarios demonstrate the effectiveness of our
method.
(Nov 12): 17:45-18:45 (Evening) - Gather
Leveraging BERT and TFIDF Features for Short Text Clustering via Alignment-Promoting Co-Training
Zetong Li, Qinliang Su, Shijing Si, Jianxing Yu
BERT and TFIDF features excel in capturing rich semantics and important words, respectively. Since most existing clustering methods are
solely based on the BERT model, they often fall short in utilizing keyword information, which, however, is very useful in clustering short
texts. In this paper, we propose a **CO**-**T**raining **C**lustering (**COTC**) framework to make use of the collective strengths of
BERT and TFIDF features. Specifically, we develop two modules responsible for the clustering of BERT and TFIDF features, respectively.
We use the deep representations and cluster assignments from the TFIDF module outputs to guide the learning of the BERT module, seeking
to align them at both the representation and cluster levels. Reversely, we also use the BERT module outputs to train the TFIDF module,
thus leading to the mutual promotion. We then show that the alternating co-training framework can be placed under a unified joint training
objective, which allows the two modules to be connected tightly and the training signals to be propagated efficiently. Experiments on eight
benchmark datasets show that our method outperforms current SOTA methods significantly.
(Nov 12): 17:45-18:45 (Evening) - Gather
One-to-Many Communication and Compositionality in Emergent Communication
Heeyoung Lee
Compositional languages leverage rules that derive meaning from combinations of simpler constituents. This property is considered to be
the hallmark of human language as it enables the ability to express novel concepts and ease of learning. As such, numerous studies in the
emergent communication field explore the prerequisite conditions for emergence of compositionality. Most of these studies set out one-to-one
communication environment wherein a speaker interacts with a single listener during a single round of communication game. However,
real-world communications often involve multiple listeners; their interests may vary and they may even need to coordinate among themselves
to be successful at a given task. This work investigates the effects of one-to-many communication environment on emergent languages where
a single speaker broadcasts its message to multiple listeners to cooperatively solve a task. We observe that simply broadcasting the speaker’s
message to multiple listeners does not induce more compositional languages. We then find and analyze two axes of environmental pressures
that facilitate emergence of compositionality: listeners of *different interests* and *coordination* among listeners.
Machine Translation
(Nov 12): 17:45-18:45 (Evening) - Room: Gather
331
Posters and Demos
model various information. Second, DeMPT employs a heuristic way to further discriminately enhance the utilization of the source-side inter-
and intra-sentence information at the final decoding phase. Experiments show that our approach significantly outperforms the concatenation
method, and further improves the performance of LLMs in discourse modeling.
332
Posters and Demos
NLP Applications
(Nov 12): 17:45-18:45 (Evening) - Room: Gather
333
Posters and Demos
Question Answering
(Nov 12): 17:45-18:45 (Evening) - Room: Gather
334
Posters and Demos
335
Posters and Demos
336
Posters and Demos
337
Posters and Demos
338
Posters and Demos
up on the nuances of language that show distress or urgency. The dataset is the binary classification of the social media posts that required
help and did not require help in the war. The dataset could significantly improve humanitarian efforts, allowing for quicker and more targeted
help for those facing the challenges of war. Moreover, the baseline models are implemented and GPT 3.5 achieved an accuracy of 81.15%.
(Nov 13): 7:45-8:45 (Morning) - Gather
On Fake News Detection with LLM Enhanced Semantics Mining
Xiaoxiao Ma, Yuchen Zhang, Kaize Ding, Jian Yang, Jia Wu, Hao Fan
Large language models (LLMs) have emerged as valuable tools for enhancing textual features in various text-related tasks. Despite their
superiority in capturing the lexical semantics between tokens for text analysis, our preliminary study on two popular LLMs, i.e., ChatGPT
and Llama2, showcases that simply applying the news embeddings from LLMs is ineffective for fake news detection. Such embeddings only
encapsulate the language styles between tokens. Meanwhile, the high-level semantics among named entities and topics, which reveal the
deviating patterns of fake news, have been ignored. Therefore, we propose a topic model together with a set of specially designed prompts
to extract topics and real entities from LLMs and model the relations among news, entities, and topics as a heterogeneous graph to facilitate
investigating news semantics. We then propose a Generalized Page-Rank model and a consistent learning criteria for mining the local and
global semantics centered on each news piece through the adaptive propagation of features across the graph. Our model shows superior
performance on five benchmark datasets over seven baseline methods and the efficacy of the key ingredients has been thoroughly validated.
(Nov 13): 7:45-8:45 (Morning) - Gather
Message Passing on Semantic-Anchor-Graphs for Fine-grained Emotion Representation Learning and Classification
Pinyi Zhang, Jingyang Chen, Junchen Shen, Zijie Zhai, Ping Li, Jie Zhang, Kai Zhang
Emotion classification has wide applications in education, robotics, virtual reality, etc. However, identifying subtle differences between
fine-grained emotion categories remains challenging. Current methods typically aggregate numerous token embeddings of a sentence into
a single vector, which, while being an efficient compressor, may not fully capture complex semantic and temporal distributions. To solve
this problem, we propose SEmantic ANchor Graph Neural Networks (SEAN-GNN) for fine-grained emotion classification. It learns a group
of representative, multi-faceted semantic anchors in the token embedding space: using these anchors as a global reference, any sentence
can be projected onto them to form a "semantic-anchor graph”, with node attributes and edge weights quantifying the semantic and tem-
poral information respectively. The graph structure is well aligned across sentences and, importantly, allows for generating comprehensive
emotion representations regarding K different anchors. Message passing on this graph can further integrate and refine the learned features.
Empirically, SEAN-GNN can generate meaningful semantic anchors and discriminative graph patterns for different emotion, with promising
classification results on 6 popular benchmark datasets against state-of-the-arts.
Demo
(Nov 13): 7:45-8:45 (Morning) - Room: Gather
339
Posters and Demos
neck by equipping LLMs with external knowledge, a technique known as Retrieval Augmented Generation (RAG). However, two key issues
constrained the development of RAG. First, there is a growing lack of comprehensive and fair comparisons between novel RAG algorithms.
Second, open-source tools such as LlamaIndex and LangChain employ high-level abstractions, which results in a lack of transparency and
limits the ability to develop novel algorithms and evaluation metrics. To close this gap, we introduce RAGLAB, a modular and research-
oriented open-source library. RAGLAB reproduces 6 existing algorithms and provides a comprehensive ecosystem for investigating RAG
algorithms. Leveraging RAGLAB, we conduct a fair comparison of 6 RAG algorithms across 10 benchmarks. With RAGLAB, researchers
can efficiently compare the performance of various algorithms and develop novel algorithms.
(Nov 13): 7:45-8:45 (Morning) - Gather
Sailor: Open Language Models for South-East Asia
Guangtao Zeng, Jia Guo, Jiahui Zhou, Longxu Dou, Min Lin, Qian Liu, Xin Mao, Wei Lu, jin ziqi
We present Sailor, a family of open language models ranging from 0.5B to 14B parameters, tailored for South-East Asian (SEA) languages.
From Qwen1.5, Sailor models accept 200B to 400B tokens during continual pre-training, primarily covering the languages of English, Chi-
nese, Vietnamese, Thai, Indonesian, Malay, and Lao. The training leverages several techniques, including BPE dropout for improving the
model robustness, aggressive data cleaning and deduplication, and small proxy models to optimize the data mixture. Experimental results
on four typical tasks indicate that Sailor models demonstrate strong performance across different benchmarks, including commonsense rea-
soning, question answering, reading comprehension and examination. We share our insights to spark a wider interest in developing large
language models for multilingual use cases.
(Nov 13): 7:45-8:45 (Morning) - Gather
DeepPavlov 1.0: Your Gateway to Advanced NLP Models Backed by Transformers and Transfer Learning
Alexander Popov, Anastasia Voznyuk, Anna Korzanova, Dmitry Karpov, Fedor Ignatov, Maksim Savkin, Vasily Konovalov
We present DeepPavlov 1.0, an open-source framework for using Natural Language Processing (NLP) models by leveraging transfer learning
techniques. DeepPavlov 1.0 is created for modular and configuration-driven development of state-of-the-art NLP models and supports a wide
range of NLP model applications. DeepPavlov 1.0 is designed for practitioners with limited knowledge of NLP/ML. DeepPavlov is based on
PyTorch and supports HuggingFace transformers. DeepPavlov is publicly released under the Apache 2.0 license and provides access to an
online demo.
(Nov 13): 7:45-8:45 (Morning) - Gather
Medico: Towards Hallucination Detection and Correction with Multi-source Evidence Fusion
Baotian Hu, Jifang Wang, Jindi Yu, Min zhang, Xinping Zhao, Yibin Chen, zhenyu liu, dongfang li
As we all know, hallucinations prevail in Large Language Models (LLMs), where the generated content is coherent but factually incor-
rect, which inflicts a heavy blow on the widespread application of LLMs. Previous studies have shown that LLMs could confidently state
non-existent facts rather than answering “I don’t know”. Therefore, it is necessary to resort to external knowledge to detect and correct
the hallucinated content. Since manual detection and correction of factual errors is labor-intensive, developing an automatic end-to-end
hallucination-checking approach is indeed a needful thing. To this end, we present Medico, a Multi-source evidence fusion enhanced hallu-
cination detection and correction framework. It fuses diverse evidence from multiple sources, detects whether the generated content contains
factual errors, provides the rationale behind the judgment, and iteratively revises the hallucinated content. Experimental results on evidence
retrieval (0.964 HR@5, 0.908 MRR@5), hallucination detection (0.927-0.951 F1), and hallucination correction (0.973-0.979 approval rate)
manifest the great potential of Medico. A video demo of Medico can be found at https://youtu.be/RtsO6CSesBI.
(Nov 13): 7:45-8:45 (Morning) - Gather
OpenOmni: A Collaborative Open Source Tool for Building Future-Ready Multimodal Conversational Agents
Qiang Sun, Sirui Li, Wei Liu, Wenxiao Zhang, Yuanyi Luo
Multimodal conversational agents are highly desirable because they offer natural and human-like interaction. However, there is a lack of
comprehensive end-to-end solutions to support collaborative development and benchmarking. While proprietary systems like GPT-4o and
Gemini demonstrating impressive integration of audio, video, and text with response times of 200-250ms, challenges remain in balanc-
ing latency, accuracy, cost, and data privacy. To better understand and quantify these issues, we developed OpenOmni, an open-source,
end-to-end pipeline benchmarking tool that integrates advanced technologies such as Speech-to-Text, Emotion Detection, Retrieval Aug-
mented Generation, Large Language Models, along with the ability to integrate customized models. OpenOmni supports local and cloud
deployment, ensuring data privacy and supporting latency and accuracy benchmarking. This flexible framework allows researchers to cus-
tomize the pipeline, focusing on real bottlenecks and facilitating rapid proof-of-concept development. OpenOmni can significantly en-
hance applications like indoor assistance for visually impaired individuals, advancing human-computer interaction. Our demonstration
video is available https://www.youtube.com/watch?v=zaSiT3clWqY, demo is available via https://openomni.ai4wa.com, code is available
via https://github.com/AI4WA/OpenOmniFramework.
340
Posters and Demos
users to upload any country-based language model generations they wish to analyze. To showcase CAVA’s efficacy, we present a case study
analyzing how several popular language models answer survey questions from the World Values Survey.
341
Posters and Demos
TransAgents employs specialized agentsâSenior Editor, Junior Editor, Translator, Localization Specialist, and Proofreaderâto collaboratively
produce translations that are accurate, culturally sensitive, and of high quality. Our system is flexible, allowing users to configure their transla-
tion company based on specific needs, and universal, with empirical evidence showing superior performance across various domains compared
to state-of-the-art methods. Additionally, TransAgents features a user-friendly interface and offers translations at a cost approximately 80×
cheaper than professional human translation services. Evaluations on literary, legal, and financial test sets demonstrate that TransAgents
produces translations preferred by human evaluators, even surpassing human-written references in literary contexts. Our live demo website is
available at https://www.transagents.ai/. Our demonstration video is available at https://www.youtube.com/watch?v=p7jIAtF-WKc.
342
Posters and Demos
Xianwei Zhuang, Zhihong Zhu, Zhanpeng Chen, Yuxin Xie, Liming Liang, Yuexian Zou
Large Vision-Language Models (LVLMs) may produce outputs that are unfaithful to reality, also known as visual hallucinations (VH), which
hinders their application in multimodal understanding and decision-making. In this work, we introduce a novel plug-and-play train-free decod-
ing algorithm named Game and Tree based Hallucination Mitigation (GTHM), designed for mitigating VH. GTHM is inspired by empirical
observations that the fuzziness of multi-granularity view perception exacerbates VH. Based on this, GTHM leverages visual information to
construct a coarse-to-fine visual view tree (CFTree) that organizes visual objects, attributes, and relationships in a hierarchical manner. Addi-
tionally, we innovatively model the optimal visual-token matching process on the CFTree as the cooperative game. Specifically, we define the
Tree-based Shapley Value (TSV) for each visual view on the CFTree to assess its significant contribution to the overall visual understanding,
thereby determining the optimal visual granularity. Subsequently, we utilize the TSV as guidance to implement adaptive weight contrastive
decoding to achieve vision-aware decoding. Extensive experiments on four popular benchmarks confirm the effectiveness of our GTHM in
alleviating VH across different LVLM families without additional training or post-processing. Our code is published at https://github.com/-
mengchuang123/GTHM.
(Nov 13): 7:45-8:45 (Morning) - Gather
Large Language Model-based Human-Agent Collaboration for Complex Task Solving
Xueyang Feng, Zhi-Yuan Chen, Yujia Qin, Yankai Lin, Xu Chen, Zhiyuan Liu, Ji-Rong Wen
In recent developments within the research community, the integration of Large Language Models (LLMs) in creating fully autonomous
agents has garnered significant interest. Despite this, LLM-based agents frequently demonstrate notable shortcomings in adjusting to dy-
namic environments and fully grasping human needs. In this work, we introduce the problem of LLM-based human-agent collaboration for
complex task-solving, exploring their synergistic potential. To tackle the problem, we propose a Reinforcement Learning-based Human-
Agent Collaboration method, ReHAC, which trains a policy model designed to determine the most opportune stages for human intervention
within the task-solving process. We conduct experiments under real and simulated human-agent collaboration scenarios. Experimental results
demonstrate that the synergistic efforts of humans and LLM-based agents significantly improve performance in complex tasks, primarily
through well-planned, limited human intervention. Datasets and code are available at: https://github.com/XueyangFeng/ReHAC/.
(Nov 13): 7:45-8:45 (Morning) - Gather
BC-Prover: Backward Chaining Prover for Formal Theorem Proving
Yuhang He, Jihai Zhang, Jianzhu Bao, Fangquan Lin, Cheng Yang, Bing Qin, Ruifeng Xu, Wotao Yin
Despite the remarkable progress made by large language models in mathematical reasoning, interactive theorem proving in formal logic still
remains a prominent challenge. Previous methods resort to neural models for proofstep generation and search. However, they suffer from
exploring possible proofsteps empirically in a large search space. Moreover, they directly use a less rigorous informal proof for proofstep gen-
eration, neglecting the incomplete reasoning within. In this paper, we propose BC-Prover, a backward chaining framework guided by pseudo
steps. Specifically, BC-Prover prioritizes pseudo steps to proofstep generation. The pseudo steps boost the proof construction in two aspects:
(1) Backward Chaining that decomposes the proof into sub-goals for goal-oriented exploration. (2) Step Planning that makes a fine-grained
planning to bridge the gap between informal and formal proofs. Experiments on the miniF2F benchmark show significant performance gains
by our framework over the state-of-the-art approaches. Our framework is also compatible with existing provers and further improves their
performance with the backward chaining technique.
(Nov 13): 7:45-8:45 (Morning) - Gather
Thoughts to Target: Enhance Planning for Target-driven Conversation
Zhonghua Zheng, Lizi Liao, Yang Deng, Ee-Peng Lim, Minlie Huang, Liqiang Nie
In conversational AI, large-scale models excel in various tasks but struggle with target-driven conversation planning. Current methods, such
as chain-of-thought reasoning and tree-search policy learning techniques, either neglect plan rationality or require extensive human simulation
procedures. Addressing this, we propose a novel two-stage framework, named EnPL, to improve the LLMs’ capability in planning conver-
sations towards designated targets, including (1) distilling natural language plans from target-driven conversation corpus and (2) generating
new plans with demonstration-guided in-context learning. Specifically, we first propose a filter approach to distill a high-quality plan dataset,
ConvPlan (Resources of this paper can be found at https://github.com/pandazzh2020/ConvPlan). With the aid of corresponding conversational
data and support from relevant knowledge bases, we validate the quality and rationality of these plans. Then, these plans are leveraged to
help guide LLMs to further plan for new targets. Empirical results demonstrate that our method significantly improves the planning ability of
LLMs, especially in target-driven conversations. Furthermore, EnPL is demonstrated to be quite effective in collecting target-driven conver-
sation datasets and enhancing response generation, paving the way for constructing extensive target-driven conversational models.
(Nov 13): 7:45-8:45 (Morning) - Gather
Rescue Conversations from Dead-ends: Efficient Exploration for Task-oriented Dialogue Policy Optimization
Yangyang Zhao, Mehdi Dastani, Jinchuan Long, Zhenyu Wang, Shihan Wang
Training a task-oriented dialogue policy using deep reinforcement learning is promising but requires extensive environment exploration. The
amount of wasted invalid exploration makes policy learning inefficient. In this paper, we define and argue that dead-end states are important
reasons for invalid exploration. When a conversation enters a dead-end state, regardless of the actions taken afterward, it will continue in a
dead-end trajectory until the agent reaches a termination state or maximum turn. We propose a Dead-end Detection and Resurrection (DDR)
method that detects dead-end states in an efficient manner and provides a rescue action to guide and correct the exploration direction. To
prevent dialogue policies from repeating errors, DDR also performs dialogue data augmentation by adding relevant experiences that include
dead-end states and penalties into the experience pool. We first validate the dead-end detection reliability and then demonstrate the effective-
ness and generality of the method across various domains through experiments on four public dialogue datasets.
343
Posters and Demos
discourse-aware rewards are developed to assess the coherence between the generated response and the target utterance, with the objective
of optimizing the policy. The experimental results and in-depth analyses on two popular datasets demonstrate that our RL-TRC significantly
outperforms the state-of-the-art baselines, particularly in generating responses that are more coherent with the target utterances.
(Nov 13): 7:45-8:45 (Morning) - Gather
Language Models in Dialogue: Conversational Maxims for Human-AI Interactions
Erik Miehling, Manish Nagireddy, Prasanna Sattigeri, Elizabeth M. Daly, David Piorkowski, John T. Richards
Modern language models, while sophisticated, exhibit some inherent shortcomings, particularly in conversational settings. We claim that
many of the observed shortcomings can be attributed to violation of one or more conversational principles. By drawing upon extensive re-
search from both the social science and AI communities, we propose a set of maxims – quantity, quality, relevance, manner, benevolence,
and transparency – for describing effective human-AI conversation. We first justify the applicability of the first four maxims (from Grice)
in the context of human-AI interactions. We then argue that two new maxims, benevolence (concerning the generation of, and engagement
with, harmful content) and transparency (concerning recognition of one’s knowledge boundaries, operational constraints, and intents), are
necessary for addressing behavior unique to modern human-AI interactions. We evaluate the degree to which various language models are
able to understand these maxims and find that models possess an internal prioritization of principles that can significantly impact accurate
interpretability of the maxims.
344
Posters and Demos
Zecheng Tang, Keyan Zhou, Juntao Li, Yuyang Ding, Pinzheng Wang, Yan Bowen, Renjie Hua, Min Zhang
Text detoxification aims to minimize the risk of language models producing toxic content. Existing detoxification methods of directly con-
straining the model output or further training the model on the non-toxic corpus fail to achieve a decent balance between detoxification
effectiveness and generation quality. This issue stems from the neglect of constrain imposed by the context since language models are de-
signed to generate output that closely matches the context while detoxification methods endeavor to ensure the safety of the output even if it
semantically deviates from the context. In view of this, we introduce a Context-aware Model self-Detoxification (CMD) framework that pays
attention to both the context and the detoxification process, i.e., first detoxifying the context and then making the language model generate
along the safe context. Specifically, CMD framework involves two phases: utilizing language models to synthesize data and applying these
data for training. We also introduce a toxic contrastive loss that encourages the model generation away from the negative toxic samples.
Experiments on various LLMs have verified the effectiveness of our MSD framework, which can yield the best performance compared to
baselines.
(Nov 13): 7:45-8:45 (Morning) - Gather
Universal Vulnerabilities in Large Language Models: Backdoor Attacks for In-context Learning
Shuai Zhao, Meihuizi Jia, Anh Tuan Luu, Fengjun Pan, Jinming Wen
In-context learning, a paradigm bridging the gap between pre-training and fine-tuning, has demonstrated high efficacy in several NLP tasks,
especially in few-shot settings. Despite being widely applied, in-context learning is vulnerable to malicious attacks. In this work, we raise
security concerns regarding this paradigm. Our studies demonstrate that an attacker can manipulate the behavior of large language models
by poisoning the demonstration context, without the need for fine-tuning the model. Specifically, we design a new backdoor attack method,
named ICLAttack, to target large language models based on in-context learning. Our method encompasses two types of attacks: poison-
ing demonstration examples and poisoning demonstration prompts, which can make models behave in alignment with predefined intentions.
ICLAttack does not require additional fine-tuning to implant a backdoor, thus preserving the model’s generality. Furthermore, the poisoned
examples are correctly labeled, enhancing the natural stealth of our attack method. Extensive experimental results across several language
models, ranging in size from 1.3B to 180B parameters, demonstrate the effectiveness of our attack method, exemplified by a high average
attack success rate of 95.0% across the three datasets on OPT models.
(Nov 13): 7:45-8:45 (Morning) - Gather
How Alignment and Jailbreak Work: Explain LLM Safety through Intermediate Hidden States
Zhenhong Zhou, Haiyang Yu, Xinghua Zhang, Rongwu Xu, Fei Huang, Yongbin Li
Large language models (LLMs) rely on safety alignment to avoid responding to malicious user inputs. Unfortunately, jailbreak can circum-
vent safety guardrails, resulting in LLMs generating harmful content and raising concerns about LLM safety. Due to language models with
intensive parameters often regarded as black boxes, the mechanisms of alignment and jailbreak are challenging to elucidate. In this paper,
we employ weak classifiers to explain LLM safety through the intermediate hidden states. We first confirm that LLMs learn ethical concepts
during pre-training rather than alignment and can identify malicious and normal inputs in the early layers. Alignment actually associates the
early concepts with emotion guesses in the middle layers and then refines them to the specific reject tokens for safe generations. Jailbreak
disturbs the transformation of early unethical classification into negative emotions. We conduct experiments on models from 7B to 70B across
various model families to prove our conclusion. Overall, our paper indicates the intrinsical mechanism of LLM safety and how jailbreaks
circumvent safety guardrails, offering a new perspective on LLM safety and reducing concerns.
Generation
(Nov 13): 7:45-8:45 (Morning) - Room: Gather
345
Posters and Demos
based context chunking; (3) Skewed length due to sentence-wise filter learning. To address these issues, we propose a model-based evidence
extraction learning framework, SEER, optimizing a vanilla model as an evidence extractor with desired properties through self-aligned learn-
ing. Extensive experiments show that our method largely improves the final RAG performance, enhances the faithfulness, helpfulness, and
conciseness of the extracted evidence, and reduces the evidence length by 9.25 times. The code will be available at https://github.com/HITsz-
TMG/SEER.
(Nov 13): 7:45-8:45 (Morning) - Gather
Learn to Refuse: Making Large Language Models More Controllable and Reliable through Knowledge Scope Limitation and Refusal
Mechanism
Lang Cao
Large language models (LLMs) have demonstrated impressive language understanding and generation capabilities, enabling them to answer
a wide range of questions across various domains. However, these models are not flawless and often produce responses that contain errors or
misinformation. These inaccuracies, commonly referred to as hallucinations, render LLMs unreliable and even unusable in many scenarios.
In this paper, our focus is on mitigating the issue of hallucination in LLMs, particularly in the context of question-answering. Instead of
attempting to answer all questions, we explore a refusal mechanism that instructs LLMs to refuse to answer challenging questions in order
to avoid errors. We then propose a simple yet effective solution called Learn to Refuse (L2R), which incorporates the refusal mechanism to
enable LLMs to recognize and refuse to answer questions that they find difficult to address. To achieve this, we utilize a structured knowledge
base to represent all the LLM’s understanding of the world, enabling it to provide traceable gold knowledge. This knowledge base is separate
from the LLM and initially empty. It can be filled with validated knowledge and progressively expanded. When an LLM encounters questions
outside its domain, the system recognizes its knowledge scope and determines whether it can answer the question independently. Addition-
ally, we introduce a method for automatically and efficiently expanding the knowledge base of LLMs. Through qualitative and quantitative
analysis, we demonstrate that our approach enhances the controllability and reliability of LLMs.
(Nov 13): 7:45-8:45 (Morning) - Gather
CItruS: Chunked Instruction-aware State Eviction for Long Sequence Modeling
Yu Bai, Xiyuan Zou, Heyan Huang, Sanxing Chen, Marc-Antoine Rondeau, Yang Gao, Jackie CK Cheung
Long sequence modeling has gained broad interest as large language models (LLMs) continue to advance. Recent research has identified that
a large portion of hidden states within the key-value caches of Transformer models can be discarded (also termed evicted) withoutaffecting
the perplexity performance in generating long sequences. However, we show that these methods, despite preserving perplexity performance,
often drop information that is important for solving downstream tasks, a problem which we call information neglect. To address this issue,
we introduce Chunked Instruction-aware State Eviction (CItruS), a novel modeling technique that integrates the attention preferences useful
for a downstream task into the eviction process of hidden states. In addition, we design a method for chunked sequence processing to further
improve efficiency. Our training-free method exhibits superior performance on long sequence comprehension and retrieval tasks over several
strong baselines under the same memory budget, while preserving language modeling perplexity. The code and data have been released at
https://github.com/ybai-nlp/CItruS.
(Nov 13): 7:45-8:45 (Morning) - Gather
MirrorStories: Reflecting Diversity through Personalized Narrative Generation with Large Language Models
Sarfaroz Yunusov, Hamza Sidat, Ali Emami
This study explores the effectiveness of Large Language Models (LLMs) in creating personalized "mirror stories" that reflect and resonate
with individual readers’ identities, addressing the significant lack of diversity in literature. We present MirrorStories, a corpus of 1,500 person-
alized short stories generated by integrating elements such as name, gender, age, ethnicity, reader interest, and story moral. We demonstrate
that LLMs can effectively incorporate diverse identity elements into narratives, with human evaluators identifying personalized elements
in the stories with high accuracy. Through a comprehensive evaluation involving 26 diverse human judges, we compare the effectiveness
of MirrorStories against generic narratives. We find that personalized LLM-generated stories not only outscore generic human-written and
LLM-generated ones across all metrics of engagement (with average ratings of 4.22 versus 3.37 on a 5-point scale), but also achieve higher
textual diversity while preserving the intended moral. We also provide analyses that include bias assessments and a study on the potential for
integrating images into personalized stories.
(Nov 13): 7:45-8:45 (Morning) - Gather
AdaSwitch: Adaptive Switching between Small and Large Agents for Effective Cloud-Local Collaborative Learning
Hao Sun, Jiayi Wu, Hengyi Cai, Xiaochi Wei, Yue Feng, Bo Wang, Shuaiqiang Wang, Yan Zhang, Dawei Yin
Recent advancements in large language models (LLMs) have been remarkable. Users face a choice between using cloud-based LLMs for
generation quality and deploying local-based LLMs for lower computational cost. The former option is typically costly and inefficient, while
the latter usually fails to deliver satisfactory performance for reasoning steps requiring deliberate thought processes. In this work, we propose
a novel LLM utilization paradigm that facilitates the collaborative operation of large cloud-based LLMs and smaller local-deployed LLMs.
Our framework comprises two primary modules: the local agent instantiated with a relatively smaller LLM, handling less complex reasoning
steps, and the cloud agent equipped with a larger LLM, managing more intricate reasoning steps. This collaborative processing is enabled
through an adaptive mechanism where the local agent introspectively identifies errors and proactively seeks assistance from the cloud agent,
thereby effectively integrating the strengths of both locally-deployed and cloud-based LLMs, resulting in significant enhancements in task
completion performance and efficiency. We evaluate AdaSwitch across 7 benchmarks, ranging from mathematical reasoning and complex
question answering, using various types of LLMs to instantiate the local and cloud agents. The empirical results show that AdaSwitch effec-
tively improves the performance of the local agent, and sometimes achieves competitive results compared to the cloud agent while utilizing
much less computational overhead.
(Nov 13): 7:45-8:45 (Morning) - Gather
Towards Verifiable Text Generation with Evolving Memory and Self-Reflection
Hao Sun, Hengyi Cai, Bo Wang, Yingyan Hou, Xiaochi Wei, Shuaiqiang Wang, Yan Zhang, Dawei Yin
Despite the remarkable ability of large language models (LLMs) in language comprehension and generation, they often suffer from producing
factually incorrect information, also known as hallucination. A promising solution to this issue is verifiable text generation, which prompts
LLMs to generate content with citations for accuracy verification. However, verifiable text generation is non-trivial due to the focus-shifting
phenomenon, the intricate reasoning needed to align the claim with correct citations, and the dilemma between the precision and breadth
of retrieved documents. In this paper, we present VTG, an innovative framework for Verifiable Text Generation with evolving memory and
self-reflection. VTG introduces evolving long short-term memory to retain both valuable documents and recent documents. A two-tier verifier
equipped with an evidence finder is proposed to rethink and reflect on the relationship between the claim and citations. Furthermore, active
retrieval and diverse query generation are utilized to enhance both the precision and breadth of the retrieved documents. We conduct extensive
experiments on five datasets across three knowledge-intensive tasks and the results reveal that VTG significantly outperforms baselines.
(Nov 13): 7:45-8:45 (Morning) - Gather
346
Posters and Demos
347
Posters and Demos
Large Language Models (LLMs) have demonstrated remarkable performance across multipletasks through in-context learning. For complex
reasoning tasks that require step-by-step thinking, Chain-of-Thought (CoT) prompting has given impressive results, especially when com-
bined with self-consistency. Nonetheless, some tasks remain particularly difficult for LLMs to solve. Tree of Thoughts (ToT) and Graph of
Thoughts (GoT) emerged as alternatives, dividing the complex problem into paths of subproblems. In this paper, we propose Tree of Problems
(ToP), a simpler version of ToT, which we hypothesise can work better for complex tasks that can be divided into identical subtasks. Our
empirical results show that our approach outperforms ToT and GoT, and in addition per forms better than CoT on complex reasoning tasks.
All code for this paper will be made available.
(Nov 13): 7:45-8:45 (Morning) - Gather
Not All Contexts Are Equal: Teaching LLMs Credibility-aware Generation
Ruotong Pan, Boxi Cao, Hongyu Lin, Xianpei Han, Jia Zheng, Sirui Wang, Xunliang Cai, Le Sun
The rapid development of large language models has led to the widespread adoption of Retrieval-Augmented Generation (RAG), which in-
tegrates external knowledge to alleviate knowledge bottlenecks and mitigate hallucinations. However, the existing RAG paradigm inevitably
suffers from the impact of flawed information introduced during the retrieval phrase, thereby diminishing the reliability and correctness of the
generated outcomes. In this paper, we propose Credibility-aware Generation (CAG), a universally applicable framework designed to mitigate
the impact of flawed information in RAG. At its core, CAG aims to equip models with the ability to discern and process information based
on its credibility. To this end, we propose an innovative data transformation framework that generates data based on credibility, thereby
effectively endowing models with the capability of CAG. Furthermore, to accurately evaluate the models’ capabilities of CAG, we construct
a comprehensive benchmark covering three critical real-world scenarios. Experimental results demonstrate that our model can effectively
understand and employ credibility for generation, significantly outperform other models with retrieval augmentation, and exhibit robustness
despite the increasing noise in the context.
(Nov 13): 7:45-8:45 (Morning) - Gather
Extending Context Window of Large Language Models from a Distributional Perspective
Yingsheng Wu, Yuxuan Gu, Xiaocheng Feng, Weihong Zhong, Dongliang Xu, Qing Yang, Hongtao Liu, Bing Qin
Scaling the rotary position embedding (RoPE) has become a common method for extending the context window of RoPE-based large lan-
guage models (LLMs). However, existing scaling methods often rely on empirical approaches and lack a profound understanding of the
internal distribution within RoPE, resulting in suboptimal performance in extending the context window length. In this paper, we propose to
optimize the context window extending task from the view of rotary angle distribution. Specifically, we first estimate the distribution of the
rotary angles within the model and analyze the extent to which length extension perturbs this distribution. Then, we present a novel extension
strategy that minimizes the disturbance between rotary angle distributions to maintain consistency with the pre-training phase, enhancing the
model’s capability to generalize to longer sequences. Experimental results compared to the strong baseline methods demonstrate that our
approach reduces by up to 72% of the distributional disturbance when extending LLaMA2’s context window to 8k, and reduces by up to
32% when extending to 16k. On the LongBench-E benchmark, our method achieves an average improvement of up to 4.33% over existing
state-of-the-art methods. Furthermore, Our method maintains the model’s performance on the Hugging Face Open LLM benchmark after
context window extension, with only an average performance fluctuation ranging from -0.12 to +0.22.
Industry
(Nov 13): 7:45-8:45 (Morning) - Room: Gather
348
Posters and Demos
pose feeding sentence embeddings developed from microblogs and search logs with a self-attention mechanism. Experiments showed that our
model outperformed two baselines, including a strong LLM-based one. We will release the dataset upon acceptance to support future research.
9
Due to the company’s policy, codes will be open-sourced to facilitate future research.
349
Posters and Demos
tasks. In this work, we propose a novel yet simple PEFT method, Prompt Aware Representation Adjustment (PARA). We propose installing
a lightweight vector generator at each Transformer layer to generate vectors conditioned on the input prompts that will modify the hidden
representations. We have conducted experiments on various tasks, and the experimental results demonstrate that: (a) our PARA method can
outperform the recent PEFT baselines with comparable tunable parameters. (b) Our PARA method is more efficient than LoRA under the
single-backbone multi-tenant setting, showing great potential for the industry.
(Nov 13): 7:45-8:45 (Morning) - Gather
ULMR: Unlearning Large Language Models via Negative Response and Model Parameter Average
Shaojie Shi, Xiaoyu Tan, Xihe Qiu, Chao Qu, Kexin Nie, Yuan Cheng, Wei Chu, Xu Yinghui, Yuan Qi
In recent years, large language models (LLMs) have attracted significant interest from the research community due to their broad applicabil-
ity in many language-oriented tasks, and are now widely used in numerous areas of production and daily life. One source of the powerful
capabilities of LLMs is the massive scale of their pre-training dataset. However, these pre-training datasets contain many outdated, harm-
ful, and personally sensitive information, which inevitably becomes memorized by LLM during the pre-training process. Eliminating this
undesirable data is crucial for ensuring the model’s safety and enhancing the user experience. However, the cost of extensively cleaning the
pre-training dataset and retraining the model from scratch is very high. In this work, we propose ULMR , a unlearning framework for LLMs ,
which first uses carefully designed prompts to rewrite the instructions in the specified dataset, and generate corresponding negative responses.
Subsequently, to ensure that the model does not excessively deviate post-training, we perform model parameter averaging to preserve the
performance of the original LLM. We conducted experiments on two public datasets, TOFU and RWKU, demonstrating that our method can
effectively forget specified information while retaining the capabilities of the original LLM.
(Nov 13): 7:45-8:45 (Morning) - Gather
ProConSuL: Project Context for Code Summarization with LLMs
Vadim Lomshakov, Andrey Podivilov, Sergey Savin, Oleg Baryshnikov, Alena Lisevych, Sergey Nikolenko
We propose Project Context for Code Summarization with LLMs (ProConSuL), a new framework to provide a large language model (LLM)
with precise information about the code structure from program analysis methods such as a compiler or IDE language services and use task
decomposition derived from the code structure. ProConSuL builds a call graph to provide the context from callees and uses a two-phase
training method (SFT + preference alignment) to train the model to use the project context. We also provide a new evaluation benchmark
for C/C++ functions and a set of proxy metrics. Experimental results demonstrate that ProConSuL allows to significantly improve code
summaries and reduce the number of hallucinations compared to the base model (CodeLlama-7B-instruct). We make our code and dataset
available at https://github.com/TypingCat13/ProConSuL.
(Nov 13): 7:45-8:45 (Morning) - Gather
Building an Efficient Multilingual Non-Profit IR System for the Islamic Domain Leveraging Multiprocessing Design in Rust
Vera Pavlova, Mohammed Makhlouf
The widespread use of large language models (LLMs) has dramatically improved many applications of Natural Language Processing (NLP),
including Information Retrieval (IR). However, domains that are not driven by commercial interest often lag behind in benefiting from AI-
powered solutions. One such area is religious and heritage corpora. Alongside similar domains, Islamic literature holds significant cultural
value and is regularly utilized by scholars and the general public. Navigating this extensive amount of text is challenging, and there is cur-
rently no unified resource that allows for easy searching of this data using advanced AI tools. This work focuses on the development of a
multilingual non-profit IR system for the Islamic domain. This process brings a few major challenges, such as preparing multilingual domain-
specific corpora when data is limited in certain languages, deploying a model on resource-constrained devices, and enabling fast search on a
limited budget. By employing methods like continued pre-training for domain adaptation and language reduction to decrease model size, a
lightweight multilingual retrieval model was prepared, demonstrating superior performance compared to larger models pre-trained on general
domain data. Furthermore, evaluating the proposed architecture that utilizes Rust Language capabilities shows the possibility of implementing
efficient semantic search in a low-resource setting.
350
Posters and Demos
351
Posters and Demos
Efficient Answer Retrieval System (EARS): Combining Local DB Search and Web Search for Generative QA
Nikita Krayko, Ivan Sidorov, Fedor Laputin, Daria Galimzianova, Vasily Konovalov
In this work we propose an efficient production-ready factoid question answering (QA) system that combines a local knowledge base search
and generative context-based QA. To assess the quality of the generated content, we devise metrics for both manual and automatic evaluation
of the answers to questions. A distinctive feature of our system is the Ranker component, which ranks potential answers according to their
relevance. This enhances the quality of local knowledge base retrieval by 23%. Another crucial aspect of the system is the LLM, which
utilizes contextual information from the internet to formulate responses. It boosts the utility of voice-based answers by 94%. **EARS** is
language-agnostic and the approach can be implemented for any data domain.
(Nov 13): 7:45-8:45 (Morning) - Gather
Mixture of Diverse Size Experts
Manxi Sun, Wei Liu, Jian Luan, Pengzhi Gao, Bin Wang
The Sparsely-Activated Mixture-of-Experts (MoE) architecture has gained popularity for scaling large language models (LLMs) due to the
sub-linearly increasing computational costs. Despite its success, most of the current structure designs face the challenge that the experts share
the same size such that tokens have no chance to choose the experts with the most appropriate size to generate the next token. To migrate
this defect, we propose Mixture of Diverse Size Experts (MoDSE), a new MoE architecture with designed layers where experts have different
sizes. Analysis on difficult token generation tasks shows that experts with different sizes give better predictions, and the routing path of the
experts tends to be stable after a period of training. The diversity of experts’ size will lead to load unbalancing. To tackle this limitation, we
introduce an expert-pair allocation strategy to distribute the workload evenly across the GPUs. Comprehensive evaluations across multiple
benchmarks demonstrate the effectiveness of MoDSE, surpassing existing MoEs by adaptively assigning the parameter budget to experts
while maintaining the same total parameter size and number of experts.
(Nov 13): 7:45-8:45 (Morning) - Gather
Course-Correction: Safety Alignment Using Synthetic Preferences
Rongwu Xu, Yishuo Cai, Zhenhong Zhou, Renjie Gu, Haiqin Weng, Liu Yan, Tianwei Zhang, Wei Xu, Han Qiu
The risk of harmful contents generated by large language models (LLMs) becomes a critical concern. This paper systematically evaluates and
enhances LLMs’ capability to perform course-correction, ie, the model can steer away from generating harmful content autonomously. First,
we introduce the C2 -Eval benchmark for quantitative assessment and analyze 10 popular LLMs, revealing varying proficiency of current
safety-tuned LLMs in course-correction. To improve, we propose fine-tuning LLMs with preference learning, emphasizing the preference
for timely course-correction. Using an automated pipeline, we create C2 -Syn, a synthetic C2 -Syn with 750K pairwise preferences, to teach
models the concept of timely course-correction through data-driven learning. Experiments on L LAMA 2-C HAT 7B and Q WEN 2 7B show that
our method effectively enhances course-correction skills without affecting general performance. Additionally, it effectively improves LLMs’
safety, particularly in resisting jailbreak attacks.
(Nov 13): 7:45-8:45 (Morning) - Gather
GOVERN: Gradient Orientation Vote Ensemble for Multi-Teacher Reinforced Distillation
Wenjie Zhou, Zhenxin Ding, Xiaodong Zhang, Haibo Shi, Junfeng Wang, Dawei Yin
Pre-trained language models have become an integral component of question-answering systems, achieving remarkable performance. For
practical deployment, it is critical to carry out knowledge distillation to preserve high performance under computational constraints. In this
paper, we address a key question: given the importance of unsupervised distillation for student performance, how does one effectively en-
semble knowledge from multiple teachers at this stage without the guidance of labels? We propose a novel algorithm, GOVERN, to tackle
this issue. GOVERN has demonstrated significant improvements in both offline and online experiments. The proposed algorithm has been
successfully deployed in a real-world commercial question-answering system.
(Nov 13): 7:45-8:45 (Morning) - Gather
Scaling Parameter-Constrained Language Models with Quality Data
Ernie Chang, Matteo Paltenghi, Yang Li, Pin-Jie Lin, Changsheng Zhao, Patrick Huber, Zechun Liu, Rastislav Rabatin, Yangyang Shi, Vikas
Chandra
Scaling laws in language modeling traditionally quantify training loss as a function of dataset size and model parameters, providing compute-
optimal estimates but often neglecting the impact of data quality on model generalization. In this paper, we extend the conventional under-
standing of scaling law by offering a microscopic view of data quality within the original formulation – effective training tokens – which we
posit to be a critical determinant of performance for parameter-constrained language models. Specifically, we formulate the proposed term
of effective training tokens to be a combination of two readily-computed indicators of text: (i) text diversity and (ii) syntheticity as measured
by a teacher model. We pretrained over 200 models of 25M to 1.5B parameters on a diverse set of sampled, synthetic data, and estimated
the constants that relate text quality, model size, training tokens, and eight reasoning task accuracy scores. We demonstrated the estimated
constants yield +0.83 Pearson correlation with true accuracies, and analyze it in scenarios involving widely-used data techniques such as data
sampling and synthesis which aim to improve data quality.
(Nov 13): 7:45-8:45 (Morning) - Gather
DL-QAT: Weight-Decomposed Low-Rank Quantization-Aware Training for Large Language Models
Wenjing Ke, Zhe Li, Dong Li, Lu Tian, Emad Barsoum
Improving the efficiency of inference in Large Language Models (LLMs) is a critical area of research. Post-training Quantization (PTQ) is a
popular technique, but it often faces challenges at low-bit levels, particularly in downstream tasks. Quantization-aware Training (QAT) can
alleviate this problem, but it requires significantly more computational resources. To tackle this, we introduced Weight-Decomposed Low-
Rank Quantization-Aware Training (DL-QAT), which merges the advantages of QAT while training only less than 1% of the total parameters.
Specifically, we introduce a group-specific quantization magnitude to adjust the overall scale of each quantization group. Within each quanti-
zation group, we use LoRA matrices to update the weight size and direction in the quantization space. We validated the effectiveness of our
method on the LLaMA and LLaMA2 model families. The results show significant improvements over our baseline method across different
quantization granularities. For instance, for LLaMA-7B, our approach outperforms the previous state-of-the-art method by 4.2% in MMLU
on 3-bit LLaMA-7B. Additionally, our quantization results on pre-trained models also surpass previous QAT methods, demonstrating the
superior performance and efficiency of our approach.
352
Posters and Demos
efficiently combines a cloud-based LLM with a smaller client-side model through retrieval augmented memory. This integration enables the
client model to generate better responses, benefiting from the LLM’s capabilities and cloud-based data. Meanwhile, via a novel asynchronous
memory update mechanism, the client model can deliver real-time completions to user inputs without the need to wait for responses from the
cloud. Our experiments on five datasets demonstrate that Hybrid-RACA offers strong performance while maintaining low latency.
(Nov 13): 7:45-8:45 (Morning) - Gather
LLMC: Benchmarking Large Language Model Quantization with a Versatile Compression Toolkit
Ruihao Gong, Yang Yong, Shiqiao Gu, Yushi Huang, Chengtao Lv, Yunchen Zhang, Dacheng Tao, Xianglong Liu
Recent advancements in large language models (LLMs) are propelling us toward artificial general intelligence with their remarkable emergent
abilities and reasoning capabilities. However, the substantial computational and memory requirements limit the widespread adoption. Quan-
tization, a key compression technique, can effectively mitigate these demands by compressing and accelerating LLMs, albeit with potential
risks to accuracy. Numerous studies have aimed to minimize the accuracy loss associated with quantization. However, their quantization
configurations vary from each other and cannot be fairly compared. In this paper, we present LLMC, a plug-and-play compression toolkit, to
fairly and systematically explore the impact of quantization. LLMC integrates dozens of algorithms, models, and hardwares, offering high
extensibility from integer to floating-point quantization, from LLM to vision-language (VLM) model, from fixed-bit to mixed precision, and
from quantization to sparsification. Powered by this versatile toolkit, our benchmark covers three key aspects: calibration data, algorithms
(three strategies), and data formats, providing novel insights and detailed analyses for further research and practical guidance for users. Our
toolkit is available at https://github.com/anonymous-emnlp123/llmc.
(Nov 13): 7:45-8:45 (Morning) - Gather
Context Matters: Pushing the Boundaries of Open-Ended Answer Generation with Graph-Structured Knowledge Context
Somnath Banerjee, Amruit Sahoo, Sayan Layek, Avik Dutta, Rima Hazra, Animesh Mukherjee
This paper introduces a novel framework that combines graph-driven context retrieval in conjunction to knowledge graphs based enhancement,
honing the proficiency of LLMs, especially in domain specific community question answering platforms like AskUbuntu, Unix, and Server-
Fault. We conduct experiments on various LLMs with different parameter sizes to evaluate their ability to ground knowledge and determine
factual accuracy in answers to open-ended questions. Our methodology GraphContextGen consistently outperforms dominant text-based
retrieval systems, demonstrating its robustness and adaptability to a larger number of use cases. This advancement highlights the importance
of pairing context rich data retrieval with LLMs, offering a renewed approach to knowledge sourcing and generation in AI systems. We also
show that, due to rich contextual data retrieval, the crucial entities, along with the generated answer, remain factually coherent with the gold
answer. We shall release the source code and datasets upon acceptance.
(Nov 13): 7:45-8:45 (Morning) - Gather
Pretraining and Finetuning Language Models on Geospatial Networks for Accurate Address Matching
Saket Maheshwary, Arpan Paul, Saurabh Sohoney
We propose a novel framework for pretraining and fine-tuning language models with the goal of determining whether two addresses repre-
sent the same physical building. For delivery and logistics, improving address matching positively impacts geocoding, route planning, and
delivery time estimations, leading to an efficient and reliable delivery experience. We propose to view a list of addresses as an address graph
and curate inputs for language models by placing geospatially linked addresses in the same context. Our approach jointly integrates concepts
from graph theory and weak supervision with address text and geospatial semantics. This integration enables us to generate informative
and diverse address pairs, facilitating pretraining and fine-tuning in a self-supervised manner. Experiments and ablation studies on manually
curated datasets and comparisons with state-of-the-art techniques demonstrate the efficacy of our proposed approach. We achieve a 24.49%
improvement in recall while maintaining 95% precision on average, in comparison to the current baseline across multiple geographies. Fur-
ther, we demonstrate the impact of improving address matching on geocode learning. We performed offline evaluations and launched online
A/B experiments which show that our proposed approach improves delivery precision by 14.68% and reduces delivery defects by 8.79% on
average across geographies.
(Nov 13): 7:45-8:45 (Morning) - Gather
LARA: Linguistic-Adaptive Retrieval-Augmentation for Multi-Turn Intent Classification
Junhua Liu, Yong Keat Tan, Bin Fu, Kwan Hui Lim
Multi-turn intent classification is notably challenging due to the complexity and evolving nature of conversational contexts. This paper in-
troduces LARA, a Linguistic-Adaptive Retrieval-Augmentation framework to enhance accuracy in multi-turn classification tasks across six
languages, accommodating numerous intents in chatbot interactions. LARA combines a fine-tuned smaller model with a retrieval-augmented
mechanism, integrated within the architecture of LLMs. The integration allows LARA to dynamically utilize past dialogues and relevant in-
tents, thereby improving the understanding of the context. Furthermore, our adaptive retrieval techniques bolster the cross-lingual capabilities
of LLMs without extensive retraining and fine-tuning. Comprehensive experiments demonstrate that LARA achieves state-of-the-art perfor-
mance on multi-turn intent classification tasks, enhancing the average accuracy by 3.67% from state-of-the-art single-turn intent classifiers.
Information Extraction
(Nov 13): 7:45-8:45 (Morning) - Room: Gather
353
Posters and Demos
Cross-domain Named Entity Recognition (CDNER) is crucial for Knowledge Graph (KG) construction and natural language processing
(NLP), enabling learning from source to target domains with limited data. Previous studies often rely on manually collected entity-relevant
sentences from the web or attempt to bridge the gap between tokens and entity labels across domains. These approaches are time-consuming
and inefficient, as these data are often weakly correlated with the target task and require extensive pre-training.To address these issues, we
propose automatically generating task-oriented knowledge (GTOK) using large language models (LLMs), focusing on the reasoning process
of entity extraction. Then, we employ task-oriented pre-training (TOPT) to facilitate domain adaptation. Additionally, current cross-domain
NER methods often lack explicit explanations for their effectiveness. Therefore, we introduce the concept of information density to better
evaluate the model’s effectiveness before performing entity recognition.We conduct systematic experiments and analyses to demonstrate the
effectiveness of our proposed approach and the validity of using information density for model evaluation.
354
Posters and Demos
Recent research in zero-shot Relation Extraction (RE) has focused on using Large Language Models (LLMs) due to their impressive zero-
shot capabilities. However, current methods often perform suboptimally, mainly due to a lack of detailed, context-specific prompts needed for
understanding various sentences and relations. To address this, we introduce the Self-Prompting framework, a novel method designed to fully
harness the embedded RE knowledge within LLMs. Specifically, our framework employs a three-stage diversity approach to prompt LLMs,
generating multiple synthetic samples that encapsulate specific relations from scratch. These generated samples act as in-context learning
samples, offering explicit and context-specific guidance to efficiently prompt LLMs for RE. Experimental evaluations on benchmark datasets
show our approach outperforms existing LLM-based zero-shot RE methods. Additionally, our experiments confirm the effectiveness of our
generation pipeline in producing high-quality synthetic data that enhances performance.
355
Posters and Demos
with competitive baselines as well as the CLSD approaches trained with labeled data in target language.
(Nov 13): 7:45-8:45 (Morning) - Gather
Crafting Personalized Agents through Retrieval-Augmented Generation on Editable Memory Graphs
Zheng Wang, Zhongyang Li, Jiang Zeren, Dandan Tu, Wei Shi
In the age of mobile internet, user data, often referred to as memories, is continuously generated on personal devices. Effectively managing
and utilizing this data to deliver services to users is a compelling research topic. In this paper, we introduce a novel task of crafting person-
alized agents powered by large language models (LLMs), which utilize a user’s smartphone memories to enhance downstream applications
with advanced LLM capabilities. To achieve this goal, we introduce EMG-RAG, a solution that combines Retrieval-Augmented Generation
(RAG) techniques with an Editable Memory Graph (EMG). This approach is further optimized using Reinforcement Learning to address three
distinct challenges: data collection, editability, and selectability. Extensive experiments on a real-world dataset validate the effectiveness of
EMG-RAG, achieving an improvement of approximately 10% over the best existing approach. Additionally, the personalized agents have
been transferred into a real smartphone AI assistant, which leads to enhanced usability.
(Nov 13): 7:45-8:45 (Morning) - Gather
Enhancing Legal Case Retrieval via Scaling High-quality Synthetic Query-Candidate Pairs
Cheng Gao, Chaojun Xiao, Zhenghao Liu, Huimin Chen, Zhiyuan Liu, Maosong Sun
Legal case retrieval (LCR) aims to provide similar cases as references for a given fact description. This task is crucial for promoting consistent
judgments in similar cases, effectively enhancing judicial fairness and improving work efficiency for judges. However, existing works face
two main challenges for real-world applications: existing works mainly focus on case-to-case retrieval using lengthy queries, which does not
match real-world scenarios; and the limited data scale, with current datasets containing only hundreds of queries, is insufficient to satisfy
the training requirements of existing data-hungry neural models. To address these issues, we introduce an automated method to construct
synthetic query-candidate pairs and build the largest LCR dataset to date, LEAD, which is hundreds of times larger than existing datasets.
This data construction method can provide ample training signals for LCR models. Experimental results demonstrate that model training with
our constructed data can achieve state-of-the-art results on two widely-used LCR benchmarks. Besides, the construction method can also be
applied to civil cases and achieve promising results. The data and codes can be found in https://github.com/thunlp/LEAD.
356
Posters and Demos
multilingualism in LLMs primarily focus on using English as the pivot language to enhance language understanding and reasoning. Given
that multiple languages are a compensation for the losses caused by a single language’s limitations, it’s a natural next step to enrich the
models learning context through the integration of the original input with its multiple translations. In this paper, we start by revealing that
LLMs learn from parallel multilingual input (PMI). Our comprehensive evaluation shows that PMI enhances the model’s comprehension of
the input, achieving superior performance than conventional in-context learning (ICL). Furthermore, to explore how multilingual processing
affects prediction, we examine the activated neurons in LLMs. Surprisingly, involving more languages in the input activates fewer neurons,
leading to more focused and effective neural activation patterns. Also, this neural reaction coincidently mirrors the neuroscience insight about
synaptic pruning, highlighting a similarity between artificial and biological ‘brains’.
(Nov 13): 7:45-8:45 (Morning) - Gather
MARE: Multi-Aspect Rationale Extractor on Unsupervised Rationale Extraction
Han Jiang, Junwen Duan, Zhe Qu, Jianxin Wang
Unsupervised rationale extraction aims to extract text snippets to support model predictions without explicit rationale annotation.Researchers
have made many efforts to solve this task. Previous works often encode each aspect independently, which may limit their ability to capture
meaningful internal correlations between aspects. While there has been significant work on mitigating spurious correlations, our approach fo-
cuses on leveraging the beneficial internal correlations to improve multi-aspect rationale extraction. In this paper, we propose a Multi-Aspect
Rationale Extractor (MARE) to explain and predict multiple aspects simultaneously. Concretely, we propose a Multi-Aspect Multi-Head
Attention (MAMHA) mechanism based on hard deletion to encode multiple text chunks simultaneously. Furthermore, multiple special to-
kens are prepended in front of the text with each corresponding to one certain aspect. Finally, multi-task training is deployed to reduce
the training overhead. Experimental results on two unsupervised rationale extraction benchmarks show that MARE achieves state-of-the-art
performance. Ablation studies further demonstrate the effectiveness of our method. Our codes have been available at https://github.com/CSU-
NLP-Group/MARE.
(Nov 13): 7:45-8:45 (Morning) - Gather
Leveraging Estimated Transferability Over Human Intuition for Model Selection in Text Ranking
Jun Bai, Zhuofan Chen, Zhenzi Li, Hanhua Hong, Jianfei Zhang, Chen Li, Chenghua Lin, Wenge Rong
Text ranking has witnessed significant advancements, attributed to the utilization of dual-encoder enhanced by Pre-trained Language Models
(PLMs). Given the proliferation of available PLMs, selecting the most effective one for a given dataset has become a non-trivial challenge. As
a promising alternative to human intuition and brute-force fine-tuning, Transferability Estimation (TE) has emerged as an effective approach
to model selection. However, current TE methods are primarily designed for classification tasks, and their estimated transferability may not
align well with the objectives of text ranking. To address this challenge, we propose to compute the expected rank as transferability, explic-
itly reflecting the model’s ranking capability. Furthermore, to mitigate anisotropy and incorporate training dynamics, we adaptively scale
isotropic sentence embeddings to yield an accurate expected rank score. Our resulting method, Adaptive Ranking Transferability (AiRTran),
can effectively capture subtle differences between models. On challenging model selection scenarios across various text ranking datasets,
it demonstrates significant improvements over previous classification-oriented TE methods, human intuition, and ChatGPT with minor time
consumption.
(Nov 13): 7:45-8:45 (Morning) - Gather
Enhancing Training Data Attribution for Large Language Models with Fitting Error Consideration
Kangxi Wu, Liang Pang, Huawei Shen, Xueqi Cheng
The black-box nature of large language models (LLMs) poses challenges in interpreting results, impacting issues such as data intellectual
property protection and hallucination tracing. Training data attribution (TDA) methods are considered effective solutions to address these
challenges.Most recent TDA methods rely on influence functions, assuming the model achieves minimized empirical risk. However, achiev-
ing this criterion is difficult, and sourcing accuracy can be compromised by fitting errors during model training. In this paper, we introduce
a novel TDA method called Debias and Denoise Attribution (DDA), which enhances influence functions by addressing fitting errors. Specif-
ically, the debias strategy seeks to improve the performance of influence functions by eliminating the knowledge bias present in the base
model before fine-tuning, while the denoise strategy aims to reduce discrepancies in influence scores arising from varying degrees of fitting
during the training process through smoothing techniques.Experimental results demonstrate that our method significantly outperforms exist-
ing approaches, achieving an averaged AUC of 91.64%. Moreover, DDA exhibits strong generality and scalability across various sources and
different-scale models like LLaMA2, QWEN2, and Mistral.
357
Posters and Demos
Mengqi Zhang, Xiaotian Ye, Qiang Liu, Pengjie Ren, Shu Wu, Zhumin Chen
Large language models (LLMs) are pivotal in advancing natural language processing (NLP) tasks, yet their efficacy is hampered by inaccura-
cies and outdated knowledge. Model editing emerges as a promising solution to address these challenges. However, existing editing methods
struggle to track and incorporate changes in knowledge associated with edits, which limits the generalization ability of post-edit LLMs in
processing edited knowledge. To tackle these problems, we propose a novel model editing method that leverages knowledge graphs for
enhancing LLM editing, namely GLAME. Specifically, we first utilize a knowledge graph augmentation module to uncover associated knowl-
edge that has changed due to editing, obtaining its internal representations within LLMs. This approach allows knowledge alterations within
LLMs to be reflected through an external graph structure. Subsequently, we design a graph-based knowledge edit module to integrate struc-
tured knowledge into the model editing. This ensures that the updated parameters reflect not only the modifications of the edited knowledge
but also the changes in other associated knowledge resulting from the editing process. Comprehensive experiments conducted on GPT-J and
GPT-2 XL demonstrate that GLAME significantly improves the generalization capabilities of post-edit LLMs in employing edited knowledge.
(Nov 13): 7:45-8:45 (Morning) - Gather
Transfer Learning for Text Classification via Model Risk Analysis
Yujie Sun, Chuyi Fan, Qun Chen
It has been well recognized that text classification can be satisfactorily performed by Deep Neural Network (DNN) models, provided that
there are sufficient in-distribution training data. However, in the presence of distribution drift, a well trained DNN model may not perform
well on a new dataset even though class labels are aligned between training and target datasets. To alleviate this limitation, we propose a
novel approach based on model risk analysis to adapt a pre-trained DNN model towards a new dataset given only a small set of representative
data. We first present a solution of model risk analysis for text classification, which can effectively quantify misprediction risk of a classifier
on a dataset. Built upon the existing framework of LearnRisk, the proposed solution, denoted by LearnRisk-TC, first generates interpretable
risk features, then constructs a risk model by aggregating these features, and finally trains the risk model on a small set of labeled data.
Furthermore, we present a transfer learning solution based on model risk analysis, which can effectively fine-tune a pre-trained model toward
a target dataset by minimizing its misprediction risk. We have conducted extensive experiments on real datasets. Our experimental results
show that the proposed solution performs considerably better than the existing alternative approaches. By using text classification as a test
case, we demonstrate the potential applicability of risk-based transfer learning to various challenging NLP tasks. Our codes are available at
https://github.com/syjcomputer/LRTC.
(Nov 13): 7:45-8:45 (Morning) - Gather
Exploring Reward Model Strength’s Impact on Language Models
Yanjun Chen, Dawei Zhu, Yirong Sun, Xinghao Chen, Wei Zhang, Xiaoyu Shen
Reinforcement Learning from Human Feedback significantly enhances Natural Language Processing by aligning language models with hu-
man expectations. A critical factor in this alignment is the strength of reward models used during training. This study explores whether
stronger reward models invariably lead to better language models. In this paper, through experiments on relevance, factuality, and complete-
ness tasks using the QA-FEEDBACK dataset and reward models based on Longformer, we uncover a surprising paradox: language models
trained with moderately accurate reward models outperform those guided by highly accurate ones. This challenges the widely held belief
that stronger reward models always lead to better language models, and opens up new avenues for future research into the key factors driving
model performance and how to choose the most suitable reward models.
358
Posters and Demos
Language Modeling
(Nov 13): 7:45-8:45 (Morning) - Room: Gather
359
Posters and Demos
may still generate responses that sound plausible but contradict factual knowledge, a phenomenon known as hallucination. In this paper, we
demonstrate the feasibility of mitigating hallucinations by verifying and minimizing the inconsistency between external knowledge present in
the alignment data and the intrinsic knowledge embedded within foundation LLMs. Specifically, we propose a novel approach called Knowl-
edge Consistent Alignment (KCA), which employs a well-aligned LLM to automatically formulate assessments based on external knowledge
to evaluate the knowledge boundaries of foundation LLMs. To address knowledge inconsistencies in the alignment data, KCA implements
several specific strategies to deal with these data instances. We demonstrate the superior efficacy of KCA in reducing hallucinations across
six benchmarks, utilizing foundation LLMs of varying backbones and scales. This confirms the effectiveness of mitigating hallucinations by
reducing knowledge inconsistency. Our code, model weights, and data are openly accessible at https://github.com/fanqiwan/KCA.
(Nov 13): 7:45-8:45 (Morning) - Gather
Retrieved In-Context Principles from Previous Mistakes
Hao Sun, Yong Jiang, Bo Wang, Yingyan Hou, Yan Zhang, Pengjun Xie, Fei Huang
In-context learning (ICL) has been instrumental in adapting large language models (LLMs) to downstream tasks using correct input-output
examples. Recent advances have attempted to improve model performance through principles derived from mistakes, yet these approaches
suffer from lack of customization and inadequate error coverage. To address these limitations, we propose Retrieved In-Context Principles
(RICP), a novel teacher-student framework. In RICP, the teacher model analyzes mistakes from the student model to generate reasons and
insights for preventing similar mistakes. These mistakes are clustered based on their underlying reasons for developing task-level principles,
enhancing the error coverage of principles. During inference, the most relevant mistakes for each question are retrieved to create question-level
principles, improving the customization of the provided guidance. RICP is orthogonal to existing prompting methods and does not require
intervention from the teacher model during inference. Experimental results across seven reasoning benchmarks reveal that RICP effectively
enhances performance when applied to various prompting strategies.
(Nov 13): 7:45-8:45 (Morning) - Gather
KNN-Instruct: Automatic Instruction Construction with K Nearest Neighbor Deduction
Jianshang Kou, Benfeng Xu, Chiwei Zhu, Zhendong Mao
Supervised fine-tuning (SFT) is a critical procedure for aligning large language models. Despite its efficiency, the construction of SFT data
often struggles with issues of quality, diversity, and scalability. Many existing methods, inspired by the Self-Instruct framework, typically
generate synthetic instructions by prompting aligned proprietary models like ChatGPT. However, such process suffers from stale distribution,
resulting in instructions that are merely trivial variations of existing ones. In this paper, we introduce a novel bootstrapping approach termed
KNN-Instruct, which incorporates KNN deduction to produce meaningful new instructions by effectively summarizing and learning from
similar existing ones. We conduct an economical controlled experiment to preliminarily validate its effectiveness. In the further experiment,
we construct a high-quality SFT dataset named KNN-Inst-12k*. Applying the dataset to Qwen-2-7B, we get a MT-Bench score of 7.64,
which outperforms all 7B models on the LMSYS leaderboard, including Starling-LM-7B (7.48), OpenChat-3.5 (7.06) and Zephyr-7B-beta
(6.53). Our code and data are available at https://github.com/CrossmodalGroup/KNN-Instruct/.
(Nov 13): 7:45-8:45 (Morning) - Gather
InferAligner: Inference-Time Alignment for Harmlessness through Cross-Model Guidance
Pengyu Wang, Dong Zhang, Linyang Li, Chenkun Tan, Xinghao Wang, Mozhi Zhang, Ke Ren, Botian Jiang, Xipeng Qiu
As large language models (LLMs) rapidly evolve, they are increasingly being customized through fine-tuning to suit the specific needs of
various applications. A critical aspect of this advancement is the alignment process, which ensures that these models perform tasks in ways
that align with human values and expectations. Current alignment methods, such as direct preference optimization (DPO) and reinforcement
learning from human feedback (RLHF), focus primarily on alignment during training phase. However, these methods often involve complex
and resource-intensive training processes, posing significant challenge for their implementation. Therefore, we propose InferAligner, a sim-
ple yet effective method for harmlessness alignment during inference phase. InferAligner decouples harmlessness from helpfulness. During
the training phase, it focuses solely on enhancing the target model’s capabilities on downstream tasks. In the inference phase, it utilizes safety
steering vectors extracted from the aligned model to guide the target model towards harmlessness alignment. Experimental results show
that our method can be very effectively applied to domain-specific models in finance, medicine, and mathematics, as well as to multimodal
large language models (MLLMs) such as LLaVA. It significantly diminishes the attack success rate (ASR) of both harmful instructions and
jailbreak instructions, while maintaining almost unchanged performance in downstream tasks.
(Nov 13): 7:45-8:45 (Morning) - Gather
First Heuristic Then Rational: Dynamic Use of Heuristics in Language Model Reasoning
Yoichi Aoki, Keito Kudo, Tatsuki Kuribayashi, Shusaku Sone, Masaya Taniguchi, Keisuke Sakaguchi, Kentaro Inui
Explicit multi-step reasoning, such as chain-of-thought, is widely adopted in the community to explore the better performance of language
models (LMs). We report on the systematic strategy that LMs use in this process.Our controlled experiments reveal that LMs rely more
heavily on heuristics, such as lexical overlap, in the earlier stages of reasoning when more steps are required to reach an answer. Conversely,
their reliance on heuristics decreases as LMs progress closer to the final answer. This suggests that LMs track only a limited number of future
steps and dynamically combine heuristic strategies with rational ones in solving tasks involving multi-step reasoning.
(Nov 13): 7:45-8:45 (Morning) - Gather
Re-Reading Improves Reasoning in Large Language Models
Xiaohan Xu, Chongyang Tao, Tao Shen, Can Xu, Hongbo Xu, Guodong Long, Jian-Guang Lou, Shuai Ma
To enhance the reasoning capabilities of off-the-shelf Large Language Models (LLMs), we introduce a simple, yet general and effective
prompting method, RE2, i.e., Re-Reading the question as input. Unlike most thought-eliciting prompting methods, such as Chain-of-Thought
(CoT), which aim to elicit the reasoning process in the output, RE2 shifts the focus to the input by processing questions twice, thereby enhanc-
ing the understanding process. Consequently, RE2 demonstrates strong generality and compatibility with most thought-eliciting prompting
methods, including CoT. Crucially, RE2 facilitates a "bidirectional" encoding in unidirectional decoder-only LLMs because the first pass
could provide global information for the second pass. We begin with a preliminary empirical study as the foundation of RE2, illustrating its
potential to enable "bidirectional" attention mechanisms. We then evaluate RE2 on extensive reasoning benchmarks across 14 datasets, span-
ning 112 experiments, to validate its effectiveness and generality. Our findings indicate that, with the exception of a few scenarios on vanilla
ChatGPT, RE2 consistently enhances the reasoning performance of LLMs through a simple re-reading strategy. Further analyses reveal RE2’s
adaptability, showing how it can be effectively integrated with different LLMs, thought-eliciting prompting, and ensemble strategies.
(Nov 13): 7:45-8:45 (Morning) - Gather
Zero-Shot Detection of LLM-Generated Text using Token Cohesiveness
Shixuan Ma, Quan Wang
The increasing capability and widespread usage of large language models (LLMs) highlight the desirability of automatic detection of LLM-
generated text. Zero-shot detectors, due to their training-free nature, have received considerable attention and notable success. In this paper,
we identify a new feature, token cohesiveness, that is useful for zero-shot detection, and we demonstrate that LLM-generated text tends to
360
Posters and Demos
exhibit higher token cohesiveness than human-written text. Based on this observation, we devise TOCSIN, a generic dual-channel detection
paradigm that uses token cohesiveness as a plug-and-play module to improve existing zero-shot detectors. To calculate token cohesiveness,
TOCSIN only requires a few rounds of random token deletion and semantic difference measurement, making it particularly suitable for a
practical black-box setting where the source model used for generation is not accessible. Extensive experiments with four state-of-the-art base
detectors on various datasets, source models, and evaluation settings demonstrate the effectiveness and generality of the proposed approach.
Code available at: https://github.com/Shixuan-Ma/TOCSIN.
(Nov 13): 7:45-8:45 (Morning) - Gather
Safety Arithmetic: A Framework for Test-time Safety Alignment of Language Models by Steering Parameters and Activations
Rima Hazra, Sayan Layek, Somnath Banerjee, Soujanya Poria
Ensuring the safe alignment of large language models (LLMs) with human values is critical as they become integral to applications like trans-
lation and question answering. Current alignment methods struggle with dynamic user intentions and complex objectives, making models
vulnerable to generating harmful content. We propose Safety Arithmetic, a training-free framework enhancing LLM safety across different
scenarios: Base models, Supervised fine-tuned models (SFT), and Edited models. Safety Arithmetic involves Harm Direction Removal to
avoid harmful content and Safety Alignment to promote safe responses. Additionally, we present NoIntentEdit, a dataset highlighting edit
instances that could compromise model safety if used unintentionally. Our experiments show that Safety Arithmetic significantly improves
safety measures, reduces over-safety, and maintains model utility, outperforming existing methods in ensuring safe content generation.
(Nov 13): 7:45-8:45 (Morning) - Gather
Improving Referring Ability for Biomedical Language Models
Junfeng Jiang, Fei Cheng, Akiko Aizawa
Existing auto-regressive large language models (LLMs) are primarily trained using documents from general domains. In the biomedical
domain, continual pre-training is a prevalent method for domain adaptation to inject professional knowledge into powerful LLMs that have
been pre-trained in general domains. Previous studies typically conduct standard pre-training by randomly packing multiple documents into a
long pre-training sequence. Recently, some existing works suggest that enhancing the relatedness of documents within the same pre-training
sequence may be advantageous. However, these studies primarily focus on general domains, which cannot be readily applied in the biomed-
ical domain where the distinction of fine-grained topics is harder. Is it possible to further improve the pre-training for biomedical language
models (LMs) using exactly the same corpus? In this paper, we explore an improved approach to continual pre-training, which is a prevalent
method for domain adaptation, by utilizing information from the citation network in this challenging scenario. Empirical studies demonstrate
that our proposed LinkLM data improves both the intra-sample and inter-sample referring abilities of auto-regressive LMs in the biomedical
domain, encouraging more profound consideration of task-specific pre-training sequence design for continual pre-training.
361
Posters and Demos
ters on a 300B-token corpus, and evaluated with a total of 1,376 intermediate checkpoints on various downstream tasks. These tasks include
validation loss, commonsense reasoning, and information retrieval and generation. The study reveals that existing linear complexity language
models exhibit similar scaling capabilities as conventional transformer-based models while also demonstrating superior linguistic proficiency
and knowledge retention.
(Nov 13): 7:45-8:45 (Morning) - Gather
Null-Shot Prompting: Rethinking Prompting Large Language Models With Hallucination
Pittawat Taveekitworachai, Febri Abdullah, Ruck Thawonmas
This paper presents a series of investigations into an interesting phenomenon where we observe performance increases in large language mod-
els (LLMs) when providing a prompt that causes and exploits hallucination. We propose null-shot prompting, a counter-intuitive approach
where we intentionally instruct LLMs to look at and utilize information from a null section. We investigate null-shot prompting on a wide
range of tasks, including arithmetic reasoning, commonsense reasoning, and reading comprehension. We observe a substantial increase in
performance in arithmetic reasoning tasks for various models, with up to a 44.62% increase compared to a baseline in one model. Therefore,
we investigate deeper into this task by utilizing a more challenging mathematics problem-solving benchmark. We observe that LLMs benefit
from hallucination in null-shot prompting in this task and discuss the mathematical topics that benefit the most from introducing hallucination
in the prompt. We continue our investigation by evaluating hallucination detection abilities of the LLMs when using null-shot prompting.
We find surprising results where hallucination in prompts can improve hallucination detection abilities of many LLMs. We also examine the
effects of introducing both reasoning, which is known to mitigate hallucination, and hallucination simultaneously in the prompt and observe
another surprising turn for the mathematics problem-solving benchmark with many performance improvements. We hope this paper will
spark more interest, investigations, and discussions on how hallucination in prompts LLMs and even bolsters them in certain cases.
(Nov 13): 7:45-8:45 (Morning) - Gather
XLLaMA2: Scaling Linguistic Horizons of LLM by Enhancing Translation Capabilities Beyond 100 Languages
Yinquan Lu, Wenhao Zhu, Lei Li, Yu Qiao, Fei Yuan
Large Language Models (LLMs) demonstrate remarkable translation capabilities in high-resource language tasks, yet their performance in
low-resource languages is hindered by insufficient multilingual data during pre-training. To address this, we conduct extensive multilingual
continual pre-training on the LLaMA series models, enabling translation support across more than 100 languages. Through a comprehensive
analysis of training strategies, such as vocabulary expansion and data augmentation, we develop LLaMAX. Remarkably, without sacrificing
its generalization ability, LLaMAX achieves significantly higher translation performance compared to existing open-source LLMs (by more
than 10 spBLEU points) and performs on-par with specialized translation model (M2M-100-12B) on the Flores-101 benchmark. Extensive
experiments indicate that LLaMAX can serve as a robust multilingual foundation model. The code 10 and the models 11 are publicly available.
(Nov 13): 7:45-8:45 (Morning) - Gather
Tokenization Falling Short: The Curse of Tokenization
Yekun Chai, Yewei Fang, Qiwei Peng, Xuhong Li
Language models typically tokenize raw text into sequences of subword identifiers from a predefined vocabulary, a process inherently sensitive
to typographical errors, length variations, and largely oblivious to the internal structure of tokensissues we term *the curse of tokenization*.
In this study, we delve into these drawbacks and demonstrate that large language models (LLMs) remain susceptible to these problems.
This study systematically investigates these challenges and their impact on LLMs through three critical research questions: (1) complex
problem solving, (2) token structure probing, and (3) resilience to typographical variation. Our findings reveal that scaling model parame-
ters can mitigate the issue of tokenization; however, LLMs still suffer from biases induced by typos and other text format variations. Our
experiments show that subword regularization such as BPE-dropout can mitigate this issue. We release our evaluation code and data at
https://github.com/FloatAI/TKEval.
(Nov 13): 7:45-8:45 (Morning) - Gather
On Training Data Influence of GPT Models
Qingyi Liu, Yekun Chai, Shuohuan Wang, Yu Sun, Qiwei Peng, Hua Wu
Amidst the rapid advancements in generative language models, the investigation of how training data shapes the performance of GPT models
is still emerging. This paper presents GPTfluence, a novel approach that leverages a featurized simulation to assess the impact of training
examples on the training dynamics of GPT models. Our approach not only traces the influence of individual training instances on perfor-
mance trajectories, such as loss and other key metrics, on targeted test points but also enables a comprehensive comparison with existing
methods across various training scenarios in GPT models, ranging from 14 million to 2.8 billion parameters, across a range of downstream
tasks. Contrary to earlier methods that struggle with generalization to new data, GPTfluence introduces a parameterized simulation of training
dynamics, demonstrating robust generalization capabilities to unseen training data. This adaptability is evident across both fine-tuning and
instruction-tuning scenarios, spanning tasks in natural language understanding and generation. We make our code and data publicly available
at https://github.com/ernie-research/gptfluence.
10
https://github.com/CONE-MT/LLaMAX/.
11
https://huggingface.co/LLaMAX/.
362
Posters and Demos
optimal lexical systems are those where multiple words can apply to the same referent, conveying different amounts of information. Such
systems allow speakers to maximize communication accuracy and minimize the amount of information they convey when communicating
about referents in contexts.
(Nov 13): 7:45-8:45 (Morning) - Gather
HCEG: Improving the Abstraction Ability of Language Models with Hierarchical Conceptual Entailment Graphs
Juncai Li, Ru Li, Xiaoli Li, Qinghua Chai, Jeff Z. Pan
The abstract inference capability of the Language Model plays a pivotal role in boosting its generalization and reasoning prowess in Nat-
ural Language Inference (NLI). Entailment graphs are crafted precisely for this purpose, focusing on learning entailment relations among
predicates. Yet, prevailing approaches overlook the *polysemy* and *hierarchical nature of concepts* during entity conceptualization. This
oversight disregards how arguments might entail differently across various concept levels, thereby missing potential entailment connec-
tions. To tackle this hurdle, we introduce the *concept pyramid* and propose the HiCon-EG (Hierarchical Conceptual Entailment Graph)
framework, which organizes arguments hierarchically, delving into entailment relations at diverse concept levels. By learning entailment rela-
tionships at different concept levels, the model is guided to better understand concepts so as to improve its abstract inference capabilities. Our
method enhances scalability and efficiency in acquiring common-sense knowledge through leveraging statistical language distribution instead
of manual labeling, Experimental results show that entailment relations derived from HiCon-EG significantly bolster abstract detection tasks.
Our code is available at https://github.com/SXUCFN/HiCon-EG
(Nov 13): 7:45-8:45 (Morning) - Gather
Locally Measuring Cross-lingual Lexical Alignment: A Domain and Word Level Perspective
Taelin Karidi, Eitan Grossman, Omri Abend
NLP research on aligning lexical representation spaces to one another has so far focused on aligning language spaces in their entirety. How-
ever, cognitive science has long focused on a local perspective, investigating whether translation equivalents truly share the same meaning or
the extent that cultural and regional influences result in meaning variations. With recent technological advances and the increasing amounts of
available data, the longstanding question of cross-lingual lexical alignment can now be approached in a more data-driven manner. However,
developing metrics for the task requires some methodology for comparing metric efficacy. We address this gap and present a methodology
for analyzing both synthetic validations and a novel naturalistic validation using lexical gaps in the kinship domain.We further propose new
metrics, hitherto unexplored on this task, based on contextualized embeddings. Our analysis spans 16 diverse languages, demonstrating that
there is substantial room for improvement with the use of newer language models. Our research paves the way for more accurate and nuanced
cross-lingual lexical alignment methodologies and evaluation.
(Nov 13): 7:45-8:45 (Morning) - Gather
Detecting Subtle Differences between Human and Model Languages Using Spectrum of Relative Likelihood
Yang Xu, Yu Wang, Hao An, Yongyuan Li, Zhichen Liu
Human and model-generated texts can be distinguished by examining the magnitude of likelihood in language. However, it is becoming
increasingly difficult as language model’s capabilities of generating human-like texts keep evolving. This study provides a new perspective
by using the relative likelihood values instead of absolute ones, and extracting useful features from the spectrum-view of likelihood for the
human-model text detection task. We propose a detection procedure with two classification methods, supervised and heuristic-based, respec-
tively, which results in competitive performances with previous zero-shot detection methods and a new state-of-the-art on short-text detection.
Our method can also reveal subtle differences between human and model languages, which find theoretical roots in psycholinguistics studies.
363
Posters and Demos
Effective Demonstration Annotation for In-Context Learning via Language Model-Based Determinantal Point Process
Peng Wang, Xiaobin Wang, Chao Lou, Shengyu Mao, Pengjun Xie, Yong Jiang
In-context learning (ICL) is a few-shot learning paradigm that involves learning mappings through input-output pairs and appropriately ap-
plying them to new instances. Despite the remarkable ICL capabilities demonstrated by Large Language Models (LLMs), existing works are
highly dependent on large-scale labeled support sets, not always feasible in practical scenarios. To refine this approach, we focus primarily on
an innovative selective annotation mechanism, which precedes the standard demonstration retrieval. We introduce the Language Model-based
Determinant Point Process (LM-DPP) that simultaneously considers the uncertainty and diversity of unlabeled instances for optimal selec-
tion. Consequently, this yields a subset for annotation that strikes a trade-off between the two factors. We apply LM-DPP to various language
models, including GPT-J, LlaMA, and GPT-3. Experimental results on 9 NLU and 2 Generation datasets demonstrate that LM-DPP can
effectively select canonical examples. Further analysis reveals that LLMs benefit most significantly from subsets that are both low uncertainty
and high diversity.
(Nov 13): 7:45-8:45 (Morning) - Gather
QUIK: Towards End-to-end 4-Bit Inference on Generative Large Language Models
Saleh Ashkboos, Ilia Markov, Elias Frantar, Tingxuan Zhong, Xincheng Wang, Jie Ren, Torsten Hoefler, Dan Alistarh
Large Language Models (LLMs) from the GPT family have become extremely popular, leading to a race towards reducing their inference
costs to allow for efficient local computation. However, the vast majority of existing work focuses on weight-only quantization, which can re-
duce runtime costs in the memory-bound one-token-at-a-time generative setting, but does not address costs in compute-bound scenarios, such
as batched inference or prompt processing.In this paper, we address the general quantization problem, where both weights and activations
should be quantized, which leads to computational improvements in general. We show that the majority of inference computations for large
generative models can be performed with both weights and activations being cast to 4 bits, while at the same time maintaining good accuracy.
We achieve this via a hybrid quantization strategy called QUIK that compresses most of the weights and activations to 4-bit, while keeping
a small fraction of “outlier” weights and activations in higher-precision. QUIK is that it is designed with computational efficiency in mind:
we provide GPU kernels matching the QUIK format with highly-efficient layer-wise runtimes, which lead to practical end-to-end throughput
improvements of up to 3.4x relative to FP16 execution. We provide detailed studies for models from the OPT, LLaMA-2 and Falcon families,
as well as a first instance of accurate inference using quantization plus 2:4 sparsity.Anonymized code is available.
364
Posters and Demos
365
Posters and Demos
ically, they can be fine-tuned solely on tasks in the source language and subsequently applied to tasks in the target language. However, for
low-resource languages unseen during pre-training, relying solely on zero-shot language transfer often yields sub-optimal results. One com-
mon strategy is to continue training PLMs using masked language modeling objectives on the target language. Nonetheless, this approach can
be inefficient due to the need to adjust all parameters for language adaptation. In this paper, we propose a more efficient solution: soft-prompt
tuning for language adaptation. Our experiments demonstrate that with carefully designed prompts, soft-prompt tuning enables mPLMs to
achieve effective zero-shot cross-lingual transfer to downstream tasks in previously unseen languages. Notably, we found that prompt tuning
outperforms continuously trained baselines on two text classification benchmarks, encompassing 20 low-resource languages while utilizing a
mere 0.28% of the tuned parameters. These results underscore the superior adaptability of mPLMs to previously unseen languages afforded
by soft-prompt tuning compared to traditional fine-tuning methods.
366
Posters and Demos
Machine Translation
(Nov 13): 7:45-8:45 (Morning) - Room: Gather
367
Posters and Demos
natural language processing (NLP), especially in Grammatical Error Correction (GEC). This work explores the complexities of applying GEC
systems to CSW texts. Our objectives include evaluating the performance of state-of-the-art GEC systems on an authentic CSW dataset from
English as a Second Language (ESL) learners, exploring synthetic data generation as a solution to data scarcity, and developing a model
capable of correcting grammatical errors in monolingual and CSW texts. We generated synthetic CSW GEC data, resulting in one of the first
substantial datasets for this task, and showed that a model trained on this data is capable of significant improvements over existing systems.
This work targets ESL learners, aiming to provide educational technologies that aid in the development of their English grammatical correct-
ness without constraining their natural multilingualism.
(Nov 13): 7:45-8:45 (Morning) - Gather
Compression Parity: Measuring and Predicting the Multilingual Capabilities of Language Models
Alexander Tsvetkov, Alon Kipnis
Large Language Models (LLMs) are increasingly deployed in user-facing applications worldwide, necessitating handling multiple languages
across various tasks. We propose a metric called Information Parity (IP) that can predict an LLM’s capabilities across multiple languages in
a task-agnostic manner. IP is well-motivated from an information theoretic perspective: it is associated with the LLMs efficiency of com-
pressing the text in a given language compared to a reference language. We evaluate IP and other popular metrics such as Tokenization Parity
(TP) and Tokenizer Fertility (TF) on several variants of open-sourced LLMs (Llama2, Gemma, Mistral). Among all metrics known to us, IP
is better correlated with existing task-specific benchmark scores from the literature and thus better predicts such scores in a certain language.
These findings show that IP may be useful for ranking multilingual LLMs’ capabilities regardless of the downstream task.
(Nov 13): 7:45-8:45 (Morning) - Gather
PreAlign: Boosting Cross-Lingual Transfer by Early Establishment of Multilingual Alignment
Jiahuan Li, Shujian Huang, Aarron Ching, Xinyu Dai, Jiajun Chen
Large language models demonstrate reasonable multilingual abilities, despite predominantly English-centric pretraining. However, the spon-
taneous multilingual alignment in these models is shown to be weak, leading to unsatisfactory cross-lingual transfer and knowledge sharing.
Previous works attempt to address this issue by explicitly injecting multilingual alignment information during or after pretraining. Thus
for the early stage in pretraining, the alignment is weak for sharing information or knowledge across languages. In this paper, we propose
PreAlign, a framework that establishes multilingual alignment prior to language model pretraining. PreAlign injects multilingual alignment
by initializing the model to generate similar representations of aligned words and preserves this alignment using a code-switching strategy
during pretraining. Extensive experiments in a synthetic English to English-Clone setting demonstrate that PreAlign significantly outperforms
standard multilingual joint training in language modeling, zero-shot cross-lingual transfer, and cross-lingual knowledge application. Further
experiments in real-world scenarios further validate PreAlign’s effectiveness across various model sizes.
368
Posters and Demos
mantics between objects. This complexity and diversity in SGG leads to underrepresentation, where parts of triplet labels are rare or even
unseen during training, resulting in imprecise predictions. To tackle this, we propose integrating the pretrained Vision-language Models to
enhance representation. However, due to the gap between pretraining and SGG, direct inference of pretrained VLMs on SGG leads to severe
bias, which stems from the imbalanced predicates distribution in the pretraining language set. To alleviate the bias, we introduce a novel LM
Estimation to approximate the unattainable predicates distribution. Finally, we ensemble the debiased VLMs with SGG models to enhance
the representation, where we design a certainty-aware indicator to score each sample and dynamically adjust the ensemble weights. Our
training-free method effectively addresses the predicates bias in pretrained VLMs, enhances SGG’s representation, and significantly improve
the performance.
(Nov 13): 7:45-8:45 (Morning) - Gather
Enhancing Advanced Visual Reasoning Ability of Large Language Models
Zhiyuan Li, Dongnan Liu, Chaoyi Zhang, Heng Wang, Tengfei Xue, Weidong Cai
Recent advancements in Vision-Language (VL) research have sparked new benchmarks for complex visual reasoning, challenging models’
advanced reasoning ability. Traditional Vision-Language models (VLMs) perform well in visual perception tasks while struggling with com-
plex reasoning scenarios. Conversely, Large Language Models (LLMs) demonstrate robust text reasoning capabilities; however, they lack
visual acuity. To bridge this gap, we propose **C**omplex **V**isual **R**easoning **L**arge **L**anguage **M**odels (**CVR-
LLM**), capitalizing on VLMs’ visual perception proficiency and LLMs’ extensive reasoning capability. Unlike recent multimodal large
language models (MLLMs) that require a projection layer, our approach transforms images into detailed, context-aware descriptions using
an iterative self-refinement loop and leverages LLMs’ text knowledge for accurate predictions without extra training. We also introduce
a novel multi-modal in-context learning (ICL) methodology to enhance LLMs’ contextual understanding and reasoning. Additionally, we
introduce Chain-of-Comparison (CoC), a step-by-step comparison technique enabling contrasting various aspects of predictions. Our CVR-
LLM presents the first comprehensive study across a wide array of complex visual reasoning tasks and achieves SOTA performance among all.
(Nov 13): 7:45-8:45 (Morning) - Gather
MantisScore: A Reliable Fine-grained Metric for Video Generation
Xuan He, Dongfu Jiang, Ge Zhang, Max Ku, Achint Soni, Sherman Siu, Haonan Chen, Abhranil Chandra, Ziyan Jiang, Aaran Arulraj, Kai
Wang, Quy Duc Do, Yuansheng Ni, Bohan Lyu, Yaswanth Narsupalli, Rongqi Fan, Zhiheng Lyu, Bill Yuchen Lin, Wenhu Chen
The recent years have witnessed great advances in video generation. However, the development of automatic video metrics is lagging sig-
nificantly behind. None of the existing metric is able to provide reliable scores over generated videos. The main barrier is the lack of
large-scale human-annotated dataset. In this paper, we release VideoFeedback, the first large-scale dataset containing human-provided multi-
aspect score over 37.6K synthesized videos from 11 existing video generative models. We train VideoScore (initialized from Mantis)based
on VideoFeedback to enable automatic video quality assessment. Experiments show that the Spearman’s correlation betweenVideoScore
and humans can reach 77.1 on VideoFeedback-test, beating the prior best metrics by about 50 points. Further result onother held-out Eval-
Crafter, GenAI-Bench, and VBench show that VideoScore has consistently much higher correlation with humanjudges than other metrics.
Due to these results, we believe VideoScore can serve as a great proxy for human raters to (1) rate different video models to track progress (2)
simulate fine-grained human feedback in Reinforcement Learning with Human Feedback (RLHF) to improve current video generation models.
(Nov 13): 7:45-8:45 (Morning) - Gather
Empowering Large Language Model for Continual Video Question Answering with Collaborative Prompting
Chen Cai, Zheng Wang, Jianjun Gao, Wenyang Liu, Ye Lu, Runzhong Zhang, Kim-Hui Yap
In recent years, the rapid increase in online video content has underscored the limitations of static Video Question Answering (VideoQA)
models trained on fixed datasets, as they struggle to adapt to new questions or tasks posed by newly available content. In this paper, we explore
the novel challenge of VideoQA within a continual learning framework, and empirically identify a critical issue: fine-tuning a large language
model (LLM) for a sequence of tasks often results in catastrophic forgetting. To address this, we propose Collaborative Prompting (ColPro),
which integrates specific question constraint prompting, knowledge acquisition prompting, and visual temporal awareness prompting. These
prompts aim to capture textual question context, visual content, and video temporal dynamics in VideoQA, a perspective underexplored in
prior research. Experimental results on the NExT-QA and DramaQA datasets show that ColPro achieves superior performance compared to
existing approaches, achieving 55.14% accuracy on NExT-QA and 71.24% accuracy on DramaQA, highlighting its practical relevance and
effectiveness.
(Nov 13): 7:45-8:45 (Morning) - Gather
World to Code: Multi-modal Data Generation via Self-Instructed Compositional Captioning and Filtering
Jiacong Wang, Bohong Wu, Haiyong Jiang, Haoyuan Guo, Xin Xiao, zhou Xun, Jun Xiao
Recent advances in Vision-Language Models (VLMs) and the scarcity of high-quality multi-modal alignment data have inspired numerous
researches on synthetic VLM data generation. The conventional norm in VLM data construction uses a mixture of specialists in caption and
OCR, or stronger VLM APIs and expensive human annotation.In this paper, we present World to Code (W 2C), a meticulously curated multi-
modal data construction pipeline that organizes the final generation output into a Python code format. The pipeline leverages the VLM itself to
extract cross-modal information via different prompts and filter the generated outputs again via a consistency filtering strategy. Experiments
have demonstrated the high quality of W 2C by improving various existing visual question answering and visual grounding benchmarks
across different VLMs. Further analysis also demonstrates that the new code parsing ability of VLMs presents better cross-modal equivalence
than the commonly used detail caption ability. Our code is available at https://github.com/foundation-multimodal-models/World2Code.
(Nov 13): 7:45-8:45 (Morning) - Gather
RWKV-CLIP: A Robust Vision-Language Representation Learner
Tiancheng Gu, Kaicheng Yang, Xiang An, Ziyong Feng, Dongnan Liu, Weidong Cai, Jiankang Deng
Contrastive Language-Image Pre-training (CLIP) has significantly improved performance in various vision-language tasks by expanding the
dataset with image-text pairs obtained from the web. This paper further explores CLIP from the perspectives of data and model architecture.
To mitigate the impact of the noise data and enhance the quality of large-scale image-text data crawled from the internet, we introduce a di-
verse description generation framework that can leverage Large Language Models (LLMs) to combine and refine information from web-based
image-text pairs, synthetic captions, and detection tags. Additionally, we propose RWKV-CLIP, the first RWKV-driven vision-language rep-
resentation learning model that combines the effective parallel training of transformers with the efficient inference of RNNs. Extensive experi-
ments across different model scales and pre-training datasets demonstrate that RWKV-CLIP is a robust vision-language representation learner
and it achieves state-of-the-art performance across multiple downstream tasks, including linear probing, zero-shot classification, and zero-shot
image-text retrieval. To facilitate future research, the code and pre-trained models are released at https://github.com/deepglint/RWKV-CLIP.
369
Posters and Demos
bols, and text-based representations. Despite the success of CLIP-type visual embeddings, they often lack access to world knowledge critical
for visual reasoning. In this work, we propose Visual Table, a novel form of visual representation tailored for visual reasoning. Visual tables
are constructed as hierarchical descriptions of visual scenes, featuring a scene description and multiple object-centric descriptions covering
categories, attributes, and knowledge. Thanks to the structural and textual formats, visual tables offer unique properties over mere visual em-
beddings, such as explainability and controllable editing. Furthermore, they deliver instance-level world knowledge and detailed attributes that
are essential for visual reasoning. To create visual tables, we develop a generator trained on the dataset with collected, small-scale annotations.
Extensive results on 11 visual reasoning benchmarks demonstrate that the generated visual tables significantly outperform previous structural
and text-based representations. Moreover, they consistently enhance state-of-the-art multi-modal large language models across diverse bench-
marks, showcasing their potential for advancing visual reasoning tasks. Our code is available at https://github.com/LaVi-Lab/Visual-Table.
370
Posters and Demos
state-of-the-art specialist models and MLLMs. Our findings indicate that there remains a significant gap in achieving satisfactory grounding
performance. We anticipate that our dataset will inspire new approaches to enhance visual reasoning and develop more advanced cross-modal
interaction strategies, ultimately unlocking the full potential of MLLMs.
(Nov 13): 7:45-8:45 (Morning) - Gather
VideoCLIP-XL: Advancing Long Description Understanding for Video CLIP Models
Jiapeng Wang, Chengyu Wang, Kunzhe Huang, Jun Huang, Lianwen Jin
Contrastive Language-Image Pre-training (CLIP) has been widely studied and applied in numerous applications. However, the emphasis on
brief summary texts during pre-training prevents CLIP from understanding long descriptions. This issue is particularly acute regarding videos
given that videos often contain abundant detailed contents. In this paper, we propose the VideoCLIP-XL (eXtra Length) model, which aims
to unleash the long-description understanding capability of video CLIP models. Firstly, we establish an automatic data collection system and
gather a large-scale VILD pre-training dataset with VIdeo and Long-Description pairs. Then, we propose Text-similarity-guided Primary
Component Matching (TPCM) to better learn the distribution of feature space while expanding the long description capability. We also
introduce two new tasks namely Detail-aware Description Ranking (DDR) and Hallucination-aware Description Ranking (HDR) for further
understanding improvement. Finally, we construct a Long Video Description Ranking (LVDR) benchmark for evaluating the long-description
capability more comprehensively. Extensive experimental results on widely-used text-video retrieval benchmarks with both short and long
descriptions and our LVDR benchmark can fully demonstrate the effectiveness of our method.
(Nov 13): 7:45-8:45 (Morning) - Gather
In-Context Compositional Generalization for Large Vision-Language Models
Chuanhao Li, Chenchen Jing, Zhen Li, Mingliang Zhai, Yuwei Wu, Yunde Jia
Recent work has revealed that in-context learning for large language models exhibits compositional generalization capacity, which can be
enhanced by selecting in-context demonstrations similar to test cases to provide contextual information. However, how to exhibit in-context
compositional generalization (ICCG) of large vision-language models (LVLMs) is non-trival. Due to the inherent asymmetry between visual
and linguistic modalities, ICCG in LVLMs faces an inevitable challenge—redundant information on the visual modality. The redundant
information affects in-context learning from two aspects: (1) Similarity calculation may be dominated by redundant information, resulting
in sub-optimal demonstration selection. (2) Redundant information in in-context demonstrations brings misleading contextual information to
in-context learning. To alleviate these problems, we propose a demonstration selection method to achieve ICCG for LVLMs, by considering
two key factors of demonstrations: content and structure, from a multimodal perspective. Specifically, we design a diversity-coverage-based
matching score to select demonstrations with maximum coverage, and avoid selecting demonstrations with redundant information via their
content redundancy and structural complexity. We build a GQA-ICCG dataset to simulate the ICCG setting, and conduct experiments on
GQA-ICCG and the VQA v2 dataset. Experimental results demonstrate the effectiveness of our method.
(Nov 13): 7:45-8:45 (Morning) - Gather
GRIZAL: Generative Prior-guided Zero-Shot Temporal Action Localization
Onkar Kishor Susladkar, Gayatri Sudhir Deshmukh, Vandan Gorade, Sparsh Mittal
Zero-shot temporal action localization (TAL) aims to temporally localize actions in videos without prior training examples. To address the
challenges of TAL, we offer GRIZAL, a model that uses multimodal embeddings and dynamic motion cues to localize actions effectively.
GRIZAL achieves sample diversity by using large-scale generative models such as GPT-4 for generating textual augmentations and DALL-E
for generating image augmentations. Our model integrates vision-language embeddings with optical flow insights, optimized through a blend
of supervised and self-supervised loss functions. On ActivityNet, Thumos14 and Charades-STA datasets, GRIZAL greatly outperforms state-
of-the-art zero-shot TAL models, demonstrating its robustness and adaptability across a wide range of video content. We will make all the
models and code publicly available by open-sourcing them.
371
Posters and Demos
372
Posters and Demos
(Grounding-based Metaphor Binding), which illustrates linguistic metaphors from the grounding perspective elaborated through LLMs.
GOME consists of two steps for metaphor illustration, including grounding-based elaboration and scenario visualization. In the elaboration
step, metaphorical knowledge is integrated into systematic instructions for LLMs, which employs a CoT prompting method rooted in rhetoric.
This approach specifies metaphorical devices such as vehicles and groundings, to ensure accurate and faithful descriptions consumed by text-
to-image models. In the visualization step, an inference-time metaphor binding method is realized based on elaboration outputs, which register
attentional control during the diffusion process, and captures the underlying attributes from the abstract metaphorical domain. Comprehensive
evaluations using multiple downstream tasks confirm that, GOME is superior to isolated LLMs, diffusion models, or their direct collaboration.
(Nov 13): 7:45-8:45 (Morning) - Gather
mPLUG-DocOwl 1.5: Unified Structure Learning for OCR-free Document Understanding
Anwen Hu, Haiyang Xu, Jiabo Ye, Ming Yan, Liang Zhang, Bo Zhang, Ji Zhang, Qin Jin, Fei Huang, Jingren Zhou
Structure information is critical for understanding the semantics of text-rich images, such as documents, tables, and charts. Existing Mul-
timodal Large Language Models (MLLMs) for Visual Document Understanding are equipped with text recognition ability but lack general
structure understanding abilities for text-rich document images. In this work, we emphasize the importance of structure information in Visual
Document Understanding and propose Unified Structure Learning to boost the performance of MLLMs. Based on publicly available text-rich
images, we build a comprehensive training set DocStruct4M to support structure-aware parsing tasks and multi-grained text localization tasks
across 5 domains: document, webpage, table, chart, and natural image. To better encode structure information, we design a simple and
effective vision-to-text module H-Reducer, which can not only maintain the layout information but also reduce the length of visual features
by merging horizontal adjacent patches through convolution, enabling the LLM to understand high-resolution images more efficiently. Our
model DocOwl 1.5 achieves state-of-the-art performance on 10 visual document understanding benchmarks. All codes, models, and datasets
are publicly available at https://github.com/X-PLUG/mPLUG-DocOwl/tree/main/DocOwl1.5.
(Nov 13): 7:45-8:45 (Morning) - Gather
Empowering Backbone Models for Visual Text Generation with Input Granularity Control and Glyph-Aware Training
Wenbo Li, Guohao Li, Zhibin Lan, Xue Xu, Wanru Zhuang, Jiachen Liu, Xinyan Xiao, Jinsong Su
Diffusion-based text-to-image models have demonstrated impressive achievements in diversity and aesthetics but struggle to generate images
with legible visual texts. Existing backbone models have limitations such as misspelling, failing to generate texts, and lack of support for
Chinese texts, but their development shows promising potential. In this paper, we propose a series of methods, aiming to empower backbone
models to generate visual texts in English and Chinese. We first conduct a preliminary study revealing that BPE tokenization and insufficient
learning of cross-attention modules restrict the performance of the backbone models. Based on these observations, we make the following
improvements: (1) We design a mixed granularity input strategy to provide more suitable text representations; (2) We propose to augment the
conventional training objective with three glyph-aware training losses, which enhance the learning of cross-attention modules and encourage
the model to focus on visual texts. Through experiments, we demonstrate that our methods can effectively empower backbone models to
generate semantic relevant, aesthetically appealing, and accurate visual text images, while maintaining their fundamental image generation
quality.
(Nov 13): 7:45-8:45 (Morning) - Gather
How Does the Textual Information Affect the Retrieval of Multimodal In-Context Learning?
Yang Luo, Zangwei Zheng, Zirui Zhu, Yang You
The increase in parameter size of multimodal large language models (MLLMs) introduces significant capabilities, particularly multimodal
in-context learning, where MLLMs enhance task performance without updating pre-trained parameters. However, this effectiveness hinges
on the appropriate selection of in-context examples, a process currently biased towards visual data, overlooking textual information. More
importantly, the area of supervised retrievers for retrieval of multimodal in-context learning, crucial for optimal in-context example selection,
continues to be investigated. Our study provides an in-depth evaluation of the impact of textual information on the unsupervised selection
of in-context examples in multimodal contexts, uncovering a notable sensitivity of retriever performance to the employed modalities. Based
on the above finding, we introduce a novel supervised MLLM prompt retriever MSIER that leverages a trained retriever based on MLLM’s
confidence to select examples, which enhances multimodal in-context learning efficiency. This approach is validated through extensive testing
across three different tasks, demonstrating the method’s effectiveness. Additionally, we investigate the influence of modalities on our super-
vised retrieval method’s training and explore the transferability of the supervised prompt retriever. This exploration paves the way for future
advancements, highlighting the potential for refined in-context learning in MLLMs through the strategic use of multimodal data. The public
code is available at https://github.com/NUS-HPC-AI-Lab/Multimodal-ICL-Retriever.
(Nov 13): 7:45-8:45 (Morning) - Gather
Divide and Conquer Radiology Report Generation via Observation Level Fine-grained Pretraining and Prompt Tuning
Yuanpin Zhou, Huogen Wang
The automation of radiology report generation (RRG) holds immense potential to alleviate radiologists’ workloads and improve diagnostic ac-
curacy. Despite advancements in image captioning and vision-language pretraining, RRG remains challenging due to the lengthy and complex
nature of radiology reports. In this work, we proposes the Divide and Conquer Radiology Report Generation (DCRRG) model, which breaks
down full-text radiology reports into concise observation descriptions. This approach enables the model to capture fine-grained representa-
tions from each observation through a two-stage process: an encoding stage focusing on observation prediction tasks to learn fine-grained
representations, and a decoding stage for integrating these descriptions into cohesive and comprehensive radiology reports. Experimental
results on two benchmark datasets demonstrate that DCRRG achieves significant improvements across all evaluated metrics, underscoring its
capability to generate semantically coherent and clinically accurate radiology reports.
NLP Applications
(Nov 13): 7:45-8:45 (Morning) - Room: Gather
373
Posters and Demos
user input text can be 100% recovered from the obfuscated embedded vectors. We further analyze security requirements on embedding ob-
fuscation and present several remedies to our proposed attack.
374
Posters and Demos
thy —- both accurate and well-calibrated (the prediction confidence should align with its ground truth correctness likelihood). Nowadays,
fine-tuning has become the most popular method for adapting a model to practical usage by significantly increasing accuracy on downstream
tasks. Despite the great accuracy it achieves, we found fine-tuning is still far away from satisfactory trustworthiness due to "tuning-induced
mis-calibration". In this paper, we delve deeply into why and how mis-calibration exists in fine-tuned models, and how distillation can allevi-
ate the issue. Then we further propose a brand new method named Efficient Trustworthy Distillation (FIRST), which utilizes a small portion
of teacher’s knowledge to obtain a reliable language model in a cost-efficient way. Specifically, we identify the "concentrated knowledge"
phenomenon during distillation, which can significantly reduce the computational burden. Then we apply a "trustworthy maximization" pro-
cess to optimize the utilization of this small portion of concentrated knowledge before transferring it to the student. Experimental results
demonstrate the effectiveness of our method, where better accuracy (+2.3%) and less mis-calibration (-10%) are achieved on average across
both in-domain and out-of-domain scenarios, indicating better trustworthiness.
(Nov 13): 7:45-8:45 (Morning) - Gather
Leveraging Context-aware Prompting for Commit Message Generation
Zhihua Jiang, Jianwei Chen, Dongning Rao, Guanghui Ye
Writing comprehensive commit messages is tedious yet important, because these messages describe changes of code, such as fixing bugs or
adding new features. However, most existing methods focus on either only the changed lines or nearest context lines, without considering
the effectiveness of selecting useful contexts. On the other hand, it is possible that introducing excessive contexts can lead to noise. To this
end, we propose a code model COMMIT (Context-aware prOMpting based comMIt-message generaTion) in conjunction with a code dataset
CODEC (COntext and metaData Enhanced Code dataset). Leveraging program slicing, CODEC consolidates code changes along with related
contexts via property graph analysis. Further, utilizing CodeT5+ as the backbone model, we train COMMIT via context-aware prompt on
CODEC. Experiments show that COMMIT can surpass all compared models including pre-trained language models for code (code-PLMs)
such as CommitBART and large language models for code (code-LLMs) such as Code-LlaMa. Besides, we investigate several research ques-
tions (RQs), further verifying the effectiveness of our approach. We release the data and code at: https://github.com/Jnunlplab/COMMIT.git.
(Nov 13): 7:45-8:45 (Morning) - Gather
Improving Knowledge Graph Completion with Structure-Aware Supervised Contrastive Learning
Jiashi Lin, Lifang Wang, Xinyu Lu, Zhongtian Hu, Wei Zhang, Wenxuan Lu
Knowledge Graphs (KGs) often suffer from incomplete knowledge, which which restricts their utility. Recently, Contrastive Learning (CL)
has been introduced to Knowledge Graph Completion (KGC), significantly improving the discriminative capabilities of KGC models and
setting new benchmarks in performance. However, existing contrastive methods primarily focus on individual triples, overlooking the broader
structural connectivities and topologies of KGs. This narrow focus limits a comprehensive understanding of the graph’s structural knowledge.
To address this gap, we propose StructKGC, a novel contrastive learning framework designed to flexibly accommodate the diverse topolo-
gies inherent in KGs. Additionally, we introduce four contrastive tasks specifically tailored to KG data: Vertex-level CL, Neighbor-level
CL, Path-level CL, and Relation composition level CL. These tasks are trained synergistically during the fine-tuning of pre-trained language
models (PLMs), allowing for a more nuanced capture of subgraph semantics. To validate the effectiveness of our method, we perform a
comprehensive set of experiments on several real-world datasets. The experimental results demonstrate that our approach achieves SOTA
performance under standard supervised and low-resource settings. Furthermore, the different levels of structure-aware tasks introduced can
mutually reinforce each other, leading to consistent performance improvements.
375
Posters and Demos
376
Posters and Demos
(style and topic) of generated data and the generator model are not known in advance. In this work, we focus on the robustness of classifier-
based detectors of AI-generated text, namely their ability to transfer to unseen generators or semantic domains. We investigate the geometry
of the embedding space of Transformer-based text encoders and show that clearing out harmful linear subspaces helps to train a robust clas-
sifier, ignoring domain-specific spurious features. We investigate several subspace decomposition and feature selection strategies and achieve
significant improvements over state of the art methods in cross-domain and cross-generator transfer. Our best approaches for head-wise and
coordinate-based subspace removal increase the mean out-of-distribution (OOD) classification score by up to 9% and 14% in particular setups
for RoBERTa and BERT embeddings respectively. We release our code and data: https://github.com/SilverSolver/RobustATD
(Nov 13): 7:45-8:45 (Morning) - Gather
Augmenting Black-box LLMs with Medical Textbooks for Biomedical Question Answering
Yubo Wang, Xueguang Ma, Wenhu Chen
Large-scale language models (LLMs) like ChatGPT have demonstrated impressive abilities in generating responses based on human instruc-
tions. However, their use in the medical field can be challenging due to their lack of specific, in-depth knowledge. In this study, we present a
system called LLMs Augmented with Medical Textbooks (LLM-AMT) designed to enhance the proficiency of LLMs in specialized domains.
LLM-AMT integrates authoritative medical textbooks into the LLMs’ framework using plug-and-play modules. These modules include a
Query Augmenter, a Hybrid Textbook Retriever, and a Knowledge Self-Refiner. Together, they incorporate authoritative medical knowledge.
Additionally, an LLM Reader aids in contextual understanding. Our experimental results on three medical QA tasks demonstrate that LLM-
AMT significantly improves response quality, with accuracy gains ranging from 11.6% to 16.6%. Notably, with GPT-4-Turbo as the base
model, LLM-AMT outperforms the specialized Med-PaLM 2 model pre-trained on a massive amount of medical corpus by 2-3%. We found
that despite being 100smaller in size, medical textbooks as a retrieval corpus are proven to be a more effective knowledge database than
Wikipedia in the medical domain, boosting performance by 7.8%-13.7%.
(Nov 13): 7:45-8:45 (Morning) - Gather
Exploring Open Graph Models with Large Language Models
Lianghao Xia, Ben Kao, Chao Huang
Graph learning has become essential in various domains, including recommendation systems and social network analysis. Graph Neural
Networks (GNNs) have emerged as promising techniques for encoding structural information and improving performance in tasks like link
prediction and node classification. However, a key challenge remains: the difficulty of generalizing to unseen graph data with different prop-
erties. In this work, we propose a novel graph foundation model, called OpenGraph, to address this challenge. Our approach tackles several
technical obstacles. Firstly, we enhance data augmentation using a large language model (LLM) to overcome data scarcity in real-world
scenarios. Secondly, we introduce a unified graph tokenizer that enables the model to generalize effectively to diverse graph data, even when
encountering unseen properties during training. Thirdly, our developed scalable graph transformer captures node-wise dependencies within
the global topological context. Extensive experiments validate the effectiveness of our framework. By adapting OpenGraph to new graph
characteristics and comprehending diverse graphs, our approach achieves remarkable zero-shot graph learning performance across various
settings. We release the model implementation at https://github.com/HKUDS/OpenGraph.
(Nov 13): 7:45-8:45 (Morning) - Gather
MM-ChatAlign: A Novel Multimodal Reasoning Framework based on Large Language Models for Entity Alignment
Xuhui Jiang, Yinghan Shen, Zhichao Shi, Chengjin Xu, Wei Li, Huang Zihe, Jian Guo, Yuanzhuo Wang
Multimodal entity alignment (MMEA) integrates multi-source and cross-modal knowledge graphs, a crucial yet challenging task for data-
centric applications.Traditional MMEA methods derive the visual embeddings of entities and combine them with other modal data for align-
ment by embedding similarity comparison.However, these methods are hampered by the limited comprehension of visual attributes and
deficiencies in realizing and bridging the semantics of multimodal data. To address these challenges, we propose MM-ChatAlign, a novel
framework that utilizes the visual reasoning abilities of MLLMs for MMEA.The framework features an embedding-based candidate collec-
tion module that adapts to various knowledge representation strategies, effectively filtering out irrelevant reasoning candidates. Additionally,
a reasoning and rethinking module, powered by MLLMs, enhances alignment by efficiently utilizing multimodal information.Extensive ex-
periments on four MMEA datasets demonstrate MM-ChatAlign’s superiority and underscore the significant potential of MLLMs in MMEA
tasks.The source code is available at https://github.com/jxh4945777/MMEA/.
(Nov 13): 7:45-8:45 (Morning) - Gather
Divide and Conquer: Legal Concept-guided Criminal Court View Generation
Qi Xu, Xiao Wei, Hang Yu, Qian Liu, Hao Fei
The Criminal Court View Generation task aims to produce explanations that inform judicial decisions. This necessitates a nuanced un-
derstanding of diverse legal concepts, such as Recidivism, Confess, and Robbery, which often coexist within cases, complicating
holistic analysis. However, existing methods mainly rely on the generation capability of language models, without paying enough atten-
tion to the important legal concepts.To enhance the precision and depth of such explanations, we introduce Legal Concept-guided Criminal
Court Views Generation (LeGen), a three-stage approach designed for iterative reasoning tailored to individual legal constructs.Specifically,
in the first stage, we design a decomposer to divide the court views into focused sub-views, each anchored around a distinct legal con-
cept. Next, a concept reasoning module generates targeted rationales by intertwining the deconstructed facts with their corresponding legal
frameworks, ensuring contextually relevant interpretations.Finally, a verifier and a generator are employed to align the rationale with the
case fact and obtain synthesized comprehensive and legally sound final court views, respectively.We evaluate LeGen by conducting extensive
experiments on a real-world dataset and experimental results validate the effectiveness of our proposed model. Our codes are available at
https://anonymous.4open.science/r/LeGen-5625.
(Nov 13): 7:45-8:45 (Morning) - Gather
ProTrix: Building Models for Planning and Reasoning over Tables with Sentence Context
Zirui Wu, Yansong Feng
Tables play a crucial role in conveying information in various domains. We propose a Plan-then-Reason framework to answer different types
of user queries over tables with sentence context. The framework first plans the reasoning paths over the context, then assigns each step
to program-based or textual reasoning to reach the final answer. This framework enhances the table reasoning abilities for both in-context
learning and fine-tuning methods. GPT-3.5-Turbo following Plan-then-Reason framework surpasses other prompting baselines without self-
consistency while using less API calls and in-context demonstrations. We also construct an instruction tuning set TrixInstruct to evaluate the
effectiveness of fine-tuning with this framework. We present ProTrix model family by finetuning models on TrixInstruct. Our experiments
show that ProTrix family generalizes to diverse unseen tabular tasks with only 6k training instances. We further demonstrate that ProTrix can
generate accurate and faithful explanations to answer complex free-form questions. Our work underscores the importance of the planning and
reasoning abilities towards a model over tabular tasks with generalizability and interpretability. We will open-source our dataset and models.
(Nov 13): 7:45-8:45 (Morning) - Gather
GeoGPT4V: Towards Geometric Multi-modal Large Language Models with Geometric Image Generation
377
Posters and Demos
Shihao Cai, Keqin Bao, Hangyu Guo, Jizhi Zhang, Jun Song, Bo Zheng
Large language models have seen widespread adoption in math problem-solving, yet for geometry problems, which often necessitate visual
aids even for humans, the most advanced multi-modal models still struggle to effectively utilize image information. High-quality data is
crucial for enhancing the geometric capabilities of multi-modal models, yet existing open-source datasets and related efforts are either too
challenging for direct model learning or suffer from misalignment between text and images. To overcome this issue, we introduce a novel
pipeline that leverages GPT-4 and GPT-4V to generate relatively basic geometry problems with aligned text and images, facilitating model
learning. We have produced a dataset of 4.9K geometry problems and combined it with 19K open-source data to form our GeoGPT4V
dataset. Experimental results demonstrate that the GeoGPT4V dataset significantly improves the geometry performance of various models on
the MathVista and MathVision benchmarks. The code is available at https://anonymous.4open.science/r/GeoGPT4V-08B2.
378
Posters and Demos
Large Language Models (LLMs) have shown remarkable performance in various basic natural language tasks. For completing the complex
task, we still need a plan for the task to guide LLMs to generate the specific solutions step by step. LLMs can directly generate task plans,
but these plans may still contain factual errors or are incomplete. A high-quality task plan contains correct step-by-step solutions for solving
all situations and behavioral instructions for avoiding mistakes. To obtain it, we propose the Learning to Plan method, which involves two
phases: (1) In the first learning task plan phase, it iteratively updates the task plan with new step-by-step solutions and behavioral instructions,
which are obtained by prompting LLMs to derive from training error feedback. (2) In the subsequent test phase, the LLM uses the learned
task plan to guide the inference of LLM on the test set. We demonstrate the effectiveness of our method on the five different reasoning type
tasks (8 datasets). Further, our analysis experiment shows that the task plan learned by one LLM can directly guide another LLM to improve
its performance, which reveals a new transfer learning paradigm.
379
Posters and Demos
Question Answering
(Nov 13): 7:45-8:45 (Morning) - Room: Gather
380
Posters and Demos
381
Posters and Demos
Large language models (LLMs) are widely used in question-answering (QA) systems but often generate information with hallucinations.
Retrieval-augmented generation (RAG) offers a potential remedy, yet the uneven retrieval quality and irrelevant contents may distract LLMs.In
this work, we address these issues at the generation phase by treating RAG as a multi-document QA task.We propose a novel decoding strat-
egy, Dynamic Contrastive Decoding, which dynamically amplifies knowledge from selected documents during the generation phase. method
involves constructing inputs batchwise, designing new selection criteria to identify documents worth amplifying, and applying contrastive
decoding with a specialized weight calculation to adjust the final logits used for sampling answer tokens. Zero-shot experimental results on
ALCE-ASQA, NQ, TQA and PopQA benchmarks show that our method outperforms other decoding strategies. Additionally, we conduct
experiments to validate the effectiveness of our selection criteria, weight calculation, and general multi-document scenarios. Our method
requires no training and can be integrated with other methods to improve the RAG performance. Our codes will be publicly available at
https://github.com/JulieJin-km/Dynamic_Contrastive_Decoding.
(Nov 13): 7:45-8:45 (Morning) - Gather
CoTAR: Chain-of-Thought Attribution Reasoning with Multi-level Granularity
Moshe Berchansky, Daniel Fleischer, Moshe Wasserblat, Peter Izsak
State-of-the-art performance in QA tasks is currently achieved by systems employing Large Language Models (LLMs), however these models
tend to hallucinate information in their responses. One approach focuses on enhancing the generation process by incorporating attribution
from the given input to the output. However, the challenge of identifying appropriate attributions and verifying their accuracy against a source
is a complex task that requires significant improvements in assessing such systems. We introduce an attribution-oriented Chain-of-Thought
reasoning method to enhance the accuracy of attributions. This approach focuses the reasoning process on generating an attribution-centric
output. Evaluations on two context enhanced question-answering datasets using GPT-4 demonstrate improved accuracy and correctness of
attributions. In addition, the combination of our method with finetuning enhances the response and attribution accuracy of two smaller LLMs,
showing their potential to outperform GPT-4 in some cases.
(Nov 13): 7:45-8:45 (Morning) - Gather
Improving Zero-shot LLM Re-Ranker with Risk Minimization
Xiaowei Yuan, Zhao Yang, Yequan Wang, Jun Zhao, Kang Liu
In the Retrieval-Augmented Generation (RAG) system, advanced Large Language Models (LLMs) have emerged as effective Query Likeli-
hood Models (QLMs) in an unsupervised way, which re-rank documents based on the probability of generating the query given the content of
a document. However, directly prompting LLMs to approximate QLMs inherently is biased, where the estimated distribution might diverge
from the actual document-specific distribution.In this study, we introduce a novel framework, UR3 , which leverages Bayesian decision the-
ory to both quantify and mitigate this estimation bias. Specifically, UR3 reformulates the problem as maximizing the probability of document
generation, thereby harmonizing the optimization of query and document generation probabilities under a unified risk minimization objective.
Our empirical results indicate that UR3 significantly enhances re-ranking, particularly in improving the Top-1 accuracy. It benefits the QA
tasks by achieving higher accuracy with fewer input documents.
(Nov 13): 7:45-8:45 (Morning) - Gather
PCQPR: Proactive Conversational Question Planning with Reflection
Shasha Guo
Conversational Question Generation (CQG) enhances the interactivity of conversational question-answering systems in fields such as ed-
ucation, customer service, and entertainment. However, traditional CQG, focusing primarily on the immediate context, lacks the con-
versational foresight necessary to guide conversations toward specified conclusions. This limitation significantly restricts their ability to
achieve conclusion-oriented conversational outcomes. In this work, we redefine the CQG task as Conclusion-driven Conversational Ques-
tion Generation (CCQG) by focusing on proactivity, not merely reacting to the unfolding conversation but actively steering it towards a
conclusion-oriented question-answer pair. To address this, we propose a novel approach, called Proactive Conversational Question Planning
with self-Refining (PCQPR). Concretely, by integrating a planning algorithm inspired by Monte Carlo Tree Search (MCTS) with the an-
alytical capabilities of large language models (LLMs), PCQPR predicts future conversation turns and continuously refines its questioning
strategies. This iterative self-refining mechanism ensures the generation of contextually relevant questions strategically devised to reach a
specified outcome. Our extensive evaluations demonstrate that PCQPR significantly surpasses existing CQG methods, marking a paradigm
shift towards conclusion-oriented conversational question-answering systems.
(Nov 13): 7:45-8:45 (Morning) - Gather
LongRAG: A Dual-perspective Retrieval-Augmented Generation Paradigm for Long-Context Question Answering
Qingfei Zhao, Ruobing Wang, Yukuo Cen, Daren Zha, Shicheng Tan, Yuxiao Dong, Jie Tang
Long-Context Question Answering (LCQA), a challenging task, aims to reason over long-context documents to yield accurate answers to
questions. Existing long-context Large Language Models (LLMs) for LCQA often struggle with the "lost in the middle" issue. Retrieval-
Augmented Generation (RAG) mitigates this issue by providing external factual evidence. However, its chunking strategy disrupts the global
long-context information, and its low-quality retrieval in long contexts hinders LLMs from identifying effective factual details due to sub-
stantial noise. To this end, we propose LongRAG, a general, dual-perspective, and robust LLM-based RAG system paradigm for LCQA to
enhance RAGs understanding of complex long-context knowledge (i.e., global information and factual details). We design LongRAG as a
plug-and-play paradigm, facilitating adaptation to various domains and LLMs. Extensive experiments on three multi-hop datasets demonstrate
that LongRAG significantly outperforms long-context LLMs (up by 6.94%), advanced RAG (up by 6.16%), and Vanilla RAG (up by 17.25%).
Furthermore, we conduct quantitative ablation studies and multi-dimensional analyses, highlighting the effectiveness of the systems compo-
nents and fine-tuning strategies.Data and code are available at [https://github.com/QingFei1/LongRAG](https://github.com/QingFei1/Lon-
gRAG).
382
Posters and Demos
comprising over 300 hours of speech data that is generated by five advanced zero-shot TTS models. To simulate real-world scenarios, we
employ diverse attack methods and audio prompts from different datasets. Experiments show that, through novel attack-augmented training,
the Wav2Vec2-large and Whisper-medium models achieve equal error rates of 4.1% and 6.5% respectively. Additionally, we demonstrate our
models’ outstanding few-shot ADD ability by fine-tuning with just one minute of target-domain data. Nonetheless, neural codec compressors
greatly affect the detection accuracy, necessitating further research. Our dataset is publicly available (https://github.com/leolya/CD-ADD).
(Nov 13): 7:45-8:45 (Morning) - Gather
UNO Arena for Evaluating Sequential Decision-Making Capability of Large Language Models
Zhanyue Qin, Haochuan Wang, Deyuan Liu, Ziyang Song, Cunhang Fan, Zhao Lv, Jinlin Wu, Zhen Lei, Zhiying Tu, Dianhui Chu, Xiaoyan
Yu, Dianbo Sui
Sequential decision-making refers to algorithms that take into account the dynamics of the environment, where early decisions affect subse-
quent decisions. With large language models (LLMs) demonstrating powerful capabilities between tasks, we can’t help but ask: Can Current
LLMs Effectively Make Sequential Decisions? In order to answer this question, we propose the UNO Arena based on the card game UNO
to evaluate the sequential decision-making capability of LLMs and explain in detail why we choose UNO. In UNO Arena, We evaluate the
sequential decision-making capability of LLMs dynamically with novel metrics based Monte Carlo methods. We set up random players,
DQN-based reinforcement learning players, and LLM players (e.g. GPT-4, Gemini-pro) for comparison testing. Furthermore, in order to
improve the sequential decision-making capability of LLMs, we propose the TUTRI player, which can involves having LLMs reflect their
own actions with the summary of game history and the game strategy. Numerous experiments demonstrate that the TUTRI player achieves a
notable breakthrough in the performance of sequential decision-making compared to the vanilla LLM player.
383
Posters and Demos
Classification is a core NLP task architecture with many potential applications. While large language models (LLMs) have brought substantial
advancements in text generation, their potential for enhancing classification tasks remains underexplored. To address this gap, we propose a
framework for thoroughly investigating fine-tuning LLMs for classification, including both generation- and encoding-based approaches. We
instantiate this framework in edit intent classification (EIC), a challenging and underexplored classification task. Our extensive experiments
and systematic comparisons with various training approaches and a representative selection of LLMs yield new insights into their application
for EIC. We investigate the generalizability of these findings on five further classification tasks. To demonstrate the proposed methods and
address the data shortage for empirical edit analysis, we use our best-performing EIC model to create Re3-Sci2.0, a new large-scale dataset
of 1,780 scientific document revisions with over 94k labeled edits. The quality of the dataset is assessed through human evaluation. The new
dataset enables an in-depth empirical study of human editing behavior in academic writing. We make our experimental framework, models
and data publicly available.
(Nov 13): 7:45-8:45 (Morning) - Gather
MetaBench: Planning of Multiple APIs from Various APPs for Complex User Instruction
Hongru WANG, Rui Wang, Boyang XUE, Heming Xia, Jingtao Cao, Zeming Liu, Jeff Z. Pan, Kam-Fai Wong
Large Language Models (LLMs) can interact with the real world by connecting with versatile external APIs, resulting in better problem-
solving and task automation capabilities. Previous research primarily either focuses on APIs with limited arguments from a single source
or overlooks the complex dependency relationship between different APIs. However, it is essential to utilize multiple APIs collaboratively
from various sources, especially for complex user instructions. In this paper, we introduce MetaBench, the first benchmark to evaluate
LLMs’ ability to plan and execute multiple APIs from various sources in order to complete the user’s task. Specifically, we consider two
significant challenges in multiple APIs: 1) graph structures: some APIs can be executed independently while others need to be executed one
by one, resulting in graph-like execution order; and 2) permission constraints: which source is authorized to execute the API call. We have
experimental results on 9 distinct LLMs; e.g., GPT-4o achieves only a 2.0% success rate at the most complex instruction, revealing that the
existing state-of-the-art LLMs still cannot perform well in this situation even with the help of in-context learning and finetuning. Our code
and data are publicly available at https://github.com/ruleGreen/AppBench.
(Nov 13): 7:45-8:45 (Morning) - Gather
On Creating an English-Thai Code-switched Machine Translation in Medical Domain
Parinthapat Pengpun, Krittamate Tiankanon, Amrest Chinkamol, Jiramet Kinchagawat, Pitchaya Chairuengjitjaras, Pasit Supholkhan, Pub-
ordee Aussavavirojekul, Chiraphat Boonnag, Kanyakorn Veerakanjana, Hirunkul Phimsiri, Boonthicha Sae-jia, Nattawach Sataudom, Piyalitt
Ittichaiwong, Peerat Limkonchotiwat
Machine translation (MT) in the medical domain plays a pivotal role in enhancing healthcare quality and disseminating medical knowledge.
Despite advancements in English-Thai MT technology, common MT approaches often underperform in the medical field due to their inability
to precisely translate medical terminologies. Our research prioritizes not merely improving translation accuracy but also maintaining medical
terminology in English within the translated text through code-switched (CS) translation. We developed a method to produce CS medical
translation data, fine-tuned a CS translation model with this data, and evaluated its performance against strong baselines, such as Google
Neural Machine Translation (NMT) and GPT-3.5/GPT-4. Our model demonstrated competitive performance in automatic metrics and was
highly favored in human preference evaluations. Our evaluation result also shows that medical professionals significantly prefer CS trans-
lations that maintain critical English terms accurately, even if it slightly compromises fluency. Our code and test set are publicly available
https://github.com/preceptorai-org/NLLB_CS_EM_NLP2024.
(Nov 13): 7:45-8:45 (Morning) - Gather
LongGenBench: Long-context Generation Benchmark
Xiang Liu, Peijie Dong, Xuming Hu, Xiaowen Chu
Current long-context benchmarks primarily focus on retrieval-based tests, requiring Large Language Models (LLMs) to locate specific infor-
mation within extensive input contexts, such as the needle-in-a-haystack (NIAH) benchmark. Long-context generation refers to the ability of
a language model to generate coherent and contextually accurate text that spans across lengthy passages or documents. While recent studies
show strong performance on NIAH and other retrieval-based long-context benchmarks, there is a significant lack of benchmarks for evalu-
ating long-context generation capabilities. To bridge this gap and offer a comprehensive assessment, we introduce a synthetic benchmark,
LongGenBench, which allows for flexible configurations of customized generation context lengths. LongGenBench advances beyond tradi-
tional benchmarks by redesigning the format of questions and necessitating that LLMs respond with a single, cohesive long-context answer.
Upon extensive evaluation using LongGenBench, we observe that: (1) both API accessed and open source models exhibit performance degra-
dation in long-context generation scenarios, ranging from 1.2% to 47.1%; (2) different series of LLMs exhibit varying trends of performance
degradation, with the Gemini-1.5-Flash model showing the least degradation among API accessed models, and the Qwen2 series exhibiting
the least degradation in LongGenBench among open source models.
(Nov 13): 7:45-8:45 (Morning) - Gather
R-Judge: Benchmarking Safety Risk Awareness for LLM Agents
Tongxin Yuan, Zhiwei He, Lingzhong Dong, Yiming Wang, Ruijie Zhao, Tian Xia, Lizhen Xu, Binglin Zhou, Fangqi Li, Zhuosheng Zhang, Rui
Wang, Gongshen Liu
Large language models (LLMs) have exhibited great potential in autonomously completing tasks across real-world applications. Despite this,
these LLM agents introduce unexpected safety risks when operating in interactive environments. Instead of centering on the harmlessness
of LLM-generated content in most prior studies, this work addresses the imperative need for benchmarking the behavioral safety of LLM
agents within diverse environments. We introduce R-Judge, a benchmark crafted to evaluate the proficiency of LLMs in judging and iden-
tifying safety risks given agent interaction records. R-Judge comprises 569 records of multi-turn agent interaction, encompassing 27 key
risk scenarios among 5 application categories and 10 risk types. It is of high-quality curation with annotated safety labels and risk descrip-
tions. Evaluation of 11 LLMs on R-Judge shows considerable room for enhancing the risk awareness of LLMs: The best-performing model,
GPT-4o, achieves 74.42% while no other models significantly exceed the random. Moreover, we reveal that risk awareness in open agent
scenarios is a multi-dimensional capability involving knowledge and reasoning, thus challenging for LLMs. With further experiments, we
find that fine-tuning on safety judgment significantly improve model performance while straightforward prompting mechanisms fail. R-Judge
is publicly available at Annoymous.
(Nov 13): 7:45-8:45 (Morning) - Gather
MoleculeQA: A Dataset to Evaluate Factual Accuracy in Molecular Comprehension
Xingyu Lu, He CAO, Zijing Liu, Shengyuan Bai, leqingchen, Yuan Yao, Hai-Tao Zheng, Yu Li
Large language models are playing an increasingly significant role in molecular research, yet existing models often generate erroneous infor-
mation. Traditional evaluations fail to assess a model’s factual correctness. To rectify this absence, we present MoleculeQA, a novel question
answering (QA) dataset which possesses 62K QA pairs over 23K molecules. Each QA pair, composed of a manual question, a positive option
and three negative options, has consistent semantics with a molecular description from authoritative corpus. MoleculeQA is not only the first
benchmark to evaluate molecular factual correctness but also the largest molecular QA dataset. A comprehensive evaluation on MoleculeQA
384
Posters and Demos
for existing molecular LLMs exposes their deficiencies in specific aspects and pinpoints crucial factors for molecular modeling. Furthermore,
we employ MoleculeQA in reinforcement learning to mitigate model hallucinations, thereby enhancing the factual correctness of generated
information.
(Nov 13): 7:45-8:45 (Morning) - Gather
SLANG: New Concept Comprehension of Large Language Models
Lingrui Mei, Shenghua Liu, Yiwei Wang, Baolong Bi, Xueqi Cheng
The dynamic nature of language, particularly evident in the realm of slang and memes on the Internet, poses serious challenges to the adapt-
ability of Large Language Models (LLMs). Traditionally anchored to static datasets, these models often struggle to keep up with the rapid
linguistic evolution characteristic of online communities. This research aims to bridge this gap by enhancing LLMs’ comprehension of
the evolving new concepts on the Internet, without the high cost of continual retraining. In pursuit of this goal, we introduce SLNAG, a
benchmark designed to autonomously integrate novel data and assess LLMs’ ability to comprehend emerging concepts, alongside FOCUS,
an approach uses causal inference to enhance LLMs to understand new phrases and their colloquial context. Our benchmark and approach
involves understanding real-world instances of linguistic shifts, serving as contextual beacons, to form more precise and contextually relevant
connections between newly emerging expressions and their meanings. The empirical analysis shows that our causal inference-based approach
outperforms the baseline methods in terms of precision and relevance in the comprehension of Internet slang and memes.
(Nov 13): 7:45-8:45 (Morning) - Gather
Multilingual Synopses of Movie Narratives: A Dataset for Story Understanding
Yidan Sun, Jianfei Yu, Boyang Li
Story video-text alignment, a core task in computational story understanding, aims to align video clips with corresponding sentences in their
descriptions. However, progress on the task has been held back by the scarcity of manually annotated video-text correspondence and the heavy
concentration on English narrations of Hollywood movies. To address these issues, in this paper, we construct a large-scale multilingual video
story dataset named Multilingual Synopses of Movie Narratives (M-SyMoN), containing 13,166 movie summary videos from 7 languages, as
well as manual annotation of fine-grained video-text correspondences for 101.5 hours of video. Training on the human annotated data from
SyMoN outperforms the SOTA methods by 15.7 and 16.2 percentage points on Clip Accuracy and Sentence IoU scores, respectively, demon-
strating the effectiveness of the annotations. As benchmarks for future research, we create 6 baseline approaches with different multilingual
training strategies, compare their performance in both intra-lingual and cross-lingual setups, exemplifying the challenges of multilingual
video-text alignment. The dataset is released at:https://github.com/insundaycathy/M-SyMoN
(Nov 13): 7:45-8:45 (Morning) - Gather
A Novel Metric for Measuring the Robustness of Large Language Models in Non-adversarial Scenarios
Samuel Ackerman, Ella Rabinovich, Eitan Farchi, Ateret Anaby Tavor
We evaluate the robustness of several large language models on multiple datasets. Robustness here refers to the relative insensitivity of the
model’s answers to meaning-preserving variants of their input. Benchmark datasets are constructed by introducing naturally-occurring, non-
malicious perturbations, or by generating semantically equivalent paraphrases of input questions or statements. We further propose a novel
metric for assessing a model robustness, and demonstrate its benefits in the non-adversarial scenario by empirical evaluation of several models
on the created datasets.
385
Posters and Demos
386
Posters and Demos
To study the requirements needed for a human-like language to develop, Language Emergence research uses jointly trained artificial agents
which communicate to solve a task, the most popular of which is a referential game. The targets that agents refer to typically involve a single
entity, which limits their ecological validity and the complexity of the emergent languages. Here, we present a simple multi-entity game in
which targets include multiple entities that are spatially related. We ask whether agents dealing with multi-entity targets benefit from the use
of graph representations, and explore four different graph schemes. Our game requires more sophisticated analyses to capture the extent to
which the emergent languages are compositional, and crucially, what the decomposed features are. We find that emergent languages from our
setup exhibit a considerable degree of compositionality, but not over all features.
(Nov 13): 7:45-8:45 (Morning) - Gather
SEAVER: Attention Reallocation for Mitigating Distractions in Language Models for Conditional Semantic Textual Similarity Mea-
surement
Baixuan Li, Yunlong Fan, Zhiqiang Gao
Conditional Semantic Textual Similarity (C-STS) introduces specific limiting conditions to the traditional Semantic Textual Similarity (STS)
task, posing challenges for STS models. Language models employing cross-encoding demonstrate satisfactory performance in STS, yet their
effectiveness significantly diminishes in C-STS. In this work, we argue that the failure is due to the fact that the redundant information in the
text distracts language models from the required condition-relevant information. To alleviate this, we propose Self-Augmentation via Self-
Reweighting (SEAVER), which, based solely on models’ internal attention and without the need for external auxiliary information, adaptively
reallocates the model’s attention weights by emphasizing the importance of condition-relevant tokens. On the C-STS-2023 test set, SEAVER
consistently improves performance of all million-scale fine-tuning baseline models (up to around 3 points), and even surpasses performance
of billion-scale few-shot prompted large language models (such as GPT-4). Our code is available at https://github.com/BaixuanLi/SEAVER.
(Nov 13): 7:45-8:45 (Morning) - Gather
Self-supervised Topic Taxonomy Discovery in the Box Embedding Space
Yuyin Lu, Hegang Chen, Pengbo Mao, Yanghui Rao, Haoran Xie, Fu Lee Wang, Qing Li
Topic taxonomy discovery aims at uncovering topics of different abstraction levels and constructing hierarchical relations between them. Un-
fortunately, most of prior work can hardly model semantic scopes of words and topics by holding the Euclidean embedding space assumption.
What’s worse, they infer asymmetric hierarchical relations by symmetric distances between topic embeddings. As a result, existing methods
suffer from problems of low-quality topics at high abstraction levels and inaccurate hierarchical relations. To alleviate these problems, this
paper develops a Box embedding-based Topic Model (BoxTM) that maps words and topics into the box embedding space, where the asym-
metric metric is defined to properly infer hierarchical relations among topics. Additionally, our BoxTM explicitly infers upper-level topics
based on correlation between specific topics through recursive clustering on topic boxes. Finally, extensive experiments validate high-quality
of the topic taxonomy learned by BoxTM.
387
Posters and Demos
388
Posters and Demos
large language models through query-dependent retrievals, these approaches still suffer from their complex implementation and prolonged
response times. Typically, a RAG workflow involves multiple processing steps, each of which can be executed in various ways. Here, we
investigate existing RAG approaches and their potential combinations to identify optimal RAG practices. Through extensive experiments,
we suggest several strategies for deploying RAG that balance both performance and efficiency. Moreover, we demonstrate that multimodal
retrieval techniques can significantly enhance question-answering capabilities about visual inputs and accelerate the generation of multimodal
content using a "retrieval as generation" strategy.
(Nov 13): 7:45-8:45 (Morning) - Gather
Expressive and Generalizable Low-rank Adaptation for Large Models via Slow Cascaded Learning
Siwei Li, Yifan Yang, Yifei Shen, Fangyun Wei, Zongqing Lu, Lili Qiu, Yuqing Yang
Efficient fine-tuning plays a fundamental role in modern large models, with low-rank adaptation emerging as a particularly promising ap-
proach. However, the existing variants of LoRA are hampered by limited expressiveness, a tendency to overfit, and sensitivity to hyper-
parameter settings. This paper presents LoRA Slow Cascade Learning (LoRASC), an innovative technique designed to enhance LoRA’s
expressiveness and generalization capabilities while preserving its training efficiency. Our approach augments expressiveness through a cas-
caded learning strategy that enables a mixture-of-low-rank adaptation, thereby increasing the model’s ability to capture complex patterns.
Additionally, we introduce a slow-fast update mechanism and cascading noisy tuning to bolster generalization. The extensive experiments on
various language and vision datasets, as well as robustness benchmarks, demonstrate that the proposed method not only significantly outper-
forms existing baselines, but also mitigates overfitting, enhances model stability, and improves OOD robustness.
(Nov 13): 7:45-8:45 (Morning) - Gather
Pruning via Merging: Compressing LLMs via Manifold Alignment Based Layer Merging
Deyuan Liu, Zhanyue Qin, Hairu Wang, Zhao Yang, Zecheng Wang, Fangying Rong, Qingbin Liu, Yanchao Hao, Bo Li, Xi Chen, Cunhang
Fan, Zhao Lv, Dianhui Chu, Zhiying Tu, Dianbo Sui
While large language models (LLMs) excel in many domains, their complexity and scale challenge deployment in resource-limited environ-
ments. Current compression techniques, such as parameter pruning, often fail to effectively utilize the knowledge from pruned parameters. To
address these challenges, we propose Manifold-Based Knowledge Alignment and Layer Merging Compression (MKA), a novel approach that
uses manifold learning and the Information Bottleneck (IB) measure to merge similar layers, reducing model size while preserving essential
performance. We evaluate MKA on multiple benchmark datasets and various LLMs. Our findings show that MKA not only preserves model
performance but also achieves substantial compression ratios, outperforming traditional pruning methods. Moreover, when coupled with
quantization, MKA delivers even greater compression. Specifically, on the MMLU dataset using the Llama3-8B model, MKA achieves a
compression ratio of 43.75% with a minimal performance decrease of only 2.82%. The proposed MKA method offers a resource-efficient and
performance-preserving model compression technique for LLMs. We make our code available at https://github.com/SempraETY/Pruning-
via-Merging
389
Posters and Demos
training, improving the model to enhance both its reasonableness and stability. Experiments show that the proposed model, IDEAW, can
withstand various attacks with higher capacity and more efficient locating ability compared to existing methods.
Summarization
(Nov 13): 7:45-8:45 (Morning) - Room: Gather
390
Posters and Demos
GlobeSumm: A Challenging Benchmark Towards Unifying Multi-lingual, Cross-lingual and Multi-document News Summarization
Yangfan Ye, Xiachong Feng, Xiaocheng Feng, Weitao Ma, Libo Qin, Dongliang Xu, Qing Yang, Hongtao Liu, Bing Qin
News summarization in todays global scene can be daunting with its flood of multilingual content and varied viewpoints from different
sources. However, current studies often neglect such real-world scenarios as they tend to focus solely on either single-language or single-
document tasks. To bridge this gap, we aim to unify Multi-lingual, Cross-lingual and Multi-document Summarization into a novel task,
i.e., MCMS, which encapsulates the real-world requirements all-in-one. Nevertheless, the lack of a benchmark inhibits researchers from
adequately studying this invaluable problem. To tackle this, we have meticulously constructed the GLOBESUMM dataset by first collecting
a wealth of multilingual news reports and restructuring them into event-centric format. Additionally, we introduce the method of protocol-
guided prompting for high-quality and cost-effective reference annotation. In MCMS, we also highlight the challenge of conflicts between
news reports, in addition to the issues of redundancies and omissions, further enhancing the complexity of GLOBESUMM. Through exten-
sive experimental analysis, we validate the quality of our dataset and elucidate the inherent challenges of the task. We firmly believe that
GLOBESUMM, given its challenging nature, will greatly contribute to the multilingual communities and the evaluation of LLMs.
(Nov 13): 7:45-8:45 (Morning) - Gather
Identifying Factual Inconsistencies in Summaries: Grounding Model Inference via Task Taxonomy
Liyan Xu, Zhenlin Su, Mo Yu, Jin Xu, Jinho D. Choi, Jie Zhou, Fei Liu
Factual inconsistencies pose a significant hurdle for the faithful summarization by generative models. While a major direction to enhance
inconsistency detection is to derive stronger Natural Language Inference (NLI) models, we propose an orthogonal aspect that underscores the
importance of incorporating task-specific taxonomy into the inference. To this end, we consolidate key error types of inconsistent facts in
summaries, and incorporate them to facilitate both the zero-shot and supervised paradigms of LLMs. Extensive experiments on ten datasets of
five distinct domains suggest that, zero-shot LLM inference could benefit from the explicit solution space depicted by the error type taxonomy,
and achieves state-of-the-art performance overall, surpassing specialized non-LLM baselines, as well as recent LLM baselines. We further
distill models that fuse the taxonomy into parameters through our designed prompt completions and supervised training strategies, efficiently
substituting state-of-the-art zero-shot inference with much larger LLMs.
391
Posters and Demos
erarchical contrastive learning module incorporates two complementary strategies, event-level and intent-level, to establish cognitive anchors
that uncover the latent intentions of information disseminators. Event-level contrastive learning employs high-quality data augmentation and
adversarial perturbations to enhance model robustness. Intent-level contrastive learning leverages the intent encoder to capture latent intent
features and optimize consistency within the same intent while ensuring heterogeneity between different intents to clearly distinguish key
features from irrelevant elements. Experimental results demonstrate that IRDNet significantly improves the effectiveness of rumor detection
and effectively addresses the challenges present in the field of rumor detection.
(Nov 14): 13:00-14:00 (Afternoon) - Gather
Investigating LLMs as Voting Assistants via Contextual Augmentation: A Case Study on the European Parliament Elections 2024
Ilias Chalkidis
In light of the recent 2024 European Parliament elections, we are investigating if LLMs can be used as Voting Advice Applications (VAAs).
We audit MISTRAL and MIXTRAL models and evaluate their accuracy in predicting the stance of political parties based on the latest "EU and
I" voting assistance questionnaire. Furthermore, we explore alternatives to improve models’ performance by augmenting the input context via
Retrieval-Augmented Generation (RAG) relying on web search, and Self-Reflection using staged conversations that aim to re-collect relevant
content from the model’s internal memory. We find that MIXTRAL is highly accurate with an 82% accuracy on average with a significant
performance disparity across different political groups (50-95%). Augmenting the input context with expert-curated information can lead to
a significant boost of approx. 9%, which remains an open challenge for automated RAG approaches, even considering curated content.
(Nov 14): 13:00-14:00 (Afternoon) - Gather
Enhancing Data Quality through Simple De-duplication: Navigating Responsible Computational Social Science Research
Yida Mu, Mali Jin, Xingyi Song, Nikolaos Aletras
Research in natural language processing (NLP) for Computational Social Science (CSS) heavily relies on data from social media platforms.
This data plays a crucial role in the development of models for analysing socio-linguistic phenomena within online communities. In this
work, we conduct an in-depth examination of 20 datasets extensively used in NLP for CSS to comprehensively examine data quality. Our
analysis reveals that social media datasets exhibit varying levels of data duplication. Consequently, this gives rise to challenges like label
inconsistencies and data leakage, compromising the reliability of models. Our findings also suggest that data duplication has an impact on the
current claims of state-of-the-art performance, potentially leading to an overestimation of model effectiveness in real-world scenarios. Finally,
we propose new protocols and best practices for improving dataset development from social media data and its usage.
(Nov 14): 13:00-14:00 (Afternoon) - Gather
Still Not Quite There! Assessing Large Language Models for Comorbid Mental Health Diagnosis
Amey Hengle, Atharva Kulkarni, Shantanu Deepak Patankar, Rashmi Gupta
In this study, we introduce ANGST, a novel, first of its kind benchmark for depression-anxiety comorbidity classification from social media
posts. Unlike contemporary datasets that often oversimplify the intricate interplay between different mental health disorders by treating them
as isolated conditions, ANGST enables multi-label classification, allowing each post to be simultaneously identified as indicating depression
and/or anxiety. Comprising 2876 meticulously annotated posts by expert psychologists and an additional 7667 silver-labeled posts, ANGST
posits a more representative sample of online mental health discourse. Moreover, we benchmark ANGST using various state-of-the-art lan-
guage models, ranging from Mental-BERT to GPT-4. Our results provide significant insights into the capabilities and limitations of these
models in complex diagnostic scenarios. While GPT-4 generally outperforms other models, none achieve an F1 score exceeding 72% in
multi-class comorbid classification, underscoring the ongoing challenges in applying language models to mental health diagnostics.
(Nov 14): 13:00-14:00 (Afternoon) - Gather
Quantifying Generative Media Bias with a Corpus of Real-world and Generated News Articles
Filip Trhlík, Pontus Stenetorp
Large language models (LLMs) are increasingly being utilised across a range of tasks and domains, with a burgeoning interest in their appli-
cation within the field of journalism. This trend raises concerns due to our limited understanding of LLM behaviour in this domain, especially
with respect to political bias. Existing studies predominantly focus on LLMs undertaking political questionnaires, which offers only limited
insights into their biases and operational nuances. To address this gap, our study establishes a new curated dataset that contains 2,100 human-
written articles and utilises their descriptions to generate 56,700 synthetic articles using nine LLMs. This enables us to analyse shifts in
properties between human-authored and machine-generated articles, with this study focusing on political bias, detecting it using both super-
vised models and LLMs. Our findings reveal significant disparities between base and instruction-tuned LLMs, with instruction-tuned models
exhibiting consistent political bias. Furthermore, we are able to study how LLMs behave as classifiers, observing their display of political bias
even in this role. Overall, for the first time within the journalistic domain, this study outlines a framework and provides a structured dataset
for quantifiable experiments, serving as a foundation for further research into LLM political bias and its implications.
(Nov 14): 13:00-14:00 (Afternoon) - Gather
F2 RL: Factuality and Faithfulness Reinforcement Learning Framework for Claim-Guided Evidence-Supported Counterspeech Gen-
eration
Haiyang Wang, Yuchen Pan, Xin Song, Xuechen Zhao, Minghao Hu, Bin Zhou
Hate speech (HS) on social media exacerbates misinformation and baseless prejudices. Evidence-supported counterspeech (CS) is crucial for
correcting misinformation and reducing prejudices through facts. Existing methods for generating evidence-supported CS often lack clear
guidance with a core claim for organizing evidence and do not adequately address factuality and faithfulness hallucinations in CS within
anti-hate contexts. In this paper, to mitigate the aforementioned, we propose F2 RL, a Factuality and Faithfulness Reinforcement Learning
framework for generating claim-guided and evidence-supported CS. Firstly, we generate counter-claims based on hate speech and design a
self-evaluation mechanism to select the most appropriate one. Secondly, we propose a coarse-to-fine evidence retrieval method. This method
initially generates broad queries to ensure the diversity of evidence, followed by carefully reranking the retrieved evidence to ensure its
relevance to the claim. Finally, we design a reinforcement learning method with a triplet-based factuality reward model and a multi-aspect
faithfulness reward model. The method rewards the generator to encourage greater factuality, more accurate refutation of hate speech, con-
sistency with the claim, and better utilization of evidence. Extensive experiments on three benchmark datasets demonstrate that the proposed
framework achieves excellent performance in CS generation, with strong factuality and faithfulness.
(Nov 14): 13:00-14:00 (Afternoon) - Gather
Moral Foundations of Large Language Models
Marwa Abdulhai, Gregory Serapio-García, Clement CREPY, Daria Valter, John Canny, Natasha Jaques
Moral foundations theory (MFT) is a psychological assessment tool that decomposes human moral reasoning into five factors, including
care/harm, liberty/oppression, and sanctity/degradation (Graham et al., 2009). People vary in the weight they place on these dimensions when
making moral decisions, in part due to their cultural upbringing and political ideology. As large language models (LLMs) are trained on
datasets collected from the internet, they may reflect the biases that are present in such corpora. This paper uses MFT as a lens to analyze
392
Posters and Demos
whether popular LLMs have acquired a bias towards a particular set of moral values. We analyze known LLMs and find they exhibit particular
moral foundations, and show how these relate to human moral foundations and political affiliations. We also measure the consistency of these
biases, or whether they vary strongly depending on the context of how the model is prompted. Finally, we show that we can adversarially
select prompts that encourage the moral to exhibit a particular set of moral foundations, and that this can affect the model’s behavior on down-
stream tasks. These findings help illustrate the potential risks and unintended consequences of LLMs assuming a particular moral stance.
Demo
(Nov 14): 13:00-14:00 (Afternoon) - Room: Gather
393
Posters and Demos
editor plugins, thereby making INCEpTION usable in an even wider range of annotation tasks.
394
Posters and Demos
favoring either the first or second answer regardless of content. To address this limitation, we propose PORTIA, an alignment-based system
designed to mimic human comparison strategies to calibrate position bias in a lightweight yet effective manner. Specifically, PORTIA splits
the answers into multiple segments, taking into account both length and semantics, and merges them back into a single prompt for evaluation
by LLMs. Extensive experiments with six LLMs on 11,520 answer pairs demonstrate that PORTIA markedly enhances the consistency rates
for all models and forms of comparison tested, achieving an average relative improvement of 47.46%. It also enables PORTIA-enhanced
GPT-3.5 to achieve agreement rates with humans comparable to GPT-4 and elevates GPT-4’s consistency rate up to 98%. Subsequent human
evaluations indicate that the PORTIA-enhanced GPT-3.5 model can even surpass standalone GPT-4 in terms of alignment with human eval-
uators, highlighting PORTIA’s ability to correct position bias, improve LLM consistency, and boost performance while keeping cost efficiency.
(Nov 14): 13:00-14:00 (Afternoon) - Gather
Text Fluoroscopy: Detecting LLM-Generated Text through Intrinsic Features
Xiao Yu, Kejiang Chen, Qi Yang, Weiming Zhang, Nenghai Yu
Large language models (LLMs) have revolutionized the domain of natural language processing because of their excellent performance on
various tasks. Despite their impressive capabilities, LLMs also have the potential to generate texts that pose risks of misuse. Consequently,
detecting LLM-generated text has become increasingly important.Previous LLM-generated text detection methods use semantic features,
which are stored in the last layer. This leads to methods that overfit the training set domain and exhibit shortcomings in generalization.
Therefore, We argue that utilizing intrinsic features rather than semantic features for detection results in better performance.In this work, we
design Text Fluoroscopy, a black-box method with better generalizability for detecting LLM-generated text by mining the intrinsic features of
the text to be detected. Our method captures the text’s intrinsic features by identifying the layer with the largest distribution difference from
the last and first layers when projected to the vocabulary space.Our method achieves 7.36% and 2.84% average improvement in detection
performance compared to the baselines in detecting texts from different domains generated by GPT-4 and Claude3, respectively.
(Nov 14): 13:00-14:00 (Afternoon) - Gather
Can AI Relate: Testing Large Language Model Response for Mental Health Support
Saadia Gabriel, Isha Puri, Xuhai Xu, Matteo Malgaroli, Marzyeh Ghassemi
Large language models (LLMs) are already being piloted for clinical use in hospital systems like NYU Langone, Dana-Farber and the NHS.
A proposed deployment use case is psychotherapy, where a LLM-powered chatbot can treat a patient undergoing a mental health crisis.
Deployment of LLMs for mental health response could hypothetically broaden access to psychotherapy and provide new possibilities for
personalizing care. However, recent high-profile failures, like damaging dieting advice offered by the Tessa chatbot to patients with eating
disorders, have led to doubt about their reliability in high-stakes and safety-critical settings.In this work, we develop an evaluation framework
for determining whether LLM response is a viable and ethical path forward for the automation of mental health treatment. Our framework
measures equity in empathy and adherence of LLM responses to motivational interviewing theory. Using human evaluation with trained clini-
cians and automatic quality-of-care metrics grounded in psychology research, we compare the responses provided by peer-to-peer responders
to those provided by a state-of-the-art LLM.We show that LLMs like GPT-4 use implicit and explicit cues to infer patient demographics like
race. We then show that there are statistically significant discrepancies between patient subgroups: Responses to Black posters consistently
have lower empathy than for any other demographic group (2%-13% lower than the control group). Promisingly, we do find that the manner in
which responses are generated significantly impacts the quality of the response. We conclude by proposing safety guidelines for the potential
deployment of LLMs for mental health response.
Generation
(Nov 14): 13:00-14:00 (Afternoon) - Room: Gather
395
Posters and Demos
principles and ensures logical consistency. Extensive experimental results show that, benefiting from proof principle guidance, PESA gener-
ates argumentative essays with better logical validity and persuasiveness than strong baseline models.
Industry
(Nov 14): 13:00-14:00 (Afternoon) - Room: Gather
396
Posters and Demos
Information Extraction
(Nov 14): 13:00-14:00 (Afternoon) - Room: Gather
397
Posters and Demos
Language Modeling 2
(Nov 14): 13:00-14:00 (Afternoon) - Room: Gather
398
Posters and Demos
a variation of the Direct Preference Optimization (DPO) that is more effective for knowledge modifications. Our method is based on an
online approach that continually updates the knowledge stored in the model. We use the current knowledge as a negative sample and the new
knowledge we want to introduce as a positive sample in a process called DPO. We also use teacher-forcing for negative sample generation
and optimize using the positive sample, which helps maintain localized changes. We tested our KE method on various datasets and models,
comparing it to several cutting-edge methods, with 100 and 500 sequential edits. Additionally, we conducted an ablation study comparing our
method to the standard DPO approach. Our experimental results show that our modified DPO method allows for more refined KE, achieving
similar or better performance compared to previous methods.
(Nov 14): 13:00-14:00 (Afternoon) - Gather
Dynamic Rewarding with Prompt Optimization Enables Tuning-free Self-Alignment of Language Models
Somanshu Singla, Zhen Wang, Tianyang Liu, Abdullah Ashfaq, Zhiting Hu, Eric P. Xing
Aligning Large Language Models (LLMs) traditionally relies on complex and costly training processes like supervised fine-tuning (SFT) and
reinforcement learning from human feedback (RLHF). To address the challenge of achieving alignment without these extensive tuning costs
and expensive annotations, we present a novel, tuning-free approach for self-alignment called Dynamic Rewarding with Prompt Optimization
(DRPO). Our approach enables self-alignment through a search-based prompt optimization framework, allowing the model to self-improve
and generate optimized prompts without additional training or human supervision. The core of DRPO leverages a dynamic rewarding mecha-
nism to identify and rectify model-specific alignment weaknesses, enabling LLMs to adapt quickly to various alignment challenges. Empirical
evaluations on eight recent LLMs, including both open- and closed-source, reveal that DRPO significantly enhances alignment performance,
enabling base models to outperform their SFT/RLHF-tuned counterparts. Moreover, DRPO’s automatically optimized prompts surpass those
curated by human experts, demonstrating its superior alignment capabilities. Our findings envision a highly cost-effective and adaptable
solution for future alignment research to be further explored.
399
Posters and Demos
generative tasks dataset, covering 15 Indic and 3 other languages, while using GPT-4 (one of the costliest LLM services released so far12 ) as
a commercial LLM. We observe and analyze interesting patterns involving token count, cost, and quality across a multitude of languages and
tasks. We show that choosing the best policy to interact with the LLM can reduce cost by 90% while giving better or comparable performance,
compared to communicating with the LLM in the original LRL.
Machine Translation
(Nov 14): 13:00-14:00 (Afternoon) - Room: Gather
12
http://tinyurl.com/llm-costing
400
Posters and Demos
by optimizing a token distribution matrix based on the text-to-image finetuning strategy with a token-level bias obfuscation loss as regulariza-
tion. We evaluate RAt on a large-scale text-to-image dataset with various concepts as target in both in-domain and transfer-domain scenarios.
The evaluation results demonstrate that, compared to other T2I-Refine schemes, RAt is well capable of implicitly attacking input prompts to
generate images with higher quality and explicit visual bias towards specific concept group.
(Nov 14): 13:00-14:00 (Afternoon) - Gather
Android in the Zoo: Chain-of-Action-Thought for GUI Agents
Jiwen Zhang, Jihao Wu, Teng Yihua, Minghui Liao, Nuo Xu, Xiao Xiao, zhongyu wei, Duyu Tang
Large language model (LLM) leads to a surge of autonomous GUI agents for smartphone, which completes a task triggered by natural lan-
guage through predicting a sequence of actions of API. Even though the task highly relies on past actions and visual observations, existing
studies typically consider little semantic information carried out by intermediate screenshots and screen operations. To address this, this work
presents Chain-of-Action-Thought (dubbed CoAT), which takes the description of the previous actions, the current screen, and more impor-
tantly the action thinking of what actions should be performed and the outcomes led by the chosen action. We demonstrate that, in a zero-shot
setting upon three off-the-shelf LMMs, CoAT significantly improves the action prediction compared to previous proposed context modeling.
To further facilitate the research in this line, we construct a dataset Android-In-The-Zoo (AitZ), which contains 18,643 screen-action pairs
together with chain-of-action-thought annotations. Experiments show that fine-tuning a 1B model (i.e. AUTO-UI-base) on our AitZ dataset
achieves on-par performance with CogAgent-Chat-18B.
(Nov 14): 13:00-14:00 (Afternoon) - Gather
MEANT: Multimodal Encoder for Antecedent Information
Benjamin Irving, Annika Marie Schoene
The stock market provides a rich well of information that can be split across modalities, making it an ideal candidate for multimodal eval-
uation. Multimodal data plays an increasingly important role in the development of machine learning and has shown to positively impact
performance. But information can do more than exist across modes— it can exist across time. How should we attend to temporal data that
consists of multiple information types? This work introduces (i) the MEANT model, a Multimodal Encoder for Antecedent information and
(ii) a new dataset called TempStock, which consists of price, Tweets, and graphical data with over a million Tweets from all of the companies
in the S&P 500 Index. We find that MEANT improves performance on existing baselines by over 15%, and that the textual information affects
performance far more than the visual information on our time-dependent task from our ablation study. The code and dataset will be made
available upon publication.
NLP Applications
(Nov 14): 13:00-14:00 (Afternoon) - Room: Gather
401
Posters and Demos
Yejie Wang, Keqing He, Dayuan Fu, Zhuoma GongQue, Heyang Xu, Yanxu Chen, Zhexu Wang, Yujia Fu, Guanting Dong, Muxi Diao, Jingang
Wang, Mengdi Zhang, Xunliang Cai, Weiran Xu
Recently, there has been a growing interest in studying how to construct better code instruction tuning data. However, we observe Code models
trained with these datasets exhibit high performance on HumanEval but perform worse on other benchmarks such as LiveCodeBench. Upon
further investigation, we find that many datasets suffer from severe data leakage. After cleaning up most of the leaked data, some well-known
high-quality datasets perform poorly. This discovery reveals a new challenge: identifying which dataset genuinely qualify as high-quality
code instruction data. To address this, we propose an efficient code data pruning strategy for selecting good samples. Our approach is based
on three dimensions: instruction complexity, response quality, and instruction diversity. Based on our selected data, we present XCoder, a
family of models finetuned from LLaMA3. Our experiments show Xcoder achieves new state-of-the-art performance using fewer training
data, which verify the effectiveness of our data strategy. Moreover, we perform a comprehensive analysis on the data composition and find
existing code datasets have different characteristics according to their construction methods, which provide new insights for future code LLMs.
(Nov 14): 13:00-14:00 (Afternoon) - Gather
Joint Pre-Encoding Representation and Structure Embedding for Efficient and Low-Resource Knowledge Graph Completion
Chenyu Qiu, Pengjiang Qian, Chuang Wang, Jian Yao, Li Liu, Fang wei, Eddie Y.K. Eddie
Knowledge graph completion (KGC) aims to infer missing or incomplete parts in knowledge graph. The existing models are generally di-
vided into structure-based and description-based models, among description-based models often require longer training and inference times
as well as increased memory usage. In this paper, we propose Pre-Encoded Masked Language Model (PEMLM) to efficiently solve KGC
problem. By encoding textual descriptions into semantic representations before training, the necessary resources are significantly reduced.
Furthermore, we introduce a straightforward but effective fusion framework to integrate structural embedding with pre-encoded semantic
description, which enhances the model’s prediction performance on 1-N relations. The experimental results demonstrate that our proposed
strategy attains state-of-the-art performance on the WN18RR (MRR+5.4% and Hits@1+6.4%) and UMLS datasets. Compared to existing
models, we have increased inference speed by 30x and reduced training memory by approximately 60%.
(Nov 14): 13:00-14:00 (Afternoon) - Gather
PKAD: Pretrained Knowledge is All You Need to Detect and Mitigate Textual Backdoor Attacks
Yu Chen, Qi Cao, Kaike Zhang, Xuchao Liu, Huawei Shen
In textual backdoor attacks, attackers insert poisoned samples with triggered inputs and target labels into training datasets to manipulate
model behavior, threatening the model’s security and reliability. Current defense methods can generally be categorized into inference-time
and training-time ones. The former often requires a part of clean samples to set detection thresholds, which may be hard to obtain in practical
application scenarios, while the latter usually requires an additional retraining or unlearning process to get a clean model, significantly in-
creasing training costs. To avoid these drawbacks, we focus on developing a practical defense method before model training without using any
clean samples. Our analysis reveals that with the help of a pre-trained language model (PLM), poisoned samples, different from clean ones,
exhibit mismatched relationship and shared characteristics. Based on these observations, we further propose a two-stage poison detection
strategy solely leveraging insights from PLM before model training. Extensive experiments confirm our approach’s effectiveness, achieving
better performance than current leading methods more swiftly. Our code is available at https://github.com/Ascian/PKAD.
(Nov 14): 13:00-14:00 (Afternoon) - Gather
RoBERT2VecTM: A Novel Approach for Topic Extraction in Islamic Studies
Sania Aftar, Amina El Ganadi, Luca Gagliardelli, Sonia Bergamaschi
Investigating "Hadith" texts, crucial for theological studies and Islamic jurisprudence, presents challenges due to the linguistic complexity of
Arabic, such as its complex morphology. In this paper, we propose an innovative approach to address the challenges of topic modeling in
Hadith studies by utilizing the Contextualized Topic Model (CTM). Our study introduces RoBERT2VecTM, a novel neural-based approach
that combines the RoBERTa transformer model with Doc2Vec, specifically targeting the semantic analysis of "Matn" (the actual content). The
methodology outperforms many traditional state-of-the-art NLP models by generating more coherent and diverse Arabic topics. The diversity
of the generated topics allows for further categorization, deepening the understanding of discussed concepts. Notably, our research highlights
the critical impact of lemmatization and stopwords in enhancing topic modeling. This breakthrough marks a significant stride in applying
NLP to non-Latin languages and opens new avenues for the nuanced analysis of complex religious texts.
(Nov 14): 13:00-14:00 (Afternoon) - Gather
HyperBERT: Mixing Hypergraph-Aware Layers with Language Models for Node Classification on Text-Attributed Hypergraphs
Adrián Bazaga, Pietro Lio, Gos Micklem
Hypergraphs are characterized by complex topological structure, representing higher-order interactions among multiple entities through hy-
peredges. Lately, hypergraph-based deep learning methods to learn informative data representations for the problem of node classification
on text-attributed hypergraphs have garnered increasing research attention. However, existing methods struggle to simultaneously capture
the full extent of hypergraph structural information and the rich linguistic attributes inherent in the nodes attributes, which largely hampers
their effectiveness and generalizability. To overcome these challenges, we explore ways to further augment a pretrained BERT model with
specialized hypergraph-aware layers for the task of node classification. Such layers introduce higher-order structural inductive bias into the
language model, thus improving the model’s capacity to harness both higher-order context information from the hypergraph structure and
semantic information present in text. In this paper, we propose a new architecture, HyperBERT, a mixed text-hypergraph model which si-
multaneously models hypergraph relational structure while maintaining the high-quality text encoding capabilities of a pre-trained BERT.
Notably, HyperBERT presents results that achieve a new state-of-the-art on five challenging text-attributed hypergraph node classification
benchmarks.
(Nov 14): 13:00-14:00 (Afternoon) - Gather
TRoTR: A Framework for Evaluating the Re-contextualization of Text Reuse
Francesco Periti, Pierluigi Cassotti, Stefano Montanelli, Nina Tahmasebi, Dominik Schlechtweg
Current approaches for detecting text reuse do not focus on recontextualization, i.e., how the new context(s) of a reused text differs from its
original context(s). In this paper, we propose a novel framework called TRoTR that relies on the notion of topic relatedness for evaluating
the diachronic change of context in which text is reused. TRoTR includes two NLP tasks: TRiC and TRaC. TRiC is designed to evaluate
the topic relatedness between a pair of recontextualizations. TRaC is designed to evaluate the overall topic variation within a set of recontex-
tualizations. We also provide a curated TRoTR benchmark of biblical text reuse, human-annotated with topic relatedness. The benchmark
exhibits an inter-annotator agreement of .811. We evaluate multiple, established SBERT models on the TRoTR tasks and find that they exhibit
greater sensitivity to textual similarity than topic relatedness. Our experiments show that fine-tuning these models can mitigate such a kind of
sensitivity.
(Nov 14): 13:00-14:00 (Afternoon) - Gather
Sing it, Narrate it: Quality Musical Lyrics Translation
Zhuorui Ye, Jinhan Li, Rongwu Xu
402
Posters and Demos
Translating lyrics for musicals presents unique challenges due to the need to ensure high translation quality while adhering to singability
requirements such as length and rhyme. Existing song translation approaches often prioritize these singability constraints at the expense of
translation quality, which is crucial for musicals. This paper aims to enhance translation quality while maintaining key singability features.
Our method consists of three main components. First, we create a dataset to train reward models for the automatic evaluation of translation
quality. Second, to enhance both singability and translation quality, we implement a two-stage training process with filtering techniques.
Finally, we introduce an inference-time optimization framework for translating entire songs. Extensive experiments, including both automatic
and human evaluations, demonstrate significant improvements over baseline methods and validate the effectiveness of each component in our
approach.
Question Answering
(Nov 14): 13:00-14:00 (Afternoon) - Room: Gather
403
Posters and Demos
of the annotated publications apply good or very good quality management. However, we deem the effort of 30% of the works as only subpar.
Our analysis also shows common errors, especially when using inter-annotator agreement and computing annotation error rates.
404
Posters and Demos
405
Posters and Demos
for efficient adapter techniques. Despite this, the concurrent serving of multiple adapters, each with unique matrix shapes, poses significant
system-level challenges. To address these issues, we identify an opportunity in structurally sparse adapters, which, unlike low-rank adapters,
maintain consistent matrix shapes while varying in sparsity patterns. Leveraging this characteristic, we introduce SpartanServe, a system
designed for efficient concurrent serving of LLMs using multiple structurally sparse adapters. SpartanServe employs a unified matrix multi-
plication operation and a novel memory management technique to enable effective batching. Furthermore, the incorporation of Triton kernels
enhances the acceleration of matrix multiplication in the serving process. Experimental results demonstrate that SpartanServe achieves 2.12Œ
speedup over S-LoRA when serving 96 adapters using a single NVIDIA A100 GPU (40GB), showcasing its efficacy in concurrent LLM
serving.
406
Friday, November 15, 2024
Enhancing LLM Capabilities Beyond Scaling Up - Wenpeng Yin, Muhao Chen, Rui Brickell/Flagler
Zhang, Ben Zhou, Fei Wang and Dan Roth Ballrooms Terrace
Level
Language Agents: Foundations, Prospects, and Risks - Yu Su, Diyi Yang, Shunyu Brickell/Flagler
Yao and Tao Yu Ballrooms Terrace
Level
407
Tutorials
Please refer to the official program web page for last-minute updates: https://2024.emnlp.org/program/tutorials/
408
Tutorials Details
Tutorials Details
14
Tutorial Message
Welcome to the Tutorial Session of EMNLP 2024!
As the field of NLP continues to evolve, this year’s tutorials at EMNLP 2024 will give the audience comprehensive introductions of six
exciting topics by experts in these areas: natural language explanations, offensive speech, human-centered evaluation, AI for science, agents,
and enhancing capabilities of LLMs.
As in recent years, the process of calling for, submitting, reviewing, and selecting tutorials was a collaborative effort across ACL, EACL,
NAACL, and EMNLP. Each tutorial proposal was meticulously reviewed by a panel of three reviewers, who assessed them based on criteria
such as clarity, preparedness, novelty, timeliness, instructors experience, potential audience, open access to teaching materials, and diversity
(including multilingualism, gender, age, and geolocation). A total of six tutorials covering the aforementioned topics were selected for
EMNLP.
We would like to thank the tutorial authors for their contributions, the tutorial chairs across conferences for this coordinated effort, as well as
the EMNLP conference organizers, especially the general chair Thamar Solorio.
409
Tutorials Details
Flor Miriam Plaza-del-Arco, Debora Nozza, Marco Guerini, Jeffrey Sorensen and
Marcos Zampieri
https://nlp-for-countering-hate-speech-tutorial.github.io/
410
Tutorials Details
Debora Nozza is an Assistant Professor in Computing Sciences at Bocconi University and a recipient of a 1.5 million ERC Starting Grant
(2023) for her research on personalized and subjective approaches to Natural Language Processing. She previously secured a 120,000 grant
from Fondazione Cariplo for her project MONICA, which monitors coverage, attitudes, and accessibility of Italian COVID-19 response mea-
sures. Her research focuses on NLP, especially in detecting and countering hate speech and algorithmic bias in multilingual social media data.
She has organized the 7th Workshop on Online Abuse and Harms (ACL 2023), the ICWSM 2023 Data Challenge, and tasks on misogyny
identification and multilingual hate speech detection at Evalita and SemEval.
Marco Guerini, FBK, Italy
email: info@marcoguerini.eu
website: https://www.marcoguerini.eu/
Marco Guerini is a researcher in Computational Linguistics and leads the Language and Dialogue Technologies group at Fondazione Bruno
Kessler (FBK). His work focuses on persuasive communication, sentiment analysis, and social media, with a recent emphasis on AI technolo-
gies for counter-narrative generation to combat online hate speech. He holds a Ph.D. in Information and Communication Technologies from
the University of Trento and has published extensively in top conferences and international journals. His research has gained international
media attention, including features in the Wall Street Journal, MIT Technology Review, and Harvard Business Review. He has worked with
FBKs NLP group, Trento-Rise, and received a Google Research Award (2011) and a sponsorship from eBay (2016). Additionally, he consults
for startups and companies and writes a technology blog for Corriere della Sera.
Jeffrey Sorensen, Jigsaw, USA
email: sorenj@google.com
website: https://research.google/people/author14753/
Jeffrey Sorensen is a researcher at Jigsaw, a unit within Google that explores threats to open societies, and builds technology that inspires
scalable solutions. His work spans the development of scalable algorithms for threat detection, countering online abuse, and fostering re-
silience against cyber threats. An experienced contributor to the fields of computational social science and ethical AI, Jeffrey has presented
his research at leading conferences, showcasing solutions that help safeguard online communities and enhance digital well-being on a global
scale.
411
Tutorials Details
Wenpeng Yin, Muhao Chen, Rui Zhang, Ben Zhou, Fei Wang and Dan Roth
https://www.wenpengyin.org/publications/beyond-llm-scaling-emnlp24
412
Tutorials Details
• Symbolic constraints as structures (e.g., human-written, mathematical constraints, and compiler constraints)
• Structures from decomposing the target problem
• Procedural structures that come from cognitive or problem-solving processes, such as DSP, ReAct, and RAP.
• Introducing inference-time threats (e.g., prompt injection, malicious task instructions, jailbreaking attacks, adversarial demonstra-
tions, and training-free backdoor attacks)
• Defense techniques (e.g., prompt robustness estimation, demonstration-based defense, and ensemble debiasing)
413
Tutorials Details
https://sites.google.com/view/reasoning-with-explanations
414
Tutorials Details
https://language-agent-tutorial.github.io/
415
Tutorials Details
416
Tutorials Details
Su Lin Blodgett, Jackie Chi Kit Cheung, Vera Liao and Ziang Xiao
https://human-centered-eval.github.io/
417
Workshops
15
Workshops
Overview
Jasmine W1 - BlackboxNLP 2024: Analyzing and interpreting neural networks for NLP p.420
Merrick 1 W2 - Seventh Workshop on Computational Models of Reference, Anaphora and Coref- p.421
erence
Miami Lecture Hall W3 - Seventh Workshop on Fact Extraction and VERification (FEVER) p.423
Merrick 2 W7 - The Third Workshop on Text Simplification, Accessibility and Readability p.430
418
Workshops
Merrick 1 W11 - Workshop on Customizable NLP: Progress and Challenges in Customizing NLP p.446
for a Domain, Application, Group, or Individual (CustomNLP4U)
Merrick 2 W12 - The 4th International Workshop on Natural Language Processing for Digital p.448
Humanities (NLP4DH)
Brickell W13 - GenBench: The second workshop on generalisation (benchmarking) in NLP p.452
Hibiscus A W14 - Natural Legal Language Processing (NLLP) Workshop 2024 p.454
Hibiscus B W16 - NLP4Science: The First Workshop on Natural Language Processing for Science p.458
Pearson W17 - The Second Workshop on Social Influence in Conversations (SICon 2024) p.459
Miami Lecture Hall W18 - The 11th Workshop on Asian Translation (WAT2024) p.460
Johnson W19 - The First Workshop on Advancing Natural Language Processing for Wikipedia p.462
(NLP for Wikipedia)
419
W1 - BlackboxNLP 2024: Analyzing and interpreting neural
networks for NLP
Organizers:
Yonatan Belinkov, Najoung Kim, Jaap Jumelet, Hosein Mohebbi, Aaron Mueller,
Hanjie Chen
https://blackboxnlp.github.io/
Room: Jasmine
Friday, November 15, 2024
Many recent performance improvements in NLP have come at the cost of understanding of the systems.
How do we assess what representations and computations models learn? How do we formalize desirable
properties of interpretable models, and measure the extent to which existing models achieve them? How
can we build models that better encode these properties? What can new or existing tools tell us about these
systems inductive biases? The goal of this workshop is to bring together researchers focused on interpret-
ing and explaining NLP models by taking inspiration from fields such as machine learning, psychology,
linguistics, and neuroscience. We hope the workshop will serve as an interdisciplinary meetup that allows
for cross-collaboration.
Time Session
09:00 09:10 Opening remarks
09:10 10:00 Invited talk by Jack Merullo
10:00 10:30 Oral presentations:
• Routing in Sparsely-gated Language Models responds to Context. Stefan
Arnold, Marian Fietta, Dilara Yesilbas
• Log Probabilities Are a Reliable Estimate of Semantic Plausibility in Base
and Instruction-Tuned Language Models. Carina Kauf, Emmanuele Chersoni,
Alessandro Lenci, Evelina Fedorenko, Anna A Ivanova
10:30 11:00 Break
11:00 12:30 In-person & virtual poster session 1
12:30 14:00 Lunch
14:00 15:00 Invited talk by Himabindu Lakkaraju
15:00 15:30 Oral presentations:
• Gemma Scope: Open Sparse Autoencoders Everywhere All At Once on
Gemma 2. Tom Lieberum, Senthooran Rajamanoharan, Arthur Conmy, Lewis
Smith, Nicolas Sonnerat, Vikrant Varma, Janos Kramar, Anca Dragan, Rohin
Shah, Neel Nanda
• Mechanistic? Naomi Saphra, Sarah Wiegreffe
15:30 16:00 Break
16:30 16:40 Closing remarks and awards
17:00 18:00 Panel discussion on Interpretability with: Dieuwke Hupkes, Vera Liao, Asma
Ghandeharioun, Marius Mosbach and Jack Merullo
420
W2 - Seventh Workshop on Computational Models of Reference,
Anaphora and Coreference
Organizers:
Maciej Ogrodniczuk, Sameer Pradhan, Anna Nedoluzhko, Massimo Poesio,
Vincent Ng
https://sites.google.com/view/crac2024/
Room: Merrick 1
Friday, November 15, 2024
Since 2016, the yearly CRAC (and its predecessor, CORBON) workshop has become the primary forum
for researchers interested in the computational modeling of reference, anaphora, and coreference to dis-
cuss and publish their results. Over the years, this workshop series has successfully organized five shared
tasks, which stimulated interest in new problems in this area of research, facilitated the discussion and
dissemination of results on new problems/directions (e.g., multimodal reference resolution), and helped
expand the coreference community that used to be dominated by European researchers to include young
researchers from the Americas. The aim of the workshop is to provide a forum where work on all aspects
of computational work on anaphora resolution and annotation can be presented.
Time Session
09:00 09:15 Opening and welcome (Vincent Ng and Maciej Ogrodniczuk)
09:15 10:30 Invited talk: Reference at the Heart of Natural Language Processing. Jackie
Chi Kit Cheung
Findings Paper Session
11:00 11:20 • Challenges to Evaluating the Generalization of Coreference Resolution Mod-
els: A Measurement Modeling Perspective. Ian Porada, Alexandra Olteanu,
Kaheer Suleman, Adam Trischler, and Jackie Chi Kit Cheung
11:20 11:40 • Any Other Thoughts, Hedgehog? Linking Deliberation Chains in Collabo-
rative Dialogues. Abhijnan Nath, Videep Venkatesha, Mariah Bradford, Avyakta
Chelle, Austin Collin Youngren, Carlos Mabrey, Nathaniel Blanchard, and Nikhil
Krishnaswamy
11:40 11:50 • MMAR: Multilingual and Multimodal Anaphora Resolution in Instructional
Videos. Cennet Oguz, Pascal Denis, Simon Ostermann, Emmanuel Vincent, Na-
talia Skachkova, and Josef van Genabith
EMNLP 2024 Paper
11:50 - 12:10 • Major Entity Identification: A Generalizable Alternative to Coreference Res-
olution. Kawshik S. Manikantan, Shubham Toshniwal, Makarand Tapaswi, and
Vineet Gandhi
Research Paper Session
13:50 14:00 • Enriching Conceptual Knowledge in Language Models through Metaphorical
Reference Explanation. Zixuan Zhang and Heng Ji
14:00 14:10 • Polish Coreference Corpus as an LLM Testbed: Evaluating Coreference Res-
olution within Instruction-Following Language Models by Instruction-Answer
Alignment. Karol Saputa, Angelika Peljak-apiska, and Maciej Ogrodniczuk
421
14:10 14:30 • MSCAW-coref: Multilingual, Singleton and Conjunction-Aware Word-Level
Coreference Resolution. Houjun Liu, John Bauer, Karel D’Oosterlinck, Christo-
pher Potts, and Christopher D. Manning
14:30 14:50 • Unifying the Scope of Bridging Anaphora Types in English: Bridging Anno-
tations in ARRAU and GUM. Lauren Levine and Amir Zeldes
14:50 15:10 • Revisiting English Winogender Schemas for Consistency, Coverage, and
Grammatical Case. Vagrant Gautam, Julius Steuer, Eileen Bingert, Ray Johns,
Anne Lauscher, and Dietrich Klakow
15:10 15:30 • DeepHCoref: A Deep Neural Coreference Resolution for Hindi Text. Kusum
Lata, Kamlesh Dutta, Pardeep Singh, and Abhishek Kanwar
Shared Task Paper Session
16:00 16:30 • Findings of the Third Shared Task on Multilingual Coreference Resolution.
Michal Novák, Barbora Dohnalová, Miloslav Konopík, Anna Nedoluzhko, Mar-
tin Popel, Ondej Praák, Jakub Sido, Milan Straka, Zdenk abokrtský, and Daniel
Zeman
16:30 16:50 • CorPipe at CRAC 2024: Predicting Zero Mentions from Raw Text. Milan
Straka
16:50 17:10 • End-to-end Multilingual Coreference Resolution with Headword Mention
Representation. Ondej Praák and Miloslav Konopík
17:10 17:20 • Multilingual coreference resolution as text generation. Natalia Skachkova
Panel Discussion
17:20 17:50 • The future of coreference resolution in the era of LLMs. Michal Novák, Ondej
Praák, and Martin Popel
17:50 18:00 Closing of the workshop (Maciej Ogrodniczuk)
422
W3 - Seventh Workshop on Fact Extraction and VERification
(FEVER)
Organizers:
Michael Schlichtkrull, Yulong Chen, Chenxi Whitehouse, Zhenyun Deng,
Mubashara Akhtar, Rami Aly, Rui Cao, Zhijiang Guo, Christos Christodoloupoulos,
Oana Cocarascu, Arpit Mittal, James Thorne, Andreas Vlachos
https://fever.ai/workshop.html
Room: Miami Lecture Hall
Friday, November 15, 2024
With billions of individual pages on the web providing information on almost every conceivable topic, we
should have the ability to reason about a wide range of domains. However, in order to do so, we need to
ensure that we trust the accuracy of the sources of information that we use. Handling false information
coming from unreliable sources has become the focus of a lot of recent research and media coverage. In
an effort to jointly address these problems, we are organizing the 7th instalment of the Fact Extraction and
VERification (FEVER) workshop (http://fever.ai/) to promote research in this area.
In this years workshop, we are also organising a new fact checking shared task AVeriTeC: Automated
Verification of Textual Claims. The aim is to fact-check real-world claims using evidence from the web.
For each claim, systems must return a label (Supported, Refuted, Not Enough Evidence, Conflicting
Evidence/Cherry-picking) and appropriate evidence. The evidence must be retrieved from the document
collection provided by the organisers or from the Web (e.g. using a search API).
Time Session
9:00-9:45 Opening Remarks & Shared Task Overview
FEVER Organizers
9:45-10:30 Keynote Talk 1: Omar Khattab
10:30-11:00 Coffee break
11:00-12:00 Poster Session
• Multi-hop Evidence Pursuit Meets the Web: Team Papelo at FEVER 2024.
Christopher Malon
• Retrieving Semantics for Fact-Checking: A Comparative Approach using CQ
(Claim to Question) & AQ (Answer to Question). Nicolò Urbani, Sandip Modha
and Gabriella Pasi
• RAG-Fusion Based Information Retrieval for Fact-Checking. Yuki Momii, Tet-
suya Takiguchi and Yasuo Ariki
• UHH at AVeriTeC: RAG for Fact-Checking with Real-World Claims. Ozge
Sevgili, Irina Nikishina, Seid Muhie Yimam, Martin Semmann and Chris Bieman
• Improving Evidence Retrieval on Claim Verification Pipeline through Question
Enrichment. Svetlana Churina, Anab Maulana Barik and Saisamarth Rajesh
Phaye
• Dunamu-mls Submissions on AVERITEC Shared Task. Heesoo Park, Dongjun
Lee, Jaehyuk Kim, ChoongWon Park and Changhwa Park
• FZI-WIM at AVeriTeC Shared Task: Real-World Fact-Checking with Question
Answering. Jin Liu, Steffen Thoma and Achim Rettinger
423
• Zero-Shot Learning and Key Points Are All You Need for Automated Fact-
Checking. Mohammad Ghiasvand Mohammadkhani, Ali Ghiasvand Moham-
madkhani and Hamid Beigy
• Evidence-backed Fact Checking using RAG and Few-Shot In-Context Learn-
ing with LLMs. Ronit Singal, Pransh Patwa, Parth Patwa, Aman Chadha and
Amitava Das
• SKăDU Team: Cross-Encoder based Evidence Retrieval and Question Gener-
ation with Improved Prompt for the AVeriTeC Shared Task. Shrikant Malviya
and Stamos Katsigiannis
• InFact: A Strong Baseline for Automated Fact-Checking. Mark Rothermel,
Tobias Braun, Marcus Rohrbach and Anna Rohrbach
• Exploring Retrieval Augmented Generation For Real-world Claim Verification.
Adjali Omar
• GProofT: A Multi-dimension Multi-round Fact Checking Framework Based
on Claim Fact Extraction. Jiayu Liu, Junhao Tang, Hanwen Wang, Baixuan Xu,
Haochen Shi, Weiqi Wang and Yangqiu Song
• HerO at AVeriTeC: The Herd of Open Large Language Models for Verifying
Real-World Claims. Yejun Yoon, Jaeyoon Jung, Seunghyun Yoon and Kunwoo
Park
• AIC CTU system at AVeriTeC: Re-framing automated fact-checking as a sim-
ple RAG task. Herbert Ullrich, Tomás Mlynár and Jan Drchal
• Enhancing Fact Verification with Causal Knowledge Graphs and Transformer-
Based Retrieval for Deductive Reasoning. Fiona Anting Tan, Jay Desai and
Srinivasan H. Sengamedu
• Numerical Claim Detection in Finance: A New Financial Dataset, Weak-
Supervision Model, and Market Analysis. Agam Shah, Arnav Hiray, Pratvi
Shah, Arkaprabha Banerjee, Anushka Singh, Dheeraj Deepak Eidnani, Sahasra
Chava, Bhaskar Chaudhury and Sudheer Chava
• Streamlining Conformal Information Retrieval via Score Refinement. Yotam
Intrator, Regev Cohen, Ori Kelner, Roman Goldenberg, Ehud Rivlin and Daniel
Freedman
• Improving Explainable Fact-Checking via Sentence-Level Factual Reasoning.
Francielle Vargas, Isadora Salles, Diego Alves, Ameeta Agrawal, Thiago A. S.
Pardo and Fabrício Benevenuto
• Fast Evidence Extraction for Grounded Language Model Outputs. Pranav
Mani, Davis Liang and Zachary Chase Lipton
• Question-Based Retrieval using Atomic Units for Enterprise RAG. Vatsal
Raina and Mark Gales
• AMREx: AMR for Explainable Fact Verification. Chathuri Jayaweera, Sang-
pil Youm and Bonnie J Dorr
• Claim Check-Worthiness Detection: How Well do LLMs Grasp Annotation
Guidelines? Laura Majer and Jan Snajder
• Contrastive Learning to Improve Retrieval for Real-World Fact Checking.
Aniruddh Sriram, Fangyuan Xu, Eunsol Choi and Greg Durrett
• RAGAR, Your Falsehood Radar: RAG-Augmented Reasoning for Political
Fact-Checking using Multimodal Large Language Models. Mohammed Abdul
Khaliq, Paul Yu-Chun Chang, Mingyang Ma, Bernhard Pflugfelder and Filip
Miletic
424
• FactGenius: Combining Zero-Shot Prompting and Fuzzy Relation Mining to
Improve Fact Verification with Knowledge Graphs. Sushant Gautam and Roxana
Pop
• Fact or Fiction? Improving Fact Verification with Knowledge Graphs through
Simplified Subgraph Retrievals. Tobias Aanderaa Opsahl
• ProTrix: Building Models for Planning and Reasoning over Tables with Sen-
tence Context. Zirui Wu and Yansong Feng
• SparseCL: Sparse Contrastive Learning for Contradiction Retrieval. Haike Xu,
Zongyu Lin, Yizhou Sun, Kai-Wei Chang and Piotr Indyk
• Learning to Verify Summary Facts with Fine-Grained LLM Feedback. Jihwan
Oh, Jeonghwan Choi, Nicole Hee-Yeon Kim, Taewon Yun, Ryan Donghan Kwon
and Hwanjun Song
• DAHL: Domain-specific Automated Hallucination Evaluation of Long-Form
Text through a Benchmark Dataset in Biomedicine. Jean Seo, Jongwon Lim,
Dongjun Jang and Hyopil Shin
• Detecting Misleading News Representations on Social Media Posts. Satoshi
Tohda, Naoki Yoshinaga, Masashi Toyoda, Sho Cho and Ryota Kitabayashi
• Evidence Retrieval for Fact Verification using Multi-stage Reranking. Shrikant
Malviya and Stamos Katsigiannis
• Generating Media Background Checks for Automated Source Critical Reason-
ing. Michael Schlichtkrull
• DYNAMICQA: Tracing Internal Knowledge Conflicts in Language Models.
Sara Vera Marjanovic, Haeun Yu, Pepa Atanasova, Maria Maistro, Christina
Lio- ma and Isabelle Augenstein
• Zero-Shot Fact Verification via Natural Logic and Large Language Models.
Marek Strong, Rami Aly and Andreas Vlachos
• Do We Need Language-Specific Fact-Checking Models? The Case of Chinese.
Caiqi Zhang, Zhijiang Guo and Andreas Vlachos
12:00-12:35 Contributed Shared Task Talks
• InFact: A Strong Baseline for Automated Fact-Checking. Mark Rothermel,
Tobias Braun, Marcus Rohrbach and Anna Rohrbach
• HerO at AVeriTeC: The Herd of Open Large Language Models for Verifying
Real-World Claims. Yejun Yoon, Jaeyoon Jung, Seunghyun Yoon and Kunwoo
Park
• AIC CTU system at AVeriTeC: Re-framing automated fact-checking as a sim-
ple RAG task. Herbert Ullrich, Tomás Mlynár and Jan Drchal
• Dunamu-mls Submissions on AVERITEC Shared Task. Heesoo Park, Dongjun
Lee, Jaehyuk Kim, ChoongWon Park and Changhwa Park
• Multi-hop Evidence Pursuit Meets the Web: Team Papelo at FEVER 2024.
Christopher Malon
12:35-14:00 Lunch Break
14:00-14:45 Keynote Talk 2: Rada Mihalcea
14:45-15:30 Keynote Talk 3: Peter Cunliffe-Jones
15:30-16:00 Coffee Break
16:00-16:30 Contributed Shared Task Talks
• Enhancing Fact Verification with Causal Knowledge Graphs and Transformer-
Based Retrieval for Deductive Reasoning. Fiona Anting Tan, Jay Desai and
Srinivasan H. Sengamedu
425
• Contrastive Learning to Improve Retrieval for Real-World Fact Checking.
Aniruddh Sriram, Fangyuan Xu, Eunsol Choi and Greg Durrett
• FactGenius: Combining Zero-Shot Prompting and Fuzzy Relation Mining to
Improve Fact Verification with Knowledge Graphs. Sushant Gautam and Roxana
Pop
16:30-17:15 Keynote Talk 4: Peter Cunliffe-Jones
17:15-17:30 Closing Remarks
FEVER Organizers
426
W4 - Workshop on the Future of Event Detection
Organizers:
Joel Tetreault, Thien Huu Nguyen, Hemank Lamba, Amanda Hughes
https://future-of-event-detection.github.io/
Room: Johnson
Friday, November 15, 2024
In recent years, there has been a significant increase in the amount of publicly-generated digital data. One
prominent category of this data, and arguably the largest in terms of daily generation, pertains to vari-
ous real-world events, ranging from natural disasters to political occurrences to sports events. Detecting
these events serves various crucial purposes, including early warning systems, emergency response, situa-
tional awareness, tracking public health trends, and understanding societal shifts, among others. However,
automatic real-time event detection presents intriguing challenges, primarily stemming from the charac-
teristics of the data. These challenges include the diversity of public online data (multimodal nature), the
rapid pace at which data is produced (velocity), the sheer volume of data generated, and the reliability
of the data (veracity). Moreover, the recent advancements in powerful Large Language Models (LLMs)
and Generative AI Systems offer new opportunities to revise event detection pipelines, enabling novel ap-
proaches and applications across various domains. The workshop focuses on: The workshop focuses on:
Looking forward and looking back: The workshop will solicit ideas on how the field of event detection
should evolve over the next twenty years, as well as solicit papers reflecting on what has worked and not
worked in the field thus far. Expanding Beyond NLP: As noted above, there are many sibling areas that
actively research event detection. Many of these areas have remained siloed and there is not much cross
communication though they are working on similar problem areas. This workshop seeks to address this
by actively soliciting research and invited speakers from these areas. Theory to Application: Finally, this
workshop will emphasize how event detection technology can be used in real-world applications.
Time Session
09:00 09:15 Opening Remarks
09:15 10:00 Keynote: Heng Ji (UIUC)
10:00 10:30 DEGREE2: Efficient Extraction of Multiple Events Using Language Models, An
Incremental Clustering Baseline for Event Detection on Twitter
10:30 11:00 Coffee Break
11:00 12:30 BERTrend: Neural Topic Modeling for Emerging Trends Detection, MUMOSA,
Interactive Dashboard for MUlti-MOdal Situation Awareness, A Comprehensive
Survey on Document-Level Information Extraction, Generative Approaches to
Event Extraction: Survey and Outlook
12:30 14:15 Lunch
14:15 15:00 Keynote: Lise St. Denis (University of Colorado, Boulder)
15:00 15:30 When and Where Did it Happen? An Encoder-Decoder Model to Identify Sce-
nario Context (Findings), Reasoning and Tools for Human-Level Forecasting
15:30 16:00 Coffee Break
16:00 16:20 Grounding Partially-Defined Events in Multimodal Data (Findings)
16:20 17:00 Panel
17:00 17:15 Concluding Remarks
427
W5 - The Sixth Workshop on Narrative Understanding
Organizers:
Faeze Brahman, Anneliese Brei, Khyathi Raghavi Chandu, Snigdha Chaturvedi,
Elizabeth Clark, Yash Kumar Lal, Mohit Iyyer
https://sites.google.com/cs.stonybrook.edu/wnu2024
Room: Pearson
Friday, November 15, 2024
This is the 6th iteration of the Narrative Understanding Workshop, which brings together an interdisci-
plinary group of researchers from AI, ML, NLP, Computer Vision and other related fields, as well as
scholars from the humanities to discuss methods to improve automatic narrative understanding capabili-
ties. The workshop will consist of talks from invited speakers, a panel of researchers and writers, and talks
and posters from accepted papers.
Time Session
09:00 - 10:00 Virtual Poster Session
10:00 - 10:10 Opening Remarks
10:10 - 10:50 Invited Talk: Mirella Lapata
10:50 - 11:10 Break
11:10 - 11:50 Invited Talk Lydia Chilton
11:50 - 12:30 Invited Talk David Mimno
12:30 - 14:00 Lunch
14:00 - 14:40 Invited Talk Shashank Srivastava
14:40 - 15:20 Invited Talk Maarten Sap
15:20 - 15:30 Break
15:30 - 16:30 In-Person Poster Session
428
W6 - Third Workshop on NLP for Positive Impact
Organizers:
Zhijing Jin, Rada Mihalcea, Joel Tetreault, Jieyu Zhao, Steven Wilson, Oana Ignat,
Daryna Dementieva, Giorgio Piatti
https://sites.google.com/view/nlp4positiveimpact
Room: Foster
Friday, November 15, 2024
The Third Workshop on Positive impact continues the trend of responsible NLP models and application de-
velopment including fairness, sustainability, and inclusivity. We are connecting NLP with various socially
important fields like healthcare, education, environment, etc. Specifically this year, we also have invited
NGOs that will showcase their challenges together with insights from an NGO expert on digital violence.
We welcome specialists from various perspectives to network, foster cross-disciplinary collaboration, and
spark new ideas at our workshop.
Time Session
09:00 - 09:05 Opening Remark
09:05 - 09:30 Opening Talk by Rada Mihalcea
09:30 - 10:30 Theme Session 1 (Two Invited Talks, and 5-min Q&A)
09:30 - 10:00 Prof Yulia Tsvetkov (UW)
10:00 - 10:30 Prof Anjalie Field (JHU)
10:30 - 11:00 NGO lightning talk
11:00 - 12:00 Poster Session (In-person posters; virtual presenters in Zoom session)
12:00 - 01:00 Lunch break
13:00 - 14:00 Theme Session 2: Education (Two Invited Talks, and 5-min Q&A)
13:00 - 13:30 Prof Mrinmaya Sachan (ETH)
13:30 - 14:00 Stephen Mayhew (Duolingo)
14:00 - 15:00 Theme Session 3: Healthcare (Two Invited Talks, and 5-min Q&A)
14:00 - 14.30 Prof Veronica Perez-Rosa
14:30 - 15.00 Prof Louis-Philippe Morency
15:00 - 15:30 Oral Talk Sessions (5 talks & 5 min each, Q&A in the last 5 mins)
15:30 - 15:45 Coffee Break by EMNLP
15:45 - 15:05 Special Theme ’Digital Violence’: NGO Talk by Cordelia Moore
16:05 - 17:00 Panel: ’Encouraging collaborations to advance NLP for Positive Impact’
Panelists: Cordelia Moore, Stephen Mayhew, Anjalie Field, and Jieyu Zhao
17:00 - 17:45 Research Brainstorming with attendees: Advancing NLP for Social Good
Topics: Community problems, AI solutions, collaborations
17:45 - 18:00 Best Paper Announcement & Closing
429
W7 - The Third Workshop on Text Simplification, Accessibility
and Readability
Organizers:
Matthew Shardlow, Fernando Alva-Manchego, Kai North, Regina Stodden, Sanja
tajner, Marcos Zampieri, Horacio Saggion
https://tsar-workshop.github.io/
Room: Merrick 2
Friday, November 15, 2024
The Text Simplification, Accessibility, and Readability (TSAR) workshop aims at bringing together re-
searchers, developers and industries of assistive technologies, public organizations representatives, and
other parties interested in the problem of making information more accessible to all citizens. We will
discuss recent trends and developments in the area of automatic text simplification, automatic readability
assessment, language resources and evaluation for text simplification.
Time Session
09:00 09:20 Welcome Presented by: Matthew Shardlow
09:20 10:00 Invited Talk 1: Easy-to-understand Writing with AI Assistance Walburga
Fröhlich Session Chair: Matthew Shardlow
10:00 10:30 Poster Micro-pitches Session Chair: Matthew Shardlow
10:30 11:00 Coffee Break
11:00 12:00 Oral Session 1: Main Track Session Chair: Fernando Alva-Manchego
• Cochrane-auto: An Aligned Dataset for the Simplification of Biomedical Ab-
stracts Jan Bakker and Jaap Kamps
• Images Speak Volumes: User-Centric Assessment of Image Generation for
Accessible Communication Miriam Anschütz, Tringa Sylaj and Georg Groh
• Society of Medical Simplifiers Chen Lyu and Gabriele Pergola
12:00 13:00 Lunch Break
13:00 14:40 Poster Session Session Chair: Matthew Shardlow (in-person), Kai North (vir-
tual), Regina Stodden (virtual)
• CompLex-ZH: A New Dataset for Lexical Complexity Prediction in Mandarin
and Cantonese [poster] Le Qiu, Shanyue Guo, Tak-sum Wong, Emmanuele Cher-
soni, John Sie Yuen Lee and Chu-Ren Huang
• MultiLS: An End-to-End Lexical Simplification Framework Kai North,
Tharindu Ranasinghe, Matthew Shardlow and Marcos Zampieri
• Considering Human Interaction and Variability in Automatic Text Simplifica-
tion [poster] Jenia Kim, Stefan Leijnen and Lisa Beinborn
• OtoBERT: Simplifying Suffixed Verbal Forms in Modern Hebrew Literature
Avi Shmidman and Shaltiel Shmidman
• Difficult for Whom? A Study of Japanese Lexical Complexity Adam Nohejl,
Akio Hayakawa, Yusuke Ide and Taro Watanabe
• EASSE-DE & EASSE-multi: Easier Automatic Sentence Simplification Eval-
uation for German & Multiple Languages [poster] Regina Stodden
430
• Evaluating the Simplification of Brazilian Legal Rulings in LLMs Using Read-
ability Scores as a Target [poster] Antonio Flavio Castro Paula and Celso G.
Camilo-Junior
• Measuring and Modifying the Readability of English Texts with GPT-4
[poster] Sean Trott and Pamela Rivière
• SpeciaLex: A Benchmark for In-Context Specialized Lexicon Learning (Find-
ings) Joseph Marvin Imperial and Harish Tayyar Madabushi
• README: Bridging Medical Jargon and Lay Understanding for Patient Edu-
cation through Data-Centric NLP (Findings) Zonghai Yao, Nandyala Siddharth
Kantu, Guanghao Wei, Hieu Tran, Zhangqi Duan, Sunjae Kwon, Zhichao Yang,
README annotation team, and Hong Yu
• Automating Easy Read Text Segmentation (Findings) Jesús Calleja, Thierry
Etchegoyhen, and David Ponce
14:40 15:30 Invited Talk 2: Artificial Intelligence and Plain Language Iria da Cunha
Session Chair: Horacio Saggion
15:30 16:00 Coffee Break
16:00 16:45 Oral Session 2: Special Track Session Chair: Marcos Zampieri
• Lexical Complexity Prediction and Lexical Simplification for Catalan and
Spanish Horacio Saggion, Stefan Bott, Sandra Szasz, Nelson Pérez, Saúl
Calderón and Martín Solís
• SciGisPy: A Novel Metric for Biomedical Text Simplification via Gist Infer-
ence Score Chen Lyu and Gabriele Pergola
16:45 17:30 Round Table Session Chair: Horacio Saggion
17:30 17:35 Closing Presented by: Matthew Shardlow
431
W8 - The Eighth Widening NLP Workshop (WiNLP 2024)
Organizers:
Atnafu Lambebo Tonja, Alfredo Gomez, Chanjun Park, Hellina Hailu Nigatu,
Santosh T.Y.S.S., Tanvi Anand, Wiem Ben Rim
https://www.winlp.org/winlp-2024-workshop/
Room: Hibiscus
Friday, November 15, 2024
The WiNLP workshop is open to all to foster an inclusive and welcoming ACL environment. It aims to
promote diversity and highlight the work of underrepresented groups in NLP: anyone who self-identifies
within an underrepresented group [based on gender, ethnicity, nationality, sexual orientation, disability sta-
tus, or otherwise] is invited to submit a two-page abstract for a poster presentation. In our 2024 iteration,
we hope to be more intentional about centering discussions of access and disability, as well as contribut-
ing to diversity in scientific background, discipline, training, obtained degrees, seniority, and communities
from underrepresented languages. The full-day event includes invited talks, oral presentations, and poster
sessions. The workshop provides an excellent opportunity for junior members in the community to show-
case their work and connect with senior mentors for feedback and career advice. It also offers recruitment
opportunities with leading industrial labs. Most importantly, the workshop will provide an inclusive and
accepting space, and work to lower structural barriers to joining and collaborating with the NLP commu-
nity at large.
Time Session
9:00 - 9:10 Welcome (Opening Session)
9:10 - 10:10 Keynote A: Danish Pruthi
10:10 - 11:00 Poster Session A
11:00 - 12:00 Panel A: Global Voices
12:00 - 12:45 Virtual Poster Session
12:45 - 13:30 Lunch
13:30 - 14:10 Mentorship Session
14:10 - 15:10 Panel B: Sailing the NLP Seas
15:10 - 16:00 Poster Session B
16:00 - 17:00 Keynote B: Alham Fikri Aji
17:00 - 17:10 Closing Session
432
W9 - The SIGNLL Conference on Computational Natural
Language Learning (CoNLL)
Organizers:
Libby Barak, Malihe Alikhani, Mert Inan and Julia Watson
https://conll.org/2024
Room: Tuttle
Friday, November 15, 2024 - Saturday, November 16, 2024
CoNLL is a yearly conference organized by SIGNLL (ACL’s Special Interest Group on Natural Language
Learning), focusing on theoretically, cognitively and scientifically motivated approaches to computational
linguistics. This year, CoNLL will be colocated with EMNLP 2024. Registrations for CoNLL can be made
through EMNLP (workshop 1). The focus of CoNLL is on theoretically, cognitively and scientifically
motivated approaches to computational linguistics, rather than on work driven by particular engineering
applications.
Time Session
09:00 - 09:10 Opening Remarks
09:10 - 10:30 Keynote 1: Lorna Quandt
10:30 - 11:00 Coffee Break
11:00 - 12:30 Oral Session 1: Psycholinguistic Session (Chair: Libby Barak)
• Leveraging a Cognitive Model to Measure Subjective Similarity of Human and
GPT-4 Written Content. Tyler Malloy, Maria José Ferreira, Fei Fang, Cleotilde
Gonzalez
• SPAWNing Structural Priming Predictions from a Cognitively Motivated
Parser. Grusha Prasad, Tal Linzen
• Lossy Context Surprisal Predicts Task-Dependent Patterns in Relative Clause
Processing. Kate McCurdy, Michael Hahn
• Multimodal Large Language Models Foresee Objects Based on Verb Informa-
tion But Not Gender. Shuqi Wang, Xufeng Duan, Zhenguang Cai
12:30 - 13:45 Lunch
13:45 - 15:30 Poster Session 1
15:30 - 16:00 Coffee Break
16:00 - 17:30 Oral Session 2: Syntax and Structure Session (Chair: Omri Abend)
• Is Structure Dependence Shaped for Efficient Communication? A Case Study
on Coordination. Kohei Kajikawa, Yusuke Kubota, Yohei Oseki
• NeLLCom-X: A Comprehensive Neural-Agent Framework to Simulate Lan-
guage Learning and Group Communication. Yuchen Lian, Tessa Verhoef, Ari-
anna Bisazza
• Solving the Challenge Set without Solving the Task: On Winograd Schemas
as a Test of Pronominal Coreference Resolution. Ian Porada, Jackie CK Cheung
• Global Learning with Triplet Relations in Abstractive Summarization. Jiaxin
Duan, Fengyu Lu, Junfei Liu
433
Day 2 (Saturday, Nov 16, 2024)
Time Session
09:00 - 09:10 Best Paper Awards
09:10 - 10:30 Keynote 2: Thamar Solorio
10:30 - 10:45 Coffee Break
10:45 - 12:15 Oral Session 3: LLM Session (Chair: Malihe Alikhani)
• Global-Pruner: A Stable and Efficient Pruner for Retraining-Free Pruning of
Encoder-Based Language Models. Guangzhen Yao, Sandong Zhu, Long Zhang,
MiaoQI
• Investigating large language models for their competence in extracting gram-
matically sound sentences from transcribed noisy utterances. Alina Wróblewska
• The Effect of Word Predictability on Reading Times in Information Seeking
and Repeated Reading. Keren Gruteke Klein, Yoav Meiri, Omer Shubi, Yevgeni
Berzak
• Multi-Cultural Norm Base: Frame-based Norm Discovery in Multi-Cultural
Settings. Viet Thanh Pham, Shilin Qu, Farhad Moghimifar, Suraj Sharma, Yuan-
Fang Li, Weiqing Wang, Reza Haf
12:15 - 13:45 Lunch
13:45 - 15:00 Poster Session 2
15:00 - 15:30 BabyLM Challenge (oral session)
15:30 - 16:00 Coffee Break
16:00 - 17:20 BabyLM Challenge (poster session)
17:20 - 17:30 Closing Remarks
Poster Sessions
• Text2Afford: Probing Object Affordance Prediction abilities of Language
Models solely from Text. Sayantan Adak, Daivik Agrawal, Animesh Mukher-
jee, Somak Aditya
• Transformer verbatim in-context retrieval across time and scale. Kristijan Ar-
meni, Marko Pranji, Senja Pollak
• Of Models and Men: Probing Neural Networks for Agreement Attraction with
Psycholinguistic Data. Maxim Bazhukov, Ekaterina Voloshina, Sergey Pletenev,
Arseny Anisimov, Oleg Serikov, Svetlana Toldova
• How Are Metaphors Processed by Language Models? The Case of Analogies.
Joanne Boisson
• AIStorySimilarity: Quantifying Story Similarity Using Narrative for Search,
IP Infringement, and Guided Creativity. Jon Chun
• Explaining the Hardest Errors of Contextual Embedding Based Classifiers.
Claudio Moisés Valiense de Andrade, Washington Cunha, Guilherme Fonseca,
Ana Clara Souza Pagano, Luana de Castro Santos, Adriana Silvina Pagano,
Leonardo Chaves Dutra da Rocha, Marcos André Gonçalves
• EditEval: An Instruction-Based Benchmark for Text Improvements. Jane
Dwivedi-Yu, Timo Schick, Zhengbao Jiang, Maria Lomeli, Patrick Lewis, Gau-
tier Izacard, Edouard Grave, Sebastian Riedel, Fabio Petroni
• Advancing Arabic Sentiment Analysis: ArSen Benchmark and the Improved
Fuzzy Deep Hybrid Network. Yang Fang, Cheng Xu, Shuhao Guan, Nan Yan,
Yuke Mei
434
• Critical Questions Generation: Motivation and Challenges. Blanca Calvo
Figueras, Rodrigo Agerri
• Generalizations across filler-gap dependencies in neural language models.
Katherine Howitt, Sathvik Nair, Allison Dods, Robert Melvin Hopkins
• Continuous Attentive Multimodal Prompt Tuning for Few-Shot Multimodal
Sarcasm Detection. Soumyadeep Jana, Animesh Dey, Ranbir Singh Sanasam
• Aligning Alignments: Do Colexification and Distributional Similarity Align as
Measures of cross-lingual Lexical Alignment?. Taelin Karidi, Eitan Grossman,
Omri Abend
• On Functional Competence of LLMs for Linguistic Disambiguation. Raihan
Kibria, Sheikh Intiser Uddin Dipta, Muhammad Abdullah Adnan
• TpT-ADE: Transformer Based Two-Phase ADE Extraction. Suryamukhi
Kuchibhotla, Manish Singh
• PRACT: Optimizing Principled Reasoning and Acting of LLM Agent. Zhiwei
Liu, Weiran Yao, Jianguo Zhang, Zuxin Liu, Liangwei Yang, Rithesh R N, Tian
Lan, Ming Zhu, Juntao Tan, Shirley Kokane, Thai Quoc Hoang, Juan Carlos
Niebles, Shelby Heinecke, Huan Wang, Silvio Savarese, Caiming Xiong
• Mitigating Bias in Language Model Evaluators: A Causal ATE Approach.
Rahul Madhavan, Kahini Wadhawan
• Words That Stick: Using Keyword Cohesion to Improve Text Segmentation.
Amit Maraj, Miguel Vargas Martin, Masoud Makrehchi
• An Empirical Comparison of Vocabulary Expansion and Initialization Ap-
proaches For Language Models. Nandini Mundra, Aditya Nanda Kishore Khan-
davally, Raj Dabre, Ratish Puduppully, Anoop Kunchukuttan, Mitesh M Khapra
• Revisiting Hierarchical Text Classification: Inference and Metrics. Roman
Plaud, Matthieu Labeau, Antoine Saillenfest, Thomas Bonald
• Image-conditioned human language comprehension and psychometric bench-
marking of visual language models. Subha Nawer Pushpita, Roger P. Levy
• Large Language Model Recall Uncertainty is Modulated by the Fan Effect.
Jesse Roberts, Kyle Moore, Douglas Fisher, Oseremhen Ewaleifoh, Thao Pham
• Self-supervised speech representations display some human-like cross-
linguistic perceptual abilities. Joselyn Rodriguez, Kamala Sreepada, Ruolan
Leslie Famularo, Sharon Goldwater, Naomi Feldman
• One-Vs-Rest Neural Network English Grapheme Segmentation: A Linguistic
Perspective. Samuel Rose, Nina Dethlefs, C. Kambhampati
• CrowdCounter: A benchmark type-specific multi-target counterspeech dataset.
Punyajoy Saha, Abhilash Datta, Abhik Jana, Animesh Mukherjee
• Translating Across Cultures: LLMs for Intralingual Cultural Adaptation.
Pushpdeep Singh, Mayur Patidar, Lovekesh Vig
• Making Distilled Language Models Even Smaller: Lightweight Reconstruc-
tion of Rare Token Embeddings. Kohki Tamura, Naoki Yoshinaga, Masato Neishi
• A Novel Instruction Tuning Method for Vietnamese Math Reasoning using
Trainable Open-Source Large Language Models. Nguyen Quang Vinh, Thanh-
Do Nguyen, Vinh Van Nguyen, Nam Khac-Hoai Bui
• Information Association for Language Model Updating by Mitigating LM-
Logical Discrepancy. Pengfei Yu, Heng Ji
435
W10 - Ninth Conference on Machine Translation (WMT24)
Organizers:
Philipp Koehn, Barry Haddow, Christof Monz, Tom Kocmi
https://www2.statmt.org/wmt24/
Room: Ashe Auditorium
Friday, November 15, 2024 - Saturday, November 16, 2024
WMT24 focuses on advancing machine translation techniques, discussing progress, challenges, and future
directions. It features shared tasks on translation quality evaluation, multilingual translation, and domain-
specific translation.
Time Session
8:45 - 9:00 Opening Remarks
Session 1: Shared Task Overview Papers I
9:00 - 9:30 Findings of the WMT24 General Machine Translation Shared Task: The LLM
Era Is Here but MT Is Not Solved Yet. Tom Kocmi, Eleftherios Avramidis,
Rachel Bawden, Ondej Bojar, Anton Dvorkovich, Christian Federmann, Mark
Fishel, Markus Freitag, Thamme Gowda, Roman Grundkiewicz, Barry Had-
dow, Marzena Karpinska, Philipp Koehn, Benjamin Marie, Christof Monz, Ken-
ton Murray, Masaaki Nagata, Martin Popel, Maja Popovi, Mariya Shmatova,
Steinþór Steingrímsson and Vilém Zouhar
9:30 - 9:45 Are LLMs Breaking MT Metrics? Results of the WMT24 Metrics Shared
Task. Markus Freitag, Nitika Mathur, Daniel Deutsch, Chi-kiu Lo, Elefthe-
rios Avramidis, Ricardo Rei, Brian Thompson, Frederic Blain, Tom Kocmi, Ji-
ayi Wang, David Ifeoluwa Adelani, Marianna Buchicchio, Chrysoula Zerva and
Alon Lavie
9:45 - 9:55 Findings of the Quality Estimation Shared Task at WMT 2024: Are LLMs Clos-
ing the Gap in QE?. Chrysoula Zerva, Frederic Blain, José G. C. de Souza,
Diptesh Kanojia, Sourabh Dattatray Deoghare, Nuno M. Guerreiro, Giuseppe
Attanasio, Ricardo Rei, Constantin Orasan, Matteo Negri, Marco Turchi, Rajen
Chatterjee, Pushpak Bhattacharyya, Markus Freitag and André Martins
9:55 - 10:10 Findings of the WMT 2024 Shared Task of the Open Language Data Initiative.
Jean Maillard, Laurie V. Burchell, Antonios Anastasopoulos, Christian Feder-
mann, Philipp Koehn and Skyler Wang
10:10 - 10:20 Results of the WAT/WMT 2024 Shared Task on Patent Translation. Shohei Hi-
gashiyama
10:20 - 10:30 Findings of the WMT 2024 Biomedical Translation Shared Task: Test Sets on
Abstract Level. Mariana Neves, Cristian Grozea, Philippe Thomas, Roland
Roller, Rachel Bawden, Aurélie Névéol, Steffen Castle, Vanessa Bonato, Gior-
gio Maria Di Nunzio, Federica Vezzani, Maika Vicente Navarro, Lana Yeganova
and Antonio Jimeno Yepes
10:30 - 11:00 Coffee Break
436
11:00 - 12:00 Session 2: Shared Task Posters I
General Translation Task
• MSLC24 Submissions to the General Machine Translation Task. Samuel
Larkin, Chi-kiu Lo and Rebecca Knowles
• IOL Research Machine Translation Systems for WMT24 General Machine
Translation Shared Task. Wenbo Zhang
• Choose the Final Translation from NMT and LLM Hypotheses Using MBR
Decoding: HW-TSCs Submission to the WMT24 General MT Shared Task.
Zhanglin Wu, Daimeng Wei, Zongyao Li, Hengchao Shang, Jiaxin GUO, Shao-
jun Li, Zhiqiang Rao, Yuanchang Luo, Ning Xie and Hao Yang
• CycleGN: A Cycle Consistent Approach for Neural Machine Translation.
Sören Dreano, Derek Molloy and Noel Murphy
• UvA-MTs Participation in the WMT24 General Translation Shared Task.
Shaomu Tan, David Stap, Seth Aycock, Christof Monz and Di Wu
• Tower v2: Unbabel-IST 2024 Submission for the General MT Shared Task.
Ricardo Rei, Jose Maria Pombal, Nuno M. Guerreiro, João Alves, Pedro Hen-
rique Martins, Patrick Fernandes, Helena Wu, Tania Vaz, Duarte Alves, Amin
Farajian, Sweta Agrawal, Antonio Farinhas, José G. C. de Souza and André
Martins
• TSU HITSs Submissions to the WMT 2024 General Machine Translation
Shared Task. Vladimir Mynka and Nikolay Mikhaylovskiy
• Document-level Translation with LLM Reranking: Team-J at WMT 2024 Gen-
eral Translation Task. Keito Kudo, Hiroyuki Deguchi, Makoto Morishita, Ryo
Fujii, Takumi Ito, Shintaro Ozaki, Koki Natsumi, Kai Sato, Kazuki Yano, Ryosuke
Takahashi, Subaru Kimura, Tomomasa Hara, Yusuke Sakai and Jun Suzuki
• DLUT and GTCOMs Neural Machine Translation Systems for WMT24. Hao
Zong, Chao Bei, Huan Liu, Conghu Yuan, Wentao Chen and Degen Huang
• CUNI at WMT24 General Translation Task: LLMs, (Q)LoRA, CPO and
Model Merging. Miroslav Hrabal, Josef Jon, Martin Popel, Nam Luu, Danil
Semin and Ondej Bojar
• From General LLM to Translation: How We Dramatically Improve Transla-
tion Quality Using Human Evaluation Data for LLM Finetuning. Denis Elshin,
Nikolay Karpachev, Boris Gruzdev, Ilya Golovanov, Georgy Ivanov, Alexander
Antonov, Nickolay Skachkov, Ekaterina Latypova, Vladimir Layner, Ekaterina
Enikeeva, Dmitry Popov, Anton Chekashev, Vladislav Negodin, Vera Frantsu-
zova, Alexander Chernyshev and Kirill Denisov
• Cogs in a Machine, Doing What Theyre Meant to Do the AMI Submission
to the WMT24 General Translation Task. Atli Jasonarson, Hinrik Hafsteinsson,
Bjarki Ármannsson and Steinþór Steingrímsson
• IKUN for WMT24 General MT Task: LLMs Are Here for Multilingual
Machine Translation. Baohao Liao, Christian Herold, Shahram Khadivi and
Christof Monz
• NTTSU at WMT2024 General Translation Task. Minato Kondo, Ryo Fukuda,
Xiaotian Wang, Katsuki Chousa, Masato Nishimura, Kosei Buma, Takatomo
Kano and Takehito Utsuro
• SCIR-MTs Submission for WMT24 General Machine Translation Task. Bao-
hang Li, Zekai Ye, Yichong Huang, Xiaocheng Feng and Bing Qin
437
• AIST AIRC Systems for the WMT 2024 Shared Tasks. Matiss Rikters and
Makoto Miwa
• Occiglot at WMT24: European Open-source Large Language Models Eval-
uated on Translation. Eleftherios Avramidis, Annika Grützner-Zahn, Manuel
Brack, Patrick Schramowski, Pedro Ortiz Suarez, Malte Ostendorff, Fabio Barth,
Shushen Manakhimova, Vivien Macketanz, Georg Rehm and Kristian Kersting
• Test Suites
• CoST of breaking the LLMs. Ananya Mukherjee, Saumitra Yadav and Manish
Shrivastava
• WMT24 Test Suite: Gender Resolution in Speaker-Listener Dialogue Roles.
Hillary Dawkins, Isar Nejadgholi and Chi-kiu Lo
• The GenderQueer Test Suite. Steinunn Rut Friðriksdóttir
• Domain Dynamics: Evaluating Large Language Models in English-Hindi
Translation. Soham Bhattacharjee, Baban Gain and Asif Ekbal
• Investigating the Linguistic Performance of Large Language Models in
Machine Translation. Shushen Manakhimova, Vivien Macketanz, Eleftherios
Avramidis, Ekaterina Lapshinova-Koltunski, Sergei Bagdasarov and Sebastian
Möller
• IsoChronoMeter: A Simple and Effective Isochronic Translation Evaluation
Metric. Nikolai Rozanov, Vikentiy Pankov, Dmitrii Mukhutdinov and Dima Vypi-
railenko
• A Test Suite of Prompt Injection Attacks for LLM-based Machine Translation.
Antonio Valerio Miceli Barone and Zhifan Sun
• Killing Two Flies with One Stone: An Attempt to Break LLMs Using En-
glishIcelandic Idioms and Proper Names. Bjarki Ármannsson, Hinrik Hafsteins-
son, Atli Jasonarson and Steinthor Steingrimsson
12:30 - 13:30 Session 3: Shared Task Posters II
Metrics Task
• MetaMetrics-MT: Tuning Meta-Metrics for Machine Translation via Human
Preference Calibration. David Anugraha, Garry Kuwanto, Lucky Susanto, Derry
Tanti Wijaya and Genta Indra Winata
• chrF-S: Semantics Is All You Need. Ananya Mukherjee and Manish Shrivas-
tava
• MSLC24: Further Challenges for Metrics on a Wide Landscape of Translation
Quality. Rebecca Knowles, Samuel Larkin and Chi-kiu Lo
• MetricX-24: The Google Submission to the WMT 2024 Metrics Shared Task.
Juraj Juraska, Daniel Deutsch, Mara Finkelstein and Markus Freitag
• Evaluating WMT 2024 Metrics Shared Task Submissions on AfriMTE (the
African Challenge Set). Jiayi Wang, David Ifeoluwa Adelani and Pontus Stene-
torp
• Machine Translation Metrics Are Better in Evaluating Linguistic Errors on
LLMs than on Encoder-Decoder Systems. Eleftherios Avramidis, Shushen Man-
akhimova, Vivien Macketanz and Sebastian Möller
• Quality Estimation Task
• TMU-HITs Submission for the WMT24 Quality Estimation Shared Task: Is
GPT-4 a Good Evaluator for Machine Translation?. Ayako Sato, Kyotaro Naka-
jima, Hwichan Kim, Zhousi Chen and Mamoru Komachi
438
• HW-TSC 2024 Submission for the Quality Estimation Shared Task. Weiqiao
Shan, Ming Zhu, Yuang Li, Mengyao Piao, Xiaofeng Zhao, Chang Su, Min
Zhang, Hao Yang and Yanfei Jiang
• HW-TSCs Participation in the WMT 2024 QEAPE Task. Jiawei Yu, Xiaofeng
Zhao, Min Zhang, Zhao Yanqing, Yuang Li, Su Chang, Xiaosong Qiao, Ma Miao-
miao and Hao Yang
• Open Language Data Initiative
• Expanding the FLORES+ Multilingual Benchmark with Translations for
Aragonese, Aranese, Asturian, and Valencian. Juan Antonio Perez-Ortiz, Fe-
lipe Sánchez-Martínez, Víctor M. Sánchez-Cartagena, Miquel Esplà-Gomis,
Aaron Galiano Jimenez, Antoni Oliver, Claudi Aventín-Boya, Alejandro Pardos,
Cristina Valdés, Jusèp Loís Sans Socasau and Juan Pablo Martínez
• The Bangla/Bengali Seed Dataset Submission to the WMT24 Open Language
Data Initiative Shared Task. Firoz Ahmed, Nitin Venkateswaran and Sarah
Moeller
• A High-quality Seed Dataset for Italian Machine Translation. Edoardo Fer-
rante
• Correcting FLORES Evaluation Dataset for Four African Languages. Idris
Abdulmumin, Sthembiso Mkhwanazi, Mahlatse S. Mbooi, Shamsuddeen Hassan
Muhammad, Ibrahim Said Ahmad, Neo N. Putini, Miehleketo Mathebula, Ma-
timba Shingange, Tajuddeen Gwadabe and Vukosi Marivate
• Expanding FLORES+ Benchmark for More Low-Resource Settings:
Portuguese-Emakhuwa Machine Translation Evaluation. Felermino Dario
Mario Ali, Henrique Lopes Cardoso and Rui Sousa-Silva
• Enhancing Tuvan Language Resources through the FLORES Dataset. Ali
Kuzhuget, Airana Mongush and Nachyn-Enkhedorzhu Oorzhak
• Machine Translation Evaluation Benchmark for Wu Chinese: Workflow and
Analysis. Hongjian Yu, Yiming Shi, Zherui Zhou and Christopher Haberland
• Open Language Data Initiative: Advancing Low-Resource Machine Transla-
tion for Karakalpak. Mukhammadsaid Mamasaidov and Abror Shopulatov
• FLORES+ Translation and Machine Translation Evaluation for the Erzya Lan-
guage. Isai Gordeev, Sergey Kuldin and David Dale
• Spanish Corpus and Provenance with Computer-Aided Translation for the
WMT24 OLDI Shared Task. Jose Cols
Patent Translation Task
• Efficient Terminology Integration for LLM-based Translation in Specialized
Domains. Sejoon Kim, Mingi Sung, Jeonghwan Lee, Hyunkuk Lim and Jorge
Froilan Gimenez Perez
• Rakutens Participation in WMT 2024 Patent Translation Task. Ohnmar Htun
and Alberto Poncelas
Biomedical Translation Task
• The SETU-ADAPT Submission for WMT 24 Biomedical Shared Task. Anto-
nio Castaldo, Maria Zafar, Prashanth Nayak, Rejwanul Haque, Andy Way and
Johanna Monti
14:00 - 15:00 Session 4: Invited Talk by Ricardo Rei and Nuno M. Guerreiro: ’What
Makes MT Research Special in the LLM Age?’
15:00 - 15:30 Coffee Break
Session 5: Featured Research Papers Oral Presentations
439
15:00 - 15:15 Translating Step-by-Step: Decomposing the Translation Process for Improved
Translation Quality of Long-Form Texts. Eleftheria Briakou, Jiaming Luo, Colin
Cherry and Markus Freitag
15:15 - 15:30 Is Preference Alignment Always the Best Option to Enhance LLM-Based Trans-
lation? An Empirical Analysis. Hippolyte Gisserot-Boukhlef, Ricardo Rei, Em-
manuel Malherbe, Céline Hudelot, Pierre Colombo and Nuno M. Guerreiro
15:30 - 15:45 On Instruction-Finetuning Neural Machine Translation Models. Vikas Raunak,
Roman Grundkiewicz and Marcin Junczys-Dowmunt
15:45 - 16:00 Quality or Quantity? On Data Scale and Diversity in Adapting Large Lan-
guage Models for Low-Resource Translation. Vivek Iyer, Bhavitvya Malik, Pavel
Stepachev, Pinzhen Chen, Barry Haddow and Alexandra Birch
16:00 - 16:15 Post-edits Are Preferences Too. Nathaniel Berger, Stefan Riezler, Miriam Exel
and Matthias Huck
16:15 - 16:30 Benchmarking Visually-Situated Translation of Text in Natural Images. Eliza-
beth Salesky, Philipp Koehn and Matt Post
Time Session
Session 6: Shared Task Overview Papers II
9:00 - 9:15 Findings of WMT 2024 Shared Task on Low-Resource Indic Languages Trans-
lation. Partha Pakray, Santanu Pal, Advaitha Vetagiri, Reddi Mohana Krishna,
Arnab Kumar Maji, Sandeep Kumar Dash, Lenin Laitonjam, Lyngdoh Sarah and
Riyanka Manna
9:15 - 9:30 Findings of WMT 2024s MultiIndic22MT Shared Task for Machine Translation
of 22 Indian Languages. Raj Dabre and Anoop Kunchukuttan
9:30 - 9:45 Findings of WMT2024 English-to-Low Resource Multimodal Translation Task.
Shantipriya Parida, Ondej Bojar, Idris Abdulmumin, Shamsuddeen Hassan
Muhammad and Ibrahim Said Ahmad
9:45 - 10:00 Findings of the WMT 2024 Shared Task Translation into Low-Resource Lan-
guages of Spain: Blending Rule-Based and Neural Systems. Felipe Sánchez-
Martínez, Juan Antonio Perez-Ortiz, Aaron Galiano Jimenez and Antoni Oliver
10:00 - 10:10 Findings of the WMT 2024 Shared Task on Discourse-Level Literary Transla-
tion. Longyue Wang, Siyou Liu, Chenyang Lyu, Wenxiang Jiao, Xing Wang, Jia-
hao Xu, Zhaopeng Tu, Yan Gu, Weiyu Chen, Minghao Wu, Liting Zhou, Philipp
Koehn, Andy Way and Yulin Yuan
10:10 - 10:20 Findings of the WMT 2024 Shared Task on Chat Translation. Wafaa Mohammed,
Sweta Agrawal, Amin Farajian, Vera Cabarrão, Bryan Eikema, Ana C Farinha
and José G. C. de Souza
10:20 - 10:30 Findings of the WMT 2024 Shared Task on Non-Repetitive Translation. Kazu-
taka Kinugawa, Hideya Mino, Isao Goto and Naoto Shirai
10:30 - 11:00 Coffee Break
11:00 - 12:00 Session 7: Shared Task Posters III
Low-Resource Indic Language Translation Task
• A3-108 Controlling Token Generation in Low Resource Machine Translation
Systems. Saumitra Yadav, Ananya Mukherjee and Manish Shrivastava
440
• Samsung R&D Institute Philippines @ WMT 2024 Indic MT Task. Matthew
Theodore Roque, Carlos Rafael Catalan, Dan John A. Velasco, Manuel Antonio
Rufino and Jan Christian Blaise Cruz
• DLUT-NLP Machine Translation Systems for WMT24 Low-Resource Indic
Language Translation. Chenfei Ju, Junpeng Liu, Kaiyu Huang and Degen Huang
• SRIB-NMTs Submission to the Indic MT Shared Task in WMT 2024.
Pranamya Ajay Patil, Raghavendra HR, Aditya Raghuwanshi and Kushal Verma
• MTNLP-IIITH: Machine Translation for Low-Resource Indic Languages. Ab-
hinav P M, Ketaki Shetye and Parameswari Krishnamurthy
• Exploration of the CycleGN Framework for Low-Resource Languages. Sören
DREANO, Derek MOLLOY and Noel MURPHY
• The SETU-ADAPT Submissions to the WMT24 Low-Resource Indic Lan-
guage Translation Task. Neha Gajakos, Prashanth Nayak, Rejwanul Haque and
Andy Way
• SPRING Lab IITMs Submission to Low Resource Indic Language Translation
Shared Task. Advait Joglekar, Hamees Ul Hasan Sayed and Srinivasan Umesh
• Machine Translation Advancements of Low-Resource Indian Languages by
Transfer Learning. Bin Wei, Zheng Jiawei, Zongyao Li, Zhanglin Wu, Jiaxin
GUO, Daimeng Wei, Zhiqiang Rao, Shaojun Li, Yuanchang Luo, Hengchao
Shang, Jinlong Yang, Yuhao Xie and Hao Yang
• NLIP_Lab-IITH Low-Resource MT System for WMT24 Indic MT Shared
Task. Pramit Sahoo, Maharaj Brahma and Maunendra Sankar Desarkar
• Yes-MTs Submission to the Low-Resource Indic Language Translation Shared
Task in WMT 2024. Yash Bhaskar and Parameswari Krishnamurthy
MultiIndic22MT Task
• System Description of BV-SLP for Sindhi-English Machine Translation in
MultiIndic22MT 2024 Shared Task. Nisheeth Joshi, Pragya Katyayan, Palak
Arora and Bharti Nathani
• WMT24 System Description for the MultiIndic22MT Shared Task on Ma-
nipuri Language. Ningthoujam Justwant Singh, Kshetrimayum Boynao Singh,
Ningthoujam Avichandra Singh, Sanjita Phijam and Thoudam Doren Singh
• NLIP_Lab-IITH Multilingual MT System for WAT24 MT Shared Task. Ma-
haraj Brahma, Pramit Sahoo and Maunendra Sankar Desarkar
English-to-Lowres Multi-Modal Translation Task
• DCU ADAPT at WMT24: English to Low-resource Multi-Modal Translation
Task. Sami Ul Haq, Rudali Huidrom and Sheila Castilho
• English-to-Low-Resource Translation: A Multimodal Approach for Hindi,
Malayalam, Bengali, and Hausa. Ali Hatami, Shubhanker Banerjee, Mihael Ar-
can, Bharathi Raja Chakravarthi, Paul Buitelaar and John Philip McCrae
• OdiaGenAIs Participation in WMT2024 English-to-Low Resource Multimodal
Translation Task. Shantipriya Parida, Shashikanta Sahoo, Sambit Sekhar, Upen-
dra Kumar Jena, Sushovan Jena and Kusum Lata
• Arewa NLPs Participation at WMT24. Mahmoud Said Ahmad, Auwal
Abubakar Khalid, Lukman Jibril Aliyu, Babangida Sani and Mariya Sunusi Ab-
dullahi
• Multimodal Machine Translation for Low-Resource Indic Languages: A
Chain-of-Thought Approach Using Large Language Models. Pawan Kumar Ra-
jpoot, Nagaraj N. Bhat and Ashish Shrivastava
441
• Chitranuvad: Adapting Multi-lingual LLMs for Multimodal Translation. Sha-
harukh Khan, Ayush Tarun, Ali Faraz, Palash Kamble, Vivek Dahiya, Praveen
Pokala, Ashish Anand Kulkarni, Chandra Khatri, Abhinav Ravi and Shubham
Agarwal
• Brotherhood at WMT 2024: Leveraging LLM-Generated Contextual Conver-
sations for Cross-Lingual Image Captioning. Siddharth Betala and Ishan Chok-
shi
12:30 - 13:30 Session 8: Shared Task Posters IV
Translation into Low-Resource Languages of Spain Task.
• TIM-UNIGE Translation into Low-Resource Languages of Spain for WMT24.
Jonathan Mutal and Lucía Ormaechea
• TAN-IBE Participation in the Shared Task: Translation into Low-Resource
Languages of Spain. Antoni Oliver
• Enhanced Apertium System: Translation into Low-Resource Languages of
Spain Spanish - Asturian. Sofía García
• Universitat dAlacants Submission to the WMT 2024 Shared Task on Transla-
tion into Low-Resource Languages of Spain. Aaron Galiano Jimenez, Víctor M.
Sánchez-Cartagena, Juan Antonio Perez-Ortiz and Felipe Sánchez-Martínez
• Samsung R&D Institute Philippines @ WMT 2024 Low-resource Languages
of Spain Shared Task. Dan John A. Velasco, Manuel Antonio Rufino and Jan
Christian Blaise Cruz
• Back to the Stats: Rescuing Low Resource Neural Machine Translation with
Statistical Methods. Menan Velayuthan, Dilith Randinu Jayakody, Nisansa de
Silva, Aloka Fernando and Surangika Dayani Ranathunga
• Hybrid Distillation from RBMT and NMT: Helsinki-NLPs Submission to the
Shared Task on Translation into Low-Resource Languages of Spain. Ona de
Gibert, Mikko Aulamo, Yves Scherrer and Jörg Tiedemann
• Robustness of Fine-Tuned LLMs for Machine Translation with Varying Noise
Levels: Insights for Asturian, Aragonese and Aranese. Martin Bär, Elisa For-
cada Rodríguez and Maria Garcia-Abadillo Velasco
• Training and Fine-Tuning NMT Models for Low-Resource Languages Using
Apertium-Based Synthetic Corpora. Aleix Sant, Daniel Bardanca, José Ramom
Pichel Campos, Francesca De Luca Fornaciari, Carlos Escolano, Javier Garcia
Gilabert, Pablo Gamallo, Audrey Mash, Xixian Liao and Maite Melero
• Vicomtech@WMT 2024: Shared Task on Translation into Low-Resource Lan-
guages of Spain. David Ponce, Harritxu Gete and Thierry Etchegoyhen
• SJTU System Description for the WMT24 Low-Resource Languages of Spain
Task. Tianxiang Hu, Haoxiang Sun, Ruize Gao, Jialong Tang, Pei Zhang,
Baosong Yang and Rui Wang
• Multilingual Transfer and Domain Adaptation for Low-Resource Languages of
Spain. Yuanchang Luo, Zhanglin Wu, Daimeng Wei, Hengchao Shang, Zongyao
Li, Jiaxin GUO, Zhiqiang Rao, Shaojun Li, Jinlong Yang, Yuhao Xie, Zheng
Jiawei, Bin Wei and Hao Yang
• TRIBBLE - TRanslating IBerian languages Based on Limited E-resources.
Igor Kuzmin, Piotr Przybya, Euan McGill and Horacio Saggion
Discourse-Level Literary Translation Task
• CloudSheep System for WMT24 Discourse-Level Literary Translation. Lisa
Liu, Ryan Liu, Angela Tsai and Jingbo Shang
442
• Final Submission of SJTULoveFiction to Literary Task. Haoxiang Sun, Tianx-
iang Hu, Ruize Gao, Jialong Tang, Pei Zhang, Baosong Yang and Rui Wang
• Context-aware and Style-related Incremental Decoding Framework for
Discourse-Level Literary Translation. Yuanchang Luo, Jiaxin GUO, Daimeng
Wei, Hengchao Shang, Zongyao Li, Zhanglin Wu, Zhiqiang Rao, Shaojun Li,
Jinlong Yang and Hao Yang
• NovelTrans: System for WMT24 Discourse-Level Literary Translation.
Yuchen Liu, Yutong Yao, Runzhe Zhan, Yuchu Lin and Derek F. Wong
• LinChanceŒNTU for Unconstrained WMT2024 Literary Translation. Kechen
Li, Yaotian Tao, Hongyi Huang and Tianbo Ji
Chat Translation Task
• Improving Context Usage for Translating Bilingual Customer Support Chat
with Large Language Models. Jose Maria Pombal, Sweta Agrawal and André
Martins
• Optimising LLM-Driven Machine Translation with Context-Aware Sliding
Windows. Xinye Yang, Yida Mu, Kalina Bontcheva and Xingyi Song
• Context-Aware LLM Translation System Using Conversation Summarization
and Dialogue History. Mingi Sung, Seungmin Lee, Jiwon Kim and Sejoon Kim
• Enhancing Translation Quality: A Comparative Study of Fine-Tuning and
Prompt Engineering in Dialog-Oriented Machine Translation Systems. Insights
from the MULTITAN-GML Team. Lichao Zhu, Maria Zimina, Behnoosh Nam-
darzadeh, Nicolas Ballier and Jean-Baptiste Yunès
• The SETU-ADAPT Submissions to WMT 2024 Chat Translation Tasks. Maria
Zafar, Antonio Castaldo, Prashanth Nayak, Rejwanul Haque and Andy Way
• Exploring the Traditional NMT Model and Large Language Model for Chat
Translation. Jinlong Yang, Hengchao Shang, Daimeng Wei, Jiaxin GUO,
Zongyao Li, Zhanglin Wu, Zhiqiang Rao, Shaojun Li, Yuhao Xie, Yuanchang
Luo, Zheng Jiawei, Bin Wei and Hao Yang
Non-Repetitive Translation Task
• Reducing Redundancy in Japanese-to-English Translation: A Multi-Pipeline
Approach for Translating Repeated Elements in Japanese. Qiao Wang, Yixuan
Huang and Zheng Yuan
• SYSTRAN @ WMT24 Non-Repetitive Translation Task. Marko Avila and
Josep Crego
14:00 - 15:30 Session 9: Research Paper Boaster Session
15:30 - 16:30 Session 10: Research Paper Poster Session I
• Mitigating Metric Bias in Minimum Bayes Risk Decoding. Geza Kovacs,
Daniel Deutsch and Markus Freitag
• Beyond Human-Only: Evaluating Human-Machine Collaboration for Collect-
ing High-Quality Translation Data. Zhongtao Liu, Parker Riley, Daniel Deutsch,
Alison Lui, Mengmeng Niu, Apurva Shah and Markus Freitag
• How Effective Are State Space Models for Machine Translation? Hugo
Pitorro, Pavlo Vasylenko, Marcos Treviso and André Martins
• Evaluation and Large-scale Training for Contextual Machine Translation. Matt
Post and Marcin Junczys-Dowmunt
• A Multi-task Learning Framework for Evaluating Machine Translation of
Emotion-loaded User-generated Content. Shenbin Qian, Constantin Orasan,
Diptesh Kanojia and Félix do Carmo
443
• On Instruction-Finetuning Neural Machine Translation Models. Vikas Raunak,
Roman Grundkiewicz and Marcin Junczys-Dowmunt
• Benchmarking Visually-Situated Translation of Text in Natural Images. Eliza-
beth Salesky, Philipp Koehn and Matt Post
• Analysing Translation Artifacts: A Comparative Study of LLMs, NMTs, and
Human Translations. Fedor Sizov, Cristina España-Bonet, Josef van Genabith,
Roy Xie and Koel Dutta Chowdhury
• How Grammatical Features Impact Machine Translation: A New Test Suite
for Chinese-English MT Evaluation. Huacheng Song, Yi Li, Yiwen Wu, Yu Liu,
Jingxia Lin and Hongzhi Xu
• Improving Statistical Significance in Human Evaluation of Automatic Metrics
via Soft Pairwise Accuracy. Brian Thompson, Nitika Mathur, Daniel Deutsch
and Huda Khayrallah
• Speech Is More than Words: Do Speech-to-Text Translation Systems Leverage
Prosody? Ioannis Tsiamas, Matthias Sperber, Andrew Finch and Sarthak Garg
• Cultural Adaptation of Menus: A Fine-Grained Approach. Zhonghe Zhang,
Xiaoyu He, Vivek Iyer and Alexandra Birch
• Pitfalls and Outlooks in Using COMET. Vilém Zouhar, Pinzhen Chen, Tsz Kin
Lam, Nikita Moghe and Barry Haddow
16:30 - 17:00 Coffee Break
17:00 - 18:00 Session 11: Research Paper Poster Session II
• Post-edits Are Preferences Too. Nathaniel Berger, Stefan Riezler, Miriam Exel
and Matthias Huck
• Translating Step-by-Step: Decomposing the Translation Process for Improved
Translation Quality of Long-Form Texts. Eleftheria Briakou, Jiaming Luo, Colin
Cherry and Markus Freitag
• Scaling Laws of Decoder-Only Models on the Multilingual Machine Trans-
lation Task. Gaëtan Caillaut, Mariam Nakhlé, Raheel Qader, Jingshu Liu and
Jean-Gabriel Barthélemy
• Shortcomings of LLMs for Low-Resource Translation: Retrieval and Under-
standing Are Both the Problem. Sara Kay Court and Micha Elsner
• Introducing the NewsPaLM MBR and QE Dataset: LLM-Generated High-
Quality Parallel Data Outperforms Traditional Web-Crawled Data. Mara Finkel-
stein, David Vilar and Markus Freitag
• Is Preference Alignment Always the Best Option to Enhance LLM-Based
Translation? An Empirical Analysis. Hippolyte Gisserot-Boukhlef, Ricardo Rei,
Emmanuel Malherbe, Céline Hudelot, Pierre Colombo and Nuno M. Guerreiro
• Quality or Quantity? On Data Scale and Diversity in Adapting Large Lan-
guage Models for Low-Resource Translation. Vivek Iyer, Bhavitvya Malik, Pavel
Stepachev, Pinzhen Chen, Barry Haddow and Alexandra Birch
• Efficient Technical Term Translation: A Knowledge Distillation Approach for
Parenthetical Terminology Translation. Myung Jiyoon, Jihyeon Park, Jungki
Son, Kyungro Lee and Joohyung Han
• Assessing the Role of Imagery in Multimodal Machine Translation. Nicholas
Kashani Motlagh, Jim Davis, Jeremy Gwinnup, Grant Erdmann and Tim Ander-
son
444
• Error Span Annotation: A Balanced Approach for Human Evaluation of Ma-
chine Translation. Tom Kocmi, Vilém Zouhar, Eleftherios Avramidis, Roman
Grundkiewicz, Marzena Karpinska, Maja Popovi, Mrinmaya Sachan and Mariya
Shmatova
• Neural Methods for Aligning Large-Scale Parallel Corpora from the Web for
South and East Asian Languages. Philipp Koehn
• Plug, Play, and Fuse: Zero-Shot Joint Decoding via Word-Level Re-ranking
across Diverse Vocabularies. Sai Koneru, Matthias Huck, Miriam Exel and Jan
Niehues
445
W11 - Workshop on Customizable NLP: Progress and Challenges
in Customizing NLP for a Domain, Application, Group, or
Individual (CustomNLP4U)
Organizers:
Chan Young Park, Vidhisha Balachandran, Weijia Shi, Shirley Anugrah Hayati,
Sachin Kumar
https://customnlp4u-24.github.io/
Room: Merrick 1
Saturday, November 16, 2024
CustomNLP4U explores the customization of NLP systems for specific domains or user groups, focusing
on the challenges of tailoring NLP for personalized applications. For NLP models to be usable in practice,
particularly in emerging scenarios with widely varying use cases, situations, and user expectations, there
is a need to develop models that can be tailored to different consumers (individuals, groups, or organiza-
tions) and easily controlled by them; models that can reason about their users (often private) knowledge
and context to provide personalized responses. The topics of this workshop include (but not limited to):
Data collection, processing, analysis, and annotation efforts to increase representation and aid customiza-
tion; discussion and analysis of data sources not publicly available, and associated issues of privacy and
copyright. Modeling: New pretraining, fine-tuning, inference methods for customizing NLP models; cus-
tomizing reward models and model alignment to diverse consumers. New modeling paradigms aimed at
customization such as model ensembles, model averaging, federated learning, nonparametric models, etc.;
customizing models at inference time via prompting, in-context learning, chain-of-thought prompting, etc.
Evaluation: Evaluation of existing generalist, non-customized models, identifying their shortcomings for
varied use-cases; evaluation of customization techniques and customized models; interpretability and anal-
ysis of customization patterns across different kinds of consumers. Open Science: Best practices for open
and reproducible science concerning customizable NLP: dataset release and licensing, open-sourcing mod-
els, related privacy, copyright, and policy issues. Applications: e.g., information seeking on sensitive data
comprising legal, medical, or financial information; NLP models for communities reflecting sociolects,
dialects, or other language varieties; personalized AI assistants, etc. Ethical Issues: privacy and copyright;
personalization, intrusiveness, unintended biases; invisibility versus hypervisibility.
Time Session
9:00 - 9:15 Opening Remarks
9:15 - 10:00 Invited Talk 1 - Diyi Yang (Stanford)
10:00 - 10:45 Invited Talk 2 - Hanna Kirk (Oxford)
10:45 - 11:00 Coffee Break
11:00 - 12:00 Poster Session
13:00 - 13:15 Lightening Slides
13:15 - 14:00 Invited Talk 3 - Jared Roesch
14:00 - 14:45 Invited Talk 4
14:45 - 15:30 Invited Talk 5 - Maartje Ter Hoeve (Apple)
15:30 - 16:00 Coffee Break
16:00 - 16:30 Outstanding Papers Oral Presentations (10 min each)
446
1. Constructing Domain-Specific Evaluation Sets for LLM-as-a-judge
2. Trustful LLMs: Customizing and Grounding Text Generation with knowledge
bases and Dual Decoders
3. Customizing LLM Generation in Safety Scenarios with Active Learning for
Enhanced Representativeness and Robustness
16:30 - 17:00 Best Paper Award + Closing Remarks
447
W12 - The 4th International Workshop on Natural Language
Processing for Digital Humanities (NLP4DH)
Organizers:
Mika Hämäläinen, Emily Öhman, Khalid Alnajjar, So Miyagawa, Yuri Bizzoni
https://www.nlp4dh.com/nlp4dh-2024
Room: Merrick 2
Saturday, November 16, 2024
The 4th International Conference on Natural Language Processing for Digital Humanities (NLP4DH 2024)
will be organized together with EMNLP 2024. The proceedings of the conference will be published in the
ACL anthology. The conference will take place in Miami, USA on November 16, 2024. The focus of the
conference is on applying natural language processing techniques to digital humanities research. The top-
ics can be anything of digital humanities interest with a natural language processing or generation aspect.
A list of suitable topics includes but is not limited to: Text analysis and processing related to humani-
ties using computational methods; Thorough error analysis of an NLP system using (digital) humanities
methods; Dataset creation and curation for NLP (e.g. digitization, digitalization, datafication, and data
preservation).; Research on cultural heritage collections such as national archives and libraries using NLP;
NLP for error detection, correction, normalization and denoising data; Generation and analysis of literary
works such as poetry and novels; Analysis and detection of text genres.
Time Session
9:00 9:10 Opening words
Oral session 1
9:10 9:30 Lightning talks
9:30 9:50 Text Length and the Function of Intentionality: A Case Study of Contrastive
Subreddits Emily Sofi Öhman and Aatu Liimatta
9:50 10:10 Tracing the Genealogies of Ideas with Sentence Embeddings Lucian Li
10:10 10:30 Evaluating Computational Representations of Character: An Austen Character
Similarity Benchmark Funing Yang and Carolyn Jane Anderson
10:30 11:00 Coffee break
Oral session 2
11:00 11:20 Investigating Expert-in-the-Loop LLM Discourse Patterns for Ancient Intertex-
tual Analysis Ray Umphrey, Jesse Roberts, Lindsey Roberts
11:20 11:40 Extracting Relations from Ecclesiastical Cultural Heritage Texts Giulia Cru-
ciani
11:40 12:00 Constructing a Sentiment-Annotated Corpus of Austrian Historical Newspapers:
Challenges, Tools, and Annotator Experience Lucija Krusic
12:00 12:20 It is a Truth Individually Acknowledged: Cross-references On Demand Piper
Vasicek, Courtni Byun, Kevin Seppi
12:20 12:40 Extracting Position Titles from Unstructured Historical Job Advertisements
Klara Venglarova, Raven Adam, Georg Vogeler
12:40 13:10 Lunch
Oral session 3
448
13:10 13:30 Language Resources From Prominent Born-Digital Humanities Texts are Still
Needed in the Age of LLMs Natalie Hervieux, Peiran Yao, Susan Brown, De-
nilson Barbosa
13:30 13:50 NLP for Digital Humanities: Processing Chronological Text Corpora Adam
Pawowski, Tomasz Walkowiak
13:50 14:10 A Multi-task Framework with Enhanced Hierarchical Attention for Sentiment
Analysis on Classical Chinese Poetry: Utilizing Information from Short Lines
Quanqi Du and Veronique Hoste
14:10 14:30 Exploring Similarity Measures and Intertextuality in Vedic Sanskrit Literature
So Miyagawa, Yuki Kyogoku, Yuzuki Tsukagoshi, Kyoko Amano
14:30 14:50 Historical Ink: 19th Century Latin American Spanish Newspaper Corpus with
LLM OCR Correction Laura Manrique-Gomez, Tony Montes, Arturo Rodriguez
Herrera, Ruben Manrique
14:50 15:10 Canonical Status and Literary Influence: A Comparative Study of Danish Novels
from the Modern Breakthrough (1870 - 1900) Pascale Feldkamp, Alie Lassche,
Jan Kostkan, Márton Kardos, Kenneth Enevoldsen, Katrine Baunvig, Kristoffer
Nielbo
15:10 15:30 Deciphering Psycho-Social Effects of Eating Disorder: Analysis of Reddit Posts
using Large Language Models and Topic Modeling Medini Chopra, Anindita
Chatterjee, Lipika Dey, Partha Pratim Das
15:30 16:30 Posters and coffee
• Topic-Aware Causal Intervention for Counterfactual Detection Thong Thanh
Nguyen, Truc-My Nguyen
• UD for German Poetry Stefanie Dipper, Ronja Laarmann-Quante
• Molyé: A Corpus-Based Approach to Language Contact in Colonial France
Rasul Dent, Juliette Janes, Thibault Clerice, Pedro Ortiz Suarez, Benoît Sagot
• Improving Latin Dependency Parsing by Combining Treebanks and Predic-
tions Hanna-Mari Kristiina Kupari, Erik Henriksson, Veronika Laippala, Jenna
Kanerva
• From N-Grams to Pre-Trained Multilingual Models for Language Identifica-
tion Thapelo Andrew Sindane, Vukosi Marivate
• Visualising Changes in Semantic Neighbourhoods of English Noun Com-
pounds over Time Malak Rassem, Myrto Tsigkouli, Chris W Jenkins, Filip
Mileti, Sabine Schulte im Walde
• SEFLAG: Systematic Evaluation Framework for NLP Models and Datasets in
Latin and Ancient Greek Konstantin Schulz, Florian Deichsler
• A Two-Model Approach for Humour Style Recognition Mary Ogbuka Ken-
neth, Foaad Khosmood, Abbas Edalat
• N-Gram-Based Preprocessing for Sandhi Reversion in Vedic Sanskrit Yuzuki
Tsukagoshi, Ikki Ohmukai
• Evaluating Open-Source LLMs in Low-Resource Languages: Insights from
Latvian High School Exams Roberts Daris, Guntis Brzdi, Inguna Skadia, Baiba
Saulite
• Computational Methods for the Analysis of Complementizer Variability in
Language and Literature: The Case of Hebrew ’she-’ and ’ki’ Avi Shmidman,
Aynat Rubinstein
449
• From Discrete to Continuous Classes: A Situational Analysis of Multilingual
Web Registers with LLM Annotations Erik Henriksson, Amanda Myntti, Saara
Hellström, Selcen Erten-Johansson, Anni Eskelinen, Liina Repo, Veronika Laip-
pala
• Testing and Adapting the Representational Abilities of Large Language Models
on Folktales in Low-Resource Languages J. A. Meaney, Beatrice Alex, William
Lamb
• Examining Language Modeling Assumptions Using an Annotated Literary Di-
alect Corpus Craig Messner, Thomas Lippincott
• Evaluating Language Models in Location Referring Expression Extraction
from Early Modern and Contemporary Japanese Texts Ayuki Katayama, Yusuke
Sakai, Shohei Higashiyama, Hiroki Ouchi, Ayano Takeuchi, Ryo Bando, Yuta
Hashimoto, Toshinobu Ogiso, Taro Watanabe
• Evaluating LLM Performance in Character Analysis: A Study of Artificial
Beings in Recent Korean Science Fiction Woori Jang, Seohyon Jung
• Sui Generis: Large Language Models for Authorship Attribution and Verifica-
tion in Latin Svetlana Gorovaia, Gleb Schmidt, Ivan P. Yamshchikov
16:30 17:30 Virtual posters (online on Gather)
• Classification of Buddhist Verses: The Efficacy and Limitations of
Transformer-Based Models Nikita Neveditsin, Ambuja Salgaonkar, Pawan Lin-
gras, Vijay Mago
• Enhancing Swedish Parliamentary Data: Annotation, Accessibility, and Ap-
plication in Digital Humanities Shafqat Mumtaz Virk, Claes Ohlsson, Nina Tah-
masebi, Henrik Björck, Leif Runefelt
• Adapting Measures of Literality for Use with Historical Language Data Adam
Roussel
• Vector Poetics: Parallel Couplet Detection in Classical Chinese Poetry Maciej
Kurzynski, Xiaotong Xu, Yu Feng
• Intersecting Register and Genre: Understanding the Contents of Web-Crawled
Corpora Amanda Myntti, Liina Repo, Elian Freyermuth, Antti Kanner, Veronika
Laippala, Erik Henriksson
• Text vs. Transcription: A Study of Differences Between the Writing and
Speeches of U.S. Presidents Mina Rajaei Moghadam, Mosab Rezaei, Gülat Ay-
gen, Reva Freedman
• Mitigating Biases to Embrace Diversity: A Comprehensive Annotation Bench-
mark for Toxic Language Xinmeng Hou
• Enhancing Neural Machine Translation for Ainu-Japanese: A Comprehensive
Study on the Impact of Domain and Dialect Integration Ryo Igarashi, So Miya-
gawa
• Exploring Large Language Models for Qualitative Data Analysis Tim Fischer,
Chris Biemann
• Cross-Dialectal Transfer and Zero-Shot Learning for Armenian Varieties:
A Comparative Analysis of RNNs, Transformers and LLMs Chahan Vidal-
Gorène, Nadi Tomeh, Victoria Khurshudyan
• Increasing the Difficulty of Automatically Generated Questions via Reinforce-
ment Learning with Synthetic Preference for Cost-Effective Cultural Heritage
Dataset Generation William Thorne, Ambrose Robinson, Bohua Peng, Chenghua
Lin, Diana Maynard
450
• Assessing Large Language Models in Translating Coptic and Ancient Greek
Ostraca Audric-Charles Wannaz, So Miyagawa
• The Social Lives of Literary Characters: Combining Citizen Science and Lan-
guage Models to Understand Narrative Social Networks Andrew Piper, Michael
Xu, Derek Ruths
• Multi-Word Expressions in Biomedical Abstracts and Their Plain English
Adaptations Sergei Bagdasarov, Elke Teich
• Assessing the Performance of ChatGPT-4, Fine-Tuned BERT and Traditional
ML Models on Moroccan Arabic Sentiment Analysis Mohamed Hannani, Ab-
delhadi Soudi, Kristof Van Laerhoven
• Analyzing Pokémon and Mario Streamers’ Twitch Chat with LLM-Based User
Embeddings Mika Hämäläinen, Jack Rueter, Khalid Alnajjar
• Corpus Development Based on Conflict Structures in the Security Field and
LLM Bias Verification Keito Inoshita
• Generating Interpretations of Policy Announcements Andreas Marfurt, Ashley
Thornton, David Sylvan, James Henderson
• Order Up! Micromanaging Inconsistencies in ChatGPT-4o Text Analyses
Erkki Mervaala, Ilona Kousa
• CIPHE: A Framework for Document Cluster Interpretation and Precision from
Human Exploration Anton Eklund, Mona Forsman, Frank Drewes
• Empowering Teachers with Usability-Oriented LLM-Based Tools for Digi-
tal Pedagogy Melany Vanessa Macias, Lev Kharlashkin, Leo Einari Huovinen,
Mika Hämäläinen
451
W13 - GenBench: The second workshop on generalisation
(benchmarking) in NLP
Organizers:
Dieuwke Hupkes, Mario Giulianelli, Verna Dankers, Amirhossein Kazemnejad,
Khuyagbaatar Batsuren, Ryan Cotterell, Christos Christodoulopoulos
https://genbench.org/workshop/
Room: Brickell
Saturday, November 16, 2024
The ability to generalise well is often mentioned as one of the primary desiderata for models of natural
language processing (NLP). Yet, there are still many open questions related to what it means for an NLP
model to generalise well, and how generalisation should be evaluated. LLMs, trained on gigantic training
corpora that are - at best - hard to analyse or not publicly available at all, bring a new set of challenges to
the topic. The second GenBench workshop aims to serve as a cornerstone to catalyse research on gener-
alisation in the NLP community. The workshop has two concrete goals: Bring together different expert
communities to discuss challenging questions relating to generalisation in NLP; Establish a shared plat-
form for state-of-the-art generalisation testing in NLP. We started this last year, and this years collaborative
benchmarking task (CBT) is solely LLM-focused! particular engineering applications.
Time Session
09:00 09:15 Opening Remarks
09:15 10:00 Keynote 1, by Pascale Fung
10:00 10:30 Oral Presentations
• Is artificial intelligence still intelligence? LLMs generalize to novel adjective-
noun pairs, but dont mimic the full human distribution. Hayley Ross, Kathryn
Davidson, Najoung Kim
• Investigating the Generalizability of Pretrained Language Models across Mul-
tiple Dimensions: A Case Study of NLI and MRC. Ritam Dutt, Sagnik Ray
Choudhury, Varun Venkat Rao, Carolyn Rose, V.G. Vinod Vydiswaran
10:30 11:00 Coffee Break
11:00 11:45 Keynote 2, by Najoung Kim
11:45 12:30 Spotlight Presentations:
• MLissard: Multilingual Long and Simple Sequential Reasoning Benchmarks.
Mirelle Candida Bueno, Roberto Lotufo, Rodrigo Frassetto Nogueira
• OmniDialog: A Multimodal Benchmark for Generalization Across Text, Vi-
sual, and Audio Modalities. Anton Razzhigaev, Maxim Kurkin, Elizaveta Gon-
charova, Irina Abdullaeva, Anastasia Lysenko, Alexander Panchenko, Andrey
Kuznetsov, Denis Dimitrov
• MultiPragEval: Multilingual Pragmatic Evaluation of Large Language Models.
Dojun Park, Jiwoo Lee, Seohyun Park, Hyeyun Jeong, Youngeun Koo, Soonha
Hwang, Seonwoo Park, Sungeun Lee
• The SlayQA benchmark of social reasoning: testing gender-inclusive general-
ization with neopronouns. Bastian Bunzeck, Sina ZarrieSS
452
• MMLU-SR: A Benchmark for Stress-Testing Reasoning Capability of Large
Language Models. Wentian Wang, Sarthak Jain, Paul Kantor, Jacob Feldman,
Lazaros Gallos, Hao Wang
12:30 13:45 Lunch Break
13:45 15:00 Poster Session
15:00 15:45 Keynote 3, by Sameer Singh
15:45 16:00 Coffee Break
16:00 16:30 Panel
16:30 16:45 Closing Remarks and Best Paper Award
453
W14 - Natural Legal Language Processing (NLLP) Workshop
2024
Organizers:
Nikolaos Aletras, Leslie Barrett, Ilias Chalkidis, Catalina Goanta, Daniel
Preotiuc-Pietro, Gerasimos Spanakis
https://nllpw.org/workshop/
Room: Hibiscus A
Saturday, November 16, 2024
The Natural Legal Language Processing (NLLP) 2024 workshop, now at its sixth edition, brings together
researchers, practitioners, policy makers from around the world who develop NLP techniques within the
legal domain. NLP technologies allow legal practitioners and decision-makers to make more informed
decisions, optimize legal strategies and serve clients/consumers/citizens in a more cost-efficient way. The
fast-paced, multi-jurisdictional world of law is a growing area of application for NLP, offering data sources
which are often multilingual and multimodal. For example, evidentiary data sets used in private and pub-
lic legal practice require in-depth image analysis and speech recognition technologies to complement text
data (e.g., opinions and judgments) currently dominating the area. Legal NLP research can create societal
impact by informing regulators how to best protect certain categories of citizens at risk (e.g. vulnerable
consumers), or by enhancing citizen education and access to justice. This is an exciting opportunity to
expand the boundaries of our field by identifying new problems and exploring new data as it interacts with
the full inventory of NLP and machine learning approaches.
Time Session
Session 1
09:00 - 09:15 Workshop opening
09:15 - 09:20 Summarizing Long Regulatory Documents with a Multi-Step Pipeline. Mika Sie,
Ruby Beek, Michiel Bots, Sjaak Brinkkemper, Albert Gatt
09:20 - 09:25 Towards an Automated Pointwise Evaluation Metric for Generated Long-Form
Legal Summaries. Shao Min Tan, Quentin Grail, Lee Quartey
09:25 - 09:30 Cross Examine: An Ensemble-based Approach to Leverage Large Language
Models for Legal Text Analytics. Saurav Chowdhury, Lipika Dey, Suyog Joshi
09:30 - 09:35 LexSumm and LexT5: Benchmarking and Modeling Legal Summarization Tasks
in English. Santosh T.Y.S.S, Cornelius Johannes Weiss, Matthias Grabmair
09:35 - 09:40 Algorithm for Automatic Legislative Text Consolidation. Matias Etcheverry,
Thibaud Real-del-sarte, Pauline Chavallard
09:40 - 09:50 Joint Q&A
09:50 - 09:55 LeGen: Complex Information Extraction from Legal Sentences using Generative
Models. Chaitra C R, Sankalp Kulkarni, Sai Rama Akash Varma Sagi, Shashank
Pandey, Rohit Yalavarthy, Dipanjan Chakraborty, Prajna Devi Upadhyay
09:55 - 10:00 Information Extraction for Planning Court Cases. Drish Mali, Rubash Mali,
Claire Barale
10:00 - 10:05 Automated Anonymization of Parole Hearing Transcripts. Abed El Rahman
Itani, Wassiliki Siskou, Annette Hautli-Janisz
10:05 - 10:10 BLT: Can Large Language Models Handle Basic Legal Text?. Andrew Blair-
Stanek, Nils Holzenberger, Benjamin Van Durme
454
10:10 - 10:15 Classify First, and Then Extract: Prompt Chaining Technique for Information
Extraction. Alice Saebom Kwak, Clayton T. Morrison, Derek Bambauer, Mihai
Surdeanu
10:15 - 10:20 HiCuLR: Hierarchical Curriculum Learning for Rhetorical Role Labeling of Le-
gal Documents. Santosh T.Y.S.S, Apolline Isaia, Shiyu Hong, Matthias Grabmair
10:20 - 10:30 Joint Q&A
10:30 - 11:00 Break
Session 2
11:00 - 11:05 Rethinking Legal Judgement Prediction in a Realistic Scenario in the Era of
Large Language Models. Shubham Kumar Nigam, Aniket Deroy, Subhankar
Maity, Arnab Bhattacharya
11:05 - 11:10 The CLC-UKET Dataset: Benchmarking Case Outcome Prediction for the UK
Employment Tribunal. Huiyuan Xie, Felix Steffek, Joana Ribeiro de Faria,
Christine Carter, Jonathan Rutherford
11:10 - 11:15 Transductive Legal Judgment Prediction Combining BERT Embeddings with
Delaunay-Based GNNs. Hugo Attali, Nadi Tomeh
11:15 - 11:20 Comparative Study of Explainability Methods for Legal Outcome Prediction.
Ieva Raminta Staliunaite, Josef Valvoda, Ken Satoh
11:20 - 11:25 Incorporating Precedents for Legal Judgement Prediction on European Court of
Human Rights Cases. Santosh T.Y.S.S, Mohamed Hesham Elganayni, Stanisaw
Sójka, Matthias Grabmair
11:25 - 11:30 The Craft of Selective Prediction: Towards Reliable Case Outcome Classifica-
tion An Empirical Study on European Court of Human Rights Cases. Santosh
T.Y.S.S, Irtiza Chowdhury, Shanshan Xu, Matthias Grabmair
11:30 - 11:45 Joint Q&A
11:45 - 11:50 Quebec Automobile Insurance Question-Answering With Retrieval-Augmented
Generation. David Beauchemin, Richard Khoury, Zachary Gagnon
11:50 - 11:55 Attributed Question Answering for Preconditions in the Dutch Law. Felicia Re-
delaar, Romy van Drie, Suzan Verberne, Maaike de Boer
11:55 - 12:00 Measuring the Groundedness of Legal Question-Answering Systems. Dietrich
Trautmann, Natalia Ostapuk, Quentin Grail, Adrian Alan Pol, Guglielmo Boni-
fazi, Shang Gao, Martin Gajek
12:00 - 12:10 Joint Q&A
12:10 - 14:00 Lunch & In-Person Poster Session (Lunch provided)
Session 3
14:00 - 15:00 Keynote Talk: Omri Ben-Shahar. Privacy Protection, At What Cost?
15:00 - 15:15 Shared Task: Enhancing Legal Violation Identification with LLMs and Deep
Learning Techniques: Achievements in the LegalLens 2024 Competition. Ben
hagag, Gil Gil Semo, Dor Bernsohn, Liav Harpaz, Pashootan Vaezipoor, Rohit
Saha, Kyryl Truskovskyi, Gerasimos Spanakis
15:15 - 15:30 Shared Task Winner Presentation
15:30 - 16:00 Break
Session 4
16:00 - 16:05 LLMs to the Rescue: Explaining DSA Statements of Reason with Platform’s
Terms of Services. Marco Aspromonte, Andrea Filippo Ferraris, Federico Galli,
Giuseppe Contissa
455
16:05 - 16:10 Enhancing Contract Negotiations with LLM-Based Legal Document Compari-
son. Savinay Narendra, Kaushal Shetty, Adwait Ratnaparkhi
16:10 - 16:15 Multi-Property Multi-Label Documents Metadata Recommendation Based on
Encoder Embeddings. Nasredine Cheniki, Vidas Daudaravicius, Abdelfettah Fe-
liachi, Didier Hardy, Marc Wilhelm Küster
16:15 - 16:20 CLERC: A Dataset for U. S. Legal Case Retrieval and Retrieval-Augmented
Analysis Generation. Abe Bohan Hou, Orion Weller, Guanghui Qin, Eugene
Yang, Dawn Lawrie, Nils Holzenberger, Andrew Blair-Stanek, Benjamin Van
Durme
16:20 - 16:25 Empowering Air Travelers: A Chatbot for Canadian Air Passenger Rights.
Maksym Taranukhin, Sahithya Ravi, Gabor Lukacs, Evangelos Milios, Vered
Shwartz
16:25 - 16:30 The Impact of Formulaic Language in the Court of Justice of the European Union
on the Performance of Lexical and Dense Retrieval Methods. Larissa Mori,
Carlos Sousa de Oliveira, Yuehwern Yih, Mario Ventresca
16:30 - 16:40 Joint Q&A
16:40 - 16:45 Gaps or Hallucinations? Scrutinizing Machine-Generated Legal Analysis for
Fine-Grained Text Evaluations. Abe Bohan Hou, William Jurayj, Nils Holzen-
berger, Andrew Blair-Stanek, Benjamin Van Durme
16:45 - 16:50 How Many Van Goghs Does It Take to Van Gogh? Finding the Imitation Thresh-
old. Sahil Verma, Royi Rassin, Arnav Mohanty Das, Gantavya Bhatt, Preethi
Seshadri, Chirag Shah, Jeff Bilmes, Hannaneh Hajishirzi, Yanai Elazar
16:50 - 16:55 Towards Supporting Legal Argumentation with NLP: Is More Data Really All
You Need?. Santosh T.Y.S.S, Kevin Ashley, Katie Atkinson, Matthias Grabmair
16:55 - 17:00 Misinformation with Legal Consequences (MisLC): A New Task Towards Har-
nessing Societal Harm of Misinformation. Chu Fei Luo, Radin Shayanfar, Rohan
V Bhambhoria, Samuel Dahan, Xiaodan Zhu
17:00 - 17:10 Joint Q&A
17:10 - 17:15 LAR-ECHR: A New Legal Argument Reasoning Task and Dataset for Cases of
the European Court of Human Rights. Odysseas S. Chlapanis, Dimitris Galanis,
Ion Androutsopoulos
17:15 - 17:20 Developing a Pragmatic Benchmark for Assessing Korean Legal Language Un-
derstanding in Large Language Models. Kimyeeun, Choi Youngrok, Eunkyung
Choi, JinHwan Choi, Hai Jin Park, Wonseok Hwang
17:20 - 17:25 Enhancing Legal Expertise in Large Language Models through Composite
Model Integration: The Development and Evaluation of Law-Neo. Zhihao Liu,
Yanzhen Zhu, Mengyuan Lu
17:25 - 17:30 Joint Q&A
17:30 - 17:40 Best Presentation Award
456
W15 - The 4th Workshop on Multilingual Representation
Learning
Organizers:
David Ifeoluwa Adelani, Duygu Ataman, Mammad Hajili, Raghav Mantri, Abraham
Owodunni, Jonne Saleva, David Strap, Francesco Tinner
https://sigtyp.github.io/ws2024-mrl.html
Room: Jasmine
Saturday, November 16, 2024
Multi-lingual representation learning methods have recently been found to be extremely efficient in learn-
ing features useful for transfer learning between languages and demonstrating potential in achieving suc-
cessful adaptation of natural language processing (NLP) models into languages or tasks with little to no
training resources. On the other hand, there are many aspects of such models which have the potential for
further development and analysis in order to prove their applicability in various contexts. These contexts
include different NLP tasks and also understudied language families, which face important obstacles in
achieving practical advances that could improve the state-of-the-art in NLP of various low-resource or
underrepresented languages.
Time Session
09:00 09:10 Opening Remarks
09:10 09:50 Invited Talk by Karen Livescu
09:50 10:30 Invited Talk by Hila Gonen
10:30 11:00 Coffee Break
11:00 12:30 Poster Session
12:30 14:00 Lunch Break
14:00 14:30 Shared Task Session:
• Findings Paper
• Winning Team Presentation
14:30 15:30 Best Paper Session:
• Best Paper
• Honorable Mentions
15:30 16:00 Coffee Break
16:00 16:50 Invited Talk by Sebastian Ruder
16:50 17:00 Closing Remarks
457
W16 - NLP4Science: The First Workshop on Natural Language
Processing for Science
Organizers:
Nitay Calderon, Alex Chapanin, Rotem Dror, Amir Feder, Ariel Goldstein, Anna
Korhonen, Shir Lissak, Yaakov Ophir, Lotem Peled-Cohen, Roi Reichart, Ilanit
Sobol, Refael Tikochinski, Mor Ventura
https://sites.google.com/view/nlp4science/home
Room: Hibiscus B
Saturday, November 16, 2024
The NLP4Science workshop ventures into an important new frontier: leveraging NLP to better under-
stand the human mind. Researchers are increasingly using NLP and LLMs in particular for the scientific
modeling and understanding of human behavior. They apply NLP tools to gain invaluable insights into
social science, psychology, psychiatry, health, neuroscience, behavioral economics, and beyond. In this
workshop we will cover principles of NLP-driven scientific modeling, advanced methods for statistically
robust evaluation of NLP models, experimental design, causal inference, and causality-based methods for
text models in science.
Time Session
08:45 09:00 Gathering and Welcome
09:00 09:45 Invited Speaker - Amit Sharma
09:45 10:30 Invited Speaker - Rita Goldstein
10:30 11:00 Coffee Break
11:00 12:00 Panel Discussion and Q&A
12:00 13:00 Lunch Break
13:00 13:45 Invited Speaker - Roger Levy
13:45 14:30 Invited Speaker - Hadas Raviv
14:30 16:00 Poster Session + Coffee Break
16:00 16:15 Best Paper Announcement + Short Oral
16:15 16:45 Invited Speaker - Nitay Calderon
16:45 17:00 Closing Remarks
458
W17 - The Second Workshop on Social Influence in Conversations
(SICon 2024)
Organizers:
Muskan Garg, Kushal Chawla, Weiyan Shi, Ritam Dutt, Deuksin Brian Kwon,
James Hale, Daniel Hershcovich, Aina Gari Soler, Liang Qiu, Alexandros
Papangelis, Zhou Yu, Gale Lucas
https://sites.google.com/view/sicon2024/home
Room: Pearson
Saturday, November 16, 2024
Social influence (SI) is the change in an individual’s thoughts, feelings, attitudes, or behaviors from in-
teracting with another individual or a group. For example, a buyer uses SI skills to negotiate trade-offs
and build rapport with the seller. SI is ubiquitous in everyday life, and hence, realistic human-machine
conversations must reflect these dynamics, making it essential to model and understand SI in dialogue
research systematically. This would improve SI systems’ ability to understand users utterances, tailor
communication strategies, personalize responses, and actively lead conversations. These challenges draw
on perspectives not only from NLP and AI research but also from Game Theory, Affective Computing,
Communication, and Social Psychology. SI dialogue tasks like negotiation, persuasion, therapy, and ar-
gumentation have recently gained traction. Current conversational systems emphasize modeling system
strategies using dialogue acts and strategy annotations or modeling users. Prior work also explored related
tasks crucial for the eventual development of SI systems, namely outcome prediction, argument mining,
and lie detection. However, these efforts are scattered, and only limited efforts focus on building useful
systems exhibiting SI skills, such as chatbots. Ensuring AI-driven models safety, interpretability, and inte-
gration into real-time applications that simulate or analyze SI remains challenging.
Time Session
09:00–09:10 Opening Remarks
09:10–09:40 Invited Talk: David Jurgens
09:40–10:10 Invited Talk: Kyriaki Kalimeri
10:10–10:40 Invited Talk: Maurice Schweitzer
10:40–11:00 Coffee Break
11:00–12:00 Panel Discussion 1
12:00–13:30 Lunch Break
13:30–14:30 Poster Session 1
14:30–15:30 Poster Session 2
15:30–16:00 Invited Talk: Viktoria Spaiser
16:00–16:30 Invited Talk: Yulia Tsvetkov
16:30–16:45 Lightning Talks
16:45–17:00 Coffee Break
17:00–17:30 Invited Talk: Yi-Chia Wang
17:30–18:00 Invited Talk: Henning Wachsmuth
18:00 Closing Remarks
459
W18 - The 11th Workshop on Asian Translation (WAT2024)
Organizers:
Toshiaki Nakazawa, Isao Goto, Hidaya Mino, Kazutaka Kinugawa, Haiyue Song,
Raj Dabre, Shohei Higashiyama, Shantipriya Parida, Ondej Bojar, Sadao Kurohashi,
Pushpak Bhattacharyya
https://lotus.kuee.kyoto-u.ac.jp/WAT/WAT2024/index.html
Room: Miami Lecture Hall
Saturday, November 16, 2024
Many Asian countries are rapidly growing these days and the importance of communicating and exchang-
ing the information with these countries has intensified. To satisfy the demand for communication among
these countries, machine translation technology is essential. Machine translation technology has rapidly
evolved recently and it is seeing practical use especially between European languages. However, the trans-
lation quality of Asian languages is not that high compared to that of European languages, and machine
translation technology for these languages has not reached a stage of proliferation yet. This is not only
due to the lack of the language resources for Asian languages but also due to the lack of techniques to cor-
rectly transfer the meaning of sentences from/to Asian languages. Consequently, a place for gathering and
sharing the resources and knowledge about Asian language translation is necessary to enhance machine
translation research for Asian languages. The Workshop on Machine Translation (WMT), the world’s
largest machine translation workshop, mainly targets on European languages and does not include Asian
languages. The International Workshop on Spoken Language Translation (IWSLT) has spoken language
translation tasks for some Asian languages using TED talk data, but these is no task for written language.
The Workshop on Asian Translation (WAT) is an open machine translation evaluation campaign focusing
on Asian languages. WAT gathers and shares the resources and knowledge of Asian language translation
to understand the problems to be solved for the practical use of machine translation technologies among
all Asian countries. WAT is unique in that it is an "open innovation platform": the test data is fixed and
open, so participants can repeat evaluations on the same data and confirm changes in translation accuracy
over time. WAT has no deadline for the automatic translation quality evaluation (continuous evaluation),
so participants can submit translation results at any time.
Time Session
09:00 09:05 Welcome (Toshiaki Nakazawa)
09:05 10:30 Panel Discussion (Chair: Toshiaki Nakazawa) Machine Translation of Asian
Languages in the LLM Era
Min Zhang, Thepchai Supnithi, Kozo Moriguchi, Fred Bane
10:30 11:00 Break
11:00 12:40 Research Paper Presentations (20 mins. each)
• Machine Translation Of Marathi Dialects: A Case Study Of Kadodi Raj Dabre,
Mary Noel Dabre, Teresa Pereira
• AI-Tutor: Interactive Learning of Ancient Knowledge from Low-Resource
Languages Siddhartha R. Dalal, Rahul Aditya, Vethavikashini Chithrra Raghu-
ram, Prahlad Koratamaddi
• An Empirical Study of Multilingual Vocabulary for Neural Machine Transla-
tion Models Kenji Imamura, Masao Utiyama
460
• Are Large Language Models State-of-the-art Quality Estimators for Machine
Translation of User-generated Content? Shenbin Qian, Constantin Orasan,
Diptesh Kanojia, Félix do Carmo
• Creative and Context-Aware Translation of East Asian Idioms with GPT-4
Kenan Tang, Peiyang Song, Yao Qin, Xifeng Yan
12:40 12:45 Closing Remarks (Toshiaki Nakazawa)
461
W19 - The First Workshop on Advancing Natural Language
Processing for Wikipedia (NLP for Wikipedia)
Organizers:
Lucie-Aimée Kaffee, Isaac Johnson, Angela Fan, Tajuddeen Gwadabe, Fabio
Petroni, Daniel van Strien
https://meta.wikimedia.org/wiki/NLP_for_Wikipedia_(EMNLP_2024)
Room: Johnson
Saturday, November 16, 2024
A space both to celebrate Wikimedia’s contributions to the NLP community and highlight approaches
to ensuring the sustainability of this relationship for years to come. Wikipedia is a uniquely important
resource for the NLP community; it is multilingual, can be freely reused under its open license, and is
edited and maintained by a dedicated community of editors who have earned its status as a very high-
quality dataset for many applications. With this value comes many tensions however: Despite Wikipedia’s
presence in over 300 language editions, much focus in language modeling remains on the high-resource
languages; Despite the openness of Wikipedia and its role in many advances in natural language modeling,
there are concerns that some of these advances such as generative text models could undermine Wikipedia
and threaten its sustainability as a community and ultimately data resource; Despite the heavy usage of
Wikimedia data among the NLP community, few researchers work on developing tools that can contribute
back to the Wikimedia community. We will invite researchers to contribute novel uses of Wikimedia data
or studies of the impact of Wikimedia data within the NLP community. We will also discuss successful
approaches to developing tooling that can assist the Wikimedia community in maintaining and improving
the breadth of the Wikimedia projects.
Time Session
09:00 - 09:05 Opening remarks - Isaac/Lucie introducing the workshop
09:05 - 09:50 Keynote by Jess Wade - 30-min talk (remote) + Joint Q&A
09:50 - 10:30 2-minute paper lightning talks - Pre-recorded
10:30 - 11:00 Coffee break
11:00 - 12:00 Poster session - In-person or virtual on Gather
12:00 - 12:45 Lunch
12:45 - 13:30 Keynote by Scott A. Hale - 30-min talk (remote) + Joint Q&A
13:30 - 14:15 Panel: Misinformation + Wikipedia - Isabelle Augenstein, Andreas Vlachos
14:15 - 15:00 Panel: Impact of LLMs on Wikipedia - Ilyas Lebleu, David Adelani
15:00 - 15:30 Closing (30-minutes) - Statements by Leila Zia
462
Local Guide
16
Local Guide
Conference Venue
Venue: Hyatt Regency Miami - 400 South East Second Ave - Miami, FL 33131
https://www.hyatt.com/
Phone: (305) 358-1234
463
Local Guide
This year venue for EMNLP 2024 is the Hyatt Regency Miami, located at 400 SE 2nd Ave, Miami, FL
33131, USA. Our Downtown Miami Hotel is next to Brickell, one of the trendiest neighborhoods in Miami.
Our hotel overlooks the Miami River and our ideal downtown location puts you steps from the Miami
Riverwalk and Bayfront Park. The hotel is also close to the Port of Miami and the Kaseya Center (formerly
FTX Arena). For a day of shopping, our hotel is near Brickell City Centre. Or, explore Little Havana
and tour the Phillip & Patricia Frost Museum of Science. Rooms and suites at our downtown Miami
hotel offer stunning city and Biscayne Bay views with all the comforts you expect from a luxury, urban
destination. Hyatt Regency Miamis 615 rooms and suites are outfitted with beautiful wood furnishings,
generous wardrobe space and functional work areas.
Hotel Parking
Self-parking: $53.00
Valet parking: Standard vehicles $53.00; Oversize vehicles: $78.00
• 225 SE. 2nd St. Garage: Located 0.1 mi away and costs $15
• 200 SE. 2nd St. Garage - P2531: Located 0.1 mi away and costs $25
• 250 SE. 3rd Ave. Garage - P2530: Located 0.1 mi away and costs $25
• 200 S. Biscayne Blvd. SE Financial Center Garage: Located 0.2 mi away and costs $20
• 81 SE. 5th St. 444 Brickell Ave. Garage: Located 0.3 mi away and costs $20
• 29 NW. 1st St. Cindy Lot: Located 0.3 mi away and costs $5
Market
In need of a late-night treat or early morning breakfast? Stop by the Market on the Lobby Level. The
Markets inviting, airy feel makes it easy to dash in and grab a quick bite to go. Choose from a steady
rotation of bakery items, pizzas, sandwiches, and healthy choices, as well as a full specialty coffee menu.
Choose from a steady rotation of bakery items, pizzas, sandwiches, and fresh options plus a full specialty
coffee menu.
Need it To Go
Enjoy a restaurant experience in the comfort of your room. The Market offers freshly prepared food in
eco-friendly containers for pickup or in-room delivery. Order from your in-room phone or use our mobile
app.
464
Local Guide
lunch, and dinner. Enjoy views of the Miami River for breakfast or lunch. Stop by for a signature mojito
as you take in the latest sporting event and explore our lounge menu or full à la carte dinner offerings for
the perfect bite.
Breakfast
Mon Sun 7:00 AM - 11:00 AM
Lunch
Mon Sun 12:00 PM - 4:00 PM
Dinner
Mon Thu 4:00 PM - 11:00 PM
Fri & Sat 4:00 PM - 12:00 AM
Sun 4:00 PM - 11:00 PM
Amenities
Outlets
For USA there are two associated plug types, types A and B. Plug type A is the plug which has two flat
parallel pins and plug type B is the plug which has two flat parallel pins and a grounding pin. USA operates
on a 120V supply voltage and 60Hz.
Accessibility
We are committed to providing equal access and opportunity for individuals with disabilities. The features
also make this hotel more accessible for older individuals with changing abilities to ensure a seamless
experience. Our overall goal is to improve usability throughout the hotel for all guests.
About Miami
Explore Miami: https://www.miamiandbeaches.com/
Image credits: The Official Website of Greater Miami & Miami Beach
Miami, Florida is a renowned tourist destination located in the southeastern United States. Known for its
stunning coastal beauty and vibrant cultural diversity, the city offers visitors an eclectic mix of art, history,
and outdoor activities. Miamis year-round tropical climate makes it an ideal location for both relaxation
and adventure. Below are some key highlights of the city:
465
Local Guide
Cultural Attractions
Miami is celebrated for its multicultural flair, heavily influenced by Latin American, Caribbean, and
European cultures. Visitors can explore Little Havana, the heart of Miamis Cuban culture, where they can
enjoy authentic Cuban cuisine, music, and art. The Wynwood Arts District is another cultural hotspot,
famous for its street art and galleries. Art enthusiasts can visit the Pérez Art Museum Miami (PAMM) and
the Art Deco Historic District for architectural inspiration.
466
Local Guide
Miamis blend of sun-soaked beaches, rich culture, diverse dining options, and vibrant nightlife makes
it a top destination for tourists from around the world. Whether visitors seek adventure, relaxation, or
cultural enrichment, Miami offers an array of experiences that cater to all interests. For more details, visit
the official tourism website at: https://www.miamiandbeaches.com/
Things to do in Miami
Miami, Florida, is bursting with activities and attractions that cater to all interests.
Discover Nature
Take a trip to the Everglades National Park for unique wildlife encounters, airboat tours, and hiking
trails. Enjoy outdoor activities such as kayaking in Biscayne Bay.
Experience Nightlife
Miamis nightlife is legendary, with numerous nightclubs, bars, and live music venues. Popular areas
include South Beach, Wynwood, and Brickell, offering a range of entertainment options.
Attend Events
Miami hosts various events throughout the year, including Art Basel, food festivals, and concerts, mak-
ing it an exciting destination for visitors.
• Bayside Marketplace
• Calle Ocho
• South Beach
467
Local Guide
Beaches
Beach Guide here.
Budget Friendly
For free things to do in Miami, click here.
Useful Information
The currency in Miami, FL is the U.S. Dollar. ATMs are readily accessible, and credit/debit cards as
well as Apple Pay and Google Pay are widely accepted.
Tipping
468
Local Guide
Weather
During November, Miamis weather is still very summer-like with temperatures in the mid to low 80s F
(about 30C). Skies have a tendency to be clear or partly sunny; with occasional rainfall. Summer attire
works well most days.
Local Customs
Drinking
The federal legal age for buying and drinking alcohol is 21 years old. Its illegal to drink in public spaces,
including the beach.
Beach Etiquette
Locals take the beach seriously, so avoid making faux pas by respecting peoples space on the beach.
Beach Safety
Dont go to a beach that is displaying a purple flag. This indicates that jellyfish, stingrays, or other danger-
ous critters are in the water.
Start Late
Miami is a late-night city with much nightlife not in full swing until after midnight. Take your time: stretch
out dinner and enjoy some cocktails before heading out for the night.
Language
Spanish is used in day-to-day life in Miami. It pays to learn at least a few words. Here are a few!
• Hola - Hello
• Sí - Yes
• No - No
469
Local Guide
Walking
Walk to the right of the sidewalk and step off to the side of the sidewalk if you want to stop to check your
phone, look up directions, or want to take in a view.
Driving
Americans drive on the right-hand side of the road. Traffic laws can vary between states, so its worth
finding out about any local differences if you plan on driving. It is legal to turn right on a red light if its
safe to do so unless there are signs stating otherwise.
Public Transport
Allow others to disembark before boarding, dont take up more than one seat, and stand to offer seating to
pregnant women or someone with a disability.
Spitting
Spitting is considered rude in any public setting. Find more information about local customs and etiquette
in the United States generally here.
Food Options
• Market to Go
• Market
• Moxies
• Coyo Taco
• Novecento
470
Local Guide
• Bali Cafe
• Tacology
471
Venue Map
17
Venue Map
472
VENUE MAP
FLAMINGO
CONVENTION CENTER
FOURTH FLOOR
TEQUESTA
GRANADA
IBIS
PEARSON I
CITY OF MIAMI
PEARSON II
JAMES L. KNIGHT
ASHE
CONVENTION CENTER
AUDITORIUM
THIRD FLOOR
GAUTIER
I
ATRIUM MERRICK
II
ZAMORA MERRICK
CONVENTION CENTER
THIRD FLOOR
TO
NORTH LOBBY
HALL
JAPENGO
CENTRAL
HALL
A B
ELEVATORS
A
GARDENIA
HIBISCUS A
TUTTLE B
REGENCY NORTH JASMINE
BALLROOM TUTTLE C
(Tuttle, Monroe, CENTER HIBISCUS B
Flagler & Brickell TUTTLE
Combined) SOUTH
MONROE
LOWER
PROMENADE
FLAGLER
UPPER
BRICKELL PRE- PROMENADE
NORTH BRICKELL FUNCTION
CENTER BRICKELL RIVERWALK
ORCHID
SOUTH A
ORCHID
B
HYATT REGENCY MIAMI ORCHID
C
ORCHID
D
HALL
SOUTH
Lobby Level
Hyatt Regency
JAPENGO
Check-in
ROOM
DININ
G
N
KITCHE
All Gender Restrooms
E HALL
SERVIC
located in Japengo
RIVERFRONT
HALL LOBBY
LEVEL
NORTH
HALL
CENTRAL
HALL
k
AZALEA
ic
Br
ELEVATORS
&
RESTROOMS
)
d r M
A
Y
TU
ne le O
HIBISCUS A
bi lag LRO
TT
GARDENIA
m ,F L
LE
B
on C
M
Co roe BA
JASMINE
M EN
O
,
e G
N PHONES HIBISCUS B C
ttl RE
RO
u
E
(T
FL
AG PROMENADE LOWER
LE
R
BR PREFUNCTION
IC
KE
PROMENADE UPPER
RAMP
LL
RIVERWALK
MIAMI RIVER
D
CHI
R
O
TERRACE LEVEL
Lower Terrace Level
SYMBOL KEY
= COLUMN
= DOORWAY
= STAIRS
Venue Map
476
Author Index
477
Author Index
478
Author Index
479
Author Index
480
Author Index
Bhosale, 90 Borgholt, 41
Bhuiya, 244 Borgne, 102
Bhuiyan, 72, 197 Borimann, 90
Bi, 329, 351, 359, 385 Borthwick, 88
Bibi, 257 Borzilov, 393
Biderman, 142 Bos, 249
Bie, 235 Bose, 309
Bielikova, 192, 211, 288 Bosselut, 96, 123, 394
Biemann, 154, 249 Bossi, 119
Biggs, 118 Bostrom, 135
Bikakis, 281 Bottarini, 70
Bilen, 313 Bouayad-Agha, 88
Binbin, 349 Boudin, 397
Bing, 326, 359, 379 Boulle, 66
Bingert, 59 Bouraoui, 250
Bingliwu, 50 Bout, 376
Biran, 189 Bouyarmane, 55
Birrer, 291 Bouzoubaa, 147
Bisazza, 168, 244 Bowden, 223
Bisk, 233, 275, 321 Bowen, 345
Bissyandé, 199 Boyd-Graber, 85, 158, 242, 245, 281, 321
Biswas, 303 Brack, 132
Bitterman, 212, 318 Bradford, 116
Bitton, 99 Bradley, 56
Biyik, 255 Bragg, 76, 253
Bjerring-Hansen, 291 Brahman, 321
Bjerva, 74 Brandon, 288
Blain, 45 Brannon, 48
Blanchard, 116 Bransom, 163
Blanco, 48, 64, 250, 307 Braun, 90
Blanco-Cuaresma, 303 Breazeal, 47, 48
Blaschko, 193 Brief, 73
Blevins, 140, 144 Brinner, 265
Blloshmi, 83, 261 Broeck, 68
Blodgett, 58 Broek, 248
Blouir, 130 Broman, 79
Bobinac, 154 Brown, 349, 366
Bocklet, 174 Browne, 122
Boeker, 137 Bruggeman, 171
Boenninghoff, 322 Bruni, 71
Bogdanov, 354 Bruns, 325
Bogin, 163, 207 Brunskill, 44
Boix-Adserà, 75 Bruseva, 230
Bolandraftar, 89 Brutti, 163
Boleda, 362 Bryksin, 393
Bollegala, 121, 347 Bu, 316, 401
Bonaldi, 72 Buchmann, 124
Bonifacio, 282 Buda, 111
Boonnag, 384 Budhiraja, 278
Borah, 121 Buettner, 310
Borchert, 142 Bugbee, 303
Borenstein, 130, 167, 295 Bui, 210
Borges, 96 Bukharin, 275
481
Author Index
482
Author Index
483
Author Index
484
Author Index
485
Author Index
486
Author Index
487
Author Index
488
Author Index
Gete, 275 Gong, 135, 192, 202, 204, 207, 216, 234, 238,
Geva, 64, 122, 124, 189 244, 285, 288, 308, 327, 353, 370,
Ghaddar, 91 386
Ghaffari, 243 Gongas, 143
Ghafouri, 325 GongQue, 402
Ghanem, 257 Gonzalez, 176
Ghanim, 121 Gonzalo, 78
Ghannay, 102 Goodman, 98
Ghassemi, 147, 395 Gor, 158
Ghazarian, 276 Gorade, 371
Ghinassi, 117 Gorbunov, 154
Ghodsi, 396 Gordon, 82, 235
Gholaminejad, 218 Goriely, 232
Ghosal, 65, 191, 312 Goswami, 118, 190
Ghose, 272 Gottesman, 122, 189
Ghosh, 53, 57, 76, 86, 132, 180, 191, 277, 278, Gou, 182, 201, 395
303, 390 Gouda, 203
Giadikiaroglou, 84 Govindarajan, 293
Gigant, 210 Gowda, 182
Gilbert, 44 Goyal, 60, 79, 92, 162, 164, 201, 274, 277,
Gill, 191 294, 309
Gillani, 145 Grabowski, 132
Gilsenan-McMahon, 282 Gratch, 318
Gim, 304 Graça, 229
Ginn, 141, 332 Greco, 251
Gipp, 180, 188, 226 Greene, 109
Giryes, 242 Greenstein-Messica, 302
Gispert, 247 Gretter, 163
Gittens, 262 Grezes, 303
Giulianelli, 96–98 Griffiths, 270
Giuliani, 340 Grimmelmann, 209
Gizzi, 335 Grimmer, 292
Glass, 61, 68 Grinsztajn, 272
Glava, 142, 230, 238, 239 Grishina, 305
Globerson, 189 Groschwitz, 249
Glória-Silva, 312 Gross, 196
Godbole, 85 Grossman, 363
Goddard, 396 Grotov, 393
Goel, 286 Grundkiewicz, 182
Goharian, 179 Gröner, 206
Golac, 88 Gschwind, 301
Golany, 168 Gu, 64, 84, 126, 132, 143, 169, 193, 215, 220,
Golazizian, 296 248, 253, 254, 267, 276, 282, 285,
Goldberg, 205, 229, 230 289, 305, 329, 330, 336, 344, 347,
Goldfarb-Tarrant, 85, 140 348, 352, 353, 369, 385
Goldman, 210, 233 Gualdoni, 362
Goldstein, 145 Guan, 101, 104, 109, 205, 221, 234, 242, 257,
Goldwasser, 48, 149 289
Gollakota, 216 Gubelmann, 96
Gomez, 216, 218 Guerberof-Arenas, 120
Gomez-Sebastia, 88 Guerini, 47, 72
Gonen, 139, 140 Guerra, 115
GONG, 157 Guerraoui, 170, 171
489
Author Index
490
Author Index
491
Author Index
Huang, 44, 46, 50, 52, 53, 55, 59, 64, 67, 72, Ilin, 241
74, 75, 77, 82, 85, 87, 93, 94, 98, Iluz, 254
105, 114, 125, 128, 131, 133, 135, Ilyas, 227
136, 141, 147, 149, 151, 152, 156, Imoto, 365
159–161, 163, 166, 169, 175, 178, Imperial, 70, 139, 250
181, 186, 198, 199, 202, 205, 207, Indu, 227
209, 219, 221, 222, 228, 234, Indurthi, 135, 144
236–238, 240, 242, 243, 245, 247, Ineichen, 210
256–258, 261, 267, 268, 271, 273, Ingle, 225
277–279, 285–288, 304, 305, 308, Ingvaldsen, 161
313, 315–318, 324, 328, 329, Inkpen, 231
331–334, 336–338, 342, 343, Inoue, 170, 171
345–347, 349, 351, 353, 355, Inui, 170, 171, 190, 360
358–360, 367, 368, 371, 373, 377, ION, 307
379, 381, 385, 387–390, 393, 398, Iordache, 140
403 Ippolito, 340
huang, 64 Iqbal, 81, 253
Huber, 170, 352 Irawan, 139
Huda, 286 Irsoy, 129
Hudi, 139 Irving, 401
Huerta-Enochian, 235 Isaac, 255
hufeng, 371, 387 Ishigaki, 260
Hui, 84, 108 Ishii, 385
Hulden, 332 Ishmam, 197, 332
Hull, 293 Iskander, 161
Hunter, 116, 281 Islam, 165, 168, 262, 312
Huo, 151, 279, 332, 355 Iso, 162
Huot, 46, 69 Ittichaiwong, 384
Hupkes, 71
Ivanova, 170
Hurtado, 274
Ive, 63
Hus, 153
Iwasawa, 233
Hussenot, 275
Iyyer, 83, 164, 167, 263
Huynh, 107, 119
Izsak, 382
Hwang, 88, 135, 138, 141, 146, 159, 181, 195,
197, 245, 265, 276, 297
hwang, 76, 94, 111, 174, 203 Jaakkola, 68
Hwang„ 77 Jacovi, 233
Hyeon, 299 Jada, 336
Hänni, 126 Jafari, 196
Hätty, 90 Jagatap, 56
Jagerman, 328
Iakovenko, 139 Jaggi, 255
Iana, 230 Jahan, 72
Idris, 340 Jaidka, 253
Ie, 132 Jaimes, 167
Igamberdiev, 258 Jain, 53, 87, 89, 137, 157, 225, 241, 257, 260,
Ignatov, 340 338, 351
III, 44, 76, 158, 171, 254 JAISWAL, 176
Ikbal, 56 Jaitly, 289
Ikeda, 298 James, 293, 330
Ilhan, 288 Jampani, 300
Ilie-Ablachim, 212 Janardhanan, 62
Ilievski, 54, 268 Jandial, 253
492
Author Index
Jang, 89, 94, 101, 146, 186, 195, 229, 297, Jing, 243, 371
299, 301, 307, 323, 350 Jinnai, 137
Jansen, 249, 265 Jinxiaojia, 342
Janssen, 208 Jo, 89, 180
Jaques, 392 Johansson, 357
Jaravine, 137 Johnson, 88, 326
Jarrar, 217 Johri, 306
Jatowt, 381 Jones, 57, 108
Jauhar, 322 Joo, 217, 222
Java, 253 Joshi, 46, 56, 62, 70, 76, 123, 175, 188, 281
Jaya, 139 Joty, 50, 62, 72, 155, 165, 166, 262, 283, 326
Jayanthi, 246 Jou, 53
Jedema, 208 JU, 151
Jen, 181 Ju, 329, 359
Jennings, 72 Jumelet, 74
Jeon, 89, 110, 185, 217, 263, 310 Junczys-Dowmunt, 182
Jeong, 79, 159, 225, 245, 378 Juneja, 201
Jetter, 187 Jung, 50, 70, 76, 115, 256, 293, 295, 297, 324,
Jeung, 174 350, 378, 396
Jha, 218, 287 Junker, 265
Jhamtani, 130, 135, 208, 281 Junlin, 219
Jhunjhunwala, 72 Jurgens, 269, 294, 295, 320
Ji, 55, 57, 72, 83, 91, 100, 103, 106, 122, 133, Jwa, 167
141, 174, 188, 218, 229, 231, 259,
296, 313, 319, 329, 330, 333, 372, K, 227
374, 376 Kabbara, 48
ji, 150 Kabir, 245
Jia, 119, 125, 134, 157, 158, 227, 249, 266, Kabra, 45
288, 319, 345, 371, 398 Kadaoui, 166, 217
Jian, 96, 370 Kadlík, 196
JIANG, 223 Kadurin, 191
Jiang, 50, 51, 53–55, 59, 67, 82, 84, 87, 100, Kaelin, 66
101, 106, 107, 118, 119, 123, 131, Kahn, 53
133, 134, 151, 154, 158, 161, 164, Kai, 303
177, 184, 188, 199, 200, 202, 219, Kailkhura, 134
224, 225, 229, 244, 249, 257, 266, Kairouz, 73
267, 273, 275, 278, 282, 289, 290, Kaiser, 297
310, 313, 321, 323, 337, 357, 358, Kale, 330
360, 361, 364, 369, 375, 377, 380, Kalinsky, 229
381, 389, 400 Kallala, 43
jiang, 364 Kamalloo, 282
Jiao, 110, 145, 155, 160, 198, 199, 202 Kambadur, 129
Jiayang, 64, 99, 161, 249, 311 Kambhatla, 149
Jiayu, 202 Kamigaito, 45, 68, 112, 153, 154, 272, 287
Jie, 197 Kammakomati, 259
JIN, 290 Kamoi, 41
Jin, 51, 69, 72, 93, 105, 122, 128, 134, 148, Kampman, 139
150, 155, 171, 178, 185, 187, 194, Kamruzzaman, 119
204, 209, 210, 223, 238–240, 254, Kan, 85, 93, 162, 213, 286, 314
256, 271, 276, 297, 301, 304, 306, Kanagaraj, 303
309, 351, 371–373, 381, 387, 392, Kanagarajan, 56
398 Kandpal, 73
JING, 240 Kaneko, 262
493
Author Index
Kang, 53, 89, 94, 110, 115, 118, 141, 149, 169, Ke, 351, 352
203, 219, 230, 245, 266, 320, 378, KediChen, 212
404 Keh, 139
kang, 334 Keller, 205, 386
Kangaslahti, 174 Kelly, 187
Kanithi, 236 Kementchedjhieva, 53
Kannen, 274 Kersting, 132
Kanojia, 45, 90, 154 Kesarwani, 303
Kanoulas, 155 Keutzer, 176, 207, 218
Kantarcioglu, 52 Kew, 144
Kantu, 213 KHADEMI, 393
Kao, 151, 263, 377 Khadivi, 177
Kapadnis, 221 Khan, 64, 72, 236, 242, 323
Kapanipathi, 56 Khandelwal, 143, 213
Kapur, 220 Khanehzar, 325
Karagöz, 114 Khanuja, 63
Karami, 50 Khapra, 64
Karanam, 56 Kharlamov, 324
Kargupta, 298 Khasanova, 302
Karidi, 363 Khashabi, 122, 189
Karim, 228 Khasin, 302
Karimi, 336 Khatib, 172, 292
Karlinsky, 128 Khattab, 79, 299
Karls, 322 Khattak, 89
Karlsson, 139 Khelli, 139
Karmaker, 250 Khetan, 195
Karnin, 161 Khetani, 66
Karpinska, 164 Khondaker, 202
Karpov, 340 Khot, 163
Karpukhin, 396 Khrabrov, 191
Karray, 112 KhudaBukhsh, 292
Kartik, 213 Khule, 405
Karver, 121, 256 Kidambi, 275
Karypis, 195, 322 Kiegeland, 95
Kasai, 166 Kiela, 103
Kasat, 201 Kil, 134
Kashima, 159 Kilicoglu, 172
Kasianenko, 325 KIM, 146, 195, 199, 305
Kasneci, 294 Kim, 47, 54, 58, 59, 68, 69, 75–78, 83, 85, 88,
Kataria, 191 89, 93–95, 100, 102–105, 109,
Katsigiannis, 230 111, 116, 118, 119, 123, 128–130,
Katsimpras, 91 138, 139, 141, 146, 158, 160, 162,
Katz, 64, 230 163, 167, 171, 173–176, 179, 180,
Katz-Samuels, 235 185, 195, 197, 200–204, 217, 218,
Kaufman, 121, 256 222, 224, 225, 227–229, 234, 246,
Kaur, 166 248, 256, 259, 260, 262, 265, 276,
Kautsar, 139 279, 285, 288, 297, 299, 300, 304,
Kavathekar, 214 305, 307, 311, 313–315, 319–321,
Kawabata, 190 323, 324, 330, 341, 368, 371, 374,
Kawaguchi, 162, 286 378
Kawarada, 260 Kimyeeun, 276
Kayser, 74 Kinchagawat, 384
Kazienko, 197 King, 47, 113, 160, 246
494
Author Index
495
Author Index
496
Author Index
497
Author Index
279, 307, 312, 319, 321, 329, 332, Magalhaes, 300, 312
334, 335, 337, 340, 341, 354, 355, Magdy, 217, 283
362, 364, 369, 375, 376, 379, 384, Magnusson, 287
387, 389, 397, 398 Magooda, 171
Luan, 110, 111, 352 Mahadevan, 374
Lucas, 211, 318 mahajan, 250
Lucchetti, 320 Maharaj, 81, 149
Lucero, 66 Mahendra, 139, 325
Lucy, 211 Maheshwari, 263
Ludaescher, 43 Maheshwary, 261, 353
Luger, 292 Mahfouz, 301
Luhtaru, 320 Mahmood, 241
Lukasiewicz, 74 Mahmoud, 81
Lukasik, 137 Mahmud, 332
Lukito, 292 Mahongxia, 363
Lum, 255 Mahowald, 55, 125, 252, 293
Lundberg, 90 Mai, 404
Luo, 48, 50, 61, 81, 91, 103, 106, 131, 150, Maimaiti, 371
163, 168, 169, 181, 198, 204, 228, Maimon, 233
240, 246, 253, 254, 273, 304, 314, Maistro, 41, 269, 355
322, 325, 327, 340, 341, 344, 351, Maiti, 216
358, 359, 373, 379, 380, 388, 398 Maity, 262, 290
luo, 82 Maiya, 126
Lutz, 255 Majumder, 56
Luu, 100, 159, 164, 258, 286, 345, 379, 405 Mak, 366
Lv, 132, 189, 233, 234, 244, 353, 383, 388, 389 Makhlouf, 350
Lymperaiou, 84, 279 Makinae, 45, 153
Lyu, 65, 112, 147, 157, 171, 193, 198, 199, MAKOUAR, 217
282, 331, 369, 395 Malagutti, 97
Malaviya, 69, 207
Ma, 43, 51, 57, 67, 72, 91, 92, 94, 101, 105, Malfa, 159
107, 112, 129, 132, 134, 147, 152, Malgaroli, 395
158, 169, 184, 185, 193, 201, 203, Malik, 56, 86, 337
204, 215, 228, 242, 270, 273, 274, Malin, 52, 57
277, 280, 282, 301, 303, 316, 317, Mallick, 161
339, 340, 347, 359, 360, 377, 381, Malmasi, 88
382, 391, 394, 397, 398 Malviya, 230
Maaløe, 41 Malwat, 282
Mabrey, 116 Mamo, 168
MacAvaney, 52 Mamta, 58
MacDonald, 327 Man, 150, 331
Macina, 220 Manakul, 140, 354
Mackie, 52 Manatkar, 348
Macko, 211 Mancenido, 190
Madaan, 257 Manchanda, 322
Madabushi, 70, 250 Mandal, 278
Madan, 227 Mandelkern, 54, 71
Madhusudan, 52 Manerba, 59
Madhusudhan, 261 Mangla, 60
Madikeri, 216, 296 Manikantan, 114
Madotto, 89, 213 Manmatha, 374
Maduabuchi, 335 Manning, 49
Magai, 376 Manocha, 57, 76, 241, 242, 312
498
Author Index
499
Author Index
500
Author Index
501
Author Index
502
Author Index
503
Author Index
504
Author Index
Qin, 81, 85, 91, 99, 147, 151, 154, 155, 183, Rame, 275
188, 270, 305, 307, 321, 328, 333, Ramesh, 234, 257
341, 343, 348, 361, 366, 378, 380, Ramis, 303
381, 383, 389, 391 Ramponi, 70
Qing, 234, 348, 378 Ramu, 190, 211, 315
Qiu, 68, 72, 82, 86, 131, 136, 160, 161, 171, Ran, 134, 272, 304
213, 231, 249, 258, 268, 287, 326, Ranaldi, 141, 193
328, 341, 344, 350, 352, 359, 360, Ranasinghe, 45, 292
372, 380, 381, 383, 389, 394, 402, Ranathunga, 64
405 Rando, 123
Qorib, 111, 139 Rangappa, 218
Qu, 153, 154, 271, 303, 311, 316, 329, 335, Ranger, 376
350, 357, 361, 376 Rangwala, 195
QUAN, 66, 71 Rani, 214
Quan, 131, 284, 359 Ranjit, 70
quezibing, 212 Rankel, 284
Quick, 63 Rao, 54, 338, 347, 358, 374, 375, 387
Rashkin, 46
Rabatin, 109, 352 Rassin, 229
Rabinovich, 385 Rasteh, 89
Radev, 50, 166, 283 Rastogi, 233
Radhakrishna, 199 Rathore, 144
Radharapu, 62 Rau, 246
Raffel, 152 Ravanelli, 265
Rafiei, 183, 324 Ravfogel, 52
Ragan-Kelley, 288 Ravi, 146
Ragazzi, 108 Rawal, 49, 115, 321
Raghu, 56, 180, 220, 297 Rawat, 85
Raghuraman, 303 Ray, 60, 69, 103, 201, 203, 260
Raha, 236 Raychev, 144
Rahaman, 69 Raza, 280, 374
Rahamim, 174 Razniewski, 61
Rahimi, 93 Rebedea, 132, 212, 222
Rahman, 72, 168, 262 Recknor, 315
Rahmati, 117 Reddy, 53, 60, 91, 139, 167, 218, 229
Raileanu, 300 Reed, 404
Raina, 106, 124, 126, 215 Reeson, 207
RAJ, 164 Reforgiato, 212
Raj, 256 Rei, 152, 155, 264
Rajabzadeh, 396 Reich, 95
Rajadesingan, 149 Reichart, 40, 145, 186, 314
Rajagopal, 118 Reichman, 231
Rajan, 268 Reisert, 170, 171
Rajmohan, 82, 352 Reiter, 212
Rajtmajer, 118 ReiSS, 162
Ram, 49 Rekabsaz, 345
Ramachandran, 303 Ren, 43, 51, 89, 94, 112, 156, 157, 194, 201,
Ramakrishna, 49, 274, 381 207, 225, 264, 298, 319, 358, 360,
Ramakrishnan, 390 364, 382
Ramamoorthy, 63 Renduchintala, 237
Ramamurthy, 227 Renze, 86
Ramanathan, 203 Rep, 250
Ramasubramanian, 84, 303 Rezagholizadeh, 91, 135, 282, 396
505
Author Index
506
Author Index
507
Author Index
508
Author Index
509
Author Index
510
Author Index
511
Author Index
512
Author Index
513
Author Index
514
Author Index
515
Author Index
Yun, 125, 173, 174, 182, 235, 238, 276, 323, 163–167, 173, 175, 176, 178, 181,
378 183, 184, 188, 189, 191, 193–196,
Yunusov, 346 200–209, 214, 216, 217, 220–222,
Yurochkin, 262 224–227, 229–231, 233, 234, 237,
Yüksel, 280 238, 240–242, 244, 245, 247–249,
Yldz, 265 253, 254, 256, 258, 260, 264, 267,
271, 275, 276, 279, 280, 282,
zadeh, 78 284–290, 298, 301, 303–306, 308,
Zafar, 159 309, 311–315, 317–319, 324, 326,
Zaharia, 79 328, 329, 331, 333–335, 337,
Zaheer, 85 339–348, 350–361, 364, 367–370,
Zahera, 153 372–376, 378–391, 393–395, 397,
Zahid, 399 398, 400–403
Zahraei, 404 zhang, 266, 338, 340, 345, 374
Zaiane, 181, 387 Zhangjijun, 342
Zaib, 78 zhangwenlong, 99
Zaman, 127 Zhao, 42, 45, 49–51, 64, 67, 82, 83, 85, 89,
Zamaraeva, 60 91–93, 97, 99, 101, 105, 106, 109,
Zambre, 302 112, 113, 116, 120, 126, 127, 129,
Zampieri, 292 133–135, 144, 148, 155–157,
ZAN, 400 159–164, 166, 173, 175, 181, 187,
Zang, 293 192, 197, 198, 201, 206, 209, 210,
Zantedeschi, 197 218, 219, 221, 234, 237–240, 242,
Zaranis, 154 245, 247, 252–254, 266, 267, 273,
Zare, 315 275, 277, 282, 283, 302–304, 313,
ZarrieSS, 169, 206, 265 314, 325, 328–331, 335, 336,
Zavelca, 212 338–343, 345, 348, 349, 352, 356,
Zayed, 217 358, 359, 363, 370, 374, 378,
Zebaze, 347 380–382, 384, 387, 389, 392, 398
Zee, 215 zhao, 206, 216, 285, 341, 359, 365
Zeldes, 59, 115, 322 Zharmagambetov, 151
Zemel, 102, 274, 277, 313 Zharov, 393
ZENG, 339, 341 Zhen, 173
Zeng, 45, 50, 65, 82, 83, 119, 124, 131, 132, Zheng, 47, 66, 100, 101, 105, 109, 121, 129,
136, 148, 155, 166, 199, 200, 204, 145, 169, 172, 181, 182, 188, 204,
206, 233, 260, 278, 286, 329, 340, 212, 221, 227, 240, 249, 253, 256,
348, 355, 393, 404 258, 269, 271, 293, 308, 316, 325,
Zeren, 356 341, 343, 345, 348, 352, 363, 366,
Zerva, 66, 279 373, 376, 378, 384, 388, 391, 401
Zettlemoyer, 53, 140, 209, 274 zheng, 290
Zettsu, 376 Zhi, 108, 359
Zevallos, 194 Zhong, 56, 160, 187, 192, 289, 311, 330, 345,
Zha, 49, 382 348, 361, 364, 369, 378, 381
ZHAI, 127 zhong, 319
Zhai, 62, 100, 242, 339, 371 ZHOU, 178, 331
Zhan, 192, 266, 356 Zhou, 42, 43, 50, 57, 58, 61, 77, 93, 97,
ZHANG, 85, 122, 266, 326, 381 109–111, 113, 117, 120, 121,
Zhang, 40, 41, 43, 46, 49, 50, 52–55, 57, 58, 129–131, 133, 135, 136, 141, 143,
60–62, 64, 66–68, 71, 72, 74–76, 144, 147, 158, 160, 161, 166, 172,
78, 79, 81–83, 85–89, 92, 93, 96, 175, 177, 178, 181, 192, 194, 196,
99, 101, 103–107, 110, 111, 117, 200, 206, 207, 212, 214, 218, 219,
119, 120, 122, 128–136, 139, 141, 222, 230, 231, 233, 234, 236,
142, 144, 145, 150–152, 155–161, 240–242, 246, 247, 254, 258, 266,
516
Author Index
517
Author Index
518
Come build the
future with us
At Amazon, we fundamentally believe that innovation is essential to being the
most customer-centric company in the world. It’s the company's ability to have an
impact at scale that allows us to attract some of the brightest minds in artificial
intelligence, and related fields.
Intuit’s investment in a robust AI infrastructure to democratize AI has made it possible for technologists
across the company to build AI capabilities into our products at scale. To fuel rapid innovation with
generative AI (GenAI), we built GenOS, a proprietary GenAI operating system that’s empowering Intuit
technologists to design, build, and deploy breakthrough experiences. For example, Intuit Assist is
embedded across our platform and products, using powerful and relevant contextual data sets spanning
small business, consumer finance, and tax to help our customers make smart financial decisions.
PLATINUM SPONSORS
GOLD SPONSORS
SILVER SPONSORS
BRONZE SPONSORS