Article

A visually grounded natural language interface for reference to spatial scenes

Authors:

Deb RoyAuthors Info & Claims

ICMI '03: Proceedings of the 5th international conference on Multimodal interfaces

Pages 219 - 226

https://doi.org/10.1145/958432.958474

Published: 05 November 2003 Publication History

Abstract

Many user interfaces, from graphic design programs to navigation aids in cars, share a virtual space with the user. Such applications are often ideal candidates for speech interfaces that allow the user to refer to objects in the shared space. We present an analysis of how people describe objects in spatial scenes using natural language. Based on this study, we describe a system that uses synthetic vision to "see" such scenes from the person's point of view, and that understands complex natural language descriptions referring to objects in the scenes. This system is based on a rich notion of semantic compositionality embedded in a grounded language understanding framework. We describe its semantic elements, their compositional behaviour, and their grounding through the synthetic vision system. To conclude, we evaluate the performance of the system on unconstrained input.

References

[1]

J. Allen. Natural Language Understanding, chapter 3. The Benjamin/Cummings Publishing Company, Inc, Redwood City, CA, USA, 1995.

Digital Library

[2]

M. Brown, B. Buntschuh, and J. Wilpon. SAM: A perceptive spoken language-understanding robot. IEEE Transactions on Systems, Man and Cybernetics, 6(22):1390--1402, Nov/Dec 1992.

[3]

S. Dhande. A computational model to connect gestalt perception and natural language. Master's thesis, Massachusetts Institure of Technology, 2003.

[4]

P. Gorniak and D. Roy. Augmenting user interfaces with adaptive speech commands. In Proceedings of the International Conference for Multimodal Interfaces, 2003.

Digital Library

[5]

J. M. Lammens. A computational model of color perception and color naming. PhD thesis, State University of New York, 1994.

Digital Library

[6]

B. Landau and R. Jackendoff. "what" and "where" in spatial language and spatial cognition. Behavioural and Brain Sciences, 2(16):217--238, 1993.

[7]

K. Nagao and J. Rekimoto. Ubiquitous talker: Spoken language interaction with real world objects. In Proceeding of the International Joint Conference on Artificial Intelligence, 1995.

Digital Library

[8]

S. Narayanan. KARMA: Knowledge-based Action Representations for Metaphor and Aspect. PhD thesis, Univesity of California, Berkeley, 1997.

Digital Library

[9]

S. Oviatt, P. Cohen, L. Wu, J. Vergo, L. Duncan, B. Suhm, J. Bers, T. Holzman, T. Winograd, J. Landay, J. Larson, and D. Ferro. Designing the user interface for multimodal speech and gesture applications: State-of-the-art systems and research directions. Human Computer Interaction, 15(4):263--322, August 2000.

Digital Library

[10]

S. L. Oviatt, A. DeAngeli, and K. Kuhn. Integration and synchronization of input modes during multimodal human-computer interaction. In CHI, pages 415--422, 1997.

Digital Library

[11]

B. H. Partee. Lexical semantics and compositionality. In L. R. Gleitman and M. Liberman, editors, An Invitation to Cognitive Science: Language, volume 1, chapter 11, pages 311--360. MIT Press, Cambridge, MA, 1995.

[12]

J. Pustejovsky. The Generative Lexicon. MIT Press, Cambridge, MA, USA, 1995.

[13]

T. Regier and L. Carlson. Grounding spatial language in perception: An empirical and computational investigation. Journal of Experimental Psychology: General, 130(2):273--298, 2001.

[14]

D. Roy. Learning visually-grounded words and syntax for a scene description task. Computer Speech and Language, 16(3), 2002.

[15]

D. Roy, P. J. Gorniak, N. Mukherjee, and J. Juster. A trainable spoken language understanding system. In Proceedings of the International Conference of Spoken Language Processing, 2002.

[16]

D. Roy and A. Pentland. Learning words from sights and sounds: A computational model. Cognitive Science, 26(1):113--146, 2002.

Digital Library

[17]

J. Shi and J. Malik. Normalized cuts and image segmentation. IEEE Transactions on Pattern Analysis and Machine Intelligence, 8(22):888--905, August 2000.

Digital Library

[18]

J. M. Siskind. Grounding the lexical semantics of verbs in visual perception using force dynamics and event logic. Journal of Artificial Intelligence Research, 15:31--90, August 2001.

Digital Library

[19]

M. Wertheimer. Laws of organization in perceptual forms. In A source book of Gestalt psychology, pages 71--88. Routledge, New York, 1999.

[20]

T. Winograd. Procedures as a representation for data in a computer program for understanding natural language. PhD thesis, Massachusetts Institute of Technology, 1970.

[21]

N. Yoshida. Utterance segmenation for spontaneous speech recognition. Master's thesis, Massachusetts Institute of Technology, 2002.

Cited By

Nikolaus MAlishahi AChrupała G(2022)Learning English with Peppa PigTransactions of the Association for Computational Linguistics10.1162/tacl_a_0049810(922-936)Online publication date: 7-Sep-2022
https://doi.org/10.1162/tacl_a_00498
Wyeld T(2018)Communicating Spatial Relations Using Online ChatVisual Approaches to Cognitive Education With Technology Integration10.4018/978-1-5225-5332-8.ch011(233-282)Online publication date: 2018
https://doi.org/10.4018/978-1-5225-5332-8.ch011
Budhiraja PMadhvanath SMorency LBohus DAghajan HCassell JNijholt AEpps J(2012)The blue one to the leftProceedings of the 14th ACM international conference on Multimodal interaction10.1145/2388676.2388691(57-58)Online publication date: 22-Oct-2012
https://dl.acm.org/doi/10.1145/2388676.2388691
Show More Cited By

Index Terms

A visually grounded natural language interface for reference to spatial scenes
1. Computing methodologies
  1. Artificial intelligence

Recommendations

Visualizing Natural Language Descriptions: A Survey

A natural language interface exploits the conceptual simplicity and naturalness of the language to create a high-level user-friendly communication channel between humans and machines. One of the promising applications of such interfaces is generating ...
Spatial operators in natural language understanding: the prepositions
ACMSE '17: Proceedings of the 17th annual Southeast regional conference

The paper describes the operation of a LISP program which accepts English sentences involving spatial prepositions and creates a three dimensional model of the objects described, with emphasis on the appropriate spatial relations between the objects. A ...
Natural Language Processing and User Modeling: Synergies and Limitations

The fields of user modeling and natural language processing have been closely linked since the early days of user modeling. Natural language systems consult user models in order to improve their understanding of users' requirements and to generate ...

Comments

Information & Contributors

Information

Published In

cover image ACM Conferences

ICMI '03: Proceedings of the 5th international conference on Multimodal interfaces

November 2003

318 pages

ISBN:1581136218

DOI:10.1145/958432

Conference Chair:
Sharon Oviatt
Oregon Health & Science University
,
Program Chairs:
Trevor Darrell
Massachusetts Institute of Technology
,
Mark Maybury
MITRE
,
Wolfgang Wahlster
DFKI, Germany

Copyright © 2003 ACM.

Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

Sponsors

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 05 November 2003

Permissions

Request permissions for this article.

Request Permissions

Check for updates

Author Tags

Qualifiers

Article

Conference

ICMI-PUI03

Sponsor:

ICMI-PUI03: International Conference on Multimodal User Interfaces

November 5 - 7, 2003

British Columbia, Vancouver, Canada

Acceptance Rates

ICMI '03 Paper Acceptance Rate 45 of 130 submissions, 35%;

Overall Acceptance Rate 453 of 1,080 submissions, 42%

Contributors

Other Metrics

View Article Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

8
Total Citations
View Citations
369
Total Downloads

Downloads (Last 12 months)1
Downloads (Last 6 weeks)0

Reflects downloads up to 17 Feb 2025

Other Metrics

View Author Metrics

Citations

Cited By

Nikolaus MAlishahi AChrupała G(2022)Learning English with Peppa PigTransactions of the Association for Computational Linguistics10.1162/tacl_a_0049810(922-936)Online publication date: 7-Sep-2022
https://doi.org/10.1162/tacl_a_00498
Wyeld T(2018)Communicating Spatial Relations Using Online ChatVisual Approaches to Cognitive Education With Technology Integration10.4018/978-1-5225-5332-8.ch011(233-282)Online publication date: 2018
https://doi.org/10.4018/978-1-5225-5332-8.ch011
Budhiraja PMadhvanath SMorency LBohus DAghajan HCassell JNijholt AEpps J(2012)The blue one to the leftProceedings of the 14th ACM international conference on Multimodal interaction10.1145/2388676.2388691(57-58)Online publication date: 22-Oct-2012
https://dl.acm.org/doi/10.1145/2388676.2388691
León Cde la Puente SDionne DHervás RGervás P(2009)Constructing World Abstractions for Natural Language in Virtual 3D EnvironmentsNew Directions in Intelligent Interactive Multimedia Systems and Services - 210.1007/978-3-642-02937-0_8(77-86)Online publication date: 2009
https://doi.org/10.1007/978-3-642-02937-0_8
Kerr WHoversten SHewlett DCohen PChang Y(2007)Learning in Wubble World2007 IEEE 6th International Conference on Development and Learning10.1109/DEVLRN.2007.4354034(330-335)Online publication date: Jul-2007
https://doi.org/10.1109/DEVLRN.2007.4354034
Puade OWyeld T(2006)A Qualitative Assessment of Communicating Spatial Concepts in Virtual and Physical Environments via a Text-Based MediumSixth IEEE International Conference on Advanced Learning Technologies (ICALT'06)10.1109/ICALT.2006.1652371(81-83)Online publication date: 2006
https://doi.org/10.1109/ICALT.2006.1652371
André EDorfmüller-Ulhaas KRehm M(2005)Engaging in a conversation with synthetic characters along the virtuality continuumProceedings of the 5th international conference on Smart Graphics10.1007/11536482_1(1-12)Online publication date: 22-Aug-2005
https://dl.acm.org/doi/10.1007/11536482_1
Gorniak PRoy DOviatt SDarrell TMaybury MWahlster W(2003)Augmenting user interfaces with adaptive speech commandsProceedings of the 5th international conference on Multimodal interfaces10.1145/958432.958467(176-179)Online publication date: 5-Nov-2003
https://dl.acm.org/doi/10.1145/958432.958467

View Options

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Publication

View options

PDF

View or Download as a PDF file.

eReader

View online with eReader.

Figures

Tables

Media

View Table of Conten