skip to main content
10.1145/958432.958474acmconferencesArticle/Chapter ViewAbstractPublication Pagesicmi-mlmiConference Proceedingsconference-collections
Article

A visually grounded natural language interface for reference to spatial scenes

Published: 05 November 2003 Publication History

Abstract

Many user interfaces, from graphic design programs to navigation aids in cars, share a virtual space with the user. Such applications are often ideal candidates for speech interfaces that allow the user to refer to objects in the shared space. We present an analysis of how people describe objects in spatial scenes using natural language. Based on this study, we describe a system that uses synthetic vision to "see" such scenes from the person's point of view, and that understands complex natural language descriptions referring to objects in the scenes. This system is based on a rich notion of semantic compositionality embedded in a grounded language understanding framework. We describe its semantic elements, their compositional behaviour, and their grounding through the synthetic vision system. To conclude, we evaluate the performance of the system on unconstrained input.

References

[1]
J. Allen. Natural Language Understanding, chapter 3. The Benjamin/Cummings Publishing Company, Inc, Redwood City, CA, USA, 1995.
[2]
M. Brown, B. Buntschuh, and J. Wilpon. SAM: A perceptive spoken language-understanding robot. IEEE Transactions on Systems, Man and Cybernetics, 6(22):1390--1402, Nov/Dec 1992.
[3]
S. Dhande. A computational model to connect gestalt perception and natural language. Master's thesis, Massachusetts Institure of Technology, 2003.
[4]
P. Gorniak and D. Roy. Augmenting user interfaces with adaptive speech commands. In Proceedings of the International Conference for Multimodal Interfaces, 2003.
[5]
J. M. Lammens. A computational model of color perception and color naming. PhD thesis, State University of New York, 1994.
[6]
B. Landau and R. Jackendoff. "what" and "where" in spatial language and spatial cognition. Behavioural and Brain Sciences, 2(16):217--238, 1993.
[7]
K. Nagao and J. Rekimoto. Ubiquitous talker: Spoken language interaction with real world objects. In Proceeding of the International Joint Conference on Artificial Intelligence, 1995.
[8]
S. Narayanan. KARMA: Knowledge-based Action Representations for Metaphor and Aspect. PhD thesis, Univesity of California, Berkeley, 1997.
[9]
S. Oviatt, P. Cohen, L. Wu, J. Vergo, L. Duncan, B. Suhm, J. Bers, T. Holzman, T. Winograd, J. Landay, J. Larson, and D. Ferro. Designing the user interface for multimodal speech and gesture applications: State-of-the-art systems and research directions. Human Computer Interaction, 15(4):263--322, August 2000.
[10]
S. L. Oviatt, A. DeAngeli, and K. Kuhn. Integration and synchronization of input modes during multimodal human-computer interaction. In CHI, pages 415--422, 1997.
[11]
B. H. Partee. Lexical semantics and compositionality. In L. R. Gleitman and M. Liberman, editors, An Invitation to Cognitive Science: Language, volume 1, chapter 11, pages 311--360. MIT Press, Cambridge, MA, 1995.
[12]
J. Pustejovsky. The Generative Lexicon. MIT Press, Cambridge, MA, USA, 1995.
[13]
T. Regier and L. Carlson. Grounding spatial language in perception: An empirical and computational investigation. Journal of Experimental Psychology: General, 130(2):273--298, 2001.
[14]
D. Roy. Learning visually-grounded words and syntax for a scene description task. Computer Speech and Language, 16(3), 2002.
[15]
D. Roy, P. J. Gorniak, N. Mukherjee, and J. Juster. A trainable spoken language understanding system. In Proceedings of the International Conference of Spoken Language Processing, 2002.
[16]
D. Roy and A. Pentland. Learning words from sights and sounds: A computational model. Cognitive Science, 26(1):113--146, 2002.
[17]
J. Shi and J. Malik. Normalized cuts and image segmentation. IEEE Transactions on Pattern Analysis and Machine Intelligence, 8(22):888--905, August 2000.
[18]
J. M. Siskind. Grounding the lexical semantics of verbs in visual perception using force dynamics and event logic. Journal of Artificial Intelligence Research, 15:31--90, August 2001.
[19]
M. Wertheimer. Laws of organization in perceptual forms. In A source book of Gestalt psychology, pages 71--88. Routledge, New York, 1999.
[20]
T. Winograd. Procedures as a representation for data in a computer program for understanding natural language. PhD thesis, Massachusetts Institute of Technology, 1970.
[21]
N. Yoshida. Utterance segmenation for spontaneous speech recognition. Master's thesis, Massachusetts Institute of Technology, 2002.

Cited By

View all
  • (2022)Learning English with Peppa PigTransactions of the Association for Computational Linguistics10.1162/tacl_a_0049810(922-936)Online publication date: 7-Sep-2022
  • (2018)Communicating Spatial Relations Using Online ChatVisual Approaches to Cognitive Education With Technology Integration10.4018/978-1-5225-5332-8.ch011(233-282)Online publication date: 2018
  • (2012)The blue one to the leftProceedings of the 14th ACM international conference on Multimodal interaction10.1145/2388676.2388691(57-58)Online publication date: 22-Oct-2012
  • Show More Cited By

Recommendations

Comments

Information & Contributors

Information

Published In

cover image ACM Conferences
ICMI '03: Proceedings of the 5th international conference on Multimodal interfaces
November 2003
318 pages
ISBN:1581136218
DOI:10.1145/958432
Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

Sponsors

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 05 November 2003

Permissions

Request permissions for this article.

Check for updates

Author Tags

  1. cognitive modelling
  2. computational semantics
  3. natural language understanding
  4. vision based semantics

Qualifiers

  • Article

Conference

ICMI-PUI03
Sponsor:
ICMI-PUI03: International Conference on Multimodal User Interfaces
November 5 - 7, 2003
British Columbia, Vancouver, Canada

Acceptance Rates

ICMI '03 Paper Acceptance Rate 45 of 130 submissions, 35%;
Overall Acceptance Rate 453 of 1,080 submissions, 42%

Contributors

Other Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

  • Downloads (Last 12 months)1
  • Downloads (Last 6 weeks)0
Reflects downloads up to 17 Feb 2025

Other Metrics

Citations

Cited By

View all
  • (2022)Learning English with Peppa PigTransactions of the Association for Computational Linguistics10.1162/tacl_a_0049810(922-936)Online publication date: 7-Sep-2022
  • (2018)Communicating Spatial Relations Using Online ChatVisual Approaches to Cognitive Education With Technology Integration10.4018/978-1-5225-5332-8.ch011(233-282)Online publication date: 2018
  • (2012)The blue one to the leftProceedings of the 14th ACM international conference on Multimodal interaction10.1145/2388676.2388691(57-58)Online publication date: 22-Oct-2012
  • (2009)Constructing World Abstractions for Natural Language in Virtual 3D EnvironmentsNew Directions in Intelligent Interactive Multimedia Systems and Services - 210.1007/978-3-642-02937-0_8(77-86)Online publication date: 2009
  • (2007)Learning in Wubble World2007 IEEE 6th International Conference on Development and Learning10.1109/DEVLRN.2007.4354034(330-335)Online publication date: Jul-2007
  • (2006)A Qualitative Assessment of Communicating Spatial Concepts in Virtual and Physical Environments via a Text-Based MediumSixth IEEE International Conference on Advanced Learning Technologies (ICALT'06)10.1109/ICALT.2006.1652371(81-83)Online publication date: 2006
  • (2005)Engaging in a conversation with synthetic characters along the virtuality continuumProceedings of the 5th international conference on Smart Graphics10.1007/11536482_1(1-12)Online publication date: 22-Aug-2005
  • (2003)Augmenting user interfaces with adaptive speech commandsProceedings of the 5th international conference on Multimodal interfaces10.1145/958432.958467(176-179)Online publication date: 5-Nov-2003

View Options

Login options

View options

PDF

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

Figures

Tables

Media

Share

Share

Share this Publication link

Share on social media