Advances in Intelligent Multimedia: Multimodal Semantic Representation
Advances in Intelligent Multimedia: Multimodal Semantic Representation
Advances in Intelligent Multimedia: Multimodal Semantic Representation
Paul Mc Kevitt
School of Computing & Intelligent Systems
Faculty of Engineering
University of Ulster, Magee
BT48 7JL, Derry/Londonderry
NORTHERN IRELAND
p.mckevitt@ulster.ac.uk
Abstract
Intelligent MultiMedia or MultiModal systems involve the computer processing, un-
derstanding and production of inputs and outputs from at least speech, text, and
visual information in terms of semantic representations. One of the central questions
for these systems is what form of semantic representation should be used. Here, we
look at current trends in multimodal semantic representation which are mainly XML-
and frame- based, relate our experiences in the development of multimodal systems
(CHAMELEON and CONFUCIUS) and conclude that producer/consumer, intention
(speech acts), semantic-content, and timestamps are four important components of
any multimodal semantic representation.
3 MultiModal expe-
riences: CHAMELEON and
CONFUCIUS
We have had experience with developing two Mul-
tiModal systems, CHAMELEON and CONFUCIUS
and each system has its own requirements in terms of
MultiModal semantic representation. Figure 1: Architecture of CHAMELEON
3.1 CHAMELEON
CHAMELEON has a distributed architecture of com-
municating agent modules processing inputs and out-
puts from different modalities and each of which
can be tailored to a number of application do-
mains. The process synchronisation and intercom-
munication for CHAMELEON modules is performed
using the DACS (Distributed Applications Com-
munication System) Inter Process Communication
(IPC) software (see Fink et al. 1996) which enables
CHAMELEON modules to be glued together and dis-
tributed across a number of servers. Presently, there
are ten software modules in CHAMELEON: black-
board, dialogue manager, domain model, gesture
Figure 2: Physical layout of the IntelliMedia Work-
recogniser, laser system, microphone array, speech
Bench
recogniser, speech synthesiser, natural language pro-
cessor (NLP), and Topsy as shown in Figure 1. More
detail on CHAMELEON can be found in Brøndsted and/or pointing gestures and outputs are synchro-
et al. 1998, 2001). nised speech synthesis and pointing. We currently
An initial application of CHAMELEON is the In- run all of CHAMELEON on a standard Intel pen-
telliMedia WorkBench which is a hardware and soft- tium computer which handles input for the Campus
ware platform as shown in Figure 2. One or more Information System in real-time.
cameras and lasers can be mounted in the ceiling,
microphone array placed on the wall and there is a 3.2 Frame semantics
table where things (objects, gadgets, people, pictures,
2D/3D models, building plans, or whatever) can be CHAMELEON’s blackboard stores semantic represen-
placed. The current domain is a Campus Informa- tations produced by each of the other modules and
tion System which at present gives information on keeps a history of these over the course of an inter-
the architectural and functional layout of a building. action. All modules communicate through the ex-
2-dimensional (2D) architectural plans of the building change of semantic representations with each other or
drawn on white paper are laid on the table and the the blackboard. The meaning of interactions over the
user can ask questions about them. Presently, there course of a MultiModal dialogue is represented using
is one static camera which calibrates the plans on the a frame semantics with frames in the spirit of Minsky
table and the laser, and interprets the user’s point- (1975). The intention is that all modules in the sys-
ing while the system points to locations and draws tem can produce and read frames. Frames are coded
routes with a laser. Inputs are simultaneous speech in CHAMELEON with messages built as predicate-
argument structures following a BNF definition. The
frame semantics was first presented in Mc Kevitt and [GESTURE
Dalsgaard (1997). Frames represent some crucial el- GESTURE: coordinates (3, 2)
ements such as module, input/output, intention, loca- INTENTION: pointing
tion, and timestamp. Module is simply the name of TIME: timestamp]
the module producing the frame (e.g. NLP). Inputs
are the input recognised whether spoken (e.g. “Show 3.2.2 Output frames
me Hanne’s office”) or gestures (e.g. pointing coor-
dinates) and outputs the intended output whether An output frame takes the general form:
spoken (e.g. “This is Hanne’s office.”) or gestures
(e.g. pointing coordinates). Timestamps can include
the times a given module commenced and terminated [MODULE
processing and the time a frame was written on the INTENTION: intention-type
blackboard. The frame semantics also includes repre- OUTPUT: output
sentations for two key phenomena in language/vision TIME: timestamp]
integration: reference and spatial relations.
where MODULE is the name of the output mod-
Frames can be grouped into three categories: (1)
ule producing the frame, intention-type includes dif-
input, (2) output and (3) integration. Input frames
ferent types of utterances and gestures and OUTPUT
are those which come from modules processing per-
is at least UTTERANCE or GESTURE. An utter-
ceptual input, output frames are those produced by
ance output frame can at least have intention-type
modules generating system output and integration
(1) query? (2) instruction!, and (3) declarative. An
frames are integrated meaning representations con-
example utterance output frame is:
structed over the course of a dialogue (i.e. all other
frames). Here, we shall discuss frames with a focus
more on frame semantics than on frame syntax and [SPEECH-SYNTHESIZER
in fact the actual coding of frames as messages within INTENTION: declarative
CHAMELEON has a different syntax. UTTERANCE: (This is Hanne’s office)
TIME: timestamp]
3.2.1 Input frames
A gesture output frame can at least have
An input frame takes the general form: intention-type (1) description (pointing), (2) descrip-
tion (route), (3) description (mark-area), and (4) de-
scription (indicate-direction). An example gesture
[MODULE output frame is:
INPUT: input
INTENTION: intention-type
TIME: timestamp] [LASER
INTENTION: description (pointing)
where MODULE is the name of the input mod- LOCATION: coordinates (5, 2)
ule producing the frame, INPUT can be at least UT- TIME: timestamp]
TERANCE or GESTURE, input is the utterance or
gesture and intention-type includes different types of
3.2.3 Integration frames
utterances and gestures. An utterance input frame
can at least have intention-type (1) query?, (2) in- Integration frames are all those other than in-
struction! and (3) declarative. An example of an put/output frames. An example utterance integra-
utterance input frame is: tion frame is:
[SPEECH-RECOGNISER [NLP
UTTERANCE: (Point to Hanne’s office) INTENTION: description (pointing)
INTENTION: instruction! LOCATION: office (tenant Hanne) (coordi-
TIME: timestamp] nates (5, 2))
UTTERANCE: (This is Hanne’s office)
A gesture input frame is where intention-type TIME: timestamp]
can be at least (1) pointing, (2) mark-area, and (3)
indicate-direction. An example of a gesture input Things become even more complex with the oc-
frame is: currence of references and spatial relationships:
of input from the user, (1) either the beginning or the
[MODULE ending of a story in the form of a sentence, and (2)
INTENTION: intention-type stylistic specifications, and outputs natural language
LOCATION: location stories; and CONFUCIUS focuses on story interpre-
LOCATION: location tation and multimodal presentation. It receives input
LOCATION: location natural language stories or (play/movie) scripts and
SPACE-RELATION: beside presents them with 3D animation, speech and non-
REFERENT: person speech audio.
LOCATION: location
TIME: timestamp]
An example of such an integration frame is:
[DOMAIN-MODEL
INTENTION: query? (who)
LOCATION: office (tenant Hanne)
(coordinates (5, 2))
LOCATION: office (tenant Jørgen)
(coordinates (4, 2))
LOCATION: office (tenant Børge) Figure 3: Intelligent multimodal storytelling platform
(coordinates (3, 1)) – Seanchaı́
SPACE-RELATION: beside
REFERENT: (person Paul-Dalsgaard) The knowledge base and its visual knowledge se-
LOCATION: office (tenant Paul-Dalsgaard) mantic representation are used in CONFUCIUS (see
(coordinates (4, 1)) Figure 4), and they could also be adopted in other vi-
TIME: timestamp] sion and natural language processing integration ap-
plications. The dashed part in the figure includes the
We have reported complete blackboard histories prefabricated objects such as characters, props, and
for the instruction “Point to Hanne’s office” and animations for basic activities, which will be used in
the query “Whose office is this?” + [pointing] (ex- the Animation generation module. When the input
ophoric/deictic reference) in Mc Kevitt and Dals- is a story, it will be transferred to a script by the
gaard (1997) and Brøndsted et al. (1998). With re- script writer, then parsed by the script parser and
spect of spatial relations we derive all the frames ap- the natural language processing module respectively.
pearing on the blackboard for the example: “Who’s The modules for Natural Language Processing (NLP),
in the office beside him?’ in Mc Kevitt (2000). Text to Speech (TTS) and sound effects operate in
To summarise, in CHAMELEON and the In- parallel. Their outputs will be fused at code com-
telliMedia Workbench we have found that pro- bination, which generates a holistic 3D world repre-
ducer/consumer, intention (speech acts), semantic- sentation including animation, speech and sound ef-
content, and timestamps are four important com- fects. NLP will be performed using Gate and Word-
ponents of any multimodal semantic representation. Net, TTS will be performed using Festival or Mi-
With respect of multimodal semantic-content there crosoft Whistler, VRML (Virtual Reality Modelling
is a requirement of representing two key elements of Language) will be used to model the story 3D vir-
multimodal systems: reference and spatial relations. tual world, and visual semantics is represented using
a Prolog-like formalism.
4 Seanchaı́
4.1 Visual knowledge representation
Within an intelligent multimedia storytelling plat-
form called Seanchaı́ we are interested in generat- Existing multimodal semantic representations within
ing 3D animation automatically. Seanchaı́ will per- various intelligent multimedia systems may repre-
form multimodal storytelling generation, interpreta- sent the general organisation of semantic structure
tion and presentation and consists of Homer, a story- for various types of inputs and outputs and are us-
telling generation module, and CONFUCIUS, a sto- able at various stages such as media fusion and prag-
rytelling interpretation and presentation module (see matic aspects. However, there is a gap between high-
Figure 3). The output of the former module could be level general multimodal semantic representation and
fed as input to the latter. Homer focuses on natural lower-level representation that is capable of connect-
language story generation. It will receive two types ing meanings across modalities. Such a lower-level
clature. Each non-atomic action is defined by one or
more subgoals, and the name of every goal/subgoal
reveals its purpose and effect. Primitives 1 through
14 are basic primitive actions in our framework (Fig-
ure 6). We do not claim that these fourteen cover all
the necessary primitives needed in modelling observ-
able verbs. 131 and 142 are actually not primitive
actions, but they are necessary in processing complex
space displacement. In the first twelve primitives, 1-3
describe position movement, 4 and 5 concern orien-
tation changes, 6-9 focus on alignment, 10 is a com-
posite action (not atomic) composed by lower level
primitives, and 11, 12 concern size (shape) changes.
Figure 7 illustrates the hierarchical structure of the
twelve primitives. Higher level actions are defined by
lower level ones. For instance, alignment operations
Figure 4: System architecture of CONFUCIUS are composed by move() and/or moveTo() predicates.
Definitions of the primitives are given in Ma and Mc
Kevitt (2003).
meaning representation, which links language modal-
ities to visual modalities, is proposed in Ma and Mc
Kevitt (2003, 2005). Figure 5 illustrates the multi- 1) move(obj, xInc, yInc, zInc)
modal semantic representation of CONFUCIUS. It is 2) moveTo(obj, loc)
composed of language, visual and non-speech audio 3) moveToward(obj, loc, displacement)
modalities. Between the multimodal semantics and 4) rotate(obj, xAngle, yAngle, zAngle)
each specific modality there are two levels of repre- 5) faceTo(obj1, obj2)
sentation: one is a high-level multimodal semantic 6) alignMiddle(obj1, obj2, axis)
representation which is media-independent, the other 7) alignMax(obj1, obj2, axis)
is an intermediate level media-dependent representa- 8) alignMin(obj1, obj2, axis)
tion. CONFUCIUS will use an XML-based represen- 9) alignTouch(obj1, obj2, axis)
tation for high-level multimodal semantics and an ex- 10) touch(obj1, obj2, axis)
tended predicate-argument representation for inter- (for the relation of support and contact)
mediate representation which connects language with 11) scale(obj, rate)
visual modalities as shown in Figure 5. Our visual (scale up/down, change size)
semantics decomposition method is at the intermedi- 12) squash(obj, rate, axis)
ate representation level (see Ma and Mc Kevitt 2003, (squash or lengthen an object)
2005). It is suitable for implementation in the 3D 13) group(x, [y|_], newObj)
graphic modelling language VRML. It will be trans- 14) ungroup(xyList, x, yList)
lated to VRML code by a Java program in CONFU-
CIUS. We also plan to include non-speech audio in
the media-dependent and media-independent seman- Figure 6: Basic predicate-argument primitives within
tic representations. CONFUCIUS
CONFUCIUS the rest of the list after deleting x from the original list. This
is also a basic list operation in Prolog.
3 Semantic constraint – declare an instance of the type ‘Ani-
The predicate-argument format we apply to rep- mal’. Metaphor usage of vegetal or inanimate characters is not
resent verb semantics has a Prolog-inspired nomen- considered here.
Many of the requirements in multimodal semantic
representation come from the need to integrate infor-
mation from different modalities. In terms of lan-
guage and vision integration there are requirements
for mapping the language and visual information into
semantic components which can be fused and inte-
grated and will be necessary for answering queries
such as “Whose office is this?” In terms of language
Figure 7: Hierarchical structure of CONFUCIUS’
and computer graphics integration there are require-
primitives
ments for determining the visual meaning of language
actions (verbs) so that for example, language can be
move(x.feet, _, HEIGHT, _), mapped into graphical presentations automatically.
move(x.body, _, HEIGHT, _), So for example with the verb “close” there could be
move(x.feet, _, -HEIGHT, _). three visual definitions: closing of a normal door (ro-
tation on y axis), closing of a sliding door (moving on
Example 2, call: x axis), or closing of a rolling shutter door (a combi-
– as in “A is calling B” (verb tense is not consid- nation of rotation on x axis and moving on y axis).
ered here because it is at sentence level rather than Two key problems in language and vision inte-
word level). This is one word-sense of call where call- gration are reference (see Brøndsted 1999, Kievet et
ing is conducted by telephone. Here is the definition al. 2001) and spatial relations (see Mc Kevitt 2000,
of one word-sense of call which is at the first level of Zelinsky-Wibbelt 1993), i.e. in multimodal systems
the visual semantic verb representation hierarchy: there are regular deictic references to the visual con-
text and also numerous spatial relations. Hence, it
call(a):- is a necessary requirement for adequate semantic-
type(a, Person), content representations to incorporate mechanisms
type(tel, Telephone), for representing spatial relations and reference.
pickup(a, tel.receiver, a.leftEar),
dial(a, tel.keypad),
speak(a, tel.receiver), 6 Conclusion and future work
putdown(a, tel.receiver, tel.set).
Although traditional and Intelligent MultiMedia or
Further examples are given in Ma and Mc Kevitt MultiModal Systems are both concerned with text,
(2003). voice, sound and video/graphics, with the former the
To summarise, in CONFUCIUS we have found computer has little or no understanding of the mean-
that as in CHAMELEON higher-level media- ing of what it is presenting and this is what dis-
independent semantic representations will be impor- tinguishes the two. With the current proliferation
tant in forms such as XML and frames but also of multimodal systems the question that everyone
that intermediate-level media-dependent representa- is asking is what is the correct multimodal mean-
tions will be necessary in order to represent fully cor- ing representation. From our experience in devel-
respondences between modalities. oping two multimodal systems, one which integrates
the processing of spoken dialogue and vision for both
input and output (CHAMELEON) and one which
5 Discussion translates text stories into multimodal presentations
with 3D graphics, spoken dialogue and non-speech
Our experience with MultiModal semantic represen- audio (CONFUCIUS) we conclude that multimodal
tation is that the representations required are depen- semantic representation: (1) depends on the task at
dent on the applications at hand and also MultiModal hand, (2) depends on the system architecture, (3) will
system architectures. This is also clear from the dis- be necessary at different levels (media-independent
cussions found in Romary (2001), Maybury (2001) and dependent) (4) will have at least the following
and Bunt and Romary (2002). There are require- four important components: producer/consumer, in-
ments for higher-level media-independent representa- tention (speech acts), semantic-content, and times-
tions but also lower-level more media-dependent rep- tamps (5) will have many forms of representation
resentations. We argue that producer/consumer, in- such as frames, XML, formal logics, event-logic truth
tention (speech acts), semantic-content, and times- conditions, X-schemas and f-structs or connection-
tamps are four important components of any higher- ist models. With respect of multimodal semantic-
level multimodal semantic representation. content there is a requirement of representing two
key elements of multimodal systems: reference and André, E., T. Rist, S. van Mulken, M. Klesen and S.
spatial relations. With respect of multimodal sys- Baldes (2000) The automated design of believe-
tem architectures there are interesting questions as able dialogues for animated presentation teams.
to where multimodal semantic representations lie in In Embodied conversational agents, J. Cassell, J.
systems and whether all the semantics is contained in Sullivan, S. Prevost and E. Churchill (Eds.), 220-
one single blackboard (CHAMELEON) or distributed 255. Cambridge, MA: The MIT Press.
throughout the system (Ymir and SmartKom). André, E. and T. Rist (2001) Controlling the be-
Future work will involve experimenting with var- haviour of animated presentation agents in the
ious semantic representations and architectures with interface: scripting vs. instructing. In AI Maga-
numerous applications and as we have found with zine, 22(4), 53-66.
knowledge representation in artificial intelligence it Bailey, D., J. Feldman, S. Narayanan & G. Lakoff
may be the case that no single representation is the (1997) Modeling embodied lexical development.
correct one but more significant will be how we use In Proceedings of the Nineteenth Annual Confer-
the representation and what can be achieved with it ence of the Cognitive Science Society (CogSci97),
in terms of multimodality. 19-24, Stanford, CA, USA.
Berners-Lee, T., J. Hendler and O. Lassila (2001) The
semantic web. In Scientific American, May.
References Brøndsted, T. (1999) Reference problems in
CHAMELEON. In Proc. of the ESCA Tuto-
Ahn, R., R.J. Beun, T. Borghuis, H.C. Bunt and C. rial and Research Workshop on Interactive Di-
van Overveld (1995) The DENK architecture: a alogue in MultiModal Systems (IDS-99), Paul
fundamental approach to user-interfaces. In In- Dalsgaard, Paul Heisterkamp and Chin-Hui Lee
tegration of natural language and vision process- (Eds.), 133-136. Kloster Irsee, Germany, June.
ing (Vol. I): computational models and systems, Brøndsted, T., P. Dalsgaard, L.B. Larsen, M. Man-
P. Mc Kevitt (Ed.), 267-281. Dordrecht, The they, P. Mc Kevitt, T.B. Moeslund & K.G. Ole-
Netherlands: Kluwer Academic Publishers. sen (1998) A platform for developing Intelligent
Almeida, L., I. Amdal, N. Beires, M. Boualem, MultiMedia applications. Technical Report R-98-
L. Boves, E. den Os, P. Filoche, R. Gomes, 1004, Center for PersonKommunikation (CPK),
J.E. Knudsen, K. Kvale, J. Rugelbak, C. Tal- Institute of Electronic Systems (IES), Aalborg
lec and N. Warakagoda (2002) Implementing University, Denmark, May.
and evaluating a multimodal and multilingual Brøndsted, T., P. Dalsgaard, L.B. Larsen, M. Man-
tourist guide. In Proc. of the International they, P. Mc Kevitt, T.B. Moeslund & K.G.
CLASS Workshop on Natural, Intelligent and Ef- Olesen (2001) The IntelliMedia WorkBench -
fective interaction in MultiModal Dialogue Sys- an Environment for Building Multimodal Sys-
tems, Copenhagen, Denmark, 28-29 June. tems. In Advances in Cooperative Multimodal
André, Elisabeth, G. Herzog, and T. Rist (1988) Communication: Second International Confer-
On the simultaneous interpretation of real-world ence, CMC’98, Tilburg, The Netherlands, Jan-
image sequences and their natural language de- uary 1998, Selected Papers, H. Bunt and R.J.
scription: the system SOCCER. In Proceedings Beun (Eds.), 217-233. Lecture Notes in Artificial
of the 8th European Conference on Artificial In- Intelligence (LNAI) series, LNAI 2155, Berlin,
telligence, 449-454, Munich, Germany. Germany: Springer Verlag.
André, Elisabeth and Thomas Rist (1993) The de- Bunt, H.C., R. Ahn, R.J. Beun, T. Borghuis and
sign of illustrated documents as a planning task. C. van Overveld (1998) Multimodal coopera-
In Intelligent multimedia interfaces, M. Maybury tion with the DENK system. In Multimodal
(Ed.), 75-93. Menlo Park, CA: AAAI Press. Human-Computer Communication, H.C. Bunt,
André, E., J. Müller and T. Rist (1996) The PPP R.J. Beun and T. Borghuis (Eds.), 1-12. Berlin,
persona: a multipurpose animated presentation Germany: Springer-Verlag.
agent. In Advanced Visual Interfaces, T. Catarci, Bunt, H. and L. Romary (2002) Towards multi-
M.F. Constabile, S. Levialdi and G. Santucci modal content representation. In International
(Eds.), 245-247. New York, USA: ACM Press. standards of terminology and language resources
André, E. and T. Rist (2000) Presenting through per- management, LREC 2002, Las Palmas, Spain.
forming: on the use of multiple lifelike charac- Carenini, G., F. Pianesi, M. Ponzi and O. Stock
ters in knowledge-based presentation systems. In (1992) Natural language generation and hypertext
Proc. of the Second International Conference on access. IRST Technical Report 9201-06, Insti-
Intelligent User Interfaces (IUI 2000), Los An- tuto Per La Scientifica E Tecnologica, Loc. Pant
geles, CA, USA, 1-8. e Di Povo, I-138100 Trento, Italy.
Cassell, J., J. Sullivan, S. Prevost and E. Churchill derlicher Szenen. In Sport und Informatik, J.
(Eds.) (2000) Embedded conversational agents. Perl (Ed.), 95-119. Schorndorf: Hofmann.
Cambridge, MA: MIT Press. Herzog, G., C.-K. Sung, E. Andre, W. Enkelmann,
Cassell, J., H. Vilhjalmsson and T. Bickmore H.-H. Nagel, T. Rist, and W. Wahlster (1989)
(2001) BEAT: the behaviour expression anima- Incremental natural language description of dy-
tion toolkit. In SIGGRAPH 2001 Conference namic imagery. In Wissenbasierte Systeme. 3.
Proceedings, Los Angeles, August 12-17, 477-486. Internationaler GI-Kongress, C. Freksa and W.
Chai, J. S. Pan and M.X. Zhou (2002) MIND: seman- Brauer (Eds.), 153-162. Berlin: Springer-Verlag.
tics based multimodal interpretation framework. Kievet, L, P. Piewek, R. Jan-Beun and H. Bunt
In Proc. of the International CLASS Workshop (2001) Multimodal cooperative resolution of ref-
on Natural, Intelligent and Effective interaction erential expressions in the DENK system. In Ad-
in MultiModal Dialogue Systems, Copenhagen, vances in Cooperative Multimodal Communica-
Denmark, 28-29 June. tion: Second International Conference, CMC’98,
Coyne, B & R. Sproat (2001) WordsEye: an auto- Tilburg, The Netherlands, January 1998, Se-
matic text-to-scene conversion system. In AT&T lected Papers, H.C. Bunt and R.J. Beun (Eds.),
Labs. Computer Graphics Annual Conference, 197-214. Lecture Notes in Artificial Intelligence
SIGGRAPH 2001 Conference Proceedings, Los (LNAI) series, LNAI 2155, Berlin, Germany:
Angeles, Aug 12-17, 487-496. Springer-Verlag.
DAML+OIL (2002) DAML+OIL reference descrip- Kosslyn, S.M. and J.R. Pomerantz (1977) Imagery,
tion. http://www.w3.org/TR/daml+oil-refe- propositions and the form of internal representa-
rence. tions. In Cognitive Psychology, 9, 52-76.
Denis, M. and M. Carfantan (Eds.) (1993) Images Ma, Minhua and Paul Mc Kevitt (2003) Seman-
et langages: multimodalité et modelisation cog- tic representation of events in 3D animation.
nitive. Actes du Colloque Interdisciplinaire du In Proc. of the Fifth International Workshop
Comité National de la Recherche Scientifique, on Computational Semantics (IWCS-5), Harry
Salle des Conférences, Siège du CNRS, Paris, Bunt, Reinhard Muskens and Elias Thiesse
April. (Eds.). Tilburg University, Tilburg, The Nether-
Dennett, Daniel (1991) Consciousness explained. lands, January.
Harmondsworth: Penguin. Ma, Minhua and Paul Mc Kevitt (2005) Visual se-
Feldman, J., G. Lakoff, D. Bailey, S. Narayanan, mantics and ontology of eventive verbs. In Nat-
T. Regier and A. Stolcke (1996) L0 – the first ural Language Processing - IJCNLP-04, First
five years of an automated language acquisition International Joint Conference, Hainan Island,
project. In Integration of natural language and China, March 22-24, 2004, Keh-Yih Su, Jun-
vision processing (Vol. III): theory and ground- Ichi Tsujii, Jong-Hyeok Lee and Oi Yee Kwong
ing representations, P. Mc Kevitt (Ed.), 205-231. (Eds.), 187-196. Lecture Notes in Artificial Intel-
Dordrecht, The Netherlands: Kluwer Academic ligence (LNAI) series, LNCS 3248. Berlin, Ger-
Publishers. many: Springer Verlag.
Fink, G.A., N. Jungclaus, H. Ritter, and G. Sagerer Maaβ, Wolfgang, Peter Wizinski, Gerd Herzog
(1995) A communication framework for heteroge- (1993) VITRA GUIDE: Multimodal route de-
neous distributed pattern analysis. In Proc. In- scriptions for computer assisted vehicle naviga-
ternational Conference on Algorithms and Appli- tion. Bereich Nr. 93, Universitat des Saarlandes,
cations for Parallel Processing, V. L. Narasimhan FB 14 Informatik IV, Im Stadtwald 15, D-6600,
(Ed.), 881-890. IEEE, Brisbane, Australia. Saarbrucken 11, Germany, February.
Granström, Björn, David House and Inger Karls- Maybury, Mark (1991) Planning multimedia explana-
son (Eds.) (2002) Multimodality in language and tions using communicative acts. In Proceedings
Speech systems. Dordrecht, The Netherlands: of the Ninth American National Conference on
Kluwer Academic Publishers. Artificial Intelligence (AAAI-91), Anaheim, CA,
Grumbach, A. (1996) Grounding symbols into per- July 14-19.
ceptions. In Integration of natural language and Maybury, Mark (Ed.) (1993) Intelligent multimedia
vision processing (Vol. III): theory and ground- interfaces. Menlo Park, CA: AAAI Press.
ing representations, P. Mc Kevitt (Ed.), 233-248. Maybury, Mark (Ed.) (1997) Intelligent multime-
Dordrecht, The Netherlands: Kluwer Academic dia information retrieval. Menlo Park, CA:
Publishers. AAAI/MIT Press.
Herzog, G. and G. Retz-Schmidt (1990) Das Sys- Maybury, M. (2001) Would you build your dream
tem SOCCER: Simultane Interpretation und house without a blueprint?, Working group on
naturalichsprachliche Beschreibung zeitveran- software architectures for MultiModal Systems
(WG 3). International Seminar on Coor- Okada, Naoyuki (1996) Integrating vision, motion
dination and Fusion in MultiModal Interac- and language through mind. In Integration of
tion, Schloss Dagstuhl International Confer- Natural Language and Vision Processing, Vol-
ence and Research Center for Computer Sci- ume IV, Recent Advances, Mc Kevitt, Paul (ed.),
ence, Wadern, Saarland, Germany, 29 October 55-80. Dordrecht, The Netherlands: Kluwer
- 2 November. (http://www.dfki.de/˜wahlster/ Academic Publishers.
Dagstuhl Multi Modality/WG 3 The Architect- Okada, Naoyuki (1997) Integrating vision, motion
ure Dream Team/index.html). and language through mind. In Proceedings of
Maybury, Mark and Wolfgang Wahlster (Eds.) the Eighth Ireland Conference on Artificial Intel-
(1998) Readings in intelligent user interfaces. ligence (AI-97), Volume 1, 7-16. University of
Los Altos, CA: Morgan Kaufmann Publishers. Ulster, Magee, Derry, Northern Ireland, Septem-
Mc Kevitt, Paul (1994) Visions for language. In ber.
Proceedings of the Workshop on Integration of OWL (2002) Feature synopsis for OWL Lite
Natural Language and Vision processing, Twelfth and OWL. http://www.w3.org/TR/WD-owl-
American National Conference on Artificial In- features-2020729/.
telligence (AAAI-94), Seattle, Washington, USA, Partridge, Derek (1991) A new guide to Artificial In-
August, 47-57. telligence. Norwood, New Jersey: Ablex Pub-
Mc Kevitt, Paul (Ed.) (1995/1996) Integration of lishing Corporation.
Natural Language and Vision Processing (Vols. Pentland, Alex (Ed.) (1993) Looking at people:
I-IV). Dordrecht, The Netherlands: Kluwer- recognition and interpretation of human action.
Academic Publishers. IJCAI-93 Workshop (W28) at The 13th In-
Mc Kevitt, Paul (2000) CHAMELEON meets spatial ternational Conference on Artificial Intelligence
cognition. In Spatial cognition, Sean O Nuallain (IJCAI-93), Chambéry, France, EU, August.
(Ed.), 149-170. US: John Benjamins, Also in, Pylyshyn, Zenon (1973) What the mind’s eye tells
Proceedings of MIND-III: The Annual Confer- the mind’s brain: a critique of mental imagery.
ence of the Cognitive Science Society of Ireland, In Psychological Bulletin, 80, 1-24.
Theme: Spatial Cognition, Mary Hegarty and Reithinger, N. (2001) Media coordination
Seán Ó Nualláin (Eds.), Part II, 70-87. Dublin in SmartKom. International Seminar on Coor-
City University (DCU), Dublin, Ireland, August, dination and Fusion in MultiModal Interaction,
1998. Schloss Dagstuhl International Conference and
Mc Kevitt, Paul and Paul Dalsgaard (1997) A frame Research Center for Computer Science, Wadern,
semantics for an IntelliMedia TourGuide. In Pro- Saarland, Germany, 29 October - 2 November.
ceedings of the Eighth Ireland Conference on Ar- (http://www.dfki.de/˜wahlster/Dagstuhl Multi
tificial Intelligence (AI-97), Volume 1, 104-111. Modality/Media Coordination In SmartKom/i-
University of Ulster, Magee, Derry, Northern Ire- ndex.html).
land, September. Reithinger, N., C. Lauer and L. Romary (2002) MI-
Mc Kevitt, Paul, Seán Ó Nualláin and Conn Mulvi- AMM - Multidimensional information access us-
hill (Eds.) (2002) Language, vision and music. ing multiple modalities. In Proc. of the Interna-
Amsterdam, The Netherlands: John Benjamins tional CLASS Workshop on Natural, Intelligent
Publishing Co.. and Effective interaction in MultiModal Dialogue
Minsky, M. (1975) A Framework for representing Systems, Copenhagen, Denmark, 28-29 June.
knowledge. In Readings in knowledge representa- Retz-Schmidt, Gudala (1991) Recognizing intentions,
tion, R. Brachman and H. Levesque (Eds.), 245- interactions, and causes of plan failures. In User
262. Los Altos, CA: Morgan Kaufmann. Modelling and User-Adapted Interaction, 1: 173-
Narayanan, S., D. Manuel, L. Ford, D. Tallis & M. 202.
Yazdani (1995) Language visualisation: applica- Retz-Schmidt, Gudala and Markus Tetzlaff (1991)
tions and theoretical foundations of a primitive- Methods for the intentional description of image
based approach. In Integration of Natural Lan- sequences. Bereich Nr. 80, Universitat des Saar-
guage and Vision Processing (Volume II): Intel- landes, FB 14 Informatik IV, Im Stadtwald 15,
ligent Multimedia, P. Mc Kevitt (Ed.), 143-163. D-6600, Saarbrucken 11, Germany, EU, August.
Dordrecht, The Netherlands: Kluwer Academic Rich, Elaine and Kevin Knight (1991) Artificial In-
Publishers. telligence. New York: McGraw-Hill.
Neumann, B. and H.-J. Novak (1986) NAOS: Rickheit, Gert and Ipke Wachsmuth (1996) Collabo-
Ein System zur naturalichsprachlichen Beschrei- rative Research Centre “Situated Artificial Com-
bung zeitveranderlicher Szenen. In Informatik. municators” at the University of Bielefeld, Ger-
Forschung und Entwicklung, 1(1): 83-92. many. In Integration of Natural Language and
Vision Processing, Volume IV, Recent Advances, eration, 63, 387-427.
Mc Kevitt, Paul (ed.), 11-16. Dordrecht, The Wahlster, W., N. Reithinger and A. Blocher (2001)
Netherlands: Kluwer Academic Publishers. SmartKom: towards multimodal dialogues with
Romary, L. (2001) Working group on multimodal anthropomorphic interface agents. In Proceed-
meaning representation (WG 4). Interna- ings of The International Status Conference:
tional Seminar on Coordination and Fusion Lead Projects, “Human-Computer Interaction”,
in MultiModal Interaction, Schloss Dagstuhl G. Wolf and G. Klein (Eds.), 23-34. Berlin, Ger-
International Conference and Research Cen- many: Deutsches Zentrum für Luft- und Raum-
ter for Computer Science, Wadern, Saar- fahrt.
land, Germany, 29 October - 2 November. Waltz, David (1975) Understanding line drawings of
(http://www.dfki.de/˜wahlster/Dagstuhl Multi scenes with shadows. In The psychology of com-
Modality/WG 4 Multimodal Meaning Represe- puter vision, Winston, P.H. (Ed.), 19-91. New
ntation/index.html). York: McGraw-Hill.
Sales, N.J., R.G. Evans and I. Aleksander (1996) Suc- W3C-MMI (2002) http://www.w3.org/2002/mmi/. .
cessful naive representation grounding. In Inte- Zelinsky-Wibbelt, Cornelia (Ed.) (1993) The seman-
gration of natural language and vision processing tics of prepositions: from mental Processing to
(Vol. III): computational models and systems, natural language processing (NLP 3). Berlin,
P. Mc Kevitt (Ed.), 185-204. Dordrecht, The Germany: Mouton de Gruyter.
Netherlands: Kluwer Academic Publishers. Zhou, M.X. and S. Feiner (2001) IMPROVISE: auto-
SALT (2002) http://www.saltforum.org. . mated generation of animated graphics for coor-
Schank, R.C. (1973) The fourteen primitive actions dinated multimeida presentations. In Advances
and their inferences. Memo AIM-183, Stanford in Cooperative Multimodal Communication: Sec-
Artificial Intelligence Laboratory, Stanford, CA, ond International Conference, CMC’98, Tilburg,
USA. The Netherlands, January 1998, Selected Papers,
Siskind, J.M. (1995) Grounding language in percep- H.C. Bunt and R.J. Beun (Eds.), 43-63. Lec-
tion. In Integration of Natural Language and Vi- ture Notes in Artificial Intelligence (LNAI) se-
sion Processing (Volume I): Computational Mod- ries, LNAI 2155. Berlin, Germany: Springer-
els and Systems, P. Mc Kevitt (Ed.), 207-227. Verlag.
Dordrecht, The Netherlands: Kluwer Academic
Publishers.
Stock, Oliviero (1991) Natural language and explo-
ration of an information space: the ALFresco
Interactive system. In Proceedings of the 12th
International Joint Conference on Artificial In-
telligence (IJCAI-91), 972-978, Darling Harbour,
Sydney, Australia, August.
Thórisson, Kris R. (1996) Communicative hu-
manoids: a computational model of psychosocial
dialogue skills. Ph.D. thesis, Massachusetts In-
stitute of Technology.
Thórisson, Kris R. (1997) Layered action control
in communicative humanoids. In Proceedings
of Computer Graphics Europe ’97, June 5-7,
Geneva, Switzerland.
VoiceXML (2002) http://www.voicexml.org. .
Wahlster, Wolfgang (1988) One word says more than
a thousand pictures: On the automatic verbal-
ization of the results of image sequence analy-
sis. Bereich Nr. 25, Universitat des Saarlandes,
FB 14 Informatik IV, Im Stadtwald 15, D-6600,
Saarbrucken 11, Germany, February.
Wahlster, Wolfgang, Elisabeth André, Wolfgang Fin-
kler, Hans-Jurgen Profitlich, and Thomas Rist
(1993) Plan-based integration of natural lan-
guage and graphics generation. In Artificial In-
telligence, Special issue on natural language gen-