Advances in Intelligent Multimedia: Multimodal Semantic Representation

Advances in Intelligent MultiMedia:
MultiModal semantic representation
Paul Mc Kevitt
School of Computing & Intelligent Systems
Faculty of Engineering
University of Ulster, Magee
BT48 7JL, Derry/Londonderry
NORTHERN IRELAND
p.mckevitt@ulster.ac.uk
Abstract
Intelligent MultiMedia or MultiModal systems involve the computer processing, un-
derstanding and production of inputs and outputs from at least speech, text, and
visual information in terms of semantic representations. One of the central questions
for these systems is what form of semantic representation should be used. Here, we
look at current trends in multimodal semantic representation which are mainly XML-
and frame- based, relate our experiences in the development of multimodal systems
(CHAMELEON and CONFUCIUS) and conclude that producer/consumer, intention
(speech acts), semantic-content, and timestamps are four important components of
any multimodal semantic representation.
1 Introduction integrated language and vision systems, few were, and

these two subfields quickly arose. It is not clear why
What distinguishes traditional MultiMedia from In- there has not already been much activity in integrat-
telligent MultiMedia or MultiModal Systems is that ing NLP and VP. Is it because of the long-time reduc-
although both are concerned with text, voice, sound tionist trend in science up until the recent emphasis
and video/graphics with possibly touch and virtual on chaos theory, non-linear systems, and emergent be-
reality linked in, in the former the computer has lit- haviour? Or, is it because the people who have tended
tle or no understanding of the meaning of what it to work on NLP tend to be in other Departments,
is presenting. Intelligent MultiMedia or MultiModal or of a different ilk, from those who have worked on
systems involve the computer processing and under- VP? Dennett (1991, p. 57-58) says “Surely a major
standing of perceptual signal and symbol input from source of the widespread skepticism about “machine
at least speech, text and visual images, and then re- understanding” of natural language is that such sys-
acting to it, is much more complex and involves sig- tems almost never avail themselves of anything like
nal and symbol processing techniques from not just a visual workspace in which to parse or analyze the
engineering and computer science but also artificial input. If they did, the sense that they were actually
intelligence and cognitive science (Mc Kevitt 1994, understanding what they processed would be greatly
1995/96, Mc Kevitt et al. 2002). With IntelliMedia heightened (whether or not it would still be, as some
systems, people can interact in spoken dialogues with insist, an illusion). As it is, if a computer says, “I
machines, querying about what is being presented see what you mean” in response to input, there is a
and even their gestures and body language can be strong temptation to dismiss the assertion as an ob-
interpreted. vious fraud.”
Although there has been much success in devel-
oping theories, models and systems in the areas of People are able to combine the processing of lan-
Natural Language Processing (NLP) and Vision Pro- guage and vision with apparent ease. In particular,
cessing (VP) (Partridge 1991, Rich and Knight 1991) people can use words to describe a picture, and can re-
there has been little progress in integrating these two produce a picture from a language description. More-
subareas of Artificial Intelligence (AI). In the begin- over, people can exhibit this kind of behaviour over
ning although the general aim of the field was to build a very wide range of input pictures and language de-
scriptions. Even more impressive is the fact that peo- temporal entities (spatial relations and events) have
ple can look at images and describe not just the image been recognised by an event-recognition system. A
itself but a set of abstract emotions evoked by it. Al- focussing process selects interesting agents to be con-
though there are theories of how we process vision and centrated on during a plan-recognition process. Plan
language, there are few theories about how such pro- recognition provides a basis for intention recognition
cessing is integrated. There have been large debates and plan-failure analysis. Each recognised intentional
in Psychology and Philosophy with respect to the de- entity is described in natural language. A system
gree to which people store knowledge as propositions called SOCCER (André et al. 1988, Herzog et al.
or pictures (Kosslyn and Pomerantz 1977, Pylyshyn 1989) verbalizes real-world image sequences of soc-
1973). cer games in natural language and REPLAI-II ex-
There are at least two advantages of linking the tends the range of capabilities of SOCCER. Here,
processing of natural languages to the processing of NLP is used more for annotation through text gener-
visual scenes. First, investigations into the nature of ation with less focus on analysis.
human cognition may benefit. Such investigations are Maaβ et al. (1993) describe a system, called Vitra
being conducted in the fields of Psychology, Cognitive Guide, that generates multimodal route descriptions
Science, and Philosophy. Computer implementations for computer assisted vehicle navigation. Information
of integrated VP and NLP can shed light on how peo- is presented in natural language, maps and perspec-
ple do it. Second, there are advantages for real-world tive views. Three classes of spatial relations are de-
applications. The combination of two powerful tech- scribed for natural language references: (1) topolog-
nologies promises new applications: automatic pro- ical relations (e.g. in, near), (2) directional relations
duction of speech/text from images; automatic pro- (e.g. left, right) and (3) path relations (e.g. along,
duction of images from speech/text; and the auto- past). The output for all presentation modes relies
matic interpretation of images with speech/text. The on one common 3D model of the domain. Again, Vi-
theoretical and practical advantages of linking nat- tra emphasizes annotation through generation of text,
ural language and vision processing have also been rather than analysis, and the vision module considers
described in Wahlster (1988). interrogation of a database of digitized road and city
Early work for synthesizing simple text from im- maps rather than vision analysis.
ages was conducted by Waltz (1975) who produced Some of the engineering work in NLP focusses on
an algorithm capable of labelling edges and corners the exciting idea of incorporating NLP techniques
in images of polyhedra. The labelling scheme obeys with speech, touchscreen, video and mouse to pro-
a constraint minimisation criterion so that only sets vide advanced multimedia interfaces (Maybury 1993,
of consistent labellings are used. The system can Maybury and Wahlster 1998). Examples of such work
be expected to become ‘confused’ when presented are found in the ALFresco system which is a multi-
with an image where two mutually exclusive but self- media interface providing information on Italian Fres-
consistent labellings are possible. This is important coes (Carenini et al. 1992 and Stock 1991), the WIP
because in this respect the program can be regarded system that provides information on assembling, us-
as perceiving an illusion such as what humans see in ing, and maintaining physical devices like an expresso
the Necker cube. However, the system seemed to be machine or a lawnmower (André and Rist 1993 and
incapable of any higher-order text descriptions. For Wahlster et al. 1993) with more recent work on inter-
example, it did not produce natural language state- active presentations with an animated agent in PPP
ments such as “There is a cube in the picture.” (Personalised Plan Presenter) (André et al. 1996,
A number of natural language systems for the André and Rist 2000), AiA (Adaptive Communica-
description of image sequences have been developed tion Assistant for Effective Infobahn Access) (André
(Herzog and Retz-Schmidt 1990, Neumann and No- and Rist 2001) and Miau (Multiple Internet Agents
vak 1986). These systems can verbalize the behaviour for User-Adaptive Decision Support) (André et al.
of human agents in image sequences about football 2000), and a multimedia interface which identifies ob-
and describe the spatio-temporal properties of the jects and conveys route plans from a knowledge-based
behaviour observed. Retz-Schmidt (1991) and Retz- cartographic information system (Maybury 1991).
Schmidt and Tetzlaff (1991) describe an approach Others, developing general IntelliMedia plat-
which yields plan hypotheses about intentional enti- forms include CHAMELEON (Brøndsted et al. 1998,
ties from spatio-temporal information about agents. 2001) SmartKom (Reithinger 2001, Wahlster et al.
The results can be verbalized in natural language. 2001) Situated Artificial Communicators (Rickheit
The system called REPLAI-II takes observations and Wachsmuth 1996), Communicative Humanoids
from image sequences as input. Moving objects from (Thórisson 1996, 1997), AESOPWORLD (Okada
two-dimensional image sequences have been extracted 1996, 1997) and MultiModal Interfaces like INTER-
by a vision system (Herzog et al. 1989) and spatio- ACT (Waibel et al. 1996). Other moves towards inte-
gration are reported in Denis and Carfantan (1993), in ABIGAIL which focusses on segmenting contin-
Granström et al. (2002), Maybury (1997), Maybury uous motion pictures into distinct events and clas-
and Wahlster (1998), Mc Kevitt (1994, 1995/96), Mc sifying those events into event types. Bailey et al.
Kevitt et al. (2002) and Pentland (1993). (1997) use x-schemas (eXecuting schemas) and f-
With the current proliferation of work in the area structs (Feature-STRUCTures) representations which
of Intelligent MultiMedia or MultiModal Systems one combine schemata representations with fuzzy set the-
of the central questions people are asking is what is ory. They uses a formalism of Petri nets to repre-
the correct semantic representation. And we must sent x-schemas as a stable state of a system that
keep in mind of course that multimodal semantics consists of small elements which interact with each
not only applies to multimodal systems but also to other when the system is moving from state to state.
efforts on semantic markup of the World Wide Web Narayanan et al. (1995) discuss the possibility of de-
or The Semantic Web (see Berners-Lee et al. 2001). veloping visual primitives for language primitives and
use Schank’s (1973) Conceptual Dependency (CD)
theory in a 3D language visualisation system. As
2 MultiModal semantic repre- an alternative to symbolic representation methods
for multimodal semantics there are also connection-
sentation ist methods. Sales et al. (1996) in their Neural State
Detailed discussions on the nature and requirements Machine investigate Weightless Artificial Neural Net-
of multimodal semantic representations are to be work connectionist representations for grounding vi-
found in Romary (2001), Maybury (2001) and Bunt sual and linguistic representations. Feldman et al.
and Romary (2002). Chai et al. (2002) present (1996) in the L0 project look at how a system can
their views on what such a semantics should con- learn sentence-picture pairs. They started out using
tain. It is clear that a multimodal semantic rep- connectionist methods for grammar learning but then
resentation must support interpretation and gener- adopted a probabilistic framework which was thought
ation, any kind of multimodal input and output and to provide more versatile representations. Grumbach
a variety of semantic theories. The representation (1996) investigates how a hybrid connectionist model
may contain architectural, environmental, and inter- can be used to model implicit knowledge (e.g. sensori-
actional information. Architectural information in- motor associations) and explicit knowledge (e.g. a
cludes producer/consumer of the information, infor- teacher giving verbal advice). Waibel et al. (1996)
mation confidence, and input/output devices. En- look at multimodal human computer interfaces with
vironmental representation includes timestamps and spoken dialogue, face recognition and gesture tracking
spatial information. Interactional representation in- with mainly neural network and statistical methods.
cludes speaker/user’s state. In addition to the various methods deployed for
Much of the work in MultiModal Systems chooses multimodal semantics within multimodal systems
frames or XML to represent multimodal semantics. there are also moves from bodies, mainly industrial,
Frames are used in CHAMELEON, AESOPWORLD, to define markup languages for multimodal systems.
REA (Cassell et al. 2000), Ymir (Thorisson 1996, SALT (Speech Application Language Tags) (2002) is
1997) and WordsEye (Coyne and Sproat 2001). The an open standard attempt to augment existing XML-
semantics can be localised as in CHAMELEON where based markup languages in order to provide spoken
the frames are stored in a central blackboard or access to many forms of content through a wide va-
distributed throughout various modules as in Ymir. riety of devices, to promote multimodal interaction
XML-based representations are used in BEAT (Cas- and to enable voice on the internet. The SALT spec-
sell et al. 2001), SmartKom (Wahlster et al. 2001) us- ification language defines a set of lightweight tags as
ing M3L (MultiModal Markup Language), MIAMM extensions to commonly used Web-based markup lan-
using MMIL (MultiModal Interface Language) (Rei- guages. VoiceXML (2002) arose from a need to define
thinger et al. 2002), MUST (Almeida et al. 2002) us- a markup language for over-the-telephone dialogues
ing MXML (MUST XML) and IMPROVISE (Zhou and at a time, 1999, when many pieces of the Web
and Feiner 2001). infrastructure as we know it today had not matured.
There are other multimodal systems using alter- There are also additional semantic markup languages
native specialised semantic representations. Ahn et within the XML family of the WorldWideWeb Con-
al. (1996) and Bunt et al. (1998) use type theo- sortium (W3C) such as Ontology Web Language
retical logic within the DenK system, an electronic (OWL) published by the W3C’s Web Ontology Work-
cooperative assistant, to represent domain knowl- ing Group (OWL 2002). OWL is a derivative of
edge, dialogue context, and a context-change the- DAML+OIL (DARPA Agent Markup Language, On-
ory of communication. Siskind (1995) uses event- tology Interchange Language) Web Ontology Lan-
logic truth conditions for simple spatial motion verbs guage (DAML+OIL 2002) and builds upon the Re-
source Description Framework (RDF). Also, relevant
is the fact that W3C has a Working Group on Multi-
modal Interaction looking at Multimodal interaction
on the web with specific focus on a markup specifica-
tion for synchronisation across various modalities and
devices with a wide range of capabilities (W3C-MMI
2002).
3 MultiModal expe-
riences: CHAMELEON and
CONFUCIUS
We have had experience with developing two Mul-
tiModal systems, CHAMELEON and CONFUCIUS
and each system has its own requirements in terms of
MultiModal semantic representation. Figure 1: Architecture of CHAMELEON
3.1 CHAMELEON
CHAMELEON has a distributed architecture of com-
municating agent modules processing inputs and out-
puts from different modalities and each of which
can be tailored to a number of application do-
mains. The process synchronisation and intercom-
munication for CHAMELEON modules is performed
using the DACS (Distributed Applications Com-
munication System) Inter Process Communication
(IPC) software (see Fink et al. 1996) which enables
CHAMELEON modules to be glued together and dis-
tributed across a number of servers. Presently, there
are ten software modules in CHAMELEON: black-
board, dialogue manager, domain model, gesture
Figure 2: Physical layout of the IntelliMedia Work-
recogniser, laser system, microphone array, speech
Bench
recogniser, speech synthesiser, natural language pro-
cessor (NLP), and Topsy as shown in Figure 1. More
detail on CHAMELEON can be found in Brøndsted and/or pointing gestures and outputs are synchro-
et al. 1998, 2001). nised speech synthesis and pointing. We currently
An initial application of CHAMELEON is the In- run all of CHAMELEON on a standard Intel pen-
telliMedia WorkBench which is a hardware and soft- tium computer which handles input for the Campus
ware platform as shown in Figure 2. One or more Information System in real-time.
cameras and lasers can be mounted in the ceiling,
microphone array placed on the wall and there is a 3.2 Frame semantics
table where things (objects, gadgets, people, pictures,
2D/3D models, building plans, or whatever) can be CHAMELEON’s blackboard stores semantic represen-
placed. The current domain is a Campus Informa- tations produced by each of the other modules and
tion System which at present gives information on keeps a history of these over the course of an inter-
the architectural and functional layout of a building. action. All modules communicate through the ex-
2-dimensional (2D) architectural plans of the building change of semantic representations with each other or
drawn on white paper are laid on the table and the the blackboard. The meaning of interactions over the
user can ask questions about them. Presently, there course of a MultiModal dialogue is represented using
is one static camera which calibrates the plans on the a frame semantics with frames in the spirit of Minsky
table and the laser, and interprets the user’s point- (1975). The intention is that all modules in the sys-
ing while the system points to locations and draws tem can produce and read frames. Frames are coded
routes with a laser. Inputs are simultaneous speech in CHAMELEON with messages built as predicate-
argument structures following a BNF definition. The
frame semantics was first presented in Mc Kevitt and [GESTURE
Dalsgaard (1997). Frames represent some crucial el- GESTURE: coordinates (3, 2)
ements such as module, input/output, intention, loca- INTENTION: pointing
tion, and timestamp. Module is simply the name of TIME: timestamp]
the module producing the frame (e.g. NLP). Inputs
are the input recognised whether spoken (e.g. “Show 3.2.2 Output frames
me Hanne’s office”) or gestures (e.g. pointing coor-
dinates) and outputs the intended output whether An output frame takes the general form:
spoken (e.g. “This is Hanne’s office.”) or gestures
(e.g. pointing coordinates). Timestamps can include
the times a given module commenced and terminated [MODULE
processing and the time a frame was written on the INTENTION: intention-type
blackboard. The frame semantics also includes repre- OUTPUT: output
sentations for two key phenomena in language/vision TIME: timestamp]
integration: reference and spatial relations.
where MODULE is the name of the output mod-
Frames can be grouped into three categories: (1)
ule producing the frame, intention-type includes dif-
input, (2) output and (3) integration. Input frames
ferent types of utterances and gestures and OUTPUT
are those which come from modules processing per-
is at least UTTERANCE or GESTURE. An utter-
ceptual input, output frames are those produced by
ance output frame can at least have intention-type
modules generating system output and integration
(1) query? (2) instruction!, and (3) declarative. An
frames are integrated meaning representations con-
example utterance output frame is:
structed over the course of a dialogue (i.e. all other
frames). Here, we shall discuss frames with a focus
more on frame semantics than on frame syntax and [SPEECH-SYNTHESIZER
in fact the actual coding of frames as messages within INTENTION: declarative
CHAMELEON has a different syntax. UTTERANCE: (This is Hanne’s office)
TIME: timestamp]
3.2.1 Input frames
A gesture output frame can at least have
An input frame takes the general form: intention-type (1) description (pointing), (2) descrip-
tion (route), (3) description (mark-area), and (4) de-
scription (indicate-direction). An example gesture
[MODULE output frame is:
INPUT: input
INTENTION: intention-type
TIME: timestamp] [LASER
INTENTION: description (pointing)
where MODULE is the name of the input mod- LOCATION: coordinates (5, 2)
ule producing the frame, INPUT can be at least UT- TIME: timestamp]
TERANCE or GESTURE, input is the utterance or
gesture and intention-type includes different types of
3.2.3 Integration frames
utterances and gestures. An utterance input frame
can at least have intention-type (1) query?, (2) in- Integration frames are all those other than in-
struction! and (3) declarative. An example of an put/output frames. An example utterance integra-
utterance input frame is: tion frame is:
[SPEECH-RECOGNISER [NLP
UTTERANCE: (Point to Hanne’s office) INTENTION: description (pointing)
INTENTION: instruction! LOCATION: office (tenant Hanne) (coordi-
TIME: timestamp] nates (5, 2))
UTTERANCE: (This is Hanne’s office)
A gesture input frame is where intention-type TIME: timestamp]
can be at least (1) pointing, (2) mark-area, and (3)
indicate-direction. An example of a gesture input Things become even more complex with the oc-
frame is: currence of references and spatial relationships:
of input from the user, (1) either the beginning or the
[MODULE ending of a story in the form of a sentence, and (2)
INTENTION: intention-type stylistic specifications, and outputs natural language
LOCATION: location stories; and CONFUCIUS focuses on story interpre-
LOCATION: location tation and multimodal presentation. It receives input
LOCATION: location natural language stories or (play/movie) scripts and
SPACE-RELATION: beside presents them with 3D animation, speech and non-
REFERENT: person speech audio.
LOCATION: location
TIME: timestamp]
An example of such an integration frame is:
[DOMAIN-MODEL
INTENTION: query? (who)
LOCATION: office (tenant Hanne)
(coordinates (5, 2))
LOCATION: office (tenant Jørgen)
(coordinates (4, 2))
LOCATION: office (tenant Børge) Figure 3: Intelligent multimodal storytelling platform
(coordinates (3, 1)) – Seanchaı́
SPACE-RELATION: beside
REFERENT: (person Paul-Dalsgaard) The knowledge base and its visual knowledge se-
LOCATION: office (tenant Paul-Dalsgaard) mantic representation are used in CONFUCIUS (see
(coordinates (4, 1)) Figure 4), and they could also be adopted in other vi-
TIME: timestamp] sion and natural language processing integration ap-
plications. The dashed part in the figure includes the
We have reported complete blackboard histories prefabricated objects such as characters, props, and
for the instruction “Point to Hanne’s office” and animations for basic activities, which will be used in
the query “Whose office is this?” + [pointing] (ex- the Animation generation module. When the input
ophoric/deictic reference) in Mc Kevitt and Dals- is a story, it will be transferred to a script by the
gaard (1997) and Brøndsted et al. (1998). With re- script writer, then parsed by the script parser and
spect of spatial relations we derive all the frames ap- the natural language processing module respectively.
pearing on the blackboard for the example: “Who’s The modules for Natural Language Processing (NLP),
in the office beside him?’ in Mc Kevitt (2000). Text to Speech (TTS) and sound effects operate in
To summarise, in CHAMELEON and the In- parallel. Their outputs will be fused at code com-
telliMedia Workbench we have found that pro- bination, which generates a holistic 3D world repre-
ducer/consumer, intention (speech acts), semantic- sentation including animation, speech and sound ef-
content, and timestamps are four important com- fects. NLP will be performed using Gate and Word-
ponents of any multimodal semantic representation. Net, TTS will be performed using Festival or Mi-
With respect of multimodal semantic-content there crosoft Whistler, VRML (Virtual Reality Modelling
is a requirement of representing two key elements of Language) will be used to model the story 3D vir-
multimodal systems: reference and spatial relations. tual world, and visual semantics is represented using
a Prolog-like formalism.
4 Seanchaı́
4.1 Visual knowledge representation
Within an intelligent multimedia storytelling plat-
form called Seanchaı́ we are interested in generat- Existing multimodal semantic representations within
ing 3D animation automatically. Seanchaı́ will per- various intelligent multimedia systems may repre-
form multimodal storytelling generation, interpreta- sent the general organisation of semantic structure
tion and presentation and consists of Homer, a story- for various types of inputs and outputs and are us-
telling generation module, and CONFUCIUS, a sto- able at various stages such as media fusion and prag-
rytelling interpretation and presentation module (see matic aspects. However, there is a gap between high-
Figure 3). The output of the former module could be level general multimodal semantic representation and
fed as input to the latter. Homer focuses on natural lower-level representation that is capable of connect-
language story generation. It will receive two types ing meanings across modalities. Such a lower-level
clature. Each non-atomic action is defined by one or
more subgoals, and the name of every goal/subgoal
reveals its purpose and effect. Primitives 1 through
14 are basic primitive actions in our framework (Fig-
ure 6). We do not claim that these fourteen cover all
the necessary primitives needed in modelling observ-
able verbs. 131 and 142 are actually not primitive
actions, but they are necessary in processing complex
space displacement. In the first twelve primitives, 1-3
describe position movement, 4 and 5 concern orien-
tation changes, 6-9 focus on alignment, 10 is a com-
posite action (not atomic) composed by lower level
primitives, and 11, 12 concern size (shape) changes.
Figure 7 illustrates the hierarchical structure of the
twelve primitives. Higher level actions are defined by
lower level ones. For instance, alignment operations
Figure 4: System architecture of CONFUCIUS are composed by move() and/or moveTo() predicates.
Definitions of the primitives are given in Ma and Mc
Kevitt (2003).
meaning representation, which links language modal-
ities to visual modalities, is proposed in Ma and Mc
Kevitt (2003, 2005). Figure 5 illustrates the multi- 1) move(obj, xInc, yInc, zInc)
modal semantic representation of CONFUCIUS. It is 2) moveTo(obj, loc)
composed of language, visual and non-speech audio 3) moveToward(obj, loc, displacement)
modalities. Between the multimodal semantics and 4) rotate(obj, xAngle, yAngle, zAngle)
each specific modality there are two levels of repre- 5) faceTo(obj1, obj2)
sentation: one is a high-level multimodal semantic 6) alignMiddle(obj1, obj2, axis)
representation which is media-independent, the other 7) alignMax(obj1, obj2, axis)
is an intermediate level media-dependent representa- 8) alignMin(obj1, obj2, axis)
tion. CONFUCIUS will use an XML-based represen- 9) alignTouch(obj1, obj2, axis)
tation for high-level multimodal semantics and an ex- 10) touch(obj1, obj2, axis)
tended predicate-argument representation for inter- (for the relation of support and contact)
mediate representation which connects language with 11) scale(obj, rate)
visual modalities as shown in Figure 5. Our visual (scale up/down, change size)
semantics decomposition method is at the intermedi- 12) squash(obj, rate, axis)
ate representation level (see Ma and Mc Kevitt 2003, (squash or lengthen an object)
2005). It is suitable for implementation in the 3D 13) group(x, [y|_], newObj)
graphic modelling language VRML. It will be trans- 14) ungroup(xyList, x, yList)
lated to VRML code by a Java program in CONFU-
CIUS. We also plan to include non-speech audio in
the media-dependent and media-independent seman- Figure 6: Basic predicate-argument primitives within
tic representations. CONFUCIUS
The predicate-argument primitives can be used to

provide definitions of visual semantics of verbs. For
example,
Example 1, jump3 :
jump(x):-
type(x, Animal),
1 As is the convention in the programming language Pro-
log, arguments can be replaced by an underscore if they are

undetermined.
Figure 5: MultiModal semantic representation in 2 ungroup element x from a list which contains it. yList is
CONFUCIUS the rest of the list after deleting x from the original list. This
is also a basic list operation in Prolog.
3 Semantic constraint – declare an instance of the type ‘Ani-
The predicate-argument format we apply to rep- mal’. Metaphor usage of vegetal or inanimate characters is not
resent verb semantics has a Prolog-inspired nomen- considered here.
Many of the requirements in multimodal semantic
representation come from the need to integrate infor-
mation from different modalities. In terms of lan-
guage and vision integration there are requirements
for mapping the language and visual information into
semantic components which can be fused and inte-
grated and will be necessary for answering queries
such as “Whose office is this?” In terms of language
Figure 7: Hierarchical structure of CONFUCIUS’
and computer graphics integration there are require-
primitives
ments for determining the visual meaning of language
actions (verbs) so that for example, language can be
move(x.feet, _, HEIGHT, _), mapped into graphical presentations automatically.
move(x.body, _, HEIGHT, _), So for example with the verb “close” there could be
move(x.feet, _, -HEIGHT, _). three visual definitions: closing of a normal door (ro-
tation on y axis), closing of a sliding door (moving on
Example 2, call: x axis), or closing of a rolling shutter door (a combi-
– as in “A is calling B” (verb tense is not consid- nation of rotation on x axis and moving on y axis).
ered here because it is at sentence level rather than Two key problems in language and vision inte-
word level). This is one word-sense of call where call- gration are reference (see Brøndsted 1999, Kievet et
ing is conducted by telephone. Here is the definition al. 2001) and spatial relations (see Mc Kevitt 2000,
of one word-sense of call which is at the first level of Zelinsky-Wibbelt 1993), i.e. in multimodal systems
the visual semantic verb representation hierarchy: there are regular deictic references to the visual con-
text and also numerous spatial relations. Hence, it
call(a):- is a necessary requirement for adequate semantic-
type(a, Person), content representations to incorporate mechanisms
type(tel, Telephone), for representing spatial relations and reference.
pickup(a, tel.receiver, a.leftEar),
dial(a, tel.keypad),
speak(a, tel.receiver), 6 Conclusion and future work
putdown(a, tel.receiver, tel.set).
Although traditional and Intelligent MultiMedia or
Further examples are given in Ma and Mc Kevitt MultiModal Systems are both concerned with text,
(2003). voice, sound and video/graphics, with the former the
To summarise, in CONFUCIUS we have found computer has little or no understanding of the mean-
that as in CHAMELEON higher-level media- ing of what it is presenting and this is what dis-
independent semantic representations will be impor- tinguishes the two. With the current proliferation
tant in forms such as XML and frames but also of multimodal systems the question that everyone
that intermediate-level media-dependent representa- is asking is what is the correct multimodal mean-
tions will be necessary in order to represent fully cor- ing representation. From our experience in devel-
respondences between modalities. oping two multimodal systems, one which integrates
the processing of spoken dialogue and vision for both
input and output (CHAMELEON) and one which
5 Discussion translates text stories into multimodal presentations
with 3D graphics, spoken dialogue and non-speech
Our experience with MultiModal semantic represen- audio (CONFUCIUS) we conclude that multimodal
tation is that the representations required are depen- semantic representation: (1) depends on the task at
dent on the applications at hand and also MultiModal hand, (2) depends on the system architecture, (3) will
system architectures. This is also clear from the dis- be necessary at different levels (media-independent
cussions found in Romary (2001), Maybury (2001) and dependent) (4) will have at least the following
and Bunt and Romary (2002). There are require- four important components: producer/consumer, in-
ments for higher-level media-independent representa- tention (speech acts), semantic-content, and times-
tions but also lower-level more media-dependent rep- tamps (5) will have many forms of representation
resentations. We argue that producer/consumer, in- such as frames, XML, formal logics, event-logic truth
tention (speech acts), semantic-content, and times- conditions, X-schemas and f-structs or connection-
tamps are four important components of any higher- ist models. With respect of multimodal semantic-
level multimodal semantic representation. content there is a requirement of representing two
key elements of multimodal systems: reference and André, E., T. Rist, S. van Mulken, M. Klesen and S.
spatial relations. With respect of multimodal sys- Baldes (2000) The automated design of believe-
tem architectures there are interesting questions as able dialogues for animated presentation teams.
to where multimodal semantic representations lie in In Embodied conversational agents, J. Cassell, J.
systems and whether all the semantics is contained in Sullivan, S. Prevost and E. Churchill (Eds.), 220-
one single blackboard (CHAMELEON) or distributed 255. Cambridge, MA: The MIT Press.
throughout the system (Ymir and SmartKom). André, E. and T. Rist (2001) Controlling the be-
Future work will involve experimenting with var- haviour of animated presentation agents in the
ious semantic representations and architectures with interface: scripting vs. instructing. In AI Maga-
numerous applications and as we have found with zine, 22(4), 53-66.
knowledge representation in artificial intelligence it Bailey, D., J. Feldman, S. Narayanan & G. Lakoff
may be the case that no single representation is the (1997) Modeling embodied lexical development.
correct one but more significant will be how we use In Proceedings of the Nineteenth Annual Confer-
the representation and what can be achieved with it ence of the Cognitive Science Society (CogSci97),
in terms of multimodality. 19-24, Stanford, CA, USA.
Berners-Lee, T., J. Hendler and O. Lassila (2001) The
semantic web. In Scientific American, May.
References Brøndsted, T. (1999) Reference problems in
CHAMELEON. In Proc. of the ESCA Tuto-
Ahn, R., R.J. Beun, T. Borghuis, H.C. Bunt and C. rial and Research Workshop on Interactive Di-
van Overveld (1995) The DENK architecture: a alogue in MultiModal Systems (IDS-99), Paul
fundamental approach to user-interfaces. In In- Dalsgaard, Paul Heisterkamp and Chin-Hui Lee
tegration of natural language and vision process- (Eds.), 133-136. Kloster Irsee, Germany, June.
ing (Vol. I): computational models and systems, Brøndsted, T., P. Dalsgaard, L.B. Larsen, M. Man-
P. Mc Kevitt (Ed.), 267-281. Dordrecht, The they, P. Mc Kevitt, T.B. Moeslund & K.G. Ole-
Netherlands: Kluwer Academic Publishers. sen (1998) A platform for developing Intelligent
Almeida, L., I. Amdal, N. Beires, M. Boualem, MultiMedia applications. Technical Report R-98-
L. Boves, E. den Os, P. Filoche, R. Gomes, 1004, Center for PersonKommunikation (CPK),
J.E. Knudsen, K. Kvale, J. Rugelbak, C. Tal- Institute of Electronic Systems (IES), Aalborg
lec and N. Warakagoda (2002) Implementing University, Denmark, May.
and evaluating a multimodal and multilingual Brøndsted, T., P. Dalsgaard, L.B. Larsen, M. Man-
tourist guide. In Proc. of the International they, P. Mc Kevitt, T.B. Moeslund & K.G.
CLASS Workshop on Natural, Intelligent and Ef- Olesen (2001) The IntelliMedia WorkBench -
fective interaction in MultiModal Dialogue Sys- an Environment for Building Multimodal Sys-
tems, Copenhagen, Denmark, 28-29 June. tems. In Advances in Cooperative Multimodal
André, Elisabeth, G. Herzog, and T. Rist (1988) Communication: Second International Confer-
On the simultaneous interpretation of real-world ence, CMC’98, Tilburg, The Netherlands, Jan-
image sequences and their natural language de- uary 1998, Selected Papers, H. Bunt and R.J.
scription: the system SOCCER. In Proceedings Beun (Eds.), 217-233. Lecture Notes in Artificial
of the 8th European Conference on Artificial In- Intelligence (LNAI) series, LNAI 2155, Berlin,
telligence, 449-454, Munich, Germany. Germany: Springer Verlag.
André, Elisabeth and Thomas Rist (1993) The de- Bunt, H.C., R. Ahn, R.J. Beun, T. Borghuis and
sign of illustrated documents as a planning task. C. van Overveld (1998) Multimodal coopera-
In Intelligent multimedia interfaces, M. Maybury tion with the DENK system. In Multimodal
(Ed.), 75-93. Menlo Park, CA: AAAI Press. Human-Computer Communication, H.C. Bunt,
André, E., J. Müller and T. Rist (1996) The PPP R.J. Beun and T. Borghuis (Eds.), 1-12. Berlin,
persona: a multipurpose animated presentation Germany: Springer-Verlag.
agent. In Advanced Visual Interfaces, T. Catarci, Bunt, H. and L. Romary (2002) Towards multi-
M.F. Constabile, S. Levialdi and G. Santucci modal content representation. In International
(Eds.), 245-247. New York, USA: ACM Press. standards of terminology and language resources
André, E. and T. Rist (2000) Presenting through per- management, LREC 2002, Las Palmas, Spain.
forming: on the use of multiple lifelike charac- Carenini, G., F. Pianesi, M. Ponzi and O. Stock
ters in knowledge-based presentation systems. In (1992) Natural language generation and hypertext
Proc. of the Second International Conference on access. IRST Technical Report 9201-06, Insti-
Intelligent User Interfaces (IUI 2000), Los An- tuto Per La Scientifica E Tecnologica, Loc. Pant
geles, CA, USA, 1-8. e Di Povo, I-138100 Trento, Italy.
Cassell, J., J. Sullivan, S. Prevost and E. Churchill derlicher Szenen. In Sport und Informatik, J.
(Eds.) (2000) Embedded conversational agents. Perl (Ed.), 95-119. Schorndorf: Hofmann.
Cambridge, MA: MIT Press. Herzog, G., C.-K. Sung, E. Andre, W. Enkelmann,
Cassell, J., H. Vilhjalmsson and T. Bickmore H.-H. Nagel, T. Rist, and W. Wahlster (1989)
(2001) BEAT: the behaviour expression anima- Incremental natural language description of dy-
tion toolkit. In SIGGRAPH 2001 Conference namic imagery. In Wissenbasierte Systeme. 3.
Proceedings, Los Angeles, August 12-17, 477-486. Internationaler GI-Kongress, C. Freksa and W.
Chai, J. S. Pan and M.X. Zhou (2002) MIND: seman- Brauer (Eds.), 153-162. Berlin: Springer-Verlag.
tics based multimodal interpretation framework. Kievet, L, P. Piewek, R. Jan-Beun and H. Bunt
In Proc. of the International CLASS Workshop (2001) Multimodal cooperative resolution of ref-
on Natural, Intelligent and Effective interaction erential expressions in the DENK system. In Ad-
in MultiModal Dialogue Systems, Copenhagen, vances in Cooperative Multimodal Communica-
Denmark, 28-29 June. tion: Second International Conference, CMC’98,
Coyne, B & R. Sproat (2001) WordsEye: an auto- Tilburg, The Netherlands, January 1998, Se-
matic text-to-scene conversion system. In AT&T lected Papers, H.C. Bunt and R.J. Beun (Eds.),
Labs. Computer Graphics Annual Conference, 197-214. Lecture Notes in Artificial Intelligence
SIGGRAPH 2001 Conference Proceedings, Los (LNAI) series, LNAI 2155, Berlin, Germany:
Angeles, Aug 12-17, 487-496. Springer-Verlag.
DAML+OIL (2002) DAML+OIL reference descrip- Kosslyn, S.M. and J.R. Pomerantz (1977) Imagery,
tion. http://www.w3.org/TR/daml+oil-refe- propositions and the form of internal representa-
rence. tions. In Cognitive Psychology, 9, 52-76.
Denis, M. and M. Carfantan (Eds.) (1993) Images Ma, Minhua and Paul Mc Kevitt (2003) Seman-
et langages: multimodalité et modelisation cog- tic representation of events in 3D animation.
nitive. Actes du Colloque Interdisciplinaire du In Proc. of the Fifth International Workshop
Comité National de la Recherche Scientifique, on Computational Semantics (IWCS-5), Harry
Salle des Conférences, Siège du CNRS, Paris, Bunt, Reinhard Muskens and Elias Thiesse
April. (Eds.). Tilburg University, Tilburg, The Nether-
Dennett, Daniel (1991) Consciousness explained. lands, January.
Harmondsworth: Penguin. Ma, Minhua and Paul Mc Kevitt (2005) Visual se-
Feldman, J., G. Lakoff, D. Bailey, S. Narayanan, mantics and ontology of eventive verbs. In Nat-
T. Regier and A. Stolcke (1996) L0 – the first ural Language Processing - IJCNLP-04, First
five years of an automated language acquisition International Joint Conference, Hainan Island,
project. In Integration of natural language and China, March 22-24, 2004, Keh-Yih Su, Jun-
vision processing (Vol. III): theory and ground- Ichi Tsujii, Jong-Hyeok Lee and Oi Yee Kwong
ing representations, P. Mc Kevitt (Ed.), 205-231. (Eds.), 187-196. Lecture Notes in Artificial Intel-
Dordrecht, The Netherlands: Kluwer Academic ligence (LNAI) series, LNCS 3248. Berlin, Ger-
Publishers. many: Springer Verlag.
Fink, G.A., N. Jungclaus, H. Ritter, and G. Sagerer Maaβ, Wolfgang, Peter Wizinski, Gerd Herzog
(1995) A communication framework for heteroge- (1993) VITRA GUIDE: Multimodal route de-
neous distributed pattern analysis. In Proc. In- scriptions for computer assisted vehicle naviga-
ternational Conference on Algorithms and Appli- tion. Bereich Nr. 93, Universitat des Saarlandes,
cations for Parallel Processing, V. L. Narasimhan FB 14 Informatik IV, Im Stadtwald 15, D-6600,
(Ed.), 881-890. IEEE, Brisbane, Australia. Saarbrucken 11, Germany, February.
Granström, Björn, David House and Inger Karls- Maybury, Mark (1991) Planning multimedia explana-
son (Eds.) (2002) Multimodality in language and tions using communicative acts. In Proceedings
Speech systems. Dordrecht, The Netherlands: of the Ninth American National Conference on
Kluwer Academic Publishers. Artificial Intelligence (AAAI-91), Anaheim, CA,
Grumbach, A. (1996) Grounding symbols into per- July 14-19.
ceptions. In Integration of natural language and Maybury, Mark (Ed.) (1993) Intelligent multimedia
vision processing (Vol. III): theory and ground- interfaces. Menlo Park, CA: AAAI Press.
ing representations, P. Mc Kevitt (Ed.), 233-248. Maybury, Mark (Ed.) (1997) Intelligent multime-
Dordrecht, The Netherlands: Kluwer Academic dia information retrieval. Menlo Park, CA:
Publishers. AAAI/MIT Press.
Herzog, G. and G. Retz-Schmidt (1990) Das Sys- Maybury, M. (2001) Would you build your dream
tem SOCCER: Simultane Interpretation und house without a blueprint?, Working group on
naturalichsprachliche Beschreibung zeitveran- software architectures for MultiModal Systems
(WG 3). International Seminar on Coor- Okada, Naoyuki (1996) Integrating vision, motion
dination and Fusion in MultiModal Interac- and language through mind. In Integration of
tion, Schloss Dagstuhl International Confer- Natural Language and Vision Processing, Vol-
ence and Research Center for Computer Sci- ume IV, Recent Advances, Mc Kevitt, Paul (ed.),
ence, Wadern, Saarland, Germany, 29 October 55-80. Dordrecht, The Netherlands: Kluwer
- 2 November. (http://www.dfki.de/˜wahlster/ Academic Publishers.
Dagstuhl Multi Modality/WG 3 The Architect- Okada, Naoyuki (1997) Integrating vision, motion
ure Dream Team/index.html). and language through mind. In Proceedings of
Maybury, Mark and Wolfgang Wahlster (Eds.) the Eighth Ireland Conference on Artificial Intel-
(1998) Readings in intelligent user interfaces. ligence (AI-97), Volume 1, 7-16. University of
Los Altos, CA: Morgan Kaufmann Publishers. Ulster, Magee, Derry, Northern Ireland, Septem-
Mc Kevitt, Paul (1994) Visions for language. In ber.
Proceedings of the Workshop on Integration of OWL (2002) Feature synopsis for OWL Lite
Natural Language and Vision processing, Twelfth and OWL. http://www.w3.org/TR/WD-owl-
American National Conference on Artificial In- features-2020729/.
telligence (AAAI-94), Seattle, Washington, USA, Partridge, Derek (1991) A new guide to Artificial In-
August, 47-57. telligence. Norwood, New Jersey: Ablex Pub-
Mc Kevitt, Paul (Ed.) (1995/1996) Integration of lishing Corporation.
Natural Language and Vision Processing (Vols. Pentland, Alex (Ed.) (1993) Looking at people:
I-IV). Dordrecht, The Netherlands: Kluwer- recognition and interpretation of human action.
Academic Publishers. IJCAI-93 Workshop (W28) at The 13th In-
Mc Kevitt, Paul (2000) CHAMELEON meets spatial ternational Conference on Artificial Intelligence
cognition. In Spatial cognition, Sean O Nuallain (IJCAI-93), Chambéry, France, EU, August.
(Ed.), 149-170. US: John Benjamins, Also in, Pylyshyn, Zenon (1973) What the mind’s eye tells
Proceedings of MIND-III: The Annual Confer- the mind’s brain: a critique of mental imagery.
ence of the Cognitive Science Society of Ireland, In Psychological Bulletin, 80, 1-24.
Theme: Spatial Cognition, Mary Hegarty and Reithinger, N. (2001) Media coordination
Seán Ó Nualláin (Eds.), Part II, 70-87. Dublin in SmartKom. International Seminar on Coor-
City University (DCU), Dublin, Ireland, August, dination and Fusion in MultiModal Interaction,
1998. Schloss Dagstuhl International Conference and
Mc Kevitt, Paul and Paul Dalsgaard (1997) A frame Research Center for Computer Science, Wadern,
semantics for an IntelliMedia TourGuide. In Pro- Saarland, Germany, 29 October - 2 November.
ceedings of the Eighth Ireland Conference on Ar- (http://www.dfki.de/˜wahlster/Dagstuhl Multi
tificial Intelligence (AI-97), Volume 1, 104-111. Modality/Media Coordination In SmartKom/i-
University of Ulster, Magee, Derry, Northern Ire- ndex.html).
land, September. Reithinger, N., C. Lauer and L. Romary (2002) MI-
Mc Kevitt, Paul, Seán Ó Nualláin and Conn Mulvi- AMM - Multidimensional information access us-
hill (Eds.) (2002) Language, vision and music. ing multiple modalities. In Proc. of the Interna-
Amsterdam, The Netherlands: John Benjamins tional CLASS Workshop on Natural, Intelligent
Publishing Co.. and Effective interaction in MultiModal Dialogue
Minsky, M. (1975) A Framework for representing Systems, Copenhagen, Denmark, 28-29 June.
knowledge. In Readings in knowledge representa- Retz-Schmidt, Gudala (1991) Recognizing intentions,
tion, R. Brachman and H. Levesque (Eds.), 245- interactions, and causes of plan failures. In User
262. Los Altos, CA: Morgan Kaufmann. Modelling and User-Adapted Interaction, 1: 173-
Narayanan, S., D. Manuel, L. Ford, D. Tallis & M. 202.
Yazdani (1995) Language visualisation: applica- Retz-Schmidt, Gudala and Markus Tetzlaff (1991)
tions and theoretical foundations of a primitive- Methods for the intentional description of image
based approach. In Integration of Natural Lan- sequences. Bereich Nr. 80, Universitat des Saar-
guage and Vision Processing (Volume II): Intel- landes, FB 14 Informatik IV, Im Stadtwald 15,
ligent Multimedia, P. Mc Kevitt (Ed.), 143-163. D-6600, Saarbrucken 11, Germany, EU, August.
Dordrecht, The Netherlands: Kluwer Academic Rich, Elaine and Kevin Knight (1991) Artificial In-
Publishers. telligence. New York: McGraw-Hill.
Neumann, B. and H.-J. Novak (1986) NAOS: Rickheit, Gert and Ipke Wachsmuth (1996) Collabo-
Ein System zur naturalichsprachlichen Beschrei- rative Research Centre “Situated Artificial Com-
bung zeitveranderlicher Szenen. In Informatik. municators” at the University of Bielefeld, Ger-
Forschung und Entwicklung, 1(1): 83-92. many. In Integration of Natural Language and
Vision Processing, Volume IV, Recent Advances, eration, 63, 387-427.
Mc Kevitt, Paul (ed.), 11-16. Dordrecht, The Wahlster, W., N. Reithinger and A. Blocher (2001)
Netherlands: Kluwer Academic Publishers. SmartKom: towards multimodal dialogues with
Romary, L. (2001) Working group on multimodal anthropomorphic interface agents. In Proceed-
meaning representation (WG 4). Interna- ings of The International Status Conference:
tional Seminar on Coordination and Fusion Lead Projects, “Human-Computer Interaction”,
in MultiModal Interaction, Schloss Dagstuhl G. Wolf and G. Klein (Eds.), 23-34. Berlin, Ger-
International Conference and Research Cen- many: Deutsches Zentrum für Luft- und Raum-
ter for Computer Science, Wadern, Saar- fahrt.
land, Germany, 29 October - 2 November. Waltz, David (1975) Understanding line drawings of
(http://www.dfki.de/˜wahlster/Dagstuhl Multi scenes with shadows. In The psychology of com-
Modality/WG 4 Multimodal Meaning Represe- puter vision, Winston, P.H. (Ed.), 19-91. New
ntation/index.html). York: McGraw-Hill.
Sales, N.J., R.G. Evans and I. Aleksander (1996) Suc- W3C-MMI (2002) http://www.w3.org/2002/mmi/. .
cessful naive representation grounding. In Inte- Zelinsky-Wibbelt, Cornelia (Ed.) (1993) The seman-
gration of natural language and vision processing tics of prepositions: from mental Processing to
(Vol. III): computational models and systems, natural language processing (NLP 3). Berlin,
P. Mc Kevitt (Ed.), 185-204. Dordrecht, The Germany: Mouton de Gruyter.
Netherlands: Kluwer Academic Publishers. Zhou, M.X. and S. Feiner (2001) IMPROVISE: auto-
SALT (2002) http://www.saltforum.org. . mated generation of animated graphics for coor-
Schank, R.C. (1973) The fourteen primitive actions dinated multimeida presentations. In Advances
and their inferences. Memo AIM-183, Stanford in Cooperative Multimodal Communication: Sec-
Artificial Intelligence Laboratory, Stanford, CA, ond International Conference, CMC’98, Tilburg,
USA. The Netherlands, January 1998, Selected Papers,
Siskind, J.M. (1995) Grounding language in percep- H.C. Bunt and R.J. Beun (Eds.), 43-63. Lec-
tion. In Integration of Natural Language and Vi- ture Notes in Artificial Intelligence (LNAI) se-
sion Processing (Volume I): Computational Mod- ries, LNAI 2155. Berlin, Germany: Springer-
els and Systems, P. Mc Kevitt (Ed.), 207-227. Verlag.
Dordrecht, The Netherlands: Kluwer Academic
Publishers.
Stock, Oliviero (1991) Natural language and explo-
ration of an information space: the ALFresco
Interactive system. In Proceedings of the 12th
International Joint Conference on Artificial In-
telligence (IJCAI-91), 972-978, Darling Harbour,
Sydney, Australia, August.
Thórisson, Kris R. (1996) Communicative hu-
manoids: a computational model of psychosocial
dialogue skills. Ph.D. thesis, Massachusetts In-
stitute of Technology.
Thórisson, Kris R. (1997) Layered action control
in communicative humanoids. In Proceedings
of Computer Graphics Europe ’97, June 5-7,
Geneva, Switzerland.
VoiceXML (2002) http://www.voicexml.org. .
Wahlster, Wolfgang (1988) One word says more than
a thousand pictures: On the automatic verbal-
ization of the results of image sequence analy-
sis. Bereich Nr. 25, Universitat des Saarlandes,
FB 14 Informatik IV, Im Stadtwald 15, D-6600,
Saarbrucken 11, Germany, February.
Wahlster, Wolfgang, Elisabeth André, Wolfgang Fin-
kler, Hans-Jurgen Profitlich, and Thomas Rist
(1993) Plan-based integration of natural lan-
guage and graphics generation. In Artificial In-
telligence, Special issue on natural language gen-

Advances in Intelligent Multimedia: Multimodal Semantic Representation

Uploaded by

Copyright:

Available Formats

Advances in Intelligent Multimedia: Multimodal Semantic Representation

Uploaded by

Document Information

Original Description:

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Advances in Intelligent Multimedia: Multimodal Semantic Representation

Uploaded by

Copyright:

Available Formats

Advances in Intelligent MultiMedia:

MultiModal semantic representation

1 Introduction integrated language and vision systems, few were, and

The predicate-argument primitives can be used to

log, arguments can be replaced by an underscore if they are

You might also like