The TREC2001 Video Track: Information Retrieval on
Digital Video Information
Alan F.Smeaton1, Paul Over2, Cash J. Costello3, Arjen P. de Vries4, David
Doermann5, Alexander Hauptmann6, Mark E. Rorvig7, John R. Smith8, Lide Wu9.
1
Centre for Digital Video Processing, Dublin City University, Dublin, 9, Ireland.
2
National Institute for Standards and Technology, Gaithersburg, Md., USA.
3
Johns Hopkins University Applied Physics Laboratory, Laurel, Md., USA.
4
CWI, Amsterdam, The Netherlands.
5
Laboratory for Language and Media Processing, University of Maryland,
College Park, MD. USA
6
School of Computer Science, Carnegie Mellon University, USA
7
School of Library Information Sciences, University of North Texas, Tx., USA
8
IBM T. J. Watson Research Center, Hawthorne, NY, USA.
9
Dept. of Computer Science, Fudan University, Shanghai, China..
The development of techniques to support content-based access to archives of
digital video information has recently started to receive much attention from the
research community. During 2001, the annual TREC activity, which has been
benchmarking the performance of information retrieval techniques on a range
of media for 10 years, included a “track” or activity which allowed
investigation into approaches to support searching through a video library. This
paper is not intended to provide a comprehensive picture of the different
approaches taken by the TREC2001 video track participants but instead we give
an overview of the TREC video search task and a thumbnail sketch of the
approaches taken by different groups. The reason for writing this paper is to
highlight the message from the TREC video track that there are now a variety
of approaches available for searching and browsing through digital video
archives, that these approaches do work, are scalable to larger archives and can
yield useful retrieval performance for users. This has important implications in
making digital libraries of video information attainable.
1. Introduction
The technical challenges associated with generation, storage and transmission of
digital video information have received much attention over the last few years and we
are now at the stage where we can regard these engineering problems as having made
significant progress. This now allows us to create large libraries of digital video
information and with that comes the associated challenge of developing effective,
efficient and scalable approaches to searching and browsing through video digital
libraries.
TREC is an annual activity which has been ongoing for the last decade and which
has been benchmarking the retrieval effectiveness of a variety of information retrieval
tasks. This has included retrieval on text documents, documents in a variety of
natural languages, spoken audio, web documents, documents corrupted by an OCR
process, and so on. In 2001, TREC included a “track” or activity line which explored
different approaches to searching through a collection of digital video information.
The goal of the TREC2001 video track was to promote progress in content-based
retrieval from digital video by using open, metrics-based evaluation and using
publicly available video.
The TREC2001 video track had 12 participating groups, 5 from US, 2 from Asia
and 5 from Europe and was divided into two distinct tasks namely shot boundary
detection and searching. Shot boundary detection is the task of automatically
determining the boundaries between different camera shots which is usually used as a
fundamental component of video structuring and further details of the shot boundary
detection task can be found in [1]. The searching task involved running queries
against the video collection and what made the queries particularly interesting and
challenging was that they were true multimedia queries as they all had video clips,
images, or audio clips as part of the query, in addition to a text description.
Participating groups used a variety of techniques to match these multimedia queries
against the video dataset, some running fully automated techniques and others
involving users in interactive search experiments. 11 hours of MPEG-1 data was
collected and distributed as well as 74 topics or queries.
The rest of this paper is organised as follows. In the next section we give an
introduction to the search task, covering the video data used, the topics and how they
were formed, the evaluation mechanism and the evaluation metrics adopted. In
section 3, each of the main groups who participated in the search task give an
overview of the approach that they have taken in the search task. Section 4 includes a
brief summary and comparison across the approaches as well as including some
indicative evaluation results in order to allow the reader to gauge the absolute
performance levels of the video retrieval systems. A concluding section assesses the
contribution that the TREC2001 video track has made.
2. The TREC2001 Video Track
Like most of the TREC activities, the video track in TREC2001 was coordinated
by the National Institute for Standards and Technology (NIST) though participating
groups contributed significant amounts of work towards the definition and running of
the track. The search tasks in the video track were extensions of their text analogues
from previous TRECs. Participating groups were asked to index a test collection of
video data and were asked to return lists of shots from the videos in the test collection
which met the information need for a set of topics. The boundaries for the units of
video to be retrieved were supposed to be shots and were not predefined and each
system made its own independent judgment of what frame sequences constituted a
relevant shot.
Participants were free to use whatever indexing and retrieval techniques they
wished though the search task was divided into two distinct classes, one for
interactive retrieval which involved some human in the search loop, and one for
automatic retrieval where the retrieved shots were determined completely
automatically. This distinction arose because the search task was designed to
replicate the situation where a user uses a video information retrieval system to satisfy
an information need, sometimes using interactive retrieval, sometimes completely
automated. Another feature of the search task, which also reflects its real world
nature, is that topics are either “known item” or “general”. In the case of known item
retrieval, the user knows that there is at least one relevant shot in the test collection
and the task is to find those shots known to satisfy the information need, while the
case of general searching reflects the situation where the user does not know whether
or not there are shots in the collection which satisfy the information need.
Although the track decided early on that it should work with more than text
recognised from spoken audio, systems were allowed to use transcripts created by
automatic speech recognition (ASR) and any group which did this had to submit a run
without the ASR or one using only ASR as a baseline. Three groups used ASR.
The test collection for the search task consisted of 85 video programmes
representing over 11 hours of video, encoded in MPEG-1 and totalling over 6 Gbytes
in size. The content came from the OpenVideo project [2], the NIST organisation
itself, and the BBC who provided some stock footage. Further details of the
collection can be found on the web pages for the video track [3]. The videos are
mostly of a documentary nature but vary in their age, production style, and quality.
The only manually created information that search systems were allowed to use was
that which was already as part of the test collection, namely the existing transcripts
associated with the NIST files and the existing descriptions associated with the BBC
material, though most groups did not use this information.
The search topics were designed as multimedia descriptions of an information
need, such as someone searching an archive of video might have in the course of
collecting material to include in a larger video or to answer questions. While today
this may be done largely by searching associated descriptive text created by a human
when the video material was added to the archive, the track's scenario envisioned
allowing the searcher to use a combination of other media in describing his or her
information need. How one might do this naturally and effectively is an open
question. Thus topics in the TREC2001 video track contained not only text but
possibly examples (including video, audio, images) which represent the searcher’s
information need. The topics expressed a very wide variety of needs for video clips:
of a particular object or class of objects, of an activity/event or class of
activities/events, of a particular person, of a kind of landscape, on a particular subject,
using a particular camera technique, answering a factual question, etc.
For a number of practical reasons, the topics were created by the participants which
is an example of the significant contribution to running the track made by those
participants. Each group was asked to formulate several topics they could imagine
being used by someone searching a video archive. NIST submitted topics as well, did
some selection and pruning, and negotiated revisions. All the topics were pooled and
all systems were expected to run on all of these if possible.
All topics contained a text description of the user information need and examples
in other media were optional. There were indicators of the appropriate processing
(automatic, manual or either) and finally, if the topic was a hunt for one or more
known-items, then the list of known-items was included. If examples to illustrate the
information need were included then these were to come from outside the test data.
74 topics were produced in this manner and Table 1 gives a summary of the use of
example media in those topics.
Number of topics
No. topics with image examples / Avg. number of images
No. topics with audio examples / Avg. number of audio
No. topics with video examples / Avg. number of videos
74
26 / 2.0
10 / 4.3
51 / 2.4
Table 1: Distribution of other media in topics
In the case of the known-item search submissions, these were evaluated by NIST but
the evaluation of known item retrieval turned out to be more difficult than anticipated.
One reason for this was because each group was able to define the start/stop
boundaries of the shots they returned we had to use a parameterised matching
procedure between known item and submitted results. Matching a submitted item to a
known-item defined with the topic was a function of the length of the known-item, the
length of the submitted item, the length of the intersection, and two variables which
measured the amount of desired overlap among these. Evaluations were run with
different settings of these overlaps. The measures calculated for the evaluation of
known-item searching were precision and recall with the ground truth or relevant
video clips from the collection being provided by the participants who formulated the
topics. The number of known-items across the topics varied from 1 to 60 with a mean
of 5.63, so the upper bound on precision in a result set of 100 items was quite low.
Submissions for the general search topics were evaluated by retired information
analysts at NIST. They were instructed to familiarize themselves with the topic
material and then judge each submitted clip relevant if it contained material which
met the need expressed in the topic as they understood it, even if there was nonrelevant material present, otherwise they were told to judge the clip as not relevant.
They used web-based software developed at NIST to allow them to (re)play the video,
audio, and image examples included in the topic as well as the submitted clips. A
second set of relevance judgments of the submitted materials was then performed and
overall, the two assessors agreed 84.6% of the time. The measure calculated for the
evaluation general searching was precision but we have also calculated a partial recall
score.
The detailed performance scores from the 8 groups who submitted a total of 21
runs are available online at http://www-nlpir.nist.gov/projects/trecvid/results.html but
before we address retrieval performance, the next section will give a thumbnail sketch
of the different approaches to video indexing and retrieval taken by the TREC2001
video track participants.
3. Participants in the TREC2001 Video Track Search Task
Of the 12 groups who took part in the TREC2001 video track, most completed the
shot boundary detection task and 8 completed the search task and the approaches that
each of these groups have taken is described here. Further descriptions on all of the
participants work can be found in their papers in the TREC2001 proceedings [4].
3.1 Carnegie Mellon University
The CMU Informedia Digital Video library's standard processing modules were used
for the TREC2001 Video evaluations. Among the processing features that were
utilized in Video TREC were: shot detection using simple color histogram
differences, keyframe extraction, speech recognition using the Sphinx speech
recognizer with a 64000 word vocabulary, face detection, video OCR, and image
search based on color histogram features in different color spaces and textures.
The Informedia interface was used in the interactive track with only minor
modifications, most of which involved user preference settings. For example, users
found they wanted to see as many shot results for each query as could fit on the
screen, while geographic maps were irrelevant. The main modification was the
addition of multiple image search engines, which allowed a user to switch between
image retrieval approaches, when nothing relevant could be found using a given
image retrieval approach.
For the automatic track, Informedia image retrieval was modified to process Iframes instead of merely keyframes for the image retrieval. We also added a speaker
identification component, which determined whether a given segment of audio might
have originated from the same speaker as the query audio. Post-mortem analysis of
the results showed that image retrieval and video OCR had the largest impact on
performance.
3.2 Dublin City University (Ireland)
The group from Dublin City University explored interactive search and retrieval from
digital video by employing more than 30 users to perform the search tasks under
controlled, timed conditions. In the Físchlár system developed at DCU [5], several
keyframe browser interfaces have been developed and the task DCU performed was
to evaluate the relative effectiveness of three different keyframe browsers. One of
these keyframe browsers was based on a timeline of groups of related keyframes, a
second browser interface simply played the keyframes on screen as a kind of
slideshow, and the final browser interface was a 4-level hierarchical browser which
allowed dynamic navigation through the keyframe sets. In the DCU experiments, 30
users (either final year undergraduates or research students) were employed to spend
between 5 and 10 minutes on each topic, and each volunteer did interactive searching
on 12 topics using one of the 3 different browsers per topic in round robin fashion.
This gave the DCU group the opportunity to compare the relative performances of the
three keyframe browser interfaces.
3.3 Fudan University (China)
The group from Fudan University tried 17 topics, including people searching, video
text searching, camera motion etc. In order to do the search they also developed
several feature extracting modules. These are qualitative camera motion analysis
module, face detection and recognition module, video text detection and recognition
module, and a speaker recognition and speaker clustering module. In addition they
also used the speech SDK from Microsoft to get transcripts. Based on the above
feature extraction modules, the Fudan retrieval system consists of two parts. One is
the off-line indexing sub-system and the other is on-line searching sub-system.
For the face detection and recognition modules, face detection consists of skincolor based segmentation, and motion and shape filtering; face recognition uses a new
optimal discrimination criterion to get features for recognition [6]. For the video text
detection and recognition module, the group used vertical edge based methods to
detect text blocks and an improved logical level technique to binarize the text blocks.
The recognition was done by commercial software after binarization..
3.4 IBM Research1
The IBM Research team developed a system for automatic and interactive contentbased retrieval of video using visual features and statistical models. The system used
IBM CueVideo for computing automatic shot boundary detection results and selecting
key-frames. The system indexed the key-frames of the video shots using MPEG-7
visual descriptors based on color histograms, color composition, texture and edge
histograms. The MPEG-7 visual descriptors were used for answering automatic
searches using content-based retrieval techniques. The system also used statistical
models for classifying events (fire, smoke, launch), scenes (greenery, land, outdoors,
rock, sand, sky, water), and objects (airplane, boat, rocket, vehicle, faces). The
classifiers were used to generate labels and corresponding confidence scores for each
shot. The features and models were then used together for answering interactive
searches where the user constructed query/filter pipelines that cascaded content-based
and model-based searches. This allowed integration of multiple searches using
different methods for each topic, for example, to retrieve “shots that have similar
color to this image, have label ‘outdoors’ and show a ‘boat.’”
The IBM team also developed a system based on automatic speech
recognition (ASR) and text indexing. The speech-based system was used as a
baseline for the content-based/model-based system. The overall results showed that
the content-based/model-based system performed relatively well compared to the
speech-based system and to other systems. In some cases the speech-based system
provided better results, for example, to retrieve “clips that deal with floods.” In other
cases, the content-based/model-based system provided better results, for example, to
retrieve “shots showing grasslands.” In two cases, the best result was obtained by
combining speech-based and content-based/model-based methods, for example, to
1
The IBM Research Team consisted of members from IBM T. J. Watson Research Center and
IBM Almaden Research Center.
retrieve “clips of Perseus high altitude plane.” The results show promise in particular
for the approach based on statistical modeling for video content classification. The
overall results show that significant improvements are still needed in retrieval
effectiveness in general to develop usable systems. The NIST video retrieval
benchmark is helping to accelerate the necessary technology development.
3.5 Johns Hopkins University
The JHU/APL research group developed an automatic retrieval system for the
TREC2001 video track that relied on the image content of the digital video frames.
Each keyframe in the video collection was indexed by its color histogram and image
texture features. The texture measures were calculated using a descriptor proposed by
Manjunath [7]. Ignoring audio clips or text descriptions, the query representation
consisted of the image and video portions of the information need. A weighted
distance between the image features of the query representation and the keyframes in
the index served as a similarity measure. The shots that were retrieved for a particular
query minimized this distance measure.
3.6 Lowlands Group (Netherlands)
A ‘joint venture’ between research institutes and universities in the Netherlands
approached the challenge offered by the Video Track as the ‘Lowlands Team’2. The
group submitted pure automatic as well as ‘interactive’ runs, investigating the
influence of human interaction on retrieval results.
The visual automatic system heuristically selected a set of filters based on
specialized detectors, by analyzing the query text with WordNet; e.g., the face
detector is associated to categories ‘person, human, individual’. The retrieval system
included a face detector, a camera motion detector (pan, tilt, zoom), a monologue
detector, and a detector for text found in the keyframes using OCR. The filtered
results are ranked with query example images or keyframes from example videos. A
transcript-based automatic system used speech transcripts provided by CMU in a
retrieval model based on language models. A trivial combination of these two
automatic systems has also been tried.
The first interactive run investigated whether better articulated queries are helpful;
e.g., Lunar Rover scenes are characterized by ‘a black sky’, and the Starwars scene by
‘shiny gold’. A second interactive run studied whether a user could improve, with
limited effort, the results by combining the four other approaches.
A (somewhat disappointing) lesson from the retrieval results was that the
transcript-only run outperformed all other approaches, including the interactive runs.
2
The Lowlands Team consisted of the database group of the CWI, the multimedia group of
TNO, the vision group of the University of Amsterdam, and the language technology group of
University of Twente.
3.7 University of Maryland
The University of Maryland, working with visiting researchers from the University of
Oulu, extended methods used for image retrieval based on the spatial correlation for
colors by using a novel color content method, the Temporal Color Correlogram, to
capture the spatio-temporal relationship of colors in a video shot using co-occurrence
statistics. The temporal correlogram is an extension of HSV color correlogram, and
computes an autocorrelation of the quantized HSV color values from a set of frame
samples taken from a video shot.
To implement the approach the video material was segmented to create shots using
VideoLogger video editing software from Virage and our own MERIT system. From
each shot, the first frame was selected as a representative key frame, and the static
image color correlogram was obtained. In order to calculate the temporal correlogram
non-exhaustively and to keep the number of samples in equal for varying shot lengths,
each shot was sampled evenly with a respective sampling delay so that the number of
sample frames did not exceed 40. After segmentation, shot features were fed into our
CMRS retrieval system and queries were defined using either example videos or
example images depending on the respective VideoTREC topic specification.
VideoTREC result submission contained retrieval results of two system
configurations. The first configuration was obtained using the temporal color
correlogram for the retrieval topics that contained video examples in the topic
definition and the second configuration used the color correlogram for topics that
contained example images in their definition.
3.8 University of North Texas
The University of North Texas team extracted frames from the collection at regular
five-second intervals. These frames were then run through a keyframe extraction
process, which removed the redundance of highly similar frames and ensured the
presence of frames outside the prescribed normal distribution limits. The resulting
keyframes were placed into UNT’s Brighton Image Searcher application, which is
based on mathematical measures that correspond to primitive image features. Two
members of the team independently used this application to attempt to retrieve
relevant keyframes for 13 of the original search topics. For each topic, the two people
performing the searches selected a keyframe that appeared to answer the question.
The chosen keyframe was then used as an exemplar to find keyframes similar to it.
Precision scores were better than expected due to the human judgment presence.
4. Summary and Analysis of Approaches
The brief review of the approaches to video indexing and retrieval taken by track
participants shows those approaches to be very varied indeed. Some sites ran
interactive searching with real users (DCU) while others did their query processing
entirely automatically (JHU). Some used automatic speech recognition transcripts
(CMU, IBM, Lowlands) while others based their retrieval entirely on the visual
aspects of video (UNT, UMd). Some groups used many automatically extracted video
features as part of their retrieval (CMU, IBM, Fudan, Lowlands) while others used
only a limited set of identified features (UMd, UNT). Some groups were experienced
in the video indexing field and were able to leverage upon previous experience and
background in working with video (IBM, CMU) while for other groups, this was their
first real experience of doing video indexing and retrieval (JHU, UNT).
As might be expected for the first running of an evaluation framework still very
much under construction, the results are probably most useful for small-scale
comparisons - within-topic and between closely related system variants. Plausible
cross-system comparison will have to wait on better consistency in topic formulation,
agreement on better measures, larger numbers of comparable data points. We expect
some of the participants will do further investigation and analysis of their own
TREC2001 video track results and such analysis may give further insights which will
be of benefit to those participants.
In terms of performance results, overall the absolute performance figures were very
mixed. In the known item search tasks the mean average precision for the best 2
interactive runs (1 site) was a little over 0.6, across ~31 topics, while another group
submitted two runs over the same topics and scored a consistent 0.23. Scores for
comparable automatic runs ranged from 0.002 to 0.609. The use of averages may be
misleading, particularly given the large number of topics for which any given system
found no relevant clips. For the general search tasks the results were generally even
poorer with mean partial average precision scores (based on half the collection)
ranging from 0.03 to 0.23 for interactive runs on 12 topics and from 0.02 to 0.11 for
automatic runs on 28 topics. The multiplicity of factors makes success as well as
failure analysis a real challenge. Ongoing examination will try to explain differences
in performance, but it may be that the first running of any TREC track will always be
the one which irons out the difficulties and throws up the unforseen problems and that
was certainly true here.
5. Conclusions and Contribution of the TREC2001 Video Track
The TREC2001 video track revealed that there are still a lot of issues to be addressed
successfully when it comes to evaluating the performance of retrieval on digital video
information. It was very encouraging to see interest from the community who
specialise in evaluation of interactive retrieval, in what was achieved in the video
track.
Overall, the track was successful with more participants than expected and the
promise of even more groups this year (2002). However the real impact of the track
was not in the measurement of the effectiveness of one approach to retrieval from
digital video libraries compared to another approach but was the fact that we have
now shown that there are several groups working in this area worldwide who have the
capability and the systems to support real information retrieval on significant volumes
of digital video content. As an indication of what our field is now capable of and of
the potential we have for future development, the TREC2001 video track was a
wonderful advertisement. There have also been many lessons learned from the track,
for example the technical issues related to defining frame numbers in video which are
consistent across the decoders used by different participants.
One of the interesting questions thrown up by the general search task was to do
with the complexity of the topics and the relationship between the text and nontextual parts of the topic where topics had image/audio/video examples. Often it was
not clear that all of the example was exemplary, but there was no way to indicate,
even to a human, what aspects of the example to emphasize or ignore. We’re not sure
what to do about this but it may be that by making the topics more focussed, as we are
planning this year, this issue may disappear.
For this year we will use a new dataset which is greater in size, and more
challenging in nature – at the time of writing it appears that the TREC2002 video
track will have over 20 participating groups and that we will repeat the searching task
with a more focussed set of topics, some with multimedia topic descriptions.
We are also expecting to have a variety of detection tasks such as the occurrence
and number of faces, identifying text in the image and then submitting it for OCR,
categorising the audio as either speech, audio or silence, and so on. The search task
will be as before, namely emulating the scenario where a user approaches a video
retrieval system with some information need which is satisfied by the retrieval of
some number of video clips from the video archive and the evaluation will, as before,
be done in terms of precision and recall.
Authors’ Note: The authors wish to extend our sympathies to the family and friends
of our co-author, Mark E. Rorvig, who passed away shortly before this paper was
submitted. We thank Diane Jenkins from UNT for helping us to clarify some of the
contributions from University of North Texas.
References
1.
2.
3.
4.
5.
6.
7.
Smeaton, A.F., Over, P. and Taban R. The TREC-2001 Video Track Report, in NIST
Special Publication 500-250: The Tenth Text REtrieval Conference (TREC 2001)
(available at http://trec.nist.gov/pubs/trec10/t10_proceedings.html)
The OpenVideo Project. Available at http://www.open-video.org/ (last visited 30 April
2002).
The TREC Video Track. Available at http://www-nlpir.nist.gov/projects/t01v/t01v.html
(last visited 30 April 2002)
Proceedings of the Tenth Text REtrieval Conference (TREC-2001), Gaithersburg,
Maryland, November 13-16, 2001 Available at http://trec.nist.gov/pubs.html (last visited
30 April 2002).
Lee, H. et al.: Implementation and Analysis of Several Keyframe-Based Browsing
Interfaces to Digital Video. In Proceedings of the Fourth European Conference on Digital
Libraries (ECDL), Lisbon, Portugal, Springer-Verlag LNCS 1923, J. Borbinha and T.
Baker (Eds), pp.206-218, September 2000.
Yuefei Guo, Lide Wu.“A novel optimal discriminant principal in high dimensional
spaces”, Proc. International Conference on Development and Learning, June 2002, MIT
Manjunath, B. Wu, P. Newsam, S. and Shin, H. A Texture Descriptor for Browsing and
Similarity Retrieval.’ Journal of Signal Processing: Image Communication, 16(1), pp. 3343, September 2000.