TRECVID 2003 Experiments
at MediaTeam Oulu and VTT
†
‡
‡
†
†
Mika Rautiainen , Jani Penttilä , Paavo Pietarila , Kai Noponen , Matti Hosio , Timo Koskela
‡
‡
†
†
Satu-Marja Mäkelä , Johannes Peltola , Jialin Liu , Timo Ojala and Tapio Seppänen
†
†
†
MediaTeam Oulu
P.O.BOX 4500, FIN-90014 University of Oulu, Finland
{firstname.lastname@ee.oulu.fi} discard umlauts
‡
VTT Technical Research Centre of Finland
P.O. Box 1100, Kaitoväylä 1, FIN-90571 Oulu, Finland
{firstname.lastname@vtt.fi) discard umlauts
Abstract
MediaTeam Oulu and VTT Technical Research Centre of Finland participated jointly in semantic feature extraction, manual search
and interactive search tasks of TRECVID 2003. We participated to the semantic feature extraction by submitting results to 15 out of
the 17 defined semantic categories. Our approach utilized spatio-temporal visual features based on correlations of quantized gradient
edges and color values together with several physical features from the audio signal. Most recent version of our Video Browsing and
Retrieval System (VIRE) contains an interactive cluster-temporal browser of video shots exploiting three semantic levels of similarity:
visual, conceptual and lexical. The informativeness of the browser was enhanced by incorporating automatic speech transcription texts
into the visual views based on shot key frames. The experimental results for interactive search task were obtained by conducting a user
experiment of eight people with two system configurations: browsing by (I) visual features only (visual and conceptual browsing was
allowed, no browsing with ASR text) or (II) visual features and ASR text (all semantic browsing levels were available and ASR-text
content was visible). The interactive results using ASR-based features were better than the results using only visual features. This
indicates the importance of successful integration of both visual and textual features for video browsing. In contrast to previous
version of VIRE which performed early feature fusion by training unsupervised self-organizing maps, newest version capitalises on
late fusion of features queries, which was evaluated in manual search task. This paper gives an overview of the developed system and
summarises the results.
1 Introduction
This paper first describes our experiments in the task of detecting semantic features. Then it introduces our recent version of prototype
Video Browsing and Retrieval System VIRE and the experiments in manual and interactive search tasks. In the end of the paper
concluding remarks are given.
2 Semantic Feature/Concept Detection
To avoid confusion with low-level features, we denote semantic features as semantic concepts.
2.1 Visual Features
The efficiency of visual low-level features in the detection of semantic concepts was tried out in our TREC 2002 Video Track
experiments [15]. The obtained results attracted towards further testing with a larger set of concepts. In brief, the propagation of small
example sets uses the local neighbourhoods and geometrical distances between low-level feature points to define high-level concept
confidences for every shot in a feature space. We have observed in [16] that different semantic concepts (e.g. high-level features) may
co-exist in a video shot. Our visual concept detectors do not cluster space into separable classes, but instead provides continuous
confidence measures for all items in the space. Due to this, it has been possible to reduce amount of training into small positive
example sets for each concept. The method uses following three temporal low-level features taken from the visual content of a video
shot.
Motion Activity
Motion is a prominent property in a video frame sequence. We have used features describing motion activity, based on definitions of
MPEG-7 Visual standard [4] MPEG-7 Motion Activity descriptor computes four features from the motion vectors within a video
sequence: discrete intensity, direction, spatial distribution and temporal distribution of motion activity. Our system uses the following
four values from these features:
•
Intensity of motion is a discrete value where high intensity indicates high activity and vice versa. Intensity is defined as the
variance of motion vector magnitudes normalized by the frame resolution and quantized in the range between 1 and 5.
•
Average intensity measures the average length of all macroblock vectors.
•
Spatial distribution of activity indicates whether the activity is scattered across many regions or in one large region. This is
achieved measuring the short, medium and long runs of zeros that provide information about the size and number of moving
objects in the scene. Each attribute is extracted from the thresholded vectors obtained from the video data. All feature values
are normalized so that they can be used jointly with other features in self-organizing indexes.
Temporal Color Correlogram (TCC)
Color properties of a video shot were obtained into a Temporal Color Correlogram (TCC) feature. Its efficiency against traditional
static color descriptors has been indicated in [2][3]. TCC captures the correlation of HSV color pixel values in spatio-temporal
neighborhoods. The feature approximates temporal dispersion or congregation of uniform color regions within a video sequence.
Traditionally static color features do not consider temporal dynamics as they are often computed from a single keyframe. TCC is
computed by sampling 20 video frames evenly over bounded video sequence. The details about the computational parameters (such as
color space quantization and spatial constraints) are described in [3].
Temporal Gradient Correlogram (TGC)
Temporal Gradient Correlogram (TGC) feature computes local correlations of specific edge orientations producing an
autocorrelogram, whose elements correspond to probabilities of edge directions occurring at particular spatial distances. The feature is
computed over 20 video frames sampled evenly over the duration of bounded video sequence. Due to temporal sampling,
autocorrelogram is able to capture also temporal changes in spatial edge orientations. From each sample frame, the edge orientations
(obtained with Prewitt kernels) are quantized into four segments depending on their orientation being horizontal, vertical or either of
the diagonal directions. The details of the feature and the used parameters can be found from the study utilizing it in the detection of
city/landscape scenes and people from video shots [16].
Feature Fusion and Combined Result Sets
A simple method of propagating small positive example sets was used in the experiments to provide final confidence values for the
high-level concepts. In brief, positive examples are used to create ranked result sets that are later combined with a variant of Borda
count voting principle [17].
Geometric distances (dissimilarities) to the example shot in low-level feature space result in initial rank ordered lists Dl f (k ) , where
semantic concept f of an example shot k is propagated by ‘confidence ranks’ to its nearest neighbours in a low-level feature space l.
The ranked result list for each example was computed with L1-distance in the feature space l. Then, ranked lists Dl f (k ) created from
L feature spaces were combined with variants of Borda count. In this work, two fusion approaches based on minimum and sum
operations were experimented:
R nf (k ) = Θ(
where
D1f (k )
D1fmax (k )
,...,
D Lf (k )
D Lf max (k )
(1)
)
R nf (k ) = fused rank of a result shot n to the feature result sets 1…L
Dl f (k ) = rank to the query example k representing semantic concept f in the feature space l
Dl fmax (k ) = maximum rank to the query example k in its result set. Equals to the size of data
Θ = fusion operator, minimum or sum
f
After fusion with low-level feature results is achieved, feature f result set R (k ) contains the items with ranks describing the ‘voted’
confidence to the example shot k. Next, the sets of results ( R f (1),..., R f ( K )) are further combined with a method that considers the
ranks as votes. The method selects the best (minimum) ranks for each item to form the final confidence rank. This operation, which is
congruent with a Boolean OR operation in fuzzy systems, further sorts and selects the 2000 most confident result shots organized by
the confidence rank S nf . The following formulae describe the combination procedure:
S nf = min(
R nf (1)
f
R max
(1)
,...,
R nf ( K )
f
R max
(K )
)
C f = [sort {S1 ,..., S N }]2000
(2)
(3)
where S nf = minimum rank of a result shot n to the examples 1…K
R nf (k ) = rank to the query example k with a feature f
f
R max
(k ) = maximum rank to the query example k in its result set R f (k ) , Equals to N
C f = Final ranked set of results for the semantic concept f
[ ]X = X top-ranked items in a sorted list
N
= Size of the items in the feature space, equals to the size of the TRECVID 2003 test set
Face Detection
Previously described low-level features are insufficient for some high-level feature tasks in TRECVID 2003 as some of the features
were based on presence of a person in a shot. Features ‘news subject face’ and ‘news subject monologue’ required the detection of at
least one person (a news subject) to be contained in a shot. In order to detect these semantic concepts, face regions were located from
the shot’s key frame images, which were provided by NIST. Prior the actual face detection, key frames containing skin were filtered to
reduce computational effort of a face classifier. Skin regions were detected from 10x10 image sub-blocks according a method
introduced in [16]. The connected skin-block regions were further scrutinized using face detector that extracted facial features with
overlapping 19x19 windows in multiple resolutions. The facial features were classified with trained support vector machine [6].
Finally, heuristic rules were used to reduce false and overlapping detections. The face detector was empirically found to give apt
results for the ‘news subject face’ when one or two face regions were detected per key frame. Empirical findings were obtained using
TRECVID 2003 development set.
2.2 Audio Features
MPEG7 Audio
spreaddev, harmdev, flengthdev: Two of the basic algorithms were selected from the MPEG-7 Part 4 Audio Standard [12]. Low-level
features that were selected from MPEG-7 were spectral spread and harmonicity ratio. Harmonicity ratio describes the proportion of
harmonic components in the spectrum. The algorithm output is 1 for purely periodic signal and 0 for the white noise. The standard
deviation of the harmonicity ratio over one shot was used as a feature. A useful feature derived from harmonicity ratio was the length
of the comb filter, which is an estimate of the delay that maximizes the autocorrelation function. Using this feature it was possible to
classify brown noise-like signals such as car motor sound, which in general have the smallest possible comb filter length. The feature
for one shot, was the standard deviation of the comb filter length values. Spectral centroid is the center of gravity of the power
spectrum. For audio signals that have clearly much energy in lower or higher parts of the spectrum this feature is useful. Spectral
spread is the RMS deviation of the spectrum centroid, and thereby it describes if the spectrum is widely spread out or concentrated
around its centroid. Used values were standard deviation values over a shot.
F0dev, F0med: The F0 was calculated as presented in [13]. F0 was used for two purposes: F0 detection was used to make a
voiced/unvoiced decision and its value as a basic feature. Features derived from F0 were: standard deviation and median over a shot.
They were naturally used for detecting speech related semantics: female voice detection and finding monologue parts of the audio
signal.
clustmean, clustmed, clustdev: In order to be able to separate monologue and conversation parts we used features derived from Kmeans clustering of mel-cepstrum coefficients. Mel-cepstrum models the human perception and it is widely used in speech and
speaker recognition [14]. The method used calculates mel-cepstrum coefficients for voiced frames and clusters the coefficients to 10
clusters with K-means algorithm. It returns the central vectors of 10 clusters for each frame in a shot. The distances between clusters
in current and previous frames are calculated and the cluster distance mean, median and standard deviation were used as features to
distinguish between monologue and conversation frames.
SNR: Signal to noise ratio, SNR, for each shot was calculated. It was calculated through the power between voiced and unvoiced
frames in a shot. This was thought to give an estimate of the outdoor shots as a higher background noise level of the outside
recordings.
Additional Features
We computed the amplitude envelope functions, relative difference functions and onsets, based on methods described in [9] and [10],
and derived various low-level features based on this data. From this feature set we selected the following features for semantic concept
detection:
spectral_kurtosis: This is the averaged kurtosis calculated across the frequency domain on each of the sample points on the subband
amplitude envelopes. This measure has its highest values when the subband amplitude envelopes have a sharp peak in their frequency
distribution and low values when the subband energy is evenly distributed across the frequencies. Thus, this measure is directly related
to the spectral flatness of the shot.
onset_kurtosis: This measure gives the average spectral kurtosis of onsets. It is calculated from the relative amplitude envelopes at
time points respective to onset times and gives a measure of how evenly the onset subband energies are distributed across the
frequency domain.
hi_onset_ratio: This is the proportion of onsets whose frequency distribution is slanted towards high frequencies.
spectral_variance: The averaged variance taken across the frequency domain on each of the sample points on the subband amplitude
envelopes. This gives a measure related to the spectral volatility of a shot.
ioi_dev: The standard deviation of the inter-onset intervals within a shot. This feature has its lowest values when the onset trail is
either periodical and steady or very fast.
onset_loudness_variance: The variance of onset loudness within a shot.
onset_skewness: Mean onset skewness is a measure of the asymmetry of the frequency distribution of onsets. It is calculated from the
relative amplitude envelopes at time points respective to onset times. We use this measure as a supplement to f0-based features in
female speech detection since the onsets in female speech are dominantly distributed towards the higher frequencies.
zero-crossing ratio: Zero-crossing ratios were calculated as described in [1].
Feature Selection and Classification
The classifier used was a standard Quadratic Gaussian Classifier. The training data features were normalized to a unit variance and
zero mean. The features were selected for each class by first selecting a larger feature base by heuristic methods. Then the features
obtaining the best classification performance were validated for the intermediate level classification.
Initially the classification was used to produce intermediate semantic concepts from opposing semantic pairs collected from the
training data. Because the annotations of the test data were found to be unreliable, all training data was manually verified. Table 1
shows the intermediate semantic concepts and the features selected for their classification. The training data was selected from the
collaborative annotation data including all siblings for a given concept.
Semantic Concept Pairs
‘Outdoors’ vs. ‘Indoors’
‘Monologue’ vs. ‘Conversation’
‘Male Speech & Monologue’ vs. ‘Female Speech & Monologue’
‘Sport’ vs. ‘Monologue & Conversation’
‘Vehicle Noise’ vs. ‘Monologue & Conversation’
Used Features
spectral_kurtosis, onset_kurtosis, SNR, zero-crossing ratio
hi_onset_ratio, F0dev, clustmean, clustmed, clustdev
onset_skewness, F0med
spectral_variance, ioi_dev, onset_loudness_variance, SNR
spreaddev, harmdev, flengthdev
Table 1. Intermediate semantic concepts
The ‘Conversation’ concept was created from the shots annotated as ‘Speech’, but not as ‘Monologue’. The output of the classifier
was considered as the confidence for a concept.
Table 2 shows the final semantic concepts and the intermediate semantic classifiers that produce confidence value for the first class.
Speech/music discriminator [11] confidence was combined with speech related concepts to improve the detection. For the
combination of intermediate classifier outputs, we have used exactly the same procedure than in fusion of multiple modalities (see Eq.
4 in Chapter 2.3).
Final Semantic Concept
‘Outdoors’
‘News Subject Monologue’
‘Sport’
‘Female Speech’
‘Car, Truck, Bus’
Intermediate Semantic Classifiers and Combinations
Confidence of ‘Outdoors’ vs. ‘Indoors’
Confidence of ‘Monologue’ vs. ‘Conversation’, masked with results from ‘Speech/music’.
Confidence of ‘Sport’ vs. ‘Monologue & Conversation’
Confidence of ‘Male Speech & Monologue’ vs. ‘Female Speech & Monologue’, masked with
results from ‘Speech/music’
Confidence of ‘Vehicle Noise’ vs. ‘Monologue & Conversation’
Table 2. Final semantic concepts for run MT6. The outputs of intermediate classifiers
are used to create confidence values (for the underlined name) to generate the final concept.
2.3 Fusion of Visual and Audio Features
In order to find the effect of combining multiple modalities, late fusion of independent feature detector outputs (auditory and visual)
was experimented with four semantic concept categories: ‘Outdoors’, ’News subject monologue’, ‘Sporting event’ and ‘Physical
violence’. The fusion was achieved according the Borda count variant method explained in Chapter 2.1. In order to emphasize results
that have correlations between detector outputs, equation 1 was modified to sum the normalized confidence ranks of separate feature
detectors instead of selecting the minimum (strongest confidence rank) from the feature detector outputs:
S nf = sum(
, where
fx
Daudio
,
fy
Dvisual
fx
fy
Daudio
max Dvisual max
(4 )
)
S nf = sum of the normalized confidence ranks of item n in the visual and audio result sets
Dkf = confidence rank of item n using a detector k in a feature f.
fx = fy (fusing same feature detector outputs) or fx
Dkf max
≠
fy (fusing different feature detectors)
= maximum rank to the detector k in its result set R k , Equals to test set size in TRECVID 2003
2.4 Semantic Feature Task Experiments in TRECVID 2003
Seven runs were submitted for TRECVID 2003 semantic concept task. Table 3 shows the definitions of the submitted runs and their
identifiers. Some of the runs used a fixed visual feature configuration whereas others used pre-validated feature configurations for
each semantic concept. In this chapter, a group of semantic concepts were tested with varying visual feature configurations. The group
will be denoted with symbol Ψ , consisting of the topics 1,3,4,5,6,7,9,10,12,13,14,16.
Run ID
FI_OU_MT1
Used Features
TGC, TCC, MA, face, f0
Feature fusion method
MIN, SUM
Semantic Concepts
FI_OU_MT2
TGC & TCC
MIN
FI_OU_MT3
FI_OU_MT4
MA & TGC &TCC
MA & TGC & TCC
SUM
MIN
FI_OU_MT5
TGC
-
Ψ
Ψ
Ψ
Ψ
FI_OU_MT6
MPEG-7 Fea, Additional Fea
Gaussian classifier & SUM
1,8,9,11,13
FI_OU_MT7
Visual, Audio Concepts
SUM
1,11,13,16
Ψ , 2, 8
Table 3. Submitted semantic concept runs for TRECVID 2003. Cells with cursive text reflect that the run used a validation process to
obtain the best configuration for each semantic concept, normal text indicates that the property is fixed for every feature in the
run. Ψ ={1,3,4,5,6,7,9,10,12,13,14,16}
Results of Visual Runs
Runs from MT1 to MT5 tested sets of visual feature configurations and fusion methods for a fixed group of semantic concepts Ψ .
Run MT1 consists of pre-validated (best) configurations for each concept in Ψ . Additionally, it also contains results for concepts
‘news subject face’ and ‘female speech’ that had specifically designed detectors (see respectively 2.1 and 2.2 for more details).
Validation of the configurations in MT1 was obtained in TRECVID 2003 development set using collaborative annotation data [18].
The effect of various feature configurations and fusion methods is visible in Table 4. Run-wise means of the average precisions
(MAP) show that pre-validated configurations of MT1 obtained the best overall detection result. Best non-validated result with fixed
features was obtained using TGC and TCC with minimum rank feature fusion. Surprisingly the use of motion didn’t lead to any
improvement from this. Using only one feature, TGC, provided the weakest MAP. Semantic concepts with the highest individual
results were ‘weather news’ (MT1: 0.501), ‘news subject face’ (MT1: 0.107), ‘sporting event’ (MT1: 0.106) and ‘people’ (MT1:
0.096). For the fusion methods, minimum rank performed better than sum of ranks.
About domain specific features, our ‘news subject face’ detector obtained the average precision of 0.107, whereas the median and
maximum results were 0.0835 and 0.182 (26 runs submitted results for ‘news subject face’).
Run ID
FI_OU_MT1
FI_OU_MT2
MAP of Group Ψ
0.090
0.075
FI_OU_MT3
FI_OU_MT4
0.037
0.063
FI_OU_MT5
0.043
Median
0.084
Max
0.314
Table 4. Mean of the average precisions (MAP) for Ψ . Table also shows median and maximum of
MAP for 26 runs that submitted results for these concepts.
Results of Audio Runs
The results obtained using only audio features were submitted in run MT6. Table 5 shows the average precisions for the individual
concepts. Generally speaking, the results using audio were rather low. However, by fusing the detectors with visual features,
encouraging results were obtained for some concepts (see multi-modal results for details).
Feature ID
Outdoors (1)
News subject monologue (11)
Audio Run (MT6)
0.013
0.003
Sporting event (13)
Female speech (8)
0.070
0.042
Car/Truck/Bus (9)
0.005
Table 5. Average precisions for the five semantic concepts. Best performing concept is ‘sporting event’
whereas ‘news subject monologue’ is the worst
Result of the Multi-modal Run
Run MT7 contains four multi-modal results obtained from combinations of two detectors (visual and audio-based). The selection of
feature detector combinations is based on empirical findings with TRECVID 2003 development set and collaborative annotation
results. The used fusion method is explained in Chapter 2.3. Table 6 shows run concepts and their detector combinations.
Semantic Concept
‘Outdoors’
‘News Subject Monologue’
‘Sporting Event’
‘Physical Violence’
Combinations of Semantic Concepts
Visual Semantic Concept Detector
Audio Semantic Concept Detector
‘Outdoors’ from MT1
‘Outdoors’ from MT6
‘News Subject Face’ from MT1
‘News Subject Monologue’ from MT6
‘Sporting Event’ from MT1
‘Sporting Event’ from MT6
‘Physical Violence’ from MT1
‘Music’ from ‘Speech/music’ detector (see 2.2)
Table 6. Final semantic concepts
Table 7 shows the average precisions for multi-modal feature detectors in four semantic concept categories. By comparing the results
to the individual visual and audio-based detectors, one can see that in two concepts the multi-modal features provide the best overall
result.
Feature ID
Combined(MT7)
Visual(MT1)
Audio(MT6)
Outdoors (1)
0.035
0.013
0.052
News subject monologue (11)
0.003
0.013
Sporting event (13)
Physical violence (16)
0.245
0.024
0.106
0.031
0.07
-
Table 7. Average precision for the four semantic concepts. Table also shows corresponding result from a single modality detector,
if available.
3 Manual and Interactive Search
3.1 Video Browsing and Retrieval System
Our video browsing and retrieval system VIRE was used in manual and interactive search experiments. System uses J2SE, QuickTime
for Java and MySQL JDBC components. The system consists of a server and client(s); the server provides query services for various
feature indexes and formulates the final result set by fusing the sub-results of independent features. Client software offers two views,
one for constructing automatic video queries and another for browsing. References to media data and additional meta-information are
stored in MySQL-database. The browsing view of the client offers cluster-temporal [19] navigation to the database videos.
VIRE system provides search features in three semantic levels. Lowest level provides visual similarity between two video objects and
is also known as a content-based example search. Second level draws out the semantics from low-level features by pre-trained
semantic concept detectors. Third level is highly semantic textual search from automatic speech recognition transcripts.
Visual Similarity with Generic Features
Generic features are measured from the physical properties of visual video data. Generic feature vectors of query and database shots
are compared using a geometric distance metric that measures dissimilarities. Different features can be used simultaneously in
searching. The results of individual feature queries are combined with late fusion techniques.
The visual similarity is constructed from three physical properties of a shot: (I) Color is the most widely used content property in
CBVR research. Similarity by color gives initially very good perceptual correspondence between two color images that are small or
short of details. After images are reviewed in detail, other properties for similarity emerge. (II) Structure of the edges in a visual
imagery is a strong cue for many computer vision applications, such as classification city and landscape images or segregating natural
and non-natural objects. This can also provide invaluable in queries where statistical color information is insufficient in describing the
main properties of an image. (III) Motion is a property that is intrinsic for multimedia video. It describes properties that cannot be
perceived from a static image. In some occasions it can be the only property that can reliably describe the content of a sequence, a
fast-paced action would be an example of such. Following features from these categories have been used in our experiments:
Temporal Color Correlogram (TCC), Temporal Gradient Correlogram (TGC) and Motion Activity (MA). These features are already
described in Chapter 2.1. To compute the geometric dissimilarities of the described feature vectors, we have used L1 norm. Each
feature vector is normalized prior computing the dissimilarity.
Conceptual Similarity with Semantic Concepts
Semantic features are lexically defined concepts that exist in a shot with a certain confidence value. Trained semantic concept
detectors are used to create a series of confidences for each video shot as described in Chapter 2. Our search system utilizes concept
confidences as features that are being compared against other shots’ confidence values. The problem with floating confidence values
is that they are usually generated with an algorithm that attempts to find out the degree in which possibility a concept in question truly
exists in the shot. Therefore the confidence value usually starts to fail in some threshold value and generates non-relevant results for a
concept-query. Our system uses 15 semantic concepts to describe a shot conceptually. These features are ordered based on their
confidence values and the ordered list is utilized in computation of conceptual similarities between shots. The ordered list of concepts
is used differently in manual and interactive searching.
Manual concept query is based on user-defined concept list that describes for each of the 15 features whether it is required (1) or not
required (0) property of the result. In order to compute the overall similarity, each of the list item l is used to compute dissimilarity for
that feature in a database shot n. Dissimilarity is computed based on the difference between the detected confidence and binarized
confidence value defined by the user. After computing dissimilarities for every concept, overall dissimilarity between the concept list
and a shot is computed as the sum of individual dissimilarity values. All shots are ranked by the smallest dissimilarity value and
ordered list of shots is outputted as the conceptual query result.
Example based concept query is used when browsing by conceptual features. Here the concept list of required/not required concepts is
constructed from the example shot confidences. Each confidence value is measured against the upper and lower limits of the scale
resulting a difference between maximum and minimum confidence (confidence distance to the values 1 and 0 respectively). The
confidences are ranked based on the shortest distances, resulting a list of the most discriminative concepts that describe shot in
conceptual terms. For example, a shot containing an airplane flying on the sky could have top three descriptive concepts as airplane,
outdoors and no face. This list is further used similar manner to a manual concept query, where a list of binary concepts is used as a
query parameter. Our system configuration was set to use top three concepts for the example based concept queries.
Following 15 semantic concepts were incorporated in our conceptual search: outdoors, news subject face, people, building, road,
vegetation, animal, female speech, car/truck/bus, aircraft, news subject monologue, non studio setting, sporting event, weather news
and physical violence. The confidence values were obtained as a result of our participation to the semantic feature detection task. See
Chapter 2 for details. Semantic detectors were used from the run MT1, except the features 11,13,16,8 that were from the runs MT6
and MT7. Selected concepts were outcome of an educated guess about the performance of individual feature detectors.
Lexical Similarity with Term-Frequency Inverse Document Frequency
Text features are derived from the automatic speech recognition (ASR) data that was made available for all participants by LIMSICNRS [8]. The ASR transcripts were indexed into a database treating each white-space delimited token as a word. A stop word list
was used to exclude grammatical and otherwise undiscriminating words that would have led to poor resolution. Remaining words
were then stemmed using the Porter stemming algorithm [7]. Available speaker segmentation helped to create better contextual
organization for the index and whip up the semantic capabilities. This was achieved by expanding neighbouring shots to every index
word until the boundaries of speaker segments were reached. Due to this, every shot became topically connected to its neighbours,
presuming that the speaker would not change his/her semantic context during speech.
To evaluate queries, we used relevance metrics to measure which shots were suitable given a set of topic related query words. The
ranking of the shots was computed using Term-Frequency Inverse Document Frequency (TFIDF) [5] based classification method that
pinpoints relevant words occurring in the ASR transcripts. First, the given query words were automatically stemmed to remove any
suffixes. Second, the TFIDF relevance metric was computed for every shot in the database. Finally, the shots were sorted according
the computed TFIDF metric.
Since lexical similarity was also a search feature for browsing, we constructed an example-based text search that used example shot
ASR-text as a source for query. A list of lexically similar shots was then created with TFIDF as was described earlier.
Late Feature Fusion in Manual and Interactive Queries
In order to construct a final ranked list of most similar shots, system has to formulate sub-queries with previously described
similarities. User can select whether he/she wants to retrieve shots based on all or some of the semantic levels. For example, user can
define a following complex search (manually):
•
•
•
Lexical: ‘sphinx pyramid’
Conceptual: concepts ‘outdoors’, ‘buildings’ and ‘road’ are required
Visual: example image (about Sphinx) search using ‘color’ and ’structure’
The process of creating visual similarity search follows the same process that is used to generate the semantic concept results (in
Chapter 2.1), except the clipping of top ranked results. The fusion operator Θ has been set to sum for low-level visual feature search
and the features available in manual search are TCC, TGC and MA. For browsing with visual similarity, we have constrained the
system to use only one search configuration: TGC and TCC search combined with sum operator.
To provide final result lists for previously described example, VIRE system commences sub-queries for each of the defined semantic
level. VIRE combines the sub-query result lists that contain rankings based on brute-force computations of geometric L1
dissimilarities for the entire database. The sub-results can be considered as votes of individual feature ‘experts’ e. To make a fusion of
these feature lists, we use a variant of Borda count voting method like in Chapter 2.1. Following formulae describe the procedure
S nt = sum(
Rnt (1)
t
Rmax
,...,
Rnt ( E )
t
Rmax
(4 )
)
C f = [sort {S1 ,..., S N }]1000
(5 )
where S nt = minimum rank of a result shot n to the search topic t by the query ‘experts’ 1…E (E ∈ {1,2,3})
Rnt (e) = rank to the query topic t given by an ‘expert’ e (e.g. a visual similarity query service).
t
Rmax
= maximum rank for topic t Equals to the size of TRECVID 2003 test set
,
Ct = Final ranked set of results for the search topic t
[ ]X = X top-ranked items in a sorted list
Manual Query Interface
Manual query interface provides search features in three semantic levels. Changing a topic from the menu launches a timer and loads
new example shots or images provided by the topic description. User can select any combination of the three visual features (TGC,
TCC or MA) for any example video shot, but for example image only TCC and TGC are supported. User can set any configuration of
‘1s’ and ‘0s’ for 15 semantic features. Additionally, a lexical word search can be constructed by selecting a set of words from the topic
description. After results for the search have arrived, user can select any interesting shot as a start point for navigation in the browsing
interface.
Interactive Browsing Interface
The novelty of our approach in the interactive search task relies on cluster-temporal browsing [19] and ways of viewing the result
shots. The motivation is to reduce the effect caused by ambiguous results usually obtained from a traditional content-based example
search. Currently, two dominant approaches are used to realize video searching and browsing. Systems either select content-based
presentation of video items or rely on more traditional time-line based organization into temporally adjacent items. The disadvantages
of the approaches are in their incapability to associate computed features with the user’s information need (ambiguity of content-based
approaches) and to provide a holistic view over the linear temporal presentation (inefficiency of the time-line based browsing).
Our approach combines both inter-video similarities and local temporal relations of video shots in a single interface. In interactive
search, users want the computer to act as a ‘humble servant’, providing enough cues and dimensions for users to navigate through the
vast search space towards the relevant objects. In VIRE, users can perform cluster-temporal browsing that combines timeline
presentation of videos with content-based retrieval. Cluster-temporal browsing implies that the video content is not utilized alone, but
in conjunction with temporal video structure.
Figures 1 (a) and (b) illustrates the browsing interface. The panel showing the first row of key frame images displays sequential shots
from a single video in a chronological time-line. At any time, user can scroll through the entire video shot sequence to get a fast
overview of the video content. The lower right panel gives user another content-oriented view, but this time from the entire database.
The columns below the topmost shots show the most similar matches organized in top-down rank-order. The columns generate a
similarity view that provides linkage to other database videos. The similarity is measured based on the features selected in the lower
left panel. Selectable features are named ‘visual’, ’conceptual’ and ‘lexical’. User can select a single feature or any combination
depending on what properties he/she wants to browse with.
(c)
(a)
(b)
Figure 1 (a) Cluster-temporal browsing interface. The similarity view organizes result shots column-wise (b) Another organization of
the similarity view: similar shots grouped by videos. Images depict a news search for ice hockey and basketball sports content. (c)
ASR text is visible under the result items
When user locates interesting shots in the similarity view, he can open the related video shot in the topmost row so that the interesting
shot is located in the middle column. After the video shots are changed in the top row, system re-computes the similarity view. At any
time, user can update the current view by changing to other feature combinations. User can also select two ways to organize similar
shots in a similarity view. The traditional view puts the results of a single search into columns under the top-row shots. Another view
computes the amount of result shots originating from a single video to create video ranks. Ranks are then used to organize the shots
into rows by their ranked parent video. Another novelty feature in the browser is the added visualization of the textual shot content
obtained from the ASR transcripts. The requirements to update the similarity view are heavy, since the browsing speed should be close
to real-time. To update a view, system must perform parallel query processing for several example-based queries. Multi-threaded
index queries with efficient query cache provide reasonable access times even for the most complex feature parameters.
3.2 Experimental Setup
MediaTeam submitted 10 result runs, six for manual and four for interactive search tasks. NIST provided 25 search topics that were
used in the experiments. Topic nr. 120 was left out from the interactive experiments leaving 24 search topics. A topic contained one or
more example clips of video or images and textual topic description to aid the search process. From the visual similarity example
search, only TCC and TGC features were used. Textual search was based on given words in the topic description.
NIST provided segmentation for all videos, from which more than 32000 video shots belonged to the feature search test collection.
IBM organized a collaborative annotation for TRECVID 2003 participants, which resulted in annotation of feature search
development collection, creating over 60 hours of annotated training material. TRECVID 2003 data consisted mainly of ABC/CNN
news from 1998 and about 13 hours of C-SPAN programming.
In manual queries a test user processed all 25 topics. He used VIRE system’s manual search interface to generate six different result
sets with following configurations (run ID):
•
•
•
•
•
ASR-based lexical search (M5),
single example based visual search (M6),
all examples based visual search (M7),
all examples based visual + ASR search (M8),
All examples based visual + conceptual search (M9).
Run M6 was based on a list of single examples that was provided by Mr. Thijs Westerveld from the University of Twente.
The interactive search task was carried out by a group of eight new users. Test users, two of them females, were mainly information
engineering undergraduate students, having good skills in using computers, but little experience in searching video databases. They all
are used to web-searching. 8 users, 24 topics and two variants of VIRE system were divided into following configurations
•
•
•
•
I1V: Variant A: S1[T1-T6],S7[T7-T12],S2[T13-T18],S8[T19-T24]
I2VT: Variant B: S2[T1-T6],S8[T7-T12],S1[T13-T18],S7[T19-T24]
I3V : Variant A: S3[T1-T6],S5[T7-T12],S6[T13-T18],S4[T19-T24]
I4VT: Variant B: S4[T1-T6],S6[T7-T12],S5[T13-T18],S3[T19-T24]
System variant A disabled the support for lexical searching and ASR text visualizations so that searching was based entirely on visual
content. System variant B enabled lexical searching and textual visualizations. By setting the experiment like above, the effect of
learning was reduced between the system variants and the effect of fatigue was minimized with break and refreshments. The effect of
learning within the topic sets was not controlled, most of the users processed the topics in numerical order. All users were given half
an hour introduction to the system, with emphasis on the search and browsing interface functions demonstrated with a couple of
example searches. Users were told to use approximately twelve minutes for each search, during which they navigated in the shot
database and selected shots that seemed to fit to the topic description. Total time for the experiment was about three hours. Users were
also told to fill a questionnaire about their experiences. The machines that the system was running on were 400-800 MHz PCs with
Windows 2000 operating system. During the change of system configurations (halfway of the experiment), users were given
refreshments and a break.
3.3 Search Results from TRECVID 2003
Average precisions for the 10 different search configurations are shown in Table 8. The descriptions of the runs are found from
Chapter 3.2. MAP shows the mean value of the average precisions in 25 (manual) and 24 (interactive) topics. As can be seen, the
performance of interactive search is much better than the manual search results. The average time for interactive search was 10.68
with system variant A (visual) and 11.63 with variant B (visual+textual). For the system variant A, the average number of hits at depth
30 was 8.19 whereas for the variant B respective value was 10.88. The same values for hits at depth 10 were 6.33 and 4.81
respectively. This means that the people spent about 9% more time browsing with visual and textual cues resulting in 31-33% increase
in found matches within the first 10-30 results.
Manual runs elapsed average time was 1.5 minutes. With best manual configuration (M5) the average hits per topic at depths 10 and
30 were 2 and 4.44
Search Run ID
OUMT_I1V (interactive)
OUMT_I2VT (interactive)
OUMT_I3V (interactive)
OUMT_I4VT (interactive)
Median (interactive)
Max
(interactive)
OUMT_M5 (manual)
OUMT_M6 (manual)
OUMT_M7 (manual)
OUMT_M8 (manual)
OUMT_M9 (manual)
OUMT_M10 (manual)
Median (manual)
Max (manual)
MAP
0.172
0.241
0.156
0.207
0.184
0.476
0.098
0.005
0.023
0.024
0.004
0.006
0.072
0.218
Elapsed Time Average (min)
11.2
12.1
10.2
11.2
9.5
15
1.5
1.5
1.5
1.5
1.5
1.5 (estimated)
3.12
15
Table 8. Results for search runs over 25 and 24 search topics
Some of the most successful topics for the interactive test users were 116, 118, 114, 110, 106 and 102. According to the answers in
questionnaire, the VIRE system was easy to learn, and somewhat easy to use. This was an improvement from the commentary we
received last year about VIRE [15] and was partially caused by the added functionality that utilised more semantic ASR textual
transcripts. From the different browser configurations people preferred browsing with visual and lexical features enabled and the best
system variant was type B (ASR transcripts visible) with the result shots organised in columns, not by video rows.
4 Conclusions
Large experiments were conducted with TRECVID 2003. Our experiments showed that our remarkably simple method based on
training with small example sets using color and structural features generated slightly above median results. Our multimodal fusion
was not behaving as we expected, improvements could be achieved for example with multi-modal classifiers.
The most successful component in this work was cluster-temporal browser that provided many levels of semantics for the user to
pursue more meaningful results. Our system provided the user multiple parallel paths from which he/she could choose the direction for
the navigation. We have combined the temporal adjacency of shots and content-based similarity searches into one interface. This gives
the user better understanding about the various inter-relations of video shots during browsing.
The manual search methods based on text were most successful; there was no improvement in results when any combinations of
visual, conceptual and lexical features were used together. One reason for the failure of combined features may be in the
implementation issues we have found after the experiments were finished, however it is clear that the fusion techniques should be
further improved. Nevertheless, the second best manual configuration was obtained with all visual examples together with a lexical
search. The use of one search example was performing worse than using all of them.
Acknowledgments
The financial support of the National Technology Agency of Finland and the Academy of Finland is gratefully
acknowledged.
References
[1]
[2]
[3]
[4]
[5]
[6]
[7]
[8]
[9]
[10]
[11]
[12]
[13]
[14]
[15]
[16]
[17]
[18]
[19]
[20]
Kedem B (1986) Spectral analysis and discrimination by zero-crossings. Proc. IEEE. Vol. 74, No. 11.
Ojala T, Rautiainen M, Matinmikko E & Aittola M (2001) Semantic image retrieval with HSV correlograms. Proc. 12th
Scandinavian Conference on Image Analysis, Bergen, Norway, pp.621-627.
Rautiainen M & Doermann D (2002) Temporal color correlograms for video retrieval.
Proc. 16th International Conference on Pattern Recognition, Quebec, Canada.
MPEG-7 standard: ISO/IEC FDIS 15938-3 Information Technology - Multimedia Content Description Interface - Part 3:
Visual.
Salton G & Yang C (1973) On the specification of term values in automatic indexing. Journal of Documentation, Vol. 29,
351–372.
Heisele B, Poggio T & Pontil M (2000) Face Detection in Still Gray Images, Tech. Report 1687, Center for Biological and
Computational Learning, MIT.
Porter M (1980) An Algorithm for Suffix Stripping Program. Program, 14(3):130—137.
Gauvain JL, Lamel L & Adda G (2002) The LIMSI Broadcast News Transcription System. Speech Communication, 37(12):89-108.
Scheirer, ED (1998) Tempo and Beat Analysis of Acoustic Musical Signals J. Acoust. Soc. Am. 103:1, pp 588-601.
Klapuri, A. (1999) Sound Onset Detection by Applying Psychoacoustic Knowledge. IEEE International Conference on
Acoustics, Speech and Signal Processing.
Penttilä J, Peltola J & Seppänen T (2001) A speech/music discriminator-based audio browser with a degree of certainty
measure. Proc. Infotech Oulu International Workshop on Information Retrieval, Oulu, Finland, 125-131.
MPEG7 standard: ISO/IEC FDIS 15938-4: Information technology -- Multimedia content description interface -- Part 4:
Audio
Dubnowski JJ, Schafer RW & Rabiner LR (1976) Real-Time Digital Hardware Pitch Detector, IEEE Transactions on
Acoustics, Speech, and Signal Processing, Vol. 24, No. 1, s. 2-8.
Deller JR, Hansen JHL & Proakis JG (2000) Discrete-Time Processing of Speech Signals, IEEE Press, 908 s.
Rautiainen M, Penttilä J, Vorobiev D, Noponen K, Väyrynen P, Hosio M, Matinmikko E, Mäkelä SM, Peltola J, Ojala T &
Seppänen T (2002) TREC 2002 Video Track experiments at MediaTeam Oulu and VTT. Text Retrieval Conference TREC
2002 Video Track, Gaithersburg, MD.
Rautiainen M, Seppänen T, Penttilä J & Peltola J (2003) Detecting semantic concepts from video using temporal gradients
and audio classification. International Conference on Image and Video Retrieval, Urbana, IL.
Ho T, Hull J & Srihari S (1994) Decision combination in multiple classifier systems. IEEE Transactions on Pattern Analysis
and Machine Intelligence 16(1): 66–75.
IBM VideoAnnEx Server Page MPEG-7 (31.10.2003) http://mp7.watson.ibm.com/VideoAnnEx/
Rautiainen M, Ojala T & Seppänen T (2003) Cluster-temporal video browsing with semantic filtering. Advanced Concepts
for Intelligent Vision Systems, Ghent, Belgium.
Prosody-based classification of emotions in spoken Finnish Seppänen T, Väyrynen E &Toivanen J (2003) 8th European
Conference on Speech Communication and Technology, Geneva, Switzerland, p.26.