Matching slides to presentation videos using SIFT and scene background matching

Quanfu Fan; Kobus Barnard; Arnon Amir; Alon Efrat; Ming Lin

Matching slides to presentation videos using SIFT and scene background matching

2006, Proceedings of the ACM International Multimedia Conference and Exhibition

Matching Slides To Presentation Videos Using SIFT and Scene Background Matching Quanfu Fan Kobus Barnard Arnon Amir Department of Computer Science University of Arizona, Tucson, AZ85721 Department of Computer Science University of Arizona, Tucson, AZ85721 IBM Almaden Research Center 650 Harry Road, San Jose, CA95120 quanfu@cs.arizona.edu kobus@cs.arizona.edu arnon@almaden.ibm.com Alon Efrat Ming Lin Department of Computer Science University of Arizona, Tucson, AZ85721 Department of Management Information Systems University of Arizona, Tucson, AZ85721 alon@cs.arizona.edu mlin@email.arizona.edu ABSTRACT General Terms We present a general approach for automatically matching electronic slides to videos of corresponding presentations for use in distance learning and video proceedings of conferences. We deal with a large variety of videos, various frame compositions and color balances, arbitrary slides sequence and with dynamic cameras switching, pan, tilt and zoom. To achieve high accuracy, we develop a two-phases process with unsupervised scene background modelling. In the first phase, scale invariant feature transform (SIFT) keypoints are applied to frame to slide matching, under constraint projective transformation (constraint homography) using a random sample consensus (RANSAC). Successful first-phase matches are then used to automatically build a scene background model. In the second phase the background model is applied to the remaining unmatched frames to boost the matching performance for difficult cases such as wide field of view camera shots where the slide shows as a small portion of the frame. We also show that color correction is helpful when color-related similarity measures are used for identifying slides. We provide detailed quantitative experimentation results characterizing the effect of each part of our approach. The results show that our approach is robust and achieves high performance on matching slides to a number of videos with different styles. Algorithms, Performance, Experimentation Categories and Subject Descriptors I.2.10 [Information Systems]: Information Storage and Retrieval Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. To copy otherwise, to republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. MIR’06, October 26–27, 2006, Santa Barbara, California, USA. Copyright 2006 ACM 1-59593-495-2/06/0010 ...$5.00. Keywords Distance Learning, Electronic Slides, Presentation Videos, SIFT Keypoints, RANSAC, Video Indexing, Color Correction 1. INTRODUCTION In this work we consider matching electronic slides to presentation videos captured by one or several cameras, either fixed or allowed to pan tilt and zoom. More formally, given a sequence of video frames F = {f1 , f2 , . . . , fn } and images of the electronic slides S = {s1 , s2 , . . . , sm } associated with them, we are interested in finding a mapping function M : F → S such that M(fi ) = sj if frame fi contains slide sj and M(fi ) = 0 when there is no slide visible in frame fi . Matching slides to videos provides an attractive way of indexing videos by slides for searching and browsing. By finding the original electronic slide that is displayed at any point in time along the video, it is possible to display clear, high resolution slides side by side with the video. This matching automates a currently manual process in the preparation of class content for distance learning, therefore reducing time to publish and labor costs. It further allows to adjust the quality of the slide images in video frames (resolution, color, and contrast), and to index and retrieve video segments using the textual content of the corresponding electronic slides. Slides to video matching has been studied for about a decade. Early work such as the Classroom 2000 Project [1] and BMRC Lecture Browser [14] manually edit time stamps to match the slides to the video clips. Recently some automatic approaches [3, 5, 6, 10, 12, 15] have been proposed to match or synchronize slides to videos. Such approaches usually involve two steps: 1) locating slides or extracting texts in the video frames; and 2) identifying slides by using some recognition methods such as template matching or string matching. In such a matching process, the first step is crucial as it directly determines the performance of the follow-up recognition task. In what follows, a slide refers to an image, automatically extracted from a presentation file (e.g., PPT or PRZ files). Video frames, often denoted keyframes when they represent segments of the video, or in short frames, are extracted from the (compressed) digital video of the presentation. We divide the video frames into 3 types, or categories. A frame is called a full-slide frame if the entire frame shows the (usually entire) slide content. Otherwise, it is called a small-slide frame if it contains both a slide area and a substantial portion of scene background. These are usually wide field of view shots of the presenter along with the projection screen. A frame without any slide is referred to as a no-slide frame. In a video capturing system, where several cameras are mixed and are allowed to pan tilt and zoom, the contents of the video frames can vary greatly. For example, when a slide is captured in the video, it may appear small, fullframe, or clipped (camera zoom-in). Further, the slide content may appear in whole or in part, such as during an animation, and might suffer partial occlusion, e.g., by the speaker. Even worse, not only the geometry and content of the projected slide varies, even the colors might change greatly due to switching between cameras with different settings, and any dynamic changes in camera settings, e.g. an automatic shutter response to changes in ambient illumination or slide brightness. When no electronic slide is present, the frame may show the speaker and some scene background, or just background (e.g., part of the classroom, the audience, or both that is visible in some video frames). In other cases the frame may still show the projector screen, displaying non-slide content such as a demo, a video, or a web page. Throughout the paper we refer by background to the classroom scene background (not to be confused with the slides’ template or “background”, that are not discussed in this work). Figure 1 shows different types of frames captured in a presentation. In many cases the slide’s text is not recognizable, so text-based matching approaches are not appropriate. Also, when large amount of camera zoom is applied it might become very challenging to accurately spot the corresponding slide area. It is therefore desired to have a unifying matching approach that can handle all these difficulties without being limited by the camera setup or the frame type. Towards this goal, we propose a new robust approach to match slides to video frames regardless of the frame type. Our algorithm uses random sample consensus (RANSAC) [7] to robustly estimate homographies between frames and slides based on scale invariant feature transform (SIFT) keypoint matching [11]. SIFT keypoint features are highly distinctive and are invariant to image scale and rotation. They can provide correct matching in images subject to noise, blurring and illumination changes. Hence our approach can deal with a large varieties of videos, including those of less than professional quality. As one may assume, full-frame slides are the easiest to match. When slides are shown small in the video, the matching task is harder, and the RANSAC process might fail to find a good slide match. We propose several strategies to help RANSAC deal with this difficulty effectively and efficiently. Our approach is to process the video frames in phases, where in the first phase we fetch to find high- confidence matches for the “easier” cases and the rest, unmatched slides, are being classified by their types. The successful first-phase matches are then used to build a scene background model automatically. In the second phase, the model is applied to the remaining unmatched frames to boost the matching performance of the more difficult cases. We also show that color correction is helpful when colorrelated similarity measures are used for identifying slides. The remainder of the paper is organized as follows. Section 2 gives a short review on SIFT features, homography and RANSAC. Section 3 presents our slides to frame matching algorithm in details. Section 4 describe a simple color correction model. Section 5 demonstrates the experimental results. Finally, Section 6 concludes our results and outlines future work. 2. OVERVIEW: SIFT KEYPOINTS, HOMOGRAPHIES AND RANSAC The basic step in matching a video frame with its corresponding slide involves with estimating the geometric transformation between the frame and the slide, identifying the corresponding slide (i.e., slide number), and computing the level of confidence in this match. In oppose to most past approaches which aim at separating between the transformation estimation and the slide identification by first looking for the slide region in the frame, here we simultaneously solve for both the transformation and the identification problems using RANSAC on SIFT keypoints, and further use those to estimate the matching confidence. Our approach is fully automatic and can handle video capture systems with few limits on the motions (pan, tilt, zoom and even movement) of the cameras. 2.1 The SIFT Descriptor Recently, great progress has been made in object recognition by using local descriptors such as the scale invariant feature transform (SIFT) keypoints [11] used here. SIFT keypoints are points of local gray-levels maxima and minima, detected in a set of difference-of-Gaussian images in the scale space. Each keypoint is associated with a location, scale, orientation and a descriptor - a 128-dimensional feature vector that captures the statistics of gradient orientations around the keypoint. SIFT keypoints are scale and rotation invariant and have been shown robust to illumination change and viewpoint change. Figure 2 shows the SIFT keypoints detected in a slide. Since SIFT is based on the local gradient distribution, as seen in the figure, heavily textured regions produce more keypoints than a colorhomogeneous region. Fortunately for our application, text on slides yields many, well distinct keypoints. Given the keypoints detected in two images A and B, Lowe [11] presents a simple matching scheme based on the saliency of the keypoints. A keypoint PA from image A is considered a match to a keypoint PB from image B if PA is the nearest neighbor of PB in the descriptor’s feature space and d(PA , PB )2 < τ2 (1) ′ d(PA , PB )2 where d(., .) denotes the Euclidean distance between the descriptors of the two keypoints and PA′ is the second nearest keypoint of PB in image A. For simplicity, we refer to this matching algorithm as the nearest neighbor (NN) algorithm. (a) Full slide (b) Zoom-in slide (c) Zoom-out slide (d) Slide with dramatic color change (e) Animated slide (f) No slide Figure 1: Different frame types captured by camera. Except 1(f ), which shows two frames without slide, each pair includes an original slide image (left) and one of the sample video frames of the slide (right). In (b), the red box in the left image indicates the original slide area of the frame in the right. According to the definitions in Section 1, (a), (b), (d) and (e) are full-slide frames, (c) is small-slide frame and (f ) is no-slide frame. In [11], a threshold τ = 0.8 was selected for object recognition. In our experiments, we found that this threshold excludes the majority of outliers while keeping a good portion of the correct matches. Hence, we set τ = 0.8 in all our experiments. Fig 2 shows a putative matching result found by this scheme. The above matching scheme searches keypoint matches in the whole images. However, if the transformation between two images is given, we can project the keypoints in one image to the other one and find keypoint matches locally within some range r, i.e we add another criterion as follows, |(PB (A)PB )| ≤ r (2) where |.| is the Euclidean distance between PB and the projection of PA in the image B. We refer this searching scheme as local mode in comparison to the above global mode. When slides are small, keypoints on them become less distinctive and a global search rejects many correct matches. Instead, a local search can not only gives more keypoint correspondences, but also be more likely to ensure more correct ones due to geometric constraints (the known transformation). In the following section, we will discuss how to find the transformation between a slide and a frame to make this scheme applicable. (a) SIFT keypoints Outliers Ratio (%) # RANSAC iterations 10% 4 25% 12 50% 71 75% 1177 Table 1: The number of RANSAC iterations required to ensure 99% confidence that at least one sample will have no outliers for a sample size of 4 keypoints (homography). 2.2 Fitting a Homography using RANSAC The mapping between a coplanar set of points and its perspective projection on an image plane is provided by a Homography. The homography H between a slide and its projected image in the frame plane can be determined by four or more pairs of corresponding keypoints by solving X ′ = HX where X is a set of slide keypoints and X ′ is the corresponding frame keypoints. In this work, we used the Normalized Direct Linear Transformation (See [9] for details.) to estimate H. SIFT keypoints are highly distinctive. As shown in Fig 2, the simple matching scheme can give a set of putative correspondences with a good portion of correct keypoint matches. However, the outlying correspondences, even a few, can severely affect the estimation of a homography. Here we use RANSAC to search for the true keypoint correspondences by imposing a homography on the putative correspondences found by the NN algorithm. RANSAC is an iterative random algorithm. At each iteration, a randomly selected subset of four pairs of matched keypoints is used to compute a hypothesized homography. The hypothesis is then evaluated by checking how many of the remaining matched pairs of keypoints are consistent with it. Hence the required number of iterations to ensure high probability of detection depends on the percentage of the outliers in the data. Table 1 shows that the number of iterations required for homography estimation increases dramatically as the rate of outliers increases. In our experimentations, when the slide area occupies all or most of the frame, there are less than 50% outliers in the putative metched keypoints. In such a case, testing 100 hypotheses is sufficient to ensure a 99% chance of finding the correct homography H. In Section 3, we address some of the difficult cases where more samplings are required. 3. THE SLIDES TO VIDEO MATCHING ALGORITHM (b) Putative matching (c) Correct matching Figure 2: Keypoint matching. The top two images shows keypoints detected in two images. An arrow attached to each keypoint shows the associated scale and rotation features. The image on the bottom left shows matches proposed by simple nearest neighbor algorithm. The image on the right shows proper matches that share a homography from slide to frame. For clarity we only display about one quarter of the keypoint matches. In general, RANSAC works very well on our test data and achieves impressively high performance. However, it faces a few challenges due to the high complexity of this data. Firstly, when the slide’s area is very small in the frame and the slide does NOT have rich texture (e.g., slides containing only a single plot, a small table or very little text) RANSAC might fail to find the homography due to the interference of outliers from the background scene surrounding the slide area in the frame. Secondly, more than a third of the frames in our data has no slides. Matching them to slides takes a lot of time. We propose an approach called background matching to overcome these problems. Our algorithm includes three stages - two RANSAC-based recognition phases with an unsupervised scene background modelling in between. Initially, all frames are marked UNDECIDED, i.e. their types are unknown. RANSAC is then run with a small number of iterations to gather useful information about the frame types. In the second stage background matching and a binary classifier are applied to further determine the frame types for remaining undecided ones. Finally, RANSAC is run again, but only on the unsuccessful frames with slides or the undecided slides from the first run. Note that FULL, SMALL and NOSLIDE in the following algorithm correspond to full-slide , small-slide and no-slide frames defined above, respectively. The algorithm is summarized as follows:, 1 Mark all frames as UNDECIDED. 2 Run RANSAC with a number of iterations N1 to find slide matches for each frame. If an acceptable slide match for a frame is found, mark the frame either as FULL or as SMALL according to the ratio of the slide area over the whole image. 3 Using the information obtained in Step 2, build a binary classifier to detect all full-slide frames and mark them as FULL. 4 If no SMALL frames were found in Step 2, skip to step 5. Otherwise, create an (unsupervised) scene background model, apply background matching on all UNDECIDED frames and classify them as SMALL or NOSLIDE. 5 Run RANSAC again with a number of iterations N2 on all frames that have not successfully claimed slide matches in the first run and are not labelled NOSLIDE. For SMALL frames, use for matching only keypoints from the slide area. In our experiments, we set N 1 = 100 and N 2 = 400 based on Table 1. 3.1 Matching Score 3.1.1 Keypoints-based Score Each frame fj is compared to all the slides to find the best matching slide. Let B(si |fj ) denote the quality of matching between slide si and frame fj . It can be regarded as the similarity between si and fj . Let kij be the number of keypoint correspondences between si and fj that are consistent with the best homography found by the RANSAC between si and fj , then a simple expression of B(si |fj ) is, ( k P ij if kij ≥ m i kij B(si |fj ) = (3) 0 otherwise That is, the best matching slide is accepted if and only if the score passes a certain threshold, m. We experiment with different values of m, whereas higher value provides higher confidence is accepted matches with the tradeoff of higher risk to reject correct matches with small number of correspondences. The effect of this threshold on the number of errors is evaluated in Section 5. 3.1.2 Normalized Cross Correlation (NCC) Score The normalized cross correlation (NCC) preferred in template matching [8] is another important similarity measure between two images. We define another similarity score as B(si |fj ) =  Cρij 0 if kij ≥ m otherwise (4) where ρij is the NCC between the projected image of si in fj (after color correction) and the slide content of fj . C is a constant that makes B a valid probability. ρ is set to 0 if it is less than 0. m is the same threshold as that defined in Equation 3. 3.2 Scene Background Matching The scene background interference with the RANSAC algorithm is in large eliminated by using the background matching. If the slide area of a small-slide is known, we can reduce the background outliers affect on RANSAC by pruning those erroneous keypoint matches from the area surrounding the detected slide area. Although approaches proposed in previous work such as [6] and [10] could be used to detect the slide area of in the video, they are not robust when the color of slide area is not very distinguishable from to the scene in the frames. We present a new approach to detect the slide area of a small-slide frame by automatically matching the background between frames. As can be observed, small-slide frames usually share some static objects in the scene such as floor, podium, walls, audience etc (see Fig 1(c)). Let f1 be a small-slide frame with known slide area detected by RANSAC. We call f1 a reference frame. Let f2 be another small-slide frame with unknown slide area. Similar to matching a slide to a frame, we match between frame f2 and frame f1 by RANSAC and estimate the transformation H between them. The known area of f1 can now be transformed to f2 by H, to produce the slide area in f2 . Since the transformation H is established by matching the shared background objects between the two frames, we call this method “background matching”. Note that the background is not a planar object, so the assumption for the simple homography does not hold. A more well-grounded transformation in this case is the fundamental matrix [9] between two frames. However, as long as the parallax between the frames is minimal (that is, camera may undergo pan, tilt and zoom but no significant translation), homography is still an acceptable approximation. Usually we only need to know one reference frame, which can be obtained from the first RANSAC run. If more than one reference frame is available, we can combine them together to get a ”big” reference frame for efficiency. We briefly describe this idea as follows. Let F = {f1 , f2 , . . . , fm } be a set of small-slide frames with known slide areas R = {r1 , r2 , . . . , rm }. We pick the frame fk that has a slide located closest to the the image center and match all the other reference frames to fk . For each frame fi (i 6= k), we transform all its keypoints and its slide area ri to the coordinates of fk . Let Ki′ and ri′ be the new keypoint set and the new slide area for fi , respectively. Then, the new reference frame S Sm ′ ′ can be expressed as fnew = ( m i=1 Ki , i=1 i ). Redundant or similar keypoints are removed in the united set. Once the slide area is spotted in a frame, we can prune the erroneous keypoint matches from the background and apply the local mode search to find more correct keypoint matches. Thus, we greatly increase our chance of detecting the correct slide match to the frame in the second RANSAC run. Fig 3 shows the background matching between two frames and the successfully identified slide area by this approach. The reason for including Fl in D+ is to count for the possible varied light conditions. The negative samples D− are selected as D− = Fs ∪ B where B is the k farthest frames away from D+ in feature space. The distance of a frame fi to D+ is formally defined as, dD + (fi ) = min dG (Ifi , Ifj ) fj ∈D + (6) k is determined based on the size of D+ and Fs . One drawback of this approach is that it has to rely on background matching to single out the small-slide frames first. If the first RANSAC run is unable to provide any information for background matching, this approach can only separate full-slide frames from others. 4. COLOR CORRECTION (a) Scene matching background (b) Keypoint matching inside the slide area only Figure 3: Scene background matching between two video frames. First, background matching is applied between two small-slide frames (left). The white box in the bottom frame bounds the slide area, which is known from a successful match in the first RANSAC phase. It is used to infer the slide area in the top frame. Next (right), keypoint matching is applied between the projected slide area and the slide, eliminating nearly all scene background keypoints. Images can vary greatly in color if captured under different lighting conditions, as shown in Fig 1(d). If an algorithm relies on a color-related measure to identify slides in videos, then it has to take color issue into accounted. However, most of the previous work have not addressed this issue. (a) the original slide (b) the same slide captured by camera (c) the registered slide (d) the slide after color correction 3.3 Detecting No-slide Frames by SVM Matching frames without slide to slides spends a lot of time and can introduce false matches. Thus we would like to prune them from the process as soon as possible. However, without prior knowledge, detection of background and noslide frames is not trivial. In this section, we show that it is possible to detect no-slide frames without any prior knowledge about the video. Usually, there is significant visual difference between fullslide frames and other frames containing substantial background (i.e small-slide and no-slide frames). We can first separate full-slide frames from others by using a binary classifier. For frames not classified as full-slide frame, the background matching described above can further tell whether they are a small-slide or no-slide frame. We use linear-kernel SVM [4] as our binary classifier for this task. The image features used are Color Coherence Vector (CCV), a color histogram that incorporates spatial information [13]. Given two images I and I ′ , their CCV distance is defined in [13] as, dG (I, I ′ ) = n X |αi − α′i | + |βi − βi′ | (5) i=1 where GI = h(α1 , β1 ), . . . , (αn , βn )i and G′I = h(α′1 , β1′ ), . . . , (α′n , βn′ )i are the normalized CCVs of I and I ′ , respectively. We construct the SVM training data set as follows. Let Fl be the set of full-slide frames detected in the first RANSAC run in the matching algorithm. Similarly, Fs is defined as the set of small-slide frames detected. We set the positive samples D+ = S ∪ Fl where S is the set of original slides. Figure 4: Color correction One simple color constancy model is a single linear transformation. Let Cs = (Rs , Gs , Bs ) be the color of a pixel in the original slide image and Cf = (Rf , Gf , Bf ) be the color of a pixel in the image registered from a frame to the slide image. We can map Cf to Cs by, Cf = M Cs Cf CsT (Cs CsT )−1 Solving (7) yields M = color Cs′ can be obtained by, (7) and the corrected Cs′ = M −1 Cf (8) Figure 4 shows a slide image after color correction. 5. EXPERIMENTAL RESULTS In this section, we conduct detailed performance analysis on each part of our algorithms. We first look at the overall accuracy of our algorithms on recognizing slides in videos. We then examine in detail the effectiveness of the background matching method. Dataset CONF1 CONF2 UNIV Video #1 #2 #3 #4 #5 #6 #1 #2 #3 #4 #5 #1 #2 Duration (min) 47 55 41 20 39 49 68 54 63 52 47 39 48 Full Slides 33 76 38 20 41 53 122 58 50 40 17 33 76 Small Slides 9 3 6 8 12 42 3 1 0 1 0 9 3 No Slides 61 72 53 23 64 59 103 104 90 61 52 61 72 Total 103 151 97 51 117 154 228 163 140 102 69 103 151 PPT Slides 29 39 27 21 34 67 63 68 49 33 53 44 48 Table 2: Summary of the video data used in our experiments. 5.1 Video Data We construct a set of 13 presentation pairs (MPEG video + presentation file); 6 presentations from a corporate conference and 5 presentations from a scientific conference, both captured using a similar setup of three pan-tilt-zoom cameras, with live video editing; one camera tracks the speaker, one camera covers the projector screen and is used to zoom in on the slides, and the third camera captures the audience [2]. Two more presentations are university seminars (denoted UNIV) captured by two cameras (one full-slide and the other gives either small-slide or audience views). All presentation files were prepared and delivered by different speakers, thus in the corporate conference all speakers used the same slides template. The video data is summarized in Table 2. We manually constructed a ground truth matching between frames and slides for evaluation purposes. Each keyframe containing a slide is labelled with the slide number and with full-slide or small-slide . Keyframes containing no slides are marked with 0. The few frames showing missing slides are marked as “missing” and are not considered in the evaluation. 5.2 Video And Slide Image Processing The videos were first processed by shot boundary detection and one keyframe was extracted from each shot. The frame size is 320×240 for all the videos and the size of slides is 443 × 342. We resized the slides to 320 × 247 for efficiency considerations. A few PPT slides were missing in videos 3 and 4 because they were removed from the PPT files by the speakers after the talk and before providing us with their files. The precision measure P used in our experiments is defined as, P =1− # of correctly identified frames # of ground truth frames (9) Throughout the results section we look at missrecognition error counts, marked by the number of misrecongnized/total slides. A slide is considered correctly recognized if it is matched by the system to the same PPT slide number as marked in the ground truth. Hence an error counts as either matching with a wrong slide (rare) or failure to match with any slide (the more common failure mode). We experimented with two image similarity measures: one is the counts of keypoint matches (KP), described in Section 3.1.1 and the other is the normalized cross correlation ( NCC ) [8] defined in Section 3.1.2. The results on the first data set are presented in Table 3, showing clear advantage for using color correction with the NCC measure on full- slide frames. However, Keypoint-based score outperforms NCC and requires no color correction, so we selected it as our similarity measure for the rest of the experimentation. We considered the first phase algorithm with fixed 100 RANSAC iterations as the base algorithm ( keypoint matching ). Table 4 display the results. It can be seen that baseline recognition results are more than 97% accurate for full-slide frames and even more accurate in classification of no-slide frames. Most errors occur, as expected, in smallslide frames. It is worth to note that we experimented with keypoint mapping either from frame to slide or from slide to frame. Since the frame and the slide produce substantially different sets of keypoints, and we use nearest neighbors to find the matches, this mapping is not a symmetric relationship. We found a clear advantage to mapping frame points to slide points over going the other direction, in particular for smallslide frames. This is partly because small-slide frames have much fewer keypoints in the slide area, and mapping those keypoints onto the slide has much higher chance to find correct correspondences than going the other direction. Hence all the experimentation were carried this way. Next we compare the background matching algorithm described in Section 3.2 with the keypoint matching performance. The background matching algorithm uses a fixed number of 400 iteration in the second phase RANSAC run. Table 5 shows the performance of these two algorithms on the data set. In this table, KP(100) denotes the keypoint matching performance and KP(400) denotes the performance of the same baseline algorithm when the RANSAC is allowed for 400 iterations. The background matching, denoted as BP(), was tested twice, once in localmode matching and second in global-mode matching. Each mode uses 100 RANSAC iterations in the first phase and 400 RANSAC iterations in the second phase, after background matching. As we can see from Table 5, both keypoint matching and background matching achieve high recognition performance, with background matching performing significantly better than keypoint matching on small slides. The improvement is in part attributed to running 400 more RANSAC iterations - hence we provide the results of the baseline KP(400) for comparison. Note, however, that this run is much slower than the other runs because it runs 400 iterations on all the frames, as oppose to running those only on the small slides in the background matching method. Moreover, the background matching method still outperforms the KP(400) run. There is a noticeable difference between the full-slide recognition error on CONF1 (3.10%)and on CONF2 data set (23.34%). Checking these presentations, the higher number of errors in CONF2 attributes to the use of many more videos and animations, some slides with very little content (including a couple of plain blue slides) and several cases of duplicated identical slides. For the last, we expect that by introducing temporal analysis into the matching process, taking into count the mostly sequential slides order during a typical presentation, , it will be possible to accurately label the identical slides. We further tested the robustness of scene background matching to the choice of the matching threshold m. Figure 5 shows how the number of misses increases when the threshold is higher. The background matching with local keypoint Frame Similarity KP(100) BP(100+400) Local NCC1 42/258 44/258 full-slide NCC2 KP 22/258 8/258 27/258 10/258 NCC1 29/76 14/76 small-slide NCC2 KP 29/76 27/76 13/76 15/76 NCC1 0/327 3/327 no-slide NCC2 0/327 3/327 KP 0/327 3/327 NCC1 71/611 61/611 Total NCC2 51/611 43/611 KP 35/661 28/661 Table 3: Recognition error/total frames comparisons under two different similarity measures: NCC1 is without color correction, NCC2 is with color correction, and KP is Keypoint matching (no color correction is needed). Dataset CONF1 CONF2 UNIV Video 1 2 3 4 5 6 total 1 2 3 4 5 total 1 2 total # full-slide 1/33 2/76 1/37 0/18 2/41 2/53 8/258 (3.10%) 14/92 9/58 10/50 27/40 0/17 60/257 (23.34%) 0/48 3/54 3/102 (2.94%) # small-slide 2/9 3/3 1/4 4/6 4/12 13/42 27/76 (35.52%) 2/3 1/1 0/0 1/1 0/0 4/5 (80.00%) 9/39 11/36 20/75 (26.66%) # no-slide 0/61 0/72 0/50 0/23 0/64 0/57 0/327 (0.00%) 0/103 0/104 1/90 0/61 0/52 1/410 (0.24%) 0/66 1/101 1/167 (0.59%) Total 3/103 5/151 2/91 4/47 6/117 15/152 35/661(5.29%) 16/198 10/163 11/140 28/102 0/69 65/672(9.67%) 9/153 15/191 24/344(6.97%) Table 4: Error rates of the baseline algorithm (i.e., no background matching), marked by the number of misrecongnized/total slides (error percentile) for full-slide , small-slide and no-slide frames, per each presentation. #iterations of the 1st RANSAC = 100. Frame Alg. KP(100) KP(400) BP(100+400) Global BP(100+400) Local CONF1 8/258 9/258 8/258 10/258 full-slide CONF2 60/257 56/257 60/257 51/257 UNIV 3/102 3/102 3/102 3/102 CONF1 27/76 24/76 25/76 15/76 small-slide CONF2 4/5 4/5 4/5 4/5 UNIV 20/75 17/75 16/75 12/75 CONF1 0/327 0/327 0/327 3/327 no-slide CONF2 1/410 1/410 1/410 10/410 UNIV 1/167 1/167 1/167 1/167 CONF1 35/661 33/661 33/661 28/661 Total CONF2 65/672 61/672 65/672 65/672 UNIV 24/344 21/344 20/344 16/344 Table 5: Performance comparison of four different algorithms; KP(100) and KP(400) denote the keypoint matching performance when the RANSAC is allowed for 100 and 400 iterations, respectively. The background matching, denoted as BP(), was tested twice, once in local-mode matching and second in global-mode matching. Each mode uses 100 RANSAC iterations in the first phase and 400 RANSAC iterations in the second phase, after background matching. The two modes are discussed in 2.1. Results on 3 different frame types are shown for the three data sets, CONF1, CONF2 and UNIV. For each method, frame type and data set, the number of errors/total frames of this class is displayed. The background matching with local search method outperforms the other methods. 60 50 Number of misses ter for Information Science and Technology (ACIST) and by IBM. KP100 KP400 BP100+400(Global) BP100+400(Local) 8. REFERENCES 40 30 20 10 0 6 8 10 12 14 Threshold for correct matching 16 Figure 5: Algorithm robustness is demonstrated by repeating the experiment for different values of the matching acceptance threshold, m, and measuring the missrecognized small-slide frames. A higher threshold provides greater confidence in the matching. The background matching with local keypoint search, BP100+400(Local), outperforms the other three methods. search algorithm outperforms all other three variants. It is nearly insensitive to the value of m, hence providing much better homographies and stronger matching for small slides than the alternatives. Changing m has no measurable impact on performance of full-slide matching (for which confidence levels are much higher), nor on no-slide frames classification (for which m = 6 already produces nearly perfect classification results, and increasing m only improves it). 6. CONCLUSIONS We have demonstrated a comprehensive approach to matching slides to presentation videos. Experiments on three data sets show that the approach is viable for real world applications. We found that an implementation of keypoint matching together with constrained planar homography tuned to the application provided a very solid starting point for the development of our matching system. The unsupervised scene background matching is a robust and fast way to boost the performance of RANSAC in difficult cases such as small slides. We also show that searching initial keypoint matching locally can give more correct correspondences, thus making RANSAC more robust. Interestingly, the generally high performance of this initial version made measuring incremental improvement difficult in some cases. In this work, we did not exploit the temporal information such as the usually sequential order of slide changes, which should be able to further help us resolve ambiguities between extremely similar or identical slides that often occur in a presentation. We intend to explore this direction in future work. 7. ACKNOWLEDGMENTS We would like to thank many IBM colleagues who helped us collecting videos and slides data and generating manual labeling. We also extend our deep thanks to all the presenters who were willing to let us use their presentations for this study. This work was supported in part by the Arizona Cen- [1] G. D. Abowd, C. G. Atkeson, A. Feinstein, C. E. Hmelo, R. Kooper, S. Long, N. N. Sawhney, and M. Tani. Teaching and learning as multimedia authoring: The classroom 2000 project. In ACM Multimedia, pages 187–198, 1996. [2] A. Amir, G. Ashour, and S. Srinivasan. Automatic generation of conference video proceedings. In Journal of Visual Communication and Image Representation, JVCI Special Issue on Multimedia Databases, pages 467–488, 2004. [3] A. Behera, D. Lalanne, and R. Ingold. Looking at projected documents: Event detection document identification., 2004. [4] N. Christianini and J. Shawe. Support Vector machines and other kernel-based learning method. Cambridge University Press, 2002. [5] B. Erol, J. J. Hull, and D. Lee. Linking multimedia presentations with their symbolic source documents: algorithm and applications. In ACM Multimedia, pages 498–507, 2003. [6] S. Fathima, T. Mahmood. Indexing for topics in videos using foils. In IEEE Conference on Computer Vision and Pattern Recognition, pages II: 312–319, 2000. [7] A. Fischler, M. and C. Bolles, R. Random sample consensus: A paradigm for model fitting with applications to image analysis and automated cartography. Comm. of the ACM, 24:381–395, 1981. [8] R. M. Haralick and G. S. Linda. Computer and Robot Vision, Volume II. Addison-Wesley, 1992. [9] R. Hartley and A. Zisserman. Multiple view and geometry in computer vision. Cambridge University Press, 2002. [10] T. Liu, R. Hjelsvold, and R. Kender, J. Analysis and enhancement of videos of electronic slide presentations. IEEE International Conference on Multimedia and Expo (ICME), 2002. [11] D. Lowe. Distinctive image features from scale-invariant keypoints. International Journal of Computer Vision, pages 91–110, 2004. [12] S. Mukhopadhyay and B. Smith. Passive capture and structuring of lectures. In ACM Multimedia (1), pages 477–487, 1999. [13] G. Pass, R. Zabih, and J. Miller. Comparing images using color coherence vectors. In ACM Multimedia, pages 65–73, 1996. [14] L. A. Rowe and J. M. Gonzelez. Bmrc lecture browsers. In http://bmrc.berkekey.edu/frame/projects/lb/index.html. [15] F. Wang, C.-W. Ngo, and T.-C. Pong. Synchronization of lecture videos and electronic slides by video text analysis. In ACM Multimedia, pages 315–318, 2003.

Log In

Matching slides to presentation videos using SIFT and scene background matching

Related papers

Related papers

Related topics