Academia.eduAcademia.edu

Content based image and video retrieval using embedded text

2006, Proceedings of the 7th Asian Conference on Computer Vision Volume Part Ii

Extraction of text from image and video is an important step in building efficient indexing and retrieval systems for multimedia databases. We adopt a hybrid approach for such text extraction by exploiting a number of characteristics of text blocks in color images and video frames. Our system detects both caption text as well as scene text of different font, size, color and intensity. We have developed an application for on-line extraction and recognition of texts from videos. Such texts are used for retrieval of video clips based on any given keyword. The application is available on the web for the readers to repeat our experiments and also to try text extraction and retrieval from their own videos.

Content Based Image and Video Retrieval Using Embedded Text Chinmaya Misra and Shamik Sural School of Information Technology, Indian Institute of Technology, Kharagpur, West Bengal -721302, India {cmisra, shamik}@sit.iitkgp.ernet.in Abstract. Extraction of text from image and video is an important step in building efficient indexing and retrieval systems for multimedia databases. We adopt a hybrid approach for such text extraction by exploiting a number of characteristics of text blocks in color images and video frames. Our system detects both caption text as well as scene text of different font, size, color and intensity. We have developed an application for on-line extraction and recognition of texts from videos. Such texts are used for retrieval of video clips based on any given keyword. The application is available on the web for the readers to repeat our experiments and also to try text extraction and retrieval from their own videos. 1 Introduction Text embedded in an image is usually closely related to its semantic content. Hence, text is often considered to be a strong candidate for use as a feature in high level semantic indexing and content-based retrieval. An index built using extracted and recognized text enables keyword-based searches on a multimedia database. As an example, we can identify video frames on specific topics of discussion from an educational video if the frames display corresponding text information. One of the main challenges in this work is to be able to locate text blocks in an image with complex color combinations. Text in image and video can be classified into two broad types: (i) Caption text - also known as Graphic text or Overlay text and (ii) Scene text. Caption text as shown in Fig. 1(a), is the type of text that is synthetically added to a video or an image during editing. It serves many different purposes like display of actor list and credit in a movie, topics covered in an educational video, etc. Caption text in a video frame typically has low resolution so that it does not occlude the scene objects. In contrast to caption text, scene text as shown in Fig. 1(b), usually occurs in the field of view of a camera during video or still photography. Examples of scene text include street signs, billboards, topics covered through presentation slides in educational videos, number plates on cars, etc. Scene text is often more difficult to detect and extract compared to caption text due to its unlimited range of font, size, shape and color. It may be noted from Figs. 1(a) and (b) that the images P.J. Narayanan et al. (Eds.): ACCV 2006, LNCS 3852, pp. 111–120, 2006. c Springer-Verlag Berlin Heidelberg 2006  112 C. Misra and S. Sural (a) (b) Fig. 1. (a)Caption text and (b) Scene text containing either scene text or caption text cannot serve as a direct input to an Optical Character Recognition (OCR) system. Existing segmentation techniques built in the OCR systems are not capable of handling the complexity of color images in which such text regions are embedded. Instead, it is essential to build specialized methods for identifying the text blocks from images and video frames. Contents of such text blocks can then be submitted to an OCR for identification of the characters and words. Our goal is to accurately extract text blocks from color images and video frames, recognize the texts using an OCR and store them as keywords in a database for indexing and retrieval. 2 Related Work In recent years, attempts have been made to develop methods for extracting text blocks from still images and videos. Li et al [6] use a 16 × 16 window moved over various positions in an image. Each window position is classified as a text or a non-text block using a neural network. Text blocks identified by the classifier are then tracked across frame boundaries. This method detects text only at the block level. Jain and Yu [3] propose a method to locate text in color images using connected components. Their method can detect text only with large size and high contrast. While it is well suited for processing newspaper advertisements and web images, it is not so efficient in detecting text in complex and cluttered background. Accuracy of this approach is high for binary and gray-scale images but the system is not so accurate in locating text in full-color images. Lienhart and Wernicke propose a multi-resolution approach to detect potential text lines in images and video frames using edge orientation [7]. This method also uses small windows to find edge orientations and a complex-valued neutral network based method to classify text regions with certain pre-defined sizes. They employ projection profiles as well as geometrical and morphological constraints for refining the text boxes. Nugroho et al [9] apply color reduction and decompose a multi-valued color image into a small number of meaningful color prototypes based on foreground color. After that, connected components are extracted from each foreground image and text and non-text components are classified with the help of stroke features. This approach works well in a limited range of characters, especially in multi-segment characters like Japanese and Chinese. Content Based Image and Video Retrieval Using Embedded Text 113 Malababic et al [8] detect artificial text in videos using a feature that captures foreground to background contrast, density of short edges of varying orientations and concentration of short vertical edges that are horizontally aligned. Various geometrical constraints are also applied for improving the result. Sato et al [10] investigate superimposed caption recognition in news videos. They use a spatial filter to localize the text regions as well as size and position constraints to refine the detected area. This algorithm can be applied only in a specific domain, namely, news video analysis. Jung and Han [4] sequentially adds advantages of texture based methods and connected component based methods. A texture classifier detects text regions and filtering is done by the connected component based method using geometric and shape information. They detect text in images with multiple color, intensity and fonts. However, since this method processes a raw pixel values for each frame in texture classifier and performs a number of stages of filtering and refinement, it takes a lot of time for processing each image. Zhang et al [13] use a multiple hypothesis filtering approach on several binary images after decomposing a given image by color space partitioning. To find the candidate text regions they use texture and motion energy as compressed domain features. This method can be used to detect caption text from newscasts. However, it works on the assumption that most of the text is located in some predefined regions with high contrast and simple background in the video. In contrast to the above-mentioned methods, we propose a hybrid approach in which multiple cues are used for detecting text blocks in images and videos. Further, in all of the existing methods, there is no mention of any complete system being developed using the text extraction techniques. We feel that along with the development of new algorithms, it is equally important to be able to demonstrate the results. For this purpose, we have built a video retrieval system based on embedded text, which is available on the web. Interested readers will be able to repeat our experiments and also perform their own retrievals using this application. The rest of the paper is organized as follows. In the next section, we give a description of our system. The results are presented in section 4 and we conclude in the last section of the paper. 3 Hybrid Approach to Text Extraction In this section, we first give an overview of our system followed by a detailed description of the building blocks. 3.1 Overview of the Approach The input to our system can either be a still image or a video decomposed into frames. We first use a color reduction step in which the input is converted into a 64-color image. This step is necessary since there can be a large number of colors present in an image. Individual color level processing makes the system both inefficient as well as sensitive to noise. We next determine the Regions of 114 C. Misra and S. Sural Interest (ROIs) - Regions in the image where text could potentially be located. This step, while meant to speed-up subsequent searches, should not filter out the text regions. Care is, therefore, taken to ensure that only those regions that are certainly of type non-text are eliminated. After identification of the regions of interest, geometrical and morphological features are extracted from each ROI. A multilayer perceptron (MLP) is used as a feature-based classifier to determine if the ROI contains text or non-text blocks. It should be noted that at this stage, we identify an entire ROI to either belong to a text region or to a non-text region and not its individual components. After classification of an ROI as text or non-text, the potential text regions are subjected to a connected component analysis for reducing the false positives. Connected components of the regions of interest so far marked as text, are examined for the existence of specific text features. If such features are not present in the connected components, they are eliminated. The remaining components are marked as text blocks. These text blocks are next given as input to an OCR. The OCR output in the form of ASCII characters forming words is stored in a database as keywords with frame reference for future retrieval. 3.2 Detailed Description of the Steps I Frame Extraction. A text can be detected from static images or videos. For video sequences, since text must be present for at least half a second for viewers to read the contents, we use only I-frames for text extraction from videos with the typical IBBPBBPBBPBB sequence at a rate of 30 frames per second. Any text which occurs in a video for duration less than the time gap between successive I-frames, is not useful to the viewers as well and hence need not be considered. If a video follows any other frame sequence, we extract every twelfth frame for text extraction. This step is not required for processing still images. Color Reduction. Color reduction is an important pre-processing step for text extraction from complex still images and videos. We perform color reduction by taking the 2 higher order bits from the R, G and B color bands. Now each image contains only 26 color combinations instead of 224 . In Fig. 2(a) and Fig. 2(b) we show an original image and the corresponding color reduced image, respectively. After color reduction, each pixel has a color value v ψ where ψ ={0,1,2, 3, . . . (V-1)}, V being the total number of colors. If only two higher order bits (a) (b) Fig. 2. (a)Original image and (b) Color reduced Image Content Based Image and Video Retrieval Using Embedded Text 115 are used, V=64. We use the color-reduced image for the identification of regions of interest. Region of Interest Identification. We next identify potential regions in the image where text may be located. Identification of such regions of interest helps in speeding up the process of text extraction. In the text regions, there are high densities of foreground pixels for some meaningful color plane. A projection profile of an image region is a compact representation of the spatial pixel content distribution. Horizontal projection profile (HPP) for a given color is defined as a vector of the pixel frequency over each row for that color. Vertical projection profile (VPP) is defined similarly. A threshold for HPP, TH =8, and a threshold for VPP, TV =2, is set to refine the region of interest (ROI). A text is expected to be located in image regions where the count of pixels for a given color in the horizontal direction is greater than TH and the count of pixels for the same color in the vertical direction is greater than TV . Texts usually do not have fixed sizes in images and video frames. However, more than 99% of all texts are less than half the image height and at least greater than 4 pixel in height to make them legible. Geometrical and Morphological Feature Extraction. For each ROI, a number of features are extracted for each color. Before feature extraction, the regions of interest are binarized as follows. Let vij denote the color value of the pixel (i,j) after color reduction. For a given color vk , v  ψ, binarization is done as follows: for i=1 to ROI Height for j = 1 to ROI Width if vi,j = vk Set vi,j = 1 else Set vi,j = 0 Thus, when we process any given color, we set all pixels in the ROI of that color to 1 and the rest to 0. A total of 7 features are extracted which are briefly mentioned below. i. Foreground Pixel Density - It is the number of pixels per unit area whose binarized value is 1. ii. Ratio of Foreground Pixel Density to Background Pixel Density - Background pixel density is calculated in a manner similar to foreground pixel density described above. iii. Edge Pixel Density - Edge pixels are defined as the ones for which one of its eight neighbors has a binarized value of 0. iv. Foreground Pixel to Edge Pixel Ratio - Ratio of foreground pixel density to edge pixel density v. Horizontal Edge Pixel Density vi. Vertical Edge Pixel Density vii. Diagonal Edge Pixel Density. 116 C. Misra and S. Sural MLP Based Classification. The geometrical and morphological features extracted from each region of interest are next used for classification by a multilayer perceptron. In the learning phase, we use features extracted from a set of images containing both text and non-text regions. Such regions are manually checked and assigned the corresponding ground truth. 200 text regions and an equal number of non-text regions are used for training the MLP. The MLP contains 7 inputs, one hidden layer of 10 units and 1 output. The output represents whether the input block contains text or non-text. The MLP was trained with different initial conditions and was found to have similar performance in each case. Connected Component Analysis. In order to reduce the number of false positives after MLP based classification, we introduce connected component analysis as a post-processing step. The following heuristics are applied to filter out possible non-text blocks from the list of connected components. i. Text lines are usually separated from image boundaries. ii. Base and ceiling of the text components are in the same line. iii. At least four text blocks are present in an ROI for meaningful text representations. At the end of this post-processing step, most of the non-text blocks are removed and the rest of the regions of interest are expected to contain only text. OCR Based Identification. Text blocks in each region of interest are given as input to an OCR for recognition. The generated outputs from the OCR are ASCII characters, which are stored in a database as keywords for future indexing and retrieval. In Fig. 3, we explain the process of text recognition in detail. Fig. 3(a) shows an ROI identified as a text block. This ROI is separated out from Sample Image O_n_L (a) (b) (c) _C_u__e # 1_ 11O Data Transfer Techni_ues-1 (d) (e) (f) Fig. 3. (a) Image with ROIs identified (b) Binarized text block (c) OCR output (d) Image with multiple ROIs (e) Multiple binarized text blocks (f) OCR output for multiple text blocks Content Based Image and Video Retrieval Using Embedded Text 117 Fig. 4. Various stages of text extraction from an image (a) Original (b) After ROI detection (c) Output of MLP based classification (d) Final result the rest of the image and binarized as shown in Fig. 3(b). When this ROI is given as input to the OCR, the corresponding ASCII output is shown in Fig. 3(c). It is observed that while the text extraction part of our system detects the text blocks accurately even in a complex background, the OCR sometimes fails to recognize the text correctly. As seen in Fig. 3(c), the last word was mis-recognized due to the presence of noise. Another example image with multiple ROIs containing caption text is shown in Fig. 3(d). Here also the text regions have been identified correctly as shown in Fig. 3(e). The corresponding OCR output is shown in Fig. 3(f). While a specific off-the-shelf OCR is currently being used in our work, it is expected that the character recognition accuracy and hence the overall system performance will improve further if a better OCR is used. The effect of the hybrid approach on the quality of text extraction is explained using Fig. 4. In Fig. 4(a), we show four original images of varying complexity. The detected regions of interest are shown in Fig. 4(b). It is observed that at this stage, recall is very high (greater than 90% ) but there are a number of false positives. The MLP based classifier can correctly detect most of the text blocks and eliminate a large number of non-text blocks. The output of the MLP is shown in Fig. 4(c). At this stage, the precision has improved considerably. In Fig. 4(d) we show the image after the connected component based post-processing step. It is seen that the final result has high recall as well as precision. 3.3 Web-Based Video Retrieval System We have developed a web-based on-line video retrieval system using embedded text. It should be noted that to facilitate blind review, the web site address has 118 C. Misra and S. Sural not been mentioned in this initial version of the paper. However, it will be made available in the accepted version. To test the accuracy of any system, users are often interested in retrieving frames from their own video files. To facilitate this, we provide a utility to upload an external video file in our system. From the video, keywords are extracted and stored in a database in a fully automated manner. The user can then query the database with his choice of keywords. Sets of consecutive video frames containing the keywords are retrieved from the database. A short video clip is generated from each set of consecutive frames and returned to the user for viewing. Thus, user gets back a collection of short video clips containing his choice of keywords. To the best of our knowledge, this feature is unique in our work and is not available in any other text extraction system available in the research domain. 4 Results In this section, we present quantitative results on the performance of the text extraction system. The performance can be measured in terms of true positives (TP) - text regions identified correctly as text regions, false positives (FP) non-text regions identified as text regions and false negatives (FN) - text regions missed by the system. Using these basic definitions, recall and precision of retrieval can be defined as follows: Recall = TP/(TP+FN) and Precision = TP/(TP+FP) While the above definitions are generic, different researchers use different units of text for calculating recall and precision. Wong and Chen [12] consider the number of characters while some of the other authors count the number of text boxes or text regions [1,6]. Jain and Yu [3] calculate recall and precision by considering either characters or blocks depending on the type of image. We adopt the second definition in which we consider the text regions as units for counting. The ground-truth is obtained by manually marking the correct text regions. We have calculated recall and precision on a large number of text-rich images. For video processing, we have tested the system on different types of mpeg videos such as news clips, lecture clips and commercials. The videos contain both caption texts as well as scene texts of different font, color and intensity. Table 1 shows the performance of our proposed method on four types of video. It is seen that our method has an overall average recall of 82% and precision of 87%. Another important consideration is the quality and complexity of pictures for evaluation. Jain and Yu consider large fonts in web images, advertisements and video clips [3]. Kim [5] does not detect low contrast text and small fonts. Li et al [6] use text with different complex motions. Zhang et al [13] as well as Sato et al [10] detect only caption text in news video clips. We are able to detect text under a large number of different conditions like text with small fonts, low intensity, different color and cluttered background, text from noisy video, News caption with horizontal scrolling and both caption text and scene text. Content Based Image and Video Retrieval Using Embedded Text 119 Table 1. Recall and precision of text block extraction No. of text blocks TP FP FN Recall (%) Precision (%) News 780 624 52 156 80 92 Sports 144 120 60 24 83.3 66.6 Lectures 120 96 24 24 80 80 Commercials 3241 288 36 36 88 88 Table 2. Execution time of text extraction Machine used Proposed PIV Image size Processing Time(sec) — 0.14 [12] Sun Ultra sparc 320*240 1.2 [4] — [11] PIV 320*240 0.47 — 1.7 The primary advantage of the proposed method is that it is very fast since most of the computationally intensive algorithms are applied only on the regions of interests. Table 2 shows processing time for different types of video clips using a 2.4 GHZ Pentium-IV machine. We show comparative time required by different algorithms including those proposed in [4], [11] and [12]. For our algorithm the average is taken over a number of different image sizes. It is seen that our algorithm requires the least time for processing each frame. Since we process every I-frame which occurs at the rate of about 3 per second, we are able to achieve real time processing speed in our system 5 Conclusions We have presented a hybrid approach for the detection of text regions and recognition of texts from images and video frames. It can detect both scene text and caption text. A content-based video retrieval system has been developed in which keywords are extracted from video frames based on their textual content. The keywords are stored and indexed in a database for retrieval. We plan to extend our work in the compressed domain processing to make it even faster. A more accurate OCR will also improve the quality of retrieval further. Acknowledgement The work done by Shamik Sural is supported by research grants from the Department of Science and Technology, India, under Grant No. SR/FTP/ETA-20/2003 and by a grant from IIT Kharagpur under ISIRD scheme No. IIT/SRIC/ISIRD/ 2002-2003. 120 C. Misra and S. Sural References 1. L. Agnihotri and N. Dimitrova: Text Detection in Video Segments. Proc. of Workshop on Content Based Access to Image and Video Libraries, pp. 109-113, June 1999 2. Y. M. Y. Hasan and L. J. Karam: Morphological Text Extraction from Images. IEEE Transactions on Image Processing. Vol. 9, Nov. 2000 3. A. K. Jain and B. Yu: Automatic Text Location in Images and Video Frames. Pattern Recognition, Vol. 31, No.12, pp. 2055-2076, 1998 4. K. Jung and J. H. Han: Hybrid Approach to Efficient Text Extraction in Complex Color Images. Pattern Recognition Letters Vol. 25, pp. 679-699, 2004 5. H-K. Kim: Efficient Automatic Text Location Method and Content-Based Indexing and Structuring of Video Database. Journal of Visual Communication and Image Representation, Vol. 7, No 4. pp. 336-344, Dec 1996 6. H. Li, D. Doerman and O. Kia: Automatic Text Detection and Tracking in Digital Video. IEEE Transactions on Image Processing. Vol. 9, pp. 147-156, Jan. 2000 7. R. Lienhart and A Wernicke: Localizing and Segmenting Text in Images and Videos. IEEE Transactions on Circuits and Systems for Video Technology, Vol. 12, No. 4, pp. 256-268, April 2002 8. J. Malobabic, N. O’Connor, N. Murphy and S. Marlow: Automatic Detection and Extraction of Artificial Text in Video. Adaptive information cluster, center for digital video processing, Dublin city university, Dublin city University, 2002 9. A. S. Nurgroho, S. Kuroyanagi and A. Iwata: An Algorithm for Locating Characters in Color Image using Stroke Analysis Neural Network. Proc. of the 9th International Conference on Neural Information Processing (ICONIP’02), November 18-22, 2002 10. T. Sato, T. Kanade, E. Hughes and M. Smith: Video OCR: Indexing Digital News Libraries by Recognition of Superimposed Captions. Multimedia Systems, Vol. 7, pp. 385-394, 1999 11. J.C. Shim, C. Dorai and R. Bolle: Automatic Text Extraction from Video for Content-Based Annotation and Retrieval. Proc. of the 14th International Conference on Pattern Recognition, Vol. 1, pp. 618-620, Brisbane, Australia, August 1998 12. E. K Wong and M. Chen: A New Robust Algorithm for Video Extraction. Pattern Recognition, Vol. 36, No. 6, pp. 1397-1406, June 2003 13. D. Zhang, B. L. Tseng, C. Y. Lin and S. F. Chang: Accurate Overlay Text Extraction For Digital Video Analysis. Columbia University Advent Group Technical Report, 2003 View publication stats