Content Based Image and Video Retrieval Using
Embedded Text
Chinmaya Misra and Shamik Sural
School of Information Technology, Indian Institute of Technology,
Kharagpur, West Bengal -721302, India
{cmisra, shamik}@sit.iitkgp.ernet.in
Abstract. Extraction of text from image and video is an important
step in building efficient indexing and retrieval systems for multimedia
databases. We adopt a hybrid approach for such text extraction by exploiting a number of characteristics of text blocks in color images and
video frames. Our system detects both caption text as well as scene text
of different font, size, color and intensity. We have developed an application for on-line extraction and recognition of texts from videos. Such
texts are used for retrieval of video clips based on any given keyword.
The application is available on the web for the readers to repeat our
experiments and also to try text extraction and retrieval from their own
videos.
1
Introduction
Text embedded in an image is usually closely related to its semantic content.
Hence, text is often considered to be a strong candidate for use as a feature in
high level semantic indexing and content-based retrieval. An index built using
extracted and recognized text enables keyword-based searches on a multimedia
database. As an example, we can identify video frames on specific topics of
discussion from an educational video if the frames display corresponding text
information. One of the main challenges in this work is to be able to locate text
blocks in an image with complex color combinations.
Text in image and video can be classified into two broad types: (i) Caption
text - also known as Graphic text or Overlay text and (ii) Scene text. Caption
text as shown in Fig. 1(a), is the type of text that is synthetically added to a
video or an image during editing. It serves many different purposes like display
of actor list and credit in a movie, topics covered in an educational video, etc.
Caption text in a video frame typically has low resolution so that it does not
occlude the scene objects.
In contrast to caption text, scene text as shown in Fig. 1(b), usually occurs in
the field of view of a camera during video or still photography. Examples of scene
text include street signs, billboards, topics covered through presentation slides in
educational videos, number plates on cars, etc. Scene text is often more difficult
to detect and extract compared to caption text due to its unlimited range of font,
size, shape and color. It may be noted from Figs. 1(a) and (b) that the images
P.J. Narayanan et al. (Eds.): ACCV 2006, LNCS 3852, pp. 111–120, 2006.
c Springer-Verlag Berlin Heidelberg 2006
112
C. Misra and S. Sural
(a)
(b)
Fig. 1. (a)Caption text and (b) Scene text
containing either scene text or caption text cannot serve as a direct input to an
Optical Character Recognition (OCR) system. Existing segmentation techniques
built in the OCR systems are not capable of handling the complexity of color
images in which such text regions are embedded. Instead, it is essential to build
specialized methods for identifying the text blocks from images and video frames.
Contents of such text blocks can then be submitted to an OCR for identification
of the characters and words.
Our goal is to accurately extract text blocks from color images and video
frames, recognize the texts using an OCR and store them as keywords in a
database for indexing and retrieval.
2
Related Work
In recent years, attempts have been made to develop methods for extracting text
blocks from still images and videos. Li et al [6] use a 16 × 16 window moved over
various positions in an image. Each window position is classified as a text or a
non-text block using a neural network. Text blocks identified by the classifier
are then tracked across frame boundaries. This method detects text only at the
block level. Jain and Yu [3] propose a method to locate text in color images using
connected components. Their method can detect text only with large size and
high contrast. While it is well suited for processing newspaper advertisements
and web images, it is not so efficient in detecting text in complex and cluttered
background. Accuracy of this approach is high for binary and gray-scale images
but the system is not so accurate in locating text in full-color images.
Lienhart and Wernicke propose a multi-resolution approach to detect potential text lines in images and video frames using edge orientation [7]. This
method also uses small windows to find edge orientations and a complex-valued
neutral network based method to classify text regions with certain pre-defined
sizes. They employ projection profiles as well as geometrical and morphological
constraints for refining the text boxes. Nugroho et al [9] apply color reduction
and decompose a multi-valued color image into a small number of meaningful
color prototypes based on foreground color. After that, connected components
are extracted from each foreground image and text and non-text components
are classified with the help of stroke features. This approach works well in a
limited range of characters, especially in multi-segment characters like Japanese
and Chinese.
Content Based Image and Video Retrieval Using Embedded Text
113
Malababic et al [8] detect artificial text in videos using a feature that captures
foreground to background contrast, density of short edges of varying orientations
and concentration of short vertical edges that are horizontally aligned. Various
geometrical constraints are also applied for improving the result.
Sato et al [10] investigate superimposed caption recognition in news videos.
They use a spatial filter to localize the text regions as well as size and position constraints to refine the detected area. This algorithm can be applied only
in a specific domain, namely, news video analysis. Jung and Han [4] sequentially adds advantages of texture based methods and connected component based
methods. A texture classifier detects text regions and filtering is done by the connected component based method using geometric and shape information. They
detect text in images with multiple color, intensity and fonts. However, since
this method processes a raw pixel values for each frame in texture classifier and
performs a number of stages of filtering and refinement, it takes a lot of time
for processing each image. Zhang et al [13] use a multiple hypothesis filtering
approach on several binary images after decomposing a given image by color
space partitioning. To find the candidate text regions they use texture and motion energy as compressed domain features. This method can be used to detect
caption text from newscasts. However, it works on the assumption that most
of the text is located in some predefined regions with high contrast and simple
background in the video.
In contrast to the above-mentioned methods, we propose a hybrid approach
in which multiple cues are used for detecting text blocks in images and videos.
Further, in all of the existing methods, there is no mention of any complete
system being developed using the text extraction techniques. We feel that along
with the development of new algorithms, it is equally important to be able to
demonstrate the results. For this purpose, we have built a video retrieval system
based on embedded text, which is available on the web. Interested readers will
be able to repeat our experiments and also perform their own retrievals using
this application.
The rest of the paper is organized as follows. In the next section, we give a
description of our system. The results are presented in section 4 and we conclude
in the last section of the paper.
3
Hybrid Approach to Text Extraction
In this section, we first give an overview of our system followed by a detailed
description of the building blocks.
3.1
Overview of the Approach
The input to our system can either be a still image or a video decomposed into
frames. We first use a color reduction step in which the input is converted into
a 64-color image. This step is necessary since there can be a large number of
colors present in an image. Individual color level processing makes the system
both inefficient as well as sensitive to noise. We next determine the Regions of
114
C. Misra and S. Sural
Interest (ROIs) - Regions in the image where text could potentially be located.
This step, while meant to speed-up subsequent searches, should not filter out
the text regions. Care is, therefore, taken to ensure that only those regions that
are certainly of type non-text are eliminated.
After identification of the regions of interest, geometrical and morphological
features are extracted from each ROI. A multilayer perceptron (MLP) is used
as a feature-based classifier to determine if the ROI contains text or non-text
blocks. It should be noted that at this stage, we identify an entire ROI to either
belong to a text region or to a non-text region and not its individual components.
After classification of an ROI as text or non-text, the potential text regions are
subjected to a connected component analysis for reducing the false positives.
Connected components of the regions of interest so far marked as text, are examined for the existence of specific text features. If such features are not present
in the connected components, they are eliminated. The remaining components
are marked as text blocks. These text blocks are next given as input to an OCR.
The OCR output in the form of ASCII characters forming words is stored in a
database as keywords with frame reference for future retrieval.
3.2
Detailed Description of the Steps
I Frame Extraction. A text can be detected from static images or videos. For
video sequences, since text must be present for at least half a second for viewers
to read the contents, we use only I-frames for text extraction from videos with the
typical IBBPBBPBBPBB sequence at a rate of 30 frames per second. Any text
which occurs in a video for duration less than the time gap between successive
I-frames, is not useful to the viewers as well and hence need not be considered.
If a video follows any other frame sequence, we extract every twelfth frame for
text extraction. This step is not required for processing still images.
Color Reduction. Color reduction is an important pre-processing step for text
extraction from complex still images and videos. We perform color reduction by
taking the 2 higher order bits from the R, G and B color bands. Now each image
contains only 26 color combinations instead of 224 . In Fig. 2(a) and Fig. 2(b) we
show an original image and the corresponding color reduced image, respectively.
After color reduction, each pixel has a color value v ψ where ψ ={0,1,2,
3, . . . (V-1)}, V being the total number of colors. If only two higher order bits
(a)
(b)
Fig. 2. (a)Original image and (b) Color reduced Image
Content Based Image and Video Retrieval Using Embedded Text
115
are used, V=64. We use the color-reduced image for the identification of regions
of interest.
Region of Interest Identification. We next identify potential regions in the
image where text may be located. Identification of such regions of interest helps
in speeding up the process of text extraction. In the text regions, there are high
densities of foreground pixels for some meaningful color plane. A projection profile of an image region is a compact representation of the spatial pixel content
distribution. Horizontal projection profile (HPP) for a given color is defined as
a vector of the pixel frequency over each row for that color. Vertical projection
profile (VPP) is defined similarly. A threshold for HPP, TH =8, and a threshold
for VPP, TV =2, is set to refine the region of interest (ROI). A text is expected to
be located in image regions where the count of pixels for a given color in the horizontal direction is greater than TH and the count of pixels for the same color in
the vertical direction is greater than TV . Texts usually do not have fixed sizes in
images and video frames. However, more than 99% of all texts are less than half
the image height and at least greater than 4 pixel in height to make them legible.
Geometrical and Morphological Feature Extraction. For each ROI, a
number of features are extracted for each color. Before feature extraction, the
regions of interest are binarized as follows.
Let vij denote the color value of the pixel (i,j) after color reduction. For a
given color vk , v ψ, binarization is done as follows:
for i=1 to ROI Height
for j = 1 to ROI Width
if vi,j = vk
Set vi,j = 1
else
Set vi,j = 0
Thus, when we process any given color, we set all pixels in the ROI of that
color to 1 and the rest to 0.
A total of 7 features are extracted which are briefly mentioned below.
i. Foreground Pixel Density - It is the number of pixels per unit area whose
binarized value is 1.
ii. Ratio of Foreground Pixel Density to Background Pixel Density - Background pixel density is calculated in a manner similar to foreground pixel
density described above.
iii. Edge Pixel Density - Edge pixels are defined as the ones for which one of its
eight neighbors has a binarized value of 0.
iv. Foreground Pixel to Edge Pixel Ratio - Ratio of foreground pixel density to
edge pixel density
v. Horizontal Edge Pixel Density
vi. Vertical Edge Pixel Density
vii. Diagonal Edge Pixel Density.
116
C. Misra and S. Sural
MLP Based Classification. The geometrical and morphological features extracted from each region of interest are next used for classification by a multilayer
perceptron. In the learning phase, we use features extracted from a set of images
containing both text and non-text regions. Such regions are manually checked
and assigned the corresponding ground truth. 200 text regions and an equal
number of non-text regions are used for training the MLP. The MLP contains 7
inputs, one hidden layer of 10 units and 1 output. The output represents whether
the input block contains text or non-text. The MLP was trained with different
initial conditions and was found to have similar performance in each case.
Connected Component Analysis. In order to reduce the number of false positives after MLP based classification, we introduce connected component analysis
as a post-processing step. The following heuristics are applied to filter out possible non-text blocks from the list of connected components.
i. Text lines are usually separated from image boundaries.
ii. Base and ceiling of the text components are in the same line.
iii. At least four text blocks are present in an ROI for meaningful text representations.
At the end of this post-processing step, most of the non-text blocks are removed
and the rest of the regions of interest are expected to contain only text.
OCR Based Identification. Text blocks in each region of interest are given
as input to an OCR for recognition. The generated outputs from the OCR are
ASCII characters, which are stored in a database as keywords for future indexing
and retrieval. In Fig. 3, we explain the process of text recognition in detail.
Fig. 3(a) shows an ROI identified as a text block. This ROI is separated out from
Sample Image O_n_L
(a)
(b)
(c)
_C_u__e # 1_
11O Data Transfer
Techni_ues-1
(d)
(e)
(f)
Fig. 3. (a) Image with ROIs identified (b) Binarized text block (c) OCR output (d) Image with multiple ROIs (e) Multiple binarized text blocks (f) OCR output for multiple
text blocks
Content Based Image and Video Retrieval Using Embedded Text
117
Fig. 4. Various stages of text extraction from an image (a) Original (b) After ROI
detection (c) Output of MLP based classification (d) Final result
the rest of the image and binarized as shown in Fig. 3(b). When this ROI is given
as input to the OCR, the corresponding ASCII output is shown in Fig. 3(c). It is
observed that while the text extraction part of our system detects the text blocks
accurately even in a complex background, the OCR sometimes fails to recognize
the text correctly. As seen in Fig. 3(c), the last word was mis-recognized due
to the presence of noise. Another example image with multiple ROIs containing
caption text is shown in Fig. 3(d). Here also the text regions have been identified
correctly as shown in Fig. 3(e). The corresponding OCR output is shown in
Fig. 3(f). While a specific off-the-shelf OCR is currently being used in our work,
it is expected that the character recognition accuracy and hence the overall
system performance will improve further if a better OCR is used. The effect of
the hybrid approach on the quality of text extraction is explained using Fig. 4.
In Fig. 4(a), we show four original images of varying complexity. The detected
regions of interest are shown in Fig. 4(b). It is observed that at this stage, recall
is very high (greater than 90% ) but there are a number of false positives. The
MLP based classifier can correctly detect most of the text blocks and eliminate
a large number of non-text blocks. The output of the MLP is shown in Fig. 4(c).
At this stage, the precision has improved considerably. In Fig. 4(d) we show the
image after the connected component based post-processing step. It is seen that
the final result has high recall as well as precision.
3.3
Web-Based Video Retrieval System
We have developed a web-based on-line video retrieval system using embedded
text. It should be noted that to facilitate blind review, the web site address has
118
C. Misra and S. Sural
not been mentioned in this initial version of the paper. However, it will be made
available in the accepted version. To test the accuracy of any system, users are
often interested in retrieving frames from their own video files. To facilitate this,
we provide a utility to upload an external video file in our system. From the
video, keywords are extracted and stored in a database in a fully automated
manner. The user can then query the database with his choice of keywords.
Sets of consecutive video frames containing the keywords are retrieved from the
database. A short video clip is generated from each set of consecutive frames
and returned to the user for viewing. Thus, user gets back a collection of short
video clips containing his choice of keywords. To the best of our knowledge, this
feature is unique in our work and is not available in any other text extraction
system available in the research domain.
4
Results
In this section, we present quantitative results on the performance of the text
extraction system. The performance can be measured in terms of true positives
(TP) - text regions identified correctly as text regions, false positives (FP) non-text regions identified as text regions and false negatives (FN) - text regions missed by the system. Using these basic definitions, recall and precision of
retrieval can be defined as follows:
Recall = TP/(TP+FN) and Precision = TP/(TP+FP)
While the above definitions are generic, different researchers use different
units of text for calculating recall and precision. Wong and Chen [12] consider
the number of characters while some of the other authors count the number of
text boxes or text regions [1,6]. Jain and Yu [3] calculate recall and precision
by considering either characters or blocks depending on the type of image. We
adopt the second definition in which we consider the text regions as units for
counting. The ground-truth is obtained by manually marking the correct text
regions.
We have calculated recall and precision on a large number of text-rich images.
For video processing, we have tested the system on different types of mpeg
videos such as news clips, lecture clips and commercials. The videos contain
both caption texts as well as scene texts of different font, color and intensity.
Table 1 shows the performance of our proposed method on four types of video.
It is seen that our method has an overall average recall of 82% and precision of
87%. Another important consideration is the quality and complexity of pictures
for evaluation. Jain and Yu consider large fonts in web images, advertisements
and video clips [3]. Kim [5] does not detect low contrast text and small fonts.
Li et al [6] use text with different complex motions. Zhang et al [13] as well
as Sato et al [10] detect only caption text in news video clips. We are able to
detect text under a large number of different conditions like text with small
fonts, low intensity, different color and cluttered background, text from noisy
video, News caption with horizontal scrolling and both caption text and scene
text.
Content Based Image and Video Retrieval Using Embedded Text
119
Table 1. Recall and precision of text block extraction
No. of text blocks
TP
FP
FN
Recall (%)
Precision (%)
News
780
624
52
156
80
92
Sports
144
120
60
24
83.3
66.6
Lectures
120
96
24
24
80
80
Commercials
3241
288
36
36
88
88
Table 2. Execution time of text extraction
Machine used
Proposed
PIV
Image size
Processing Time(sec)
—
0.14
[12]
Sun Ultra
sparc
320*240
1.2
[4]
—
[11]
PIV
320*240
0.47
—
1.7
The primary advantage of the proposed method is that it is very fast since
most of the computationally intensive algorithms are applied only on the regions
of interests. Table 2 shows processing time for different types of video clips using
a 2.4 GHZ Pentium-IV machine. We show comparative time required by different algorithms including those proposed in [4], [11] and [12]. For our algorithm
the average is taken over a number of different image sizes. It is seen that our
algorithm requires the least time for processing each frame. Since we process
every I-frame which occurs at the rate of about 3 per second, we are able to
achieve real time processing speed in our system
5
Conclusions
We have presented a hybrid approach for the detection of text regions and recognition of texts from images and video frames. It can detect both scene text and
caption text. A content-based video retrieval system has been developed in which
keywords are extracted from video frames based on their textual content. The
keywords are stored and indexed in a database for retrieval.
We plan to extend our work in the compressed domain processing to make
it even faster. A more accurate OCR will also improve the quality of retrieval
further.
Acknowledgement
The work done by Shamik Sural is supported by research grants from the Department of Science and Technology, India, under Grant No. SR/FTP/ETA-20/2003
and by a grant from IIT Kharagpur under ISIRD scheme No. IIT/SRIC/ISIRD/
2002-2003.
120
C. Misra and S. Sural
References
1. L. Agnihotri and N. Dimitrova: Text Detection in Video Segments. Proc. of
Workshop on Content Based Access to Image and Video Libraries, pp. 109-113,
June 1999
2. Y. M. Y. Hasan and L. J. Karam: Morphological Text Extraction from Images.
IEEE Transactions on Image Processing. Vol. 9, Nov. 2000
3. A. K. Jain and B. Yu: Automatic Text Location in Images and Video Frames.
Pattern Recognition, Vol. 31, No.12, pp. 2055-2076, 1998
4. K. Jung and J. H. Han: Hybrid Approach to Efficient Text Extraction in Complex
Color Images. Pattern Recognition Letters Vol. 25, pp. 679-699, 2004
5. H-K. Kim: Efficient Automatic Text Location Method and Content-Based Indexing
and Structuring of Video Database. Journal of Visual Communication and Image
Representation, Vol. 7, No 4. pp. 336-344, Dec 1996
6. H. Li, D. Doerman and O. Kia: Automatic Text Detection and Tracking in Digital
Video. IEEE Transactions on Image Processing. Vol. 9, pp. 147-156, Jan. 2000
7. R. Lienhart and A Wernicke: Localizing and Segmenting Text in Images and
Videos. IEEE Transactions on Circuits and Systems for Video Technology, Vol.
12, No. 4, pp. 256-268, April 2002
8. J. Malobabic, N. O’Connor, N. Murphy and S. Marlow: Automatic Detection and
Extraction of Artificial Text in Video. Adaptive information cluster, center for
digital video processing, Dublin city university, Dublin city University, 2002
9. A. S. Nurgroho, S. Kuroyanagi and A. Iwata: An Algorithm for Locating Characters in Color Image using Stroke Analysis Neural Network. Proc. of the 9th
International Conference on Neural Information Processing (ICONIP’02), November 18-22, 2002
10. T. Sato, T. Kanade, E. Hughes and M. Smith: Video OCR: Indexing Digital News
Libraries by Recognition of Superimposed Captions. Multimedia Systems, Vol. 7,
pp. 385-394, 1999
11. J.C. Shim, C. Dorai and R. Bolle: Automatic Text Extraction from Video for
Content-Based Annotation and Retrieval. Proc. of the 14th International Conference on Pattern Recognition, Vol. 1, pp. 618-620, Brisbane, Australia, August
1998
12. E. K Wong and M. Chen: A New Robust Algorithm for Video Extraction. Pattern
Recognition, Vol. 36, No. 6, pp. 1397-1406, June 2003
13. D. Zhang, B. L. Tseng, C. Y. Lin and S. F. Chang: Accurate Overlay Text Extraction For Digital Video Analysis. Columbia University Advent Group Technical
Report, 2003
View publication stats