0% found this document useful (0 votes)
15 views

New Tampered Features For Scene and Caption Text Classification in Video Frame

Uploaded by

bob wu
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
15 views

New Tampered Features For Scene and Caption Text Classification in Video Frame

Uploaded by

bob wu
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 6

2016 15th International Conference on Frontiers in Handwriting Recognition

New Tampered Features for Scene and Caption


Text Classification in Video Frame
Sangheeta Roy1,Palaiahnakote Shivakumara1, Umapada Pal2, Tong Lu3 and Chew Lim Tan4
1
Faculty of Computer Science and Information Technology, University of Malaya, Kuala Lumpur, Malaysia
2
Computer Vision and Pattern Recognition Unit, Indian Statistical Institute, Kolkata, India
3
National Key Lab for Novel Software Technology, Nanjing University, Nanjing, China.
4
Department of Computer Science, National University of Singapore.
[email protected], [email protected], [email protected], [email protected], [email protected].

Abstract—The presence of both uniform text size, or font, and very often it will be in horizontal
caption/graphics/superimposed and scene texts in video frames is direction and displays at the bottom of video. On the other hand,
the major cause for the poor accuracy of text recognition methods. scene text naturally exists. It can have multiple colors, fonts, text
This paper proposes an approach for identifying tampered sizes, orientations, contrasts, complex backgrounds and
information by analyzing the spatial distribution of DCT distortions due to illumination effect, motion blur, etc.
coefficients in a new way for classifying caption and scene text. Therefore, achieving better results for video with a single
Since caption text is edited/superimposed, which results in method is not as easy as for only one type of text (e.g. only on
artificially created texts comparing to scene texts that exist natural scene text or scanned text, etc.). It is evident from Fig. 1,
naturally in frames. We exploit this fact to identify the presence of
where we can see frames with only caption texts, scene texts,
caption and scene texts in video frames based on the advantage of
DCT coefficients. The proposed method analyzes the distributions
and both caption and scene texts in Fig. 1(a)-Fig. 1(c),
of both zero and non-zero coefficients (only positive values) locally respectively. Figs. 1(a)-(c) show that the text detection method
by moving a window, and studies histogram operations over each in [4] which proposes histogram oriented moments for text
input text line image. This generates line graphs for respective zero detection in video, and the recognition method in [5] which
and non-zero coefficient coordinates. We further study the proposes a Bayesian classifier for video text recognition through
behavior of text lines, namely, linearity and smoothness based on binarization, give good results for video with caption texts as
centroid location analysis, and the principal axis direction of each shown in Fig. 1(a). However, the same text detection method
text line for classification. Experimental results on standard produces more false positives for the natural scene image and
datasets, namely, ICDAR 2013 video, 2015 video, YVT video and the recognition method fails for natural scene texts as shown in
our own data, show that the performances of text recognition Fig. 1(b). Similarly, for the frame with both caption and scene
methods are improved significantly after-classification compared texts in Fig. 1(c), the text detection method detects caption texts
to before-classification. well but misses scene texts, while the recognition method gives
good results for both texts. This shows that the presence of both
Keywords—Caption text recognition, Video text recognition, caption and scene texts leads to confusion, and hence poor or
Tampered text, DCT coefficients, Classification of caption text, inconsistent results are achieved by text detection and
classification of scene text. recognition methods.
I. INTRODUCTION
Video has become one of the main media for communication
in entertainment, daily surveillance, security applications, etc.
This results in a huge collection of databases containing
heterogeneous videos [1]. It leads to vast potential demands for
efficient algorithms for indexing and retrieval [1]. Despite
several methods available in the literature for video indexing and
retrieval in the field of content based image analysis, there is a RICE OFFERS COHIIIULEHCES TU‘! QTI-711
need for filling the gap between high level and low level (a). Frame with only Caption text (b) Frame with only scene text
features. Therefore, recent research suggests use of text that
appears in video for generating semantics through text detection JPM
and recognition [2]. For video understanding through text
information, there are many methods developed in the past
decades [3-6]. However, the performances of these methods are
inconsistent and not satisfactory. The main reason is that the STUART VARNEY

presence of two types of texts, namely, caption text which is (c). Frame with both caption and scene texts
manually edited and scene text which is naturally appearing in Fig.1: Text detection, banalization and recognition results for
images, in each single frame in video. Since caption text is three types of frames using the method due to [5].
edited, it has good quality, clarity, contrast, uniform color,

2167-6445/16 $31.00 © 2016 IEEE 36


DOI 10.1109/ICFHR.2016.17
Hence, this work aims to develop a new method for orientation. Therefore, we use this method for text line
classifying caption and scene texts, such that we can accordingly detection rather than cropping text lines manually. However, it
choose an appropriate method to achieve our goal. For instance, is found that text detection methods detect texts regardless of
Khare et al. [4] proposed a multi-oriented video text detection caption and scene text types with some inconsistency, and they
method, which scores 78% F-measure for ICDAR 2013 video don’t have the ability to identify them [3, 7, 8]. Therefore, it is
containing only scene type texts, while the same method scores necessary to develop a method for differentiating them to
82% F-measure for their own video data which contains both improve recognition rates because these two text types differ in
caption and scene texts. Anthimopoulos et al. [1] proposed a quality, clarity, and contrast as discussed in Section I.
method for detecting both caption and scene texts in video, It is true that caption text in video frame is superimposed
which scores 98% F-measure for video data and 70% for
and hence it can be considered as tampered text [9], while scene
ICDAR 2003 scene data containing natural scene texts. The
same observation is true for recognition of video texts. For text can be a part of an image. This cue motivated us to explore
example, Roy et al. [5] proposed a Bayesian classifier for the DCT coefficients [12, 13]. Therefore, we explore the
recognition of video, which scores 56.18% character recognition distribution of zero and non-zero coefficients (positive values)
rate for horizontal caption texts and 21.12% character over text line images to identify them. To extract such cues, we
recognition rate for non-horizontal scene text data. The above obtain lines according to the distribution of zero and non-zero
discussion shows that the performances of the methods are not coefficients over a text line image. We propose a new idea of
consistent when dataset changes. studying the linearity and smoothness property of the lines
To overcome the above problems, existing methods are based on checking the centroid of a line falling on the line itself
generally developed by focusing on only one type of text. For or not iteratively. If the centroid falls on the same line, it is
example, Wolf and Jolion [6] proposed a method for the considered as straightness property or else cursiveness
extraction and recognition of artificial texts in multimedia property. The relationship between the line that represents zero
documents. However, the methods that focus on single type text coefficients and the line that represents non-zero coefficients is
may not be suitable for video text recognition as discussed defined as Rule-1 for classifying text types. Since it is a
above. Therefore, to find a solution to this problem, recently, complex classification problem and one idea may not be
Shivakumara et al. [7] proposed a method for the separation of sufficient, we propose one more idea that counts crossing
graphics and scene texts in video frames. This method explores points, where the principal axis crosses over the actual
edge patterns of caption and scene texts for classification. coefficient line to study the smoothness of lines with respect to
However, it does not utilize temporal information. Xu et al. [8]
zero and non-zero coefficients (positive values). The proposed
proposed a method for the classification of graphics and scene
method finds the relationship between these two lines to define
texts using temporal frames. It is noted from both the methods
that since the methods rely on pixel information in spatial Rule-2 for classifying them. Further, we combine both Rule-1
domain, they are not robust to noise and distortion. and Rule-2 to achieve better results.
Therefore, in this work, we propose a novel idea which
explores the property of DCT coefficients in the frequency
domain for caption and scene text classification at line level. It
is a fact that since caption text is edited, it is considered as
tampered text, while scene text being a part of an image (a). Inputs : Caption text line Scene text line
considered as normal text according to the method [9]. Inspired
by the work proposed in [9] for text detection using forgery
based on blocking effect, we explore the same tampered
property using DCT coefficients in this work. In addition, DCT (b) DCT coefficients of the caption and scene text images in (a)
coefficients are explored to identify tampered regions in general
images with the help of training and a classifier [10, 11]. It is
true that DCT coefficient matrix contains high energy values
towards the left top corner and the number of high energy values
gradually decreases towards the right bottom corner for normal (c) Distribution of zero and non-zero coefficients for Caption and
text line images [12, 13] (for scene text line). We cannot expect Scene text line images in (b). Green represents zero and red represents
non-zero coefficients.
the same pattern for a caption text line image because it is a Fig. 2: DCT coefficients distribution for Caption and Scene text line
tampered text. Therefore, in this work, we exploit zero and non- images
zero coefficients of DCT for identifying tampered information
for text lines in video frames. To the best of our knowledge, this A. Tampered Cue for Caption Text Presence Detection
is the first kind of work for introducing tampered features for the
For the text lines shown in Fig. 2(a) as caption and scene
classification of caption and scene texts.
texts, the proposed method obtains corresponding DCT images
II. PROPOSED METHOD as shown in Fig. 2(b). It is noted from the DCT images in Fig.
2(b) that for caption texts, high DCT coefficients are scattered
Based on literature survey on text detection in video [2-4],
over the image, while for scene texts, DCT coefficients are
we have found that the method in [3] gives reasonable results
clustered at top left corner of the image. The same observation
in spite of the inconsistency for video and scene texts detection.
can be confirmed from the distribution of zero-coefficients
In addition, this method does not have any limitation on
shown in Fig. 2(c) where zero coefficients are denoted by green

37
color and non-zero coefficients are denoted by red color. It is same observations in the form of smoothness and non-
seen from Fig. 2(c) that for caption texts, zero coefficients are smoothness of the lines to differentiate caption and scene texts.
scattered over the image while for scene texts, dense zero We map the line graphs to image formats to study the linearity
coefficients gradually increase towards bottom right corner. and smoothness of the lines.
This cue leads to the extraction of tampered information for the ܼ‫ܥ‬௪ ‫ͲͲͳ כ‬
ܼܲ‫ܥ‬ௐ ൌ  ǡሺͳሻ
classification of caption and scene text in video frames. ܼ‫ ܥ‬൅ ܼܰ‫ܥ‬
To extract such cues given by DCT coefficients, we perform where ܼ‫ܥ‬௪ represents the count of ܼ‫ ܥ‬in every sliding window
a window operation in a non-overlapping way over text lines as w. ܼ‫ ܥ‬denotes the total number of zero coefficient counts in the
shown in Fig. 3(a) for caption texts, where we consider the image, and ܼܰ‫ ܥ‬refers the total number of non-zero coefficient
height of texts as the width to define window size. Since this counts.
work considers text lines detected by the text detection method
ܼܰ‫ܥ‬௪ ‫ͲͲͳ כ‬
which provides also the direction of text lines, therefore moving ܼܲܰ‫ܥ‬ௐ ൌ  ሺʹሻ
ܼ‫ ܥ‬൅ ܼܰ‫ܥ‬
window in arbitrary orientations is not an issue. For each
window, we compute the percentages of zero and non-zero where ܼܰ‫ܥ‬௪ represents the count of ܼܰ‫ ܥ‬in every sliding
coefficients (positive values) as defined in equation (1) and window w.
equation (2), respectively. The percentage calculation makes B. Classification of Caption and Scene Texts
the method invariant to different dimensions of text lines. The
effect of the distribution of percentage values for the whole text To extract features for studying the behavior of coefficient
line can be seen in Fig. 3(b), where it is noted that for caption lines, we map the lines to spatial domain as shown in Fig. 4(a)
for the lines shown in Fig. 3(c). Here we can see lines are
texts, the distributions of zero coefficients (red color bars) and
displayed in image formats. We consider each line in the images
non-zero coefficients (blue color bars) do not have uniform
in Fig. 4(a) as the inputs for studying the linearity and
variations, while for scene texts, both the coefficients have smoothness properties of the lines with respect to zero and non-
uniform variations. There is a gradual change in the percentage zero coefficients of caption and scene text lines. We propose a
of zero coefficient (red color bars) values as the window moves novel iterative method to check whether the centroid of a line
over text lines in the case of scene texts and almost the same falls on itself or not. It is a fact that if the centroid of all points
variations for non-zero coefficients (blue bars). as well as all subsets of the points of a line falls on a line itself,
then the line can be considered as a straight one. Else it can be
considered as a cursive line. In the first iteration, the method
considers the whole line for checking whether the centroid falls
(a). Non-overlapping window for Caption text line of Fig(a)
on it or not. In the second iteration, the method considers the line
by reducing one pixel. This process of checking centroid
continues until the last pixel is reached. Further, the proposed
method calculates the percentage of count (ܲ‫ )ܥܯ‬that falls on
the lines as defined in equation (3). The process is illustrated in
Fig. 4(b), where one can expect a larger percentage with respect
to straightness for non-zero coefficient lines (top line of scene
text image) than zero coefficient lines (bottom line of scene text
image) line for scene texts. Similarly, the percentage of the count
which represents straightness of zero coefficients (bottom line
of caption text image) is lower than that of the count which
(b). Percentage of zero (red color line) and non-zero coefficients represents non-zero coefficients (top line of caption text image)
(blue color line) computed for each window of caption and scene text for caption texts. Therefore formally, we can define Rule-1(R1)
lines of Fig.2 (a). for identifying tampered text as a caption as in equation (4).
ܿ‫ͲͲͳ כ ݈݁݊݅݊݋݈݈݂݃݊݅ܽ݀݅݋ݎݐ݂݊݁ܿ݋ݐ݊ݑ݋‬
ܲ‫ ܥܯ‬ൌ ሺ͵ሻ
‫݈݈݁݊݅݊݅݁ݔ݅݌݂݋ݎܾ݁݉ݑ݈݊ܽݐ݋ݐ‬
ͳǡ ܵܿ݁݊݁ǡ ݂݅ܲ‫ܥܯ‬ே௓஼ ൒  ܲ‫ܥܯ‬௓஼
ܴͳ ൌ ൜ (4),
Ͳǡ ‫݊݋݅ݐ݌ܽܥ‬ǡ ܱ‫݁ݏ݅ݓݎ݄݁ݐ‬
where ܼܰ‫ ܥ‬denotes non-zero coefficient lines, ܼ‫ ܥ‬denotes zero
coefficient lines, ܲ‫ ܥܯ‬denotes the percentage of a centroid
(c) Line graphs for the values in (b): Red line represents zero falling on the respective lines. This rule (R1) helps us to identify
coefficients and blue line represents non-zero coefficients. a tampered text as a caption based on the linearity of lines, which
Fig. 3: Linear and non-linear behavior of zero and non-zero coefficients
of DCT over Caption and Scene text lines.
in turn helps in the classification of scene texts. Since the
classification of caption and scene text is not a simple problem
due to the unpredictable nature of scene text, one property may
To extract such behavior of the distributions of zero and be insufficient to achieve good results. Therefore, we propose
non-zero coefficients, we plot line graphs as shown in Fig. 3(c) one more novel idea of studying smoothness and non-
for the same values in Fig. 3(b), where we can visualize the smoothness of lines of caption and scene texts.

38
where m denotes the total number of scene and text images.
III. EXPERIMENTAL RESULTS
We use standard datasets which are available publicly as
benchmark databases, namely, ICDAR 2013 [14] which
contains only scene text lines with large variations, ICDAR
2015[15] which is slightly more complex than ICDAR 2013,
and YVT [16] which contains only scene texts with large
(a). Converting line graphs shown in Fig. 3(c) to image formats. background variations for experimentation. The above datasets
Bottom line represents zero and top line represents non-zero contain 28, 49 and 30 video, respectively, which gives in total
coefficients.
1150 frames. One can notice from the above datasets that all the
three databases contain only scene texts but not caption texts.
Generally, when we consider video of news channels, movies,
sports, etc, where the appearances of both caption and scene
texts are common. Therefore, we create our own database by
collecting videos from YouTube, where we consider different
varieties of videos with different fonts, font sizes, backgrounds
and contrasts. In total, we collect 32 videos that contain from
(b) Studying linearity and non-linearity behavior of the coefficient
2-3 minutes to 15-20 minutes of content, which gives 350
lines of Caption and Scene text lines by extracting straightness and frames with only caption texts, 180 frames with only scene
cursiveness properties. Centroid falling on line marked by Cyan texts, and 300 frames with both scene and caption texts. We use
color and centroid not falling on line are marked by Magenta color. the text detection method in [4] for extracting text lines from
the above frame database, which gives 900 caption and 650
scene text lines to evaluate the proposed classification method.
In this work, we use classification rate through a confusion
matrix and recognition rate for evaluating the proposed
classification method and binarization methods, respectively.
Classification is validated by conducting experiments on
recognition before and after classification at the text line level.
Recognition before classification calculates the recognition
(c) Studying smooth and non-smooth behavior of the coefficient lines rates for different binarization methods on both caption and
of Caption and Scene text lines by extracting crossing points given by scene texts. Recognition after classification calculates the
principal lines of text lines. Principal axis is marked in Yellow dotted
color and crossing points are marked in green color.
recognition rates for caption text and scene text separately
Fig. 4: Extracting behavior of the Caption and Scene text lines. through different binarization methods. We expect the
accuracies of binarization methods after classification to be
higher than those before classification. This is because after
For each line in caption and scene text images, the method
estimates the principal axis using the coordinates of the classification, we can modify the same method or use a
respective lines as shown in Fig. 4(c), where the yellow color different method which suits the class to achieve good results.
dotted line is the principal axis. To understand whether the line We use character recognition rate as the performance measure
is smooth or not, we count the number of crossing points ሺ‫ܲܥ‬ሻ through several binarization methods and publicly available
done by the principal axis with the lines as marked by green OCR [17].
color in Fig. 4(c) for caption and scene text images. It is To show that the proposed method is superior to the existing
observed that the number of crossing points of zero coefficients methods, we use a recent method which classifies graphics and
and non-zero coefficients lines are almost the same for scene scene texts based on pixel patterns of graphics and scene texts
texts, while it is not so for captions texts. Therefore, we calculate [8] for comparative study at the text line level. For validating
the percentage of crossing points to define Rule-2 (R2) for classification, we implement several binarization methods,
classifying caption and scent text as in equation (5). namely, the method proposed by Howe [18] for binarizing scene
text lines in natural scene images, the method proposed by Roy
ͳǡ ܵܿ݁݊݁ǡ ݂݅ȁܲ‫ܲܥ‬ே௓஼ െ  ܲ‫ܲܥ‬௓஼ ȁ ൑ ͳ et al. [5] for binarizing text lines in video using a Bayesian
ܴʹ ൌ ൜ (5),
Ͳǡ ‫݊݋݅ݐ݌ܽܥ‬ǡ ܱ‫݁ݏ݅ݓݎ݄݁ݐ‬ classifier, the binarization method presented by Su et al [19]
where ܲ‫ ܲܥ‬denotes the percentage of crossing points. based on local contrast information, and the image binarization
Furthermore, the final classification combines rule-1 and rule-2 method developed by Milayav et al. [20] for natural scene
as defined in equation (6) and equation (7) for scene text and images. The reason for choosing these methods is as follows.
caption text, respectively. The method in [18] works well for both low contrast texts and
୫ does not require much parameter tuning. The method in [5] is
 ୱୡୣ୬ୣ ൌ ራሺͳ୬ ǡ ʹ୬ ሻǡ ሺ͸ሻǡ capable of binarizing texts in video, the method in [19] works
୬ୀଵ well for degraded low contrast document images, while the
ሺͳ୬ ൌ Ͳǡ ʹ୬ ൌ Ͳሻǡ method in [20] is developed for binarizing texts in natural scene
 ୡୟ୮୲୧୭୬ ൌ ς௠
୬ୀଵ ሺ͹ሻǡ


39
images. Since the considered databases contains low contrast as method in [8]. Sample qualitative results for the proposed and
in video, high contrast as in natural scene images and plain the existing method [8] are shown in Fig. 7 for caption and
background text as in document analysis, we choose these scene texts at line level. Fig. 7 shows that the proposed method
methods from different domains for fair validations of the classifies different text types successfully, and at the same time,
proposed classification method. the existing method fails to classify them correctly. It is evident
A. Experiments for Caption and Scene Text Classification from the quantitative results of the proposed and existing
methods reported in Table II that the proposed method is better
The proposed classification method involves two key steps
than the existing method in terms of classification rate. The
using two rules, namely, Rule-1 derived with centroid features
and Rule-2 with crossing features. Therefore, to analyze the reason for the poor results of the existing method is that it is
contribution of each rule we compute confusion matrices for sensitive to pixels as it works at the pixel level depending on
individual rules and the combined rule as reported in Table I. It the stroke widths of components. On the other hand, the
is observed from Table I that each rule contributes equally to proposed method uses new tampered features in frequency
achieve better results because both rules give almost similar domain, which is robust compared to spatial domain features.
classification rate. As a result, the combined rule which
considers the advantage of both Rule-1 and Rule-2 scores better
results than individual rules.
Sample successful classification results of the proposed
method are shown in Fig. 5, where it can be seen that the
proposed method (combined one) classifies text lines of
different backgrounds, fonts and font sizes correctly as caption
and scene texts. It is noticed from Fig. 5 that the proposed
classification method is invariant to orientation. In the same
way, we also present unsuccessful results of the proposed
method in Fig. 6, where it is shown that the proposed method
still has some issues with background and font variations.
Therefore, there is still room for improvement in the near
future.
Table I: Confusion matrix of the proposed method using centroid Caption classified as Scene Scene classified as Caption
features at text line level Fig. 6: Samples of unsuccessful classification results of the proposed
Classification rate using different features method.
Proposed
Features Centroids Crossing
(combined) Table II: Performance of the proposed and existing methods
Types Scene Caption Scene Caption Scene Caption for caption and scene text classification at text line level (in %)
Scene 60.52 39.48 63.81 36.19 68.66 31.34 Types Proposed method (%) Existing method[8]
Caption 42.3 57.7 38.4 61.6 28.62 71.38 Scene Caption Scene Caption
Scene 68.66 31.34 65.69 34.31
Caption 28.62 71.38 32.62 67.38

Caption text lines Scene text lines


Fig. 7: Sample qualitative results of the proposed and existing
methods. Here the text line images successfully classified by the
proposed method while the existing method fails on this images.

B. Experiments for Validating Classificaiton Method


To show the usefulness of classification at the text line
level, we conduct recognition experiments for different
Caption text lines Scene text lines
binarization methods before and after classification. Since the
proposed classification method classifies caption and scene
Fig. 5: Samples of successful classification results of the proposed
method for the Caption and Scene text lines texts in video frames, we choose key parameters of each
binarization method to tune according to caption and scene text
To show that the proposed method is superior to existing classes. The tuned parameter values are used for after-
methods, we compare the results of proposed method with the classification and default parameters values are used for before-

40
classification. Since we modify the parameter values of each The work described in this paper was supported by the Natural
method, we expect better results for after-classification Science Foundation of China under Grant No. 61272218 and No.
61321491, Natural Science Foundation for Distinguished Young
compared to before-classification. For the binarization methods
Scholars of Jiangsu under Grant No. BK20160021, and partly
listed in Table III we tune the parameters, namely, threshold supported by the University of Malaya HIR under Grant No.
values for Canny image [20], threshold value for the Bayesian M.C/625/1/HIR/210.
classifier [5], and widow size for binarization algorithms [18].
Since Su et al. [19] provides the exe file, no tuning has been REFERENCES
applied. The parameter values before and after classification are [1] M. Anthimopoulos, B. Gatos and I. Pratikakis, “Detetion of artficial and
listed in Table III for each method. It is noticed from Table III scene text in images and video frames”, PAA, 2013, 431-446.
that the recognition rates of the existing methods improve [2] Q. Ye and D. Doermann. “Text Detection and Recognition in Imagery: A
Survey. IEEE. Trans. PAMI, 2015, 1480-1500.
significantly after-classification compared to before-
[3] G. Liang, P. Shivakumara, T. Lu and C. L. Tan. “Multi-Spectral Fusion
classification and this is due to improvement of binarization Based Approach for Arbitrarily-Oriented Scene Text Detection in Video
using different parameters of different method based on the Images”, IEEE Trans. IP, 2015, 4488-4501.
classification results. In this work, we present one case study to [4] V. Khare, P. Shivakumara, and P. Raveendran, “A new Histogram
show the effect of classification. However, one can use a new Oriented Moments descriptor for multi-oriented moving text detection in
method for achieving even better recognition rates as document video,” ESWA, 2015, 7627-7640.
analysis by considering the advantage of classification. [5] S. Roy, P. Shivakumara, P. P. Roy, U. Pal and C. L. Tan, “Bayesian
classifier for multi-oriented video text recognition system”, ESWA, 2015,
Table III: Character recognition rates of different binarization 5554-5566.
methods before and after classification at text line level (in %). P [6] C. Wolf and J. M. Jolion, “Extraction and recognition of artificial text in
denotes parameter. multimedia documents”, PAA, 2003, 309-326.
[7] P. Shivakumara, N. V. Kumar, D. S. Guru and C. L Tan, “Separation of
Methods Before Classification After Classification graphics (superimposed) and scene text in videos”, In Proc. DAS, 2014,
Scene and P Scene P Caption P 344-348.
Caption
[8] J. Xu, P. Shivakumara, T. Lu, T. Q. Phan and C. L. Tan, “Graphics and
Howe [18] 51.3 0.4 56.3 0.4 51.7 0.2 Scene Text Classification in Video”, In Proc. ICPR, 2014, 4714-4719.
Roy et al. [5] 40.6 0.05 42.8 0.2 47.2 0.03
Su et al. [19] 48.9 - 55.2 - 49.6 - [9] A. Hooda, M. Kathuria and V. Pankajakshan, “Application of forgery
localization in overlay text detection”, In Proc. ICVGIP, 2014.
Milyaey [20] 42.6 10 49.5 10 45.7 8
[10] W. Wang, J. Dong and T. Tan, “Exploring DCT Coefficients Quantization
Effects for Local Tampering Detection”, IEEE Trans. IFS, 2014, 1653-
1666.
IV. CONCLUSION AND FUTURE WORK [11] X. H. Li, Y. Q. Zhao, M. Liao, F. Y. Shih and Y. Q. Shi, “Detection of
We have introduced a new idea of exploring DCT for tampered region for JPEG images by using mode-based first digit
features”, EURASIP, 2012, 1-10.
identifying tampered information for the classification of
[12] Y. Zhong, H. Zhang and A. K. Jain, “Automatic caption localization in
caption and scene texts. The proposed method finds the unique compressed video”, IEEE. Trans. PAMI, 2000, 385-392.
relationship between zero and non-zero coefficients of caption [13] H. Li, W. Luo and J. Huang, “Anti-forensics of double JPEG compression
and scene texts to differentiate them. The unique relationship is with the same quantization marix”, MTA, 2015, 6729-6744.
extracted by a new iterative centroid checking and crossing [14] D. Karatzas, F. Shafait, S. Uchida, M. Iwamura, L. G. I. Boorda, S. R.
point detection method created by principal axis and coefficient Mestre, J. Mas, D. F. Mota, J. A. Almazan and L. P. De las Heras.
lines of caption and scene texts. To the best of our knowledge, “ICDAR 2013 robust reading competition”. In Proc. ICDAR, 2013, 1115-
1124.
this is the first kind of work by introducing tampering idea for
[15] D. Karatzas, L. Gomez-Bigorda, A. Nicolaou, S. Ghosh, A. Bagdanow,
classification. Experimental results show that the proposed M. Iwamura, J. Matas, L. Neumann and V. R. Chandrsekhar. “ICDAR
method classifies caption and scene text successfully at the text 2015 Competition on Robust Reading”. In Proc. ICDAR, 2015, 1156-
line level. The classification is tested by conducting 1160.
experiments on recognition through different binarization [16] P. Nguyen, K. Wang and S. Belongie, S. “Video Text Detection and
Recognition: Dataset and Benchmark”. In Proc. WCACV, 2014, 776-783.
methods before and after classification. The performances of
[17] Tesseract. http://code.google.com/p/tesseract-ocr/.
binarization methods improve significantly after classification
compared to before classification. It is seen from the [18] N. R. Howe, “Document binarization with automatic parameter tuning”,
IJDAR, 2013, 247-258.
experimental results that the proposed method still misclassifies
[19] Bolan Su, Shijian Lu, Chew Lim Tan, A Robust Document Image
some of the texts that have different fonts and background. Our Binarization for Degraded Document Images. IEEE Trans. IP, 2013,
future study would extend the method with the help of learning 1408-1417.
techniques such as deep learning to address this issue. [20] S. Milyaev, O. Barinova, T. Novikova, P. Kohli, V. S. Lempitsky,“Image
Binarization for End-to-End Text Understanding in Natural Images”. In
ACKNOWLEDGMENT Proc. ICDAR, 2013, 128-132.

41

You might also like