New Tampered Features For Scene and Caption Text Classification in Video Frame
New Tampered Features For Scene and Caption Text Classification in Video Frame
Abstract—The presence of both uniform text size, or font, and very often it will be in horizontal
caption/graphics/superimposed and scene texts in video frames is direction and displays at the bottom of video. On the other hand,
the major cause for the poor accuracy of text recognition methods. scene text naturally exists. It can have multiple colors, fonts, text
This paper proposes an approach for identifying tampered sizes, orientations, contrasts, complex backgrounds and
information by analyzing the spatial distribution of DCT distortions due to illumination effect, motion blur, etc.
coefficients in a new way for classifying caption and scene text. Therefore, achieving better results for video with a single
Since caption text is edited/superimposed, which results in method is not as easy as for only one type of text (e.g. only on
artificially created texts comparing to scene texts that exist natural scene text or scanned text, etc.). It is evident from Fig. 1,
naturally in frames. We exploit this fact to identify the presence of
where we can see frames with only caption texts, scene texts,
caption and scene texts in video frames based on the advantage of
DCT coefficients. The proposed method analyzes the distributions
and both caption and scene texts in Fig. 1(a)-Fig. 1(c),
of both zero and non-zero coefficients (only positive values) locally respectively. Figs. 1(a)-(c) show that the text detection method
by moving a window, and studies histogram operations over each in [4] which proposes histogram oriented moments for text
input text line image. This generates line graphs for respective zero detection in video, and the recognition method in [5] which
and non-zero coefficient coordinates. We further study the proposes a Bayesian classifier for video text recognition through
behavior of text lines, namely, linearity and smoothness based on binarization, give good results for video with caption texts as
centroid location analysis, and the principal axis direction of each shown in Fig. 1(a). However, the same text detection method
text line for classification. Experimental results on standard produces more false positives for the natural scene image and
datasets, namely, ICDAR 2013 video, 2015 video, YVT video and the recognition method fails for natural scene texts as shown in
our own data, show that the performances of text recognition Fig. 1(b). Similarly, for the frame with both caption and scene
methods are improved significantly after-classification compared texts in Fig. 1(c), the text detection method detects caption texts
to before-classification. well but misses scene texts, while the recognition method gives
good results for both texts. This shows that the presence of both
Keywords—Caption text recognition, Video text recognition, caption and scene texts leads to confusion, and hence poor or
Tampered text, DCT coefficients, Classification of caption text, inconsistent results are achieved by text detection and
classification of scene text. recognition methods.
I. INTRODUCTION
Video has become one of the main media for communication
in entertainment, daily surveillance, security applications, etc.
This results in a huge collection of databases containing
heterogeneous videos [1]. It leads to vast potential demands for
efficient algorithms for indexing and retrieval [1]. Despite
several methods available in the literature for video indexing and
retrieval in the field of content based image analysis, there is a RICE OFFERS COHIIIULEHCES TU‘! QTI-711
need for filling the gap between high level and low level (a). Frame with only Caption text (b) Frame with only scene text
features. Therefore, recent research suggests use of text that
appears in video for generating semantics through text detection JPM
and recognition [2]. For video understanding through text
information, there are many methods developed in the past
decades [3-6]. However, the performances of these methods are
inconsistent and not satisfactory. The main reason is that the STUART VARNEY
presence of two types of texts, namely, caption text which is (c). Frame with both caption and scene texts
manually edited and scene text which is naturally appearing in Fig.1: Text detection, banalization and recognition results for
images, in each single frame in video. Since caption text is three types of frames using the method due to [5].
edited, it has good quality, clarity, contrast, uniform color,
37
color and non-zero coefficients are denoted by red color. It is same observations in the form of smoothness and non-
seen from Fig. 2(c) that for caption texts, zero coefficients are smoothness of the lines to differentiate caption and scene texts.
scattered over the image while for scene texts, dense zero We map the line graphs to image formats to study the linearity
coefficients gradually increase towards bottom right corner. and smoothness of the lines.
This cue leads to the extraction of tampered information for the ܼܥ௪ ͲͲͳ כ
ܼܲܥௐ ൌ ǡሺͳሻ
classification of caption and scene text in video frames. ܼ ܥ ܼܰܥ
To extract such cues given by DCT coefficients, we perform where ܼܥ௪ represents the count of ܼ ܥin every sliding window
a window operation in a non-overlapping way over text lines as w. ܼ ܥdenotes the total number of zero coefficient counts in the
shown in Fig. 3(a) for caption texts, where we consider the image, and ܼܰ ܥrefers the total number of non-zero coefficient
height of texts as the width to define window size. Since this counts.
work considers text lines detected by the text detection method
ܼܰܥ௪ ͲͲͳ כ
which provides also the direction of text lines, therefore moving ܼܲܰܥௐ ൌ ሺʹሻ
ܼ ܥ ܼܰܥ
window in arbitrary orientations is not an issue. For each
window, we compute the percentages of zero and non-zero where ܼܰܥ௪ represents the count of ܼܰ ܥin every sliding
coefficients (positive values) as defined in equation (1) and window w.
equation (2), respectively. The percentage calculation makes B. Classification of Caption and Scene Texts
the method invariant to different dimensions of text lines. The
effect of the distribution of percentage values for the whole text To extract features for studying the behavior of coefficient
line can be seen in Fig. 3(b), where it is noted that for caption lines, we map the lines to spatial domain as shown in Fig. 4(a)
for the lines shown in Fig. 3(c). Here we can see lines are
texts, the distributions of zero coefficients (red color bars) and
displayed in image formats. We consider each line in the images
non-zero coefficients (blue color bars) do not have uniform
in Fig. 4(a) as the inputs for studying the linearity and
variations, while for scene texts, both the coefficients have smoothness properties of the lines with respect to zero and non-
uniform variations. There is a gradual change in the percentage zero coefficients of caption and scene text lines. We propose a
of zero coefficient (red color bars) values as the window moves novel iterative method to check whether the centroid of a line
over text lines in the case of scene texts and almost the same falls on itself or not. It is a fact that if the centroid of all points
variations for non-zero coefficients (blue bars). as well as all subsets of the points of a line falls on a line itself,
then the line can be considered as a straight one. Else it can be
considered as a cursive line. In the first iteration, the method
considers the whole line for checking whether the centroid falls
(a). Non-overlapping window for Caption text line of Fig(a)
on it or not. In the second iteration, the method considers the line
by reducing one pixel. This process of checking centroid
continues until the last pixel is reached. Further, the proposed
method calculates the percentage of count (ܲ )ܥܯthat falls on
the lines as defined in equation (3). The process is illustrated in
Fig. 4(b), where one can expect a larger percentage with respect
to straightness for non-zero coefficient lines (top line of scene
text image) than zero coefficient lines (bottom line of scene text
image) line for scene texts. Similarly, the percentage of the count
which represents straightness of zero coefficients (bottom line
of caption text image) is lower than that of the count which
(b). Percentage of zero (red color line) and non-zero coefficients represents non-zero coefficients (top line of caption text image)
(blue color line) computed for each window of caption and scene text for caption texts. Therefore formally, we can define Rule-1(R1)
lines of Fig.2 (a). for identifying tampered text as a caption as in equation (4).
ܿͲͲͳ כ ݈݈݈݂݁݊݅݊݃݊݅ܽ݀݅ݎݐ݂݊݁ܿݐ݊ݑ
ܲ ܥܯൌ ሺ͵ሻ
݈݈݁݊݅݊݅݁ݔ݂݅ݎܾ݁݉ݑ݈݊ܽݐݐ
ͳǡ ܵܿ݁݊݁ǡ ݂݅ܲܥܯே ܲܥܯ
ܴͳ ൌ ൜ (4),
Ͳǡ ݊݅ݐܽܥǡ ܱ݁ݏ݅ݓݎ݄݁ݐ
where ܼܰ ܥdenotes non-zero coefficient lines, ܼ ܥdenotes zero
coefficient lines, ܲ ܥܯdenotes the percentage of a centroid
(c) Line graphs for the values in (b): Red line represents zero falling on the respective lines. This rule (R1) helps us to identify
coefficients and blue line represents non-zero coefficients. a tampered text as a caption based on the linearity of lines, which
Fig. 3: Linear and non-linear behavior of zero and non-zero coefficients
of DCT over Caption and Scene text lines.
in turn helps in the classification of scene texts. Since the
classification of caption and scene text is not a simple problem
due to the unpredictable nature of scene text, one property may
To extract such behavior of the distributions of zero and be insufficient to achieve good results. Therefore, we propose
non-zero coefficients, we plot line graphs as shown in Fig. 3(c) one more novel idea of studying smoothness and non-
for the same values in Fig. 3(b), where we can visualize the smoothness of lines of caption and scene texts.
38
where m denotes the total number of scene and text images.
III. EXPERIMENTAL RESULTS
We use standard datasets which are available publicly as
benchmark databases, namely, ICDAR 2013 [14] which
contains only scene text lines with large variations, ICDAR
2015[15] which is slightly more complex than ICDAR 2013,
and YVT [16] which contains only scene texts with large
(a). Converting line graphs shown in Fig. 3(c) to image formats. background variations for experimentation. The above datasets
Bottom line represents zero and top line represents non-zero contain 28, 49 and 30 video, respectively, which gives in total
coefficients.
1150 frames. One can notice from the above datasets that all the
three databases contain only scene texts but not caption texts.
Generally, when we consider video of news channels, movies,
sports, etc, where the appearances of both caption and scene
texts are common. Therefore, we create our own database by
collecting videos from YouTube, where we consider different
varieties of videos with different fonts, font sizes, backgrounds
and contrasts. In total, we collect 32 videos that contain from
(b) Studying linearity and non-linearity behavior of the coefficient
2-3 minutes to 15-20 minutes of content, which gives 350
lines of Caption and Scene text lines by extracting straightness and frames with only caption texts, 180 frames with only scene
cursiveness properties. Centroid falling on line marked by Cyan texts, and 300 frames with both scene and caption texts. We use
color and centroid not falling on line are marked by Magenta color. the text detection method in [4] for extracting text lines from
the above frame database, which gives 900 caption and 650
scene text lines to evaluate the proposed classification method.
In this work, we use classification rate through a confusion
matrix and recognition rate for evaluating the proposed
classification method and binarization methods, respectively.
Classification is validated by conducting experiments on
recognition before and after classification at the text line level.
Recognition before classification calculates the recognition
(c) Studying smooth and non-smooth behavior of the coefficient lines rates for different binarization methods on both caption and
of Caption and Scene text lines by extracting crossing points given by scene texts. Recognition after classification calculates the
principal lines of text lines. Principal axis is marked in Yellow dotted
color and crossing points are marked in green color.
recognition rates for caption text and scene text separately
Fig. 4: Extracting behavior of the Caption and Scene text lines. through different binarization methods. We expect the
accuracies of binarization methods after classification to be
higher than those before classification. This is because after
For each line in caption and scene text images, the method
estimates the principal axis using the coordinates of the classification, we can modify the same method or use a
respective lines as shown in Fig. 4(c), where the yellow color different method which suits the class to achieve good results.
dotted line is the principal axis. To understand whether the line We use character recognition rate as the performance measure
is smooth or not, we count the number of crossing points ሺܲܥሻ through several binarization methods and publicly available
done by the principal axis with the lines as marked by green OCR [17].
color in Fig. 4(c) for caption and scene text images. It is To show that the proposed method is superior to the existing
observed that the number of crossing points of zero coefficients methods, we use a recent method which classifies graphics and
and non-zero coefficients lines are almost the same for scene scene texts based on pixel patterns of graphics and scene texts
texts, while it is not so for captions texts. Therefore, we calculate [8] for comparative study at the text line level. For validating
the percentage of crossing points to define Rule-2 (R2) for classification, we implement several binarization methods,
classifying caption and scent text as in equation (5). namely, the method proposed by Howe [18] for binarizing scene
text lines in natural scene images, the method proposed by Roy
ͳǡ ܵܿ݁݊݁ǡ ݂݅ȁܲܲܥே െ ܲܲܥ ȁ ͳ et al. [5] for binarizing text lines in video using a Bayesian
ܴʹ ൌ ൜ (5),
Ͳǡ ݊݅ݐܽܥǡ ܱ݁ݏ݅ݓݎ݄݁ݐ classifier, the binarization method presented by Su et al [19]
where ܲ ܲܥdenotes the percentage of crossing points. based on local contrast information, and the image binarization
Furthermore, the final classification combines rule-1 and rule-2 method developed by Milayav et al. [20] for natural scene
as defined in equation (6) and equation (7) for scene text and images. The reason for choosing these methods is as follows.
caption text, respectively. The method in [18] works well for both low contrast texts and
୫ does not require much parameter tuning. The method in [5] is
ୱୡୣ୬ୣ ൌ ራሺͳ୬ ǡ ʹ୬ ሻǡ ሺሻǡ capable of binarizing texts in video, the method in [19] works
୬ୀଵ well for degraded low contrast document images, while the
ሺͳ୬ ൌ Ͳǡ ʹ୬ ൌ Ͳሻǡ method in [20] is developed for binarizing texts in natural scene
ୡୟ୮୲୧୭୬ ൌ ς
୬ୀଵ ሺሻǡ
39
images. Since the considered databases contains low contrast as method in [8]. Sample qualitative results for the proposed and
in video, high contrast as in natural scene images and plain the existing method [8] are shown in Fig. 7 for caption and
background text as in document analysis, we choose these scene texts at line level. Fig. 7 shows that the proposed method
methods from different domains for fair validations of the classifies different text types successfully, and at the same time,
proposed classification method. the existing method fails to classify them correctly. It is evident
A. Experiments for Caption and Scene Text Classification from the quantitative results of the proposed and existing
methods reported in Table II that the proposed method is better
The proposed classification method involves two key steps
than the existing method in terms of classification rate. The
using two rules, namely, Rule-1 derived with centroid features
and Rule-2 with crossing features. Therefore, to analyze the reason for the poor results of the existing method is that it is
contribution of each rule we compute confusion matrices for sensitive to pixels as it works at the pixel level depending on
individual rules and the combined rule as reported in Table I. It the stroke widths of components. On the other hand, the
is observed from Table I that each rule contributes equally to proposed method uses new tampered features in frequency
achieve better results because both rules give almost similar domain, which is robust compared to spatial domain features.
classification rate. As a result, the combined rule which
considers the advantage of both Rule-1 and Rule-2 scores better
results than individual rules.
Sample successful classification results of the proposed
method are shown in Fig. 5, where it can be seen that the
proposed method (combined one) classifies text lines of
different backgrounds, fonts and font sizes correctly as caption
and scene texts. It is noticed from Fig. 5 that the proposed
classification method is invariant to orientation. In the same
way, we also present unsuccessful results of the proposed
method in Fig. 6, where it is shown that the proposed method
still has some issues with background and font variations.
Therefore, there is still room for improvement in the near
future.
Table I: Confusion matrix of the proposed method using centroid Caption classified as Scene Scene classified as Caption
features at text line level Fig. 6: Samples of unsuccessful classification results of the proposed
Classification rate using different features method.
Proposed
Features Centroids Crossing
(combined) Table II: Performance of the proposed and existing methods
Types Scene Caption Scene Caption Scene Caption for caption and scene text classification at text line level (in %)
Scene 60.52 39.48 63.81 36.19 68.66 31.34 Types Proposed method (%) Existing method[8]
Caption 42.3 57.7 38.4 61.6 28.62 71.38 Scene Caption Scene Caption
Scene 68.66 31.34 65.69 34.31
Caption 28.62 71.38 32.62 67.38
40
classification. Since we modify the parameter values of each The work described in this paper was supported by the Natural
method, we expect better results for after-classification Science Foundation of China under Grant No. 61272218 and No.
61321491, Natural Science Foundation for Distinguished Young
compared to before-classification. For the binarization methods
Scholars of Jiangsu under Grant No. BK20160021, and partly
listed in Table III we tune the parameters, namely, threshold supported by the University of Malaya HIR under Grant No.
values for Canny image [20], threshold value for the Bayesian M.C/625/1/HIR/210.
classifier [5], and widow size for binarization algorithms [18].
Since Su et al. [19] provides the exe file, no tuning has been REFERENCES
applied. The parameter values before and after classification are [1] M. Anthimopoulos, B. Gatos and I. Pratikakis, “Detetion of artficial and
listed in Table III for each method. It is noticed from Table III scene text in images and video frames”, PAA, 2013, 431-446.
that the recognition rates of the existing methods improve [2] Q. Ye and D. Doermann. “Text Detection and Recognition in Imagery: A
Survey. IEEE. Trans. PAMI, 2015, 1480-1500.
significantly after-classification compared to before-
[3] G. Liang, P. Shivakumara, T. Lu and C. L. Tan. “Multi-Spectral Fusion
classification and this is due to improvement of binarization Based Approach for Arbitrarily-Oriented Scene Text Detection in Video
using different parameters of different method based on the Images”, IEEE Trans. IP, 2015, 4488-4501.
classification results. In this work, we present one case study to [4] V. Khare, P. Shivakumara, and P. Raveendran, “A new Histogram
show the effect of classification. However, one can use a new Oriented Moments descriptor for multi-oriented moving text detection in
method for achieving even better recognition rates as document video,” ESWA, 2015, 7627-7640.
analysis by considering the advantage of classification. [5] S. Roy, P. Shivakumara, P. P. Roy, U. Pal and C. L. Tan, “Bayesian
classifier for multi-oriented video text recognition system”, ESWA, 2015,
Table III: Character recognition rates of different binarization 5554-5566.
methods before and after classification at text line level (in %). P [6] C. Wolf and J. M. Jolion, “Extraction and recognition of artificial text in
denotes parameter. multimedia documents”, PAA, 2003, 309-326.
[7] P. Shivakumara, N. V. Kumar, D. S. Guru and C. L Tan, “Separation of
Methods Before Classification After Classification graphics (superimposed) and scene text in videos”, In Proc. DAS, 2014,
Scene and P Scene P Caption P 344-348.
Caption
[8] J. Xu, P. Shivakumara, T. Lu, T. Q. Phan and C. L. Tan, “Graphics and
Howe [18] 51.3 0.4 56.3 0.4 51.7 0.2 Scene Text Classification in Video”, In Proc. ICPR, 2014, 4714-4719.
Roy et al. [5] 40.6 0.05 42.8 0.2 47.2 0.03
Su et al. [19] 48.9 - 55.2 - 49.6 - [9] A. Hooda, M. Kathuria and V. Pankajakshan, “Application of forgery
localization in overlay text detection”, In Proc. ICVGIP, 2014.
Milyaey [20] 42.6 10 49.5 10 45.7 8
[10] W. Wang, J. Dong and T. Tan, “Exploring DCT Coefficients Quantization
Effects for Local Tampering Detection”, IEEE Trans. IFS, 2014, 1653-
1666.
IV. CONCLUSION AND FUTURE WORK [11] X. H. Li, Y. Q. Zhao, M. Liao, F. Y. Shih and Y. Q. Shi, “Detection of
We have introduced a new idea of exploring DCT for tampered region for JPEG images by using mode-based first digit
features”, EURASIP, 2012, 1-10.
identifying tampered information for the classification of
[12] Y. Zhong, H. Zhang and A. K. Jain, “Automatic caption localization in
caption and scene texts. The proposed method finds the unique compressed video”, IEEE. Trans. PAMI, 2000, 385-392.
relationship between zero and non-zero coefficients of caption [13] H. Li, W. Luo and J. Huang, “Anti-forensics of double JPEG compression
and scene texts to differentiate them. The unique relationship is with the same quantization marix”, MTA, 2015, 6729-6744.
extracted by a new iterative centroid checking and crossing [14] D. Karatzas, F. Shafait, S. Uchida, M. Iwamura, L. G. I. Boorda, S. R.
point detection method created by principal axis and coefficient Mestre, J. Mas, D. F. Mota, J. A. Almazan and L. P. De las Heras.
lines of caption and scene texts. To the best of our knowledge, “ICDAR 2013 robust reading competition”. In Proc. ICDAR, 2013, 1115-
1124.
this is the first kind of work by introducing tampering idea for
[15] D. Karatzas, L. Gomez-Bigorda, A. Nicolaou, S. Ghosh, A. Bagdanow,
classification. Experimental results show that the proposed M. Iwamura, J. Matas, L. Neumann and V. R. Chandrsekhar. “ICDAR
method classifies caption and scene text successfully at the text 2015 Competition on Robust Reading”. In Proc. ICDAR, 2015, 1156-
line level. The classification is tested by conducting 1160.
experiments on recognition through different binarization [16] P. Nguyen, K. Wang and S. Belongie, S. “Video Text Detection and
Recognition: Dataset and Benchmark”. In Proc. WCACV, 2014, 776-783.
methods before and after classification. The performances of
[17] Tesseract. http://code.google.com/p/tesseract-ocr/.
binarization methods improve significantly after classification
compared to before classification. It is seen from the [18] N. R. Howe, “Document binarization with automatic parameter tuning”,
IJDAR, 2013, 247-258.
experimental results that the proposed method still misclassifies
[19] Bolan Su, Shijian Lu, Chew Lim Tan, A Robust Document Image
some of the texts that have different fonts and background. Our Binarization for Degraded Document Images. IEEE Trans. IP, 2013,
future study would extend the method with the help of learning 1408-1417.
techniques such as deep learning to address this issue. [20] S. Milyaev, O. Barinova, T. Novikova, P. Kohli, V. S. Lempitsky,“Image
Binarization for End-to-End Text Understanding in Natural Images”. In
ACKNOWLEDGMENT Proc. ICDAR, 2013, 128-132.
41