A New Forensic Video Database for Source Smartphone Identification Description and Analysis
A New Forensic Video Database for Source Smartphone Identification Description and Analysis
ABSTRACT In recent years, the field of digital imaging has made significant progress, so that today every
smartphone has a built-in video camera that allows you to record high-quality video for free and without
restrictions. On the other hand, rapidly growing internet technology has contributed significantly to the
widespread use of digital video via web-based multimedia systems and mobile smartphone applications such
as YouTube, Facebook, Twitter, WhatsApp, etc. However, as the recording and distribution of digital videos
have become affordable nowadays, security issues have become threatening and spread worldwide. One of
the security issues is identifying source cameras on videos. There are some new challenges that should be
addressed in this area. One of the new challenges is individual source camera identification (ISCI), which
focuses on identifying each device regardless of its model. The first step towards solving the problems is a
popular video database recorded by modern smartphone devices, which can also be used for deep learning
methods that are growing rapidly in the field of source camera identification. In this paper, a smartphone
video database named Qatar University Forensic Video Database (QUFVD) is introduced. The QUFVD
includes 6000 videos from 20 modern smartphone representing five brands, each brand has two models, and
each model has two identical smartphone devices. This database is suitable for evaluating different techniques
such as deep learning methods for video source smartphone identification and verification. To evaluate the
QUFVD, a series of experiments to identify source cameras using a deep learning technique are conducted.
The results show that improvements are essential for the ISCI scenario on video.
INDEX TERMS Database, smart phone, source camera identification on videos, deep learning methods.
This work is licensed under a Creative Commons Attribution 4.0 License. For more information, see https://creativecommons.org/licenses/by/4.0/
20080 VOLUME 10, 2022
Y. Akbari et al.: New Forensic Video Database for Source Smartphone Identification: Description
forged ones. The original videos are suitable for source cam- vertically in each scene. Each scene contains at least three
era identification purpose. About 150 videos were collected videos. In addition, all videos were recorded over 10 seconds.
from three sources that was also extended by [19]. Method The database was evaluted by method presented in [23].
presented in [20] was tested in the study. SOCRatES database5 [24] captured by smartphones.
VISION database2 was introduced in [21] that is the Around 9700 images and 1000 videos were taken by 103 dif-
most popular database in the field. In total, 35 portable ferent smartphones from 15 different brands. [3] and [25]
devices from 11 major brands contributed 34,427 images were assessed on the database.
and 1914 videos, all in native format and social format
(Facebook, YouTube and WhatsApp are included). There are III. MOTIVATION
videos captured in indoor, outdoor, and flat scenarios. Videos The rapid development of new smartphones in the field of
of flat surfaces such as walls and sky are included in the imaging may be an important reason for the development of
flat scenario. Videos depicting offices or shops are included databases in forensic analysis, especially in the identification
in the indoor scenario, while videos depicting gardens are of source cameras. On the other hand, the coverage and com-
included in the outdoor scenario. Three recording modes were pletion of aspects that other databases have not considered
used for each scenario: Still mode, where the user stands in the development of the databases may lead researchers to
still while the video is recorded. While capturing the video present a new database.
(moving video), the user walks, the panrot mode combines As described in the previous sections, most databases
a pan with a rotation to achieve a recording. YouTube and contain videos recorded on VCR, and only one of these
WhatsApp social media platforms were used to exchange databases is dedicated to smartphones (Daxing) [1]. Although
videos belonging to each scenario. In the study, they evaluated the database can be considered as an important database in
the database by method presented in [4]. this field as it covers a wide range of devices, there are some
video-ACID database3 was presented in [14] to source aspects that may lead researchers to develop a new database
camera identification that is accessible publicly. Over 12,000 to meet new challenges in this field.
videos were collected from 46 physical cameras repre- Table 2 shows the database in detail based on number of
senting 36 different camera models in the video- ACID’s videos for each device. Of the 90 devices used in the database,
database. All of these videos were shot manually to represent 85 devices were considered for recording the videos. As can
a range of lighting conditions, content, and motion. Moreover, be seen from the table, the range of videos recorded by
this database is suitable for both SCMI or ISCI scenarios. the devices is limited. This may be because the Daxing
They evaluated deep learning method presented in [22]. database focuses on both videos and images. The smallest
[1] presented a Daxing smartphone identification number of videos recorded by a device in the Daxing database
database,4 which include both images and videos from exten- is 4 while the largest number of videos recorded by a device
sive smartphones of different brands, models and devices. is 106 videos, where only one device has 106 videos and the
The data from 90 smartphones, representing 22 models rest of the devices have less than 31 videos. More devices
and 5 brands, includes 43400 images and 1400 videos. In the have 12 to 28 videos. On average, the number of videos
case of the iPhone 6S (Plus), 23 different smartphone models for each device is around 26. As a result, the assessment
are available. Scenes selected normally include a sky, grass, of PRNU-based methods may not be reliable. Furthermore,
rocks, trees, stairs, a vertical printer, a lobby wall, and a white source camera identification techniques based on machine
wall in a classroom, among others. The videos were shot learning may face a problem of unbalanced data since the
number of training videos is small and differs across the
2 https://lesc.dinfo.unifi.it/VISION/ devices. This prompts the researchers to adjust and balance
3 misl.ece.drexel.edu/video-acid
4 https://github.com/xyhcn/Daxing 5 http://socrates.eurecom.fr/
FIGURE 2. Sample frames from captured videos: (a) Huawei-Y7 (device 1), (b) Huawei-Y7 (device 2), (c) Huawei-Y9 (device1), (d) Huawei-Y9 (device 2),
(e) iPhone-8Plus (device 1), (f) iPhone-8Plus (device 2), (g) iPhone-XsMax (device 1), (h) iPhone-XsMax (device 2), (i) Nokia-5.4 (device 1), (j) Nokia-5.4
(device 2), (k) Nokia-7.1 (device 1), (l) Nokia-7.1 (device 2), (m) Samsung-A50 (device 1), (n) Samsung-A50 (device 2), (o) Samsung-Note9 (device 1),
(p) Samsung-Note9 (device 2), (q) Xioami-RedmiNote8 (device 1), (r) Xioami-RedmiNote8 (device 2), (s) Xioami-RedmiNote9Pro (device 1),
(t) Xioami-RedmiNote9Pro (device 2).
the database before use. For example, for the iPhone 8 Plus, suitable for SCMI or ISCI. The options are described in more
24 videos were recorded for device #1 and only 4 videos were detail in this subsection. Table 3 summarizes our database
recorded for device #2. with its features. The QUFVD is publicly available.6
As our experiments have shown (Section V), increasing
the number of training data can improve the results in our A. DEVICES
database. Also, since most machine learning methods require There are several popular manufacturers that produce differ-
enough data to train, it is obvious that a database with many ent smartphone brands. Among them, there are few brands
more videos is better suited to machine learning methods that are widely used by people. In order to have a variety of
compared to Daxing for the ISCI scenario. As shown in brands, we collected the devices to be used for video record-
Table 2, for implementing ISCI scenario, only one device ing and selected 5 popular brands: iPhone, Samsung, Huawei,
has 106 videos to train, and the rest of the devices have less Xioami and Nokia. For each brand, we selected two differ-
than 31 videos to train. Additionally, to design the structure ent models, and for each model, we selected two devices.
of the Daxing database to be suitable for a machine learning Therefore, four devices are considered for each brand. The
approach, the videos need to be divided into training, testing total number of devices used to collect this database should
and validation sets. If we consider 26 videos for each device be 20 devices.
as the average of the database and use our structure to split the
database, we have 15, 7 and 4 videos for training, testing and B. SIZE PROPERTIES
validation respectively. Therefore, it is obvious that it can be With the development of deep learning methods in this area,
less to compare the machine learning methods fairly. It should a large number of videos or frame can improve the results
be noted that the Daxing database may be more suitable for in this area, as shown in this article. Therefore, a database
machine learning methods for the SCMI scenario for more with suitable size can be considered for both traditional
models. methods such as PRNU and deep learning methods. In our
Finally, it should be noted that a new database can be database, 300 videos are collected for each device, making
connected to other databases such as Daxing to obtain more a total of 6000 videos available. The length of the videos is
data and deal with new challenges. between 11 and 15 seconds at a frame rate of 30 frames per
second. Since I-frames play an important role in identifying
the source [12], [14], these types of frames are also extracted.
IV. QUFVD DESCRIPTION
Depending on the length and content of each video, they are
In this section, we discuss the features and structure of
QUFVD. For describing a database, the following options are 6 https://www.dropbox.com/sh/nb543na9qq0wlaz/
important: number of videos and camera, resolution, codec, AAAc5N8ecjawk2KlVF8kfkrya?dl=0
TABLE 2. Number of videos captured based on Daxing database for each device.
different. A total of 76531 I-frames are extracted by FFmpeg7 same camera model. In the database, two devices are consid-
software. Finally, to test Deep Learning methods, 500 patches ered for each smartphone. For example, for Samsung Galaxy
of 350 × 350 are extracted from each I-frame. A total of A50, the videos were captured from two devices. Moreover,
980,580 Patches are extracted for train set. the challenge is studied in the evaluation section. Therefore,
our database contains both SCMI and ISCI scenarios for all
C. CONTENT PROPERTIES models defining two 10- and 20-class problems.
In this database collection, we rely mainly on the static
camera despite the static and moving state of the camera E. CODEC PROPERTIES
when recording. This database contains very diverse video Video files are compressed with codecs, which are always
collections of different scenes, either outdoor, indoor, moving a tradeoff between quality and size (better quality vs. larger
or still objects, mainly gardens, sky, streets, shops, domestic file size). Video files can be compressed to reduce their size,
staff and sea. Figure 2 shows samples of the data based on which can reduce bandwidth usage and increase streaming
each device. speed. For encoding high-definition video, AVC is the stan-
dard codec used by several online video services, including
D. ISCI PROPERTIES YouTube and Vimeo. The MPEG-4 and the H.264 standards
One way to challenge the identification of the source camera are implemented by the library ’libx264’ in FFmpeg. All
is that the captured videos are from smartphones with the smartphones used for our database recorded videos according
to the H.264 video encoding standard, except for the iPhone
7 https://www.ffmpeg.org/ Xs Max and the Samsung Note9 (H.265).
F. I-FRAME PROPERTIES
Group of Pictures (GOP) consists of I-frames, P-frames and
B-frames as intra-coded picture, predictive-coded picture and
bi-predictive coded picture respectively in coding standards
like MPEG series and H.264. I-frames are the least com-
pressible and do not require other video frames for decoding.
P-frames can be decompressed with data from previous
frames and are more compressible than I-frames. For
B-frames, previous and forward frames can also be used
as data references to achieve the highest compression. The
I-frame is generally more detailed than the P- and B-frames.
The GOP size, generally divided into fixed and unfixed,
is the number of B- and P-frames between two consecu-
tive I-frames. Several studies have demonstrated that meth-
ods based on I-frames give better results compared to other
FIGURE 3. Structure of the folders in the database.
frames [26]–[28].
G. VIDEO NAMING
All videos are renamed according to the following rule: that 80% of these videos are considered as training set and the
‘‘Brand_Model_device No._Video No.’’ for example: We remaining 20% are considered as test set. Also, 20% of the
have ‘‘iPhone_XS Max_1_(1)’’, which refers to the first training data is considered as validation data. The structure
video from the first device of the model XS Max of the iPhone can be used for both classical (like PRNU) and machine
mobile brand, another example: ‘‘Samsung_A50_2_(30)’’, learning methods. For example, in PRNU methods, reference
refers to the video number 30 from the second device of the patterns can be obtained from videos in the training data and
model Galaxy A50 of the mobile brand Samsung, and so on. query patterns in the test data. Since, as mentioned earlier, the
Also, for I-frames, the frame number and type are appended I-frames lead to better results, the I-frames of the videos are
to the video name, e.g., ‘‘Huawei-Y7Prime2019-1(14)-31-I’’ extracted to evaluate the database. The statistics for training,
indicates the 31st frame with the type I-frame of the device. testing, and validation at both the video and frame levels are
shown in Table 4. For each video in each experimental series,
we selected all I-frames related to the videos in the training,
H. RESOLUTION AND COLOR MODE
testing, and validation series. A total of 76531 I-frames were
The resolution of a video is the width and height of the video
extracted.
in pixels. All videos recorded in the database are based on
The method presented in [29] is used to evaluate our
the rear camera of smartphones. There are two types of reso-
database. Also, in references [30], [31] and [22], the CNN
lutions in the database, namely 720 × 1280 and 1080 × 1920.
method (MISLnet CNN architecture [29]) was used to iden-
Also, the frames are stored in two modes: Color (True Color)
tify the source camera, using frames to train the network.
and Grayscale. It can be tested whether the resolution and The network used a constrained convolutional layer that was
color mode affect the results.
added as the first layer that used three kernels with size 5. This
layer is constructed in such a way that there are relationships
I. STRUCTURE OF THE DATABASE between adjacent pixels that are independent of the content of
The overall structure of QUFVD is shown in Figure 3. The the scene. The methods was tested on VISION database [21].
structure can be modified by researchers according to their The experiments showed that the layer can improve results
methods and facilities. Moreover, the database can be com- compared with deep learning architectures without the layer.
bined with other databases (e.g. the Daxing database) to The structure of the CNN for the three studies is shown in
address new scenarios and challenges and to have a wider Figure 4. As shown in the figure, a constrained convolutional
choice of brands. layer is added to a simple CNN.
Our database is evaluated against two main scenarios of
V. QUFVD EVALUATION ISCI and SCMI. A 10-class problem should be considered for
The quality of our database is evaluated in this section by SCMI and a 20-class problem for ISCI. For each, the effect
experimenting with ISCI and SCMI scenarios with different of the number and size of patches is examined, as well as the
settings based on a Deep Learning method. We divide the effect of the color mode, i.e., gray and true color modes. All
experiments into different scenarios showing the influence of videos were encoded according to the respective device codec
some conditions on the results. This result provides a base- using the H.264 or H.265 video encoding standard, and no
line for the accuracy of camera model identification in the video was edited or re-encoded.
QUFVD database and can be used for comparison with other To identify a video based on its I-frames, all I-frames in
methods. The division of the database for the experiments is the test set are considered. The scores obtained by the CNN
based on the highest probability show which I-frames belong Also, the processing time for patch, frame and a video
to which classes. At the video level, a majority vote then with 11 I-frames is shown in Table 11, which is performed
decides all the frames that belong to a video. by processing frame size of 1920 × 1080.
A 64-bit operating system (Ubuntu 18) with a CPU
E5-2650 v4 @ 2.20 GHz, 128.0 GB RAM, and four NVIDIA B. RESULT DISCUSSION
GTX TITAN X. was used in order to run our experiments. State-of-the-art source camera identification methods have
faced challenges such as compression, stabilization, and
ISCI. Various methods have been presented to overcome
A. ISCI VS SCMI these challenges. Recently, Deep Learning methods have
The performance of the network is measured by comput- been introduced to solve these challenges. As mentioned ear-
ing the accuracy based on frame-level and video-level in lier, our database is also evaluated using deep learning method
both scenarios ISCI and SCMI. In classification stage, each developed to solve these problems. Overall, the results in
frame/video in the test data is classified into one of the frame and video levels show that the method is successful
10-class (SCMI) or 20-class (ISCI). The frame-level and for the SCMI problem, but it does not work well for the ISCI
video-level results for SCMI scenarios for each smartphone challenge. For both scenarios, when the results are reported
model are shown in Table 5. at the video level, improvement can be seen. The results are
To investigate the effects of device dependency, the ISCI discussed in more detail below.
scenario is considered. The results of the frame and video As shown in Table 5 in frame level, all devices except
levels in terms of accuracy for the ISCI scenario for each the Y7, 8 Plus, and Redmi Note9 Pro achieve more than
device are shown in Table 6. The result is based on the 70 % accuracy in grayscale mode. The biggest improvement
accuracy for each device. in the mode is for the Note 9 compared to the color mode.
Also overall accuracy, precision, recall and F1-score based The best results are reported for the Note 9 and Xs Max,
on frame-level for both scenarios ISCI and SCMI are reported which have the same codec (H.265). At the video level,
in Table 7. Precision is also called Positive Predictive an overall improvement is seen for all devices. The best
Value (PPV) which is a measure of the closeness of the set result with 95% is also obtained for the Note 9. However,
of predicted results. Recall is also known as True Positive its codec is similar to that of Xs Max, so Xs Max cannot
Rate (TPR) and F1-score is the harmonic average of the pre- see an improvement like the Note 9. In general, we cannot
cision and recall, where it is at its best at a value of 1 meaning make the decision that the codec has a direct effect on the
perfect precision and recall. results. Moreover, we can see that the resolution does not
Tables 5 and 6 also list the effects of color mode, affect the results, since Y7 and Y9 have the lowest resolution,
i.e., grayscale and true color. With this premise, Figure 5 and 6 but their results are not worse. Therefore, as a result of this
provide a more comprehensive picture of camera identifica- work, the two cases cannot confirm that codec and resolution
tion performance to check the quality of the CNN by present- are two effective factors in this area. However, in grayscale
ing the Receiver Operating Characteristic (ROC) curves for a mode, it can be seen that the results are better than in color
selected group of ten and twenty cameras from our database. mode.
Two values are calculated for each threshold: True Positive Based on Table 6 (ISCI scenario), although only 3 devices
Ratio (TPR) and False Positive Ratio (FPR). The TPR of have an accuracy of less than 65%, half of the devices achieve
a given class, e.g. Huawei Y7, is the number of outputs an accuracy of less than 50% at the frame level. Even though
whose actual and predicted class is Huawei Y7 divided by the Note 9 for Device 1 scores the best among all devices, similar
number of outputs whose predicted class is Huawei Y7. The to the SCMI scenario, Device 2 scores only 66.7%, which
FPR is calculated by dividing the number of outputs whose means the fifth place. There are no meaningful results in the
actual class is not Huawei Y7, but whose predicted class was table, except that grayscale mode still performs better than
Huawei Y7 by the number of outputs whose predicted class color mode for both frame and video levels.
is not Huawei Y7. Figures 5 and 6 show the TPR compared to the FPR
One of the most important factors in machine learning is for the two scenarios SCMI and ISCI in two modes (color
how much training data the model needs to perform well. and grayscale) at different frame-level thresholds. As can
To show how it works, a series of experiments were conducted be seen from the figures, we have different analysis for the
with increasing training data for the SCMI scenario for both devices in terms of TPR and FPR. The best performance is
gray and color modes. Table 8 shows the effect of the factor. shown by Nokia 5.4 with Area Under Curve (AUC=0.989)
In addition, the size of the patches can affect the per- compared to the second ranked Note 9 with AUC=0.987
formance of the CNN methods. For the experiment, four in grayscale mode (Figure 5 (b)). Moreover, as shown in
different sizes were considered for the SCMI scenario based Figure 6 (a and b), RedmiNote9Pro performs significantly
on 10000 greyscale patches (see Table 9). better in grayscale mode. From the figure, it can be seen that
For a more detailed analysis of the error detections, the Note 9 device 1 has the best performance with AUC=0.989.
confusion matrix for the ISCI scenario in grayscale mode is As shown in Table 7, all metrics are better in the SCMI
given in Table 10. scenario than in the ISCI scenario. Based on the results,
TABLE 4. The statistics for training, testing, and validation at both the video and frame levels.
TABLE 5. The results of the frame and video levels in terms of accuracy TABLE 6. The results of the frame and video levels in terms of
(%) for the SCMI scenario for each smartphone model. accuracy (%) for the ISCI scenario for each device.
TABLE 7. The results of the frame level in terms of accuracy, PPV, TPV and
F-score (%) for the SCMI and ISCI scenario.
FIGURE 6. True and false positive rates (ROC) obtained in ISCI scenario
(a) 20 classes with color mode (b) 20 classes with gray scale mode.
FIGURE 5. True and false positive rates (ROC) obtained in SCMI scenario
(a) 10 classes with color mode (b) 10 classes with gray scale mode.
Table 9 shows that while the size of each patch can with 10000 patches for each class. For sizes over 350 × 350,
improve performance, it is limited by the size of the 350 × a drop in performance is indicated. Therefore, we chose this
350. It should be noted that the experiment was conducted size for all experiments in the evaluation.
TABLE 10. Confusion matrix of ISCI scenario in grayscale mode. Classes 1 to 20 are Y7 (device 1), Nokia 5.4 (device 2), Nokia 7.1 (device 1), Nokia 7.1
(device 2), A50 (device 1), A50 (device 2), Note 9 (device 1), Note 9 (device 2), RedmiNote8 (device 1), RedmiNote8 (device 2), RedmiNote9Pro (device 1),
Y7 (device 2), RedmiNote9Pro (device 2), Y9 (device 1), Y9 (device 2), 8 Plus (device 1), 8 Plus (device 2), Xs Max (device 1), Xs Max (device 2), Nokia 5.4
(device 1), respectively.
TABLE 11. The processing time (second) for patch, frame and a video with 11 I-frames.
Table 10 shows the confusion matrix obtained for the The database is suitable for new challenges such as ISCI
ISCI scenario in grey scale mode. As mentioned earlier, the and for use by deep learning methods. The results show that
scenario is more challenging than SCMI and the results can improvement is essential for ISCI. Although it is not a fair
be improved in the next studies. The confusion matrix can comparison, the Deep Learning method used in our study
show misclassifications between all classes. As shown in the achieves promising results compared to the results reported
table, misclassifications between devices of the same brand by Daxing, which are based on the PRNU method.
occur in most cases, e.g., classes 14 and 15 (two devices Y9) In order to improve the video level results, different deci-
have the most misclassifications when they misidentify each sion making approaches such as fusion methods based on
other. weighting the score of the classifiers can be applied in the
As can be seen in Table 11, the time increases in all patch, future. We will add a few more tasks to the database where
frame and video levels as the patch size increases. we transfer videos over social media such as WhatsApp
and Facebook to study the impact of compression on source
VI. CONCLUSION camera identification. To detect video tampering, another task
This paper presents a new video database (QUFVD) based on adds forged videos to the database. To get more data and new
smartphones for source camera identification. The database challenges, our database can be attached to other databases.
includes five popular smartphone brands with two models per A augmentation method can be applied on the database to
brand with two devices for each model, 6000 original videos, have more data to train them. Although we cannot clearly see
and 76531 I-frames. The entire database is provided with the effects of codec and resolution in the method, it can be
an evaluation analysis for use by the research community. studied by other methods.
NOOR AL-MAADEED (Member, IEEE) received FOUAD KHELIFI (Member, IEEE) received the
the Ph.D. degree in computer engineering from Ingenieur d’Etat degree in electrical engineering
Brunel University, U.K., in 2014. She is currently from the University of Jijel, Algeria, in 2000, the
an Associate Professor with the Computer Science Magistère degree in electronic engineering from
and Engineering Department, Qatar University. the University of Annaba, Algeria, in 2003, and
She participated in many regional and interna- the Ph.D. degree from the School of Computer
tional conferences and published an important Science, in 2007. Then, he joined Queen’s Univer-
number of research articles in prestigious peer- sity Belfast, U.K., as a Research Student, in 2004.
reviewed journals, book chapters, and conferences From 2008 to 2010, he held a research position
proceedings. She has improved the relationship with the Digital Media and Systems Research
between academia and the industry by leading many research projects, both Institute, University of Bradford, U.K., before joining the Department of
domestically and abroad, totaling over eight million QAR in her fields Computer and Information Sciences, Northumbria University at Newcastle,
of specialization, such as image processing, speech, speaker recognition, U.K., as a Lecturer (an Assistant Professor), where he is currently an Asso-
intelligent pattern recognition, video-surveillance systems, and biometrics. ciate Professor (a Reader). He has authored and coauthored over 90 pub-
She is a member of the First Batch Qatar Leadership Center, the Current lications and has successfully supervised 12 Ph.D. students. His research
and Future Leaders Program, Qatar University Senate, and other committees. interests include the fields of computer vision and machine learning, image
She is also a member of various international associations, such as IET, and video watermarking, image/video authentication and perceptual hashing,
BA, and IAENG. She participates in activities which connect her to the data hiding, image forensics and biometrics, image and video coding, and
community, such as working with charities and volunteering in sport events. medical image analysis.
She received the following awards, such as the Qatar Education Excellence
Platinum Award for new Ph.D. holders from Highness the Emir of Qatar, in
2014 and 2015, the Premium Award from IET Biometrics, in 2017, and the
Barzan Award, in 2019.