Study of Subjective Quality and Objective Blind Quality Prediction of Stereoscopic Videos
Study of Subjective Quality and Objective Blind Quality Prediction of Stereoscopic Videos
SUBMITTED BY:
D T D ARAVIND KUMAR 21EC65R03
CHRIST LARSEN KUMAR EKKA 21EC65R05
SOURABH KUMAR SAHU 21EC65R06
SOURAV NAG 21EC65R11
SUDIP DAS 21EC65R12
SUBHAM GUCHHAIT 21EC65R27
Introduction
A study on the quality is required as after acquisition of 3D content. Several post acquisition
processing such as Sampling, Quantization to be done on the video. Which In result degrades
the overall perceived as 3D video quality. Quality assessment are of 2 types that is subjective
and objective.
Subjective quality assessment of S3D videos
In subjective assessment, human subjects perform the quality assessment task as human
opinion score serve as valuable benchmark on objective assessment algorithms.
Quality analysis in this domain is purely observed by several authors work they have
concluded from their observation of S3D video.
- De Silva et al. [9] created an S3D video dataset containing H.264 and H.265 compression
artifacts. The dataset has 14 reference and 116 test sequences of full HD resolution,
down sampled to 960*1080. They concluded that higher quantization step sizes caused
more significant perceptual quality differences than lower quantization step sizes.
- Hewage et al.[10] created an S3D video dataset which they used to explore the effects of
random packet losses on the overall perception of S3D videos. They used 9 reference
sequences, 54 test sequences, and 6 different packet loss rates. They concluded that
S3D perceptual quality was significantly affected by the loss of packets from either the
left or right views of an S3D video.
The test videos were rated similarly by a group of individuals on the similar 0 to 5 scale.
The difference of the reference and distorted S3D videos were calculated as per ITU-R
recommendation.
𝑑𝑖𝑞𝑖 𝑞𝑗 = 𝑠𝑢𝑏𝑞𝑖 𝑞𝑗 − 𝑠𝑢𝑏𝑞𝑖 𝑞𝑗
𝑟𝑒𝑓
qi indicating the subject and qj indicating the video id, 𝑠𝑢𝑏𝑞𝑖 𝑞𝑗 is the reference score and
𝑟𝑒𝑓
𝑠𝑢𝑏𝑞𝑖 𝑞𝑗 is the distorted score, which was followed by calculating the DMOS scores,
(difference between “reference” and “processed” Mean Opinion Score in a full reference
testing).
∑𝑍𝑞𝑖 =1 𝑑𝑖𝑞𝑖 𝑞𝑗
𝐷𝑀𝑂𝑆𝑞𝑗 =
𝑍
The efficacy of the subjective study was done by examining the internal structure of the
dataset, randomly dividing all of the collected DMOS scores into two halves, wherein the
human subjects were mutually exclusive, computing LCC (Linear Correlation Coefficient)
and Spearman’s Rank Order Correlation Coefficient (SROCC) between these two halves,
which were repeated 100 times to ensure statistical consistency, before computing the
mean, median and standard deviation over all the iteration, correlating as to how much the
opinion was consistent among the subjects.
Objective Quality Assessment:
Various psycho-visual experiments have been performed on the mammalian visual cortex to
find out the disparity selectivity in visual area MT and the dependencies that exist between
motion and disparity. The middle temporal visual area (MT or V5) is a region of extrastriata
visual cortex. In several species of both New World monkeys and Old World monkeys the MT
area contains a high concentration of direction-selective neurons. The MT in primates is
thought to play a major role in the perception of motion, the integration of local motion
signals into global precepts, and the guidance of some eye movements. The result of the
experiments was that a large portion of area MT is responsible for disparity processing and
that these components exhibit patchy, distributive, and directional dependencies. Luminance
and disparity sub-band coefficients of S3D pictures have sharp peaks and heavy tails that can
be modelled using an UGGD (Univariate Generalized Gaussian Distribution). A series of
experiments has been performed on S3D scene components (spatial, disparity and
motion/temporal) of natural S3D images and videos to explore the statistical dependencies
that arise among these scene components. S3D scene components exhibit strong
dependencies and that these dependencies can be well modelled as following a BGGD
(Bivariate Generalized Gaussian Distribution). The psychovisual studies and the S3D scene
component statistical studies lead to a completely blind S3D NR VQA algorithm based on a
BGGD model of the joint statistical dependencies between motion and disparity.
Let the multivariate random vector x ϵ RN follow a Multivariate Generalized Gaussian
Distribution (MGGD) with density function given by,
1 𝑇 −1
𝑝(𝑥 |𝑀, 𝛼, 𝛽 ) = 1 𝑔𝛼,𝛽 (𝑥 𝑀 𝑥)
|𝑀|2
𝑁
𝛽𝑇( ) 1 𝑦
𝑔𝛼,𝛽 (𝑦) = 2 𝑒 − ( )𝛽
2 𝛼
𝑁
1 2 𝑁
(2𝛽 П𝑎) Ґ( )
2𝛽
Where M is an N×N covariance matrix, α is a scale parameter, β is a shape parameter and
𝑔𝛼,𝛽 (. )is the density generator. The popular Maximum Likelihood Estimation (MLE) method is
used to compute the parameters α, β and M of the BGGD.
In the model, motion and disparity provide the primary features and this results in N = 2.
Therefore, the multivariate GGD becomes a bivariate GGD (BGGD). The BGGD model
parameters α and β, and the coherence score (Ѱ) are used for quality prediction. The
coherence score is defined as:
(𝜆𝑚𝑎𝑥 − 𝜆𝑚𝑖𝑛 )2
Ѱ=
(𝜆𝑚𝑎𝑥 + 𝜆𝑚𝑖𝑛 )2
Where 𝜆𝑚𝑎𝑥 and 𝜆𝑚𝑖𝑛 represent the maximum and minimum eigenvalues of M. These
eigenvalues are capable of accurately capturing directional dependencies between the motion
and disparity components.
The motion vector and disparity maps are decomposed at multiple scales (3 scales) and at
multiple orientations (0̊, 30̊, 60̊, 90̊, 120̊, 150̊) using the steerable pyramid decomposition.
Motion vectors and disparity maps are computed on a frame-by-frame basis. A three-step
search is employed to estimate the motion vectors and a SSIM based algorithm to conduct
disparity estimation. Each of the corresponding motion and disparity sub-bands is jointly
modeled using BGGD.
Figure (a) shows the 150th frame of the left video of the ‘Domain Parklands’ S3D video. Figures
(b), (c) and (d) show the distorted versions of the same frame. The distortions are H.264 (CRF
= 50), H.265 (CRF = 50) and Blur (radii = 7) respectively. Figures (e), (f), (g) show the
frame-wise α, β and Ѱ scores of the ‘Domain Parklands’ pristine S3D video and its H.264
compressed versions, respectively. Figures (i), (j) and (k) show scatter plots of the α, β, Ѱ
scores of the same reference and distorted S3D video sequences.
From the plots, it is clear that the features follow a number of trends:
1. The features can clearly discriminate videos having large perceptual quality differences,
e.g., the computed BGGD features (α, β, Ѱ) of the S3D video compressed at (CL = 35, CR
= 35) significantly differ from those of the S3D video compressed at (CL = 50, CR = 50).
2. The features take similar values on videos that are perceptually similar in quality. For
example, the BGGD features computed on the S3D videos compressed at (CL = 45, CR =
45), (CL = 45, CR = 50) and at (CL = 50, CR = 50) yield similar feature values. These
observations further motivate us to use them as quality features in the proposed
MoDi3D algorithm. The plots in correspond to the first scale 0̊ orientation of the
steerable pyramid decomposition, and the plots use the negative logarithmic scores of
all features for better visualization. The x-axis represents the frame sequence number
of the S3D video set. The frame-wise average NIQE scores and scatter plots of the
average NIQE scores of the left and right views of the ‘Domain Parklands’ pristine S3D
video and its H.264 compressed versions are shown in Figures (h) and (l), respectively.
The plots clearly show quality variations with respect to the distortion levels.
Proposed method:
The first stage of the proposed method computes the motion and disparity features of the S3D
video. The second stage performs the MoDi 2d score computation. In the third stage, spatial
features of the individual views of an S3D video are computed using the NIQE model. In the
last stage, we compute the MoDi3d score of an S3D video.
𝑇𝑡 = √(𝑇𝐻 + 𝑇𝑉 )
Where Tt represents the motion vector strength and TH,TV are the horizontal and vertical
motion vector components respectively.
The Three-Step Search Motion Estimation algorithm was taken from one of the reference
“Block-based motion estimation algorithms a survey”, and the short summary of that
reference is , In block-based motion estimation (BBME), the current frame is divided into N×N
pixel−size macroblocks (MBs) and for each MB a certain area of the reference frame is
searched to minimize a block difference measure (BDM), which is usually a sum of absolute
differences (SAD) between the current and the reference MB. The displacement within the
search area (SA) which gives the minimum BDM value is called a motion vector (MV). MVs
together with transformed and quantized block differences (residua) are entropy−coded into
the video bitstream.
The simple and regular three step search (TSS) which utilizes the square search pattern with
eight search points (SPs) around the centre at each step. The initial step size is d/2 and is
halved in the subsequent steps. When equals 7, the number of steps is 3 and the number of
used SPs is 25. For larger search ranges, TSS can be easily extended to n steps with the
number of search points equal to 1 + 8[log2(d+ 1)].
The algorithm computes the best matching block in the right view of a corresponding block of
the left view to estimate the disparity was limited to 30. The disparity map for a given S3D
pair on a frame-wise basis.
The SSIM based stereo matching algorithm was taken from the reference “Image Quality
Assessment: From Error Visibility to Structural Similarity”, and the short summary is written
below.
The luminance of the surface of an object being observed is the product of the illumination
and the reflectance, but the structures of the objects in the scene are independent of the
illumination. Consequently, to explore the structural information in an image, we wish to
separate the influence of the illumination. We define the structural information in an image as
those attributes that represent the structure of objects in the scene, independent of the
average luminance and contrast.
Suppose x and y are two nonnegative image signals, which have been aligned with each other
(e.g., spatial patches extracted from each image). If we consider one of the signals to have
perfect quality, then the similarity measure can serve as a quantitative measurement of the
quality of the second signal. The system separates the task of similarity measurement into
three comparisons: luminance, contrast and structure.
First, the luminance of each signal is compared. Assuming discrete signals, this is estimated as
the mean intensity. The luminance comparison function is then a function of and y. Second,
we remove the mean intensity from the signal. We use the standard deviation (the square root
of variance) as an estimate of the signal contrast. An unbiased estimate in discrete form is
given by,
The contrast comparison is then the comparison of x and third, the signal is normalized
(divided) by its own standard deviation, so that the two signals being compared have unit
standard deviation. Finally, the three components are combined to yield an overall similarity
measure
And the expression for each of the luminance, contrast and structure components is,
where 𝛼 >0, β >0 and Ψ >0 are parameters used to adjust the relative importance of the
three components.
The steerable pyramid decomposition was performed on the motion and disparity maps at
multiple scales and multiple orientations. Since the motion vectors were estimated using a
block size of 8 × 8, we down sampled the sub bands of the disparity map to the same size by
averaging over 8×8 blocks.
where j represents the frame number and Q represents the total number of S3D video frames.
L, R represent the left and right views of an S3D video. NIEQjL and NIEQjR represent the frame
level NIQE scores of the left and right views, while S denotes the overall spatial quality of an
S3D video
MoDi2D computation:
As stated previously, the motion vector maps and disparity maps of an S3D video are
decomposed at three scales and six orientations using the steerable pyramid decomposition.
We estimate the BGGD model parameters (α, β) and coherence score (Ψ) at each sub band of
an S3D view of a video and denoted as:
where i represents the sub band level (1 ≤ i ≤ 18). The total number of motion vector maps
computed in an S3D video is Q − 1. Therefore, 1 ≤ j ≤ Q − 1. fα, fβ and f Ψ are video level
feature sets of the α, β and Ψ scores, respectively.
The BGGD model parameters α, β and Ψ scores are estimated over all sub bands of motion
vector and disparity maps from the 34 pristine S3D video sequences.
Create three individual feature sets from the α, β and Ψ scores of the reference S3D video set.
𝑝 𝑝 𝑝 𝑝 𝑝 𝑝
𝑓 α = [α𝑗𝑖 ] , 𝑓 β = [β𝑗𝑖 ] , 𝑓 Ψ = [Ψ𝑗𝑖 ]; 1 ≤ 𝑝 ≤ 𝑃
where p represents the pristine S3D video and P represents the total number of pristine
videos (P = 34).
𝑝 𝑝 𝑝
As in NIQE, the pristine S3D video parameter sets (𝑓 α , 𝑓 β and 𝑓 Ψ ) are modeled using a
Multivariate Gaussian (MVG) distribution denoted by N (ν, Σ), where ν and Σ are the mean
vector and covariance matrix, respectively. Specifically, the means (𝑣α𝑝 , 𝑣β𝑝 , 𝑣Ψ𝑝 ) and
𝑝 𝑝 𝑝
covariances correspond to the 𝑓 α , 𝑓 β and 𝑓 Ψ sets, respectively.
Distorted Feature Set
First calculate the BGGD parameters α, β and Ψ scores over all sub bands over all the sub
bands of the framewise motion vectors and disparity map of each distorted S3D videos. The
feature set of distorted S3D videos are
𝑑 𝑑 𝑑
𝑓 𝛼 = [𝛼𝑗𝑖𝑑 ], 𝑓 𝛽 = [𝛽𝑗𝑖𝑑 ], 𝑓 𝛹 = [𝛹𝑗𝑖𝑑 ]
Now too check whether the given test video is pristine or distorted the likelihood of the
parameter is evaluated using MVG distribution.
Likelihood estimated is calculated on frame level bases and it’s a single value per frame.
The video level likelihood is estimated by averaging the frame level estimates.
Where denote the mean values of the frame level likelihood estimation
scores of the individual features
γ represents the overall departure of a distorted video’s statistics with respect to pristine
model.
The product of ∆ and γ scores to measure the joint quality of motion and disparity
components of each test S3D video.
MoDi3D = MoDi2D × S.
The LCC, SROCC, and Root Mean Square Error (RMSE) are used to evaluate the proposed
MoDi3D algorithm's performance. LCC denotes a linear relationship between two variables.
The monotonic relationship between two input sets is measured by SROCC. The magnitude
error between estimated objective scores and subjective DMOS scores is measured by the
RMSE. Higher LCC and SROCC values indicate good agreement between subjective and
objective measures, and lower RMSE signifies more accurate prediction performance.
Conclusion
There were two main contributions in this report. First, a performed a comprehensive
subjective quality evaluation on a symmetrically and asymmetrically distorted full HD S3D
video dataset was performed. The dataset contains 12 pristine S3D video sequences and 288
test stimuli. The test video sequences are a combination of H.264 and H.265 compression,
blur distortions, and frame freezes. 20 subjects were involved in the study and the study was
conducted using the ACR-HR method.
Secondly, a completely blind S3D NR VQA algorithm based on computing the joint
statistical dependencies between motion and disparity sub band coefficients of an S3D video
was proposed. The BGGD parameters (α, β), and the coherence score (Ψ) from the
eigenvalues of the covariance matrix were extracted and the features were found to be
distortion discriminable. An unsupervised 2D NR IQA model (NIQE) was used to estimate
spatial quality.
Finally, these features were pooled to predict the overall quality of an S3D video. The
proposed objective algorithm MoDi3D demonstrates competitive performance as compared
to popular 2D and 3D FR and supervised NR image and video quality assessment models,
even though it is not trained on any distorted S3D videos nor on any annotations of them.
Q. Compute the performance of the objective blind predictor on the databases.
The best performance was achieved by using the Video Quality Evaluation using Motion and
Depth Statistics (VQUEMODES) under 3D NR VQA (supervised) is shown to outperform
popular state-of-the-art 2D IQA/VQA and 3D IQA/VQA when evaluating over all the datasets.
The performance of this algorithm for all the datasets and the performance metrics have the
values as shown,
Although NIQE and other NR QA algorithms perform well on symmetrically distorted videos,
they fall short when it comes to asymmetrically distorted S3D videos. On both symmetrically
and asymmetrically distorted S3D videos, the proposed MoDi3D model (combination of
MoDi2D and Spatial NIQE score) performs consistently.
Due to the blank frames with H.264 compression artifacts during frame freezes, the joint
dependencies between motion and disparity components varied more when compared to the
H.264 and H.265 compressions. The MoDi3D model effectively captures these statistical
variations and delivers better performance on frame freezes than on other compression
artifacts. Blur is a spatial distortion that does not significantly change the motion information
properties of an S3D video. Therefore, the dependency variation between motion and
disparity components is lower compared to compression artifacts. As a result, the suggested
model is unable to capture statistical connections between motion and disparity components,
and MoDi3D performs worse than compression-based distortions. Finally, and most vital part
is, the proposed approach is absolutely unaware of subjective opinion, i.e., it is 'fully blind,'
and it performs well across a wide range of distortions and datasets.
Reference
[1] Y. Liu, L. K. Cormack, and A. C. Bovik, “Statistical modeling of 3-D natural
scenes with application to bayesian stereopsis,” IEEE Transactions on Image
Processing, vol. 20, pp. 2515–2530, Sept 2011.
[2] “eMarketer: Better research. Better business decisions.”, url =
https://www.emarketer.com/.”
[3] R. Tenniswood, L. Safonova, and M. Drake, “3Ds effect on a films box office
and profitability,” 2010. [4] K. Seshadrinathan, R. Soundararajan, A. C. Bovik,
and L. K. Cormack, “Study of subjective and objective quality assessment of
video,” IEEE Transactions on Image Processing, vol. 19, pp. 1427–1441, June
2010.
[5] J. Wang, S. Wang, and Z. Wang, “Asymmetrically compressed stereoscopic
3D videos: Quality assessment and rate-distortion performance evaluation,”
IEEE Transactions on Image Processing, vol. 26, pp. 1330– 1343, March 2017.
[6] B. Appina, K. Manasa, and S. S. Channappayya, “Subjective and objective
study of the relation between 3D and 2D views based on depth and bitrate,”
Electronic Imaging, vol. 2017, no. 5, pp. 145–150, 2017.
[7] K. Wang, M. Barkowsky, R. Cousseau, K. Brunnstrom, R. Olsson, P. Le
Callet, and M. Sjostrom, “Subjective evaluation of HDTV stereoscopic videos in
IPTV scenarios using absolute category rating,” in Proc. SPIE, vol. 7863, 2011.
[8] Z. Chen, W. Zhou, and W. Li, “Blind stereoscopic video quality assessment:
From depth perception to overall experience,” IEEE Transactions on Image
Processing, vol. 27, no. 2, pp. 721–734, 2018.
[9] E. Dumic, S. Grgi ´ c, K. ´ Saki ˇ c, P. M. R. Rocha, and L. A. da Silva Cruz,
´ “3D video subjective quality: a new database and grade comparison study,”
Multimedia tools and applications, vol. 76, no. 2, pp. 2087– 2109, 2017.
[10] “Lab for Video and Image Analysis (LFOVIA) Downloads,
http://www.iith.ac.in/∼lfovia/downloads.html.” [20] S. L. P. Yasakethu, C. T. E.
R. Hewage, W. A. C. Fernando, and A. M. Kondoz, “Quality analysis for 3D
video using 2D video quality models,” IEEE Transactions on Consumer
Electronics, vol. 54, pp. 1969–1976, November 2008.
[11] J. Yang, H. Wang, W. Lu, B. Li, A. Badiid, and Q. Meng, “A noreference
optical flow-based quality evaluator for stereoscopic videos in curvelet domain,”
Information Sciences, vol. 414, pp. 133–146, 2017.
[12] G. Jiang, S. Liu, M. Yu, F. Shao, Z. Peng, and F. Chen, “No reference stereo
video quality assessment based on motion feature in tensor decomposition
domain,” Journal of Visual Communication and Image Representation, 2017.
[13] B. Appina, A. Jalli, S. S. Battula, and S. S. Channappayya, “Noreference
stereoscopic video quality assessment algorithm using joint motion and depth
statistics,” in 25th International Conference on Image Processing, IEEE, pp.
2800–2804, 2018.
[14] E. Cheng, P. Burton, J. Burton, A. Joseski, and I. Burnett, “Rmit3dv: Pre-
announcement of a creative commons uncompressed HD 3D video database,” in
Fourth International Workshop on Quality of Multimedia Experience, pp. 212–
217, July 2012.
[15] M. T. Pourazad, Z. Mai, P. Nasiopoulos, K. Plataniotis, and R. K. Ward,
“Effect of brightness on the quality of visual 3D perception,” in International
Conference on Image Processing, IEEE, pp. 989–992, Sept 2011.