"Modernizing Videos Using Ai: A Mini Project Report Submitted in Partial Fulfillment of The Requirements Fo
"Modernizing Videos Using Ai: A Mini Project Report Submitted in Partial Fulfillment of The Requirements Fo
By
Piyush Raikwar (2017IMT-062)
Aryanshu Verma (2017IMT-019)
Harshit Patel (2017IMT-039)
Dhananjai Kumar (2017IMT-033)
2020
CANDIDATES DECLARATION
We hereby certify that the work, which is being presented in the report, entitled
“Modernizing Videos Using AI” in partial fulfillment of the requirement for
the award of the Degree of Bachelor of Technology and submitted to the
institution is an authentic record of our own work carried out during the period
March 2020 to May 2020 under the supervision of Prof. Dr. Somesh Kumar.
We also cited the reference about the text(s)/figure(s)/table(s) from where they
have been taken.
Piyush Raikwar
Harshit Patel
Aryanshu Verma
Dhananjai Kumar
This is to certify that the above statement made by the candidates is correct to
the best of my knowledge.
Ever wanted to bring your good old days of watching classic black and white
shows like ‘Charlie Chaplin’ and not so old colored shows like ‘Malgudi Days’
but which are not upto the mark according to today's display technology.
Nowadays, people are accustomed to watching movies, TV shows and other
video content in high definition. As a result of which they don’t enjoy watching
old movies and TV shows in poor quality, in black and white format,
inconsistent playback etc. All these problems have reduced the interest of
modern life viewers. What if we told you that you could bring those days back
and re-experience those shows in coloured form, high-resolution and
without any static-noise?
This is exactly the idea behind this project i.e, to develop a fully automatic
approach for restoration and enhancement of such videos. Our project is about
turning old-style, black and white, low-resolution, full of static-noise images or
videos into a modern-style, coloured, high-resolution and noise-free
counterparts. We have created a web-app which can take an image or a video
as input and produces an enhanced version of them depending upon the filters
selected by the user. The filters available are:
1. Colorization: Adds sensible colours to the input.
2. Super-Resolution: Outputs a high-resolution version of input.
3. De-noising: Removes static-noise from the input.
We are highly indebted to Prof. Dr. Somesh Kumar, and are obliged for giving
us the autonomy of functioning and experimenting with ideas. We would like to
take this opportunity to express our profound gratitude to them not only for
their academic guidance but also for their personal interest in our project and
constant support coupled with confidence boosting and motivating sessions
which proved very fruitful and were instrumental in infusing self-assurance and
trust within us. The nurturing and blossoming of the present work is mainly due
to their valuable guidance,suggestions, astute judgment, constructive criticism
and an eye for perfection. Our mentor always answered myriad of our doubts
with smiling graciousness and prodigious patience, never letting us feel that we
are novices by always lending an ear to our views, appreciating and improving
them and by giving us a free hand in our project. It’s only because of their
overwhelming interest and helpful attitude, the present work has attained the
stage it has.
1. Introduction . . . . . . . . . . . . . . . . . . . 1-3
i. Image Colorization . . . . . . . . . . . . . . . . . 7
4. Method / Approach
5. Results
6. References . . . . . . . . . . . . . . . . . . . . 24
1. Introduction
The core idea of this project is to provide an easy-to-use and interactive web app
for enhancing your old/distorted videos. You can apply any of the three filters namely:
Colorization, Super-Resolution and De-noising. The methods or algorithms used are
based on Deep Learning, Computer Vision and Image Representation domains. The
video is converted into individual frames and filters are applied on a single image or
individual frames of videos.
Figure: Example input grayscale photos and output colorizations from our algorithm.
1
The task of super resolving an image (increasing the spatial dimensions of an
image given a low-resolution image) is an ill-posed problem and is an active research
topic within computer-vision. As part of our project we explore the usage of this
technique on videos with some modifications that help output better results.
Modifications to the standard implementation of SRGAN’s [2] on videos was
necessary for without which the resulting super-resolved video contained several
“flickering” frames, i.e. frames with varying brightness. Our network is able to super-
resolve variety of videos with gains in perceptual quality.
The idea of single image super resolution (SISR) of an image is to take a low
resolution image and upscale the image to a higher resolution. There are many
techniques that have been researched and showcased over the past years. Earliest
techniques either used bicubic or were prediction based .Though these were fast, they
resulted in overly smoothed textures. More recently convolutional neural network based
super resolution techniques have shown promising results. Some used deep networks
with a loss function that is closer to perceptual similarity which recovers visually more
convincing HR images. Based on these works and with the aim to retain textures in
super resolved images, Ledig et al. [2] proposed the use of GAN’s which provided upto
4X up-scaling, generating sharp textures. The success of this technique on images,
raises questions over the feasibility of SRGANs on videos as well. In this project we use
a modified SRGANs to apply super-resolution on images and videos.
From left to right: bicubic interpolation, deep residual network optimized for MSE, deep residual
generative adversarial network optimized for a loss more sensitive to human perception, original HR
image. Corresponding PSNR and SSIM are shown in brackets. [4× upscaling]
2
For Video or Image De-noising , One of the major problems associated with old
images and videos is the presence of high noise within it . It can be in many form such
as flickering , unusual contrast change with respect to time etc. to solve this issue a lot
techniques introduced .Some of them are simple image processing based but they
really don’t work that well So here i used well know machine learning based approach
Denoising with the help of Autoencoder to tackle this problem.
This is a Convolution neural network based approach as it is clear from the name
autoencoder . Autoencoders have been widely applied in dimension reduction and
image noise reduction. Modeling image data requires a special approach in the neural
network world. The best known neural network for modeling image data is the
Convolutional Neural Network (CNN, or ConvNet) or called Convolutional Autoencoder.
Several methods have been proposed to remove the noise and recover the true
image u. Even though they may be very different in tools it must be emphasized that a
wide class share the same basic remark : denoising is achieved by averaging. This
averaging may be performed locally: the Gaussian smoothing model, the anisotropic
filtering and the neighborhood filtering Tomasi by the calculus of variations: the Total
Variation minimization or in the frequency domain: the empirical Wiener filters and
wavelet thresholding methods.
3
2. End-to-End Design:
The technologies used in implementation of the frontend are
:HTML5,CSS3,Javascript,Bootstrap.
CSS is used to create a layout and design for the html elements on your page.
JS:apply changes to the html also allow us to add some logic to webpage by processing
data and finally help us to receive and/or submit data from the browser to/from a server
Bootstrap4:It provides us the ability to easily create responsive design enabling faster
and easier web development.
➔ Backend server receives a request from user’s web browser. It’s wrapped up in
JSON
➔ Backend pushes the job into a queue
➔ Backend replies to the user:”Please wait”. Backend is then free to serve other
users
➔ The user’s web browser starts displaying a ‘please wait’ spinner.
➔ Eventually, a worker will pick up the job, remove it from the queue, and process it
through some ML model). It’ll save the process-related-info to a database.
➔ Meanwhile, the user’s web browser is polling the backend every 30 seconds to
ask if the job is done yet by providing the id. The backend checks if the database
has a result stored at and replies accordingly. Any of our multiple horizontal
backends is able to serve the user’s request. You might imagine that the shared
database is a single point of failure, and you’d be right! But separately, we
provisioned replicas and some failover mechanism, maybe sharding/load
balancing, so it’s all good.
4
➔ After 15 minutes plus a bit, the user polls for a result, and we are able to serve
it up.
5
Webapp Operating Procedure:
6
3. Related Work / Literature review
3. i. Image Colorization
Colorization algorithms mostly differ in the ways they obtain and treat the data for
modeling the correspondence between grayscale and color. Non-parametric methods,
given an input grayscale image, first define one or more color reference images
(provided by a user or retrieved automatically) to be used as source data. Then,
following the Image Analogies framework, color is transferred onto the input image from
analogous regions of the reference image(s). Parametric methods, on the other hand,
learn prediction functions from large datasets of color images at training time, posing
the problem as either regression onto continuous color space [4,5] or classification of
quantized color values [6]. Our method also learns to classify colors, but does so with a
larger model, trained on more data, and with several innovations in the loss function and
mapping to a final continuous output.
Similarly, Larsson et al. [7] and Iizuka et al. have developed similar systems,
which leverage large-scale data and CNNs. The methods differ in their CNN
architectures and loss functions. While we use a classification loss, with rebalanced rare
classes, Larsson et al. use an un-rebalanced classification loss, and Iizuka et al. use a
regression loss. The CNN architectures are also somewhat different: Larsson et al. use
hypercolumns on a VGG network, Iizuka et al. use a two-stream architecture in which
they fuse global and local features, and we use a single-stream, VGG-styled network
with added depth and dilated convolutions. In addition, while we and Larsson et al. train
our models on ImageNet [8], Iizuka et al. train their model on Places.
7
maintaining real-time speed. All these works use different information from the frames
preceding and succeeding the frame being super-resolved. For example, “End-to-end
learning of video super-resolution with motion compensation.” uses optical flow across the
frames. In this project we see how photo-realistic the super-resolved video would look
that has been super-resolved with SRGANs that just looks at a single frame at a time.
With respect to single image super resolution, learning the upscaling filters become
pivotal as they improve accuracy along with speed over models like the ones mentioned
in Dong et al.[2] for example, which uses bicubic interpolation to upscale the image
before feeding to the CNN. We also see that for videos unlike that in images, preserving
average coherent brightness across multiple frames is essential.
Image denoising is to remove noise from a noisy image( [9] by Baldi and K.
Hornik), so as to restore the true image. However, since noise, edge, and texture are
high frequency components, it is difficult to distinguish them in the process of denoising
and the denoised images could inevitably lose some details. Overall, recovering
meaningful information from noisy images in the process of noise removal to obtain high
quality images is an important problem nowadays.
In fact, image denoising is a classic problem and has been studied for a long
time. However, it remains a challenging and open task.
8
4. Method
4. i. Image Colorization
● Design
We train a CNN to map from a grayscale input to a distribution over quantized
color value outputs using the architecture shown in the following figure. In the following,
we focus on the design of the objective function, and our technique for inferring point
estimates of color from the predicted color distribution.
Figure: Our network architecture. Each conv layer refers to a block of 2 or 3 repeated conv and
ReLU layers, followed by a BatchNorm layer. The net has no pool layers. All changes in resolution
are achieved through spatial downsampling or upsampling between conv blocks.
● Objective function
Given an input lightness channel X ∈ RH×W×1 , our objective is to learn a mapping
Y^ = F(X) to the two associated color channels Y ∈ RH×W×2 , where, H, W are image
dimensions. (We denote predictions with a ^ symbol and ground truth without.) We
perform this task in CIE Lab color space. Because distances in this space model
perceptual distance, a natural objective function, as used in [4,5], is the Euclidean loss
L2 (·, ·) between predicted and ground truth colors:
9
However, this loss is not robust to the inherent ambiguity and multimodal nature
of the colorization problem. If an object can take on a set of distinct ab values, the
optimal solution to the Euclidean loss will be the mean of the set. In color prediction, this
averaging effect favors grayish, desaturated results.
where v(·) is a weighting term that can be used to rebalance the loss based on
color-class rarity. Finally, we map probability distribution Z^ to color values Y with
function Y = H(Z^).
10
● Adversarial network architecture
In our project, like in the Ledig et al. [2] paper there is a discriminator network,
DθD which
is optimized in an alternating manner along with the generator network GθG to
solve the adversarial min-max problem:
The idea behind this is that it allows one to train a generative model G with the goal of
fooling a differentiable discriminator D that is trained to distinguish super-resolved
images from real images thus encouraging perceptually superior solutions.
11
● Loss Function
The perceptual loss function lSR is pivotal for the performance of the
generator network. Perceptual loss is the weighted sum of a content loss (lXSR )
and an adversarial loss component:
In the following we describe possible choices for the content loss lXSR and the
adversarial loss lGenSR
➔ Content Loss:
The first part of the content loss is the VGG loss based on the ReLU activation
layers of the pre-trained 19 layer VGG network. However, using this alone as content
loss will not capturethe pixel intensities. This was not the case in [2] as it was more
concerned with just a photo-realistic super resolution of an image. But when it comes to
super resolving a video, the pixel intensities become important as well. Because in
videos, the pixel intensity differences from one frame to the next don’t change much.
Hence there is a need to maintain consistency of pixel intensities. Thus we added a L1
loss to the content loss to enforce pixel intensity constancy.
➔ Adversarial Loss:
To favor solutions that reside on the manifold of natural images, [2] also adds an
adversarial loss. This generative loss lGENSR is defined based on the probabilities of the
)) over all training samples as:
discriminator DθD (GθG (I LR
➔ Training Procedure:
It is to be noted that it is possible for the generator network to learn an
up-sampling function by minimizing just the Content Loss (lSR content loss ) on pairs of
low-resolution and high resolution images. The addition of the Adversarial and the GAN
training procedure incentivizes the generator to produce more realistic high resolution
images with finer details.
12
4. iii. Image De-noising
● Architecture
The encoder part of the network will be a typical convolutional pyramid. Each
convolutional layer will be followed by a max-pooling layer to reduce the dimensions of
the layers. The decoder though might be something new to you. The decoder needs to
convert from a narrow representation to a wide reconstructed image. For example, the
representation could be a 4x4x8 max-pool layer. This is the output of the encoder, but
also the input to the decoder. We want to get a 28x28x1 image out from the decoder so
we need to work our way back up from the narrow decoder input layer. A schematic of
the network is shown below.
13
1. The Convolution Layer
The convolution step creates many small pieces called the feature maps or
features like the green, red or navy blue squares in Figure (E). These squares preserve
the relationship between pixels in the input image. Let each feature scan through the
original image like what’s shown in Figure (F). This process in producing the scores is
called filtering.
After scanning through the original image, each feature produces a filtered image
with high scores and low scores as shown in Figure (G). If there is a perfect match,
there is a high score in that square. If there is a low match or no match, the score is low
14
or zero. For example, the red square found four areas in the original image that show a
perfect match with the feature, so scores are high for those four areas.
Okay, so the decoder has these "Upsample" layers that you might not have seen
before. First off, I'll discuss a bit what these layers aren't. Usually, you'll see transposed
convolution layers used to increase the width and height of the layers. They work almost
exactly the same as convolutional layers, but in reverse. A stride in the input layer
results in a larger stride in the transposed convolution layer. For example, if you have a
3x3 kernel, a 3x3 patch in the input layer will be reduced to one unit in a convolutional
layer. Comparatively, one unit in the input layer will be expanded to a 3x3 path in a
transposed convolution layer. The TensorFlow API provides us with an easy way to
create the layers, tf.nn.conv2d_transpose.
15
5. Results
5. i. Image Colorization
Here we assess the graphics aspect of our algorithm, evaluating the perceptual
realism of our colorizations, along with other measures of accuracy. We compare our
full algorithm to several variants Finally, we show qualitative examples on legacy black
and white images.
We train our network on the 1.3M images from the ImageNet training set [8],
validate on the first 10k images in the ImageNet validation set, and test on a separate
10k images in the validation set, same as in [7].
16
Figure: Example results from our ImageNet test set. Our classification loss with rebalancing
produces more accurate and vibrant results than a regression loss or a classification loss without
rebalancing.
17
5. ii. Image Super-Resolution
18
● Testing on Videos:
We applied the SRGAN model to several videos and what was observed initially
was that even though the model was able to recover details in the upscaled frames,
there was a significant change in the overall brightness of the output frames.
19
This rapid change in brightness between the frames led to a flickering effect in
the super-resolved video. The above figures, attempts to depict this phenomena but it is
much more prominent when one views the videos . Looking at the “VGG” row in the
figures, we can see that between successive frames there is a significant change in
brightness. This however was minimized by adding an L1 loss to the Content Loss
and retraining the SRGAN on the DIV2K dataset.
Figure . 1st row: low resolution frame, 2nd row: frame super-resolved with SRGAN and L1 loss,
3rd row: frame super-resolved with SRGAN.
20
5. iii. Image De-noising
We use MNIST which is a well known database of handwritten digits. Keras has
MNIST dataset utility. We can download the data as follows:
● Testing on images
I'm adding noise to the test images and passing them through the autoencoder. It
does a surprisingly great job of removing the noise, even though it's sometimes
difficult to tell what the original number is.
21
● Testing on Videos:
22
Output
23
6. References
[1] E. Agustsson and R. Timofte. Ntire 2017 challenge on single image super-resolution:
Dataset and study . In The IEEE Conference on Computer Vision and Pattern Recognition
(CVPR) Workshops, July 2017.
[3] J. Caballero, C. Ledig, A. Aitken, A. Acosta, J. Totz,Z. Wang, and W. Shi. Real-time video
super-resolution with spatio-temporal networks and motion compensation. arXiv
preprint arXiv:1611.05250, 2016.
[4] Cheng, Z., Yang, Q., Sheng, B.: Deep colorization. In: Proceedings of the IEEE International
Conference on Computer Vision. (2015) 415–423
[6] Charpiat, G., Hofmann, M., Schölkopf, B.: Automatic image colorization via multimodal
predictions. In: Computer Vision–ECCV 2008. Springer (2008) 126–139
[7] Larsson, G., Maire, M., Shakhnarovich, G.: Learning representations for automatic
colorization. European Conference on Computer Vision (2016)
[8] Russakovsky, O., Deng, J., Su, H., Krause, J., Satheesh, S., Ma, S., Huang, Z., Karpathy, A.,
Khosla, A., Bernstein, M., et al.: Imagenet large scale visual recognition challenge. International
Journal of Computer Vision 115(3) (2015) 211–252
[9] E. Rumelhart, G.E. Hinton, and R.J. Williams. Learning internal representations by error
propagation. In Parallel Distributed Processing. Vol 1: Foundations. MIT Press, Cambridge, MA,
1986.
[10] Baldi and K. Hornik. Neural networks and principal component analysis: Learning from
examples without local minima. Neural Networks, 2(1):53–58, 1988.
24