0% found this document useful (0 votes)
65 views29 pages

"Modernizing Videos Using Ai: A Mini Project Report Submitted in Partial Fulfillment of The Requirements Fo

This document describes a student project to develop a web app for enhancing old videos using AI techniques. It allows applying colorization, super-resolution, and denoising filters to input videos. For colorization, a CNN is trained on millions of color images to predict color channels from grayscale images. For super-resolution, a GAN is used to increase spatial resolution while retaining textures. Denoising removes static from videos using deep learning models. The web app converts videos to frames, applies selected filters to each frame, and recombines the results into an enhanced output video.

Uploaded by

Somesh Dahiya
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
65 views29 pages

"Modernizing Videos Using Ai: A Mini Project Report Submitted in Partial Fulfillment of The Requirements Fo

This document describes a student project to develop a web app for enhancing old videos using AI techniques. It allows applying colorization, super-resolution, and denoising filters to input videos. For colorization, a CNN is trained on millions of color images to predict color channels from grayscale images. For super-resolution, a GAN is used to increase spatial resolution while retaining textures. Denoising removes static from videos using deep learning models. The web app converts videos to frames, applies selected filters to each frame, and recombines the results into an enhanced output video.

Uploaded by

Somesh Dahiya
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 29

“MODERNIZING VIDEOS USING AI​”

A mini project report submitted in partial fulfillment of the


requirements fo​r B.Tech. Mini Project

Integrated Post Graduate

By
Piyush Raikwar (2017IMT-062)
Aryanshu Verma (2017IMT-019)
Harshit Patel (2017IMT-039)
Dhananjai Kumar (2017IMT-033​)

under the supervision of -

Dr. Somesh Kumar

​ABV INDIAN INSTITUTE OF INFORMATION


TECHNOLOGY AND MANAGEMENT GWALIOR-474010

2020
CANDIDATES DECLARATION
We hereby certify that the work, which is being presented in the report, entitled
“Modernizing Videos Using AI” in partial fulfillment of the requirement for
the award of the Degree of Bachelor of Technology and submitted to the
institution is an authentic record of our own work carried out during the period
March 2020 to May 2020 under the supervision of Prof. ​Dr. Somesh Kumar​.
We also cited the reference about the text(s)/figure(s)/table(s) from where they
have been taken.

Date: ​10 June, 2020 Signatures of the Candidates

Piyush Raikwar

Harshit Patel

Aryanshu Verma

Dhananjai Kumar

This is to certify that the above statement made by the candidates is correct to
the best of my knowledge.

Date: Signatures of the Research Supervisors


ABSTRACT

Ever wanted to bring your good old days of watching classic black and white
shows like ‘Charlie Chaplin’ and not so old colored shows like ‘Malgudi Days’
but which are not upto the mark according to today's display technology.
Nowadays, people are accustomed to watching movies, TV shows and other
video content in high definition. As a result of which they don’t enjoy watching
old movies and TV shows in poor quality, in black and white format,
inconsistent playback etc. All these problems have reduced the interest of
modern life viewers. What if we told you that you could bring those days back
and re-experience those shows in ​coloured form, ​high-resolution ​and
without any static-noise​?

This is exactly the idea behind this project i.e, to develop a fully automatic
approach for restoration and enhancement of such videos. Our project is about
turning old-style, black and white, low-resolution, full of static-noise images or
videos into a modern-style, coloured, high-resolution and noise-free
counterparts. We have created a web-app which can take an image or a video
as input and produces an enhanced version of them depending upon the filters
selected by the user. The filters available are:
1. Colorization: ​Adds sensible colours to the input.
2. Super-Resolution: ​Outputs a high-resolution version of input.
3. De-noising: ​Removes static-noise from the input.

Keywords: Colorization, Vision for Graphics, CNNs, Super-Resolution , DNNs,


Image and Video Denoising , Autoencoder.
ACKNOWLEDGEMENTS

We are highly indebted to Prof. ​Dr. Somesh Kumar​, and are obliged for giving
us the autonomy of functioning and experimenting with ideas. We would like to
take this opportunity to express our profound gratitude to them not only for
their academic guidance but also for their personal interest in our project and
constant support coupled with confidence boosting and motivating sessions
which proved very fruitful and were instrumental in infusing self-assurance and
trust within us. The nurturing and blossoming of the present work is mainly due
to their valuable guidance,suggestions, astute judgment, constructive criticism
and an eye for perfection. Our mentor always answered myriad of our doubts
with smiling graciousness and prodigious patience, never letting us feel that we
are novices by always lending an ear to our views, appreciating and improving
them and by giving us a free hand in our project. It’s only because of their
overwhelming interest and helpful attitude, the present work has attained the
stage it has.

Finally, we are grateful to our Institution and colleagues whose constant


encouragement served to renew our spirit, refocus our attention and energy
and helped us in carrying out this work.

(Piyush Raikwar -2017IMT062 )


(Aryanshu Verma -2017IMT019 )
(Harshit Patel -2017IMT039 )
(Dhananjai Kumar -2017IMT033 )
TABLE OF CONTENTS
0. Abstract

1. Introduction ​. . . . . . . . . . . . . . . . . . . 1-3

2. End-to-End Design ​. . . . . . . . . . . . . . . . . . . 4-6

3. Related Work / Literature Review

i. Image Colorization ​. . . . . . . . . . . . . . . . . 7

ii. Image Super-Resolution ​. . . . . . . . . . . . . . . . 7

iii. Image De-noising ​. . . . . . . . . . . . . . . . . . 8

4. Method / Approach

i. Image Colorization ​. . . . . . . . . . . . . . . . . 9-10

ii. Image Super-Resolution ​. . . . . . . . . . . . . . 10-12

iii. Image De-noising ​. . . . . . . . . . . . . . . . . 13-15

5. Results

i. Image Colorization ​. . . . . . . . . . . . . . . . 16-17

ii. Image Super-Resolution ​. . . . . . . . . . . . . . . 18-20

iii. Image De-noising ​. . . . . . . . . . . . . . . . . 21-23

6. References ​. . . . . . . . . . . . . . . . . . . . 24
1. Introduction

The core idea of this project is to provide an easy-to-use and interactive web app
for enhancing your old/distorted videos. You can apply any of the three filters namely:
Colorization, Super-Resolution and De-noising​.​ ​The methods or algorithms used are
based on Deep Learning, Computer Vision and Image Representation domains. The
video is converted into individual frames and filters are applied on a single image or
individual frames of videos.

For ​Image/Video Colorization, ​given a grayscale photograph (individual frames


of video) as input, we implement the solution to solve the problem of hallucinating a
plausible color version of the photograph. Corresponding to the following figure, in many
cases, the semantics of the scene and its surface texture provide ample cues for many
regions in each image: the grass is typically green, the sky is typically blue, and the
ladybug is most definitely red. This problem is clearly underconstrained, so other
approaches have either relied on significant user interaction or resulted in desaturated
colorizations. The system is implemented as a feed-forward pass in a CNN at test time
and is trained on over a million color images. Given the lightness channel L, our system
predicts the corresponding a and b color channels of the image in the CIE Lab
colorspace. To solve this problem, we leverage large-scale data. Predicting color has
the nice property that training data is practically free: any color photo can be used as a
training example, simply by taking the image’s L channel as input and its ab channels
as the supervisory signal.

Figure: Example input grayscale photos and output colorizations from our algorithm. 

1
The task of ​super resolving an image (increasing the spatial dimensions of an
image given a low-resolution image) is an ill-posed problem and is an active research
topic within computer-vision. As part of our project we explore the usage of this
technique on videos with some modifications that help output better results.
Modifications to the standard implementation of SRGAN’s [2] on videos was
necessary for without which the resulting super-resolved video contained several
“flickering” frames, i.e. frames with varying brightness. Our network is able to super-
resolve variety of videos with gains in perceptual quality.

The idea of single image super resolution (SISR) of an image is to take a low
resolution image and upscale the image to a higher resolution. There are many
techniques that have been researched and showcased over the past years. Earliest
techniques either used bicubic or were prediction based .Though these were fast, they
resulted in overly smoothed textures. More recently convolutional neural network based
super resolution techniques have shown promising results. Some used deep networks
with a loss function that is closer to perceptual similarity which recovers visually more
convincing HR images. Based on these works and with the aim to retain textures in
super resolved images, Ledig et al. [2] proposed the use of GAN’s which provided upto
4X up-scaling, generating sharp textures. The success of this technique on images,
raises questions over the feasibility of SRGANs on videos as well. In this project we use
a modified SRGANs to apply super-resolution on images and videos.

​From  left  to  right:  bicubic  interpolation,  deep  residual  network  optimized  for  MSE,  deep  residual 
generative  adversarial  network  optimized  for  a  loss  more sensitive to human perception, original HR 
image. Corresponding PSNR and SSIM are shown in brackets. [4× upscaling​]

2
For ​Video or Image De-noising , One of the major problems associated with old
images and videos is the presence of high noise within it . It can be in many form such
as flickering , unusual contrast change with respect to time etc. to solve this issue a lot
techniques introduced .Some of them are simple image processing based but they
really don’t work that well So here i used well know machine learning based approach
Denoising with the help of Autoencoder to tackle this problem.

This is a Convolution neural network based approach as it is clear from the name
autoencoder . Autoencoders have been widely applied in dimension reduction and
image noise reduction. Modeling image data requires a special approach in the neural
network world. The best known neural network for modeling image data is the
Convolutional Neural Network (CNN, or ConvNet) or called Convolutional Autoencoder.

Several methods have been proposed to remove the noise and recover the true
image u. Even though they may be very different in tools it must be emphasized that a
wide class share the same basic remark : denoising is achieved by averaging. This
averaging may be performed locally: the Gaussian smoothing model, the anisotropic
filtering and the neighborhood filtering Tomasi by the calculus of variations: the Total
Variation minimization or in the frequency domain: the empirical Wiener filters and
wavelet thresholding methods.

3
2. End-to-End Design:
The technologies used in implementation of the frontend are
:HTML5,CSS3,Javascript,Bootstrap.

HTML:​structures the website content.

CSS ​is used to create a layout and design for the html elements on your page.

JS​:apply changes to the html also allow us to add some logic to webpage by processing
data and finally help us to receive and/or submit data from the browser to/from a server

Bootstrap4:​It provides us the ability to easily create responsive design enabling faster
and easier web development.

Backend Concept and Implementation​:

In general, we want to run as many backend instances as possible, for scalability.Each


instance has to remain stateless: finish handling the HTTP request and exit.We don’t
want anything in memory between requests, because a client’s first request might go to
one server, and a subsequent request to another.It’s bad if we have a long running
endpoint: it would tie up one of our servers (say… doing some ML task), leaving it
unable to handle other users’ requests.We need to keep the web server responsive and
have it hand off long running tasks, with some kind of shared persistence so that when
the user checks progress or requests the result, any server can report.The answer is a
first-in, first-out (FIFO) queue. The backend simply enqueues jobs.

➔ Backend server receives a request from user’s web browser. It’s wrapped up in
JSON
➔ Backend pushes the job into a queue
➔ Backend replies to the user:”Please wait”. Backend is then free to serve other
users
➔ The user’s web browser starts displaying a ‘please wait’ spinner.
➔ Eventually, a worker will pick up the job, remove it from the queue, and process it
through some ML model). It’ll save the process-related-info to a database.
➔ Meanwhile, the user’s web browser is polling the backend every 30 seconds to
ask if the job is done yet by providing the id. The backend checks if the database
has a result stored at and replies accordingly. Any of our multiple horizontal
backends is able to serve the user’s request. You might imagine that the shared
database is a single point of failure, and you’d be right! But separately, we
provisioned replicas and some failover mechanism, maybe sharding/load
balancing, so it’s all good.

4
➔ After ​15 minutes plus a bit​, the user polls for a result, and we are able to serve
it up.

Web app working Flowchart:

5
Webapp Operating Procedure​:

A simple UI is designed using technologies discussed above and an easy to use


interface is provided to user.Procedure to use the webapp:
➢ Open the webapp in any browser
➢ Click on the ‘Browse’ button to locate video or image on your device.
➢ Click on the green button ‘Here!’ to upload the video.
➢ Now video will be visible under the ‘Video added’ section.
➢ Select filter of your choice.
➢ Now locate the red button ‘Here’ at the bottom and press it to apply the filter.
➢ Download the video with the ‘DOWNLOAD’ button once progress bars are
completed showing successful completion.

6
3. Related​ ​Work / Literature review

3. i. Image Colorization
Colorization algorithms mostly differ in the ways they obtain and treat the data for
modeling the correspondence between grayscale and color. Non-parametric methods,
given an input grayscale image, first define one or more color reference images
(provided by a user or retrieved automatically) to be used as source data. Then,
following the Image Analogies framework, color is transferred onto the input image from
analogous regions of the reference image(s). Parametric methods, on the other hand,
learn prediction functions from large datasets of color images at training time, posing
the problem as either regression onto continuous color space [4,5] or classification of
quantized color values [6]. Our method also learns to classify colors, but does so with a
larger model, trained on more data, and with several innovations in the loss function and
mapping to a final continuous output.

Similarly, Larsson et al. [7] and Iizuka et al. have developed similar systems,
which leverage large-scale data and CNNs. The methods differ in their CNN
architectures and loss functions. While we use a classification loss, with rebalanced rare
classes, Larsson et al. use an un-rebalanced classification loss, and Iizuka et al. use a
regression loss. The CNN architectures are also somewhat different: Larsson et al. use
hypercolumns on a VGG network, Iizuka et al. use a two-stream architecture in which
they fuse global and local features, and we use a single-stream, VGG-styled network
with added depth and dilated convolutions. In addition, while we and Larsson et al. train
our models on ImageNet [8], Iizuka et al. train their model on Places.

3. ii. Image Super-Resolution


There has been substantial work on super-resolution of videos. The motivation to
do this is fueled by the numerous implications this might pose. If successful, areas like
video compression, transfer of video over network, video processing etc might have
scope for improvement with these new techniques. There was an end-to-end video
super- resolution network proposed that included the estimation of optical flow in the
overall network architecture. This paper showed that processing of whole images are
responsible for a large increase in accuracy than using independent patches. Another
research showcased a spatio-temporal sub-pixel convolution network that effectively
exploited the temporal redundancies and improved reconstruction accuracy while

7
maintaining real-time speed. All these works use different information from the frames
preceding and succeeding the frame being super-resolved. For example, ​“​End-to-end
learning of video super-resolution with motion compensation​.” uses optical flow across the
frames. In this project we see how photo-realistic the super-resolved video would look
that has been super-resolved with SRGANs that just looks at a single frame at a time.
With respect to single image super resolution, learning the upscaling filters become
pivotal as they improve accuracy along with speed over models like the ones mentioned
in Dong et al.[2] for example, which uses bicubic interpolation to upscale the image
before feeding to the CNN. We also see that for videos unlike that in images, preserving
average coherent brightness across multiple frames is essential.

3. iii. Image De-noising

Owing to the influence of environment, transmission channel, and other factors,


images are inevitably contaminated by noise during acquisition, compression, and
transmission, leading to distortion and loss of image information. With the presence of
noise, possible subsequent image processing tasks, such as video processing, image
analysis, and tracking, are adversely affected. Therefore, image denoising plays an
important role in modern image processing systems [10].

Image denoising is to remove noise from a noisy image( [9] by Baldi and K.
Hornik), so as to restore the true image. However, since noise, edge, and texture are
high frequency components, it is difficult to distinguish them in the process of denoising
and the denoised images could inevitably lose some details. Overall, recovering
meaningful information from noisy images in the process of noise removal to obtain high
quality images is an important problem nowadays.

In fact, image denoising is a classic problem and has been studied for a long
time. However, it remains a challenging and open task.

8
4. Method

4. i. Image Colorization
● Design
We train a CNN to map from a grayscale input to a distribution over quantized
color value outputs using the architecture shown in the following figure. In the following,
we focus on the design of the objective function, and our technique for inferring point
estimates of color from the predicted color distribution​.

Figure: Our network architecture. Each conv layer refers to a block of 2 or 3 repeated conv and 
ReLU layers, followed by a BatchNorm layer. The net has no pool layers. All changes in resolution 
are achieved through spatial downsampling or upsampling between conv blocks. 
 
 
● Objective function
Given an input lightness channel X ∈ R​H×W×1 , our objective is to learn a mapping
Y^ = F(X) to the two associated color channels Y ∈ R​H×W×2 , where, H, W are image
dimensions. (We denote predictions with a ^ symbol and ground truth without.) We
perform this task in CIE Lab color space. Because distances in this space model
perceptual distance, a natural objective function, as used in [4,5], is the Euclidean loss
L​2​ (·, ·) between predicted and ground truth colors:

9
However, this loss is not robust to the inherent ambiguity and multimodal nature
of the colorization problem. If an object can take on a set of distinct ab values, the
optimal solution to the Euclidean loss will be the mean of the set. In color prediction, this
averaging effect favors grayish, desaturated results.

Instead, we treat the problem as multinomial classification. We quantize the ab


output space into bins with grid size 10 and keep the Q = 313 values which are
in-gamut. For a given input X, we learn a mapping Z^ = G(X) to a probability distribution
over possible colors Z ∈ [0, 1]​H×W×Q​, where Q is the number of quantized ab values. To
compare predicted Z against ground truth, we define function Z = H​−1 (Y), which
converts ground truth color Y to vector Z, using a soft-encoding scheme. We then use
multinomial cross entropy loss Lcl (·, ·), defined as:

where v(·) is a weighting term that can be used to rebalance the loss based on
color-class rarity. Finally, we map probability distribution Z^ to color values Y with
function Y = H(Z^).

4. ii. Image Super-Resolution


We have designed a framework that inputs a low resolution video, super resolves
it by 4X. To achieve this, we separate the video into frames and super resolve each of
these frames as if they were a single image, like in single image super resolution. In
single image super resolution, the aim is to estimate a super-resolved image I SR from
a low-resolution input image I(LR) . Here I(LR) is the low-resolution version of its
high-resolution counterpart I(HR) . The high-resolution images are only available during
training. For training, I(HR) is obtained from the DIV2K dataset [1]. And the I(LR) is
obtained by downsampling the respective I(HR). We then use these to train a generator
network as a feed-forward convolutional neural network.

10
● Adversarial network architecture
In our project, like in the Ledig et al. [2] paper there is a discriminator network,
D​θD which
​ is optimized in an alternating manner along with the generator network G​θG to
solve the adversarial min-max problem:

The idea behind this is that it allows one to train a generative model G with the goal of
fooling a differentiable discriminator D that is trained to distinguish super-resolved
images from real images thus encouraging perceptually superior solutions.

● Sub-pixel Convolution Layer


In the generator architecture all the activation maps that are computed are of the
same spatial dimension as the low-resolution image. There are repeated units of {Conv,
Batch-norm.ReLU and Element-wise Sums}. The spatial dimension of the activation
maps are increased by a multiplicative factor using a sub-pixel convolution layer as
proposed by Shi et al. This layer essentially uses regular strided convolutional layers
followed by a specific type of image reshaping called a phase shift. In other words,
instead of putting zeros in between pixels and having to do extra computation, they
calculate more convolutions in lower resolution and resize the resulting map into an
upscaled image. This way, no meaningless zeros are necessary.

11
● Loss Function
The perceptual loss function l​SR is pivotal for the performance of the
generator network. Perceptual loss is the weighted sum of a content loss (l​X​SR )
and an adversarial loss component:

In the following we describe possible choices for the content loss l​X​SR ​and the
adversarial loss l​Gen​SR

➔ Content Loss:
The first part of the content loss is the VGG loss based on the ReLU activation
layers of the pre-trained 19 layer VGG network. However, using this alone as content
loss will not capturethe pixel intensities. This was not the case in [2] as it was more
concerned with just a photo-realistic super resolution of an image. But when it comes to
super resolving a video, the pixel intensities become important as well. Because in
videos, the pixel intensity differences from one frame to the next don’t change much.
Hence there is a need to maintain consistency of pixel intensities. Thus we added a L1
loss to the content loss to enforce pixel intensity constancy.

➔ Adversarial Loss:
To favor solutions that reside on the manifold of natural images, [2] also adds an
adversarial loss. This generative loss l​GEN​SR ​is defined based on the probabilities of the
​ )) over all training samples as:
discriminator D​θD​ (G​θG​ (I LR​

➔ Training Procedure:
It is to be noted that it is possible for the generator network to learn an
up-sampling function by minimizing just the Content Loss (l​SR ​content loss ) on pairs of
low-resolution and high resolution images. The addition of the Adversarial and the GAN
training procedure incentivizes the generator to produce more realistic high resolution
images with finer details.

12
4. iii. Image De-noising

● Architecture
The encoder part of the network will be a typical convolutional pyramid. Each
convolutional layer will be followed by a max-pooling layer to reduce the dimensions of
the layers. The decoder though might be something new to you. The decoder needs to
convert from a narrow representation to a wide reconstructed image. For example, the
representation could be a 4x4x8 max-pool layer. This is the output of the encoder, but
also the input to the decoder. We want to get a 28x28x1 image out from the decoder so
we need to work our way back up from the narrow decoder input layer. A schematic of
the network is shown below.

How Does the Convolutional Autoencoders Work?


The above data extraction seems magical. How does that really work? It involves the
following three layers: The convolution layer, the reLu layer and the pooling layer​.

13
1. The Convolution Layer

The convolution step creates many small pieces called the feature maps or
features like the green, red or navy blue squares in Figure (E). These squares preserve
the relationship between pixels in the input image. Let each feature scan through the
original image like what’s shown in Figure (F). This process in producing the scores is
called filtering.

After scanning through the original image, each feature produces a filtered image
with high scores and low scores as shown in Figure (G). If there is a perfect match,
there is a high score in that square. If there is a low match or no match, the score is low

14
or zero. For example, the red square found four areas in the original image that show a
perfect match with the feature, so scores are high for those four areas.

2. What's going on with the decoder

Okay, so the decoder has these "Upsample" layers that you might not have seen
before. First off, I'll discuss a bit what these layers aren't. Usually, you'll see transposed
convolution layers used to increase the width and height of the layers. They work almost
exactly the same as convolutional layers, but in reverse. A stride in the input layer
results in a larger stride in the transposed convolution layer. For example, if you have a
3x3 kernel, a 3x3 patch in the input layer will be reduced to one unit in a convolutional
layer. Comparatively, one unit in the input layer will be expanded to a 3x3 path in a
transposed convolution layer. The TensorFlow API provides us with an easy way to
create the layers, tf.nn.conv2d_transpose​.

15
5. Results

5. i. Image Colorization
Here we assess the graphics aspect of our algorithm, evaluating the perceptual
realism of our colorizations, along with other measures of accuracy. We compare our
full algorithm to several variants Finally, we show qualitative examples on legacy black
and white images.
We train our network on the 1.3M images from the ImageNet training set [8],
validate on the first 10k images in the ImageNet validation set, and test on a separate
10k images in the validation set, same as in [7].

Figure: Applying our method to legacy black and white images. 

16
Figure: Example results from our ImageNet test set. Our classification loss with rebalancing 
produces more accurate and vibrant results than a regression loss or a classification loss without 
rebalancing. 
 
 
 
 

17
5. ii. Image Super-Resolution

● Testing on MNIST Data


To test the model, the first step was to test it on a domain specific dataset. We
used the MNIST dataset which comprised of handwritten grayscale images of numerals.
Figure shows the results on images that were not used during the training process. The
SRGAN super-resolved images (scaling factor of 2) shows significantly better
perceptual quality than compared to the ones upscaled using bi-cubic interpolation.

● Testing on DIV2K dataset


The next step was to train the model on a dataset with more variation. We made
use of the DIVerse 2K [1] dataset. This data-set consisted of 800 high resolution images
(not specific to any class) and for the corresponding low-resolution images we
downsampled the HR images by a factor of 4 using bi-cubic interpolation.
Figure below shows the SRGAN results on a test image where the input LR image was
upscaled by a factor of 4. We see that the fine details have been reconstructed.

18
● Testing on Videos​:
We applied the SRGAN model to several videos and what was observed initially
was that even though the model was able to recover details in the upscaled frames,
there was a significant change in the overall brightness of the output frames.

19
This rapid change in brightness between the frames led to a flickering effect in
the super-resolved video. The above figures, attempts to depict this phenomena but it is
much more prominent when one views the videos . Looking at the “VGG” row in the
figures, we can see that between successive frames there is a significant change in
brightness. This however was minimized by adding an L1 loss to the Content Loss
and retraining the SRGAN on the DIV2K dataset.

Figure . 1st row: low resolution frame, 2nd row: frame super-resolved with SRGAN and L1 loss,
3rd row: frame super-resolved with SRGAN.

20
5. iii. Image De-noising

● Testing on MNIST Data

We use ​MNIST which is a well known database of handwritten digits. Keras has
MNIST dataset utility. We can download the data as follows:

The shape of each image is 28x28 and there is no color information.

● Testing on images

I'm adding noise to the test images and passing them through the autoencoder. It
does a surprisingly great job of removing the noise, even though it's sometimes
difficult to tell what the original number is.

21
● Testing on Videos​:

In this case I simply took old movie footage .


Original

22
Output

23
6. References
[1] E. Agustsson and R. Timofte. Ntire 2017 challenge on single image super-resolution:
Dataset and study . In The IEEE Conference on Computer Vision and Pattern Recognition
(CVPR) Workshops, July 2017.

[2] C. Ledig, L. Theis, F. Huszár, J. Caballero, A. Cunningham, A. Acosta, A. Aitken, A. Tejani,


J. Totz, Z. Wang, et al. Photo-realistic single image super-resolution using a generative
adversarial network. arXiv preprint arXiv:1609.04802, 2016.

[3] J. Caballero, C. Ledig, A. Aitken, A. Acosta, J. Totz,Z. Wang, and W. Shi. Real-time video
super-resolution with spatio-temporal networks and motion compensation. arXiv
preprint arXiv:1611.05250, 2016.

[4] Cheng, Z., Yang, Q., Sheng, B.: Deep colorization. In: Proceedings of the IEEE International
Conference on Computer Vision. (2015) 415–423

[5] Dahl, R.: Automatic colorization. In: http://tinyclouds.org/colorize/. (2016)

[6] Charpiat, G., Hofmann, M., Schölkopf, B.: Automatic image colorization via multimodal
predictions. In: Computer Vision–ECCV 2008. Springer (2008) 126–139

[7] Larsson, G., Maire, M., Shakhnarovich, G.: Learning representations for automatic
colorization. European Conference on Computer Vision (2016)

[8] Russakovsky, O., Deng, J., Su, H., Krause, J., Satheesh, S., Ma, S., Huang, Z., Karpathy, A.,
Khosla, A., Bernstein, M., et al.: Imagenet large scale visual recognition challenge. International
Journal of Computer Vision 115(3) (2015) 211–252

[9] E. Rumelhart, G.E. Hinton, and R.J. Williams. Learning internal representations by error
propagation. In Parallel Distributed Processing. Vol 1: Foundations. MIT Press, Cambridge, MA,
1986.​

[10] Baldi and K. Hornik. Neural networks and principal component analysis: Learning from
examples without local minima. Neural Networks, 2(1):53–58, 1988.

24

You might also like