APM 598 Final Project Report
APM 598 Final Project Report
Abstract
1 Introduction
Generative models have lately made news for applications like creating fake images and
swapping faces in celebrity images, but this application possesses a serious social
challenge to discriminate between real and fake images. Generative models are created
from unsupervised learning (analyze structure with unlabeled data), and once the
structure is learnt, a new set of data can be created that does not exist. The two popular
kinds of generative models are Generative Adversarial Network (GAN) and Variational
Autoencoder (VAE). This project focuses on explaining three key applications of
generative modeling for images :
1. Image Restoration : Restoring old degraded photos by using variational
autoencoders.
2. Image Upscaling : Enlarging and enhancing a small image using enhanced super
resolution GAN.
3. Neural Style Transfer : Generates a digital image which adopts the style of a
different image.
2 Implementation
We implemented the project to test the state of art generative networks for image
restoration, image upscaling using super-resolution and then using style transfer to give
artistic touch to the images. Further details for each of the methods above is provided
now :
Real old photos are a mixture of unknown degradation (which can be any
combination of structured or unstructured degradation), which is difficult to be
characterized accurately, making it more difficult to create a degradation model which
can realistically render the old image artifact. Hence, there is a need to construct a
degradation model, which includes real degraded images and synthetic generated data.
Previous deep learning methods used supervised learning which did not give good results
for real old photos because the degradation model consisted of synthetic generated data of
degraded photos, which were nowhere similar to the real old photos. As a result of this, a
domain gap is created between real old photos and the photos synthesized for training the
model. To reduce this domain gap, triplet domain translation is used to bridge between
domains of real old photos ( R ), synthetic photos (X) constructed for training, and
ground truth domain (Y) consisting of images without degradation. r, x, y are denoted as
images where r ∈ R, x ∈ X and y ∈ Y domains and mapping is done to corresponding
latent spaces through
ER: R → ZR, EX: X → ZX, EY: Y → Z Y
The latent space of synthetic photos and real old photos are oriented in the
shared domain such that ZR , ZX are close to each other (ZR ≈ ZX) using variational
autoencoders (VAEs). This shared latent space is used for performing image restoration.
After latent space translation, real old photos “r” can be restored by sequentially
performing the mappings,
rR→Y = GY ◦ TZ ◦ ER(r)
Autoencoders take high dimensional input data, compress it by passing it through encoder
to create a smaller representation (less dimension than the input) known as a
"(bottleneck) latent space representation” and is given as input to decoders to reconstruct
the high dimensional data. This reconstructed data and input data are compared to obtain
the error function, and an iterative optimisation method is used for training the neural
network in order to generate compressed output images. But this case produces
unrealistic images when expected variation because of the discontinuities in its latent
space representation.
The given figure describes the architecture of the proposed network. Here, two VAEs are
used to generate variational data for two cases. One VAE consists of old photos “R” and
synthetic photos “X” where ER,X is the encoder and GR,X is the generator, which share old
photos and synthetic photos such that both the degraded images can be mapped into this
shared latent space. The other VAE is used for the output image “Y” where the encoder
and generator are EY , GY. VAEs are used because it learns the mapping of old photos and
synthetic photos and generalizes well to real photos by reducing the domain gap.
Afterwards, image restoration is performed for the synthetic pair {X,Y} using mapping T
which include “ResBlocks” and “Partial Nonlocal Blocks”. Nonlocal blocks deal with
structured degradation (patches, holes, scratches) and ResBlocks deal with unstructured
degradation (blurr, color fading, noise, low resolution). The combination of these blocks
enhances the capability of latent space.. Given the latent space ZR ≈ ZX , the generator GY
always generates a completely clean image without degradation.
It is assumed that the old photos reminisce about the special moments, which include the
faces of the loved ones. When generating synthetic images, sometimes unwanted textures
are observed on generated faces. Therefore, a face refinement network is included to
retrieve fine details of faces present in the old photos in the latent space “z”. As a result,
the perceptual quality of the faces is greatly enhanced.
In many CSI movies, there's that scene where someone finds a small and obscured image,
and they get a clear picture out of it by zooming and enhancing it. Is this really possible?
Mostly no, those movies are nowhere near technically accurate. But, to some extent, yes.
It is indeed possible to enlarge and enhance images. The process of upscaling and
enhancing an image is called super-resolution.
In information theory, there's a concept called data processing inequality. It states that
whatever way you process data, we cannot add information that is not already there.
This implies that missing data cannot be recovered by further processing. Does that mean
super-resolution is theoretically impossible? Not if we have an additional source of
information.
A neural network can learn to hallucinate details based on some prior information
it collects from a large set of images. The details added to an image this way would still
not violate the data processing inequality. Because the information is there, somewhere
in the training set, even if it's not in the input image. First, we can create a dataset by
collecting high-resolution images and downscaling them, or we can simply use one of
the existing super-resolution datasets, such as the DIV2K dataset. Then, we can build a
convolutional neural network that would input only the low-resolution images, and we
can train it to produce higher resolution images that match the original ones the best. As
shown in figure 2, The SRCNN[1] paper simply minimized the squared difference
between the pixel values to produce images that are as close as possible to the original
high-resolution images.
Before understanding super-resolution using GANs it will be good to know more about
how Generative Adversarial Networks work in general.
In GANs [2], we have two neural networks, one is Generator and other is a
Discriminator. The generator tries to generate a high resolution image and the
discriminator tries to determine whether or not it's real or not. Imagine that there is a
counterfeiter and he wants to create an image that looks identical to the real image but
obviously it's fake, so he takes it over to a pawn shop to try to get some money for it.
The store owner then tries to critique that artwork to determine whether or not it's real.
This is exactly how GANs work. The counterfeiter in this case is the generator and the
critic is the discriminator. We feed in low resolution images to the generator and it
creates a high resolution image or the artwork and then our discriminator tries to tell if it
is fake or real. As it can be seen in figure-1, there are two models (both are neural
networks). Generator receives the input Z (which can be a low res image) and then
outputs X̂. X̂ is fed to the discriminator network where it calculates the distance between
X̂ and X where X is the real high-res image, thus regarding X̂ as fake or real.
Loss is such that the generator is incentivized to generate X̂ such that X̂=X and
the discriminator is incentivized to be able to differentiate between X̂ and X.
Theoretically, the generator will become so good that it will be able to generate X̂ such
that it is the same as X and the discriminator will say X̂ is real every time.
There's another paper called Enhanced SRGAN [4], which proposes a few tricks to
improve the results further. Enhanced-SRGAN, or ESRGAN for short, somehow got
popular in the gaming community. It was used for upscaling vintage games, and it
worked pretty well. It's surprising how well it worked on video game graphics despite
being trained only on natural images.
One of the enhancements made was the removal of batch normalization layers in
their network architecture. Batch normalization does help a lot for many computer vision
tasks. But for image-processing related tasks, such as super-resolution or image
restoration in general, batch normalization can create some artifacts. Researchers also
added more layers and connections to this model architecture. It's not surprising that a
more sophisticated model resulted in better images, but deeper models can be trickier to
train, especially if they are not using batch normalization layers. So, the authors of
ESRGAN used some tricks like residual scaling to stabilize the training of such a
network. In addition to the changes in the model architecture, they also modified the loss
functions. We have used ESR-GAN for implementation of super-resolution.
2.3 Neural Style Transfer
In this section, we take an artistic image (style) such as a Van Gogh painting or a
psychedelic image and capture the features from it. The style is then applied to a
seemingly normal photograph (content) and we can visualize the artistic results. The
motivation to obtain such a style transfer image is to imagine how a person would be
painted by Van Gogh or for purely artistic/curiosity purposes.
The model available in TF-Hub was built by the team at Google Brain [9], which was
trained on the ImageNet dataset [10] for content images and the Kaggle Painter by
Numbers dataset [11] along with the Describable Textures Dataset [12] for style images.
The models consist of two networks, one for style prediction and another for style
transfer. The Style Prediction Network is loosely based on the Inception-v3 architecture
[13] which predicts and embedding vector 𝑆 which is the input for the Style transfer
network along with the content image. The Style Transfer Network largely follows [14].
The objective for style transfer model is to minimize:
ℒ𝑐(𝑥, 𝑐) + λ𝑠ℒ𝑠(𝑥, 𝑠)
ℒ𝑐 is the content loss and ℒ𝑠 is the style loss while λ𝑠 is a lagrangian multiplier that
weights the relative strength of the style loss. The content and style losses are defined as
ℒ𝑐 = ∑
𝑗𝜖𝐶
1
𝑛𝑗 ||𝑓𝑗(𝑥) − 𝑓𝑗(𝑐) ||
2
2
ℒ𝑠 = ∑
𝑖𝜖𝑆
1
𝑛𝑖 ||𝐺[𝑓𝑖(𝑥)] − 𝐺[𝑓𝑖(𝑠)] ||
𝐹
where 𝑓𝑙(𝑥) are the network activations in lth layer, nl is the number of units in lth layer
and 𝐺[𝑓𝑙(𝑥)] is the square and symmetric gram matrix that measures the spatially
averaged correlation structure across the filters for the lth layer activations.
Here, we implemented style transfer using the pretrained VGG-19 [15] network. First we
load the content image and test the VGG19 network to check whether the correct label is
predicted by the image classification model. We then load the VGG19 network without
the classification head, take the intermediate layers of it and use them to represent the
content and style images which is equivalent to the latent space representation of
generative networks. We can do this as somewhere between the model before the
classification label is predicted, the model acts as a feature extractor. By using the
intermediate layers we describe the content and style of the input images. The content of
an image is given by the intermediate feature map values, the style of the image by the
means and correlations across various feature maps. After building the model for content
and style tensor extractor, we run gradient descent with Adam optimizer by setting style
and content weights. We also regularize the high frequency terms of the image which is
also called total variation loss which is basically an edge detector. We use the inbuilt
function for the total variation loss in TensorFlow for this.
3 Results
In this project, we give the old degraded photo as the initial input and the system removes
the unstructured degradation and gives the clean restored image as output. In order to
restore the structured degraded images, we need to specify to the system that the image
contains scratches such that the system deals with both the unstructured and structured
degradation and gives a clean output image.
We took the restored image and passed it through the ESR-GAN generator. The results
were as follows -
Figure 9 : Low Resolution Image input to ESRGAN
Figure 11 : Style transfer on content image with TF-Hub and our model
We conducted some experiments to test our model and find some empirical properties of
the generative models that we have used for image restoration, image up-scaling and style
transfer.
Apart from the old images available in Microsoft's dataset, we tested this model with our
old printed images to observe the model’s efficiency. It is observed in figure 14 that faces
present in the images are enhanced, color is fixed however, scratches, and unwanted
patches are not completely removed (as observed in the first restored image top right
corner, and in the left center of the second restored image). Datasets including degraded
images with such types of patches can help improve the model.
We conducted an experiment to see how the upscaled image compares to the original
image. We fed the image as given in figure 15 as the input to the network. The upscaled
image obtained is shown in figure 16.
Figure 15 : Input image given to the model
Figure 16 : Upscaled output image
If we take a close look at the face in figure 17, we have interesting observations -
The ESRGAN applies some smoothening and paints in some of the details. We can
clearly see that the teeth are not visible in the input image. The generative model has
painted in the details (in this case teeth), which is undesirable. Also taking a closer look
at the right ear, we can see that the model has failed to draw the ear properly. This can
give us some insights about the validity of our model and what kind of training data
should be used for further training the model. ESRGAN was trained for anime upscaling.
Training the ESR-GAN model more on real life dataset rather than on anime dataset can
help alleviate these problems.
We conducted experiments to see how different content images are stylized using just one
style image. Following are the content images used from self-photographs and
screenshots [18] [19] while the style is Psychedelic as shown in figure 18. The style vs
content loss for each is also plotted.
As we can see from using the same style for different content images, the features of the
style towards the outer layers (bubbles) are applied to the outer parts of the content image
in different areas. The style transfer is not uniform for all content images and it is not a
simple superimposition of the content and style images. The style features are extracted
and transferred in varying degrees and orientations according to the content image
features.
5 Discussion and Conclusion
From our study, it can be concluded that the use of VAEs have helped in reducing the
domain gap to generate realistic clean images out of old degraded images. This model
works efficiently for restoring old degraded images which consist of unstructured
degradations. However, it is not efficient for some types of structured degradations
(patches). We are using the Google Colab platform that entirely runs in the cloud and
provides access to GPU. It is seen that CUDA runs out of memory when large file size
images are given input to this model. Hence, this model can be improved to deal with
large file size images and function well on local machines.
After the experiments, it can be concluded that although the results of ESR-GAN are
pretty good, there is still some room for improvement, especially relating to the datasets
on which the model is being trained. Also, since the training of GAN’s is extremely hard,
the issues stated in the experiments can also arise due to overfitting. To conclude, the
results are pretty good right now and will only get better with more research.
6 References
[1] Image Super-Resolution Using Deep Convolutional Networks,
https://arxiv.org/abs/1501.00092
[2] Generative Adversarial Networks, https://arxiv.org/abs/1406.2661
[3] Photo-Realistic Single Image Super-Resolution Using a Generative Adversarial Network,
https://arxiv.org/abs/1609.04802
[4] ESRGAN: Enhanced Super-Resolution Generative Adversarial Networks,
https://arxiv.org/abs/1809.00219
[5] Digitalcameraworld.com, “Its official looking at old photos is more relaxing than meditating”,
https://www.digitalcameraworld.com/news/its-official-looking-at-old-photos-is-more-relaxing-tha
n-meditating
[6] Wan, Ziyu & Zhang, Bo & Chen, Dongdong & Zhang, Pan & Chen, Dong & Liao, Jing &
Wen, Fang. (2020). Bringing Old Photos Back to Life. 2744-2754.
10.1109/CVPR42600.2020.00282.
[7] towardsdatascience.com, “Intuitively understanding variational autoencoders ”
https://towardsdatascience.com/intuitively-understanding-variational-autoencoders-1bfe67eb5daf
[8] Z. Wan et al., "Old Photo Restoration via Deep Latent Space Translation," in IEEE
Transactions on Pattern Analysis and Machine Intelligence, doi: 10.1109/TPAMI.2022.3163183.
[9] Golnaz Ghiasi, Honglak Lee, Manjunath Kudlur, Vincent Dumoulin, Jonathon Shlens.
Exploring the structure of a real-time, arbitrary neural artistic stylization network. Proceedings of
the British Machine Vision Conference (BMVC), 2017.
[10] ImageNet datasethttps://www.image-net.org/
[11] Kaggle Painter by Numbers https://www.kaggle.com/competitions/painter-by-numbers/data
[12] Describable Textures Dataset https://www.robots.ox.ac.uk/~vgg/data/dtd/
[13] C. Szegedy, V. Vanhoucke, S. Ioffe, J. Shlens, and Z. Wojna. Rethinking the inception
architecture for computer vision. IEEE Computer Vision and Pattern Recognition (CVPR), 2015.
[14] V. Dumoulin, J. Shlens, and M. Kudlur. A learned representation for artistic style.
International Conference of Learned Representations (ICLR), 2016.
[15] Simonyan, Karen, and Andrew Zisserman. "Very deep convolutional networks for large-scale
image recognition." arXiv preprint arXiv:1409.1556 (2014).
[16] https://www.freeimages.com/
[17] C. Irving. Fake: the story of Elmyr de Hory: the greatest art forger of our time. McGraw-Hill,
1969.
[18] Eichiro Oda, TOEI, FUNimation Entertainment (Firm),. (2022). One piece: Episode 1015.
[19] Hajime Isayama, Wit Studios, (Firm),. (2013). Shingeki no Kyojin: Episode 12.
Appendix
Submitted to : Sebastien Motsch, Submitted by : Akshat Sharma ([email protected]) ,
Aradhita Sharma ([email protected]), Apoorva Uplap ([email protected])
#◢ Microsoft's Image Restoration
!git clone https://github.com/microsoft/Bringing-Old-Photos-Back-to-
Life.git photo_restoration
%cd Global/detection_models
!git clone https://github.com/vacancy/Synchronized-BatchNorm-PyTorch
!cp -rf Synchronized-BatchNorm-PyTorch/sync_batchnorm .
%cd ../../
%cd Global/
!wget
https://facevc.blob.core.windows.net/zhanbo/old_photo/pretrain/Global/
checkpoints.zip
!unzip checkpoints.zip
%cd ../
/content/photo_restoration/Face_Enhancement/models/networks
Cloning into 'Synchronized-BatchNorm-PyTorch'...
remote: Enumerating objects: 188, done.ote: Counting objects: 100%
(27/27), done.ote: Compressing objects: 100% (17/17), done.ote: Total
188 (delta 10), reused 27 (delta 10), pack-reused 161odels
Cloning into 'Synchronized-BatchNorm-PyTorch'...
remote: Enumerating objects: 188, done.ote: Counting objects: 100%
(27/27), done.ote: Compressing objects: 100% (17/17), done.ote: Total
188 (delta 10), reused 27 (delta 10), pack-reused 161arks.dat.bz2
Resolving dlib.net (dlib.net)... 107.180.26.78
Connecting to dlib.net (dlib.net)|107.180.26.78|:80... connected.
HTTP request sent, awaiting response... 200 OK
Length: 64040097 (61M)
Saving to: ‘shape_predictor_68_face_landmarks.dat.bz2’
/content/photo_restoration
/content/photo_restoration/Face_Enhancement
--2022-04-29 18:25:41--
https://facevc.blob.core.windows.net/zhanbo/old_photo/pretrain/Face_En
hancement/checkpoints.zip
Resolving facevc.blob.core.windows.net
(facevc.blob.core.windows.net)... 20.150.78.196
Connecting to facevc.blob.core.windows.net
(facevc.blob.core.windows.net)|20.150.78.196|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 684354563 (653M) [application/x-zip-compressed]
Saving to: ‘checkpoints.zip’
Archive: checkpoints.zip
creating: checkpoints/
creating: checkpoints/Setting_9_epoch_100/
inflating: checkpoints/Setting_9_epoch_100/latest_net_G.pth
creating: checkpoints/FaceSR_512/
inflating: checkpoints/FaceSR_512/latest_net_G.pth
/content/photo_restoration
/content/photo_restoration/Global
--2022-04-29 18:26:17--
https://facevc.blob.core.windows.net/zhanbo/old_photo/pretrain/Global/
checkpoints.zip
Resolving facevc.blob.core.windows.net
(facevc.blob.core.windows.net)... 20.150.78.196
Connecting to facevc.blob.core.windows.net
(facevc.blob.core.windows.net)|20.150.78.196|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 2036400762 (1.9G) [application/x-zip-compressed]
Saving to: ‘checkpoints.zip’
Archive: checkpoints.zip
creating: checkpoints/
creating: checkpoints/restoration/
creating: checkpoints/restoration/VAE_B_scratch/
inflating: checkpoints/restoration/VAE_B_scratch/latest_net_G.pth
inflating:
checkpoints/restoration/VAE_B_scratch/latest_optimizer_G.pth
inflating:
checkpoints/restoration/VAE_B_scratch/latest_optimizer_D.pth
inflating: checkpoints/restoration/VAE_B_scratch/latest_net_D.pth
creating: checkpoints/restoration/VAE_A_quality/
inflating: checkpoints/restoration/VAE_A_quality/latest_net_G.pth
inflating:
checkpoints/restoration/VAE_A_quality/latest_net_featD.pth
inflating:
checkpoints/restoration/VAE_A_quality/latest_optimizer_G.pth
inflating:
checkpoints/restoration/VAE_A_quality/latest_optimizer_D.pth
inflating:
checkpoints/restoration/VAE_A_quality/latest_optimizer_featD.pth
inflating: checkpoints/restoration/VAE_A_quality/latest_net_D.pth
creating: checkpoints/restoration/mapping_Patch_Attention/
inflating:
checkpoints/restoration/mapping_Patch_Attention/latest_net_mapping_net
.pth
inflating:
checkpoints/restoration/mapping_Patch_Attention/latest_net_D.pth
creating: checkpoints/restoration/mapping_quality/
inflating:
checkpoints/restoration/mapping_quality/latest_net_mapping_net.pth
inflating:
checkpoints/restoration/mapping_quality/latest_optimizer_mapping_net.p
th
inflating:
checkpoints/restoration/mapping_quality/latest_optimizer_D.pth
inflating: checkpoints/restoration/mapping_quality/latest_net_D.pth
creating: checkpoints/restoration/mapping_scratch/
inflating:
checkpoints/restoration/mapping_scratch/latest_net_mapping_net.pth
inflating:
checkpoints/restoration/mapping_scratch/latest_optimizer_mapping_net.p
th
inflating:
checkpoints/restoration/mapping_scratch/latest_optimizer_D.pth
inflating: checkpoints/restoration/mapping_scratch/latest_net_D.pth
creating: checkpoints/restoration/VAE_B_quality/
inflating: checkpoints/restoration/VAE_B_quality/latest_net_G.pth
inflating:
checkpoints/restoration/VAE_B_quality/latest_optimizer_G.pth
inflating:
checkpoints/restoration/VAE_B_quality/latest_optimizer_D.pth
inflating: checkpoints/restoration/VAE_B_quality/latest_net_D.pth
creating: checkpoints/detection/
inflating: checkpoints/detection/FT_Epoch_latest.pt
/content/photo_restoration
import os
basepath = os.getcwd()
input_path = os.path.join(basepath, input_folder)
output_path = os.path.join(basepath, output_folder)
os.mkdir(output_path)
!rm -rf /content/photo_restoration/output/*
!python run.py --input_folder
/content/photo_restoration/test_images/old --output_folder
/content/photo_restoration/output/ --GPU 0
/content/photo_restoration
Running Stage 1: Overall restoration
Mapping: You are using the mapping model without global restoration.
Now you are processing a.png
Now you are processing b.png
Now you are processing c.png
Now you are processing d.png
Now you are processing e.png
Now you are processing f.png
Now you are processing g.png
Now you are processing h.png
Finish Stage 1 ...
import io
import IPython.display
import numpy as np
import PIL.Image
if I1.ndim >= 3:
I2 = np.asarray(I2.resize((W,H)))
I_combine = np.zeros((H,W*2,3))
I_combine[:,:W,:] = I1[:,:,:3]
I_combine[:,W:,:] = I2[:,:,:3]
else:
I2 = np.asarray(I2.resize((W,H)).convert('L'))
I_combine = np.zeros((H,W*2))
I_combine[:,:W] = I1[:,:]
I_combine[:,W:] = I2[:,:]
I_combine = PIL.Image.fromarray(np.uint8(I_combine))
W_base = 600
if resize:
ratio = W_base / (W*2)
H_new = int(H * ratio)
I_combine = I_combine.resize((W_base, H_new), PIL.Image.LANCZOS)
return I_combine
filenames = os.listdir(os.path.join(input_path))
filenames.sort()
display(make_grid(image_original, image_restore))
a.png
b.png
c.png
d.png
e.png
f.png
g.png
h.png
%cd /content/photo_restoration/
import os
import io
import IPython.display
import numpy as np
import PIL.Image
basepath = os.getcwd()
input_folder = "test_images/old_w_scratch"
output_folder = "output"
input_path = os.path.join(basepath, input_folder)
output_path = os.path.join(basepath, output_folder)
filenames = os.listdir(os.path.join(input_path))
filenames.sort()
display(make_grid(image_original, image_restore))
/content/photo_restoration
file a.png
file ab.png
file b.png
file c.png
file d.png
#◢ SUPER-RESOLUTION (ESRGAN)
import os
import time
from PIL import Image
import numpy as np
import tensorflow as tf
import tensorflow_hub as hub
import matplotlib.pyplot as plt
os.environ["TFHUB_DOWNLOAD_PROGRESS"] = "True"
# Declaring Constants
#IMAGE_PATH = "original.png"
IMAGE_PATH = "/content/photo_restoration/output/final_output/b.png"
SAVED_MODEL_PATH = "https://tfhub.dev/captain-pool/esrgan-tf2/1"
def preprocess_image(image_path):
""" Loads image from path and preprocesses to make it model ready
Args:
image_path: Path to the image file
"""
hr_image = tf.image.decode_image(tf.io.read_file(image_path))
# If PNG, remove the alpha channel. The model only supports
# images with 3 color channels.
if hr_image.shape[-1] == 4:
hr_image = hr_image[...,:-1]
hr_size = (tf.convert_to_tensor(hr_image.shape[:-1]) // 4) * 4
hr_image = tf.image.crop_to_bounding_box(hr_image, 0, 0, hr_size[0],
hr_size[1])
hr_image = tf.cast(hr_image, tf.float32)
return tf.expand_dims(hr_image, 0)
%matplotlib inline
def plot_image(image, title=""):
"""
Plots images from image tensors.
Args:
image: 3D image tensor. [height, width, channels].
title: Title to display in the plot.
"""
image = np.asarray(image)
image = tf.clip_by_value(image, 0, 255)
image = Image.fromarray(tf.cast(image, tf.uint8).numpy())
plt.imshow(image)
plt.axis("off")
plt.title(title)
hr_image = preprocess_image(IMAGE_PATH)
start = time.time()
fake_image = model(hr_image)
fake_image = tf.squeeze(fake_image)
print("Time Taken: %f" % (time.time() - start))
img =
cv2.imread("/content/photo_restoration/test_images/old_w_scratch/b.png
")
cv2_imshow(img)
img =
cv2.imread("/content/photo_restoration/output/final_output/b.png")
cv2_imshow(img)
img = cv2.imread("/content/photo_restoration/Super Resolution.jpg")
cv2_imshow(img)
from google.colab.patches import cv2_imshow
import cv2
img =
cv2.imread("/content/photo_restoration/test_images/old_w_scratch/ab.pn
g")
cv2_imshow(img)
img =
cv2.imread("/content/photo_restoration/output/final_output/ab.png")
cv2_imshow(img)
img = cv2.imread("/content/photo_restoration/Super Resolution.jpg")
cv2_imshow(img)
Neural Style Transfer
from google.colab import drive
drive.mount('/content/drive')
Mounted at /content/drive
import os
import tensorflow as tf
import matplotlib.pyplot as plt
import numpy as np
import warnings
import random
from tensorflow import keras
from keras.models import Sequential
from tensorflow.keras.layers import Dense, Dropout
import numpy as np # for math and arrays
import pandas as pd
import PIL.Image
import time
import functools
plt.imshow(image)
if title:
plt.title(title)
<IPython.core.display.HTML object>
#content_image =
load_img('drive/MyDrive/StyleTransfer/Content/Eagle.jpg')
content_image =
load_img('/content/photo_restoration/output/final_output/b.png')
style_image = load_img('Psychedelic.jpeg')
plt.subplot(1, 2, 1)
imshow(content_image, 'Content Image')
plt.subplot(1, 2, 2)
imshow(style_image, 'Style Image')
Show the Output given by the TF Hub Style Transfer Model
import tensorflow_hub as hub
hub_model = hub.load('https://tfhub.dev/google/magenta/arbitrary-
image-stylization-v1-256/2')
stylized_image = hub_model(tf.constant(content_image),
tf.constant(style_image))[0]
tensor_to_image(stylized_image)
Downloaded https://tfhub.dev/google/magenta/arbitrary-image-
stylization-v1-256/2, Total size: 89.77MB
Load the VGG19 model from Keras
x = tf.keras.applications.vgg19.preprocess_input(content_image*255)
x = tf.image.resize(x, (224, 224))
vgg = tf.keras.applications.VGG19(include_top=True,
weights='imagenet')
prediction_probabilities = vgg(x)
prediction_probabilities.shape
TensorShape([1, 1000])
Predict the label for the image to test whether VGG is working for Image classification
predicted_top_5 =
tf.keras.applications.vgg19.decode_predictions(prediction_probabilitie
s.numpy())[0]
[(class_name, prob) for (number, class_name, prob) in predicted_top_5]
[('bow_tie', 0.27081335),
('military_uniform', 0.077421315),
('cowboy_hat', 0.05717317),
('seat_belt', 0.023244375),
('barbershop', 0.022434594)]
print()
for layer in vgg.layers:
print(layer.name)
input_2
block1_conv1
block1_conv2
block1_pool
block2_conv1
block2_conv2
block2_pool
block3_conv1
block3_conv2
block3_conv3
block3_conv4
block3_pool
block4_conv1
block4_conv2
block4_conv3
block4_conv4
block4_pool
block5_conv1
block5_conv2
block5_conv3
block5_conv4
block5_pool
style_layers = ['block1_conv1',
'block2_conv1',
'block3_conv1',
'block4_conv1',
'block5_conv1']
num_content_layers = len(content_layers)
num_style_layers = len(style_layers)
def vgg_layers(layer_names):
""" Creates a vgg model that returns a list of intermediate output
values."""
# Load our model. Load pretrained VGG, trained on imagenet data
vgg = tf.keras.applications.VGG19(include_top=False,
weights='imagenet')
vgg.trainable = False
style_extractor = vgg_layers(style_layers)
style_outputs = style_extractor(style_image*255)
block1_conv1
shape: (1, 384, 512, 64)
min: 0.0
max: 797.95276
mean: 35.36682
block2_conv1
shape: (1, 192, 256, 128)
min: 0.0
max: 5022.046
mean: 198.26442
block3_conv1
shape: (1, 96, 128, 256)
min: 0.0
max: 8470.7705
mean: 225.4287
block4_conv1
shape: (1, 48, 64, 512)
min: 0.0
max: 21628.117
mean: 784.6738
block5_conv1
shape: (1, 24, 32, 512)
min: 0.0
max: 3131.514
mean: 53.838207
class StyleContentModel(tf.keras.models.Model):
def __init__(self, style_layers, content_layers):
super(StyleContentModel, self).__init__()
self.vgg = vgg_layers(style_layers + content_layers)
self.style_layers = style_layers
self.content_layers = content_layers
self.num_style_layers = len(style_layers)
self.vgg.trainable = False
style_outputs = [gram_matrix(style_output)
for style_output in style_outputs]
content_dict = {content_name: value
for content_name, value
in zip(self.content_layers, content_outputs)}
results = extractor(tf.constant(content_image))
print('Styles:')
for name, output in sorted(results['style'].items()):
print(" ", name)
print(" shape: ", output.numpy().shape)
print(" min: ", output.numpy().min())
print(" max: ", output.numpy().max())
print(" mean: ", output.numpy().mean())
print()
print("Contents:")
for name, output in sorted(results['content'].items()):
print(" ", name)
print(" shape: ", output.numpy().shape)
print(" min: ", output.numpy().min())
print(" max: ", output.numpy().max())
print(" mean: ", output.numpy().mean())
Styles:
block1_conv1
shape: (1, 64, 64)
min: 0.008798338
max: 20073.625
mean: 146.20078
block2_conv1
shape: (1, 128, 128)
min: 0.0
max: 24671.422
mean: 5006.4023
block3_conv1
shape: (1, 256, 256)
min: 0.0
max: 105253.766
mean: 4994.0215
block4_conv1
shape: (1, 512, 512)
min: 0.0
max: 1066156.0
mean: 71412.266
block5_conv1
shape: (1, 512, 512)
min: 0.0
max: 73688.96
mean: 636.2896
Contents:
block5_conv2
shape: (1, 32, 32, 512)
min: 0.0
max: 1013.4015
mean: 8.951838
style_targets = extractor(style_image)['style']
content_targets = extractor(content_image)['content']
image = tf.Variable(content_image)
def clip_0_1(image):
return tf.clip_by_value(image, clip_value_min=0.0,
clip_value_max=1.0)
style_weight=1e-2
content_weight=1e4
rank_1_tensor = tf.constant([])
def style_content_loss(outputs):
style_outputs = outputs['style']
content_outputs = outputs['content']
style_loss = tf.add_n([tf.reduce_mean((style_outputs[name]-
style_targets[name])**2)
for name in style_outputs.keys()])
style_loss *= style_weight / num_style_layers
content_loss = tf.add_n([tf.reduce_mean((content_outputs[name]-
content_targets[name])**2)
for name in content_outputs.keys()])
content_loss *= content_weight / num_content_layers
loss = style_loss + content_loss
return loss
@tf.function()
def train_step(image):
with tf.GradientTape() as tape:
outputs = extractor(image)
loss = style_content_loss(outputs)
array([53277.508], dtype=float32)
total_variation_weight=30
@tf.function()
def train_step(image):
with tf.GradientTape() as tape:
outputs = extractor(image)
loss = style_content_loss(outputs)
loss += total_variation_weight*tf.image.total_variation(image)
image = tf.Variable(content_image)
epochs = 50
steps_per_epoch = 100
step = 0
for n in range(epochs):
for m in range(steps_per_epoch):
step += 1
train_step(image)
print(".", end='', flush=True)
display.clear_output(wait=True)
display.display(tensor_to_image(image))
print("Train step: {}".format(step))
end = time.time()
print("Total time: {:.1f}".format(end-start))
Train step: 5000
Total time: 1658.4
try:
from google.colab import files
except ImportError:
pass
else:
files.download(file_name)
<IPython.core.display.Javascript object>
<IPython.core.display.Javascript object>