ML_Seminar
ML_Seminar
Abstract
This project implements real-time neural style transfer using Adaptive Instance
Normalization (AdaIN). The model performs artistic style transfer by aligning feature
statistics between content and style images. Unlike traditional methods that are slow
and require style-specific training, this approach enables fast and flexible stylization of
arbitrary images using a single feed-forward model. The implementation is based on
PyTorch and follows the architecture proposed by Huang and Belongie. This report
outlines the methodology, training setup, loss functions, and results of the model.
1 Introduction
Neural Style Transfer (NST) is a technique that recomposes the content of one image in
the visual appearance or ”style” of another. Traditional approaches, such as the method
introduced by Gatys et al. (2016), use a pre-trained convolutional neural network (e.g.,
VGG-19) to iteratively optimize a randomly initialized image so that its deep features match
those of a given content and style image. While this method yields visually impressive results,
it is computationally intensive and cannot be used in real-time applications.
Subsequent approaches attempted to accelerate the process by training feed-forward net-
works to mimic the optimization. However, these networks were either limited to specific
styles or required multi-style training and large models. To overcome these constraints, Xun
Huang and Serge Belongie introduced the concept of Adaptive Instance Normalization
(AdaIN) in 2017. AdaIN enables arbitrary style transfer in real-time using a single feed-
forward network by aligning the channel-wise mean and variance of content features to match
those of the style features.
This report is based on a PyTorch implementation of the AdaIN method, as provided in
the open-source repository: https://github.com/RUPESH-KUMAR01/AdaIn_style_transfer.
1
It includes detailed discussion of the architecture, loss functions, training procedures, and
evaluation results.
2 Methodology
The AdaIN model consists of three major components: a pre-trained encoder, the AdaIN
feature transformation layer, and a trainable decoder. The overall architecture is inspired
by the one proposed in the original AdaIN paper, and the implementation closely follows
the provided PyTorch code.
2.1 Encoder
The encoder is a truncated version of the VGG-19 network, stopping at layer relu4 1. It
is used to extract high-level feature representations of both the content and style images.
In the code, this is implemented by loading the pre-trained VGG-19 model and freezing its
parameters to avoid updates during training.
2.3 Decoder
The decoder network is a symmetric convolutional network designed to reconstruct an image
from AdaIN-transformed features. It includes upsampling layers (nearest-neighbor interpola-
tion) followed by convolutional blocks. The decoder is trained from scratch while the encoder
remains fixed.
The goal of the decoder is to produce an output image that reflects the style of the
reference image while preserving the structure of the content image. In the implementation,
the decoder is trained using a joint content and style loss (described in the next section).
3 Training
3.1 Datasets
The training procedure uses two distinct datasets:
2
• Content images: Typically sampled from MS-COCO, a large dataset of natural
images.
• Style images: Sourced from artistic datasets such as WikiArt, which contain paintings
in various artistic styles.
Each training batch samples one content and one style image randomly. The images are
resized and normalized before being passed through the encoder.
3.2 Optimization
Only the decoder is trained. The encoder (VGG-19) is kept frozen. The Adam optimizer is
used with:
• Batch size: 8
The training objective is to minimize a weighted combination of content and style losses,
encouraging the decoder to generate outputs that resemble the style image in appearance
but retain the content structure.
4 Loss Functions
The training objective consists of two parts: content loss and style loss. Both are computed
using features extracted from the encoder.
3
4.1 Content Loss
The content loss ensures that the output image maintains the structure and semantics of the
content image. It is computed as the Euclidean distance between the AdaIN-transformed
features and the features of the output image:
Lc = ∥f (g(t)) − t∥22
where f is the encoder, g is the decoder, and t is the AdaIN-transformed target feature.
Here, ϕi denotes the features extracted from the ith layer of the encoder.
Ltotal = Lc + λLs
The hyperparameter λ balances content and style. In practice, λ = 10 gives good results.
5 Results
5.1 Stylized Output
The model is capable of generating high-quality stylized images that retain the spatial struc-
ture of the content image while adopting the visual appearance (colors, textures, brush-
strokes) of the style image. This is achieved without training a separate model for each
style, demonstrating the flexibility of AdaIN-based style transfer.
Figure 1 shows a sample result, where the content image has been transformed with
the style of a reference painting. The model generalizes well to unseen styles and produces
results in real-time, making it suitable for interactive applications.
4
Figure 1: Example of stylized image output. The output preserves the structure of the
content image while adopting the artistic characteristics of the style image.
Figure 2: Training loss over time. The total loss combines content and style objectives.
5
6 Conclusion
This project successfully reimplements the AdaIN-based style transfer architecture for real-
time stylization using PyTorch. The approach demonstrates how statistical alignment of
feature maps using AdaIN allows arbitrary style transfer without requiring per-style retrain-
ing. The encoder is fixed, and the decoder is trained to reconstruct stylized outputs using a
content-style trade-off loss.
The results confirm that AdaIN produces perceptually convincing stylized images that
preserve the content structure while adopting artistic styles. Compared to earlier methods,
this model is lightweight and fast enough for real-time use.
Future improvements could include temporal consistency for video stylization, user-
guided controls for blending styles, or improving high-frequency texture detail in stylized
results.
References
• Huang, Xun, and Serge Belongie. ”Arbitrary Style Transfer in Real-Time with Adap-
tive Instance Normalization.” arXiv preprint arXiv:1703.06868 (2017).
• Gatys, Leon A., Alexander S. Ecker, and Matthias Bethge. ”Image Style Transfer
Using Convolutional Neural Networks.” In Proceedings of the IEEE conference on
computer vision and pattern recognition. 2016.