0% found this document useful (0 votes)
8 views

State of the Art on Deep Learning-enhanced Rendering Methods

Uploaded by

Kim He
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
8 views

State of the Art on Deep Learning-enhanced Rendering Methods

Uploaded by

Kim He
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 23

Machine Intelligence Research 20(6), December 2023, 799-821

www.mi-research.net DOI: 10.1007/s11633-022-1400-x

State of the Art on Deep Learning-enhanced


Rendering Methods
Qi Wang 1 Zhihua Zhong 1 Yuchi Huo 2,1 Hujun Bao 1 Rui Wang 1
1 State Key Laboratory of CAD&CG, Zhejiang University, Hangzhou 310058, China
2 Zhejiang Laboratory, Hangzhou 311121, China

Abstract: Photorealistic rendering of the virtual world is an important and classic problem in the field of computer graphics. With the
development of GPU hardware and continuous research on computer graphics, representing and rendering virtual scenes has become
easier and more efficient. However, there are still unresolved challenges in efficiently rendering global illumination effects. At the same
time, machine learning and computer vision provide real-world image analysis and synthesis methods, which can be exploited by com-
puter graphics rendering pipelines. Deep learning-enhanced rendering combines techniques from deep learning and computer vision into
the traditional graphics rendering pipeline to enhance existing rasterization or Monte Carlo integration renderers. This state-of-the-art
report summarizes recent studies of deep learning-enhanced rendering in the computer graphics community. Specifically, we focus on
works of renderers represented using neural networks, whether the scene is represented by neural networks or traditional scene files.
These works are either for general scenes or specific scenes, which are differentiated by the need to retrain the network for new scenes.

Keywords: Neural rendering, computer graphics, scene representation, rendering, post-processing.

Citation: Q. Wang, Z. Zhong, Y. Huo, H. Bao, R. Wang. State of the art on deep learning-enhanced rendering methods. Machine
Intelligence Research, vol.20, no.6, pp.799–821, 2023. http://doi.org/10.1007/s11633-022-1400-x

1 Introduction puter vision and deep learning research fields in recent


years, the combination of traditional graphics rendering
For traditional computer graphics, generating pipelines and deep learning provides a new direction to
photorealistic rendering results of a scene is an import- solve the above problems. A bunch of deep generative
ant research direction, and researchers have developed a models of generating high-resolution styled[8] or high-fidel-
variety of algorithms to solve this problem in recent dec- ity 2D images have emerged, e.g., the seminal generative
ades, including modeling complex materials[1−3], sophistic- adversarial neural networks (GANs)[9] and their follow-
ated sampling methods[4, 5], acceleration of global illumin- ups[10, 11], also the variational auto encoder networks
ation computation[6, 7], etc. These methods are mainly ap- (VAEs)[12−14]. Reference [15] even realizes the control of
plied in two fields: the rasterization pipeline for real-time the generated image through additional condition input.
rendering that serves the latest video games and the ray Armed with powerful neural network-based image
tracing pipeline commonly used in the film industry for generation tools, researchers consider how to represent
offline rendering that deals with global illumination ef- traditional scenes as data types that neural networks can
fects. Regardless of the rendering pipeline, much time- handle and feed into generative networks to render
consuming manual work by artists and programmers is scenes. The first seminal method that combines a deep
essential, i.e., complex and sophisticated renderers or neural network and a traditional rendering pipeline is the
shaders must be written by experienced programmers, generative query network (GQN)[16]. The network takes
and scene construction, including geometry, material tex- several rendered images and the corresponding camera
tures, lighting conditions, and animations are the respons- parameters as input to encode the complete scene inform-
ation as a vector, and the vector is fed to a generative
ibility of artists. These preliminary preparations greatly
network to enable the rendering of the scene from any
increase the time and capital cost of photorealistic ren-
viewpoint (Section 4.4). Although the rendering results
dering. However, with the rapid development of com-
generated by their method are not realistic enough, they
Review inspire a vast amount of subsequent work and create a
Manuscript received on July 11, 2022; accepted on November 29,
2022 new field of research: neural rendering. Compared with
Recommended by Associate Editor Jing-Yi Yu other deep learning research fields, the focus of neural
Colored figures are available in the online version at https://link. rendering is not only on the delicate network structure
springer.com/journal/11633
© Institute of Automation, Chinese Academy of Sciences and design but also on the combination of physical and math-
Springer-Verlag GmbH Germany, part of Springer Nature 2023 ematical knowledge in traditional rendering. In fact, for a
800 Machine Intelligence Research 20(6), December 2023

specific rendering task, the difficulty lies in embedding to provide readers with a better understanding of the
the corresponding domain knowledge into the network, methods described below. Then we discuss the landscape
e.g., for human skin rendering, how to embed the subsur- of applications that is enabled by deep learning-enhanced
face scattering process into the network to allow the rendering. Finally, we summarize the entire report.
neural generator to generate more realistic skin effects.
Compared with the traditional rendering pipeline, the 2 Scope of this STAR
rendering quality of neural rendering is closely related to
the quantity, distribution, and quality of the input data- In this state-of-the-art report, we focus on the classic
set. Thus, how to render high-quality results with insuffi- and latest applications that combine deep neural net-
cient data is also an important concern of neural render- works with renderer components in the computer graph-
ing. ics rendering pipeline (Fig. 1). Specifically, we discuss how
This state-of-the-art report (STAR) summarizes and neural networks can replace or enhance the work of ren-
classifies the different types of deep learning-enhanced derers in traditional rendering pipelines, and the advant-
rendering approaches. It should be noted that our work is ages and disadvantages of the combinations. We categor-
different from another review on neural rendering[17]. We ize the deep learning-based rendering techniques and rep-
only focus on approaches that are integrated into the tra- resentative works that appeared in this survey in Table 1.
ditional rendering pipelines with neural networks, i.e., the For a clearer understanding, we first introduce the funda-
forwards subset of neural rendering which assumes known mentals of traditional physically based rendering and
input scenes (geometry, lighting, material, viewpoint) and deep neural networks that relate to image synthesis.
does not concern the specific representation. However, the Then, we discuss some classic and up-to-date works,
concept of “rendering” in their review is broader, includ- based on several aspects.
ing a series of GAN-based 2D image generation works We do not cover any work based on neural radiance
and image-based rendering[18]. At the same time, our clas- fields (NeRF)[70−78], which is not related to traditional
sification of deep learning-enhanced rendering methods is rendering pipelines. NeRF is a novel view synthesis and
more in line with traditional graphics research and our 3D reconstruction method with implicit scene representa-
demonstration of each work is more detailed. The central tion (density field) combined with the ray marching al-
scheme around which we structure this report is the gen- gorithm that draw great attention in the field of com-
erality and application scenarios of each approach which puter vision. Please refer to [80] for a deep comprehen-
is essential for most kinds of graphic applications. Novel sion. Similar to NeRF, the signed distance function (SDF)
view synthesis and relighting are commonly achieved in also utilizes implicit scene representation (signed dis-
the following methods. Thus, we do not classify the ap- tance field) which achieves better 3D reconstruction res-
proaches by them. We start by clarifying the scope of our ults. Although the SDF-based approach is somewhat dif-
report. Then we discuss the theoretical fundamentals of ferent from the traditional rendering pipeline, this meth-
physically-based rendering and deep generative networks od utilizes the volume rendering algorithm, which is an

Rendering pipeline

Scene representation Renderer Post-processing

Fig. 1 Traditional rendering pipeline and corresponding learning-based methods. Each method replaces a (or more) step (steps) in the
traditional rendering pipeline.
Q. Wang et al. / State of the Art on Deep Learning-enhanced Rendering Methods 801

Table 1 Categories of deep learning-based rendering techniques and representative works in each category

Papers
Technique category
Generalize Specific

Voxel-based scene representation [19–21] [22]

Vector-based scene representation [16, 23] [24, 25]

Scene representation Mesh-based scene representation [26] [27–29]

Point-based scene representation [30–33]

Network-based scene representation [34, 35] [36–39]

Ambient occlusion [40–43]

Direct illumination [44]


Global illumination
Indirect illumination [45–47] [48–50]

Volume and subsurface [51–56]

Human-related rendering [57–60] [61–64]

Post-processing [65–69]

NeRF (not discussed) [70–78]

Denoising (not discussed) [79]

important algorithm in traditional graphics. In view of denotes the surface normal, ωi denotes the incident
the fact that there is no review on SDF, we introduce one direction of the light path, Li denotes the incident
of the most popular SDF methods[36] (Section 5.4). Please radiance, fr denotes the BSDF function, and Ω denotes
refer to [81, 82] for more related work. the hemisphere around the surface point. This equation
Although deep learning-based Monte Carlo denoising omits consideration of transparent objects and any effects
methods that aim to reconstruct denoised results from of subsurface or volumetric scattering. The most classic
synthetic images generated by low sample per pixel solver for this integral is Monte Carlo simulations[84]. In
(SPP) have made significant progress in recent years, and practice, the film only records three different wavelengths
these processes are typically used as a post-processing corresponding to the R, G, and B spectrum. The BSDF
stage in a traditional ray tracing pipeline, we will not dis- function is usually obtained by fitting the actual
cuss them because they have been well studied by a re- measured data of different materials. For more discussion
cent survey[79]. on modelling lighting, materials, cameras, and geometry,
please refer to [85].
3 Theoretical foundation
3.2 Deep generative network
3.1 Physically based rendering Traditional generative adversarial networks (GANs)[9]
synthesize virtual images with statistics resembling the
Traditional graphics pipelines model image formation
training set from a sampled random vector. The specific
as a physical process in the real world: the photons emit-
content of the generated pictures cannot be controlled.
ted by the light source interact with objects in the scene
However, this is far from sufficient for scene rendering, as
as a bidirectional scattering distribution function (BSDF)
generating a random image for a specific scene is mean-
determined by geometry and material properties, which
ingless. To address this problem, feed-forward neural net-
are then recorded by the camera. This process is known
as light transport and can be formulated by an equation, works are trained with a distance to generate images giv-
the classical rendering equation[83]: ing conditional inputs[86]. However, these networks usu-
ally suffer from blurry results caused by the distance that
L0 (x, ω0 , λ, t) = Lε (x, ω0 , λ, t) + only counts for individual pixels in image space and ig-
∫ nores the complex visual structure[87]. Later work pro-
fr (x, ωi , ω0 , λ, t) Li (x, ωi , λ, t) (ωi · n) dωi (1) posed perceptual similarity distances[88−90] computed by

pre-trained networks (usually VGGnet) to measure the
where Lo represents the outgoing radiance from a surface, distance between generated image and ground truth in
x denotes the surface position, ωo denotes the outgoing high-dimensional feature space. Additionally, the struc-
direction of the light path, λ denotes the light tural similarity index measure (SSIM)[91, 92] distance is
wavelength, t denotes the moment of interaction, n considered to improve the prediction quality. Although
802 Machine Intelligence Research 20(6), December 2023

pairwise supervised training might achieve better metrics, train the generator and discriminator jointly. Benefitting
the generated images may look unnatural. The condition- from their advanced network structure, they generate res-
al GANs (cGANs)[45] and StyleGAN[93] aim to generate ults close to offline rendering in real-time, see Fig. 2.
images matching the conditional distribution of outputs
given inputs that are indistinguishable from the human
visual system. Although the generation can be controlled
by the condition, the network cannot yet achieve explicit
scene-level control.

4 General methods
There are many applications of deep learning-en-
hanced rendering, including surface rendering, subsurface
rendering, volume rendering and novel view synthesis, re-
lighting, photorealistic human appearance rendering, etc.
Here, we categorize these applications into general meth-
ods and specific methods because the ability to use a
trained network to different input scenes is important for
rendering. Under each category, we detail each applica-
Fig. 2 Left to right: Results from Deep shading, AOGAN, and
tion by the renderer′s input and output types. Instead of
ground truth, respectively. Images taken from Ren and Song[43].
classifying by novel view synthesis and relighting[17], we
focus on the overall pipeline of the application as it is
closer to the traditional computer graphics process.
4.2 Volume and subsurface rendering
General methods only need to be trained once and can
The techniques of rendering participate medium are a
then be applied to a range of scenes without retraining.
crucial part of traditional computer graphics, which can
The applications described below are general methods by
represent effects such as clouds, smoke, flames, waxes,
default, and no additional explanations will be given.
multiple liquids, skin, etc. These rendering techniques can
be divided into two types, one is the volume rendering of
4.1 Ambient occlusion generation methods
the medium as particles, including homogeneous volume
Ambient occlusion (AO) is a typical screen-space ef- rendering[95] and heterogeneous volume rendering[96, 97],
fect that is usually used in a real-time rendering pipeline and the other is the approximation of internal scattering,
that simulates the occlusion effect of objects in the scene. i.e., bidirectional scattering surface reflectance distribu-
High-quality AO is usually generated by offline rendering tion function (BSSRDF) based methods[3, 98] commonly
while there is still work to generate inaccurate AO in used to render high density, high albedo medium such as
real-time[94]. Deep shading[40] presents a novel technique wax, skin, marble, etc. Although the rendering of the par-
to generate several rendering effects, including AO utiliz- ticipating medium is relatively mature, there are still
ing deferred shading buffers and convolutional neural net- problems with time-consuming volume rendering and the
works. This is an early classic work using neural net- inaccuracy of BSSRDF-based methods, so deep learning-
works as renderers, using a U-shaped network that takes based volume and subsurface rendering techniques have
deferred shading buffers as input to generate specific ren- emerged in recent years.
dering results. Erra et al.[41] introduce another method to Deep scattering[51] proposed a method to synthesize
generate AO, different from [40], the input to their net- multi-scattered illumination in clouds using deep radi-
work is not deferred shading buffers but sampled nor- ance-predicting neural networks (RPNN), which effi-
mals in the object space. And they use OpenGL Shading ciently synthesize the in-scattered radiance to replace the
language to implement network inference which enables costly evaluation of Monte Carlo (MC) integration. In-
direct integration into the real-time rendering pipeline. stead of predicting the full radiance directly, they opt for
Similar to [40], Zhang et al.[42] use a similar U-shape net- only multi-scattered transport and employ MC integra-
work structure and deferred shading buffer to generate tion for the rest of the transport. Their method achieves
AO. They also implement a Compute Shader Library to an up to 4 000× speedup over path tracing and the bias is
integrate the network into a real-time rendering pipeline. visually acceptable. Panin and Nikolenko[52] improved
AOGAN[43] is the latest approach to generate screen- Deep scattering by proposing the Faster RPNN which is
space AO. Different from all the above methods, they 2–3 times faster than the RPNN. They decrease the
build a GAN-based neural network with a self-attention RPNN network size by using a baking network for bak-
module with position and normal shading buffer as input. ing light of a single directional light source and decrease
They also combined the perceptual loss of the VGG struc- the descriptor by passing in the rendering network a
ture with the adversarial loss of the GAN structure to much smaller cloud descriptor, thus saving time both on
Q. Wang et al. / State of the Art on Deep Learning-enhanced Rendering Methods 803

inference and collection while obtaining a lower bias com- sion-based approximation in the traditional BSSRDF
pared to RPNN. Abbas and Babahenini[53] introduce the method, they abandon the idea of diffusion approxima-
latest method of rendering forest fog using a method sim- tion and use a neural network to predict the sampling
ilar to [40]. Although fog is a cloud-like medium, instead point directly and their contributions while rendering. By
of considering a specific scattering process, they simply fitting the scene surface to a quadratic polynomial, their
generate forest fog rendering results utilizing shading buf- network can handle arbitrarily shaped inputs. Their work
fers (normal map, depth map, albedo map, RGB color produces a more realistic appearance and lower error
map without fog) and a U-shape-based generative ad- compared to a photon beam diffusion path-traced refer-
versarial neural network. Their ground truth is images of ence (see Fig. 4).
forest fog rendered by traditional rendering pipelines.
Zheng et al.[54] proposed a new method for rendering het-
erogeneous volumes that utilizes three neural networks to
predict visibility, single scattering, and multiple scatter-
ing. Different from [51], this work not only predicts mul-
tiple scattering but also predicts all scatter processes at
once, and they present multiple scattering by spherical (a) Photon beam diffusion (b) Shape adaptive BSSRDF
harmonic (SH) basis functions. Thus, the network only
Fig. 4 Rendering of translucent soap blocks with photon beam
needs to predict the coefficients of SH. Fig. 3 shows their
diffusion and their work. Images taken from Vicini et al.[56]
rendering results of translucent materials and opaque ma-
terials.
4.3 Voxel-based scene representation ren-
dering methods

Regardless of the rendering method, the representa-


tion of the scene determines the input form of the render-
ing neural network. The traditional renderer takes a scene
representation file as input. However, this representation
cannot be represented as a tensor, so it cannot be used as
an input to a neural network. Inspired by recent progress
in computer vision, many approaches that represent the
scene as a voxel grid have emerged.
Visual object networks (VON)[19] presents a novel gen-
erative model, synthesizing natural images of objects with
a disentangled 3D representation. Inspired by classic
graphics rendering pipelines, they decomposed the genera-
Ours GT tion model into three independent factors-shape, view-
point, and texture. They first learn a shape-generative
Fig. 3 Rendering results and ground truth of translucent adversarial network that maps a randomly sampled shape
material and opaque material under different environment maps.
Images taken from Zheng et al.[54]
code to a voxel grid. Then they project the voxel grid to
2.5D sketches with their differentiable projection module
In addition to the volume rendering method, there are under a sampled viewpoint. Finally, they trained a tex-
also works related to subsurface scattering. Hermosilla ture network to add realistic, diverse textures to 2.5D
et al.[55] introduced a deep learning-based method to learn sketches to generate 2D images that cannot be distin-
the latent space of light transport from a 3D point cloud guished from real images by an image discriminator. The
to represent the ambient occlusion, global illumination, whole model is trained end-to-end on both 2D and 3D
and subsurface scattering shading effects. Compared to data. Their scenes, although randomly generated, are still
screen-space methods, their method represents the 3D represented using a voxel grid to represent three-dimen-
scene as an unstructured 3D point cloud, which is later sional structures. RenderNet[20] proposed a differentiable
projected to the 2D output image during rendering. For rendering convolutional network with a projection unit
each effect, they trained a network individually. For the that can render 2D images from 3D shapes represented
subsurface scattering network, its input is the position, by a voxel grid. Benefitting from their differentiable ren-
normal, albedo, direct illumination, and scattering coeffi- derer, their work enables relighting, different kinds of
cients corresponding to each point of the point cloud, and shading (phong, contour line, cartoon, ambient
the final output is the rendering result of the point. The occlusion), novel view synthesis, and shape reconstruc-
state-of-the-art approach for subsurface scattering render- tion from images. RenderNet passes a voxel grid, camera
ing was proposed by Vicini et al.[56] To address the error pose, and light position as input, and applies a view-pro-
caused by the semi-infinite plane assumption and diffu- jection transformation to convert the voxel grid to the
804 Machine Intelligence Research 20(6), December 2023

camera coordinate system. After trilinear sampling, the

Input
transformed voxel grid is sent to a 3D convolution net-
work with a projection unit to produce 2D feature maps

Prediction
that are sent to a 2D convolution network to compute
shading. The network can alternatively produce normal
maps of the 3D input. They also demonstrate their abil-
ity to iteratively recover a 3D voxel grid representation of

Ground truth
a scene from a single image utilizing the differentiable
renderer. Neural voxel renderer NVR[21] presents a deep
learning-based rendering method that maps a voxelized
scene into a high-quality image. Their method allows con-
trol of the scene that is similar to a classic graphics Fig. 5 Neural rendering of cars and textured objects with
pipeline, including geometric and appearance modifica- NVR+ . Images taken from Rematas and Ferrari[21].
tions, lighting condition modification, and camera posi-
tion modification. They demonstrate the effectiveness of work (GQN), which is a framework for learning a vector
their method by rendering scenes with varying scene set- embedding of a scene relying only on a few scene observa-
tings. Their main contribution is presenting a novel neur- tions. The GQN takes several images taken from differ-
al network model that takes a voxel representation of the ent viewpoints, and the corresponding camera poses as in-
scene as input and learns how to render it. Two neural put and constructs a latent vector that encodes informa-
renderers: NVR and NVR+ are designed to render the tion about the underlying scene. This latent vector is de-
scene. However, NVR generates a blurry and artifact res- signed to represent the complete scene (e.g., object geo-
ult when the color pattern of the input voxels forms a metries, colors, positions, lighting, and scene layout), and
high frequency and irregular texture. RenderNet[20] is the it is unaware of the viewpoints. Each time a new observa-
backbone of the NVR network. As the input of the NVR tion is added, the latent vector representing the scene will
network, the voxel is first sent to the 3D encoder, which sum up the latent vector of the observation to obtain a
contains a series of 3D convolutions, and then passed more complete and accurate scene representation. The
through the reshape unit to become a 2D feature. These GQN′s generator is responsible for generating an image,
features are finally subjected to a series of 2-dimensional given the scene representation and a new camera view-
convolutions as the final feature of the voxel. Light condi- point as input. Only when the latent vector represented
tions are also processed by two-layer fully connected lay- by the scene is accurate enough can the generation net-
ers and tiled to the final feature so that the lighting in- work synthesize the correct novel view image. At the
formation is encoded. Finally, a 2D decoder processes the same time, although only a few viewpoints per scene are
final feature to generate the output image. The NVR + used to train GQN, it is able to render unseen scenes
network adds the splatting processing network and the from arbitrary viewpoints. However, their approach can
neural rendering network based on the NVR network.
only handle simple scenes with basic shapes and unreal
The splatting processing network first synthesizes an im-
lighting.
age by splitting the center of the colored voxels in the
Liao et al.[23] defined a new task of 3D controllable im-
target view and then passes this image through a 2D con-
age synthesis and proposed a method for solving it. They
volution encoder. The output of this network is then con-
considered generating a vector-based scene representa-
catenated with the features from NVR, and the final res-
tion from a controllable 3D generated and rendering it us-
ult is processed by a U-Net (Neural rendering network) to
ing a 2D generator. Their method consists of three main
generate the output image. Fig. 5 shows the neural ren-
parts: a 3D generator, a differentiable projection layer,
dering results of objects with NVR+. Although scene
parameters can be modified, there are still some fixed at- and a 2D generator. The 3D generator maps a latent code
tributes, such as light color, camera focal length, object drawn from a Gaussian distribution into a set of abstract
material, etc. 3D primitives. Then, the differentiable projection layer
takes each 3D primitive as input and outputs a feature
4.4 Vector-based scene representation map, an alpha map, and a depth map. Finally, the GAN-
rendering methods based 2D generator refines them and produces the final
rendered image. A background vector is also projected
Since deep neural networks deal with tensors, it is an and rendered to composite with the final render generat-
intuitive way to represent the scene as a vector that can ing the full rendered result. Although their work can gen-
be directly passed to neural networks. There are also erate controllable scenes (object rotation and translation),
methods dealing with vector-based scene representation the generated geometry and lighting are relatively simple,
in recent years. and there is no control over properties such as materials
Eslami et al.[16] introduced the generative query net- and lighting.
Q. Wang et al. / State of the Art on Deep Learning-enhanced Rendering Methods 805

4.5 Network-based scene representation sample. They use volume rendering to accumulate colors
rendering methods and densities along the ray to render the final image. Al-
though their approach can handle the novel unseen scene,
Whether the scene is represented by voxel, point they still need to fine-tune each scene to obtain compar-
cloud, mesh, or any discrete form, the precision of the able results (see Fig. 6).
representation is limited, so interpolation is always ap-
plied. However, implicit representation of a scene utiliz-
ing a neural network provides a continuous 3D scene rep-
resentation method that infinite precision which is inde-
Orchid
pendent of the original scene representation. Therefore,
methods for rendering implicit scene representations have
emerged in recent years.
T-rex
Oechsle et al.[34] proposed a novel implicit representa-
tion of surface light fields that captures the visual appear-
ance of an object. They condition the surface light field
Horns
with respect to the location and spectrum of a small light
source which allows relighting and novel view synthesis
using environment maps or manipulating the light source.
Taking the encoding of an input image, the encoding of Fern

the corresponding input shape, and a lighting configura- Fig. 6 Qualitative comparison on real-forward-facing data[99].
tion, the conditional implicit surface light field (cSLF) From left to right: Ground truth, their fine-tuned method,
outputs a predicted image which computes a photomet- NeRF[70], and their method, LLFF[99]. Images taken from Wang
et al.[35]
ric loss with the ground truth image. The cSLF network
is a two-step model. First, the 3D location, shape feature
vector, and image feature vector are mapped to a D-di- 4.6 Mesh-based scene representation ren-
mensional appearance feature. This appearance feature is dering methods
a localized appearance representation independent of the
viewpoint and lighting condition. Then, the appearance A polygon mesh is the most traditional way to repres-
vector, lighting vector, viewpoint, and shape feature vec- ent scenes for traditional graphics pipelines. However, it
tor are fed into the lighting model to synthesize the RGB is not commonly used in deep learning-based rendering
image. Their cSLF is capable of inferring light fields of due to the non-derivable nature of rasterization that pre-
novel unseen objects and preserving the texture, reflec- vents back-propagation. However, there are still works to
tion, and shadow effects. However, their work can only solve this problem in a number of ways.
handle relatively simple scenes and lighting conditions Neural 3D mesh render[26] proposes a method to ap-
and cannot restore the specular effect well. proximate gradients for rasterization, which enables the
IBRNet[35] proposed a method to synthesize novel view back-propagation of neural networks. They are able to
images of complex scenes by interpolating a sparse set of perform single image mesh reconstruction supervised with
nearby views. Utilizing an MLP and a ray transformer, silhouette images utilizing the neural renderer. Their
they estimate radiance and volume density at continuous method demonstrates the potential of integrating mesh
5D locations (3D spatial and 2D viewing directions) only renderers into neural networks. The focus of this work is
taking multi-view images as input. Unlike NeRF-based on the computation of rasterization gradients. Tradition-
methods that need to retrain for a novel scene, they learn al rasterization is a discrete operation that determines the
a generic view interpolation function that generalizes to color of each pixel by judging whether it overlaps with
novel scenes. The framework of their method is very sim- the mesh. They replace the sudden change in pixel color
ilar to NeRF and is divided into three parts. They first caused by the intersection of the mesh with a gradual
identified a set of neighboring source views and extracted change using linear interpolation. Thus, the color change
their image feature using a shared U-Net-based convolu- becomes a continuous process associated with mesh ver-
tional neural network. Then, for each ray in the target tices.
view, the IBRNet predicts the colors and densities for
each sample along the ray. In practice, they aggregate the 4.7 Global illumination rendering methods
image color, features, and view direction from the neigh-
boring source views as the MLP input and output of the Although the rendering method mentioned in the
color and density feature. The density features are then above paragraphs can cope with various scene representa-
passed to the ray Transformer, which contains positional tions and achieve novel view synthesis and relighting,
encoding and multi-head self-attention to the sequence of they basically do not handle global illumination caused
density features to predict the final density value for each by light interaction with different objects in the scene.
806 Machine Intelligence Research 20(6), December 2023

According to our survey, most deep learning-enhanced rates and recover target resolution and frame rates by
global illumination rendering methods do not explicitly deep learning methods.
use a neural renderer to generate the final rendering im- Superresolution is introduced to real-time rendering
age but replace part of the render equation with a neural by NVIDIA. NVIDIA released DLSS2.0[65] in 2019, which
network while integrating. Therefore, we introduce the is the first deep learning-enhanced superresolution meth-
following work briefly. od that can be applied in practice. However, since
Deep illumination[46] presents a novel deep learning DLSS2.0 relies on NVIDIA′s hardware platform, no tech-
technique for approximating global illumination (GI) in nical information is publicly available. Xiao et al.[66] pro-
real-time using conditional generative adversarial net- posed NSRR, using U-Net[102] as the backbone to recon-
works (cGANs)[45]. Their pipeline is intuitive: First, they struct the final result with the input including low resolu-
generate deferred shading buffers (normal map, direct tion color, depth map and motion vector over multiple
lighting, diffuse map, depth map) and global illumina- historical frames. High-resolution results can be achieved
tion ground truth via VXGI[100] and GPU path with NSRR at a real-time frame rate and most of the
tracing[101]; Then, the generated buffers are passed to a U- high-frequency detail can be recovered.
Net-based generator network to predict the GI image. Fi- Frame interpolation in rendering is another way to re-
nally, the predicted GI image and buffers or ground truth duce rendering task. Guo et al.[67] proposed ExtraNet to
image are passed to the discriminator network. Their predict an extrapolated frame according to previous
method is a relatively early work and provides a baseline frames and current Gbuffers. Briedis et al.[68] presented a
for global illumination neural rendering. Neural control frame interpolation method for offline rendering applica-
variates (NCV)[47] propose a method for unbiased vari- tions.
ance reduction in parametric Monte Carlo integration. Deep CG2Real[69] presented a method to improve the
Using the neural network to learn a function that is close quality of OpenGL rendered images as a two-stage post-
to the render equation, as well as a neural importance processing process. Their two-stage pipeline first gener-
sampler to produce the probability of sampling, and an- ates an accurate shading with the supervision of physic-
other neural network that infers the solution of the integ- ally-based renderings (PBR). Furthermore, they increase
ral equation, they dramatically reduce the noise at the the realism of texture and shading utilizing a
cost of negligible visible bias. CycleGAN[11] network. They demonstrate that their
method yields more realistic results compared to other
4.8 Direct illumination rendering methods approaches via evaluations on the SUNCG[103] dataset.
They first leverage the generative neural networks that
In general, direct lighting can be easily and efficiently take deferred shading buffers (albedo map, normal map,
obtained through rasterization or ray tracing. However, and OpenGL shading map) as input and predict the PBR
there is still work to learn the rendering of direct lighting shading map. This shading map is then product with the
through neural networks. albedo map generating the PBR image. Note that this
Suppan et al.[44] proposed a neural direct-illumination training process is supervised with PBR rendering results.
renderer (NDR) to render direct-illumination images of Later, another generative neural network predicted the
any geometry with opaque materials under distant illu- real albedo and shading image which are responsible for
mination. The network framework is relatively simple: generating the real result. This stage of training is super-
Given deferred shading buffers (normal map, roughness, vised with unpaired data. Fig. 7 shows their predicted real
depth), they first generate diffuse and specular coarse image compared to the OpenGL image and CycleGAN
shading results, then the illumination, which is encoded result.
as a vector of 75 SH coefficients, combined with coarse
shading and deferred shading buffers are fed into the 4.10 Human-related rendering methods
NDR to generate the diffuse and specular shading result.
Finally, they obtained the final render result by multiply- Human-related rendering has drawn great attention in
ing albedo input with shading results and adding diffuse computer graphics including skin rendering, hair render-
and specular parts together. ing, face rendering, body animation rendering, etc. Tradi-
tional computer graphics usually model the human ap-
4.9 Post-processing methods pearance as the physical process of light interacting with
the human body and rendering with real-time rasteriza-
As higher resolutions and refresh rates, as well as tion or offline ray tracing. With the development of neur-
more photorealistic effects bring great challenges to real- al networks in recent years, deep learning-based human-
time rendering, neural networks are used in the post-pro- related rendering methods have started to proliferate and
cessing stage to alleviate the burden of rendering are gradually replacing traditional methods. Some of the
pipelines. Superresolution and frame interpolation enable methods described below are similar to the previous para-
rendering pipeline work at a lower resolution or frame graphs, but the focus of this paragraph is on human-re-
Q. Wang et al. / State of the Art on Deep Learning-enhanced Rendering Methods 807

LookinGood[58] proposed a novel method to augment a


real-time performance capture system with a deep neural
network that takes a coarse rendered textured 3D recon-
struction from a novel viewpoint and outputs high-qual-
ity rendering results that perform super-resolution, de-
noising, and completion of the original images. They test
their method in two situations: One involving an upper-
(a) OpenGL image (b) CycleGAN result body reconstruction of an actor from a single RGB-D
camera, and the second consisting of full-body capture.
They use extra cameras except for the reconstruction
camera to provide ground truth, which achieves self-su-
pervised training. The backbone of LookinGood is a U-
Net-like architecture. The system is specifically designed
for VR and AR headsets and accounts for consistency
between two stereo views. Fig. 9 shows the re-rendering
(c) Our predicted real image (a) (b) (c) results w.r.t. to viewpoint changes. Although their meth-
od can be generalized to a different actor, the quality of
Fig. 7 Compared to OpenGL rendering (a) and single-stage the unseen actor is reduced.
prediction with CycleGAN (b), their result restores more
realistic lighting and textures. Images taken from Bi et al.[69]
Input view
lated rendering; thus, it′s all covered here.
Wei et al.[57] present an adversarial network for ren-
dering photorealistic hair that takes a strand-based 3D
Predicted view

hair model as input and provides user control for color


and lighting through reference images. Benefiting from
the simple forward pass of their network, they achieve a
real-time rendering rate. Given a natural image, they gen-
erate four processed images by three sequential image Fig. 9 Neural re-rendering results of different viewpoints.
processing operators: the segment hair image, the gray Image taken from Martin-Brualla et al.[58]
image, the orientation map, and the edge activation map.
The segment hair, gray image, and orientation map are Meka et al.[59] introduced a method that combines tra-
all encoded into their own latent space with a feature ditional graphics pipelines with neural rendering to gener-
vector. Given a 3D hair model at inference time, they ate photorealistic renderings of dynamic performances un-
first extract the edge activation map from a randomized der novel viewpoints and lighting assuming the availabil-
rendered image of the desired viewpoint. Then generat- ity of an approximate geometry of the subject for every
ors are applied sequentially in the inverse order of the im- frame of the performance. Their method is capable of ren-
age processing flow (see Fig. 8). dering unseen subject poses and novel subject identities
and significantly outperforms the existing state-of-the-art
Segmented hair I1 Gray image I2 Orientation map I3
Input image Edge activation map
solutions. A U-Net architecture is first exploited to ex-
F1 F2 F3 tract features from the two spherical gradient illumina-
tion images of each viewpoint that concatenated each
pixel with view direction. After acquiring the feature,
they warp the features of every viewpoint using warp
E1 E2 E3
fields and pool all of them together into a single tensor to
Color z Lighting z Structure z
remove the dependency on the order of the input images
according to the feature weights computed by the dot
product between the camera viewing direction and the
G1 G2 G3 surface normals. The texture coordinate feature is then
sampled by a warp corresponding to the target camera
view to generate the resampled features. Then, they gen-
F
erate a reflection map and a light visibility map and mul-
tiply the light visibility map elementwise with the concat-
Reconstructed Reconstructed Reconstructed
color image gray image orientation map
Edge activation map CG hair model
enation of resampled features, reflection map, and view
direction, generating the neural diffuse rendering image.
Fig. 8 Pipeline of [57]. The top row shows the image processing
flow of an input natural image, and the bottom row shows the Additionally, the resampled features are fed into an Al-
inference flow from right to left. Image taken from Wei et al.[57] pha Matting network, predicting the alpha mask. Finally,
808 Machine Intelligence Research 20(6), December 2023

the neural diffuse rendering image and alpha mask are tions will be given.
passed through a U-Net that generates the actual
rendered images, see Fig. 10 for a comprehensive under- 5.1 Voxel-based scene representation ren-
standing. dering methods
Rendering with style[60] proposes combining the tradi-
tional rendering pipeline and neural rendering of faces, DeepVoxels[22] is a learned viewpoint-invariant, per-
automatically and seamlessly generating full-head sistent, and uniform 3D voxel grid of feature representa-
photorealistic portrait renders from only facial skin render tion that encodes the view-dependent appearance of a 3D
without any artist intervention. Their method is also cap- scene without explicitly modelling its geometry. The fi-
able of rendering and preserving identity over animated nal rendered image is formed based on a 2D network that
performance sequences. They first synthesize a high-qual- receives the perspective resampled version of the 3D
ity skin render via a traditional rendering pipeline with volume. The scene-specific DeepVoxels feature represent-
an alpha mask from a 3D face geometry and appearance ation is formed from a set of multi-view images without
maps. This rendered image is then projected into a pre- explicit 3D supervision. They first extract 2D feature
trained Style-GAN2 network[104] to realistically inpaint maps using 2D U-Net and explicitly lift the features to
the missing pixels of the portrait (eyes, hair, the interior 3D based on a differentiable lifting layer. The lifted 3D
of the mouth). The final compositing step overlays the feature volume is fused using a gated recurrent network
architecture. After feature fusion, the feature volume is
raytraced skin appearance on top of the projection res-
processed by a 3D U-Net and then mapped to the cam-
ults.
era coordinate system of the two target views via a differ-
entiable reprojection layer. An occlusion network then
5 Specific methods
computes the soft visibility of each voxel. Finally, a
Usually, only scenes or objects specified during net- learned 2D U-Net rendering network generates the two fi-
work training can be rendered by specific methods. A new nal output images. Their network is trained end-to-end
network needs to be retrained for every new scene or ob- by a 2D re-rendering loss that forces the predictions to
ject. For example, if the method operates on a single car match the target views. They show several novel view
scene (with a specific lighting condition, in a specific loca- synthesis results on challenging scenes and outperform
tion), then changing the instance of the car, increasing baseline methods (see Fig. 11).
the number of cars, changing the lighting conditions, etc.,
will disable the network. In general, specific methods pro- 5.2 Point-based scene representation ren-
duce higher quality than general methods at the expense dering methods
of training time. The applications described below are
specific methods by default, and no additional explana- In addition to voxel representation, another common
Reflection map

Light visibility
Camera warp W k

Resampled features

Neural
Element-wise
multiplication rendering
Resampled features

View direction

Resampling to view k
Neural shading
UV features

Predicted mask

Alpha mask prediction


Fig. 10 Neural rendering pipeline. Given a target camera, the UV features are resampled. A neural shading model adds view-related
information to resampled features. An alpha mask is predicted by the Alpha Matting network. A U-Net finally renders the target
images. Image taken from Meka et al.[59]
Q. Wang et al. / State of the Art on Deep Learning-enhanced Rendering Methods 809

Ground truth Worrall et al.[105] Pix2pix Ours Ours-test views


CVPR CVPR

CV

CV

CV

CV
CV

PR

PR

PR

PR
PR

Fig. 11 Comparison of the best three performing models to ground truth and other samples of novel views generated by their model.
From left to right: ground truth, Worral et al.[105], Isola et al.[15] and DeepVoxels. Images taken from Sitzmann et al.[22]

method of scene representation is using a point cloud. descriptor to each point and rasterize the points with a z -
The advantage of this representation is that there have buffer at several resolutions corresponding to the given
been works in computer vision to reconstruct point clouds camera parameters. Each rasterization is fed to different
from images, and the reconstructed point clouds can be downsampling layers of U-Net and synthesis of the final
rendered directly using these methods. render result. They optimized the parameters of the ren-
Meshry et al.[30] applied traditional 3D reconstruction dering network and the neural descriptors by back-
from internet photos of a tourist landmark to generate a propagating the perceptual loss function. They also show
point cloud corresponding to the landmark. They train a that their approach is able to model and render scenes
neural rendering network that takes deferred shading buf- captured by hand-held RGBD cameras as well as simple
fers (depth, color, semantic labelling) as input and gener- RGB streams.
ates realistic renderings of the landmark with relighting Dai et al.[32] presented a novel neural point cloud ren-
and novel view synthesis. Given a large internet photo dering pipeline through multi-plane projections. The
collection of a scene, they first reconstruct a dense neural network takes the raw point cloud of a scene as in-
colored point cloud using structure-from-motion[106] and put and outputs image or image sequences from novel
multi-view stereo[107] and then render the point cloud camera views. They propose a method to project 3D
from the viewpoint of each image to generate the aligned points into a layered volume of camera frustum so that
dataset. Per-pixel albedo and depth are generated by us- the network automatically learns the visibility of 3D
ing point splatting with a z -buffer. To model the differ- points. The whole framework of the network consists of
ent appearances with relighting under a viewpoint, they two modules: multi-plane-based voxelization and multi-
pre-train an appearance encoder that takes deferred buf- plane rendering. The first module divides the 3D space of
fers and real images as inputs using a triplet loss. Then a the camera view frustum uniformly into small frustum
neural renderer is trained using reconstruction loss and voxels according to image size and a predefined number
GAN loss and finally fine-tuned with an appearance en- of planes. Aggregation operations are also adopted for
coder. To account for the transient object in the scene each small frustum to generate a multi-plane 3D repres-
(pedestrians, cars, etc.), they also concatenate the se- entation that concatenates with normalized view direc-
mantic label to the deferred buffer. The ground truth se- tion and sends it to the render network. The render net-
mantic segmentations are computed using DeepLab[108] on work predicts a 4 channels output (RGB + blend weight)
the input image, while they train a separate semantic la- for each plane. The final output is produced by blending
belling network that takes deferred shading buffers as in- all planes according to blend weights. Finally, the whole
put for inference. However, their work produces poor res- framework is supervised by perceptual loss. They demon-
ults for landmark details, such as text, and there are not strate that their method produces more stable renderings
enough input images for a scene. compared to previous methods (see Fig. 12).
Aliev et al.[31] presented a novel point-based method Sanzenbacher et al.[33] trained a deep neural network
that uses a raw point cloud representation of the scene to generate photorealistic rendering results of a specific
and generates novel view synthesis render results with a scene in real time by learning light transport in a static
learnable neural descriptor of each point and a deep ren- or dynamic scene. Their approach operates in both 3D
dering network. They first attach an 8-dimensional and 2D space, thus enabling global illumination effects
810 Machine Intelligence Research 20(6), December 2023

(a) Point clouds (b) Pix2pix (c) NPG (d) Our results (e) Ground truth
Fig. 12 Two comparisons on cases with noisy depth. NPG[31] and Pix2Pix[15] either completely miss the correct objects or produce a
mixture of foreground and background. Image taken from Dai et al.[32]

and 3D scene geometry manipulation. They first repres- This U-Net network is the light transport net (LTN) and
ent the scene in the form of an unstructured point cloud takes neural texture, normal map, and light direction
sampled from the scene′s surface and attach additional map as input and outputs a light transport map that con-
properties (albedo, light spectrum) to each point. The tains per-pixel light transport at each sampled light direc-
point cloud is then processed with a light transport layer tion. Finally, they retrieved the radiance on each sampled
which is a PointNet-based architecture[109] with ResNet- light direction and integrated it with albedo and SHs to
blocks[110] of depth two to learn light transport in the render the final image. They use the L1 loss for the differ-
scene. The network output is projected into a 2D image. ence between rendered images and ground truth images.
Then combined with additional image space information Fig. 13 shows the relighting and novel view synthesis res-
(depth, normal, albedo, view ray), the projection features ults of RNR.
are sent to the image synthesis layer to synthesize the fi-
nal image. By minimizing the MSE error of the gener-
ated image and noisy rendering obtained from a physic-
ally-based renderer, they jointly optimize the whole mod-
el. They also prove that using noisy images as ground
truth, the gradient estimates are unbiased.

5.3 Vector-based scene representation


rendering methods

Chen et al.[24] proposed a novel relightable neural ren-


derer (RNR) for novel view synthesis and relighting util-
izing multi-view images as input. RNR models the physic-
al rendering process of image generation, specifically, in
terms of environment lighting, object intrinsic attributes,
and light transport function (LTF). RNR conducts re-
gression on these three individual components rather than
translating deep features to appearance. Benefiting from Light 1
the physically-based rendering process, their method im-
proves the quality of novel view synthesis and relighting. Light 2

They decomposed the render equation into albedo, LTFs Reference View synthesis Relight 1 Relight 2
and lighting, and use spherical harmonics (SH) to fit the
Fig. 13 Relighting and novel view synthesis results of RNR on
lighting. They first follow the step of [111] and apply a K- real data. Image taken from Chen et al.[24]
nearest neighbor (K-NN) method to search the neighbor-
hood of each 3D mesh vertex and then apply multiple Granskog et al.[25] present a technique to adaptively
graph convolutional networks (GCNs) to extract the disentangle lighting, material, and geometric information,
global features as a vector of the 3D geometry. After generating a vector-based scene representation that pre-
that, they repeat and concatenate the feature vector with serves the orthogonality of these components. The scene
the U-Net feature map after the first downsampling layer. encoding network takes several high-quality observations
Q. Wang et al. / State of the Art on Deep Learning-enhanced Rendering Methods 811

of the scene attached with deferred shading buffers (posi- trained end-to-end with a differentiable ray-marching al-
tion, normal, depth) and camera parameters as input and gorithm. In practice, the scene representation function is
produces a view-independent neural scene representation represented by a multi-layer perceptron (MLP) that
vector. This vector is the average of all generated obser- learns to map a spatial location to a feature representa-
vation feature vectors. For a novel view, the representa- tion of scene properties of that spatial location. A two-
tion vector, camera parameters, and corresponding de- step differentiable ray-marching algorithm is used to gen-
ferred shading buffer are passed into a neural renderer to erate the final rendered image by first finding the world
obtain an image of the novel viewpoint. Their method is coordinates of the intersections of the camera ray with
similar to [16] but focuses on adaptively partitioning the scene geometry and then mapping the feature vector to a
neural scene representation and in-depth analysis of exist- color. They introduced a ray marching long short-term
ing image generators with respect to the partitioned rep- memory (RM-LSTM) to handle the first problem and a
resentation. Since their work disentangles elements in a per-pixel MLP to map a single feature vector to a single
scene, it is possible to use the lighting of one scene to re- RGB vector. After training, the view-independent MLP
light another scene by replacing the lighting part of the can be queried by a novel camera view with the ray-
scene representation vector (see Fig. 14). marching algorithm and then rendered by the per-pixel
MLP.
5.4 Network-based scene representation Differentiable volumetric rendering (DVR)[38] presents
rendering methods a differentiable rendering formulation for implicit shape
and texture representations. Similar to [37], they also rep-
Scene representation networks (SRNs)[37] is a classic resent the shape and appearance with a neural network.
method that proposes a continuous 3D-structure-aware However, they design an occupancy network assigning a
scene representation that encodes both geometry and ap- probability of occupancy to every point in 3D space and
pearance. They map the world coordinates to a feature extract the object surface using isosurface extraction
representation of local scene properties. Taking only 2D techniques[112] instead of the RM-LSTM network. They
images and their camera poses as input, SRNs can be also generate the final rendered image directly from the

Scene A Scene B Scene B w/lighting from A


Prediction
Reference
Prediction
Reference

Fig. 14 The lighting partition of scene A replaces the lighting partition of scene B. The relighting result is shown in the right column.
Image taken from Granskog et al.[25]
812 Machine Intelligence Research 20(6), December 2023

texture field[113] instead of the per-pixel MLP. For the the shapes of the scene as SIREN-based SDF representa-
single-view reconstruction task, they process the input tion. The appearance is modelled as a radiance field for
image with an encoder and use the output to condition directions. They take multi-view 2D images and object
the occupancy network and texture field. They show their masks as input to supervise the 3D representation. The
multi-view 3D reconstruction results and single-view re- loss function is relatively complex and contains a L1 im-
constructions (see Fig. 15). Similar to [37], their method age reconstruction loss for true foreground pixels, an
can also achieve novel view synthesis. eikonal constraint to regularize the scene representation
network, a soft mask loss proposed in [116] defined for
the non-foreground pixels, also a smoothness term to lin-
earize the angular behavior of SIREN. Compared to the
NeRF-based method, they only need to use sphere track-
ing to find the first intersection of the ray and the model,
and then query the value of the radiance field, without
accumulating samples along the ray, which leads to a
(a) Shape (b) Normals (c) Texture
faster rendering process. They also embedded their meth-
Fig. 15 They show the shape, normals, and the textured shape od into the traditional rasterizing pipeline to achieve a
for their method trained with 2D images and sparse depth maps real-time rendering rate by extracting the mesh from SDF
for scan 106 of the DTU dataset[1]. Image taken from Niemeyer
et al.[38] using marching cubes and then rasterizing the mesh us-
ing OpenGL to compute the vertex position buffer and
Yariv et al.[36] modelled the volume density as a func- angles between the ray towards the current rendering
tion of the geometry, different from previous work model- camera and the rays towards each of the projective tex-
ling the geometry as a function of the volume density. ture map viewpoints. Finally, they apply the unstruc-
They defined the volume density function as Laplace′s cu- tured lumigraph rendering technique[118], generating the
mulative distribution function (CDF) applied to a signed rendered image.
distance function (SDF) representation. This new dens-
ity representation produces high-quality geometry recon- 5.5 Mesh-based scene representation me-
struction and enables the disentanglement of shape and thods
appearance. Their framework consists of two MLPs, the
first approximating the SDF of the learned geometry and Deep surface light fields (DSLF)[27] present a neural
the global geometry feature of dimension 256. The second network called the DSLF to model light transport on ver-
MLP presents the scene′s radiance field. Fig. 16 shows tices of object mesh using only moderate sampling of
qualitative results sampled from the BlendedMVS[115] multi-view images. Their DSLF can achieve a high data
dataset. compression ratio while performing real-time rendering on
Neural lumigraph rendering (NLR)[39] implicitly rep- the GPU. They first obtain the 3D model of the object as
resents a scene surface and radiance field using a neural a mesh by structure-from-motion (SFM) and then re-
network that accelerates state-of-the-art neural rendering gister the multi-view images with the mesh using feature
by approximately two orders of magnitude and is com- extraction and matching and perspective-n-point (PnP)[119]
patible with traditional graphics pipelines which enable techniques. They also conduct texture-aware remeshing
real-time rendering rates. They present both the shape to avoid blurring of the line features. The deep network
and appearance of 3D objects similar to IDR[116]. finally takes the vertex position (represented by texture
However, their backbone network is sinusoidal representa- coordinates) and ray direction as input and outputs the
tion networks (SIREN)[117] instead of MLP. They model final light transport of that vertex. During rendering,
GT
NcRF++
VolSDF

Fig. 16 Qualitative results sampled from the BlendedMVS dataset. From top to bottom, the ground truth, the NeRF++[114] results,
and their results. Image taken from Yariv et al.[36]
Q. Wang et al. / State of the Art on Deep Learning-enhanced Rendering Methods 813

their method integrates well with traditional rasteriza- also reconstructed the geometric mesh with UV-mapping
tion pipelines by replacing the vertex shader with DSLF via COLMAP and generate a neural texture to represent
to predict the vertex color. Fig. 17 shows the novel view the feature of the object. Instead of directly passing the
synthesis results of synthesis and real scenes. projected neural texture to a neural renderer, they com-
bined (via per-pixel multiplication) radiance cues, which
are synthesized by rendering scene-independent bias ma-
terials under the target light onto the rough geometry,
with projected neural texture and passed to a neural ren-
Synthetic

derer to produce the final relit image of the scene.


Moreover, they also predict a binary mask from the pro-
jected neural texture for compositing the relit appearance.
Fig. 18 shows their simultaneously novel view synthesis
and relighting results on captured scenes.
Real

Fig. 18 Qualitative comparison between captured scenes


ground truth (right) and simultaneously novel view synthesis
and relighting results of their method (left). Difference images
Fig. 17 DSLF rendering results from different viewpoints.
are shown in the insets. Images taken from Gao et al.[29]
Their method produces high fidelity results in both real and
synthesis scenes with different materials. Images taken from
Chen et al.[27] 5.6 Global illumination rendering methods
Deferred neural rendering[28] introduces the neural tex- Ren et al.[48] proposed the first method to model glob-
tures, a learned feature map that is stored as maps on al illumination with a neural network. They introduce a
top of 3D mesh proxies that contain significantly more in- radiance regression function (RRF), presenting a non-lin-
formation than traditional textures. Different from the ear mapping from local attributes to indirect illumina-
original 2D generative neural networks, their method tions. They first define the closed-form of indirect illu-
achieves explicit control over the generated output. They mination and then train an MLP that takes position,
show the effectiveness of their method on novel view syn- view direction, point light position, and normal and re-
thesis, scene editing, and facial reenactment. They first flectance parameters as input and predicts the RGB com-
obtain the coarse geometric proxy with UV-map paramet- ponents of indirect illuminations. Combined with direct
erization and camera parameters using the COLMAP[107] illumination, they finally obtain the global illumination
structure-from-motion technique. Taking the geometry result. Neural radiosity[49] directly uses a neural network
mesh and a neural texture as input, the standard graph- to predict the solution of the rendering equation by min-
ics pipeline is used to render a view-dependent screen imizing the norm of its residual for each point in a 3D
space feature map. This feature map is then converted to scene. They derive the MC estimate of the residual norm
a photorealistic image via a U-Net-based deferred neural and the MC approximation of the residual norm gradient
renderer. with respect to network parameters. Different from tradi-
Deferred neural lighting[29] proposes a novel method tional neural network optimization, their model optim-
for novel viewpoint relighting of a specific scene. Differ- izes network parameters in the traditional ray tracing
ent from traditional methods, which require dense framework and computes gradients using the formula
samples of the view direction and lighting condition com- they derived. After training, images from arbitrary view-
bination, their method utilizes unstructured photographs points can be computed efficiently (see Fig. 19). Diolatzis
taken from a handheld acquisition scheme that only re- et al.[50] introduced the latest active exploration (AE)
quires two cellphones. They demonstrate the effective- method using Markov chain Monte Carlo (MCMC) to
ness of their method in a variety of real-world and syn- render novel scene instances given an explicit parameter-
thetic scenes. Similar to deferred neural rendering[28], they ization of the scene configuration. The scene configura-
814 Machine Intelligence Research 20(6), December 2023

Multiple views of LHS Ours GT


5.7 Human-related rendering methods

Lombardi et al.[61] introduced a data-driven deep ap-


pearance model for rendering the human face that learns
MAPE: 0.054
both facial geometry and appearance from a multi-view
capture system. Their method generates realistic novel
view images with no need for an accurate geometry mod-
MAPE: 0.058
el, which is a significant departure from the traditional
graphics pipeline. They also integrated their model with
an unsupervised technique for mapping images to facial
MAPE: 0.066 states into virtual reality applications. Begining with
multi-view input photos of an identity and a reconstruc-
ted mesh, they first unwrap the photos to generate the
MAPE: 0.058 view-specific texture maps. They then computed the aver-
age texture of the texture maps. The average texture and
Fig. 19 Multiple views of a solution of their network. Images
the mesh are sent to a variational autoencoder (VAE)[12]
taken from Hadadan et al.[49]
which is conditioned by an output viewpoint and pre-
tion controls the variables of the scene, such as changing dicts a mesh and view-specific texture corresponding to
objects, materials, lights, and viewpoint. The MCMC the output viewpoint. With texture and geometry, it can
method generates scene configuration samples that best easily render images from a novel point of view. The
help training hard light transport paths (e.g., caustics whole VAE is supervised by the generated mesh and
and transmittance). During training, they explicitly mod- view-specific texture reconstruction loss.
el the scene and the set of variable parameters as a vec- Liu et al.[62] proposed a method to generate video-real-
tor. They generate difficult instances of the variable scene istic animations of real humans under user control. Com-
to guide the PixelGenerator network using AE. In addi- pared to traditional human character rendering, they do
tion to the scene representation vector, the PixelGenerat- not require a high-quality photorealistic 3D model, but a
or also takes auxiliary deferred shading buffers as input video sequence and a 3D template model of the person.
and predicts the global illumination image path. At infer- They first reconstruct a 3D character model of the target
ence time, the explicit scene representation vector which person from static posture images and then obtain the
contains requested configuration information is fed into training motion data from the monocular training video
the PixelGenerator with deferred shading buffers of the based on the method of [120]. These motion data are then
corresponding scene configuration to predict the final im- fed to the character-to-image translation network with
age. Note that their method is relatively efficient (4–6 the color and depth of body part images to produce
FPS) and is capable of interactively altering the scene il- video-realistic output frames. At inference time, the
lumination by moving objects, the viewpoint, and modify- Character-to-Image translation network takes the source
ing materials. Fig. 20 shows the interactive rendering res- motion data with the conditioning input (color and depth
ults of their methods and controllable variables depicted body part images with a 3D character model) as input to
in red. reenact the target human character. The source motion

Interactive rendering Variables


Ours
Ground truth
Spaceship

MAPE: 0.036
Ground truth Ours
Veach door

MAPE: 0.042
Fig. 20 Interactive rendering results. From left to right: Ground truth path traced images; their interactive neural renderer with 4 to
6 FPS; the rendering result of a varied scene; the variable parts of the scene depicted in red. Images taken from Diolatzis et al.[50]
Q. Wang et al. / State of the Art on Deep Learning-enhanced Rendering Methods 815

data can not only be provided by monocular video but


also from user-defined motion or motion capture data.
Their results outperformed the state-of-the-art methods
in learning-based human video synthesis.
Wu et al.[63] present a neural human renderer (NHR)
for rendering photorealistic free-view video (FVV) from
dynamic human captures under a multi-view setting. Ex-
periments show that NHR outperforms the state-of-the-
art neural and image-based rendering techniques, espe-
cially on hands, hair, nose, foot, etc. As the input to the
NHR, the multi-view stereo (MVS), which consists of a
synchronized, multi-view video sequence, is exploited to Fig. 21 Free view video results on a challenging dance scene
using NHR. Red blouses impose significant challenges in 3D
construct a point cloud at each frame. Each point in the
reconstruction. Image taken from Wu et al.[63]
point cloud has color, computed through reprojection on
the input view images. Next, feature extraction (FE) viewing direction. The physically-based diffuse base tex-
based on PointNet++[121] was used to process the spati- tures are generated by multiplying the albedo texture, the
otemporal point cloud sequence generating 3D-point light cosines texture, and the view visibility texture cor-
descriptors. The descriptor with camera parameters is responding to the sampled view and lighting direction.
projected and rasterized to produce a feature map and The residual maps are then fed into an encoder generat-
depth map corresponding to the viewpoint. Finally, a U- ing multiscale features that are pooled to remove the de-
Net-based renderer maps the feature map and depth map pendence on their order and number. The pooled feature
to the output RGB image and mask image. The point is concatenated to the feature activations of the query
cloud reconstructed from MVS produces patches of holes path network, which takes the query light cosines map,
on textureless or occluded regions. Thus, they refine their query view cosines map, and diffuse base map of the tar-
geometry by rendering a dense set of new views and us- get view and lighting direction. The query path network
ing the resulting masks as silhouettes and conduct visual then synthesizes the non-diffuse residuals, which repres-
hull reconstruction based on space-carving or shape-from- ent global illumination and complex material effects. Fi-
silhouettes (SfS). Fig. 21 shows the FVV results on a nally, the non-diffuse residuals and diffuse base textures
dance scene. are wrapped to image space utilizing UV wrapping pre-
Zhang et al.[64] proposed a method for learning a neur- defined by coarse geometry. Fig. 22 shows the simultan-
al representation of light transport (LT) of the human eous relighting and views synthesis results of their meth-
body with a rough 3D geometry and multi-view one light od.
at a time (OLAT) images. They model non-diffuse and
global LT in texture space as residuals added to physic- 6 Conclusions
ally based diffuse rendering and enable high-quality (with
complex material effects and global illumination) novel Deep learning-enhanced rendering has drawn great at-
view synthesis and relighting simultaneously. Their tention in both computer graphics and computer vision
framework consists of two main paths: the observation research fields in recent years. This state-of-the-art re-
path and query path. The observation path first takes port spans a variety of use cases that range from general
several nearby texture-space residual maps (observed and specific methods of ambient occlusion generation,
minus diffuse base) sampled around the target light and volume and subsurface rendering, multiple scene repres-

Illumination varies (view constant) Illumination varies (view constant)


View varies (illumination constant)

View varies (illumination constant)

Fig. 22 Their model is able to perform simultaneous relighting and view synthesis, and it produces accurate renderings for unobserved
viewpoints and light directions. Image taken from Zhang et al.[64]
816 Machine Intelligence Research 20(6), December 2023

entation rendering, global illumination rendering, direct ative adversarial nets. In Proceedings of the 27th Interna-
illumination rendering, and human-related rendering, tional Conference on Neural Information Processing Sys-
tems, Montreal, Canada, pp. 2672–2680, 2014.
which embodies the rapid growth of deep learning-en-
hanced rendering methods. Deep learning-enhanced ren- [10] M. Arjovsky, S. Chintala, L. Bottou. Wasserstein gener-
ative adversarial networks. In Proceedings of the 34th In-
dering has already shown impressive ability at real-time ternational Conference on Machine Learning, Sydney,
global illumination rendering, novel view synthesis, and Australia, pp. 214–223, 2017.
relighting with only several images as input. We believe [11] J. Y. Zhu, T. Park, P. Isola, A. A. Efros. Unpaired image-
that the traditional graphics rendering pipeline can be to-image translation using cycle-consistent adversarial
partially or completely replaced by deep learning-en- networks. In Proceedings of IEEE International Confer-
hanced rendering in the future. We hope that our report ence on Computer Vision, Venice, Italy, pp. 2242–2251,
2017. DOI: 10.1109/ICCV.2017.244.
can provide researchers with a deep understanding of
deep learning-enhanced rendering, and help them devel- [12] D. P. Kingma, M. Welling. Auto-encoding variational
Bayes. In Proceedings of the 2nd International Confer-
op the next generation of deep learning-enhanced render- ence on Learning Representations, Banff, Canada, 2013.
ing and graphics applications. DOI: 10.48550/arXiv.1312.6114.
[13] Y. C. Pu, Z. Gan, R. Henao, X. Yuan, C. Y. Li, A.
Declarations of conflict of interest Stevens, L. Carin. Variational autoencoder for deep learn-
ing of images, labels and captions. In Proceedings of the
The authors declared that they have no conflicts of in- 30th International Conference on Neural Information
terest to this work. Processing Systems, Barcelona, Spain, pp. 2360–2368,
2016.

References [14] A. Vahdat, J. Kautz. NVAE: A deep hierarchical vari-


ational autoencoder. Advances in neural information pro-
[1] C. Donner, H. W. Jensen. A spectral BSSRDF for shad- cessing systems, vol. 33, pp. 19667–19679, 2020.
ing human skin. In Proceedings of the 17th Eurographics [15] P. Isola, J. Y. Zhu, T. H. Zhou, A. A. Efros. Image-to-im-
conference on Rendering Techniques, Eurographics Asso- age translation with conditional adversarial networks. In
ciation, Nicosia, Cyprus, pp. 409–417, 2006. Proceedings of IEEE Conference on Computer Vision
[2] L. Q. Yan, M. Hašan, B. Walter, S. Marschner, R. and Pattern Recognition, Honolulu, USA, pp. 5967–5976,
Ramamoorthi. Rendering specular microgeometry with 2017. DOI: 10.1109/CVPR.2017.632.
wave optics. ACM Transactions on Graphics, vol. 37, [16] S. M. A. Eslami, D. Jimenez Rezende, F. Besse, F. Viola,
no. 4, Article number 75, 2018. DOI: 10.1145/3197517. A. S. Morcos, M. Garnelo, A. Ruderman, A. A. Rusu, I.
3201351. Danihelka, K. Gregor, D. P. Reichert, L. Buesing, T.
[3] L. Q. Yan, W. L. Sun, H. W. Jensen, R. Ramamoorthi. A Weber, O. Vinyals, D. Rosenbaum, N. Rabinowitz, H.
BSSRDF model for efficient rendering of fur with global King, C. Hillier, M. Botvinick, D. Wierstra, K. Kavuk-
illumination. ACM Transactions on Graphics, vol. 36, cuoglu, D. Hassabis. Neural scene representation and ren-
no. 6, Article number 208, 2017. DOI: 10.1145/3130800. dering. Science, vol. 360, no. 6394, pp. 1204–1210, 2018.
3130802. DOI: 10.1126/science.aar6170.

[4] E. Veach, L. J. Guibas. Metropolis light transport. In [17] A. Tewari, O. Fried, J. Thies, V. Sitzmann, S. Lombardi,
Proceedings of the 24th Annual Conference on Computer K. Sunkavalli, R. Martin-brualla, T. Simon, J. Saragih,
Graphics and Interactive Techniques, ACM, Los Angeles, M. Nießner, R. Pandey, S. Fanello, G. Wetzstein, J. Y.
USA, pp. 65–76, 1997. DOI: 10.1145/258734.258775. Zhu, C. Theobalt, M. Agrawala, E. Shechtman, D. B.
Goldman, M. Zollhöfer. State of the art on neural render-
[5] M. Pauly, T. Kollig, A. Keller. Metropolis light transport ing. Computer Graphics Forum, vol. 39, no. 2, pp. 701–
for participating media. In Proceedings of Eurographics 727, 2020. DOI: 10.1111/cgf.14022.
Workshop on Rendering Techniques, Springer, Brno,
Czech Republic, pp. 11–22, 2000. DOI: 10.1007/978-3- [18] C. Zhang, T. Chen. A survey on image-based render-
7091-6303-0_2. ing – representation, sampling and compression. Signal
Processing: Image Communication, vol. 19, no. 1, pp. 1–
[6] Y. Ouyang, S. Liu, M. Kettunen, M. Pharr, J. Pantaleoni. 28, 2004. DOI: 10.1016/j.image.2003.07.001.
ReSTIR GI: Path resampling for real-time path tracing.
Computer Graphics Forum, vol. 40, no. 8, pp. 17–29, 2021. [19] J. Y. Zhu, Z. T. Zhang, C. K. Zhang, J. J. Wu, A. Torral-
DOI: 10.1111/cgf.14378. ba, J. B. Tenenbaum, W. T. Freeman. Visual object net-
works: Image generation with disentangled 3D represent-
[7] M. McGuire, M. Mara, D. Nowrouzezahrai, D. Luebke. ation. In Proceedings of the 32nd International Confer-
Real-time global illumination using precomputed light ence on Neural Information Processing Systems, Canada,
field probes. In Proceedings of the 21st ACM SIG- pp. 118–129, 2018.
GRAPH Symposium on Interactive 3D Graphics and
[20] T. H. Nguyen-Phuoc, C. Li, S. Balaban, Y. L. Yang. Ren-
Games, San Francisco, USA, pp. 2, 2017. DOI: 10.1145/
derNet: A deep convolutional network for differentiable
3023368.3023378.
rendering from 3D shapes. In Proceedings of the 32nd In-
[8] D. P. Fan, Z. L. Huang, P. Zheng, H. Liu, X. B. Qin, L. ternational Conference on Neural Information Processing
Van Gool. Facial-sketch synthesis: A new challenge. Ma- Systems, Montreal, Canada, pp. 7902–7912, 2018.
chine Intelligence Research, vol. 19, no. 4, pp. 257–287,
[21] K. Rematas, V. Ferrari. Neural voxel renderer: Learning
2022. DOI: 10.1007/s11633-022-1349-9.
an accurate and controllable rendering tool. In Proceed-
[9] I. J. Goodfellow, J. Pouget-Abadie, M. Mirza, B. Xu, D. ings of IEEE/CVF Conference on Computer Vision and
Warde-Farley, S. Ozair, A. Courville, Y. Bengio. Gener- Pattern Recognition, IEEE, Seattle, USA, pp. 5416–
Q. Wang et al. / State of the Art on Deep Learning-enhanced Rendering Methods 817

5426, 2020. DOI: 10.1109/CVPR42600.2020.00546. In Proceedings of International Conference on 3D Vision,


IEEE, Fukuoka, Japan, pp. 452–462, 2020. DOI: 10.1109/
[22] V. Sitzmann, J. Thies, F. Heide, M. Nießner, G. Wetz-
3DV50981.2020.00055.
stein, M. Zollhöfer. Deepvoxels: Learning persistent 3D
feature embeddings. In Proceedings of IEEE/CVF Con- [35] Q. Q. Wang, Z. C. Wang, K. Genova, P. Srinivasan, H.
ference on Computer Vision and Pattern Recognition, Zhou, J. T. Barron, R. Martin-Brualla, N. Snavely, T.
IEEE, Long Beach, USA, pp. 2432–2441, 2019. DOI: 10. Funkhouser. IBRNet: Learning multi-view image-based
1109/CVPR.2019.00254. rendering. In Proceedings of IEEE/CVF Conference on
Computer Vision and Pattern Recognition, IEEE,
[23] Y. Liao, K. Schwarz, L. Mescheder, A. Geiger. Towards Nashville, USA, pp. 4688–4697, 2021. DOI: 10.1109/CV-
unsupervised learning of generative models for 3D con- PR46437.2021.00466.
trollable image synthesis. In Proceedings of IEEE/CVF
Conference on Computer Vision and Pattern Recogni- [36] L. Yariv, J. T. Gu, Y. Kasten, Y. Lipman. Volume ren-
tion, IEEE, Seattle, USA, pp. 5870–5879, 2020. DOI: 10. dering of neural implicit surfaces. In Proceedings of the
1109/CVPR42600.2020.00591. 34th International Conference on Neural Information
Processing Systems, pp. 4805–4815, 2021.
[24] Z. Chen, A. P. Chen, G. L. Zhang, C. Y. Wang, Y. Ji, K.
N. Kutulakos, J. Y. Yu. A neural rendering framework for [37] V. Sitzmann, M. Zollhöfer, G. Wetzstein. Scene repres-
free-viewpoint relighting. In Proceedings of IEEE/ entation networks: Continuous 3D-structure-aware neur-
CVF Conference on Computer Vision and Pattern Recog- al scene representations. In Proceedings of the 33rd Inter-
nition, IEEE, Seattle, USA, pp. 5598–5609, 2020. DOI: national Conference on Neural Information Processing
10.1109/CVPR42600.2020.00564. Systems, NeurIPS, Vancouver, Canada, pp. 101, 2019.
[25] J. Granskog, F. Rousselle, M. Papas, J. Novák. Composi- [38] M. Niemeyer, L. Mescheder, M. Oechsle, A. Geiger. Dif-
tional neural scene representations for shading inference. ferentiable volumetric rendering: Learning implicit 3D
ACM Transactions on Graphics, vol. 39, no. 4, Article representations without 3D supervision. In Proceedings of
number 135, 2020. DOI: 10.1145/3386569.3392475. IEEE/CVF Conference on Computer Vision and Pattern
Recognition, IEEE, Seattle, USA, pp. 3501–3512, 2020.
[26] H. Kato, Y. Ushiku, T. Harada. Neural 3D mesh renderer. DOI: 10.1109/CVPR42600.2020.00356.
In Proceedings of IEEE/CVF Conference on Computer
Vision and Pattern Recognition, IEEE, Salt Lake City, [39] P. Kellnhofer, L. C. Jebe, A. Jones, R. Spicer, K. Pulli, G.
USA, pp. 3907–3916, 2018. DOI: 10.1109/CVPR.2018. Wetzstein. Neural lumigraph rendering. In Proceedings of
00411. IEEE/CVF Conference on Computer Vision and Pattern
Recognition, IEEE, Nashville, USA, pp. 4285–4295, 2021.
[27] A. P. Chen, M. Y. Wu, Y. L. Zhang, N. Y. Li, J. Lu, S. H. DOI: 10.1109/CVPR46437.2021.00427.
Gao, J. Y. Yu. Deep surface light fields. Proceedings of
ACM on Computer Graphics and Interactive Techniques, [40] O. Nalbach, E. Arabadzhiyska, D. Mehta, H. P. Seidel, T.
vol. 1, no. 1, Article number 14, 2018. DOI: 10.1145/ Ritschel. Deep shading: Convolutional neural networks
3203192. for screen space shading. Computer Graphics Forum,
vol. 36, no. 4, pp. 65–78, 2017. DOI: 10.1111/cgf.13225.
[28] J. Thies, M. Zollhöfer, M. Nießner. Deferred neural ren-
dering: Image synthesis using neural textures. ACM [41] U. Erra, N. Capece, R. Agatiello. Ambient occlusion bak-
Transactions on Graphics, vol. 38, no. 4, Article number ing via a feed-forward neural network. In Proceedings of
66, 2019. DOI: 10.1145/3306346.3323035. European Association for Computer Graphics: Short Pa-
pers, Lyon, France, pp. 13–16, 2017. DOI: 10.2312/egsh.
[29] D. Gao, G. J. Chen, Y. Dong, P. Peers, K. Xu, X. Tong. 20171003.
Deferred neural lighting: Free-viewpoint relighting from
unstructured photographs. ACM Transactions on Graph- [42] D. J. Zhang, C. H. Xian, G. L. Luo, Y. H. Xiong, C. Han.
ics, vol. 39, no. 6, Article number 258, 2020. DOI: 10. DeepAO: Efficient screen space ambient occlusion gener-
1145/3414685.3417767. ation via deep network. IEEE Access, vol. 8, pp. 64434–
64441, 2020. DOI: 10.1109/ACCESS.2020.2984771.
[30] M. Meshry, D. B. Goldman, S. Khamis, H. Hoppe, R.
Pandey, N. Snavely, R. Martin-Brualla. Neural rerender- [43] L. Ren, Y. Song. AOGAN: A generative adversarial net-
ing in the wild. In Proceedings of IEEE/CVF Confer- work for screen space ambient occlusion. Computational
ence on Computer Vision and Pattern Recognition, Visual Media, vol. 8, no. 8, pp. 483–494, 2022. DOI: 10.
IEEE, Long Beach, USA, pp. 6871–6880, 2019. DOI: 10. 1007/s41095-021-0248-2.
1109/CVPR.2019.00704. [44] C. Suppan, A. Chalmers, J. Zhao, A. Doronin, T. Rhee.
[31] K. A. Aliev, A. Sevastopolsky, M. Kolos, D. Ulyanov, V. Neural screen space rendering of direct illumination. In
Lempitsky. Neural point-based graphics. In Proceedings Proceedings of the 29th Pacific Conference on Computer
of the 16th European Conference on Computer Vision, Graphics and Applications, Pacific Graphics, Wellington,
New Zealand, pp. 37–42, 2021.
Springer, Glasgow, UK, pp. 696–712, 2020. DOI: 10.1007/
978-3-030-58542-6_42. [45] M. Mirza, S. Osindero. Conditional generative adversari-
al nets, [Online], Available: https://arxiv.org/abs/1411.
[32] P. Dai, Y. D. Zhang, Z. W. Li, S. C. Liu, B. Zeng. Neural
1784, 2014.
point cloud rendering via multi-plane projection. In Pro-
ceedings of IEEE/CVF Conference on Computer Vision [46] M. M. Thomas, A. G. Forbes. Deep illumination: Approx-
and Pattern Recognition, IEEE, Seattle, USA, pp. 7827– imating dynamic global illumination with generative ad-
7836, 2020. DOI: 10.1109/CVPR42600.2020.00785. versarial network, [Online], Available: https://arxiv.org/
abs/1710.09834, 2017.
[33] P. Sanzenbacher, L. Mescheder, A. Geiger. Learning
neural light transport, [Online], Available: https://arxiv. [47] T. Müller, F. Rousselle, A. Keller, J. Novák. Neural con-
org/abs/2006.03427, 2020. trol variates. ACM Transactions on Graphics, vol. 39,
no. 6, Article number 243, 2020. DOI: 10.1145/3414685.
[34] M. Oechsle, M. Niemeyer, C. Reiser, L. Mescheder, T.
3417804.
Strauss, A. Geiger. Learning implicit surface light fields.
818 Machine Intelligence Research 20(6), December 2023

[48] P. R. Ren, J. P. Wang, M. M. Gong, S. Lin, X. Tong, B. [61] S. Lombardi, J. Saragih, T. Simon, Y. Sheikh. Deep ap-
N. Guo. Global illumination with radiance regression pearance models for face rendering. ACM Transactions
functions. ACM Transactions on Graphics, vol. 32, no. 4, on Graphics, vol. 37, no. 4, Article number 68, 2018. DOI:
Article number 130, 2013. DOI: 10.1145/2461912. 10.1145/3197517.3201401.
2462009.
[62] L. J. Liu, W. P. Xu, M. Zollhöfer, H. Kim, F. Bernard, M.
[49] S. Hadadan, S. H. Chen, M. Zwicker. Neural radiosity. Habermann, W. P. Wang, C. Theobalt. Neural rendering
ACM Transactions on Graphics, vol. 40, no. 6, Article and reenactment of human actor videos. ACM Transac-
number 236, 2021. DOI: 10.1145/3478513.3480569. tions on Graphics, vol. 38, no. 5, Article number 139,
2019. DOI: 10.1145/3333002.
[50] S. Diolatzis, J. Philip, G. Drettakis. Active exploration
for neural global illumination of variable scenes. ACM [63] M. Y. Wu, Y. H. Wang, Q. Hu, J. Y. Yu. Multi-view
Transactions on Graphics, vol. 41, no. 5, Article number neural human rendering. In Proceedings of IEEE/CVF
171, 2022. DOI: 10.1145/3522735. Conference on Computer Vision and Pattern Recogni-
tion, IEEE, Seattle, USA, pp. 1679–1688, 2020. DOI: 10.
[51] S. Kallweit, T. Müller, B. Mcwilliams, M. Gross, J. 1109/CVPR42600.2020.00175.
Novák. Deep scattering: Rendering atmospheric clouds
with radiance-predicting neural networks. ACM Transac- [64] X. M. Zhang, S. Fanello, Y. T. Tsai, T. C. Sun, T. F. Xue,
tions on Graphics, vol. 36, no. 6, pp. 1–11, 2017. DOI: 10. R. Pandey, S. Orts-Escolano, P. Davidson, C. Rhemann,
1145/3130800.3130880. P. Debevec, J. T. Barron, R. Ramamoorthi, W. T. Free-
man. Neural light transport for relighting and view syn-
[52] M. Panin, S. Nikolenko. Faster RPNN: Rendering clouds thesis. ACM Transactions on Graphics, vol. 40, no. 1, Art-
with latent space light probes. In Proceedings of SIG- icle number 9, 2021. DOI: 10.1145/3446328.
GRAPH Asia Technical Briefs, ACM, Brisbane, Aus-
tralia, pp. 21–24, 2019. DOI: 10.1145/3355088.3365150. [65] A. Edelsten, P. Jukarainen, A. Patney. Truly Next-GEN:
Adding Deep Learning to Games & Graphics. In Proceed-
[53] F. Abbas, M. C. Babahenini. Babahenini. Forest fog ren- ings of Game Developers Conference Recording (GDC
dering using generative adversarial networks. The Visual Vault), USA, 2019.
Computer, vol. 39, no. 3, pp. 943–952, 2023.
[66] L. Xiao, S. Nouri, M. Chapman, A. Fix, D. Lanman, A.
[54] Q. Zheng, G. Singh, H. P. Seidel. Neural relightable parti- Kaplanyan. Neural supersampling for real-time render-
cipating media rendering. In Proceedings of the 34th In- ing. ACM Transactions on Graphics, vol. 39, no. 4, Art-
ternational Conference on Neural Information Processing icle number 142, 2020. DOI: 10.1145/3386569.3392376.
Systems, pp. 15203–15215, 2021.
[67] J. Guo, X. H. Fu, L. Q. Lin, H. J. Ma, Y. W. Guo, S. Q.
[55] P. Hermosilla, S. Maisch, T. Ritschel, T. Ropinski. Deep- Liu, L. Q. Yan. ExtraNet: Real-time extrapolated render-
learning the latent space of light transport. Computer ing for low-latency temporal supersampling. ACM Trans-
Graphics Forum, vol. 38, no. 4, pp. 207–217, 2019. DOI: actions on Graphics, vol. 40, no. 6, Article number 278,
10.1111/cgf.13783. 2021. DOI: 10.1145/3478513.3480531.
[56] D. Vicini, V. Koltun, W. Jakob. A learned shape-adapt- [68] K. M. Briedis, A. Djelouah, M. Meyer, I. McGonigal, M.
ive subsurface scattering model. ACM Transactions on Gross, C. Schroers. Neural frame interpolation for
Graphics, vol. 38, no. 4, Article number 127, 2019. DOI: rendered content. ACM Transactions on Graphics,
10.1145/3306346.3322974. vol. 40, no. 6, Article number 239, 2021. DOI: 10.1145/
[57] L. Y. Wei, L. W. Hu, V. Kim, E. Yumer, H. Li. Real-time 3478513.3480553.
hair rendering using sequential adversarial networks. In [69] S. Bi, K. Sunkavalli, F. Perazzi, E. Shechtman, V. Kim,
Proceedings of the 15th European Conference on Com- R. Ramamoorthi. Deep CG2Real: Synthetic-to-real
puter Vision, Springer, Munich, Germany, pp. 105–122, translation via image disentanglement. In Proceedings of
2018. DOI: 10.1007/978-3-030-01225-0_7. IEEE/CVF International Conference on Computer Vis-
[58] R. Martin-Brualla, R. Pandey, S. R. Yang, P. ion, IEEE, Seoul, Republic of Korea, pp. 2730–2739, 2019.
Pidlypenskyi, J. Taylor, J. Valentin, S. Khamis, P. Dav- DOI: 10.1109/ICCV.2019.00282.
idson, A. Tkach, P. Lincoln, A. Kowdle, C. Rhemann, D. [70] B. Mildenhall, P. P. Srinivasan, M. Tancik, J. T. Barron,
B. Goldman, C. Keskin, S. Seitz, S. Izadi, S. Fanello. R. Ramamoorthi, R. Ng. NeRF: Representing scenes as
LookinGood: Enhancing performance capture with real- neural radiance fields for view synthesis. In Proceedings
time neural re-rendering. ACM Transactions on Graph- of the 16th European Conference on Computer Vision,
ics, vol. 37, no. 6, Article number 255, 2018. DOI: 10. Springer, Glasgow, UK, pp. 405–421, 2020. DOI: 10.1007/
1145/3272127.3275099. 978-3-030-58452-8_24.
[59] A. Meka, R. Pandey, C. Häne, S. Orts-Escolano, P. [71] J. T. Barron, B. Mildenhall, M. Tancik, P. Hedman, R.
Barnum, P. David-Son, D. Erickson, Y. D. Zhang, J. Martin-Brualla, P. P. Srinivasan. Mip-NeRF: A
Taylor, S. Bouaziz, C. Legendre, W. C. Ma, R. Overbeck, multiscale representation for anti-aliasing neural radi-
T. Beeler, P. Debevec, S. Izadi, C. Theobalt, C. Rhem- ance fields. In Proceedings of IEEE/CVF International
ann, S. Fanello. Deep relightable textures: Volumetric Conference on Computer Vision, IEEE, Montreal,
performance capture with neural rendering. ACM Trans- Canada, pp. 5835–5844, 2021. DOI: 10.1109/ICCV48922.
actions on Graphics, vol. 39, no. 6, Article number 259, 2021.00580.
2020. DOI: 10.1145/3414685.3417814.
[72] R. Martin-Brualla, N. Radwan, M. S. M. Sajjadi, J. T.
[60] P. Chandran, S. Winberg, G. Zoss, J. Riviere, M. Gross, Barron, A. Dosovitskiy, D. Duckworth. NeRF in the wild:
P. Gotardo, D. Bradley. Rendering with style: Combin- Neural radiance fields for unconstrained photo collec-
ing traditional and neural approaches for high-quality tions. In Proceedings of IEEE/CVF Conference on Com-
face rendering. ACM Transactions on Graphics, vol. 40, puter Vision and Pattern Recognition, IEEE, Nashville,
no. 6, Article number 223, 2021. DOI: 10.1145/3478513. USA, pp. 7206–7215, 2021. DOI: 10.1109/CVPR46437.
3480509. 2021.00713.
Q. Wang et al. / State of the Art on Deep Learning-enhanced Rendering Methods 819

[73] K. Park, U. Sinha, P. Hedman, J. T. Barron, S. Bouaziz, DOI: 10.1109/CVPR.2015.7298761.


D. B. Goldman, R. Martin-Brualla, S. M. Seitz. HyperN-
[87] Y. Blau, T. Michaeli. The perception-distortion tradeoff.
eRF: A higher-dimensional representation for topologic-
In Proceedings of IEEE/CVF Conference on Computer
ally varying neural radiance fields. ACM Transactions on
Vision and Pattern Recognition, IEEE, Salt Lake City,
Graphics, vol. 40, no. 6, Article number 238, 2021. DOI:
USA, pp. 6228–6237, 2018. DOI: 10.1109/CVPR.2018.
10.1145/3478513.3480487.
00652.
[74] A. Pumarola, E. Corona, G. Pons-Moll, F. Moreno-
[88] X. E. Zhang, R. Ng, Q. F. Chen. Single image reflection
Noguer. D-NeRF: Neural radiance fields for dynamic sce-
separation with perceptual losses. In Proceedings of
nes. In Proceedings of IEEE/CVF Conference on Compu-
IEEE/CVF Conference on Computer Vision and Pattern
ter Vision and Pattern Recognition, IEEE, Nashville,
Recognition, IEEE, Salt Lake City, USA, pp. 4786–4794,
USA, pp. 10313–10322, 2021. DOI: 10.1109/CVPR46437.
2018. DOI: 10.1109/CVPR.2018.00503.
2021.01018.
[89] C. Atapattu, B. Rekabdar. Improving the realism of syn-
[75] S. Y. Su, F. Yu, M. Zollhöfer, H. Rhodin. A-NeRF: Artic-
thetic images through a combination of adversarial and
ulated neural radiance fields for learning human shape, perceptual losses. In Proceedings of International Joint
appearance, and pose. In Proceedings of the 34th Interna- Conference on Neural Networks, IEEE, Budapest, Hun-
tional Conference on Neural Information Processing Sys- gary, pp. 1–7, 2019. DOI: 10.1109/IJCNN.2019.8852449.
tems, pp. 12278–12291, 2021.
[90] J. Johnson, A. Alahi, L. Fei-Fei. Perceptual losses for
[76] G. Gafni, J. Thies, M. Zollhöfer, M. Nießner. Dynamic real-time style transfer and super-resolution. In Proceed-
neural radiance fields for monocular 4D facial avatar re- ings of the 14th European Conference on Computer Vis-
construction. In Proceedings of IEEE/CVF Conference ion, Springer, Amsterdam, The Netherlands, pp. 694–711,
on Computer Vision and Pattern Recognition, IEEE, 2016. DOI: 10.1007/978-3-319-46475-6_43.
Nashville, USA, pp. 8645–8654, 2021. DOI: 10.1109/CV-
PR46437.2021.00854. [91] A. Rehman, Z. Wang. SSIM-based non-local means im-
age denoising. In Proceedings of the 18th IEEE Interna-
[77] C. Gao, Y. Shih, W. S. Lai, C. K. Liang, J. B. Huang. tional Conference on Image Processing, Brussels, Belgi-
Portrait neural radiance fields from a single image, [On- um, pp. 217–220, 2011. DOI: 10.1109/ICIP.2011.6116065.
line], Available: https://arxiv.org/abs/2012.05903, 2020.
[92] J. Hwang, C. S. Yu, Y. Shin. SAR-to-optical image trans-
[78] B. B. Yang, Y. D. Zhang, Y. H. Xu, Y. J. Li, H. Zhou, H. lation using SSIM and perceptual loss based cycle-con-
J. Bao, G. F. Zhang, Z. P. Cui. Learning object-composi- sistent GAN. In Proceedings of International Conference
tional neural radiance field for editable scene rendering. on Information and Communication Technology Conver-
In Proceedings of IEEE/CVF International Conference gence, IEEE, Jeju, Republic of Korea, pp. 191–194, 2020.
on Computer Vision, IEEE, Montreal, Canada, pp. 13759– DOI: 10.1109/ICTC49870.2020.9289381.
13768, 2021. DOI: 10.1109/ICCV48922.2021.01352.
[93] T. Karras, S. Laine, T. Aila. A style-based generator ar-
[79] Y. C. Huo, S. E. Yoon. A survey on deep learning-based chitecture for generative adversarial networks. In Pro-
Monte Carlo denoising. Computational Visual Media, ceedings of IEEE/CVF Conference on Computer Vision
vol. 7, no. 2, pp. 169–185, 2021. DOI: 10.1007/s41095-021- and Pattern Recognition, IEEE, Long Beach, USA,
0209-9. pp. 4396–4405, 2019. DOI: 10.1109/CVPR.2019.00453.
[80] K. Gao, Y. N. Gao, H. J. He, D. N. Lu, L. L. Xu, J. Li. [94] L. Bavoil. Horizon-based Ambient Occlusion Using Com-
NeRF: Neural radiance field in 3D vision, a comprehens- pute Shaders, NVIDIA, USA, 2011.
ive review, [Online], Available: https://arxiv.org/abs/
2210.00379, 2022. [95] J. Wilhelms, A. Van Gelder. A coherent projection ap-
proach for direct volume rendering. ACM SIGGRAPH
[81] P. Wang, L. J. Liu, Y. Liu, C. Theobalt, T. Komura, W. Computer Graphics, vol. 25, no. 4, pp. 275–284, 1991.
P. Wang. NeuS: Learning neural implicit surfaces by DOI: 10.1145/127719.122758.
volume rendering for multi-view reconstruction. In Pro-
ceedings of the 34th International Conference on Neural [96] P. Kutz, R. Habel, Y. K. Li, J. Novák. Spectral and de-
Information Processing Systems, pp. 27171–27183, 2021. composition tracking for rendering heterogeneous
volumes. ACM Transactions on Graphics, vol. 36, no. 4,
[82] Z. H. Yu, S. Y. Peng, M. Niemeyer, T. Sattler, A. Geiger. Article number 111, 2017. DOI: 10.1145/3072959.3073
MonoSDF: Exploring monocular geometric cues for neur- 665.
al implicit surface reconstruction, [Online], Available: ht-
tps://arxiv.org/abs/2206.00665, 2022. [97] B. Miller, I. Georgiev, W. Jarosz. A null-scattering path
integral formulation of light transport. ACM Transac-
[83] J. T. Kajiya. The rendering equation. In Proceedings of tions on Graphics, vol. 38, no. 4, Article number 44, 2019.
the 13th Annual Conference on Computer Graphics and DOI: 10.1145/3306346.3323025.
Interactive Techniques, ACM, Dallas, USA, pp. 143–150,
1986. DOI: 10.1145/15922.15902. [98] C. Donner, J. Lawrence, R. Ramamoorthi, T. Hachisuka,
H. W. Jensen, S. Nayar. An empirical BSSRDF model.
[84] E. Veach. Robust Monte Carlo methods for light trans- New Orleans Louisiana, USA, Article number 30, 2009.
port simulation, Ph. D. dissertation, Stanford University, DOI: 10.1145/1576246.1531336.
Stanford, USA, 1998.
[99] B. Mildenhall, P. P. Srinivasan, R. Ortiz-Cayon, N. K.
[85] M. Pharr, W. Jakob, G. Humphreys. Physically Based Kalantari, R. Ramamoorthi, R. Ng, A. Kar. Local light
Rendering: From Theory to Implementation, 3rd ed., field fusion: Practical view synthesis with prescriptive
Cambridge, USA: Morgan Kaufmann, 2016. sampling guidelines. ACM Transactions on Graphics,
vol. 38, no. 4, Article number 29, 2019. DOI: 10.1145/
[86] A. Dosovitskiy, J. T. Springenberg, T. Brox. Learning to
3306346.3322980.
generate chairs with convolutional neural networks. In
Proceedings of IEEE Conference on Computer Vision and [100] C. Crassin, F. Neyret, M. Sainz, S. Green, E. Eisemann.
Pattern recognition, Boston, USA, pp. 1538–1546, 2015. Interactive indirect illumination using voxel cone tracing.
820 Machine Intelligence Research 20(6), December 2023

Computer Graphics Forum, vol. 30, no. 7, pp. 1921–1930, IEEE, Long Beach, USA, pp. 4455–4465, 2019. DOI: 10.
2011. DOI: 10.1111/j.1467-8659.2011.02063.x. 1109/CVPR.2019.00459.
[101] Y. Tokuyoshi, S. Ogaki. Real-time bidirectional path tra- [113] M. Oechsle, L. Mescheder, M. Niemeyer, T. Strauss, A.
cing via rasterization. In Proceedings of ACM SIG- Geiger. Texture fields: Learning texture representations
GRAPH Symposium on Interactive 3D Graphics and in function space. In Proceedings of IEEE/CVF Interna-
Games, Costa Mesa, USA, pp. 183–190, 2012. DOI: 10. tional Conference on Computer Vision, IEEE, Seoul, Re-
1145/2159616.2159647. public of Korea, pp. 4530–4539, 2019. DOI: 10.1109/IC-
[102] O. Ronneberger, P. Fischer, T. Brox. U-Net: Convolu- CV.2019.00463.
tional networks for biomedical image segmentation. In [114] K. Zhang, G. Riegler, N. Snavely, V. Koltun. NeRF++:
Proceedings of the 18th International Conference on
Analyzing and improving neural radiance fields, [Online],
Medical Image Computing and Computer-assisted Inter-
Available: https://arxiv.org/abs/2010.07492, 2020.
vention, Springer, Munich, Germany, pp. 234–241, 2015.
DOI: 10.1007/978-3-319-24574-4_28. [115] Y. Yao, Z. X. Luo, S. W. Li, J. Y. Zhang, Y. F. Ren, L.
Zhou, T. Fang, L. Quan. BlendedMVS: A large-scale
[103] S. R. Song, F. Yu, A. Zeng, A. X. Chang, M. Savva, T.
Funkhouser. Semantic scene completion from a single dataset for generalized multi-view stereo networks. In
depth image. In Proceedings of IEEE Conference on Proceedings of IEEE/CVF Conference on Computer Vis-
Computer Vision and Pattern Recognition, Honolulu, ion and Pattern Recognition, IEEE, Seattle, USA,
USA, pp. 190–198, 2017. DOI: 10.1109/CVPR.2017.28. pp. 1787–1796, 2020. DOI: 10.1109/CVPR42600.2020.
00186.
[104] T. Karras, S. Laine, M. Aittala, J. Hellsten, J. Lehtinen,
T. Aila. Analyzing and improving the image quality of [116] L. Yariv, Y. Kasten, D. Moran, M. Galun, M. Atzmon, R.
StyleGAN. In Proceedings of IEEE/CVF Conference on Basri, Y. Lipman. Multiview neural surface reconstruc-
Computer Vision and Pattern Recognition, IEEE, tion by disentangling geometry and appearance. In Pro-
Seattle, USA, pp. 8107–8116, 2020. DOI: 10.1109/ ceedings of the 34th International Conference on Neural
CVPR42600.2020.00813. Information Processing Systems, Vancouver, Canada,
Article number 210, 2020.
[105] D. E. Worrall, S. J. Garbin, D. Turmukhambetov, G. J.
Brostow. Interpretable transformations with encoder-de- [117] V. Sitzmann, J. N. P. Martel, A. W. Bergman, D. B.
coder networks. In Proceedings of IEEE International Lindell, G. Wetzstein. Implicit neural representations
Conference on Computer Vision, Venice, Italy, pp. 5737– with periodic activation functions. Advances in Neural
5746, 2017. DOI: 10.1109/ICCV.2017.611. Information Processing Systems, vol. 33, no. 7, pp.
[106] J. L. Schönberger, J. M. Frahm. Structure-from-motion 7462–7473, 2020.
revisited. In Proceedings of IEEE Conference on Com- [118] C. Buehler, M. Bosse, L. McMillan, S. Gortler, M. Cohen.
puter Vision and Pattern Recognition, Las Vegas, USA,
Unstructured lumigraph rendering. In Proceedings of the
pp. 4104–4113, 2016. DOI: 10.1109/CVPR.2016.445.
28th Annual Conference on Computer graphics and Inter-
[107] J. L. Schönberger, E. L. Zheng, J. M. Frahm, M. active Techniques, ACM, Los Angeles, USA, pp. 425–432,
Pollefeys. Pixelwise view selection for unstructured 2001. DOI: 10.1145/383259.383309.
multi-view stereo. In Proceedings of the 14th European
Conference on Computer Vision, Springer, Amsterdam, [119] V. Lepetit, F. Moreno-Noguer, P. Fua. EPnP: An accur-
The Netherlands, pp. 501–518, 2016. DOI: 10.1007/978-3- ate O(n) solution to the PnP problem. International
319-46487-9_31. Journal of Computer Vision, vol. 81, no. 2, pp. 155–166,
2009. DOI: 10.1007/s11263-008-0152-6.
[108] L. C. Chen, G. Papandreou, I. Kokkinos, K. Murphy, A.
L. Yuille. DeepLab: Semantic image segmentation with [120] D. Mehta, S. Sridhar, O. Sotnychenko, H. Rhodin, M.
deep convolutional nets, atrous convolution, and fully Shafiei, H. P. Seidel, W. P. Xu, D. Casas, C. Theobalt.
connected CRFs. IEEE Transactions on Pattern Analys- VNect: Real-time 3D human pose estimation with a
is and Machine Intelligence, vol. 40, no. 4, pp. 834–848, single RGB camera. ACM Transactions on Graphics,
2018. DOI: 10.1109/TPAMI.2017.2699184. vol. 36, no. 4, Article number 44, 2017. DOI: 10.1145/
3072959.3073596.
[109] R. Q. Charles, S. Hao, K. C. Mo, L. J. Guibas. PointNet:
Deep learning on point sets for 3D classification and seg- [121] C. R. Qi, L. Yi, H. Su, L. J. Guibas. PointNet++: Deep
mentation. In Proceedings of IEEE Conference on Com- hierarchical feature learning on point sets in a metric
puter Vision and Pattern Recognition, Honolulu, USA, space. In Proceedings of the 31st International Confer-
pp. 77–85, 2017. DOI: 10.1109/CVPR.2017.16. ence on Neural Information Processing Systems, Long
[110] K. M. He, X. Y. Zhang, S. Q. Ren, J. Sun. Deep residual Beach, USA, pp. 5105–5114, 2017.
learning for image recognition. In Proceedings of IEEE
Conference on Computer Vision and Pattern Recogni- Qi Wang received the B. Eng. and
tion, IEEE, Las Vegas, USA, pp. 770–778, 2016. DOI: 10.
M. Eng. degrees in computer science and
1109/CVPR.2016.90.
technology from Beijing Institute of Tech-
[111] G. H. Li, M. Müller, A. Thabet, B. Ghanem. DeepGCNs: nology, China in 2017 and 2019, respect-
Can GCNs go as deep as CNNs? In Proceedings of 2019 ively. He is currently a Ph. D. degree can-
IEEE/CVF International Conference on Computer Vis- didate in computer graphics at State Key
ion, IEEE, Seoul, Republic of Korea, pp. 9266–9275, 2019. Laboratory of CAD&CG, Zhejiang Uni-
DOI: 10.1109/ICCV.2019.00936. versity, China.
[112] L. Mescheder, M. Oechsle, M. Niemeyer, S. Nowozin, A. His research direction is human-related
Geiger. Occupancy networks: Learning 3D reconstruc- rendering.
tion in function space. In Proceedings of IEEE/CVF Con- E-mail: [email protected]
ference on Computer Vision and Pattern Recognition, ORCID iD: 0000-0002-6326-3209
Q. Wang et al. / State of the Art on Deep Learning-enhanced Rendering Methods 821

Zhihua Zhong received the B. Eng. de- Hujun Bao received the B. Eng. and
gree in computer network from Jinan Uni- Ph. D. degrees in applied mathematics
versity, China in 2017. He is currently a from Zhejiang University, China in 1987
master student in computer graphics at and 1993, respectively. He is currently the
State Key Laboratory of CAD&CG, Zheji- deputy director of Zhejiang Laboratory
ang University, China. and the deputy director of Informatics De-
His research direction is superresolu- partment of the Science and Technology
tion in real-time rendering. Committee of the Ministry of Education,
E-mail: [email protected] China.
His research interests include rendering, modelling and virtu-
al reality.
E-mail: [email protected]

Yuchi Huo received the Ph. D. degree Rui Wang received the B. Sc. degree in
from State Key Laboratory of CAD&CG, computer science and the Ph. D. degree in
Zhejiang University, China. He is a “ Hun- mathematics from Zhejiang University,
dred Talent Program” researcher in State China in 2001 and 2007, respectively. He is
Key Laboratory of CAD&CG, Zhejiang currently a professor at State Key Labor-
University, China. atory of CAD&CG, Zhejiang University,
His research interests include rendering, China.
deep learning, image processing, and com- His research interests include real-time
putational optics, which are aiming for the rendering, realistic rendering, GPU-based
realization of next-generation neural rendering pipeline and computation and 3D display techniques.
physical-neural computation. E-mail: [email protected] (Corresponding author)
E-mail: [email protected] ORCID iD: 0000-0003-4267-0347

You might also like