0% found this document useful (0 votes)
2 views

On_the_Role_of_ViT_and_CNN_in_Semantic_Communications_Analysis_and_Prototype_Validation

This document discusses the integration of Vision Transformers (ViTs) and Convolutional Neural Networks (CNNs) in semantic communications, highlighting a proposed ViT-based model that outperforms CNN variants by achieving a peak signal-to-noise ratio gain of +0.5 dB. The study introduces new analytical measures and validates the model through a real wireless channel prototype, marking a significant advancement in understanding the dynamics of semantic communication systems. Additionally, the authors provide open-source code to facilitate further research and reproducibility of their findings.

Uploaded by

751486692
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
2 views

On_the_Role_of_ViT_and_CNN_in_Semantic_Communications_Analysis_and_Prototype_Validation

This document discusses the integration of Vision Transformers (ViTs) and Convolutional Neural Networks (CNNs) in semantic communications, highlighting a proposed ViT-based model that outperforms CNN variants by achieving a peak signal-to-noise ratio gain of +0.5 dB. The study introduces new analytical measures and validates the model through a real wireless channel prototype, marking a significant advancement in understanding the dynamics of semantic communication systems. Additionally, the authors provide open-source code to facilitate further research and reproducibility of their findings.

Uploaded by

751486692
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 14

Received 5 June 2023, accepted 12 June 2023, date of publication 3 July 2023, date of current version 19 July 2023.

Digital Object Identifier 10.1109/ACCESS.2023.3291405

On the Role of ViT and CNN in Semantic


Communications: Analysis and
Prototype Validation
HANJU YOO 1 , (Graduate Student Member, IEEE), LINGLONG DAI2 , (Fellow, IEEE),
SONGKUK KIM 1 , (Member, IEEE), AND CHAN-BYOUNG CHAE 1 , (Fellow, IEEE)
1 School of Integrated Technology, Yonsei University, Seoul 03722, South Korea
2 Department of Electronic Engineering, Tsinghua University, Beijing 100084, China
Corresponding authors: Chan-Byoung Chae ([email protected]) and Songkuk Kim ([email protected])
This work was supported in part by the Institute of Information and Communications Technology Planning and Evaluation (IITP) and the
National Research Foundation of Korea (NRF) Grant funded by the Korean Government [Ministry of Science and ICT (MSIT)] under
Grant 2021-0-02208, Grant 2022R1A5A1027646, and Grant 2022R1F1A1076493; and in part by Samsung Electronics.

ABSTRACT Semantic communications have shown promising advancements by optimizing source and
channel coding jointly. However, the dynamics of these systems remain understudied, limiting research and
performance gains. Inspired by the robustness of Vision Transformers (ViTs) in handling image nuisances,
we propose a ViT-based model for semantic communications. Our approach achieves a peak signal-to-noise
ratio (PSNR) gain of +0.5 dB over convolutional neural network variants. We introduce novel measures,
average cosine similarity and Fourier analysis, to analyze the inner workings of semantic communications
and optimize the system’s performance. We also validate our approach through a real wireless channel
prototype using software-defined radio (SDR). To the best of our knowledge, this is the first investigation of
the fundamental workings of a semantic communications system, accompanied by the pioneering hardware
implementation. To facilitate reproducibility and encourage further research, we provide open-source code,
including neural network implementations and LabVIEW codes for SDR-based wireless transmission
systems (Source codes available at https://bit.ly/SemViT).

INDEX TERMS 6G, deep neural network, real-time wireless communications, semantic communications,
wireless image transmission.

I. INTRODUCTION To bridge the gap between theoretical assumptions and


Conventional communications systems traditionally employ real-world conditions, there is growing interest in semantic
separate blocks for source coding and channel coding. This communications. Semantic communications aims to address
modular approach stems from Shannon’s separation the- these challenges by integrating source coding, channel cod-
orem [1], which asserts that source coding and channel ing, and modulation within a joint optimization framework
coding can be independently optimized without compromis- based on deep learning techniques [3], [4], [5]. By con-
ing optimality, under idealized communication conditions sidering the interplay between these components, semantic
like infinite code length, independent and identically dis- communications holds promise for achieving improved trans-
tributed (i.i.d.) symbols, or stationary channels. However, mission performance and enhanced efficiency in real-world
in practical communication scenarios, these assumptions communication systems.
often do not hold [2]. Typically, these systems treat wireless channels as a non-
trainable, noise-adding layer, and are trained end-to-end [4],
The associate editor coordinating the review of this manuscript and [5], [6], [7], [8]. Because neural networks consist of various
approving it for publication was Olutayo O. Oyerinde . matrix operations, they are highly parallelizable and can be

This work is licensed under a Creative Commons Attribution-NonCommercial-NoDerivatives 4.0 License.


71528 For more information, see https://creativecommons.org/licenses/by-nc-nd/4.0/ VOLUME 11, 2023
Yoo et al.: On the Role of ViT and CNN in Semantic Communications

scaled to meet performance and latency requirements, which reversely generates the output of the latent representations
is critical in emerging applications such as extended real- to match the original signals. In that process of squeezing
ity (XR) [9]. Unlike traditional coding methods that aim to and expanding, only the core part of the input is extracted in
recover the original symbols precisely, deep-learning-based an unsupervised manner [3], [12]. One way to exploit those
models can be optimized for goal-oriented communications. properties is a denoising autoencoder [13], which removes
For example, learned representations can target more accu- artifacts in a noised image.
rate classifications of an image [10] rather than full signal The semantic communication system, however, may be
recovery, which may not be necessary. distinguished from the denoising autoencoder regarding the
The semantic communications system can be viewed as location of the noise. In the denoising autoencoder, the orig-
a deep neural network problem that involves reconstructing inal image contains the artifacts. In the contrary, in the
the original signal from the corrupted latent representations. semantic communications system, perturbations are added to
One of the main research focuses is selecting building blocks the compressed latent features (i.e., encoded symbols). Those
for the neural network, such as convolutional [4], [7], self- perturbations typically consist of theoretical Rayleigh fad-
attention [6], [8], or recurrent neural network blocks [5], ing or AWGN. The networks learn to generate compressed,
or finding optimal network architecture. While heuristics can noise-resilient signal representations by using loss functions
be used to find optimal architectures, a deeper analysis of that measure the distance between the original signals and the
how deep neural networks jointly perform source/channel reconstructed ones (e.g., mean-squared error loss). If those
coding or modulations, which is currently understudied, can learned features are composed of two-dimensional complex
significantly enhance these architecture search procedures numbers, those autoencoder networks can be viewed as a
and facilitate research. mapping function from the source to a symbol. Unlike tradi-
In this paper, we extend our prior work [11], which tional communications systems that separate source coding,
merely adopted the Vision Transformer (ViT) architecture to channel coding, and modulation, a semantic communications
semantic communications systems, and carefully fine-tune system conducts joint source-channel coding using a deep
the network to find better architectures for the system. neural network.
Moreover, we thoroughly analyze the results to understand Ongoing research on semantic communications covers
how the image semantic communications systems work various domains, including texts [5], [6], [14], speech sig-
and what the advantage of the ViT is. We also verify the nals [15], images [4], [7], and videos [16]. For example,
results in real wireless channels using a software-defined- the authors in [14] proposed a transformer-based seman-
radio (SDR)-based testbed. Full source codes is available at tic communications architecture, DeepSC. Reference [15]
https://bit.ly/SemViT, including SDR implementations and expanded the domain to speech transmission with squeeze-
trained neural network parameters. Our contributions are as and-excitation-aided convolutional neural networks (CNN).
follows: The work [4] proposed a joint source and channel coding
• We carefully design semantic communications systems (JSCC) network for images, which compresses given images
that harmonize Vision Transformers with CNNs, with with CNNs. Their coding method achieved 3 dB peak signal-
appropriate priors and insights from the computer vision to-noise-ratio (PSNR) gain to JPEG+LDPC coding scheme
community. under the Rayleigh fading channel. Reference [16] presented
• We conduct extensive analysis about how image seman- DeepWiVe, an video transmission JSCC system that uses
tic communications systems work in an additive white CNN and non-local blocks [17] to capture the redundancies
Gaussian noise (AWGN) or Rayleigh channel, yielding between frames.
various insights while introducing analysis metrics such However, most prior work mentioned above has the fol-
as average cosine similarities and Fourier analysis. lowing limitations: 1) they mostly have focused on the
• We build a SDR-based wireless semantic communica- implementation rather than delicate considerations of the
tions system prototype and verify that the simulated network architecture or careful analysis of why the proposed
results fit well with the real wireless channels. system works; 2) they have evaluated the system only in the
• We publicly release our source codes, including deep simulated channel environments (i.e., Rayleigh and AWGN
neural network implementations, trained parameters, channels). In this paper, we propose a ViT-based image
or SDR-based testbed implementations, to facilitate semantic communication system inspired by existing analysis
follow-up studies. of the computer vision communities on ViTs; we carefully
analyze the results with various metrics to figure out how the
II. BACKGROUNDS ViTs work in the semantics communications system. We also
A. SEMANTIC COMMUNICATIONS verify the system with the SDR-based wireless prototype to
A semantic communications system is a kind of autoencoder, show the system’s feasibility in a real wireless channel.
a neural network capable of compressing given signals and Recently, there have been a few pioneering works that
reconstructing them. A typical autoencoder encodes the given have used Vision Transformer-based backbones for semantic
image into the smaller but high-dimensional features and then communications. For example, the authors in [18] used a ViT

VOLUME 11, 2023 71529


Yoo et al.: On the Role of ViT and CNN in Semantic Communications

FIGURE 1. (a) Proposed system architecture and (b) a ViT block. For convolutional layers, kernel size k and output channel C are denoted by (k × k,
C ). For ViT layers, the number of heads Nh and dimensions per head dh are denoted as (Nh , dh ). H × W × C denotes the height, width, and channels
of the image/features, and DS and US stand for spatial downsampling and upsampling, respectively.

encoder to produce semantic-noise resilient features inspired and Wqry , Wkey , Wval ∈ R√ Din ×Dout are query, key, value

by ViT’s patch-wise image processing. The authors in [8] transformation matricies and Din is a normalization factor,
used a Swin Transformer-based backbone for image analysis/ respectively. P ∈ RN ×N is a learnable positional encoding
transform to bring the hyperprior concept to the semantic that alleviates its permutation-equivariance and translation-
communications. variance, which can be problematic for images. Attention
However, they used ViTs only to realize their ideas, while matrix A can be interpreted as a similarity matrix between
we aim here–through extensive analysis–to figure out how the input tokens in a latent space, and self-attention produces
ViT works in sematic communications via extensive analysis new features by using mutual similarities of given inputs
and try to provide some insights. We also verify our results as weights by softmaxing and multiplying the attention
on a real wireless channel with our SDR-based testbed. Our matrix.
contributions are parallel to those works and can be utilized
simultaneously to improve communications systems. 2) MULTI-HEAD SELF-ATTENTION
Multi-head self-attention is done by conducting multiple
B. MULTI-HEAD SELF-ATTENTION AND CONVOLUTIONS self-attentions (‘‘heads’’) in parallel to enable multiple inter-
Multi-Head Self-Attention (MHSA) mechanism is the pretations of the same input exploiting different query,
essence of the Vision Transformer architecture. It is dif- key, and value transformations. It consists of Nh heads
ferentiated from convolution in that it has global receptive and Dh dimensions per each head, and multiple head
fields and content-adaptivity. In this section, we introduce outputs are combined to produce final representations as
the mathematical formulation of the multi-head self-attention follows:
and compare them to convolutions. The notation used is
mainly borrowed from [19]. MHSA(X) = concat[Self-Attentionh (X)]Wout (3)
h∈[Nh ]

1) SELF-ATTENTION where Wout ∈ RDout ×Dout is a transformation matrix used


Let X ∈ RN ×Din represent an input matrix, which has N for projecting concatenated head outputs, and the output
tokens (or pixels) and Din dimensions. An (single-headed) dimension of Self-Attentionh is Dh . Typically Dh is set as
self-attention block maps an input X into output Y ∈ RN ×Dout Dh × Nh = Dout .
dimension output as follows:
p 3) CONVOLUTIONS
Y = Self-Attention(X) = softmax(A/ Din )XWval (1)
Convolutional layers have been widely adopted in building
where A is an attention matrix and is defined as: neural networks for images, even after the advent of self-
attentions. For a input feature X ∈ RH ×W ×Din , convolutional
A = XWqry WTkey XT + P (2) layers with learned kernel matrix K ∈ Rk×k×Din can be

71530 VOLUME 11, 2023


Yoo et al.: On the Role of ViT and CNN in Semantic Communications

TABLE 1. Tested architectures, their computational complexity (GFLOPs), trainable parameters, and decoded image quality.

denoted as follows: is desirable as low-pass filtering is a typical way to enhance


k k
noised images.
⌊2⌋ ⌊2⌋
X X Inspired by helpful characteristics of ViTs, such as
Conv(X)i,j = KT⌊ k ⌋+a,⌊ k ⌋+b,: Xi+a,j+b,: . (4) low-pass filtering effects and robustness to impaired sig-
2 2
a=⌊− k2 ⌋ b=⌊− k2 ⌋ nals, we chose the Vision Transformer architecture and its
multi-head self-attention mechanism as our primary building
As seen above, convolutional layers transform given matri- blocks. Our rationale for using ViTs in semantic communica-
ces only depending on learned kernels shared by all pixels tions systems is as follows:
and are content-agnostic. Their receptive fields, i.e., the range • ViTs can perform source coding better than CNNs.
of input pixels utilized to produce a single output token, are Vision Transformers have content-adaptivity and global
limited to the kernel size k. In the contrary, multi-head self- receptive fields, whereas CNNs learn kernel weights and
attention has a global receptive field and calculates the inner have local receptive fields. Those properties can help
products of the queries and keys, enabling content-dependent source coding, where reducing content redundancies
operations. However, it requires much more computation and among all features is critical.
memory resources due to the calculation of N × N inner • ViTs are more robust to noised signals than CNNs.
products of the attention matrix. It should be carefully used ViTs behave like low-pass filters [27] in image clas-
when there are a number of tokens, e.g., high-resolution sifcation, unlike CNNs, which behave like high-pass
images. filters. As the decoding process can be viewed as
denoising, which removes high-frequency artifacts from
C. VISION TRANSFORMERS the channel noises, ViT can outperform CNNs in seman-
Vision Transformers were first introduced in [20], inspired by tic communications systems. ViTs are also known for
a Transformer [21], which is a de facto standard architecture their robustness to occlusions or perturbations in images,
in a natural language processing research community. Unlike which resemble channel noises.
CNN, it adopts the multi-head self-attention mechanism to
enable content-adaptive operations and has global recep- III. SEMANTIC ViT
tive fields. ViTs and their variants outperform CNNs and A. SYSTEM ARCHITECTURE
record state-of-the-art performances in various computer- In this paper, we propose SemViT, an abbreviation for
vision fields, including image classification [22], object Semantic Vision Transformer. It follows the typical autoen-
detection [23], or image restoration [24]. coder design of a semantic communications system,
In the recent research [25], it is found that ViTs are as described in Section II-A. It has ten blocks, 5 for the
more robust on common image corruptions, such as occlu- encoder and decoder networks, respectively. The encoder
sions, permutations, natural perturbations, or even adversarial network maps input color image X ∈ RH ×W ×3 into complex
attacks. When 80% of an image is randomly dropped, for in- and quadrature-phase messages M ∈ CS , and the decoder
example, it shows ∼60% classification accuracy, whereas reconstructs the decoded image X̂ ∈ RH ×W ×3 from the
CNN maintains zero accuracies. It also recorded 36% lower channel-corrupted symbols. Following prior works [4], [7],
mean corruption error [26] for natural perturbations, or 30%p we denote the proportion of the number of complex symbols
S
more classification accuracy when an adversarial patch – 5% sent and the number of pixels in the original image, H ×W ×3 ,
of the total image size – is added. Also, the authors in [27] as a bandwidth ratio.
argued that ViTs behave like low-pass filters in image clas- As shown in Fig. 1, we combine convolutional and ViT
sification, unlike CNNs resembling high-pass filters, which layers to build the semantic communications systems. For

VOLUME 11, 2023 71531


Yoo et al.: On the Role of ViT and CNN in Semantic Communications

FIGURE 2. System architecture of the USRP-based wireless semantic communications system testbed.

FIGURE 3. Example I/Q signals transmitted in the wireless testbed. (a) Modulated symbol plot obtained from 64 test images, corresponding to a total of
32,768 symbols. (b) Raw in-phase sequence received in the USRP device. The red vertical lines indicate the estimated symbol duration, obtained by
autocorrelation of the pilot sequence added to the front and back of the modulated symbols. (c) Visualization of the I/Q imbalance problem. Despite
transmitting only in-phase symbols, the same signal is also received (albeit with weak power) in the quadrature phase.

convolutional layers, we used a kernel size of 9 × 9 or 5 × 5, was based on the observation that convolutional layers
following [7], [28]. We used [20] and [22]-inspired ViT lay- excel at processing local patterns, which are more preva-
ers, which consist of MHSA layers (described in Section II-B) lent in the early stages of the network. By assessing the
and multi-layer perceptron layers for ViT blocks. For posi- performance of these individual combinations, we identi-
tional encoding, we used learnable 2D positional encoding fied the most effective configurations for both the encoder
P ∈ RN ×N that are sampled from R(2N −1)×(2N −1) based on and decoder parts. Subsequently, we assembled the best
the relative distance between the key and query pixels. combinations to determine the optimal network architec-
To find the best combinations of the convolution and ViT ture for the entire system. Considering the time-consuming
layers, we started from the CNN baseline (based on [7] and nature of testing all possible combinations, this approach
[28]) and replaced one by one each convolutional layer with allowed us to efficiently determine the most effective
a ViT layer from the middle of the architecture. We did configurations.
so because ViTs are known to perform poorly at the first Table 1 shows the tested architecture, computational com-
few layers of the neural network (called ‘‘stem’’) [22]. plexity (GFLOPs), the number of parameters, and the average
To overcome the time-consuming training process associated quality of the reconstructed images (PSNR). We report the
with testing all possible combinations, we adopted a more image PSNR results with 1/6 bandwidth ratio, 10 dB SNR.
efficient approach. Specifically, we replaced convolutional GDN denotes whether the generalized divisive normaliza-
layers selectively either in the encoder (e.g., C-C-V-C-C-C) tion layer [28] is adopted. We used ‘C’ for convolutional
or the decoder (e.g., C-C-C-V-C-C) to test each combination layers and ‘V’ for ViT layers, and their index denotes their
independently. layer number. For example, C-C-V-V-C-C means we used
We also adhered to the approach suggested in [22], ViT layers at layers 2 and 3 (see Fig. 1) and convolutional
imposing a constraint that convolution stages should pre- layers for all other layers. Each architecture is trained for
cede Transformer stages [17], [22], [29]. This decision 100 epochs.

71532 VOLUME 11, 2023


Yoo et al.: On the Role of ViT and CNN in Semantic Communications

The C-C-V and V-C-C architecture performed best To address hardware implementation challenges and
regarding produced image quality at the encoder and accommodate the Gaussian-like nature of the neural-
the decoder, respectively. As a result, we adopted the network-modulated symbols (as depicted in Fig. 3a), symbol
C-C-V-V-C-C architecture to our semantic communications clipping was employed using constant thresholds to align
system. Furthermore, this architecture had the lowest com- with the digital-to-analog converter (DAC) requirements.
putation complexity and trainable parameters, thanks to the Additionally, pilot symbols were introduced at the beginning
parameter-FLOPs efficiency of the ViTs compared to CNNs and end of the transmitted symbols, to enable signal detection
with large kernel sizes (e.g., 5 or 9). We also found that via autocorrelation and gain compensation (see Fig. 3b and
the generalized divisive normalization (GDN) layer, which is Fig. 3c).
a normalization layer typically used in image compression Fig. 3c reveals the presence of I/Q imbalance problems
community [28] and prior works [7], is not beneficial for arising from impairments in our USRP device. To miti-
ViTs. This is possibly due to the Layer Normalization [30] in gate these issues, we developed a model to characterize the
the ViT layer, and is also consistent with the recent research interference between the I/Q symbols, as demonstrated in
result [31], which reports that GDN resulted in training (5). In these equations, x̂i /x̂q represents the interfered I/Q
instability with ViTs. We thus removed it to get additional symbols, xi /xq denotes the original I/Q sequence, and ki /kq
GFLOPs and parameter reduction. Note that due to hardware represents the interference constant associated with each
architecture and software optimizations, calculated GFLOPs respective component:
are not proportional to latencies (e.g., the C-C-V-V-C-C
architecture is about 1.3× slower at training than is the x̂i = xi + kq xq , (5)
C-C-C-C-C-C). x̂q = ki xi + xq . (6)
In the next step, we estimated the constants ki and kq by
B. TRAINING AND SYSTEM SETUP
utilizing the pilot signal transmitted prior to the main symbol
For training and evaluation, we used a desktop PC equipped
transmission. Through this calibration process, we success-
with Nvidia GeForce RTX 2080 Ti 11 GB. We used the
fully obtained the I/Q symbols without interference. It is
CIFAR-10 [32] dataset for training/validations and reported
important to note that these calibration procedures, as well
test results based on the CIFAR-100 test dataset, both
as our coarse hardware implementations, introduced addi-
of which consist of 60,000 32 × 32-pixel color images
tional noise. More precise hardware implementations have
(50,000 training, 10,000 test images). Unless stated other-
the potential to further enhance performance, and we leave
wise, we used an Adam optimizer with 0.0001 constant
them for future work.
learning rate. We trained the model for 600 epochs.
We implemented a wireless semantic communications
IV. RESULTS AND ANALYSIS
system based on the NI USRP-2943R FPGA platform
A. IMAGE PSNR PERFORMANCE
and PXIe-1082 chassis for experiments in a real wireless
channel. Fig. 2 shows the wireless transmission process Fig. 4 compares decoded image quality (PSNR) between
of USRP-based semantic communication systems. We first proposed SemViT, CNN-based DeepJSCC, and conventional
encoded the given image with the encoder network in the better portable graphics1 (BPG) [33] format. To facilitate a
host PC. The encoded symbols were then sent to the USRP direct comparison with previous works [7], we considered
via LAN with UDP protocol, after which the USRP con- several bandwidth ratios, namely 1/12, 1/6, 1/4, 1/3, and 1/2.
ducted the wireless transmission of the received symbol. The These ratios were chosen to align with the parameters used
USRP then delivered the channel-corrupted symbol back to in the referenced study, allowing for easy and meaningful
the PC, where the signal was decoded to produce the output comparisons between our results and those reported in the
image. literature.
We conducted wireless transmission experiments using For the BPG+LDPC approach, we evaluated the perfor-
a single USRP device equipped with two omnidirectional mance in terms of the best PSNR value across various
antennas: one for transmission and another for reception. combinations of code rates and modulations. We evaluated
The transmission was carried out with a base frequency of various combinations of code rates and modulations, includ-
2 GHz and a bandwidth of 1 MHz. The experiments were ing (3072, 6144), (3072, 4608), and (1536, 4608), which
performed in line-of-sight (LoS) environments, as depicted correspond to 1/2, 2/3, and 1/3 code rates, respectively.
in Figs. 2 and 9a. To decode the received signals, we initially In terms of modulations, we considered options such as
performed gain compensation using the pilot signals. Subse- BPSK, 4-QAM, 16-QAM, or 64-QAM. To ensure compat-
quently, we employed a neural network decoder that had been ibility between the code bits and the bits per symbol in the
trained in simulated AWGN environments. It is important to selected modulation schemes, we made sure that the number
note that our focus in this study is to present initial proof-of- of code bits was divisible by the number of bits per symbol
concepts for neural network-modulated symbol transmission, 1 BPG format is an image format based on HEVC (H.265) video codec
and as such, the specific parameters mentioned above can be and is one of the most efficient image compression methods among the non-
adjusted as needed. neural network-based image codec.

VOLUME 11, 2023 71533


Yoo et al.: On the Role of ViT and CNN in Semantic Communications

FIGURE 4. (a): Decoded image quality at bandwidth ratio=1/6 with respect to channel SNRs. (b), (c): Transmitted image PSNR with regard to bandwidth
ratio at 0 dB and 10 dB SNR, respectively. All results are reported from the models trained at the same SNR and BW ratio to the evaluation setup.

(1, 2, 4, or 6). On the other hand, for the BPG+capacity B. COSINE SIMILARITY ANALYSIS
approach, we focused on calculating the theoretical capac- We chose a metric spatial-wise average cosine similarity to
ity of the AWGN channel based on the SNR, which was show the diversity of the produced features. We interpreted
expressed in bit/s/Hz or bit/symbol. Using this information, each layer’s output features X ∈ RH ×W ×C as a set of
we determined the required image file size that matched the C-dimensional vectors and computed average cosine similar-
number of symbols transmitted. Consequently, we adjusted ities S between all H × W vectors as follows:
the quality factor of BPG compression to ensure compliance
1 X X XTi,j,: Xp,q,:
with the file size restrictions. Finally, we reported the PSNR S= . (7)
of the encoded image obtained using this approach. HW (HW − 1) p,q Xi,j,: Xp,q,:
i,j
As expected, the proposed SemViT outperformed (p,q)̸ =(i,j)

DeepJSCCs in all regions, proving ViT’s benefits in semantic SemViT produces more diverse features. In Fig. 5a,
communications. Interestingly, the performance gap between we show the average cosine similarity with respect to the
SemViT and DeepJSCC increased as the channel SNR and number of symbols in layer 2, which produces final fea-
bandwidth ratio rose, effectively narrowing the gap between tures before being projected to symbols. SemViT consistently
conventional separate source-channel coding-based meth- reduced the average cossims as the number of symbols
ods and semantic communications in the high SNR and increased, while DeepJSCC failed to reduce them in a large
bandwidth ratio region. DeepJSCC particularly underper- number of symbol regions. Instead, in DeepJSCC, sym-
formed BPG+capacity at AWGN 10 dB, 1/4 bandwidth ratio bol diversification is conducted by the mere weighted sum
(Fig. 4c), but SemViT effectively utilized given data rates and of the features in the linear projection layer behind (see
SNR to beat the DeepJSCC and even BPG+capacity. Fig. 5a and Fig. 5c). This coincides with the tendency for
This is in accord with our first intuitions–SemViT the image-quality gap to increase as the number of sym-
can effectively reduce redundancies between represen- bols increases. Furthermore, among all regions, the average
tations and diversify output features, thanks to its cosine similarity of the SemViT is significantly lower than
content-adaptiveness and global receptive field. According DeepJSCC. Considering both results, we can interpret the
to this interpretation, the ViT’s ability to produce diverse performance gain as a result of SemViT’s ability to pro-
features led to a more significant performance gap in the duce more diverse features, thanks to its content-adaptiveness
higher SNR and bandwidth ratio region, where semantic com- and global receptive field. CNN’s failure to produce diverse
munication systems should convey more detailed information features can be explained by the significant redundancy of
(e.g., high-frequency components of the image). The lower learned filters [34] (a well-known problem), leading to redun-
gap in the lower SNR and bandwidth region can be explained dant representations.
by the CNN’s ability to extract critical features in the image Note that cosine similarity was not incorporated as a metric
or the necessity of redundant features to deal with the harsher in our loss function during the training of the neural net-
channels. To support these ideas, we chose two metrics for work, and the cosine similarity results are obtained from
the analysis–cosine similarity and Fourier analysis. Detailed the network’s primary objective of maximizing its decoding
analysis and rationales are explained in the following performance. Our insights based on cosine similarity analysis
sections. are as follows:

71534 VOLUME 11, 2023


Yoo et al.: On the Role of ViT and CNN in Semantic Communications

FIGURE 5. (a), (b): Cosine similarity at layer 2 (last features before symbol projection layer) with respect to bandwidth ratio and channel SNR,
respectively. (c), (d): Cosine similarity at complex symbol with respect to BW ratio and SNR. (a) and (c) are measured in AWGN 10 dB channel, and
(b) and (d) are with 512 symbols. Note that layer two output has 256-dims while the complex symbol is a 2-dimensional vector, so a direct comparison
of cossim values might be unfair.

FIGURE 6. Evolution of average cosine similarities over epochs in encoder network. Every encoder layer produces features with lower average cosine
similarities as the training continues. (a), (b), (c): Cosine similarities at layers 0, 1, and 2, respectively. See Fig. 1 for layer naming.

ViTs can be good at source coding, while CNNs may image quality. Furthermore, as shown in Fig. 5a, the cosine
be so at channel coding. In Fig. 5b, DeepJSCC shows a similarity gap between SemViT and DeepJSCC increases as
weak tendency to diversify its output features as the channel the channel SNR and bandwidth ratio rises, which aligns with
SNR rises, whereas SemViT shows no significant tendency the PSNR results. We are not arguing that average cosine
to do so. SemViT’s invariability to channel SNRs is rather similarity is the perfect metric that measures the amount of
complemented by the linear projection layer behind (see information, but we believe it is good enough to show the
Fig. 5b and Fig. 5d). This implies the specialization between difference between the ViTs and CNNs.
ViTs (for the source coding) and projection layers (kind of
channel coding). In contrast, CNN’s cosine similarity at C. FOURIER ANALYSIS
layer 2 is inversely proportional to channel SNRs, while the We perform a Fourier analysis on the input image, analyzing
tendency becomes weaker in the symbols produced by the the behavior of each layer’s frequency filtering up to the final
1D-projection layer (See Fig. 5b and Fig. 5d). This pattern feature prior to image generation (i.e., the output of layer 3).
suggests the possibility that the CNNs are best at the channel Specifically, we begin by averaging the features produced
coding among the ViTs, projection layers, and CNNs. Also, by layer 2 across the channel dimension to obtain a matrix
the inconsistency of symbol cosine similarity with respect X ∈ RH ×W , and then calculate the log-amplitude difference
to train SNRs may be due to oversimplified symbol produc- from the DC component.To maintain clarity and consistency
tion. As a solution for building symbols, linear projection with prior work [27], we report only the half-diagonal portion
followed by reshaping might not be a good one. Note that of the 2D discrete Fourier transform (DFT) of the averaged
the architecture of our proposed SemViT is not based on these features. The mathematical formula for obtaining the Fourier
insights, as our purpose in designing SemViT is to find initial H
analysis result y ∈ R⌊ 2 ⌋+1 of the given features, computed
designs of ViT-based semantic communications systems and from 2D DFT representation F of the given features X, is as
to provide preliminary analysis. We leave the architecture follows:
design based on those understanding to future work.
Appropriateness of the metric. Judging the effectiveness of
H −1 W −1
the chosen metric is a difficult work. Hence, we show that X X nk ml
Fk,l = Xn,m e−j2π H + W , (8)
the network is trained to reduce average cosine similarity.
n=0 m=0
Fig. 6 shows the average cosine similarity of the encoder
ŷ = log(diag(|F|)) − log(diag(|F0,0 |I)), (9)
layers decreases as the training epochs lengthen, showing
the association between lower cosine similarities and the y = ŷ0:⌊ H ⌋+1 . (10)
2

VOLUME 11, 2023 71535


Yoo et al.: On the Role of ViT and CNN in Semantic Communications

FIGURE 7. top: fourier analysis from the input image to the final layer output. The gray arrow shows how the relative amplitude changes as the layer
index increases. Bottom: Layer-wise average cosine similarity. We denoted whether the given layer is a high-pass filter (HPF) or a low-pass filter (LPF)
based on the fourier analysis. The colored region (layers 0-2) represents the encoder part.

Note that the operator diag(·) generates the diagonal vec- DeepJSCC and the extraction of high-dimensional features
tor from a given matrix, while log(·) carries out an (256-dims) from the extremely low-dimensional symbol vec-
element-wise logarithmic operation. Additionally, |F| indi- tors (e.g., for 1/6 BW ratio, reshaped dimension of layer 3’s
cates element-wise absolute value of complex-valued matrix input is 8). Also, layer 4 in SemViT decreases cosine similar-
F, and |F0,0 | signifies the magnitude of the DC component. ity while acting like a low-pass filter, which can be because of
Equation 10 calculates the half-diagonal elements from the selective solid amplification of mid-band signal (around 0.5π
computed difference in log-amplitude. Based on the Fourier radians, see Fig. 7d) while inhibiting other high-frequency
analysis, we have observed the following results: regions.
Encoders are high-pass filters, while decoders are low- ViTs produce many more high-frequency details, especially
pass filters. As can be seen in Fig. 7, in the encoder network, in the high-BW region. Comparing Fig. 7a and Fig. 7c,
the amplitude of high-frequency components is continuously or Fig. 7b and Fig. 7d suggests the key difference between
increased, whereas the decoder works oppositely. Layer 5 the low- and high-BW ratio region is the amplitude of the
exceptionally acts as a weak high-pass filter, possibly due high-frequency components. Generally, the high-frequency
to the generation of high-frequency details of the image, components of the image contain the details (e.g., textures),
but the relative amplitude difference is not that significant. while the low-frequency parts consist of the rough shape.
Interestingly, unlike CNNs that behave like HPFs in the image Therefore, we can interpret the PSNR gain on the high-BW
classification tasks [27], convolutional layers in the decoder ratio in Fig. 4 is thanks to the increased details. Further-
network consistently behave like LPFs. This is likely due to more, SemViT shows a much larger amplitude in mid- and
the ‘‘unpacking’’ properties of the decoder, which decodes high-frequency regions than DeepJSCC, which are also com-
the highly compressed (high-pass filtered) features produced pliant with the more significant PSNR gap at a larger BW
by encoder networks. Also note that the operating dynamics ratio (Fig. 4c) and the conclusion of Section IV-B. Note that
of the CNN can be different between the image classification the ViT decoder still behaves like a robust low-pass filter (i.e.,
network and the semantic communications system, as the the overall amplitude difference of the final encoder output
training objective is different. and the ViT decoder features are relatively huge in high-
ViTs behave like strong LPFs in the decoder. This affirms frequency regions), but selectively retains high amplitude in
that ViT’s usage in the decoder network is a good idea, as the certain high-frequency bands (e.g., 0.5π and 1π)
decoders are essentially low-pass filters (possibly to suppress
the high-frequency noise induced by the channel). Fig. 7
shows the relative amplitude differences of layer 3, in the D. ATTENTION MAPS
high-frequency regime, are significant in SemViT compared In Fig. 8, we visualize the sublayer attention map of layer 2
to DeepJSCC. Interestingly, although layer 3 acts like low- (the last layer before the symbol projection) and layer 3
pass filters, the cosine similarity decreases in DeepJSCC; (the first layer of the decoder). For visibility, we chose the
This can be due to the weak low-pass filtering effects of attention map at the index (4, 4) and reshaped it to match the

71536 VOLUME 11, 2023


Yoo et al.: On the Role of ViT and CNN in Semantic Communications

FIGURE 8. Visualization of sublayer attention map at layers 2 and 3 on the index (4, 4). The symmetrical structure of global-to-local attention is clearly
visible.

feature map size (8 × 8). The results are averaged over all imperfect gain compensations, reflections, and quantization
CIFAR-10 test sets (10,000 images). errors of the DAC). We further validated the system in the
Surprisingly, the attention map clearly shows the symmet- crowded indoor environment at CES 2023 (Fig. 9), and con-
ric structure of global-to-local attention; This resembles the firmed there was no significant difference in the performance
context, global- and local- hyperprior structure, which was (Fig. 10).
recently proposed by deep-learning-based image compres- Interestingly, the performance gap between the simulated
sion community [35]. This might be further evidence showing and real environment rises as the channel SNR degrades. This
ViT’s strength in source coding. The evident cross-structure is likely due to the disparity in the precision between the
of the attention map is due to the additive positional encoding simulation (32-bit floating points) and the USRP hardware
(not shown in the paper). The low self-weighting is likely due (12-bit fixed-point DAC hardware). As we manipulated the
to the residual connection of the model architecture (i.e., the signal power (and the resulting channel SNR) by reducing
previous features are added to the next feature even without the signal amplitude, more quantization errors were induced
the attention procedure). in the lower SNRs (due to the fixed-point arithmetic of DAC
hadware) to degenerate the reported image quality. This sug-
V. RESULTS IN OTHER ENVIRONMENTS
gests that the semantic communications system should be
To see the generalization of the results in other channel trained to consider RF hardware restrictions (e.g., DAC quan-
environments or metrics, we additionally show the results in tizations, power-to-average-power-ratio (PAPR) constraints).
the Rayleigh fading channel, in the real wireless channel, and We leave them for future work.
Fig. 11c reports the SSIM score of the proposed SemViT
with the structural similarity index metric (SSIM) [36], which
and DeepJSCC in the AWGN channel. Unsurprisingly, the
is a perceptual image quality metric.
proposed SemViT shows better SSIM scores compared to
DeepJSCC, which are coincident with image PSNR results
A. DECODED IMAGE QUALITY (Fig. 4b). Note that the reported SSIM values are with the
Fig. 11a shows the results in the slow Rayleigh fading model trained with PSNR loss and thus can be improved
channel, where the channel is kept unchanged during the by using the SSIM loss in the training procedure. More
transmission of the entire image. We retrained the entire perceptual loss, e.g., MS-SSIM [37] or VGG loss [38], might
network to match the Rayleigh fading environment without be utilized to enable more semantical compression of the
any channel state information to the network. SemViT still images.
outperforms DeepJSCC, especially in the high SNR regions.
The PSNR gap between DeepJSCC and the proposed SemViT B. ANALYSIS IN THE RAYLEIGH FADING CHANNEL
increases as the SNR rises, coinciding with the AWGN results To see if our analysis in the AWGN channel can be gen-
and our previous analysis. eralized to the Rayleigh fading channel, we conducted the
Fig. 11b shows the PSNR results in the real wireless Fourier and average spatial-wise cosine similarity analysis in
channel, which are measured in the USRP-based prototype the Rayleigh channel. Fig. 12a and Fig. 12b shows the Fourier
mentioned in Section III-B. We transmitted encoded constel- analysis results. The filter characteristics of the encoder and
lations at different SNR regions by adjusting the amplitude the decoder in AWGN and Rayleigh channel are almost
of the transmitted signals, i.e., dividing the encoded signal identical. The proposed SemViT trained in the Rayleigh
with varying constants. Each points in the figure denote the channel behaves like a more robust low- or high-pass filter,
average image PSNR of the transmission of the 64 random coinciding with the previous analysis. However, the ampli-
images. As expected, the proposed SemViT shows better tude deviation of encoded features is relatively minor in
image quality than DeepJSCC in the real wireless channel. the Rayleigh channel in both DeepJSCC and SemViT, i.e.,
However, there exists about a 3 dB PSNR gap between the the encoder network produces much less diverse features
simulated (AWGN) results and actual measurements due to to deal with the more unpredictable channel. Also, layer 3
non-Gaussian errors of the channel (i.e., errors induced by in DeepJSCC does not conduct notable transformations in

VOLUME 11, 2023 71537


Yoo et al.: On the Role of ViT and CNN in Semantic Communications

FIGURE 9. (a): Real-time demonstration of the proposed system in a crowded indoor environment (at CES 2023, Las Vegas, USA), (b) screenshots of the
client/server (for neural encoding/decoding), and (c) the LabVIEW block diagram (for real-time wireless transmission). Both softwares, including neural
network parameters, are available open-source. See https://bit.ly/SemViT/.

FIGURE 10. Examples of original (left) and transmitted images using proposed SemViT (middle) and conventional JPEG (right). For the JPEG image,
we assumed LTE’s modulation-and-constellation scheme targetting 0 dB SNR and inversely calculated the required image bytes to send an equal number
of symbols compared to our proposed system. Transmitted images are randomly chosen from the CIFAR-10 test set.

frequency domains, whereas SemViT still shows obvious image and the layer five output. This explains the lower image
LPF behavior. PSNR quality in the Rayleigh channel.
Notably, DeepJSCC produces low-pass-filtered features in The top part of Fig. 12c shows the layer-wise cosine sim-
layer two, in which SemViT in the Rayleigh channel or even ilarity and whether the layer is HPF or LPF, based on the
DeepJSCC in the AWGN channel conducts high-pass filter- Fourier analysis. The key difference compared to AWGN
ing (top left in Fig. 12c). Due to the CNN decoder’s weakness results is that the cosine similarity of the last layer’s fea-
in low-pass filtering, or decoding, DeepJSCC should pro- tures almost equals 1 in both DeepJSCC and SemViT. One
duce more redundant features in the encoder (LPF behavior) possible interpretation is that the semantic communication
rather than dimensionality reduction (HPF behavior) as like systems try to average all given symbols spatially to deal
in AWGN channel or SemViT, leading to poorer decoded with harsh channel conditions and produce a single pol-
image quality (Fig. 11a). This phenomenon is also seen in ished vector from which the network reconstructs the entire
cosine similarity analysis results – in Rayleigh channels, image. The DeepJSCC decoder’s consistent increase of the
DeepJSCC’s average cosine similarity of layer 2 output cosine similarity, both in Rayleigh and AWGN (Fig. 12c
increases (top left of Fig. 11c) whereas SemViT (top right) or and Fig. 7a), can be explained well in this way. However,
DeepJSCC in AWGN (Fig. 7a) lowers the feature cosine sim- layer 4 of AWGN-trained SemViT behaves differently from
ilarity in layer 2. DeepJSCC and SemViT conduct low-pass the DeepJSCC or Rayleigh-trained SemViT – it diversifies
filtering in the last layer (layer 5), leading to more amplitude its output features while performing low-pass filtering if the
differences in high-frequency regions between the original channel is good enough (e.g., AWGN 10 dB SNR as in

71538 VOLUME 11, 2023


Yoo et al.: On the Role of ViT and CNN in Semantic Communications

FIGURE 11. (a), (b): PSNR results in the Rayleigh and real wireless channel, respectively. (c): Image SSIM results in the AWGN, 0 dB SNR.
We borrow the BPG+capacity data in (a) from [7].

FIGURE 12. (a), (b): Fourier and (c): cosine similarity analysis in Rayleigh fading channels.

(Fig. 7b). Considering the content-adaptivity of the ViTs, this VI. DISCUSSIONS
may mean that SemViT decodes the signal by aggregating In this section, we provide some insights and possible
similar information across all symbols to produce multiple research directions based on the analysis given in Section IV.
‘‘pure’’ information sources, while DeepJSCC can only aver- • Using heterogeneous architectures for semantic commu-
age the adjacent symbols due to its content-agnostic property. nications, which combine the strengths of both ViTs and
This suggests that using ViTs may also benefit interference- CNNs, may be more beneficial than relying on either
canceling applications, e.g., inter-symbol interference or approach alone. This could involve using ViTs for source
self-interference cancellation in full-duplex communications. coding and CNNs for channel coding, with possible
In Fig. 12, the bottom left illustrates the 2D symbol, while applications including the extraction of hierarchical fea-
the bottom right shows the layer two feature cosine similarity tures and the creation of more robust models that are
analysis. As in Fig. 5b, the layer two cosine similarity of resistant to channel noise.
DeepJSCC gradually decreases as the channel SNR rises, i.e., • ViTs can serve as effective LPFs that aid in decoding.
encoded features has some channel adaptivity. Even SemViT, To test their impact on performance, we can incorporate
whose features were channel-agnostic in AWGN, produces at least one ViT layer into the decoder network or intro-
more diverse symbols as the channel improves. This could duce a non-trainable, explicit blurring layer prior to the
be because the Rayleigh channel was so harsh that even the decoder network and analyze the results.
channel-insensitive ViT had to adapt to the SNR, or decoding • A more robust network may be necessary for symbol-
the symbols in AWGN might be too easy for ViTs. The producing layers. Although we did not observe any
cosine similarity of the 2D symbols does not show any clear inverse-proportional relationships between chan-
evident tendencies to adapt better to channel environments, nel SNR and symbol cosine similarities, this may be
which is corresponsive to AWGN results (Fig. 5d), and attributed to the current oversimplified symbol pro-
still suggests using a more robust network for final symbol jection layer. To address this, future research may
production. explore the use of channel-wise attention to generate

VOLUME 11, 2023 71539


Yoo et al.: On the Role of ViT and CNN in Semantic Communications

symbols instead of the simple projection and reshaping [7] D. B. Kurka and D. Gündüz, ‘‘DeepJSCC-f: Deep joint source-channel
method. coding of images with feedback,’’ IEEE J. Sel. Areas Inf. Theory, vol. 1,
no. 1, pp. 178–193, May 2020.
• More efficient ViT-based models can be developed [8] J. Dai, S. Wang, K. Tan, Z. Si, X. Qin, K. Niu, and P. Zhang, ‘‘Nonlinear
to reduce encoding/decoding latencies. Our work did transform source-channel coding for semantic communications,’’ IEEE J.
not specifically focus on the computational efficiency Sel. Areas Commun., vol. 40, no. 8, pp. 2300–2316, Aug. 2022.
[9] C. Liaskos, ‘‘XR-RF imaging enabled by software-defined metasurfaces
of SemViT or compare it to conventional image and machine learning: Foundational vision, technologies and challenges,’’
transmission systems. This is because conventional IEEE Access, vol. 10, pp. 119841–119862, 2022.
communication systems typically operate on special- [10] E. C. Strinati and S. Barbarossa, ‘‘6G networks: Beyond Shannon towards
ized hardware, whereas our proposed system utilizes a semantic and goal-oriented communications,’’ Comput. Netw., vol. 190,
May 2021, Art. no. 107930.
general-purpose graphics processing unit, making a fair [11] H. Yoo, T. Jung, L. Dai, S. Kim, and C. Chae, ‘‘Demo: real-time semantic
comparison difficult. However, with the critical latency communications with a vision transformer,’’ in Proc. IEEE Int. Conf.
requirements of future B5G/6G systems in mind, it is Commun. Workshops (ICC Workshops), May 2022, pp. 1–2.
[12] D. Gündüz, Z. Qin, I. E. Aguerri, H. S. Dhillon, Z. Yang, A. Yener,
imperative to research more efficient neural network K. K. Wong, and C. Chae, ‘‘Beyond transmitting bits: Context, semantics,
architectures for semantic communications. and task-oriented communications,’’ IEEE J. Sel. Areas Commun., vol. 41,
no. 1, pp. 5–41, Jan. 2023.
[13] P. Vincent, H. Larochelle, Y. Bengio, and P.-A. Manzagol, ‘‘Extracting and
VII. CONCLUSION
composing robust features with denoising autoencoders,’’ in Proc. 25th Int.
The SemViT system proposed in this paper used a Vision Conf. Mach. Learn. (ICML), 2008, pp. 1096–1103.
Transformer to enhance image transmission performance [14] H. Xie, Z. Qin, G. Y. Li, and B. Juang, ‘‘Deep learning enabled seman-
in semantic communications. The experiments conducted tic communication systems,’’ IEEE Trans. Signal Process., vol. 69,
pp. 2663–2675, 2021.
in various regions show that SemViT outperforms conven- [15] Z. Weng and Z. Qin, ‘‘Semantic communication systems for speech trans-
tional CNN-based methods in all regions, particularly in mission,’’ IEEE J. Sel. Areas Commun., vol. 39, no. 8, pp. 2434–2444,
high-SNR and bandwidth ratio regimes. We verified the Aug. 2021.
[16] T. Tung and D. Gündüz, ‘‘DeepWiVe: Deep-Learning-Aided wireless
system’s availability in real-world wireless channels by con- video transmission,’’ IEEE J. Sel. Areas Commun., vol. 40, no. 9,
ducting extensive experiments on a USRP-based wireless pp. 2570–2583, Sep. 2022.
semantic communications testbed, which have been made [17] X. Wang, R. Girshick, A. Gupta, and K. He, ‘‘Non-local neural networks,’’
publicily available as an open-source to enable reproducibil- in Proc. IEEE/CVF Conf. Comput. Vis. Pattern Recognit., Jun. 2018,
pp. 7794–7803.
ity. We also conducted a thorough analysis to determine how [18] Q. Hu, G. Zhang, Z. Qin, Y. Cai, G. Yu, and G. Ye Li, ‘‘Robust semantic
a Vision Transformer can improve semantic communications communications against semantic noise,’’ 2022, arXiv:2202.03338.
systems. Our analysis stated that 1) encoders are essentially [19] J.-B. Cordonnier, A. Loukas, and M. Jaggi, ‘‘On the relationship between
self-attention and convolutional layers,’’ in Proc. Int. Conf. Learn. Repre-
HPF and decoders are LPF, 2) ViTs are good source-coders sent. (ICLR), May 2019.
and diversify the encoded representations, 3) ViTs are bene- [20] A. Dosovitskiy, L. Beyer, A. Kolesnikov, D. Weissenborn, X. Zhai,
ficial in decoders thanks to their strong LPF behavior, and T. Unterthiner, M. Dehghani, M. Minderer, G. Heigold, S. Gelly,
4) CNN might be good at channel coding, affirming the J. Uszkoreit, and N. Houlsby, ‘‘An image is worth 16×16 words: Trans-
formers for image recognition at scale,’’ in Proc. Int. Conf. Learn.
combined usage of ViT and convolutional layers. We hope Represent. (ICLR), May 2021.
our work provides some deeper insights and facilitates further [21] A. Vaswani, N. Shazeer, N. Parmar, J. Uszkoreit, L. Jones, A. N. Gomez,
studies. L. U. Kaiser, and I. Polosukhin, ‘‘Attention is all you need,’’ in Proc. Adv.
Neural Inf. Process. Syst. (NeurIPS), vol. 30, Dec. 2017, pp. 6000–6010.
[22] Z. Dai, H. Liu, Q. V. Le, and M. Tan, ‘‘CoAtNet: Marrying convolution
ACKNOWLEDGMENT and attention for all data sizes,’’ in Proc. Adv. Neural Inf. Process. Syst.
The authors are grateful to T. Jung for his valuable advice and (NeurIPS), vol. 34, Dec. 2021, pp. 3965–3977.
[23] N. Carion, F. Massa, G. Synnaeve, N. Usunier, A. Kirillov, and
help in implementing their semantic communications testbed. S. Zagoruyko, ‘‘End-to-end object detection with transformers,’’ in Proc.
Eur. Conf. Comput. Vis. (ECCV), Jul. 2020, pp. 213–229.
REFERENCES [24] S. W. Zamir, A. Arora, S. Khan, M. Hayat, F. S. Khan, and M. Yang,
[1] C. E. Shannon, ‘‘A mathematical theory of communication,’’ Bell Syst. ‘‘Restormer: Efficient transformer for high-resolution image restora-
Tech. J., vol. 27, no. 3, pp. 379–423, Jul. 1948. tion,’’ in Proc. IEEE/CVF Conf. Comput. Vis. Pattern Recognit. (CVPR),
Jun. 2022, pp. 5718–5729.
[2] S. Vembu, S. Verdu, and Y. Steinberg, ‘‘The source-channel separation
theorem revisited,’’ IEEE Trans. Inf. Theory, vol. 41, no. 1, pp. 44–54, [25] M. M. Naseer, K. Ranasinghe, S. H. Khan, M. Hayat, F. Shahbaz Khan, and
Jan. 1995. M.-H. Yang, ‘‘Intriguing properties of vision transformers,’’ in Proc. Adv.
[3] D. Gündüz, Z. Qin, I. E. Aguerri, H. S. Dhillon, Z. Yang, A. Yener, Neural Inf. Process. Syst. (NeurIPS), vol. 34, Dec. 2021, pp. 23296–23308.
K. K. Wong, and C. Chae, ‘‘Guest editorial special issue on beyond [26] D. Hendrycks and T. Dietterich, ‘‘Benchmarking neural network robust-
transmitting bits: Context, semantics, and task-oriented communications,’’ ness to common corruptions and perturbations,’’ 2019, arXiv:1903.12261.
IEEE J. Sel. Areas Commun., vol. 41, no. 1, pp. 1–4, Jan. 2023. [27] N. Park and S. Kim, ‘‘How do vision transformers work,’’ in Proc. Int.
[4] E. Bourtsoulatze, D. B. Kurka, and D. Gündüz, ‘‘Deep joint source-channel Conf. Learn. Represent. (ICLR), Apr. 2022.
coding for wireless image transmission,’’ IEEE Trans. Cognit. Commun. [28] J. Ballé, V. Laparra, and E. P. Simoncelli, ‘‘End-to-end optimized
Netw., vol. 5, no. 3, pp. 567–579, Sep. 2019. image compression,’’ in Proc. Int. Conf. Learn. Representations (ICLR),
[5] N. Farsad, M. Rao, and A. Goldsmith, ‘‘Deep learning for joint source- Apr. 2017.
channel coding of text,’’ in Proc. IEEE Int. Conf. Acoust., Speech Signal [29] T. Xiao, M. Singh, E. Mintun, T. Darrell, P. Dollár, and R. Girshick, ‘‘Early
Process. (ICASSP), Apr. 2018, pp. 2326–2330. convolutions help transformers see better,’’ Proc. Adv. Neural Inf. Process.
[6] H. Xie and Z. Qin, ‘‘A lite distributed semantic communication system Syst. (NeurIPS), vol. 34, pp. 30392–30400, Dec. 2021.
for Internet of Things,’’ IEEE J. Sel. Areas Commun., vol. 39, no. 1, [30] J. L. Ba, J. R. Kiros, and G. E. Hinton, ‘‘Layer normalization,’’ 2016,
pp. 142–153, Jan. 2021. arXiv:1607.06450.

71540 VOLUME 11, 2023


Yoo et al.: On the Role of ViT and CNN in Semantic Communications

[31] Y. Bai, X. Yang, X. Liu, J. Jiang, Y. Wang, X. Ji, and W. Gao, ‘‘Towards SONGKUK KIM (Member, IEEE) received the
end-to-end image compression and analysis with transformers,’’ in Proc. Ph.D. degree in computer science from the Univer-
AAAI Conf. Artif. Intell., vol. 36, no. 1, Jun. 2022, pp. 104–112. sity of Michigan, Ann Arbor, MI, USA, in 2005.
[32] A. Krizhevsky, ‘‘Learning multiple layers of features from tiny images,’’ From 2005 to 2007, he was a Research Staff with
Univ. Toronto, Toronto, ON, Canada, Tech. Rep. TR-2009, Apr. 2009. the Xerox Research Center and a Software Engi-
[33] BPG Image Format. Accessed: Oct. 13, 2022. [Online]. Available: neer with Google, until 2011. He is currently an
https://bellard.org/bpg/ Assistant Professor with the School of Integrated
[34] M. Denil, B. Shakibi, L. Dinh, M. A. Ranzato, and N. de Freitas, ‘‘Predict- Technology, Yonsei University, South Korea. His
ing parameters in deep learning,’’ in Proc. Adv. Neural Inf. Process. Syst.
research interests include machine learning, big
(NeurIPS), vol. 26, Dec. 2013, pp. 2148–2156.
data mining, and cloud computing.
[35] J. Kim, B. Heo, and J. Lee, ‘‘Joint global and local hierarchical priors
for learned image compression,’’ in Proc. IEEE/CVF Conf. Comput. Vis.
Pattern Recognit. (CVPR), Jun. 2022, pp. 5982–5991.
[36] Z. Wang, A. C. Bovik, H. R. Sheikh, and E. P. Simoncelli, ‘‘Image quality
assessment: From error visibility to structural similarity,’’ IEEE Trans.
Image Process., vol. 13, no. 4, pp. 600–612, Apr. 2004.
[37] Z. Wang, E. P. Simoncelli, and A. C. Bovik, ‘‘Multiscale structural simi-
larity for image quality assessment,’’ in Proc. 37th Asilomar Conf. Signals,
Syst. Comput., Nov. 2003, pp. 1398–1402.
[38] J. Johnson, A. Alahi, and F.-F. Li, ‘‘Perceptual losses for real-time style
transfer and super-resolution,’’ in Proc. Eur. Conf. Comput. Vis. (ECCV),
Oct. 2016, pp. 694–711.

HANJU YOO (Graduate Student Member, IEEE)


received the B.S. degree (summa cum laude)
from the School of Integrated Technology, Yonsei
University, in 2021, where he is currently pur-
suing the Ph.D. degree. His research interests
include deep neural networks for computer
vision, learned image compression, semantic
communications/deep joint source-channel cod-
ing, and vision transformer architecture.

LINGLONG DAI (Fellow, IEEE) received the


B.S. degree from Zhejiang University, Hangzhou,
China, in 2003, the M.S. degree (Hons.) from CHAN-BYOUNG CHAE (Fellow, IEEE) received
the China Academy of Telecommunications the Ph.D. degree in electrical and computer engi-
Technology, Beijing, China, in 2006, and the Ph.D. neering from The University of Texas at Austin
degree (Hons.) from Tsinghua University, Beijing, (UT), in 2008.
in 2011. Prior to joining UT, he was a Research Engineer
From 2011 to 2013, he was a Postdoctoral with the Telecommunications Research and Devel-
Research Fellow with the Department of Elec- opment Center, Samsung Electronics, Suwon,
tronic Engineering, Tsinghua University. He has South Korea, from 2001 to 2005. He is currently
coauthored the book MmWave Massive MIMO: A Paradigm for 5G an underwood Distinguished Professor with the
(Academic Press, 2016). His current research interests include massive School of Integrated Technology, Yonsei Univer-
MIMO, reconfigurable intelligent surface (RIS), millimeter-wave and ter- sity, South Korea. Before joining Yonsei University, he was with Bell Labs,
ahertz communications, machine learning for wireless communications, and Alcatel-Lucent, Murray Hill, NJ, USA, from 2009 to 2011, as a member
electromagnetic information theory. of the Technical Staff, and Harvard University, Cambridge, MA, USA,
Dr. Dai has received five IEEE Best Paper Awards from the IEEE ICC and from 2008 to 2009, and as a Postdoctoral Research Fellow and a Lecturer.
the IEEE VTC. He was a recipient of the China National Excellent Doctoral Dr. Chae is a fellow of NAEK. He was a recipient/co-recipient of the CES
Dissertation Nomination Award, in 2013, the URSI Young Scientist Award, Innovation Award, in 2023, the IEEE ICC Best Demo Award, in 2022, the
in 2014, the IEEE TRANSACTIONS ON BROADCASTING Best Paper Award, in 2015, IEEE WCNC Best Demo Award, in 2020, the Best Young Engineer Award
the National Natural Science Foundation of China for Outstanding Young from the National Academy of Engineering of Korea (NAEK), in 2019,
Scholars, in 2017, the IEEE ComSoc AP Outstanding Young Researcher the IEEE DySPAN Best Demo Award, in 2018, the IEEE/KICS JOURNAL
Award, in 2017, the IEEE ComSoc AP Outstanding Paper Award, in 2018, OF COMMUNICATIONS AND NETWORKS Best Paper Award, in 2018, the IEEE
the China Communications Best Paper Award, in 2019, the IEEE ACCESS Best INFOCOM Best Demo Award, in 2015, the IEIE/IEEE Joint Award for
Multimedia Award, in 2020, the IEEE Communications Society Leonard G. Young IT Engineer of the Year, in 2014, the KICS Haedong Young Scholar
Abraham Prize, in 2020, the IEEE ComSoc Stephen O. Rice Prize, in 2022, Award, in 2013, the IEEE Signal Processing Magazine Best Paper Award,
and the IEEE ICC Outstanding Demo Award, in 2022. He has been listed as a in 2013, the IEEE ComSoc AP Outstanding Young Researcher Award,
Highly Cited Researcher by Clarivate Analytics, since 2020. He has served as in 2012, and the IEEE VTS Dan. E. Noble Fellowship Award, in 2008. He has
an Editor for the IEEE TRANSACTIONS ON COMMUNICATIONS, from 2017 to 2021, held several editorial positions, including the Editor-in-Chief for the IEEE
the IEEE TRANSACTIONS ON VEHICULAR TECHNOLOGY, from 2016 to 2020, and TRANSACTIONS ON MOLECULAR, BIOLOGICAL, AND MULTI-SCALE COMMUNICATIONS,
the IEEE COMMUNICATIONS LETTERS, from 2016 to 2020. He has also served as a Senior Editor for the IEEE WIRELESS COMMUNICATIONS LETTERS, and an
a Guest Editor for the IEEE JOURNAL ON SELECTED AREAS IN COMMUNICATIONS Editor for the IEEE Communications Magazine, IEEE TRANSACTIONS ON
(JSAC), IEEE JOURNAL OF SELECTED TOPICS IN SIGNAL PROCESSING, and IEEE WIRELESS COMMUNICATIONS, and IEEE WIRELESS COMMUNICATIONS LETTERS.
Wireless Communications Magazine. He is currently serving as an Area He was an IEEE ComSoc Distinguished Lecturer, from 2020 to 2023.
Editor for the IEEE COMMUNICATIONS LETTERS.

VOLUME 11, 2023 71541

You might also like