0% found this document useful (0 votes)
23 views

Deep-TEMPEST: Using Deep Learning To Eavesdrop On HDMI From Its Unintended Electromagnetic Emanations

Uploaded by

xiyojey400
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
23 views

Deep-TEMPEST: Using Deep Learning To Eavesdrop On HDMI From Its Unintended Electromagnetic Emanations

Uploaded by

xiyojey400
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 10

Deep-TEMPEST: Using Deep Learning to Eavesdrop on HDMI

from its Unintended Electromagnetic Emanations


Santiago Fernández Gabriel Varela Pablo Musé
Emilio Martínez [email protected] Federico Larroca
[email protected] Facultad de Ingeniería, Universidad [email protected]
[email protected] de la República [email protected]
Facultad de Ingeniería, Universidad Montevideo, Uruguay Facultad de Ingeniería, Universidad
de la República de la República
Montevideo, Uruguay Montevideo, Uruguay
arXiv:2407.09717v1 [cs.CR] 12 Jul 2024

Abstract Keywords
In this work, we address the problem of eavesdropping on dig- Software Defined Radio, Side-channel attack, Deep Learning
ital video displays by analyzing the electromagnetic waves that ACM Reference Format:
unintentionally emanate from the cables and connectors, partic- Santiago Fernández, Emilio Martínez, Gabriel Varela, Pablo Musé, and Fed-
ularly HDMI. This problem is known as TEMPEST. Compared to erico Larroca. 2024. Deep-TEMPEST: Using Deep Learning to Eavesdrop on
the analog case (VGA), the digital case is harder due to a 10-bit HDMI from its Unintended Electromagnetic Emanations. In Proceedings of
encoding that results in a much larger bandwidth and non-linear Submitted. ACM, New York, NY, USA, 10 pages. https://doi.org/XXXXXXX.
mapping between the observed signal and the pixel’s intensity. As XXXXXXX
a result, eavesdropping systems designed for the analog case ob-
tain unclear and difficult-to-read images when applied to digital 1 Introduction
video. The proposed solution is to recast the problem as an inverse TEMPEST is a term used to describe the unintentional emanation
problem and train a deep learning module to map the observed of sensitive or confidential information from electrical equipment.
electromagnetic signal back to the displayed image. However, this While it may refer to any kind of emissions, such as acoustic and
approach still requires a detailed mathematical analysis of the sig- other types of vibrations [31], it primarily deals with electromag-
nal, firstly to determine the frequency at which to tune but also netic waves. In particular, this article focuses on electromagnetic
to produce training samples without actually needing a real TEM- emissions from video displays. The issue of inferring the content
PEST setup. This saves time and avoids the need to obtain these displayed on a monitor from the electromagnetic waves emitted by
samples, especially if several configurations are being considered. it and its connectors has a long history, dating back to the 1980s
Our focus is on improving the average Character Error Rate in text, with the first public demonstrations by Win van Eck. This problem
and our system improves this rate by over 60 percentage points is sometimes referred to as Van Eck Phreaking, but for the remainder
compared to previous available implementations. The proposed of this article, we will use the term TEMPEST [29].
system is based on widely available Software Defined Radio and Van Eck’s research was focused on the then-prevalent CRT mon-
is fully open-source, seamlessly integrated into the popular GNU itors. However, Markus Kuhn’s work in the early 2000s [15] studied
Radio framework. We also share the dataset we generated for train- modern digital displays, including both the analog interface VGA
ing, which comprises both simulated and over 1000 real captures. (Video Graphics Array) and the digital interfaces HDMI (High-
Finally, we discuss some countermeasures to minimize the potential Definition Multimedia Interface) or DVI (Digital Visual Interface).
risk of being eavesdropped by systems designed based on similar Nevertheless, reproducing these studies was challenging due to the
principles. need for expensive and specialized hardware, such as a wide-band
AM receiver. This entrance barrier has been significantly reduced
CCS Concepts in recent years by the development of Software Defined Radio
(SDR) [30]. SDR employs generic hardware that down-converts the
• Security and privacy → Side-channel analysis and counter- signal to baseband and then provides the sampled signal to the PC,
measures; • Computing methodologies → Neural networks. making the hardware more affordable and signal processing sim-
pler, since it is performed in software. This advantages resulted in
two open-source implementations of TEMPEST (TempestSDR [21]
and gr-tempest [17]) and several empirical studies of the problem,
Permission to make digital or hard copies of all or part of this work for personal or
classroom use is granted without fee provided that copies are not made or distributed particularly focusing on the HDMI interface [4–6, 10, 18–20, 24, 28].
for profit or commercial advantage and that copies bear this notice and the full citation However, despite all of these efforts “this threat still is not well-
on the first page. Copyrights for components of this work owned by others than the documented and understood” [4]. Our first contribution is precisely
author(s) must be honored. Abstracting with credit is permitted. To copy otherwise, or
republish, to post on servers or to redistribute to lists, requires prior specific permission to address this issue by providing an analytical expression of the
and/or a fee. Request permissions from [email protected]. signal’s complex samples as received by the SDR when spying on
Submitted, 2024, an HDMI display. Virtually all of the above-mentioned studies use
© 2024 Copyright held by the owner/author(s). Publication rights licensed to ACM.
ACM ISBN 978-1-4503-XXXX-X/18/06 an AM demodulation step as part of their processing chain, similar
https://doi.org/XXXXXXX.XXXXXXX to the first studies by Van Eck with VGA, with the exception of [4],
Submitted, 2024, Santiago Fernández, Emilio Martínez, Gabriel Varela, Pablo Musé, and Federico Larroca

which experimentally observed that by using FM demodulation, the


attacker may also obtain significant information on the display’s
content. As we will see, our analytical model explains why both the
magnitude and the phase of the complex samples provide informa-
tion on the eavesdropped image. Furthermore, these expressions
are crucial when setting up the eavesdropping system to choose
the frequency one should tune to in order to get maximum energy.
Instead of tuning the SDR to the frequency that obtains the best Unintended
Signal-to-Noise Ratio through trial-and-error (as in [18–20]), the Electromagnetic
Emanations
frequencies to be tested for a particular screen are manageable
when based in our analysis. Antenna & SDR
Equipped with this model, our second contribution is to re-cast
the TEMPEST problem as an inverse one. That is, recovering the
source image from the baseband complex samples gathered from the Convolutional
SDR. Motivated by the success of deep learning in solving inverse Neural Network gr-tempest
problems in other contexts [23], we propose designing and training
a deep convolutional neural network to infer the source image from
the baseband complex samples.
To our knowledge, three other works propose deep learning-
based algorithms for TEMPEST attacks [10, 18, 19]. Our work differs
significantly, overcoming some limitations of these previous studies.
In [19], the focus is on smartphone displays rather than HDMI or
DVI, which emit much lower power signals. They classified almost
unintelligible images from TempestSDR into digits, a simpler 10- Figure 1: Proposed system. The HDMI cable and connectors
class classification task. The works in [18] and [10] target HDMI emit unintended electromagnetic signals, which are cap-
but are less applicable to realistic scenarios, processing patches tured by the SDR and processed by gr-tempest, obtaining
with only a few characters. They both apply a denoiser to the a degraded complex-valued image, which in turn is fed to a
grayscale images produced by TempestSDR. Another relevant work convolutional neural network to infer the source image. All
is [20], which reconstructs images from electromagnetic emissions three images correspond to actual results.
of embedded cameras. They used a modified TempestSDR and a
GAN-based image translator to restore spied images, offering a
potential adaptation to TEMPEST attacks.
More in particular, our contributions in this respect are twofold. The full dataset comprises around 3500 samples, out of which
Firstly, we have developed and publicly shared an open-source im- approximately 1300 are real captures. Our aim is to make this
plementation of an end-to-end deep-learning architecture. Figure openness useful in further advancing research in this area. Please
1 presents an illustrative diagram of the system, including an ex- visit https://github.com/emidan19/deep-tempest for the complete
ample of actual results. Our primary focus is on the restoration of dataset and code.
text. Our architecture surpasses vanilla implementations of either The rest of the article is structured as follows. The next sec-
TempestSDR or gr-tempest, producing significantly higher-quality tion discusses the threat model, whereas Sec. 3 provides a detailed
reconstructed images, achieving over 60 percentage points reduc- overview of the HDMI signal. In Sec. 4, we summarize the working
tion in the average Character Error Rate (CER). Furthermore, and principle of SDR and characterize the forward operator by giving a
based on the insights provided by our analytical model, we avoid mathematical expression of the samples produced by the hardware
the AM demodulation step all previous works use (as they are based given an input image. How to recover the image from these sam-
on TempestSDR), which further distorts the signal, and instead learn ples by means of deep learning is discussed in Sec. 5. The obtained
to map directly from the complex samples to the original image; results and countermeasures are presented in Secs. 6 and 7. Closing
i.e. solve the inverse problem. As we report in Sec. 6, using the remarks and future work are discussed in Sec. 8.
complex samples and avoiding the information loss incurred in
demodulation results in a significant gain in performance.
Secondly, we have made this article’s complete dataset pub- 2 Threat Model
licly available. It includes two sources of data: several real-life This section presents the threat model we consider in this work.
signals and a GNU Radio-based simulator, which we developed The attacker’s objective is to recover the image displayed on a
and are sharing, that, given an image, produces the spied sig- monitor that contains sensitive or confidential information. This
nal. This simulator is based on the analytical expressions derived monitor is connected through a standard digital display interface,
in this work. Furthermore, we discuss how to train the learning which may be either HDMI or DVI. To achieve their objective, the
module (partially) based on these simulations, significantly re- attacker will resort to the electromagnetic energy emanating from
ducing the time-consuming stage of acquiring real-life signals the connectors and cables of the digital display, from which they
without negatively impacting the quality of the recovered images. will infer the monitor’s content.
Deep-TEMPEST: Using Deep Learning to Eavesdrop on HDMI from its Unintended Electromagnetic Emanations Submitted, 2024,

We assume that the attacker is equipped with off-the-shelf hard-


ware to capture and process these emanations. The necessary equip-
ment includes a laptop with a GPU (although a CPU-only laptop is
a viable, albeit slower, alternative), an SDR hardware (see Sec. 4 for
a discussion), an antenna, and a Low Noise Amplifier (LNA).
We foresee two separate operational scenarios. Firstly, one where
the attacker remains unnoticed, e.g., if the spied system is close to
a wall and the attacker operates from the other side. In this case,
the setup may include somewhat large directive antennas, and an
online operation is viable where, for instance, the attacker adjusts
the antenna’s direction until a proper image is obtained and only Figure 2: An illustration of the transmission of a frame on a
saves the images that they are interested in. single TMDS channel. The red arrow indicates the order in
A second scenario is one where only the attacker’s hardware which the signal is transmitted. Video is actually sent only
goes unnoticed. For instance, a small omnidirectional antenna is during the video data periods.
left near the HDMI cable and connectors of the spied system, and
the spying PC is not visible or does not draw attention. In this
case, which requires physical proximity to the spied system, the Different from VGA, the intensity of each color (from 256 possible
attacker’s PC may periodically (e.g., every second) record a signal, values) is encoded into 10 bits before transmission. The 8-bit input
process it to obtain an image, and save it for offline visualization. word is first differentially XORed or XNORed using the first bit as
If hard drive space is not an issue, the attacker may even record the reference. The encoder uses the operation that results in fewer
the raw samples of the SDR periodically and apply our method to bit transitions given the input word, and the choice is indicated in
these recordings. the ninth bit. The second stage negates or not the first 8 bits (flagged
by the tenth bit) to even out 1s and 0s in the encoded stream. Note
that each video data period is encoded independently, meaning that
3 Unintended Electromagnetic Emanations of the process is restarted for each line.
HDMI
3.1 Digital signal 3.2 Electrical and electromagnetic signal
Although there are seven different versions of HDMI (ranging from After analyzing the digital signal generated by the video, we can
1.0 up to 2.1) and five types of connectors (A to E), video is encoded now examine the resulting electromagnetic signal surrounding the
the same way for all of them except for version 2.1. This last version, cable. Our main interest is to determine where the largest portion
released in 2017, is typically used only in high-end TVs with 4k of its power lies in the spectrum so we can tune our system to that
or 8k video, and we will not consider it in this work. In any case, frequency. Additionally, we want to obtain an approximate expres-
HDMI is backward compatible with single-link DVI, so our results sion of this electromagnetic signal, which will help us simulate it.
are also valid for DVI-D or DVI-I. This will enable us to produce samples that we can use to train and
To transmit audio and video, HDMI uses three separate TMDS evaluate our learning system without necessarily using an actual
channels, each corresponding to the red, blue, and green compo- TEMPEST setup. We will defer this last problem to the next section
nents regarding video, where each channel is sent serially over since it also includes the effects of the SDR hardware.
three separate pins (positive, negative, and ground; further details HDMI uses differential signaling, basically meaning that every
regarding the electrical signal are presented in the next subsection). channel is composed of two cables, where the bit value is estimated
While 𝑌𝐶𝑏 𝐶𝑟 pixel encoding and other color depths are possible, from the difference in voltage between the two. That is to say, for
the default configuration is 𝑅𝐺𝐵 encoding with 24 bits. We will thus any of the three TMDS channels, the voltage signal 𝑥 + (𝑡) and 𝑥 − (𝑡)
only consider this configuration for brevity, although extensions in both cables would be:
to these scenarios are straightforward. As illustrated in Fig. 2, and
∑︁
𝑥 + (𝑡) = 𝑉𝑐𝑐 + 𝑥𝑏 [𝑘]𝑝 (𝑡 − 𝑘𝑇𝑏 ), (1)
just as in VGA, each video frame includes a horizontal and vertical
𝑘
blanking, where no video is transmitted. During these periods, au- ∑︁
dio or control packets are transmitted instead (the so-called control 𝑥 − (𝑡) = 𝑉𝑐𝑐 − 𝑥𝑏 [𝑘]𝑝 (𝑡 − 𝑘𝑇𝑏 ), (2)
and data island periods). 𝑘
This means that the pixel rate is actually higher than what is where 𝑉𝑐𝑐 is a constant, 𝑥𝑏 [𝑘] corresponds to the mapping of 𝑘-th
being displayed. For instance, for a resolution of 1920 × 1080 with bit (e.g. a negative voltage for 0 and a positive one for 1), 𝑇𝑏 is the
progressive scan, there are actually 2200 × 1125 pixels per frame bit duration, and 𝑝 (𝑡) is the shaping pulse (typically a rectangular
(including blanking). In terms of the notation of Fig. 2, this means pulse of duration 𝑇𝑏 ).
that 𝑝𝑥 = 1920, 𝑝 𝑦 = 1080, 𝑃𝑥 = 2200 and 𝑃 𝑦 = 1125, which at a The immediate consequence is that under an ideal system and
frame rate of 60 Hz represents a pixel rate of 1/𝑇𝑝 = 148.5 MHz. Sup- observing both cables together as in our case, we would mea-
ported resolutions and the corresponding timings may be consulted sure 𝑥 (𝑡) = 𝑥 + (𝑡) + 𝑥 − (𝑡) = 2𝑉𝑐𝑐 , which is independent of the
at the EIA/CEA-861 standard, but it is important to note that the information-carrying sequence 𝑥𝑏 [𝑘]. However, as observed in pre-
possibilities are limited (e.g. 197 possible timings and resolutions vious works [28], the pulses in 𝑥 + (𝑡) and 𝑥 − (𝑡) are not perfectly
in HDMI 2.0, and only 64 for HDMI 1.4). aligned nor exactly the same. For instance, assuming that 𝑥 − (𝑡) is
Submitted, 2024, Santiago Fernández, Emilio Martínez, Gabriel Varela, Pablo Musé, and Federico Larroca

delayed a time 𝜖𝑇𝑏 with respect to 𝑥 + (𝑡), we would obtain


∑︁
𝑥 (𝑡) = 𝑥 + (𝑡) + 𝑥 − (𝑡) =2𝑉𝑐𝑐 + 𝑥𝑏 [𝑘]𝑞(𝑡 − 𝑘𝑇𝑏 ), (3)
𝑘
where 𝑞(𝑡) =𝑝 (𝑡) − 𝑝 (𝑡 − 𝜖𝑇𝑏 ). (4)
That is to say, ignoring the constant 2𝑉𝑐𝑐 , a classic PCM (Pulse-
Code Modulation) signal with conforming pulse 𝑞(𝑡). By adding a
random delay to 𝑥 (𝑡), we can study it as a Wide-Sense Stationary
signal whose Power Spectral Density (i.e. the expected power per
Hertz) has the following well-known expression:
|𝑄 (𝑓 )| 2 4 sin2 (𝜋 𝑓 𝜖𝑇𝑏 )
𝑆𝑋 (𝑓 ) = 𝑆𝑋𝑏 (𝑓 ) = sinc2 (𝑓 𝑇𝑏 )𝑆𝑋𝑏 (𝑓 ), (5) Figure 3: The power spectral density of a TMDS encoded
𝑇𝑏 𝑇𝑏
signal computed by multiplying an estimation of 𝑆𝑋𝑏 (𝑓 ) and
where 𝑆𝑋𝑏 (𝑓 ) = 𝑙 𝑅𝑋𝑏 [𝑙]𝑒 − 𝑗2𝜋 𝑓 𝑙𝑇𝑏 and 𝑅𝑋𝑏 [𝑙] = E{𝑥𝑏 [𝑘]𝑥𝑏 [𝑘 +
Í
|𝑄 (𝑓 )| 2 /𝑇𝑏 (the dashed red curve, shown for reference); cf.
𝑙]}. That is to say, the Discrete-Time Fourier Transform 𝑆𝑋𝑏 (𝜔) of Eq. (5). Both curves are normalized to its maximum value
the auto-correlation of the sequence 𝑥𝑏 [𝑘] evaluated at 𝜔 = 2𝜋 𝑓 𝑇𝑏 . for clarity. Significant spikes every multiple of 0.1/𝑇𝑏 are
Note that 𝑆𝑋𝑏 (𝑓 ) is a periodic function of period 1/𝑇𝑏 (the bit rate). clearly visible. In the zoom-in around 𝑓 = 0.3/𝑇𝑏 shown below,
It is typically the case that consecutive frames in the spied moni- smaller but nevertheless important spikes every multiple of
tor are very similar (if not identical). This is also true for contiguous 1/(𝑃𝑥 𝑇𝑝 ) (the inverse of the duration of each horizontal line)
lines. Denoting as 𝑇𝑝 the pixel time (i.e. 𝑇𝑝 = 10𝑇𝑏 ), and recalling are also clearly visible.
that each line is encoded independently, the previous two observa-
tions mean that high values of 𝑆𝑋𝑏 (𝑓 ) should be expected at mul-
tiples of 𝑓 = 1/(𝑃𝑥 𝑃 𝑦𝑇𝑝 ) (the frame rate) as well as 𝑓 = 1/(𝑃𝑥 𝑇𝑝 )
(the horizontal lines rate). Furthermore, given that TMDS encoding
enforces no DC component, 𝑆𝑋𝑏 (0) ≈ 0.
The other relevant time scale is precisely 𝑇𝑝 since consecutive
pixels are similar. Note that the analysis in this case is complicated
by the non-linear encoding we discussed before. As a first step, let
us consider a constant image, which produces at most two different
encoded words (the differentially encoded word or its negation), Figure 4: Diagram of an SDR. The drivers provide complex
which are sent alternately, the least significant bit first. This process samples 𝑦 [𝑙] whose real and imaginary parts correspond to
will produce a 𝑆𝑋𝑏 (𝑓 ) with large spikes at every multiple of 1/𝑇𝑝 the in-phase and quadrature components.
since under a constant image, bits 10-bits apart are typically the
opposite (i.e. typically 𝑥𝑏 [𝑘] = −𝑥𝑏 [𝑘 + 10]). Another significant
spike should be present at 1/(2𝑇𝑝 ), too, since bits 20-bits apart are
4 Software Defined Radio
typically the same.
This intuition is verified for more complex encoded images, as Having characterized our signal of interest 𝑥 (𝑡) in (3), let us now
shown in Fig. 3, which displays an estimation of 𝑆𝑋𝑏 (𝑓 ) for a TMDS discuss how to intercept it and, furthermore, provide an analytic
signal corresponding to eight frames of a user typing in a word expression to the signal captured by the SDR and thus the one we
processor, multiplied by |𝑄 (𝑓 )| 2 /𝑇𝑏 (cf. Eq. (5)) along with |𝑄 (𝑓 )| 2 may consider to perform the eavesdropping.
for reference (using 𝜖 = 0.002). Note that the significant increase in
𝑆𝑋𝑏 (𝑓 ) at 𝑓 ≈ 0.05/𝑇𝑏 = 1/(2𝑇𝑝 ) is attenuated by |𝑄 (𝑓 )| 2 , whereas 4.1 Hardware
the peaks every multiple of 0.1/𝑇𝑏 = 1/𝑇𝑝 are not. The lower graph As illustrated in Fig. 4, an SDR hardware moves the signal to base-
in the figure displays a zoom-in to the third-pixel harmonic (marked band and provides its filtered samples. These samples will be pro-
with a blue slashed rectangle), where the peaks corresponding to cessed using software to produce the eavesdropped image. Starting
multiples of 1/(𝑃𝑥 𝑇𝑝 ) are clearly visible. from (3), and ignoring the constant term, we may interpret 𝑥 (𝑡) as a
The conclusion of this section is that most of the power of the train of Dirac deltas that goes through a filter with impulse response
emanations from an HDMI signal is located at the first few multiples 𝑞(𝑡). However, since we are down-converting this signal to base-
of the pixel rate. Naturally, the precise expression of 𝑞(𝑡) in (3) is band, the complex baseband representation of this channel is actu-
not known a priori. In (5), we have only assumed unaligned pulses ally a filter with impulse response 𝑔(𝑡) = F −1 {𝑄 (𝑓 + 𝑓𝑐 )𝐻𝐿𝐹 𝑃 (𝑓 )}
(with an unknown 𝜖), but other differences may also exist. Regarding (see for example [9]). That is to say, the inverse Fourier transform
where most of the leaked power exists, a first approximation, like the of the product between the Fourier transform of 𝑞(𝑡) moved to zero
one we presented, is enough. Furthermore and quite interestingly, from the tuning frequency 𝑓𝑐 (which, as we discussed before, will
as discussed in the following two sections, this expression will be equal to a harmonic of 1/𝑇𝑝 ) times the transfer function of the
also be enough to produce simulations that may be used to train SDR’s low-pass filter. If a sampling rate 𝑓𝑠 is used, then 𝐻𝐿𝑃𝐹 (𝑓 )
a learning system that maps samples of the emitted signal to the is ideally zero for |𝑓 | > 𝑓𝑠 /2 and a constant otherwise. In other
source image that produced them. words, instead of filtering the train of Dirac deltas with 𝑞(𝑡), we
Deep-TEMPEST: Using Deep Learning to Eavesdrop on HDMI from its Unintended Electromagnetic Emanations Submitted, 2024,

Transfer Function 1.0 bit sequence 𝑥𝑏 [𝑘]. However, recall that the attacker’s actual ob-
fc = 3/Tp
jective, as in any communications problem, is to estimate the most
|Q(f )| 1
plausible image that generated the observed complex sequence 𝑦 [𝑙].
0.5 fs = 30Tb
|G(f )| We propose a data-driven approach to this problem that leverages
the a priori information regarding what kind of images are typi-
0.0 cally displayed in a monitor (i.e., the original images used in the
−0.4 −0.2 0.0 0.2 0.4 training set should be representative of desktop content). This is
f [1/Tb] accomplished through a deep-learning module, which we present
in detail in the next section. Before that, the following subsection
Figure 5: Normalized Fourier Transform of 𝑞(𝑡) (i.e. Eq. 4 with discusses how, for the sake of simplicity, this estimation is simply
𝜖 = 0.002) and 𝑔(𝑡), the complex baseband representation of computed as |𝑦 [𝑙]| in TempestSDR.
the channel as seen by the SDR.
4.2 Software
use 𝑔(𝑡), whose Fourier transform 𝐺 (𝑓 ) is 𝑄 (𝑓 ) evaluated around Regarding software, samples are provided by the driver and then
𝑓𝑐 and zeroed for |𝑓 | > 𝑓𝑠 /2. This process is illustrated in Fig. 5 processed arbitrarily by the spying PC. Both TempestSDR and gr-tempest
using 𝑞(𝑡) as defined in (4), 𝑓𝑐 = 3/𝑇𝑝 and 𝑓𝑠 = 1/(30𝑇𝑏 ). adapt the sampling rate 𝑓𝑠 to produce an integer number of samples
All in all, after sampling, the following sequence is obtained: for every 𝑃𝑥 pixels, i.e., 𝑃𝑥 𝑇𝑝 = 𝑚/𝑓𝑠 for some integer 𝑚. When
∑︁ the sampling rate is successfully synchronized this way, these 𝑚
𝑦 [𝑙] = 𝑥𝑏 [𝑘]𝑔(𝑙/𝑓𝑠 − 𝑘𝑇𝑏 ). (6) samples correspond to a line, and thus, displaying 𝑃 𝑦 of these lines
𝑘 produces a non-skewed and static image. Correlations as the one
We may further enrich the model by adding noise, small errors we discussed before are searched for in the signal and used in a
to 𝑓𝑐 (instead of precisely a multiple of the pixel rate), and offsets PLL-like system to estimate the precise value of 𝑓𝑠 (see [17] and
in both time and phase (uniform between zero and 1/𝑓𝑠 or 2𝜋, [21] for details).
respectively). These impairments are included in our simulations Given that (6) is a complex signal (as seen in Fig. 5, since |𝐺 (𝑓 )|
to make the learning system more robust to these non-idealities. is not symmetric around zero), TempestSDR actually takes the mag-
Note, however, that we are ignoring the antenna’s bandwidth and nitude of the samples (i.e. an envelope detector, termed AM de-
possible non-linearities. modulator in some contexts, e.g. [20]), which further distorts the
Regarding the sampling rate, mid-level SDRs allow for, at most, signal. To avoid this unnecessary degradation, for the case of VGA
some tens of MHz. For example, the USRP 200-mini [7] we used gr-tempest instead applies an equalization filter to the complex
in our experiments has a maximum sampling rate of 𝑓𝑠 = 50 MHz. signal to produce much better results. We will also consider the
Just as in the example in Fig. 5, this is only a third of the pixel rate complex signal so as to provide the learning system with the most
at a resolution of 1920 × 1080@60Hz (resulting in 1/𝑇𝑝 = 148 MHz), information available. As we will see, this choice will have a non-
meaning that each sample 𝑦 [𝑙] will actually be a linear combination negligible impact on the performance of the model.
of several tens of encoded bits, further complicating the image The other significant difference between TempestSDR and gr-tempest
reconstruction. is that the former was coded from scratch, whereas the latter uses
In fact, since the anti-aliasing filter of the SDR produces a 𝐺 (𝑓 ) GNU Radio [1]. This is a framework that represents a processing
that is zero for |𝑓 | > 𝑓𝑠 /2, and if 𝑓𝑠 ≪ 1/𝑇𝑏 as we just discussed, chain as a series of interconnected blocks (a so-called flowgraph),
the resulting loss of information means that the attacker cannot each executing a well-defined operation on the signal (e.g. filter-
recover the sequence of bits 𝑥𝑏 [𝑘] by observing the samples 𝑦 [𝑙]. ing or resampling). New blocks can be easily created and added
It may appear that a viable alternative is to increase the sampling to the already vast list of available ones. These new blocks can be
rate 𝑓𝑠 up to 1/𝑇𝑏 , and after equalization, sample each bit separately programmed either in C++ or Python. In the latter case, Numpy is
and decode the image. There are three important drawbacks to used to represent data, which further simplifies the integration of
this approach. Firstly, it would require an SDR that operates with deep learning frameworks such as PyTorch, as in our case. All of
a sampling rate and a corresponding instantaneous bandwidth of these features have been the main motivation behind our choice of
at least some GHz, which even high-end and extremely expensive gr-tempest as the starting point of our system.
solutions struggle to provide (e.g. the USRP X440 by Ettus Research
provides up to 3200 MHz of bandwidth at the cost of over 25,000 5 Eavesdropping Images from gr-tempest
dollars [8]). Secondly, it is unclear if the interference from other Complex Sequences
sources (received due to the increased receiver’s bandwidth) will not
prove detrimental in recovering the image. Last but not least, there
5.1 Deep Learning to Solve the Inverse Problem
is the problem of processing such an enormous amount of samples, In this section, we consider the inverse problem of recovering a
which would further impact the resulting cost of the spying setup, clean or source image 𝑿 ∈ R𝑝 𝑦 ×𝑝𝑥 from a degraded observation
this time in terms of the required PC. 𝒀 ∈ C𝑝 𝑦 ×𝑝𝑥 , which is an array of complex numbers with equal size
For the above reasons, we will consider a sampling rate value 𝑓𝑠 of the source image. This observation is modeled as:
as those obtained from less expensive (and also less conspicuous)
hardware, which will thus unavoidably result in an unrecoverable 𝒀 = T (𝑿 ) + 𝑵 , (7)
Submitted, 2024, Santiago Fernández, Emilio Martínez, Gabriel Varela, Pablo Musé, and Federico Larroca

where T : R𝑝 𝑦 ×𝑝𝑥 → C𝑝 𝑦 ×𝑝𝑥 is a non-linear degradation operator, Skip Connection

and 𝑵 ∈ C𝑝 𝑦 ×𝑝𝑥 is an additive complex noise, for which real and


imaginary parts are assumed to be mutually independent, each of
them being a white Gaussian noise image of variance 𝜎 2 . Recall
that in our case, 𝑿 refers to a monitor image to be spied on (and
thus of shape 𝑝 𝑦 × 𝑝𝑥 ), while 𝒀 corresponds to an array of complex
ling Up
sam
samples defined by (6) and synchronized by gr-tempest. More mp
wnsa plin
g
Do
details on how we construct 𝑿 and 𝒀 are discussed in the following
subsection.
Due to the aforementioned inter-symbol interference, the degra- Figure 6: DRUNet architecture takes as input the in-phase
dation operator T is severely ill-posed, so achieving perfect restora- and quadrature components (red and green channels, respec-
tion of 𝑿 is impossible. Therefore, we must settle for obtaining an tively) of the eavesdropped image and outputs a grayscale
estimation 𝑿ˆ by introducing regularization and hope to get as close image.
as possible to the original image. This corresponds to performing
Bayesian estimation to solve a Maximum A Posteriori problem,
which can be formulated as follows:
1
𝑿ˆ = argmin 2 ∥𝒀 − T (𝑿 )∥ 2 + 𝜆R (𝑿 ), (8)
𝑿 2𝜎

where the solution minimizes a data term 2𝜎1 2 ∥𝒀 − T (𝑿 )∥ 2 and a


regularization term 𝜆R (𝑿 ) with regularization parameter 𝜆. Specif-
ically, the data term is responsible for demanding similarity with
the degradation process, while the regularization term is composed
of a function R : R𝑝 𝑦 ×𝑝𝑥 → R+ that holds responsibility for deliv-
ering a stable solution. The proper choice of a regularizer is not a
trivial task as it involves considering prior knowledge of the kind
of images to be recovered. However, traditional hand-crafted priors
(e.g. Tikhonov regularization) are usually too over-simplistic and do
not capture the complexity of real images. This is why recent meth-
ods follow learning-based approaches that, using large datasets of Figure 7: Experimental setup. The enumeration corresponds
pairs of source/degraded image samples, directly learn the mapping to 1) antenna, 2) RF filters and amplifier, 3) SDR, and 4) the
from the degraded observations to the source images [34] or learn spying computer running a GNU Radio flowgraph.
decoupled priors combined with the MAP formulation [33].
In this work, we propose to train an end-to-end deep convolu-
tional neural network (CNN) as a regressor 𝑿ˆ = 𝑓 (𝒀 , Θ) to learn using residual blocks and skip connections. This strategy has been
to map the degraded complex signals, spied, into the clean source shown to enhance the model capacity.
images. This training is performed by minimizing a certain loss
function L on a training set containing 𝑁 clean-degraded image 5.2 Generating the training set
𝑁 , i.e.
pairs {(𝑿𝑖 , 𝒀𝑖 )}𝑖=1 Let us now discuss how we constructed the training set. Each pair
(𝑿𝑖 , 𝒀𝑖 ) stems from two possible sources: actual spied signals or
𝑁
∑︁ simulations. The former was obtained using the experimental setup
L 𝑓 (𝒀𝑖 , Θ), 𝑿𝑖 .

min (9) shown in Fig. 7. The antenna was placed somewhat close to the
Θ
𝑖=1
cable and complemented with a Mini-Circuits ZJL-6G+ amplifier
Note that the CNN regressor 𝑓 (𝒀𝑖 , Θ) does not depend on the degra- and a band-pass filter composed of an SLP-450+ low-pass filter
dation operator T explicitly, but it does so in an implicit way since and an SHP-250+ high-pass filter, both from Mini-Circuits. This
the clean-degraded image pairs that are used to compute its weights would correspond to the second scenario we discussed in Sec. 2. It is
in (9) may be synthetically generated via 𝒀𝑖 = T (𝑿𝑖 ). worth mentioning anyhow that we are not interested in proving the
For the network 𝑓 (𝒀𝑖 , Θ) we use DRUNet (Deep Residual UNet) [32], feasibility of TEMPEST, which has already been demonstrated [15,
a popular CNN with high expressive power. Its architecture, de- 21, 24, 28], but in improving the results obtained by the state of
picted in Fig. 6, is composed of a succession of interconnected con- the art (i.e. when gr-tempest or TempestSDR obtain reasonable
volutional layers, activation functions, and pooling or subsampling results, our system should further improve them). Our simple setup
layers. Inspired by UNet [26], DRUNet uses an encoder-decoder is sufficient to this end.
structure: in the first series of convolutional layers, the image is It is important to emphasize that obtaining real captures is not a
down-sampled to a lower-dimensional space, and then, throughout simple task. We used a monitor with a resolution of 1600 × 900 @
the second series of convolutional layers, the image is up-sampled to 60 fps, tuning the SDR to the third harmonic of the pixel frequency
its original size. Furthermore, as in other architectures like ResNet (324 MHz) using a modified version of the flowgraph of gr-tempest.
[13], it is possible to interconnect non-adjacent convolutional layers Modifications include minor improvements to the tuning frequency
Deep-TEMPEST: Using Deep Learning to Eavesdrop on HDMI from its Unintended Electromagnetic Emanations Submitted, 2024,

correction algorithm, and naturally output the complex samples optical character recognition software [27]. We remark that the
instead of their magnitudes. Furthermore, as we mentioned before, OCR system was only used to evaluate performance, not for model
gr-tempest automatically adapts the sampling rate 𝑓𝑠 to produce training. In particular, we compare the text produced by Tesseract
an integer number 𝑚 of samples per image row. We have further on the original image and on the recovered one. The percentage
interpolated this signal to produce 𝑃𝑥 complex samples per line (i.e. of different characters between both outputs is the CER, and we
using an interpolator with ratio 𝑃𝑥 /𝑚). report the average over all images in the test set.
The most challenging aspect was tagging which sample corre- The hardware used for training and evaluation tasks consists of
sponds to the first pixel of the image. This is a key step in performing an Intel Core i7-10700F CPU with 64GB of RAM and an NVIDIA
a pixel-by-pixel matching of the captured image with its respective GeForce RTX 3090 GPU with 24GB of VRAM. Inference on 1600 ×
original version, which is necessary for the model’s supervised 900 sized images takes approximately 0.5s with GPU and 15s on
training. Although gr-tempest, in addition to adapting the sam- CPU. The model parameters were optimized by minimizing the 𝐿2
pling rate 𝑓𝑠 , also provides an automatic algorithm that re-centers norm between the recovered image and its ground truth. We used
the image, in our experience, the results of the latter were not the Adam optimizer [14] to train on image patches of 256 × 256
sufficient for our purposes. pixels (patch size) and batches of 48 patches (batch size). A Total
To address this limitation, we first detect the blanking periods Variation regularizer [3] was also added to reduce noise while
using the Hough Line Transform [11]. We then both remove them preserving the edges. The values of the learning rate (𝑙𝑟 = 1.56 ×
entirely and shift the image, leaving the capture adjusted to the 10 −5 ) and the regularization weight (𝜆𝑇𝑉 = 2.2 × 10 −13 ) were found
original version. Detection is achieved by keeping only those lines through a hyper-parameter search using the Optuna framework [2].
whose distance between each other corresponds to the blanking Weights of the DRUNet architecture were initialized with He’s
size (for both horizontal and vertical). A grayscale conversion of Normal weights [12], except for certain cases we discuss below.
the original image constitutes 𝑿𝑖 (more in particular, the average Synthetic data only. Let us first consider an ideal case where we
of the three RGB image color channels), whereas the re-centered perfectly know the electromagnetic signal’s behavior, i.e. a model
complex array of the samples constitutes the corresponding 𝒀𝑖 . trained and evaluated only on the synthetic data. We shall denote
The rest of the degraded images were simulated under the same it as Base Model, and it will be useful both to assess the impact of
conditions as the SDR (sampling rate and tuning frequency) and the approximations we performed when deriving (6), but also to
the system being eavesdropped (resolution). The synthetic dataset evaluate what performance we may expect (at best) when using
was generated with a Python script, also available at the project’s real-life signals. As we mentioned in the previous section, we have
repository, that simulates the pipeline composed of the HDMI trans- trained and evaluated our system using two different pulses: a
mission protocol, the SDR baseband down-conversion, and low-pass rectangular pulse, or a difference of two rectangular pulses as in (4),
filtering and sampling (i.e. Eq. (6)). Gaussian noise, small frequency with 𝜖 = 0.1. We trained both models 180 epochs, resulting in a CER
errors, and a random delay were also added. To explore the effects of around 30% when tested over their respective synthetic samples.
of using a precise expression of the pulse 𝑞(𝑡), we have tested two The complete set of results is summarized in Table 1.
different possibilities: the difference between two delayed rectan- Evaluation in real-life data. Next, we consider real-life signals
gular pulses (as in (4) with 𝜖 = 0.1), or simply a rectangular pulse. acquired with the setup displayed in Fig. 7. If we evaluate both
As we will see, quite interestingly, the trained system is robust to Base Models on this data, their performance drops significantly to a
this choice. CER of about 50%, still much better than those of the grayscale im-
ages produced by both TempestSDR or vanilla gr-tempest, which
obtain a CER of over 90%. Furthermore, the fact that both Base
6 Experiments and Results Models obtain similar results indicates that a precise expression for
We gathered a set of 3491 clean-degraded image pairs following the the conforming pulse 𝑞(𝑡) is unnecessary, which we will further
procedure presented in the previous section. The dataset includes explore in the next section. However, synthetic data will prove sig-
2189 simulated samples for each pulse (1738 used for training, 148 nificantly useful when combined with real-life signals, dramatically
for validation, and 303 for test) as well as 1302 real-life samples decreasing the number of samples required in training, a discussion
(882 for training, 120 for validation, and 300 for test). The dataset we defer to the end of the section.
was carefully constructed to represent the content of an actual The next step is, naturally, to re-train the model by using only
screen image, ranging from online sales pages [16] to conference real-life data. We will refer to the resulting system as the Pure Model.
articles [22] and manual screenshots on a variety of web pages. Evaluation of its inferred images results in a CER of about 35%, very
To evaluate the performance of the trained models, we first similar to those obtained by the Base Models when evaluated on
need to define a representative restoration metric. Typical image synthetic data. These are excellent results, which mean that only
restoration metrics are the Peak Signal-to-Noise Ratio (PSNR) or about one-third of the characters are incorrectly detected by Tesser-
the Structural Similarity Index Measure (SSIM) [25]. However, it act on the inferred image. Redundancy enables a human operator
is reasonable to assume that the eavesdropper is mostly interested to recover most (if not all) of the rest of the text present in the
in the text being displayed on the monitor. In this case, neither of image. A representative inference example is shown in Fig. 8. Fur-
them are suitable indicators as they are sensitive to changes in the ther zoomed-in results are shown in Fig. 9, including the results of
images’ contrast and are thus not indicative of the legibility of the vanilla gr-tempest. Note how, in the example on the left, the text is
recovered text. For this reason, we chose to also report the Char- restored with higher quality when the font size is larger, even if the
acter Error Rate (CER), which was computed using the Tesseract original text color is blue. Furthermore, the one on the right shows
Submitted, 2024, Santiago Fernández, Emilio Martínez, Gabriel Varela, Pablo Musé, and Federico Larroca

Figure 9: Zoomed-in examples obtained by vanilla


gr-tempest (top), Pure Model (middle), and the origi-
nal image (bottom).

Fraction PSNR (dB) SSIM CER (%)


Figure 8: Example of a complete inference using the Pure 5% 14.6 0.766 39.0
Model in a real-life sample. 10% 15.2 0.791 35.0
20% 15.4 0.797 33.3
50% 15.6 0.803 31.4
Model PSNR (dB) SSIM CER (%)
100% 15.7 0.806 29.8
Synthetic Data
Table 2: Performance of the fine-tuned Base Model as we vary
Base (ideal pulse) 21.3 0.913 29.5
the number of real-life samples (as a fraction of the complete
Base (real pulse) 20.2 0.908 32.8
dataset) used in training.
Real-life Data
Base (ideal pulse) 10.0 0.610 49.4
Base (real pulse) 10.0 0.601 55.2
Raw image magnitude
8.57 0.345 92.2 The first idea is simply to build a smaller training set. For instance,
(gr-tempest)
Pure (w/ complex values) 15.2 0.787 35.3
if we use a third of the training set on the Pure Model, the CER
Pure (w/ magnitude only) 14.2 0.754 43.6 would increase roughly by three percentage points, more precisely
resulting in a CER of 38.3%. Instead of training the Pure Model from
Table 1: Performance of all trained models, evaluated on
scratch, a very interesting and useful alternative is to use the Base
test sets of both synthetic and real captures. The best perfor-
Model as a starting point, whose training samples are virtually free
mance for each dataset and metric is indicated in bold text.
to produce. The idea is to expose the Base Model to real-life samples
so that it can leverage what it has learned from the simulations to
better infer images from real-life signals. More in particular, we
start from the weights of the Ideal Base Model and further train
great text restoration performance except for some characters (such it for another 100 epochs using only a subset of real-life samples.
˜ symbols), which are less common and therefore
as “𝜏”, “𝜋” and “𝑥” The results obtained with this methodology, a so-called Model
under-represented in the training set. Fine-Tuning (which may be interpreted as Few-Shot Learning in
Denoising the grayscale images. A pertinent question is how this case), is shown in Table 2. Note how simulated data may be
much information would have actually been lost had we not re-cast leveraged to obtain the same performance as the Pure Model but
the TEMPEST problem as an inverse one. That is to say, what would using only 10% of the real-life samples. Quite interestingly, this fine-
the performance be had we proceeded as in [10, 18, 20] and applied tuning produces the best results from all of the evaluated models.
a denoiser to the grayscale image as produced by TempestSDR or
gr-tempest. We have thus trained a model with only real-life sig- 7 Robustness and Countermeasures
nals as before, but taking the magnitude of the complex samples.
This results in a significant increase in the CER, reaching almost 7.1 Robustness
44%. This shows that using the complex samples as an input to the This section evaluates our system’s performance when modifica-
network is a better choice, as the system can leverage information tions are introduced in both the acquisition phase and the reference
from both magnitude and phase. images. For instance, the training set was generated with a fixed
On the utility of synthetic data. As we discuss in the next section, sampling rate, tuning frequency, and monitor resolution configura-
robustness of the spying system requires signals that span several tion for both actual signals and simulations. It is essential to assess
monitor configurations (i.e. resolutions) as well as SDR’s parameters which changes in these parameters require complete retraining.
(i.e. harmonic and sampling rate). This means that the attacker Robustness to the Signal Acquisition Process We start by
has to build a training set including several thousands of real-life exploring changes in SDR tuning frequency. Our choice of the
samples, which acquisition constitutes then a significant bottleneck third-pixel harmonic was based on the absence of other significant
in developing a robust spying system. It is crucial, then, to study sources of radio-frequency interference, but this is not always the
how to reduce the number of real-life signals required and if it is case, and the operator may need to tune to, for instance, the fourth
possible to do so without affecting the resulting performance. one. Note that in this case, the most important difference between
Deep-TEMPEST: Using Deep Learning to Eavesdrop on HDMI from its Unintended Electromagnetic Emanations Submitted, 2024,

(a) 1600 × 900 resolution image spied at (b) 1280x720 resolution image spied at
4th pixel frequency rate (CER = 26.6%). 4th pixel frequency rate (CER = 50%).

Figure 10: Model inferences over non-trained setup spied Figure 11: Image inferences when synthetic low-level noise
images. The inference of 10b shows the model does not assure is added to the original image. Inference performance is sig-
a good performance at other spying setups. nificantly degraded, even with an imperceptible noise level.

the samples in the training set and the observed signal lies in the
form of 𝑔(𝑡) (cf. Eq. (6)), which will now correspond to another 𝑓𝑐 .
However, as illustrated in Fig. 5, the difference in the corresponding that moves from about the 30% that we obtained before (cf. Table
pulses is not significant, and the learning system should obtain 1) to 48.7 %. However, simply by further training the model for
reasonable results. This is confirmed in Fig. 10a, which shows an another 10 epochs, where the remaining 500 samples were added
inference example using a real signal tuned at 𝑓𝑐 = 4/𝑇𝑝 . The re- to the training set, the resulting CER drops again to 29.8 %. This
sulting CER in this example was 26%, demonstrating the robustness experiment shows that the architecture has the potential to learn
to changes in the tuning frequency. new text font types with a few training epochs and provides further
As a second step, let us additionally modify the monitor’s resolu- evidence of its expressiveness.
tion (thus resulting in a different pixel rate 1/𝑇𝑝 ) and choose again
𝑓𝑐 = 4/𝑇𝑝 . We interpolated the captured complex image resolution 7.2 Countermeasures
to 1600×900 before computing the inference to feed the learning It is essential to expose the spying system flaws so the counterpart
module with the same array size that it was trained on, thus avoid- (e.g., the computer user) can exploit them and ensure the protection
ing any disadvantage compared to the previous configuration. An of personal or classified information. To this end, we mention two
example inference (using 1280 × 720@60fps) is shown in Fig. 10b. countermeasures that, by modifying the displayed image (in a pri-
In this case, the performance was clearly degraded, resulting in a marily eye-imperceptible manner to the computer user), inference
CER of 50%. Differently from the previous case, differences in the based on the resulting emanations fails. These defects stem from
resulting shaping pulses are enough to produce samples where the the analysis discussed in Sec. 3 and leverage the non-linearity of
learning system’s performance degrades significantly. the TMDS encoding.
In any case, expecting the system to perform well under all One way to accomplish this is by adding low-level noise to the
possible resolutions and harmonics would not be reasonable. How- image displayed on the monitor, creating an adversarial attack on
ever, since the number of possible configurations is limited, we the neural network. This noise may be, for instance, an additive
may envisage a set of different parameters for the DRUNet, each Gaussian noise with a constant variance. The example in Fig. 11
trained on signals acquired when a specific resolution was used illustrates this possibility by artificially adding a very small noise
in the monitor and a certain configuration was used on the SDR. to the original image (𝜎 = 3). Note how most of the text in the
As discussed in the previous section, we may fine-tune the model inference becomes illegible.
trained on simulations, so the acquisition process should not be A more perceptible but definitive solution is to use a color gra-
time-consuming. dient on the images’ background, as illustrated in Fig. 12a. When
Robustness to the Images’ Content Text fonts not used for train- using a horizontal gradient (a white-to-black ramp, for example), we
ing are another point to consider for testing the model’s robustness, are changing the grayscale linearly over the image, but the TMDS
appearing in the examples we showed previously, especially that of encoding will produce significant changes on the eavesdropped sig-
Fig. 9. Given that several of the images we included in our dataset nal (see Fig. 12b). In this case, also shown in Fig. 12b, the inference
come from PDF documents obtained from a conference (and thus fails completely.
with the same font), it is interesting to evaluate whether the system
presents certain overfitting to these kinds of images. To measure
the performance of the model for unseen fonts, we created a new 8 Conclusion
dataset consisting of 800 new simulated samples. Each of these In this work, we have presented an open-source implementation of a
images consists of random text, where each line alternates between deep learning architecture trained to map from the electromagnetic
147 different font types (those included in the default Ubuntu in- signal emanating from an HDMI cable to the displayed image. The
stallation and that contain the Latin script). The simulation uses complete dataset, including simulations based on the analytical
the same image resolution, pixel harmonic frequency, and sampling expressions we derived (as well as scripts to generate them), is also
rate as in the previous section. made available. Notably, the system obtains much better results than
Using a subset of 300 of these images to evaluate the Base Model previous implementations, significantly improving the Character
with the ideal pulse results in an increase of the average CER, Error Rate when eavesdropping text.
Submitted, 2024, Santiago Fernández, Emilio Martínez, Gabriel Varela, Pablo Musé, and Federico Larroca

[8] Ettus Research. 2024. USRP X440. https://www.ettus.com/all-products/usrp-


x440/.
[9] Robert G Gallager. 2008. Principles of digital communication. Cambridge Univer-
sity Press Cambridge, UK.
[10] J. Galvis, S. Morales, C. Kasmi, and F. Vega. 2021. Denoising of Video Frames
Resulting From Video Interface Leakage Using Deep Learning for Efficient Optical
(a) Image with horizontal gradient. (b) Eavesdropped image. Character Recognition. IEEE Letters on Electromagnetic Compatibility Practice and
Applications 3, 2 (2021), 82–86. https://doi.org/10.1109/LEMCPA.2021.3073663
[11] Allam Shehata Hassanein, Sherien Mohammad, Mohamed Sameer, and Moham-
Figure 12: Gradient background experiment scenario. A hor- mad Ehab Ragab. 2015. A survey on Hough transform, theory, techniques and
applications. arXiv preprint arXiv:1502.02160 (2015).
izontal 0-127 grayscale ramp is subtracted from the original [12] Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. 2015. Delving Deep
image (a), resulting in an observed complex image (upper b) into Rectifiers: Surpassing Human-Level Performance on ImageNet Classification.
with several vertical bands. Inference (lower b) thus fails to arXiv 1502.01852 [cs.CV] (2015).
[13] Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. 2016. Deep residual
restore the text. learning for image recognition. In IEEE CVPR 2016.
[14] Diederik P. Kingma and Jimmy Ba. 2017. Adam: A Method for Stochastic Opti-
mization. arXiv:1412.6980 [cs.LG] (2017).
[15] Markus G. Kuhn. 2003. Compromising emanations: eavesdropping risks of com-
This work paves the way for several interesting and challenging puter displays. Technical Report UCAM-CL-TR-577. University of Cambridge,
research avenues. As we discussed in Sec. 7, the trained archi- Computer Laboratory. https://doi.org/10.48456/tr-577
tecture’s performance degrades as we modify the spied system’s [16] Anurendra Kumar, Keval Morabia, William Wang, Kevin Chang, and Alex
Schwing. 2022. CoVA: Context-aware Visual Attention for Webpage Information
parameters (e.g., the resolution or the tuned frequency). A possible Extraction. In 5th ECNLP.
solution is to train several architectures, one for each foreseeable [17] Federico Larroca, Pablo Bertrand, Felipe Carrau, and Victoria Severi. 2022. gr-
set of parameters. Simulations will naturally come in handy in this tempest: an open-source GNU Radio implementation of TEMPEST. In 2022 Asian-
HOST. https://doi.org/10.1109/AsianHOST56390.2022.10022149
otherwise extremely time-consuming process. An alternative is to [18] Florian Lemarchand, Cyril Marlin, Florent Montreuil, Erwan Nogues, and Maxime
leverage the fact that we have an explicit expression for the degra- Pelcat. 2020. Electro-Magnetic Side-Channel Attack Through Learned Denoising
and Classification. In ICASSP 2020.
dation operator and strive at solving (8) directly. Deep learning has [19] Z Liu, N Samwel, LJA Weissbart, Z Zhao, D Lauret, L Batina, and M Larson.
also been successfully applied to these so-called plug&play meth- 2021. Screen Gleaning: A Screen Reading TEMPEST Attack on Mobile Devices
ods, in particular, to apply the prior distribution or regularization Exploiting an Electromagnetic Side Channel. In NDSS 2021.
[20] Yan Long, Qinhong Jiang, Chen Yan, Tobias Alam, Xiaoyu Ji, Wenyuan Xu,
term, which takes the form of a denoiser (see [33] for example). and Kevin Fu. 2024. EM Eye: Characterizing Electromagnetic Side-channel
The main challenge in the case of TEMPEST is how to efficiently Eavesdropping on Embedded Cameras. In NDSS 2024.
find the optimum to the data term since the degradation operator [21] Martin Marinov. 2014. Remote video eavesdropping using a software-defined
radio platform. MS thesis, University of Cambridge (2014). https://github.com/
is highly non-linear. martinmarinov/TempestSDR.
We may also enrich the signal we are using for inference. As [22] Paul Mooney. 2019. CVPR 2019 Papers. https://www.kaggle.com/datasets/
paultimothymooney/cvpr-2019-papers. Visited on 2023-08-04.
we discussed before, the eavesdropped samples present significant [23] Gregory Ongie, Ajil Jalal, Christopher A. Metzler, Richard G. Baraniuk, Alexan-
redundancy, which we implicitly used through gr-tempest to align dros G. Dimakis, and Rebecca Willett. 2020. Deep Learning Techniques for Inverse
𝒀 and 𝑿 . However, this redundancy may also be used to produce Problems in Imaging. IEEE Journal on Selected Areas in Information Theory 1, 1
(2020), 39–56. https://doi.org/10.1109/JSAIT.2020.2991563
even better results. We may, for instance, use several consecutive [24] Christian David O’Connell. 2019. Exploiting quasiperiodic electromagnetic radi-
complex arrays of samples to construct a complex tensor, which ation using software-defined radio. PhD thesis, University of Cambridge (2019).
may then be fed to a network that infers the original image. https://doi.org/10.17863/CAM.38085
[25] Marius Pedersen and Jon Yngve Hardeberg. 2012. Full-Reference Image Quality
Finally, it is important to highlight that the architecture we Metrics: Classification and Evaluation. Foundations and Trends® in Computer
used takes some seconds to produce each inference. This is hardly Graphics and Vision 7, 1 (2012), 1–80. https://doi.org/10.1561/0600000037
[26] Olaf Ronneberger, Philipp Fischer, and Thomas Brox. 2015. U-net: Convolutional
real-time, and it would be interesting to undertake a faster imple- networks for biomedical image segmentation. In MICCAI 2015. Springer.
mentation now that the method’s feasibility has been verified. [27] Ray Smith. 2007. An Overview of the Tesseract OCR Engine. In ICDAR ’07. IEEE
Computer Society, Washington, DC, USA, 629–633.
[28] Tae-Lim Song, Yi-Ru Jeong, and Jong-Gwan Yook. 2015. Modeling of Leaked
References Digital Video Signal and Information Recovery Rate as a Function of SNR. IEEE
[1] 2024. GNU Radio. The free & open software radio ecosystem . https://www. Transactions on Electromagnetic Compatibility 57, 2 (2015), 164–172.
gnuradio.org/. [29] Wim van Eck. 1985. Electromagnetic radiation from video display units: An
[2] Takuya Akiba, Shotaro Sano, Toshihiko Yanase, Takeru Ohta, and Masanori eavesdropping risk? Computers & Security 4, 4 (1985), 269–286.
Koyama. 2019. Optuna: A next-generation hyperparameter optimization frame- [30] Alexander M Wyglinski, Don P Orofino, Matthew N Ettus, and Thomas W Ron-
work. In 25th ACM SIGKDD. deau. 2016. Revolutionizing software defined radio: case studies in hardware,
[3] T Chan, Selim Esedoglu, Frederick Park, and A Yip. 2006. Total variation image software, and education. IEEE Communications magazine 54, 1 (2016), 68–75.
restoration: Overview and recent developments. Handbook of mathematical [31] Jiadi Yu, Li Lu, Yingying Chen, Yanmin Zhu, and Linghe Kong. 2021. An Indirect
models in computer vision (2006), 17–31. Eavesdropping Attack of Keystrokes on Touch Screen through Acoustic Sensing.
[4] Pieterjan De Meulemeester, Bart Scheers, and Guy A.E. Vandenbosch. 2020. IEEE Transactions on Mobile Computing (2021).
Differential Signaling Compromises Video Information Security Through AM [32] Kai Zhang, Yawei Li, Wangmeng Zuo, Lei Zhang, Luc Van Gool, and Radu Timofte.
and FM Leakage Emissions. IEEE Transactions on Electromagnetic Compatibility 2021. Plug-and-play image restoration with deep denoiser prior. IEEE Transactions
62, 6 (2020), 2376–2385. https://doi.org/10.1109/TEMC.2020.3000830 on Pattern Analysis and Machine Intelligence 44, 10 (2021), 6360–6376.
[5] Pieterjan de Meulemeester, Bart Scheers, and Guy A.E. Vandenbosch. 2020. Eaves- [33] Kai Zhang, Wangmeng Zuo, Shuhang Gu, and Lei Zhang. 2017. Learning deep
dropping a (Ultra-)High-Definition Video Display from an 80 Meter Distance CNN denoiser prior for image restoration. In IEEE CVPR 2017. 3929–3938.
Under Realistic Circumstances. In IEEE EMCSI 2020. [34] Kai Zhang, Wangmeng Zuo, and Lei Zhang. 2018. FFDNet: Toward a Fast and
[6] Pieterjan De Meulemeester, Bart Scheers, and Guy A.E. Vandenbosch. 2020. A Flexible Solution for CNN-Based Image Denoising. IEEE Trans. Image Process. 27,
Quantitative Approach to Eavesdrop Video Display Systems Exploiting Multi- 9 (2018), 4608–4622. https://doi.org/10.1109/TIP.2018.2839891
ple Electromagnetic Leakage Channels. IEEE Transactions on Electromagnetic
Compatibility 62, 3 (2020), 663–672. https://doi.org/10.1109/TEMC.2019.2923026
[7] Ettus Research. 2024. USRP B200mini. https://www.ettus.com/all-products/usrp-
b200mini/.

You might also like