Deep-TEMPEST: Using Deep Learning To Eavesdrop On HDMI From Its Unintended Electromagnetic Emanations
Deep-TEMPEST: Using Deep Learning To Eavesdrop On HDMI From Its Unintended Electromagnetic Emanations
Abstract Keywords
In this work, we address the problem of eavesdropping on dig- Software Defined Radio, Side-channel attack, Deep Learning
ital video displays by analyzing the electromagnetic waves that ACM Reference Format:
unintentionally emanate from the cables and connectors, partic- Santiago Fernández, Emilio Martínez, Gabriel Varela, Pablo Musé, and Fed-
ularly HDMI. This problem is known as TEMPEST. Compared to erico Larroca. 2024. Deep-TEMPEST: Using Deep Learning to Eavesdrop on
the analog case (VGA), the digital case is harder due to a 10-bit HDMI from its Unintended Electromagnetic Emanations. In Proceedings of
encoding that results in a much larger bandwidth and non-linear Submitted. ACM, New York, NY, USA, 10 pages. https://doi.org/XXXXXXX.
mapping between the observed signal and the pixel’s intensity. As XXXXXXX
a result, eavesdropping systems designed for the analog case ob-
tain unclear and difficult-to-read images when applied to digital 1 Introduction
video. The proposed solution is to recast the problem as an inverse TEMPEST is a term used to describe the unintentional emanation
problem and train a deep learning module to map the observed of sensitive or confidential information from electrical equipment.
electromagnetic signal back to the displayed image. However, this While it may refer to any kind of emissions, such as acoustic and
approach still requires a detailed mathematical analysis of the sig- other types of vibrations [31], it primarily deals with electromag-
nal, firstly to determine the frequency at which to tune but also netic waves. In particular, this article focuses on electromagnetic
to produce training samples without actually needing a real TEM- emissions from video displays. The issue of inferring the content
PEST setup. This saves time and avoids the need to obtain these displayed on a monitor from the electromagnetic waves emitted by
samples, especially if several configurations are being considered. it and its connectors has a long history, dating back to the 1980s
Our focus is on improving the average Character Error Rate in text, with the first public demonstrations by Win van Eck. This problem
and our system improves this rate by over 60 percentage points is sometimes referred to as Van Eck Phreaking, but for the remainder
compared to previous available implementations. The proposed of this article, we will use the term TEMPEST [29].
system is based on widely available Software Defined Radio and Van Eck’s research was focused on the then-prevalent CRT mon-
is fully open-source, seamlessly integrated into the popular GNU itors. However, Markus Kuhn’s work in the early 2000s [15] studied
Radio framework. We also share the dataset we generated for train- modern digital displays, including both the analog interface VGA
ing, which comprises both simulated and over 1000 real captures. (Video Graphics Array) and the digital interfaces HDMI (High-
Finally, we discuss some countermeasures to minimize the potential Definition Multimedia Interface) or DVI (Digital Visual Interface).
risk of being eavesdropped by systems designed based on similar Nevertheless, reproducing these studies was challenging due to the
principles. need for expensive and specialized hardware, such as a wide-band
AM receiver. This entrance barrier has been significantly reduced
CCS Concepts in recent years by the development of Software Defined Radio
(SDR) [30]. SDR employs generic hardware that down-converts the
• Security and privacy → Side-channel analysis and counter- signal to baseband and then provides the sampled signal to the PC,
measures; • Computing methodologies → Neural networks. making the hardware more affordable and signal processing sim-
pler, since it is performed in software. This advantages resulted in
two open-source implementations of TEMPEST (TempestSDR [21]
and gr-tempest [17]) and several empirical studies of the problem,
Permission to make digital or hard copies of all or part of this work for personal or
classroom use is granted without fee provided that copies are not made or distributed particularly focusing on the HDMI interface [4–6, 10, 18–20, 24, 28].
for profit or commercial advantage and that copies bear this notice and the full citation However, despite all of these efforts “this threat still is not well-
on the first page. Copyrights for components of this work owned by others than the documented and understood” [4]. Our first contribution is precisely
author(s) must be honored. Abstracting with credit is permitted. To copy otherwise, or
republish, to post on servers or to redistribute to lists, requires prior specific permission to address this issue by providing an analytical expression of the
and/or a fee. Request permissions from [email protected]. signal’s complex samples as received by the SDR when spying on
Submitted, 2024, an HDMI display. Virtually all of the above-mentioned studies use
© 2024 Copyright held by the owner/author(s). Publication rights licensed to ACM.
ACM ISBN 978-1-4503-XXXX-X/18/06 an AM demodulation step as part of their processing chain, similar
https://doi.org/XXXXXXX.XXXXXXX to the first studies by Van Eck with VGA, with the exception of [4],
Submitted, 2024, Santiago Fernández, Emilio Martínez, Gabriel Varela, Pablo Musé, and Federico Larroca
Transfer Function 1.0 bit sequence 𝑥𝑏 [𝑘]. However, recall that the attacker’s actual ob-
fc = 3/Tp
jective, as in any communications problem, is to estimate the most
|Q(f )| 1
plausible image that generated the observed complex sequence 𝑦 [𝑙].
0.5 fs = 30Tb
|G(f )| We propose a data-driven approach to this problem that leverages
the a priori information regarding what kind of images are typi-
0.0 cally displayed in a monitor (i.e., the original images used in the
−0.4 −0.2 0.0 0.2 0.4 training set should be representative of desktop content). This is
f [1/Tb] accomplished through a deep-learning module, which we present
in detail in the next section. Before that, the following subsection
Figure 5: Normalized Fourier Transform of 𝑞(𝑡) (i.e. Eq. 4 with discusses how, for the sake of simplicity, this estimation is simply
𝜖 = 0.002) and 𝑔(𝑡), the complex baseband representation of computed as |𝑦 [𝑙]| in TempestSDR.
the channel as seen by the SDR.
4.2 Software
use 𝑔(𝑡), whose Fourier transform 𝐺 (𝑓 ) is 𝑄 (𝑓 ) evaluated around Regarding software, samples are provided by the driver and then
𝑓𝑐 and zeroed for |𝑓 | > 𝑓𝑠 /2. This process is illustrated in Fig. 5 processed arbitrarily by the spying PC. Both TempestSDR and gr-tempest
using 𝑞(𝑡) as defined in (4), 𝑓𝑐 = 3/𝑇𝑝 and 𝑓𝑠 = 1/(30𝑇𝑏 ). adapt the sampling rate 𝑓𝑠 to produce an integer number of samples
All in all, after sampling, the following sequence is obtained: for every 𝑃𝑥 pixels, i.e., 𝑃𝑥 𝑇𝑝 = 𝑚/𝑓𝑠 for some integer 𝑚. When
∑︁ the sampling rate is successfully synchronized this way, these 𝑚
𝑦 [𝑙] = 𝑥𝑏 [𝑘]𝑔(𝑙/𝑓𝑠 − 𝑘𝑇𝑏 ). (6) samples correspond to a line, and thus, displaying 𝑃 𝑦 of these lines
𝑘 produces a non-skewed and static image. Correlations as the one
We may further enrich the model by adding noise, small errors we discussed before are searched for in the signal and used in a
to 𝑓𝑐 (instead of precisely a multiple of the pixel rate), and offsets PLL-like system to estimate the precise value of 𝑓𝑠 (see [17] and
in both time and phase (uniform between zero and 1/𝑓𝑠 or 2𝜋, [21] for details).
respectively). These impairments are included in our simulations Given that (6) is a complex signal (as seen in Fig. 5, since |𝐺 (𝑓 )|
to make the learning system more robust to these non-idealities. is not symmetric around zero), TempestSDR actually takes the mag-
Note, however, that we are ignoring the antenna’s bandwidth and nitude of the samples (i.e. an envelope detector, termed AM de-
possible non-linearities. modulator in some contexts, e.g. [20]), which further distorts the
Regarding the sampling rate, mid-level SDRs allow for, at most, signal. To avoid this unnecessary degradation, for the case of VGA
some tens of MHz. For example, the USRP 200-mini [7] we used gr-tempest instead applies an equalization filter to the complex
in our experiments has a maximum sampling rate of 𝑓𝑠 = 50 MHz. signal to produce much better results. We will also consider the
Just as in the example in Fig. 5, this is only a third of the pixel rate complex signal so as to provide the learning system with the most
at a resolution of 1920 × 1080@60Hz (resulting in 1/𝑇𝑝 = 148 MHz), information available. As we will see, this choice will have a non-
meaning that each sample 𝑦 [𝑙] will actually be a linear combination negligible impact on the performance of the model.
of several tens of encoded bits, further complicating the image The other significant difference between TempestSDR and gr-tempest
reconstruction. is that the former was coded from scratch, whereas the latter uses
In fact, since the anti-aliasing filter of the SDR produces a 𝐺 (𝑓 ) GNU Radio [1]. This is a framework that represents a processing
that is zero for |𝑓 | > 𝑓𝑠 /2, and if 𝑓𝑠 ≪ 1/𝑇𝑏 as we just discussed, chain as a series of interconnected blocks (a so-called flowgraph),
the resulting loss of information means that the attacker cannot each executing a well-defined operation on the signal (e.g. filter-
recover the sequence of bits 𝑥𝑏 [𝑘] by observing the samples 𝑦 [𝑙]. ing or resampling). New blocks can be easily created and added
It may appear that a viable alternative is to increase the sampling to the already vast list of available ones. These new blocks can be
rate 𝑓𝑠 up to 1/𝑇𝑏 , and after equalization, sample each bit separately programmed either in C++ or Python. In the latter case, Numpy is
and decode the image. There are three important drawbacks to used to represent data, which further simplifies the integration of
this approach. Firstly, it would require an SDR that operates with deep learning frameworks such as PyTorch, as in our case. All of
a sampling rate and a corresponding instantaneous bandwidth of these features have been the main motivation behind our choice of
at least some GHz, which even high-end and extremely expensive gr-tempest as the starting point of our system.
solutions struggle to provide (e.g. the USRP X440 by Ettus Research
provides up to 3200 MHz of bandwidth at the cost of over 25,000 5 Eavesdropping Images from gr-tempest
dollars [8]). Secondly, it is unclear if the interference from other Complex Sequences
sources (received due to the increased receiver’s bandwidth) will not
prove detrimental in recovering the image. Last but not least, there
5.1 Deep Learning to Solve the Inverse Problem
is the problem of processing such an enormous amount of samples, In this section, we consider the inverse problem of recovering a
which would further impact the resulting cost of the spying setup, clean or source image 𝑿 ∈ R𝑝 𝑦 ×𝑝𝑥 from a degraded observation
this time in terms of the required PC. 𝒀 ∈ C𝑝 𝑦 ×𝑝𝑥 , which is an array of complex numbers with equal size
For the above reasons, we will consider a sampling rate value 𝑓𝑠 of the source image. This observation is modeled as:
as those obtained from less expensive (and also less conspicuous)
hardware, which will thus unavoidably result in an unrecoverable 𝒀 = T (𝑿 ) + 𝑵 , (7)
Submitted, 2024, Santiago Fernández, Emilio Martínez, Gabriel Varela, Pablo Musé, and Federico Larroca
correction algorithm, and naturally output the complex samples optical character recognition software [27]. We remark that the
instead of their magnitudes. Furthermore, as we mentioned before, OCR system was only used to evaluate performance, not for model
gr-tempest automatically adapts the sampling rate 𝑓𝑠 to produce training. In particular, we compare the text produced by Tesseract
an integer number 𝑚 of samples per image row. We have further on the original image and on the recovered one. The percentage
interpolated this signal to produce 𝑃𝑥 complex samples per line (i.e. of different characters between both outputs is the CER, and we
using an interpolator with ratio 𝑃𝑥 /𝑚). report the average over all images in the test set.
The most challenging aspect was tagging which sample corre- The hardware used for training and evaluation tasks consists of
sponds to the first pixel of the image. This is a key step in performing an Intel Core i7-10700F CPU with 64GB of RAM and an NVIDIA
a pixel-by-pixel matching of the captured image with its respective GeForce RTX 3090 GPU with 24GB of VRAM. Inference on 1600 ×
original version, which is necessary for the model’s supervised 900 sized images takes approximately 0.5s with GPU and 15s on
training. Although gr-tempest, in addition to adapting the sam- CPU. The model parameters were optimized by minimizing the 𝐿2
pling rate 𝑓𝑠 , also provides an automatic algorithm that re-centers norm between the recovered image and its ground truth. We used
the image, in our experience, the results of the latter were not the Adam optimizer [14] to train on image patches of 256 × 256
sufficient for our purposes. pixels (patch size) and batches of 48 patches (batch size). A Total
To address this limitation, we first detect the blanking periods Variation regularizer [3] was also added to reduce noise while
using the Hough Line Transform [11]. We then both remove them preserving the edges. The values of the learning rate (𝑙𝑟 = 1.56 ×
entirely and shift the image, leaving the capture adjusted to the 10 −5 ) and the regularization weight (𝜆𝑇𝑉 = 2.2 × 10 −13 ) were found
original version. Detection is achieved by keeping only those lines through a hyper-parameter search using the Optuna framework [2].
whose distance between each other corresponds to the blanking Weights of the DRUNet architecture were initialized with He’s
size (for both horizontal and vertical). A grayscale conversion of Normal weights [12], except for certain cases we discuss below.
the original image constitutes 𝑿𝑖 (more in particular, the average Synthetic data only. Let us first consider an ideal case where we
of the three RGB image color channels), whereas the re-centered perfectly know the electromagnetic signal’s behavior, i.e. a model
complex array of the samples constitutes the corresponding 𝒀𝑖 . trained and evaluated only on the synthetic data. We shall denote
The rest of the degraded images were simulated under the same it as Base Model, and it will be useful both to assess the impact of
conditions as the SDR (sampling rate and tuning frequency) and the approximations we performed when deriving (6), but also to
the system being eavesdropped (resolution). The synthetic dataset evaluate what performance we may expect (at best) when using
was generated with a Python script, also available at the project’s real-life signals. As we mentioned in the previous section, we have
repository, that simulates the pipeline composed of the HDMI trans- trained and evaluated our system using two different pulses: a
mission protocol, the SDR baseband down-conversion, and low-pass rectangular pulse, or a difference of two rectangular pulses as in (4),
filtering and sampling (i.e. Eq. (6)). Gaussian noise, small frequency with 𝜖 = 0.1. We trained both models 180 epochs, resulting in a CER
errors, and a random delay were also added. To explore the effects of around 30% when tested over their respective synthetic samples.
of using a precise expression of the pulse 𝑞(𝑡), we have tested two The complete set of results is summarized in Table 1.
different possibilities: the difference between two delayed rectan- Evaluation in real-life data. Next, we consider real-life signals
gular pulses (as in (4) with 𝜖 = 0.1), or simply a rectangular pulse. acquired with the setup displayed in Fig. 7. If we evaluate both
As we will see, quite interestingly, the trained system is robust to Base Models on this data, their performance drops significantly to a
this choice. CER of about 50%, still much better than those of the grayscale im-
ages produced by both TempestSDR or vanilla gr-tempest, which
obtain a CER of over 90%. Furthermore, the fact that both Base
6 Experiments and Results Models obtain similar results indicates that a precise expression for
We gathered a set of 3491 clean-degraded image pairs following the the conforming pulse 𝑞(𝑡) is unnecessary, which we will further
procedure presented in the previous section. The dataset includes explore in the next section. However, synthetic data will prove sig-
2189 simulated samples for each pulse (1738 used for training, 148 nificantly useful when combined with real-life signals, dramatically
for validation, and 303 for test) as well as 1302 real-life samples decreasing the number of samples required in training, a discussion
(882 for training, 120 for validation, and 300 for test). The dataset we defer to the end of the section.
was carefully constructed to represent the content of an actual The next step is, naturally, to re-train the model by using only
screen image, ranging from online sales pages [16] to conference real-life data. We will refer to the resulting system as the Pure Model.
articles [22] and manual screenshots on a variety of web pages. Evaluation of its inferred images results in a CER of about 35%, very
To evaluate the performance of the trained models, we first similar to those obtained by the Base Models when evaluated on
need to define a representative restoration metric. Typical image synthetic data. These are excellent results, which mean that only
restoration metrics are the Peak Signal-to-Noise Ratio (PSNR) or about one-third of the characters are incorrectly detected by Tesser-
the Structural Similarity Index Measure (SSIM) [25]. However, it act on the inferred image. Redundancy enables a human operator
is reasonable to assume that the eavesdropper is mostly interested to recover most (if not all) of the rest of the text present in the
in the text being displayed on the monitor. In this case, neither of image. A representative inference example is shown in Fig. 8. Fur-
them are suitable indicators as they are sensitive to changes in the ther zoomed-in results are shown in Fig. 9, including the results of
images’ contrast and are thus not indicative of the legibility of the vanilla gr-tempest. Note how, in the example on the left, the text is
recovered text. For this reason, we chose to also report the Char- restored with higher quality when the font size is larger, even if the
acter Error Rate (CER), which was computed using the Tesseract original text color is blue. Furthermore, the one on the right shows
Submitted, 2024, Santiago Fernández, Emilio Martínez, Gabriel Varela, Pablo Musé, and Federico Larroca
(a) 1600 × 900 resolution image spied at (b) 1280x720 resolution image spied at
4th pixel frequency rate (CER = 26.6%). 4th pixel frequency rate (CER = 50%).
Figure 10: Model inferences over non-trained setup spied Figure 11: Image inferences when synthetic low-level noise
images. The inference of 10b shows the model does not assure is added to the original image. Inference performance is sig-
a good performance at other spying setups. nificantly degraded, even with an imperceptible noise level.
the samples in the training set and the observed signal lies in the
form of 𝑔(𝑡) (cf. Eq. (6)), which will now correspond to another 𝑓𝑐 .
However, as illustrated in Fig. 5, the difference in the corresponding that moves from about the 30% that we obtained before (cf. Table
pulses is not significant, and the learning system should obtain 1) to 48.7 %. However, simply by further training the model for
reasonable results. This is confirmed in Fig. 10a, which shows an another 10 epochs, where the remaining 500 samples were added
inference example using a real signal tuned at 𝑓𝑐 = 4/𝑇𝑝 . The re- to the training set, the resulting CER drops again to 29.8 %. This
sulting CER in this example was 26%, demonstrating the robustness experiment shows that the architecture has the potential to learn
to changes in the tuning frequency. new text font types with a few training epochs and provides further
As a second step, let us additionally modify the monitor’s resolu- evidence of its expressiveness.
tion (thus resulting in a different pixel rate 1/𝑇𝑝 ) and choose again
𝑓𝑐 = 4/𝑇𝑝 . We interpolated the captured complex image resolution 7.2 Countermeasures
to 1600×900 before computing the inference to feed the learning It is essential to expose the spying system flaws so the counterpart
module with the same array size that it was trained on, thus avoid- (e.g., the computer user) can exploit them and ensure the protection
ing any disadvantage compared to the previous configuration. An of personal or classified information. To this end, we mention two
example inference (using 1280 × 720@60fps) is shown in Fig. 10b. countermeasures that, by modifying the displayed image (in a pri-
In this case, the performance was clearly degraded, resulting in a marily eye-imperceptible manner to the computer user), inference
CER of 50%. Differently from the previous case, differences in the based on the resulting emanations fails. These defects stem from
resulting shaping pulses are enough to produce samples where the the analysis discussed in Sec. 3 and leverage the non-linearity of
learning system’s performance degrades significantly. the TMDS encoding.
In any case, expecting the system to perform well under all One way to accomplish this is by adding low-level noise to the
possible resolutions and harmonics would not be reasonable. How- image displayed on the monitor, creating an adversarial attack on
ever, since the number of possible configurations is limited, we the neural network. This noise may be, for instance, an additive
may envisage a set of different parameters for the DRUNet, each Gaussian noise with a constant variance. The example in Fig. 11
trained on signals acquired when a specific resolution was used illustrates this possibility by artificially adding a very small noise
in the monitor and a certain configuration was used on the SDR. to the original image (𝜎 = 3). Note how most of the text in the
As discussed in the previous section, we may fine-tune the model inference becomes illegible.
trained on simulations, so the acquisition process should not be A more perceptible but definitive solution is to use a color gra-
time-consuming. dient on the images’ background, as illustrated in Fig. 12a. When
Robustness to the Images’ Content Text fonts not used for train- using a horizontal gradient (a white-to-black ramp, for example), we
ing are another point to consider for testing the model’s robustness, are changing the grayscale linearly over the image, but the TMDS
appearing in the examples we showed previously, especially that of encoding will produce significant changes on the eavesdropped sig-
Fig. 9. Given that several of the images we included in our dataset nal (see Fig. 12b). In this case, also shown in Fig. 12b, the inference
come from PDF documents obtained from a conference (and thus fails completely.
with the same font), it is interesting to evaluate whether the system
presents certain overfitting to these kinds of images. To measure
the performance of the model for unseen fonts, we created a new 8 Conclusion
dataset consisting of 800 new simulated samples. Each of these In this work, we have presented an open-source implementation of a
images consists of random text, where each line alternates between deep learning architecture trained to map from the electromagnetic
147 different font types (those included in the default Ubuntu in- signal emanating from an HDMI cable to the displayed image. The
stallation and that contain the Latin script). The simulation uses complete dataset, including simulations based on the analytical
the same image resolution, pixel harmonic frequency, and sampling expressions we derived (as well as scripts to generate them), is also
rate as in the previous section. made available. Notably, the system obtains much better results than
Using a subset of 300 of these images to evaluate the Base Model previous implementations, significantly improving the Character
with the ideal pulse results in an increase of the average CER, Error Rate when eavesdropping text.
Submitted, 2024, Santiago Fernández, Emilio Martínez, Gabriel Varela, Pablo Musé, and Federico Larroca