Container_Runtime_Sec
Container_Runtime_Sec
Accepted: 5/13/2024
Abstract
1. Introduction
Virtual machines (VMs) are the fundamental building blocks that power most
applications today. VMs are powered by a program known as a hypervisor that can
expose the hardware of a physical system to multiple virtual machines. Each virtual
machine runs an operating system entirely separate from the host system and any other
virtual machines running on the same hardware. The overhead of running an entire
operating system nested inside of another operating system does have performance
implications. Still, those drawbacks are usually greatly outweighed by the scaling,
operational, and security benefits that VMs provide.
Despite the numerous benefits VMs offer, they are a cumbersome level of
abstraction for many workloads. The overhead incurred by running a VM is high relative
to the resources many applications require. Many small services can run on a single
virtual machine, but for security purposes and management reasons, it's often better to
keep them separate. Before containers, as known today, many technologies were
developed to provide application isolation at the OS level and allow administrators to
prevent unwanted interference – malicious or otherwise – between applications running
on the same host.
One of the earliest examples of OS-level virtualization was the chroot1 system
call, developed by Bell Laboratories and introduced in Unix Version 7. Chroot allowed
a process to set a specific starting point for paths beginning with "/." There is little
documentation regarding the original use of chroot. Still, in 1994, the Computer
Systems Research Group at the University of California, Berkley added the chroot2
command line tool to the 4.4BSD operating system. In an accompanying book, the
authors state that the syscall was primarily used to set up restricted access to a system
(Bostic et al., 1996, p. 38). For example, if chroot has been called with the directory
/var/application and the application tries to read /etc/shadow, it would
automatically translate to /var/application/etc/shadow. In 2000, FreeBSD 4.0 – a
1
/usr/man/man2/chdir.2 – Unix Version 7
2
/usr/share/man/cat8/chroot.0 – Berkley Software Distribution 4.4
derivative of BSD (Lucas, 2019, p. xxxvi) – introduced the jail3 system call, which is
built on top of chroot and provides additional security guarantees. Similarly, the Linux
operating system developed its own OS-level virtualization. Most notably, Linux
Containers and Docker (Hildred, 2015).
Large organizations with complex systems and sprawling infrastructure found that
neither VMs nor containers provided the right level of abstraction for their workloads4,5,6.
Because of this, many companies have developed custom orchestration systems to
automatically provision, schedule, and manage workloads using a combination of VMs
and containers (Verma et al., 2015, p. 12). One of the most influential orchestration
systems was an internally developed tool at Google called Borg. It allowed users to
define 'jobs' using a declarative syntax, and Borg would take care of scheduling,
execution, availability, and monitoring (Verma et al., 2015). Borg directly inspired the
open-source container orchestrator Kubernetes.
Containers aren't inherently insecure; they simply lack the same level of security
that virtual machines can provide. The deployment speed and scale container
orchestrators enable is a double-edged sword for modern security teams. Kubernetes is a
complex service that requires careful implementation to be effective and secure without
considering the security of applications that developers will run inside a deployed cluster.
Additionally, beyond standard security hygiene and best practices, many workloads
require an added level of security and isolation due to specific risks they entail, such as
multi-tenant environments.
3
https://man.freebsd.org/cgi/man.cgi?query=jail&sektion=2&manpath=FreeBSD+4.0-
RELEASE
4
https://kubernetes.io/case-studies/spotify
5
https://kubernetes.io/case-studies/squarespace
6
https://kubernetes.io/case-studies/box
organizations need more control over what code executes, and the sheer volume of
pipelines makes it impossible for them to review every individual execution. To improve
the security posture of their pipelines, they use temporary containers or virtual machines
spawned for each job in each pipeline. CI/CD pipelines are generally not time or
resource-sensitive, making them good candidates for adding additional runtime security.
The small amount of overhead incurred by many tools would easily go unnoticed.
Additionally, adding observability and security to CI/CD processes has become
increasingly important as threat actors target software supply chains.
called Jupyter Hub, which is a browser-based interface that allows them to collaborate
and execute code in a shared environment. These shared environments are commonly run
within containers or VMs and provide a way to give users a common pre-configured
environment. While administrators can tightly control the libraries that are available since
they control the images on the backend, sandboxing provides an added layer of security if
a zero-day vulnerability is found within the container runtime or malicious code happens
to have been found in a previously approved library (Hafner et al., 2021; Kaplan & Qian,
2021; Schlueter, 2016; Zahan et al., 2022).
1.4. Learning
Many languages today offer the ability to test your code in a browser via
"playgrounds."7,8 These environments will upload code to be compiled and run on a
remote server and then return the output to the user's browser. Sandboxes add an
additional layer of security to the servers handling the code. There are also learning
services such as HackerRank and Codewars, which offer hundreds of interactive
exercises for dozens of languages. These services work similarly, allowing users to
upload the code they wrote, which is compiled and executed. Without proper sandboxing,
both examples would be incredibly risky for any organization since they would
purposefully allow arbitrary remote code execution, which can be incredibly dangerous.
2. Runtime security
The key to running code safely is controlling exactly what resources the
environment has access to. Improperly configured containers can allow the processes
running inside them to execute with higher permissions than they should normally have
and glean additional information about the underlying host. In many cases, this is easier
said than done. Even trivial applications can require access to hundreds of files and
directories to load libraries, read and write data, and otherwise interact with a system.
Access to the Internet or other network resources is another common requirement.
7
https://play.rust-lang.org
8
https://go.dev/play
Security teams must often make trade-offs between following security best
practices and meeting stakeholders' requirements. There are sometimes dozens of
products to evaluate when looking for a solution to a specific problem. This level of
choice gives administrators the flexibility they need to meet the requirements specific to
their organization. However, it also makes reviewing and understanding every available
option difficult. This paper aims to make understanding the ever-changing landscape of
choices easier by focusing on the most common underlying technologies that power
runtime security tools.
9
https://gvisor.dev/docs
system calls before passing them to the host's kernel, certain tasks such as filesystem I/O
and maintaining network connections have a noticeable overhead.
Kata containers have the same goal as gVisor, but instead of acting as a shim, the
kata runtime launches the container in a "lightweight" virtual machine10. Kata's OCI-
compliant runtime is exposed through the containerd-shim-kata-v2 command line
tool. The runtime interacts with a hypervisor to launch a dedicated virtual machine for
each container. Because Kata launches a VM for each container, it can guarantee the
same level of isolation that VMs offer. The virtual machine images and hypervisors11 that
the Kata supports are highly optimized for containerized workloads. Still, it does incur a
performance penalty due to the added overhead. By default, it uses QEMU, a type 2
KVM hypervisor, but it also supports newer, container-optimized Virtual Machine
Monitors (VMM) such as Cloud Hypervisor and Firecracker. While a full analysis is out
of scope for this paper, Firecracker was specifically developed by AWS for their Lambda
and Fargate services to provide faster function execution and higher isolation via a
purpose-built hypervisor (Gupta & Lian, 2018).
10
https://github.com/kata-containers/kata-containers/tree/3.3.0/docs/design/architecture
11
https://github.com/kata-containers/kata-containers/blob/3.3.0/docs/hypervisors.md
prevalent security modules are AppArmor and SELinux, which implement forms of
mandatory access control (MAC). On a standard Linux system, files and processes are
associated with a user, group, and a "mode" that determines the permissions given to the
owner, group, and any other users on the system12. The "mode" is expressed as a three-
digit octal number where the first, second, and third digits correspond to the owner,
group, and 'others,' respectively. Each number represents the permissions – read, write, or
execute – that can be given to the specific entity in that position. Each permission has a
numeric value of 1 (execute), 2 (write), or 4 (read). Adding them together can express
multiple permissions. For example, 'read' and 'execute' (1+4) would be 5. A complete
example of a real file might look like this:
12
https://tldp.org/LDP/intro-linux/html/sect_03_04.html
aren't as strong as the isolation of a virtual machine, they can still be an incredibly
effective security measure if properly applied.
2.3. eBPF
Extended Berkley Packet Filters (eBPF) is another technology inside the Linux
kernel that provides powerful networking, observability, and security features. BPF was
initially introduced as the "BSD Packet Filter" for the BSD operating system in 1992
(McCanne & Jacobson, 1993). It provided a way for users to efficiently capture and filter
network packets by implementing a BPF virtual machine within the kernel that could
execute user-provided instructions via an interpreter. This novel approach had two
benefits. First, it was highly efficient because it avoided the expensive operation of
copying every network packet to a user-space program and instead executed the
instructions directly in the kernel – only returning the packets matched by the filter. The
second benefit was the safety added by an additional layer of verification the BPF virtual
machine added. Before executing the user's instructions, the interpreter validates that the
instructions are valid and safe.
Despite its differences from "classic BPF" (the original implementation), eBPF
maintains the same performance and safety guarantees as its predecessor. These
guarantees are a large part of the success it has seen today. When faced with the choice of
implementing a kernel module or a BPF program, BPF is usually the much better choice.
Some of the most popular Kubernetes runtime security tools today are all powered by
eBPF: Tracee13, Falco14, and Tetragon15. The way that these runtimes work is similar to
gVisor. However, instead of inserting a process between the container and the host that
intercepts system calls before they reach the kernel, the runtimes inject eBPF programs
directly into the kernel that monitor for specific events and syscalls. This leads to much
better performance with the added flexibility of being able to write custom rules similar
to the "profiles" that LSMs use to deny or allow actions. Another benefit is the increased
visibility eBPF provides. Rather than intercepting and blocking certain actions, these
runtimes can feed events to external monitoring system which allows administrators to
collect details metrics, improve incident response, and proactively respond to unexpected
behaviors.
3. Research Method
Each technology will be analyzed based on the isolation level it provides from the
infrastructure it runs within and the performance cost that isolation incurs. Generally
speaking, the higher the level of isolation, the higher the performance cost; however, due
to the diverse nature of workloads and various optimizations that exist, e.g., GPU-
optimized workloads may be less impacted by higher levels of isolation from the host
since the calculations are offloaded to the GPU and certain tasks may perform actions
that are more sensitive to network latency as opposed to system call bottlenecks.
A set of tests was chosen from the Phoronix Test Suite, a set of standardized
benchmarks covering CPU, memory, and I/O intensive workloads.
13
https://github.com/aquasecurity/tracee
14
https://github.com/falcosecurity/falco
15
https://github.com/cilium/tetragon
16
https://docs.turingpi.com/docs/turing-rk1-specs-and-io-ports
run a node at a given time. Pods will be limited to consuming at most 4 CPU cores and 8
GB of memory. Each node is running the following software and versions:
Software Version
Host Operating System Ubuntu 22.04.4 LTS
Kernel 5.10.160-rockchip aarch64
AppArmor 3.0.4-2ubuntu2.3
containerd 1.7.13
runc 1.1.12
gVisor (runsc) release-20240311.0
kata-runtime 3.3.0
qemu-system-aarch64 7.2.0
Tetragon 1.0.3
The testing and result collection will be orchestrated via the Phoronix Test Suite,
which is the tool developed by Phoronix Media to power OpenBenchmarking.org. The
Pronix Test Suite contains hundreds of open-source benchmarks that cover dozens of
different applications and performance tools. Leveraging an existing and widely deployed
benchmarking tool will help reduce the chances of introducing testing biases and provide
results that can be objectively compared against the millions of existing results hosted by
OpenBenchmarking.org. The following tests will be executed as part of the test suite for
this research:
Test Parameters
pts/nginx-3.0.1 20 and 100 active connections
SET, GET, LPUSH, LPOP, and
pts/redis-1.4.0 SADD with 50 parallel
connections
pts/unpack-linux-1.2.0 N/A
pts/compress-zstd-1.6.0 Compression Level 3
connections.
Unfortunately for gVisor and Kata containers, the first test highlights their biggest
weakness, handling network connections. The drastic difference here is the overhead
added to the syscalls required for every request. In the case of the kata container, the
inefficiencies of running a network stack within a virtual machine have a big impact on
performance. Still, they are not nearly as large as the impact that gVisor imposes. gVisor
suffers from the overhead the Sentry adds by needing to intercept and translate the
required syscalls for every request it handles. Tetragon performed similarly to a standard
container, which is to be expected since the underlying mechanism – an eBPF program –
is running directly inside the kernel. There does appear to be a slight drop in capacity for
the 100 req/sec test, but the difference is negligible.
17
https://github.com/facebook/zstd
Compared to the previous NGINX example, the results between the containers
were practically indistinguishable. Kata containers even showed a slight improvement in
compression speed. The cause of this improvement was unclear at the time of testing but
may have been related to the fact that the QEMU virtual machines launched by the kata-
runtime were using a more recent Linux kernel (6.1.62-126), which could have contained
patches and optimizations that the host kernel did not have.
4.1.3. Redis
Redis is an in-memory key-value store used by many applications. It is favored
for its speed and scalability, which enable it to handle millions of requests efficiently.
Even in this small test environment, the number of requests the benchmark tools were
able to generate was impressive. Unfortunately, this test highlighted the weakness of kata
containers and gVisor, which is similar to the NGINX benchmark.
gVisor took another large performance hit in this scenario due to the overhead
introduced by its syscall filtering. Interestingly, kata containers did not appear to take as
much of a performance hit. They performed marginally lower than the Baseline and
Tetragon benchmark runs but not as significantly as the NGINX benchmark. As was
mentioned previously, this is not something that would typically need to run in a sandbox
since Redis is a "trusted" application and doesn't run any arbitrary code, but it does
highlight the importance of understanding an organization's applications and testing for
its specific environment. It would have been easy to assume that Kata might have
suffered a similar penalty in this test based on the NGINX benchmark. However, this is
not the case.
The results of this test were initially surprising. The benchmark runs with no
sandboxing and the runs with Tetragon performed similarly – and had the best
performance – but the gVisor and Kata runs showed that file operations were
significantly impacted. Kata containers also had a large variance between runs, with the
slowest taking 99 seconds and the fastest run still taking 66 seconds. None of the other
runtimes had such wide variances. Upon further investigation, the performance impact
seemed to derive from QEMU's implementation of shared host paths.
One of the most promising developments discovered during the research for this
paper is the progress and capabilities shown by BPF. While hardened runtimes are
18
https://kubernetes.io/docs/concepts/storage/volumes#emptydir
necessary in some scenarios, the added security and observability provided by BPF-based
runtime security tools are hard to overlook. They should be considered a standard
addition to any new environment. Tetragon specifically positioned itself as a complement
to existing LSMs and provides powerful capabilities that enable administrators to respond
proactively to emerging threats.
6. Conclusion
This paper has shown that there are many available tools to choose from when
administrators need to add additional layers of defense to their infrastructure. The
container runtime security landscape constantly evolves, and each new technology aims
to be better than the last but inevitably involves considering some trade-offs.
Nevertheless, this paper's examination found that because of that constant improvement,
developers and administrators should be able to find the right balance of security and
performance for their needs. Additionally, our research identified eBPF as a major area of
future growth and research. The observability, control, and performance offered by BPF
can allow an administrator to monitor and secure their infrastructure through a common
interface and allow them to react to emerging threats in a way that was not possible
previously.
References
Belim, S. V., & Belim, S. Yu. (2018). Implementation of Mandatory Access Control in
1126. https://doi.org/10.3103/S0146411618080357
https://www.tuhs.org/cgi-bin/utree.pl?file=V7
Bostic, K., Karels, M. J., & Quarterman, J. S. (1996). The design and implementation of
Gregg, B. (2019, July 15). BPF Performance Tools: Linux System and Application
07-15/bpf-performance-tools-book.html
Gupta, A., & Lian, L. (2018, November 27). Announcing the Firecracker Open Source
Technology: Secure and Fast microVM for Serverless Computing. AWS Open
secure-fast-microvm-serverless/
Hafner, A., Mur, A., & Bernard, J. (2021). Node package manager’s dependency network
Hildred, T. (2015, August 28). The History of Containers. Red Hat Blog.
https://www.redhat.com/en/blog/history-containers
Kaplan, B., & Qian, J. (2021). A Survey on Common Threats in npm and PyPi Registries
security/open-sourcing-gvisor-a-sandboxed-container-runtime
Lucas, M. W. (2019). Absolute FreeBSD: The complete guide to FreeBSD (3rd edition).
No Starch Press.
Manor, E. (2018, July 24). Bringing the best of serverless to you. Google Cloud Platform
Blog. https://cloudplatform.googleblog.com/2018/07/bringing-the-best-of-
serverless-to-you.html
McCanne, S., & Jacobson, V. (1993, January). The BSD Packet Filter: A New
winter-1993-conference/bsd-packet-filter-new-architecture-user-level-packet
(1.2.0). https://github.com/opencontainers/runtime-spec/releases/tag/v1.2.0
Schlueter, I. (2016, March 26). Kik, left-pad, and npm. Npm Blog.
https://blog.npmjs.org/post/141577284765/kik-left-pad-and-npm
The Kata Authors. (n.d.). Kata Containers (3.3.0) [Computer software]. OpenInfra
Foundation. https://github.com/kata-containers/kata-containers
The Kubernetes Authors. (n.d.-a). Case study: Box. Kubernetes User Case Studies.
The Kubernetes Authors. (n.d.-b). Case study: Spotify. Kubernetes User Case Studies.
The Kubernetes Authors. (n.d.-c). Case study: Squarespace. Kubernetes User Case
studies/squarespace/
The Phoronix Test Suite Authors. (n.d.). Phoronix Test Suite (10.8.4) [Computer
The Tetragon Authors. (n.d.). Tetragon (1.0.3) [Computer software]. Cloud Native
Verma, A., Pedrosa, L., Korupolu, M., Oppenheimer, D., Tune, E., & Wilkes, J. (2015).
https://doi.org/10.1145/2741948.2741964
Zahan, N., Zimmermann, T., Godefroid, P., Murphy, B., Maddila, C., & Williams, L.
(2022). What are Weak Links in the npm Supply Chain? Proceedings of the 44th
Appendix
Results
Baseline - Test Results by Node kube01 kube02 kube03 kube04 Average
Unpacking The Linux Kernel - linux-5.19.tar.xz (sec) 15.28 14.49 13.95 14.00 14.43
Zstd Compression - Compression Level: 3 - Compression Speed (MB/s) 177.80 176.20 176.10 176.30 176.60
Zstd Compression - Compression Level: 3 - Decompression Speed (MB/s) 721.30 715.40 724.90 739.80 725.35
Redis - Test: GET - Parallel Connections: 50 (Reqs/sec) 1,347,345 1,325,022 1,367,589 1,370,072 1,352,507
Redis - Test: SET - Parallel Connections: 50 (Reqs/sec) 921,270 893,255 912,752 924,104 912,846
Redis - Test: LPOP - Parallel Connections: 50 (Reqs/sec) 1,315,050 1,313,340 1,324,483 1,345,346 1,324,555
Redis - Test: SADD - Parallel Connections: 50 (Reqs/sec) 1,115,189 1,122,310 1,122,221 1,157,980 1,129,425
Redis - Test: LPUSH - Parallel Connections: 50 (Reqs/sec) 722,436 715,132 715,621 735,950 722,285
nginx - Connections: 20 (Reqs/sec) 6,122 6,236 6,186 6,437 6,245
nginx - Connections: 100 (Reqs/sec) 6,080 6,359 6,174 6,467 6,270
Kata Containers - Test Results by Node kube01 kube02 kube03 kube04 Average
Unpacking The Linux Kernel - linux-5.19.tar.xz (sec) 66.99 99.08 78.58 91.52 84.04
Unpacking The Linux Kernel - linux-5.19.tar.xz - virtio-fs (sec) 71.775 81.059 79.302 62.676 73.70
Zstd Compression - Compression Level: 3 - Compression Speed (MB/s) 268.80 202.70 212.50 210.40 223.60
Zstd Compression - Compression Level: 3 - Decompression Speed (MB/s) 718.10 704.70 717.00 725.70 716.38
Redis - Test: GET - Parallel Connections: 50 (Reqs/sec) 1,180,419 1,212,683 1,226,434 1,243,539 1,215,769
Redis - Test: SET - Parallel Connections: 50 (Reqs/sec) 869,500 883,480 895,624 904,463 888,267
Redis - Test: LPOP - Parallel Connections: 50 (Reqs/sec) 1,223,896 1,227,998 1,251,235 1,264,323 1,241,863
Redis - Test: SADD - Parallel Connections: 50 (Reqs/sec) 988,015 1,001,220 1,018,159 1,028,757 1,009,038
Redis - Test: LPUSH - Parallel Connections: 50 (Reqs/sec) 732,325 771,398 767,261 792,585 765,892
nginx - Connections: 20 (Reqs/sec) 2,956 1,815 1,858 1,860 2,122
nginx - Connections: 100 (Reqs/sec) 2,905 1,735 1,830 1,865 2,084