SlideShare a Scribd company logo
Rootless Containers &
Unresolved Issues
Akihiro Suda / NTT (@_AkihiroSuda_)
May 17, 2019
1
Agenda
• Introduction to Rootless Containers
• How it works
• Adoption status
• Unresolved issues
• containerd dev plan
2
Introduction
3
Rootless Containers
• Run containers, runtimes, and orchestrators as a non-root
user
• Don’t confuse with:
– usermod -aG docker penguin
– docker run --user
– dockerd --userns-remap
4
Motivation of Rootless Containers
• To mitigate potential vulnerability of container runtimes and
orchestrator (the primary motivation)
• To allow users of shared machines (e.g. HPC) to run
containers without the risk of breaking other users
environments
– Still unsuitable for “multi-tenancy” where you can’t really
trust other users
• To isolate nested containers, e.g. “Docker-in-Docker”
5
Runtime vulnerabilities
• Docker “Shocker” (2014)
– A malicious container was allowed to access the host file system,
as CAP_DAC_READ_SEARCH was effective by default
• Docker CVE-2014-9357
– A malicious docker build container could run arbitrary binary on
the host as the root due to an LZMA archive issue
• containerd #2001 (2018)
– A malicious container image could remove /tmp on the host when
the image was pulled (not when actually launched!)
6
Runtime vulnerabilities
• Docker “Shocker” (2014)
– A malicious container was allowed to access the host file system,
as CAP_DAC_READ_SEARCH was effective by default
• Docker CVE-2014-9357
– A malicious docker build container could run arbitrary binary on
the host as the root due to an LZMA archive issue
• containerd #2001 (2018)
– A malicious container image could remove /tmp on the host when
the image was pulled (not when actually launched!)
7
Vulnerability of daemons, not containers per se
So --userns-remap is not effective
Runtime vulnerabilities
• runc #1962 (2019)
– Container break-out via
/proc/sys/kernel/core_pattern or
/sys/kernel/uevent_helper
– Hosts with the initrd rootfs (DOCKER_RAMDISK) were
affected (e.g. Minikube)
• runc CVE-2019-5736
– Container break-out via /proc/self/exe
8
Other vulnerabilities
• Kubernetes CVE-2017-1002101, CVE-2017-1002102
– A malicious container was allowed to access the host filesystem via
vulnerabilities related to volumes
• Kubernetes CVE-2018-1002105
– A malicious API call could be used to gain cluster-admin (and
hence the root privileges on the nodes)
• Git CVE-2018-11235 (affected Kubernetes gitRepo volumes)
– A malicious repo could execute an arbitrary binary as the root when
it was cloned
9
Other vulnerabilities
• Kubernetes CVE-2017-1002101, CVE-2017-1002102
– A malicious container was allowed to access the host filesystem via
vulnerabilities related to volumes
• Kubernetes CVE-2018-1002105
– A malicious API call could be used to gain cluster-admin (and
hence the root privileges on the nodes)
• Git CVE-2018-11235 (affected Kubernetes gitRepo volumes)
– A malicious repo could execute an arbitrary binary as the root when
it was cloned
10
--userns-remap might not be effective
Play-with-Docker.com vulnerability
• Play-with-Docker.com: Online Docker playground,
implemented using Docker-in-Docker with custom
AppArmor profiles
• Malicious kernel module was loadable due to AppArmor
misconfiguration (revealed on Jan 14, 2019)
– Not really an issue of Docker
11https://www.cyberark.com/threat-research-blog/how-i-hacked-play-with-docker-and-remotely-ran-code-on-the-host/
What Rootless Containers can
• Prohibit accessing files owned by other users
• Prohibit modifying firmware and kernel (→ undetectable
malware)
• Prohibit other privileged operations like ARP spoofing,
rebooting,...
12
What Rootless Containers cannot
• If a container was broke out, the attacker still might be able
to
– Mine cryptocurrencies
– Springboard-attack to other hosts
• Not effective for kernel / VM/ HW vulns
– But we could use gVisor together for mitigating some of
them
13
How it works
14
User Namespaces
• User namespaces allow non-root users to pretend to be the
root
• Root-in-UserNS can have “fake” UID 0 and also create other
namespaces (MountNS, NetNS..)
15
User Namespaces
16
$ id -u
1001
$ ls -ln
-rw-rw---- 1 1001 1001 42 May 1 12:00 foo
$ docker-rootless run -v $(pwd):/mnt -it alpine
/ # id -u
0
/ # ls -ln /mnt
-rw-rw---- 1 0 0 42 May 1 12:00 foo
User Namespaces
17
$ docker-rootless run -v /:/host -it alpine
/ # ls -ln /host/dev/sda
brw-rw---- 1 65534 65534 8, 0 May 1 12:00
/host/dev/sda
/ # cat /host/dev/sda
cat: can’t open ‘/host/dev/sda’: Permission denied
Sub-users (and sub-groups)
• Put users in your user account so you can be a user while
you are being a user
• Sub-users are used as non-root users in a container
– USER in Dockerfile
– docker run --user
18
Sub-users (and sub-groups)
• If /etc/subuid contains “1001:100000:65536”
• Having 65536 sub-users should be enough for most
containers
19
0 1001 100000 165535 232
Host
UserNS
primary user
sub-users
start
sub-users
length
0 1 65536
Sub-users (and sub-groups)
• Sub-users are configured via SUID binaries
/usr/bin/{newuidmap, newgidmap}
• SETUID binary can be dangerous; newuidmap &
newgidmap had two CVEs so far:
– CVE-2016-6252 (CVSS v3: 7.8): integer overflow issue
– CVE-2018-7169 (CVSS v3: 5.3): supplementary GID issue
20
Sub-users (and sub-groups)
• Also hard to maintain sub-users
– LDAP / AD
– Nesting user namespaces might need huge number of
sub-users
21
Sub-users (and sub-groups)
• Alternative way: Single-mapping mode
• Does not require newuidmap/newgidmap
• Ptrace and/or Seccomp can be used for intercepting
syscalls to emulate sub-users
– user.rootlesscontainers xattr can be used for
chown emulation
22
Network Namespaces
• An unprivileged user can create network namespaces along
with user namespaces
• With network namespaces, the user can
– isolate abstract (pathless) UNIX sockets
• important to prevent container breakout
– create iptables rules
– set up overlay networking with VXLAN
– run tcpdump
– ...
23
Network Namespaces
• But an unprivileged user cannot set up veth pairs across
the host and namespaces, i.e. No internet connection
24
The Internet
Host
UserNS + NetNS
Network Namespaces
25
• lxc-user-nic SUID binary allows unprivileged users to
create veth, but we are not huge fun of SUID binaries
• Our approach: use completely unprivileged usermode
network (“Slirp”) with a TAP device
TAP
“Slirp” TAPFD
send fd as
a SCM_RIGHTS cmsg
The Internet
Host
UserNS + NetNS
Network Namespaces
Benchmark of several “Slirp” implementations:
• slirp4netns (our own implementation based on QEMU Slirp) is the
fastest because it avoids copying packets across the namespaces
MTU=1500 MTU=4000 MTU=16384 MTU=65520
vde_plug 763 Mbps Unsupported Unsupported Unsupported
VPNKit 514 Mbps 526 Mbps 540 Mbps Unsupported
slirp4netns 1.07 Gbps 2.78 Gbps 4.55 Gbps 9.21 Gbps
cf. rootful veth 52.1 Gbps 45.4 Gbps 43.6 Gbps 51.5 Gbps
Benchmark: iperf3 (netns -> host), measured on Travis CI. See rootless-containers/rootlesskit#12 26
Multi-node networking
• Flannel VXLAN is known to work
– Encapsulates Ethernet packets in UDP packets
– Provides L2 connectivity across rootless containers on
different nodes
• Other protocols should work as well, except ones that
require access to raw Ethernet
27
Snapshotting
• OverlayFS is currently unavailable in UserNS (except on
Ubuntu kernel)
• FUSE-OverlayFS can be used instead with kernel 4.18+
• XFS reflink can be also used to deduplicate files (but slow)
28
Cgroup
• pam_cgfs can be used for delegating permissions to
unprivileged users, but considered insecure by systemd
folks https://github.com/containers/libpod/issues/1429
• cgroup2 provides proper support for delegation, but not
adopted by OCI at the moment
29
Rootless Containers in Containers
• Urge demand for building images on Kubernetes cluster
• Seccomp and AppArmor needs to be disabled for the parent
containers
• To allow the children to mount procfs (pid-namespaced),
maskedPaths and readonlyPaths for /proc/* for the
parent needs to be removed (weird!)
– Same applies to sysfs (net-namespaced)
30
Rootless Containers in Containers
• So --privileged had been typically required anyway :(
– Or at least --security-opt
{seccomp,apparmor}=unconfined
• Docker 19.03 supports --security-opt
systempaths=unconfined for allowing procfs & sysfs
mount (Kube: securityContext.procMount, but no
sysMount yet)
– Make sure to lock the root in the container!
(passwd -l root, Alpine CVE-2019-5021 )
31
Adoption status
32
Adoption status: runtimes
33
Docker v19.03
containerd
runc
Podman
(≈ CRI-O)
crun
LXC Singularity
NetNS isolation
with Internet
connectivity
● VPNKit
● slirp4netns
● lxc-user-nic
(SUID)
slirp4netns
lxc-user-nic
(SUID)
No support
Supports
FUSE-OverlayFS
No Yes No No
Cgroup No
Limited support
for cgroup2
pam_cgfs No
Adoption status: runtimes::GPU
• nvidia-container-runtime is known to work
• Need to disable cgroup manually
• Rootful nVIDIA container needs to be executed on every
system startup
• Probably, other devices such as FPGA should work as well
(untested)
34
Adoption status: runtimes::single-mapping
mode
• udocker does not need subuid configuration, as it can
emulate subuser with ptrace (based on PRoot)
– but no persistent chown
• runROOTLESS (Don’t confuse with upstream rootless runc)
supports persistent chown as well, using
user.rootlesscontainers xattr
– the xattr value is a pair of UID and GID in protobuf
encoding
– the xattr convention is compatible with umoci
35
Adoption status: runtimes::single-mapping
mode
• Ptrace is slow https://github.com/rootless-containers/runrootless/issues/14
• seccomp can be used for acceleration but hard to
implement correctly
36
Adoption status: image builders
• BuildKit / img / Buildah supports rootless mode
– Works in containers as well as on the host
– Does not need --privileged but Seccomp and
AppArmor needs to be disabled
37
Adoption status: image builders
• Similar but different work: Kaniko & Makisu
– Rootful
– But no need to disable seccomp and AppArmor,
because they don’t create containers for RUN
instructions in Dockerfile
38
Adoption status: Kubernetes
• Usernetes project provides patches for rootless Kubernetes,
but not proposed to the upstream yet
– Supports all major CRI runtimes: dockershim, containerd,
CRI-O
– Flannel VXLAN is known to work
– Lack of cgroup might be huge concern
• But Usernetes is already integrated into k3s!
(5 less than k8s)
39
$ k3s server --rootless
You can rootlesify your own project easily!
• RootlessKit does almost all things for rootlessifying your
container project (or almost any rootful app)
– Creates UserNS with sub-users and sub-groups
– Creates MountNS with writable /etc, /run but without
chroot
– Creates NetNS with VPNKit/slirp4netns/lxc-user-nic
– Provides REST API on UNIX socket for port forwarding
management
40
You can rootlesify your own project easily!
41
$ rootlesskit --net=slirp4netns --copy-up=/etc 
--port-driver=builtin bash
# id -u
0
# touch /etc/here-is-writable-tmpfs
# ip a
...
2: tap0: <BROADCAST,MULTICAST,UP,LOWER_UP>
inet 10.0.2.100/24 scope global tap0
...
# rootlessctl add-ports 0.0.0.0:8080:80/tcp
You can rootlesify your own project easily!
• With RootlessKit, you just need to work on disabling cgroup
stuff, sysctl stuff, and changing the data path from /var/lib
to /home
• Used by Docker, BuildKit, k3s
42
Unresolved Issues
43
Kernel has vulns
• UserNS tends to have priv escalation vulns
– CVE 2013-1858: UserNS + CLONE_FS
– CVE-2014-4014: UserNS + chmod
– CVE-2015-1328: UserNS + OverlayFS (Ubuntu-only)
• So rootless OverlayFS is still not merged in upstream
– CVE-2018-18955: UserNS + complex ID mapping
44
Kernel has vulns
• A bunch of code paths that can hang up the kernel
– e.g. CVE-2018-7191 (unpublished published today):
creating a tap device with illegal name
– And more, see
https://medium.com/@jain.sm/security-challenges-with-kubernetes-818fad4a89f2
• Unlimited resources e.g.
– Pending signals
– Max user process
– Max FDs per user
(see the same URL above)
45
Kernel has vulns
• So I’ve never suggested using rootless containers for real
multi-tenancy ¯_(ツ)_/¯
46
Kernel has vulns
• gVisor might be able to mitigate them but significant
overhead and syscall incompatibility
• UML (20 yo, still alive!) is almost compatible with real Linux
but it even lacks support for SMP
• linuxd: similar to UML but accelerated with host kernel
patches
– Still no public code
https://schd.ws/hosted_files/ossna18/db/Containerize%20Linux%20Kernel.pdf
47
Cgroups
• cgroup2 is not adopted in OCI
• crun is trying to support cgroup2 without changing OCI spec
48
Mount
• Only supports:
– tmpfs
– bind
– procfs (PID-namespaced)
– sysfs (net-namespaced)
– FUSE (since kernel 4.18)
– Overlay (Ubuntu only)
• No support for mounting any block devices (even loopback
devices)
49
Landlock
• landlock: unprivileged sandbox LSM
• Not merged in the upstream kernel, but promising as
AppArmor-alternative
50
LDAP / Active Directory
• /etc/sub{u,g}id configuration is painful for LDAP/AD
• Alternatively, implementing NSS module is under
discussion, but no code yet https://github.com/shadow-maint/shadow/issues/154
51
Single-mapping mode
• runROOTLESS / PRoot could be accelerated with seccomp
but implementation is broken
• Kernel 5.0 seccomp could be used for getting rid of ptrace
completely
52
containerd dev plan
53
containerd dev plan
• Implement FUSE-OverlayFS snapshotter plugin
– Probably in a separate repo
– Should not be difficult
• Support cgroup2
– Probably we want to wait for OCI Runtime Spec and runc
to be revised
– But we can also consider beginning support cgroup2
right now with crun
54
containerd dev plan
• Support running containerd inside gVisor
– So as to allow running rootless containers in a container
without disabling seccomp & apparmor
– And to mitigate potential kernel vulns
– Currently MountNS is not working
https://github.com/google/gvisor/issues/221
55

More Related Content

What's hot (20)

PDF
Room 3 - 1 - Nguyễn Xuân Trường Lâm - Zero touch on-premise storage infrastru...
Vietnam Open Infrastructure User Group
 
PPTX
initとプロセス再起動
Takashi Takizawa
 
PDF
Diving Through The Layers: Investigating runc, containerd, and the Docker eng...
Phil Estes
 
PDF
Red Hat OpenStack 17 저자직강+스터디그룹_1주차
Nalee Jang
 
PDF
An introduction to SSH
nussbauml
 
PDF
Linux Profiling at Netflix
Brendan Gregg
 
PDF
Interrupt Affinityについて
Takuya ASADA
 
PDF
Docker Introduction
Jeffrey Ellin
 
PPTX
Dockers and containers basics
Sourabh Saxena
 
PDF
Docker Swarm 0.2.0
Docker, Inc.
 
PPT
Docker introduction
Phuc Nguyen
 
PDF
Extreme Linux Performance Monitoring and Tuning
Milind Koyande
 
PDF
HTTP Request Smuggling via higher HTTP versions
neexemil
 
PPTX
Docker, LinuX Container
Araf Karsh Hamid
 
ODP
Kubernetes Architecture
Knoldus Inc.
 
PDF
Comparing Next-Generation Container Image Building Tools
Akihiro Suda
 
PPTX
Docker 101 - Nov 2016
Docker, Inc.
 
PDF
並行実行制御の最適化手法
Sho Nakazono
 
PDF
Rootless Kubernetes
Akihiro Suda
 
PDF
Rootless Containers
Akihiro Suda
 
Room 3 - 1 - Nguyễn Xuân Trường Lâm - Zero touch on-premise storage infrastru...
Vietnam Open Infrastructure User Group
 
initとプロセス再起動
Takashi Takizawa
 
Diving Through The Layers: Investigating runc, containerd, and the Docker eng...
Phil Estes
 
Red Hat OpenStack 17 저자직강+스터디그룹_1주차
Nalee Jang
 
An introduction to SSH
nussbauml
 
Linux Profiling at Netflix
Brendan Gregg
 
Interrupt Affinityについて
Takuya ASADA
 
Docker Introduction
Jeffrey Ellin
 
Dockers and containers basics
Sourabh Saxena
 
Docker Swarm 0.2.0
Docker, Inc.
 
Docker introduction
Phuc Nguyen
 
Extreme Linux Performance Monitoring and Tuning
Milind Koyande
 
HTTP Request Smuggling via higher HTTP versions
neexemil
 
Docker, LinuX Container
Araf Karsh Hamid
 
Kubernetes Architecture
Knoldus Inc.
 
Comparing Next-Generation Container Image Building Tools
Akihiro Suda
 
Docker 101 - Nov 2016
Docker, Inc.
 
並行実行制御の最適化手法
Sho Nakazono
 
Rootless Kubernetes
Akihiro Suda
 
Rootless Containers
Akihiro Suda
 

Similar to Rootless Containers & Unresolved issues (20)

PDF
The State of Rootless Containers
Akihiro Suda
 
PDF
20240201 [HPC Containers] Rootless Containers.pdf
Akihiro Suda
 
PDF
[KubeCon NA 2020] containerd: Rootless Containers 2020
Akihiro Suda
 
PDF
Podman rootless containers
Giuseppe Scrivano
 
PPTX
Usernetes: Kubernetes as a non-root user
Akihiro Suda
 
PPTX
Exploring Docker Security
Patrick Kleindienst
 
PDF
DCSF19 Hardening Docker daemon with Rootless mode
Docker, Inc.
 
PDF
[DockerCon 2019] Hardening Docker daemon with Rootless mode
Akihiro Suda
 
PDF
Docker, Linux Containers, and Security: Does It Add Up?
Jérôme Petazzoni
 
PDF
"Lightweight Virtualization with Linux Containers and Docker". Jerome Petazzo...
Yandex
 
PDF
Docker, Linux Containers (LXC), and security
Jérôme Petazzoni
 
PDF
LXC, Docker, security: is it safe to run applications in Linux Containers?
Jérôme Petazzoni
 
PDF
Docker Container: isolation and security
宇 傅
 
PDF
The internals and the latest trends of container runtimes
Akihiro Suda
 
PDF
Containers & Security
All Things Open
 
PDF
[DockerCon 2020] Hardening Docker daemon with Rootless Mode
Akihiro Suda
 
ODP
Practical Container Security by Mrunal Patel and Thomas Cameron, Red Hat
Docker, Inc.
 
PDF
Lightweight Virtualization with Linux Containers and Docker | YaC 2013
dotCloud
 
PDF
Lightweight Virtualization with Linux Containers and Docker I YaC 2013
Docker, Inc.
 
PDF
How Secure Is Your Container? ContainerCon Berlin 2016
Phil Estes
 
The State of Rootless Containers
Akihiro Suda
 
20240201 [HPC Containers] Rootless Containers.pdf
Akihiro Suda
 
[KubeCon NA 2020] containerd: Rootless Containers 2020
Akihiro Suda
 
Podman rootless containers
Giuseppe Scrivano
 
Usernetes: Kubernetes as a non-root user
Akihiro Suda
 
Exploring Docker Security
Patrick Kleindienst
 
DCSF19 Hardening Docker daemon with Rootless mode
Docker, Inc.
 
[DockerCon 2019] Hardening Docker daemon with Rootless mode
Akihiro Suda
 
Docker, Linux Containers, and Security: Does It Add Up?
Jérôme Petazzoni
 
"Lightweight Virtualization with Linux Containers and Docker". Jerome Petazzo...
Yandex
 
Docker, Linux Containers (LXC), and security
Jérôme Petazzoni
 
LXC, Docker, security: is it safe to run applications in Linux Containers?
Jérôme Petazzoni
 
Docker Container: isolation and security
宇 傅
 
The internals and the latest trends of container runtimes
Akihiro Suda
 
Containers & Security
All Things Open
 
[DockerCon 2020] Hardening Docker daemon with Rootless Mode
Akihiro Suda
 
Practical Container Security by Mrunal Patel and Thomas Cameron, Red Hat
Docker, Inc.
 
Lightweight Virtualization with Linux Containers and Docker | YaC 2013
dotCloud
 
Lightweight Virtualization with Linux Containers and Docker I YaC 2013
Docker, Inc.
 
How Secure Is Your Container? ContainerCon Berlin 2016
Phil Estes
 
Ad

More from Akihiro Suda (20)

PDF
20250617 [KubeCon JP 2025] containerd - Project Update and Deep Dive.pdf
Akihiro Suda
 
PDF
20250616 [KubeCon JP 2025] VexLLM - Silence Negligible CVE Alerts Using LLM.pdf
Akihiro Suda
 
PDF
20250403 [KubeCon EU] containerd - Project Update and Deep Dive.pdf
Akihiro Suda
 
PDF
20250403 [KubeCon EU Pavilion] containerd.pdf
Akihiro Suda
 
PDF
20250402 [KubeCon EU Pavilion] Lima.pdf_
Akihiro Suda
 
PDF
20241115 [KubeCon NA Pavilion] Lima.pdf_
Akihiro Suda
 
PDF
20241113 [KubeCon NA Pavilion] containerd.pdf
Akihiro Suda
 
PDF
【情報科学若手の会 (2024/09/14】なぜオープンソースソフトウェアにコントリビュートすべきなのか
Akihiro Suda
 
PDF
【Vuls祭り#10 (2024/08/20)】 VexLLM: LLMを用いたVEX自動生成ツール
Akihiro Suda
 
PDF
20240415 [Container Plumbing Days] Usernetes Gen2 - Kubernetes in Rootless Do...
Akihiro Suda
 
PDF
20240321 [KubeCon EU Pavilion] Lima.pdf_
Akihiro Suda
 
PDF
20240320 [KubeCon EU Pavilion] containerd.pdf
Akihiro Suda
 
PDF
[Podman Special Event] Kubernetes in Rootless Podman
Akihiro Suda
 
PDF
[KubeConNA2023] Lima pavilion
Akihiro Suda
 
PDF
[KubeConNA2023] containerd pavilion
Akihiro Suda
 
PDF
[DockerConハイライト] OpenPubKeyによるイメージの署名と検証.pdf
Akihiro Suda
 
PDF
[CNCF TAG-Runtime] Usernetes Gen2
Akihiro Suda
 
PDF
[DockerCon 2023] Reproducible builds with BuildKit for software supply chain ...
Akihiro Suda
 
PDF
[KubeConEU2023] Lima pavilion
Akihiro Suda
 
PDF
[KubeConEU2023] containerd pavilion
Akihiro Suda
 
20250617 [KubeCon JP 2025] containerd - Project Update and Deep Dive.pdf
Akihiro Suda
 
20250616 [KubeCon JP 2025] VexLLM - Silence Negligible CVE Alerts Using LLM.pdf
Akihiro Suda
 
20250403 [KubeCon EU] containerd - Project Update and Deep Dive.pdf
Akihiro Suda
 
20250403 [KubeCon EU Pavilion] containerd.pdf
Akihiro Suda
 
20250402 [KubeCon EU Pavilion] Lima.pdf_
Akihiro Suda
 
20241115 [KubeCon NA Pavilion] Lima.pdf_
Akihiro Suda
 
20241113 [KubeCon NA Pavilion] containerd.pdf
Akihiro Suda
 
【情報科学若手の会 (2024/09/14】なぜオープンソースソフトウェアにコントリビュートすべきなのか
Akihiro Suda
 
【Vuls祭り#10 (2024/08/20)】 VexLLM: LLMを用いたVEX自動生成ツール
Akihiro Suda
 
20240415 [Container Plumbing Days] Usernetes Gen2 - Kubernetes in Rootless Do...
Akihiro Suda
 
20240321 [KubeCon EU Pavilion] Lima.pdf_
Akihiro Suda
 
20240320 [KubeCon EU Pavilion] containerd.pdf
Akihiro Suda
 
[Podman Special Event] Kubernetes in Rootless Podman
Akihiro Suda
 
[KubeConNA2023] Lima pavilion
Akihiro Suda
 
[KubeConNA2023] containerd pavilion
Akihiro Suda
 
[DockerConハイライト] OpenPubKeyによるイメージの署名と検証.pdf
Akihiro Suda
 
[CNCF TAG-Runtime] Usernetes Gen2
Akihiro Suda
 
[DockerCon 2023] Reproducible builds with BuildKit for software supply chain ...
Akihiro Suda
 
[KubeConEU2023] Lima pavilion
Akihiro Suda
 
[KubeConEU2023] containerd pavilion
Akihiro Suda
 
Ad

Recently uploaded (20)

PDF
HiHelloHR – Simplify HR Operations for Modern Workplaces
HiHelloHR
 
PPTX
Platform for Enterprise Solution - Java EE5
abhishekoza1981
 
PDF
Mobile CMMS Solutions Empowering the Frontline Workforce
CryotosCMMSSoftware
 
PPTX
Equipment Management Software BIS Safety UK.pptx
BIS Safety Software
 
PPTX
Human Resources Information System (HRIS)
Amity University, Patna
 
PPTX
Comprehensive Guide: Shoviv Exchange to Office 365 Migration Tool 2025
Shoviv Software
 
PDF
Efficient, Automated Claims Processing Software for Insurers
Insurance Tech Services
 
PDF
vMix Pro 28.0.0.42 Download vMix Registration key Bundle
kulindacore
 
PDF
Streamline Contractor Lifecycle- TECH EHS Solution
TECH EHS Solution
 
PDF
Alexander Marshalov - How to use AI Assistants with your Monitoring system Q2...
VictoriaMetrics
 
PDF
Capcut Pro Crack For PC Latest Version {Fully Unlocked} 2025
hashhshs786
 
PDF
Unlock Efficiency with Insurance Policy Administration Systems
Insurance Tech Services
 
PPTX
How Apagen Empowered an EPC Company with Engineering ERP Software
SatishKumar2651
 
PPTX
Writing Better Code - Helping Developers make Decisions.pptx
Lorraine Steyn
 
PDF
Salesforce CRM Services.VALiNTRY360
VALiNTRY360
 
PDF
Alarm in Android-Scheduling Timed Tasks Using AlarmManager in Android.pdf
Nabin Dhakal
 
PDF
Odoo CRM vs Zoho CRM: Honest Comparison 2025
Odiware Technologies Private Limited
 
PPTX
A Complete Guide to Salesforce SMS Integrations Build Scalable Messaging With...
360 SMS APP
 
PDF
Beyond Binaries: Understanding Diversity and Allyship in a Global Workplace -...
Imma Valls Bernaus
 
PPTX
The Role of a PHP Development Company in Modern Web Development
SEO Company for School in Delhi NCR
 
HiHelloHR – Simplify HR Operations for Modern Workplaces
HiHelloHR
 
Platform for Enterprise Solution - Java EE5
abhishekoza1981
 
Mobile CMMS Solutions Empowering the Frontline Workforce
CryotosCMMSSoftware
 
Equipment Management Software BIS Safety UK.pptx
BIS Safety Software
 
Human Resources Information System (HRIS)
Amity University, Patna
 
Comprehensive Guide: Shoviv Exchange to Office 365 Migration Tool 2025
Shoviv Software
 
Efficient, Automated Claims Processing Software for Insurers
Insurance Tech Services
 
vMix Pro 28.0.0.42 Download vMix Registration key Bundle
kulindacore
 
Streamline Contractor Lifecycle- TECH EHS Solution
TECH EHS Solution
 
Alexander Marshalov - How to use AI Assistants with your Monitoring system Q2...
VictoriaMetrics
 
Capcut Pro Crack For PC Latest Version {Fully Unlocked} 2025
hashhshs786
 
Unlock Efficiency with Insurance Policy Administration Systems
Insurance Tech Services
 
How Apagen Empowered an EPC Company with Engineering ERP Software
SatishKumar2651
 
Writing Better Code - Helping Developers make Decisions.pptx
Lorraine Steyn
 
Salesforce CRM Services.VALiNTRY360
VALiNTRY360
 
Alarm in Android-Scheduling Timed Tasks Using AlarmManager in Android.pdf
Nabin Dhakal
 
Odoo CRM vs Zoho CRM: Honest Comparison 2025
Odiware Technologies Private Limited
 
A Complete Guide to Salesforce SMS Integrations Build Scalable Messaging With...
360 SMS APP
 
Beyond Binaries: Understanding Diversity and Allyship in a Global Workplace -...
Imma Valls Bernaus
 
The Role of a PHP Development Company in Modern Web Development
SEO Company for School in Delhi NCR
 

Rootless Containers & Unresolved issues

  • 1. Rootless Containers & Unresolved Issues Akihiro Suda / NTT (@_AkihiroSuda_) May 17, 2019 1
  • 2. Agenda • Introduction to Rootless Containers • How it works • Adoption status • Unresolved issues • containerd dev plan 2
  • 4. Rootless Containers • Run containers, runtimes, and orchestrators as a non-root user • Don’t confuse with: – usermod -aG docker penguin – docker run --user – dockerd --userns-remap 4
  • 5. Motivation of Rootless Containers • To mitigate potential vulnerability of container runtimes and orchestrator (the primary motivation) • To allow users of shared machines (e.g. HPC) to run containers without the risk of breaking other users environments – Still unsuitable for “multi-tenancy” where you can’t really trust other users • To isolate nested containers, e.g. “Docker-in-Docker” 5
  • 6. Runtime vulnerabilities • Docker “Shocker” (2014) – A malicious container was allowed to access the host file system, as CAP_DAC_READ_SEARCH was effective by default • Docker CVE-2014-9357 – A malicious docker build container could run arbitrary binary on the host as the root due to an LZMA archive issue • containerd #2001 (2018) – A malicious container image could remove /tmp on the host when the image was pulled (not when actually launched!) 6
  • 7. Runtime vulnerabilities • Docker “Shocker” (2014) – A malicious container was allowed to access the host file system, as CAP_DAC_READ_SEARCH was effective by default • Docker CVE-2014-9357 – A malicious docker build container could run arbitrary binary on the host as the root due to an LZMA archive issue • containerd #2001 (2018) – A malicious container image could remove /tmp on the host when the image was pulled (not when actually launched!) 7 Vulnerability of daemons, not containers per se So --userns-remap is not effective
  • 8. Runtime vulnerabilities • runc #1962 (2019) – Container break-out via /proc/sys/kernel/core_pattern or /sys/kernel/uevent_helper – Hosts with the initrd rootfs (DOCKER_RAMDISK) were affected (e.g. Minikube) • runc CVE-2019-5736 – Container break-out via /proc/self/exe 8
  • 9. Other vulnerabilities • Kubernetes CVE-2017-1002101, CVE-2017-1002102 – A malicious container was allowed to access the host filesystem via vulnerabilities related to volumes • Kubernetes CVE-2018-1002105 – A malicious API call could be used to gain cluster-admin (and hence the root privileges on the nodes) • Git CVE-2018-11235 (affected Kubernetes gitRepo volumes) – A malicious repo could execute an arbitrary binary as the root when it was cloned 9
  • 10. Other vulnerabilities • Kubernetes CVE-2017-1002101, CVE-2017-1002102 – A malicious container was allowed to access the host filesystem via vulnerabilities related to volumes • Kubernetes CVE-2018-1002105 – A malicious API call could be used to gain cluster-admin (and hence the root privileges on the nodes) • Git CVE-2018-11235 (affected Kubernetes gitRepo volumes) – A malicious repo could execute an arbitrary binary as the root when it was cloned 10 --userns-remap might not be effective
  • 11. Play-with-Docker.com vulnerability • Play-with-Docker.com: Online Docker playground, implemented using Docker-in-Docker with custom AppArmor profiles • Malicious kernel module was loadable due to AppArmor misconfiguration (revealed on Jan 14, 2019) – Not really an issue of Docker 11https://www.cyberark.com/threat-research-blog/how-i-hacked-play-with-docker-and-remotely-ran-code-on-the-host/
  • 12. What Rootless Containers can • Prohibit accessing files owned by other users • Prohibit modifying firmware and kernel (→ undetectable malware) • Prohibit other privileged operations like ARP spoofing, rebooting,... 12
  • 13. What Rootless Containers cannot • If a container was broke out, the attacker still might be able to – Mine cryptocurrencies – Springboard-attack to other hosts • Not effective for kernel / VM/ HW vulns – But we could use gVisor together for mitigating some of them 13
  • 15. User Namespaces • User namespaces allow non-root users to pretend to be the root • Root-in-UserNS can have “fake” UID 0 and also create other namespaces (MountNS, NetNS..) 15
  • 16. User Namespaces 16 $ id -u 1001 $ ls -ln -rw-rw---- 1 1001 1001 42 May 1 12:00 foo $ docker-rootless run -v $(pwd):/mnt -it alpine / # id -u 0 / # ls -ln /mnt -rw-rw---- 1 0 0 42 May 1 12:00 foo
  • 17. User Namespaces 17 $ docker-rootless run -v /:/host -it alpine / # ls -ln /host/dev/sda brw-rw---- 1 65534 65534 8, 0 May 1 12:00 /host/dev/sda / # cat /host/dev/sda cat: can’t open ‘/host/dev/sda’: Permission denied
  • 18. Sub-users (and sub-groups) • Put users in your user account so you can be a user while you are being a user • Sub-users are used as non-root users in a container – USER in Dockerfile – docker run --user 18
  • 19. Sub-users (and sub-groups) • If /etc/subuid contains “1001:100000:65536” • Having 65536 sub-users should be enough for most containers 19 0 1001 100000 165535 232 Host UserNS primary user sub-users start sub-users length 0 1 65536
  • 20. Sub-users (and sub-groups) • Sub-users are configured via SUID binaries /usr/bin/{newuidmap, newgidmap} • SETUID binary can be dangerous; newuidmap & newgidmap had two CVEs so far: – CVE-2016-6252 (CVSS v3: 7.8): integer overflow issue – CVE-2018-7169 (CVSS v3: 5.3): supplementary GID issue 20
  • 21. Sub-users (and sub-groups) • Also hard to maintain sub-users – LDAP / AD – Nesting user namespaces might need huge number of sub-users 21
  • 22. Sub-users (and sub-groups) • Alternative way: Single-mapping mode • Does not require newuidmap/newgidmap • Ptrace and/or Seccomp can be used for intercepting syscalls to emulate sub-users – user.rootlesscontainers xattr can be used for chown emulation 22
  • 23. Network Namespaces • An unprivileged user can create network namespaces along with user namespaces • With network namespaces, the user can – isolate abstract (pathless) UNIX sockets • important to prevent container breakout – create iptables rules – set up overlay networking with VXLAN – run tcpdump – ... 23
  • 24. Network Namespaces • But an unprivileged user cannot set up veth pairs across the host and namespaces, i.e. No internet connection 24 The Internet Host UserNS + NetNS
  • 25. Network Namespaces 25 • lxc-user-nic SUID binary allows unprivileged users to create veth, but we are not huge fun of SUID binaries • Our approach: use completely unprivileged usermode network (“Slirp”) with a TAP device TAP “Slirp” TAPFD send fd as a SCM_RIGHTS cmsg The Internet Host UserNS + NetNS
  • 26. Network Namespaces Benchmark of several “Slirp” implementations: • slirp4netns (our own implementation based on QEMU Slirp) is the fastest because it avoids copying packets across the namespaces MTU=1500 MTU=4000 MTU=16384 MTU=65520 vde_plug 763 Mbps Unsupported Unsupported Unsupported VPNKit 514 Mbps 526 Mbps 540 Mbps Unsupported slirp4netns 1.07 Gbps 2.78 Gbps 4.55 Gbps 9.21 Gbps cf. rootful veth 52.1 Gbps 45.4 Gbps 43.6 Gbps 51.5 Gbps Benchmark: iperf3 (netns -> host), measured on Travis CI. See rootless-containers/rootlesskit#12 26
  • 27. Multi-node networking • Flannel VXLAN is known to work – Encapsulates Ethernet packets in UDP packets – Provides L2 connectivity across rootless containers on different nodes • Other protocols should work as well, except ones that require access to raw Ethernet 27
  • 28. Snapshotting • OverlayFS is currently unavailable in UserNS (except on Ubuntu kernel) • FUSE-OverlayFS can be used instead with kernel 4.18+ • XFS reflink can be also used to deduplicate files (but slow) 28
  • 29. Cgroup • pam_cgfs can be used for delegating permissions to unprivileged users, but considered insecure by systemd folks https://github.com/containers/libpod/issues/1429 • cgroup2 provides proper support for delegation, but not adopted by OCI at the moment 29
  • 30. Rootless Containers in Containers • Urge demand for building images on Kubernetes cluster • Seccomp and AppArmor needs to be disabled for the parent containers • To allow the children to mount procfs (pid-namespaced), maskedPaths and readonlyPaths for /proc/* for the parent needs to be removed (weird!) – Same applies to sysfs (net-namespaced) 30
  • 31. Rootless Containers in Containers • So --privileged had been typically required anyway :( – Or at least --security-opt {seccomp,apparmor}=unconfined • Docker 19.03 supports --security-opt systempaths=unconfined for allowing procfs & sysfs mount (Kube: securityContext.procMount, but no sysMount yet) – Make sure to lock the root in the container! (passwd -l root, Alpine CVE-2019-5021 ) 31
  • 33. Adoption status: runtimes 33 Docker v19.03 containerd runc Podman (≈ CRI-O) crun LXC Singularity NetNS isolation with Internet connectivity ● VPNKit ● slirp4netns ● lxc-user-nic (SUID) slirp4netns lxc-user-nic (SUID) No support Supports FUSE-OverlayFS No Yes No No Cgroup No Limited support for cgroup2 pam_cgfs No
  • 34. Adoption status: runtimes::GPU • nvidia-container-runtime is known to work • Need to disable cgroup manually • Rootful nVIDIA container needs to be executed on every system startup • Probably, other devices such as FPGA should work as well (untested) 34
  • 35. Adoption status: runtimes::single-mapping mode • udocker does not need subuid configuration, as it can emulate subuser with ptrace (based on PRoot) – but no persistent chown • runROOTLESS (Don’t confuse with upstream rootless runc) supports persistent chown as well, using user.rootlesscontainers xattr – the xattr value is a pair of UID and GID in protobuf encoding – the xattr convention is compatible with umoci 35
  • 36. Adoption status: runtimes::single-mapping mode • Ptrace is slow https://github.com/rootless-containers/runrootless/issues/14 • seccomp can be used for acceleration but hard to implement correctly 36
  • 37. Adoption status: image builders • BuildKit / img / Buildah supports rootless mode – Works in containers as well as on the host – Does not need --privileged but Seccomp and AppArmor needs to be disabled 37
  • 38. Adoption status: image builders • Similar but different work: Kaniko & Makisu – Rootful – But no need to disable seccomp and AppArmor, because they don’t create containers for RUN instructions in Dockerfile 38
  • 39. Adoption status: Kubernetes • Usernetes project provides patches for rootless Kubernetes, but not proposed to the upstream yet – Supports all major CRI runtimes: dockershim, containerd, CRI-O – Flannel VXLAN is known to work – Lack of cgroup might be huge concern • But Usernetes is already integrated into k3s! (5 less than k8s) 39 $ k3s server --rootless
  • 40. You can rootlesify your own project easily! • RootlessKit does almost all things for rootlessifying your container project (or almost any rootful app) – Creates UserNS with sub-users and sub-groups – Creates MountNS with writable /etc, /run but without chroot – Creates NetNS with VPNKit/slirp4netns/lxc-user-nic – Provides REST API on UNIX socket for port forwarding management 40
  • 41. You can rootlesify your own project easily! 41 $ rootlesskit --net=slirp4netns --copy-up=/etc --port-driver=builtin bash # id -u 0 # touch /etc/here-is-writable-tmpfs # ip a ... 2: tap0: <BROADCAST,MULTICAST,UP,LOWER_UP> inet 10.0.2.100/24 scope global tap0 ... # rootlessctl add-ports 0.0.0.0:8080:80/tcp
  • 42. You can rootlesify your own project easily! • With RootlessKit, you just need to work on disabling cgroup stuff, sysctl stuff, and changing the data path from /var/lib to /home • Used by Docker, BuildKit, k3s 42
  • 44. Kernel has vulns • UserNS tends to have priv escalation vulns – CVE 2013-1858: UserNS + CLONE_FS – CVE-2014-4014: UserNS + chmod – CVE-2015-1328: UserNS + OverlayFS (Ubuntu-only) • So rootless OverlayFS is still not merged in upstream – CVE-2018-18955: UserNS + complex ID mapping 44
  • 45. Kernel has vulns • A bunch of code paths that can hang up the kernel – e.g. CVE-2018-7191 (unpublished published today): creating a tap device with illegal name – And more, see https://medium.com/@jain.sm/security-challenges-with-kubernetes-818fad4a89f2 • Unlimited resources e.g. – Pending signals – Max user process – Max FDs per user (see the same URL above) 45
  • 46. Kernel has vulns • So I’ve never suggested using rootless containers for real multi-tenancy ¯_(ツ)_/¯ 46
  • 47. Kernel has vulns • gVisor might be able to mitigate them but significant overhead and syscall incompatibility • UML (20 yo, still alive!) is almost compatible with real Linux but it even lacks support for SMP • linuxd: similar to UML but accelerated with host kernel patches – Still no public code https://schd.ws/hosted_files/ossna18/db/Containerize%20Linux%20Kernel.pdf 47
  • 48. Cgroups • cgroup2 is not adopted in OCI • crun is trying to support cgroup2 without changing OCI spec 48
  • 49. Mount • Only supports: – tmpfs – bind – procfs (PID-namespaced) – sysfs (net-namespaced) – FUSE (since kernel 4.18) – Overlay (Ubuntu only) • No support for mounting any block devices (even loopback devices) 49
  • 50. Landlock • landlock: unprivileged sandbox LSM • Not merged in the upstream kernel, but promising as AppArmor-alternative 50
  • 51. LDAP / Active Directory • /etc/sub{u,g}id configuration is painful for LDAP/AD • Alternatively, implementing NSS module is under discussion, but no code yet https://github.com/shadow-maint/shadow/issues/154 51
  • 52. Single-mapping mode • runROOTLESS / PRoot could be accelerated with seccomp but implementation is broken • Kernel 5.0 seccomp could be used for getting rid of ptrace completely 52
  • 54. containerd dev plan • Implement FUSE-OverlayFS snapshotter plugin – Probably in a separate repo – Should not be difficult • Support cgroup2 – Probably we want to wait for OCI Runtime Spec and runc to be revised – But we can also consider beginning support cgroup2 right now with crun 54
  • 55. containerd dev plan • Support running containerd inside gVisor – So as to allow running rootless containers in a container without disabling seccomp & apparmor – And to mitigate potential kernel vulns – Currently MountNS is not working https://github.com/google/gvisor/issues/221 55