An Exploitation Chain To Breakout of VMware ESXi
An Exploitation Chain To Breakout of VMware ESXi
Hostd
: sandboxed : user worlds
Guest
VMX Virtual Machine Ring-0
Plugin Hostd VMX handler 0
Hardware …
etc. Plugin Virtual
hypercall handler N
Hardware Ring-3
etc.
VMM
User world API In/Out (special ports) return to user-land
Resource
User world API Exceptions Hypervisor
VMKernel VMFS captured
Scheduling I/O Stack Drivers
Resource
Physical Hardwares
VMKernel Figure 2: The backdoor remote procedure call. Under default I/O
Scheduling I/O Stack Drivers privilege level, a Ring-3 program should not be able to issue I/O
Figure 1: The architecture of VMware ESXi [5]. VMKernel, a
POSIX-like OS, is designed to multiplex the virtual machines and operations. As a result, the int or out instruction should cause
Physical Hardwares the Ring-3 process to fault and crash. However, in this scene, the
provides some core fundamentals such as resource scheduling, I/O
stacks, file system (VMFS), and drivers. The guest machines can com- hypervisor captures the fault and handles it, supporting the backdoor
municate withVMFSthe host through hypercall, which includes normal RPC mechanism.
hypercalls such as VM-Exit and special hypercalls such as backdoor
and VMCI. The term "user world" refers to a process running in the
VMkernel. A significant user world is "VMX," a Ring-3 process, be- hypervisor. Also, it has been widely applied to VMware prod-
hypercall
cause it contains RPC handlers and virtual hardwares. (Note that ucts, e.g., some functionalities in open-vm-tools [18] such
the VMX process and some other process sandboxed by the vSphere as drag-and-drop, copy-and-paste are implemented on top of
sandbox.) The virtual machine monitor (VMM) is a process that the backdoor mechanism. As Figure 2 illustrates, the guest
provides the execution environment for guest virtual machines. machine that runs in Ring-3 of a protected-mode OS executes
the in or out instruction with a specific port, which raises a
corresponding exception. Normally, this results in the crash
abilities used in the first virtual machine escape of of the process. However, in a VMware virtual machine, the
VMware ESXi. hypervisor captures the exception and dispatches it to a proper
handler on the host OS. Therefore, there are no exceptions
3. State-of-the-art and reusable exploitation techniques for in the end. Compared to normal hypercall, which usually re-
manipulating memory layout, constructing an arbitrary- quires the CPL ⩽ IOPL, some backdoor requests can be issued
address-write primitive, and achieving persistent ex- from Ring-3 directly. Consequently, a channel for the commu-
ploitation in VMware ESXi. nication between guest and host can be established. For one
example, as Figure 3 delineates, backdoor can be leveraged
Threat Model. In this research, we assume that an adver-
to send data from the guest to the host. By putting required
sary can execute arbitrary codes in the user space and kernel
parameters into specific registers, like a simple function call,
space on the guest OS.
a process running in the protected mode can invoke the RPC
directly. In the sample, we first create a new "RPCI" channel
2 Background and retrieve the channel number. Then, we can send data
to the host through this channel. Based on this mechanism,
2.1 The Architecture of the ESXi some high-level and complicated protocols could be devel-
oped. Furthermore, in this paper, we also use this feature
As Figure 1 illustrates, VMware ESXi integrates its operating to reliably manipulate the memory layout. The details are
system (OS) called VMkernel, providing the functionalities discussed in §3.
of resource scheduling, I/O stacks, network stacks, storage
stacks, and device drivers, and all processes are running on
2.2 Virtual Machine Escape
top of it. VMkernel also implements a simple in-memory
file system to hold staged patches, configurations, and system VM escape is a process of breaking out of a virtual machine
logs. from a guest OS, so the guest VM can launch an arbitrary
To communicate with the hypervisor, the guest taps into execution with the privilege of the host operating system [21].
the VMM through VM-Exit in most circumstances. Notably, Specifically, ESXi completely isolates the guest operating sys-
VMware also introduced another hypercall mechanism called tems from each other by leveraging hardware virtualization
backdoor. Interestingly, although it is named "backdoor," it technologies, such as Intel VT or AMD-V. Any privileged
is merely a communication channel between the guest and the instructions from the guest operating systems will be captured
1 ; Creating a new channel Interface Category Privilege
2 asm(
3 "movl $0x564d5868,%%eax\n\t" ; magic bytes: ’VMXh’
4 "movl $0xc9435052,%%ebx\n\t" ; magic bytes for RPCI
SVGA2D Virtual Graphic ROOT
5 "movl $0x1e,%%ecx\n\t" ; MESSAGE_TYPE_OPEN SVGA3D Virtual Graphic ROOT
6 "movl $0x5658,%%edx\n\t" ; special I/O port e1000 Virtual Ethernet ROOT
7 "out %%eax,%%dx\n\t"
8 "movl %%edx, %%eax\n\t" ; ret channel number (EDX(HI)) e1000e Virtual Ethernet ROOT
9 "movl %%ecx, %%ebx\n\t" ; success or failure VMXNET3 Virtual Ethernet ROOT
10 ... xHCI Virtual USB ROOT
11 );
12 uHCI Virtual USB ROOT
13 ; Sending data through a specific channel aHCI Virtual SATA ROOT
14 asm(
15 "movl %0, %%edx\n\t" ; channel number (EDX(HI))
Lsilogic Virtual SCSI ROOT
16 "movl $0x41414141,%%ebx\n\t" ; 4 bytes to be sent Printer Virtual COM Device ROOT
17 "movl $0x564d5868,%%eax\n\t" ; magic bytes: ’VMXh’
18 "movl $0x0002001e,%%ecx\n\t" ; MESSAGE_TYPE_SEND Table 1: The virtual hardware has been demonstrated to affect the
19 "movw $0x5658,%%dx\n\t" ; special I/O port
20 "out %%eax,%%dx\n\t"
interaction of guest-to-host. The privilege field indicates the require-
21 "movl %%ecx, %%eax" ; success or failure ment to open the device in the guest OS.
22 ...
23 );
Figure 3: By using "RPCI," a sort of backdoor-based mechanism, them require that the root privilege be opened in the
guest machines can communicate with the host OS. By reading or guest operating system.
writing a special I/O port (0x5658/ 0x5659), a process running in
Ring-3 can invoke the RPC directly. 3. RPC Channels: VMware developed some RPC proto-
cols such as backdoor and VMware Virtual Machine
Communication Interface (VMCI) to accelerate the com-
and sanitized by the hypervisor. In normal circumstances, the munication between the guest OS and the host OS. It has
guest cannot execute codes or affect security-critical behav- been applied in VMware’s virtual machines for decades,
iors such as system configuration, and network connections because it does not rely on hardware virtualization ex-
of other guests or the host. By exploiting the vulnerabilities tensions. Thus, it also results in some virtual machine
in the ESXi, the adversaries can cross the security boundary escape attack surfaces. Some RPC handlers exist in the
between the guest and the host, to execute arbitrary codes on VMX process. By exploiting the bugs in these handlers,
the host operating system (i.e., virtual machine escape in the adversaries can escape from the guest OS.
ESXi).
Common Mitigations. The host maintains a POSIX-like
operating system, and some Linux-like security mitigations
2.3 Security Analysis of the ESXi are integrated into the host.
Attack Surfaces. The virtualization layer is the most signifi- 1. ASLR: Address Space Layout Randomization
cant part of the lifetime of the guest operating system. Any (ASLR) [20] was introduced to mitigate the exploitation
interaction between the guest OS and the hypervisor is a po- of memory corruption vulnerabilities. In ESXi, the
tential attack vector that could be exploited by adversaries. addresses encompassing the program, stack, heap, and
Generally, the guest can communicate with the host of ESXi libraries of the user space binaries are randomized. In
in several ways: the ESXi’s VMX process, which contains most of the
virtual hardware, some hardware such as the network
1. VMKernel and Core Virtualization Infrastructures: card runs in Ring-3. Therefore, an attacker who wants
There are some fundamentals such as VM-Exit handlers, to attack virtual hardware or other Ring-3 services in
memory management, and memory virtualization infras- ESXi first has to leak code pointers (i.e., information
tructures offered by the hypervisor running in the kernel leakage) to further hijack the control flow of ESXi.
space. An adversary who attacks the hypervisor success-
fully could take over the kernel of the host operating 2. NX/ DEP: This option is referred to as Data Execution
system directly. Prevention (DEP) or No-Execute (NX). It works with
the processor to help prevent buffer overflow attacks by
2. Virtual Hardware: To support the I/O virtualization, blocking code execution from memory that is marked as
VMware designed a batch of virtual harware and de- non-executable [11]. In ESXi, when the process is trying
vices. Most of them are integrated into the VMX process to execute shellcodes on stack, heap, or data segments,
of ESXi [3, 15]. The guest OS can communicate with it will crash.
the virtual hardware through port I/O (PIO) or memory-
mapped I/O (MMIO). Table 1 shows some significant 3. Compact VMX: Compared with VMware Workstation,
virtual hardware integrated in VMware ESXi. Most of the type-2 hypervisor developed by VMware, ESXi’s
1 # Rules applicable for all VMs 1 void __usercall vmxnet3_reg_cmd(vmxnet3_class *a1,
2 2 __int64 read_or_write, _DWORD *data, __int64 a4, __int64 a5)
3 -s genericSys grant 3 {
4 -s ioctlSys grant 4 ...
5 -s vsiReadSys grant 5 case 4: // VMXNET3_CMD_UPDATE_MAC_FILTERS
6 ... 6 if ( a1->field_1A20 ) {
7 7 ⋆ dma_memory_create(a1->driver_shared_addr + 8, 0x2B0ui64, 1,
8 -c unix_socket_create grant 8 ⋆ a1->state->field_B8, &page);
9 -c unix_stream_socket_bind grant 9 vmxnet3_cmd_update_mac_filters(v6, &page, a5);
10 -c unix_dgram_socket_bind grant 10 ⋆ destruct_page_struct(&page);
11 ... 11 sub_14017CB30(v6);
12 12 }
13 -p inet_socket_bind all grant 13 break;
14 -p inet_socket_connect loopback grant 14 ...
15 -p inet_socket_connect nonloopback grant 15 }
16 ... 16
17 17 char __fastcall dma_memory_create(unsigned __int64 addr, unsigned
18 -d tpm2emuObj tpm2emuDom file_exec grant 18 __int64 size, int a3, int a4, page_struct *page)
19 19 {
20 -r /var/run rw 20 unsigned __int64 v5;
21 -r /var/lock rw 21
22 ... 22 v5 = *(qword_140DAA810 + 12160);
23 // check the addr
Figure 4: A sample rule for global VMs. It grants what system calls 24 ⋆ if ( addr > v5 || !size || size > v5 - addr + 1 )
25 ⋆ return 0;
a VMX process is allowed to call, what network connections a VMX 26 set_page_struct(addr, size, a3, a4, page);
process is allowed to establish, and what directories a VMX process is 27 return 1;
allowed to read or write. 28 }
In the command VMXNET3_CMD_UPDATE_MAC_FILTERS, the 19 // first 8-byte of v20 is initialized, but 16-byte is read
20 // (HIDWORD(v18) == 16)
dma_memeory_create() function creates a page structure 21 ⋆ write_back_to_guest(v19, &v20, HIDWORD(v18), 0, *(_DWORD *)
used to read/write memory between guest and host. The 22 ⋆ (v14 + 184));
23 return 1;
destruct_page_struct() is responsible for releasing the 24 }
memory of the page structure. 25 ...
26 }
According to Figure 5, the page structure is allocated
and initialized in the function set_page_struct(). At Figure 6: A code snippet of vmxnet3_cmd_get_coalesce(). The
the beginning of the function, the dma_memeory_create() get_args() function reads a memory region of the VMX process.
function checks the validity of the physical address given Ultimately, v18, v19 are controllable. The second parameter
by the guest. Unfortunately, if the guest provides an write_back_to_guest indicates the source buffer, and the third one
invalid physical address, the function will return im- indicates the size.
mediately. However, after the dma_memeory_create(),
the VMXNET3_CMD_UPDATE_MAC_FILTERS handler fails to
check whether the allocation of the page structure is function (Figure 8), the function frees a field of the unini-
successful, resulting in an uninitialized use in the tialized stack memory. Hence, after filling a pointer into
destruct_page_struct() function. the uninitialized memory, an arbitrary-address-free primitive
Technically, this bug can also be turned into an informa- could be constructed.
tion leakage bug. However, to improve the stability of the 2) Arbitrary address write primitive. As Figure 11 illus-
exploitation, we decided to chain a dependent information trates, the metadata of Backdoor-RPC channel exists in the
leakage bug into the exploitation. data segment. Therefore, we use this feature to construct an
CVE-2018-6982. This bug is also caused by an uninitial- arbitrary-address-write primitive. First, we opened several
ized stack variable in the memory of ESXi, and we uti- Backdoor-RPC channels; thus, some metadata structures of
lized it to independently retrieve memory address informa- the channel in the data segment will be activated. Second,
tion from the host. There is another command handler in we fake a glibc fast-bin chunk on it to do the House of Spirit
vmxnet3_reg_cmd() called vmxnet3_cmd_get_coalesce(). Attack.
Figure 6 depicts the core logic of it. The get_args() func- Specifically, after leaking the address of the data segment,
tion is used to retrieve some data from a memory region of the we calculated the addresses of the metadata for the backdoor,
VMX process. The sanity_check() function qulifies that the and put them into the uninitialized stack memory using the
v19 must satisfy v19 ⩽ 16. Also, the write_back_to_guest() function handle_port_io(). For example, in Figure 9, when
function will write 16 bytes of data into the guest context. Un- the size of the data is less than 0x8000, it will put all of the
fortunately, only 8 bytes of them (v20) are initialized. data into the stack. Next, we use the arbitrary-address-free
primitive to free the fake fast-bin chunk.
3.2 Exploitation House of Spirit Attack. Because ESXi uses a variant of glibc
to maintain Ring-3’s heap, we decide to fake a fast-bin chunk
A significant challenge of exploiting uninitialized use bugs on the global metadata of Backdoor-RPC channels, i.e., House
is how to control the uninitialized variable. In this section, of Spirit Attack of glibc [1,13,22]. However, glibc has several
we illustrate the entire process of turning the uninitialized use integrity checks to mitigate memory corruption attacks. To
bug to arbitrary code execution and how we overcome the bypass it, we need to construct the fake chunk and pick the
challenges. size properly.
1) Arbitrary address free primitive. As Figure 7 delineated, After investigating, as Figure 10 illustrates, we determine
first, we leak some addresses to break ASLR through the in- some constraints in the free() function of glibc that need to
formation leakage bug. Next, in destruct_page_struct() be satisfied:
stack 4. open several RPC channels
handle_port_io()
and fake fast chunk
Info X state channel 1 stack
1. retrieve some Info Y …
… metadata …
information create time 5. put the address
(on .data seg) …
in .data seg as uninitialized addr data len of the fake fast
…
fingerprints to … … chunk (on .data)
fake chunk addr into stack
construct the exp 3. using the leaked state channel N
2. trigger the uninitialized …
…
dynamically info to calculate metadata
stack memory read create time
the addr of .data (on .data seg)
segment data len
13. ROP chain on stack … 6. free the fake chunk
… state channel N+1 using the arbitrary
14. shellcodes … metadata
12. stack pivot address free primitive
mmap RWX memory create time (on .data seg)
… data len
1 void __fastcall destruct_page_struct(page_struct *a1) 1 void __usercall handle_port_io(__int64 a1, __int64 a2, __int64 a3)
2 { 2 { ...
3 int v1; // eax 3 char *v11; // rsi
4 page_struct *v2; // rbx 4 ...
5 unsigned int v3; // edi 5 __int64 v35; // [rsp+A0h] [rbp-8038h]
6 __int64 v4; // rbp 6 __int64 v36; // [rsp+80B0h] [rbp-28h]
7 __int64 v5; // rsi 7
8 __int64 v6; // r12 8 v3 = *(a1 + 4);
9 __int64 v7; // rax 9 v4 = *(a1 + 13);
10 10 read_or_write = *(a1 + 48);
11 v1 = a1->ready; 11 ...
12 v2 = a1; 12 if ( *(a1 + 60) && (v10 = *(a1 + 52) << 12, v10 > 0x8000) )
13 if ( v1 == 1 ) 13 v11 = malloc_heap_memory(v10); // copy the data into heap
14 {...} 14 else
15 else 15 v11 = &v35;
16 { 16 if ( read_or_write & 1 )
17 v3 = 0; 17 { if ( *(v8 + 60) )
18 if ( v1 ) 18 { ...
19 { 19 v15 = v11;
20 v4 = 0i64; 20 do
21 v5 = 0i64; 21 { ...
22 do 22 memcpy(v15, v18, v17); // copy the data into stack
23 { 23 ...
24 ...
25 } Figure 9: A code snippet of handle_port_io(). We use it to spray
26 while ( v3 < v2->ready );
27 } the stack.
28 free(v2->field_18); // free the pointer on stack
29 }
30 }
4. For the next chunk’s size θ: 2 ∗ SIZE_SZ ⩽ θ ⩽ av →
Figure 8: A code snippet of destruct_page_struct(). We use it to system_mem.
free arbitrary addresses.
5. The first chunk in the fast-bin is not the fake chunk.
1. The ISMMAP bit of the fake chunk is 0. Then, we reallocate the fake chunk by leveraging other
Backdoor-RPC channel operations, i.e., when a new channel
2. The fake chunk’s address is aligned. is opened, the channel allocates a new buffer with a control-
lable length that pointed by the data field in the metadata of
3. The size of the fake chunk is 32 bytes to 128 bytes and the channel. This is a flexible and reusable trick to manipulate
aligned. the heap of ESXi. Finally, we overwrite the next data pointer
: sandboxed
void public_fRE(Void_t* mem)
1
.data segment
2 {
3 Hostd
mstate ar_ptr; Channel N state fake fast-bin chunk
4 mchunkptr p;
... …
5
Plugin
6 p = mem2chunk(mem); 0x20 data prev size
7 if (chunk_is_mmapped(p)) // check mmap bit
8 { etc. 0x28 size size
9 munmap_chunk(p); …
10 return; Guest
11 } VMX Virtual Machine Channel N+1 state
12 ...
13
Hardware
ar_ptr = arena_for_chunk(p); …
14 ... 0x20 data
15 _int_free(ar_ptr, mem); size
0x28
16
17
} VMKernel …
18 void _int_free(mstate av, Void_t* mem) VMM
19 { Figure 11: Arbitrary Address Write. By faking a fast-bin chunk
20 mchunkptr p; Physical Hardwares
21 INTERNAL_SIZE_T size; on the metadata of Backdoor-RPC channels, we can reallocate the
22 mfastbinptr* fb;Resource fake through Backdoor-RPC operations. Aftertarget
reallocating,
addr we can
...
23
24
Scheduling I/O Stack Drivers
p = mem2chunk(mem);
overwrite the next channel’s metadata to corrupt arbitrary addresses.
25 size = chunksize(p);
26 ...
27
// check current size
VMFS
28
29 if ((unsigned long)(size) <= (unsigned long)(av->max_fast))
/var/run/inetd.conf.
30 { In this way, we can bind a shell on a specific port by
31 // check next chunk
32 if (chunk_at_offset(p, size)->size <= 2 * SIZE_SZ overwriting the inetd.conf file. Note that files existing in
33 || __builtin_expect(chunksize(chunk_at_offset(p, size)) the /var/run/* are not persistent and copied from the backup
34 >= av->system_mem, 0))
35 firmware in the /bootbank/* directory after rebooting.
36 {
37 errstr = "free(): invalid next size (fast)";
38 goto errout; Forcing the process to restart. To activate the config-
39 } uration and spawn a shell, we need to force the inetd process
40 ...
41 fb = &(av->fastbins[fastbin_index(size)]); to restart. However, we cannot simply restart the entire OS,
42 ... because the inetd.conf file is not persistent. Files in the
43
44 p->fd = *fb; VMFS are copied from the bootbank after the host OS restart.
45 *fb = p; Fortunately, there is a watchdog can help us to restart some
46 }
47 } processes. As a result, we use the kill() system call to
terminate the inetd process. After that, the watchdog restarts
Figure 10: To fake a fast-bin chunk successfully, we need to bypass
the process, and a bind shell spawns.
some constraints in glibc.