How To Handle Globally Distributed QCOW2 Chains - Final - 01
How To Handle Globally Distributed QCOW2 Chains - Final - 01
QCOW2 Chains?
●
Eyal Moscovici ●
Amit Abir
– With Oracle Ravello – With Oracle Ravello
since 2015 since 2011
– Software Engineer in – Virtual Storage &
the Virtualization Networking Team
group, focusing on Leader
the Linux kernel and
QEMU
10/27/17 2 / 32
Agenda
➔
Oracle Ravello Introduction
➔
Storage Layer Design
➔
Storage Layer Implementation
➔
Challenges and Solutions
➔
Summary
10/27/17 3 / 32
Oracle Ravello - Introduction
●
Founded in 2011 by Qumranet founders, acquired in 2016 by Oracle
●
Oracle Ravello is a Virtual Cloud Provider
●
Allows seamless “Lift and Shift”:
– Migrate on-premise data-center workloads to the public cloud
●
No need to change:
– The VM images
– Network configuration
– Storage configuration
10/27/17 4 / 32
Migration to the Cloud - Challenges
●
Virtual hardware
– Different hypervisors have different virtual hardware
– Chipsets, disk/net controllers, SMBIOS/ACPI and etc.
●
Network topology and capabilities
– Clouds only support L3 IP-based communication
– No switches, VLANs, Mirror-ports and etc.
10/27/17 5 / 32
Virtual hardware support
●
Solved by Nested Virtualization:
– HVX: Our own binary translation hypervisor
– KVM: When HW assist available
●
Enhanced QEMU, SeaBIOS & OVMF supporting:
– i440bx chipset
– VMXNET3, PVSCSI
– Multiple Para-virtual interfaces (including VMWare backdoor ports)
– SMBIOS & ACPI interface
– Boot from LSILogic & PVSCSI
10/27/17 6 / 32
Network capabilities support
●
Solved by our Software Defined Network - SDN
●
Leveraging Linux SDN components
– Tun/Tap, TC Actions, Bridge, eBPF and etc.
●
Fully distributed network functions
– Leverages OpenVSwitch
10/27/17 7 / 32
Oracle Ravello Flow Public
Cloud
VM VM VM
Ravello Image Storage
KVM/HVX
VM VM
Cloud VM
VM (KVM/Xen)
HW
1. Import
Data Center
2. Publish
VM VM VM
Hypervisor Ravello
Console
HW
10/27/17 8 / 32
Storage Layer - Challenges
●
Where to place the VM disks data?
●
Should support multiple clouds and regions
●
Fetch data in real time
●
Clone a VM fast
●
Writes to the disk should be persistent
10/27/17 9 / 32
Storage Layer – Basic Solution
●
Place the VMs disk images directly on cloud volumes (EBS)
●
Advantages:
– Performance
– Zero time to first byte
●
Disadvantages:
Cloud VM
– Cloud and region bounded
– Long cloning time QEMU
10/27/17 10 / 32
Storage Layer – Alternative Solution
●
Place a raw file in the cloud object storage
●
Advantages:
– Globally available
Object Storage
– Fast cloning Cloud VM data
– Inexpensive
QEMU Remote access
●
Disadvantages:
Volume
– Long boot time /dev/sdb/data
data
– Long snapshot time
– Same sectors stored many times
10/27/17 11 / 32
Storage Layer – Our Solution
●
Place the image in the object storage and upload deltas to create a chain
●
Advantages:
– Boot starts immediately
– Store only new data
– Globally available Object Storage
Remote Reads
– Fast cloning
Cloud VM
– Inexpensive
QEMU
●
Disadvantages:
Local writes
– Performance penalty /dev/sdb/tip Volume
tip
10/27/17 12 / 32
Storage Layer Architecture
Cloud VM
●
VM disk is backed by a QCow2 image QEMU
chain Disk
●
Reads are performed by Cloud FS: Our RO QCow2 tip
storage layer file system Cloud FS
Cloud FS
– Translates disk reads to HTTP requests
cache
– Supports multiple cloud object storages
QCow2 Chain
– Caches read data locally
Cloud Volume
– Fuse based
Object Storage
10/27/17 13 / 32
CloudFS - Read Flow
Cloud VM
QEMU
/mnt/cloudfs/diff4
Cloud FS
GET /diff4 HTTP/1.1
Host: ravello-vm-disks.s3.amazonaws.com
x-amz-date: Wed, 18 Oct 2017 21:32:02 GMT
Range: bytes=1024-1535
Cloud
Object
Storage
10/27/17 14 / 32
CloudFS - Write Flow
●
A new tip to the QCow chain is created: qemu-img create
– Before a VM starts
– Before a snapshot (using QMP): blockdev-snapshot-sync
●
The tip is uploaded to the cloud storage:
– After the VM stops
– During a snapshot Object Storage
Cloud VM
QEMU
tip
10/27/17 15 / 32
Accelerate Remote Access
●
Small requests are extended to 2MB requests
– Assume data read locality
– Latency vs. Throughput
– Experiments show that 2MB is optimal
●
QCow2 chain files have random names
– They hit different cloud workers for cloud
requests
10/27/17 16 / 32
Globally Distributed Chains
●
A VM can start on any cloud or region
●
New data is uploaded to the same local region
– Data locality is assumed
●
Globally distributed chains are created
●
Problem: Reading data from remote regions could be long
AWS Sydney
Base diff1 OCI Pheonix
diff2 diff3
GCE Frankfurt
diff4
10/27/17 17 / 32
Globally Distributed Chains - Solution
●
Every region has its own cache for parts of the chain
from different regions
●
The first time the VM starts in a new region – every
remote sector read is copied to the regional cache
Cache
diff1 Base
10/27/17 18 / 32
Performance Drawbacks of QCow
Chains
●
QCow keeps minimal information about the entire chain its
backing file
– QEMU must “walk the chain” to load image metadata (L1
table) to RAM
●
Some metadata (L2 tables) is spread across the image
– A single disk read creates multiple random remote reads of
metadata from multiple remote files
●
qemu-img commands work on the whole virtual disk
– Hard to bound execution time
10/27/17 19 / 32
Keep QCow2 Chains Short
Virtual disk
Tip
●
A new tip to the QCow chain is created: A
– Each VM starts
– Each snapshot
●
Problem: Chains are getting longer!
Base
– For Example: a VM with 1 Disks that started 100 times has a chain 100 links
deep.
●
Long chains cause:
– High latency: Data/metadata read requires to “walk the chain”
– High memory usage: Each file has its own metadata (L1 tables).
1MB (L1 size) * 100 (links) = 100MB per disk. Assume 10 VMs with 4 Disks
each: 4G of memory overhead
10/27/17 20 / 32
Keep QCow2 Chains Short (Cont.)
●
Solution: merge tip with backing file before upload
– Rebase the tip over the grandparent.
– Only when backing file is small (~300MB) to keep snapshot time minimal
●
This is done live/offline:
– Live: using QMP block-stream job command
– Offline: using qemu-img rebase
10/27/17 21 / 32
qemu-img rebase
static int img_rebase(int argc, char **argv)
●
Problem: per-byte {
...
comparison between ALL for (sector = 0; sector < num_sectors; sector += n) {
...
allocated sectors not present ret = blk_pread(blk_old_backing,
sector << BDRV_SECTOR_BITS,
in tip buf_old, n << BDRV_SECTOR_BITS);
...
– Logic is different then ret = blk_pread(blk_new_backing,
sector << BDRV_SECTOR_BITS,
QMP block-stream rebase buf_new, n << BDRV_SECTOR_BITS);
...
– Requires fetching these while (written < n) {
if (compare_sectors(buf_old + written * 512,
sectors buf_new + written * 512, n - written, &pnum)) {
ret = blk_pwrite(blk,
Virtual disk (sector + written) << BDRV_SECTOR_BITS,
Tip buf_old + written * 512,
pnum << BDRV_SECTOR_BITS, 0);
A }
written += pnum;
B (rebase }
target) }
}
10/27/17 22 / 32
qemu-img rebase (2)
●
Solution: Optimized rebase in the same image chain
– Only Compare sectors that were changed after the rebase target
10/27/17 23 / 32
Reduce first remote read latency
●
Problem: High latency on first data remote read
– Prolongs boot time
– Prolongs user application startup
– Gets worse with long chains (more remote reads)
Object Storage
Cloud VM
QEMU
tip
10/27/17 24 / 32
Prefetch Disk Data
●
Solution: Prefetch disk data
– While the VM is running, start reading the disks
data from the cloud
– Read all disks in parallel
– Only in relatively idle times
10/27/17 25 / 32
Prefetch Disk Data
●
Naive solution: read ALL the files in the chain
●
Problem: We may fetch a lot of redundant data
– An image may contain overwritten data
Tip
A
B
Redundant
Data
10/27/17 26 / 32
Avoid pre-fetching redundant data
●
Solution: Fetch data from the virtual disk exposed to the
guest
– Mount the tip image as a block device
– Read data from the block device
– QEMU will fetch only the relavent data
Virtual disk
> qemu-nbd –connect=/dev/nbd0 tip.qcow Tip
> dd if=/dev/nbd0 of=/dev/null A
B
Redundant
Data
10/27/17 27 / 32
Avoid pre-fetching redundant data (2)
●
Problem: Reading raw block device read ALL sectors
– Reading unallocated sectors wastes CPU cycles
●
Solution: use qemu-img map
– Returns a map of allocated sectors.
– Allows us to read only allocated sectors.
10/27/17 28 / 32
Avoid pre-fetching redundant data (3)
●
Problem: qemu-img map works on the whole disk
– Takes a long time to finish
– We can’t prefetch data during map
10/27/17 29 / 32
Avoid pre-fetching redundant data (4)
●
Solution: split the map of the disk
– We added offset and length parameter to the
operation
– Bounds execution time
– Starts prefetch data quickly
10/27/17 30 / 32
Summary
●
Oracle Ravello storage layer is implemented using QCow2 chains
– Stored on the public cloud’s object storage
●
QCow2 and QEMU implementations are not ideal for our use case
– QCow2 keeps minimal metadata about the entire chain
– Qcow2 metadata is spread across the file
– QEMU must often “walk the chain”
●
We would like to work with the community to improve
performance in usecases such as ours
10/27/17 31 / 32
Questions?
Thank you!
10/27/17 32 / 32