0% found this document useful (0 votes)
76 views

How To Handle Globally Distributed QCOW2 Chains - Final - 01

This document discusses challenges in managing globally distributed QCOW2 disk image chains and Oracle Ravello's solutions. Oracle Ravello allows lifting and shifting virtual machines from on-premises data centers to public clouds without changes. Storing VM disk images directly on cloud volumes is region-bounded and expensive, while storing raw files in object storage has long boot times. Ravello instead stores disk images as QCOW2 chains in object storage, with new data uploaded as deltas. It uses a custom CloudFS to cache reads locally and accelerate remote access across regions.

Uploaded by

Alex Karasulu
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
76 views

How To Handle Globally Distributed QCOW2 Chains - Final - 01

This document discusses challenges in managing globally distributed QCOW2 disk image chains and Oracle Ravello's solutions. Oracle Ravello allows lifting and shifting virtual machines from on-premises data centers to public clouds without changes. Storing VM disk images directly on cloud volumes is region-bounded and expensive, while storing raw files in object storage has long boot times. Ravello instead stores disk images as QCOW2 chains in object storage, with new data uploaded as deltas. It uses a custom CloudFS to cache reads locally and accelerate remote access across regions.

Uploaded by

Alex Karasulu
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 32

How to Handle Globally Distributed

QCOW2 Chains?

Eyal Moscovici & Amit Abir


Oracle-Ravello
About Us


Eyal Moscovici ●
Amit Abir
– With Oracle Ravello – With Oracle Ravello
since 2015 since 2011
– Software Engineer in – Virtual Storage &
the Virtualization Networking Team
group, focusing on Leader
the Linux kernel and
QEMU

10/27/17 2 / 32
Agenda


Oracle Ravello Introduction

Storage Layer Design

Storage Layer Implementation

Challenges and Solutions

Summary

10/27/17 3 / 32
Oracle Ravello - Introduction


Founded in 2011 by Qumranet founders, acquired in 2016 by Oracle

Oracle Ravello is a Virtual Cloud Provider

Allows seamless “Lift and Shift”:
– Migrate on-premise data-center workloads to the public cloud

No need to change:
– The VM images
– Network configuration
– Storage configuration

10/27/17 4 / 32
Migration to the Cloud - Challenges


Virtual hardware
– Different hypervisors have different virtual hardware
– Chipsets, disk/net controllers, SMBIOS/ACPI and etc.

Network topology and capabilities
– Clouds only support L3 IP-based communication
– No switches, VLANs, Mirror-ports and etc.

10/27/17 5 / 32
Virtual hardware support


Solved by Nested Virtualization:
– HVX: Our own binary translation hypervisor
– KVM: When HW assist available

Enhanced QEMU, SeaBIOS & OVMF supporting:
– i440bx chipset
– VMXNET3, PVSCSI
– Multiple Para-virtual interfaces (including VMWare backdoor ports)
– SMBIOS & ACPI interface
– Boot from LSILogic & PVSCSI

10/27/17 6 / 32
Network capabilities support


Solved by our Software Defined Network - SDN

Leveraging Linux SDN components
– Tun/Tap, TC Actions, Bridge, eBPF and etc.

Fully distributed network functions
– Leverages OpenVSwitch

10/27/17 7 / 32
Oracle Ravello Flow Public
Cloud
VM VM VM
Ravello Image Storage
KVM/HVX

VM VM
Cloud VM
VM (KVM/Xen)

HW
1. Import

Data Center
2. Publish

VM VM VM

Hypervisor Ravello
Console
HW

10/27/17 8 / 32
Storage Layer - Challenges


Where to place the VM disks data?

Should support multiple clouds and regions

Fetch data in real time

Clone a VM fast

Writes to the disk should be persistent

10/27/17 9 / 32
Storage Layer – Basic Solution


Place the VMs disk images directly on cloud volumes (EBS)

Advantages:
– Performance
– Zero time to first byte

Disadvantages:
Cloud VM
– Cloud and region bounded
– Long cloning time QEMU

– Too expensive /dev/sdb Volume


data

10/27/17 10 / 32
Storage Layer – Alternative Solution


Place a raw file in the cloud object storage

Advantages:
– Globally available
Object Storage
– Fast cloning Cloud VM data
– Inexpensive
QEMU Remote access

Disadvantages:
Volume
– Long boot time /dev/sdb/data
data
– Long snapshot time
– Same sectors stored many times

10/27/17 11 / 32
Storage Layer – Our Solution


Place the image in the object storage and upload deltas to create a chain

Advantages:
– Boot starts immediately
– Store only new data
– Globally available Object Storage
Remote Reads
– Fast cloning
Cloud VM
– Inexpensive
QEMU

Disadvantages:
Local writes
– Performance penalty /dev/sdb/tip Volume
tip

10/27/17 12 / 32
Storage Layer Architecture

Cloud VM

VM disk is backed by a QCow2 image QEMU
chain Disk


Reads are performed by Cloud FS: Our RO QCow2 tip
storage layer file system Cloud FS
Cloud FS
– Translates disk reads to HTTP requests
cache
– Supports multiple cloud object storages
QCow2 Chain
– Caches read data locally
Cloud Volume
– Fuse based
Object Storage

10/27/17 13 / 32
CloudFS - Read Flow
Cloud VM

QEMU

read(”/mnt/cloudfs/diff4”, offset=1024, size=512, ...)

/mnt/cloudfs/diff4

fuse_op_read(”/mnt/cloudfs/diff4”, offset=1024, size=512...)

Cloud FS
GET /diff4 HTTP/1.1
Host: ravello-vm-disks.s3.amazonaws.com
x-amz-date: Wed, 18 Oct 2017 21:32:02 GMT
Range: bytes=1024-1535
Cloud
Object
Storage

10/27/17 14 / 32
CloudFS - Write Flow

A new tip to the QCow chain is created: qemu-img create
– Before a VM starts
– Before a snapshot (using QMP): blockdev-snapshot-sync

The tip is uploaded to the cloud storage:
– After the VM stops
– During a snapshot Object Storage
Cloud VM

QEMU

tip
10/27/17 15 / 32
Accelerate Remote Access


Small requests are extended to 2MB requests
– Assume data read locality
– Latency vs. Throughput
– Experiments show that 2MB is optimal

QCow2 chain files have random names
– They hit different cloud workers for cloud
requests

10/27/17 16 / 32
Globally Distributed Chains


A VM can start on any cloud or region

New data is uploaded to the same local region
– Data locality is assumed

Globally distributed chains are created

Problem: Reading data from remote regions could be long

AWS Sydney
Base diff1 OCI Pheonix
diff2 diff3
GCE Frankfurt
diff4
10/27/17 17 / 32
Globally Distributed Chains - Solution


Every region has its own cache for parts of the chain
from different regions

The first time the VM starts in a new region – every
remote sector read is copied to the regional cache

AWS Sydney OCI Pheonix


Base diff1 diff2 diff3

Cache
diff1 Base
10/27/17 18 / 32
Performance Drawbacks of QCow
Chains

QCow keeps minimal information about the entire chain its
backing file
– QEMU must “walk the chain” to load image metadata (L1
table) to RAM

Some metadata (L2 tables) is spread across the image
– A single disk read creates multiple random remote reads of
metadata from multiple remote files

qemu-img commands work on the whole virtual disk
– Hard to bound execution time

10/27/17 19 / 32
Keep QCow2 Chains Short
Virtual disk
Tip

A new tip to the QCow chain is created: A
– Each VM starts
– Each snapshot

Problem: Chains are getting longer!
Base
– For Example: a VM with 1 Disks that started 100 times has a chain 100 links
deep.

Long chains cause:
– High latency: Data/metadata read requires to “walk the chain”
– High memory usage: Each file has its own metadata (L1 tables).
1MB (L1 size) * 100 (links) = 100MB per disk. Assume 10 VMs with 4 Disks
each: 4G of memory overhead

10/27/17 20 / 32
Keep QCow2 Chains Short (Cont.)


Solution: merge tip with backing file before upload
– Rebase the tip over the grandparent.
– Only when backing file is small (~300MB) to keep snapshot time minimal

This is done live/offline:
– Live: using QMP block-stream job command
– Offline: using qemu-img rebase

Virtual disk Virtual disk


Tip Rebased Tip
A
B (rebase target) B (rebase target)

10/27/17 21 / 32
qemu-img rebase
static int img_rebase(int argc, char **argv)

Problem: per-byte {
...
comparison between ALL for (sector = 0; sector < num_sectors; sector += n) {
...
allocated sectors not present ret = blk_pread(blk_old_backing,
sector << BDRV_SECTOR_BITS,
in tip buf_old, n << BDRV_SECTOR_BITS);
...
– Logic is different then ret = blk_pread(blk_new_backing,
sector << BDRV_SECTOR_BITS,
QMP block-stream rebase buf_new, n << BDRV_SECTOR_BITS);
...
– Requires fetching these while (written < n) {
if (compare_sectors(buf_old + written * 512,
sectors buf_new + written * 512, n - written, &pnum)) {
ret = blk_pwrite(blk,
Virtual disk (sector + written) << BDRV_SECTOR_BITS,
Tip buf_old + written * 512,
pnum << BDRV_SECTOR_BITS, 0);
A }
written += pnum;
B (rebase }
target) }
}

10/27/17 22 / 32
qemu-img rebase (2)


Solution: Optimized rebase in the same image chain
– Only Compare sectors that were changed after the rebase target

static int img_rebase(int argc, char **argv)


{
...
Virtual disk // check if blk_new_backing and blk are in the same chain
Tip same_chain = ...
A for (sector = 0; sector < num_sectors; sector += n) {
...
B (rebase m = n;
target) if (same_chain) {
ret = bdrv_is_allocated_above(blk, blk_new_backing,
No need sector, m, &m);
to compare if (!ret) continue;
this part ...}

10/27/17 23 / 32
Reduce first remote read latency


Problem: High latency on first data remote read
– Prolongs boot time
– Prolongs user application startup
– Gets worse with long chains (more remote reads)

Object Storage
Cloud VM

QEMU

tip
10/27/17 24 / 32
Prefetch Disk Data


Solution: Prefetch disk data
– While the VM is running, start reading the disks
data from the cloud
– Read all disks in parallel
– Only in relatively idle times

10/27/17 25 / 32
Prefetch Disk Data


Naive solution: read ALL the files in the chain

Problem: We may fetch a lot of redundant data
– An image may contain overwritten data

Tip
A
B
Redundant
Data

10/27/17 26 / 32
Avoid pre-fetching redundant data


Solution: Fetch data from the virtual disk exposed to the
guest
– Mount the tip image as a block device
– Read data from the block device
– QEMU will fetch only the relavent data
Virtual disk
> qemu-nbd –connect=/dev/nbd0 tip.qcow Tip
> dd if=/dev/nbd0 of=/dev/null A
B
Redundant
Data
10/27/17 27 / 32
Avoid pre-fetching redundant data (2)


Problem: Reading raw block device read ALL sectors
– Reading unallocated sectors wastes CPU cycles

Solution: use qemu-img map
– Returns a map of allocated sectors.
– Allows us to read only allocated sectors.

qemu-img map tip.qcow

10/27/17 28 / 32
Avoid pre-fetching redundant data (3)


Problem: qemu-img map works on the whole disk
– Takes a long time to finish
– We can’t prefetch data during map

10/27/17 29 / 32
Avoid pre-fetching redundant data (4)


Solution: split the map of the disk
– We added offset and length parameter to the
operation
– Bounds execution time
– Starts prefetch data quickly

qemu-img map -offset 0 -length 1G tip.qcow

10/27/17 30 / 32
Summary


Oracle Ravello storage layer is implemented using QCow2 chains
– Stored on the public cloud’s object storage

QCow2 and QEMU implementations are not ideal for our use case
– QCow2 keeps minimal metadata about the entire chain
– Qcow2 metadata is spread across the file
– QEMU must often “walk the chain”

We would like to work with the community to improve
performance in usecases such as ours

10/27/17 31 / 32
Questions?

Thank you!

10/27/17 32 / 32

You might also like