0% found this document useful (0 votes)

76 views

How To Handle Globally Distributed QCOW2 Chains - Final - 01

This document discusses challenges in managing globally distributed QCOW2 disk image chains and Oracle Ravello's solutions. Oracle Ravello allows lifting and shifting virtual machines from on-premises data centers to public clouds without changes. Storing VM disk images directly on cloud volumes is region-bounded and expensive, while storing raw files in object storage has long boot times. Ravello instead stores disk images as QCOW2 chains in object storage, with new data uploaded as deltas. It uses a custom CloudFS to cache reads locally and accelerate remote access across regions.

Uploaded by

Alex Karasulu

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

76 views

How To Handle Globally Distributed QCOW2 Chains - Final - 01

Uploaded by

Alex Karasulu

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 32

How to Handle Globally Distributed

QCOW2 Chains?

Eyal Moscovici & Amit Abir

Oracle-Ravello
About Us

●
Eyal Moscovici ●
Amit Abir
– With Oracle Ravello – With Oracle Ravello
since 2015 since 2011
– Software Engineer in – Virtual Storage &
the Virtualization Networking Team
group, focusing on Leader
the Linux kernel and
QEMU

10/27/17 2 / 32
Agenda

➔
Oracle Ravello Introduction
➔
Storage Layer Design
➔
Storage Layer Implementation
➔
Challenges and Solutions
➔
Summary

10/27/17 3 / 32
Oracle Ravello - Introduction

●
Founded in 2011 by Qumranet founders, acquired in 2016 by Oracle
●
Oracle Ravello is a Virtual Cloud Provider
●
Allows seamless “Lift and Shift”:
– Migrate on-premise data-center workloads to the public cloud
●
No need to change:
– The VM images
– Network configuration
– Storage configuration

10/27/17 4 / 32
Migration to the Cloud - Challenges

●
Virtual hardware
– Different hypervisors have different virtual hardware
– Chipsets, disk/net controllers, SMBIOS/ACPI and etc.
●
Network topology and capabilities
– Clouds only support L3 IP-based communication
– No switches, VLANs, Mirror-ports and etc.

10/27/17 5 / 32
Virtual hardware support

●
Solved by Nested Virtualization:
– HVX: Our own binary translation hypervisor
– KVM: When HW assist available
●
Enhanced QEMU, SeaBIOS & OVMF supporting:
– i440bx chipset
– VMXNET3, PVSCSI
– Multiple Para-virtual interfaces (including VMWare backdoor ports)
– SMBIOS & ACPI interface
– Boot from LSILogic & PVSCSI

10/27/17 6 / 32
Network capabilities support

●
Solved by our Software Defined Network - SDN
●
Leveraging Linux SDN components
– Tun/Tap, TC Actions, Bridge, eBPF and etc.
●
Fully distributed network functions
– Leverages OpenVSwitch

10/27/17 7 / 32
Oracle Ravello Flow Public
Cloud
VM VM VM
Ravello Image Storage
KVM/HVX

VM VM
Cloud VM
VM (KVM/Xen)

HW
1. Import

Data Center
2. Publish

VM VM VM

Hypervisor Ravello
Console
HW

10/27/17 8 / 32
Storage Layer - Challenges

●
Where to place the VM disks data?
●
Should support multiple clouds and regions
●
Fetch data in real time
●
Clone a VM fast
●
Writes to the disk should be persistent

10/27/17 9 / 32
Storage Layer – Basic Solution

●
Place the VMs disk images directly on cloud volumes (EBS)
●
Advantages:
– Performance
– Zero time to first byte
●
Disadvantages:
Cloud VM
– Cloud and region bounded
– Long cloning time QEMU

– Too expensive /dev/sdb Volume

data

10/27/17 10 / 32
Storage Layer – Alternative Solution

●
Place a raw file in the cloud object storage
●
Advantages:
– Globally available
Object Storage
– Fast cloning Cloud VM data
– Inexpensive
QEMU Remote access
●
Disadvantages:
Volume
– Long boot time /dev/sdb/data
data
– Long snapshot time
– Same sectors stored many times

10/27/17 11 / 32
Storage Layer – Our Solution

●
Place the image in the object storage and upload deltas to create a chain
●
Advantages:
– Boot starts immediately
– Store only new data
– Globally available Object Storage
Remote Reads
– Fast cloning
Cloud VM
– Inexpensive
QEMU
●
Disadvantages:
Local writes
– Performance penalty /dev/sdb/tip Volume
tip

10/27/17 12 / 32
Storage Layer Architecture

Cloud VM
●
VM disk is backed by a QCow2 image QEMU
chain Disk

●
Reads are performed by Cloud FS: Our RO QCow2 tip
storage layer file system Cloud FS
Cloud FS
– Translates disk reads to HTTP requests
cache
– Supports multiple cloud object storages
QCow2 Chain
– Caches read data locally
Cloud Volume
– Fuse based
Object Storage

10/27/17 13 / 32
CloudFS - Read Flow
Cloud VM

QEMU

read(”/mnt/cloudfs/diff4”, offset=1024, size=512, ...)

/mnt/cloudfs/diff4

fuse_op_read(”/mnt/cloudfs/diff4”, offset=1024, size=512...)

Cloud FS
GET /diff4 HTTP/1.1
Host: ravello-vm-disks.s3.amazonaws.com
x-amz-date: Wed, 18 Oct 2017 21:32:02 GMT
Range: bytes=1024-1535
Cloud
Object
Storage

10/27/17 14 / 32
CloudFS - Write Flow
●
A new tip to the QCow chain is created: qemu-img create
– Before a VM starts
– Before a snapshot (using QMP): blockdev-snapshot-sync
●
The tip is uploaded to the cloud storage:
– After the VM stops
– During a snapshot Object Storage
Cloud VM

QEMU

tip
10/27/17 15 / 32
Accelerate Remote Access

●
Small requests are extended to 2MB requests
– Assume data read locality
– Latency vs. Throughput
– Experiments show that 2MB is optimal
●
QCow2 chain files have random names
– They hit different cloud workers for cloud
requests

10/27/17 16 / 32
Globally Distributed Chains

●
A VM can start on any cloud or region
●
New data is uploaded to the same local region
– Data locality is assumed
●
Globally distributed chains are created
●
Problem: Reading data from remote regions could be long

AWS Sydney
Base diff1 OCI Pheonix
diff2 diff3
GCE Frankfurt
diff4
10/27/17 17 / 32
Globally Distributed Chains - Solution

●
Every region has its own cache for parts of the chain
from different regions
●
The first time the VM starts in a new region – every
remote sector read is copied to the regional cache

AWS Sydney OCI Pheonix

Base diff1 diff2 diff3

Cache
diff1 Base
10/27/17 18 / 32
Performance Drawbacks of QCow
Chains
●
QCow keeps minimal information about the entire chain its
backing file
– QEMU must “walk the chain” to load image metadata (L1
table) to RAM
●
Some metadata (L2 tables) is spread across the image
– A single disk read creates multiple random remote reads of
metadata from multiple remote files
●
qemu-img commands work on the whole virtual disk
– Hard to bound execution time

10/27/17 19 / 32
Keep QCow2 Chains Short
Virtual disk
Tip
●
A new tip to the QCow chain is created: A
– Each VM starts
– Each snapshot
●
Problem: Chains are getting longer!
Base
– For Example: a VM with 1 Disks that started 100 times has a chain 100 links
deep.
●
Long chains cause:
– High latency: Data/metadata read requires to “walk the chain”
– High memory usage: Each file has its own metadata (L1 tables).
1MB (L1 size) * 100 (links) = 100MB per disk. Assume 10 VMs with 4 Disks
each: 4G of memory overhead

10/27/17 20 / 32
Keep QCow2 Chains Short (Cont.)

●
Solution: merge tip with backing file before upload
– Rebase the tip over the grandparent.
– Only when backing file is small (~300MB) to keep snapshot time minimal
●
This is done live/offline:
– Live: using QMP block-stream job command
– Offline: using qemu-img rebase

Virtual disk Virtual disk

Tip Rebased Tip
A
B (rebase target) B (rebase target)

10/27/17 21 / 32
qemu-img rebase
static int img_rebase(int argc, char **argv)
●
Problem: per-byte {
...
comparison between ALL for (sector = 0; sector < num_sectors; sector += n) {
...
allocated sectors not present ret = blk_pread(blk_old_backing,
sector << BDRV_SECTOR_BITS,
in tip buf_old, n << BDRV_SECTOR_BITS);
...
– Logic is different then ret = blk_pread(blk_new_backing,
sector << BDRV_SECTOR_BITS,
QMP block-stream rebase buf_new, n << BDRV_SECTOR_BITS);
...
– Requires fetching these while (written < n) {
if (compare_sectors(buf_old + written * 512,
sectors buf_new + written * 512, n - written, &pnum)) {
ret = blk_pwrite(blk,
Virtual disk (sector + written) << BDRV_SECTOR_BITS,
Tip buf_old + written * 512,
pnum << BDRV_SECTOR_BITS, 0);
A }
written += pnum;
B (rebase }
target) }
}

10/27/17 22 / 32
qemu-img rebase (2)

●
Solution: Optimized rebase in the same image chain
– Only Compare sectors that were changed after the rebase target

static int img_rebase(int argc, char **argv)

{
...
Virtual disk // check if blk_new_backing and blk are in the same chain
Tip same_chain = ...
A for (sector = 0; sector < num_sectors; sector += n) {
...
B (rebase m = n;
target) if (same_chain) {
ret = bdrv_is_allocated_above(blk, blk_new_backing,
No need sector, m, &m);
to compare if (!ret) continue;
this part ...}

10/27/17 23 / 32
Reduce first remote read latency

●
Problem: High latency on first data remote read
– Prolongs boot time
– Prolongs user application startup
– Gets worse with long chains (more remote reads)

Object Storage
Cloud VM

QEMU

tip
10/27/17 24 / 32
Prefetch Disk Data

●
Solution: Prefetch disk data
– While the VM is running, start reading the disks
data from the cloud
– Read all disks in parallel
– Only in relatively idle times

10/27/17 25 / 32
Prefetch Disk Data

●
Naive solution: read ALL the files in the chain
●
Problem: We may fetch a lot of redundant data
– An image may contain overwritten data

Tip
A
B
Redundant
Data

10/27/17 26 / 32
Avoid pre-fetching redundant data

●
Solution: Fetch data from the virtual disk exposed to the
guest
– Mount the tip image as a block device
– Read data from the block device
– QEMU will fetch only the relavent data
Virtual disk
> qemu-nbd –connect=/dev/nbd0 tip.qcow Tip
> dd if=/dev/nbd0 of=/dev/null A
B
Redundant
Data
10/27/17 27 / 32
Avoid pre-fetching redundant data (2)

●
Problem: Reading raw block device read ALL sectors
– Reading unallocated sectors wastes CPU cycles
●
Solution: use qemu-img map
– Returns a map of allocated sectors.
– Allows us to read only allocated sectors.

qemu-img map tip.qcow

10/27/17 28 / 32
Avoid pre-fetching redundant data (3)

●
Problem: qemu-img map works on the whole disk
– Takes a long time to finish
– We can’t prefetch data during map

10/27/17 29 / 32
Avoid pre-fetching redundant data (4)

●
Solution: split the map of the disk
– We added offset and length parameter to the
operation
– Bounds execution time
– Starts prefetch data quickly

qemu-img map -offset 0 -length 1G tip.qcow

10/27/17 30 / 32
Summary

●
Oracle Ravello storage layer is implemented using QCow2 chains
– Stored on the public cloud’s object storage
●
QCow2 and QEMU implementations are not ideal for our use case
– QCow2 keeps minimal metadata about the entire chain
– Qcow2 metadata is spread across the file
– QEMU must often “walk the chain”
●
We would like to work with the community to improve
performance in usecases such as ours

10/27/17 31 / 32
Questions?

Thank you!

10/27/17 32 / 32

CTC5118 DS Draft11 20200817 en PDF
No ratings yet
CTC5118 DS Draft11 20200817 en PDF
105 pages
User Guide Linstor
No ratings yet
User Guide Linstor
103 pages
Open Vswitch With DPDK : Hands-On Lab
No ratings yet
Open Vswitch With DPDK : Hands-On Lab
17 pages
ZFS Overview
No ratings yet
ZFS Overview
40 pages
DRBD Trouble
No ratings yet
DRBD Trouble
13 pages
CPU Scheduling Algorithms (Report)
91% (11)
CPU Scheduling Algorithms (Report)
17 pages
Virtual Machine Block Storage With The Distributed Storage System
No ratings yet
Virtual Machine Block Storage With The Distributed Storage System
40 pages
DRBD-Cookbook: How to create your own cluster solution, without SAN or NAS!
From Everand
DRBD-Cookbook: How to create your own cluster solution, without SAN or NAS!
Joerg Christian Seubert
No ratings yet
KVM Forum 2013 Ovirt SLA
No ratings yet
KVM Forum 2013 Ovirt SLA
26 pages
DRBD Users Guide
No ratings yet
DRBD Users Guide
170 pages
Cluster Synchronization With Csync
100% (2)
Cluster Synchronization With Csync
10 pages
User Guide DRBD 9
100% (1)
User Guide DRBD 9
213 pages
DRBD Quick Reference Guide
No ratings yet
DRBD Quick Reference Guide
3 pages
Drbd9 Mysql Rhel8
No ratings yet
Drbd9 Mysql Rhel8
23 pages
DRBD Quick Reference PDF
No ratings yet
DRBD Quick Reference PDF
3 pages
Csync2 Configuration
No ratings yet
Csync2 Configuration
11 pages
Cluster Resources: Cluster Configuration in - Cibadmin - Q
No ratings yet
Cluster Resources: Cluster Configuration in - Cibadmin - Q
2 pages
How To KVM Backup and Restore in Linux
No ratings yet
How To KVM Backup and Restore in Linux
4 pages
Module 1 - Basic Topology and Router Setup: The Following Will Be The Common Topology Used For The First Series of Labs
No ratings yet
Module 1 - Basic Topology and Router Setup: The Following Will Be The Common Topology Used For The First Series of Labs
9 pages
Apache Directory Studio LDAP Browser User Guide
0% (1)
Apache Directory Studio LDAP Browser User Guide
142 pages
Comparison Table VxLAN Vs OTV
No ratings yet
Comparison Table VxLAN Vs OTV
3 pages
Ha Ovirt Iscsi
No ratings yet
Ha Ovirt Iscsi
28 pages
OVS-DPDK Life of A Packet.2019
No ratings yet
OVS-DPDK Life of A Packet.2019
7 pages
DRBD Users Guide PDF
0% (1)
DRBD Users Guide PDF
170 pages
Pcs Command Reference
No ratings yet
Pcs Command Reference
4 pages
Ldap
No ratings yet
Ldap
47 pages
KVM Management
No ratings yet
KVM Management
5 pages
Highly Available Virtualization With KVM, iSCSI & Pacemaker: Florian Haas
No ratings yet
Highly Available Virtualization With KVM, iSCSI & Pacemaker: Florian Haas
15 pages
Ha Cluster With Os Leap 161010101038 PDF
No ratings yet
Ha Cluster With Os Leap 161010101038 PDF
39 pages
Drbd-Tutorial Tutorails
No ratings yet
Drbd-Tutorial Tutorails
104 pages
Opennebula 4.4 Administration Guide
No ratings yet
Opennebula 4.4 Administration Guide
174 pages
DRBD Corosync Pacemaker Cluster 140219072555 Phpapp01
No ratings yet
DRBD Corosync Pacemaker Cluster 140219072555 Phpapp01
12 pages
XenDesktop Getting Started
No ratings yet
XenDesktop Getting Started
86 pages
Libvirt: Official Documentation
No ratings yet
Libvirt: Official Documentation
4 pages
User Guide DRBD 9
No ratings yet
User Guide DRBD 9
156 pages
LVM Linux
No ratings yet
LVM Linux
3 pages
Unit 8 Block Devices, Raid, and LVM: Welcome To
No ratings yet
Unit 8 Block Devices, Raid, and LVM: Welcome To
40 pages
User Guide DRBD 9 PDF
No ratings yet
User Guide DRBD 9 PDF
198 pages
Introduction To Bhyve
No ratings yet
Introduction To Bhyve
34 pages
Lab 8.4.3b Managing Cisco IOS Images With ROMMON and TFTP: Objectives
No ratings yet
Lab 8.4.3b Managing Cisco IOS Images With ROMMON and TFTP: Objectives
13 pages
Eve Ce Book 5.5 2024
No ratings yet
Eve Ce Book 5.5 2024
160 pages
AIX QEMU Blog
No ratings yet
AIX QEMU Blog
17 pages
h15275 Vxrail Planning Guide Virtual San Stretched Cluster
No ratings yet
h15275 Vxrail Planning Guide Virtual San Stretched Cluster
10 pages
Open VSwitch Getting Started
No ratings yet
Open VSwitch Getting Started
6 pages
How Do I Configure Red Hat Enterprise Linux 3 or 4 To Access ISCSI Storage?
No ratings yet
How Do I Configure Red Hat Enterprise Linux 3 or 4 To Access ISCSI Storage?
8 pages
Nova HA
100% (1)
Nova HA
22 pages
Ovirt and Gluster Hyper-Converged!: Ha Solution For Maximum Resource Utilization
No ratings yet
Ovirt and Gluster Hyper-Converged!: Ha Solution For Maximum Resource Utilization
30 pages
Virtio-Fs - A Shared File System For Virtual Machines
No ratings yet
Virtio-Fs - A Shared File System For Virtual Machines
21 pages
Dell R730xd RedHat Ceph Performance SizingGuide WhitePaper
No ratings yet
Dell R730xd RedHat Ceph Performance SizingGuide WhitePaper
37 pages
Module 6 - More iBGP, and Basic eBGP Configuration
No ratings yet
Module 6 - More iBGP, and Basic eBGP Configuration
10 pages
Bird
No ratings yet
Bird
43 pages
Red Hat Enterprise Virtualization 3.1 V2V Guide en US
No ratings yet
Red Hat Enterprise Virtualization 3.1 V2V Guide en US
63 pages
Differentiated I/O Services in Virtualized Environments: Tyler Harter, Salini SK & Anand Krishnamurthy
No ratings yet
Differentiated I/O Services in Virtualized Environments: Tyler Harter, Salini SK & Anand Krishnamurthy
44 pages
Identity-Based Networking Services: MAC Security: Deployment Guide
No ratings yet
Identity-Based Networking Services: MAC Security: Deployment Guide
32 pages
Embedded Linux PDF
No ratings yet
Embedded Linux PDF
146 pages
Red Hat Cluster1
No ratings yet
Red Hat Cluster1
49 pages
Ospfv2 PDF
No ratings yet
Ospfv2 PDF
4 pages
KVM Cheatsheet
No ratings yet
KVM Cheatsheet
1 page
Red Hat Openstack Platform-16.2-Firewall Rules For Red Hat Openstack Platform-En-us
No ratings yet
Red Hat Openstack Platform-16.2-Firewall Rules For Red Hat Openstack Platform-En-us
9 pages
Extending Puppet - Second Edition
From Everand
Extending Puppet - Second Edition
Alessandro Franceschi
No ratings yet
Unix / Linux FAQ: with Tips to Face Interviews
From Everand
Unix / Linux FAQ: with Tips to Face Interviews
Prof. N.B. Venkateswarlu
No ratings yet
VMware Horizon View Essentials
From Everand
VMware Horizon View Essentials
Peter von Oven
No ratings yet
Simulation of Stochastic Blockchain Models: Workshop On Blockchain Dependability
No ratings yet
Simulation of Stochastic Blockchain Models: Workshop On Blockchain Dependability
8 pages
Epidemic Broadcast Trees
No ratings yet
Epidemic Broadcast Trees
20 pages
How Is Value Created Within An Inner Sou
No ratings yet
How Is Value Created Within An Inner Sou
4 pages
Business Value Openshift
No ratings yet
Business Value Openshift
22 pages
Indusoft Web Studio V8.1: Install and Run Iotview On Raspberry Pi
No ratings yet
Indusoft Web Studio V8.1: Install and Run Iotview On Raspberry Pi
6 pages
Vulnpsehoy
No ratings yet
Vulnpsehoy
7 pages
Chapter 2: Operating-System Services: Silberschatz, Galvin and Gagne ©2018 Operating System Concepts - 10 Edition
No ratings yet
Chapter 2: Operating-System Services: Silberschatz, Galvin and Gagne ©2018 Operating System Concepts - 10 Edition
59 pages
Test Log
No ratings yet
Test Log
560 pages
CS5001NA Networks and Operating Systems: Week 9 Workshop
No ratings yet
CS5001NA Networks and Operating Systems: Week 9 Workshop
1 page
Git Prodigy
No ratings yet
Git Prodigy
225 pages
Kali Linux in The Windows App Store
No ratings yet
Kali Linux in The Windows App Store
3 pages
BD Lab File
No ratings yet
BD Lab File
39 pages
Docker Interview Questions Answers
No ratings yet
Docker Interview Questions Answers
8 pages
AL3452 os lab
No ratings yet
AL3452 os lab
101 pages
Redhat Rh302: Practice Exam: Question No: 1 Correct Text
No ratings yet
Redhat Rh302: Practice Exam: Question No: 1 Correct Text
20 pages
Sales Analytics Personal
No ratings yet
Sales Analytics Personal
5 pages
Edubuntu
No ratings yet
Edubuntu
5 pages
Input and Output Functions in C - 1622196378713
No ratings yet
Input and Output Functions in C - 1622196378713
12 pages
Realtek Wi-Fi SDK For Android KK 4 4
No ratings yet
Realtek Wi-Fi SDK For Android KK 4 4
15 pages
Moshell 64bit Installation UserGuide
No ratings yet
Moshell 64bit Installation UserGuide
4 pages
Java Platform Overview
No ratings yet
Java Platform Overview
25 pages
Priyanka Agrawal Resume
No ratings yet
Priyanka Agrawal Resume
3 pages
Devops: Resume Ankita D. Pingalkar Mobile: +91-8551800660 Career Objective
No ratings yet
Devops: Resume Ankita D. Pingalkar Mobile: +91-8551800660 Career Objective
2 pages
How To Build A Simulated Mobile Robot Base Using ROS - Automatic Addison
No ratings yet
How To Build A Simulated Mobile Robot Base Using ROS - Automatic Addison
12 pages
FDS TD Sample
No ratings yet
FDS TD Sample
8 pages
C Programming
No ratings yet
C Programming
10 pages
AWS Sample Resume 2
100% (1)
AWS Sample Resume 2
3 pages
Cron Job Example With PuTTy - CodeOfaNinja
No ratings yet
Cron Job Example With PuTTy - CodeOfaNinja
5 pages
IICT WEEK 8
No ratings yet
IICT WEEK 8
7 pages
linux commands
No ratings yet
linux commands
5 pages
terraform intro and install
No ratings yet
terraform intro and install
7 pages
Oracle Enterprise Manager Cloud Control 13c: Install Upgrade Ed2
No ratings yet
Oracle Enterprise Manager Cloud Control 13c: Install Upgrade Ed2
3 pages
Important Linux Command
No ratings yet
Important Linux Command
3 pages

How To Handle Globally Distributed QCOW2 Chains - Final - 01

Uploaded by

How To Handle Globally Distributed QCOW2 Chains - Final - 01

Uploaded by

How to Handle Globally Distributed

Eyal Moscovici & Amit Abir

– Too expensive /dev/sdb Volume

read(”/mnt/cloudfs/diff4”, offset=1024, size=512, ...)

fuse_op_read(”/mnt/cloudfs/diff4”, offset=1024, size=512...)

AWS Sydney OCI Pheonix

Virtual disk Virtual disk

static int img_rebase(int argc, char **argv)

qemu-img map tip.qcow

qemu-img map -offset 0 -length 1G tip.qcow

You might also like