Introduction to Cloud Data Center and Network Issues

Introduction to Cloud Data Center
and Network Issues

Presenter: Jason, Tsung-Cheng, HOU
Advisor: Wanjiun Liao
July 2nd, 2012
1

Agenda
• Cloud Computing / Data Center
Basic Background
• Enabling Technology
• Infrastructure as a Service
A Cloud DC System Example
• Networking Issues in Cloud DC

2

Brand New Technology ??
• Not exactly, for large scale computing in the past:
utility mainframe, grid computing, super computer
• Past demand: scientific computing, large scale
engineering (finance, construction, aerospace)
• New demand: search, e-commerce, content
streaming, application/web hosting, IT
outsourcing, mobile/remote apps, big data
processing…
• Difference: aggregated individual small
demand, highly volatile and dynamic, not all
profitable
– Seek for economy of scale to cost down
– Rely on resilient, flexible and scalable infrastructure 3

Cloud Data Center
Traditional Data Center Cloud Data Center
Co-located Integrated
Servers
Dependent Failure Fault-Tolerant
Partitioned Unified
Resources
Performance Interrelated Performance Isolated
Separated Centralized Full Control
Management
Manual With Automation
Plan Ahead Flexible
Scheduling
Overprovisioning Scalable
Renting Per Physical Machines Per Logical Usage
Application / Runs and Moves across
Fixes on Designated Servers
Services All VMs
Cloud Computing: Cloud DC Requirements:
End Device / Client • On-Demand Self-Service • Measured Usage
• Resource Pooling • Broad Network Access
Common App. Platform • Rapid Elasticity Network Dependent
Cloud Data Center 4

Server and Switch Organization

What’s on Amazon?
Dropbox, Instagram, Netflix, Pinterest
Foursquare, Quora, Twitter, Yelp
Nasdaq, New York Times….
…and a lot more

Data Center Components

6

Clusters of Commodities
• Current cloud DC achieves high performance
using commodity servers and switches
→no specialized solution for supercomputing
• Supercomputing still exists, an example is
Symmetric Multi-Processing server
→128-core on shared RAM-like memory
• Compare to 32 4-core LAN connected server
– Accessing global data, SMP: 100ns, LAN: 100us
• Computing penalty from delayed LAN-access
• Performance gain when clusters grow large
7

Penalty for Latency in LAN-access

This is not a comparison of server-cluster and single high-end server.

f: # global data I/O in 10ms
High: f = 100
Medium: f = 10
Low: f = 1

?

Performance gain when clusters grow
large

9

Agenda
Basic Background

10

A DC-wide System
• Has software systems consisting of:
– Distributed system, logical clocks, coordination
and locks, remote procedural call…etc
– Distributed file system
– (We do not go deeper into above components)
– Parallel computation: MapReduce, Hadoop
• Virtualized Infrastructure:
– Computing: Virtual Machine / Hypervisor
– Storage: Virtualized / distributed storage
– Network: Network virtualization…the next step?
11

MapReduce
• 100 TB datasets
– Scanning on 1 node – 23 days
– On 1000 nodes – 33 minutes
• Single machine performance does not matter
– Just add more… but HOW to use so many clusters ?
– How to make distributed programming simple and elegant ?
• Sounds great, but what about MTBF?
– MTBF = Mean Time Between Failures
– 1 node – once per 3 years
– 1000 nodes – 1 node per 1 day
• MapReduce refer to both:
– Programming framework
– Fault-tolerant runtime system
12

MapRaduce: Word Counting

Shuffle and Sort
↓↓↓

13

MapReduce: A Diagram

←Shuffle
← Sort

14

Distributed Execution Overview
Master also deals with:
• Worker status updates
User
• Fault-tolerance
Program
• I/O Scheduling
fork fork • Automatic distribution
fork
• Automatic parallelization
Master
assign assign
map reduce
Input Data Worker
write Output
local Worker File 0
Split 0
read write
Split 1 Worker
Split 2 Output
Worker File 1
Worker remote
read,sort
↑↑↑↑↑
Shuffle & Sort

VM and Hypervisor
• Virtual Machine: A software
package, sometimes using hardware
acceleration, that allows an isolated guest
operating system to run within a host
operating system.
• Stateless: Once shut down, all HW states
disappear.
• Hypervisor: A software platform that is
responsible for creating, running, and
destroying multiple virtual machines.
• Type I and Type II hypervisor 16

Type 1 vs. Type 2 Hypervisor

17

Concept of Virtualization
• Decoupling HW/SW by abstraction & layering
• Using, demanding,
but not owning or configuring
• Resource pool: flexible to
slice, resize, combine, and distribute
• A degree of automation by software
HOST 1 HOST 2 HOST 3 HOST 4,

VMs

Hypervisor:
Turns 1 server into many “virtual machines” (instances or VMs)
(VMWare ESX, Citrix XEN Server, KVM, Etc.) 18

Concept of Virtualization
• Hypervisor: abstraction for HW/SW
• For SW: Abstraction and automation of
physical resources
– Pause, erase, create, and monitor
– Charge services per usage units
• For HW: Generalized interaction with SW
– Access control
– Multiplex and demultiplex
• Ultimate hypervisor control from operator
• Benefit? Monetize operator capital expense
19

I/O Virtualization Model
• Protect I/O access, multiplex / demultiplex traffic
• Deliver PKTs among VMs in shared memory
• Performance bottleneck: Overhead when communicating
between driver domain and VMs
• VM scheduling and long queue→delay/throughput variance

Bottleneck:
CPU/RAM I/O lag
VM Scheduling
I/O Buffer Queue

20

Agenda
Basic Background

21

OpenStack Status
• OpenStack
– Founded by NASA and Rackspace in 2010
– Today 183 companies and 3386 people
– Was only 125 and 1500 in fall, 2011.
– Growing fast now, latest release Essex, Apr. 5th
• Aligned release cycle with Ubuntu, Apr. / Oct.
• Aim to be the “Linux” in cloud computing sys.
• Open-source v.s. Amazon and vmware
• Start-ups are happening around OpenStack
• Still lacks big use cases and implementation
22

A Cloud Management Layer is
Questions arise as the environment grows...
“VM sprawl” can make things unmanageable very quickly
Missing
APPS USERS ADMINS

How do you empower employees to self-
How do you make your apps cloud aware?
service?

Where should you provision new VMs? How do you keep track of it all?

+

A Cloud Management Layer Is Missing
1. Server Virtualization
Virtualization 2. Cloud Data Center 3. Cloud Federation

Solution: OpenStack, The Cloud Operating System
Cloud Operating System
A new management layer that adds automation and control

APPS USERS ADMINS

CLOUD OPERATING SYSTEM

Server Virtualization 2. Cloud Data Center 3. Cloud Federation

A common platform is here.
Common Platform
OpenStack is open source software powering public and private clouds.

Private Cloud: Public Cloud:

OpenStack enables cloud federation
Connecting clouds to create global resource pools
Washington
Common software
platform making
Federation
possible

Texas California Europe

Virtualization 2. Cloud Data Center 3. Cloud Federation

Horizon
OpenStack Key Components

Glance
Swift
Nova

Keystone

Keystone Main Functions
• Provides 4 primary services:
– Identity: User information authentication
– Token: After logged in, replace account-password
– Service catalog: Service units registered
– Policies: Enforces different user levels
• Can be backed by different databases.

27

Swift Implementation
Duplicated storage, load balancing

↑ Logical view

↓Physical arrangement

← Stores real objects

←Stores object metadata

↑Stores container / object
metadata

Glance
• Image storage and indexing.
• Keeps a database of metadata associated
with an image, discover, register, and retrieve.
• Built on top of Swift, images store in Swift
• Two servers:
– Glance-api: public interface for uploading and
managing images.
– Glance-registry: private interface to metadata
database
• Support multiple image formats
30

Glance Process

Upload or Store

Download or Get

31

Nova
• Major components:
– API: public facing interface
– Message Queue: Broker to handle interactions
between services, currently based on RabbitMQ
– Scheduler: coordinates all services, determines
placement of new resources requested
– Compute Worker: hosts VMs, controls hypervisor
and VMs when receives cmds on Msg Queue
– Volume: manages permanent storage

32

Messaging (RabbitMQ)

33

General Nova Process

34

Complete System Logical View

36

Agenda
Basic Background

37

Primitive OpenStack Network
• Each VM network owned by one network host
– Simply a Linux running Nova-network daemon
• Nova Network node is the only gateway
• Flat Network Manager:
– Linux networking bridge forms a subnet
– All instances attached same bridge
– Manually configure server, controller, and IP
• Flat DHCP Network Manager:
– Add DHCP server along same bridge
• Only gateway, per-cluster, fragmentation
38

OpenStack Network

Linux server running Nova-network daemon. VMs bridged in to a raw Ethernet
device 39
The only gateway of all NICs bridged into the net.

Conventional DCN Topology
Public Internet

DC Layer-3

DC Layer-2

• Oversubscription • Scale-up proprietary design – expensive
• Fragmentation of resources: • Inflexible addressing, static routing
Network limits cross-DC communication • Inflexible network configuration
• Hinders applications’ scalability Protocol baked / embedded on chips
• Only reachability isolation
Dependent performance bottleneck 40

A New DCN Topology
Core Switches

Full Bisection

Aggr
Full Bisection
Edge

Pod-0 Pod-1 k=4
• k pod with (k2/4 hosts, k switches)per pod • Cabling explosion, copper trans. range
• (k/2)2 core switches, (k/2)2 paths for S-D • Existing addressing/routing/forwarding do
• (5k2/4) k-port switches, k3/4 hosts not work well on fat-tree / clos
• 48-port: 27,648 hosts, 2,880 switches • Scalability issue with millions of end hosts
• Full bisection BW at each level • Configuration of millions of parts
• Modular scale-out cheap design

41

IP PMAC (Location)
10.5.1.2
10.2.4.5
(00:00):01:02:(00:01)
(00:02):00:02:(00:01)
Addressing
Controller 4

Proxy ARP

2 Switch PKT Rewrite
IP AMAC (Identity) PMAC (Location)
5 10.2.4.5 00:19:B9:FA:88:E2 00:02:00:02:00:01

dst IP dst MAC
MAC
1 ARP
10.2.4.5 00:02:00:02:00:01
??? 3
• Switches: 32~64 K flow entries, 640 KB • AMAC: Identity, maintained at switches
• Assume 10,000k VMs on 500k servers • PMAC: (pod,position,port,vmid)
• Identity-based: 10,000k flat entries, IP→ PMAC, mapped at controller
100 MB huge, flexible, per-VM/APP • Routing: Static VLAN or ECMP-hashing
VM migration, continuous connection (To be presented later)
• Location-based: 1k hierarchical entries • Consistency / efficiency / fault-tolerant?
10 KB easy storage, fixed, per-server Solve by (controller, SW, host) diff. roles
Easy forwarding, no extra reconfiguration • Implemented: server- / switch- centric 43

Load Balancing / Multipathing
Per-flow hashing Pre-configured
Randomization VLAN Tunnels

End hosts “transparent”: Sends traffic to networks as usual, without seeing detail
OpenFlow: Controller talks to (HW/SW switches, kernel agents), manipulates entries
• Clusters grow larger, nodes demand faster • Need to utilize multiple paths and
• Network delay / PKT loss → Performance ↓ capacity!
• Still, only commodity hardware • VLAN: multiple preconfigured tunnels
• Aggregated individual small demand →Topological dependent
→ Traffic extremely volatile / unpredictable • Multipath-TCP: modified transport mech.
• Traffic matrix: dynamic, evolving, not steady →Distributes and shifts load among paths
• User: Don’t know infrastructure, topology • ECMP/VLB: Randomization, header hash
• Operator: Don’t know application, traffic →Only randomized upward paths 44
→Only for symmetric traffic

Flow Scheduling

• ECMP-hashing → per-flow static path • Flow-to-core mappings, re-allocat flows
• Long-live elephant flows may collide • What time granularity? Fast enough?
• Some links full, others under-utilized • Controller computation? Scalable?
45

Reactive Reroute
Qeq
Qoff
Q

• Congestion Point: Switch • QCN in IEEE 802.1Q task group
-Samples incoming PKTs -For converged networks, assure zero drop
FB -Monitor and maintain queue level -Like TCP AIMD but on L2, w/ diff purpose
-Send feedback msg to src -CP directly reacts, not end-to-end
-Feedback msg according to Q-len -Can be utilized for reactive reroute
-Choose to re-hash elephant flows • May differentiate FB msg
• Reaction Point: Source Rate Limiter -Decrease more for lower classes (QoS)
-Decrease rate according to feedback -Decrease more for larger flows (Fairness)
-Increase rate by counter / timer • Large flows are suppressed →High delay46

Controller
• DCN relies on controller for many functions:
– Address mapping / mgmt / registration / reuse
– Traffic load scheduling / balancing
– Route computation, switch entries configuration
– Logical network view ↔ physical construction
• An example: Onix
– Distributed system
– Maintain, exchange &
distribute net states
• Hard static: SQL DB
• Soft dynamic: DHT
– Asynchronous but
eventually consistent
47

Onix Functions
Control Plane / Applications
API
Provides

Abstraction Logical Forwarding Plane
Control Logical States
Provides Commands Abstractions
Network
Distributed Mapping
Info Base
System Network Hypervisor Onix / Network OS
Distributes, Configures Real States
OpenFlow

49

OpenStack Quantum Service

XenServer: Domain 0 Kernel-based VM: Linux Kernel

Always Call for Controller?

ASIC switching rate
Latency: 5 s

51

Always Call for Controller?
CPU Controller
Latency: 2 ms
A huge waste
of resources!

52

Conclusion
• Concept of cloud computing is not brand new
– But with new usage, demand, and economy
– Aggregated individual small demands
– Thus pressures traditional data centers
– Clusters of commodities for performance and
economy of scale
• Data Center Network challenges
– Carry tons of apps, tenants, and compute tasks
– Network delay / loss = service bottleneck
– Still no consistent sys / traffic / analysis model
– Large scale construct, no public traces, practical?
53

Reference
• YA-YUNN SU, “Topics in Cloud Computing”, NTU CSIE 7324
• Luiz André Barroso and Urs Hölzle, “The Datacenter as a Computer - An Introduction to the
Design of Warehouse-Scale Machines”, Google Inc.
• 吳柏均,郭嘉偉, “MapReduce: Simplified Data Processing on Large Clusters”, CSIE 7324 in
class presentation slides.
• Stanford, “Data Mining”, CS345A,
http://www.stanford.edu/class/cs345a/slides/02-mapreduce.pdf
• Dr. Allen D. Malony, CIS 607: Seminar in Cloud Computing, Spring 2012, U. Oregon
http://prodigal.nic.uoregon.edu/~hoge/cis607/
• Manel Bourguiba, Kamel Haddadou, Guy Pujolle, “Packet aggregation based network i/o
virtualization for cloud computing”, Computer Communication 35, 2012
• Eric Keller, Jen Roxford, “The „Platform as a Service‟ Model for Networking”, in Proc. INM
WREN , 2010
• Martin Casado, Teemu Koponen, Rajiv Ramanathan, Scott Shenker, “Virtualizing the
Network Forwarding Plane”, in Proc. PRESTO (November 2010)
• Guohui Wang T. S. Eugene Ng, “The Impact of Virtualization on Network Performance of
Amazon EC2 Data Center”, INFOCOMM 2010
• OpenStack Documentation
http://docs.openstack.org/
55

Reference
• Bret Piatt, OpenStack Overview, OpenStack Tutorial
http://salsahpc.indiana.edu/CloudCom2010/slides/PDF/tutorials/OpenStackTutorialIEEEClo
udCom.pdf
http://www.omg.org/news/meetings/tc/ca-10/special-events/pdf/5-3_Piatt.pdf
• Vishvananda Ishaya, Networking in Nova
http://unchainyourbrain.com/openstack/13-networking-in-nova
• Jaesuk Ahn, OpenStack, XenSummit Asia
http://www.slideshare.net/ckpeter/openstack-at-xen-summit-asia
http://www.slideshare.net/xen_com_mgr/2-xs-asia11kahnopenstack
• Salvatore Orlando, Quantum: Virtual Networks for Openstack
http://qconlondon.com/dl/qcon-london-
2012/slides/SalvatoreOrlando_QuantumVirtualNetworksForOpenStackClouds.pdf
• Dan Wendlandt, Openstack Quantum: Virtual Networks for OpenStack
http://www.ovirt.org/wp-content/uploads/2011/11/Quantum_Ovirt_discussion.pdf
• David A. Maltz, “Data Center Challenges: Building Networks for Agility, Senior
Researcher, Microsoft”, Invited Talk, 3rd Workshop on I/O Virtualization, 2011
http://static.usenix.org/event/wiov11/tech/slides/maltz.pdf
• Amin Vahdat, “PortLand: Scaling Data Center Networks to 100,000 Ports and
Beyond”, Stanford EE Computer Systems Colloquium, 2009
http://www.stanford.edu/class/ee380/Abstracts/091118-DataCenterSwitch.pdf
56

Reference
• Mohammad Al-Fares , Alexander Loukissas , Amin Vahdat, “A scalable, commodity data
center network architecture”, ACM SIGCOMM 2008
• Albert Greenberg, James R. Hamilton, Navendu Jain, Srikanth Kandula, Changhoon
Kim, Parantap Lahiri, David A. Maltz, Parveen Patel, Sudipta Sengupta, “VL2: a scalable
and flexible data center network”, ACM SIGCOMM 2009
• Radhika Niranjan Mysore, Andreas Pamboris, Nathan Farrington, Nelson Huang, Pardis
Miri, Sivasankar Radhakrishnan, Vikram Subramanya, Amin Vahdat, “PortLand: a scalable
fault-tolerant layer 2 data center network fabric”, ACM SIGCOMM 2009
• Jayaram Mudigonda, Praveen Yalagandula, Mohammad Al-Fares, Jeffrey C.
Mogul, “SPAIN: COTS data-center Ethernet for multipathing over arbitrary
topologies”, USENIX NSDI 2010
• Nick McKeown, Tom Anderson, Hari Balakrishnan, Guru Parulkar, Larry Peterson, Jennifer
Rexford, Scott Shenker, Jonathan Turner, “OpenFlow: enabling innovation in campus
networks”, ACM SIGCOMM 2008
• Mohammad Al-Fares, Sivasankar Radhakrishnan, Barath Raghavan, Nelson Huang, Amin
Vahdat, “Hedera: dynamic flow scheduling for data center networks”, USENIX NSDI 2010
• M. Alizadeh, B. Atikoglu, A. Kabbani, A. Lakshmikantha, R. Pan, B. Prabhakar, and M.
Seaman, “Data center transport mechanisms: Congestion control theory and IEEE
standardization,” Communication, Control, and Computing, 2008 46th Annual Allerton
Conference on
57

Reference
• A. Kabbani, M. Alizadeh, M. Yasuda, R. Pan, and B. Prabhakar. “AF-QCN: Approximate
fairness with quantized congestion notiﬁcation for multitenanted data centers”, In High
Performance Interconnects (HOTI), 2010, IEEE 18th Annual Symposium on
• Adrian S.-W. Tam, Kang Xi H., Jonathan Chao , “Leveraging Performance of Multiroot Data
Center Networks by Reactive Reroute”, 2010 18th IEEE Symposium on High Performance
Interconnects
• Daniel Crisan, Mitch Gusat, Cyriel Minkenberg, “Comparative Evaluation of CEE-based
Switch Adaptive Routing”, 2nd Workshop on Data Center - Converged and Virtual Ethernet
Switching (DC CAVES), 2010
• Teemu Koponen et al., “Onix: A distributed control platform for large-scale production
networks”, OSDI, Oct, 2010
• Andrew R. Curtis (University of Waterloo); Jeffrey C. Mogul, Jean Tourrilhes, Praveen
Yalagandula, Puneet Sharma, Sujata Banerjee (HP Labs), SIGCOMM 2011

58

Symmetric Multi-Processing (SMP): Several
CPUs on shared RAM-like memory

↑Data distributed evenly among nodes

60

Computer Room Air Conditioning
Computer Room Air Conditioning

Introduction to Cloud Data Center and Network Issues

More Related Content

What's hot (20)

Similar to Introduction to Cloud Data Center and Network Issues (20)

More from Jason TC HOU (侯宗成) (14)

Recently uploaded (20)

Introduction to Cloud Data Center and Network Issues

Editor's Notes