Vineet Gupta - GM - Software Engineering - Directi: Intelligent People. Uncommon Ideas
Vineet Gupta - GM - Software Engineering - Directi: Intelligent People. Uncommon Ideas
Characteristics App Tier Scaling Replication Partitioning Consistency Normalization Caching Data Engine Types
Offline Processing (Batching / Queuing) Distributed Processing Map Reduce Non-blocking IO Fault Detection, Tolerance and Recovery
Characteristics App Tier Scaling Replication Partitioning Consistency Normalization Caching Data Engine Types
22M+ users Dozens of DB servers Dozens of Web servers Six specialized graph database servers to run recommendations engine
Source: http://highscalability.com/digg-architecture
1 TB / Day 100 M blogs indexed / day 10 B objects indexed / day 0.5 B photos and videos Data doubles in 6 months Users double in 6 months
Source: http://www.royans.net/arch/2007/10/25/scaling-technorati-100-millionblogs-indexed-everyday/
2 PB Raw Storage 470 M photos, 4-5 sizes each 400 k photos added / day 35 M photos in Squid cache (total) 2 M photos in Squid RAM 38k reqs / sec to Memcached 4 B queries / day
Source: http://mysqldba.blogspot.com/2008/04/mysql-uc-2007-presentation-file.html
Virtualized database spans 600 production instances residing in 100+ server clusters distributed over 8 datacenters 2 PB of data 26 B SQL queries / day 1 B page views / day 3 B API calls / month 15,000 App servers
Source: http://highscalability.com/ebay-architecture/
450,000 low cost commodity servers in 2006 Indexed 8 B web-pages in 2005 200 GFS clusters (1 cluster = 1,000 5,000 machines) Read / write thruput = 40 GB / sec across a cluster Map-Reduce
100k jobs / day 20 PB of data processed / day 10k MapReduce programs
Source: http://highscalability.com/google-architecture/
Data Size ~ PB Data Growth ~ TB / day No of servers 10s to 10,000 No of datacenters 1 to 10 Queries B+ / day Specialized needs more / other than RDBMS
Characteristics App Tier Scaling Replication Partitioning Consistency Normalization Caching Data Engine Types
App Server
DB Server
Host
Cons
Finite limit Hardware does not scale linearly (diminishing returns for each incremental unit) Requires downtime Increases Downtime Impact Incremental costs increase exponentially
App Server
DB Server
Host
Host
Pros
Increases per application Availability Task-based specialization, optimization and tuning possible Reduces context switching Simple to implement for out of band processes No changes to App required Flexibility increases
Cons
Sub-optimal resource utilization May not increase overall availability Finite Scalability
Web Server
Load Balancer
Web Server
DB Server
Web Server
Load Balancing
Hardware balancers are faster Software balancers are more customizable
Web Server
User 1
Load Balancer
User 2
Web Server
DB Server
Web Server
Web Server
User 1
Load Balancer
User 2
Asymmetrical load distribution Downtime
Web Server
DB Server
Web Server
Web Server
User 1
Load Balancer
User 2
SPOF Reads and Writes generate network + disk IO
Web Server
Session Store
Web Server
Web Server
User 1
Load Balancer
User 2
Web Server
Web Server
Pros
No SPOF Easier to setup Fast Reads
Cons
n x Writes Increase in network IO with increase in nodes Stale data (rare)
Web Server
User 1
Load Balancer
User 2
Web Server
DB Server
Web Server
No Sessions
Stuff state in a cookie and sign it! Cookie is sent with every request / response
Bad
Sticky sessions
Good
Clustered sessions for small number of nodes and / or small write volume Central sessions for large number of nodes or large write volume
Great
No Sessions!
CDN
Get closer to your user Akamai, Limelight
App-Layer
Add more nodes and load balance! Avoid Sticky Sessions Avoid Sessions!!
Data Store
Tricky! Very Tricky!!!
Characteristics App Tier Scaling Replication Partitioning Consistency Normalization Caching Data Engine Types
App Layer
App Layer
Each node has its own copy of data Shared Nothing Cluster
Master Slave
Writes sent to one node, cascaded to others
Multi-Master
Writes can be sent to multiple nodes Can lead to deadlocks Requires conflict management
App Layer
Master
Slave
Slave
Slave
Slave
n x Writes Async vs. Sync SPOF Async - Critical Reads from Master!
App Layer
Master
Master
Slave
Slave
Slave
Asynchronous
Guaranteed, but out-of-band replication from Master to Slave Master updates its own db and returns a response to client Replication from Master to Slave takes place asynchronously Faster response to a client Slave data is marginally behind the Master Requires modification to App to send critical reads and writes to master, and load balance all other reads
Synchronous
Guaranteed, in-band replication from Master to Slave Master updates its own db, and confirms all slaves have updated their db before returning a response to client Slower response to a client Slaves have the same data as the Master at all times Requires modification to App to send writes to master and load balance all reads
Critical reads are sent to a Master In most cases RDBMS agnostic Slower and in some cases less reliable
Read
Read Write Write Read Write
Read Write
Read Write
Read Write
Read Write
Characteristics App Tier Scaling Replication Partitioning Consistency Normalization Caching Data Engine Types
Vertical Partitioning
Divide data on tables / columns Scale to as many boxes as there are tables or columns Finite
Horizontal Partitioning
Divide data on rows Scale to as many boxes as there are rows! Limitless scaling
App Layer
App Layer
T1
T2
T3
T4
T5
Facebook - User table, posts table can be on separate nodes Joins need to be done in code (Why have them?)
App Layer
T1
T1
T2
T2
T3
T3
T4
T4
T5
T5
T1
T2
T3
T4
T5
Value Based
Split on timestamp of posts Split on first alphabet of user name
Hash Based
Use a hash function to determine cluster
Lookup Map
First Come First Serve Round Robin
Characteristics App Tier Scaling Replication Partitioning Consistency Normalization Caching Data Engine Types
Consistency
Availability
Partition Tolerance
Source: http://citeseerx.ist.psu.edu/viewdoc/summary?doi=10.1.1.20.1495
Transactional serializability
The behavior is as if a serial order exists
Ti Doesnt Know About These Transactions and They Dont Know About Ti
Te Ta Tc Tb Td Tf
Tg Ti
Tj
Tl Th Tk Tm
Tn To
Transaction Serializability
Source: http://blogs.msdn.com/pathelland/
Each Transaction Only Sees a Simple Advancing of Time with a Clear Set of Preceding Transactions
Source: http://blogs.msdn.com/pathelland/
Slide 47
There is no simultaneity at a distance! Similar to speed of light Knowledge travels at speed of light By the time you see a distant object it may have changed! By the time you see a message, the data may have changed! Services, transactions, and locks bound simultaneity! Inside a transaction, things appear simultaneous (to others) Simultaneity only inside a transaction! Simultaneity only inside a service!
Source: http://blogs.msdn.com/pathelland/ Slide 48
All data from distant stars is from the past 10 light years away; 10 year old knowledge The sun may have blown up 5 minutes ago We wont know for 3 minutes more
All data seen from a distant service is from the past
By the time you see it, it has been unlocked and may change
This is like going from Newtonian to Einstonian physics Newtons time marched forward uniformly Instant knowledge Classic distributed computing: many systems look like one RPC, 2-phase commit, remote method calls In Einsteins world, everything is relative to ones perspective Today: No attempt to blur the boundary
Source: http://blogs.msdn.com/pathelland/ Slide 49
Cant have the same data at many locations Unless it is a snapshot Changing distributed data needs versions Creates a snapshot
Wednesdays Price-List
Price-List
Wednesdays Price-List
Wednesdays Price-List
Wednesdays Price-List
Tuesdays Price-List
Tuesdays Price-List
Tuesdays Price-List
Mondays Price-List
Mondays Price-List
Source: http://blogs.msdn.com/pathelland/
Given what I know here and now, make a decision Remember the versions of all the data used to make this decision Record the decision as being predicated on these versions Other copies of the object may make divergent decisions Try to sort out conflicts within the family If necessary, programmatically apologize Very rarely, whine and fuss for human help
Subjective Consistency
Given the information I have at hand, make a decision and act on it ! Remember the information at hand !
Eventually, all the copies of the object share their changes Ill show you mine if you show me yours! Now, apply subjective consistency: Given the information I have at hand, make a decision and act on it! Everyone has the same information, everyone comes to the same conclusion about the decisions to take
Eventual Consistency
Given the same knowledge, produce the same result ! Everyone sharing their knowledge leads to the same result...
Characteristics App Tier Scaling Replication Partitioning Consistency Normalization Caching Data Engine Types
Emp Phone Mgr # Mgr Name Mgr Phone 5-1234 13 Sam 6-9876 3-3123 38 Harry 5-6782 2-1112 13 Sam 6-9876 5-7349 02 Betty 4-0101 Source: http://blogs.msdn.com/pathelland/
affiliations table affiliation_id description Microsoft Georgia Tech member_count 18,656 23,488
user table
42
relati religi user_work_history first_ last_ table hom inter politi user onsh ous_ user_idnam nam affiliation_id sex etow este cal_v _id ip_st view (foreign_key) (foreign user_phone_numbers e table user_screen_names e nkey) table d_in iews company_affil atus s user_id company_na 12345 42 iation_id job_title Atlan me (foreign_key) 1234 marr wom (foreign key) 12345 John user_id 598 user_id Doe Male ta, (null)im_service (null) phone_number phone_type screen_name 5 ied en (foreign_key) (foreign_key) Program GA 12345 42 Microsoft Manager 12345 425-555-1203 Home geeknproud@exam 12345 AIM ple.com Quality 12345 425-555-6161 Work i2 12345 78 Assurance voip4life@example. Technologies 12345 206-555-0932 Cell 12345 Skype Engineer org
Many Kinds of Computing are Append-Only Lots of observations are made about the world Debits, credits, Purchase-Orders, Customer-Change-Requests, etc As time moves on, more observations are added You cant change the history but you can add new observations Derived Results May Be Calculated Estimate of the current inventory Frequently inaccurate Historic Rollups Are Calculated Monthly bank statements
Wednesdays Price-List
Price-List
Wednesdays Price-List
Wednesdays Price-List
Wednesdays Price-List
Tuesdays Price-List
Tuesdays Price-List
Tuesdays Price-List
Mondays Price-List
Mondays Price-List
Source: http://blogs.msdn.com/pathelland/
Characteristics App Tier Scaling Replication Partitioning Consistency Normalization Caching Data Engine Types
App Server
Cache
App Server
Cache
App Server
Cache
In-memory Distributed Hash Table Memcached instance manifests as a process (often on the same machine as web-server) Memcached Client maintains a hash table
Which item is stored on which instance
Characteristics App Tier Scaling Replication Partitioning Consistency Normalization Caching Data Engine Types
Amazon - S3, SimpleDb, Dynamo Google - App Engine Datastore, BigTable Microsoft SQL Data Services, Azure Storages Facebook Cassandra LinkedIn - Project Voldemort Ringo, Scalaris, Kai, Dynomite, MemcacheDB, ThruDB, CouchDB, Hbase, Hypertable
Basic Concepts
No tables - Containers-Entity No schema - each tuple has its own set of properties
Google BigTable
Sparse, Distributed, multi-dimensional sorted map Indexed by row key, column key, timestamp Each value is an un-interpreted array of bytes
Amazon Dynamo
Data partitioned and replicated using consistent hashing Decentralized replica sync protocol Consistency thru versioning
Facebook Cassandra
Used for Inbox search Open Source
Scalaris
Keys stored in lexicographical order Improved Paxos to provide ACID Memory resident, no persistence
Real Life Scaling requires trade offs No Silver Bullet Need to learn new things Need to un-learn Balance!