Jamie Oconnor - Research Paper - Modern Distributed Databases
Jamie Oconnor - Research Paper - Modern Distributed Databases
Student Name:
Jamie OConnor
Student ID:
A00180489
Group:
Web
Contents
Introduction
Netflix
Introduction
This report will examine the importance of distributed databases in modern computing and
how professional organisations require them to handle vast data. I will focus on the
distributed databases used by Google and Netflix. I will discuss the challenges and issues that
are associated with designing large scale modern distributed databases and how both
organisations have successfully overcome these challenges.
Issues & challenges faced
There are many challenges in designing a large scale modern distributed database. Google
spent up to 4 and half years designing the foundation for their own. (Hsieh, 2012)
Here are some of the issues faced:
Scalability - Google and Netflix are constantly growing their services. To keep up with public
demand and to maintain performance of data on a global scale, scalability must be addressed.
The architecture must be stable and have the ability to grow.
Security - The fact that data is located at multiple sites increases the probability of security
lapses.
Reliability - Without a reliable system you may not be able to provide your services to the
expected standard. Google and Netflix could lose millions of dollars, many customers and
tarnish their reputation.
Availability - Google and Netflix customers want continuous access to data. What happens if
a datacentre fails? Will customers be able to continue working?
The above is only a summary of the issues faced. Transaction management, concurrency
control, query optimization and data integrity must all be addressed while maintaining
transparency.
GOOGLE
General Description of the Google Services and Database
Google are a multinational organisation widely respected in the software industry. Their
services include: Gmail, Chrome, YouTube, Android, Google AdWords etc. Their goal is: to
make it as easy as possible for you to find the information you need and get the things you
need to do done. (Google, 2015) They need advanced databases to store their data.
Spanner is the system they created. Spanner uses replication for both global availability and
geographic locality. Spanners main focus is managing cross-datacentre replicated data, but
they have spent time designing and implementing important database features on top of their
distributed-systems infrastructure. (Corbett, et al., 2012)
Architecture of the Google Distributed Database
A deployment of Spanner is called a Universe. Each universe has zones amounting to the
locations across which data can be replicated. Zones can be added/removed from a running
system. There may be more than one zone in a datacentre, meaning applications can partition
data across different servers within the same datacentre.
A zone has a zonemaster and can have 1000 spanservers. The zonemaster allocates data to
spanservers which provide the data to clients. The location proxies locate spanservers.
The universemaster displays status information about each zone and the placement driver is
responsible for the automated movement of data across zones. Spanservers manage 100s of
tablets (data structures) and have a Paxos machine which supports replication. (Corbett, et
al., 2012)
Hardware & Software used by Google
Spanner makes use of hardware-assisted time synchronization using GPS clocks and atomic
clocks to ensure global consistency. (Corbett, et al., 2012) Google had to install antennas on
the roofs of its datacentres connecting them to the hardware below. According to Andrew
Fikes, the GPS units they use were relatively inexpensive devices with lots of different
vendors. The time keepers are kept in racks onside the servers, and again, they need only
connect to some machines in the datacentre. (Metz, 2012) TrueTime is implemented by a
set of time master machines per datacentre and does not require specialized servers. Google
also make use of Paxos state machines and spanserver machines. (Corbett, et al., 2012)
Security of the Google Distributed Database
At our data centres, we take security very seriously. We keep your data safe and secure by
using dozens of critical security features. (Google, 2015) Google build exclusive custom
servers with only necessary hardware and software. They also have Emergency backup
generators. They automatically shift all data in randomly named chunks across datacentres
across many computers in different locations avoiding single point failures. The
location/status of each hard drive in their datacentres is tracked. If they have reached the end
of their lives they are destroyed in a thorough, multi-step process. At their datacentres,
Google have access controls, guards on duty 24/7, video surveillance and perimeter fencing
to physically protect the sites at all times. (Google, 2015)
Netflix
General Description of the Netflix Services and Database
Netflix are international providers of on-demand internet streaming media. Netflix have new
movies and TV shows coming all the time, options for subtitles or dubbing, award-winning
original series and documentaries that you wont find anywhere else. (Netflix, 2015) The
distributed database they use is Apache Cassandra (The DataStax Enterprise Edition).
(Datastax, 2014) Cassandra is a distributed storage system for managing structured data that
is designed to scale to a very large size across many commodity servers, with no single point
of failure. (Lakshman, 2008) Cassandra was originally developed by Facebook and is now
used by Netflix for 95% of their database needs. Subscriber data, video metadata, pause
location and every user interaction is stored and processed to build a recommendation for
individual users. (Kalantzis, 2014)
Architecture of the Netflix Distributed Database
Cassandra doesnt support a full relational data model. It provides clients with a simple data
model that supports dynamic control over data layout and format. (Lakshman & Malik, 2010)
An instance of Cassandra has one table made up of one or more column families as defined
by the user. Each column family can consist of supercolumns/columns which are dynamically
created. There is no limit on the number of these that can be stored within a family. Columns
constructs have a name, value and a timestamp. Supercolumns have a name and an infinite
number of columns. Every row has a unique key. Keys are strings of any size. Key K4 could
have 94 columns/supercolumns and key K5 could have 20 columns/supercolumns.
(Lakshman, 2008)
drives the reliability and scalability of the software systems relying on this service.
(Lakshman, 2008)
Historical Issues
Netflixs previous Oracle database went down for 48+ hours. It wasnt the databases fault
but the Storage Area Network that was storing all the data. This was the reason Netflix
decided to look for an alternative. Netflix tried using SimpleDB but it wasnt scalable enough
for their requirements. (Kalantzis, 2014)
References
Brodkin, J., 2015. Netflix shuts down its last data center, but it still runs a big IT operation.
[Online]
Available at: http://arstechnica.com/information-technology/2015/08/netflix-shuts-downits-last-data-center-but-still-runs-a-big-it-operation/
[Accessed 23 October 2015].
Corbett, J. C. et al., 2012. Spanner: Googles Globally-Distributed Database. [Online]
Available at:
http://static.googleusercontent.com/media/research.google.com/en//archive/spannerosdi2012.pdf
[Accessed 12 October 2015].
Datastax, 2014. Netflix Personalizes Viewing for Over 50 Million Customers with DataStax.
[Online]
Available at: http://www.datastax.com/wp-content/uploads/2011/09/CS-Netflix.pdf?3
[Accessed 27 October 2015].
DataStax, 2015. DataStax Enterprise Advanced Security. [Online]
Available at: http://www.datastax.com/products/datastax-enterprise-security
[Accessed 26 October 2015].
Google, 2015. Google Datacenters. [Online]
Available at: https://www.google.com/about/datacenters/inside/data-security/
[Accessed 12 October 2015].
Google, 2015. Our products and services. [Online]
Available at: https://www.google.com/about/company/products/
[Accessed 12 October 2015].
Hsieh, W., 2012. Wilson Hsieh - Spanner: Google's Globally-Distributed Database - OSDI
2012. [Online]
Available at: https://www.youtube.com/watch?v=NthK17nbpYs
[Accessed 12 October 2015].
10
Kalantzis, C., 2014. Netflix: Cassandra @ Netflix Building a House of Cards on a Solid
Foundation. [Online]
Available at: https://www.youtube.com/watch?v=RMSNLP_ORg8
[Accessed 23 October 2015].
Lakshman, A., 2008. Cassandra A structured storage system on a P2P Network. [Online]
Available at: https://www.facebook.com/notes/facebook-engineering/cassandra-astructured-storage-system-on-a-p2p-network/24413138919
[Accessed 23 October 2015].
Lakshman, A. & Malik, P., 2010. Cassandra - A Decentralized Structured Storage System.
[Online]
Available at: https://www.cs.cornell.edu/projects/ladis2009/papers/lakshman-ladis2009.pdf
[Accessed 23 October 2015].
Metz, C., 2012. Exclusive: Inside Google Spanner, the Largest Single Database on Earth.
[Online]
Available at: http://www.wired.com/2012/11/google-spanner-time/
[Accessed 23 October 2015].
Netflix, 2015. [Online]
Available at: https://www.netflix.com/ie/
[Accessed 23 October 2015].
Netflix, 2015. Privacy Statement. [Online]
Available at: https://www.netflix.com/privacy
[Accessed 23 October 2015].
11