Ch02 - Big Data Storage Concepts
Ch02 - Big Data Storage Concepts
3
Clusters
• A cluster is a tightly coupled collection
of servers, or nodes.
• They servers usually have the same
hardware specifications and are
connected together via a network to
work as a single unit.
• Each node in the cluster has its own
dedicated resources, such as memory, a
processor, and a hard drive.
Figure 2.1 - The symbol used to
• A cluster can execute a task by splitting represent a cluster.
it into small pieces and distributing their
execution onto different computers that
belong to the cluster
4
File Systems and Distributed File Systems
• A file system is the method of storing
and organizing data on a storage device,
such as flash drives, DVDs and hard
drives.
• A file is an atomic unit of storage used
by the file system to store data. Figure 2.2 - The symbol used
to represent a file system.
• A file system provides a logical view of
the data stored on the storage device
and presents it as a tree structure of
directories and files.
• Operating systems employ file systems
to store and retrieve data on behalf of
applications.
5
Distributed File Systems
• A distributed file system is a file
system that can store large files spread
across the nodes of a cluster.
• To the client, files appear to be local;
however, this is only a logical view.
• Physically, the files are distributed
throughout the cluster.
• This local view is presented via the
distributed file system and it enables
the files to be accessed from multiple
Figure 2.3 The symbol used to
locations. represent distributed file systems
• Examples include the Google File
System (GFS) and Hadoop
Distributed File System (HDFS).
6
NoSQL Database
• A Not-only SQL (NoSQL) database is a non-relational
database
• It is highly scalable, fault-tolerant and specifically
designed to house semi-structured and unstructured data.
• They often provides an API-based query interface that
can be called from within an application.
• They also support query languages other than Structured
Query Language (SQL) as SQL was designed to query
structured data stored within a relational database.
• An example,
– a NoSQL database that is optimized to store XML Figure 2.4 - A NoSQL database
files will often use XQuery as the query language. can provide an API- or SQL-
like query interface
– a NoSQL database designed to store RDF data will
use SPARQL to query the relationships it contains.
7
Sharding
• Sharding is the process of horizontally partitioning a large
dataset into a collection of smaller, more manageable datasets
called shards.
• The shards are distributed across multiple nodes, where a node
is a server or a machine
• Each shard
– is stored on a separate node and each node is responsible
for only the data stored on it.
– shares the same schema, and all shards collectively
represent the complete dataset.
8
Figure 2.5 - An example of sharding where a dataset is spread across Node A and
Node B, resulting in Shard A and Shard B, respectively
9
Sharding…
• Sharding allows the distribution of processing loads across multiple
nodes to achieve horizontal scalability.
• Horizontal scaling is a method for increasing a system’s capacity by
adding similar or higher capacity resources alongside existing resources.
• Since each node is responsible for only a part of the whole dataset,
read/write times are greatly improved.
• How sharding works in practice:
1. Each shard can independently service reads and writes for the
specific subset of data that it is responsible for.
2. Depending on the query, data may need to be fetched from both
shards.
• A benefit of sharding is that it provides partial tolerance toward failures.
– In case of a node failure, only data stored on that node is affected.
10
Replication
• Replication stores multiple copies of a dataset, known as
replicas, on multiple nodes.
• Replication provides scalability and availability due to the
fact that the same data is replicated on various nodes.
• Fault tolerance is also achieved since data redundancy ensures
that data is not lost when an individual node fails.
• There are two different methods that are used to implement
replication:
1. master-slave
2. peer-to-peer
11
Figure 2.6 - An example of replication where a dataset is replicated to Node A and
Node B, resulting in Replica A and Replica B.
12
Master-Slave replication
• nodes are arranged in a master-slave configuration, and all data is
written to a master node.
• Once saved, the data is replicated over to multiple slave nodes.
• All external write requests, including insert, update and delete,
occur on the master node, whereas read requests can be fulfilled by
any slave node.
• It is ideal for read intensive loads rather than write intensive loads
since growing read demands can be managed by horizontal scaling
to add more slave nodes.
• Writes are consistent, as all writes are coordinated by the master
node.
– write performance will suffer as the amount of writes increases.
• If the master node fails, reads are still possible via any of the slave
nodes.
13
Figure 2.7 - An example of master-slave replication where Master A is the single point of
contact for all writes, and data can be read from Slave A and Slave B.
14
• A slave node can be configured as a backup node for the master
node.
• Read inconsistency, which can be an issue if a slave node is read
prior to an update to the master being copied to it.
• To ensure read consistency, a voting system can be implemented
where a read is declared consistent if the majority of the slaves
contain the same version of the record.
• Implementation of such a voting system requires a reliable and fast
communication mechanism between the slaves.
1. User A updates data.
2. The data is copied over to Slave A by the Master.
3. Before the data is copied over to Slave B, User B tries to read
the data from Slave B, which results in an inconsistent read.
4. The data will eventually become consistent when Slave B is
updated by the Master.
15
Figure 2.8 - An example of master-slave replication where read inconsistency occurs.
16
Peer-to-peer replication
• With peer-to-peer replication, all nodes operate at the same
level.
• In other words, there is not a master-slave relationship
between the nodes.
• Each node, known as a peer, is equally capable of handling
reads and writes.
• Each write is copied to all peers.
17
Figure 2.9 Writes are copied to Peers A, B and C simultaneously. Data is read from Peer A,
but it can also be read from Peers B or C.
18
• Peer-to-peer replication is prone to write inconsistencies that occur as a result of a
simultaneous update of the same data across multiple peers.
• This can be addressed by implementing either a pessimistic or optimistic
concurrency strategy.
– Pessimistic concurrency is a proactive strategy that prevents inconsistency.
• It uses locking to ensure that only one update to a record can occur at a
time. However, this is detrimental to availability since the database record
being updated remains unavailable until all locks are released.
– Optimistic concurrency is a reactive strategy that does not use locking.
Instead, it allows inconsistency to occur with knowledge that eventually
consistency will be achieved after all updates have propagated.
• With optimistic concurrency, peers may remain inconsistent for some period of
time before attaining consistency. However, the database remains available as no
locking is involved.
• Reads can be inconsistent during the time period when some of the peers have
completed their updates while others perform their updates.
• However, reads eventually become consistent when the updates have been
executed on all peers.
19
• To ensure read consistency, a voting system can be implemented
where a read is declared consistent if the majority of the peers
contain the same version of the record.
• As previously indicated, implementation of such a voting system
requires a reliable and fast communication mechanism between the
peers.
• Demonstrates a scenario where an inconsistent read occurs.
1. User A updates data.
2. a. The data is copied over to Peer A.
b. The data is copied over to Peer B.
3. Before the data is copied over to Peer C, User B tries to read
the data from Peer C, resulting in an inconsistent read.
4. The data will eventually be updated on Peer C, and the
database will once again become consistent.
20
Sharding and Replication
• To improve on the limited fault tolerance offered by sharding,
while additionally benefiting from the increased availability
and scalability of replication, both sharding and replication
can be combined
21
Figure 2.10 - comparison of sharding and replication that shows how a dataset is
distributed between two nodes with the different approaches
22
23