0% found this document useful (0 votes)
1K views23 pages

Ch02 - Big Data Storage Concepts

This chapter discusses concepts for storing big data, including: - Clusters, which are tightly coupled collections of servers that work as a single unit by splitting tasks. - Distributed file systems, which store large files across cluster nodes while presenting them as local files. - NoSQL databases, which are non-relational and designed for semi-structured and unstructured data. - Sharding and replication, which improve scalability and fault tolerance by horizontally partitioning and replicating data across multiple nodes.

Uploaded by

mahlet kinfe
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
1K views23 pages

Ch02 - Big Data Storage Concepts

This chapter discusses concepts for storing big data, including: - Clusters, which are tightly coupled collections of servers that work as a single unit by splitting tasks. - Distributed file systems, which store large files across cluster nodes while presenting them as local files. - NoSQL databases, which are non-relational and designed for semi-structured and unstructured data. - Sharding and replication, which improve scalability and fault tolerance by horizontally partitioning and replicating data across multiple nodes.

Uploaded by

mahlet kinfe
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 23

Chapter Two

Big Data Storage Concepts


Kibret Zewdu (MSc)
Faculty of Computing
Jimma Institute of Technology(JiT)
Outline
• After the end this chapter, you will be able to understand each of
the following concepts
– Clusters
– File Systems and Distributed File Systems
– NoSQL
– Sharding
– Replication
– Sharding and Replication



CAP Theorem
ACID
BASE } Reading Assignment
2
Introduction
• Data acquired from external sources is often not in a format or
structure that can be directly processed.
• To overcome these incompatibilities and prepare data for storage
and processing, data wrangling is necessary.
• Data wrangling includes steps to filter, cleanse and otherwise
prepare the data for down stream analysis.
• From a storage perspective, a copy of the data is first stored in its
acquired format, and, after wrangling, the prepared data needs to be
stored again.
• Typically, storage is required whenever the following occurs:
– external datasets are acquired, or internal data will be used in a
BigData environment
– Data is manipulated to be made amenable for data analysis
– data is processed via an ETL activity, or output is generated as a
result of an analytical operation

3
Clusters
• A cluster is a tightly coupled collection
of servers, or nodes.
• They servers usually have the same
hardware specifications and are
connected together via a network to
work as a single unit.
• Each node in the cluster has its own
dedicated resources, such as memory, a
processor, and a hard drive.
Figure 2.1 - The symbol used to
• A cluster can execute a task by splitting represent a cluster.
it into small pieces and distributing their
execution onto different computers that
belong to the cluster

4
File Systems and Distributed File Systems
• A file system is the method of storing
and organizing data on a storage device,
such as flash drives, DVDs and hard
drives.
• A file is an atomic unit of storage used
by the file system to store data. Figure 2.2 - The symbol used
to represent a file system.
• A file system provides a logical view of
the data stored on the storage device
and presents it as a tree structure of
directories and files.
• Operating systems employ file systems
to store and retrieve data on behalf of
applications.

5
Distributed File Systems
• A distributed file system is a file
system that can store large files spread
across the nodes of a cluster.
• To the client, files appear to be local;
however, this is only a logical view.
• Physically, the files are distributed
throughout the cluster.
• This local view is presented via the
distributed file system and it enables
the files to be accessed from multiple
Figure 2.3 The symbol used to
locations. represent distributed file systems
• Examples include the Google File
System (GFS) and Hadoop
Distributed File System (HDFS).

6
NoSQL Database
• A Not-only SQL (NoSQL) database is a non-relational
database
• It is highly scalable, fault-tolerant and specifically
designed to house semi-structured and unstructured data.
• They often provides an API-based query interface that
can be called from within an application.
• They also support query languages other than Structured
Query Language (SQL) as SQL was designed to query
structured data stored within a relational database.
• An example,
– a NoSQL database that is optimized to store XML Figure 2.4 - A NoSQL database
files will often use XQuery as the query language. can provide an API- or SQL-
like query interface
– a NoSQL database designed to store RDF data will
use SPARQL to query the relationships it contains.

7
Sharding
• Sharding is the process of horizontally partitioning a large
dataset into a collection of smaller, more manageable datasets
called shards.
• The shards are distributed across multiple nodes, where a node
is a server or a machine
• Each shard
– is stored on a separate node and each node is responsible
for only the data stored on it.
– shares the same schema, and all shards collectively
represent the complete dataset.

8
Figure 2.5 - An example of sharding where a dataset is spread across Node A and
Node B, resulting in Shard A and Shard B, respectively

9
Sharding…
• Sharding allows the distribution of processing loads across multiple
nodes to achieve horizontal scalability.
• Horizontal scaling is a method for increasing a system’s capacity by
adding similar or higher capacity resources alongside existing resources.
• Since each node is responsible for only a part of the whole dataset,
read/write times are greatly improved.
• How sharding works in practice:
1. Each shard can independently service reads and writes for the
specific subset of data that it is responsible for.
2. Depending on the query, data may need to be fetched from both
shards.
• A benefit of sharding is that it provides partial tolerance toward failures.
– In case of a node failure, only data stored on that node is affected.

10
Replication
• Replication stores multiple copies of a dataset, known as
replicas, on multiple nodes.
• Replication provides scalability and availability due to the
fact that the same data is replicated on various nodes.
• Fault tolerance is also achieved since data redundancy ensures
that data is not lost when an individual node fails.
• There are two different methods that are used to implement
replication:
1. master-slave
2. peer-to-peer

11
Figure 2.6 - An example of replication where a dataset is replicated to Node A and
Node B, resulting in Replica A and Replica B.
12
Master-Slave replication
• nodes are arranged in a master-slave configuration, and all data is
written to a master node.
• Once saved, the data is replicated over to multiple slave nodes.
• All external write requests, including insert, update and delete,
occur on the master node, whereas read requests can be fulfilled by
any slave node.
• It is ideal for read intensive loads rather than write intensive loads
since growing read demands can be managed by horizontal scaling
to add more slave nodes.
• Writes are consistent, as all writes are coordinated by the master
node.
– write performance will suffer as the amount of writes increases.
• If the master node fails, reads are still possible via any of the slave
nodes.

13
Figure 2.7 - An example of master-slave replication where Master A is the single point of
contact for all writes, and data can be read from Slave A and Slave B.
14
• A slave node can be configured as a backup node for the master
node.
• Read inconsistency, which can be an issue if a slave node is read
prior to an update to the master being copied to it.
• To ensure read consistency, a voting system can be implemented
where a read is declared consistent if the majority of the slaves
contain the same version of the record.
• Implementation of such a voting system requires a reliable and fast
communication mechanism between the slaves.
1. User A updates data.
2. The data is copied over to Slave A by the Master.
3. Before the data is copied over to Slave B, User B tries to read
the data from Slave B, which results in an inconsistent read.
4. The data will eventually become consistent when Slave B is
updated by the Master.

15
Figure 2.8 - An example of master-slave replication where read inconsistency occurs.
16
Peer-to-peer replication
• With peer-to-peer replication, all nodes operate at the same
level.
• In other words, there is not a master-slave relationship
between the nodes.
• Each node, known as a peer, is equally capable of handling
reads and writes.
• Each write is copied to all peers.

17
Figure 2.9 Writes are copied to Peers A, B and C simultaneously. Data is read from Peer A,
but it can also be read from Peers B or C.

18
• Peer-to-peer replication is prone to write inconsistencies that occur as a result of a
simultaneous update of the same data across multiple peers.
• This can be addressed by implementing either a pessimistic or optimistic
concurrency strategy.
– Pessimistic concurrency is a proactive strategy that prevents inconsistency.
• It uses locking to ensure that only one update to a record can occur at a
time. However, this is detrimental to availability since the database record
being updated remains unavailable until all locks are released.
– Optimistic concurrency is a reactive strategy that does not use locking.
Instead, it allows inconsistency to occur with knowledge that eventually
consistency will be achieved after all updates have propagated.
• With optimistic concurrency, peers may remain inconsistent for some period of
time before attaining consistency. However, the database remains available as no
locking is involved.
• Reads can be inconsistent during the time period when some of the peers have
completed their updates while others perform their updates.
• However, reads eventually become consistent when the updates have been
executed on all peers.

19
• To ensure read consistency, a voting system can be implemented
where a read is declared consistent if the majority of the peers
contain the same version of the record.
• As previously indicated, implementation of such a voting system
requires a reliable and fast communication mechanism between the
peers.
• Demonstrates a scenario where an inconsistent read occurs.
1. User A updates data.
2. a. The data is copied over to Peer A.
b. The data is copied over to Peer B.
3. Before the data is copied over to Peer C, User B tries to read
the data from Peer C, resulting in an inconsistent read.
4. The data will eventually be updated on Peer C, and the
database will once again become consistent.

20
Sharding and Replication
• To improve on the limited fault tolerance offered by sharding,
while additionally benefiting from the increased availability
and scalability of replication, both sharding and replication
can be combined

21
Figure 2.10 - comparison of sharding and replication that shows how a dataset is
distributed between two nodes with the different approaches
22
23

You might also like