Ch02 - Big Data Storage Concepts

This chapter discusses concepts for storing big data, including: - Clusters, which are tightly coupled collections of servers that work as a single unit by splitting tasks. - Distributed file systems, which store large files across cluster nodes while presenting them as local files. - NoSQL databases, which are non-relational and designed for semi-structured and unstructured data. - Sharding and replication, which improve scalability and fault tolerance by horizontally partitioning and replicating data across multiple nodes.

Uploaded by

mahlet kinfe

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

1K views23 pages

Ch02 - Big Data Storage Concepts

Uploaded by

mahlet kinfe

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 23

Chapter Two

Big Data Storage Concepts

Kibret Zewdu (MSc)
Faculty of Computing
Jimma Institute of Technology(JiT)
Outline
• After the end this chapter, you will be able to understand each of
the following concepts
– Clusters
– File Systems and Distributed File Systems
– NoSQL
– Sharding
– Replication
– Sharding and Replication
–
–
–
CAP Theorem
ACID
BASE } Reading Assignment
2
Introduction
• Data acquired from external sources is often not in a format or
structure that can be directly processed.
• To overcome these incompatibilities and prepare data for storage
and processing, data wrangling is necessary.
• Data wrangling includes steps to filter, cleanse and otherwise
prepare the data for down stream analysis.
• From a storage perspective, a copy of the data is first stored in its
acquired format, and, after wrangling, the prepared data needs to be
stored again.
• Typically, storage is required whenever the following occurs:
– external datasets are acquired, or internal data will be used in a
BigData environment
– Data is manipulated to be made amenable for data analysis
– data is processed via an ETL activity, or output is generated as a
result of an analytical operation

3
Clusters
• A cluster is a tightly coupled collection
of servers, or nodes.
• They servers usually have the same
hardware specifications and are
connected together via a network to
work as a single unit.
• Each node in the cluster has its own
dedicated resources, such as memory, a
processor, and a hard drive.
Figure 2.1 - The symbol used to
• A cluster can execute a task by splitting represent a cluster.
it into small pieces and distributing their
execution onto different computers that
belong to the cluster

4
File Systems and Distributed File Systems
• A file system is the method of storing
and organizing data on a storage device,
such as flash drives, DVDs and hard
drives.
• A file is an atomic unit of storage used
by the file system to store data. Figure 2.2 - The symbol used
to represent a file system.
• A file system provides a logical view of
the data stored on the storage device
and presents it as a tree structure of
directories and files.
• Operating systems employ file systems
to store and retrieve data on behalf of
applications.

5
Distributed File Systems
• A distributed file system is a file
system that can store large files spread
across the nodes of a cluster.
• To the client, files appear to be local;
however, this is only a logical view.
• Physically, the files are distributed
throughout the cluster.
• This local view is presented via the
distributed file system and it enables
the files to be accessed from multiple
Figure 2.3 The symbol used to
locations. represent distributed file systems
• Examples include the Google File
System (GFS) and Hadoop
Distributed File System (HDFS).

6
NoSQL Database
• A Not-only SQL (NoSQL) database is a non-relational
database
• It is highly scalable, fault-tolerant and specifically
designed to house semi-structured and unstructured data.
• They often provides an API-based query interface that
can be called from within an application.
• They also support query languages other than Structured
Query Language (SQL) as SQL was designed to query
structured data stored within a relational database.
• An example,
– a NoSQL database that is optimized to store XML Figure 2.4 - A NoSQL database
files will often use XQuery as the query language. can provide an API- or SQL-
like query interface
– a NoSQL database designed to store RDF data will
use SPARQL to query the relationships it contains.

7
Sharding
• Sharding is the process of horizontally partitioning a large
dataset into a collection of smaller, more manageable datasets
called shards.
• The shards are distributed across multiple nodes, where a node
is a server or a machine
• Each shard
– is stored on a separate node and each node is responsible
for only the data stored on it.
– shares the same schema, and all shards collectively
represent the complete dataset.

8
Figure 2.5 - An example of sharding where a dataset is spread across Node A and
Node B, resulting in Shard A and Shard B, respectively

9
Sharding…
• Sharding allows the distribution of processing loads across multiple
nodes to achieve horizontal scalability.
• Horizontal scaling is a method for increasing a system’s capacity by
adding similar or higher capacity resources alongside existing resources.
• Since each node is responsible for only a part of the whole dataset,
read/write times are greatly improved.
• How sharding works in practice:
1. Each shard can independently service reads and writes for the
specific subset of data that it is responsible for.
2. Depending on the query, data may need to be fetched from both
shards.
• A benefit of sharding is that it provides partial tolerance toward failures.
– In case of a node failure, only data stored on that node is affected.

10
Replication
• Replication stores multiple copies of a dataset, known as
replicas, on multiple nodes.
• Replication provides scalability and availability due to the
fact that the same data is replicated on various nodes.
• Fault tolerance is also achieved since data redundancy ensures
that data is not lost when an individual node fails.
• There are two different methods that are used to implement
replication:
1. master-slave
2. peer-to-peer

11
Figure 2.6 - An example of replication where a dataset is replicated to Node A and
Node B, resulting in Replica A and Replica B.
12
Master-Slave replication
• nodes are arranged in a master-slave configuration, and all data is
written to a master node.
• Once saved, the data is replicated over to multiple slave nodes.
• All external write requests, including insert, update and delete,
occur on the master node, whereas read requests can be fulfilled by
any slave node.
• It is ideal for read intensive loads rather than write intensive loads
since growing read demands can be managed by horizontal scaling
to add more slave nodes.
• Writes are consistent, as all writes are coordinated by the master
node.
– write performance will suffer as the amount of writes increases.
• If the master node fails, reads are still possible via any of the slave
nodes.

13
Figure 2.7 - An example of master-slave replication where Master A is the single point of
contact for all writes, and data can be read from Slave A and Slave B.
14
• A slave node can be configured as a backup node for the master
node.
• Read inconsistency, which can be an issue if a slave node is read
prior to an update to the master being copied to it.
• To ensure read consistency, a voting system can be implemented
where a read is declared consistent if the majority of the slaves
contain the same version of the record.
• Implementation of such a voting system requires a reliable and fast
communication mechanism between the slaves.
1. User A updates data.
2. The data is copied over to Slave A by the Master.
3. Before the data is copied over to Slave B, User B tries to read
the data from Slave B, which results in an inconsistent read.
4. The data will eventually become consistent when Slave B is
updated by the Master.

15
Figure 2.8 - An example of master-slave replication where read inconsistency occurs.
16
Peer-to-peer replication
• With peer-to-peer replication, all nodes operate at the same
level.
• In other words, there is not a master-slave relationship
between the nodes.
• Each node, known as a peer, is equally capable of handling
reads and writes.
• Each write is copied to all peers.

17
Figure 2.9 Writes are copied to Peers A, B and C simultaneously. Data is read from Peer A,
but it can also be read from Peers B or C.

18
• Peer-to-peer replication is prone to write inconsistencies that occur as a result of a
simultaneous update of the same data across multiple peers.
• This can be addressed by implementing either a pessimistic or optimistic
concurrency strategy.
– Pessimistic concurrency is a proactive strategy that prevents inconsistency.
• It uses locking to ensure that only one update to a record can occur at a
time. However, this is detrimental to availability since the database record
being updated remains unavailable until all locks are released.
– Optimistic concurrency is a reactive strategy that does not use locking.
Instead, it allows inconsistency to occur with knowledge that eventually
consistency will be achieved after all updates have propagated.
• With optimistic concurrency, peers may remain inconsistent for some period of
time before attaining consistency. However, the database remains available as no
locking is involved.
• Reads can be inconsistent during the time period when some of the peers have
completed their updates while others perform their updates.
• However, reads eventually become consistent when the updates have been
executed on all peers.

19
• To ensure read consistency, a voting system can be implemented
where a read is declared consistent if the majority of the peers
contain the same version of the record.
• As previously indicated, implementation of such a voting system
requires a reliable and fast communication mechanism between the
peers.
• Demonstrates a scenario where an inconsistent read occurs.
1. User A updates data.
2. a. The data is copied over to Peer A.
b. The data is copied over to Peer B.
3. Before the data is copied over to Peer C, User B tries to read
the data from Peer C, resulting in an inconsistent read.
4. The data will eventually be updated on Peer C, and the
database will once again become consistent.

20
Sharding and Replication
• To improve on the limited fault tolerance offered by sharding,
while additionally benefiting from the increased availability
and scalability of replication, both sharding and replication
can be combined

21
Figure 2.10 - comparison of sharding and replication that shows how a dataset is
distributed between two nodes with the different approaches
22
23

APM210+ +Core+APM+I +essentials+ +Student+Guide
100% (1)
APM210+ +Core+APM+I +essentials+ +Student+Guide
199 pages
Parcel Delivery Management System
No ratings yet
Parcel Delivery Management System
28 pages
Exam Hp0-A116: HP Arcsight Esm 6.5 Security Administrator and Analyst
No ratings yet
Exam Hp0-A116: HP Arcsight Esm 6.5 Security Administrator and Analyst
60 pages
SAP Document Management System
100% (1)
SAP Document Management System
5 pages
Operations and Maintenance Plan RACI: R Responsible, A Accountable, C Consulted, I Informed
100% (2)
Operations and Maintenance Plan RACI: R Responsible, A Accountable, C Consulted, I Informed
11 pages
Lec 3 -Basic Concepts
No ratings yet
Lec 3 -Basic Concepts
32 pages
Module_2
No ratings yet
Module_2
40 pages
NoSQL Module 2
No ratings yet
NoSQL Module 2
76 pages
module 2
No ratings yet
module 2
36 pages
DrKP-Module-2-1
No ratings yet
DrKP-Module-2-1
77 pages
Nosql1
No ratings yet
Nosql1
40 pages
NoSQL M2
No ratings yet
NoSQL M2
47 pages
Unit 5 NOSQL
No ratings yet
Unit 5 NOSQL
102 pages
Distribution Model
100% (1)
Distribution Model
24 pages
BDA CH 2 (StorageConcepts)
No ratings yet
BDA CH 2 (StorageConcepts)
33 pages
Nosql Systems: Sharding, Replication and Consistency: Riccardo Torlone Università Roma Tre
No ratings yet
Nosql Systems: Sharding, Replication and Consistency: Riccardo Torlone Università Roma Tre
28 pages
NoSQL - Unit2
No ratings yet
NoSQL - Unit2
8 pages
Big Data Storage Concepts
No ratings yet
Big Data Storage Concepts
31 pages
Module 2 Nosql
No ratings yet
Module 2 Nosql
10 pages
NoSQL Databases UNIT-2
No ratings yet
NoSQL Databases UNIT-2
29 pages
Nosql Module 2
100% (1)
Nosql Module 2
87 pages
DBMS Module-5 2024 Chap 2
No ratings yet
DBMS Module-5 2024 Chap 2
25 pages
NOSQL_MOD2
No ratings yet
NOSQL_MOD2
25 pages
Lecture 6
No ratings yet
Lecture 6
9 pages
MOD5_CH2
No ratings yet
MOD5_CH2
36 pages
Big Data Management and Nosql Databases: Doc. Rndr. Irena Holubova, PH.D
No ratings yet
Big Data Management and Nosql Databases: Doc. Rndr. Irena Holubova, PH.D
27 pages
III-sharding-strategies
No ratings yet
III-sharding-strategies
30 pages
NO SQL IA-01_MICRO
No ratings yet
NO SQL IA-01_MICRO
6 pages
0zI2XrFJX5tR CjuECI f5HwGdQkpL8DAkTmwDPyFm3H0eCERMEvG9fH
No ratings yet
0zI2XrFJX5tR CjuECI f5HwGdQkpL8DAkTmwDPyFm3H0eCERMEvG9fH
13 pages
Big Data - No SQL Databases and Related Concepts
100% (1)
Big Data - No SQL Databases and Related Concepts
101 pages
NoSQL - Unit 2
No ratings yet
NoSQL - Unit 2
11 pages
Nosql Data Management
No ratings yet
Nosql Data Management
13 pages
Nosql Overview: Implementation Free
No ratings yet
Nosql Overview: Implementation Free
40 pages
Lec21Notes Merged
No ratings yet
Lec21Notes Merged
20 pages
Hadoop 2
No ratings yet
Hadoop 2
27 pages
NO SQL
No ratings yet
NO SQL
14 pages
Module 7
No ratings yet
Module 7
30 pages
nosql-kk
No ratings yet
nosql-kk
23 pages
04 Surveys Cattell PDF
No ratings yet
04 Surveys Cattell PDF
16 pages
Nosql
No ratings yet
Nosql
12 pages
Dynamo: Amazon'S Highly Available Key-Value Store: Csci 8101: Advanced Operating Systems Presented By: Chaithra KN
No ratings yet
Dynamo: Amazon'S Highly Available Key-Value Store: Csci 8101: Advanced Operating Systems Presented By: Chaithra KN
23 pages
Nosql Q&A
No ratings yet
Nosql Q&A
204 pages
Chapter_4_3d6b7fe08203468c915d52f43c8757c0_1712934164766
No ratings yet
Chapter_4_3d6b7fe08203468c915d52f43c8757c0_1712934164766
28 pages
NOSQL M2-P1-P2 PPT
No ratings yet
NOSQL M2-P1-P2 PPT
75 pages
Mathina BDA
No ratings yet
Mathina BDA
11 pages
Chap 2 Emerging Database Landscape
No ratings yet
Chap 2 Emerging Database Landscape
10 pages
Introduction To Nosql: Gabriele Pozzani
No ratings yet
Introduction To Nosql: Gabriele Pozzani
49 pages
BDA MODULE 3
No ratings yet
BDA MODULE 3
20 pages
4.NoSQL 1
No ratings yet
4.NoSQL 1
69 pages
REPLICATION
No ratings yet
REPLICATION
20 pages
CH-07 Replication
No ratings yet
CH-07 Replication
35 pages
Data Engineering Unit 3
No ratings yet
Data Engineering Unit 3
4 pages
Lecture 3_ Principles of NoSQL Databases
No ratings yet
Lecture 3_ Principles of NoSQL Databases
49 pages
nosql-databases
No ratings yet
nosql-databases
379 pages
Unit5_Notes_Short_DB
No ratings yet
Unit5_Notes_Short_DB
6 pages
Chapter 10
No ratings yet
Chapter 10
25 pages
Big Data IN A Gist
No ratings yet
Big Data IN A Gist
16 pages
A Thorough Introduction To Distributed Systems
No ratings yet
A Thorough Introduction To Distributed Systems
31 pages
Ebook - Cracking The System Design Interview Course
100% (1)
Ebook - Cracking The System Design Interview Course
91 pages
Big Data Analysis
No ratings yet
Big Data Analysis
9 pages
Big data Slides
No ratings yet
Big data Slides
26 pages
Lekcija09 - 04 NoSQL Redis
No ratings yet
Lekcija09 - 04 NoSQL Redis
40 pages
DBMS-unit 5-Nosql databases
No ratings yet
DBMS-unit 5-Nosql databases
9 pages
BDA Assign 1
No ratings yet
BDA Assign 1
21 pages
Introduction to Microsoft SQL Server
From Everand
Introduction to Microsoft SQL Server
Eric Frick
No ratings yet
1 SQL
No ratings yet
1 SQL
45 pages
CVCSA Instructors Guide
No ratings yet
CVCSA Instructors Guide
3 pages
Practical-6 Aim:: Emp - 33 (Local Table) Emp - 22 (Remote Table)
No ratings yet
Practical-6 Aim:: Emp - 33 (Local Table) Emp - 22 (Remote Table)
4 pages
Struts 2 Bookstore
No ratings yet
Struts 2 Bookstore
30 pages
kaur y singh 2017 - ai based healthcare plataform for real time predictive and prescriptive analytics using reactive programming
No ratings yet
kaur y singh 2017 - ai based healthcare plataform for real time predictive and prescriptive analytics using reactive programming
13 pages
Database Concepts and The Structured Query Language Class 12 - Aashi Nagiya
No ratings yet
Database Concepts and The Structured Query Language Class 12 - Aashi Nagiya
13 pages
LM6 - B+ Tree Index Files - B Tree Index Files
No ratings yet
LM6 - B+ Tree Index Files - B Tree Index Files
27 pages
DDB Module-3 Notes 22 29-09-2024 Lecture
No ratings yet
DDB Module-3 Notes 22 29-09-2024 Lecture
9 pages
Database Management Systems Lab Assesment-3 Name: P.Gopichand Reg - No: 18MIS0101 Course Code: SWE1004 Faculty: Jayaram Reddy A Slot: L1+L2
No ratings yet
Database Management Systems Lab Assesment-3 Name: P.Gopichand Reg - No: 18MIS0101 Course Code: SWE1004 Faculty: Jayaram Reddy A Slot: L1+L2
17 pages
TPL SDS
No ratings yet
TPL SDS
10 pages
E-Marketing-Chapter 5
No ratings yet
E-Marketing-Chapter 5
27 pages
June 2022 QP - Component 2 Eduqas Computer Science A-level
No ratings yet
June 2022 QP - Component 2 Eduqas Computer Science A-level
7 pages
Exploring Data-MC Practice: Use The Data For Questions 1 - 5
No ratings yet
Exploring Data-MC Practice: Use The Data For Questions 1 - 5
2 pages
DEFINITIONS
No ratings yet
DEFINITIONS
2 pages
Functional Dependency
No ratings yet
Functional Dependency
23 pages
4.query Processing and Optimization
No ratings yet
4.query Processing and Optimization
5 pages
Database CLI (DBCLI) OCI: Database Command Line Utility - L200
No ratings yet
Database CLI (DBCLI) OCI: Database Command Line Utility - L200
13 pages
Unit 3 Dbms Extra Questions
No ratings yet
Unit 3 Dbms Extra Questions
2 pages
Data Security Platform and The SQL Server Sa Account 10-18-2022
No ratings yet
Data Security Platform and The SQL Server Sa Account 10-18-2022
11 pages
Python Training
No ratings yet
Python Training
5 pages
ITU Big Data Standard
No ratings yet
ITU Big Data Standard
38 pages
Cricket Management System Ptoject Report
No ratings yet
Cricket Management System Ptoject Report
50 pages
DBMS Lab Manual
No ratings yet
DBMS Lab Manual
50 pages
3.how Can I Retrive All Records of Emp1 Those Should Not Present in Emp2?
No ratings yet
3.how Can I Retrive All Records of Emp1 Those Should Not Present in Emp2?
6 pages
Chapter 2 - Data Base
No ratings yet
Chapter 2 - Data Base
14 pages

Ch02 - Big Data Storage Concepts

Uploaded by

Ch02 - Big Data Storage Concepts

Uploaded by

Chapter Two

Big Data Storage Concepts

You might also like