09 - Cloud-Enabling Technologies - v2
09 - Cloud-Enabling Technologies - v2
Enabling
Technologies:
Storage and
Computing
2
Big data
3
4Vs of Big Data
Veracity Velocity
4
Characteristics of Big Data:
1-Scale (Volume)
• Data Volume
• 44x increase from 2009 to 2020
• From 0.8 zettabytes to 35zb
• Data volume is increasing exponentially
Exponential increase in
collected/generated data
5
Data storage on a cloud
• An ever increasing number of cloud-based services collect detailed data about
their services and information about the users of these services. The service
providers use the clouds to analyze the data.
• Humongous amounts of data - in 2013
• The Internet video will generate over 18 EB/month.
• Global mobile data traffic will reach 2 EB/month.
(1 EB = 1018 bytes, 1 PB = 1015 bytes, 1 TB = 1012 bytes, 1 GB = 1012 bytes)
6
Characteristics of Big Data:
2-Complexity (Variety)
• Various formats, types, and structures
• Numerical, text, images, audio, video,
sequences, time series, social media data,
multi-dim arrays, etc…
• A single application can be
generating/collecting many types of data
7
Characteristics of Big Data:
3-Speed (Velocity)
• Data is being generated fast and need to be processed fast
• Static data ➔ Streaming data
• Online Data Analytics
• Late decisions ➔ missing opportunities
• Examples
• E-Promotions: Based on your current location, your purchase history, what
you like ➔ send promotions right now for store next to you
8
Characteristics of Big Data:
4-Uncertainty (Veracity)*
• Some make it 4V’s
• In the big data era, data can be in doubt
• Uncertainty due to data inconsistency & incompleteness
• For example, in the concurrency scenario
• Model approximations
9
Data storage in the age of
cloud computing
10
Major challenges
• The storage system design philosophy has shifted from performance-
at-any-cost to reliability-at-the-lowest-possible-cost.
• Important implications on software complexity.
• Maintaining consistency among multiple copies of data records
• increases the data management software complexity
• could negatively affect the storage system performance if data is frequently
updated.
• Sophisticated strategies to reduce the access time for data streaming
and content delivery.
• Data replication allows concurrent access to data from multiple
processors and decreases the chances of data loss.
11
Data Base Management System (DBMS)
• Database ➔ a collection of logically-related records.
• Data Base Management System (DBMS) ➔ the software that controls the
access to the database.
• Query language ➔ a dedicated programming language used to develop
database applications.
• Most cloud application do not interact directly with the file systems, but
through a DBMS.
• Database models ➔ reflect the limitations of the hardware available at the
time and the requirements of the most popular applications of each period.
• navigational model of the 1960s.
• relational model of the 1970s.
• object-oriented model of the 1980s.
• NoSQL model of the first decade of the 2000s.
16
Storage
requirements of
cloud applications
• Most cloud applications are data-intensive and
test the limitations of the existing infrastructure.
Requirements:
• Rapid application development and short-
time to the market.
• Low latency.
• Scalability.
• High availability.
• Consistent view of the data.
• These requirements cannot be satisfied
simultaneously by existing database models; e.g.,
relational databases are easy to use for
application development but do not scale well.
• Joining tables takes time!
17
CAP Theorem
18
https://toppertips-bx67a.ondigitalocean.app/cap-theorem/ 19
CAP Theorem Proof
A A A
D B D B A A
C C A
20
Case study: SQL vs NoSQL
21
ACID property of Relational Database –
SQL
• Atomicity - All changes to data are performed as if they are a single operation. That is, all the
changes are performed, or none of them are.
• For example, in an application that transfers funds from one account to another, the
atomicity property ensures that, if a debit is made successfully from one account, the
corresponding credit is made to the other account.
• Consistency - Data is in a consistent state when a transaction starts and when it ends.
• For example, in an application that transfers funds from one account to another, the
consistency property ensures that the total value of funds in both the accounts is the same
at the start and end of each transaction.
• Isolation - The intermediate state of a transaction is invisible to other transactions. As a result,
transactions that run concurrently appear to be serialized.
• For example, in an application that transfers funds from one account to another, the
isolation property ensures that another transaction sees the transferred funds in one
account or the other, but not in both, nor in neither.
• Durability - After a transaction successfully completes, changes to data persist and are not
undone, even in the event of a system failure.
• For example, in an application that transfers funds from one account to another, the
durability property ensures that the changes made to each account will not be reversed.
22
NoSQL databases
• The name NoSQL is misleading. Stonebreaker notes that “blinding performance depends
on removing overhead. Such overhead has nothing to do with SQL, it revolves around
traditional implementations of ACID transactions, multi-threading, and disk
management.”
• No SQL provides less assurance than relational database but it scales very well and reacts
well to rapid data changes.
• Attributes:
• Scale well.
• Do not exhibit a single point of failure.
• Have built-in support for consensus-based decisions.
• Support partitioning and replication as basic primitives.
23
BASE property of NoSQL databases
24
Four main types of NoSQL databases
• Document databases — the document can vary from record to record. They
store data in JSON (JavaScript Object Notation) or BSON (Binary JSON) data
formats, which provide flexibility in working with data of all types.
• Support structured or semi-structured data
• Key-value stores employ a simple schema, with data stored as a simple
collection of key-value pairs. (e.g. redis)
• Keys are unique, and the value associated with a key can range from simple
primitives to complex objects.
• Wide column stores capture huge volumes of data in a row-and-column
format. While they are considered NoSQL databases, their format makes
them similar to relational databases.
• They differ from relational databases in that every row is not required to have the
same number of columns.
• Wide column stores are often built to handle big data use cases that require
aggregation for queries.
• Graph databases have a fundamentally different structure, in that data
elements and their relationships are stored as a graph (e.g. Amazon
Neptune)
25
File and File System
26
Logical and physical organization of a file
• File ➔ a linear array of cells stored on a persistent storage device. Viewed
by an application as a collection of logical records; the file is stored on a
physical device as a set of physical records, or blocks, of size dictated by the
physical media.
• File pointer➔ identifies a cell used as a starting point for a read or write
operation.
• The logical organization of a file ➔ reflects the data model, the view of the
data from the perspective of the application.
• The physical organization of a file ➔ reflects the storage model and
describes the manner the file is stored on a given storage media.
27
File systems
• File system ➔ collection of directories; each directory provides information
about a set of files.
• Traditional – Unix File System.
• Distributed file systems.
• Network File Systems (NFS) - very popular, have been used for some time, but do not
scale well and have reliability problems; an NFS server could be a single point of failure.
• Storage Area Networks (SAN) - allow cloud servers to deal with non-disruptive
changes in the storage configuration. The storage in a SAN can be pooled and
then allocated based on the needs of the servers. A SAN-based implementation
of a file system can be expensive, as each node must have a Fibre Channel
adapter to connect to the network.
• Parallel File Systems (PFS) - scalable, capable of distributing files across a large
number of nodes, with a global naming space. Several I/O nodes serve data to all
computational nodes; it includes also a metadata server which contains
information about the data stored in the I/O nodes. The interconnection network
of a PFS could be a SAN.
28
Unix File System (UFS)
29
UFS layering Symbolic path
name layer
Absolute path
name layer
Path name
layer
Inode layer
UFS layered design separates the physical file structure from
the logical one. The lower three layers are related to the File layer
physical file structure, while the upper three layers are
related to logical organization.
Block layer
30
Network File System (NFS)
• Design objectives:
• Provide the same semantics as a local Unix File System
(UFS) to ensure compatibility with existing applications.
• Facilitate easy integration into existing UFS.
• Ensure that the system will be widely used; thus, support
clients running on different operating systems.
• Accept a modest performance degradation due to remote
access over a network with a bandwidth of several Mbps.
• NFS is based on the client-server paradigm. The client runs on
the local host while the server is at the site of the remote file
system; they interact by means of Remote Procedure Calls
(RPC).
• A remote file is uniquely identified by a file handle (fh) - a 32-
byte internal name.
31
Application
Communication network
The NFS client-server interaction. The vnode layer implements file operation in a uniform
manner, regardless of whether the file is local or remote.
An operation targeting a local file is directed to the local file system, while one for a remote
file involves NFS; an NSF client packages the relevant information about the target and the
NFS server passes it to the vnode layer on the remote host which, in turn, directs it to the
remote file system.
32
Application NFS client NFS server
API RPC
LOOKUP(dfh,fname)
Lookup fname
OPEN READ(fh, offset,count) in directory dfh and retun
(fname,flags,mode) -------------------------------------- fh (the file handle) and
CREATE(dfh,fname,mode) file attributes or create a
new file
CLOSE (fh) Remove fh from the open file table of
the process
Read data from file fh at
READ(fd,buf,count) READ(fh, offset,count) offset and length count
and return it.
RENAME RENAME(dfh,fromfname,
Rename file
(fromfname,tofname) tofh,tofname)
LOOKUP(dfh, fname)
LINK(fname, linkname) READLINK(fh) Create a link
LINK(dfh,fnam)
33
General Parallel File System
(GPFS)
34
SAN: Storage Area
Networks
35
Case Study: GFS
36
Google File System (GFS)
37
I/O throughput lags behind
• Storing all data in one place adds the risk of hardware failures
3
9
Google datacenter
• Lots of cheap, commodity PCs, each with disk and CPU
40
A cool idea! But wait…
• Stuff breaks
• If you have one server, it may stay up 3 years (1,000 days)
• If you have 10k servers, expect to lose 10 a day
41
GFS: The Google File System
42
Target environment
• Thousands of computers
• Distributed
• Computers have their own disks, and the file system spans
those disks
43
Target environment
46
GFS chunks
• GFS files are collections of fixed-size segments called chunks.
• The chunk size is 64 MB; this choice is motivated by the desire to optimize
the performance for large files and to reduce the amount of metadata
maintained by the system.
• A large chunk size increases the likelihood that multiple operations will be
directed to the same chunk thus, it reduces the number of requests to
locate the chunk and, at the same time, it allows the application to
maintain a persistent network connection with the server where the chunk
is located.
• A chunk consists of 64 KB blocks and each block has a 32 bit checksum.
• Ensure reliability through replication with 3+ copies.
• Chunks are stored on Linux files systems and are replicated on multiple sites; a user
may change the number of the replicas, from the standard value of three, to any
desired value.
• At the time of file creation each chunk is assigned a unique chunk handle.
47
File name & chunk index Master
Application
Chunk data
State
information
Instructions
Communication network
Chunk handle
& data count
• The architecture of a GFS cluster; the master maintains state information about all
system components; it controls a number of chunk servers. A chunk server runs
under Linux; it uses metadata provided by the master to communicate directly with
the application. The data and the control paths are shown separately, data paths
with thick lines and the control paths with thin lines. Arrows show the flow of
control between the application, the master and the chunk servers.
48
Distributed File System
⚫ Chunk Servers
– File is split into contiguous chunks
– Typically each chunk is 16-64MB
– Each chunk replicated (usually 2x or 3x)
– Try to keep replicas in different racks
C0 C1 D0 C1 C2 C4 C0 C4
C5 C2 C5 C3 D0 D1
… D1 C3
• Map-Reduce!
• Will be covered in the next week.
50