0% found this document useful (0 votes)

26 views51 pages

CC - Lecture 8-Final

The document discusses data-intensive technologies for cloud computing. It covers trends in massive data and cheap hardware. It also discusses distributed file systems, data stores, MapReduce frameworks, and running science applications in the cloud.

Uploaded by

Asif Mahmood

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PPTX, PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

26 views51 pages

CC - Lecture 8-Final

Uploaded by

Asif Mahmood

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PPTX, PDF, TXT or read online on Scribd

You are on page 1/ 51

Lecture 8

DATA-INTENSIVE TECHNOLOGIES
FOR CLOUD COMPUTING

DATE 26/10/2018

1
TRENDS
 Massive data
 Thousands to millions of cores
 Consolidated data centers
 Shift from clock rate battle to multicore to many core…

 Cheap hardware
 Failures are the norm

 VM based systems

 Making accessible (Easy to use)

 More people requiring large scale data processing
 Shift from academia to industry..
2
MOVING TOWARDS..
 Distributed File Systems
 HDFS, etc..
 Distributed Key-Value stores
 Data intensive parallel application frameworks
 MapReduce
 High level languages
 Science in the clouds

3
DISTRIBUTED DATA STORAGE
4
CLOUD DATA STORES (NO-SQL)
 Schema-less:
 Shared nothing architecture

 Elasticity

 Sharding

 Asynchronous replication

 BASE instead of ACID

http://nosqlpedia.com/wiki/Survey_distributed_databases
CLOUD DATA STORES (NO-SQL)
 Schema-less:
 “Tables” don’t have a pre-defined schema. Records have a
variable number of fields that can vary from record to record.
Record contents and semantics are enforced by applications.
 Shared nothing architecture
 Instead of using a common storage pool (e.g., SAN), each server uses only
its own local storage. This allows storage to be accessed at local disk speeds
instead of network speeds, and it allows capacity to be increased by adding
more nodes. Cost is also reduced since commodity hardware can be used.
 Elasticity
 Both storage and server capacity can be added on-the-fly by merely adding
more servers. No downtime is required. When a new node is added, the
database begins giving it something to do and requests to fulfill.
6

http://nosqlpedia.com/wiki/Survey_distributed_databases
CLOUD DATA STORES (NO-SQL)
 Sharding
 Instead of viewing the storage as a monolithic space, records are partitioned into
shards. Usually, a shard is small enough to be managed by a single server, though
shards are usually replicated. Sharding can be automatic (e.g., an existing shard
splits when it gets too big), or applications can assist in data sharding by
assigning each record a partition ID.
 Asynchronous replication
 Compared to RAID storage (mirroring and/or striping) or synchronous
replication, NoSQL databases employ asynchronous replication. This allows
writes to complete more quickly since they don’t depend on extra network traffic.
One side effect of this strategy is that data is not immediately replicated and
could be lost in certain windows. Also, locking is usually not available to protect
all copies of a specific unit of data.
 BASE instead of ACID
 NoSQL databases emphasize performance and availability. This requires
prioritizing the components of the CAP theorem (described elsewhere) that tends
7
to make true ACID transactions implausible
http://nosqlpedia.com/wiki/Survey_distributed_databases
ACID VS BASE

ACID BASE
Strong consistency Weak consistency
Isolation – stale data OK
Focus on “commit” Availability first
Nested transactions Best effort
Availability? Approximate answers OK
Conservative Aggressive (optimistic)
(pessimistic) Simpler!
Difficult evolution Faster
(e.g. schema) Easier evolution

ACID (Atomicity, Consistency, Isolation, and Durability) 8

BASE (Basically Available, Soft state, Eventual consistency)

GOOGLE BIGTABLE
 Data Model
 A sparse, distributed, persistent multidimensional sorted map

 Indexed by a row key, column key, and a timestamp

 A table contains column families
 Column keys grouped in to column families
 Row ranges are stored as tablets (Sharding)
 Supports single row transactions
 Use Chubby distributed lock service to manage masters and tablet locks
 Based on GFS
 Supports running scripts and map reduce

Fay Chang, et. al. “Bigtable: A Distributed Storage System for Structured Data”.
AMAZON DYNAMO
Problem Technique Advantage
Partitioning Consistent Hashing Incremental Scalability

Vector clocks with # of versions is decoupled

High Availability for writes
reconciliation during reads from update rates.

Provides high availability and

Sloppy Quorum and hinted durability guarantee when
Handling temporary failures
handoff some of the replicas are not
available.

Recovering from permanent Synchronizes divergent

Using Merkle trees
failures replicas in the background.

Preserves symmetry and

avoids having a centralized
Membership and failure Gossip-based membership
registry for storing
detection protocol and failure detection.
membership and node liveness
information. 10

DeCandia, G., et al. 2007. Dynamo: Amazon's highly available key-value store. In Proceedings of Twenty-First ACM SIGOPS Symposium on
Operating Systems Principles (Stevenson, Washington, USA, October 14 - 17, 2007). SOSP '07. ACM, 205-220. (pdf)
NO-SQL DATA STORES

http://nosqlpedia.com/wiki/Survey_distributed_databases
GOOGLE DISK FARM
Early days…

…today

12
MOTIVATION
 Need for a scalable DFS
 Large distributed data-intensive applications

 High data processing needs

 Performance, Reliability, Scalability and Availability

 More than traditional DFS

13
ASSUMPTIONS –
ENVIRONMENT
 Commodity Hardware
 inexpensive

 Component Failure
 the norm rather than the exception

 TBs of Space
 must support TBs of space

14
DESIGN

 Design factors
 Failures are common (built from inexpensive commodity components)
 Files
 large (multi-GB)
 mutation principally via appending new data
 low-overhead atomicity essential
 Co-design applications and file system API
 Sustained bandwidth more critical than low latency

 File structure chunk

 Divided into 64 MB chunks
 Chunk identified by 64-bit handle
 Chunks replicated (default 3 file
replicas)
 Chunks divided into 64KB blocks … blocks
 Each block has a 32-bit checksum

15
ARCHITECTURE

metadata

data

 Master
 Manages namespace/metadata
 Manages chunk creation, replication, placement
 Performs snapshot operation to create duplicate of file or directory
tree
 Performs checkpointing and logging of changes to metadata

 Chunkservers
 Stores chunk data and checksum for each block
 On startup/failure recovery, reports chunks to master
 Periodically reports sub-set of chunks to master (to detect no longer
needed chunks) 16
ARCHITECTURE

 Contact single master

 Obtain chunk locations
 Contact one of chunkservers
 Obtain data
17
MASTER
 Metadata
 Three types
 File & chunk namespaces
 Mapping from files to chunks
 Locations of chunks’ replicas
 Replicated on multiple remote machines
 Kept in memory

 Operations
 Replica placement
 New chunk and replica creation
 Load balancing
 Unused storage reclaim
IMPLEMENTATION –
CONSISTENCY MODEL
 Relaxed consistency model
 Two types of mutations
 Writes
 Cause data to be written at an application-specified file offset
 Record appends
 Operations that append data to a file
 Cause data to be appended atomically at least once
 Offset chosen by GFS, not by the client

 States of a file region after a mutation

 Consistent
 All clients see the same data, regardless which replicas they read from
 Defined
 consistent + all clients see what the mutation writes in its entirety
 Undefined
 consistent +but it may not reflect what any one mutation has written
 Inconsistent
 Clients see different data at different times

19
IMPLEMENTATION –
LEASES AND MUTATION ORDER
 Master uses leases to maintain a consistent mutation order among
replicas

 Primary is the chunkserver who is granted a chunk lease

 All others containing replicas are secondaries

 Primary defines a mutation order between mutations

 All secondaries follows this order

20
MUTATION OPERATIONS
 Primary replica
 Holds lease assigned by master (60 sec. default)
 Assigns serial order for all mutation operations
performed on replicas

 Write operation
 1-2: client obtains replica locations and identity of
primary replica
 3: client pushes data to replicas (stored in LRU
buffer by chunk servers holding replicas)
 4: client issues update request to primary
 5: primary forwards/performs write request
 6: primary receives replies from replica
 7: primary replies to client

 Record append operation

 Performed atomically (one byte sequence)
 At-least-once semantics
 Append location chosen by GFS and returned to client
 Extension to step 5:
 If record fits in current chunk: write record and tell replicas the offset
 If record exceeds chunk: pad the chunk, reply to client to use next chunk

21
CONSISTENCY GUARANTEES
primary primary primary

replica replica replica

defined consistent inconsistent

 Write
 Concurrent writes may be consistent but undefined
 Write operations that are large or cross chunk boundaries
are subdivided by client into individual writes
 Concurrent writes may become interleaved
 Record append
 Atomically, at-least-once semantics
 Client retries failed operation
 After successful retry, replicas are defined
in region of append but may have
intervening undefined regions
 Application safeguards
 Use record append rather than write
 Insert checksums in record headers to detect fragments
 Insert sequence numbers to detect duplicates

22
METADATA MANAGEMENT
pathname lock chunk list
Logical structure /home read Chunk4400488,…

/save Chunk8ffe07783,…

/home/user/foo write Chunk6254ee0,…

/home/user read Chunk88f703,…

 Namespace
 Logically a mapping from pathname to chunk list
 Allows concurrent file creation in same directory
 Read/write locks prevent conflicting operations
 File deletion by renaming to a hidden name; removed during regular scan
 Operation log
 Historical record of metadata changes
 Kept on multiple remote machines
 Checkpoint created when log exceeds threshold
 When checkpointing, switch to new log and create checkpoint in separate thread
 Recovery made from most recent checkpoint and subsequent log
 Snapshot
 Revokes leases on chunks in file/directory
 Log operation
 Duplicate metadata (not the chunks!) for the source
 On first client write to chunk:
 Required for client to gain access to chunk
 Reference count > 1 indicates a duplicated chunk
 Create a new chunk and update chunk list for duplicate

23
CHUNK/REPLICA MANAGEMENT
 Placement
 On chunkservers with below-average disk space utilization
 Limit number of “recent” creations on a chunkserver (since access traffic will follow)
 Spread replicas across racks (for reliability)

 Reclamation
 Chunk become garbage when file of which they are a part is deleted
 Lazy strategy (garbage college) is used since no attempt is made to reclaim chunks at time of deletion
 In periodic “HeartBeat” message chunkserver reports to the master a subset of its current chunks
 Master identifies which reported chunks are no longer accessible (i.e., are garbage)
 Chunkserver reclaims garbage chunks

 Stale replica detection

 Master assigns a version number to each chunk/replica
 Version number incremented each time a lease is granted
 Replicas on failed chunkservers will not have the current version number
 Stale replicas removed as part of garbage collection

24
PERFORMANCE

25
BENEFITS AND LIMITATIONS
 Simple design with single master
 Fault tolerance

 Custom designed

 Only viable in a specific environment

 Limited security

26
SECTOR

27
File System GFS/HDFS Lustre Sector
Architecture Cluster-based, Cluster based, Cluster based,
asymmetric, parallel Asymettric, Parallel Asymettric, Parallel

Communication RPC/TCP Network UDT

Independence
Naming Central metadata Central metadata Multiple Metadata
server server Masters
Synchronization Write-once-read-many, Hybrid locking General purpose I/O
locks on object leases mechanism using
leases, distributed lock
manager
Consistency and Server side replication, Server side meta data Server side
replication Async replication, replication, Client replication
checksum side caching,
checksum
Fault Tolerance Failure as norm Failure as exception Failure as norm
Security N/A Authentication, Security server,
Authorization based
Authentication, 28
Authorization
DATA INTENSIVE PARALLEL
PROCESSING FRAMEWORKS
29
MAPREDUCE
 General purpose massive data analysis in brittle
environments
 Commodity clusters
 Clouds

 Efficiency, Scalability, Redundancy, Load Balance, Fault

Tolerance
 Apache Hadoop
 HDFS

 Microsoft DryadLINQ

30
EXECUTION OVERVIEW

Source: http://code.google.com/edu/parallel/mapreduce-tutorial.html
MAPREDUCE
1. The MapReduce library in the user program first shards the input files
into M pieces of typically 16 megabytes to 64 megabytes (MB) per
piece. It then starts up many copies of the program on a cluster of
machines.
2. One of the copies of the program is special: the master. The rest are
workers that are assigned work by the master. There are M map tasks
and R reduce tasks to assign. The master picks idle workers and assigns
each one a map task or a reduce task.
3. A worker who is assigned a map task reads the contents of the
corresponding input shard. It parses key/value pairs out of the input data
and passes each pair to the user-defined Map function. The intermediate
key/value pairs produced by the Map function are buffered in memory.
4. Periodically, the buffered pairs are written to local disk, partitioned into
R regions by the partitioning function. The locations of these buffered
pairs on the local disk are passed back to the master, who is responsible32

for forwarding these locations to the reduce workers.

MAPREDUCE
5. When a reduce worker is notified by the master about these
locations, it uses remote procedure calls to read the buffered data
from the local disks of the map workers. When a reduce worker has
read all intermediate data, it sorts it by the intermediate keys so that
all occurrences of the same key are grouped together. If the amount
of intermediate data is too large to fit in memory, an external sort is
used.
6. The reduce worker iterates over the sorted intermediate data and for
each unique intermediate key encountered, it passes the key and the
corresponding set of intermediate values to the user's Reduce
function. The output of the Reduce function is appended to a final
output file for this reduce partition.
7. When all map tasks and reduce tasks have been completed, the
master wakes up the user program. At this point, the MapReduce 33
call in the user program returns back to the user code.
WORD COUNT
Input Mapping Shuffling Reducing
foo, 1 foo, 1
car, 1 foo, 1 foo, 3
bar, 1 foo, 1

foo car bar foo, 1

bar, 1
foo bar foo bar, 1 bar, 2
bar, 1
car car car foo, 1

car, 1 car, 1
car, 1 car, 1
car, 4
car, 1 car, 1
car,1

34
WORD COUNT
Input Mapping Shuffling Sorting Reducing
foo, 1
car, 1
bar, 1 foo,1
car,1
bar, 1
foo car bar foo, 1 foo, 1 bar,<1,1> bar,2
foo bar foo bar, 1 bar, 1 car,<1,1,1,1> car,4
car car car foo, 1 foo, 1 foo,<1,1,1> foo,3
car, 1
car, 1
car, 1
car, 1
car, 1
car, 1

35
HADOOP & DRYADLINQ
Apache Hadoop Microsoft DryadLINQ
Master Node Data/Compute Nodes
Standard LINQ operations

Job M M M M
DryadLINQ operations

Tracker R R R R
DryadLINQ Compiler
HDFS
Directed
2 Data
Vertex :
Name 1 execution task
2 blocks Acyclic Graph
Node 3 3 4 Edge : (DAG) based
communication
path execution
flows
Dryad Execution Engine

 Apache Implementation of Google’s MapReduce

 Dryad process the DAG executing vertices on compute
 Hadoop Distributed File System (HDFS) manage data
clusters
 Map/Reduce tasks are scheduled based on data locality in  LINQ provides a query interface for structured data
HDFS (replicated data blocks)  Provide Hash, Range, and Round-Robin partition
patterns

Job creation; Resource management; Fault tolerance& re-execution of failed taskes/vertices 36

Judy Qiu Cloud Technologies and Their Applications Indiana University Bloomington March 26 2010
Programming Scheduling & Load
Feature Data Storage Communication
Model Balancing
Data locality,
Rack aware dynamic task
Hadoop MapReduce HDFS TCP scheduling through a
global queue,
natural load balancing
Windows Data locality/ Network
DAG based Shared Files/TCP
Shared topology based run time
Dryad execution pipes/ Shared memory
directories graph optimizations, Static
flows (Cosmos) FIFO
scheduling
Shared file
Iterative Content Distribution Data locality, based static
Twister system / Local
MapReduce Network/Direct TCP scheduling
disks
Dynamic scheduling
TCP through Azure
MapReduceRol Azure Blob
e4Azure MapReduce Storage Blob Storage/ (Direct through a global queue,
TCP) Good natural load
balancing

Low latency Available processing

Variety of Shared file
MPI communication capabilities/ User 37
topologies systems
channels controlled
Failure
Feature Monitoring Language Support
Handling
Java, Executables Linux cluster,
Re-execution Web based
are supported via Amazon Elastic
Hadoop of map and Monitoring UI,
reduce tasks API Hadoop Streaming, MapReduce, Future
PigLatin Grid
Monitoring
Re-execution C# + LINQ (through Windows HPCS
Dryad support for
of vertices DryadLINQ) cluster
execution graphs
Re-execution API to monitor Java,
of iterations the progress of Linux Cluster,
Twister Executable via Java FutureGrid
jobs wrappers
Window Azure
Re-execution API, Web based
MapReduce Compute, Windows
Roles4Azure of map and monitoring UI C# Azure Local
reduce tasks
Development Fabric
Minimal support
Program level C, C++, Fortran, Linux/Windows
MPI for task level
Check pointing Java, C# cluster
monitoring
38
INHOMOGENEOUS DATA
PERFORMANCE
Randomly Distributed Inhomogeneous Data
Mean: 400, Dataset Size: 10000
1900
1850
1800
1750
Time (s)

1700
1650
1600
1550
1500
0 50 100 150 200 250 300

Standard Deviation
DryadLinq SWG Hadoop SWG Hadoop SWG on VM

Inhomogeneity of data does not have a significant effect when the sequence
39
lengths are randomly distributed
Dryad with Windows HPCS compared to Hadoop with Linux RHEL on Idataplex (32
nodes)
INHOMOGENEOUS DATA PERFORMANCE
Skewed Distributed Inhomogeneous data
Mean: 400, Dataset Size: 10000
6,000

5,000
Total Time (s)

4,000

3,000

2,000

1,000

0
0 50 100 150 200 250 300

Standard Deviation
DryadLinq SWG Hadoop SWG Hadoop SWG on VM

This shows the natural load balancing of Hadoop MR dynamic task

assignment using a global pipe line in contrast to the DryadLinq static 40
assignment
Dryad with Windows HPCS compared to Hadoop with Linux RHEL on Idataplex (32 nodes)
MAPREDUCEROLES4AZURE

41
SEQUENCE ASSEMBLY PERFORMANCE

42
OTHER ABSTRACTIONS
 Other abstractions..
 All-pairs
 DAG
 Wavefront

43
APPLICATIONS
44
APPLICATION CATEGORIES
1. Synchronous
 Easiest to parallelize. Eg: SIMD
2. Asynchronous
 Evolve dynamically in time and different evolution
algorithms.
3. Loosely Synchronous
 Middle ground. Dynamically evolving members,
synchronized now and then. Eg: IterativeMapReduce
4. Pleasingly Parallel
5. Meta problems
45

GC Fox, et al. Parallel Computing Works. http://www.netlib.org/utk/lsi/pcwLSI/text/node25.html#props

APPLICATIONS 1
(1-
100)
2 3 4
(101- (201- (301-
200) 300) 400)
N

1 from Reduce 1
M1 M2 M3 …. M# hdfs://.../rowblock_1.out
BioInformatics
(1-100) M6

2 from from Reduce 2
M4 M5 …. hdfs://.../rowblock_2.out
 Sequence Alignment (101-200) M2 M9

3 from Reduce 3
 SmithWaterman-GOTOH All-pairs alignment (201-300)
M6
M5
M7 M8 …. hdfs://.../rowblock_3.out

Reduce 4
 Sequence Assembly 4 from
M9
from
M10 …. hdfs://.../rowblock_4.out
(301-400) M3 M8
 Cap3 . . . . …. .
. . . . …. .
 CloudBurst . . . . …. .
. . . . …. .
 Data mining N From
M#
M(N*
(N+1)/2)
Reduce N
hdfs://.../rowblock_N.out
 MDS, GTM & Interpolations

46
WORKFLOWS
 Represent and manage complex distributed scientific
computations
 Composition and representation
 Mapping to resources (data as well as compute)
 Execution and provenance capturing

 Type of workflows
 Sequence of tasks, DAGs, cyclic graphs, hierarchical
workflows (workflows of workflows)
 Data Flows vs Control flows
 Interactive workflows

47
LEAD – LINKED ENVIRONMENTS FOR
DYNAMIC DISCOVERY
 Based on WS-BPEL and
SOA infrastructure

48
PEGASUS AND DAGMAN
 Pegasus
 Resource, data discovery
 Mapping computation to resources
 Orchestrate data transfers
 Publish results
 Graph optimizations

 DAGMAN
 Submits tasks to execution resources
 Monitor the execution
 Retries in case of failure
 Maintain dependencies
49
CONCLUSION
 Scientific analysis is moving more and more towards Clouds
and related technologies
 Lot of cutting-edge technologies out in the industry which we
can use to facilitate data intensive computing.

 Motivation
 Developing easy-to-use efficient software frameworks to
facilitate data intensive computing

50
 Thank You !!!

Cloud Unit3
No ratings yet
Cloud Unit3
26 pages
Amazon Dynamo DB - Presentation
100% (1)
Amazon Dynamo DB - Presentation
30 pages
Rapid Application Development and Short-Time To The Market Low Latency Scalability High Availability Consistent View of The Data
No ratings yet
Rapid Application Development and Short-Time To The Market Low Latency Scalability High Availability Consistent View of The Data
21 pages
Ccomputing Madurya
No ratings yet
Ccomputing Madurya
20 pages
storage-systems
No ratings yet
storage-systems
23 pages
CC - Lecture 6-Data
No ratings yet
CC - Lecture 6-Data
44 pages
Dynamo: Amazon'S Highly Available Key-Value Store: Csci 8101: Advanced Operating Systems Presented By: Chaithra KN
No ratings yet
Dynamo: Amazon'S Highly Available Key-Value Store: Csci 8101: Advanced Operating Systems Presented By: Chaithra KN
23 pages
Big Data IN A Gist
No ratings yet
Big Data IN A Gist
16 pages
Chapter 10
No ratings yet
Chapter 10
25 pages
The Google File System: Kenneth Chiu
No ratings yet
The Google File System: Kenneth Chiu
40 pages
GFS
No ratings yet
GFS
44 pages
15 Gfs
No ratings yet
15 Gfs
40 pages
The Google File System: Alexandru Costan
No ratings yet
The Google File System: Alexandru Costan
38 pages
UCS15E08 - Cloud Computing - Unit 3 Notes
No ratings yet
UCS15E08 - Cloud Computing - Unit 3 Notes
13 pages
Unit 5 CC
No ratings yet
Unit 5 CC
8 pages
Lecture_14_HDFS_GFS
No ratings yet
Lecture_14_HDFS_GFS
30 pages
Lecture 07 - Key-Value Databases
No ratings yet
Lecture 07 - Key-Value Databases
75 pages
Cloud Data Storage
No ratings yet
Cloud Data Storage
47 pages
ds_2016_17_lec18
No ratings yet
ds_2016_17_lec18
26 pages
20-GFS-BigTable
No ratings yet
20-GFS-BigTable
36 pages
3
No ratings yet
3
11 pages
Big data Slides
No ratings yet
Big data Slides
26 pages
nosql-kk
No ratings yet
nosql-kk
23 pages
M4_05_Google File System
No ratings yet
M4_05_Google File System
28 pages
When it comes to cloud file systems like GFS
No ratings yet
When it comes to cloud file systems like GFS
6 pages
4
No ratings yet
4
53 pages
CassandraTraining v3.3.4
100% (1)
CassandraTraining v3.3.4
183 pages
CIS - 468 - 04 - NOSQL Databases and Big Data Storage Systems
No ratings yet
CIS - 468 - 04 - NOSQL Databases and Big Data Storage Systems
102 pages
Gfs Google File System 13331
No ratings yet
Gfs Google File System 13331
28 pages
Lecture 5 Distributed Storage Systems
No ratings yet
Lecture 5 Distributed Storage Systems
26 pages
Massively Parallel Cloud Data Storage Systems: S. Sudarshan IIT Bombay
No ratings yet
Massively Parallel Cloud Data Storage Systems: S. Sudarshan IIT Bombay
17 pages
Visual Guide To NoSQL Systems - Nathan Hurst's Blog
No ratings yet
Visual Guide To NoSQL Systems - Nathan Hurst's Blog
10 pages
Nosql Overview: Implementation Free
No ratings yet
Nosql Overview: Implementation Free
40 pages
Nosql Systems: Sharding, Replication and Consistency: Riccardo Torlone Università Roma Tre
No ratings yet
Nosql Systems: Sharding, Replication and Consistency: Riccardo Torlone Università Roma Tre
28 pages
The Google File System: Firas Abuzaid
No ratings yet
The Google File System: Firas Abuzaid
22 pages
Mapreduce: Simplified Data Processing On Large Clusters
No ratings yet
Mapreduce: Simplified Data Processing On Large Clusters
38 pages
Lecture 4.1 - Hadoop - MapReduce - Hbase
No ratings yet
Lecture 4.1 - Hadoop - MapReduce - Hbase
94 pages
64 Prerna Jain Dspractassg11
No ratings yet
64 Prerna Jain Dspractassg11
8 pages
BiG DaTa
100% (1)
BiG DaTa
9 pages
GMM1
No ratings yet
GMM1
120 pages
07-Architecture Database
No ratings yet
07-Architecture Database
69 pages
6.1 Cassandra
No ratings yet
6.1 Cassandra
21 pages
Big Data Analytics Unit-2
No ratings yet
Big Data Analytics Unit-2
14 pages
Chapter_2_c8ad153f2f004857aca733db68105108_1712934164766
No ratings yet
Chapter_2_c8ad153f2f004857aca733db68105108_1712934164766
21 pages
ECS781P-9-Cloud Data Management
No ratings yet
ECS781P-9-Cloud Data Management
79 pages
3.1 Hadoop Ecosystem
No ratings yet
3.1 Hadoop Ecosystem
48 pages
An Overview of Google File System (GFS) _ Medium
No ratings yet
An Overview of Google File System (GFS) _ Medium
10 pages
Nosql1
No ratings yet
Nosql1
40 pages
BDS-Session-5_NoSQL-DB
No ratings yet
BDS-Session-5_NoSQL-DB
51 pages
NoSQL Database Technology - A Survey and Comparison of Systems
No ratings yet
NoSQL Database Technology - A Survey and Comparison of Systems
44 pages
09 - Cloud-Enabling Technologies - v2
No ratings yet
09 - Cloud-Enabling Technologies - v2
45 pages
IntroNoSQL Revised
No ratings yet
IntroNoSQL Revised
28 pages
NoSQL Intro
No ratings yet
NoSQL Intro
26 pages
GFS
No ratings yet
GFS
33 pages
NO-SQL
No ratings yet
NO-SQL
32 pages
chap6
No ratings yet
chap6
54 pages
SYSTEM DESIGN.docx (1)
No ratings yet
SYSTEM DESIGN.docx (1)
6 pages