0% found this document useful (0 votes)

108 views

Fillatre Big Data

This document discusses tools for Big Data in astronomy. It begins with an introduction to Big Data and the Hadoop ecosystem, including HDFS for distributed storage and MapReduce for distributed processing. It then discusses specific Hadoop tools like Pig, Hive, and Spark. It provides an example of using MapReduce for image coaddition in astronomy. Overall, the document outlines how Hadoop and its ecosystem can provide scalable tools for storing, processing, and analyzing the large datasets common in modern astronomy.

Uploaded by

satmania

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

108 views

Fillatre Big Data

Uploaded by

satmania

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 98

Outils informatiques pour le Big Data

en astronomie
Lionel Fillatre
Université Nice Sophia Antipolis
Polytech Nice Sophia
Laboratoire I3S

École d'été thématique CNRS BasMatI

1
3 juin 2015
Outlines
 What is the Big Data (including Hadoop Ecosystem)
 HDFS (Hadoop Distributed File System)
 What is MapReduce?
 Image Coaddition with MapReduce
 What is NoSQL?
 What is Pig?
 What is Hive?
 What is Spark?
 Conclusion

2
What is the Big Data

3
Big Data Definition
 No single standard definition…

“Big Data” is data whose scale, diversity, and complexity

require new architecture, techniques, algorithms, and analytics
to manage it and extract value and hidden knowledge from it…

4
Characteristics of Big Data:
1-Scale (Volume)

 Data Volume
 44x increase from 2009 to 2020
 From 0.8 zettabytes to 35zb
 Data volume is increasing exponentially

Exponential increase in
collected/generated data

5
Characteristics of Big Data:
2-Complexity (Variety)

 Various formats, types, and structures

 Text, numerical, images, audio, video,
sequences, time series, social media
data, multi-dim arrays, etc…
 Static data vs. streaming data
 A single application can be
generating/collecting many types of
data

To extract knowledge
 all these types of data need to be linked together

6
Characteristics of Big Data:
3-Speed (Velocity)
 Data is generated fast and need to be processed fast
 Online Data Analytics
 Late decisions  missing opportunities
 Examples
 E-Promotions: Based on your current location, your purchase history, what you like
 send promotions right now for store next to you

 Healthcare monitoring: sensors monitoring your activities and body

 any abnormal measurements require immediate reaction

7
Some Make it 5V’s

8
What technology for Big Data?

9
10
11
12
Hadoop Origins
 Apache Hadoop is a framework that allows for the
distributed processing of large data sets accross clusters of
commodity computers using a simple programming model.
 Hadoop is an open-source implementation of Google
MapReduce and Google File System (GFS).
 Hadoop fulfills need of common infrastructure:
 Efficient, reliable, easy to use,
 Open Source, Apache License.

13
Hadoop Ecosystem (main elements)

14
Data Storage
 Storage capacity has grown exponentially but read
speed has not kept up
 1990:
 Store 1,400 MB
 Transfer speed of 4.5MB/s
 Read the entire drive in ~ 5 minutes
 2010:
 Store 1 TB
 Transfer speed of 100MB/s
 Read the entire drive in ~ 3 hours
 Hadoop - 100 drives working at the same time can
read 1TB of data in 2 minutes

15
Hadoop Cluster
 A set of "cheap" commodity hardware
 No need for super-computers, use commodity unreliable hardware
 Not desktops
 Networked together
 May reside in the same location
– Set of servers in a set of racks in a data center

16
Scale-Out Instead of Scale-Up
 It is harder and more expensive to scale-up
 Add additional resources to an existing node (CPU, RAM)
 Moore’s Law can’t keep up with data growth
 New units must be purchased if required resources can not be added
 Also known as scale vertically
 Scale-Out
 Add more nodes/machines to an existing distributed application
 Software layer is designed for node additions or removal
 Hadoop takes this approach - A set of nodes are bonded together as a
single distributed system
 Very easy to scale down as well

17
Code to Data
 Traditional data processing architecture
 Nodes are broken up into separate processing and storage nodes
connected by high-capacity link
 Many data-intensive applications are not CPU demanding
causing bottlenecks in network

18
Code to Data
 Hadoop co-locates processors and storage
 Code is moved to data (size is tiny, usually in KBs)
 Processors execute code and access underlying local storage

19
Failures are Common
 Given a large number machines, failures are
common
 Large warehouses may see machine failures weekly or even daily
 Hadoop is designed to cope with node failures
 Data is replicated
 Tasks are retried

20
Comparison to RDBMS
 Relational Database Management Systems
(RDBMS) for batch processing
 Oracle, Sybase, MySQL, Microsoft SQL Server, etc.
 Hadoop doesn’t fully replace relational products; many
architectures would benefit from both Hadoop and a Relational
product
 RDBMS products scale up
 Expensive to scale for larger installations
 Hits a ceiling when storage reaches 100s of terabytes
 Structured Relational vs. Semi-Structured vs. Unstructured
 Hadoop was not designed for real-time or low latency queries

21
HDFS
(Hadoop Distributed File System)

22
HDFS
 Appears as a single disk
 Runs on top of a native filesystem
 Fault Tolerant
 Can handle disk crashes, machine crashes, etc...
 Based on Google's Filesystem (GFS or GoogleFS)

23
HDFS is Good for...
 Storing large files
 Terabytes, Petabytes, etc...
 Millions rather than billions of files
 100MB or more per file
 Streaming data
 Write once and read-many times patterns
 Optimized for streaming reads rather than random reads
 “Cheap” Commodity Hardware
 No need for super-computers, use less reliable commodity hardware

24
HDFS is not so good for...

 Low-latency reads
 High-throughput rather than low latency for small chunks of data
 HBase addresses this issue
 Large amount of small files
 Better for millions of large files instead of billions of small files
 For example each file can be 100MB or more
 Multiple Writers
 Single writer per file
 Writes only at the end of file, no-support for arbitrary offset

25
HDFS Daemons

26
Files and Blocks

27
HDFS File Write

28
HDFS File Read

29
What is MapReduce?

30
Hadoop MapReduce
 Model for processing large amounts of data in
parallel
 On commodity hardware
 Lots of nodes
 Derived from functional programming
 Map and reduce functions
 Can be implemented in multiple languages
 Java, C++, Ruby, Python, etc.

31
Hadoop MapReduce History

32
Main principle
 Map: ( f, [a, b, c, ...]) -> [ f(a), f(b), f(c), ... ]
 Apply a function to all the elements of a list
 ex.: map((f: x->x + 1), [1, 2, 3]) = [2, 3, 4]
 Intrinsically parallel

 Reduce: ( g, [a, b, c, ...] ) -> g(a, g(b, g(c, ... )))

 Apply a function to a list recursively
 ex.: (sum , [1, 2, 3 ,4]) = sum(1, sum( 2, sum( 3, 4 ) ) )

 Purely fonctionnal
 No global variables, no side effects

33
WordCount example

34
MapReduce Framework
 Takes care of distributed processing and coordination
 Scheduling
 Jobs are broken down into smaller chunks called tasks.
 These tasks are scheduled.
 Task localization with Data
 Framework strives to place tasks on the nodes that host the
segment of data to be processed by that specific task
 Code is moved to where the data is

35
MapReduce Framework
 Error Handling
 Failures are an expected behavior so tasks are automatically re-tried
on other machines
 Data Synchronization
 Shuffle and Sort barrier re-arranges and moves data between
machines
 Input and output are coordinated by the framework

36
Map Reduce 2.0 on YARN
 Yet Another Resource Negotiator (YARN)
 Various applications can run on YARN
 MapReduce is just one choice (the main choice at this point)
 http://wiki.apache.org/hadoop/PoweredByYarn

37
YARN Cluster

38
YARN: Running an Application

39
YARN: Running an Application

40
YARN: Running an Application

41
YARN: Running an Application

42
YARN: Running an Application

43
YARN and MapReduce
 YARN does not know or care what kind of application it is
running
 MapReduce uses YARN
 Hadoop includes a MapReduce ApplicationMaster to manage
MapReduce jobs
 Each MapReduce job is an instance of an application

44
Running a MapReduce2 Application

45
Running a MapReduce2 Application

46
Running a MapReduce2 Application

47
Running a MapReduce2 Application

48
Running a MapReduce2 Application

49
Running a MapReduce2 Application

50
Running a MapReduce2 Application

51
Running a MapReduce2 Application

52
Running a MapReduce2 Application

53
Image Coaddition with
MapReduce

54
What is Astronomical Survey Science
from Big Data point of view ?
 Gather millions of images and TBs/PBs of storage.
 Require high-throughput data reduction pipelines.
 Require sophisticated off-line data analysis tools
 The following example is extracted from
Wiley K., Connolly A., Gardner J., Krughoff S., Balazinska M., Howe B., Kwon
Y., Bu Y.
Astronomy in the Cloud: Using MapReduce for Image Co-Addition.
Publications of the Astronomical Society of the Pacific,
2011, vol. 123, no. 901, pp. 366-380.

55
FITS (Flexible Image Transport System)
 An image format that knows where it is looking.
 Common astronomical image representation file format.
 Metadata tags (like EXIF):
 Most importantly: Precise astrometry (position on sky)
 Other:
 Geolocation (telescope location)
 Sky conditions, image quality, etc.

56
Image Coaddition
 Give multiple partially overlapping images and a query
(color and sky bounds):
 Find images’ intersections with the query bounds.
 Project bitmaps to the bounds.
 Stack and mosaic into a final product.

57
Image Stacking (Signal Averaging)
 Stacking improves SNR: makes
fainter objects visible.

 Example (SDSS, Stripe 82):

 Top: Single image, R-band
 Bottom: 79-deep stack (~9x
SNR improvement)

 Variable conditions (e.g.,

atmosphere, PSF, haze) mean
stacking algorithm complexity
58
can exceed a mere sum.
Advantages of MapReduce
 High-level problem description. No effort spent on
internode communication, message-passing, etc.
 Programmed in Java (accessible to most science researchers,
not just computer scientists and engineers).
 Runs on cheap commodity hardware, potentially in the
cloud, e.g., Amazon’s EC2.
 Scalable: 1000s of nodes can be added to the cluster with no
modification to the researcher’s software.
 Large community of users/support.

59
Coaddition in Hadoop

60
What is NoSQL?

61
What is NoSQL?
 Stands for Not Only SQL
 Class of non-relational data storage systems
 Usually do not require a fixed table schema nor do they use the
concept of joins
 All NoSQL offerings relax one or more of the ACID properties
(CAP theorem)
 For data storage, an RDBMS cannot be the be-all/end-all
 Just as there are different programming languages, need to have
other data storage tools in the toolbox
 A NoSQL solution is more acceptable to a client now

62
The CAP Theorem

Theorem: You can have at most

two of these properties for any
Availability shared-data system

Consistency

Partition
tolerance

63
The CAP Theorem
 Once a writer has written, all
readers will see that write

Availability

Consistency

Partition
tolerance

64
Consistency
 Two kinds of consistency:
 strong consistency – ACID (Atomicity Consistency
Isolation Durability)
 weak consistency – BASE (Basically Available Soft-state
Eventual consistency)
• Basically Available: The database system always seems to work!
• Soft State: It does not have to be consistent all the time.
• Eventually Consistent: The system will eventually become
consistent when the updates propagate, in particular, when there
are not too many updates.

65
The CAP Theorem

System is available during

software and hardware upgrades
Availability and node failures.

Consistency

Partition
tolerance

66
Availability
 A guarantee that every request receives a response about whether
it succeeded or failed.
 Traditionally, thought of as the server/process available five 9’s
(99.999 %).
 However, for large node system, at almost any point in time
there’s a good chance that a node is either down or there is a
network disruption among the nodes.

67
The CAP Theorem

A system can continue to

operate in the presence of a
Availability network partitions.

Consistency

Partition
tolerance

68
Failure is the rule
 Amazon:
 Datacenter with 100 000 disks
 From 6 000 to 10 000 disks fail over per year (25
disks per day)
 Sources of failures are numerous:
 Hardware (disk)
 Network
 Power
 Software
 Software and OS updates.
69
The CAP Theorem

70
Different Types of NoSQL Systems
• Distributed Key-Value Systems - Lookup a single value for a key
• Amazon’s Dynamo

• Document-based Systems - Access data by key or by search of “document” data.

• CouchDB
• MongoDB

• Column-based Systems
• Google’s BigTable
• HBase
• Facebook’s Cassandra

• Graph-based Systems - Use a graph structure

• Google’s Pregel
• Neo4j
71
Key-Value Pair (KVP) Stores
“Value” is stored as a “blob”
• Without caring or knowing what is inside
• Application is responsible for understanding the data

In simple terms, a NoSQL Key-Value store is a single table with two columns: one
being the (Primary) Key, and the other being the Value.

72
Each record may have a different schema
Document storage
• Records within a single table can have different structures.
• An example record from Mongo, using JSON format, might look like
{
“_id” : ObjectId(“4fccbf281168a6aa3c215443″),

“first_name” : “Thomas”,
“last_name” : “Jefferson”,
“address” : {
“street” : “1600 Pennsylvania Ave NW”,
“city” : “Washington”, Embedded object
“state” : “DC”
}
}
• Records are called documents.
• You can also modify the structure of any document on the fly by adding and removing
members from the document.
• Unlike simple key-value stores, both keys and values are fully searchable in document
databases.
73
Column-based Stores
• Based on Google’s BigTable store:
• Each record = (row:string, column:string, time:int64)
• Distributed data storage, especially versioned data (time-stamps).
• What is a column-based store? - Data tables are stored as sections of
columns of data, rather than as rows of data.

74
Graph Database
• Apply graph theory in the storage of information about the relationship
between entries

• A graph database is a database that uses graph structures with nodes,

edges, and properties to represent and store data.

• In general, graph databases are useful when you are more interested in
relationships between data than in the data itself:
• for example, in representing and traversing social networks,
generating recommendations, or conducting forensic investigations
(e.g. pattern detection).

75
Example

76
What is Pig?

77
Pig
 In brief:
“is a platform for analyzing large data sets that consists of a high-level
language for expressing data analysis programs, coupled with infrastructure
for evaluating these programs.”

 Top Level Apache Project

 http://pig.apache.org

 Pig is an abstraction on top of Hadoop

 Provides high level programming language designed for data processing
 Converted into MapReduce and executed on Hadoop Clusters

 Pig is widely accepted and used

 Yahoo!, Twitter, Netflix, etc...
 At Yahoo!, 70% MapReduce jobs are written in Pig

78
Disadvantages of Raw MapRaduce
1. Extremely rigid data flow
M R
Other flows constantly hacked in

M M R M

Join, Union Split Chains

2. Common operations must be coded by hand

• Join, filter, projection, aggregates, sorting, distinct

3. Semantics hidden inside map-reduce functions

• Difficult to maintain, extend, and optimize
• Resulting code is difficult to reuse and maintain; shifts focus and attention away
79
from data analysis
Pig and MapReduce
 MapReduce requires programmers
 Must think in terms of map and reduce functions
 More than likely will require Java programmers
 Pig provides high-level language that can be used by
 Analysts
 Data Scientists
 Statisticians
 Etc...
 Originally implemented at Yahoo! to allow analysts to
access data

80
Pig’s Features
 Main operators:
 Join Datasets
 Sort Datasets
 Filter
 Data Types
 Group By
 User Defined Functions
 Etc..

 Example:
>movies = LOAD '/home/movies_data.csv' USING PigStorage(',') as
(id,name,year,rating,duration);
>movies_greater_than_four = FILTER movies BY (float)rating>4.0;
>DUMP movies_greater_than_four;

81
What is Hive?

82
Hive
 Data Warehousing Solution built on top of Hadoop
 Provides SQL-like query language named HiveQL
 Minimal learning curve for people with SQL expertise
 Data analysts are target audience
 Early Hive development work started at Facebook in 2007
 Today Hive is an Apache project under Hadoop
 http://hive.apache.org

83
Advantages and Drawbacks
 Hive provides
 Ability to bring structure to various data formats
 Simple interface for ad hoc querying, analyzing and summarizing large
amounts of data
 Access to files on various data stores such as HDFS and Hbase

 Hive does not provide

 Hive does not provide low latency or realtime queries
 Even querying small amounts of data may take minutes
 Designed for scalability and ease-of-use rather than low latency
responses

84
Hive
 Translates HiveQL statements into a set of MapReduce Jobs which are
then executed on a Hadoop Cluster

85
What is Spark?

86
A Brief History: Spark

87
A general view of Spark

88
Current programming models
 Current popular programming models for clusters transform
data flowing from stable storage to stable storage
 E.g., MapReduce:

Map
Reduce

Input Map Output

Reduce
Map

Benefits of data flow: runtime can decide where to run tasks and can
89 automatically recover from failures
MapReduce I/O

90
Spark
 Acyclic data flow is a powerful abstraction, but is not efficient for
applications that repeatedly reuse a working set of data:
 Iterative algorithms (many in machine learning)
 Interactive data mining tools (R, Excel, Python)

 Spark makes working sets a first-class concept to efficiently

support these apps.

91
Goal: Sharing at Memory Speed

92
Resilient Distributed Dataset (RDD)

 Provide distributed memory abstractions for clusters to support apps

with working sets.

 Retain the attractive properties of MapReduce:

 Fault tolerance (for crashes & stragglers)
 Data locality
 Scalability

Solution: augment data flow model with “resilient distributed

datasets” (RDDs)
93
Programming Model with RDD
 Resilient distributed datasets (RDDs)
 Immutable collections partitioned across cluster that can be rebuilt
if a partition is lost
 Created by transforming data in stable storage using data flow
operators (map, filter, group-by, …)
 Can be cached across parallel operations

 Parallel operations on RDDs

 Reduce, collect, count, save, …

 Restricted shared variables

 Accumulators, broadcast variables

94
Example: Logistic Regression
 Goal: find best line separating two sets of points

random initial line

target

95
Logistic Regression (SCALA Code)
val data =
spark.textFile(...).map(readPoint).cache()

var w = Vector.random(D)

for (i <- 1 to ITERATIONS) {

val gradient = data.map(p =>
(1 / (1 + exp(-p.y*(w dot p.x))) - 1) * p.y *
p.x
).reduce(_ + _)
w -= gradient
}

println("Final w: " + w)

96
Conclusion

97
Conclusion
 Data storage needs are rapidly increasing
Hadoop has become the de-facto standard for handling these
massive data sets.
 Storage of Big Data requires new storage models
NoSQL solutions.
 Parallel processing of Big Data requires a new programming
paradigm
MapReduce programming model.
 “Big data” is moving beyond one-passbatch jobs, to low-latency
apps that need datasharing
Apache Spark is an alternative solution.
98

Quiz 6 L1-L4
100% (7)
Quiz 6 L1-L4
117 pages
Big Data & Hadoop Training Material 0 1 PDF
50% (2)
Big Data & Hadoop Training Material 0 1 PDF
168 pages
Idc Seagate Dataage Whitepaper PDF
No ratings yet
Idc Seagate Dataage Whitepaper PDF
28 pages
Chapter - 2 Hadoop
No ratings yet
Chapter - 2 Hadoop
32 pages
Hadoop Ecosystem Large PDF
No ratings yet
Hadoop Ecosystem Large PDF
229 pages
Chapter 2
No ratings yet
Chapter 2
19 pages
Bsd1313 Chapter 4
No ratings yet
Bsd1313 Chapter 4
129 pages
Ashish_Presentation_Stage1_modify_LR
No ratings yet
Ashish_Presentation_Stage1_modify_LR
24 pages
Big Data
No ratings yet
Big Data
29 pages
SDCBDASPARKWEEK1-1
No ratings yet
SDCBDASPARKWEEK1-1
9 pages
Apache Hadoop
No ratings yet
Apache Hadoop
27 pages
BD by maaz
No ratings yet
BD by maaz
19 pages
The Age OF: Every Minute
No ratings yet
The Age OF: Every Minute
47 pages
Chapter 2 Hadoop Eco System
No ratings yet
Chapter 2 Hadoop Eco System
34 pages
L8 Big Data Management en
No ratings yet
L8 Big Data Management en
58 pages
Big Data Streams Analytics: Challenges, Analysis, and Applications
No ratings yet
Big Data Streams Analytics: Challenges, Analysis, and Applications
55 pages
BDA-UNIT-1
No ratings yet
BDA-UNIT-1
32 pages
Module 2. 16974328568170
No ratings yet
Module 2. 16974328568170
113 pages
11 Lecture
No ratings yet
11 Lecture
22 pages
BigData Unit 2
No ratings yet
BigData Unit 2
15 pages
HADOOP
No ratings yet
HADOOP
10 pages
1- HADOOP crash course
No ratings yet
1- HADOOP crash course
52 pages
BDA Module 2
No ratings yet
BDA Module 2
40 pages
Updated Unit-2
0% (1)
Updated Unit-2
55 pages
Big Data and Analytics and MapReduce 29052023 054155pm
No ratings yet
Big Data and Analytics and MapReduce 29052023 054155pm
35 pages
1 - Big Data and Hadoop Framework
No ratings yet
1 - Big Data and Hadoop Framework
40 pages
Week 02
No ratings yet
Week 02
115 pages
BIA BigData Overview
No ratings yet
BIA BigData Overview
38 pages
BDA Presentations Unit-4 - Hadoop, Ecosystem
100% (1)
BDA Presentations Unit-4 - Hadoop, Ecosystem
25 pages
Data Science
No ratings yet
Data Science
87 pages
Unit-5 -Hadoop.pptx
No ratings yet
Unit-5 -Hadoop.pptx
29 pages
IOT and Comp.architecture
No ratings yet
IOT and Comp.architecture
17 pages
9 Hadoop PDF
No ratings yet
9 Hadoop PDF
59 pages
Unit_IV_Hadoop
No ratings yet
Unit_IV_Hadoop
90 pages
Analyzing Big Data in Hadoop Spark
No ratings yet
Analyzing Big Data in Hadoop Spark
30 pages
Chap3_OverviewOfBigDataEcosystem
No ratings yet
Chap3_OverviewOfBigDataEcosystem
91 pages
BIG DATA ANALYTICS (1)
No ratings yet
BIG DATA ANALYTICS (1)
20 pages
Big Data – Introduction to Hadoop
No ratings yet
Big Data – Introduction to Hadoop
61 pages
Hadoop Administration
No ratings yet
Hadoop Administration
97 pages
biggdata
No ratings yet
biggdata
24 pages
Printing Big Data Hadoop
No ratings yet
Printing Big Data Hadoop
24 pages
Unit Ii LM
No ratings yet
Unit Ii LM
18 pages
Haddob Lab Report
No ratings yet
Haddob Lab Report
12 pages
cloud computing Unit-5
No ratings yet
cloud computing Unit-5
22 pages
Apache Hadoop Developer Training PDF
100% (1)
Apache Hadoop Developer Training PDF
397 pages
Ch2- Hadoop & Hdfs-En
No ratings yet
Ch2- Hadoop & Hdfs-En
61 pages
An Introduction To Hadoop Presentation PDF
100% (1)
An Introduction To Hadoop Presentation PDF
91 pages
3 Hadoop
No ratings yet
3 Hadoop
111 pages
Big Data Training
No ratings yet
Big Data Training
244 pages
BDA Module 2 Chapter 1
No ratings yet
BDA Module 2 Chapter 1
12 pages
Big Data Analysis pdf 2
No ratings yet
Big Data Analysis pdf 2
18 pages
Big Data, Map Reduce & Hadoop: By: Surbhi Vyas (7) Varsha
No ratings yet
Big Data, Map Reduce & Hadoop: By: Surbhi Vyas (7) Varsha
40 pages
DC Hadoop
No ratings yet
DC Hadoop
48 pages
Big-Data Computing: B. Ramamurthy
No ratings yet
Big-Data Computing: B. Ramamurthy
61 pages
HADOOP
No ratings yet
HADOOP
55 pages
BDA Notes Unit-2
No ratings yet
BDA Notes Unit-2
27 pages
INtroduction To Big DAta and HAdoop
No ratings yet
INtroduction To Big DAta and HAdoop
30 pages
Subject: Data Driven Decision Making: Apache Hadoop For Big Data
No ratings yet
Subject: Data Driven Decision Making: Apache Hadoop For Big Data
5 pages
Day 2 S1 Intro_to_hadoop_Ashok
No ratings yet
Day 2 S1 Intro_to_hadoop_Ashok
27 pages
Exploring Hadoop Ecosystem (Volume 1): Batch Processing
From Everand
Exploring Hadoop Ecosystem (Volume 1): Batch Processing
Wei Liu
No ratings yet
Hadoop Ecosystem for Big Data
From Everand
Hadoop Ecosystem for Big Data
Dr. Zemelak Goraga
No ratings yet
Learn Hadoop in 24 Hours
From Everand
Learn Hadoop in 24 Hours
Alex Nordeen
No ratings yet
Pages From Unified-Log-Zk-Nov15-4 - Part5
No ratings yet
Pages From Unified-Log-Zk-Nov15-4 - Part5
2 pages
Hadoop Troubleshooting 101 Kate Ting Cloudera
No ratings yet
Hadoop Troubleshooting 101 Kate Ting Cloudera
6 pages
06 Machine Learning
No ratings yet
06 Machine Learning
24 pages
CS246 Proof Probability
No ratings yet
CS246 Proof Probability
13 pages
hw09 Monitoring Best Practices
No ratings yet
hw09 Monitoring Best Practices
6 pages
HW4 Tutorial
No ratings yet
HW4 Tutorial
17 pages
The Forrester Wave™ - Big Data Fabric Q2 2018 PDF
No ratings yet
The Forrester Wave™ - Big Data Fabric Q2 2018 PDF
18 pages
My PHP Generator
No ratings yet
My PHP Generator
305 pages
Data Modeling With Graph Databases
100% (2)
Data Modeling With Graph Databases
68 pages
Idc Digital Universe 2014 PDF
No ratings yet
Idc Digital Universe 2014 PDF
17 pages
Swarm and Evolutionary Computation: ACM Transactions On Knowledge Discovery From Data
No ratings yet
Swarm and Evolutionary Computation: ACM Transactions On Knowledge Discovery From Data
1 page
Cie1931 RGB
No ratings yet
Cie1931 RGB
70 pages
IT Exam
No ratings yet
IT Exam
3 pages
Exercise Part B Unit 4
100% (1)
Exercise Part B Unit 4
5 pages
Herman Michael. - Real Python For The Web
No ratings yet
Herman Michael. - Real Python For The Web
439 pages
Smart Health
No ratings yet
Smart Health
40 pages
Advance Concept in Data Bases Unit-2 by Arun Pratap Singh
100% (1)
Advance Concept in Data Bases Unit-2 by Arun Pratap Singh
51 pages
Doc
33% (3)
Doc
84 pages
Complete Placement Preparation Masterclass Brochure
No ratings yet
Complete Placement Preparation Masterclass Brochure
88 pages
(NagpurStudents - Org) Database Management System
No ratings yet
(NagpurStudents - Org) Database Management System
6 pages
DBMS (R23) UNIT - 2-1
No ratings yet
DBMS (R23) UNIT - 2-1
32 pages
Dbms and Rdbms Questions
No ratings yet
Dbms and Rdbms Questions
16 pages
CHAPTER 6 MongoDB
No ratings yet
CHAPTER 6 MongoDB
53 pages
Ch05-Part 1
No ratings yet
Ch05-Part 1
25 pages
Back To 'Certificate Final Exam/': Incorrect 0.00 Points Out of 1.00
No ratings yet
Back To 'Certificate Final Exam/': Incorrect 0.00 Points Out of 1.00
15 pages
DP-900 Course Content
No ratings yet
DP-900 Course Content
4 pages
Shahnawaz Alam
No ratings yet
Shahnawaz Alam
46 pages
Keys in Dbms
No ratings yet
Keys in Dbms
19 pages
Computer Science and Engineering
No ratings yet
Computer Science and Engineering
384 pages
Annamalai University B.lit., B.dance, B.music, B.G.L., B.B.L., B.a.L., B.C.a. and B.L.I.S Time Table Dec 2015
No ratings yet
Annamalai University B.lit., B.dance, B.music, B.G.L., B.B.L., B.a.L., B.C.a. and B.L.I.S Time Table Dec 2015
6 pages
CC Unit 3 PPT 1.pptx
No ratings yet
CC Unit 3 PPT 1.pptx
21 pages
Mca Programme Guide PDF
No ratings yet
Mca Programme Guide PDF
126 pages
B.Com CA Syllabus
No ratings yet
B.Com CA Syllabus
48 pages
PDF Test Bank for Concepts of Database Management, 10th Edition, Lisa Friedrichsen, Joseph J. Adamski, Lisa Ruffolo, Ellen Monk, Joy L. Starks Philip J. Pratt Mary Z. Last download
100% (2)
PDF Test Bank for Concepts of Database Management, 10th Edition, Lisa Friedrichsen, Joseph J. Adamski, Lisa Ruffolo, Ellen Monk, Joy L. Starks Philip J. Pratt Mary Z. Last download
34 pages
Fall 2013 - M359 - Week 2 Presentation
No ratings yet
Fall 2013 - M359 - Week 2 Presentation
88 pages
Big Data
No ratings yet
Big Data
22 pages
jewellery management system project report.docx
No ratings yet
jewellery management system project report.docx
71 pages
Syllabus-M SC Com Appl IARI PDF
No ratings yet
Syllabus-M SC Com Appl IARI PDF
32 pages
2017 ANNUAL 11 ICT (1)
No ratings yet
2017 ANNUAL 11 ICT (1)
6 pages
Lesson2-1
No ratings yet
Lesson2-1
9 pages
HyperJaxb Java Net
100% (1)
HyperJaxb Java Net
27 pages

Fillatre Big Data

Uploaded by

Fillatre Big Data

Uploaded by

Outils informatiques pour le Big Data

École d'été thématique CNRS BasMatI

“Big Data” is data whose scale, diversity, and complexity

 Various formats, types, and structures

 Healthcare monitoring: sensors monitoring your activities and body

 Reduce: ( g, [a, b, c, ...] ) -> g(a, g(b, g(c, ... )))

 Example (SDSS, Stripe 82):

 Variable conditions (e.g.,

Theorem: You can have at most

System is available during

A system can continue to

• Document-based Systems - Access data by key or by search of “document” data.

• Graph-based Systems - Use a graph structure

• A graph database is a database that uses graph structures with nodes,

 Top Level Apache Project

 Pig is an abstraction on top of Hadoop

 Pig is widely accepted and used

Join, Union Split Chains

2. Common operations must be coded by hand

3. Semantics hidden inside map-reduce functions

 Hive does not provide

Input Map Output

 Spark makes working sets a first-class concept to efficiently

 Provide distributed memory abstractions for clusters to support apps

 Retain the attractive properties of MapReduce:

Solution: augment data flow model with “resilient distributed

 Parallel operations on RDDs

 Restricted shared variables

random initial line

for (i <- 1 to ITERATIONS) {

You might also like