Fillatre Big Data
Fillatre Big Data
en astronomie
Lionel Fillatre
Université Nice Sophia Antipolis
Polytech Nice Sophia
Laboratoire I3S
2
What is the Big Data
3
Big Data Definition
No single standard definition…
4
Characteristics of Big Data:
1-Scale (Volume)
Data Volume
44x increase from 2009 to 2020
From 0.8 zettabytes to 35zb
Data volume is increasing exponentially
Exponential increase in
collected/generated data
5
Characteristics of Big Data:
2-Complexity (Variety)
To extract knowledge
all these types of data need to be linked together
6
Characteristics of Big Data:
3-Speed (Velocity)
Data is generated fast and need to be processed fast
Online Data Analytics
Late decisions missing opportunities
Examples
E-Promotions: Based on your current location, your purchase history, what you like
send promotions right now for store next to you
7
Some Make it 5V’s
8
What technology for Big Data?
9
10
11
12
Hadoop Origins
Apache Hadoop is a framework that allows for the
distributed processing of large data sets accross clusters of
commodity computers using a simple programming model.
Hadoop is an open-source implementation of Google
MapReduce and Google File System (GFS).
Hadoop fulfills need of common infrastructure:
Efficient, reliable, easy to use,
Open Source, Apache License.
13
Hadoop Ecosystem (main elements)
14
Data Storage
Storage capacity has grown exponentially but read
speed has not kept up
1990:
Store 1,400 MB
Transfer speed of 4.5MB/s
Read the entire drive in ~ 5 minutes
2010:
Store 1 TB
Transfer speed of 100MB/s
Read the entire drive in ~ 3 hours
Hadoop - 100 drives working at the same time can
read 1TB of data in 2 minutes
15
Hadoop Cluster
A set of "cheap" commodity hardware
No need for super-computers, use commodity unreliable hardware
Not desktops
Networked together
May reside in the same location
– Set of servers in a set of racks in a data center
16
Scale-Out Instead of Scale-Up
It is harder and more expensive to scale-up
Add additional resources to an existing node (CPU, RAM)
Moore’s Law can’t keep up with data growth
New units must be purchased if required resources can not be added
Also known as scale vertically
Scale-Out
Add more nodes/machines to an existing distributed application
Software layer is designed for node additions or removal
Hadoop takes this approach - A set of nodes are bonded together as a
single distributed system
Very easy to scale down as well
17
Code to Data
Traditional data processing architecture
Nodes are broken up into separate processing and storage nodes
connected by high-capacity link
Many data-intensive applications are not CPU demanding
causing bottlenecks in network
18
Code to Data
Hadoop co-locates processors and storage
Code is moved to data (size is tiny, usually in KBs)
Processors execute code and access underlying local storage
19
Failures are Common
Given a large number machines, failures are
common
Large warehouses may see machine failures weekly or even daily
Hadoop is designed to cope with node failures
Data is replicated
Tasks are retried
20
Comparison to RDBMS
Relational Database Management Systems
(RDBMS) for batch processing
Oracle, Sybase, MySQL, Microsoft SQL Server, etc.
Hadoop doesn’t fully replace relational products; many
architectures would benefit from both Hadoop and a Relational
product
RDBMS products scale up
Expensive to scale for larger installations
Hits a ceiling when storage reaches 100s of terabytes
Structured Relational vs. Semi-Structured vs. Unstructured
Hadoop was not designed for real-time or low latency queries
21
HDFS
(Hadoop Distributed File System)
22
HDFS
Appears as a single disk
Runs on top of a native filesystem
Fault Tolerant
Can handle disk crashes, machine crashes, etc...
Based on Google's Filesystem (GFS or GoogleFS)
23
HDFS is Good for...
Storing large files
Terabytes, Petabytes, etc...
Millions rather than billions of files
100MB or more per file
Streaming data
Write once and read-many times patterns
Optimized for streaming reads rather than random reads
“Cheap” Commodity Hardware
No need for super-computers, use less reliable commodity hardware
24
HDFS is not so good for...
Low-latency reads
High-throughput rather than low latency for small chunks of data
HBase addresses this issue
Large amount of small files
Better for millions of large files instead of billions of small files
For example each file can be 100MB or more
Multiple Writers
Single writer per file
Writes only at the end of file, no-support for arbitrary offset
25
HDFS Daemons
26
Files and Blocks
27
HDFS File Write
28
HDFS File Read
29
What is MapReduce?
30
Hadoop MapReduce
Model for processing large amounts of data in
parallel
On commodity hardware
Lots of nodes
Derived from functional programming
Map and reduce functions
Can be implemented in multiple languages
Java, C++, Ruby, Python, etc.
31
Hadoop MapReduce History
32
Main principle
Map: ( f, [a, b, c, ...]) -> [ f(a), f(b), f(c), ... ]
Apply a function to all the elements of a list
ex.: map((f: x->x + 1), [1, 2, 3]) = [2, 3, 4]
Intrinsically parallel
Purely fonctionnal
No global variables, no side effects
33
WordCount example
34
MapReduce Framework
Takes care of distributed processing and coordination
Scheduling
Jobs are broken down into smaller chunks called tasks.
These tasks are scheduled.
Task localization with Data
Framework strives to place tasks on the nodes that host the
segment of data to be processed by that specific task
Code is moved to where the data is
35
MapReduce Framework
Error Handling
Failures are an expected behavior so tasks are automatically re-tried
on other machines
Data Synchronization
Shuffle and Sort barrier re-arranges and moves data between
machines
Input and output are coordinated by the framework
36
Map Reduce 2.0 on YARN
Yet Another Resource Negotiator (YARN)
Various applications can run on YARN
MapReduce is just one choice (the main choice at this point)
http://wiki.apache.org/hadoop/PoweredByYarn
37
YARN Cluster
38
YARN: Running an Application
39
YARN: Running an Application
40
YARN: Running an Application
41
YARN: Running an Application
42
YARN: Running an Application
43
YARN and MapReduce
YARN does not know or care what kind of application it is
running
MapReduce uses YARN
Hadoop includes a MapReduce ApplicationMaster to manage
MapReduce jobs
Each MapReduce job is an instance of an application
44
Running a MapReduce2 Application
45
Running a MapReduce2 Application
46
Running a MapReduce2 Application
47
Running a MapReduce2 Application
48
Running a MapReduce2 Application
49
Running a MapReduce2 Application
50
Running a MapReduce2 Application
51
Running a MapReduce2 Application
52
Running a MapReduce2 Application
53
Image Coaddition with
MapReduce
54
What is Astronomical Survey Science
from Big Data point of view ?
Gather millions of images and TBs/PBs of storage.
Require high-throughput data reduction pipelines.
Require sophisticated off-line data analysis tools
The following example is extracted from
Wiley K., Connolly A., Gardner J., Krughoff S., Balazinska M., Howe B., Kwon
Y., Bu Y.
Astronomy in the Cloud: Using MapReduce for Image Co-Addition.
Publications of the Astronomical Society of the Pacific,
2011, vol. 123, no. 901, pp. 366-380.
55
FITS (Flexible Image Transport System)
An image format that knows where it is looking.
Common astronomical image representation file format.
Metadata tags (like EXIF):
Most importantly: Precise astrometry (position on sky)
Other:
Geolocation (telescope location)
Sky conditions, image quality, etc.
56
Image Coaddition
Give multiple partially overlapping images and a query
(color and sky bounds):
Find images’ intersections with the query bounds.
Project bitmaps to the bounds.
Stack and mosaic into a final product.
57
Image Stacking (Signal Averaging)
Stacking improves SNR: makes
fainter objects visible.
59
Coaddition in Hadoop
60
What is NoSQL?
61
What is NoSQL?
Stands for Not Only SQL
Class of non-relational data storage systems
Usually do not require a fixed table schema nor do they use the
concept of joins
All NoSQL offerings relax one or more of the ACID properties
(CAP theorem)
For data storage, an RDBMS cannot be the be-all/end-all
Just as there are different programming languages, need to have
other data storage tools in the toolbox
A NoSQL solution is more acceptable to a client now
62
The CAP Theorem
Consistency
Partition
tolerance
63
The CAP Theorem
Once a writer has written, all
readers will see that write
Availability
Consistency
Partition
tolerance
64
Consistency
Two kinds of consistency:
strong consistency – ACID (Atomicity Consistency
Isolation Durability)
weak consistency – BASE (Basically Available Soft-state
Eventual consistency)
• Basically Available: The database system always seems to work!
• Soft State: It does not have to be consistent all the time.
• Eventually Consistent: The system will eventually become
consistent when the updates propagate, in particular, when there
are not too many updates.
65
The CAP Theorem
Consistency
Partition
tolerance
66
Availability
A guarantee that every request receives a response about whether
it succeeded or failed.
Traditionally, thought of as the server/process available five 9’s
(99.999 %).
However, for large node system, at almost any point in time
there’s a good chance that a node is either down or there is a
network disruption among the nodes.
67
The CAP Theorem
Consistency
Partition
tolerance
68
Failure is the rule
Amazon:
Datacenter with 100 000 disks
From 6 000 to 10 000 disks fail over per year (25
disks per day)
Sources of failures are numerous:
Hardware (disk)
Network
Power
Software
Software and OS updates.
69
The CAP Theorem
70
Different Types of NoSQL Systems
• Distributed Key-Value Systems - Lookup a single value for a key
• Amazon’s Dynamo
• Column-based Systems
• Google’s BigTable
• HBase
• Facebook’s Cassandra
In simple terms, a NoSQL Key-Value store is a single table with two columns: one
being the (Primary) Key, and the other being the Value.
72
Each record may have a different schema
Document storage
• Records within a single table can have different structures.
• An example record from Mongo, using JSON format, might look like
{
“_id” : ObjectId(“4fccbf281168a6aa3c215443″),
“first_name” : “Thomas”,
“last_name” : “Jefferson”,
“address” : {
“street” : “1600 Pennsylvania Ave NW”,
“city” : “Washington”, Embedded object
“state” : “DC”
}
}
• Records are called documents.
• You can also modify the structure of any document on the fly by adding and removing
members from the document.
• Unlike simple key-value stores, both keys and values are fully searchable in document
databases.
73
Column-based Stores
• Based on Google’s BigTable store:
• Each record = (row:string, column:string, time:int64)
• Distributed data storage, especially versioned data (time-stamps).
• What is a column-based store? - Data tables are stored as sections of
columns of data, rather than as rows of data.
74
Graph Database
• Apply graph theory in the storage of information about the relationship
between entries
• In general, graph databases are useful when you are more interested in
relationships between data than in the data itself:
• for example, in representing and traversing social networks,
generating recommendations, or conducting forensic investigations
(e.g. pattern detection).
75
Example
76
What is Pig?
77
Pig
In brief:
“is a platform for analyzing large data sets that consists of a high-level
language for expressing data analysis programs, coupled with infrastructure
for evaluating these programs.”
78
Disadvantages of Raw MapRaduce
1. Extremely rigid data flow
M R
Other flows constantly hacked in
M M R M
80
Pig’s Features
Main operators:
Join Datasets
Sort Datasets
Filter
Data Types
Group By
User Defined Functions
Etc..
Example:
>movies = LOAD '/home/movies_data.csv' USING PigStorage(',') as
(id,name,year,rating,duration);
>movies_greater_than_four = FILTER movies BY (float)rating>4.0;
>DUMP movies_greater_than_four;
81
What is Hive?
82
Hive
Data Warehousing Solution built on top of Hadoop
Provides SQL-like query language named HiveQL
Minimal learning curve for people with SQL expertise
Data analysts are target audience
Early Hive development work started at Facebook in 2007
Today Hive is an Apache project under Hadoop
http://hive.apache.org
83
Advantages and Drawbacks
Hive provides
Ability to bring structure to various data formats
Simple interface for ad hoc querying, analyzing and summarizing large
amounts of data
Access to files on various data stores such as HDFS and Hbase
84
Hive
Translates HiveQL statements into a set of MapReduce Jobs which are
then executed on a Hadoop Cluster
85
What is Spark?
86
A Brief History: Spark
87
A general view of Spark
88
Current programming models
Current popular programming models for clusters transform
data flowing from stable storage to stable storage
E.g., MapReduce:
Map
Reduce
Reduce
Map
Benefits of data flow: runtime can decide where to run tasks and can
89 automatically recover from failures
MapReduce I/O
90
Spark
Acyclic data flow is a powerful abstraction, but is not efficient for
applications that repeatedly reuse a working set of data:
Iterative algorithms (many in machine learning)
Interactive data mining tools (R, Excel, Python)
91
Goal: Sharing at Memory Speed
92
Resilient Distributed Dataset (RDD)
94
Example: Logistic Regression
Goal: find best line separating two sets of points
target
95
Logistic Regression (SCALA Code)
val data =
spark.textFile(...).map(readPoint).cache()
var w = Vector.random(D)
println("Final w: " + w)
96
Conclusion
97
Conclusion
Data storage needs are rapidly increasing
Hadoop has become the de-facto standard for handling these
massive data sets.
Storage of Big Data requires new storage models
NoSQL solutions.
Parallel processing of Big Data requires a new programming
paradigm
MapReduce programming model.
“Big data” is moving beyond one-passbatch jobs, to low-latency
apps that need datasharing
Apache Spark is an alternative solution.
98