0% found this document useful (0 votes)
108 views

Fillatre Big Data

This document discusses tools for Big Data in astronomy. It begins with an introduction to Big Data and the Hadoop ecosystem, including HDFS for distributed storage and MapReduce for distributed processing. It then discusses specific Hadoop tools like Pig, Hive, and Spark. It provides an example of using MapReduce for image coaddition in astronomy. Overall, the document outlines how Hadoop and its ecosystem can provide scalable tools for storing, processing, and analyzing the large datasets common in modern astronomy.

Uploaded by

satmania
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
108 views

Fillatre Big Data

This document discusses tools for Big Data in astronomy. It begins with an introduction to Big Data and the Hadoop ecosystem, including HDFS for distributed storage and MapReduce for distributed processing. It then discusses specific Hadoop tools like Pig, Hive, and Spark. It provides an example of using MapReduce for image coaddition in astronomy. Overall, the document outlines how Hadoop and its ecosystem can provide scalable tools for storing, processing, and analyzing the large datasets common in modern astronomy.

Uploaded by

satmania
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 98

Outils informatiques pour le Big Data

en astronomie
Lionel Fillatre
Université Nice Sophia Antipolis
Polytech Nice Sophia
Laboratoire I3S

École d'été thématique CNRS BasMatI


1
3 juin 2015
Outlines
 What is the Big Data (including Hadoop Ecosystem)
 HDFS (Hadoop Distributed File System)
 What is MapReduce?
 Image Coaddition with MapReduce
 What is NoSQL?
 What is Pig?
 What is Hive?
 What is Spark?
 Conclusion

2
What is the Big Data

3
Big Data Definition
 No single standard definition…

“Big Data” is data whose scale, diversity, and complexity


require new architecture, techniques, algorithms, and analytics
to manage it and extract value and hidden knowledge from it…

4
Characteristics of Big Data:
1-Scale (Volume)

 Data Volume
 44x increase from 2009 to 2020
 From 0.8 zettabytes to 35zb
 Data volume is increasing exponentially

Exponential increase in
collected/generated data

5
Characteristics of Big Data:
2-Complexity (Variety)

 Various formats, types, and structures


 Text, numerical, images, audio, video,
sequences, time series, social media
data, multi-dim arrays, etc…
 Static data vs. streaming data
 A single application can be
generating/collecting many types of
data

To extract knowledge
 all these types of data need to be linked together

6
Characteristics of Big Data:
3-Speed (Velocity)
 Data is generated fast and need to be processed fast
 Online Data Analytics
 Late decisions  missing opportunities
 Examples
 E-Promotions: Based on your current location, your purchase history, what you like
 send promotions right now for store next to you

 Healthcare monitoring: sensors monitoring your activities and body


 any abnormal measurements require immediate reaction

7
Some Make it 5V’s

8
What technology for Big Data?

9
10
11
12
Hadoop Origins
 Apache Hadoop is a framework that allows for the
distributed processing of large data sets accross clusters of
commodity computers using a simple programming model.
 Hadoop is an open-source implementation of Google
MapReduce and Google File System (GFS).
 Hadoop fulfills need of common infrastructure:
 Efficient, reliable, easy to use,
 Open Source, Apache License.

13
Hadoop Ecosystem (main elements)

14
Data Storage
 Storage capacity has grown exponentially but read
speed has not kept up
 1990:
 Store 1,400 MB
 Transfer speed of 4.5MB/s
 Read the entire drive in ~ 5 minutes
 2010:
 Store 1 TB
 Transfer speed of 100MB/s
 Read the entire drive in ~ 3 hours
 Hadoop - 100 drives working at the same time can
read 1TB of data in 2 minutes

15
Hadoop Cluster
 A set of "cheap" commodity hardware
 No need for super-computers, use commodity unreliable hardware
 Not desktops
 Networked together
 May reside in the same location
– Set of servers in a set of racks in a data center

16
Scale-Out Instead of Scale-Up
 It is harder and more expensive to scale-up
 Add additional resources to an existing node (CPU, RAM)
 Moore’s Law can’t keep up with data growth
 New units must be purchased if required resources can not be added
 Also known as scale vertically
 Scale-Out
 Add more nodes/machines to an existing distributed application
 Software layer is designed for node additions or removal
 Hadoop takes this approach - A set of nodes are bonded together as a
single distributed system
 Very easy to scale down as well

17
Code to Data
 Traditional data processing architecture
 Nodes are broken up into separate processing and storage nodes
connected by high-capacity link
 Many data-intensive applications are not CPU demanding
causing bottlenecks in network

18
Code to Data
 Hadoop co-locates processors and storage
 Code is moved to data (size is tiny, usually in KBs)
 Processors execute code and access underlying local storage

19
Failures are Common
 Given a large number machines, failures are
common
 Large warehouses may see machine failures weekly or even daily
 Hadoop is designed to cope with node failures
 Data is replicated
 Tasks are retried

20
Comparison to RDBMS
 Relational Database Management Systems
(RDBMS) for batch processing
 Oracle, Sybase, MySQL, Microsoft SQL Server, etc.
 Hadoop doesn’t fully replace relational products; many
architectures would benefit from both Hadoop and a Relational
product
 RDBMS products scale up
 Expensive to scale for larger installations
 Hits a ceiling when storage reaches 100s of terabytes
 Structured Relational vs. Semi-Structured vs. Unstructured
 Hadoop was not designed for real-time or low latency queries

21
HDFS
(Hadoop Distributed File System)

22
HDFS
 Appears as a single disk
 Runs on top of a native filesystem
 Fault Tolerant
 Can handle disk crashes, machine crashes, etc...
 Based on Google's Filesystem (GFS or GoogleFS)

23
HDFS is Good for...
 Storing large files
 Terabytes, Petabytes, etc...
 Millions rather than billions of files
 100MB or more per file
 Streaming data
 Write once and read-many times patterns
 Optimized for streaming reads rather than random reads
 “Cheap” Commodity Hardware
 No need for super-computers, use less reliable commodity hardware

24
HDFS is not so good for...

 Low-latency reads
 High-throughput rather than low latency for small chunks of data
 HBase addresses this issue
 Large amount of small files
 Better for millions of large files instead of billions of small files
 For example each file can be 100MB or more
 Multiple Writers
 Single writer per file
 Writes only at the end of file, no-support for arbitrary offset

25
HDFS Daemons

26
Files and Blocks

27
HDFS File Write

28
HDFS File Read

29
What is MapReduce?

30
Hadoop MapReduce
 Model for processing large amounts of data in
parallel
 On commodity hardware
 Lots of nodes
 Derived from functional programming
 Map and reduce functions
 Can be implemented in multiple languages
 Java, C++, Ruby, Python, etc.

31
Hadoop MapReduce History

32
Main principle
 Map: ( f, [a, b, c, ...]) -> [ f(a), f(b), f(c), ... ]
 Apply a function to all the elements of a list
 ex.: map((f: x->x + 1), [1, 2, 3]) = [2, 3, 4]
 Intrinsically parallel

 Reduce: ( g, [a, b, c, ...] ) -> g(a, g(b, g(c, ... )))


 Apply a function to a list recursively
 ex.: (sum , [1, 2, 3 ,4]) = sum(1, sum( 2, sum( 3, 4 ) ) )

 Purely fonctionnal
 No global variables, no side effects

33
WordCount example

34
MapReduce Framework
 Takes care of distributed processing and coordination
 Scheduling
 Jobs are broken down into smaller chunks called tasks.
 These tasks are scheduled.
 Task localization with Data
 Framework strives to place tasks on the nodes that host the
segment of data to be processed by that specific task
 Code is moved to where the data is

35
MapReduce Framework
 Error Handling
 Failures are an expected behavior so tasks are automatically re-tried
on other machines
 Data Synchronization
 Shuffle and Sort barrier re-arranges and moves data between
machines
 Input and output are coordinated by the framework

36
Map Reduce 2.0 on YARN
 Yet Another Resource Negotiator (YARN)
 Various applications can run on YARN
 MapReduce is just one choice (the main choice at this point)
 http://wiki.apache.org/hadoop/PoweredByYarn

37
YARN Cluster

38
YARN: Running an Application

39
YARN: Running an Application

40
YARN: Running an Application

41
YARN: Running an Application

42
YARN: Running an Application

43
YARN and MapReduce
 YARN does not know or care what kind of application it is
running
 MapReduce uses YARN
 Hadoop includes a MapReduce ApplicationMaster to manage
MapReduce jobs
 Each MapReduce job is an instance of an application

44
Running a MapReduce2 Application

45
Running a MapReduce2 Application

46
Running a MapReduce2 Application

47
Running a MapReduce2 Application

48
Running a MapReduce2 Application

49
Running a MapReduce2 Application

50
Running a MapReduce2 Application

51
Running a MapReduce2 Application

52
Running a MapReduce2 Application

53
Image Coaddition with
MapReduce

54
What is Astronomical Survey Science
from Big Data point of view ?
 Gather millions of images and TBs/PBs of storage.
 Require high-throughput data reduction pipelines.
 Require sophisticated off-line data analysis tools
 The following example is extracted from
Wiley K., Connolly A., Gardner J., Krughoff S., Balazinska M., Howe B., Kwon
Y., Bu Y.
Astronomy in the Cloud: Using MapReduce for Image Co-Addition.
Publications of the Astronomical Society of the Pacific,
2011, vol. 123, no. 901, pp. 366-380.

55
FITS (Flexible Image Transport System)
 An image format that knows where it is looking.
 Common astronomical image representation file format.
 Metadata tags (like EXIF):
 Most importantly: Precise astrometry (position on sky)
 Other:
 Geolocation (telescope location)
 Sky conditions, image quality, etc.

56
Image Coaddition
 Give multiple partially overlapping images and a query
(color and sky bounds):
 Find images’ intersections with the query bounds.
 Project bitmaps to the bounds.
 Stack and mosaic into a final product.

57
Image Stacking (Signal Averaging)
 Stacking improves SNR: makes
fainter objects visible.

 Example (SDSS, Stripe 82):


 Top: Single image, R-band
 Bottom: 79-deep stack (~9x
SNR improvement)

 Variable conditions (e.g.,


atmosphere, PSF, haze) mean
stacking algorithm complexity
58
can exceed a mere sum.
Advantages of MapReduce
 High-level problem description. No effort spent on
internode communication, message-passing, etc.
 Programmed in Java (accessible to most science researchers,
not just computer scientists and engineers).
 Runs on cheap commodity hardware, potentially in the
cloud, e.g., Amazon’s EC2.
 Scalable: 1000s of nodes can be added to the cluster with no
modification to the researcher’s software.
 Large community of users/support.

59
Coaddition in Hadoop

60
What is NoSQL?

61
What is NoSQL?
 Stands for Not Only SQL
 Class of non-relational data storage systems
 Usually do not require a fixed table schema nor do they use the
concept of joins
 All NoSQL offerings relax one or more of the ACID properties
(CAP theorem)
 For data storage, an RDBMS cannot be the be-all/end-all
 Just as there are different programming languages, need to have
other data storage tools in the toolbox
 A NoSQL solution is more acceptable to a client now

62
The CAP Theorem

Theorem: You can have at most


two of these properties for any
Availability shared-data system

Consistency

Partition
tolerance

63
The CAP Theorem
 Once a writer has written, all
readers will see that write

Availability

Consistency

Partition
tolerance

64
Consistency
 Two kinds of consistency:
 strong consistency – ACID (Atomicity Consistency
Isolation Durability)
 weak consistency – BASE (Basically Available Soft-state
Eventual consistency)
• Basically Available: The database system always seems to work!
• Soft State: It does not have to be consistent all the time.
• Eventually Consistent: The system will eventually become
consistent when the updates propagate, in particular, when there
are not too many updates.

65
The CAP Theorem

System is available during


software and hardware upgrades
Availability and node failures.

Consistency

Partition
tolerance

66
Availability
 A guarantee that every request receives a response about whether
it succeeded or failed.
 Traditionally, thought of as the server/process available five 9’s
(99.999 %).
 However, for large node system, at almost any point in time
there’s a good chance that a node is either down or there is a
network disruption among the nodes.

67
The CAP Theorem

A system can continue to


operate in the presence of a
Availability network partitions.

Consistency

Partition
tolerance

68
Failure is the rule
 Amazon:
 Datacenter with 100 000 disks
 From 6 000 to 10 000 disks fail over per year (25
disks per day)
 Sources of failures are numerous:
 Hardware (disk)
 Network
 Power
 Software
 Software and OS updates.
69
The CAP Theorem

70
Different Types of NoSQL Systems
• Distributed Key-Value Systems - Lookup a single value for a key
• Amazon’s Dynamo

• Document-based Systems - Access data by key or by search of “document” data.


• CouchDB
• MongoDB

• Column-based Systems
• Google’s BigTable
• HBase
• Facebook’s Cassandra

• Graph-based Systems - Use a graph structure


• Google’s Pregel
• Neo4j
71
Key-Value Pair (KVP) Stores
“Value” is stored as a “blob”
• Without caring or knowing what is inside
• Application is responsible for understanding the data

In simple terms, a NoSQL Key-Value store is a single table with two columns: one
being the (Primary) Key, and the other being the Value.

72
Each record may have a different schema
Document storage
• Records within a single table can have different structures.
• An example record from Mongo, using JSON format, might look like
{
“_id” : ObjectId(“4fccbf281168a6aa3c215443″),

“first_name” : “Thomas”,
“last_name” : “Jefferson”,
“address” : {
“street” : “1600 Pennsylvania Ave NW”,
“city” : “Washington”, Embedded object
“state” : “DC”
}
}
• Records are called documents.
• You can also modify the structure of any document on the fly by adding and removing
members from the document.
• Unlike simple key-value stores, both keys and values are fully searchable in document
databases.
73
Column-based Stores
• Based on Google’s BigTable store:
• Each record = (row:string, column:string, time:int64)
• Distributed data storage, especially versioned data (time-stamps).
• What is a column-based store? - Data tables are stored as sections of
columns of data, rather than as rows of data.

74
Graph Database
• Apply graph theory in the storage of information about the relationship
between entries

• A graph database is a database that uses graph structures with nodes,


edges, and properties to represent and store data.

• In general, graph databases are useful when you are more interested in
relationships between data than in the data itself:
• for example, in representing and traversing social networks,
generating recommendations, or conducting forensic investigations
(e.g. pattern detection).

75
Example

76
What is Pig?

77
Pig
 In brief:
“is a platform for analyzing large data sets that consists of a high-level
language for expressing data analysis programs, coupled with infrastructure
for evaluating these programs.”

 Top Level Apache Project


 http://pig.apache.org

 Pig is an abstraction on top of Hadoop


 Provides high level programming language designed for data processing
 Converted into MapReduce and executed on Hadoop Clusters

 Pig is widely accepted and used


 Yahoo!, Twitter, Netflix, etc...
 At Yahoo!, 70% MapReduce jobs are written in Pig

78
Disadvantages of Raw MapRaduce
1. Extremely rigid data flow
M R
Other flows constantly hacked in

M M R M

Join, Union Split Chains

2. Common operations must be coded by hand


• Join, filter, projection, aggregates, sorting, distinct

3. Semantics hidden inside map-reduce functions


• Difficult to maintain, extend, and optimize
• Resulting code is difficult to reuse and maintain; shifts focus and attention away
79
from data analysis
Pig and MapReduce
 MapReduce requires programmers
 Must think in terms of map and reduce functions
 More than likely will require Java programmers
 Pig provides high-level language that can be used by
 Analysts
 Data Scientists
 Statisticians
 Etc...
 Originally implemented at Yahoo! to allow analysts to
access data

80
Pig’s Features
 Main operators:
 Join Datasets
 Sort Datasets
 Filter
 Data Types
 Group By
 User Defined Functions
 Etc..

 Example:
>movies = LOAD '/home/movies_data.csv' USING PigStorage(',') as
(id,name,year,rating,duration);
>movies_greater_than_four = FILTER movies BY (float)rating>4.0;
>DUMP movies_greater_than_four;

81
What is Hive?

82
Hive
 Data Warehousing Solution built on top of Hadoop
 Provides SQL-like query language named HiveQL
 Minimal learning curve for people with SQL expertise
 Data analysts are target audience
 Early Hive development work started at Facebook in 2007
 Today Hive is an Apache project under Hadoop
 http://hive.apache.org

83
Advantages and Drawbacks
 Hive provides
 Ability to bring structure to various data formats
 Simple interface for ad hoc querying, analyzing and summarizing large
amounts of data
 Access to files on various data stores such as HDFS and Hbase

 Hive does not provide


 Hive does not provide low latency or realtime queries
 Even querying small amounts of data may take minutes
 Designed for scalability and ease-of-use rather than low latency
responses

84
Hive
 Translates HiveQL statements into a set of MapReduce Jobs which are
then executed on a Hadoop Cluster

85
What is Spark?

86
A Brief History: Spark

87
A general view of Spark

88
Current programming models
 Current popular programming models for clusters transform
data flowing from stable storage to stable storage
 E.g., MapReduce:

Map
Reduce

Input Map Output

Reduce
Map

Benefits of data flow: runtime can decide where to run tasks and can
89 automatically recover from failures
MapReduce I/O

90
Spark
 Acyclic data flow is a powerful abstraction, but is not efficient for
applications that repeatedly reuse a working set of data:
 Iterative algorithms (many in machine learning)
 Interactive data mining tools (R, Excel, Python)

 Spark makes working sets a first-class concept to efficiently


support these apps.

91
Goal: Sharing at Memory Speed

92
Resilient Distributed Dataset (RDD)

 Provide distributed memory abstractions for clusters to support apps


with working sets.

 Retain the attractive properties of MapReduce:


 Fault tolerance (for crashes & stragglers)
 Data locality
 Scalability

Solution: augment data flow model with “resilient distributed


datasets” (RDDs)
93
Programming Model with RDD
 Resilient distributed datasets (RDDs)
 Immutable collections partitioned across cluster that can be rebuilt
if a partition is lost
 Created by transforming data in stable storage using data flow
operators (map, filter, group-by, …)
 Can be cached across parallel operations

 Parallel operations on RDDs


 Reduce, collect, count, save, …

 Restricted shared variables


 Accumulators, broadcast variables

94
Example: Logistic Regression
 Goal: find best line separating two sets of points

random initial line

target

95
Logistic Regression (SCALA Code)
val data =
spark.textFile(...).map(readPoint).cache()

var w = Vector.random(D)

for (i <- 1 to ITERATIONS) {


val gradient = data.map(p =>
(1 / (1 + exp(-p.y*(w dot p.x))) - 1) * p.y *
p.x
).reduce(_ + _)
w -= gradient
}

println("Final w: " + w)

96
Conclusion

97
Conclusion
 Data storage needs are rapidly increasing
Hadoop has become the de-facto standard for handling these
massive data sets.
 Storage of Big Data requires new storage models
NoSQL solutions.
 Parallel processing of Big Data requires a new programming
paradigm
MapReduce programming model.
 “Big data” is moving beyond one-passbatch jobs, to low-latency
apps that need datasharing
Apache Spark is an alternative solution.
98

You might also like