SlideShare a Scribd company logo
Big Data
Fundamentals

.

Raj Jain
Washington University in Saint Louis
Saint Louis, MO 63130
Jain@cse.wustl.edu
These slides and audio/video recordings of this class lecture are at:
http://www.cse.wustl.edu/~jain/cse570-13/
Washington University in St. Louis

http://www.cse.wustl.edu/~jain/cse570-13/

10-1

©2013 Raj Jain
Overview
1.
2.
3.
4.
5.

Why Big Data?
Terminology
Key Technologies: Google File System, MapReduce,
Hadoop
Hadoop and other database tools
Types of Databases

Ref: J. Hurwitz, et al., “Big Data for Dummies,” Wiley, 2013, ISBN:978-1-118-50422-2
http://www.cse.wustl.edu/~jain/cse570-13/
Washington University in St. Louis

10-2

©2013 Raj Jain
Big Data






Data is measured by 3V's:
Volume: TB
Velocity: TB/sec. Speed of creation or change
Variety: Type (Text, audio, video, images, geospatial, ...)
Increasing processing power, storage capacity, and networking
have caused data to grow in all 3 dimensions.

Volume, Location, Velocity, Churn, Variety,
Veracity (accuracy, correctness, applicability)
 Examples: social network data, sensor networks,
Internet Search, Genomics, astronomy, …


Washington University in St. Louis

http://www.cse.wustl.edu/~jain/cse570-13/

10-3

©2013 Raj Jain
Why Big Data Now?
1.
2.
3.
4.

5.

Low cost storage to store data that was discarded earlier
Powerful multi-core processors
Low latency possible by distributed computing: Compute
clusters and grids connected via high-speed networks
Virtualization  Partition, Aggregate, isolate resources in any
size and dynamically change it  Minimize latency for any
scale
Affordable storage and computing with minimal man power
via clouds
 Possible because of advances in Networking

Washington University in St. Louis

http://www.cse.wustl.edu/~jain/cse570-13/

10-4

©2013 Raj Jain
Why Big Data Now? (Cont)
6.
7.
8.

9.
10.

Better understanding of task distribution (MapReduce),
computing architecture (Hadoop),
Advanced analytical techniques (Machine learning)
Managed Big Data Platforms: Cloud service providers, such
as Amazon Web Services provide Elastic MapReduce, Simple
Storage Service (S3) and HBase – column oriented database.
Google’ BigQuery and Prediction API.
Open-source software: OpenStack, PostGresSQL
March 12, 2012: Obama announced $200M for Big Data
research. Distributed via NSF, NIH, DOE, DoD, DARPA, and
USGS (Geological Survey)

Washington University in St. Louis

http://www.cse.wustl.edu/~jain/cse570-13/

10-5

©2013 Raj Jain
Big Data Applications




Monitor premature infants to alert when interventions is needed
Predict machine failures in manufacturing
Prevent traffic jams, save fuel, reduce pollution

Washington University in St. Louis

http://www.cse.wustl.edu/~jain/cse570-13/

10-6

©2013 Raj Jain
ACID Requirements







Atomicity: All or nothing. If anything fails, entire transaction
fails. Example, Payment and ticketing.
Consistency: If there is error in input, the output will not be
written to the database. Database goes from one valid state to
another valid states. Valid=Does not violate any defined rules.
Isolation: Multiple parallel transactions will not interfere with
each other.
Durability: After the output is written to the database, it stays
there forever even after power loss, crashes, or errors.
Relational databases provide ACID while non-relational
databases aim for BASE (Basically Available, Soft, and
Eventual Consistency)

Ref: http://en.wikipedia.org/wiki/ACID
Washington University in St. Louis

http://www.cse.wustl.edu/~jain/cse570-13/

10-7

©2013 Raj Jain
Terminology









Structured Data: Data that has a pre-set format, e.g., Address
Books, product catalogs, banking transactions,
Unstructured Data: Data that has no pre-set format. Movies,
Audio, text files, web pages, computer programs, social media,
Semi-Structured Data: Unstructured data that can be put into
a structure by available format descriptions
80% of data is unstructured.
Batch vs. Streaming Data
Real-Time Data: Streaming data that needs to analyzed as it
comes in. E.g., Intrusion detection. Aka “Data in Motion”
Data at Rest: Non-real time. E.g., Sales analysis.
Metadata: Definitions, mappings, scheme

Ref: Michael Minelli, "Big Data, Big Analytics: Emerging Business Intelligence and Analytic Trends for Today's Businesses,"
Wiley, 2013, ISBN:'111814760X
http://www.cse.wustl.edu/~jain/cse570-13/
Washington University in St. Louis
©2013 Raj Jain

10-8
Relational Databases and SQL


Relational Database: Stores data in tables. A “Schema”
defines the tables, the fields in tables and relationships between
the two. Data is stored one column/attribute
Order Table

Customer Table

Order Number

Customer ID

Product ID

Quantity

Unit Price

Customer ID Customer Name Customer Address Gender Income Range

SQL (Structured Query Language): Most commonly used
language for creating, retrieving, updating, and deleting
(CRUD) data in a relational database
Example: To find the gender of customers who bought XYZ:
Select CustomerID, State, Gender, ProductID from “Customer
Table”, “Order Table” where ProductID = XYZ



Ref: http://en.wikipedia.org/wiki/Comparison_of_relational_database_management_systems
http://www.cse.wustl.edu/~jain/cse570-13/
Washington University in St. Louis

10-9

©2013 Raj Jain
Non-relational Databases







NoSQL: Not Only SQL. Any database that uses non-SQL
interfaces, e.g., Python, Ruby, C, etc. for retrieval.
Typically store data in key-value pairs.
Not limited to rows or columns. Data structure and query is
specific to the data type
High-performance in-memory databases
RESTful (Representational State Transfer) web-like APIs
Eventual consistency: BASE in place of ACID

Washington University in St. Louis

http://www.cse.wustl.edu/~jain/cse570-13/

10-10

©2013 Raj Jain
NewSQL Databases






Overcome scaling limits of MySQL
Same scalable performance as NoSQL but using SQL
Providing ACID
Also called Scale-out SQL
Generally use distributed processing.

Ref: http://en.wikipedia.org/wiki/NewSQL
Washington University in St. Louis

http://www.cse.wustl.edu/~jain/cse570-13/

10-11

©2013 Raj Jain
Columnar Databases
ID
101
105
106




Name Salary
Smith
10000
Jones
20000
Jones
15000

In Relational databases, data in each row of the table is stored
together: 001:101,Smith,10000; 002:105,Jones,20000; 003:106,John;15000
 Easy to find all information about a person.
 Difficult to answer queries about the aggregate:
 How many people have salary between 12k-15k?
In Columnar databases, data in each column is stored together.

101:001,105:002,106:003; Smith:001, Jones:002,003; 10000:001, 20000:002, 150000:003




Easy to get column statistics
Very easy to add columns
Good for data with high variety  simply add columns

Ref: http://en.wikipedia.org/wiki/Column-oriented_DBMS
http://www.cse.wustl.edu/~jain/cse570-13/
Washington University in St. Louis

10-12

©2013 Raj Jain
Types of Databases
Relational Databases: PostgreSQL, SQLite, MySQL
 NewSQL Databases: Scale-out using distributed processing
Non-relational Databases:
 Key-Value Pair (KVP) Databases: Data is stored as
Key:Value, e.g., Riak Key-Value Database
 Document Databases: Store documents or web pages, e.g.,
MongoDB, CouchDB
 Columnar Databases: Store data in columns, e.g., HBase
 Graph Databases: Stores nodes and relationship, e.g., Neo4J
 Spatial Databases: For map and nevigational data, e.g.,
OpenGEO, PortGIS, ArcSDE
 In-Memory Database (IMDB): All data in memory. For real
time applications
Cloud Databases: Any data that is run in a cloud using IAAS,
VM Image, DAAS
http://www.cse.wustl.edu/~jain/cse570-13/


Washington University in St. Louis

©2013 Raj Jain

10-13
Google File System





Commodity computers serve as “Chunk Servers” and store
multiple copies of data blocks
A master server keeps a map of all chunks of files and location
of those chunks.
All writes are propagated by the writing chunk server to other
chunk servers that have copies.
Master server controls all read-write accesses
Name Space Block Map
Master Server
Chunk Server
B1

B2

B3

Chunk Server
B3

B2

B4

Chunk Server
Replicate

Chunk Server

B4

B4

B2

B1

B3

B1

Write
Ref: S. Ghemawat, et al., "The Google File System", OSP 2003, http://research.google.com/archive/gfs.html
http://www.cse.wustl.edu/~jain/cse570-13/
Washington University in St. Louis

10-14

©2013 Raj Jain
BigTable






Distributed storage system built on Google File System
Data stored in rows and columns
Optimized for sparse, persistent, multidimensional sorted map.
Uses commodity servers
Not distributed outside of Google but accessible via Google
App Engine

Ref: F. Chang, et al., "Bigtable: A Distributed Storage System for Structured Data," 2006,
http://research.google.com/archive/bigtable.html
http://www.cse.wustl.edu/~jain/cse570-13/
Washington University in St. Louis

10-15

©2013 Raj Jain
MapReduce







Software framework to process massive amounts of
unstructured data in parallel
Goals:
 Distributed: over a large number of inexpensive processors
 Scalable: expand or contract as needed
 Fault tolerant: Continue in spite of some failures
Map: Takes a set of data and converts it into another set of
key-value pairs..
Reduce: Takes the output from Map as input and outputs a
smaller set of key-value pairs.
Shuffle

Input

Map

Reduce

Output

Reduce

Output

Map
Map

Ref: J. Dean and S. Ghemawat, “MapReduce: Simplified Data Processing on Large Clusters,” OSDI 2004,
http://research.google.com/archive/mapreduce-osdi04.pdf
http://www.cse.wustl.edu/~jain/cse570-13/
Washington University in St. Louis

10-16

©2013 Raj Jain
MapReduce Example









100 files with daily temperature in two cities.
Each file has 10,000 entries.
For example, one file may have (Toronto 20), (New York 30),
..
Our goal is to compute the maximum temperature in the two
cities.
Assign the task to 100 Map processors each works on one file.
Each processor outputs a list of key-value pairs, e.g., (Toronto
30), New York (65), …
Now we have 100 lists each with two elements. We give this
list to two reducers – one for Toronto and another for New
York.
The reducer produce the final answer: (Toronto 55), (New
York 65)

Ref: IBM. “What is MapReduce?,” http://www-01.ibm.com/software/data/infosphere/hadoop/mapreduce/
http://www.cse.wustl.edu/~jain/cse570-13/
Washington University in St. Louis

10-17

©2013 Raj Jain
MapReduce Optimization






Scheduling:
 Task is broken into pieces that can be computed in parallel
 Map tasks are scheduled before the reduce tasks.
 If there are more map tasks than processors, map tasks
continue until all of them are complete.
 A new strategy is used to assign Reduce jobs so that it can be
done in parallel
 The results are combined.
Synchronization: The map jobs should be comparables so that
they finish together. Similarly reduce jobs should be comparable.
Code/Data Collocation: The data for map jobs should be at the
processors that are going to map.
Fault/Error Handling: If a processor fails, its task needs to be
assigned to another processor.

Washington University in St. Louis

http://www.cse.wustl.edu/~jain/cse570-13/

10-18

©2013 Raj Jain
Story of Hadoop





Doug Cutting at Yahoo and Mike Caferella were working on
creating a project called “Nutch” for large web index.
They saw Google papers on MapReduce and Google File
System and used it
Hadoop was the name of a yellow plus elephant toy that
Doug’s son had.
In 2008 Amr left Yahoo to found Cloudera.
In 2009 Doug joined Cloudera.

Ref: Michael Minelli, "Big Data, Big Analytics: Emerging Business Intelligence and Analytic Trends for Today's Businesses,"
Wiley, 2013, ISBN:'111814760X
http://www.cse.wustl.edu/~jain/cse570-13/
Washington University in St. Louis
©2013 Raj Jain

10-19
Hadoop






An open source implementation of MapReduce framework
Three components:
 Hadoop Common Package (files needed to start Hadoop)
 Hadoop Distributed File System: HDFS
 MapReduce Engine
HDFS requires data to be broken into blocks. Each block is
stored on 2 or more data nodes on different racks.
Name node: Manages the file system name space
 keeps track of where each block is.
Name Space Block Map
Name Node
B1

B2

B3

B3

B2

B4

Replicate

B4

B2

B1

B4

B3

B1

Write
Washington University in St. Louis

http://www.cse.wustl.edu/~jain/cse570-13/

10-20

©2013 Raj Jain
Hadoop (Cont)





Data node: Constantly ask the job tracker if there is something
for them to do
 Used to track which data nodes are up or down
Job tracker: Assigns the map job to task tracker nodes that
have the data or are close to the data (same rack)
Task Tracker: Keep the work as close to the data as possible.
Switch

Switch

Job Tracker

Name Node

DN+TT
DN+TT

DN+TT
DN+TT

Rack

Washington University in St. Louis

Rack

Switch
Sec. Job Tracker

DN+TT
DN+TT
Rack

Switch
Sec. NN
DN+TT
DN+TT
Rack

http://www.cse.wustl.edu/~jain/cse570-13/

10-21

DN= Data Node
TT = Task Tracker

©2013 Raj Jain
Hadoop (Cont)




Data nodes get the data if necessary, do the map function, and
write the results to disks.
Job tracker than assigns the reduce jobs to data nodes that have
the map output or close to it.
All data has a check attached to it to verify its integrity.

Washington University in St. Louis

http://www.cse.wustl.edu/~jain/cse570-13/

10-22

©2013 Raj Jain
Apache Hadoop Tools







Apache Hadoop: Open source Hadoop framework in Java.
Consists of Hadoop Common Package (filesystem and OS
abstractions), a MapReduce engine (MapReduce or YARN), and
Hadoop Distributed File System (HDFS)
Apache Mahout: Machine learning algorithms for collaborative
filtering, clustering, and classification using Hadoop
Apache Hive: Data warehouse infrastructure for Hadoop.
Provides data summarization, query, and analysis using a SQLlike language called HiveQL.
Stores data in an embedded Apache Derby database.
Apache Pig: Platform for creating MapReduce programs using a
high-level “Pig Latin” language. Makes MapReduce
programming similar to SQL. Can be extended by user defined
functions written in Java, Python, etc.

Ref: http://hadoop.apache.org/, http://mahout.apache.org/, http://hive.apache.org/, http://pig.apache.org/
http://www.cse.wustl.edu/~jain/cse570-13/
Washington University in St. Louis

10-23

©2013 Raj Jain
Apache Hadoop Tools (Cont)







Apache Avro: Data serialization system.
Avro IDL is the interface description language syntax for Avro.
Apache HBase: Non-relational DBMS part of the Hadoop
project. Designed for large quantities of sparse data (like
BigTable). Provides a Java API for map reduce jobs to access
the data. Used by Facebook.
Apache ZooKeeper: Distributed configuration service,
synchronization service, and naming registry for large
distributed systems like Hadoop.
Apache Cassandra: Distributed database management system.
Highly scalable.

Ref: http://avro.apache.org/, http://cassandra.apache.org/, http://hbase.apache.org/ , http://zookeeper.apache.org/
http://www.cse.wustl.edu/~jain/cse570-13/
Washington University in St. Louis

10-24

©2013 Raj Jain
Apache Hadoop Tools (Cont)






Apache Ambari: A web-based tool for provision, managing
and monitoring Apache Hadoop cluster
Apache Chukwa: A data collection system for managing large
distributed systems
Apache Sqoop: Tool for transferring bulk data between
structured databases and Hadoop
Apache Oozie: A workflow scheduler system to manage
Apache Hadoop jobs

Ref: http://incubator.apache.org/chukwa/, http://oozie.apache.org/, https://sqoop.apache.org/, http://incubator.apache.org/ambari/
http://www.cse.wustl.edu/~jain/cse570-13/
Washington University in St. Louis
©2013 Raj Jain

10-25
Apache Other Big Data Tools







Apache Accumulo: Sorted distributed key/value store based
on Google’s BigTable design. 3rd Most popular NOSQL widecolumn system. Provides cell-level security. Users can see only
authorized keys and values. Originally funded by DoD.
Apache Thrift: IDL to create services using many languages
including C#, C++, Java, Perl, Python, Ruby, etc.
Apache Beehive: Java application framework to allow
development of Java based applications.
Apache Derby: A RDBMS that can be embedded in Java
programs. Needs only 2.6MB disk space. Supports JDBC (Java
Database Connectivity) and SQL.

Ref: http://en.wikipedia.org/wiki/Apache_Accumulo, http://en.wikipedia.org/wiki/Apache_Thrift,
http://en.wikipedia.org/wiki/Apache_Beehive, http://en.wikipedia.org/wiki/Apache_derby,
http://www.cse.wustl.edu/~jain/cse570-13/
Washington University in St. Louis

10-26

©2013 Raj Jain
Other Big Data Tools







Cascading: Open Source software abstraction layer for
Hadoop.
Allows developers to create a .jar file that describes their data
sources, analysis, and results without knowing MapReduce.
Hadoop .jar file contains Cascading .jar files.
Storm: Open source event processor and distributed
computation framework alternative to MapReduce. Allows
batch distributed processing of streaming data using a sequence
of transformations.
Elastic MapReduce (EMR): Automated provisioning of the
Hadoop cluster, running, and terminating. Aka Hive.
HyperTable: Hadoop compatible database system.

Ref: http://en.wikipedia.org/wiki/Cascading, http://en.wikipedia.org/wiki/Hypertable,
http://en.wikipedia.org/wiki/Storm_%28event_processor%29
http://www.cse.wustl.edu/~jain/cse570-13/
Washington University in St. Louis

10-27

©2013 Raj Jain
Other Big Data Tools (Cont)







Filesysem in User-space (FUSE): Users can create their own
virtual file systems. Available for Linux, Android, OSX, etc.
Cloudera Impala: Open source SQL query execution on
HDFS and Apache HBase data
MapR Hadoop: Enhanced versions of Apache Hadoop
supported by MapR. Google, EMC, Amazon use MapR
Hadoop.
Big SQL: SQL interface to Hadoop (IBM)
Hadapt: Analysis of massive data sets using SQL with Apache
Hadoop.

Ref: http://en.wikipedia.org/wiki/Cloudera_Impala
http://en.wikipedia.org/wiki/Filesystem_in_Userspace, http://en.wikipedia.org/wiki/Big_SQL,
http://en.wikipedia.org/wiki/Cloudera_Impala , http://en.wikipedia.org/wiki/MapR, http://en.wikipedia.org/wiki/Hadapt
http://www.cse.wustl.edu/~jain/cse570-13/
Washington University in St. Louis
©2013 Raj Jain

10-28
Analytics
Analytics: Guide decision making by discovering patterns in data
using statistics, programming, and operations research.
 SQL Analytics: Count, Mean, OLAP
 Descriptive Analytics: Analyzing historical data to explain past
success or failures.
 Predictive Analytics: Forecasting using historical data.
 Prescriptive Analytics: Suggest decision options. Continually
update these options with new data.
 Data Mining: Discovering patterns, trends, and relationships
using Association rules, Clustering, Feature extraction
 Simulation: Discrete Event Simulation, Monte Carlo, Agent-based
 Optimization: Linear, non-Linear
 Machine Learning: An algorithm technique for learning from
empirical data and then using those lessons to predict future
outcomes of new data
Ref: Michael Minelli, "Big Data, Big Analytics: Emerging Business Intelligence and Analytic Trends for Today's Businesses,"
Wiley, 2013, ISBN:111814760X
http://www.cse.wustl.edu/~jain/cse570-13/
Washington University in St. Louis
©2013 Raj Jain

10-29
Analytics (Cont)




Web Analytics: Analytics of Web Accesses and Web users.
Learning Analytics: Analytics of learners (students)
Data Science: Field of data analytics

Washington University in St. Louis

http://www.cse.wustl.edu/~jain/cse570-13/

10-30

©2013 Raj Jain
Summary
1.
2.
3.

4.
5.

Big data has become possible due to low cost storage, high
performance servers, high-speed networking, new analytics
Google File System, BigTable Database, and MapReduce
framework sparked the development of Apache Hadoop.
Key components of Hadoop systems are HDFS, Avro data
serialization system, MapReduce or YARN computation
engine, Pig Latin high level programming language, Hive data
warehouse, HBase database, and ZooKeeper for reliable
distributed coordination.
Discovering patterns in data and using them is called
Analytics. It can be descriptive, predictive, or prescriptive
Types of Databases: Relational, SQL, NoSQL, NewSQL,
Key-Value Pair (KVP), Document, Columnar, Graph, and
Spatial

Washington University in St. Louis

http://www.cse.wustl.edu/~jain/cse570-13/

10-31

©2013 Raj Jain
Reading List





J. Hurwitz, et al., “Big Data for Dummies,” Wiley, 2013,
ISBN:978-1-118-50422-2 (Safari Book)
A. Shieh, “Sharing the Data Center Network,” NSDI 2011,
http://www.usenix.org/event/nsdi11/tech/full_papers/Shieh.pdf
IBM. “What is MapReduce?,” http://www01.ibm.com/software/data/infosphere/hadoop/mapreduce/
Michael Minelli, "Big Data, Big Analytics: Emerging Business
Intelligence and Analytic Trends for Today's Businesses,"
Wiley, 2013, ISBN:111814760X

Washington University in St. Louis

http://www.cse.wustl.edu/~jain/cse570-13/

10-32

©2013 Raj Jain
Wikipedia Links















http://en.wikipedia.org/wiki/Database
http://en.wikipedia.org/wiki/Big_data
http://en.wikipedia.org/wiki/Apache_Hadoop
http://en.wikipedia.org/wiki/ACID
http://en.wikipedia.org/wiki/Analytics
http://en.wikipedia.org/wiki/Prescriptive_analytics
http://en.wikipedia.org/wiki/Predictive_analytics
http://en.wikipedia.org/wiki/Prescriptive_Analytics
http://en.wikipedia.org/wiki/Apache_HBase
http://en.wikipedia.org/wiki/Apache_Hive
http://en.wikipedia.org/wiki/Apache_Mahout
http://en.wikipedia.org/wiki/Apache_Pig
http://en.wikipedia.org/wiki/Apache_ZooKeeper
http://en.wikipedia.org/wiki/Apache_Accumulo

Washington University in St. Louis

http://www.cse.wustl.edu/~jain/cse570-13/

10-33

©2013 Raj Jain
Wikipedia Links (Cont)















http://en.wikipedia.org/wiki/Apache_Avro
http://en.wikipedia.org/wiki/Apache_Beehive
http://en.wikipedia.org/wiki/Apache_Cassandra
http://en.wikipedia.org/wiki/Relational_database
http://en.wikipedia.org/wiki/Relational_database_management_system
http://en.wikipedia.org/wiki/Column-oriented_DBMS
http://en.wikipedia.org/wiki/Spatial_database
http://en.wikipedia.org/wiki/SQL
http://en.wikipedia.org/wiki/NoSQL
http://en.wikipedia.org/wiki/NewSQL
http://en.wikipedia.org/wiki/MySQL
http://en.wikipedia.org/wiki/Create,_read,_update_and_delete
http://en.wikipedia.org/wiki/Unstructured_data
http://en.wikipedia.org/wiki/Semi-structured_data

Washington University in St. Louis

http://www.cse.wustl.edu/~jain/cse570-13/

10-34

©2013 Raj Jain
Wikipedia Links (Cont)












http://en.wikipedia.org/wiki/Apache_derby
http://en.wikipedia.org/wiki/Apache_Thrift
http://en.wikipedia.org/wiki/Big_SQL
http://en.wikipedia.org/wiki/Cascading
http://en.wikipedia.org/wiki/Cloudera_Impala
http://en.wikipedia.org/wiki/Comparison_of_relational_database_managem
ent_systems
http://en.wikipedia.org/wiki/Filesystem_in_Userspace
http://en.wikipedia.org/wiki/Hadapt
http://en.wikipedia.org/wiki/Hypertable
http://en.wikipedia.org/wiki/MapR
http://en.wikipedia.org/wiki/Storm_%28event_processor%29

Washington University in St. Louis

http://www.cse.wustl.edu/~jain/cse570-13/

10-35

©2013 Raj Jain
References


















J. Dean and S. Ghemawat, “MapReduce: Simplified Data Processing on
Large Clusters,” OSDI 2004,
http://research.google.com/archive/mapreduce-osdi04.pdf
S. Ghemawat, et al., "The Google File System", OSP 2003,
http://research.google.com/archive/gfs.html
F. Chang, et al., "Bigtable: A Distributed Storage System for Structured
Data," 2006, http://research.google.com/archive/bigtable.html
http://avro.apache.org/
http://cassandra.apache.org/
http://hadoop.apache.org/
http://hbase.apache.org/
http://hive.apache.org/
http://incubator.apache.org/ambari/
http://incubator.apache.org/chukwa/
http://mahout.apache.org/
http://oozie.apache.org/
http://pig.apache.org/
http://zookeeper.apache.org/
https://sqoop.apache.org/

Washington University in St. Louis

http://www.cse.wustl.edu/~jain/cse570-13/

10-36

©2013 Raj Jain
Acronyms

















ACID
API
ArcSDE
BASE
CRUD
DAAS
DARPA
DBMS
DN
DoD
EMC
FUSE
HDFS
IAAS
IBM
ID

Atomicity, Consistency, Isolation, Durability
Application Programming Interface
Arc Spatial Database Engine
Basically Available, Soft, and Eventual Consistency
Create, Retrieve, Update, and Delete
Database as a Service
Defense Advanced Research Project Agancy
Database Management System
Data Node
Department of Defense
Name of a company
Filesysem in User-space
Hadoop Distributed File System
Infrastructure as a Service
International Business Machines
Identifier

Washington University in St. Louis

http://www.cse.wustl.edu/~jain/cse570-13/

10-37

©2013 Raj Jain
Acronyms (Cont)

















IDL
IMDB
JDBC
KVP
NewSQL
NoSQL
OLAP
OpenGEO
OSDI
OSX
PortGIS
PostGresSQL
RDBMS
REST
SQL
TB

Interface Description Language
In-Memory Database
Java Database Connectivity
Key-Value Pair
New SQL
Not Only SQL
Online Analytical Processing
Operating Systems Design and Implementation
Apple Mac Operating System version 10
Port Geographical Information System
PostGress SQL
Relational Database Management System
Representation State Transfer
Structured Query Language
Terabyte

Washington University in St. Louis

http://www.cse.wustl.edu/~jain/cse570-13/

10-38

©2013 Raj Jain
Acronyms (Cont)






TT
US
USGS
VM
YARN

Task Tracker
United States
United States Geological Survey
Virtual Machine
Yet Another Resource Negotiator

Washington University in St. Louis

http://www.cse.wustl.edu/~jain/cse570-13/

10-39

©2013 Raj Jain
Quiz 10A






The 3V's that define Big Data are _______________,
_______________, and _______________
ACID stands for _______________, _______________,
_______________, and _______________
BASE stands for _______________ _______________,
_______________, and _______________ Consistency.
_______________ data is the data that has pre-set format.
Data in _______________ is the data that is streaming.

Your Name: ________________________
Washington University in St. Louis

http://www.cse.wustl.edu/~jain/cse570-13/

10-41

©2013 Raj Jain
Solution to Quiz 10A






The 3V's that define Big Data are *Volume*, *velocity*, and
*variety*
ACID stands for *Atomicity*, *Consistency*, *Isolation*, and
*Durability*
BASE stands for *Basically* *Available*, *Soft*, and
*Eventual* Consistency.
*Structured* data is the data that has pre-set format.
Data in *Motion* is the data that is streaming.

Washington University in St. Louis

http://www.cse.wustl.edu/~jain/cse570-13/

10-42

©2013 Raj Jain

More Related Content

PDF
NOSQL- Presentation on NoSQL
Ramakant Soni
 
PPTX
Relational databases vs Non-relational databases
James Serra
 
PPTX
What Is Hadoop? | What Is Big Data & Hadoop | Introduction To Hadoop | Hadoop...
Simplilearn
 
DOCX
Big data lecture notes
Mohit Saini
 
PPTX
Big data and Hadoop
Rahul Agarwal
 
PPTX
Hadoop Training | Hadoop Training For Beginners | Hadoop Architecture | Hadoo...
Simplilearn
 
PPTX
Hadoop Tutorial For Beginners
Dataflair Web Services Pvt Ltd
 
PPTX
Big Data Analytics with Hadoop
Philippe Julio
 
NOSQL- Presentation on NoSQL
Ramakant Soni
 
Relational databases vs Non-relational databases
James Serra
 
What Is Hadoop? | What Is Big Data & Hadoop | Introduction To Hadoop | Hadoop...
Simplilearn
 
Big data lecture notes
Mohit Saini
 
Big data and Hadoop
Rahul Agarwal
 
Hadoop Training | Hadoop Training For Beginners | Hadoop Architecture | Hadoo...
Simplilearn
 
Hadoop Tutorial For Beginners
Dataflair Web Services Pvt Ltd
 
Big Data Analytics with Hadoop
Philippe Julio
 

What's hot (20)

PPTX
Hadoop Architecture | HDFS Architecture | Hadoop Architecture Tutorial | HDFS...
Simplilearn
 
PPTX
Introduction to Hadoop and Hadoop component
rebeccatho
 
PPTX
Introduction To Hadoop | What Is Hadoop And Big Data | Hadoop Tutorial For Be...
Simplilearn
 
PPTX
NOSQL vs SQL
Mohammed Fazuluddin
 
PPTX
Hadoop
ABHIJEET RAJ
 
PPTX
Apache PIG
Prashant Gupta
 
PDF
Big Data Ecosystem
Lucian Neghina
 
PPTX
HADOOP TECHNOLOGY ppt
sravya raju
 
PPTX
Unit 4-apache pig
vishal choudhary
 
PPTX
Data Lake Overview
James Serra
 
PPTX
Big Data and Hadoop
Flavio Vit
 
PPT
Seminar Presentation Hadoop
Varun Narang
 
PDF
Big Data: Its Characteristics And Architecture Capabilities
Ashraf Uddin
 
PPTX
The Basics of MongoDB
valuebound
 
PPTX
Introduction to Hadoop Technology
Manish Borkar
 
PPTX
Map Reduce
Prashant Gupta
 
PPTX
Hadoop and Big Data
Harshdeep Kaur
 
PPTX
Graph databases
Vinoth Kannan
 
PPTX
Chapter 1 big data
Prof .Pragati Khade
 
PPTX
Big data
Harsh Kishore Mishra
 
Hadoop Architecture | HDFS Architecture | Hadoop Architecture Tutorial | HDFS...
Simplilearn
 
Introduction to Hadoop and Hadoop component
rebeccatho
 
Introduction To Hadoop | What Is Hadoop And Big Data | Hadoop Tutorial For Be...
Simplilearn
 
NOSQL vs SQL
Mohammed Fazuluddin
 
Hadoop
ABHIJEET RAJ
 
Apache PIG
Prashant Gupta
 
Big Data Ecosystem
Lucian Neghina
 
HADOOP TECHNOLOGY ppt
sravya raju
 
Unit 4-apache pig
vishal choudhary
 
Data Lake Overview
James Serra
 
Big Data and Hadoop
Flavio Vit
 
Seminar Presentation Hadoop
Varun Narang
 
Big Data: Its Characteristics And Architecture Capabilities
Ashraf Uddin
 
The Basics of MongoDB
valuebound
 
Introduction to Hadoop Technology
Manish Borkar
 
Map Reduce
Prashant Gupta
 
Hadoop and Big Data
Harshdeep Kaur
 
Graph databases
Vinoth Kannan
 
Chapter 1 big data
Prof .Pragati Khade
 
Ad

Similar to Big Data Fundamentals (20)

PDF
Database Systems - A Historical Perspective
Karoly K
 
PPTX
NoSQL: An Analysis
Andrew Brust
 
PPTX
No SQL- The Future Of Data Storage
Bethmi Gunasekara
 
PPT
Data-Intensive Scalable Science
University of Washington
 
PPTX
The Rise of NoSQL and Polyglot Persistence
Abdelmonaim Remani
 
PDF
Relational vs. Non-Relational
PostgreSQL Experts, Inc.
 
PPTX
Column Stores and Google BigQuery
Csaba Toth
 
KEY
Non-Relational Databases at ACCU2011
Gavin Heavyside
 
PDF
Where Does Big Data Meet Big Database - QCon 2012
Ben Stopford
 
PPTX
Introduction to asdfghjkln b vfgh n v
23mz02
 
PPTX
NoSQL A brief look at Apache Cassandra Distributed Database
Joe Alex
 
PPTX
2018 05 08_biological_databases_no_sql
Prof. Wim Van Criekinge
 
PPTX
Big data concepts
Serkan Özal
 
PDF
NoSQL Databases Introduction - UTN 2013
Facundo Farias
 
PPTX
Sql vs NoSQL
RTigger
 
PPTX
Introduction to Big Data
Vipin Batra
 
PPTX
Introduction to NoSql
Omid Vahdaty
 
PPT
NoSQL Seminer
Partha Das
 
PPTX
Big Data (NJ SQL Server User Group)
Don Demcsak
 
PPTX
NOSQL PRESENTATION ON INTRRODUCTION Intro.pptx
plvdravikumarit
 
Database Systems - A Historical Perspective
Karoly K
 
NoSQL: An Analysis
Andrew Brust
 
No SQL- The Future Of Data Storage
Bethmi Gunasekara
 
Data-Intensive Scalable Science
University of Washington
 
The Rise of NoSQL and Polyglot Persistence
Abdelmonaim Remani
 
Relational vs. Non-Relational
PostgreSQL Experts, Inc.
 
Column Stores and Google BigQuery
Csaba Toth
 
Non-Relational Databases at ACCU2011
Gavin Heavyside
 
Where Does Big Data Meet Big Database - QCon 2012
Ben Stopford
 
Introduction to asdfghjkln b vfgh n v
23mz02
 
NoSQL A brief look at Apache Cassandra Distributed Database
Joe Alex
 
2018 05 08_biological_databases_no_sql
Prof. Wim Van Criekinge
 
Big data concepts
Serkan Özal
 
NoSQL Databases Introduction - UTN 2013
Facundo Farias
 
Sql vs NoSQL
RTigger
 
Introduction to Big Data
Vipin Batra
 
Introduction to NoSql
Omid Vahdaty
 
NoSQL Seminer
Partha Das
 
Big Data (NJ SQL Server User Group)
Don Demcsak
 
NOSQL PRESENTATION ON INTRRODUCTION Intro.pptx
plvdravikumarit
 
Ad

More from rjain51 (18)

PDF
Internet of Things: Challenges and Issues
rjain51
 
PDF
SDN and NFV: Facts, Extensions, and Carrier Opportunities
rjain51
 
PDF
Introduction to Internet of Things
rjain51
 
PDF
Introduction to Network Function Virtualization (NFV)
rjain51
 
PDF
Introduction to Software Defined Networking (SDN)
rjain51
 
PDF
OpenFlow Controllers and Tools
rjain51
 
PDF
Introduction to OpenFlow
rjain51
 
PDF
Network Virtualization in Cloud Data Centers
rjain51
 
PDF
LAN Extension and Network Virtualization for Cloud Computing using Layer 3 Pr...
rjain51
 
PDF
Networking Issues For Big Data
rjain51
 
PDF
Data Center Networks:Virtual Bridging
rjain51
 
PDF
Application Delivery Networking
rjain51
 
PDF
Carrier Ethernet
rjain51
 
PDF
Storage Virtualization
rjain51
 
PDF
Server Virtualization
rjain51
 
PDF
Data Center Ethernet
rjain51
 
PDF
Data Center Network Topologies
rjain51
 
PDF
Networking Protocols for Internet of Things
rjain51
 
Internet of Things: Challenges and Issues
rjain51
 
SDN and NFV: Facts, Extensions, and Carrier Opportunities
rjain51
 
Introduction to Internet of Things
rjain51
 
Introduction to Network Function Virtualization (NFV)
rjain51
 
Introduction to Software Defined Networking (SDN)
rjain51
 
OpenFlow Controllers and Tools
rjain51
 
Introduction to OpenFlow
rjain51
 
Network Virtualization in Cloud Data Centers
rjain51
 
LAN Extension and Network Virtualization for Cloud Computing using Layer 3 Pr...
rjain51
 
Networking Issues For Big Data
rjain51
 
Data Center Networks:Virtual Bridging
rjain51
 
Application Delivery Networking
rjain51
 
Carrier Ethernet
rjain51
 
Storage Virtualization
rjain51
 
Server Virtualization
rjain51
 
Data Center Ethernet
rjain51
 
Data Center Network Topologies
rjain51
 
Networking Protocols for Internet of Things
rjain51
 

Recently uploaded (20)

PPTX
AI and Robotics for Human Well-being.pptx
JAYMIN SUTHAR
 
PPTX
Dev Dives: Automate, test, and deploy in one place—with Unified Developer Exp...
AndreeaTom
 
PDF
A Day in the Life of Location Data - Turning Where into How.pdf
Precisely
 
PDF
Oracle AI Vector Search- Getting Started and what's new in 2025- AIOUG Yatra ...
Sandesh Rao
 
PDF
Software Development Methodologies in 2025
KodekX
 
PDF
Tea4chat - another LLM Project by Kerem Atam
a0m0rajab1
 
PDF
The Future of Artificial Intelligence (AI)
Mukul
 
PPTX
Applied-Statistics-Mastering-Data-Driven-Decisions.pptx
parmaryashparmaryash
 
PDF
Make GenAI investments go further with the Dell AI Factory
Principled Technologies
 
PDF
Doc9.....................................
SofiaCollazos
 
PDF
How Open Source Changed My Career by abdelrahman ismail
a0m0rajab1
 
PPTX
The-Ethical-Hackers-Imperative-Safeguarding-the-Digital-Frontier.pptx
sujalchauhan1305
 
PDF
REPORT: Heating appliances market in Poland 2024
SPIUG
 
PDF
Presentation about Hardware and Software in Computer
snehamodhawadiya
 
PPTX
New ThousandEyes Product Innovations: Cisco Live June 2025
ThousandEyes
 
PDF
Automating ArcGIS Content Discovery with FME: A Real World Use Case
Safe Software
 
PDF
Structs to JSON: How Go Powers REST APIs
Emily Achieng
 
PDF
Event Presentation Google Cloud Next Extended 2025
minhtrietgect
 
PDF
Cloud-Migration-Best-Practices-A-Practical-Guide-to-AWS-Azure-and-Google-Clou...
Artjoker Software Development Company
 
PDF
Economic Impact of Data Centres to the Malaysian Economy
flintglobalapac
 
AI and Robotics for Human Well-being.pptx
JAYMIN SUTHAR
 
Dev Dives: Automate, test, and deploy in one place—with Unified Developer Exp...
AndreeaTom
 
A Day in the Life of Location Data - Turning Where into How.pdf
Precisely
 
Oracle AI Vector Search- Getting Started and what's new in 2025- AIOUG Yatra ...
Sandesh Rao
 
Software Development Methodologies in 2025
KodekX
 
Tea4chat - another LLM Project by Kerem Atam
a0m0rajab1
 
The Future of Artificial Intelligence (AI)
Mukul
 
Applied-Statistics-Mastering-Data-Driven-Decisions.pptx
parmaryashparmaryash
 
Make GenAI investments go further with the Dell AI Factory
Principled Technologies
 
Doc9.....................................
SofiaCollazos
 
How Open Source Changed My Career by abdelrahman ismail
a0m0rajab1
 
The-Ethical-Hackers-Imperative-Safeguarding-the-Digital-Frontier.pptx
sujalchauhan1305
 
REPORT: Heating appliances market in Poland 2024
SPIUG
 
Presentation about Hardware and Software in Computer
snehamodhawadiya
 
New ThousandEyes Product Innovations: Cisco Live June 2025
ThousandEyes
 
Automating ArcGIS Content Discovery with FME: A Real World Use Case
Safe Software
 
Structs to JSON: How Go Powers REST APIs
Emily Achieng
 
Event Presentation Google Cloud Next Extended 2025
minhtrietgect
 
Cloud-Migration-Best-Practices-A-Practical-Guide-to-AWS-Azure-and-Google-Clou...
Artjoker Software Development Company
 
Economic Impact of Data Centres to the Malaysian Economy
flintglobalapac
 

Big Data Fundamentals

  • 1. Big Data Fundamentals . Raj Jain Washington University in Saint Louis Saint Louis, MO 63130 [email protected] These slides and audio/video recordings of this class lecture are at: http://www.cse.wustl.edu/~jain/cse570-13/ Washington University in St. Louis http://www.cse.wustl.edu/~jain/cse570-13/ 10-1 ©2013 Raj Jain
  • 2. Overview 1. 2. 3. 4. 5. Why Big Data? Terminology Key Technologies: Google File System, MapReduce, Hadoop Hadoop and other database tools Types of Databases Ref: J. Hurwitz, et al., “Big Data for Dummies,” Wiley, 2013, ISBN:978-1-118-50422-2 http://www.cse.wustl.edu/~jain/cse570-13/ Washington University in St. Louis 10-2 ©2013 Raj Jain
  • 3. Big Data      Data is measured by 3V's: Volume: TB Velocity: TB/sec. Speed of creation or change Variety: Type (Text, audio, video, images, geospatial, ...) Increasing processing power, storage capacity, and networking have caused data to grow in all 3 dimensions. Volume, Location, Velocity, Churn, Variety, Veracity (accuracy, correctness, applicability)  Examples: social network data, sensor networks, Internet Search, Genomics, astronomy, …  Washington University in St. Louis http://www.cse.wustl.edu/~jain/cse570-13/ 10-3 ©2013 Raj Jain
  • 4. Why Big Data Now? 1. 2. 3. 4. 5. Low cost storage to store data that was discarded earlier Powerful multi-core processors Low latency possible by distributed computing: Compute clusters and grids connected via high-speed networks Virtualization  Partition, Aggregate, isolate resources in any size and dynamically change it  Minimize latency for any scale Affordable storage and computing with minimal man power via clouds  Possible because of advances in Networking Washington University in St. Louis http://www.cse.wustl.edu/~jain/cse570-13/ 10-4 ©2013 Raj Jain
  • 5. Why Big Data Now? (Cont) 6. 7. 8. 9. 10. Better understanding of task distribution (MapReduce), computing architecture (Hadoop), Advanced analytical techniques (Machine learning) Managed Big Data Platforms: Cloud service providers, such as Amazon Web Services provide Elastic MapReduce, Simple Storage Service (S3) and HBase – column oriented database. Google’ BigQuery and Prediction API. Open-source software: OpenStack, PostGresSQL March 12, 2012: Obama announced $200M for Big Data research. Distributed via NSF, NIH, DOE, DoD, DARPA, and USGS (Geological Survey) Washington University in St. Louis http://www.cse.wustl.edu/~jain/cse570-13/ 10-5 ©2013 Raj Jain
  • 6. Big Data Applications    Monitor premature infants to alert when interventions is needed Predict machine failures in manufacturing Prevent traffic jams, save fuel, reduce pollution Washington University in St. Louis http://www.cse.wustl.edu/~jain/cse570-13/ 10-6 ©2013 Raj Jain
  • 7. ACID Requirements      Atomicity: All or nothing. If anything fails, entire transaction fails. Example, Payment and ticketing. Consistency: If there is error in input, the output will not be written to the database. Database goes from one valid state to another valid states. Valid=Does not violate any defined rules. Isolation: Multiple parallel transactions will not interfere with each other. Durability: After the output is written to the database, it stays there forever even after power loss, crashes, or errors. Relational databases provide ACID while non-relational databases aim for BASE (Basically Available, Soft, and Eventual Consistency) Ref: http://en.wikipedia.org/wiki/ACID Washington University in St. Louis http://www.cse.wustl.edu/~jain/cse570-13/ 10-7 ©2013 Raj Jain
  • 8. Terminology         Structured Data: Data that has a pre-set format, e.g., Address Books, product catalogs, banking transactions, Unstructured Data: Data that has no pre-set format. Movies, Audio, text files, web pages, computer programs, social media, Semi-Structured Data: Unstructured data that can be put into a structure by available format descriptions 80% of data is unstructured. Batch vs. Streaming Data Real-Time Data: Streaming data that needs to analyzed as it comes in. E.g., Intrusion detection. Aka “Data in Motion” Data at Rest: Non-real time. E.g., Sales analysis. Metadata: Definitions, mappings, scheme Ref: Michael Minelli, "Big Data, Big Analytics: Emerging Business Intelligence and Analytic Trends for Today's Businesses," Wiley, 2013, ISBN:'111814760X http://www.cse.wustl.edu/~jain/cse570-13/ Washington University in St. Louis ©2013 Raj Jain 10-8
  • 9. Relational Databases and SQL  Relational Database: Stores data in tables. A “Schema” defines the tables, the fields in tables and relationships between the two. Data is stored one column/attribute Order Table Customer Table Order Number Customer ID Product ID Quantity Unit Price Customer ID Customer Name Customer Address Gender Income Range SQL (Structured Query Language): Most commonly used language for creating, retrieving, updating, and deleting (CRUD) data in a relational database Example: To find the gender of customers who bought XYZ: Select CustomerID, State, Gender, ProductID from “Customer Table”, “Order Table” where ProductID = XYZ  Ref: http://en.wikipedia.org/wiki/Comparison_of_relational_database_management_systems http://www.cse.wustl.edu/~jain/cse570-13/ Washington University in St. Louis 10-9 ©2013 Raj Jain
  • 10. Non-relational Databases      NoSQL: Not Only SQL. Any database that uses non-SQL interfaces, e.g., Python, Ruby, C, etc. for retrieval. Typically store data in key-value pairs. Not limited to rows or columns. Data structure and query is specific to the data type High-performance in-memory databases RESTful (Representational State Transfer) web-like APIs Eventual consistency: BASE in place of ACID Washington University in St. Louis http://www.cse.wustl.edu/~jain/cse570-13/ 10-10 ©2013 Raj Jain
  • 11. NewSQL Databases      Overcome scaling limits of MySQL Same scalable performance as NoSQL but using SQL Providing ACID Also called Scale-out SQL Generally use distributed processing. Ref: http://en.wikipedia.org/wiki/NewSQL Washington University in St. Louis http://www.cse.wustl.edu/~jain/cse570-13/ 10-11 ©2013 Raj Jain
  • 12. Columnar Databases ID 101 105 106   Name Salary Smith 10000 Jones 20000 Jones 15000 In Relational databases, data in each row of the table is stored together: 001:101,Smith,10000; 002:105,Jones,20000; 003:106,John;15000  Easy to find all information about a person.  Difficult to answer queries about the aggregate:  How many people have salary between 12k-15k? In Columnar databases, data in each column is stored together. 101:001,105:002,106:003; Smith:001, Jones:002,003; 10000:001, 20000:002, 150000:003    Easy to get column statistics Very easy to add columns Good for data with high variety  simply add columns Ref: http://en.wikipedia.org/wiki/Column-oriented_DBMS http://www.cse.wustl.edu/~jain/cse570-13/ Washington University in St. Louis 10-12 ©2013 Raj Jain
  • 13. Types of Databases Relational Databases: PostgreSQL, SQLite, MySQL  NewSQL Databases: Scale-out using distributed processing Non-relational Databases:  Key-Value Pair (KVP) Databases: Data is stored as Key:Value, e.g., Riak Key-Value Database  Document Databases: Store documents or web pages, e.g., MongoDB, CouchDB  Columnar Databases: Store data in columns, e.g., HBase  Graph Databases: Stores nodes and relationship, e.g., Neo4J  Spatial Databases: For map and nevigational data, e.g., OpenGEO, PortGIS, ArcSDE  In-Memory Database (IMDB): All data in memory. For real time applications Cloud Databases: Any data that is run in a cloud using IAAS, VM Image, DAAS http://www.cse.wustl.edu/~jain/cse570-13/  Washington University in St. Louis ©2013 Raj Jain 10-13
  • 14. Google File System     Commodity computers serve as “Chunk Servers” and store multiple copies of data blocks A master server keeps a map of all chunks of files and location of those chunks. All writes are propagated by the writing chunk server to other chunk servers that have copies. Master server controls all read-write accesses Name Space Block Map Master Server Chunk Server B1 B2 B3 Chunk Server B3 B2 B4 Chunk Server Replicate Chunk Server B4 B4 B2 B1 B3 B1 Write Ref: S. Ghemawat, et al., "The Google File System", OSP 2003, http://research.google.com/archive/gfs.html http://www.cse.wustl.edu/~jain/cse570-13/ Washington University in St. Louis 10-14 ©2013 Raj Jain
  • 15. BigTable      Distributed storage system built on Google File System Data stored in rows and columns Optimized for sparse, persistent, multidimensional sorted map. Uses commodity servers Not distributed outside of Google but accessible via Google App Engine Ref: F. Chang, et al., "Bigtable: A Distributed Storage System for Structured Data," 2006, http://research.google.com/archive/bigtable.html http://www.cse.wustl.edu/~jain/cse570-13/ Washington University in St. Louis 10-15 ©2013 Raj Jain
  • 16. MapReduce     Software framework to process massive amounts of unstructured data in parallel Goals:  Distributed: over a large number of inexpensive processors  Scalable: expand or contract as needed  Fault tolerant: Continue in spite of some failures Map: Takes a set of data and converts it into another set of key-value pairs.. Reduce: Takes the output from Map as input and outputs a smaller set of key-value pairs. Shuffle Input Map Reduce Output Reduce Output Map Map Ref: J. Dean and S. Ghemawat, “MapReduce: Simplified Data Processing on Large Clusters,” OSDI 2004, http://research.google.com/archive/mapreduce-osdi04.pdf http://www.cse.wustl.edu/~jain/cse570-13/ Washington University in St. Louis 10-16 ©2013 Raj Jain
  • 17. MapReduce Example       100 files with daily temperature in two cities. Each file has 10,000 entries. For example, one file may have (Toronto 20), (New York 30), .. Our goal is to compute the maximum temperature in the two cities. Assign the task to 100 Map processors each works on one file. Each processor outputs a list of key-value pairs, e.g., (Toronto 30), New York (65), … Now we have 100 lists each with two elements. We give this list to two reducers – one for Toronto and another for New York. The reducer produce the final answer: (Toronto 55), (New York 65) Ref: IBM. “What is MapReduce?,” http://www-01.ibm.com/software/data/infosphere/hadoop/mapreduce/ http://www.cse.wustl.edu/~jain/cse570-13/ Washington University in St. Louis 10-17 ©2013 Raj Jain
  • 18. MapReduce Optimization     Scheduling:  Task is broken into pieces that can be computed in parallel  Map tasks are scheduled before the reduce tasks.  If there are more map tasks than processors, map tasks continue until all of them are complete.  A new strategy is used to assign Reduce jobs so that it can be done in parallel  The results are combined. Synchronization: The map jobs should be comparables so that they finish together. Similarly reduce jobs should be comparable. Code/Data Collocation: The data for map jobs should be at the processors that are going to map. Fault/Error Handling: If a processor fails, its task needs to be assigned to another processor. Washington University in St. Louis http://www.cse.wustl.edu/~jain/cse570-13/ 10-18 ©2013 Raj Jain
  • 19. Story of Hadoop     Doug Cutting at Yahoo and Mike Caferella were working on creating a project called “Nutch” for large web index. They saw Google papers on MapReduce and Google File System and used it Hadoop was the name of a yellow plus elephant toy that Doug’s son had. In 2008 Amr left Yahoo to found Cloudera. In 2009 Doug joined Cloudera. Ref: Michael Minelli, "Big Data, Big Analytics: Emerging Business Intelligence and Analytic Trends for Today's Businesses," Wiley, 2013, ISBN:'111814760X http://www.cse.wustl.edu/~jain/cse570-13/ Washington University in St. Louis ©2013 Raj Jain 10-19
  • 20. Hadoop     An open source implementation of MapReduce framework Three components:  Hadoop Common Package (files needed to start Hadoop)  Hadoop Distributed File System: HDFS  MapReduce Engine HDFS requires data to be broken into blocks. Each block is stored on 2 or more data nodes on different racks. Name node: Manages the file system name space  keeps track of where each block is. Name Space Block Map Name Node B1 B2 B3 B3 B2 B4 Replicate B4 B2 B1 B4 B3 B1 Write Washington University in St. Louis http://www.cse.wustl.edu/~jain/cse570-13/ 10-20 ©2013 Raj Jain
  • 21. Hadoop (Cont)    Data node: Constantly ask the job tracker if there is something for them to do  Used to track which data nodes are up or down Job tracker: Assigns the map job to task tracker nodes that have the data or are close to the data (same rack) Task Tracker: Keep the work as close to the data as possible. Switch Switch Job Tracker Name Node DN+TT DN+TT DN+TT DN+TT Rack Washington University in St. Louis Rack Switch Sec. Job Tracker DN+TT DN+TT Rack Switch Sec. NN DN+TT DN+TT Rack http://www.cse.wustl.edu/~jain/cse570-13/ 10-21 DN= Data Node TT = Task Tracker ©2013 Raj Jain
  • 22. Hadoop (Cont)    Data nodes get the data if necessary, do the map function, and write the results to disks. Job tracker than assigns the reduce jobs to data nodes that have the map output or close to it. All data has a check attached to it to verify its integrity. Washington University in St. Louis http://www.cse.wustl.edu/~jain/cse570-13/ 10-22 ©2013 Raj Jain
  • 23. Apache Hadoop Tools     Apache Hadoop: Open source Hadoop framework in Java. Consists of Hadoop Common Package (filesystem and OS abstractions), a MapReduce engine (MapReduce or YARN), and Hadoop Distributed File System (HDFS) Apache Mahout: Machine learning algorithms for collaborative filtering, clustering, and classification using Hadoop Apache Hive: Data warehouse infrastructure for Hadoop. Provides data summarization, query, and analysis using a SQLlike language called HiveQL. Stores data in an embedded Apache Derby database. Apache Pig: Platform for creating MapReduce programs using a high-level “Pig Latin” language. Makes MapReduce programming similar to SQL. Can be extended by user defined functions written in Java, Python, etc. Ref: http://hadoop.apache.org/, http://mahout.apache.org/, http://hive.apache.org/, http://pig.apache.org/ http://www.cse.wustl.edu/~jain/cse570-13/ Washington University in St. Louis 10-23 ©2013 Raj Jain
  • 24. Apache Hadoop Tools (Cont)     Apache Avro: Data serialization system. Avro IDL is the interface description language syntax for Avro. Apache HBase: Non-relational DBMS part of the Hadoop project. Designed for large quantities of sparse data (like BigTable). Provides a Java API for map reduce jobs to access the data. Used by Facebook. Apache ZooKeeper: Distributed configuration service, synchronization service, and naming registry for large distributed systems like Hadoop. Apache Cassandra: Distributed database management system. Highly scalable. Ref: http://avro.apache.org/, http://cassandra.apache.org/, http://hbase.apache.org/ , http://zookeeper.apache.org/ http://www.cse.wustl.edu/~jain/cse570-13/ Washington University in St. Louis 10-24 ©2013 Raj Jain
  • 25. Apache Hadoop Tools (Cont)     Apache Ambari: A web-based tool for provision, managing and monitoring Apache Hadoop cluster Apache Chukwa: A data collection system for managing large distributed systems Apache Sqoop: Tool for transferring bulk data between structured databases and Hadoop Apache Oozie: A workflow scheduler system to manage Apache Hadoop jobs Ref: http://incubator.apache.org/chukwa/, http://oozie.apache.org/, https://sqoop.apache.org/, http://incubator.apache.org/ambari/ http://www.cse.wustl.edu/~jain/cse570-13/ Washington University in St. Louis ©2013 Raj Jain 10-25
  • 26. Apache Other Big Data Tools     Apache Accumulo: Sorted distributed key/value store based on Google’s BigTable design. 3rd Most popular NOSQL widecolumn system. Provides cell-level security. Users can see only authorized keys and values. Originally funded by DoD. Apache Thrift: IDL to create services using many languages including C#, C++, Java, Perl, Python, Ruby, etc. Apache Beehive: Java application framework to allow development of Java based applications. Apache Derby: A RDBMS that can be embedded in Java programs. Needs only 2.6MB disk space. Supports JDBC (Java Database Connectivity) and SQL. Ref: http://en.wikipedia.org/wiki/Apache_Accumulo, http://en.wikipedia.org/wiki/Apache_Thrift, http://en.wikipedia.org/wiki/Apache_Beehive, http://en.wikipedia.org/wiki/Apache_derby, http://www.cse.wustl.edu/~jain/cse570-13/ Washington University in St. Louis 10-26 ©2013 Raj Jain
  • 27. Other Big Data Tools     Cascading: Open Source software abstraction layer for Hadoop. Allows developers to create a .jar file that describes their data sources, analysis, and results without knowing MapReduce. Hadoop .jar file contains Cascading .jar files. Storm: Open source event processor and distributed computation framework alternative to MapReduce. Allows batch distributed processing of streaming data using a sequence of transformations. Elastic MapReduce (EMR): Automated provisioning of the Hadoop cluster, running, and terminating. Aka Hive. HyperTable: Hadoop compatible database system. Ref: http://en.wikipedia.org/wiki/Cascading, http://en.wikipedia.org/wiki/Hypertable, http://en.wikipedia.org/wiki/Storm_%28event_processor%29 http://www.cse.wustl.edu/~jain/cse570-13/ Washington University in St. Louis 10-27 ©2013 Raj Jain
  • 28. Other Big Data Tools (Cont)      Filesysem in User-space (FUSE): Users can create their own virtual file systems. Available for Linux, Android, OSX, etc. Cloudera Impala: Open source SQL query execution on HDFS and Apache HBase data MapR Hadoop: Enhanced versions of Apache Hadoop supported by MapR. Google, EMC, Amazon use MapR Hadoop. Big SQL: SQL interface to Hadoop (IBM) Hadapt: Analysis of massive data sets using SQL with Apache Hadoop. Ref: http://en.wikipedia.org/wiki/Cloudera_Impala http://en.wikipedia.org/wiki/Filesystem_in_Userspace, http://en.wikipedia.org/wiki/Big_SQL, http://en.wikipedia.org/wiki/Cloudera_Impala , http://en.wikipedia.org/wiki/MapR, http://en.wikipedia.org/wiki/Hadapt http://www.cse.wustl.edu/~jain/cse570-13/ Washington University in St. Louis ©2013 Raj Jain 10-28
  • 29. Analytics Analytics: Guide decision making by discovering patterns in data using statistics, programming, and operations research.  SQL Analytics: Count, Mean, OLAP  Descriptive Analytics: Analyzing historical data to explain past success or failures.  Predictive Analytics: Forecasting using historical data.  Prescriptive Analytics: Suggest decision options. Continually update these options with new data.  Data Mining: Discovering patterns, trends, and relationships using Association rules, Clustering, Feature extraction  Simulation: Discrete Event Simulation, Monte Carlo, Agent-based  Optimization: Linear, non-Linear  Machine Learning: An algorithm technique for learning from empirical data and then using those lessons to predict future outcomes of new data Ref: Michael Minelli, "Big Data, Big Analytics: Emerging Business Intelligence and Analytic Trends for Today's Businesses," Wiley, 2013, ISBN:111814760X http://www.cse.wustl.edu/~jain/cse570-13/ Washington University in St. Louis ©2013 Raj Jain 10-29
  • 30. Analytics (Cont)    Web Analytics: Analytics of Web Accesses and Web users. Learning Analytics: Analytics of learners (students) Data Science: Field of data analytics Washington University in St. Louis http://www.cse.wustl.edu/~jain/cse570-13/ 10-30 ©2013 Raj Jain
  • 31. Summary 1. 2. 3. 4. 5. Big data has become possible due to low cost storage, high performance servers, high-speed networking, new analytics Google File System, BigTable Database, and MapReduce framework sparked the development of Apache Hadoop. Key components of Hadoop systems are HDFS, Avro data serialization system, MapReduce or YARN computation engine, Pig Latin high level programming language, Hive data warehouse, HBase database, and ZooKeeper for reliable distributed coordination. Discovering patterns in data and using them is called Analytics. It can be descriptive, predictive, or prescriptive Types of Databases: Relational, SQL, NoSQL, NewSQL, Key-Value Pair (KVP), Document, Columnar, Graph, and Spatial Washington University in St. Louis http://www.cse.wustl.edu/~jain/cse570-13/ 10-31 ©2013 Raj Jain
  • 32. Reading List     J. Hurwitz, et al., “Big Data for Dummies,” Wiley, 2013, ISBN:978-1-118-50422-2 (Safari Book) A. Shieh, “Sharing the Data Center Network,” NSDI 2011, http://www.usenix.org/event/nsdi11/tech/full_papers/Shieh.pdf IBM. “What is MapReduce?,” http://www01.ibm.com/software/data/infosphere/hadoop/mapreduce/ Michael Minelli, "Big Data, Big Analytics: Emerging Business Intelligence and Analytic Trends for Today's Businesses," Wiley, 2013, ISBN:111814760X Washington University in St. Louis http://www.cse.wustl.edu/~jain/cse570-13/ 10-32 ©2013 Raj Jain
  • 33. Wikipedia Links               http://en.wikipedia.org/wiki/Database http://en.wikipedia.org/wiki/Big_data http://en.wikipedia.org/wiki/Apache_Hadoop http://en.wikipedia.org/wiki/ACID http://en.wikipedia.org/wiki/Analytics http://en.wikipedia.org/wiki/Prescriptive_analytics http://en.wikipedia.org/wiki/Predictive_analytics http://en.wikipedia.org/wiki/Prescriptive_Analytics http://en.wikipedia.org/wiki/Apache_HBase http://en.wikipedia.org/wiki/Apache_Hive http://en.wikipedia.org/wiki/Apache_Mahout http://en.wikipedia.org/wiki/Apache_Pig http://en.wikipedia.org/wiki/Apache_ZooKeeper http://en.wikipedia.org/wiki/Apache_Accumulo Washington University in St. Louis http://www.cse.wustl.edu/~jain/cse570-13/ 10-33 ©2013 Raj Jain
  • 34. Wikipedia Links (Cont)               http://en.wikipedia.org/wiki/Apache_Avro http://en.wikipedia.org/wiki/Apache_Beehive http://en.wikipedia.org/wiki/Apache_Cassandra http://en.wikipedia.org/wiki/Relational_database http://en.wikipedia.org/wiki/Relational_database_management_system http://en.wikipedia.org/wiki/Column-oriented_DBMS http://en.wikipedia.org/wiki/Spatial_database http://en.wikipedia.org/wiki/SQL http://en.wikipedia.org/wiki/NoSQL http://en.wikipedia.org/wiki/NewSQL http://en.wikipedia.org/wiki/MySQL http://en.wikipedia.org/wiki/Create,_read,_update_and_delete http://en.wikipedia.org/wiki/Unstructured_data http://en.wikipedia.org/wiki/Semi-structured_data Washington University in St. Louis http://www.cse.wustl.edu/~jain/cse570-13/ 10-34 ©2013 Raj Jain
  • 35. Wikipedia Links (Cont)            http://en.wikipedia.org/wiki/Apache_derby http://en.wikipedia.org/wiki/Apache_Thrift http://en.wikipedia.org/wiki/Big_SQL http://en.wikipedia.org/wiki/Cascading http://en.wikipedia.org/wiki/Cloudera_Impala http://en.wikipedia.org/wiki/Comparison_of_relational_database_managem ent_systems http://en.wikipedia.org/wiki/Filesystem_in_Userspace http://en.wikipedia.org/wiki/Hadapt http://en.wikipedia.org/wiki/Hypertable http://en.wikipedia.org/wiki/MapR http://en.wikipedia.org/wiki/Storm_%28event_processor%29 Washington University in St. Louis http://www.cse.wustl.edu/~jain/cse570-13/ 10-35 ©2013 Raj Jain
  • 36. References                J. Dean and S. Ghemawat, “MapReduce: Simplified Data Processing on Large Clusters,” OSDI 2004, http://research.google.com/archive/mapreduce-osdi04.pdf S. Ghemawat, et al., "The Google File System", OSP 2003, http://research.google.com/archive/gfs.html F. Chang, et al., "Bigtable: A Distributed Storage System for Structured Data," 2006, http://research.google.com/archive/bigtable.html http://avro.apache.org/ http://cassandra.apache.org/ http://hadoop.apache.org/ http://hbase.apache.org/ http://hive.apache.org/ http://incubator.apache.org/ambari/ http://incubator.apache.org/chukwa/ http://mahout.apache.org/ http://oozie.apache.org/ http://pig.apache.org/ http://zookeeper.apache.org/ https://sqoop.apache.org/ Washington University in St. Louis http://www.cse.wustl.edu/~jain/cse570-13/ 10-36 ©2013 Raj Jain
  • 37. Acronyms                 ACID API ArcSDE BASE CRUD DAAS DARPA DBMS DN DoD EMC FUSE HDFS IAAS IBM ID Atomicity, Consistency, Isolation, Durability Application Programming Interface Arc Spatial Database Engine Basically Available, Soft, and Eventual Consistency Create, Retrieve, Update, and Delete Database as a Service Defense Advanced Research Project Agancy Database Management System Data Node Department of Defense Name of a company Filesysem in User-space Hadoop Distributed File System Infrastructure as a Service International Business Machines Identifier Washington University in St. Louis http://www.cse.wustl.edu/~jain/cse570-13/ 10-37 ©2013 Raj Jain
  • 38. Acronyms (Cont)                 IDL IMDB JDBC KVP NewSQL NoSQL OLAP OpenGEO OSDI OSX PortGIS PostGresSQL RDBMS REST SQL TB Interface Description Language In-Memory Database Java Database Connectivity Key-Value Pair New SQL Not Only SQL Online Analytical Processing Operating Systems Design and Implementation Apple Mac Operating System version 10 Port Geographical Information System PostGress SQL Relational Database Management System Representation State Transfer Structured Query Language Terabyte Washington University in St. Louis http://www.cse.wustl.edu/~jain/cse570-13/ 10-38 ©2013 Raj Jain
  • 39. Acronyms (Cont)      TT US USGS VM YARN Task Tracker United States United States Geological Survey Virtual Machine Yet Another Resource Negotiator Washington University in St. Louis http://www.cse.wustl.edu/~jain/cse570-13/ 10-39 ©2013 Raj Jain
  • 40. Quiz 10A      The 3V's that define Big Data are _______________, _______________, and _______________ ACID stands for _______________, _______________, _______________, and _______________ BASE stands for _______________ _______________, _______________, and _______________ Consistency. _______________ data is the data that has pre-set format. Data in _______________ is the data that is streaming. Your Name: ________________________ Washington University in St. Louis http://www.cse.wustl.edu/~jain/cse570-13/ 10-41 ©2013 Raj Jain
  • 41. Solution to Quiz 10A      The 3V's that define Big Data are *Volume*, *velocity*, and *variety* ACID stands for *Atomicity*, *Consistency*, *Isolation*, and *Durability* BASE stands for *Basically* *Available*, *Soft*, and *Eventual* Consistency. *Structured* data is the data that has pre-set format. Data in *Motion* is the data that is streaming. Washington University in St. Louis http://www.cse.wustl.edu/~jain/cse570-13/ 10-42 ©2013 Raj Jain