0% found this document useful (0 votes)
47 views

1 SQL Hadoop Analyzing Big Data Hive m1 Intro Hadoop Slides

This document introduces Hadoop and its components. It discusses the motivation for Hadoop in dealing with big data that exceeds single computer capabilities. It describes the Hadoop architecture including HDFS for reliable storage across clusters and MapReduce for distributed processing of large datasets in parallel. It provides an example of how MapReduce and HDFS can be used to solve a word counting problem on big data in a distributed manner.
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
47 views

1 SQL Hadoop Analyzing Big Data Hive m1 Intro Hadoop Slides

This document introduces Hadoop and its components. It discusses the motivation for Hadoop in dealing with big data that exceeds single computer capabilities. It describes the Hadoop architecture including HDFS for reliable storage across clusters and MapReduce for distributed processing of large datasets in parallel. It provides an example of how MapReduce and HDFS can be used to solve a word counting problem on big data in a distributed manner.
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 11

SQL on Hadoop - Analyzing

Big Data with Hive

Ahmad Alkilani
www.pluralsight.com
Introduction to Hadoop

Ahmad Alkilani
www.pluralsight.com
Outline

 Why Hadoop? Motivation


 Hadoop architecture and distributed computing
 HDFS
 MapReduce
 Getting up and running
Motivation for Hadoop

Memory C CPU
P
Memory
DISK U
Google
 ~40 Billion Web Pages x 30 KB each = Petabyte
 Today’s average disk speed reads about 120 MB/sec
 Little over 3 months to read the web!
 Approximately 1,000 drives to store and use
Distributed Computing Challenges

 Scale out with distributed computing


 Hadoop based on Google’s implementation
 Volume, Velocity, and Variety
 Recover from failures Name Node Job Tracker
 Shared nothing architecture
 Hadoop file system (HDFS)
Data Node
CPU Data Node Data Node Data node
CPU
 MapReduce Disk Disk

Data Node Data Node Data Node Data Node

Data Node
CPU Data Node Data Node
CPU Data Node

Disk Disk
Hadoop File System (HDFS)
Server Rack A Server Rack B
64 64 64 64
MB MB MB MB
MapReduce

Data Node Data Node Data Node


One mapper per block
Map Map Map
Block of data Block of data Block of data
Parallel distributed processing given
a file is split into blocks
across multiple servers.

5 Value 9 Value 2 Value


9 Value 2 Value 3 Value
9 Value 3 Value 2 Value
Shuffle and Sort
7 Value 7 Value

Folder in HDFS
Reducer A Reducer B
Word Count Example
Key Value

Byte offset This is the first line


Byte offset This is the second
line
Key Value Key Value Key Value Key Value
This 1 This 1 This 1 line 1
is 1 is 1 This 1 line 1
the 1 the 1 the 1 is 1
first 1 second 1 the 1 is 1
line 1 line 1 second 1
first 1

Reducer A Reducer B
first 1 is 2
second 1 line 2
the 2
This 2
Basic commands using HDFS

Hadoop Demo
Environment Setup

 Course focus is on development


 Use a Virtual Machine image to follow along with examples
 Pseudo distributed sandbox
 Replication factor set to 1
 Name Node, Job Tracker, Data Node, and Task Tracker on a single machine
 Demos using Hortonworks’ HDP sandbox
 Hive 0.10, 0.11 and above
Summary

 Distributed computing and scaling out to solve big data problems


 Key system characteristics
 Built to handle failures
 Move processing to the data
 Failures are inevitable. Embracing this allows for solutions built on
commodity servers
 MapReduce
 Mapper assigned to each block of data
 Key-value pairs are both the input to and output of each phase
 Keys must implement WritableComparable interface
 Shuffle and Sort plays a key role in solving problem

You might also like