0% found this document useful (0 votes)

2 views

Hadoop

Uploaded by

vmo6q4di7

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

2 views

Hadoop

Uploaded by

vmo6q4di7

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 104

主讲：马永亮(马哥)

QQ: 113228115,1661815153
http://www.magedu.com
 ASF: hadoop
 MapReduce
 HDFS

 2002: Nutch, Lucene

 2008: Hadoop

2
1250MB, 1.2G, 10PB

3
10PB
 GFS: Google File System
 分布式文件系统

GFS: 分布存储
MapReduce: 分
布式计算

4
 Nutch DFS: NDFS

 Hadoop
 Hadoop DFS = HDFS (GFS)
 MapReduce:
 MapReduce API
 MapReduce runtime environment
 MapReduce implementation

 Map, Reduce
 key/value

5
6
Hadoop集群, 10G, 40
 map reduce
partioner

7
 ETL
 Hadoop: Java
 HDFS:
 Bigtable: NoSQL
 HBase, HDFS
 Cloudera: Impala

 hive: SQL
 pig: yahoo

8
9
 50TB

tom 51 you 100

how 49 tom 10

10
30亿，500亿
 Map reduce:
reduce
 函数式编程API
 运行框架

11
1G

12
13
20T

Scribe, facebook
flume

14
15
JobTracker

16
Slot:
JobTracker

key1

mapper

reducer1
mapper2

mapper
reducer2

mapper4

17
18
Cloud computing IaaS:
OpenStack
OpenNebular
PaaS
SaaS

19
海量数据 大数据

PageRank

MapReduce: 编程框架
GFS: DFS
20亿条,50TB

Map:
1.1.1.1 1
1.1.1.2 1

20
 Lucene
 Nutch(MapRedue+NDFS)
 Hadoop
 MapReduce
 HDFS

21
Hadoop
 Map
 mapper: map task
 Reduce
JobTracker
 reducer: reduce task

map map map

TaskTracker
map

22
shuffle and sort
 combiner  partitioner

23
20TB
 map

key-value
key-value

na 1
na 1 na 1234567
na 0
na 1

24
 A.txt

25
1GB:64MB

26
27
 Hadoop
 MapReduce
 编程框架
 运行环境
 HDFS

28
HDFS

29
向HDFS文件系统保存数据

30
从HDFS读取数据

31
HDFS: 初始化, 回收站

jobtracker

tasktracker
A C F

32
 mapper, reducer
 函数式编程，lisp
 map reduce
 API
 运行框架
 实现

33
名称节点的可用性
 最简单的方式是将名称节点上的持久元数据信息实时存储多
个副本于不同的存储设备中
 提供第二名称节点(Secondary NameNode)
 第二名称节点并不真正扮演名称节点角色，它的主要任务是周
期性地将编辑日志合并至名称空间镜像文件中以免编辑日志变
得过大
 它运行在一个独立的物理主机上，并需要跟名称节点同样大的
内存资源来完成文件合并
 另外，它还保存一份名称空间镜像的副本
 然而，根据其工作机制可知，第二名称节点要滞后于主节点，
因此名称节点故障时，部分数据丢失仍然不可避免
 Hadoop 0.23引入了名称节点的高可用机制——设置两个名
称节点工作于“主备”模型，主节点故障时，其所有服务将立
即转移至备用节点
34
/dfs/imags/a /dfs/imags/b /dfs/imags/c

35
MapReduce

36
单reduce任务的MapReduce数据流

37
多reduce任务的MapReduce数据流

38
没有reducer的MapReduce作业

39
MapReduce客户端提交一个作业

40
A B

41
42
MapReduce逻辑架构

43
partitioner和combiner

44
partitioner和combiner

45
HDFS Client
Hive Pig Crunch

JobTracker

input split1 map

reduce part-r-00000

input split2 map

reduce part-r-00001
input split3 map
Avro (Protocol Buffer, Thrift)
sqoop

46
namenode

datanode datanode datanode datanode datanode

47
 Hive: ETL

Hive QL JDBC/ODBC 网络接口

Thift

驱动

JobTracker NameNode

48
 HDFS
 随机访问
 Bigtable: 大表，GFS, NoSQL(列式数据库)
 HBase
 随机访问
 实时访问
 开源的、分布式的、多版本的、面向列的存储系统

Hive Pig

MapReduce
zookeeper

HBase

HDFS
49
50
 proto buffer, thrift, avro

51
52
 列族 (Column Family)

tom M 32 alice Mars

First ave, mars

53
 flume (ASF)
 chukwa (ASF)
 scribe (facebook)

54
 Hadoop
 MapReduce
 HDFS
 HBase

55
 RDB
 行式数据库
 BigTable, HBase (Data source, Data sink)
 列式数据库

56
Functional programming

57
58
59
MapReduce

60
61
62
63
64
combiner

65
JobTracker

NameNode

secondaryNameNode

DataNode
TaskTracker
2 map
2 reduce

66
HDFS

67
HDFS Read

68
HDFS Write

69
Hadoop

70
71
72
73
R,RHadoop

74
75
Hadoop

76
77
RPC, web_gui
 start-all

78
Cloudera, CDH

hive

79
NoSQL: 稀疏格式存储方案
 CRUD:

10
Age: 30

80
 0.20.2
 JobTracker
 TaskTracker
 NameNode
 SecondaryNameNode
 DataNode

81
 本地模式
 伪分布式模式
 完全分布工模式

82
/hadoop/temp
snn
NameNode
JobTracker

DataNode
TaskTracker

83
Hadoop data ingress and egress

84
Key elements of ingress and egress

 idempotent
 aggregation
 data format transformation
 recoverability
 correctness
 resource consumption and performance
 monitoring

85
Moving data into Hadoop
 Two primary methods that can be used for moving
data into Hadoop
 writing external data at the HDFS level (a data push)
 reading external data at the MapReduce level (more like
a pull).

86
Data sources
 Hadoop data ingress across a spectrum of data
sources
 log files
 semistructured or binary files
 Databases
 HBase

87
Pushing log files into Hadoop
 Flume, Chukwa, and Scribe are log collecting and
distribution frameworks that have the capability to
use HDFS as a data sink for that log data

88
FLUME
 A distributed system for collecting streaming data

89
CHUKWA
 An Apache subproject of Hadoop that also offers a
large-scale mechanism to collect and store data in
HDFS

90
SCRIBE
 A rudimentary streaming log distribution service,
developed and used heavily by Facebook

91
Pushing and pulling semistructured and binary files

 HDFS File Slurper

 A simple utility that supports copying files from a local
directory into HDFS and vice versa

92
Scheduling regular ingress activities with Oozie

 Oozie is a server based Workflow Engine specialized

in running workflow jobs with actions that run Hadoop
Map/Reduce and Pig jobs

93
 Oozie can be used to ingress data into HDFS, and can
also be used to execute postingress activities such as
launching a MapReduce job to process the ingressed
data

94
Pulling data from databases
 Using Hadoop for data ingress, joining, and egress to
OLAP

 Using Hadoop for OLAP and feedback to OLTP

systems

95
 MapReduce contains DBInputFormat and
DBOutputFormat classes, which can be used to read
and write data from databases via JDBC
 Avro
 A language-neutral data serialization system
 To address the major downside of Hadoop Writables:
lack of language portability
 Apache Thrift
 Google’s Protocol Buffers

96
Avro container file format

97
Comparing SequenceFiles, Protocol
Buffers, Thrift, and Avro

98
Using Sqoop to import data from MySQL
 A tool designed for efficiently transferring bulk data
between Apache Hadoop and structured datastores
such as relational databases

99
Sqoop import overview

100
Moving data out of Hadoop
 Sqoop
 Hbase

101
102
关于马哥教育
 博客：http://mageedu.blog.51cto.com
 主页：http://www.magedu.com
 QQ：1661815153, 113228115
 QQ群：203585050, 279599283

103
Thank You!

AZ-104 - Microsoft Azure Administrator
No ratings yet
AZ-104 - Microsoft Azure Administrator
6 pages
LogRhythm Software Install Guide 7.4.8 RevA
No ratings yet
LogRhythm Software Install Guide 7.4.8 RevA
123 pages
TOTAL-House-Cleaning-Checklist - WKLY - DEEP - SPRING 2
100% (19)
TOTAL-House-Cleaning-Checklist - WKLY - DEEP - SPRING 2
11 pages
CT2 BDTT
No ratings yet
CT2 BDTT
6 pages
Hadoop and Pig Overview - Hands-On: Outline of Tutorial
No ratings yet
Hadoop and Pig Overview - Hands-On: Outline of Tutorial
52 pages
Unit 5 - Introduction To Hadoop
No ratings yet
Unit 5 - Introduction To Hadoop
50 pages
BDT Lab Manual
No ratings yet
BDT Lab Manual
48 pages
Unit 5 - Introduction To Hadoop
No ratings yet
Unit 5 - Introduction To Hadoop
50 pages
BDT_Unit04
No ratings yet
BDT_Unit04
136 pages
Hadoop Ecosystem
No ratings yet
Hadoop Ecosystem
58 pages
Introduction To BigData Hadoop
No ratings yet
Introduction To BigData Hadoop
12 pages
Hadoop Overview Training Material
No ratings yet
Hadoop Overview Training Material
44 pages
Lab Manual Big Data
No ratings yet
Lab Manual Big Data
22 pages
Hadoop and Related Tools
No ratings yet
Hadoop and Related Tools
57 pages
Fillatre Big Data
No ratings yet
Fillatre Big Data
98 pages
Bda Unit 1
No ratings yet
Bda Unit 1
13 pages
DS Unit 4.1
No ratings yet
DS Unit 4.1
14 pages
Big Data Testing
100% (1)
Big Data Testing
34 pages
1- HADOOP crash course
No ratings yet
1- HADOOP crash course
52 pages
Big Data(Hadoop) ppt
No ratings yet
Big Data(Hadoop) ppt
28 pages
18 module 2
No ratings yet
18 module 2
9 pages
Unit_IV_Hadoop
No ratings yet
Unit_IV_Hadoop
90 pages
Bda Imp No Header Footer (1)
No ratings yet
Bda Imp No Header Footer (1)
25 pages
2019 - LSP - Unit 10 - Ecosystem Tools
No ratings yet
2019 - LSP - Unit 10 - Ecosystem Tools
26 pages
Introduction To The Big Data Ecosystem
No ratings yet
Introduction To The Big Data Ecosystem
13 pages
Unit 3
No ratings yet
Unit 3
12 pages
Bda QB Soln
No ratings yet
Bda QB Soln
22 pages
2 Hadoop Ecosystem
No ratings yet
2 Hadoop Ecosystem
41 pages
BigData and Hadoop - Syllabus
No ratings yet
BigData and Hadoop - Syllabus
2 pages
BD - Unit - II - Hadoop Frameworks and HDFS
No ratings yet
BD - Unit - II - Hadoop Frameworks and HDFS
37 pages
An Introduction To Hadoop Presentation PDF
100% (1)
An Introduction To Hadoop Presentation PDF
91 pages
Banking Data Analysis On Hadoop
No ratings yet
Banking Data Analysis On Hadoop
21 pages
Unit 2
No ratings yet
Unit 2
15 pages
Hadoop Tutorials: Daniel Lanza Zbigniew Baranowski
No ratings yet
Hadoop Tutorials: Daniel Lanza Zbigniew Baranowski
49 pages
Unit 2 - Hadoop PDF
No ratings yet
Unit 2 - Hadoop PDF
7 pages
BDA-Module2
No ratings yet
BDA-Module2
43 pages
Big Data Lab Manual
No ratings yet
Big Data Lab Manual
44 pages
Hadoop Institutes in Hyderabad
No ratings yet
Hadoop Institutes in Hyderabad
51 pages
MODULE 2 Hadoop Ecosystem Tools
No ratings yet
MODULE 2 Hadoop Ecosystem Tools
44 pages
BDA_Experiment1
No ratings yet
BDA_Experiment1
8 pages
Hadoop Ecosystem
No ratings yet
Hadoop Ecosystem
55 pages
Lecture 2
No ratings yet
Lecture 2
70 pages
Hadoop Ecosystem
No ratings yet
Hadoop Ecosystem
5 pages
Bda Unit4
No ratings yet
Bda Unit4
22 pages
BIGDATA
No ratings yet
BIGDATA
180 pages
HADOOP notes unit 3 and 4
No ratings yet
HADOOP notes unit 3 and 4
14 pages
Hadoop Training in Bangalore
No ratings yet
Hadoop Training in Bangalore
38 pages
BD - HadoopEcoSystem Unit 2part 1
No ratings yet
BD - HadoopEcoSystem Unit 2part 1
12 pages
Practical-1: Aim: Hadoop Configuration and Single Node Cluster Setup and Perform File Management Task in
No ratings yet
Practical-1: Aim: Hadoop Configuration and Single Node Cluster Setup and Perform File Management Task in
61 pages
(Ebook) Hadoop: The Definitive Guide by Tom White ISBN 9781449311520, 1449311520pdf download
100% (5)
(Ebook) Hadoop: The Definitive Guide by Tom White ISBN 9781449311520, 1449311520pdf download
50 pages
BDT Unit04
No ratings yet
BDT Unit04
89 pages
Had Oop Details
No ratings yet
Had Oop Details
21 pages
CAIM: Cerca I Anàlisi D'informació Massiva: FIB, Grau en Enginyeria Informàtica
No ratings yet
CAIM: Cerca I Anàlisi D'informació Massiva: FIB, Grau en Enginyeria Informàtica
65 pages
Data Science
No ratings yet
Data Science
82 pages
bda 1 exp
No ratings yet
bda 1 exp
5 pages
Bda Lab Manual
0% (1)
Bda Lab Manual
40 pages
BDA Module 2
No ratings yet
BDA Module 2
40 pages
Practise Quiz Ccd-470 Exam (05-2014) - Cloudera Quiz Learning
No ratings yet
Practise Quiz Ccd-470 Exam (05-2014) - Cloudera Quiz Learning
74 pages
Unit 1 Haoop Architecture
No ratings yet
Unit 1 Haoop Architecture
26 pages
Hadoop, A Distributed Framework For Big Data
No ratings yet
Hadoop, A Distributed Framework For Big Data
55 pages
Exploring Hadoop Ecosystem (Volume 1): Batch Processing
From Everand
Exploring Hadoop Ecosystem (Volume 1): Batch Processing
Wei Liu
No ratings yet
Professional Hadoop Solutions
From Everand
Professional Hadoop Solutions
Boris Lublinsky
4/5 (2)
Learn Hive in 24 Hours
From Everand
Learn Hive in 24 Hours
Alex Nordeen
No ratings yet
FPVFreerider Manual PDF
No ratings yet
FPVFreerider Manual PDF
13 pages
Wfa97410 (Gpon Onu)
No ratings yet
Wfa97410 (Gpon Onu)
3 pages
Cea Subject Offering First Trimester 2022 2023
No ratings yet
Cea Subject Offering First Trimester 2022 2023
15 pages
Class 12th (2024-25)
No ratings yet
Class 12th (2024-25)
8 pages
Flying Bird - Abhishek Sawant
No ratings yet
Flying Bird - Abhishek Sawant
13 pages
EMS Panorama Catalogue PDF
No ratings yet
EMS Panorama Catalogue PDF
6 pages
Set Theory With an Introduction to Real Point Sets 1st Edition Abhijit Dasgupta (Auth.) download
100% (2)
Set Theory With an Introduction to Real Point Sets 1st Edition Abhijit Dasgupta (Auth.) download
49 pages
Relaxation Times Mapping Using Magnetic Resonance Imaging: Romanian Reports in Physics, Vol. 63, No. 2, P. 456-464, 2011
No ratings yet
Relaxation Times Mapping Using Magnetic Resonance Imaging: Romanian Reports in Physics, Vol. 63, No. 2, P. 456-464, 2011
9 pages
Jared Moussalli & Money Records Help Artists Get Paid
No ratings yet
Jared Moussalli & Money Records Help Artists Get Paid
2 pages
Xii - CS Reudced Study Materials 2020-2021
No ratings yet
Xii - CS Reudced Study Materials 2020-2021
103 pages
Data Science & Analytics Placement Assurance Program Brochure
No ratings yet
Data Science & Analytics Placement Assurance Program Brochure
19 pages
SYNOPSIS Black Book
No ratings yet
SYNOPSIS Black Book
60 pages
An Internet-of-Things Enabled Smart Sensing System For Nitrate Monitoring
No ratings yet
An Internet-of-Things Enabled Smart Sensing System For Nitrate Monitoring
9 pages
Mvba
No ratings yet
Mvba
2 pages
Products Affected / Serial Numbers Affected:: TP21 143g.pdf 03-21-24
No ratings yet
Products Affected / Serial Numbers Affected:: TP21 143g.pdf 03-21-24
8 pages
HTF-009-Kefang Manual
No ratings yet
HTF-009-Kefang Manual
59 pages
Keyboard Shortcuts Press This Key To Do This
No ratings yet
Keyboard Shortcuts Press This Key To Do This
2 pages
Hdlx-2416 Series: Data Sheet
No ratings yet
Hdlx-2416 Series: Data Sheet
17 pages
Subnetting Cheat Sheet
No ratings yet
Subnetting Cheat Sheet
2 pages
Delhi Public School, Jodhpur
No ratings yet
Delhi Public School, Jodhpur
40 pages
Class: Python Object Oriented Programming (03 HRS.)
No ratings yet
Class: Python Object Oriented Programming (03 HRS.)
11 pages
IBlast Undreground
No ratings yet
IBlast Undreground
16 pages
Definite Integration-02 - Solved Example
No ratings yet
Definite Integration-02 - Solved Example
15 pages
File Handling 1
No ratings yet
File Handling 1
72 pages
Nptel machinaryFaultandDiagnosis Assignment3
No ratings yet
Nptel machinaryFaultandDiagnosis Assignment3
5 pages
Archetype Tim Henson Signature
No ratings yet
Archetype Tim Henson Signature
24 pages
Lab Report On Basics Logic Gate
80% (10)
Lab Report On Basics Logic Gate
9 pages

Hadoop

Uploaded by

Hadoop

Uploaded by

主讲：马永亮(马哥)

 2002: Nutch, Lucene

tom 51 you 100

map map map

input split1 map

input split2 map

datanode datanode datanode datanode datanode

Hive QL JDBC/ODBC 网络接口

tom M 32 alice Mars

 HDFS File Slurper

 Oozie is a server based Workflow Engine specialized

 Using Hadoop for OLAP and feedback to OLTP

You might also like