0% found this document useful (0 votes)
2 views

Hadoop

Uploaded by

vmo6q4di7
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
2 views

Hadoop

Uploaded by

vmo6q4di7
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 104

主讲:马永亮(马哥)

QQ: 113228115,1661815153
http://www.magedu.com
 ASF: hadoop
 MapReduce
 HDFS

 2002: Nutch, Lucene


 2008: Hadoop

2
1250MB, 1.2G, 10PB

3
10PB
 GFS: Google File System
 分布式文件系统

GFS: 分布存储
MapReduce: 分
布式计算

4
 Nutch DFS: NDFS

 Hadoop
 Hadoop DFS = HDFS (GFS)
 MapReduce:
 MapReduce API
 MapReduce runtime environment
 MapReduce implementation

 Map, Reduce
 key/value

5
6
Hadoop集群, 10G, 40
 map reduce
partioner

7
 ETL
 Hadoop: Java
 HDFS:
 Bigtable: NoSQL
 HBase, HDFS
 Cloudera: Impala

 hive: SQL
 pig: yahoo

8
9
 50TB

tom 51 you 100


how 49 tom 10

10
30亿,500亿
 Map reduce:
reduce
 函数式编程API
 运行框架

11
1G

12
13
20T

Scribe, facebook
flume

14
15
JobTracker

16
Slot:
JobTracker

key1

mapper

reducer1
mapper2

mapper
reducer2

mapper4

17
18
Cloud computing IaaS:
OpenStack
OpenNebular
PaaS
SaaS

19
海量数据 大数据

PageRank

MapReduce: 编程框架
GFS: DFS
20亿条,50TB

Map:
1.1.1.1 1
1.1.1.2 1

20
 Lucene
 Nutch(MapRedue+NDFS)
 Hadoop
 MapReduce
 HDFS

21
Hadoop
 Map
 mapper: map task
 Reduce
JobTracker
 reducer: reduce task

map map map


TaskTracker
map

22
shuffle and sort
 combiner  partitioner

23
20TB
 map

key-value
key-value

na 1
na 1 na 1234567
na 0
na 1

24
 A.txt

25
1GB:64MB

26
27
 Hadoop
 MapReduce
 编程框架
 运行环境
 HDFS

28
HDFS

29
向HDFS文件系统保存数据

30
从HDFS读取数据

31
HDFS: 初始化, 回收站

jobtracker

tasktracker
A C F

32
 mapper, reducer
 函数式编程,lisp
 map reduce
 API
 运行框架
 实现

33
名称节点的可用性
 最简单的方式是将名称节点上的持久元数据信息实时存储多
个副本于不同的存储设备中
 提供第二名称节点(Secondary NameNode)
 第二名称节点并不真正扮演名称节点角色,它的主要任务是周
期性地将编辑日志合并至名称空间镜像文件中以免编辑日志变
得过大
 它运行在一个独立的物理主机上,并需要跟名称节点同样大的
内存资源来完成文件合并
 另外,它还保存一份名称空间镜像的副本
 然而,根据其工作机制可知,第二名称节点要滞后于主节点,
因此名称节点故障时,部分数据丢失仍然不可避免
 Hadoop 0.23引入了名称节点的高可用机制——设置两个名
称节点工作于“主备”模型,主节点故障时,其所有服务将立
即转移至备用节点
34
/dfs/imags/a /dfs/imags/b /dfs/imags/c

35
MapReduce

36
单reduce任务的MapReduce数据流

37
多reduce任务的MapReduce数据流

38
没有reducer的MapReduce作业

39
MapReduce客户端提交一个作业

40
A B

41
42
MapReduce逻辑架构

43
partitioner和combiner

44
partitioner和combiner

45
HDFS Client
Hive Pig Crunch

JobTracker

input split1 map


reduce part-r-00000

input split2 map

reduce part-r-00001
input split3 map
Avro (Protocol Buffer, Thrift)
sqoop

46
namenode

datanode datanode datanode datanode datanode

47
 Hive: ETL

Hive QL JDBC/ODBC 网络接口

Thift

驱动

JobTracker NameNode

48
 HDFS
 随机访问
 Bigtable: 大表,GFS, NoSQL(列式数据库)
 HBase
 随机访问
 实时访问
 开源的、分布式的、多版本的、面向列的存储系统

Hive Pig

MapReduce
zookeeper

HBase

HDFS
49
50
 proto buffer, thrift, avro

51
52
 列族 (Column Family)

tom M 32 alice Mars


First ave, mars

53
 flume (ASF)
 chukwa (ASF)
 scribe (facebook)

54
 Hadoop
 MapReduce
 HDFS
 HBase

55
 RDB
 行式数据库
 BigTable, HBase (Data source, Data sink)
 列式数据库

56
Functional programming

57
58
59
MapReduce

60
61
62
63
64
combiner

65
JobTracker

NameNode

secondaryNameNode

DataNode
TaskTracker
2 map
2 reduce

66
HDFS

67
HDFS Read

68
HDFS Write

69
Hadoop

70
71
72
73
R,RHadoop

74
75
Hadoop

76
77
RPC, web_gui
 start-all

78
Cloudera, CDH

hive

79
NoSQL: 稀疏格式存储方案
 CRUD:

10
Age: 30

80
 0.20.2
 JobTracker
 TaskTracker
 NameNode
 SecondaryNameNode
 DataNode

81
 本地模式
 伪分布式模式
 完全分布工模式

82
/hadoop/temp
snn
NameNode
JobTracker

DataNode
TaskTracker

83
Hadoop data ingress and egress

84
Key elements of ingress and egress

 idempotent
 aggregation
 data format transformation
 recoverability
 correctness
 resource consumption and performance
 monitoring

85
Moving data into Hadoop
 Two primary methods that can be used for moving
data into Hadoop
 writing external data at the HDFS level (a data push)
 reading external data at the MapReduce level (more like
a pull).

86
Data sources
 Hadoop data ingress across a spectrum of data
sources
 log files
 semistructured or binary files
 Databases
 HBase

87
Pushing log files into Hadoop
 Flume, Chukwa, and Scribe are log collecting and
distribution frameworks that have the capability to
use HDFS as a data sink for that log data

88
FLUME
 A distributed system for collecting streaming data

89
CHUKWA
 An Apache subproject of Hadoop that also offers a
large-scale mechanism to collect and store data in
HDFS

90
SCRIBE
 A rudimentary streaming log distribution service,
developed and used heavily by Facebook

91
Pushing and pulling semistructured and binary files

 HDFS File Slurper


 A simple utility that supports copying files from a local
directory into HDFS and vice versa

92
Scheduling regular ingress activities with Oozie

 Oozie is a server based Workflow Engine specialized


in running workflow jobs with actions that run Hadoop
Map/Reduce and Pig jobs

93
 Oozie can be used to ingress data into HDFS, and can
also be used to execute postingress activities such as
launching a MapReduce job to process the ingressed
data

94
Pulling data from databases
 Using Hadoop for data ingress, joining, and egress to
OLAP

 Using Hadoop for OLAP and feedback to OLTP


systems

95
 MapReduce contains DBInputFormat and
DBOutputFormat classes, which can be used to read
and write data from databases via JDBC
 Avro
 A language-neutral data serialization system
 To address the major downside of Hadoop Writables:
lack of language portability
 Apache Thrift
 Google’s Protocol Buffers

96
Avro container file format

97
Comparing SequenceFiles, Protocol
Buffers, Thrift, and Avro

98
Using Sqoop to import data from MySQL
 A tool designed for efficiently transferring bulk data
between Apache Hadoop and structured datastores
such as relational databases

99
Sqoop import overview

100
Moving data out of Hadoop
 Sqoop
 Hbase

101
102
关于马哥教育
 博客:http://mageedu.blog.51cto.com
 主页:http://www.magedu.com
 QQ:1661815153, 113228115
 QQ群:203585050, 279599283

103
Thank You!

You might also like