Hadoop
Hadoop
QQ: 113228115,1661815153
http://www.magedu.com
ASF: hadoop
MapReduce
HDFS
2
1250MB, 1.2G, 10PB
3
10PB
GFS: Google File System
分布式文件系统
GFS: 分布存储
MapReduce: 分
布式计算
4
Nutch DFS: NDFS
Hadoop
Hadoop DFS = HDFS (GFS)
MapReduce:
MapReduce API
MapReduce runtime environment
MapReduce implementation
Map, Reduce
key/value
5
6
Hadoop集群, 10G, 40
map reduce
partioner
7
ETL
Hadoop: Java
HDFS:
Bigtable: NoSQL
HBase, HDFS
Cloudera: Impala
hive: SQL
pig: yahoo
8
9
50TB
10
30亿,500亿
Map reduce:
reduce
函数式编程API
运行框架
11
1G
12
13
20T
Scribe, facebook
flume
14
15
JobTracker
16
Slot:
JobTracker
key1
mapper
reducer1
mapper2
mapper
reducer2
mapper4
17
18
Cloud computing IaaS:
OpenStack
OpenNebular
PaaS
SaaS
19
海量数据 大数据
PageRank
MapReduce: 编程框架
GFS: DFS
20亿条,50TB
Map:
1.1.1.1 1
1.1.1.2 1
20
Lucene
Nutch(MapRedue+NDFS)
Hadoop
MapReduce
HDFS
21
Hadoop
Map
mapper: map task
Reduce
JobTracker
reducer: reduce task
22
shuffle and sort
combiner partitioner
23
20TB
map
key-value
key-value
na 1
na 1 na 1234567
na 0
na 1
24
A.txt
25
1GB:64MB
26
27
Hadoop
MapReduce
编程框架
运行环境
HDFS
28
HDFS
29
向HDFS文件系统保存数据
30
从HDFS读取数据
31
HDFS: 初始化, 回收站
jobtracker
tasktracker
A C F
32
mapper, reducer
函数式编程,lisp
map reduce
API
运行框架
实现
33
名称节点的可用性
最简单的方式是将名称节点上的持久元数据信息实时存储多
个副本于不同的存储设备中
提供第二名称节点(Secondary NameNode)
第二名称节点并不真正扮演名称节点角色,它的主要任务是周
期性地将编辑日志合并至名称空间镜像文件中以免编辑日志变
得过大
它运行在一个独立的物理主机上,并需要跟名称节点同样大的
内存资源来完成文件合并
另外,它还保存一份名称空间镜像的副本
然而,根据其工作机制可知,第二名称节点要滞后于主节点,
因此名称节点故障时,部分数据丢失仍然不可避免
Hadoop 0.23引入了名称节点的高可用机制——设置两个名
称节点工作于“主备”模型,主节点故障时,其所有服务将立
即转移至备用节点
34
/dfs/imags/a /dfs/imags/b /dfs/imags/c
35
MapReduce
36
单reduce任务的MapReduce数据流
37
多reduce任务的MapReduce数据流
38
没有reducer的MapReduce作业
39
MapReduce客户端提交一个作业
40
A B
41
42
MapReduce逻辑架构
43
partitioner和combiner
44
partitioner和combiner
45
HDFS Client
Hive Pig Crunch
JobTracker
reduce part-r-00001
input split3 map
Avro (Protocol Buffer, Thrift)
sqoop
46
namenode
47
Hive: ETL
Thift
驱动
JobTracker NameNode
48
HDFS
随机访问
Bigtable: 大表,GFS, NoSQL(列式数据库)
HBase
随机访问
实时访问
开源的、分布式的、多版本的、面向列的存储系统
Hive Pig
MapReduce
zookeeper
HBase
HDFS
49
50
proto buffer, thrift, avro
51
52
列族 (Column Family)
53
flume (ASF)
chukwa (ASF)
scribe (facebook)
54
Hadoop
MapReduce
HDFS
HBase
55
RDB
行式数据库
BigTable, HBase (Data source, Data sink)
列式数据库
56
Functional programming
57
58
59
MapReduce
60
61
62
63
64
combiner
65
JobTracker
NameNode
secondaryNameNode
DataNode
TaskTracker
2 map
2 reduce
66
HDFS
67
HDFS Read
68
HDFS Write
69
Hadoop
70
71
72
73
R,RHadoop
74
75
Hadoop
76
77
RPC, web_gui
start-all
78
Cloudera, CDH
hive
79
NoSQL: 稀疏格式存储方案
CRUD:
10
Age: 30
80
0.20.2
JobTracker
TaskTracker
NameNode
SecondaryNameNode
DataNode
81
本地模式
伪分布式模式
完全分布工模式
82
/hadoop/temp
snn
NameNode
JobTracker
DataNode
TaskTracker
83
Hadoop data ingress and egress
84
Key elements of ingress and egress
idempotent
aggregation
data format transformation
recoverability
correctness
resource consumption and performance
monitoring
85
Moving data into Hadoop
Two primary methods that can be used for moving
data into Hadoop
writing external data at the HDFS level (a data push)
reading external data at the MapReduce level (more like
a pull).
86
Data sources
Hadoop data ingress across a spectrum of data
sources
log files
semistructured or binary files
Databases
HBase
87
Pushing log files into Hadoop
Flume, Chukwa, and Scribe are log collecting and
distribution frameworks that have the capability to
use HDFS as a data sink for that log data
88
FLUME
A distributed system for collecting streaming data
89
CHUKWA
An Apache subproject of Hadoop that also offers a
large-scale mechanism to collect and store data in
HDFS
90
SCRIBE
A rudimentary streaming log distribution service,
developed and used heavily by Facebook
91
Pushing and pulling semistructured and binary files
92
Scheduling regular ingress activities with Oozie
93
Oozie can be used to ingress data into HDFS, and can
also be used to execute postingress activities such as
launching a MapReduce job to process the ingressed
data
94
Pulling data from databases
Using Hadoop for data ingress, joining, and egress to
OLAP
95
MapReduce contains DBInputFormat and
DBOutputFormat classes, which can be used to read
and write data from databases via JDBC
Avro
A language-neutral data serialization system
To address the major downside of Hadoop Writables:
lack of language portability
Apache Thrift
Google’s Protocol Buffers
96
Avro container file format
97
Comparing SequenceFiles, Protocol
Buffers, Thrift, and Avro
98
Using Sqoop to import data from MySQL
A tool designed for efficiently transferring bulk data
between Apache Hadoop and structured datastores
such as relational databases
99
Sqoop import overview
100
Moving data out of Hadoop
Sqoop
Hbase
101
102
关于马哥教育
博客:http://mageedu.blog.51cto.com
主页:http://www.magedu.com
QQ:1661815153, 113228115
QQ群:203585050, 279599283
103
Thank You!