Cloud Computing Era Practice
Cloud Computing Era Practice
Phoenix Liau
Trend Micro
Cloud
Computing
Big
Data
Mobil
e
(NIST) :
Essential
Characteristics
Service Models
Deployment
Models
(as-a-service) Internet
(scalable) (elastic) IT
Enterprise Data
Warehouse
Cloud
Computing
SaaS
PaaS
Iaa
S
Generate
Generate
Big Data
Lead
Lead
Business
Insights
create
create
Competition, Innovation,
Productivity
What is BigData?
A set of files
A database
A single file
75MB/sec
Time taken to transfer 100GB of data
to the processor:
approx. 22
minutes!
Enterprise Database
When to use?
When to use?
Affordable Storage/Compute
Multi-step Transactions
Unstructured or Semi-structured
Lots of Inserts/Updates/Deletes
Hadoop!
inspired by
Apache Hadoop project
inspired by Google's MapReduce and Google File System paper
s.
Hadoop Core
MapReduce
HDFS
HDFS
Hadoop Distributed File System
Redundancy
Fault Tolerant
Scalable
Self Healing
Write Once, Read Many Times
Java API
Command Line Tool
MapReduce
Two Phases of Functional Programming
Redundancy
Fault Tolerant
Scalable
Self Healing
Java API
13
Hadoop Core
Java
Java
MapReduce
HDFS
Java
Java
14
Key: offset
Value: line
Key: word
Value: count
Key: word
Value: sum of count
Relation Map
Hue
Mahout
(Web Console)
(Data Mining)
Oozie
(Job Workflow & Scheduling)
Zookeeper
(Coordination)
Sqoop/Flume
(Data integration)
MapReduce Runtime
(Dist. Programming Framework)
Hbase
(Column NoSQL DB)
Mahout
(Web Console)
(Data Mining)
Oozie
(Job Workflow & Scheduling)
Zookeeper
(Coordination)
Sqoop/Flume
(Data integration)
MapReduce Runtime
(Dist. Programming Framework)
Hbase
(Column NoSQL DB)
What is ZooKeeper
A centralized service for maintaining
Configuration information
Providing distributed synchronization
Mahout
(Web Console)
(Data Mining)
Oozie
(Job Workflow & Scheduling)
Zookeeper
(Coordination)
Sqoop/Flume
(Data integration)
MapReduce Runtime
(Dist. Programming Framework)
Hbase
(Column NoSQL DB)
Flume Architecture
Log
Log
...
Flume Node
Flume Node
HDFS
Sqoop
Easy, parallel database import/export
What you want do?
Insert data from RDBMS to HDFS
Export data from HDFS back into RDBMS
Sqoop
HDFS
Sqoop
RDBMS
28
Sqoop Examples
$sqoopimportconnectjdbc:mysql://localhost/world
usernameroottableCity
...
$hadoopfscatCity/partm00000
1,Kabul,AFG,Kabol,17800002,Qandahar,AFG,Qandahar,2375003,He
rat,AFG,Herat,1868004,Mazare
Sharif,AFG,Balkh,1278005,Amsterdam,NLD,NoordHolland,731200
...
29
Mahout
(Web Console)
(Data Mining)
Oozie
(Job Workflow & Scheduling)
Zookeeper
(Coordination)
Sqoop/Flume
(Data integration)
MapReduce Runtime
(Dist. Programming Framework)
Hbase
(Column NoSQL DB)
Hive
Developed by
What is Hive?
An SQL-like interface to Hadoop
Hive
SQL
Hive
MapReduce
33
Pig
Initiated by
A
A == load
load a.txt
a.txt as
as (id,
(id, name,
name, age,
age, ...)
...)
B
B == load
load b.txt
b.txt as
as (id,
(id, address,
address, ...)
...)
C
C == JOIN
JOIN A
A BY
BY id,
id, B
B BY
BY id;STORE
id;STORE C
C into
into c.txt
c.txt
Pig
Script
Pig
MapReduce
Pig
Language
HiveQL (SQL-like)
Schema
Table definitions
that are stored in a
metastore
PigServer
WordCount Example
Input
Hello
Hello World
World Bye
Bye World
World
Hello
Hello Hadoop
Hadoop Goodbye
Goodbye Hadoop
Hadoop
the
reduce
<< Bye,
1>
Bye,
1> just sums up the values
<< Goodbye,
Goodbye, 1>
1>
<< Hadoop,
Hadoop, 2>
2>
<< Hello,
Hello, 2>
2>
<< World,
World, 2>
2>
public void map(LongWritable key, Text value, Context context) throws IOException, InterruptedException {
String line = value.toString();
StringTokenizer tokenizer = new StringTokenizer(line);
while (tokenizer.hasMoreTokens()) {
word.set(tokenizer.nextToken());
context.write(word, one);
}
}
}
public static class Reduce extends Reducer<Text, IntWritable, Text, IntWritable> {
public void reduce(Text key, Iterable<IntWritable> values, Context context)
throws IOException, InterruptedException {
int sum = 0;
for (IntWritable val : values) {
sum += val.get();
}
context.write(key, new IntWritable(sum));
}
}
public static void main(String[] args) throws Exception {
Configuration conf = new Configuration();
Job job = new Job(conf, "wordcount");
job.setOutputKeyClass(Text.class);
job.setOutputValueClass(IntWritable.class);
job.setMapperClass(Map.class);
job.setReducerClass(Reduce.class);
job.setInputFormatClass(TextInputFormat.class);
job.setOutputFormatClass(TextOutputFormat.class);
FileInputFormat.addInputPath(job, new Path(args[0]));
FileOutputFormat.setOutputPath(job, new Path(args[1]));
job.waitForCompletion(true);
}
Hive
Pig
Java
MapReduce
Java
HDFS
Script
Sqoop Flume
SQL
4
1
RDBMS FS
Posix
Mahout
(Web Console)
(Data Mining)
Oozie
(Job Workflow & Scheduling)
Zookeeper
(Coordination)
Sqoop/Flume
(Data integration)
MapReduce Runtime
(Dist. Programming Framework)
Hbase
(Column NoSQL DB)
Structured-data vs Raw-data
I Inspired by
Coordinated by Zookeeper
Low Latency
Random Reads And Writes
Distributed Key/Value Store
Simple API
PUT
GET
DELETE
SCANE
Hbase workflow
HBase Examples
hbase>
hbase>
hbase>
hbase>
hbase>
hbase>
hbase>
hbase>
Mahout
(Web Console)
(Data Mining)
Oozie
(Job Workflow & Scheduling)
Zookeeper
(Coordination)
Sqoop/Flume
(Data integration)
MapReduce Runtime
(Dist. Programming Framework)
Hbase
(Column NoSQL DB)
What is
Job 1 Job 2
Job 3
Job 4 Job 5
Oozie Features
Component Independent
MapReduce
Hive
Pig
SqoopStreaming
Mahout
(Web Console)
(Data Mining)
Oozie
(Job Workflow & Scheduling)
Zookeeper
(Coordination)
Sqoop/Flume
(Data integration)
MapReduce Runtime
(Dist. Programming Framework)
Hbase
(Column NoSQL DB)
What is
Machine-learning tool
Distributed and scalable machine learning algorithms on
the Hadoop platform
Building intelligent applications easier and faster
Conclusion
Today, we introduced:
Why Hadoop is needed
The basic concepts of HDFS and MapReduce
What sort of problems can be solved with Hadoop
What other projects are included in the Hadoop ecosyst
em
Mahout
(Web Console)
(Data Mining)
Oozie
(Job Workflow & Scheduling)
Zookeeper
(Coordination)
Sqoop/Flume
(Data integration)
MapReduce Runtime
(Dist. Programming Framework)
Hbase
(Column NoSQL DB)
Case Study
CDN
CDN // xSP
xSP
Honeypot
Honeypot
Web
Web
Crawler
Crawler
Trend
Trend Micro
Micro
Mail
Mail Protection
Protection
Trend
Trend Micro
Micro
Web
Web Protection
Protection
Trend
Trend Micro
Micro
Endpoint
Endpoint
Protection
Protection
Issues to Address
Raw Data
Information
Threat
Intelligence/Solution
Volume: Infinite
Time: No Delay
Target: Keep Changing Threats
SPN
Feedbac
k
SPAM
CDN Log
HTTP POST
L4
Log
Log
Receiver
Receiver
Log
Log
Receiver
Receiver
Web
Pages
L4
Log Post
Processin
g
Log Post
Processin
g
HTTP Download
Log Post
Processin
g
SPN infrastructure
infrastructure
SPN
Adhoc-Query (Pig)
MapReduce
HBase
Lumber
Jack
Circus
(Ambari)
Tracking
Logging
System
(TLS)
Malware
Classifica
tion
Correlatio
n Platform
Global
Object
Cache
(GOC)
Feedback Information
Message Bus
Application
Application
Email Reputation
Service
Web Reputation
Service
File Reputation
Service
85 Web Reputation
30 Email Reputation
70 File Reputation
6 TB raw
logs
1.5
Trend Micro
Products /
Technology
CDN Cache
Hadoop Cluster
Web Crawling
Machine Learning
Data Mining
Operation
User Traffic |
Honeypot
8 billions/day
40% filtered
Akamai
4.8
billions/day
82% filtered
Page Download
15
Minutes
Process
99.98% filtered
Threat
Analysis
25,000 malicious
URL /day
Consistency in HBase
Contact model: use column qualifier to store
Support range query (e.g. message box)
Pig at Linkedin
Facebook Messages
Questions?
Thank you!