SlideShare a Scribd company logo
Scott Miao
2013/12/14
Who am I

•
•
•
•

RD, SPN, Trend Micro
3 years for Hadoop eco system
Expertise in HDFS/MR/HBase
@takeshi.miao
THREATCONNECT
Product 1 Product 2

Product 3

…

IP, domain, URL, filename, process, file hash,
Virus detection, registry key, etc.

Sandbox
APT KB

Threat
Connect

Virus
DB

TE
Family
Writeup

File
Detecti
on

Threat
Web

Web
Reputa
tion

Process and
correlates different
data sources

Most relevant threat
report with actionable
intelligence
on a single portal
A GRAPH
The problems
• Store large size of Graph data
• Access large size of Graph data
• Process large size of Graph data
大
數
據
STORE
Property Graph Model (1/3)

https://github.com/tinkerpop/blueprints/wiki/Property-Graph-Model
Property Graph Model (2/3)
• A property graph has these elements
– a set of vertices
•
•
•
•

each vertex has a unique identifier.
each vertex has a set of outgoing edges.
each vertex has a set of incoming edges.
each vertex has a collection of properties defined by a map from
key to value.

– a set of edges
•
•
•
•

each edge has a unique identifier.
each edge has an outgoing tail vertex.
each edge has an incoming head vertex.
each edge has a label that denotes the type of relationship
between its two vertices.
• each edge has a collection of properties defined by a map from
key to value.
Property Graph Model (3/3)
The domain model for
Property Graph Model
The relational model for
Property Graph Model
Massive
scalable ?

Active
community ?

Analyzable ?
The winner is…
• We use HBase as a Graph Storage
– Google BigTable and PageRank
– HBaseCon2012

Yeah
We are NO. 1 !!
Use HBase to store Graph data (1/3)
• Schema design
– Table: vertex
‘<vertex-id>@<entity-type>’, ‘property:<property-key>@<property-value-type>’,
<property-value>

– Table: edge
‘<vertex1-row-key>--><label>--><vertex2-row-key>’,
‘property:<property-key>@<property-value-type>’, <property-value>
Use HBase to store Graph data (2/3)
• Sample
– Table: vertex
‘myapps-ups.com@domain’, ‘property:ip@String’, ‘…’
‘myapps-ups.com@domain’, ‘property:asn@String’, ‘…’
…
‘http://track.muapps-ups.com/InvoiceA1423AC.JPG.exe@url’, ‘property:path@String’, ‘…’
‘http://track.muapps-ups.com/InvoiceA1423AC.JPG.exe@url’, ‘property:parameter@String’, ‘…’

– Table: edge
‘myapps-ups.com@domain-->host-->http://track.muapps-ups.com/InvoiceA1423AC.JPG.exe@url’,
‘property:property1’, ‘…’
‘myapps-ups.com@domain-->host-->http://track.muapps-ups.com/InvoiceA1423AC.JPG.exe@url’,
‘property:property2’, ‘…’
Use HBase to store Graph data (3/3)
• Tables
– create 'test.vertex', {NAME => 'property',
BLOOMFILTER => 'ROW', COMPRESSION => ‘lzo',
TTL => '7776000'}
– create 'test.edge', {NAME => 'property',
BLOOMFILTER => 'ROW', COMPRESSION => ‘lzo',
TTL => '7776000'}
It’s not me,
actually…

ACCESS
3. Process Data
2. Get Data
HBase

Clients

Algorithms

1. Put data

Data Sources
Put Data
• HBase schema design is simple and humanreadable
• They are easy to write your own dumping tool
as you need
– MR/Pig/Completebulkload
– Can write cron-job to clean up the broken-edge
data
– TTL can also help to retire old data

• We already have a lot practices for this task
Get Data (1/2)
• A Graph API
• A better semantics for manipulating Graph
data
– As a wrapper for HBase Client API
– Rather than use HBase Client API directly

• Simple to Use
Vertex vertex = this.graph.getVertex("40012");
Vertex subVertex = null;
Iterable<Edge> edges =
vertex.getEdges(Direction.OUT, "knows", "foo", "bar");
for(Edge edge : edges) {
subVertex = edge.getVertex(Direction.OUT);
...
}
Get Data (2/2)
• We implement blueprints API
– It provides interfaces as spec. for users to impl.
– Currently basic query methods are implemented
– We can get benefits from it
• Other libraries support if we can impl. more degrees of
blueprints API
– http://www.tinkerpop.com/
– RESTful server, graph algorithmn, dataflow, etc
Attack on graph
PROCESS
• Thanks for human-readable HBase schema
design and random accessible in natural
– Write your own MR
– Write your own Pig/UDFs

• Ex. The pagerank
– http://zh.wikipedia.org/wiki/Pagerank
HGraph
• A project is open and put on github
– https://github.com/takeshimiao/HGraph

• A partial impl. released from our internal pilot
project
– Follow HBase schema design
– Read data via Blueprints API
– Process data with pagerank

• Download or ‘git clone’ it
– Use ‘mvn clean package’
– Run on unix-like OS
• Use window may encounter some errors
Attack on graph
Attack on graph
Attack on graph
There is another project

http://thinkaurelius.github.
io/faunus/

http://thinkaurelius.github.io/titan/
Attack on graph
OBSERVATIONS
YARN
• It seems bring Hadoop to a de-facto big data
platform
– Loose bound the MR framework and
accommodate others

• There are bunch of data processing migrated
with it
http://hortonworks.com/hadoop/yarn/
SQL-on-Hadoop
• Impala V.S. Hive (Stinger and Tez)
– Impala seems more mature than Hive
Hive built on top of a batch processing framework (even MRv2), but Impala goes
itself own way !!
Todd Lipcon
Committer/PMC member on Apache Thrift, HBаse, and Hаdoop projects

• YARN !!
– Hive stinger and Tez are based on YARN (HDP2)
– Impala also has plan to migrated to YARN (CDH5)
– Even HBase !! (HOYA)
HBase is a popular noSQL
• As I saw in Europe/CA/China, I can say HBase
is most popular noSQL solution if you already
adopted Hadoop
• Other noSQLs will not help you out of OPS
paintpoints
• So the best way is to pick your right tool and
play it well
Attack on graph
http://www.slideshare.net/Hadoop_Summit/what-is-the-point-of-hadoop?from_search=1 #p34
Attack on graph

More Related Content

What's hot (20)

PPTX
Hadoop workshop
Purna Chander
 
PPTX
Introduction to Pig
Prashanth Babu
 
PPTX
Introduction to Apache Hive(Big Data, Final Seminar)
Takrim Ul Islam Laskar
 
PPT
Introduction To Map Reduce
rantav
 
PPT
Hadoop hive presentation
Arvind Kumar
 
PPT
Unit 5-lecture4
vishal choudhary
 
PPTX
Introduction to Hive
Uday Vakalapudi
 
PDF
Mar 2012 HUG: Hive with HBase
Yahoo Developer Network
 
PPTX
An intriduction to hive
Reza Ameri
 
PPT
hadoop&zing
zingopen
 
PDF
Hadoop and Hive Development at Facebook
elliando dias
 
PPT
Hive(ppt)
Abhinav Tyagi
 
PDF
Introducing Apache Giraph for Large Scale Graph Processing
sscdotopen
 
PDF
R, Hadoop and Amazon Web Services
Portland R User Group
 
PPTX
Hive
Manas Nayak
 
PDF
Dynamic Draph / Iterative Computation on Apache Giraph
DataWorks Summit
 
PPTX
Apache Hadoop Big Data Technology
Jay Nagar
 
PDF
Hadoop pig
Sean Murphy
 
PDF
HBaseCon 2015: Just the Basics
HBaseCon
 
PPTX
Hive and HiveQL - Module6
Rohit Agrawal
 
Hadoop workshop
Purna Chander
 
Introduction to Pig
Prashanth Babu
 
Introduction to Apache Hive(Big Data, Final Seminar)
Takrim Ul Islam Laskar
 
Introduction To Map Reduce
rantav
 
Hadoop hive presentation
Arvind Kumar
 
Unit 5-lecture4
vishal choudhary
 
Introduction to Hive
Uday Vakalapudi
 
Mar 2012 HUG: Hive with HBase
Yahoo Developer Network
 
An intriduction to hive
Reza Ameri
 
hadoop&zing
zingopen
 
Hadoop and Hive Development at Facebook
elliando dias
 
Hive(ppt)
Abhinav Tyagi
 
Introducing Apache Giraph for Large Scale Graph Processing
sscdotopen
 
R, Hadoop and Amazon Web Services
Portland R User Group
 
Dynamic Draph / Iterative Computation on Apache Giraph
DataWorks Summit
 
Apache Hadoop Big Data Technology
Jay Nagar
 
Hadoop pig
Sean Murphy
 
HBaseCon 2015: Just the Basics
HBaseCon
 
Hive and HiveQL - Module6
Rohit Agrawal
 

Similar to Attack on graph (20)

PPTX
A Graph Service for Global Web Entities Traversal and Reputation Evaluation B...
HBaseCon
 
PPTX
A Graph Service for Global Web Entities Traversal and Reputation Evaluation B...
Chris Huang
 
PPTX
HBaseCon 2015: HBase @ CyberAgent
HBaseCon
 
PPTX
Introduction to Apache HBase
Gokuldas Pillai
 
ODP
HBase introduction talk
Hayden Marchant
 
PDF
HBase ArcheTypes
Matteo Bertozzi
 
PDF
Distributed graph processing
Bartosz Konieczny
 
PDF
Fishing Graphs in a Hadoop Data Lake by Jörg Schad and Max Neunhoeffer at Big...
Big Data Spain
 
PDF
Fishing Graphs in a Hadoop Data Lake
DataWorks Summit/Hadoop Summit
 
PPTX
HBaseConAsia2018: Track2-5: JanusGraph-Distributed graph database with HBase
Michael Stack
 
PDF
1st UIM-GDB - Connections to the Real World
Achim Friedland
 
PDF
HBaseCon 2015: S2Graph - A Large-scale Graph Database with HBase
HBaseCon
 
PDF
Fishing Graphs in a Hadoop Data Lake
ArangoDB Database
 
KEY
HBase and Hadoop at Urban Airship
dave_revell
 
PPTX
Graph Analytics
Khalid Salama
 
PPTX
Unit II Hadoop Ecosystem_Updated.pptx
BhavanaHotchandani
 
PPT
Chicago Data Summit: Apache HBase: An Introduction
Cloudera, Inc.
 
PPTX
Large Scale Graph Analytics with JanusGraph
DataWorks Summit
 
PPTX
Large Scale Graph Analytics with JanusGraph
P. Taylor Goetz
 
PPTX
Hbasepreso 111116185419-phpapp02
Gokuldas Pillai
 
A Graph Service for Global Web Entities Traversal and Reputation Evaluation B...
HBaseCon
 
A Graph Service for Global Web Entities Traversal and Reputation Evaluation B...
Chris Huang
 
HBaseCon 2015: HBase @ CyberAgent
HBaseCon
 
Introduction to Apache HBase
Gokuldas Pillai
 
HBase introduction talk
Hayden Marchant
 
HBase ArcheTypes
Matteo Bertozzi
 
Distributed graph processing
Bartosz Konieczny
 
Fishing Graphs in a Hadoop Data Lake by Jörg Schad and Max Neunhoeffer at Big...
Big Data Spain
 
Fishing Graphs in a Hadoop Data Lake
DataWorks Summit/Hadoop Summit
 
HBaseConAsia2018: Track2-5: JanusGraph-Distributed graph database with HBase
Michael Stack
 
1st UIM-GDB - Connections to the Real World
Achim Friedland
 
HBaseCon 2015: S2Graph - A Large-scale Graph Database with HBase
HBaseCon
 
Fishing Graphs in a Hadoop Data Lake
ArangoDB Database
 
HBase and Hadoop at Urban Airship
dave_revell
 
Graph Analytics
Khalid Salama
 
Unit II Hadoop Ecosystem_Updated.pptx
BhavanaHotchandani
 
Chicago Data Summit: Apache HBase: An Introduction
Cloudera, Inc.
 
Large Scale Graph Analytics with JanusGraph
DataWorks Summit
 
Large Scale Graph Analytics with JanusGraph
P. Taylor Goetz
 
Hbasepreso 111116185419-phpapp02
Gokuldas Pillai
 
Ad

More from Scott Miao (12)

PPTX
My thoughts for - Building CI/CD Pipelines for Serverless Applications sharing
Scott Miao
 
PPTX
20171122 aws usergrp_coretech-spn-cicd-aws-v01
Scott Miao
 
PPTX
Achieve big data analytic platform with lambda architecture on cloud
Scott Miao
 
PPTX
analytic engine - a common big data computation service on the aws
Scott Miao
 
PPTX
Zero-downtime Hadoop/HBase Cross-datacenter Migration
Scott Miao
 
PDF
004 architecture andadvanceduse
Scott Miao
 
PDF
003 admin featuresandclients
Scott Miao
 
PPTX
006 performance tuningandclusteradmin
Scott Miao
 
PPTX
005 cluster monitoring
Scott Miao
 
PPTX
002 hbase clientapi
Scott Miao
 
PPTX
001 hbase introduction
Scott Miao
 
PPTX
20121022 tm hbasecanarytool
Scott Miao
 
My thoughts for - Building CI/CD Pipelines for Serverless Applications sharing
Scott Miao
 
20171122 aws usergrp_coretech-spn-cicd-aws-v01
Scott Miao
 
Achieve big data analytic platform with lambda architecture on cloud
Scott Miao
 
analytic engine - a common big data computation service on the aws
Scott Miao
 
Zero-downtime Hadoop/HBase Cross-datacenter Migration
Scott Miao
 
004 architecture andadvanceduse
Scott Miao
 
003 admin featuresandclients
Scott Miao
 
006 performance tuningandclusteradmin
Scott Miao
 
005 cluster monitoring
Scott Miao
 
002 hbase clientapi
Scott Miao
 
001 hbase introduction
Scott Miao
 
20121022 tm hbasecanarytool
Scott Miao
 
Ad

Recently uploaded (20)

PPTX
Agile Chennai 18-19 July 2025 | Workshop - Enhancing Agile Collaboration with...
AgileNetwork
 
PDF
OFFOFFBOX™ – A New Era for African Film | Startup Presentation
ambaicciwalkerbrian
 
PDF
How ETL Control Logic Keeps Your Pipelines Safe and Reliable.pdf
Stryv Solutions Pvt. Ltd.
 
PPTX
Agentic AI in Healthcare Driving the Next Wave of Digital Transformation
danielle hunter
 
PDF
Tea4chat - another LLM Project by Kerem Atam
a0m0rajab1
 
PDF
CIFDAQ's Market Wrap : Bears Back in Control?
CIFDAQ
 
PPTX
Dev Dives: Automate, test, and deploy in one place—with Unified Developer Exp...
AndreeaTom
 
PDF
Trying to figure out MCP by actually building an app from scratch with open s...
Julien SIMON
 
PDF
Brief History of Internet - Early Days of Internet
sutharharshit158
 
PDF
The Past, Present & Future of Kenya's Digital Transformation
Moses Kemibaro
 
PDF
Lecture A - AI Workflows for Banking.pdf
Dr. LAM Yat-fai (林日辉)
 
PDF
State-Dependent Conformal Perception Bounds for Neuro-Symbolic Verification
Ivan Ruchkin
 
PPTX
Agile Chennai 18-19 July 2025 Ideathon | AI Powered Microfinance Literacy Gui...
AgileNetwork
 
PDF
Research-Fundamentals-and-Topic-Development.pdf
ayesha butalia
 
PPTX
Earn Agentblazer Status with Slack Community Patna.pptx
SanjeetMishra29
 
PDF
Peak of Data & AI Encore - Real-Time Insights & Scalable Editing with ArcGIS
Safe Software
 
PDF
Make GenAI investments go further with the Dell AI Factory
Principled Technologies
 
PDF
The Future of Mobile Is Context-Aware—Are You Ready?
iProgrammer Solutions Private Limited
 
PPTX
AVL ( audio, visuals or led ), technology.
Rajeshwri Panchal
 
PPTX
Agile Chennai 18-19 July 2025 | Emerging patterns in Agentic AI by Bharani Su...
AgileNetwork
 
Agile Chennai 18-19 July 2025 | Workshop - Enhancing Agile Collaboration with...
AgileNetwork
 
OFFOFFBOX™ – A New Era for African Film | Startup Presentation
ambaicciwalkerbrian
 
How ETL Control Logic Keeps Your Pipelines Safe and Reliable.pdf
Stryv Solutions Pvt. Ltd.
 
Agentic AI in Healthcare Driving the Next Wave of Digital Transformation
danielle hunter
 
Tea4chat - another LLM Project by Kerem Atam
a0m0rajab1
 
CIFDAQ's Market Wrap : Bears Back in Control?
CIFDAQ
 
Dev Dives: Automate, test, and deploy in one place—with Unified Developer Exp...
AndreeaTom
 
Trying to figure out MCP by actually building an app from scratch with open s...
Julien SIMON
 
Brief History of Internet - Early Days of Internet
sutharharshit158
 
The Past, Present & Future of Kenya's Digital Transformation
Moses Kemibaro
 
Lecture A - AI Workflows for Banking.pdf
Dr. LAM Yat-fai (林日辉)
 
State-Dependent Conformal Perception Bounds for Neuro-Symbolic Verification
Ivan Ruchkin
 
Agile Chennai 18-19 July 2025 Ideathon | AI Powered Microfinance Literacy Gui...
AgileNetwork
 
Research-Fundamentals-and-Topic-Development.pdf
ayesha butalia
 
Earn Agentblazer Status with Slack Community Patna.pptx
SanjeetMishra29
 
Peak of Data & AI Encore - Real-Time Insights & Scalable Editing with ArcGIS
Safe Software
 
Make GenAI investments go further with the Dell AI Factory
Principled Technologies
 
The Future of Mobile Is Context-Aware—Are You Ready?
iProgrammer Solutions Private Limited
 
AVL ( audio, visuals or led ), technology.
Rajeshwri Panchal
 
Agile Chennai 18-19 July 2025 | Emerging patterns in Agentic AI by Bharani Su...
AgileNetwork
 

Attack on graph

  • 2. Who am I • • • • RD, SPN, Trend Micro 3 years for Hadoop eco system Expertise in HDFS/MR/HBase @takeshi.miao
  • 4. Product 1 Product 2 Product 3 … IP, domain, URL, filename, process, file hash, Virus detection, registry key, etc. Sandbox APT KB Threat Connect Virus DB TE Family Writeup File Detecti on Threat Web Web Reputa tion Process and correlates different data sources Most relevant threat report with actionable intelligence on a single portal
  • 6. The problems • Store large size of Graph data • Access large size of Graph data • Process large size of Graph data
  • 9. Property Graph Model (1/3) https://github.com/tinkerpop/blueprints/wiki/Property-Graph-Model
  • 10. Property Graph Model (2/3) • A property graph has these elements – a set of vertices • • • • each vertex has a unique identifier. each vertex has a set of outgoing edges. each vertex has a set of incoming edges. each vertex has a collection of properties defined by a map from key to value. – a set of edges • • • • each edge has a unique identifier. each edge has an outgoing tail vertex. each edge has an incoming head vertex. each edge has a label that denotes the type of relationship between its two vertices. • each edge has a collection of properties defined by a map from key to value.
  • 12. The domain model for Property Graph Model
  • 13. The relational model for Property Graph Model
  • 15. The winner is… • We use HBase as a Graph Storage – Google BigTable and PageRank – HBaseCon2012 Yeah We are NO. 1 !!
  • 16. Use HBase to store Graph data (1/3) • Schema design – Table: vertex ‘<vertex-id>@<entity-type>’, ‘property:<property-key>@<property-value-type>’, <property-value> – Table: edge ‘<vertex1-row-key>--><label>--><vertex2-row-key>’, ‘property:<property-key>@<property-value-type>’, <property-value>
  • 17. Use HBase to store Graph data (2/3) • Sample – Table: vertex ‘myapps-ups.com@domain’, ‘property:ip@String’, ‘…’ ‘myapps-ups.com@domain’, ‘property:asn@String’, ‘…’ … ‘http://track.muapps-ups.com/InvoiceA1423AC.JPG.exe@url’, ‘property:path@String’, ‘…’ ‘http://track.muapps-ups.com/InvoiceA1423AC.JPG.exe@url’, ‘property:parameter@String’, ‘…’ – Table: edge ‘myapps-ups.com@domain-->host-->http://track.muapps-ups.com/InvoiceA1423AC.JPG.exe@url’, ‘property:property1’, ‘…’ ‘myapps-ups.com@domain-->host-->http://track.muapps-ups.com/InvoiceA1423AC.JPG.exe@url’, ‘property:property2’, ‘…’
  • 18. Use HBase to store Graph data (3/3) • Tables – create 'test.vertex', {NAME => 'property', BLOOMFILTER => 'ROW', COMPRESSION => ‘lzo', TTL => '7776000'} – create 'test.edge', {NAME => 'property', BLOOMFILTER => 'ROW', COMPRESSION => ‘lzo', TTL => '7776000'}
  • 20. 3. Process Data 2. Get Data HBase Clients Algorithms 1. Put data Data Sources
  • 21. Put Data • HBase schema design is simple and humanreadable • They are easy to write your own dumping tool as you need – MR/Pig/Completebulkload – Can write cron-job to clean up the broken-edge data – TTL can also help to retire old data • We already have a lot practices for this task
  • 22. Get Data (1/2) • A Graph API • A better semantics for manipulating Graph data – As a wrapper for HBase Client API – Rather than use HBase Client API directly • Simple to Use Vertex vertex = this.graph.getVertex("40012"); Vertex subVertex = null; Iterable<Edge> edges = vertex.getEdges(Direction.OUT, "knows", "foo", "bar"); for(Edge edge : edges) { subVertex = edge.getVertex(Direction.OUT); ... }
  • 23. Get Data (2/2) • We implement blueprints API – It provides interfaces as spec. for users to impl. – Currently basic query methods are implemented – We can get benefits from it • Other libraries support if we can impl. more degrees of blueprints API – http://www.tinkerpop.com/ – RESTful server, graph algorithmn, dataflow, etc
  • 26. • Thanks for human-readable HBase schema design and random accessible in natural – Write your own MR – Write your own Pig/UDFs • Ex. The pagerank – http://zh.wikipedia.org/wiki/Pagerank
  • 27. HGraph • A project is open and put on github – https://github.com/takeshimiao/HGraph • A partial impl. released from our internal pilot project – Follow HBase schema design – Read data via Blueprints API – Process data with pagerank • Download or ‘git clone’ it – Use ‘mvn clean package’ – Run on unix-like OS • Use window may encounter some errors
  • 31. There is another project http://thinkaurelius.github. io/faunus/ http://thinkaurelius.github.io/titan/
  • 34. YARN • It seems bring Hadoop to a de-facto big data platform – Loose bound the MR framework and accommodate others • There are bunch of data processing migrated with it
  • 36. SQL-on-Hadoop • Impala V.S. Hive (Stinger and Tez) – Impala seems more mature than Hive Hive built on top of a batch processing framework (even MRv2), but Impala goes itself own way !! Todd Lipcon Committer/PMC member on Apache Thrift, HBаse, and Hаdoop projects • YARN !! – Hive stinger and Tez are based on YARN (HDP2) – Impala also has plan to migrated to YARN (CDH5) – Even HBase !! (HOYA)
  • 37. HBase is a popular noSQL • As I saw in Europe/CA/China, I can say HBase is most popular noSQL solution if you already adopted Hadoop • Other noSQLs will not help you out of OPS paintpoints • So the best way is to pick your right tool and play it well