Hadoop summit 2017 enterprise graph analytics

IBM Analytics Platform Group
Enterprise Graph Analytics
Enterprise large scale graph analytics and computing base on distribute
graph database(Titan DB HBase/Solr) and distributed graph computing in
memory(TinkerPop Hadoop Gremlin SparkGraphComputer) and Hadoop2
Jun(Terry) Yang • yangjuncn@cn.ibm.com • Linkedin.com/in/terryjunyang
Jing Chen(Jerry) He • jinghe@us.ibm.com • Linkedin.com/in/jing-chen-jerry-he-1553511
Hadoop Summit 2017

2© IBM 2017 Hadoop Summit 2017
Agenda
• Challenges in hybrid data analytics
• Enterprise data quality analytics system based on graphed metadata
• Graph in enterprise data quality analytics solution

Hybrid data analytics and challenges
How was “total quantity” calculated? Show me the lineage?
What are the source-to-target mappings for the DW?
Who read the “sales” data in non-working time? How to ensure data quality?
Data Warehouse Architect
Auditor
Business Person
Data Architect

How to handle the challenges?
DataGovernance
Data Lifecycle
Management
Data Quality
Management
•Correctness
Consistency
Completeness
Timeliness
Metadata
…
Master Data
management
…

What is Metadata?
• The data used to describe other data
− Simple Metadata
− Rich Metadata
• inode attributes for file management
• Filesystem object attributes include metadata,
like modify time, access, owner, permission, etc.
File systems metadata
• Schema for data management
• Ownership information of data
• Server/Database information of data
DBMS/DW/NOSQL metadata
How to manage the metadata in hybrid data analytics environment?

Agenda

Advantage of Graph in Metadata management
Traditional solution
• Limited in one server/system
• Metadata managed within a
server/system
Property Graph based solution
• Integrate metadata
• Handle storage pressure
• Efficient Processing and Querying
• Lineage
• Wild range managed

Property Graph
Key1:value1
Key2:value2
Key1:value1
Key2:value2
Label
Edge
Properties
Vertex
G = ( V, E )
Graph Vertices Edges
label1
• Born for relationship
• Intuitive modeling
• Expressive querying
• Native analysis

Using Graph Analytics to Find Complex Patterns
1st degree relationship
2nd degree relationship
3rd degree relationship
• Graph queries are a natural
way for analyzing relationship
patterns
 Less complex than SQL
 Can handle high degrees of
relationship with ease
• Graph schema facilitates
visualization and exploration
of relationships

Case study - Audit data access
• Data theft risk in enterprise in hybrid
– Most data stolen by internal person.
– Most data theft happened in non-working time.
– Over-granting of privileges may cause data theft.

Enterprise data quality analytics system based
on graphed metadata
Data ingest
finance data
Consumption data
Credit data
Behavioral data
Graphed metadata
…
Feature Selection
Statistical learning
Data analysis
(Graphed) Metadata
analysis
…
Advanced Feature
Selection
Gradient Boosting
Decision Tree
Support Vector
Machine
Random Forests
PageRank(Graph)
…
Modeling
Customer risk rating
Consumption
Capacity
Graph model
…
Recommendation
Consumer behavior
Fraud detection
Risk analytics(Audit)
…

Data ingest
user
programData
Run
Read
name,
job id,
params,
config,
inputs,
outputs,
start_ts,
finish_ts,
…
id,
name,
group,
permission,
…
name,
size,
location,
department,
permission,
parent,
children,
…
ts_hour,
ts_min,
ts_sec,
status,
…
Metadata Integration
Graph-based Traversal
• User
• Program
• Data
• …
•Entitles  Vertices
• User run program
• Program read data
• …
Relationships  Edges
• Name
• ….
Attributes  Properties
Identify entities and relationships Metadata to Graph

Feature Selection
Who read the sensitive sales data in non-working time?
Query: userFeaSele = graph.traversal().
V().has("department","sales").inE("read").outV().hasLabel('progra
m').inE("run").has(“ts_hour",not(within(9,17))).outV()
Find the user who has the access to large amount data?
Query: … withComputer(SparkGraphComputer) …
userAdvFeaSele =
userFeaSele.pageRank().by('pageRank').order().by('pageRank').li
mit(30)
FeatureSelection
AdvancedFeature
Selection

Modeling
• Modeling risk analysis with graphed metadata, information in ERP.
• Analyze the user with employee information from ERP, with years of
working, age, role, to identify suspect. A non-sales person, for
example, an application R&D person, will be the suspect.
• Audit Recommendation.
Risk analysis model
Graph: User List(userAdvFeaSele)
ERP: Employee information
ERP: Violation information
Audit Recommendation
Risk analysis report
Suspects who stole
sensitive data
Advanced
Feature
Selection
Other
system

Agenda

User data
Machine data
log data
Behavioral data
Graphed metadata
Enterprise data quality system
Feature
analysis
Lineage Metadata
management
Cleansing
Hadoop Hbase Hive
HDFS Spark Titan
Solr
…
Data Source
third-party
data
Ingest(load)
Business Application
Risk management
Data audit
Graph in enterprise data quality analytics solution
……
Cost analytics

How to choose Enterprise Graph Database?
Data storing features
Operation and manipulation features
Graph data structures
Query features
Schema and instance representation
Easy and centralized Management
Expose service
Security features
Fast computing
Evaluate Graph database from following perspective:

Titan
• What is Titan
− Distributed Graph Database
− Based on TinkerPop (Gremlin)
− Open Source
• Titan Features
− Distribute
− Scalable : billions edges and vertices
− Real-time
− Transactional database (concurrent users/ACID/..)
− Global graph compute: graph data analytics, report, ETL
− Search: geo, numeric range, and full text search

Titan solution architecture
application
Management API TinkerPop API - Gremlin
Internal API layer
Database layer(Tx, Data, Mgmt, Optimizer)
OLAPI/O
Interface
Storage and Index Interface Layer
HBase
Storage Backend
Solr
External Index Backend
Spark
Big Data Platform
Gremlin
GraphComputer
OLAP OLTP
Hadoop
 Optimized for storing and querying billions of vertices and edges over a cluster
 Supports thousands of concurrent users
 Can execute local queries (OLTP) or distributed queries across a cluster (OLAP)

Backend – HBase & Solr
• HBase
− Tight integration with the Hadoop ecosystem.
− Native support for strong consistency.
− Linear scalability with the addition of more machines.
− Strictly consistent reads and writes.
− Convenient base classes for backing Hadoop MapReduce jobs with HBase tables.
− Support for exporting metrics via JMX.
− Open source under the liberal Apache 2 license.
• Solr
− Solr is the popular, blazing fast open source enterprise search platform from the
Apache Lucene project.
− Solr is a standalone enterprise search server with a REST-like API.
− Solr is highly reliable, scalable and fault tolerant, providing distributed indexing,
replication and load-balanced querying, automated failover and recovery, centralized
configuration and more.
 Data storing features
 Operation and manipulation features
 Graph data structures
 Query features
 Schema and instance representation
Easy and centralized Management
Expose service
Security features
Fast computing

Integration and management
Titan in Ambari
Titan
Deployment
Installation
Uninstallation
Titan client
deployment
Titan server
deployment
Titan server
operation
Start server
Stop server
Service check
Titan
Configuration
HBase backend
Solr backend
SparkGraphComputer
Titan server
Titan environment
Titan security
Titan security
support
SSL
SASL
LDAP
Kerberos
Knox
HBase Access control
 Query features
 Easy and centralized Management
Expose service
Security features
Fast computing

Remote
Titan service
Mgmt API TP API - Gremlin
Internal API layer
Database layer
OLAPI/O
HBase Solr
Spark
Gremlin
GraphComputer
Gremlin Server Gremlin Console
Titan Engine
{RESTful} {Web Socket} Gremlin>
local
Titan server Titan client
 Query features
 Expose service
Security features
Fast computing

Cluster
Remote
Titan clientTitan server
Titan security enhancement
Spark
Gremlin
Graph
Computer
local
Internal API layer
Database layer
OLAPI/O
Interface
HBase Solr
SSL
Knox
SASL
LDAP/OS
/Kerberized
Titan user
HBase
Access
control
Kerberized
Cluster
Security
Description
 Query features
 Expose service
 Security features
Fast computing

Integrate TinkerPop
SparkGraphComputer with Titan DB
Internal API layer
Database layer
OLAPI/O
Interface
HBase Solr
Gremlin GraphComputer
Graph
RDD
PageRankVertexProgram
PeerPressureVertexProgram
BulkDumperVertexProgram
BulkLoaderVertexProgram
TraversalVertexProgram
Spark-gremlin
SparkGraphComputer
Hadoop gremlin
Spark
 Query features
 Expose service
 Security features
 Fast computing

Open source Graph Database
A new Linux Foundation project
formed to continue development of
the TitanDB graph database.
Last Titan 1.0.0 was
release on Sep 20 2015

References & Contacts
• Graph
− Titan: http://titan.thinkaurelius.com
− JanusGraph: http://janusgraph.org
− TinkerPop: https://tinkerpop.apache.org
Jun(Terry) Yang
Team Lead
yangjuncn@cn.ibm.com
Linkedin.com/in/terryjunyang
Jing Chen(Jerry) He
Architect
jinghe@us.ibm.com
Linkedin.com/in/jing-chen-jerry-he-1553511

zzzz
z
z
z
Thanks!
Questions?

Hadoop summit 2017 enterprise graph analytics

Recommended

More Related Content

What's hot (17)

Similar to Hadoop summit 2017 enterprise graph analytics (20)

Recently uploaded (20)

Hadoop summit 2017 enterprise graph analytics