SlideShare a Scribd company logo
Apache Eagle
Monitor Hadoop in Real Time
Hao Chen
http://people.apache.org/~hao
ABOUT SPEAKER
PMC & Committer of Apache Eagle
hao@apache.org
MTS Software Engineer at eBay
hchen9@ebay.com
Hao Chen / 陈浩
Agenda
•Introduction
•Use Case
•Architecture
•Q & A
is a distributed real-time monitoring and alerting
engine for hadoop from eBay
Open sourced as Apache Incubator Project on Oct 26th 2015
Secure Hadoop in Realtime a data activity monitoring solution to instantly identify
access to sensitive data, recognize attacks/ malicious activity and block access in real
time.
See http://eagle.incubator.apache.org or http://github.com/apache/incubator-eagle
Apache Eagle
Eagle	
  was	
  initialized	
  by	
  end	
  of	
  2013	
  for	
  hadoop ecosystem	
  monitoring	
  as	
  any	
  existing	
  
tool	
  like	
  zabbix,	
  ganglia	
  can	
  not	
  handle	
  the	
  huge	
  volume	
  of	
  metrics/logs	
  generated	
  by	
  
hadoop system	
  in	
  eBay.
2014/2016
10,000	
  nodes
150,000+	
  cores
200	
  PB
2000+	
  user
3000+	
  nodes
10,000+	
  cores
50+	
  PB
2012
2011
1000+	
  nodes
10,000+	
  cores
10+	
  PB
100+	
  nodes
1000	
  +	
  cores
1	
  PB
2010
2009
50+	
  nodes
2007
1-­‐10	
  nodes
Hadoop	
  Data	
  
• Security
• Activity
Hadoop	
  Platform	
  
• Heath
• Availability
• Performance
Hadoop	
  @	
  eBay	
  Inc
Initiative	
  – Why	
  build	
  Eagle?
7+	
   CLUSTERS
10000+ NODES
200+	
  PB DATA
10	
  B+	
   EVENTS	
  /	
  DAY
500+	
   METRIC	
  TYPES
50,000+	
   JOBS	
  /	
  DAY
50,000,000+	
   TASKS	
  /	
  DAY
Eagle	
  Use	
  Cases
Data	
  Activity	
  Monitoring
Secure	
  Hadoop	
  in	
  Realtime a	
  data	
  activity	
  monitoring	
  solution	
  to	
  instantly	
  identify	
  access	
  to	
  
sensitive	
  data,	
  	
  recognize	
  attacks/	
  malicious	
  activity	
  and	
  block	
  access	
  in	
  real	
  time.
Job	
  Performance	
  Monitoring
Hadoop,	
  Spark	
  Job	
  Profiling	
  &	
  Performance	
  Monitoring,	
  Cluster	
  Health	
  Anomaly	
  Detection
1
2
4
eBay	
  Unified	
  Monitoring	
  Platform
Provide	
  unified	
  monitoring-­‐as-­‐service	
  for	
  everything	
  around	
  infrastructure	
  or	
  business.
3
Eagle	
  Data	
  Activity	
  Monitoring
Data	
  Loss	
  Prevention
Get	
  alerted	
  and	
  stop	
  a	
  malicious	
  user	
  trying	
  to	
  copy,	
  delete,	
  move	
  
sensitive	
  data	
  from	
  the	
  Hadoop	
  cluster.
Malicious	
  Logins
Detect	
  login	
  when	
  malicious	
  user	
  tries	
  to	
  guess	
  password.	
  Eagle	
  
creates	
  user	
  profiles	
  using	
  machine	
  learning	
  algorithm	
  to	
  detect	
  
anomalies
Unauthorized	
  access
Detect	
  and	
  stop	
  a	
  malicious	
  user	
  trying	
  to	
  access	
  classified	
  data	
  
without	
  privilege.	
  
Malicious	
  user	
  operation
Detect	
  and	
  stop	
  a	
  malicious	
  user	
  trying	
  to	
  delete	
  large	
  amount	
  of	
  
data.	
  Operation	
  type	
  is	
  one	
  parameter	
  of	
  Eagle	
  user	
  profiles.	
  Eagle	
  
supports	
  multiple	
  native	
  operation	
  types.
User
Privileges
Common	
  
Data	
  Sets
Patterns
CommandsZones
Query
Columns
§ HDFS	
  Policies
§ Access	
  to	
  Sensitive	
  files
§ HDFS	
  Commands	
  used	
  (read,	
  write,	
  update…)
§ Client	
  host	
  
§ Destination
§ Security	
  Zones
§ Hive	
  Policies
§ Access	
  to	
  tables	
  with	
  PII	
  Data
§ SQL	
  Query	
  Profiles
§ Client	
  Host
§ Security	
  Zone
Eagle	
  Data	
  Activity	
  Monitoring
Hadoop	
  Data	
  Security:	
  Detect	
  anomalies	
  in	
  accessing	
  HDFS	
  and	
  Hive
Offline:	
  Determine	
  bandwidth	
  from	
  training	
  dataset	
  the	
  kernel	
  density	
  function	
  
parameters	
  (KDE)
Online:	
  If	
  a	
  test	
  data	
  point	
  lies	
  outside	
  the	
  trained	
  bandwidth,	
  it	
  is	
  anomaly	
  (Policy)
PCs(Principle	
  Components)	
  in	
  EVD
(Eigenvalue	
  Value	
  Decomposition)
Kernel	
  Density	
  Function
Eagle	
  Machine	
  Learning	
  User	
  Profile
Use	
  Case Detect	
  node	
  anomaly	
  by	
  analyzing	
  task	
  failure	
  ratio	
  across	
  all	
  nodes
Assumption Task	
  failure	
  ratio	
  for	
  every	
  node	
  should	
  be	
  approximately	
  equal
Algorithm Node	
  by	
  node	
  compare	
  (symmetry	
  violation)	
  and	
  per	
  node	
  trend
Eagle	
  Job	
  Performance	
  Monitoring
Alerting:	
  
Anomaly	
  Detection	
  Alerting
Insight:	
  
Task	
  failure	
  drill-­‐down
Insight:	
  
Task	
  failure	
  drill-­‐down
Task	
  Failure	
  based	
  Anomaly	
  Host	
  Detection	
  
Counters  &  Features
Use	
  Case Detect	
  data	
  skew	
  by	
  statistics	
  and	
  distributions	
  for	
  attempt	
  execution	
  durations	
  and	
  
counters
Assumption	
  	
  	
  Duration	
  and	
  counters	
  should	
  be	
  in	
  normal	
  distribution
mapDuration
reduceDuration
mapInputRecords
reduceInputRecords
combineInputRecords
mapSpilledRecords
reduceShuffleRecords
mapLocalFileBytesRead
reduceLocalFileBytesRead
mapHDFSBytesRead
reduceHDFSBytesRead
Modeling  &  Statistics
Avg
Min
Max  
Distributions
Max  z-­score
Top-­N
Correlation
Threshold  &  Detection
Counters
Correlation  >  0.9  
&  Max(Z-­Score)  >  90%  
Hadoop	
  Job	
  Data	
  Skew	
  Detection
Counters  &  Features
Use	
  Case Detect	
  data	
  skew	
  by	
  statistics	
  and	
  distributions	
  for	
  attempt	
  execution	
  durations	
  and	
  
counters
Assumption	
  	
  	
  Duration	
  and	
  counters	
  should	
  be	
  in	
  normal	
  distribution
mapDuration
reduceDuration
mapInputRecords
reduceInputRecords
combineInputRecords
mapSpilledRecords
reduceShuffleRecords
mapLocalFileBytesRead
reduceLocalFileBytesRead
mapHDFSBytesRead
reduceHDFSBytesRead
Modeling  &  Statistics
Avg
Min
Max  
Distributions
Max  z-­score
Top-­N
Correlation
Threshold  &  Detection
Counters
Correlation  >  0.9  
&  Max(Z-­Score)  >  90%  
Hadoop	
  Job	
  Data	
  Skew	
  Detection
Eagle	
  @	
  eBay	
  Inc.
7+	
   CLUSTERS
10000+ NODES
200+	
  PB DATA
10	
  B+	
   EVENTS	
  /	
  DAY
500+	
   METRIC	
  TYPES
50,000+	
   JOBS	
  /	
  DAY
50,000,000+	
   TASKS	
  /	
  DAY
Eagle	
  Deployment	
  at	
  eBay	
  Production
• 100+	
  security	
  policies
• 8	
  nodes
• 30	
  worker	
  process
• 64	
  kafka partition
Eagle	
  Performance
• Avg Latency:	
   ~	
  50	
  ms
• Max	
  Throughput/Cluster:	
  300	
  k	
  /s
Technical	
  Challenges
Large	
  Scale	
  Processing	
  and	
  Storage
• Scale	
  IO	
  Complexity	
  (Stream)
• Scale	
  Computation	
  Complexity	
  (Policy)
• Scale	
  Storage	
  &	
  Query	
  (Event,	
  Log,	
  Metric)
1
Real-­‐Time	
  Alerting
• Real-­‐time	
  Data	
  Collection
• Real-­‐time	
  Stream	
  Processing
• Real-­‐time	
  Alerting
2
Expressive Correlation	
  Model
• Complex	
  policy	
  model
• Stream	
  GroupBy,	
  Join,	
  Window
• Machine	
  Learning
3
Hadoop	
  Ecosystem	
  Integration
• Data	
  Source	
  Integration
• Anomaly	
  Detection	
  Model
4
Eagle	
  Architecture
Distributed	
  Policy	
  Engine
METADATA   MANAGER
AlertExecutor_{1}
AlertExecutor_{2}
…
AlertExecutor_{N}
Real	
  Time	
  Alerts
Alerts
Policy	
  
Management
Policy
Dynamical	
   Policy	
   Deployment
Real-­‐time	
  
Event	
  Stream
Stream_{1}
Stream_{*}
Dynamical	
   Stream	
  Schema
Stream	
  
Processing
• Real-­‐time	
  Streaming: Apache	
  Storm	
  (Execution	
  Engine)	
  +	
  Kafka	
  (Message	
  Bus)
• Declarative	
  Policy: SQL	
  (CEP)	
  on	
  Streaming	
  +	
  Hot	
  Deploy
• Linear	
  Scalability:	
  Data	
  volume scale +	
  Computation	
  scale
• Metadata-­‐Driven:	
  Schema	
  Management	
  and	
  Dynamical	
  Policy	
  Lifecycle
METADATA   MANAGER
AlertExecutor_{1}
AlertExecutor_{2}
…
AlertExecutor_{N}
Real	
  Time	
  Alerts
Alerts
Policy	
  
Management
Policy
Dynamical	
   Policy	
   Deployment
Real-­‐time	
  
Event	
  Stream
Stream_{1}
Stream_{*}
Dynamical	
   Stream	
  Schema
Stream	
  
Processing
from MetricStream[(name == 'ReplLag') and (value > 1000)]
select * insert into outputStream;
• Real-­‐time	
  Streaming: Apache	
  Storm	
  (Execution	
  Engine)	
  +	
  Kafka	
  (Message	
  Bus)
• Declarative	
  Policy: SQL	
  (CEP)	
  on	
  Streaming	
  +	
  Hot	
  Deploy
• Linear	
  Scalability:	
  Data	
  volume scale +	
  Computation	
  scale
• Metadata-­‐Driven:	
  Schema	
  Management	
  and	
  Dynamical	
  Policy	
  Lifecycle
Distributed	
  Policy	
  Engine
Declarative	
  Policy
• Filter
• Join
• Aggregation:	
  Avg,	
  Sum	
  ,	
  Min,	
  Max,	
  etc
• Group by
• Having
• Stream handlers	
  for	
  window:	
  TimeWindow,	
  Batch	
  Window,	
  
Length	
  Window	
  
• Conditions	
  and	
  Expressions:	
  and,	
  or,	
  not,	
  ==,!=,	
  >=,	
  >,	
  <=,	
  <,	
  
and	
  arithmetic	
  operations
• Pattern	
  Processing
• Sequence	
  processing
• Event	
  Tables:	
  intergrate historical	
  data	
  in	
  realtime processing
• SQL-­‐Like	
  Query:	
  Query,	
  Stream	
  Definition	
   and	
  Query	
  Plan	
  
compilation
Distributed	
  SQL	
  on	
  Streaming	
  :	
  
Siddhi	
  CEP	
  +	
  Storm	
  by	
  default
from MetricStream[(name == 'ReplLag') and (value > 1000)] select * insert into
outputStream;
Declarative	
  Policy	
  -­‐ Examples
from hadoopJmxMetricEventStream
[metric == "hadoop.namenode.fsnamesystemstate.capacityused" and value > 0.9]
select metric, host, value, timestamp, component, site insert into alertStream;
Example	
  1:	
  Alert	
  if	
  hadoop namenode capacity	
  usage	
  exceed	
  90	
  percentages
from every
a = hadoopJmxMetricEventStream[metric=="hadoop.namenode.fsnamesystem.hastate"]
->
b = hadoopJmxMetricEventStream[metric==a.metric and b.host == a.host and
a.value != value)]
within 10 min
select a.host, a.value as oldHaState, b.value as newHaState, b.timestamp as
timestamp, b.metric as metric, b.component as component, b.site as site insert
into alertStream;
Example	
  2:	
  Alert	
  if	
  hadoop namenode HA	
  switches
1
Distributed	
  Policy	
  Engine	
  -­‐ Scalability
2
Distributed   Streaming   Cluster  Environment
AlertExecutor_{1}
AlertExecutor_{2}
…
AlertExecutor_{N}
Stream_{1}
Stream_{*}
Stream	
  
Processing
Dynamic	
  policy	
  partition	
  by	
  {event}	
  *	
  {policy}
• N	
  Users	
  with	
  3	
  partitions,	
  M	
  policies	
  with	
  2	
  partitions,	
  then	
  3*2	
  physical	
  tasks
• Physical	
  partition	
  +	
  policy-­‐level	
  partition
Linear	
  Scalability Principle
3
Algorithm Weights	
  of	
  Executors	
   By	
  Partition	
   User
Random 0.0484 0.152 0.3535 0.105 0.203 0.072 0.042 0.024
Greedy 0.0837 0.0837 0.0837 0.0837 0.0737 0.0637 0.0437 0.0837
Stream	
  Partition	
  Skew	
  (15:1)
Distributed	
  Policy	
  Engine – Optimization
Stream	
  Partition	
  Problem	
  	
  
https://en.wikipedia.org/wiki/Partition_problem
Distributed  Real-­time    Policy  Engine
Siddhi	
  CEP	
  Policy	
  
Evaluator
Machine	
  Learning	
  
Policy	
  Evaluator• Support	
  WSO2	
  Siddhi	
  CEP	
  as	
  first	
  class
• Extensible	
  Policy	
  Engine	
  Implementation
• Extensible	
  Policy	
  Lifecycle	
  Management
• Metadata-­‐based	
  Module	
  Management
Extensible	
   Policy	
  
Evaluator
public interface PolicyEvaluatorServiceProvider {
public String getPolicyType(); // literal string to identify one type of policy
public Class getPolicyEvaluator(); // get policy evaluator implementation
public List getBindingModules(); // policy text with json format to object mapping
}
METADATA  MANAGER
Policy/Metadata
Distributed	
  Policy	
  Engine – Extensibility
Policy	
  Engine	
  Extensibility
Stream	
  Processing	
  Framework
Optimizer
1.	
  Development 2.	
  Optimization 3.	
  Compile	
  to	
  native	
  app
Use	
  eagle	
  alert	
  framework	
  as	
  library
• Light-­‐weight	
  ORM	
  Framework	
  for	
  HBase/RDMBS
• Full-­‐function	
   SQL-­‐Like	
  REST	
  Query	
  
• Optimized	
  Rowkey design	
  for	
  time-­‐series	
  data
• Native	
  HBase Coprocessor
• Secondary	
  Index	
  Support
@Table("alertdef")
@ColumnFamily("f")
@Prefix("alertdef")
@Service(AlertConstants.ALERT_DEFINITION_SERVICE_ENDPOINT_NAME)
@JsonIgnoreProperties(ignoreUnknown = true)
@TimeSeries(false)
@Tags({"site", "dataSource", "alertExecutorId", "policyId",
"policyType"})
@Indexes({
@Index(name="Index_1_alertExecutorId", columns = { "alertExecutorID"
}, unique = true),
})
public class AlertDefinitionAPIEntity extends TaggedLogAPIEntity{
@Column("a")
private String desc;
@Column("b")
private String policyDef;
@Column("c")
private String dedupeDef;
Query=AlertDefinitionService[@dataSource="hiveQueryLog"]{@policyDef}
Large	
  Scale	
  Storage	
  and	
  Query
Uniform	
  HBase rowkey design
• Metric
• Entity
• Log
Rowkey ::= Prefix | Partition Keys | timestamp | tagName | tagValue | …
Rowkey ::= Metric Name | Partition Keys | timestamp | tagName | tagValue | …
Rowkey ::= Default Prefix | Partition Keys | timestamp | tagName | tagValue | …
Rowkey ::= Log Type | Partition Keys | timestamp | tagName | tagValue | …
Rowvalue ::= Log Content
Large	
  Scale	
  Storage	
  and	
  Query
http://opentsdb.net
Multi-­‐Tenants	
  – Topology	
  Scheduler
• Dynamical	
  Topology	
  Management
• No-­‐downtime	
  Topology	
  Maintenance
• Topology	
  High	
  Availability	
  &	
  Balance
• Resource	
  Scheduling	
  &	
  Isolation:	
  Runtime,Woker,	
  Topology	
  or	
  Cluster
Multi-­‐Tenants	
  -­‐ Dynamical	
  Correlation
• Dynamical	
  Correlation	
  on	
  Runtime:
• Sort,	
  Groupby,	
  Join,	
  Window
• Hot	
  Deploy	
  Logic
• Policy	
  Management
• Multi-­‐Correlation	
  on	
  the	
  Single	
  Stream
• Group	
  by	
  different	
  fields	
  of	
  same	
  stream
• Resort	
  same	
  stream	
  by	
  different	
  order
• Join	
  certain	
  stream	
  in	
  different	
  way
• Multi	
  Correlation	
  on	
  Multi	
  Streams
• Cross	
  Streams	
  Join
• Real-­‐time	
  &	
  Historical	
  Stream	
  Join
Multi-­‐Tenants	
  -­‐ Dynamical	
  Correlation
1
Eagle	
  Framework
Distributed	
  real-­‐time	
  framework	
  for	
  efficiently	
  developing	
   highly	
  
scalable	
  monitoring	
   applications
Eagle	
  Ecosystem
2 Eagle	
  Apps
Security/	
  Hadoop/	
  Operational	
  Intelligence	
  /	
  …
3 Eagle	
  Interface
REST	
  Service	
  /	
  Management	
  UI	
  /	
  Customizable	
  Analytics	
  
Visualization
4 Eagle	
  Integration
Ambari /	
  Docker /	
  Ranger	
  /	
  Dataguise
Apps
à Security
à Hadoop
à Cloud
à Database
Interface
à Web Portal
à REST Services
à Analytics Visualization
Integration
à Ambari
à Docker
à Ranger
à Dataguise Eagle
Framework
Open	
  Source
Community-­‐driven	
   and	
  Cross-­‐community	
  cooperation
5
Learn	
  More	
  about	
  Apache	
  Eagle
Community
• Website:	
  http://eagle.incubator.apache.org
• Github:	
  http://github.com/apache/incubator-­‐eagle
• Mailing	
  list:	
  dev@eagle.incubator.apache.org
Resources
• Documentation:	
  http://eagle.incubator.apache.org/docs/
• Docker images:	
  https://hub.docker.com/r/apacheeagle/sandbox/
Publications	
  &	
  Patents
• EAGLE:	
  USER	
  PROFILE-­‐BASED	
  ANOMALY	
  DETECTION	
  IN	
  HADOOP	
  CLUSTER	
  (IEEE)
• EAGLE:	
  DISTRIBUTED	
  REALTIME	
  MONITORING	
  FRAMEWORK	
  FOR	
  HADOOP	
  CLUSTER
Thank	
  you!
http://eagle.incubator.apache.org
The	
  slide	
  is	
  licensed	
  under	
  Creative	
  Commons	
  Attribution	
  4.0	
  International	
  license.
Ad

More Related Content

What's hot (20)

Spark Summit EU talk by Debasish Das and Pramod Narasimha
Spark Summit EU talk by Debasish Das and Pramod NarasimhaSpark Summit EU talk by Debasish Das and Pramod Narasimha
Spark Summit EU talk by Debasish Das and Pramod Narasimha
Spark Summit
 
Hive on spark is blazing fast or is it final
Hive on spark is blazing fast or is it finalHive on spark is blazing fast or is it final
Hive on spark is blazing fast or is it final
Hortonworks
 
Hadoop 3 in a Nutshell
Hadoop 3 in a NutshellHadoop 3 in a Nutshell
Hadoop 3 in a Nutshell
DataWorks Summit/Hadoop Summit
 
Ingest and Stream Processing - What will you choose?
Ingest and Stream Processing - What will you choose?Ingest and Stream Processing - What will you choose?
Ingest and Stream Processing - What will you choose?
DataWorks Summit/Hadoop Summit
 
Introduction to Apache Amaterasu (Incubating): CD Framework For Your Big Data...
Introduction to Apache Amaterasu (Incubating): CD Framework For Your Big Data...Introduction to Apache Amaterasu (Incubating): CD Framework For Your Big Data...
Introduction to Apache Amaterasu (Incubating): CD Framework For Your Big Data...
DataWorks Summit
 
A Container-based Sizing Framework for Apache Hadoop/Spark Clusters
A Container-based Sizing Framework for Apache Hadoop/Spark ClustersA Container-based Sizing Framework for Apache Hadoop/Spark Clusters
A Container-based Sizing Framework for Apache Hadoop/Spark Clusters
DataWorks Summit/Hadoop Summit
 
Spark crash course workshop at Hadoop Summit
Spark crash course workshop at Hadoop SummitSpark crash course workshop at Hadoop Summit
Spark crash course workshop at Hadoop Summit
DataWorks Summit
 
To The Cloud and Back: A Look At Hybrid Analytics
To The Cloud and Back: A Look At Hybrid AnalyticsTo The Cloud and Back: A Look At Hybrid Analytics
To The Cloud and Back: A Look At Hybrid Analytics
DataWorks Summit/Hadoop Summit
 
Fast SQL on Hadoop, Really?
Fast SQL on Hadoop, Really?Fast SQL on Hadoop, Really?
Fast SQL on Hadoop, Really?
DataWorks Summit
 
Scaling HDFS to Manage Billions of Files with Distributed Storage Schemes
Scaling HDFS to Manage Billions of Files with Distributed Storage SchemesScaling HDFS to Manage Billions of Files with Distributed Storage Schemes
Scaling HDFS to Manage Billions of Files with Distributed Storage Schemes
DataWorks Summit
 
Achieving 100k Queries per Hour on Hive on Tez
Achieving 100k Queries per Hour on Hive on TezAchieving 100k Queries per Hour on Hive on Tez
Achieving 100k Queries per Hour on Hive on Tez
DataWorks Summit/Hadoop Summit
 
Cost-based Query Optimization
Cost-based Query Optimization Cost-based Query Optimization
Cost-based Query Optimization
DataWorks Summit/Hadoop Summit
 
Apache Drill and Zeppelin: Two Promising Tools You've Never Heard Of
Apache Drill and Zeppelin: Two Promising Tools You've Never Heard OfApache Drill and Zeppelin: Two Promising Tools You've Never Heard Of
Apache Drill and Zeppelin: Two Promising Tools You've Never Heard Of
Charles Givre
 
Real Time Data Processing Using Spark Streaming
Real Time Data Processing Using Spark StreamingReal Time Data Processing Using Spark Streaming
Real Time Data Processing Using Spark Streaming
Hari Shreedharan
 
The state of SQL-on-Hadoop in the Cloud
The state of SQL-on-Hadoop in the CloudThe state of SQL-on-Hadoop in the Cloud
The state of SQL-on-Hadoop in the Cloud
DataWorks Summit/Hadoop Summit
 
Spark Uber Development Kit
Spark Uber Development KitSpark Uber Development Kit
Spark Uber Development Kit
DataWorks Summit/Hadoop Summit
 
Interactive Analytics at Scale in Apache Hive Using Druid
Interactive Analytics at Scale in Apache Hive Using DruidInteractive Analytics at Scale in Apache Hive Using Druid
Interactive Analytics at Scale in Apache Hive Using Druid
DataWorks Summit/Hadoop Summit
 
Evolving HDFS to a Generalized Storage Subsystem
Evolving HDFS to a Generalized Storage SubsystemEvolving HDFS to a Generalized Storage Subsystem
Evolving HDFS to a Generalized Storage Subsystem
DataWorks Summit/Hadoop Summit
 
Hivemall: Scalable machine learning library for Apache Hive/Spark/Pig
Hivemall: Scalable machine learning library for Apache Hive/Spark/PigHivemall: Scalable machine learning library for Apache Hive/Spark/Pig
Hivemall: Scalable machine learning library for Apache Hive/Spark/Pig
DataWorks Summit/Hadoop Summit
 
Flexible and Real-Time Stream Processing with Apache Flink
Flexible and Real-Time Stream Processing with Apache FlinkFlexible and Real-Time Stream Processing with Apache Flink
Flexible and Real-Time Stream Processing with Apache Flink
DataWorks Summit
 
Spark Summit EU talk by Debasish Das and Pramod Narasimha
Spark Summit EU talk by Debasish Das and Pramod NarasimhaSpark Summit EU talk by Debasish Das and Pramod Narasimha
Spark Summit EU talk by Debasish Das and Pramod Narasimha
Spark Summit
 
Hive on spark is blazing fast or is it final
Hive on spark is blazing fast or is it finalHive on spark is blazing fast or is it final
Hive on spark is blazing fast or is it final
Hortonworks
 
Introduction to Apache Amaterasu (Incubating): CD Framework For Your Big Data...
Introduction to Apache Amaterasu (Incubating): CD Framework For Your Big Data...Introduction to Apache Amaterasu (Incubating): CD Framework For Your Big Data...
Introduction to Apache Amaterasu (Incubating): CD Framework For Your Big Data...
DataWorks Summit
 
A Container-based Sizing Framework for Apache Hadoop/Spark Clusters
A Container-based Sizing Framework for Apache Hadoop/Spark ClustersA Container-based Sizing Framework for Apache Hadoop/Spark Clusters
A Container-based Sizing Framework for Apache Hadoop/Spark Clusters
DataWorks Summit/Hadoop Summit
 
Spark crash course workshop at Hadoop Summit
Spark crash course workshop at Hadoop SummitSpark crash course workshop at Hadoop Summit
Spark crash course workshop at Hadoop Summit
DataWorks Summit
 
Fast SQL on Hadoop, Really?
Fast SQL on Hadoop, Really?Fast SQL on Hadoop, Really?
Fast SQL on Hadoop, Really?
DataWorks Summit
 
Scaling HDFS to Manage Billions of Files with Distributed Storage Schemes
Scaling HDFS to Manage Billions of Files with Distributed Storage SchemesScaling HDFS to Manage Billions of Files with Distributed Storage Schemes
Scaling HDFS to Manage Billions of Files with Distributed Storage Schemes
DataWorks Summit
 
Apache Drill and Zeppelin: Two Promising Tools You've Never Heard Of
Apache Drill and Zeppelin: Two Promising Tools You've Never Heard OfApache Drill and Zeppelin: Two Promising Tools You've Never Heard Of
Apache Drill and Zeppelin: Two Promising Tools You've Never Heard Of
Charles Givre
 
Real Time Data Processing Using Spark Streaming
Real Time Data Processing Using Spark StreamingReal Time Data Processing Using Spark Streaming
Real Time Data Processing Using Spark Streaming
Hari Shreedharan
 
Interactive Analytics at Scale in Apache Hive Using Druid
Interactive Analytics at Scale in Apache Hive Using DruidInteractive Analytics at Scale in Apache Hive Using Druid
Interactive Analytics at Scale in Apache Hive Using Druid
DataWorks Summit/Hadoop Summit
 
Hivemall: Scalable machine learning library for Apache Hive/Spark/Pig
Hivemall: Scalable machine learning library for Apache Hive/Spark/PigHivemall: Scalable machine learning library for Apache Hive/Spark/Pig
Hivemall: Scalable machine learning library for Apache Hive/Spark/Pig
DataWorks Summit/Hadoop Summit
 
Flexible and Real-Time Stream Processing with Apache Flink
Flexible and Real-Time Stream Processing with Apache FlinkFlexible and Real-Time Stream Processing with Apache Flink
Flexible and Real-Time Stream Processing with Apache Flink
DataWorks Summit
 

Viewers also liked (20)

The truth about SQL and Data Warehousing on Hadoop
The truth about SQL and Data Warehousing on HadoopThe truth about SQL and Data Warehousing on Hadoop
The truth about SQL and Data Warehousing on Hadoop
DataWorks Summit/Hadoop Summit
 
Using Hadoop to build a Data Quality Service for both real-time and batch data
Using Hadoop to build a Data Quality Service for both real-time and batch dataUsing Hadoop to build a Data Quality Service for both real-time and batch data
Using Hadoop to build a Data Quality Service for both real-time and batch data
DataWorks Summit/Hadoop Summit
 
Apache Eagle: Secure Hadoop in Real Time
Apache Eagle: Secure Hadoop in Real TimeApache Eagle: Secure Hadoop in Real Time
Apache Eagle: Secure Hadoop in Real Time
DataWorks Summit/Hadoop Summit
 
Rebuilding Web Tracking Infrastructure for Scale
Rebuilding Web Tracking Infrastructure for ScaleRebuilding Web Tracking Infrastructure for Scale
Rebuilding Web Tracking Infrastructure for Scale
DataWorks Summit/Hadoop Summit
 
From a single droplet to a full bottle, our journey to Hadoop at Coca-Cola Ea...
From a single droplet to a full bottle, our journey to Hadoop at Coca-Cola Ea...From a single droplet to a full bottle, our journey to Hadoop at Coca-Cola Ea...
From a single droplet to a full bottle, our journey to Hadoop at Coca-Cola Ea...
DataWorks Summit/Hadoop Summit
 
Case study of DevOps for Hadoop in Recruit.
Case study of DevOps for Hadoop in Recruit.Case study of DevOps for Hadoop in Recruit.
Case study of DevOps for Hadoop in Recruit.
DataWorks Summit/Hadoop Summit
 
Data infrastructure architecture for medium size organization: tips for colle...
Data infrastructure architecture for medium size organization: tips for colle...Data infrastructure architecture for medium size organization: tips for colle...
Data infrastructure architecture for medium size organization: tips for colle...
DataWorks Summit/Hadoop Summit
 
Security and Data Governance using Apache Ranger and Apache Atlas
Security and Data Governance using Apache Ranger and Apache AtlasSecurity and Data Governance using Apache Ranger and Apache Atlas
Security and Data Governance using Apache Ranger and Apache Atlas
DataWorks Summit/Hadoop Summit
 
The real world use of Big Data to change business
The real world use of Big Data to change businessThe real world use of Big Data to change business
The real world use of Big Data to change business
DataWorks Summit/Hadoop Summit
 
Apache Hadoop 3.0 What's new in YARN and MapReduce
Apache Hadoop 3.0 What's new in YARN and MapReduceApache Hadoop 3.0 What's new in YARN and MapReduce
Apache Hadoop 3.0 What's new in YARN and MapReduce
DataWorks Summit/Hadoop Summit
 
Comparison of Transactional Libraries for HBase
Comparison of Transactional Libraries for HBaseComparison of Transactional Libraries for HBase
Comparison of Transactional Libraries for HBase
DataWorks Summit/Hadoop Summit
 
Apache Eagle Strata Hadoop World London 2016
Apache Eagle Strata Hadoop World London 2016Apache Eagle Strata Hadoop World London 2016
Apache Eagle Strata Hadoop World London 2016
Arun Karthick Manoharan
 
Path to 400M Members: LinkedIn’s Data Powered Journey
Path to 400M Members: LinkedIn’s Data Powered JourneyPath to 400M Members: LinkedIn’s Data Powered Journey
Path to 400M Members: LinkedIn’s Data Powered Journey
DataWorks Summit/Hadoop Summit
 
Protecting Enterprise Data In Apache Hadoop
Protecting Enterprise Data In Apache HadoopProtecting Enterprise Data In Apache Hadoop
Protecting Enterprise Data In Apache Hadoop
DataWorks Summit/Hadoop Summit
 
Streamline Hadoop DevOps with Apache Ambari
Streamline Hadoop DevOps with Apache AmbariStreamline Hadoop DevOps with Apache Ambari
Streamline Hadoop DevOps with Apache Ambari
DataWorks Summit/Hadoop Summit
 
Network for the Large-scale Hadoop cluster at Yahoo! JAPAN
Network for the Large-scale Hadoop cluster at Yahoo! JAPANNetwork for the Large-scale Hadoop cluster at Yahoo! JAPAN
Network for the Large-scale Hadoop cluster at Yahoo! JAPAN
DataWorks Summit/Hadoop Summit
 
Evolving HDFS to a Generalized Distributed Storage Subsystem
Evolving HDFS to a Generalized Distributed Storage SubsystemEvolving HDFS to a Generalized Distributed Storage Subsystem
Evolving HDFS to a Generalized Distributed Storage Subsystem
DataWorks Summit/Hadoop Summit
 
Apache NiFi 1.0 in Nutshell
Apache NiFi 1.0 in NutshellApache NiFi 1.0 in Nutshell
Apache NiFi 1.0 in Nutshell
DataWorks Summit/Hadoop Summit
 
Data science lifecycle with Apache Zeppelin
Data science lifecycle with Apache ZeppelinData science lifecycle with Apache Zeppelin
Data science lifecycle with Apache Zeppelin
DataWorks Summit/Hadoop Summit
 
SEGA : Growth hacking by Spark ML for Mobile games
SEGA : Growth hacking by Spark ML for Mobile gamesSEGA : Growth hacking by Spark ML for Mobile games
SEGA : Growth hacking by Spark ML for Mobile games
DataWorks Summit/Hadoop Summit
 
Using Hadoop to build a Data Quality Service for both real-time and batch data
Using Hadoop to build a Data Quality Service for both real-time and batch dataUsing Hadoop to build a Data Quality Service for both real-time and batch data
Using Hadoop to build a Data Quality Service for both real-time and batch data
DataWorks Summit/Hadoop Summit
 
From a single droplet to a full bottle, our journey to Hadoop at Coca-Cola Ea...
From a single droplet to a full bottle, our journey to Hadoop at Coca-Cola Ea...From a single droplet to a full bottle, our journey to Hadoop at Coca-Cola Ea...
From a single droplet to a full bottle, our journey to Hadoop at Coca-Cola Ea...
DataWorks Summit/Hadoop Summit
 
Data infrastructure architecture for medium size organization: tips for colle...
Data infrastructure architecture for medium size organization: tips for colle...Data infrastructure architecture for medium size organization: tips for colle...
Data infrastructure architecture for medium size organization: tips for colle...
DataWorks Summit/Hadoop Summit
 
Security and Data Governance using Apache Ranger and Apache Atlas
Security and Data Governance using Apache Ranger and Apache AtlasSecurity and Data Governance using Apache Ranger and Apache Atlas
Security and Data Governance using Apache Ranger and Apache Atlas
DataWorks Summit/Hadoop Summit
 
Apache Eagle Strata Hadoop World London 2016
Apache Eagle Strata Hadoop World London 2016Apache Eagle Strata Hadoop World London 2016
Apache Eagle Strata Hadoop World London 2016
Arun Karthick Manoharan
 
Path to 400M Members: LinkedIn’s Data Powered Journey
Path to 400M Members: LinkedIn’s Data Powered JourneyPath to 400M Members: LinkedIn’s Data Powered Journey
Path to 400M Members: LinkedIn’s Data Powered Journey
DataWorks Summit/Hadoop Summit
 
Network for the Large-scale Hadoop cluster at Yahoo! JAPAN
Network for the Large-scale Hadoop cluster at Yahoo! JAPANNetwork for the Large-scale Hadoop cluster at Yahoo! JAPAN
Network for the Large-scale Hadoop cluster at Yahoo! JAPAN
DataWorks Summit/Hadoop Summit
 
Evolving HDFS to a Generalized Distributed Storage Subsystem
Evolving HDFS to a Generalized Distributed Storage SubsystemEvolving HDFS to a Generalized Distributed Storage Subsystem
Evolving HDFS to a Generalized Distributed Storage Subsystem
DataWorks Summit/Hadoop Summit
 
Ad

Similar to Apache Eagle - Monitor Hadoop in Real Time (20)

Apache Eagle at Hadoop Summit 2016 San Jose
Apache Eagle at Hadoop Summit 2016 San JoseApache Eagle at Hadoop Summit 2016 San Jose
Apache Eagle at Hadoop Summit 2016 San Jose
Hao Chen
 
Eagle from eBay at China Hadoop Summit 2015
Eagle from eBay at China Hadoop Summit 2015Eagle from eBay at China Hadoop Summit 2015
Eagle from eBay at China Hadoop Summit 2015
Hao Chen
 
6° Sessione - Ambiti applicativi nella ricerca di tecnologie statistiche avan...
6° Sessione - Ambiti applicativi nella ricerca di tecnologie statistiche avan...6° Sessione - Ambiti applicativi nella ricerca di tecnologie statistiche avan...
6° Sessione - Ambiti applicativi nella ricerca di tecnologie statistiche avan...
Jürgen Ambrosi
 
Tugdual Grall - Real World Use Cases: Hadoop and NoSQL in Production
Tugdual Grall - Real World Use Cases: Hadoop and NoSQL in ProductionTugdual Grall - Real World Use Cases: Hadoop and NoSQL in Production
Tugdual Grall - Real World Use Cases: Hadoop and NoSQL in Production
Codemotion
 
Apache Eagle in Action
Apache Eagle in ActionApache Eagle in Action
Apache Eagle in Action
Hao Chen
 
Basics of big data analytics hadoop
Basics of big data analytics hadoopBasics of big data analytics hadoop
Basics of big data analytics hadoop
Ambuj Kumar
 
Basic of Big Data
Basic of Big Data Basic of Big Data
Basic of Big Data
Amar kumar
 
Emerging technologies /frameworks in Big Data
Emerging technologies /frameworks in Big DataEmerging technologies /frameworks in Big Data
Emerging technologies /frameworks in Big Data
Rahul Jain
 
Hadoop World 2011: Building Web Analytics Processing on Hadoop at CBS Interac...
Hadoop World 2011: Building Web Analytics Processing on Hadoop at CBS Interac...Hadoop World 2011: Building Web Analytics Processing on Hadoop at CBS Interac...
Hadoop World 2011: Building Web Analytics Processing on Hadoop at CBS Interac...
Cloudera, Inc.
 
Big Data on azure
Big Data on azureBig Data on azure
Big Data on azure
David Giard
 
sudoers: Benchmarking Hadoop with ALOJA
sudoers: Benchmarking Hadoop with ALOJAsudoers: Benchmarking Hadoop with ALOJA
sudoers: Benchmarking Hadoop with ALOJA
Nicolas Poggi
 
Ai tour 2019 Mejores Practicas en Entornos de Produccion Big Data Open Source...
Ai tour 2019 Mejores Practicas en Entornos de Produccion Big Data Open Source...Ai tour 2019 Mejores Practicas en Entornos de Produccion Big Data Open Source...
Ai tour 2019 Mejores Practicas en Entornos de Produccion Big Data Open Source...
nnakasone
 
Tajo_Meetup_20141120
Tajo_Meetup_20141120Tajo_Meetup_20141120
Tajo_Meetup_20141120
Hyoungjun Kim
 
10 Big Data Technologies you Didn't Know About
10 Big Data Technologies you Didn't Know About 10 Big Data Technologies you Didn't Know About
10 Big Data Technologies you Didn't Know About
Jesus Rodriguez
 
ETL with SPARK - First Spark London meetup
ETL with SPARK - First Spark London meetupETL with SPARK - First Spark London meetup
ETL with SPARK - First Spark London meetup
Rafal Kwasny
 
First Hive Meetup London 2012-07-10 - Tomas Cervenka - VisualDNA
First Hive Meetup London 2012-07-10 - Tomas Cervenka - VisualDNAFirst Hive Meetup London 2012-07-10 - Tomas Cervenka - VisualDNA
First Hive Meetup London 2012-07-10 - Tomas Cervenka - VisualDNA
Tomas Cervenka
 
Azure Cafe Marketplace with Hortonworks March 31 2016
Azure Cafe Marketplace with Hortonworks March 31 2016Azure Cafe Marketplace with Hortonworks March 31 2016
Azure Cafe Marketplace with Hortonworks March 31 2016
Joan Novino
 
Big data on Azure for Architects
Big data on Azure for ArchitectsBig data on Azure for Architects
Big data on Azure for Architects
Tomasz Kopacz
 
Big data, just an introduction to Hadoop and Scripting Languages
Big data, just an introduction to Hadoop and Scripting LanguagesBig data, just an introduction to Hadoop and Scripting Languages
Big data, just an introduction to Hadoop and Scripting Languages
Corley S.r.l.
 
מיכאל
מיכאלמיכאל
מיכאל
sqlserver.co.il
 
Apache Eagle at Hadoop Summit 2016 San Jose
Apache Eagle at Hadoop Summit 2016 San JoseApache Eagle at Hadoop Summit 2016 San Jose
Apache Eagle at Hadoop Summit 2016 San Jose
Hao Chen
 
Eagle from eBay at China Hadoop Summit 2015
Eagle from eBay at China Hadoop Summit 2015Eagle from eBay at China Hadoop Summit 2015
Eagle from eBay at China Hadoop Summit 2015
Hao Chen
 
6° Sessione - Ambiti applicativi nella ricerca di tecnologie statistiche avan...
6° Sessione - Ambiti applicativi nella ricerca di tecnologie statistiche avan...6° Sessione - Ambiti applicativi nella ricerca di tecnologie statistiche avan...
6° Sessione - Ambiti applicativi nella ricerca di tecnologie statistiche avan...
Jürgen Ambrosi
 
Tugdual Grall - Real World Use Cases: Hadoop and NoSQL in Production
Tugdual Grall - Real World Use Cases: Hadoop and NoSQL in ProductionTugdual Grall - Real World Use Cases: Hadoop and NoSQL in Production
Tugdual Grall - Real World Use Cases: Hadoop and NoSQL in Production
Codemotion
 
Apache Eagle in Action
Apache Eagle in ActionApache Eagle in Action
Apache Eagle in Action
Hao Chen
 
Basics of big data analytics hadoop
Basics of big data analytics hadoopBasics of big data analytics hadoop
Basics of big data analytics hadoop
Ambuj Kumar
 
Basic of Big Data
Basic of Big Data Basic of Big Data
Basic of Big Data
Amar kumar
 
Emerging technologies /frameworks in Big Data
Emerging technologies /frameworks in Big DataEmerging technologies /frameworks in Big Data
Emerging technologies /frameworks in Big Data
Rahul Jain
 
Hadoop World 2011: Building Web Analytics Processing on Hadoop at CBS Interac...
Hadoop World 2011: Building Web Analytics Processing on Hadoop at CBS Interac...Hadoop World 2011: Building Web Analytics Processing on Hadoop at CBS Interac...
Hadoop World 2011: Building Web Analytics Processing on Hadoop at CBS Interac...
Cloudera, Inc.
 
Big Data on azure
Big Data on azureBig Data on azure
Big Data on azure
David Giard
 
sudoers: Benchmarking Hadoop with ALOJA
sudoers: Benchmarking Hadoop with ALOJAsudoers: Benchmarking Hadoop with ALOJA
sudoers: Benchmarking Hadoop with ALOJA
Nicolas Poggi
 
Ai tour 2019 Mejores Practicas en Entornos de Produccion Big Data Open Source...
Ai tour 2019 Mejores Practicas en Entornos de Produccion Big Data Open Source...Ai tour 2019 Mejores Practicas en Entornos de Produccion Big Data Open Source...
Ai tour 2019 Mejores Practicas en Entornos de Produccion Big Data Open Source...
nnakasone
 
Tajo_Meetup_20141120
Tajo_Meetup_20141120Tajo_Meetup_20141120
Tajo_Meetup_20141120
Hyoungjun Kim
 
10 Big Data Technologies you Didn't Know About
10 Big Data Technologies you Didn't Know About 10 Big Data Technologies you Didn't Know About
10 Big Data Technologies you Didn't Know About
Jesus Rodriguez
 
ETL with SPARK - First Spark London meetup
ETL with SPARK - First Spark London meetupETL with SPARK - First Spark London meetup
ETL with SPARK - First Spark London meetup
Rafal Kwasny
 
First Hive Meetup London 2012-07-10 - Tomas Cervenka - VisualDNA
First Hive Meetup London 2012-07-10 - Tomas Cervenka - VisualDNAFirst Hive Meetup London 2012-07-10 - Tomas Cervenka - VisualDNA
First Hive Meetup London 2012-07-10 - Tomas Cervenka - VisualDNA
Tomas Cervenka
 
Azure Cafe Marketplace with Hortonworks March 31 2016
Azure Cafe Marketplace with Hortonworks March 31 2016Azure Cafe Marketplace with Hortonworks March 31 2016
Azure Cafe Marketplace with Hortonworks March 31 2016
Joan Novino
 
Big data on Azure for Architects
Big data on Azure for ArchitectsBig data on Azure for Architects
Big data on Azure for Architects
Tomasz Kopacz
 
Big data, just an introduction to Hadoop and Scripting Languages
Big data, just an introduction to Hadoop and Scripting LanguagesBig data, just an introduction to Hadoop and Scripting Languages
Big data, just an introduction to Hadoop and Scripting Languages
Corley S.r.l.
 
Ad

More from DataWorks Summit/Hadoop Summit (20)

Running Apache Spark & Apache Zeppelin in Production
Running Apache Spark & Apache Zeppelin in ProductionRunning Apache Spark & Apache Zeppelin in Production
Running Apache Spark & Apache Zeppelin in Production
DataWorks Summit/Hadoop Summit
 
State of Security: Apache Spark & Apache Zeppelin
State of Security: Apache Spark & Apache ZeppelinState of Security: Apache Spark & Apache Zeppelin
State of Security: Apache Spark & Apache Zeppelin
DataWorks Summit/Hadoop Summit
 
Unleashing the Power of Apache Atlas with Apache Ranger
Unleashing the Power of Apache Atlas with Apache RangerUnleashing the Power of Apache Atlas with Apache Ranger
Unleashing the Power of Apache Atlas with Apache Ranger
DataWorks Summit/Hadoop Summit
 
Enabling Digital Diagnostics with a Data Science Platform
Enabling Digital Diagnostics with a Data Science PlatformEnabling Digital Diagnostics with a Data Science Platform
Enabling Digital Diagnostics with a Data Science Platform
DataWorks Summit/Hadoop Summit
 
Revolutionize Text Mining with Spark and Zeppelin
Revolutionize Text Mining with Spark and ZeppelinRevolutionize Text Mining with Spark and Zeppelin
Revolutionize Text Mining with Spark and Zeppelin
DataWorks Summit/Hadoop Summit
 
Double Your Hadoop Performance with Hortonworks SmartSense
Double Your Hadoop Performance with Hortonworks SmartSenseDouble Your Hadoop Performance with Hortonworks SmartSense
Double Your Hadoop Performance with Hortonworks SmartSense
DataWorks Summit/Hadoop Summit
 
Hadoop Crash Course
Hadoop Crash CourseHadoop Crash Course
Hadoop Crash Course
DataWorks Summit/Hadoop Summit
 
Data Science Crash Course
Data Science Crash CourseData Science Crash Course
Data Science Crash Course
DataWorks Summit/Hadoop Summit
 
Apache Spark Crash Course
Apache Spark Crash CourseApache Spark Crash Course
Apache Spark Crash Course
DataWorks Summit/Hadoop Summit
 
Dataflow with Apache NiFi
Dataflow with Apache NiFiDataflow with Apache NiFi
Dataflow with Apache NiFi
DataWorks Summit/Hadoop Summit
 
Schema Registry - Set you Data Free
Schema Registry - Set you Data FreeSchema Registry - Set you Data Free
Schema Registry - Set you Data Free
DataWorks Summit/Hadoop Summit
 
Building a Large-Scale, Adaptive Recommendation Engine with Apache Flink and ...
Building a Large-Scale, Adaptive Recommendation Engine with Apache Flink and ...Building a Large-Scale, Adaptive Recommendation Engine with Apache Flink and ...
Building a Large-Scale, Adaptive Recommendation Engine with Apache Flink and ...
DataWorks Summit/Hadoop Summit
 
Real-Time Anomaly Detection using LSTM Auto-Encoders with Deep Learning4J on ...
Real-Time Anomaly Detection using LSTM Auto-Encoders with Deep Learning4J on ...Real-Time Anomaly Detection using LSTM Auto-Encoders with Deep Learning4J on ...
Real-Time Anomaly Detection using LSTM Auto-Encoders with Deep Learning4J on ...
DataWorks Summit/Hadoop Summit
 
Mool - Automated Log Analysis using Data Science and ML
Mool - Automated Log Analysis using Data Science and MLMool - Automated Log Analysis using Data Science and ML
Mool - Automated Log Analysis using Data Science and ML
DataWorks Summit/Hadoop Summit
 
How Hadoop Makes the Natixis Pack More Efficient
How Hadoop Makes the Natixis Pack More Efficient How Hadoop Makes the Natixis Pack More Efficient
How Hadoop Makes the Natixis Pack More Efficient
DataWorks Summit/Hadoop Summit
 
HBase in Practice
HBase in Practice HBase in Practice
HBase in Practice
DataWorks Summit/Hadoop Summit
 
The Challenge of Driving Business Value from the Analytics of Things (AOT)
The Challenge of Driving Business Value from the Analytics of Things (AOT)The Challenge of Driving Business Value from the Analytics of Things (AOT)
The Challenge of Driving Business Value from the Analytics of Things (AOT)
DataWorks Summit/Hadoop Summit
 
Breaking the 1 Million OPS/SEC Barrier in HOPS Hadoop
Breaking the 1 Million OPS/SEC Barrier in HOPS HadoopBreaking the 1 Million OPS/SEC Barrier in HOPS Hadoop
Breaking the 1 Million OPS/SEC Barrier in HOPS Hadoop
DataWorks Summit/Hadoop Summit
 
From Regulatory Process Verification to Predictive Maintenance and Beyond wit...
From Regulatory Process Verification to Predictive Maintenance and Beyond wit...From Regulatory Process Verification to Predictive Maintenance and Beyond wit...
From Regulatory Process Verification to Predictive Maintenance and Beyond wit...
DataWorks Summit/Hadoop Summit
 
Backup and Disaster Recovery in Hadoop
Backup and Disaster Recovery in Hadoop Backup and Disaster Recovery in Hadoop
Backup and Disaster Recovery in Hadoop
DataWorks Summit/Hadoop Summit
 
Unleashing the Power of Apache Atlas with Apache Ranger
Unleashing the Power of Apache Atlas with Apache RangerUnleashing the Power of Apache Atlas with Apache Ranger
Unleashing the Power of Apache Atlas with Apache Ranger
DataWorks Summit/Hadoop Summit
 
Enabling Digital Diagnostics with a Data Science Platform
Enabling Digital Diagnostics with a Data Science PlatformEnabling Digital Diagnostics with a Data Science Platform
Enabling Digital Diagnostics with a Data Science Platform
DataWorks Summit/Hadoop Summit
 
Double Your Hadoop Performance with Hortonworks SmartSense
Double Your Hadoop Performance with Hortonworks SmartSenseDouble Your Hadoop Performance with Hortonworks SmartSense
Double Your Hadoop Performance with Hortonworks SmartSense
DataWorks Summit/Hadoop Summit
 
Building a Large-Scale, Adaptive Recommendation Engine with Apache Flink and ...
Building a Large-Scale, Adaptive Recommendation Engine with Apache Flink and ...Building a Large-Scale, Adaptive Recommendation Engine with Apache Flink and ...
Building a Large-Scale, Adaptive Recommendation Engine with Apache Flink and ...
DataWorks Summit/Hadoop Summit
 
Real-Time Anomaly Detection using LSTM Auto-Encoders with Deep Learning4J on ...
Real-Time Anomaly Detection using LSTM Auto-Encoders with Deep Learning4J on ...Real-Time Anomaly Detection using LSTM Auto-Encoders with Deep Learning4J on ...
Real-Time Anomaly Detection using LSTM Auto-Encoders with Deep Learning4J on ...
DataWorks Summit/Hadoop Summit
 
Mool - Automated Log Analysis using Data Science and ML
Mool - Automated Log Analysis using Data Science and MLMool - Automated Log Analysis using Data Science and ML
Mool - Automated Log Analysis using Data Science and ML
DataWorks Summit/Hadoop Summit
 
The Challenge of Driving Business Value from the Analytics of Things (AOT)
The Challenge of Driving Business Value from the Analytics of Things (AOT)The Challenge of Driving Business Value from the Analytics of Things (AOT)
The Challenge of Driving Business Value from the Analytics of Things (AOT)
DataWorks Summit/Hadoop Summit
 
From Regulatory Process Verification to Predictive Maintenance and Beyond wit...
From Regulatory Process Verification to Predictive Maintenance and Beyond wit...From Regulatory Process Verification to Predictive Maintenance and Beyond wit...
From Regulatory Process Verification to Predictive Maintenance and Beyond wit...
DataWorks Summit/Hadoop Summit
 

Recently uploaded (20)

How analogue intelligence complements AI
How analogue intelligence complements AIHow analogue intelligence complements AI
How analogue intelligence complements AI
Paul Rowe
 
Enhancing ICU Intelligence: How Our Functional Testing Enabled a Healthcare I...
Enhancing ICU Intelligence: How Our Functional Testing Enabled a Healthcare I...Enhancing ICU Intelligence: How Our Functional Testing Enabled a Healthcare I...
Enhancing ICU Intelligence: How Our Functional Testing Enabled a Healthcare I...
Impelsys Inc.
 
IEDM 2024 Tutorial2_Advances in CMOS Technologies and Future Directions for C...
IEDM 2024 Tutorial2_Advances in CMOS Technologies and Future Directions for C...IEDM 2024 Tutorial2_Advances in CMOS Technologies and Future Directions for C...
IEDM 2024 Tutorial2_Advances in CMOS Technologies and Future Directions for C...
organizerofv
 
Massive Power Outage Hits Spain, Portugal, and France: Causes, Impact, and On...
Massive Power Outage Hits Spain, Portugal, and France: Causes, Impact, and On...Massive Power Outage Hits Spain, Portugal, and France: Causes, Impact, and On...
Massive Power Outage Hits Spain, Portugal, and France: Causes, Impact, and On...
Aqusag Technologies
 
Linux Professional Institute LPIC-1 Exam.pdf
Linux Professional Institute LPIC-1 Exam.pdfLinux Professional Institute LPIC-1 Exam.pdf
Linux Professional Institute LPIC-1 Exam.pdf
RHCSA Guru
 
AI EngineHost Review: Revolutionary USA Datacenter-Based Hosting with NVIDIA ...
AI EngineHost Review: Revolutionary USA Datacenter-Based Hosting with NVIDIA ...AI EngineHost Review: Revolutionary USA Datacenter-Based Hosting with NVIDIA ...
AI EngineHost Review: Revolutionary USA Datacenter-Based Hosting with NVIDIA ...
SOFTTECHHUB
 
Noah Loul Shares 5 Steps to Implement AI Agents for Maximum Business Efficien...
Noah Loul Shares 5 Steps to Implement AI Agents for Maximum Business Efficien...Noah Loul Shares 5 Steps to Implement AI Agents for Maximum Business Efficien...
Noah Loul Shares 5 Steps to Implement AI Agents for Maximum Business Efficien...
Noah Loul
 
tecnologias de las primeras civilizaciones.pdf
tecnologias de las primeras civilizaciones.pdftecnologias de las primeras civilizaciones.pdf
tecnologias de las primeras civilizaciones.pdf
fjgm517
 
DevOpsDays Atlanta 2025 - Building 10x Development Organizations.pptx
DevOpsDays Atlanta 2025 - Building 10x Development Organizations.pptxDevOpsDays Atlanta 2025 - Building 10x Development Organizations.pptx
DevOpsDays Atlanta 2025 - Building 10x Development Organizations.pptx
Justin Reock
 
What is Model Context Protocol(MCP) - The new technology for communication bw...
What is Model Context Protocol(MCP) - The new technology for communication bw...What is Model Context Protocol(MCP) - The new technology for communication bw...
What is Model Context Protocol(MCP) - The new technology for communication bw...
Vishnu Singh Chundawat
 
Heap, Types of Heap, Insertion and Deletion
Heap, Types of Heap, Insertion and DeletionHeap, Types of Heap, Insertion and Deletion
Heap, Types of Heap, Insertion and Deletion
Jaydeep Kale
 
Increasing Retail Store Efficiency How can Planograms Save Time and Money.pptx
Increasing Retail Store Efficiency How can Planograms Save Time and Money.pptxIncreasing Retail Store Efficiency How can Planograms Save Time and Money.pptx
Increasing Retail Store Efficiency How can Planograms Save Time and Money.pptx
Anoop Ashok
 
How Can I use the AI Hype in my Business Context?
How Can I use the AI Hype in my Business Context?How Can I use the AI Hype in my Business Context?
How Can I use the AI Hype in my Business Context?
Daniel Lehner
 
Rusty Waters: Elevating Lakehouses Beyond Spark
Rusty Waters: Elevating Lakehouses Beyond SparkRusty Waters: Elevating Lakehouses Beyond Spark
Rusty Waters: Elevating Lakehouses Beyond Spark
carlyakerly1
 
Splunk Security Update | Public Sector Summit Germany 2025
Splunk Security Update | Public Sector Summit Germany 2025Splunk Security Update | Public Sector Summit Germany 2025
Splunk Security Update | Public Sector Summit Germany 2025
Splunk
 
The Evolution of Meme Coins A New Era for Digital Currency ppt.pdf
The Evolution of Meme Coins A New Era for Digital Currency ppt.pdfThe Evolution of Meme Coins A New Era for Digital Currency ppt.pdf
The Evolution of Meme Coins A New Era for Digital Currency ppt.pdf
Abi john
 
Transcript: #StandardsGoals for 2025: Standards & certification roundup - Tec...
Transcript: #StandardsGoals for 2025: Standards & certification roundup - Tec...Transcript: #StandardsGoals for 2025: Standards & certification roundup - Tec...
Transcript: #StandardsGoals for 2025: Standards & certification roundup - Tec...
BookNet Canada
 
Special Meetup Edition - TDX Bengaluru Meetup #52.pptx
Special Meetup Edition - TDX Bengaluru Meetup #52.pptxSpecial Meetup Edition - TDX Bengaluru Meetup #52.pptx
Special Meetup Edition - TDX Bengaluru Meetup #52.pptx
shyamraj55
 
#StandardsGoals for 2025: Standards & certification roundup - Tech Forum 2025
#StandardsGoals for 2025: Standards & certification roundup - Tech Forum 2025#StandardsGoals for 2025: Standards & certification roundup - Tech Forum 2025
#StandardsGoals for 2025: Standards & certification roundup - Tech Forum 2025
BookNet Canada
 
Into The Box Conference Keynote Day 1 (ITB2025)
Into The Box Conference Keynote Day 1 (ITB2025)Into The Box Conference Keynote Day 1 (ITB2025)
Into The Box Conference Keynote Day 1 (ITB2025)
Ortus Solutions, Corp
 
How analogue intelligence complements AI
How analogue intelligence complements AIHow analogue intelligence complements AI
How analogue intelligence complements AI
Paul Rowe
 
Enhancing ICU Intelligence: How Our Functional Testing Enabled a Healthcare I...
Enhancing ICU Intelligence: How Our Functional Testing Enabled a Healthcare I...Enhancing ICU Intelligence: How Our Functional Testing Enabled a Healthcare I...
Enhancing ICU Intelligence: How Our Functional Testing Enabled a Healthcare I...
Impelsys Inc.
 
IEDM 2024 Tutorial2_Advances in CMOS Technologies and Future Directions for C...
IEDM 2024 Tutorial2_Advances in CMOS Technologies and Future Directions for C...IEDM 2024 Tutorial2_Advances in CMOS Technologies and Future Directions for C...
IEDM 2024 Tutorial2_Advances in CMOS Technologies and Future Directions for C...
organizerofv
 
Massive Power Outage Hits Spain, Portugal, and France: Causes, Impact, and On...
Massive Power Outage Hits Spain, Portugal, and France: Causes, Impact, and On...Massive Power Outage Hits Spain, Portugal, and France: Causes, Impact, and On...
Massive Power Outage Hits Spain, Portugal, and France: Causes, Impact, and On...
Aqusag Technologies
 
Linux Professional Institute LPIC-1 Exam.pdf
Linux Professional Institute LPIC-1 Exam.pdfLinux Professional Institute LPIC-1 Exam.pdf
Linux Professional Institute LPIC-1 Exam.pdf
RHCSA Guru
 
AI EngineHost Review: Revolutionary USA Datacenter-Based Hosting with NVIDIA ...
AI EngineHost Review: Revolutionary USA Datacenter-Based Hosting with NVIDIA ...AI EngineHost Review: Revolutionary USA Datacenter-Based Hosting with NVIDIA ...
AI EngineHost Review: Revolutionary USA Datacenter-Based Hosting with NVIDIA ...
SOFTTECHHUB
 
Noah Loul Shares 5 Steps to Implement AI Agents for Maximum Business Efficien...
Noah Loul Shares 5 Steps to Implement AI Agents for Maximum Business Efficien...Noah Loul Shares 5 Steps to Implement AI Agents for Maximum Business Efficien...
Noah Loul Shares 5 Steps to Implement AI Agents for Maximum Business Efficien...
Noah Loul
 
tecnologias de las primeras civilizaciones.pdf
tecnologias de las primeras civilizaciones.pdftecnologias de las primeras civilizaciones.pdf
tecnologias de las primeras civilizaciones.pdf
fjgm517
 
DevOpsDays Atlanta 2025 - Building 10x Development Organizations.pptx
DevOpsDays Atlanta 2025 - Building 10x Development Organizations.pptxDevOpsDays Atlanta 2025 - Building 10x Development Organizations.pptx
DevOpsDays Atlanta 2025 - Building 10x Development Organizations.pptx
Justin Reock
 
What is Model Context Protocol(MCP) - The new technology for communication bw...
What is Model Context Protocol(MCP) - The new technology for communication bw...What is Model Context Protocol(MCP) - The new technology for communication bw...
What is Model Context Protocol(MCP) - The new technology for communication bw...
Vishnu Singh Chundawat
 
Heap, Types of Heap, Insertion and Deletion
Heap, Types of Heap, Insertion and DeletionHeap, Types of Heap, Insertion and Deletion
Heap, Types of Heap, Insertion and Deletion
Jaydeep Kale
 
Increasing Retail Store Efficiency How can Planograms Save Time and Money.pptx
Increasing Retail Store Efficiency How can Planograms Save Time and Money.pptxIncreasing Retail Store Efficiency How can Planograms Save Time and Money.pptx
Increasing Retail Store Efficiency How can Planograms Save Time and Money.pptx
Anoop Ashok
 
How Can I use the AI Hype in my Business Context?
How Can I use the AI Hype in my Business Context?How Can I use the AI Hype in my Business Context?
How Can I use the AI Hype in my Business Context?
Daniel Lehner
 
Rusty Waters: Elevating Lakehouses Beyond Spark
Rusty Waters: Elevating Lakehouses Beyond SparkRusty Waters: Elevating Lakehouses Beyond Spark
Rusty Waters: Elevating Lakehouses Beyond Spark
carlyakerly1
 
Splunk Security Update | Public Sector Summit Germany 2025
Splunk Security Update | Public Sector Summit Germany 2025Splunk Security Update | Public Sector Summit Germany 2025
Splunk Security Update | Public Sector Summit Germany 2025
Splunk
 
The Evolution of Meme Coins A New Era for Digital Currency ppt.pdf
The Evolution of Meme Coins A New Era for Digital Currency ppt.pdfThe Evolution of Meme Coins A New Era for Digital Currency ppt.pdf
The Evolution of Meme Coins A New Era for Digital Currency ppt.pdf
Abi john
 
Transcript: #StandardsGoals for 2025: Standards & certification roundup - Tec...
Transcript: #StandardsGoals for 2025: Standards & certification roundup - Tec...Transcript: #StandardsGoals for 2025: Standards & certification roundup - Tec...
Transcript: #StandardsGoals for 2025: Standards & certification roundup - Tec...
BookNet Canada
 
Special Meetup Edition - TDX Bengaluru Meetup #52.pptx
Special Meetup Edition - TDX Bengaluru Meetup #52.pptxSpecial Meetup Edition - TDX Bengaluru Meetup #52.pptx
Special Meetup Edition - TDX Bengaluru Meetup #52.pptx
shyamraj55
 
#StandardsGoals for 2025: Standards & certification roundup - Tech Forum 2025
#StandardsGoals for 2025: Standards & certification roundup - Tech Forum 2025#StandardsGoals for 2025: Standards & certification roundup - Tech Forum 2025
#StandardsGoals for 2025: Standards & certification roundup - Tech Forum 2025
BookNet Canada
 
Into The Box Conference Keynote Day 1 (ITB2025)
Into The Box Conference Keynote Day 1 (ITB2025)Into The Box Conference Keynote Day 1 (ITB2025)
Into The Box Conference Keynote Day 1 (ITB2025)
Ortus Solutions, Corp
 

Apache Eagle - Monitor Hadoop in Real Time

  • 1. Apache Eagle Monitor Hadoop in Real Time Hao Chen http://people.apache.org/~hao
  • 2. ABOUT SPEAKER PMC & Committer of Apache Eagle [email protected] MTS Software Engineer at eBay [email protected] Hao Chen / 陈浩
  • 4. is a distributed real-time monitoring and alerting engine for hadoop from eBay Open sourced as Apache Incubator Project on Oct 26th 2015 Secure Hadoop in Realtime a data activity monitoring solution to instantly identify access to sensitive data, recognize attacks/ malicious activity and block access in real time. See http://eagle.incubator.apache.org or http://github.com/apache/incubator-eagle Apache Eagle
  • 5. Eagle  was  initialized  by  end  of  2013  for  hadoop ecosystem  monitoring  as  any  existing   tool  like  zabbix,  ganglia  can  not  handle  the  huge  volume  of  metrics/logs  generated  by   hadoop system  in  eBay. 2014/2016 10,000  nodes 150,000+  cores 200  PB 2000+  user 3000+  nodes 10,000+  cores 50+  PB 2012 2011 1000+  nodes 10,000+  cores 10+  PB 100+  nodes 1000  +  cores 1  PB 2010 2009 50+  nodes 2007 1-­‐10  nodes Hadoop  Data   • Security • Activity Hadoop  Platform   • Heath • Availability • Performance Hadoop  @  eBay  Inc Initiative  – Why  build  Eagle? 7+   CLUSTERS 10000+ NODES 200+  PB DATA 10  B+   EVENTS  /  DAY 500+   METRIC  TYPES 50,000+   JOBS  /  DAY 50,000,000+   TASKS  /  DAY
  • 6. Eagle  Use  Cases Data  Activity  Monitoring Secure  Hadoop  in  Realtime a  data  activity  monitoring  solution  to  instantly  identify  access  to   sensitive  data,    recognize  attacks/  malicious  activity  and  block  access  in  real  time. Job  Performance  Monitoring Hadoop,  Spark  Job  Profiling  &  Performance  Monitoring,  Cluster  Health  Anomaly  Detection 1 2 4 eBay  Unified  Monitoring  Platform Provide  unified  monitoring-­‐as-­‐service  for  everything  around  infrastructure  or  business. 3
  • 7. Eagle  Data  Activity  Monitoring Data  Loss  Prevention Get  alerted  and  stop  a  malicious  user  trying  to  copy,  delete,  move   sensitive  data  from  the  Hadoop  cluster. Malicious  Logins Detect  login  when  malicious  user  tries  to  guess  password.  Eagle   creates  user  profiles  using  machine  learning  algorithm  to  detect   anomalies Unauthorized  access Detect  and  stop  a  malicious  user  trying  to  access  classified  data   without  privilege.   Malicious  user  operation Detect  and  stop  a  malicious  user  trying  to  delete  large  amount  of   data.  Operation  type  is  one  parameter  of  Eagle  user  profiles.  Eagle   supports  multiple  native  operation  types. User Privileges Common   Data  Sets Patterns CommandsZones Query Columns
  • 8. § HDFS  Policies § Access  to  Sensitive  files § HDFS  Commands  used  (read,  write,  update…) § Client  host   § Destination § Security  Zones § Hive  Policies § Access  to  tables  with  PII  Data § SQL  Query  Profiles § Client  Host § Security  Zone Eagle  Data  Activity  Monitoring Hadoop  Data  Security:  Detect  anomalies  in  accessing  HDFS  and  Hive
  • 9. Offline:  Determine  bandwidth  from  training  dataset  the  kernel  density  function   parameters  (KDE) Online:  If  a  test  data  point  lies  outside  the  trained  bandwidth,  it  is  anomaly  (Policy) PCs(Principle  Components)  in  EVD (Eigenvalue  Value  Decomposition) Kernel  Density  Function Eagle  Machine  Learning  User  Profile
  • 10. Use  Case Detect  node  anomaly  by  analyzing  task  failure  ratio  across  all  nodes Assumption Task  failure  ratio  for  every  node  should  be  approximately  equal Algorithm Node  by  node  compare  (symmetry  violation)  and  per  node  trend Eagle  Job  Performance  Monitoring
  • 11. Alerting:   Anomaly  Detection  Alerting Insight:   Task  failure  drill-­‐down Insight:   Task  failure  drill-­‐down Task  Failure  based  Anomaly  Host  Detection  
  • 12. Counters  &  Features Use  Case Detect  data  skew  by  statistics  and  distributions  for  attempt  execution  durations  and   counters Assumption      Duration  and  counters  should  be  in  normal  distribution mapDuration reduceDuration mapInputRecords reduceInputRecords combineInputRecords mapSpilledRecords reduceShuffleRecords mapLocalFileBytesRead reduceLocalFileBytesRead mapHDFSBytesRead reduceHDFSBytesRead Modeling  &  Statistics Avg Min Max   Distributions Max  z-­score Top-­N Correlation Threshold  &  Detection Counters Correlation  >  0.9   &  Max(Z-­Score)  >  90%   Hadoop  Job  Data  Skew  Detection
  • 13. Counters  &  Features Use  Case Detect  data  skew  by  statistics  and  distributions  for  attempt  execution  durations  and   counters Assumption      Duration  and  counters  should  be  in  normal  distribution mapDuration reduceDuration mapInputRecords reduceInputRecords combineInputRecords mapSpilledRecords reduceShuffleRecords mapLocalFileBytesRead reduceLocalFileBytesRead mapHDFSBytesRead reduceHDFSBytesRead Modeling  &  Statistics Avg Min Max   Distributions Max  z-­score Top-­N Correlation Threshold  &  Detection Counters Correlation  >  0.9   &  Max(Z-­Score)  >  90%   Hadoop  Job  Data  Skew  Detection
  • 14. Eagle  @  eBay  Inc. 7+   CLUSTERS 10000+ NODES 200+  PB DATA 10  B+   EVENTS  /  DAY 500+   METRIC  TYPES 50,000+   JOBS  /  DAY 50,000,000+   TASKS  /  DAY Eagle  Deployment  at  eBay  Production • 100+  security  policies • 8  nodes • 30  worker  process • 64  kafka partition Eagle  Performance • Avg Latency:   ~  50  ms • Max  Throughput/Cluster:  300  k  /s
  • 15. Technical  Challenges Large  Scale  Processing  and  Storage • Scale  IO  Complexity  (Stream) • Scale  Computation  Complexity  (Policy) • Scale  Storage  &  Query  (Event,  Log,  Metric) 1 Real-­‐Time  Alerting • Real-­‐time  Data  Collection • Real-­‐time  Stream  Processing • Real-­‐time  Alerting 2 Expressive Correlation  Model • Complex  policy  model • Stream  GroupBy,  Join,  Window • Machine  Learning 3 Hadoop  Ecosystem  Integration • Data  Source  Integration • Anomaly  Detection  Model 4
  • 17. Distributed  Policy  Engine METADATA   MANAGER AlertExecutor_{1} AlertExecutor_{2} … AlertExecutor_{N} Real  Time  Alerts Alerts Policy   Management Policy Dynamical   Policy   Deployment Real-­‐time   Event  Stream Stream_{1} Stream_{*} Dynamical   Stream  Schema Stream   Processing • Real-­‐time  Streaming: Apache  Storm  (Execution  Engine)  +  Kafka  (Message  Bus) • Declarative  Policy: SQL  (CEP)  on  Streaming  +  Hot  Deploy • Linear  Scalability:  Data  volume scale +  Computation  scale • Metadata-­‐Driven:  Schema  Management  and  Dynamical  Policy  Lifecycle
  • 18. METADATA   MANAGER AlertExecutor_{1} AlertExecutor_{2} … AlertExecutor_{N} Real  Time  Alerts Alerts Policy   Management Policy Dynamical   Policy   Deployment Real-­‐time   Event  Stream Stream_{1} Stream_{*} Dynamical   Stream  Schema Stream   Processing from MetricStream[(name == 'ReplLag') and (value > 1000)] select * insert into outputStream; • Real-­‐time  Streaming: Apache  Storm  (Execution  Engine)  +  Kafka  (Message  Bus) • Declarative  Policy: SQL  (CEP)  on  Streaming  +  Hot  Deploy • Linear  Scalability:  Data  volume scale +  Computation  scale • Metadata-­‐Driven:  Schema  Management  and  Dynamical  Policy  Lifecycle Distributed  Policy  Engine
  • 19. Declarative  Policy • Filter • Join • Aggregation:  Avg,  Sum  ,  Min,  Max,  etc • Group by • Having • Stream handlers  for  window:  TimeWindow,  Batch  Window,   Length  Window   • Conditions  and  Expressions:  and,  or,  not,  ==,!=,  >=,  >,  <=,  <,   and  arithmetic  operations • Pattern  Processing • Sequence  processing • Event  Tables:  intergrate historical  data  in  realtime processing • SQL-­‐Like  Query:  Query,  Stream  Definition   and  Query  Plan   compilation Distributed  SQL  on  Streaming  :   Siddhi  CEP  +  Storm  by  default from MetricStream[(name == 'ReplLag') and (value > 1000)] select * insert into outputStream;
  • 20. Declarative  Policy  -­‐ Examples from hadoopJmxMetricEventStream [metric == "hadoop.namenode.fsnamesystemstate.capacityused" and value > 0.9] select metric, host, value, timestamp, component, site insert into alertStream; Example  1:  Alert  if  hadoop namenode capacity  usage  exceed  90  percentages from every a = hadoopJmxMetricEventStream[metric=="hadoop.namenode.fsnamesystem.hastate"] -> b = hadoopJmxMetricEventStream[metric==a.metric and b.host == a.host and a.value != value)] within 10 min select a.host, a.value as oldHaState, b.value as newHaState, b.timestamp as timestamp, b.metric as metric, b.component as component, b.site as site insert into alertStream; Example  2:  Alert  if  hadoop namenode HA  switches
  • 21. 1 Distributed  Policy  Engine  -­‐ Scalability 2 Distributed   Streaming   Cluster  Environment AlertExecutor_{1} AlertExecutor_{2} … AlertExecutor_{N} Stream_{1} Stream_{*} Stream   Processing Dynamic  policy  partition  by  {event}  *  {policy} • N  Users  with  3  partitions,  M  policies  with  2  partitions,  then  3*2  physical  tasks • Physical  partition  +  policy-­‐level  partition Linear  Scalability Principle
  • 22. 3 Algorithm Weights  of  Executors   By  Partition   User Random 0.0484 0.152 0.3535 0.105 0.203 0.072 0.042 0.024 Greedy 0.0837 0.0837 0.0837 0.0837 0.0737 0.0637 0.0437 0.0837 Stream  Partition  Skew  (15:1) Distributed  Policy  Engine – Optimization Stream  Partition  Problem     https://en.wikipedia.org/wiki/Partition_problem
  • 23. Distributed  Real-­time    Policy  Engine Siddhi  CEP  Policy   Evaluator Machine  Learning   Policy  Evaluator• Support  WSO2  Siddhi  CEP  as  first  class • Extensible  Policy  Engine  Implementation • Extensible  Policy  Lifecycle  Management • Metadata-­‐based  Module  Management Extensible   Policy   Evaluator public interface PolicyEvaluatorServiceProvider { public String getPolicyType(); // literal string to identify one type of policy public Class getPolicyEvaluator(); // get policy evaluator implementation public List getBindingModules(); // policy text with json format to object mapping } METADATA  MANAGER Policy/Metadata Distributed  Policy  Engine – Extensibility Policy  Engine  Extensibility
  • 24. Stream  Processing  Framework Optimizer 1.  Development 2.  Optimization 3.  Compile  to  native  app Use  eagle  alert  framework  as  library
  • 25. • Light-­‐weight  ORM  Framework  for  HBase/RDMBS • Full-­‐function   SQL-­‐Like  REST  Query   • Optimized  Rowkey design  for  time-­‐series  data • Native  HBase Coprocessor • Secondary  Index  Support @Table("alertdef") @ColumnFamily("f") @Prefix("alertdef") @Service(AlertConstants.ALERT_DEFINITION_SERVICE_ENDPOINT_NAME) @JsonIgnoreProperties(ignoreUnknown = true) @TimeSeries(false) @Tags({"site", "dataSource", "alertExecutorId", "policyId", "policyType"}) @Indexes({ @Index(name="Index_1_alertExecutorId", columns = { "alertExecutorID" }, unique = true), }) public class AlertDefinitionAPIEntity extends TaggedLogAPIEntity{ @Column("a") private String desc; @Column("b") private String policyDef; @Column("c") private String dedupeDef; Query=AlertDefinitionService[@dataSource="hiveQueryLog"]{@policyDef} Large  Scale  Storage  and  Query
  • 26. Uniform  HBase rowkey design • Metric • Entity • Log Rowkey ::= Prefix | Partition Keys | timestamp | tagName | tagValue | … Rowkey ::= Metric Name | Partition Keys | timestamp | tagName | tagValue | … Rowkey ::= Default Prefix | Partition Keys | timestamp | tagName | tagValue | … Rowkey ::= Log Type | Partition Keys | timestamp | tagName | tagValue | … Rowvalue ::= Log Content Large  Scale  Storage  and  Query http://opentsdb.net
  • 27. Multi-­‐Tenants  – Topology  Scheduler • Dynamical  Topology  Management • No-­‐downtime  Topology  Maintenance • Topology  High  Availability  &  Balance • Resource  Scheduling  &  Isolation:  Runtime,Woker,  Topology  or  Cluster
  • 28. Multi-­‐Tenants  -­‐ Dynamical  Correlation • Dynamical  Correlation  on  Runtime: • Sort,  Groupby,  Join,  Window • Hot  Deploy  Logic • Policy  Management • Multi-­‐Correlation  on  the  Single  Stream • Group  by  different  fields  of  same  stream • Resort  same  stream  by  different  order • Join  certain  stream  in  different  way • Multi  Correlation  on  Multi  Streams • Cross  Streams  Join • Real-­‐time  &  Historical  Stream  Join
  • 30. 1 Eagle  Framework Distributed  real-­‐time  framework  for  efficiently  developing   highly   scalable  monitoring   applications Eagle  Ecosystem 2 Eagle  Apps Security/  Hadoop/  Operational  Intelligence  /  … 3 Eagle  Interface REST  Service  /  Management  UI  /  Customizable  Analytics   Visualization 4 Eagle  Integration Ambari /  Docker /  Ranger  /  Dataguise Apps à Security à Hadoop à Cloud à Database Interface à Web Portal à REST Services à Analytics Visualization Integration à Ambari à Docker à Ranger à Dataguise Eagle Framework Open  Source Community-­‐driven   and  Cross-­‐community  cooperation 5
  • 31. Learn  More  about  Apache  Eagle Community • Website:  http://eagle.incubator.apache.org • Github:  http://github.com/apache/incubator-­‐eagle • Mailing  list:  [email protected] Resources • Documentation:  http://eagle.incubator.apache.org/docs/ • Docker images:  https://hub.docker.com/r/apacheeagle/sandbox/ Publications  &  Patents • EAGLE:  USER  PROFILE-­‐BASED  ANOMALY  DETECTION  IN  HADOOP  CLUSTER  (IEEE) • EAGLE:  DISTRIBUTED  REALTIME  MONITORING  FRAMEWORK  FOR  HADOOP  CLUSTER
  • 32. Thank  you! http://eagle.incubator.apache.org The  slide  is  licensed  under  Creative  Commons  Attribution  4.0  International  license.