Ambari metrics system - Apache ambari meetup (DataWorks Summit 2017)

Ambari Metrics System
Apache Ambari Meetup @
DataWorks Summit 2017

2 © Hortonworks Inc. 2011 – 2016. All Rights Reserved
Agenda
Introduction
New Features
Horizontal Scalability
Future Work

Introduction & Architecture
 Metrics Collector – API daemon
 Sinks – Service daemons configured to publish metrics
 Metric Monitors – Lightweight daemon for system metrics
 Managed HBase (Embedded / Distributed)
 Phoenix schema designed for fast reads
 Grafana Integration (Ambari 2.2.2)
High level component arch
Ambari
Collector API
Grafana
Phoenix
HDP
Services
System
MONITORSSINKS
Metrics Collector

Feature Highlights
AMSSimple POST
API
Rich GET API
Aggregation and
Down sampling
Metadata API
Highly Tunable
Abstract Sink
Impl.

Metric Sinks

 Sid Wagle (Hortonworks)
 Aravindan Vijayan (Hortonworks)
 Dmitry Sen (Hortonworks)
 Prajwal Rao (ITRenew)
 Myroslav Papyrkovskyy (Hortonworks)
 Yusaku Sako (Hortonworks)
 Qin Liu (IBM)
 Tim Thorpe (IBM)
 Jungtaek Lim (Hortonworks)
 Jameel Naina (Microsoft)
 Masahiro Tanaka (Ntt data)
Contributors (Karma)

New Features

Features (2.2.2 -> 2.5.x)
• Ability to query for Top N series from a set of series
• Top N ‘Hosts’ vs Top N ‘Metrics’
• Multiple Top N functions supported (max, avg, sum)
Top N
(Request)
• Ability to run a custom downsampling function on a
set of metrics.
• Use Case : HDFS top users, operations
Top N
(Source)
• Ability to aggregate a set of series on the GET path.
• Helps with ad-hoc aggregation on the fly.
• Use Case : Aggregate across Storm topologies or Kafka
topics
Series
Aggregation

Features (2.2.2 -> 2.5.x) continued
• Https Support for AMS and Grafana
• SSLV23 and TLSV1 supportHttps Support
• Service alerts based on metrics from AMS.
• Example : Deviation in Daily/Weekly Namenode RPC
queue latency
Metric Based
Alerts
• Multi Cluster support in AMS and Grafana.
• Blueprint defined.
Multi Cluster
Support

Horizontal Scalability (Ambari 2.5.0)

Operational Statistics
Disk Usage Write Load

Horizontal Scalability (AMBARI-15901)
 Horizontal scalability proportional to cluster size
 High availability for metrics service
 Restart resiliency for metrics service discovery
 Distribute heavy-weight operations
 Automatic failover
 Usability – Easy addition of extra collectors
 Distributed mode – HBase writing to HDFS
 Cluster Zookeeper – HBase and Collector dependency
 First release still requires a restart to get new sink bits
Motivation
Operational Requirements

Horizontal Scalability (AMBARI-15901)
 Distributed lock problem
– Leader election
– Ephemeral state for discovery
– Partition tolerance
 Service discovery
 Persistent distributed storage – aggregator checkpoints
Architectural Requirements

Why Helix vs Curator
Limitations of using curator
 Leader election
 Shared Lock
 Service discovery
Curator Recipes applicable
 Every solution looks different, HMaster / Kafka etc..
 Need to build abstractions for FSM and resources
 Lack of primitives for cluster management
 Possibly needs a couple of application versions to achieve stability
Soft drawbacks

Apache Helix– High level
 Controller – Co-ordinate transitions and try to maintain Ideal State
 Participant – Process hosting distributed resources
 Spectator – Observer / Router

Under the hood
 Controller models a STATE machine based on the various partitions in the
cluster.
 Uses ZK to maintain cluster state and as a notification system.
 States
– IdealState: All nodes are up and running.
– CurrentState: Actual current state of each node in the cluster
– ExternalView: The combined view of the CurrentState of all nodes
Helix architecture

AMS – Helix usage - Primitives
Instance Resource Partitions / Replicas
H
I II
Host
Cluster
Aggregators
O
M
State Model
Online
Offline
Host Cluster

Distributed Writes
Bootstrapped
with a initial set
of collectors
DONE
Failed
counter >
threshold
Collector
supplier is
ok
Push metrics to
collector
Find list of live collectors
from configured collectors
If unable to find live
collector, ask Zookeeper
for list of live collectors
Choose a collector based
on hostname with expiry
Expire collector
supplier
Timer
Sinks

Distributed Writes
Monitors
Hostname based sharding strategy similar to sinks
Initial configured collector list
When a collector is inferred to be down, it is blacklisted.
Needs a restart if no known collector is live
No ZK fallback.

Reads
 Ambari Server handles multiple collectors seamlessly.
 Event based notification whenever a collector host down situation is sensed.
– Using AmbariEvent framework
Ambari Server
 Configured with 1 collector during startup.
 If that collector goes down,
– Manually update the datasource to another collector
– (or) Restart Grafana.
Grafana

Cluster Zookeeper
METRICS MONITOR
YARN
KAFKA
FLUME
METRICS SINKS
HBASE
STORM
HIVE
NIFI
HDFS
METRICS COLLECTOR
HBASE
Master + RS
PHOENIX
Aggregators
Collector API
Helix
Participant
METRICS COLLECTOR
HBASE
Master + RS
PHOENIX
Aggregators
Collector API
Helix
Participant
AMS Multiple Collectors Architecture

Future Work (Ambari 3.0.0)
 Major revamp in the storage and access layer in AMS.
 Moving to a UUID based Row key instead of the current long and redundant
key. (AMBARI-20773)
 Aggregation V2 to tackle scale issues in AMS
– Time aggregation (down sampling) handed off to the individual monitors (AMBARI-
20758)
– Cluster aggregation done online only for metrics which need it.
 Tee to external storage by providing pluggable sink interface
 Metric Based Anomaly Detection using Statistical and Machine learning
technique (AMBARI-21105)

Thank You
Last Meetup Slides :
https://www.slideshare.net/prajrao/apache-ambari-meetup-ams-grafana

Ambari metrics system - Apache ambari meetup (DataWorks Summit 2017)

More Related Content

What's hot (20)

Similar to Ambari metrics system - Apache ambari meetup (DataWorks Summit 2017) (20)

Recently uploaded (20)

Ambari metrics system - Apache ambari meetup (DataWorks Summit 2017)

Editor's Notes