SlideShare a Scribd company logo
Monitoring Node.js Microservices on CloudFoundry with
Open Source Tools and a Shoestring Budget
Tony Erwin, aerwin@us.ibm.com
Agenda
• Introduction to Bluemix UI & Architecture
• Importance of Monitoring w/ Microservices
• Overview of Monitoring Architecture
• Using Monitoring Data
• Building Your Own Monitoring System
• Synthetic Measurements
Bluemix UI
• Front-end to IBM’s open cloud Bluemix offering
• Lets users view and manage CF resources, containers,
virtual servers, user accounts, billing/usage, etc
• Runs on top of Bluemix PaaS Layer (Cloud Foundry)
Dashboard Catalog Resource Details
And
More!
Bluemix UI Architecture
• Migrated from a
monolithic to a
microservice
architecture over
the last couple of
years
• Composed of 25+
Node.js apps
deployed to Cloud
Foundry
• See talk from
earlier this week
for more details
– To Kill a Monolith:
Slaying the Demons
of a Monolith with
Node.js
Microservices on
CloudFoundry
Home Catalog … DashboardPricing
Orgs/	
Spaces
Backend	APIs	(CF,	Containers,	VMs,	BSS,	MCCP,	etc.)
Bluemix UI (Client)
Bluemix
PaaS Proxy
Common
Monitoring	
Framework
Session	
Store
NoSQL	
DB
Cloud Foundry
Importance of Monitoring
Importance of Monitoring
• Root cause analysis when a problem occurs
– Bluemix UI is most visible part of the platform and acts as a “canary in the mine shaft”
for the whole platform
– When a critical event or outage occurs, it often starts with reports like:
• “Can’t login to console”
• “Console doesn’t work…”
• “Console is slow…”
– When this happens in the middle of the night, my team is regularly the first to get a
PagerDuty
• Being able to quickly find root cause is a matter of self-preservation
– Console behavior is often (but not always!) a symptom of something going on elsewhere
(like CF is having problems, networking is down, etc.)
• Auto-detection of problems
– Ideally, we want to find and fix problems before a user hits them
– Example: Send a PagerDuty when error rates for a given API go above a threshold
• Tracking against performance and quality targets
– Can’t meet goals for something you can’t measure over time
What to Monitor?
• Metrics we were especially interested in:
– Data for every inbound/outbound request for every microservice
• Response time
• HTTP response code
• Etc.
– Memory usage, CPU usage, uptime, and crashes for every instance of every microservice
– General health of ourselves and dependencies
Monitoring Architecture
Monitoring Architecture
Monitor	
Storage
Backend	APIs	(CF,	Containers,	VMs,	BSS,	MCCP,	etc.)
Bluemix UI (Client)
Cloud Foundry
Proxy
InfluxDB
App	1
MQTT
PagerDuty,	
Slack,	etc.
… App	N
Monitor	
Alerts
Space	
Scanner
Monitoring Components
• Each microservice bound to an MQTT service (which happens to be provided by the IBM Internet of Things
service)
• Each microservice adds middleware (private npm module) that publishes inbound / outbound request data to
MQTT in a “fire and forget” manner
– Also supports a general “publish” function to send arbitrary metrics to MQTT (e.g., overall system health, number of times we
retrieve JSON from Redis cache instead of API, etc.)
• Storage microservice:
– Subscribes to the same queue, does some massaging of the data (such as tagging with URL “category”), and writes to
InfluxDB
• Alerts microservice:
– Subscribes to the same queue, aggregates the inputs over the last X minutes, and sends alerts (like Slack, PagerDuty, etc.)
• Scanner microservice:
– Calls CF APIs every 60 seconds to get data for each app instance on mem usage, CPU usage, uptime, and crashes
– Publishes the data to MQTT
• Grafana dashboards display data from data series in InfluxDB
• Details app is deployed that can pull data from InfluxDB to complement Grafana:
– Shows details of all of the requests in tabular format
– Provides capabilities to make special queries against the InfluxDB data
Using Monitoring Data
Grafana Dashboards
• Grafana
dashboards used
to visualize data
over time for any
microservice
• Data includes:
– Total requests
– Response time
(mean, median,
90% time)
– Error rate
Identifying a Problem in Grafana
• Like a
cardiologist
reading an
echocardiogram,
we’ve gotten
good at
identifying
anomalies in
these charts
• Data to left
shows a recent
“outage” where
error rates and
response times
spiked for a
period of time
Root Cause Analysis
• We can dive into more detailed data to do root cause analysis
• In chart below, response time is broken down by “category” (e.g., CF, UAA,
Containers, etc.)
• We can see time outs in a large number of components, indicating a broader
systemic issue
Details View
• Can drill down and get tabular view with aggregated details about the
requests making up a chart
• Can drill down again to see list of individual requests (with timestamps) as well as get more
detailed statistics on individual URLs
Wall of Shame
• Building on the details view from the previous page,
we can build walls of “shame” to help drive
improvements
– Show the 10 slowest API calls made to/from a specific
microservice that have been called at least 1000 times
during the last 24 hours
– Show the top 10 requests with the most error responses
that are invoked at least X times over an arbitrary time
period
– Etc.
Memory, CPU Usage, Crashes
• Another important set of data includes memory, CPU usage, and crashes for all instances of
all microservices
• Chart below shows a major CPU usage issue we found in a dev system, so was able to fix
before finding its way to production
Building Your Own Monitoring System
Node Application Metrics (appmetrics)
• Had planned on publishing some of my monitoring code,
but in prep for CF Summit learned of the appmetrics
project being driven by some fellow IBMers
• Shares much in common with the middleware I
mentioned earlier that publishes metrics to MQTT, but
goes even deeper to provide additional performance
insights
• Fully open source
– https://github.com/RuntimeTools/appmetrics
• Proves yet again that IBM is a big place J
Default Capabilities and MQTT
• Sends data to MQTT, meaning you can subscribe to updates
• Provides an Event API which allows:
– custom triggers based on the monitoring data
– publication of custom events
• This would be enough to support other pieces of the Bluemix UI monitoring system (like the
storage service or the alerts service)
App Metrics – Default Capabilities
Data Storage
• Can be configured to store data:
– Elastic Search
• https://github.com/RuntimeTools/appmetrics-elk
– StatsD
• https://github.com/RuntimeTools/appmetrics-statsd
• No support for InfluxDB yet, but I’ve suggested
to the team they should add it
Collecting Synthetic Data
Collecting Synthetic Data
• Monitoring discussed so far only
paints a picture of the server side
• It’s also important to get a
perspective from the client
• Continuously run scripts that
leverage Sitespeed.io
(https://www.sitespeed.io/) to load
the major pages of the product
• Collects data such as perf score,
first visual change, speed index,
etc. and stores in Graphite
– Grafana dashboards built to allow us
to visualize the data
– Scripts can be running from multiple
geo locations
The End
Questions?
Tony Erwin
Email: aerwin@us.ibm.com
Twitter: @tonyerwin
See also presentation from earlier this week:
To Kill a Monolith: Slaying the Demons of a Monolith
with Node.js Microservices on CloudFoundry
(http://sched.co/AJmh)
Ad

More Related Content

What's hot (20)

REST vs. Messaging For Microservices
REST vs. Messaging For MicroservicesREST vs. Messaging For Microservices
REST vs. Messaging For Microservices
Eberhard Wolff
 
NIC - Windows Azure Pack - Level 300
NIC - Windows Azure Pack - Level 300NIC - Windows Azure Pack - Level 300
NIC - Windows Azure Pack - Level 300
Kristian Nese
 
NServiceBus introduction
NServiceBus introductionNServiceBus introduction
NServiceBus introduction
Ladendirekt OÜ
 
Grails in the Cloud (2013)
Grails in the Cloud (2013)Grails in the Cloud (2013)
Grails in the Cloud (2013)
Meni Lubetkin
 
Getting Started with Orchestrator and Service Manager
Getting Started with Orchestrator and Service ManagerGetting Started with Orchestrator and Service Manager
Getting Started with Orchestrator and Service Manager
Alexandre Verkinderen
 
SCORCH: Tying it All Together
SCORCH: Tying it All TogetherSCORCH: Tying it All Together
SCORCH: Tying it All Together
C/D/H Technology Consultants
 
Ordina SOFTC Presentation - Desktop Virtualization
Ordina SOFTC Presentation - Desktop VirtualizationOrdina SOFTC Presentation - Desktop Virtualization
Ordina SOFTC Presentation - Desktop Virtualization
Ordina Belgium
 
Spring cloud
Spring cloudSpring cloud
Spring cloud
Milan Ashara
 
Ios models
Ios modelsIos models
Ios models
JUDYFLAVIAB
 
10 ways to trigger runbooks from Orchestrator
10 ways to trigger runbooks from Orchestrator10 ways to trigger runbooks from Orchestrator
10 ways to trigger runbooks from Orchestrator
Fredrik Knalstad
 
Iib v10 performance problem determination examples
Iib v10 performance problem determination examplesIib v10 performance problem determination examples
Iib v10 performance problem determination examples
MartinRoss_IBM
 
VMware VCP7-DTM: More than just Horizon View
VMware VCP7-DTM: More than just Horizon ViewVMware VCP7-DTM: More than just Horizon View
VMware VCP7-DTM: More than just Horizon View
Matt Crape
 
Microservices Using Docker Containers for Magento 2
Microservices Using Docker Containers for Magento 2Microservices Using Docker Containers for Magento 2
Microservices Using Docker Containers for Magento 2
Schogini Systems Pvt Ltd
 
Olympus pesentation2
Olympus pesentation2Olympus pesentation2
Olympus pesentation2
mskmoorthy
 
Designing distributed, scalable and reliable systems using NServiceBus
Designing distributed, scalable and reliable systems using NServiceBusDesigning distributed, scalable and reliable systems using NServiceBus
Designing distributed, scalable and reliable systems using NServiceBus
Mauro Servienti
 
VMware Mirage for Retail
VMware Mirage for RetailVMware Mirage for Retail
VMware Mirage for Retail
Kiss Tibor
 
Roll your own FOSS cloud hosting
Roll your own FOSS cloud hostingRoll your own FOSS cloud hosting
Roll your own FOSS cloud hosting
Russell Searle
 
WSO2Con Asia 2014 - Essential Elements of an Enterprise PaaS
WSO2Con Asia 2014 - Essential Elements of an Enterprise PaaSWSO2Con Asia 2014 - Essential Elements of an Enterprise PaaS
WSO2Con Asia 2014 - Essential Elements of an Enterprise PaaS
WSO2
 
Configuration management comes to Windows
Configuration management comes to WindowsConfiguration management comes to Windows
Configuration management comes to Windows
Ravikanth Chaganti
 
Event Driven Architectures with Apache Kafka
Event Driven Architectures with Apache KafkaEvent Driven Architectures with Apache Kafka
Event Driven Architectures with Apache Kafka
Matt Masuda
 
REST vs. Messaging For Microservices
REST vs. Messaging For MicroservicesREST vs. Messaging For Microservices
REST vs. Messaging For Microservices
Eberhard Wolff
 
NIC - Windows Azure Pack - Level 300
NIC - Windows Azure Pack - Level 300NIC - Windows Azure Pack - Level 300
NIC - Windows Azure Pack - Level 300
Kristian Nese
 
NServiceBus introduction
NServiceBus introductionNServiceBus introduction
NServiceBus introduction
Ladendirekt OÜ
 
Grails in the Cloud (2013)
Grails in the Cloud (2013)Grails in the Cloud (2013)
Grails in the Cloud (2013)
Meni Lubetkin
 
Getting Started with Orchestrator and Service Manager
Getting Started with Orchestrator and Service ManagerGetting Started with Orchestrator and Service Manager
Getting Started with Orchestrator and Service Manager
Alexandre Verkinderen
 
Ordina SOFTC Presentation - Desktop Virtualization
Ordina SOFTC Presentation - Desktop VirtualizationOrdina SOFTC Presentation - Desktop Virtualization
Ordina SOFTC Presentation - Desktop Virtualization
Ordina Belgium
 
10 ways to trigger runbooks from Orchestrator
10 ways to trigger runbooks from Orchestrator10 ways to trigger runbooks from Orchestrator
10 ways to trigger runbooks from Orchestrator
Fredrik Knalstad
 
Iib v10 performance problem determination examples
Iib v10 performance problem determination examplesIib v10 performance problem determination examples
Iib v10 performance problem determination examples
MartinRoss_IBM
 
VMware VCP7-DTM: More than just Horizon View
VMware VCP7-DTM: More than just Horizon ViewVMware VCP7-DTM: More than just Horizon View
VMware VCP7-DTM: More than just Horizon View
Matt Crape
 
Microservices Using Docker Containers for Magento 2
Microservices Using Docker Containers for Magento 2Microservices Using Docker Containers for Magento 2
Microservices Using Docker Containers for Magento 2
Schogini Systems Pvt Ltd
 
Olympus pesentation2
Olympus pesentation2Olympus pesentation2
Olympus pesentation2
mskmoorthy
 
Designing distributed, scalable and reliable systems using NServiceBus
Designing distributed, scalable and reliable systems using NServiceBusDesigning distributed, scalable and reliable systems using NServiceBus
Designing distributed, scalable and reliable systems using NServiceBus
Mauro Servienti
 
VMware Mirage for Retail
VMware Mirage for RetailVMware Mirage for Retail
VMware Mirage for Retail
Kiss Tibor
 
Roll your own FOSS cloud hosting
Roll your own FOSS cloud hostingRoll your own FOSS cloud hosting
Roll your own FOSS cloud hosting
Russell Searle
 
WSO2Con Asia 2014 - Essential Elements of an Enterprise PaaS
WSO2Con Asia 2014 - Essential Elements of an Enterprise PaaSWSO2Con Asia 2014 - Essential Elements of an Enterprise PaaS
WSO2Con Asia 2014 - Essential Elements of an Enterprise PaaS
WSO2
 
Configuration management comes to Windows
Configuration management comes to WindowsConfiguration management comes to Windows
Configuration management comes to Windows
Ravikanth Chaganti
 
Event Driven Architectures with Apache Kafka
Event Driven Architectures with Apache KafkaEvent Driven Architectures with Apache Kafka
Event Driven Architectures with Apache Kafka
Matt Masuda
 

Similar to Monitoring Node.js Microservices on CloudFoundry with Open Source Tools and a Shoestring Budget (20)

2019 hashiconf seattle_consul_ioc
2019 hashiconf seattle_consul_ioc2019 hashiconf seattle_consul_ioc
2019 hashiconf seattle_consul_ioc
Pierre Souchay
 
Predix
PredixPredix
Predix
Sandeep Shabd
 
Hybrid Cloud example for SlideShare
Hybrid Cloud example for SlideShareHybrid Cloud example for SlideShare
Hybrid Cloud example for SlideShare
Hewlett-Packard
 
ThroughTheLookingGlass_EffectiveObservability.pptx
ThroughTheLookingGlass_EffectiveObservability.pptxThroughTheLookingGlass_EffectiveObservability.pptx
ThroughTheLookingGlass_EffectiveObservability.pptx
Grace Jansen
 
Log insight technical overview customer facing (based on 3.x)
Log insight technical overview customer facing (based on 3.x)Log insight technical overview customer facing (based on 3.x)
Log insight technical overview customer facing (based on 3.x)
David Pasek
 
An Introduction to MongoDB Ops Manager
An Introduction to MongoDB Ops ManagerAn Introduction to MongoDB Ops Manager
An Introduction to MongoDB Ops Manager
MongoDB
 
Cloud Foundry Technical Overview
Cloud Foundry Technical OverviewCloud Foundry Technical Overview
Cloud Foundry Technical Overview
cornelia davis
 
Microservices @ Work - A Practice Report of Developing Microservices
Microservices @ Work - A Practice Report of Developing MicroservicesMicroservices @ Work - A Practice Report of Developing Microservices
Microservices @ Work - A Practice Report of Developing Microservices
QAware GmbH
 
Azure Monitoring Overview
Azure Monitoring OverviewAzure Monitoring Overview
Azure Monitoring Overview
gjuljo
 
Un-clouding the cloud
Un-clouding the cloudUn-clouding the cloud
Un-clouding the cloud
Davinder Kohli
 
Modernizing Cloud and Hyperconverged Infrastructure monitoring
Modernizing Cloud and Hyperconverged Infrastructure monitoringModernizing Cloud and Hyperconverged Infrastructure monitoring
Modernizing Cloud and Hyperconverged Infrastructure monitoring
ManageEngine, Zoho Corporation
 
Cloud monitoring with Applications Manager
Cloud monitoring with Applications ManagerCloud monitoring with Applications Manager
Cloud monitoring with Applications Manager
ManageEngine, Zoho Corporation
 
Monitoring in Big Data Platform - Albert Lewandowski, GetInData
Monitoring in Big Data Platform - Albert Lewandowski, GetInDataMonitoring in Big Data Platform - Albert Lewandowski, GetInData
Monitoring in Big Data Platform - Albert Lewandowski, GetInData
GetInData
 
apidays LIVE Jakarta - Building an Event-Driven Architecture by Harin Honesty...
apidays LIVE Jakarta - Building an Event-Driven Architecture by Harin Honesty...apidays LIVE Jakarta - Building an Event-Driven Architecture by Harin Honesty...
apidays LIVE Jakarta - Building an Event-Driven Architecture by Harin Honesty...
apidays
 
Global Azure Bootcamp 2017 - Performance and Health Management for Modern App...
Global Azure Bootcamp 2017 - Performance and Health Management for Modern App...Global Azure Bootcamp 2017 - Performance and Health Management for Modern App...
Global Azure Bootcamp 2017 - Performance and Health Management for Modern App...
Adin Ermie
 
Mule soft meetup____indy_may_02
Mule soft meetup____indy_may_02Mule soft meetup____indy_may_02
Mule soft meetup____indy_may_02
Amit Bhusan Srivastava
 
Microservices and Serverless for Mega Startups - DevOps IL Meetup
Microservices and Serverless for Mega Startups - DevOps IL MeetupMicroservices and Serverless for Mega Startups - DevOps IL Meetup
Microservices and Serverless for Mega Startups - DevOps IL Meetup
Boaz Ziniman
 
Serverless Computing & Automation - GCP
Serverless Computing & Automation -  GCPServerless Computing & Automation -  GCP
Serverless Computing & Automation - GCP
abiguimeleroy
 
Webinar Monitoring in era of cloud computing
Webinar Monitoring in era of cloud computingWebinar Monitoring in era of cloud computing
Webinar Monitoring in era of cloud computing
CREATE-NET
 
Containers as Infrastructure for New Gen Apps
Containers as Infrastructure for New Gen AppsContainers as Infrastructure for New Gen Apps
Containers as Infrastructure for New Gen Apps
Khalid Ahmed
 
2019 hashiconf seattle_consul_ioc
2019 hashiconf seattle_consul_ioc2019 hashiconf seattle_consul_ioc
2019 hashiconf seattle_consul_ioc
Pierre Souchay
 
Hybrid Cloud example for SlideShare
Hybrid Cloud example for SlideShareHybrid Cloud example for SlideShare
Hybrid Cloud example for SlideShare
Hewlett-Packard
 
ThroughTheLookingGlass_EffectiveObservability.pptx
ThroughTheLookingGlass_EffectiveObservability.pptxThroughTheLookingGlass_EffectiveObservability.pptx
ThroughTheLookingGlass_EffectiveObservability.pptx
Grace Jansen
 
Log insight technical overview customer facing (based on 3.x)
Log insight technical overview customer facing (based on 3.x)Log insight technical overview customer facing (based on 3.x)
Log insight technical overview customer facing (based on 3.x)
David Pasek
 
An Introduction to MongoDB Ops Manager
An Introduction to MongoDB Ops ManagerAn Introduction to MongoDB Ops Manager
An Introduction to MongoDB Ops Manager
MongoDB
 
Cloud Foundry Technical Overview
Cloud Foundry Technical OverviewCloud Foundry Technical Overview
Cloud Foundry Technical Overview
cornelia davis
 
Microservices @ Work - A Practice Report of Developing Microservices
Microservices @ Work - A Practice Report of Developing MicroservicesMicroservices @ Work - A Practice Report of Developing Microservices
Microservices @ Work - A Practice Report of Developing Microservices
QAware GmbH
 
Azure Monitoring Overview
Azure Monitoring OverviewAzure Monitoring Overview
Azure Monitoring Overview
gjuljo
 
Modernizing Cloud and Hyperconverged Infrastructure monitoring
Modernizing Cloud and Hyperconverged Infrastructure monitoringModernizing Cloud and Hyperconverged Infrastructure monitoring
Modernizing Cloud and Hyperconverged Infrastructure monitoring
ManageEngine, Zoho Corporation
 
Monitoring in Big Data Platform - Albert Lewandowski, GetInData
Monitoring in Big Data Platform - Albert Lewandowski, GetInDataMonitoring in Big Data Platform - Albert Lewandowski, GetInData
Monitoring in Big Data Platform - Albert Lewandowski, GetInData
GetInData
 
apidays LIVE Jakarta - Building an Event-Driven Architecture by Harin Honesty...
apidays LIVE Jakarta - Building an Event-Driven Architecture by Harin Honesty...apidays LIVE Jakarta - Building an Event-Driven Architecture by Harin Honesty...
apidays LIVE Jakarta - Building an Event-Driven Architecture by Harin Honesty...
apidays
 
Global Azure Bootcamp 2017 - Performance and Health Management for Modern App...
Global Azure Bootcamp 2017 - Performance and Health Management for Modern App...Global Azure Bootcamp 2017 - Performance and Health Management for Modern App...
Global Azure Bootcamp 2017 - Performance and Health Management for Modern App...
Adin Ermie
 
Microservices and Serverless for Mega Startups - DevOps IL Meetup
Microservices and Serverless for Mega Startups - DevOps IL MeetupMicroservices and Serverless for Mega Startups - DevOps IL Meetup
Microservices and Serverless for Mega Startups - DevOps IL Meetup
Boaz Ziniman
 
Serverless Computing & Automation - GCP
Serverless Computing & Automation -  GCPServerless Computing & Automation -  GCP
Serverless Computing & Automation - GCP
abiguimeleroy
 
Webinar Monitoring in era of cloud computing
Webinar Monitoring in era of cloud computingWebinar Monitoring in era of cloud computing
Webinar Monitoring in era of cloud computing
CREATE-NET
 
Containers as Infrastructure for New Gen Apps
Containers as Infrastructure for New Gen AppsContainers as Infrastructure for New Gen Apps
Containers as Infrastructure for New Gen Apps
Khalid Ahmed
 
Ad

Recently uploaded (20)

Web and Graphics Designing Training in Rajpura
Web and Graphics Designing Training in RajpuraWeb and Graphics Designing Training in Rajpura
Web and Graphics Designing Training in Rajpura
Erginous Technology
 
Transcript: Canadian book publishing: Insights from the latest salary survey ...
Transcript: Canadian book publishing: Insights from the latest salary survey ...Transcript: Canadian book publishing: Insights from the latest salary survey ...
Transcript: Canadian book publishing: Insights from the latest salary survey ...
BookNet Canada
 
Manifest Pre-Seed Update | A Humanoid OEM Deeptech In France
Manifest Pre-Seed Update | A Humanoid OEM Deeptech In FranceManifest Pre-Seed Update | A Humanoid OEM Deeptech In France
Manifest Pre-Seed Update | A Humanoid OEM Deeptech In France
chb3
 
Make GenAI investments go further with the Dell AI Factory
Make GenAI investments go further with the Dell AI FactoryMake GenAI investments go further with the Dell AI Factory
Make GenAI investments go further with the Dell AI Factory
Principled Technologies
 
Linux Professional Institute LPIC-1 Exam.pdf
Linux Professional Institute LPIC-1 Exam.pdfLinux Professional Institute LPIC-1 Exam.pdf
Linux Professional Institute LPIC-1 Exam.pdf
RHCSA Guru
 
MINDCTI revenue release Quarter 1 2025 PR
MINDCTI revenue release Quarter 1 2025 PRMINDCTI revenue release Quarter 1 2025 PR
MINDCTI revenue release Quarter 1 2025 PR
MIND CTI
 
Webinar - Top 5 Backup Mistakes MSPs and Businesses Make .pptx
Webinar - Top 5 Backup Mistakes MSPs and Businesses Make   .pptxWebinar - Top 5 Backup Mistakes MSPs and Businesses Make   .pptx
Webinar - Top 5 Backup Mistakes MSPs and Businesses Make .pptx
MSP360
 
Enterprise Integration Is Dead! Long Live AI-Driven Integration with Apache C...
Enterprise Integration Is Dead! Long Live AI-Driven Integration with Apache C...Enterprise Integration Is Dead! Long Live AI-Driven Integration with Apache C...
Enterprise Integration Is Dead! Long Live AI-Driven Integration with Apache C...
Markus Eisele
 
UiPath Agentic Automation: Community Developer Opportunities
UiPath Agentic Automation: Community Developer OpportunitiesUiPath Agentic Automation: Community Developer Opportunities
UiPath Agentic Automation: Community Developer Opportunities
DianaGray10
 
AsyncAPI v3 : Streamlining Event-Driven API Design
AsyncAPI v3 : Streamlining Event-Driven API DesignAsyncAPI v3 : Streamlining Event-Driven API Design
AsyncAPI v3 : Streamlining Event-Driven API Design
leonid54
 
HCL Nomad Web – Best Practices and Managing Multiuser Environments
HCL Nomad Web – Best Practices and Managing Multiuser EnvironmentsHCL Nomad Web – Best Practices and Managing Multiuser Environments
HCL Nomad Web – Best Practices and Managing Multiuser Environments
panagenda
 
Hybridize Functions: A Tool for Automatically Refactoring Imperative Deep Lea...
Hybridize Functions: A Tool for Automatically Refactoring Imperative Deep Lea...Hybridize Functions: A Tool for Automatically Refactoring Imperative Deep Lea...
Hybridize Functions: A Tool for Automatically Refactoring Imperative Deep Lea...
Raffi Khatchadourian
 
#StandardsGoals for 2025: Standards & certification roundup - Tech Forum 2025
#StandardsGoals for 2025: Standards & certification roundup - Tech Forum 2025#StandardsGoals for 2025: Standards & certification roundup - Tech Forum 2025
#StandardsGoals for 2025: Standards & certification roundup - Tech Forum 2025
BookNet Canada
 
TrsLabs Consultants - DeFi, WEb3, Token Listing
TrsLabs Consultants - DeFi, WEb3, Token ListingTrsLabs Consultants - DeFi, WEb3, Token Listing
TrsLabs Consultants - DeFi, WEb3, Token Listing
Trs Labs
 
Designing Low-Latency Systems with Rust and ScyllaDB: An Architectural Deep Dive
Designing Low-Latency Systems with Rust and ScyllaDB: An Architectural Deep DiveDesigning Low-Latency Systems with Rust and ScyllaDB: An Architectural Deep Dive
Designing Low-Latency Systems with Rust and ScyllaDB: An Architectural Deep Dive
ScyllaDB
 
How analogue intelligence complements AI
How analogue intelligence complements AIHow analogue intelligence complements AI
How analogue intelligence complements AI
Paul Rowe
 
HCL Nomad Web – Best Practices und Verwaltung von Multiuser-Umgebungen
HCL Nomad Web – Best Practices und Verwaltung von Multiuser-UmgebungenHCL Nomad Web – Best Practices und Verwaltung von Multiuser-Umgebungen
HCL Nomad Web – Best Practices und Verwaltung von Multiuser-Umgebungen
panagenda
 
The Changing Compliance Landscape in 2025.pdf
The Changing Compliance Landscape in 2025.pdfThe Changing Compliance Landscape in 2025.pdf
The Changing Compliance Landscape in 2025.pdf
Precisely
 
UiPath Automation Suite – Cas d'usage d'une NGO internationale basée à Genève
UiPath Automation Suite – Cas d'usage d'une NGO internationale basée à GenèveUiPath Automation Suite – Cas d'usage d'une NGO internationale basée à Genève
UiPath Automation Suite – Cas d'usage d'une NGO internationale basée à Genève
UiPathCommunity
 
Technology Trends in 2025: AI and Big Data Analytics
Technology Trends in 2025: AI and Big Data AnalyticsTechnology Trends in 2025: AI and Big Data Analytics
Technology Trends in 2025: AI and Big Data Analytics
InData Labs
 
Web and Graphics Designing Training in Rajpura
Web and Graphics Designing Training in RajpuraWeb and Graphics Designing Training in Rajpura
Web and Graphics Designing Training in Rajpura
Erginous Technology
 
Transcript: Canadian book publishing: Insights from the latest salary survey ...
Transcript: Canadian book publishing: Insights from the latest salary survey ...Transcript: Canadian book publishing: Insights from the latest salary survey ...
Transcript: Canadian book publishing: Insights from the latest salary survey ...
BookNet Canada
 
Manifest Pre-Seed Update | A Humanoid OEM Deeptech In France
Manifest Pre-Seed Update | A Humanoid OEM Deeptech In FranceManifest Pre-Seed Update | A Humanoid OEM Deeptech In France
Manifest Pre-Seed Update | A Humanoid OEM Deeptech In France
chb3
 
Make GenAI investments go further with the Dell AI Factory
Make GenAI investments go further with the Dell AI FactoryMake GenAI investments go further with the Dell AI Factory
Make GenAI investments go further with the Dell AI Factory
Principled Technologies
 
Linux Professional Institute LPIC-1 Exam.pdf
Linux Professional Institute LPIC-1 Exam.pdfLinux Professional Institute LPIC-1 Exam.pdf
Linux Professional Institute LPIC-1 Exam.pdf
RHCSA Guru
 
MINDCTI revenue release Quarter 1 2025 PR
MINDCTI revenue release Quarter 1 2025 PRMINDCTI revenue release Quarter 1 2025 PR
MINDCTI revenue release Quarter 1 2025 PR
MIND CTI
 
Webinar - Top 5 Backup Mistakes MSPs and Businesses Make .pptx
Webinar - Top 5 Backup Mistakes MSPs and Businesses Make   .pptxWebinar - Top 5 Backup Mistakes MSPs and Businesses Make   .pptx
Webinar - Top 5 Backup Mistakes MSPs and Businesses Make .pptx
MSP360
 
Enterprise Integration Is Dead! Long Live AI-Driven Integration with Apache C...
Enterprise Integration Is Dead! Long Live AI-Driven Integration with Apache C...Enterprise Integration Is Dead! Long Live AI-Driven Integration with Apache C...
Enterprise Integration Is Dead! Long Live AI-Driven Integration with Apache C...
Markus Eisele
 
UiPath Agentic Automation: Community Developer Opportunities
UiPath Agentic Automation: Community Developer OpportunitiesUiPath Agentic Automation: Community Developer Opportunities
UiPath Agentic Automation: Community Developer Opportunities
DianaGray10
 
AsyncAPI v3 : Streamlining Event-Driven API Design
AsyncAPI v3 : Streamlining Event-Driven API DesignAsyncAPI v3 : Streamlining Event-Driven API Design
AsyncAPI v3 : Streamlining Event-Driven API Design
leonid54
 
HCL Nomad Web – Best Practices and Managing Multiuser Environments
HCL Nomad Web – Best Practices and Managing Multiuser EnvironmentsHCL Nomad Web – Best Practices and Managing Multiuser Environments
HCL Nomad Web – Best Practices and Managing Multiuser Environments
panagenda
 
Hybridize Functions: A Tool for Automatically Refactoring Imperative Deep Lea...
Hybridize Functions: A Tool for Automatically Refactoring Imperative Deep Lea...Hybridize Functions: A Tool for Automatically Refactoring Imperative Deep Lea...
Hybridize Functions: A Tool for Automatically Refactoring Imperative Deep Lea...
Raffi Khatchadourian
 
#StandardsGoals for 2025: Standards & certification roundup - Tech Forum 2025
#StandardsGoals for 2025: Standards & certification roundup - Tech Forum 2025#StandardsGoals for 2025: Standards & certification roundup - Tech Forum 2025
#StandardsGoals for 2025: Standards & certification roundup - Tech Forum 2025
BookNet Canada
 
TrsLabs Consultants - DeFi, WEb3, Token Listing
TrsLabs Consultants - DeFi, WEb3, Token ListingTrsLabs Consultants - DeFi, WEb3, Token Listing
TrsLabs Consultants - DeFi, WEb3, Token Listing
Trs Labs
 
Designing Low-Latency Systems with Rust and ScyllaDB: An Architectural Deep Dive
Designing Low-Latency Systems with Rust and ScyllaDB: An Architectural Deep DiveDesigning Low-Latency Systems with Rust and ScyllaDB: An Architectural Deep Dive
Designing Low-Latency Systems with Rust and ScyllaDB: An Architectural Deep Dive
ScyllaDB
 
How analogue intelligence complements AI
How analogue intelligence complements AIHow analogue intelligence complements AI
How analogue intelligence complements AI
Paul Rowe
 
HCL Nomad Web – Best Practices und Verwaltung von Multiuser-Umgebungen
HCL Nomad Web – Best Practices und Verwaltung von Multiuser-UmgebungenHCL Nomad Web – Best Practices und Verwaltung von Multiuser-Umgebungen
HCL Nomad Web – Best Practices und Verwaltung von Multiuser-Umgebungen
panagenda
 
The Changing Compliance Landscape in 2025.pdf
The Changing Compliance Landscape in 2025.pdfThe Changing Compliance Landscape in 2025.pdf
The Changing Compliance Landscape in 2025.pdf
Precisely
 
UiPath Automation Suite – Cas d'usage d'une NGO internationale basée à Genève
UiPath Automation Suite – Cas d'usage d'une NGO internationale basée à GenèveUiPath Automation Suite – Cas d'usage d'une NGO internationale basée à Genève
UiPath Automation Suite – Cas d'usage d'une NGO internationale basée à Genève
UiPathCommunity
 
Technology Trends in 2025: AI and Big Data Analytics
Technology Trends in 2025: AI and Big Data AnalyticsTechnology Trends in 2025: AI and Big Data Analytics
Technology Trends in 2025: AI and Big Data Analytics
InData Labs
 
Ad

Monitoring Node.js Microservices on CloudFoundry with Open Source Tools and a Shoestring Budget

  • 1. Monitoring Node.js Microservices on CloudFoundry with Open Source Tools and a Shoestring Budget Tony Erwin, [email protected]
  • 2. Agenda • Introduction to Bluemix UI & Architecture • Importance of Monitoring w/ Microservices • Overview of Monitoring Architecture • Using Monitoring Data • Building Your Own Monitoring System • Synthetic Measurements
  • 3. Bluemix UI • Front-end to IBM’s open cloud Bluemix offering • Lets users view and manage CF resources, containers, virtual servers, user accounts, billing/usage, etc • Runs on top of Bluemix PaaS Layer (Cloud Foundry) Dashboard Catalog Resource Details And More!
  • 4. Bluemix UI Architecture • Migrated from a monolithic to a microservice architecture over the last couple of years • Composed of 25+ Node.js apps deployed to Cloud Foundry • See talk from earlier this week for more details – To Kill a Monolith: Slaying the Demons of a Monolith with Node.js Microservices on CloudFoundry Home Catalog … DashboardPricing Orgs/ Spaces Backend APIs (CF, Containers, VMs, BSS, MCCP, etc.) Bluemix UI (Client) Bluemix PaaS Proxy Common Monitoring Framework Session Store NoSQL DB Cloud Foundry
  • 6. Importance of Monitoring • Root cause analysis when a problem occurs – Bluemix UI is most visible part of the platform and acts as a “canary in the mine shaft” for the whole platform – When a critical event or outage occurs, it often starts with reports like: • “Can’t login to console” • “Console doesn’t work…” • “Console is slow…” – When this happens in the middle of the night, my team is regularly the first to get a PagerDuty • Being able to quickly find root cause is a matter of self-preservation – Console behavior is often (but not always!) a symptom of something going on elsewhere (like CF is having problems, networking is down, etc.) • Auto-detection of problems – Ideally, we want to find and fix problems before a user hits them – Example: Send a PagerDuty when error rates for a given API go above a threshold • Tracking against performance and quality targets – Can’t meet goals for something you can’t measure over time
  • 7. What to Monitor? • Metrics we were especially interested in: – Data for every inbound/outbound request for every microservice • Response time • HTTP response code • Etc. – Memory usage, CPU usage, uptime, and crashes for every instance of every microservice – General health of ourselves and dependencies
  • 9. Monitoring Architecture Monitor Storage Backend APIs (CF, Containers, VMs, BSS, MCCP, etc.) Bluemix UI (Client) Cloud Foundry Proxy InfluxDB App 1 MQTT PagerDuty, Slack, etc. … App N Monitor Alerts Space Scanner
  • 10. Monitoring Components • Each microservice bound to an MQTT service (which happens to be provided by the IBM Internet of Things service) • Each microservice adds middleware (private npm module) that publishes inbound / outbound request data to MQTT in a “fire and forget” manner – Also supports a general “publish” function to send arbitrary metrics to MQTT (e.g., overall system health, number of times we retrieve JSON from Redis cache instead of API, etc.) • Storage microservice: – Subscribes to the same queue, does some massaging of the data (such as tagging with URL “category”), and writes to InfluxDB • Alerts microservice: – Subscribes to the same queue, aggregates the inputs over the last X minutes, and sends alerts (like Slack, PagerDuty, etc.) • Scanner microservice: – Calls CF APIs every 60 seconds to get data for each app instance on mem usage, CPU usage, uptime, and crashes – Publishes the data to MQTT • Grafana dashboards display data from data series in InfluxDB • Details app is deployed that can pull data from InfluxDB to complement Grafana: – Shows details of all of the requests in tabular format – Provides capabilities to make special queries against the InfluxDB data
  • 12. Grafana Dashboards • Grafana dashboards used to visualize data over time for any microservice • Data includes: – Total requests – Response time (mean, median, 90% time) – Error rate
  • 13. Identifying a Problem in Grafana • Like a cardiologist reading an echocardiogram, we’ve gotten good at identifying anomalies in these charts • Data to left shows a recent “outage” where error rates and response times spiked for a period of time
  • 14. Root Cause Analysis • We can dive into more detailed data to do root cause analysis • In chart below, response time is broken down by “category” (e.g., CF, UAA, Containers, etc.) • We can see time outs in a large number of components, indicating a broader systemic issue
  • 15. Details View • Can drill down and get tabular view with aggregated details about the requests making up a chart • Can drill down again to see list of individual requests (with timestamps) as well as get more detailed statistics on individual URLs
  • 16. Wall of Shame • Building on the details view from the previous page, we can build walls of “shame” to help drive improvements – Show the 10 slowest API calls made to/from a specific microservice that have been called at least 1000 times during the last 24 hours – Show the top 10 requests with the most error responses that are invoked at least X times over an arbitrary time period – Etc.
  • 17. Memory, CPU Usage, Crashes • Another important set of data includes memory, CPU usage, and crashes for all instances of all microservices • Chart below shows a major CPU usage issue we found in a dev system, so was able to fix before finding its way to production
  • 18. Building Your Own Monitoring System
  • 19. Node Application Metrics (appmetrics) • Had planned on publishing some of my monitoring code, but in prep for CF Summit learned of the appmetrics project being driven by some fellow IBMers • Shares much in common with the middleware I mentioned earlier that publishes metrics to MQTT, but goes even deeper to provide additional performance insights • Fully open source – https://github.com/RuntimeTools/appmetrics • Proves yet again that IBM is a big place J
  • 20. Default Capabilities and MQTT • Sends data to MQTT, meaning you can subscribe to updates • Provides an Event API which allows: – custom triggers based on the monitoring data – publication of custom events • This would be enough to support other pieces of the Bluemix UI monitoring system (like the storage service or the alerts service)
  • 21. App Metrics – Default Capabilities
  • 22. Data Storage • Can be configured to store data: – Elastic Search • https://github.com/RuntimeTools/appmetrics-elk – StatsD • https://github.com/RuntimeTools/appmetrics-statsd • No support for InfluxDB yet, but I’ve suggested to the team they should add it
  • 24. Collecting Synthetic Data • Monitoring discussed so far only paints a picture of the server side • It’s also important to get a perspective from the client • Continuously run scripts that leverage Sitespeed.io (https://www.sitespeed.io/) to load the major pages of the product • Collects data such as perf score, first visual change, speed index, etc. and stores in Graphite – Grafana dashboards built to allow us to visualize the data – Scripts can be running from multiple geo locations
  • 25. The End Questions? Tony Erwin Email: [email protected] Twitter: @tonyerwin See also presentation from earlier this week: To Kill a Monolith: Slaying the Demons of a Monolith with Node.js Microservices on CloudFoundry (http://sched.co/AJmh)