SlideShare a Scribd company logo
ZABBIX FOR 

HPC MONITORING AND
SUPPORT
Mikhail Serkov
Delivery Manager/HPC Engineer
2016
CONFIDENTIAL 2
What’s next?
HPC monitoring – differences from classic support
model
How do we use Zabbix
AGENDA
Overview of the customer infrastructure and software
stack
High Performance Computing – what is it about#1
#4
#5
#2
#3
CONFIDENTIAL 3
• Scientific research in Pharma area:
Bioinformatics, Computational
Chemistry, Drug Discovery, etc.
• About 10k CPU cores used for a
scientific computation.
• Shared clusters - different workflows
could run simultaneously within the
same cluster.
• About 500 different scientific tools.
• Custom software ( Python, Java, R)
Novartis Institute For Biomedical Research (NIBR)
CONFIDENTIAL 4
• Hundreds or even thousands of computation nodes
• Grid Computing technologies and software ( SGE, UGE, SoGE, PBS, etc)
• Massive parallel computation across the nodes
• Strong requirements for all subsystems on hardware and software level ( storage, network,
power, OS )
• No magic. Linux boxes, shell scripts on a low level ☺
Example of a job submission:
HIGH PERFORMANCE COMPUTING
CONFIDENTIAL 5
OVERVIEW OF THE CUSTOMER INFRASTRUCTURE
250 GPU’s
70TB RAM
35-40KW/Rack
CONFIDENTIAL 6
• 28 CPU cores ( 2 sockets x 14 cores each )
− Intel(R) Xeon(R) CPU E5-2697 v3 @ 2.60GHz
• 200 GB RAM
• 10 GB Ethernet + InfiniBand interfaces
• 8 GPU cores ( 4 cards x 2 cores each )
• NFS over 10 GB Ehternet
• Lustre over InfiniBand
TYPICAL COMPUTATION NODE CONFIGURATION
CONFIDENTIAL 7
OVERVIEW OF SOFTWARE STACK
• More than 500 of scientific tools
• Bioinformatics, Computation
Chemistry, Xtallography, Molecular
Dynamics, etc
• RHEL6.5
• Univa Grid Engine
• Zabbix 2.4
CONFIDENTIAL 8
• We need information like ‘who, what, when’, not only system metrics.
• Users are allowed to run whatever they want using grid scheduler on the computation
nodes.
• 100% CPU utilization and 100% RAM utilization for node is perfectly fine.
• Node crash – not such a big deal.
• Preventing global issues by using aggregated metrics.
• Metrics not only for monitoring but for a performance analysis.
• Users are having access to the monitoring system ( but restricted ).
HPC MONITORING DIFFERENCES
CONFIDENTIAL 9
• Able to monitor of a huge systems with a lot of metrics
• Flexible
• Out of the box
• Ability to aggregate metrics
• API for a data extraction
• GUI convenient for both support team and scientists
• Autodiscovery
• New nodes automatic configuration
WHY ZABBIX?
CONFIDENTIAL 10
ZABBIX CONFIGURATION
Server configuration:
• 20 CPU cores ( 2 sockets x 10 cores each )
− Intel(R) Xeon(R) CPU E5-2697 v3 @ 2.60GHz
• 120 GB RAM
Number of hosts: 601
Number of items: ~200k
Number of triggers: ~37k
DB Size: 187GB
CONFIDENTIAL 11
WHAT DO WE MONITOR
Local metrics ( node level ) Global metrics ( cluster level )
All default Linux checks (LA, CPU utilization, RAM,
swap, etc) - agent
Meta CPU utilization – aggregation of CPU utilization
of HPC nodes.
Every single GPU core ( Temperature, Utilization if
possible) - agent
NFS global transmit/retransmit - aggregation of
nodes values
Every single CPU core ( Utilization, Temperature) -
agent
Grid specific – used/active slots, running jobs,
pending jobs, top users - external scripts
NFS shares availability / utilization / mount details -
agent
CPU/Memory oversubscription - aggregation of
nodes values
Slots / RAM reserved - external scripts Overloaded nodes - aggregation of HPC values
HPC jobs - external scripts Pending time - external scripts
... ....
CONFIDENTIAL 12
HPC specific examples
1) Expected utilization VS Real one
Every job has a resource request for number of CPUs, RAM, etc. In every moment we can compare real
utilization with an expected one. If they are not close, we need to investigate if someone oversubscribing
resources or overload nodes.
Solution: Zabbix not only checks current system metrics, but also keeps an expected values. If they
are too different we receive warning.
2) Users on a computation node
Users are not restricted to SSH to any node ( debugging, tracing job in real time, interactive jobs,
etc). However we should check if user has job on the node he is logged into.
Solution: We have a trigger that notify us if we have anyone logged on the node with no job running.
Additionally we store a list of logged in users for any single moment.
CONFIDENTIAL 13
HPC specific examples
Pending time probes
It is really hard to predict the pending time for any particular job in the pending list, as they all have different resource
requests, and runtimes. It is not a FIFO and the pending time is always related to resources user wants to have.
Solution: Zabbix runs ‘pending probes’ ( empty jobs) and checks how long does it take. This is a good indicator for
queue state at the moment.
CONFIDENTIAL 14
WHAT DO WE MONITOR: GLOBAL METRICS
Global cluster utilization
CONFIDENTIAL 15
WHAT DO WE MONITOR: GLOBAL METRICS
RAM oversubscription
CONFIDENTIAL 16
WHAT DO WE MONITOR: GLOBAL METRICS
CPU time oversubscription
CONFIDENTIAL 17
WHAT DO WE MONITOR: GLOBAL METRICS
Meta CPU utilization
CONFIDENTIAL 18
WHAT DO WE MONITOR: GLOBAL METRICS
Aggregated cluster status
CONFIDENTIAL 19
WHAT DO WE MONITOR: GLOBAL METRICS
Storage operational metrics
CONFIDENTIAL 20
WHAT DO WE MONITOR: LOCAL METRICS
CONFIDENTIAL 21
WHAT DO WE MONITOR: LOCAL METRICS
CONFIDENTIAL 22
USER ACCESS
We want to provide a limited amount of information to users. They don’t need any info about triggers and issues, but only metrics. We
have patched Zabbix to remove all unnecessary data for guest access.
After
Before
CONFIDENTIAL 23
Benefits
• Better understanding of a global issues on the cluster an reasons of why have they happened.
• Great performance indicators for other infrastructure teams ( especially Storage team )
• Performance tuning of a scientific workflows. Jobs profiling. In some cases information we cat get from
Zabbix is helping us to significantly improve performance of jobs.
• Proactive monitoring. With Zabbix it’s easier to understand if something is not right on the cluster or
with some job. In most cases we are able to prevent global cluster issues, or at least minimize an
impact.
• One monitoring system for clusters and HPC infrastructure.
• “All in one”. Lower efforts on support/maintain monitoring system(s).
CONFIDENTIAL 24
• Tight integration with Grid HPC software.
• Data analysis using external tools, but with Zabbix data source.
• Create a set of CLI utilities for getting Zabbix statistics in ‘human-readable’ format.
• Automation of jobs profiling using Zabbix API.
WHAT’S NEXT?
CONFIDENTIAL 25
Questions?
Ad

More Related Content

What's hot (20)

Derbycon - The Unintended Risks of Trusting Active Directory
Derbycon - The Unintended Risks of Trusting Active DirectoryDerbycon - The Unintended Risks of Trusting Active Directory
Derbycon - The Unintended Risks of Trusting Active Directory
Will Schroeder
 
Internal Hive
Internal HiveInternal Hive
Internal Hive
Recruit Technologies
 
SparkとCassandraの美味しい関係
SparkとCassandraの美味しい関係SparkとCassandraの美味しい関係
SparkとCassandraの美味しい関係
datastaxjp
 
Write Faster SQL with Trino.pdf
Write Faster SQL with Trino.pdfWrite Faster SQL with Trino.pdf
Write Faster SQL with Trino.pdf
Eric Xiao
 
Spark SQL: Another 16x Faster After Tungsten: Spark Summit East talk by Brad ...
Spark SQL: Another 16x Faster After Tungsten: Spark Summit East talk by Brad ...Spark SQL: Another 16x Faster After Tungsten: Spark Summit East talk by Brad ...
Spark SQL: Another 16x Faster After Tungsten: Spark Summit East talk by Brad ...
Spark Summit
 
40分でわかるHadoop徹底入門 (Cloudera World Tokyo 2014 講演資料)
40分でわかるHadoop徹底入門 (Cloudera World Tokyo 2014 講演資料) 40分でわかるHadoop徹底入門 (Cloudera World Tokyo 2014 講演資料)
40分でわかるHadoop徹底入門 (Cloudera World Tokyo 2014 講演資料)
hamaken
 
LINEのMySQL運用について 修正版
LINEのMySQL運用について 修正版LINEのMySQL運用について 修正版
LINEのMySQL運用について 修正版
LINE Corporation
 
Zabbix による ms sql監視 ~データベースモニタリング~ odbc
Zabbix による ms sql監視 ~データベースモニタリング~ odbcZabbix による ms sql監視 ~データベースモニタリング~ odbc
Zabbix による ms sql監視 ~データベースモニタリング~ odbc
真乙 九龍
 
Realmの暗号化とAndroid System
Realmの暗号化とAndroid SystemRealmの暗号化とAndroid System
Realmの暗号化とAndroid System
Keiji Ariyama
 
Burp suite
Burp suiteBurp suite
Burp suite
Yashar Shahinzadeh
 
Introducing ELK
Introducing ELKIntroducing ELK
Introducing ELK
AllBits BVBA (freelancer)
 
Sigma and YARA Rules
Sigma and YARA RulesSigma and YARA Rules
Sigma and YARA Rules
Lionel Faleiro
 
Beyond SQL: Speeding up Spark with DataFrames
Beyond SQL: Speeding up Spark with DataFramesBeyond SQL: Speeding up Spark with DataFrames
Beyond SQL: Speeding up Spark with DataFrames
Databricks
 
State of the Trino Project
State of the Trino ProjectState of the Trino Project
State of the Trino Project
Martin Traverso
 
Centralized log-management-with-elastic-stack
Centralized log-management-with-elastic-stackCentralized log-management-with-elastic-stack
Centralized log-management-with-elastic-stack
Rich Lee
 
iOS Application Static Analysis - Deepika Kumari.pptx
iOS Application Static Analysis - Deepika Kumari.pptxiOS Application Static Analysis - Deepika Kumari.pptx
iOS Application Static Analysis - Deepika Kumari.pptx
deepikakumari643428
 
Hunting for Privilege Escalation in Windows Environment
Hunting for Privilege Escalation in Windows EnvironmentHunting for Privilege Escalation in Windows Environment
Hunting for Privilege Escalation in Windows Environment
Teymur Kheirkhabarov
 
用 Go 語言實戰 Push Notification 服務
用 Go 語言實戰 Push Notification 服務用 Go 語言實戰 Push Notification 服務
用 Go 語言實戰 Push Notification 服務
Bo-Yi Wu
 
Automated Acceptance Testing from Scratch
Automated Acceptance Testing from ScratchAutomated Acceptance Testing from Scratch
Automated Acceptance Testing from Scratch
Excella
 
Security misconfiguration
Security misconfigurationSecurity misconfiguration
Security misconfiguration
Micho Hayek
 
Derbycon - The Unintended Risks of Trusting Active Directory
Derbycon - The Unintended Risks of Trusting Active DirectoryDerbycon - The Unintended Risks of Trusting Active Directory
Derbycon - The Unintended Risks of Trusting Active Directory
Will Schroeder
 
SparkとCassandraの美味しい関係
SparkとCassandraの美味しい関係SparkとCassandraの美味しい関係
SparkとCassandraの美味しい関係
datastaxjp
 
Write Faster SQL with Trino.pdf
Write Faster SQL with Trino.pdfWrite Faster SQL with Trino.pdf
Write Faster SQL with Trino.pdf
Eric Xiao
 
Spark SQL: Another 16x Faster After Tungsten: Spark Summit East talk by Brad ...
Spark SQL: Another 16x Faster After Tungsten: Spark Summit East talk by Brad ...Spark SQL: Another 16x Faster After Tungsten: Spark Summit East talk by Brad ...
Spark SQL: Another 16x Faster After Tungsten: Spark Summit East talk by Brad ...
Spark Summit
 
40分でわかるHadoop徹底入門 (Cloudera World Tokyo 2014 講演資料)
40分でわかるHadoop徹底入門 (Cloudera World Tokyo 2014 講演資料) 40分でわかるHadoop徹底入門 (Cloudera World Tokyo 2014 講演資料)
40分でわかるHadoop徹底入門 (Cloudera World Tokyo 2014 講演資料)
hamaken
 
LINEのMySQL運用について 修正版
LINEのMySQL運用について 修正版LINEのMySQL運用について 修正版
LINEのMySQL運用について 修正版
LINE Corporation
 
Zabbix による ms sql監視 ~データベースモニタリング~ odbc
Zabbix による ms sql監視 ~データベースモニタリング~ odbcZabbix による ms sql監視 ~データベースモニタリング~ odbc
Zabbix による ms sql監視 ~データベースモニタリング~ odbc
真乙 九龍
 
Realmの暗号化とAndroid System
Realmの暗号化とAndroid SystemRealmの暗号化とAndroid System
Realmの暗号化とAndroid System
Keiji Ariyama
 
Beyond SQL: Speeding up Spark with DataFrames
Beyond SQL: Speeding up Spark with DataFramesBeyond SQL: Speeding up Spark with DataFrames
Beyond SQL: Speeding up Spark with DataFrames
Databricks
 
State of the Trino Project
State of the Trino ProjectState of the Trino Project
State of the Trino Project
Martin Traverso
 
Centralized log-management-with-elastic-stack
Centralized log-management-with-elastic-stackCentralized log-management-with-elastic-stack
Centralized log-management-with-elastic-stack
Rich Lee
 
iOS Application Static Analysis - Deepika Kumari.pptx
iOS Application Static Analysis - Deepika Kumari.pptxiOS Application Static Analysis - Deepika Kumari.pptx
iOS Application Static Analysis - Deepika Kumari.pptx
deepikakumari643428
 
Hunting for Privilege Escalation in Windows Environment
Hunting for Privilege Escalation in Windows EnvironmentHunting for Privilege Escalation in Windows Environment
Hunting for Privilege Escalation in Windows Environment
Teymur Kheirkhabarov
 
用 Go 語言實戰 Push Notification 服務
用 Go 語言實戰 Push Notification 服務用 Go 語言實戰 Push Notification 服務
用 Go 語言實戰 Push Notification 服務
Bo-Yi Wu
 
Automated Acceptance Testing from Scratch
Automated Acceptance Testing from ScratchAutomated Acceptance Testing from Scratch
Automated Acceptance Testing from Scratch
Excella
 
Security misconfiguration
Security misconfigurationSecurity misconfiguration
Security misconfiguration
Micho Hayek
 

Viewers also liked (20)

Dimitri Bellini and Pietro Antonacci - Manage Zabbix Proxies in Remote Networ...
Dimitri Bellini and Pietro Antonacci - Manage Zabbix Proxies in Remote Networ...Dimitri Bellini and Pietro Antonacci - Manage Zabbix Proxies in Remote Networ...
Dimitri Bellini and Pietro Antonacci - Manage Zabbix Proxies in Remote Networ...
Zabbix
 
Lukáš Malý - Log management ELISA controlled by Zabbix | ZabConf2016
Lukáš Malý - Log management ELISA controlled by Zabbix | ZabConf2016Lukáš Malý - Log management ELISA controlled by Zabbix | ZabConf2016
Lukáš Malý - Log management ELISA controlled by Zabbix | ZabConf2016
Zabbix
 
Rihards Olups - Zabbix at Nokia - Case Study
Rihards Olups - Zabbix at Nokia - Case StudyRihards Olups - Zabbix at Nokia - Case Study
Rihards Olups - Zabbix at Nokia - Case Study
Zabbix
 
Zabbix Conference LatAm 2016 - Jorge Pretel - Low Level Discovery for ODBC an...
Zabbix Conference LatAm 2016 - Jorge Pretel - Low Level Discovery for ODBC an...Zabbix Conference LatAm 2016 - Jorge Pretel - Low Level Discovery for ODBC an...
Zabbix Conference LatAm 2016 - Jorge Pretel - Low Level Discovery for ODBC an...
Zabbix
 
Raymond Kuiper - Zen and The Art of Zabbix Template Design | ZabConf2016
Raymond Kuiper - Zen and The Art of Zabbix Template Design | ZabConf2016Raymond Kuiper - Zen and The Art of Zabbix Template Design | ZabConf2016
Raymond Kuiper - Zen and The Art of Zabbix Template Design | ZabConf2016
Zabbix
 
Zabbix Conference LatAm 2016 - Daniel Nasiloski - Extending Zabbix - Interact...
Zabbix Conference LatAm 2016 - Daniel Nasiloski - Extending Zabbix - Interact...Zabbix Conference LatAm 2016 - Daniel Nasiloski - Extending Zabbix - Interact...
Zabbix Conference LatAm 2016 - Daniel Nasiloski - Extending Zabbix - Interact...
Zabbix
 
Zabbix Conference LatAm 2016 - Jessian Ferreira - Wireless with Zabbix
Zabbix Conference LatAm 2016 - Jessian Ferreira - Wireless with ZabbixZabbix Conference LatAm 2016 - Jessian Ferreira - Wireless with Zabbix
Zabbix Conference LatAm 2016 - Jessian Ferreira - Wireless with Zabbix
Zabbix
 
Alexei Vladishev - Zabbix - Monitoring Solution for Everyone
Alexei Vladishev - Zabbix - Monitoring Solution for EveryoneAlexei Vladishev - Zabbix - Monitoring Solution for Everyone
Alexei Vladishev - Zabbix - Monitoring Solution for Everyone
Zabbix
 
OpenStack Marketing Meeting Oct 2
OpenStack Marketing Meeting Oct 2OpenStack Marketing Meeting Oct 2
OpenStack Marketing Meeting Oct 2
OpenStack Foundation
 
Openstack高度自动化持续交付
Openstack高度自动化持续交付Openstack高度自动化持续交付
Openstack高度自动化持续交付
Bill Zhong Qibin
 
General Bare-metal Provisioning Framework.pdf
General Bare-metal Provisioning Framework.pdfGeneral Bare-metal Provisioning Framework.pdf
General Bare-metal Provisioning Framework.pdf
OpenStack Foundation
 
OpenStack Day CEE 2015: Real-World Use Cases
OpenStack Day CEE 2015: Real-World Use CasesOpenStack Day CEE 2015: Real-World Use Cases
OpenStack Day CEE 2015: Real-World Use Cases
OpenStack Foundation
 
Feedbackstr - Verbessern Sie Ihr Geschäft durch das Feedback Ihrer Kunden!
Feedbackstr - Verbessern Sie Ihr Geschäft  durch das Feedback Ihrer Kunden!Feedbackstr - Verbessern Sie Ihr Geschäft  durch das Feedback Ihrer Kunden!
Feedbackstr - Verbessern Sie Ihr Geschäft durch das Feedback Ihrer Kunden!
Spectos GmbH
 
Social Media: 4 Tipps für ein gutes Kundenfeedback
Social Media: 4 Tipps für ein gutes KundenfeedbackSocial Media: 4 Tipps für ein gutes Kundenfeedback
Social Media: 4 Tipps für ein gutes Kundenfeedback
TWT
 
Oleg Ivanivskyi - Lessons Learned While Being On-Site | ZabConf2016
Oleg Ivanivskyi - Lessons Learned While Being On-Site | ZabConf2016Oleg Ivanivskyi - Lessons Learned While Being On-Site | ZabConf2016
Oleg Ivanivskyi - Lessons Learned While Being On-Site | ZabConf2016
Zabbix
 
Vladimir Ulogov - Large Scale Simulation | ZabConf2016 Lightning Talk
Vladimir Ulogov - Large Scale Simulation | ZabConf2016 Lightning TalkVladimir Ulogov - Large Scale Simulation | ZabConf2016 Lightning Talk
Vladimir Ulogov - Large Scale Simulation | ZabConf2016 Lightning Talk
Zabbix
 
Inaba Kazuhiko - Ahiruyaki Zabbix in Japan Part 2 | ZabConf2016 Lightning Talk
Inaba Kazuhiko - Ahiruyaki Zabbix in Japan Part 2 | ZabConf2016 Lightning TalkInaba Kazuhiko - Ahiruyaki Zabbix in Japan Part 2 | ZabConf2016 Lightning Talk
Inaba Kazuhiko - Ahiruyaki Zabbix in Japan Part 2 | ZabConf2016 Lightning Talk
Zabbix
 
Rafael Martinez Guerrero Zabbix CLI | ZabConf2016 Lightning Talk
Rafael Martinez Guerrero Zabbix CLI | ZabConf2016 Lightning TalkRafael Martinez Guerrero Zabbix CLI | ZabConf2016 Lightning Talk
Rafael Martinez Guerrero Zabbix CLI | ZabConf2016 Lightning Talk
Zabbix
 
Wolfgang Alper - Zabbix Meets OPS Control / Rundeck | ZabConf2016
Wolfgang Alper - Zabbix Meets OPS Control / Rundeck | ZabConf2016Wolfgang Alper - Zabbix Meets OPS Control / Rundeck | ZabConf2016
Wolfgang Alper - Zabbix Meets OPS Control / Rundeck | ZabConf2016
Zabbix
 
Dimitri Bellini and Pietro Antonacci - Manage Zabbix Proxies in Remote Networ...
Dimitri Bellini and Pietro Antonacci - Manage Zabbix Proxies in Remote Networ...Dimitri Bellini and Pietro Antonacci - Manage Zabbix Proxies in Remote Networ...
Dimitri Bellini and Pietro Antonacci - Manage Zabbix Proxies in Remote Networ...
Zabbix
 
Lukáš Malý - Log management ELISA controlled by Zabbix | ZabConf2016
Lukáš Malý - Log management ELISA controlled by Zabbix | ZabConf2016Lukáš Malý - Log management ELISA controlled by Zabbix | ZabConf2016
Lukáš Malý - Log management ELISA controlled by Zabbix | ZabConf2016
Zabbix
 
Rihards Olups - Zabbix at Nokia - Case Study
Rihards Olups - Zabbix at Nokia - Case StudyRihards Olups - Zabbix at Nokia - Case Study
Rihards Olups - Zabbix at Nokia - Case Study
Zabbix
 
Zabbix Conference LatAm 2016 - Jorge Pretel - Low Level Discovery for ODBC an...
Zabbix Conference LatAm 2016 - Jorge Pretel - Low Level Discovery for ODBC an...Zabbix Conference LatAm 2016 - Jorge Pretel - Low Level Discovery for ODBC an...
Zabbix Conference LatAm 2016 - Jorge Pretel - Low Level Discovery for ODBC an...
Zabbix
 
Raymond Kuiper - Zen and The Art of Zabbix Template Design | ZabConf2016
Raymond Kuiper - Zen and The Art of Zabbix Template Design | ZabConf2016Raymond Kuiper - Zen and The Art of Zabbix Template Design | ZabConf2016
Raymond Kuiper - Zen and The Art of Zabbix Template Design | ZabConf2016
Zabbix
 
Zabbix Conference LatAm 2016 - Daniel Nasiloski - Extending Zabbix - Interact...
Zabbix Conference LatAm 2016 - Daniel Nasiloski - Extending Zabbix - Interact...Zabbix Conference LatAm 2016 - Daniel Nasiloski - Extending Zabbix - Interact...
Zabbix Conference LatAm 2016 - Daniel Nasiloski - Extending Zabbix - Interact...
Zabbix
 
Zabbix Conference LatAm 2016 - Jessian Ferreira - Wireless with Zabbix
Zabbix Conference LatAm 2016 - Jessian Ferreira - Wireless with ZabbixZabbix Conference LatAm 2016 - Jessian Ferreira - Wireless with Zabbix
Zabbix Conference LatAm 2016 - Jessian Ferreira - Wireless with Zabbix
Zabbix
 
Alexei Vladishev - Zabbix - Monitoring Solution for Everyone
Alexei Vladishev - Zabbix - Monitoring Solution for EveryoneAlexei Vladishev - Zabbix - Monitoring Solution for Everyone
Alexei Vladishev - Zabbix - Monitoring Solution for Everyone
Zabbix
 
Openstack高度自动化持续交付
Openstack高度自动化持续交付Openstack高度自动化持续交付
Openstack高度自动化持续交付
Bill Zhong Qibin
 
General Bare-metal Provisioning Framework.pdf
General Bare-metal Provisioning Framework.pdfGeneral Bare-metal Provisioning Framework.pdf
General Bare-metal Provisioning Framework.pdf
OpenStack Foundation
 
OpenStack Day CEE 2015: Real-World Use Cases
OpenStack Day CEE 2015: Real-World Use CasesOpenStack Day CEE 2015: Real-World Use Cases
OpenStack Day CEE 2015: Real-World Use Cases
OpenStack Foundation
 
Feedbackstr - Verbessern Sie Ihr Geschäft durch das Feedback Ihrer Kunden!
Feedbackstr - Verbessern Sie Ihr Geschäft  durch das Feedback Ihrer Kunden!Feedbackstr - Verbessern Sie Ihr Geschäft  durch das Feedback Ihrer Kunden!
Feedbackstr - Verbessern Sie Ihr Geschäft durch das Feedback Ihrer Kunden!
Spectos GmbH
 
Social Media: 4 Tipps für ein gutes Kundenfeedback
Social Media: 4 Tipps für ein gutes KundenfeedbackSocial Media: 4 Tipps für ein gutes Kundenfeedback
Social Media: 4 Tipps für ein gutes Kundenfeedback
TWT
 
Oleg Ivanivskyi - Lessons Learned While Being On-Site | ZabConf2016
Oleg Ivanivskyi - Lessons Learned While Being On-Site | ZabConf2016Oleg Ivanivskyi - Lessons Learned While Being On-Site | ZabConf2016
Oleg Ivanivskyi - Lessons Learned While Being On-Site | ZabConf2016
Zabbix
 
Vladimir Ulogov - Large Scale Simulation | ZabConf2016 Lightning Talk
Vladimir Ulogov - Large Scale Simulation | ZabConf2016 Lightning TalkVladimir Ulogov - Large Scale Simulation | ZabConf2016 Lightning Talk
Vladimir Ulogov - Large Scale Simulation | ZabConf2016 Lightning Talk
Zabbix
 
Inaba Kazuhiko - Ahiruyaki Zabbix in Japan Part 2 | ZabConf2016 Lightning Talk
Inaba Kazuhiko - Ahiruyaki Zabbix in Japan Part 2 | ZabConf2016 Lightning TalkInaba Kazuhiko - Ahiruyaki Zabbix in Japan Part 2 | ZabConf2016 Lightning Talk
Inaba Kazuhiko - Ahiruyaki Zabbix in Japan Part 2 | ZabConf2016 Lightning Talk
Zabbix
 
Rafael Martinez Guerrero Zabbix CLI | ZabConf2016 Lightning Talk
Rafael Martinez Guerrero Zabbix CLI | ZabConf2016 Lightning TalkRafael Martinez Guerrero Zabbix CLI | ZabConf2016 Lightning Talk
Rafael Martinez Guerrero Zabbix CLI | ZabConf2016 Lightning Talk
Zabbix
 
Wolfgang Alper - Zabbix Meets OPS Control / Rundeck | ZabConf2016
Wolfgang Alper - Zabbix Meets OPS Control / Rundeck | ZabConf2016Wolfgang Alper - Zabbix Meets OPS Control / Rundeck | ZabConf2016
Wolfgang Alper - Zabbix Meets OPS Control / Rundeck | ZabConf2016
Zabbix
 
Ad

Similar to Mikhail Serkov - Zabbix for HPC Cluster Support | ZabConf2016 (20)

Lessons learned from embedding Cassandra in xPatterns
Lessons learned from embedding Cassandra in xPatternsLessons learned from embedding Cassandra in xPatterns
Lessons learned from embedding Cassandra in xPatterns
Claudiu Barbura
 
iland Internet Solutions: Leveraging Cassandra for real-time multi-datacenter...
iland Internet Solutions: Leveraging Cassandra for real-time multi-datacenter...iland Internet Solutions: Leveraging Cassandra for real-time multi-datacenter...
iland Internet Solutions: Leveraging Cassandra for real-time multi-datacenter...
DataStax Academy
 
Leveraging Cassandra for real-time multi-datacenter public cloud analytics
Leveraging Cassandra for real-time multi-datacenter public cloud analyticsLeveraging Cassandra for real-time multi-datacenter public cloud analytics
Leveraging Cassandra for real-time multi-datacenter public cloud analytics
Julien Anguenot
 
Monitoring federation open stack infrastructure
Monitoring federation open stack infrastructureMonitoring federation open stack infrastructure
Monitoring federation open stack infrastructure
Fernando Lopez Aguilar
 
Atmosphere 2016 - Pawel Mastalerz, Wojciech Inglot - New way of building inf...
Atmosphere 2016 -  Pawel Mastalerz, Wojciech Inglot - New way of building inf...Atmosphere 2016 -  Pawel Mastalerz, Wojciech Inglot - New way of building inf...
Atmosphere 2016 - Pawel Mastalerz, Wojciech Inglot - New way of building inf...
PROIDEA
 
Kubernetes @ Squarespace (SRE Portland Meetup October 2017)
Kubernetes @ Squarespace (SRE Portland Meetup October 2017)Kubernetes @ Squarespace (SRE Portland Meetup October 2017)
Kubernetes @ Squarespace (SRE Portland Meetup October 2017)
Kevin Lynch
 
Intro to open source telemetry linux con 2016
Intro to open source telemetry   linux con 2016Intro to open source telemetry   linux con 2016
Intro to open source telemetry linux con 2016
Matthew Broberg
 
HPC and cloud distributed computing, as a journey
HPC and cloud distributed computing, as a journeyHPC and cloud distributed computing, as a journey
HPC and cloud distributed computing, as a journey
Peter Clapham
 
Cassandra in xPatterns
Cassandra in xPatternsCassandra in xPatterns
Cassandra in xPatterns
DataStax Academy
 
The power of linux advanced tracer [POUG18]
The power of linux advanced tracer [POUG18]The power of linux advanced tracer [POUG18]
The power of linux advanced tracer [POUG18]
Mahmoud Hatem
 
OS for AI: Elastic Microservices & the Next Gen of ML
OS for AI: Elastic Microservices & the Next Gen of MLOS for AI: Elastic Microservices & the Next Gen of ML
OS for AI: Elastic Microservices & the Next Gen of ML
Nordic APIs
 
The state of Hive and Spark in the Cloud (July 2017)
The state of Hive and Spark in the Cloud (July 2017)The state of Hive and Spark in the Cloud (July 2017)
The state of Hive and Spark in the Cloud (July 2017)
Nicolas Poggi
 
Cassandra Tools and Distributed Administration (Jeffrey Berger, Knewton) | C*...
Cassandra Tools and Distributed Administration (Jeffrey Berger, Knewton) | C*...Cassandra Tools and Distributed Administration (Jeffrey Berger, Knewton) | C*...
Cassandra Tools and Distributed Administration (Jeffrey Berger, Knewton) | C*...
DataStax
 
FPGA Hardware Accelerator for Machine Learning
FPGA Hardware Accelerator for Machine Learning FPGA Hardware Accelerator for Machine Learning
FPGA Hardware Accelerator for Machine Learning
Dr. Swaminathan Kathirvel
 
Tsinghua University: Two Exemplary Applications in China
Tsinghua University: Two Exemplary Applications in ChinaTsinghua University: Two Exemplary Applications in China
Tsinghua University: Two Exemplary Applications in China
DataStax Academy
 
Scaling Hadoop at LinkedIn
Scaling Hadoop at LinkedInScaling Hadoop at LinkedIn
Scaling Hadoop at LinkedIn
DataWorks Summit
 
OCP Telco Engineering Workshop at BCE2017
OCP Telco Engineering Workshop at BCE2017OCP Telco Engineering Workshop at BCE2017
OCP Telco Engineering Workshop at BCE2017
Radisys Corporation
 
"Making Computer Vision Software Run Fast on Your Embedded Platform," a Prese...
"Making Computer Vision Software Run Fast on Your Embedded Platform," a Prese..."Making Computer Vision Software Run Fast on Your Embedded Platform," a Prese...
"Making Computer Vision Software Run Fast on Your Embedded Platform," a Prese...
Edge AI and Vision Alliance
 
2018 03 25 system ml ai and openpower meetup
2018 03 25 system ml ai and openpower meetup2018 03 25 system ml ai and openpower meetup
2018 03 25 system ml ai and openpower meetup
Ganesan Narayanasamy
 
Webinar: OpenEBS - Still Free and now FASTEST Kubernetes storage
Webinar: OpenEBS - Still Free and now FASTEST Kubernetes storageWebinar: OpenEBS - Still Free and now FASTEST Kubernetes storage
Webinar: OpenEBS - Still Free and now FASTEST Kubernetes storage
MayaData Inc
 
Lessons learned from embedding Cassandra in xPatterns
Lessons learned from embedding Cassandra in xPatternsLessons learned from embedding Cassandra in xPatterns
Lessons learned from embedding Cassandra in xPatterns
Claudiu Barbura
 
iland Internet Solutions: Leveraging Cassandra for real-time multi-datacenter...
iland Internet Solutions: Leveraging Cassandra for real-time multi-datacenter...iland Internet Solutions: Leveraging Cassandra for real-time multi-datacenter...
iland Internet Solutions: Leveraging Cassandra for real-time multi-datacenter...
DataStax Academy
 
Leveraging Cassandra for real-time multi-datacenter public cloud analytics
Leveraging Cassandra for real-time multi-datacenter public cloud analyticsLeveraging Cassandra for real-time multi-datacenter public cloud analytics
Leveraging Cassandra for real-time multi-datacenter public cloud analytics
Julien Anguenot
 
Monitoring federation open stack infrastructure
Monitoring federation open stack infrastructureMonitoring federation open stack infrastructure
Monitoring federation open stack infrastructure
Fernando Lopez Aguilar
 
Atmosphere 2016 - Pawel Mastalerz, Wojciech Inglot - New way of building inf...
Atmosphere 2016 -  Pawel Mastalerz, Wojciech Inglot - New way of building inf...Atmosphere 2016 -  Pawel Mastalerz, Wojciech Inglot - New way of building inf...
Atmosphere 2016 - Pawel Mastalerz, Wojciech Inglot - New way of building inf...
PROIDEA
 
Kubernetes @ Squarespace (SRE Portland Meetup October 2017)
Kubernetes @ Squarespace (SRE Portland Meetup October 2017)Kubernetes @ Squarespace (SRE Portland Meetup October 2017)
Kubernetes @ Squarespace (SRE Portland Meetup October 2017)
Kevin Lynch
 
Intro to open source telemetry linux con 2016
Intro to open source telemetry   linux con 2016Intro to open source telemetry   linux con 2016
Intro to open source telemetry linux con 2016
Matthew Broberg
 
HPC and cloud distributed computing, as a journey
HPC and cloud distributed computing, as a journeyHPC and cloud distributed computing, as a journey
HPC and cloud distributed computing, as a journey
Peter Clapham
 
The power of linux advanced tracer [POUG18]
The power of linux advanced tracer [POUG18]The power of linux advanced tracer [POUG18]
The power of linux advanced tracer [POUG18]
Mahmoud Hatem
 
OS for AI: Elastic Microservices & the Next Gen of ML
OS for AI: Elastic Microservices & the Next Gen of MLOS for AI: Elastic Microservices & the Next Gen of ML
OS for AI: Elastic Microservices & the Next Gen of ML
Nordic APIs
 
The state of Hive and Spark in the Cloud (July 2017)
The state of Hive and Spark in the Cloud (July 2017)The state of Hive and Spark in the Cloud (July 2017)
The state of Hive and Spark in the Cloud (July 2017)
Nicolas Poggi
 
Cassandra Tools and Distributed Administration (Jeffrey Berger, Knewton) | C*...
Cassandra Tools and Distributed Administration (Jeffrey Berger, Knewton) | C*...Cassandra Tools and Distributed Administration (Jeffrey Berger, Knewton) | C*...
Cassandra Tools and Distributed Administration (Jeffrey Berger, Knewton) | C*...
DataStax
 
FPGA Hardware Accelerator for Machine Learning
FPGA Hardware Accelerator for Machine Learning FPGA Hardware Accelerator for Machine Learning
FPGA Hardware Accelerator for Machine Learning
Dr. Swaminathan Kathirvel
 
Tsinghua University: Two Exemplary Applications in China
Tsinghua University: Two Exemplary Applications in ChinaTsinghua University: Two Exemplary Applications in China
Tsinghua University: Two Exemplary Applications in China
DataStax Academy
 
Scaling Hadoop at LinkedIn
Scaling Hadoop at LinkedInScaling Hadoop at LinkedIn
Scaling Hadoop at LinkedIn
DataWorks Summit
 
OCP Telco Engineering Workshop at BCE2017
OCP Telco Engineering Workshop at BCE2017OCP Telco Engineering Workshop at BCE2017
OCP Telco Engineering Workshop at BCE2017
Radisys Corporation
 
"Making Computer Vision Software Run Fast on Your Embedded Platform," a Prese...
"Making Computer Vision Software Run Fast on Your Embedded Platform," a Prese..."Making Computer Vision Software Run Fast on Your Embedded Platform," a Prese...
"Making Computer Vision Software Run Fast on Your Embedded Platform," a Prese...
Edge AI and Vision Alliance
 
2018 03 25 system ml ai and openpower meetup
2018 03 25 system ml ai and openpower meetup2018 03 25 system ml ai and openpower meetup
2018 03 25 system ml ai and openpower meetup
Ganesan Narayanasamy
 
Webinar: OpenEBS - Still Free and now FASTEST Kubernetes storage
Webinar: OpenEBS - Still Free and now FASTEST Kubernetes storageWebinar: OpenEBS - Still Free and now FASTEST Kubernetes storage
Webinar: OpenEBS - Still Free and now FASTEST Kubernetes storage
MayaData Inc
 
Ad

More from Zabbix (18)

Zabbix Conference LatAm 2016 - Andre Deo - Zabbix Brazil Community
Zabbix Conference LatAm 2016 - Andre Deo - Zabbix Brazil CommunityZabbix Conference LatAm 2016 - Andre Deo - Zabbix Brazil Community
Zabbix Conference LatAm 2016 - Andre Deo - Zabbix Brazil Community
Zabbix
 
Zabbix Conference LatAm 2016 - Andre Deo - SNMP and Zabbix
Zabbix Conference LatAm 2016 - Andre Deo - SNMP and ZabbixZabbix Conference LatAm 2016 - Andre Deo - SNMP and Zabbix
Zabbix Conference LatAm 2016 - Andre Deo - SNMP and Zabbix
Zabbix
 
Zabbix Conference LatAm 2016 - Rodrigo Mohr - Challenges on Large Env with Or...
Zabbix Conference LatAm 2016 - Rodrigo Mohr - Challenges on Large Env with Or...Zabbix Conference LatAm 2016 - Rodrigo Mohr - Challenges on Large Env with Or...
Zabbix Conference LatAm 2016 - Rodrigo Mohr - Challenges on Large Env with Or...
Zabbix
 
Zabbix Conference LatAm 2016 - Marcio Prop - Monitoring Complex Environments ...
Zabbix Conference LatAm 2016 - Marcio Prop - Monitoring Complex Environments ...Zabbix Conference LatAm 2016 - Marcio Prop - Monitoring Complex Environments ...
Zabbix Conference LatAm 2016 - Marcio Prop - Monitoring Complex Environments ...
Zabbix
 
Zabbix Conference LatAm 2016 - Filipe Paternot - Zbx@Globo Automation+Integra...
Zabbix Conference LatAm 2016 - Filipe Paternot - Zbx@Globo Automation+Integra...Zabbix Conference LatAm 2016 - Filipe Paternot - Zbx@Globo Automation+Integra...
Zabbix Conference LatAm 2016 - Filipe Paternot - Zbx@Globo Automation+Integra...
Zabbix
 
Zabbix Conference LatAm 2016 - Douglas Esteves - Zabbix at UNICAMP
Zabbix Conference LatAm 2016 - Douglas Esteves - Zabbix at UNICAMPZabbix Conference LatAm 2016 - Douglas Esteves - Zabbix at UNICAMP
Zabbix Conference LatAm 2016 - Douglas Esteves - Zabbix at UNICAMP
Zabbix
 
Ryan Armstrong - Monitoring More Than 6000 Devices in Zabbix | ZabConf2016
Ryan Armstrong - Monitoring More Than 6000 Devices in Zabbix | ZabConf2016Ryan Armstrong - Monitoring More Than 6000 Devices in Zabbix | ZabConf2016
Ryan Armstrong - Monitoring More Than 6000 Devices in Zabbix | ZabConf2016
Zabbix
 
Rafael Martinez Guerrero - Zabbix at the University of Oslo | ZabConf2016
Rafael Martinez Guerrero - Zabbix at the University of Oslo | ZabConf2016Rafael Martinez Guerrero - Zabbix at the University of Oslo | ZabConf2016
Rafael Martinez Guerrero - Zabbix at the University of Oslo | ZabConf2016
Zabbix
 
Wolfgang Alper - Zabbix Meets OPS Control / Rundeck | ZabConf2016
Wolfgang Alper - Zabbix Meets OPS Control / Rundeck | ZabConf2016Wolfgang Alper - Zabbix Meets OPS Control / Rundeck | ZabConf2016
Wolfgang Alper - Zabbix Meets OPS Control / Rundeck | ZabConf2016
Zabbix
 
Sumit Goel - Monitoring Cloud Applications Using Zabbix | ZabConf2016
Sumit Goel - Monitoring Cloud Applications Using Zabbix | ZabConf2016Sumit Goel - Monitoring Cloud Applications Using Zabbix | ZabConf2016
Sumit Goel - Monitoring Cloud Applications Using Zabbix | ZabConf2016
Zabbix
 
Erik Skytthe - Monitoring Mesos, Docker, Containers with Zabbix | ZabConf2016
Erik Skytthe - Monitoring Mesos, Docker, Containers with Zabbix | ZabConf2016Erik Skytthe - Monitoring Mesos, Docker, Containers with Zabbix | ZabConf2016
Erik Skytthe - Monitoring Mesos, Docker, Containers with Zabbix | ZabConf2016
Zabbix
 
Konstantin Yakovlev - Event Analysis Toolset | ZabConf2016
Konstantin Yakovlev - Event Analysis Toolset | ZabConf2016Konstantin Yakovlev - Event Analysis Toolset | ZabConf2016
Konstantin Yakovlev - Event Analysis Toolset | ZabConf2016
Zabbix
 
Ingus Vilnis - Benefits of Zabbix Training | ZabConf2016
Ingus Vilnis -  Benefits of Zabbix Training | ZabConf2016Ingus Vilnis -  Benefits of Zabbix Training | ZabConf2016
Ingus Vilnis - Benefits of Zabbix Training | ZabConf2016
Zabbix
 
Alexei Vladishev - Opening Speech | ZabConf2016
Alexei Vladishev - Opening Speech | ZabConf2016Alexei Vladishev - Opening Speech | ZabConf2016
Alexei Vladishev - Opening Speech | ZabConf2016
Zabbix
 
Alexander Naydenko - Nagios to Zabbix Migration | ZabConf2016
Alexander Naydenko - Nagios to Zabbix Migration | ZabConf2016Alexander Naydenko - Nagios to Zabbix Migration | ZabConf2016
Alexander Naydenko - Nagios to Zabbix Migration | ZabConf2016
Zabbix
 
Alain Ganuchaud - Trouble Ticket Integration with Zabbix in Large Environment...
Alain Ganuchaud - Trouble Ticket Integration with Zabbix in Large Environment...Alain Ganuchaud - Trouble Ticket Integration with Zabbix in Large Environment...
Alain Ganuchaud - Trouble Ticket Integration with Zabbix in Large Environment...
Zabbix
 
Rihards Olups - Zabbix log management
Rihards Olups - Zabbix log managementRihards Olups - Zabbix log management
Rihards Olups - Zabbix log management
Zabbix
 
Zabbix Conference LatAm 2016 - Paulo Deolindo - Case Study_BBTS and Zabbix
Zabbix Conference LatAm 2016 - Paulo Deolindo - Case Study_BBTS and ZabbixZabbix Conference LatAm 2016 - Paulo Deolindo - Case Study_BBTS and Zabbix
Zabbix Conference LatAm 2016 - Paulo Deolindo - Case Study_BBTS and Zabbix
Zabbix
 
Zabbix Conference LatAm 2016 - Andre Deo - Zabbix Brazil Community
Zabbix Conference LatAm 2016 - Andre Deo - Zabbix Brazil CommunityZabbix Conference LatAm 2016 - Andre Deo - Zabbix Brazil Community
Zabbix Conference LatAm 2016 - Andre Deo - Zabbix Brazil Community
Zabbix
 
Zabbix Conference LatAm 2016 - Andre Deo - SNMP and Zabbix
Zabbix Conference LatAm 2016 - Andre Deo - SNMP and ZabbixZabbix Conference LatAm 2016 - Andre Deo - SNMP and Zabbix
Zabbix Conference LatAm 2016 - Andre Deo - SNMP and Zabbix
Zabbix
 
Zabbix Conference LatAm 2016 - Rodrigo Mohr - Challenges on Large Env with Or...
Zabbix Conference LatAm 2016 - Rodrigo Mohr - Challenges on Large Env with Or...Zabbix Conference LatAm 2016 - Rodrigo Mohr - Challenges on Large Env with Or...
Zabbix Conference LatAm 2016 - Rodrigo Mohr - Challenges on Large Env with Or...
Zabbix
 
Zabbix Conference LatAm 2016 - Marcio Prop - Monitoring Complex Environments ...
Zabbix Conference LatAm 2016 - Marcio Prop - Monitoring Complex Environments ...Zabbix Conference LatAm 2016 - Marcio Prop - Monitoring Complex Environments ...
Zabbix Conference LatAm 2016 - Marcio Prop - Monitoring Complex Environments ...
Zabbix
 
Zabbix Conference LatAm 2016 - Filipe Paternot - Zbx@Globo Automation+Integra...
Zabbix Conference LatAm 2016 - Filipe Paternot - Zbx@Globo Automation+Integra...Zabbix Conference LatAm 2016 - Filipe Paternot - Zbx@Globo Automation+Integra...
Zabbix Conference LatAm 2016 - Filipe Paternot - Zbx@Globo Automation+Integra...
Zabbix
 
Zabbix Conference LatAm 2016 - Douglas Esteves - Zabbix at UNICAMP
Zabbix Conference LatAm 2016 - Douglas Esteves - Zabbix at UNICAMPZabbix Conference LatAm 2016 - Douglas Esteves - Zabbix at UNICAMP
Zabbix Conference LatAm 2016 - Douglas Esteves - Zabbix at UNICAMP
Zabbix
 
Ryan Armstrong - Monitoring More Than 6000 Devices in Zabbix | ZabConf2016
Ryan Armstrong - Monitoring More Than 6000 Devices in Zabbix | ZabConf2016Ryan Armstrong - Monitoring More Than 6000 Devices in Zabbix | ZabConf2016
Ryan Armstrong - Monitoring More Than 6000 Devices in Zabbix | ZabConf2016
Zabbix
 
Rafael Martinez Guerrero - Zabbix at the University of Oslo | ZabConf2016
Rafael Martinez Guerrero - Zabbix at the University of Oslo | ZabConf2016Rafael Martinez Guerrero - Zabbix at the University of Oslo | ZabConf2016
Rafael Martinez Guerrero - Zabbix at the University of Oslo | ZabConf2016
Zabbix
 
Wolfgang Alper - Zabbix Meets OPS Control / Rundeck | ZabConf2016
Wolfgang Alper - Zabbix Meets OPS Control / Rundeck | ZabConf2016Wolfgang Alper - Zabbix Meets OPS Control / Rundeck | ZabConf2016
Wolfgang Alper - Zabbix Meets OPS Control / Rundeck | ZabConf2016
Zabbix
 
Sumit Goel - Monitoring Cloud Applications Using Zabbix | ZabConf2016
Sumit Goel - Monitoring Cloud Applications Using Zabbix | ZabConf2016Sumit Goel - Monitoring Cloud Applications Using Zabbix | ZabConf2016
Sumit Goel - Monitoring Cloud Applications Using Zabbix | ZabConf2016
Zabbix
 
Erik Skytthe - Monitoring Mesos, Docker, Containers with Zabbix | ZabConf2016
Erik Skytthe - Monitoring Mesos, Docker, Containers with Zabbix | ZabConf2016Erik Skytthe - Monitoring Mesos, Docker, Containers with Zabbix | ZabConf2016
Erik Skytthe - Monitoring Mesos, Docker, Containers with Zabbix | ZabConf2016
Zabbix
 
Konstantin Yakovlev - Event Analysis Toolset | ZabConf2016
Konstantin Yakovlev - Event Analysis Toolset | ZabConf2016Konstantin Yakovlev - Event Analysis Toolset | ZabConf2016
Konstantin Yakovlev - Event Analysis Toolset | ZabConf2016
Zabbix
 
Ingus Vilnis - Benefits of Zabbix Training | ZabConf2016
Ingus Vilnis -  Benefits of Zabbix Training | ZabConf2016Ingus Vilnis -  Benefits of Zabbix Training | ZabConf2016
Ingus Vilnis - Benefits of Zabbix Training | ZabConf2016
Zabbix
 
Alexei Vladishev - Opening Speech | ZabConf2016
Alexei Vladishev - Opening Speech | ZabConf2016Alexei Vladishev - Opening Speech | ZabConf2016
Alexei Vladishev - Opening Speech | ZabConf2016
Zabbix
 
Alexander Naydenko - Nagios to Zabbix Migration | ZabConf2016
Alexander Naydenko - Nagios to Zabbix Migration | ZabConf2016Alexander Naydenko - Nagios to Zabbix Migration | ZabConf2016
Alexander Naydenko - Nagios to Zabbix Migration | ZabConf2016
Zabbix
 
Alain Ganuchaud - Trouble Ticket Integration with Zabbix in Large Environment...
Alain Ganuchaud - Trouble Ticket Integration with Zabbix in Large Environment...Alain Ganuchaud - Trouble Ticket Integration with Zabbix in Large Environment...
Alain Ganuchaud - Trouble Ticket Integration with Zabbix in Large Environment...
Zabbix
 
Rihards Olups - Zabbix log management
Rihards Olups - Zabbix log managementRihards Olups - Zabbix log management
Rihards Olups - Zabbix log management
Zabbix
 
Zabbix Conference LatAm 2016 - Paulo Deolindo - Case Study_BBTS and Zabbix
Zabbix Conference LatAm 2016 - Paulo Deolindo - Case Study_BBTS and ZabbixZabbix Conference LatAm 2016 - Paulo Deolindo - Case Study_BBTS and Zabbix
Zabbix Conference LatAm 2016 - Paulo Deolindo - Case Study_BBTS and Zabbix
Zabbix
 

Recently uploaded (20)

Technology Trends in 2025: AI and Big Data Analytics
Technology Trends in 2025: AI and Big Data AnalyticsTechnology Trends in 2025: AI and Big Data Analytics
Technology Trends in 2025: AI and Big Data Analytics
InData Labs
 
Role of Data Annotation Services in AI-Powered Manufacturing
Role of Data Annotation Services in AI-Powered ManufacturingRole of Data Annotation Services in AI-Powered Manufacturing
Role of Data Annotation Services in AI-Powered Manufacturing
Andrew Leo
 
Rusty Waters: Elevating Lakehouses Beyond Spark
Rusty Waters: Elevating Lakehouses Beyond SparkRusty Waters: Elevating Lakehouses Beyond Spark
Rusty Waters: Elevating Lakehouses Beyond Spark
carlyakerly1
 
TrsLabs - Fintech Product & Business Consulting
TrsLabs - Fintech Product & Business ConsultingTrsLabs - Fintech Product & Business Consulting
TrsLabs - Fintech Product & Business Consulting
Trs Labs
 
Build Your Own Copilot & Agents For Devs
Build Your Own Copilot & Agents For DevsBuild Your Own Copilot & Agents For Devs
Build Your Own Copilot & Agents For Devs
Brian McKeiver
 
Transcript: #StandardsGoals for 2025: Standards & certification roundup - Tec...
Transcript: #StandardsGoals for 2025: Standards & certification roundup - Tec...Transcript: #StandardsGoals for 2025: Standards & certification roundup - Tec...
Transcript: #StandardsGoals for 2025: Standards & certification roundup - Tec...
BookNet Canada
 
IEDM 2024 Tutorial2_Advances in CMOS Technologies and Future Directions for C...
IEDM 2024 Tutorial2_Advances in CMOS Technologies and Future Directions for C...IEDM 2024 Tutorial2_Advances in CMOS Technologies and Future Directions for C...
IEDM 2024 Tutorial2_Advances in CMOS Technologies and Future Directions for C...
organizerofv
 
What is Model Context Protocol(MCP) - The new technology for communication bw...
What is Model Context Protocol(MCP) - The new technology for communication bw...What is Model Context Protocol(MCP) - The new technology for communication bw...
What is Model Context Protocol(MCP) - The new technology for communication bw...
Vishnu Singh Chundawat
 
Generative Artificial Intelligence (GenAI) in Business
Generative Artificial Intelligence (GenAI) in BusinessGenerative Artificial Intelligence (GenAI) in Business
Generative Artificial Intelligence (GenAI) in Business
Dr. Tathagat Varma
 
Dev Dives: Automate and orchestrate your processes with UiPath Maestro
Dev Dives: Automate and orchestrate your processes with UiPath MaestroDev Dives: Automate and orchestrate your processes with UiPath Maestro
Dev Dives: Automate and orchestrate your processes with UiPath Maestro
UiPathCommunity
 
Linux Support for SMARC: How Toradex Empowers Embedded Developers
Linux Support for SMARC: How Toradex Empowers Embedded DevelopersLinux Support for SMARC: How Toradex Empowers Embedded Developers
Linux Support for SMARC: How Toradex Empowers Embedded Developers
Toradex
 
Greenhouse_Monitoring_Presentation.pptx.
Greenhouse_Monitoring_Presentation.pptx.Greenhouse_Monitoring_Presentation.pptx.
Greenhouse_Monitoring_Presentation.pptx.
hpbmnnxrvb
 
TrustArc Webinar: Consumer Expectations vs Corporate Realities on Data Broker...
TrustArc Webinar: Consumer Expectations vs Corporate Realities on Data Broker...TrustArc Webinar: Consumer Expectations vs Corporate Realities on Data Broker...
TrustArc Webinar: Consumer Expectations vs Corporate Realities on Data Broker...
TrustArc
 
2025-05-Q4-2024-Investor-Presentation.pptx
2025-05-Q4-2024-Investor-Presentation.pptx2025-05-Q4-2024-Investor-Presentation.pptx
2025-05-Q4-2024-Investor-Presentation.pptx
Samuele Fogagnolo
 
Special Meetup Edition - TDX Bengaluru Meetup #52.pptx
Special Meetup Edition - TDX Bengaluru Meetup #52.pptxSpecial Meetup Edition - TDX Bengaluru Meetup #52.pptx
Special Meetup Edition - TDX Bengaluru Meetup #52.pptx
shyamraj55
 
Cybersecurity Identity and Access Solutions using Azure AD
Cybersecurity Identity and Access Solutions using Azure ADCybersecurity Identity and Access Solutions using Azure AD
Cybersecurity Identity and Access Solutions using Azure AD
VICTOR MAESTRE RAMIREZ
 
Into The Box Conference Keynote Day 1 (ITB2025)
Into The Box Conference Keynote Day 1 (ITB2025)Into The Box Conference Keynote Day 1 (ITB2025)
Into The Box Conference Keynote Day 1 (ITB2025)
Ortus Solutions, Corp
 
UiPath Community Berlin: Orchestrator API, Swagger, and Test Manager API
UiPath Community Berlin: Orchestrator API, Swagger, and Test Manager APIUiPath Community Berlin: Orchestrator API, Swagger, and Test Manager API
UiPath Community Berlin: Orchestrator API, Swagger, and Test Manager API
UiPathCommunity
 
Splunk Security Update | Public Sector Summit Germany 2025
Splunk Security Update | Public Sector Summit Germany 2025Splunk Security Update | Public Sector Summit Germany 2025
Splunk Security Update | Public Sector Summit Germany 2025
Splunk
 
Electronic_Mail_Attacks-1-35.pdf by xploit
Electronic_Mail_Attacks-1-35.pdf by xploitElectronic_Mail_Attacks-1-35.pdf by xploit
Electronic_Mail_Attacks-1-35.pdf by xploit
niftliyevhuseyn
 
Technology Trends in 2025: AI and Big Data Analytics
Technology Trends in 2025: AI and Big Data AnalyticsTechnology Trends in 2025: AI and Big Data Analytics
Technology Trends in 2025: AI and Big Data Analytics
InData Labs
 
Role of Data Annotation Services in AI-Powered Manufacturing
Role of Data Annotation Services in AI-Powered ManufacturingRole of Data Annotation Services in AI-Powered Manufacturing
Role of Data Annotation Services in AI-Powered Manufacturing
Andrew Leo
 
Rusty Waters: Elevating Lakehouses Beyond Spark
Rusty Waters: Elevating Lakehouses Beyond SparkRusty Waters: Elevating Lakehouses Beyond Spark
Rusty Waters: Elevating Lakehouses Beyond Spark
carlyakerly1
 
TrsLabs - Fintech Product & Business Consulting
TrsLabs - Fintech Product & Business ConsultingTrsLabs - Fintech Product & Business Consulting
TrsLabs - Fintech Product & Business Consulting
Trs Labs
 
Build Your Own Copilot & Agents For Devs
Build Your Own Copilot & Agents For DevsBuild Your Own Copilot & Agents For Devs
Build Your Own Copilot & Agents For Devs
Brian McKeiver
 
Transcript: #StandardsGoals for 2025: Standards & certification roundup - Tec...
Transcript: #StandardsGoals for 2025: Standards & certification roundup - Tec...Transcript: #StandardsGoals for 2025: Standards & certification roundup - Tec...
Transcript: #StandardsGoals for 2025: Standards & certification roundup - Tec...
BookNet Canada
 
IEDM 2024 Tutorial2_Advances in CMOS Technologies and Future Directions for C...
IEDM 2024 Tutorial2_Advances in CMOS Technologies and Future Directions for C...IEDM 2024 Tutorial2_Advances in CMOS Technologies and Future Directions for C...
IEDM 2024 Tutorial2_Advances in CMOS Technologies and Future Directions for C...
organizerofv
 
What is Model Context Protocol(MCP) - The new technology for communication bw...
What is Model Context Protocol(MCP) - The new technology for communication bw...What is Model Context Protocol(MCP) - The new technology for communication bw...
What is Model Context Protocol(MCP) - The new technology for communication bw...
Vishnu Singh Chundawat
 
Generative Artificial Intelligence (GenAI) in Business
Generative Artificial Intelligence (GenAI) in BusinessGenerative Artificial Intelligence (GenAI) in Business
Generative Artificial Intelligence (GenAI) in Business
Dr. Tathagat Varma
 
Dev Dives: Automate and orchestrate your processes with UiPath Maestro
Dev Dives: Automate and orchestrate your processes with UiPath MaestroDev Dives: Automate and orchestrate your processes with UiPath Maestro
Dev Dives: Automate and orchestrate your processes with UiPath Maestro
UiPathCommunity
 
Linux Support for SMARC: How Toradex Empowers Embedded Developers
Linux Support for SMARC: How Toradex Empowers Embedded DevelopersLinux Support for SMARC: How Toradex Empowers Embedded Developers
Linux Support for SMARC: How Toradex Empowers Embedded Developers
Toradex
 
Greenhouse_Monitoring_Presentation.pptx.
Greenhouse_Monitoring_Presentation.pptx.Greenhouse_Monitoring_Presentation.pptx.
Greenhouse_Monitoring_Presentation.pptx.
hpbmnnxrvb
 
TrustArc Webinar: Consumer Expectations vs Corporate Realities on Data Broker...
TrustArc Webinar: Consumer Expectations vs Corporate Realities on Data Broker...TrustArc Webinar: Consumer Expectations vs Corporate Realities on Data Broker...
TrustArc Webinar: Consumer Expectations vs Corporate Realities on Data Broker...
TrustArc
 
2025-05-Q4-2024-Investor-Presentation.pptx
2025-05-Q4-2024-Investor-Presentation.pptx2025-05-Q4-2024-Investor-Presentation.pptx
2025-05-Q4-2024-Investor-Presentation.pptx
Samuele Fogagnolo
 
Special Meetup Edition - TDX Bengaluru Meetup #52.pptx
Special Meetup Edition - TDX Bengaluru Meetup #52.pptxSpecial Meetup Edition - TDX Bengaluru Meetup #52.pptx
Special Meetup Edition - TDX Bengaluru Meetup #52.pptx
shyamraj55
 
Cybersecurity Identity and Access Solutions using Azure AD
Cybersecurity Identity and Access Solutions using Azure ADCybersecurity Identity and Access Solutions using Azure AD
Cybersecurity Identity and Access Solutions using Azure AD
VICTOR MAESTRE RAMIREZ
 
Into The Box Conference Keynote Day 1 (ITB2025)
Into The Box Conference Keynote Day 1 (ITB2025)Into The Box Conference Keynote Day 1 (ITB2025)
Into The Box Conference Keynote Day 1 (ITB2025)
Ortus Solutions, Corp
 
UiPath Community Berlin: Orchestrator API, Swagger, and Test Manager API
UiPath Community Berlin: Orchestrator API, Swagger, and Test Manager APIUiPath Community Berlin: Orchestrator API, Swagger, and Test Manager API
UiPath Community Berlin: Orchestrator API, Swagger, and Test Manager API
UiPathCommunity
 
Splunk Security Update | Public Sector Summit Germany 2025
Splunk Security Update | Public Sector Summit Germany 2025Splunk Security Update | Public Sector Summit Germany 2025
Splunk Security Update | Public Sector Summit Germany 2025
Splunk
 
Electronic_Mail_Attacks-1-35.pdf by xploit
Electronic_Mail_Attacks-1-35.pdf by xploitElectronic_Mail_Attacks-1-35.pdf by xploit
Electronic_Mail_Attacks-1-35.pdf by xploit
niftliyevhuseyn
 

Mikhail Serkov - Zabbix for HPC Cluster Support | ZabConf2016

  • 1. ZABBIX FOR 
 HPC MONITORING AND SUPPORT Mikhail Serkov Delivery Manager/HPC Engineer 2016
  • 2. CONFIDENTIAL 2 What’s next? HPC monitoring – differences from classic support model How do we use Zabbix AGENDA Overview of the customer infrastructure and software stack High Performance Computing – what is it about#1 #4 #5 #2 #3
  • 3. CONFIDENTIAL 3 • Scientific research in Pharma area: Bioinformatics, Computational Chemistry, Drug Discovery, etc. • About 10k CPU cores used for a scientific computation. • Shared clusters - different workflows could run simultaneously within the same cluster. • About 500 different scientific tools. • Custom software ( Python, Java, R) Novartis Institute For Biomedical Research (NIBR)
  • 4. CONFIDENTIAL 4 • Hundreds or even thousands of computation nodes • Grid Computing technologies and software ( SGE, UGE, SoGE, PBS, etc) • Massive parallel computation across the nodes • Strong requirements for all subsystems on hardware and software level ( storage, network, power, OS ) • No magic. Linux boxes, shell scripts on a low level ☺ Example of a job submission: HIGH PERFORMANCE COMPUTING
  • 5. CONFIDENTIAL 5 OVERVIEW OF THE CUSTOMER INFRASTRUCTURE 250 GPU’s 70TB RAM 35-40KW/Rack
  • 6. CONFIDENTIAL 6 • 28 CPU cores ( 2 sockets x 14 cores each ) − Intel(R) Xeon(R) CPU E5-2697 v3 @ 2.60GHz • 200 GB RAM • 10 GB Ethernet + InfiniBand interfaces • 8 GPU cores ( 4 cards x 2 cores each ) • NFS over 10 GB Ehternet • Lustre over InfiniBand TYPICAL COMPUTATION NODE CONFIGURATION
  • 7. CONFIDENTIAL 7 OVERVIEW OF SOFTWARE STACK • More than 500 of scientific tools • Bioinformatics, Computation Chemistry, Xtallography, Molecular Dynamics, etc • RHEL6.5 • Univa Grid Engine • Zabbix 2.4
  • 8. CONFIDENTIAL 8 • We need information like ‘who, what, when’, not only system metrics. • Users are allowed to run whatever they want using grid scheduler on the computation nodes. • 100% CPU utilization and 100% RAM utilization for node is perfectly fine. • Node crash – not such a big deal. • Preventing global issues by using aggregated metrics. • Metrics not only for monitoring but for a performance analysis. • Users are having access to the monitoring system ( but restricted ). HPC MONITORING DIFFERENCES
  • 9. CONFIDENTIAL 9 • Able to monitor of a huge systems with a lot of metrics • Flexible • Out of the box • Ability to aggregate metrics • API for a data extraction • GUI convenient for both support team and scientists • Autodiscovery • New nodes automatic configuration WHY ZABBIX?
  • 10. CONFIDENTIAL 10 ZABBIX CONFIGURATION Server configuration: • 20 CPU cores ( 2 sockets x 10 cores each ) − Intel(R) Xeon(R) CPU E5-2697 v3 @ 2.60GHz • 120 GB RAM Number of hosts: 601 Number of items: ~200k Number of triggers: ~37k DB Size: 187GB
  • 11. CONFIDENTIAL 11 WHAT DO WE MONITOR Local metrics ( node level ) Global metrics ( cluster level ) All default Linux checks (LA, CPU utilization, RAM, swap, etc) - agent Meta CPU utilization – aggregation of CPU utilization of HPC nodes. Every single GPU core ( Temperature, Utilization if possible) - agent NFS global transmit/retransmit - aggregation of nodes values Every single CPU core ( Utilization, Temperature) - agent Grid specific – used/active slots, running jobs, pending jobs, top users - external scripts NFS shares availability / utilization / mount details - agent CPU/Memory oversubscription - aggregation of nodes values Slots / RAM reserved - external scripts Overloaded nodes - aggregation of HPC values HPC jobs - external scripts Pending time - external scripts ... ....
  • 12. CONFIDENTIAL 12 HPC specific examples 1) Expected utilization VS Real one Every job has a resource request for number of CPUs, RAM, etc. In every moment we can compare real utilization with an expected one. If they are not close, we need to investigate if someone oversubscribing resources or overload nodes. Solution: Zabbix not only checks current system metrics, but also keeps an expected values. If they are too different we receive warning. 2) Users on a computation node Users are not restricted to SSH to any node ( debugging, tracing job in real time, interactive jobs, etc). However we should check if user has job on the node he is logged into. Solution: We have a trigger that notify us if we have anyone logged on the node with no job running. Additionally we store a list of logged in users for any single moment.
  • 13. CONFIDENTIAL 13 HPC specific examples Pending time probes It is really hard to predict the pending time for any particular job in the pending list, as they all have different resource requests, and runtimes. It is not a FIFO and the pending time is always related to resources user wants to have. Solution: Zabbix runs ‘pending probes’ ( empty jobs) and checks how long does it take. This is a good indicator for queue state at the moment.
  • 14. CONFIDENTIAL 14 WHAT DO WE MONITOR: GLOBAL METRICS Global cluster utilization
  • 15. CONFIDENTIAL 15 WHAT DO WE MONITOR: GLOBAL METRICS RAM oversubscription
  • 16. CONFIDENTIAL 16 WHAT DO WE MONITOR: GLOBAL METRICS CPU time oversubscription
  • 17. CONFIDENTIAL 17 WHAT DO WE MONITOR: GLOBAL METRICS Meta CPU utilization
  • 18. CONFIDENTIAL 18 WHAT DO WE MONITOR: GLOBAL METRICS Aggregated cluster status
  • 19. CONFIDENTIAL 19 WHAT DO WE MONITOR: GLOBAL METRICS Storage operational metrics
  • 20. CONFIDENTIAL 20 WHAT DO WE MONITOR: LOCAL METRICS
  • 21. CONFIDENTIAL 21 WHAT DO WE MONITOR: LOCAL METRICS
  • 22. CONFIDENTIAL 22 USER ACCESS We want to provide a limited amount of information to users. They don’t need any info about triggers and issues, but only metrics. We have patched Zabbix to remove all unnecessary data for guest access. After Before
  • 23. CONFIDENTIAL 23 Benefits • Better understanding of a global issues on the cluster an reasons of why have they happened. • Great performance indicators for other infrastructure teams ( especially Storage team ) • Performance tuning of a scientific workflows. Jobs profiling. In some cases information we cat get from Zabbix is helping us to significantly improve performance of jobs. • Proactive monitoring. With Zabbix it’s easier to understand if something is not right on the cluster or with some job. In most cases we are able to prevent global cluster issues, or at least minimize an impact. • One monitoring system for clusters and HPC infrastructure. • “All in one”. Lower efforts on support/maintain monitoring system(s).
  • 24. CONFIDENTIAL 24 • Tight integration with Grid HPC software. • Data analysis using external tools, but with Zabbix data source. • Create a set of CLI utilities for getting Zabbix statistics in ‘human-readable’ format. • Automation of jobs profiling using Zabbix API. WHAT’S NEXT?