This is a talk that I gave at the San Francisco DevOps meetup on 9/29/15. I talk about how Yelp performs service discovery using SmartStack and Docker.
Kafka Summit NYC 2017 - Running Hundreds of Kafka Clusters with 5 Peopleconfluent
Tom Crayford discusses his experience running hundreds of Apache Kafka clusters on Heroku with a small team. Some key points discussed include:
- Using automation to manage clusters and reduce manual work required
- Common issues encountered like disk growth from log compaction bugs and addressing them by scanning clusters for anomalies
- Kafka's built-in high availability and how it helped during an AWS EBS failure event
- Novel failure cases encountered like a JVM memory leak from gzip usage and working to fix it
- Importance of taking breaks and not wasting time when operating clusters at scale.
ZooKeeper - wait free protocol for coordinating processesJulia Proskurnia
ZooKeeper is a service for coordinating processes within distributed systems. Stress test of the tool was applied. Reliable Multicast and Dynamic LogBack system Configuration management were implemented with ZooKeeper.
More details: http://proskurnia.in.ua/wiki/zookeeper_research
Docker and Maestro for fun, development and profitMaxime Petazzoni
Presentation on MaestroNG, an orchestration and management tool for multi-host container deployments with Docker.
#lspe meetup, February 20th, 2014 at Yahoo!'s URL café.
Introduction to ZooKeeper - TriHUG May 22, 2012mumrah
Presentation given at TriHUG (Triangle Hadoop User Group) on May 22, 2012. Gives a basic overview of Apache ZooKeeper as well as some common use cases, 3rd party libraries, and "gotchas"
Demo code available at https://github.com/mumrah/trihug-zookeeper-demo
Wanting distributed volumes - Experiences with ceph-dockerEwout Prangsma
Slides of a docker meetup presentation in Cologne (april 28,2016)
The presentation talks about how to run ceph in docker containers and how to use the ceph filesystems for volumes of docker containers in need of persistent storage.
The document discusses switching from Nagios to Sensu for monitoring. Sensu separates concerns better by having specialized tools for each task like alerting vs graphing. Sensu is also more customizable and extensible. Configuration management systems can be used to define checks and subscriptions based on infrastructure roles. Sensu uses a decentralized model where checks are not tied to specific hosts. Checks can be routed to multiple handlers like PagerDuty and Graphite. Converting existing Nagios checks mainly involves updating the configuration management system to scope variables and add checks to roles.
This talk covers why Apache Zookeeper is a good fit for coordinating processes in a distributed environment, prior Python attempts at a client and the current state of the art Python client library, how unifying development efforts to merge several Python client libraries has paid off, features available to Python processes, and how to gracefully handle failures in a set of distributed processes.
This document discusses Knewton's use of ZooKeeper and PettingZoo to implement distributed machine learning on a Python cluster. It begins by explaining what ZooKeeper is and how it provides services for distributed synchronization. It then discusses the state of ZooKeeper libraries for Python, including incomplete bindings and lack of high-level recipes. PettingZoo is introduced as Knewton's library that implements common ZooKeeper recipes for Python, allowing their machine learning models to be sharded and scaled across multiple machines. Distributed discovery, distributed bags, leader queues, and role matching are highlighted as key recipes that enable dynamic reconfiguration and load balancing of their distributed system.
The complexity of a typical OpenNebula installation brings a special set of challenges on the monitoring side. In this talk, I will show monitoring of a full stack of from the physical servers to storage layer and ONE daemon. Providing an aggregated view of this information allows you see the real impact of a certain failure. I would like to also present a use case for a “closed-loop” setup where new VMs are automatically added to the monitoring without human intervention, allowing for an efficient approach to monitoring the services a OpenNebula setup provides.
So we're running Apache ZooKeeper. Now What? By Camille Fournier Hakka Labs
The ZooKeeper framework was originally built at Yahoo! to make it easy for the company’s applications to access configuration information in a robust and easy-to-understand way, but it has since grown to offer a lot of features that help coordinate work across distributed clusters. Apache Zookeeper became a de-facto standard for coordination service and used by Storm, Hadoop, HBase, ElasticSearch and other distributed computing frameworks.
Jörg Schad - NO ONE PUTS Java IN THE CONTAINER - Codemotion Milan 2017Codemotion
The current craze of Docker has everyone sticking their processes inside a container… but do you really understand cgroups and how they work? Do you understand the difference between CPU Sets and CPU Shares? Spark is a Scala application that lives inside a Java Runtime, do you understand the consequence of what impact the cgroup constraints have on the JRE? This talk starts with a deep understand of Java’s memory management and GC characteristics and how JRE characteristics change based on core count. We will continue the talk looking at containers and how resource isolation works.
This document summarizes John Griffith's presentation about using Docker volume plugins with OpenStack Cinder block storage. Some key points:
- Griffith developed a Cinder volume plugin for Docker to provide persistent block storage to containers. This allows using existing Cinder backends without vendor lock-in.
- He demonstrated deploying a Swarm cluster on OpenStack using docker-machine and the built-in OpenStack driver. The Cinder plugin was installed on each node to enable volume provisioning.
- As a proof of concept, Griffith deployed a Redis service with a Cinder-backed volume for persistence, and a web frontend service, demonstrating stateful applications in containers with Swarm orchestration and Cinder storage.
The nova scheduler determines where to run virtual machine instances in OpenStack. It uses filters and weights to identify the best compute host from available information. An instance request is fulfilled by the scheduler selecting a host, informing the conductor, and having the compute node launch the instance. For large clouds, a horizontally scalable scheduler that uses flavor-based queues and avoids the database may improve performance. A Scheduler-as-a-Service project is also planned to provide a generic scheduler for other OpenStack components.
Red Hat Openstack and Ceph Meetup, Pune | 28th NOV 2015
Sadique Puthen, Principal Technical Support Engineer at Red Hat, Inc., gave an introduction to Red Hat Openstack (RDO) and its components. He discussed how Openstack provides infrastructure services like compute (Nova), storage (Cinder, Swift), networking (Neutron), and database (Trove) as a service. He also covered Openstack deployment options like Packstack, TripleO, and Ironic for bare metal provisioning. The meetup aimed to introduce Openstack components and services and their role in providing infrastructure as a service through a cloud platform.
Jacopo Nardiello - Monitoring Cloud-Native applications with Prometheus - Cod...Codemotion
We are going to talk about Prometheus and how to use to monitor micro-services "Cloud-Native" application s. We are going to dive deep into the Prometheus monitoring model, we will see what are the components be hind this system and how they integrate with each others to provide an efficient and modern monitoring sy stem. We will also have a glance on Prometheus native integrations for cloud-native environments such as Kubernetes.
You have amazing content and you want to get it to your users as fast as possible. In today’s industry, milliseconds matter and slow websites will never keep up. You can use a CDN but they are expensive, make you dependent on a third party to deliver your content, and can be notoriously inflexible. Enter Varnish, a powerful, open-source caching reverse proxy that lives in your network and lets you take control of how your content is managed and delivered. We’ll discuss how to install and configure Varnish in front of a typical web application, how to handle sessions and security, and how you can customize Varnish to your unique needs. This session will teach you how Varnish can help you give your users a better experience while saving your company and clients money at the same time.
Distributed system coordination by zookeeper and introduction to kazoo python...Jimmy Lai
Zookeeper is a coordination tool to let people build distributed systems easier. In this slides, the author summarizes the usage of zookeeper and provides Kazoo Python library as example.
Ansible is an open source automation platform, written in Python, that can be used for configuration-management, application deployment, cloud provisioning, ad-hoc task-execution, multinode orchestration and so on. This talk is an introduction to Ansible for beginners, including tips like how to use containers to mimic multiple machines while iteratively automating some tasks or testing.
This document provides an overview of Terraform, an open-source infrastructure as code tool. It discusses Terraform's goals of providing a unified view of infrastructure, composing multiple tiers of infrastructure from IaaS to PaaS to SaaS, and safely changing infrastructure over time with one workflow. Key features highlighted include being open source, using infrastructure as code, resource providers that interface with cloud APIs, and the plan and apply workflow. The document also covers topics like collaboration and version history in Terraform Enterprise, file examples, the plan and apply commands, resource providers, and new features in recent Terraform versions like destroy provisioners, remote backends, state locking, and state environments.
Anatomy of the libvirt virtualization library
http://www.ibm.com/developerworks/library/l-libvirt/
libvirt
http://libvirt.org/index.html
Scheduling
http://docs.openstack.org/icehouse/config-reference/content/section_compute-scheduler.html
Openstack Zoning – Region/Availability Zone/Host Aggregate
https://kimizhang.wordpress.com/2013/08/26/openstack-zoning-regionavailability-zonehost-aggregate/
Availability Zones and Host Aggregates in OpenStack Compute (Nova)
http://blog.russellbryant.net/2013/05/21/availability-zones-and-host-aggregates-in-openstack-compute-nova/
An Introduction to Droplet Metadata
https://www.digitalocean.com/community/tutorials/an-introduction-to-droplet-metadata
HOW WE USE CLOUDINIT IN OPENSTACK HEAT
http://sdake.io/2013/03/03/how-we-use-cloudinit-in-openstack-heat/
How to inject file/meta/ssh key/root password/userdata/config drive to a VM during nova boot
https://kimizhang.wordpress.com/2014/03/18/how-to-inject-filemetassh-keyroot-passworduserdataconfig-drive-to-a-vm-during-nova-boot/
Cloud-init
https://cloudinit.readthedocs.org/en/latest/
OpenStack is an open source cloud computing platform that provides infrastructure as a service. It abstracts compute, storage, and networking resources from physical hardware into a dashboard that manages these resources as virtual machines, object storage, and virtual networks. OpenStack uses a central dashboard and various components like Nova (compute), Glance (images), Swift (object storage), Neutron (networking), and Keystone (identity) that can work with different underlying hardware and be deployed both publicly or privately. Neutron provides network as a service and tools for building advanced virtual networks using plugins that support technologies like Open vSwitch, Linux bridges, NSX, and OpenDaylight.
This document provides a summary of a presentation about using Docker volume plugins with OpenStack Cinder block storage.
The presentation discusses:
1. The speaker introducing themselves and their background with OpenStack Cinder.
2. An overview of the Docker volume plugin API and how the speaker created a Cinder volume plugin in Golang to provide block storage to Docker containers.
3. A demonstration of deploying a sample web application on a Docker Swarm cluster using the Cinder volume plugin to persist Redis data, showing how storage can be provided to containers across nodes.
This document provides an overview of using Prometheus for monitoring and alerting. It discusses using Node Exporters and other exporters to collect metrics, storing metrics in Prometheus, querying metrics using PromQL, and configuring alert rules and the Alertmanager for notifications. Key aspects covered include scraping configs, common exporters, data types and selectors in PromQL, operations and functions, and setting up alerts and the Alertmanager for routing alerts.
This document discusses MySQL and how it is used at Yelp. It provides an overview of MySQL's history and features. It then describes how Yelp uses over 100 MySQL servers with InnoDB and replication. Yelp utilizes tools like Puppet, Nagios, Ganglia, and Percona Toolkit to manage and monitor their MySQL infrastructure. The document also provides tips for using MySQL for new and existing projects, including suggestions for troubleshooting, backups, and community resources.
"Using ElasticSearch to Scale Near Real-Time Search" by John Billings (Presen...Yelp Engineering
The document discusses using ElasticSearch to enable fast and scalable search of reviews. It describes how ElasticSearch allows for tokenization, stemming, stop words removal and faceting to improve search performance compared to a basic SQL search. An example query and response show how ElasticSearch returns search results and highlights matching text. The document also briefly outlines how data could be indexed in ElasticSearch through a queueing system and how shards and replicas can provide replication and scalability. It closes by noting some potential performance issues to be aware of with ElasticSearch.
This talk covers why Apache Zookeeper is a good fit for coordinating processes in a distributed environment, prior Python attempts at a client and the current state of the art Python client library, how unifying development efforts to merge several Python client libraries has paid off, features available to Python processes, and how to gracefully handle failures in a set of distributed processes.
This document discusses Knewton's use of ZooKeeper and PettingZoo to implement distributed machine learning on a Python cluster. It begins by explaining what ZooKeeper is and how it provides services for distributed synchronization. It then discusses the state of ZooKeeper libraries for Python, including incomplete bindings and lack of high-level recipes. PettingZoo is introduced as Knewton's library that implements common ZooKeeper recipes for Python, allowing their machine learning models to be sharded and scaled across multiple machines. Distributed discovery, distributed bags, leader queues, and role matching are highlighted as key recipes that enable dynamic reconfiguration and load balancing of their distributed system.
The complexity of a typical OpenNebula installation brings a special set of challenges on the monitoring side. In this talk, I will show monitoring of a full stack of from the physical servers to storage layer and ONE daemon. Providing an aggregated view of this information allows you see the real impact of a certain failure. I would like to also present a use case for a “closed-loop” setup where new VMs are automatically added to the monitoring without human intervention, allowing for an efficient approach to monitoring the services a OpenNebula setup provides.
So we're running Apache ZooKeeper. Now What? By Camille Fournier Hakka Labs
The ZooKeeper framework was originally built at Yahoo! to make it easy for the company’s applications to access configuration information in a robust and easy-to-understand way, but it has since grown to offer a lot of features that help coordinate work across distributed clusters. Apache Zookeeper became a de-facto standard for coordination service and used by Storm, Hadoop, HBase, ElasticSearch and other distributed computing frameworks.
Jörg Schad - NO ONE PUTS Java IN THE CONTAINER - Codemotion Milan 2017Codemotion
The current craze of Docker has everyone sticking their processes inside a container… but do you really understand cgroups and how they work? Do you understand the difference between CPU Sets and CPU Shares? Spark is a Scala application that lives inside a Java Runtime, do you understand the consequence of what impact the cgroup constraints have on the JRE? This talk starts with a deep understand of Java’s memory management and GC characteristics and how JRE characteristics change based on core count. We will continue the talk looking at containers and how resource isolation works.
This document summarizes John Griffith's presentation about using Docker volume plugins with OpenStack Cinder block storage. Some key points:
- Griffith developed a Cinder volume plugin for Docker to provide persistent block storage to containers. This allows using existing Cinder backends without vendor lock-in.
- He demonstrated deploying a Swarm cluster on OpenStack using docker-machine and the built-in OpenStack driver. The Cinder plugin was installed on each node to enable volume provisioning.
- As a proof of concept, Griffith deployed a Redis service with a Cinder-backed volume for persistence, and a web frontend service, demonstrating stateful applications in containers with Swarm orchestration and Cinder storage.
The nova scheduler determines where to run virtual machine instances in OpenStack. It uses filters and weights to identify the best compute host from available information. An instance request is fulfilled by the scheduler selecting a host, informing the conductor, and having the compute node launch the instance. For large clouds, a horizontally scalable scheduler that uses flavor-based queues and avoids the database may improve performance. A Scheduler-as-a-Service project is also planned to provide a generic scheduler for other OpenStack components.
Red Hat Openstack and Ceph Meetup, Pune | 28th NOV 2015
Sadique Puthen, Principal Technical Support Engineer at Red Hat, Inc., gave an introduction to Red Hat Openstack (RDO) and its components. He discussed how Openstack provides infrastructure services like compute (Nova), storage (Cinder, Swift), networking (Neutron), and database (Trove) as a service. He also covered Openstack deployment options like Packstack, TripleO, and Ironic for bare metal provisioning. The meetup aimed to introduce Openstack components and services and their role in providing infrastructure as a service through a cloud platform.
Jacopo Nardiello - Monitoring Cloud-Native applications with Prometheus - Cod...Codemotion
We are going to talk about Prometheus and how to use to monitor micro-services "Cloud-Native" application s. We are going to dive deep into the Prometheus monitoring model, we will see what are the components be hind this system and how they integrate with each others to provide an efficient and modern monitoring sy stem. We will also have a glance on Prometheus native integrations for cloud-native environments such as Kubernetes.
You have amazing content and you want to get it to your users as fast as possible. In today’s industry, milliseconds matter and slow websites will never keep up. You can use a CDN but they are expensive, make you dependent on a third party to deliver your content, and can be notoriously inflexible. Enter Varnish, a powerful, open-source caching reverse proxy that lives in your network and lets you take control of how your content is managed and delivered. We’ll discuss how to install and configure Varnish in front of a typical web application, how to handle sessions and security, and how you can customize Varnish to your unique needs. This session will teach you how Varnish can help you give your users a better experience while saving your company and clients money at the same time.
Distributed system coordination by zookeeper and introduction to kazoo python...Jimmy Lai
Zookeeper is a coordination tool to let people build distributed systems easier. In this slides, the author summarizes the usage of zookeeper and provides Kazoo Python library as example.
Ansible is an open source automation platform, written in Python, that can be used for configuration-management, application deployment, cloud provisioning, ad-hoc task-execution, multinode orchestration and so on. This talk is an introduction to Ansible for beginners, including tips like how to use containers to mimic multiple machines while iteratively automating some tasks or testing.
This document provides an overview of Terraform, an open-source infrastructure as code tool. It discusses Terraform's goals of providing a unified view of infrastructure, composing multiple tiers of infrastructure from IaaS to PaaS to SaaS, and safely changing infrastructure over time with one workflow. Key features highlighted include being open source, using infrastructure as code, resource providers that interface with cloud APIs, and the plan and apply workflow. The document also covers topics like collaboration and version history in Terraform Enterprise, file examples, the plan and apply commands, resource providers, and new features in recent Terraform versions like destroy provisioners, remote backends, state locking, and state environments.
Anatomy of the libvirt virtualization library
http://www.ibm.com/developerworks/library/l-libvirt/
libvirt
http://libvirt.org/index.html
Scheduling
http://docs.openstack.org/icehouse/config-reference/content/section_compute-scheduler.html
Openstack Zoning – Region/Availability Zone/Host Aggregate
https://kimizhang.wordpress.com/2013/08/26/openstack-zoning-regionavailability-zonehost-aggregate/
Availability Zones and Host Aggregates in OpenStack Compute (Nova)
http://blog.russellbryant.net/2013/05/21/availability-zones-and-host-aggregates-in-openstack-compute-nova/
An Introduction to Droplet Metadata
https://www.digitalocean.com/community/tutorials/an-introduction-to-droplet-metadata
HOW WE USE CLOUDINIT IN OPENSTACK HEAT
http://sdake.io/2013/03/03/how-we-use-cloudinit-in-openstack-heat/
How to inject file/meta/ssh key/root password/userdata/config drive to a VM during nova boot
https://kimizhang.wordpress.com/2014/03/18/how-to-inject-filemetassh-keyroot-passworduserdataconfig-drive-to-a-vm-during-nova-boot/
Cloud-init
https://cloudinit.readthedocs.org/en/latest/
OpenStack is an open source cloud computing platform that provides infrastructure as a service. It abstracts compute, storage, and networking resources from physical hardware into a dashboard that manages these resources as virtual machines, object storage, and virtual networks. OpenStack uses a central dashboard and various components like Nova (compute), Glance (images), Swift (object storage), Neutron (networking), and Keystone (identity) that can work with different underlying hardware and be deployed both publicly or privately. Neutron provides network as a service and tools for building advanced virtual networks using plugins that support technologies like Open vSwitch, Linux bridges, NSX, and OpenDaylight.
This document provides a summary of a presentation about using Docker volume plugins with OpenStack Cinder block storage.
The presentation discusses:
1. The speaker introducing themselves and their background with OpenStack Cinder.
2. An overview of the Docker volume plugin API and how the speaker created a Cinder volume plugin in Golang to provide block storage to Docker containers.
3. A demonstration of deploying a sample web application on a Docker Swarm cluster using the Cinder volume plugin to persist Redis data, showing how storage can be provided to containers across nodes.
This document provides an overview of using Prometheus for monitoring and alerting. It discusses using Node Exporters and other exporters to collect metrics, storing metrics in Prometheus, querying metrics using PromQL, and configuring alert rules and the Alertmanager for notifications. Key aspects covered include scraping configs, common exporters, data types and selectors in PromQL, operations and functions, and setting up alerts and the Alertmanager for routing alerts.
This document discusses MySQL and how it is used at Yelp. It provides an overview of MySQL's history and features. It then describes how Yelp uses over 100 MySQL servers with InnoDB and replication. Yelp utilizes tools like Puppet, Nagios, Ganglia, and Percona Toolkit to manage and monitor their MySQL infrastructure. The document also provides tips for using MySQL for new and existing projects, including suggestions for troubleshooting, backups, and community resources.
"Using ElasticSearch to Scale Near Real-Time Search" by John Billings (Presen...Yelp Engineering
The document discusses using ElasticSearch to enable fast and scalable search of reviews. It describes how ElasticSearch allows for tokenization, stemming, stop words removal and faceting to improve search performance compared to a basic SQL search. An example query and response show how ElasticSearch returns search results and highlights matching text. The document also briefly outlines how data could be indexed in ElasticSearch through a queueing system and how shards and replicas can provide replication and scalability. It closes by noting some potential performance issues to be aware of with ElasticSearch.
Scaling Traffic from 0 to 139 Million Unique VisitorsYelp Engineering
This document summarizes the traffic history and infrastructure changes at Yelp from 2005 to the present. It outlines the key milestones and technology changes over time as Yelp grew from handling around 200k searches per day with 1 database in 2005-2007 to serving traffic across 29 countries in 2014 with a distributed, scalable infrastructure utilizing technologies like Elasticsearch, Kafka, and Pyleus for real-time processing.
"Optimal Learning for Fun and Profit" by Scott Clark (Presented at The Yelp E...Yelp Engineering
Scott Clark gave a presentation on optimal learning techniques. He discussed multi-armed bandits, which address the challenge of collecting information efficiently from multiple options with unknown outcomes. He provided an example of exploring various slot machines to maximize rewards. Clark also discussed Bayesian global optimization and Yelp's Metrics Optimization Engine (MOE), which uses Gaussian processes to suggest optimal parameters for A/B tests based on past experiment results, in order to efficiently optimize metrics. MOE is now being used in Yelp's live experiments to continuously improve performance.
This document provides guidance on how to conduct effective design critiques in 3-4 sentences. It establishes that critiques should have clear roles for the presenter, audience, and facilitator. The feedback session should focus all participants on understanding the problem at hand before providing feedback. Constructive feedback should ask questions, build upon the design, and remain objective rather than being personal or critical. Laptops and phones should remain closed during the critique.
This document describes a project analyzing Yelp business data using an HDInsight Hadoop cluster on Azure. The project involves downloading Yelp data, converting it to CSV, loading it onto the cluster, and using HiveQL to query and visualize the data. Key aspects analyzed include business locations, categories, ratings over time, and reviews. Visualizations were created using PowerBI. The document outlines the cluster configuration, tools used, data processing flow, sample queries, and potential extensions like natural language processing.
This document discusses how scaling teams to support big data growth at Yelp can negatively impact deployment speed due to an exponential increase in the probability of failures as the number of developers increases. It proposes service-oriented architecture and focusing on mean time to recovery rather than just preventing failures as ways to mitigate these risks and maintain rapid iteration. Continuous delivery, reliable but not exhaustive testing, and treating all processes as distributed are also recommended to support scaling teams while preserving deployment speed.
Building a smarter application Stack by Tomas Doran from YelpdotCloud
This document discusses Smartstack, a solution for service discovery and load balancing in distributed systems like Docker. It addresses problems like dynamically wiring dependent microservices and handling failures gracefully. Smartstack consists of Synapse, which generates HAProxy configurations for discovery, and Nerve, which registers services and checks health. Ambassadors provide simple connections for containers. It aims to reduce complexity compared to alternatives while working on traditional infrastructure, VMs, and Docker.
The document discusses implementing a hybrid database solution using both MongoDB and MySQL. It describes storing less frequently changing and reference data like users and products in MongoDB for flexibility, while storing transactional data like orders and inventory counts in MySQL for ACID compliance. The system keeps the data in sync between the two databases using listeners that update MySQL whenever related data is created or changed in MongoDB.
Linux Performance Analysis: New Tools and Old SecretsBrendan Gregg
Talk for USENIX/LISA2014 by Brendan Gregg, Netflix. At Netflix performance is crucial, and we use many high to low level tools to analyze our stack in different ways. In this talk, I will introduce new system observability tools we are using at Netflix, which I've ported from my DTraceToolkit, and are intended for our Linux 3.2 cloud instances. These show that Linux can do more than you may think, by using creative hacks and workarounds with existing kernel features (ftrace, perf_events). While these are solving issues on current versions of Linux, I'll also briefly summarize the future in this space: eBPF, ktap, SystemTap, sysdig, etc.
Talk for PerconaLive 2016 by Brendan Gregg. Video: https://www.youtube.com/watch?v=CbmEDXq7es0 . "Systems performance provides a different perspective for analysis and tuning, and can help you find performance wins for your databases, applications, and the kernel. However, most of us are not performance or kernel engineers, and have limited time to study this topic. This talk summarizes six important areas of Linux systems performance in 50 minutes: observability tools, methodologies, benchmarking, profiling, tracing, and tuning. Included are recipes for Linux performance analysis and tuning (using vmstat, mpstat, iostat, etc), overviews of complex areas including profiling (perf_events), static tracing (tracepoints), and dynamic tracing (kprobes, uprobes), and much advice about what is and isn't important to learn. This talk is aimed at everyone: DBAs, developers, operations, etc, and in any environment running Linux, bare-metal or the cloud."
Broken benchmarks, misleading metrics, and terrible tools. This talk will help you navigate the treacherous waters of Linux performance tools, touring common problems with system tools, metrics, statistics, visualizations, measurement overhead, and benchmarks. You might discover that tools you have been using for years, are in fact, misleading, dangerous, or broken.
The speaker, Brendan Gregg, has given many talks on tools that work, including giving the Linux PerformanceTools talk originally at SCALE. This is an anti-version of that talk, to focus on broken tools and metrics instead of the working ones. Metrics can be misleading, and counters can be counter-intuitive! This talk will include advice for verifying new performance tools, understanding how they work, and using them successfully.
Video: https://www.youtube.com/watch?v=JRFNIKUROPE . Talk for linux.conf.au 2017 (LCA2017) by Brendan Gregg, about Linux enhanced BPF (eBPF). Abstract:
A world of new capabilities is emerging for the Linux 4.x series, thanks to enhancements that have been included in Linux for to Berkeley Packet Filter (BPF): an in-kernel virtual machine that can execute user space-defined programs. It is finding uses for security auditing and enforcement, enhancing networking (including eXpress Data Path), and performance observability and troubleshooting. Many new open source tools that have been written in the past 12 months for performance analysis that use BPF. Tracing superpowers have finally arrived for Linux!
For its use with tracing, BPF provides the programmable capabilities to the existing tracing frameworks: kprobes, uprobes, and tracepoints. In particular, BPF allows timestamps to be recorded and compared from custom events, allowing latency to be studied in many new places: kernel and application internals. It also allows data to be efficiently summarized in-kernel, including as histograms. This has allowed dozens of new observability tools to be developed so far, including measuring latency distributions for file system I/O and run queue latency, printing details of storage device I/O and TCP retransmits, investigating blocked stack traces and memory leaks, and a whole lot more.
This talk will summarize BPF capabilities and use cases so far, and then focus on its use to enhance Linux tracing, especially with the open source bcc collection. bcc includes BPF versions of old classics, and many new tools, including execsnoop, opensnoop, funcccount, ext4slower, and more (many of which I developed). Perhaps you'd like to develop new tools, or use the existing tools to find performance wins large and small, especially when instrumenting areas that previously had zero visibility. I'll also summarize how we intend to use these new capabilities to enhance systems analysis at Netflix.
Video: https://www.youtube.com/watch?v=FJW8nGV4jxY and https://www.youtube.com/watch?v=zrr2nUln9Kk . Tutorial slides for O'Reilly Velocity SC 2015, by Brendan Gregg.
There are many performance tools nowadays for Linux, but how do they all fit together, and when do we use them? This tutorial explains methodologies for using these tools, and provides a tour of four tool types: observability, benchmarking, tuning, and static tuning. Many tools will be discussed, including top, iostat, tcpdump, sar, perf_events, ftrace, SystemTap, sysdig, and others, as well observability frameworks in the Linux kernel: PMCs, tracepoints, kprobes, and uprobes.
This tutorial is updated and extended on an earlier talk that summarizes the Linux performance tool landscape. The value of this tutorial is not just learning that these tools exist and what they do, but hearing when and how they are used by a performance engineer to solve real world problems — important context that is typically not included in the standard documentation.
This talk discusses Linux profiling using perf_events (also called "perf") based on Netflix's use of it. It covers how to use perf to get CPU profiling working and overcome common issues. The speaker will give a tour of perf_events features and show how Netflix uses it to analyze performance across their massive Amazon EC2 Linux cloud. They rely on tools like perf for customer satisfaction, cost optimization, and developing open source tools like NetflixOSS. Key aspects covered include why profiling is needed, a crash course on perf, CPU profiling workflows, and common "gotchas" to address like missing stacks, symbols, or profiling certain languages and events.
Basically everything you need to get started on your Zookeeper training, and setup apache Hadoop high availability with QJM setup with automatic failover.
This summary provides an overview of the lightning talks presented at the NetflixOSS Open House:
- Jordan Zimmerman from Netflix presented on several NetflixOSS projects he works on including Curator, a Java library that makes using ZooKeeper easier, and Blitz4j, an asynchronous logging library that improves performance over Log4j.
- Additional talks covered Eureka, a REST service for discovering middle-tier services; Ribbon for load balancing between middle-tier instances; Archaius for dynamic configuration; Astyanax for interacting with Cassandra; and various other NetflixOSS projects.
- The talks highlighted the motivation for these projects including addressing challenges of scaling for Netflix's large data
Troubleshooting common oslo.messaging and RabbitMQ issuesMichael Klishin
This document discusses common issues with oslo.messaging and RabbitMQ and how to diagnose and resolve them. It provides an overview of oslo.messaging and how it uses RabbitMQ for RPC calls and notifications. Examples are given of where timeouts could occur in RPC calls. Methods for debugging include enabling debug logging, examining RabbitMQ queues and connections, and correlating logs from services. Specific issues covered include RAM usage, unresponsive nodes, rejected TCP connections, TLS connection failures, and high latency. General tips emphasized are using tools to gather data and consulting log files.
This document discusses various approaches to implementing high availability (HA) in OpenStack including active/active and active/passive configurations. It provides an overview of HA techniques used at Deutsche Telekom and eBay/PayPal including load balancing APIs and databases, replicating RabbitMQ and MySQL, and configuring Pacemaker/Corosync for OpenStack services. It also discusses lessons learned around testing failures, placing services across availability zones, and having backups for HA infrastructures.
Comparison between zookeeper, etcd 3 and other distributed coordination systemsImesha Sudasingha
This is a comparison between popular distributed coordination systems including zookeeper (which powers Apache Hadoop), etcd 3 (which powers Kubernetes), consul and hazelcast. This comparison was made in second half of 2016. Therefore, please note that some of these technologies have improved immensely over the time. Anyway, this presentation will provide an initial idea of each distributed coordination systems.
This document discusses scaling up logging and metrics in OpenShift Container Platform (OCP). It provides an overview of the logging stack including Elasticsearch, Fluentd, and Kibana. It also summarizes the metrics stack including Cassandra, Heapster, and Hawkular. The document outlines testing done to evaluate limits and scaling of these components on large OCP clusters with thousands of nodes and pods. It provides recommendations for configuring and deploying the infrastructure to support high throughput logging and metrics collection.
Como creamos QuestDB Cloud, un SaaS basado en Kubernetes alrededor de QuestDB...javier ramirez
QuestDB es una base de datos open source de alto rendimiento. Mucha gente nos comentaba que les gustaría usarla como servicio, sin tener que gestionar las máquinas. Así que nos pusimos manos a la obra para desarrollar una solución que nos permitiese lanzar instancias de QuestDB con provisionado, monitorización, seguridad o actualizaciones totalmente gestionadas.
Unos cuantos clusters de Kubernetes más tarde, conseguimos lanzar nuestra oferta de QuestDB Cloud. Esta charla es la historia de cómo llegamos ahí. Hablaré de herramientas como Calico, Karpenter, CoreDNS, Telegraf, Prometheus, Loki o Grafana, pero también de retos como autenticación, facturación, multi-nube, o de a qué tienes que decir que no para poder sobrevivir en la nube.
Practice and challenges from building IaaSShawn Zhu
It is an invited presentation for NCSC2012 (China National Conference on Social Computing) on cloud computing from industry.
It summarized what we learn on developing and operating an Infrastructure as a Service in a highly scalable manner. The service described inside the corporation is kind of dogfood that engineers work with in their daily work.
Building data pipelines is pretty hard! Building a multi-datacenter active-active real time data pipeline for multiple classes of data with different durability, latency and availability guarantees is much harder.
Real time infrastructure powers critical pieces of Uber (think Surge) and in this talk we will discuss our architecture, technical challenges, learnings and how a blend of open source infrastructure (Apache Kafka and Samza) and in-house technologies have helped Uber scale.
A presentation on how applying Cloud Architecture Patterns using Docker Swarm as orchestrator is possible to create reliable, resilient and scalable FIWARE platforms.
Netflix Keystone Pipeline at Samza Meetup 10-13-2015Monal Daxini
Netflix Keystone Pipeline processing 600 billion events a day, and detailed treatise on the modification of and use of Samza for real time routing of events including docker.
We are using Elasticsearch to power the search feature of our public frontend, serving 10k queries per hour across 8 markets in SEA.
Here we are sharing our experiences of running Elasticsearch on Kubernetes, presenting our general setup, configuration tweaks and possible pitfalls.
Things You MUST Know Before Deploying OpenStack: Bruno Lago, Catalyst ITOpenStack
Audience: Advanced
About: Real world lessons and war stories about Catalyst IT’s experience in rolling out an OpenStack based public cloud in New Zealand.
This presentation will provide tips and advice that may save you a lot of time, money and nights of sleep if you are planning to run OpenStack in the future. It may also bring some insights to people that are already running OpenStack in production.
Topics covered will include: selection of hardware for optimal costs, techniques that drive quality and service levels up, common deployment mistakes, in place upgrades, how to identify the maturity level of each project and decide what is ready for production, and much more!
Speaker Bio: Bruno Lago – Entrepreneur, Catalyst IT Limited
Bruno Lago is a solutions architect that has been involved with the Catalyst Cloud (New Zealand’s first public cloud based on OpenStack) from its inception. He is passionate about open source software, cloud computing and disruptive technologies.
OpenStack Australia Day - Sydney 2016
https://events.aptira.com/openstack-australia-day-sydney-2016/
OpenStack is an open source cloud operating system that provides on-demand provisioning of compute, storage, and networking resources. It consists of several interconnected components that are managed through a dashboard interface. The key components include Horizon (dashboard), Keystone (authentication), Swift (object storage), Glance (image repository), Nova (compute), Quantum (networking), and Cinder (block storage). Nova is responsible for running virtual machine instances by retrieving images from Glance and scheduling instances on compute hosts using the Nova scheduler. The Nova scheduler uses filters and weights to determine the most suitable host for an instance based on availability, capabilities, and load.
Highly Available Load Balanced Galera MySql ClusterAmr Fawzy
Describing the major principles of well designed cloud system application including high availability and load balancing as well as implementing highly available load balanced galera mysql cluster
Kubernetes is an open-source system for automating deployment, scaling, and management of containerized applications. It provides mechanisms for scheduling containers, load balancing, storage orchestration, and declarative deployments. The document provides examples of how Kubernetes can help manage containerized applications through concepts like pods, services, replication controllers, deployments, jobs, secrets and configmaps. It also compares Kubernetes to other orchestration systems and container platforms like OpenShift, AWS ECS, Azure Container Service and OpenStack.
This document summarizes Netflix's use of Kafka in their data pipeline. It discusses how Netflix evolved from using S3 and EMR to introducing Kafka and Kafka producers and consumers to handle 400 billion events per day. It covers challenges of scaling Kafka clusters and tuning Kafka clients and brokers. Finally, it outlines Netflix's roadmap which includes contributing to open source projects like Kafka and testing failure resilience.
This document summarizes Netflix's use of Kafka in their data pipeline. It discusses the evolution of Netflix's data pipeline to incorporate Kafka to handle 400 billion events per day. It describes how Netflix uses Kafka clusters with different priorities and configurations. It also outlines some of the challenges of using Kafka at Netflix's scale, such as Zookeeper client issues and cluster scaling, and the solutions Netflix developed to address these challenges.
This document provides an overview of microservices and how to develop them using Spring. It discusses the challenges of distributed systems and how Spring Boot and Spring Cloud Netflix address areas like configuration, service registration, load balancing, fault tolerance, and monitoring. Examples are provided for building microservices with Spring Boot, integrating configuration with Spring Cloud Config, registering services with Eureka, load balancing with Ribbon and Feign, handling faults with Hystrix, and monitoring with Hystrix Dashboard. Reactive programming with RxJava is also introduced as an approach for concurrent API integration.
Procurement Insights Cost To Value Guide.pptxJon Hansen
Procurement Insights integrated Historic Procurement Industry Archives, serves as a powerful complement — not a competitor — to other procurement industry firms. It fills critical gaps in depth, agility, and contextual insight that most traditional analyst and association models overlook.
Learn more about this value- driven proprietary service offering here.
TrustArc Webinar: Consumer Expectations vs Corporate Realities on Data Broker...TrustArc
Most consumers believe they’re making informed decisions about their personal data—adjusting privacy settings, blocking trackers, and opting out where they can. However, our new research reveals that while awareness is high, taking meaningful action is still lacking. On the corporate side, many organizations report strong policies for managing third-party data and consumer consent yet fall short when it comes to consistency, accountability and transparency.
This session will explore the research findings from TrustArc’s Privacy Pulse Survey, examining consumer attitudes toward personal data collection and practical suggestions for corporate practices around purchasing third-party data.
Attendees will learn:
- Consumer awareness around data brokers and what consumers are doing to limit data collection
- How businesses assess third-party vendors and their consent management operations
- Where business preparedness needs improvement
- What these trends mean for the future of privacy governance and public trust
This discussion is essential for privacy, risk, and compliance professionals who want to ground their strategies in current data and prepare for what’s next in the privacy landscape.
Book industry standards are evolving rapidly. In the first part of this session, we’ll share an overview of key developments from 2024 and the early months of 2025. Then, BookNet’s resident standards expert, Tom Richardson, and CEO, Lauren Stewart, have a forward-looking conversation about what’s next.
Link to recording, presentation slides, and accompanying resource: https://bnctechforum.ca/sessions/standardsgoals-for-2025-standards-certification-roundup/
Presented by BookNet Canada on May 6, 2025 with support from the Department of Canadian Heritage.
Dev Dives: Automate and orchestrate your processes with UiPath MaestroUiPathCommunity
This session is designed to equip developers with the skills needed to build mission-critical, end-to-end processes that seamlessly orchestrate agents, people, and robots.
📕 Here's what you can expect:
- Modeling: Build end-to-end processes using BPMN.
- Implementing: Integrate agentic tasks, RPA, APIs, and advanced decisioning into processes.
- Operating: Control process instances with rewind, replay, pause, and stop functions.
- Monitoring: Use dashboards and embedded analytics for real-time insights into process instances.
This webinar is a must-attend for developers looking to enhance their agentic automation skills and orchestrate robust, mission-critical processes.
👨🏫 Speaker:
Andrei Vintila, Principal Product Manager @UiPath
This session streamed live on April 29, 2025, 16:00 CET.
Check out all our upcoming Dev Dives sessions at https://community.uipath.com/dev-dives-automation-developer-2025/.
Special Meetup Edition - TDX Bengaluru Meetup #52.pptxshyamraj55
We’re bringing the TDX energy to our community with 2 power-packed sessions:
🛠️ Workshop: MuleSoft for Agentforce
Explore the new version of our hands-on workshop featuring the latest Topic Center and API Catalog updates.
📄 Talk: Power Up Document Processing
Dive into smart automation with MuleSoft IDP, NLP, and Einstein AI for intelligent document workflows.
Quantum Computing Quick Research Guide by Arthur MorganArthur Morgan
This is a Quick Research Guide (QRG).
QRGs include the following:
- A brief, high-level overview of the QRG topic.
- A milestone timeline for the QRG topic.
- Links to various free online resource materials to provide a deeper dive into the QRG topic.
- Conclusion and a recommendation for at least two books available in the SJPL system on the QRG topic.
QRGs planned for the series:
- Artificial Intelligence QRG
- Quantum Computing QRG
- Big Data Analytics QRG
- Spacecraft Guidance, Navigation & Control QRG (coming 2026)
- UK Home Computing & The Birth of ARM QRG (coming 2027)
Any questions or comments?
- Please contact Arthur Morgan at [email protected].
100% human made.
#StandardsGoals for 2025: Standards & certification roundup - Tech Forum 2025BookNet Canada
Book industry standards are evolving rapidly. In the first part of this session, we’ll share an overview of key developments from 2024 and the early months of 2025. Then, BookNet’s resident standards expert, Tom Richardson, and CEO, Lauren Stewart, have a forward-looking conversation about what’s next.
Link to recording, transcript, and accompanying resource: https://bnctechforum.ca/sessions/standardsgoals-for-2025-standards-certification-roundup/
Presented by BookNet Canada on May 6, 2025 with support from the Department of Canadian Heritage.
Spark is a powerhouse for large datasets, but when it comes to smaller data workloads, its overhead can sometimes slow things down. What if you could achieve high performance and efficiency without the need for Spark?
At S&P Global Commodity Insights, having a complete view of global energy and commodities markets enables customers to make data-driven decisions with confidence and create long-term, sustainable value. 🌍
Explore delta-rs + CDC and how these open-source innovations power lightweight, high-performance data applications beyond Spark! 🚀
How Can I use the AI Hype in my Business Context?Daniel Lehner
𝙄𝙨 𝘼𝙄 𝙟𝙪𝙨𝙩 𝙝𝙮𝙥𝙚? 𝙊𝙧 𝙞𝙨 𝙞𝙩 𝙩𝙝𝙚 𝙜𝙖𝙢𝙚 𝙘𝙝𝙖𝙣𝙜𝙚𝙧 𝙮𝙤𝙪𝙧 𝙗𝙪𝙨𝙞𝙣𝙚𝙨𝙨 𝙣𝙚𝙚𝙙𝙨?
Everyone’s talking about AI but is anyone really using it to create real value?
Most companies want to leverage AI. Few know 𝗵𝗼𝘄.
✅ What exactly should you ask to find real AI opportunities?
✅ Which AI techniques actually fit your business?
✅ Is your data even ready for AI?
If you’re not sure, you’re not alone. This is a condensed version of the slides I presented at a Linkedin webinar for Tecnovy on 28.04.2025.
Noah Loul Shares 5 Steps to Implement AI Agents for Maximum Business Efficien...Noah Loul
Artificial intelligence is changing how businesses operate. Companies are using AI agents to automate tasks, reduce time spent on repetitive work, and focus more on high-value activities. Noah Loul, an AI strategist and entrepreneur, has helped dozens of companies streamline their operations using smart automation. He believes AI agents aren't just tools—they're workers that take on repeatable tasks so your human team can focus on what matters. If you want to reduce time waste and increase output, AI agents are the next move.
Big Data Analytics Quick Research Guide by Arthur MorganArthur Morgan
This is a Quick Research Guide (QRG).
QRGs include the following:
- A brief, high-level overview of the QRG topic.
- A milestone timeline for the QRG topic.
- Links to various free online resource materials to provide a deeper dive into the QRG topic.
- Conclusion and a recommendation for at least two books available in the SJPL system on the QRG topic.
QRGs planned for the series:
- Artificial Intelligence QRG
- Quantum Computing QRG
- Big Data Analytics QRG
- Spacecraft Guidance, Navigation & Control QRG (coming 2026)
- UK Home Computing & The Birth of ARM QRG (coming 2027)
Any questions or comments?
- Please contact Arthur Morgan at [email protected].
100% human made.
Complete Guide to Advanced Logistics Management Software in Riyadh.pdfSoftware Company
Explore the benefits and features of advanced logistics management software for businesses in Riyadh. This guide delves into the latest technologies, from real-time tracking and route optimization to warehouse management and inventory control, helping businesses streamline their logistics operations and reduce costs. Learn how implementing the right software solution can enhance efficiency, improve customer satisfaction, and provide a competitive edge in the growing logistics sector of Riyadh.
AI Changes Everything – Talk at Cardiff Metropolitan University, 29th April 2...Alan Dix
Talk at the final event of Data Fusion Dynamics: A Collaborative UK-Saudi Initiative in Cybersecurity and Artificial Intelligence funded by the British Council UK-Saudi Challenge Fund 2024, Cardiff Metropolitan University, 29th April 2025
https://alandix.com/academic/talks/CMet2025-AI-Changes-Everything/
Is AI just another technology, or does it fundamentally change the way we live and think?
Every technology has a direct impact with micro-ethical consequences, some good, some bad. However more profound are the ways in which some technologies reshape the very fabric of society with macro-ethical impacts. The invention of the stirrup revolutionised mounted combat, but as a side effect gave rise to the feudal system, which still shapes politics today. The internal combustion engine offers personal freedom and creates pollution, but has also transformed the nature of urban planning and international trade. When we look at AI the micro-ethical issues, such as bias, are most obvious, but the macro-ethical challenges may be greater.
At a micro-ethical level AI has the potential to deepen social, ethnic and gender bias, issues I have warned about since the early 1990s! It is also being used increasingly on the battlefield. However, it also offers amazing opportunities in health and educations, as the recent Nobel prizes for the developers of AlphaFold illustrate. More radically, the need to encode ethics acts as a mirror to surface essential ethical problems and conflicts.
At the macro-ethical level, by the early 2000s digital technology had already begun to undermine sovereignty (e.g. gambling), market economics (through network effects and emergent monopolies), and the very meaning of money. Modern AI is the child of big data, big computation and ultimately big business, intensifying the inherent tendency of digital technology to concentrate power. AI is already unravelling the fundamentals of the social, political and economic world around us, but this is a world that needs radical reimagining to overcome the global environmental and human challenges that confront us. Our challenge is whether to let the threads fall as they may, or to use them to weave a better future.
Generative Artificial Intelligence (GenAI) in BusinessDr. Tathagat Varma
My talk for the Indian School of Business (ISB) Emerging Leaders Program Cohort 9. In this talk, I discussed key issues around adoption of GenAI in business - benefits, opportunities and limitations. I also discussed how my research on Theory of Cognitive Chasms helps address some of these issues
HCL Nomad Web – Best Practices und Verwaltung von Multiuser-Umgebungenpanagenda
Webinar Recording: https://www.panagenda.com/webinars/hcl-nomad-web-best-practices-und-verwaltung-von-multiuser-umgebungen/
HCL Nomad Web wird als die nächste Generation des HCL Notes-Clients gefeiert und bietet zahlreiche Vorteile, wie die Beseitigung des Bedarfs an Paketierung, Verteilung und Installation. Nomad Web-Client-Updates werden “automatisch” im Hintergrund installiert, was den administrativen Aufwand im Vergleich zu traditionellen HCL Notes-Clients erheblich reduziert. Allerdings stellt die Fehlerbehebung in Nomad Web im Vergleich zum Notes-Client einzigartige Herausforderungen dar.
Begleiten Sie Christoph und Marc, während sie demonstrieren, wie der Fehlerbehebungsprozess in HCL Nomad Web vereinfacht werden kann, um eine reibungslose und effiziente Benutzererfahrung zu gewährleisten.
In diesem Webinar werden wir effektive Strategien zur Diagnose und Lösung häufiger Probleme in HCL Nomad Web untersuchen, einschließlich
- Zugriff auf die Konsole
- Auffinden und Interpretieren von Protokolldateien
- Zugriff auf den Datenordner im Cache des Browsers (unter Verwendung von OPFS)
- Verständnis der Unterschiede zwischen Einzel- und Mehrbenutzerszenarien
- Nutzung der Client Clocking-Funktion
3. ● This works from (almost) any host in Yelp
● This works from Python, Java, command line etc.
● If a service supports HTTP or TCP then it can be made discoverable.
○ This includes third-party services such as MySQL and scribe
● It’s dynamic: for a given service, if new instances are added then they
will automatically become available.
Very Important Things to Note
4. ● SmartStack (nerve and synapse) were written by Airbnb
● We’ve added some features
● The work here has been carried out by many people across Yelp
Credits
7. Nerve registers service instance in ZooKeeper:
/nerve/region:myregion
├── service_1
│ └── server_1_0000013614
├── service_2
│ └── server_1_0000000959
├── service_3
│ ├── server_1_0000002468
│ └── server_2_0000002467
[...]
ZooKeeper data
8. The data in a znode is all that is required to connect to the corresponding
service instance.
We’ll shortly see how this is used for discovery.
{
"host":"10.0.0.123",
"port":31337,
"name":"server_1",
"weight":10,
}
ZooKeeper data
9. hacheck
Normally hacheck just acts as a transparent proxy for our healthchecks:
$ curl -s yocalhost:6666/http/service_1/1234/status | jq .
{
"uptime": 5693819.315988064,
"pid": 2595160,
"host": "server_1",
"version": "b6309e09d71da8f1e28213d251f7c3515878caca",
}
10. hacheck
We can also use it to fail healthchecks before we shut down a service.
This allows us to gracefully shutdown a service.
(Also provides a 1s cache to limit healthcheck rate.)
$ hadown service_1
$ curl -v yocalhost:6666/http/service_1/1234/status
Service service_1 in down state since 1443217910: billings
11. configure_nerve.py
How do we know what services to advertise? Every service host
periodically runs a script to regenerate the nerve configuration, reading
from the following sources:
● yelpsoa-configs
runs_on:
server_1
server_2
● puppet
nerve_simple::puppet_service {'foo'}
● mesos slave API
14. HAProxy
● By default bind to 0.0.0.0
● Bind only to yocalhost on public servers.
● HAProxy gives us a lot of goodies for all clients:
○ Redispatch on connection failures
○ Zero-downtime restarts (once you know how :)
○ Easy to insert connection logging
● Each host also exposes an HAProxy status page for easy introspection
15. configure_synapse.py
Every client host periodically runs a script to regenerate the synapse
configuration, reading service definitions from yelpsoa-configs.
For each service reads a smartstack.yaml file.
Restarts synapse if configuration has changed.
18. Escape hatch
Some client libraries like to do their own load balancing e.g. cassandra,
memcached. Use synapse to dump the registration information to disk:
$ cat /var/run/synapse/services/devops.demo.json | jq .
[
{
"host":"10.0.0.123",
"port":31337,
"name":"server_1",
"weight":10,
}
]
20. Architecture
haproxy
docker container 1
lo 127.0.0.1
docker container 2
lo 127.0.0.1
eth0 169.254.14.17
eth0 169.254.14.18
docker0 169.254.1.1
eth0 10.0.1.2
lo:0 169.254.255.254
lo 127.0.0.1
21. yocalhost
● We’d like to run only one nerve / synapse / haproxy per host
● What address should we bind haproxy to?
● 127.0.0.1 won’t work from within a container
● Instead we pick a link-local address 169.254.255.254 (yocalhost)
● This also works on servers without docker
23. Overview
We run services in both our own datacenters as well as AWS.
We logically group these environments according to latency.
Service authors get to decide how ‘widely’ their service instances are
advertised.
Everything is controlled via smartstack.yaml files.
25. main:
proxy_port: 20973
advertise: [habitat]
discover: habitat
advertise / discover
Synapse should look in the
habitat directory in its local
ZooKeeper
Nerve should register this
service in the habitat directory
of its local ZooKeeper
27. Extra advertisements
“Wouldn’t it be useful if we could make a service running in datacenter A
available in an (arbitrary) datacenter B?”
Why?
● Makes it easier to bring up a new datacenter
● Makes it easier to add more capacity to a datacenter in an emergency
● Makes it easier to keep a datacenter going in an emergency if a service
fails
30. Unix 4eva
● Lots of little components, each doing doing one thing well
● Very simple interface for clients and services
○ If it speaks TCP or HTTP we can register it
● Easy to independently replace components
○ HAProxy -> NGINX?
● Easy to observe behavior of components
31. It’s OK if ZooKeeper fails
● Nerve and Synapse keep retrying
● HAProxy keeps running but with no updates
● HAProxy performs its own healthchecks against service instances
○ If a service instance becomes unavailable then it will stop receiving
traffic after a short period
● The website stays up :)
32. Does it blend scale?
● Used to have scaling issues with internal load balancers, this is not a
problem with SmartStack :)
● Hit some scaling issues at 10s of thousands of ZooKeeper connections
○ Addressed this by using just a single ZooKeeper connection from
each nerve and synapse
● Used to have lots of HAProxy healthchecks hitting services
○ hacheck insulates services from this
○ We limit HAProxy restart rate
33. What about etcd / consul / …?
● We try to use boring components :)
● We’re already using Zookeeper for Kafka and ElasticSearch so it’s
natural to use it for our service discovery system too.
● etcd would probably also work, and is supported by SmartStack
● Conceptually similar to consul / consul-template
34. What about DNS?
● What TTL are you going to use?
● Are you clients even going to honor the TTL?
● Does the DNS resolution happen inline with requests?
35. Conclusions
● We’ve used SmartStack to create a robust service discovery system
● It’s UNIXy: lots of separate components, each doing one thing well
● It’s flexible: locality-aware discovery
● It’s reliable: new devs at Yelp view discovery as a solved problem
● It’s useful: SmartStack is the glue that holds our SOA together