SlideShare a Scribd company logo
Building and Scaling a
WebSockets Pubsub
System
Kapil Reddy @ Helpshift
About me - Kapil
Staff Engineer @ Helpshift
Clojure
Distributed Systems
Games
Music
Books/Comics
Football
Helpshift is a Mobile CRM SaaS product. We help connect app developers with their customers. Since everything is now on mobile.
Scale
• ~2 TB data broadcast / day
• Outgoing - 75 k msg/sec
• Incoming - 1.5 k msg/sec
• Concurrency - 3.5k
Here are some scale numbers for the Platform we have built
PubSub Platform
We built a generic Pubish and Subscribe platform. Subscribers of these messages are Javascript clients listening on Websockets connection and Publishers are any
backend server using ZMQ to publish the messages
A simplified version of the platform’s architecture. Again browsers (Subscribers) connect to Dirigent using WebSockets and Backend servers (Publishers) connect to
Diligent using ZMQ. It’s a simplified view right now.
Zooming in a bit we get inside architecture a little more and see there are two different type of services. They internal talk to each other using ZMQ as well. Zookeeper is
used to do co-ordination between Dirigent services
We also we have multiple clusters and they can talk to each other. They have their different set of subscribers. Publishers can come from another cluster.
Evolution
v1 of the platform we used different transport mechanism. HTTP streaming for delivering messages to browsers and HTTP to deliver messages to Dirigent servers. HTTP
mechanism posed problem and it had coupling effect with backend server. Whenever dirigent platform went down due to load the HTTP connections timed out and
created a cascading failure in backend servers. We switched ZMQ there.
Problems with HTTP
streaming
Browser client needs only a subset of data but unsubscribing and subscribing to new topics was not possible over HTTP streaming since it’s unidirectional channel. The
only option was push everything to all clients for a specific subdomain. Initially it sounded like a good idea but once we hit scale we were running out of network
bandwidth per machine. We switched to web sockets where client can ask for specific information based on UI actions.
Under the hood
• Clojure (JVM)
• Http-kit (NIO based web sockets server)
• ZMQ
• Zookeeper
Monitoring
All the messages we are publishing is important data and needs to rendered in time. The nature of this data is ephemeral. We don’t store it anywhere so auditing is hard.
So utilising monitoring was crucial for us.
Under the hood
• StatsD protocol
• Graphite - Storage
• Grafana - Frontend
*example of monitoring
comparison different
stages*
Since auditing this kind of data is hard. We compare metrics of data in different stages of the platform. But since the numbers are big it’s hard to spot any anomaly. What
we are looking for is variance.
Message variance is easy to parse visually. If variance is low some stage of the platform is dropping data. In fact we also have setup alerts on this same query.
Another important metric is time taken to publish a message to WebSocket connection. Since near real time SLA is so important we look at p99s for anomalies. We have
setup alerts on these as well.
Cost saving
Costs are a concern for us always! There are two important factors that add up to the cost. Outgoing bandwidth usage and number of machines
Compression
First we started using gzip compression for websockets. It’s a standard compression mechanism supported by browsers but as with browsers there are quirks here.
Re-visiting features
Biggest change you can do to save costs is to re-visit the features/business logic itself and try to optimise there. This reduced the bandwidth usage by significant
amount.
Auto scaling
To save up on number of machines used. We started investigating in how to do auto scaling. Auto scaling was not a straight forward thing since all the connections are
long running and usually can stay alive for as long as 8 hours.
HAProxy with least
conn
We went with the obvious choice of least connection with HAProxy doing the load balancing.
Least load connection
works.
Sometimes
The problem with least load connection is assumption that number of connections a server is handling is directly proportional to amount of work it’s doing. This was a
wrong assumption and it just lead us to uneven distribution. Server crashes and just bad sleepless nights.
Feedback load
balancing
Feedback load balancing is something we started to do with Herald an internal tool we built at Helpshift. This helps HAProxy decide which server to choose when routing
a new connection. All the servers can expose the current load they are under to Herald which in turns tells HAproxy which server to choose. If all servers are loaded we
scale out. If all servers are under loaded we scale in.
Summary
• Building a web sockets infrastructure on EC2 is
possible but it has quirks
• Use feedback load balancing for WebSockets /
Long running connection traffic
• ZMQ, JVM are solid building blocks for building a
realtime pubsub platform
• Instrumentation in multiple stages of platform is a
good way to keep track of a real time system

More Related Content

What's hot (20)

PPTX
Web Real-time Communications
Alexei Skachykhin
 
PPTX
SignalR for ASP.NET Developers
Shivanand Arur
 
PPTX
Microsoft signal r
rustd
 
PPTX
Advanced WCF
Jack Spektor
 
PDF
Organic Growth and A Good Night Sleep: Effective Kafka Operations at Pinteres...
confluent
 
PDF
BlazeMeter Presents at the High Performance Drupal Meetup
BlazeMeter
 
PPTX
Real time web with SignalR
Alessandro Melchiori
 
PDF
2.2 Reliable Message Bus based on RocketMQ
振东 刘
 
PPTX
Real-time Communications with SignalR
Shravan Kumar Kasagoni
 
PPTX
Load balancer
Raja Soundaramourty
 
PDF
Maintaining Consistency for a Financial Event-Driven Architecture (Iago Borge...
confluent
 
PPT
Php day 2011 - Zing me configuration system arch
Quang Anh Le
 
PPTX
Building Realtime Web Applications With ASP.NET SignalR
Shravan Kumar Kasagoni
 
PPTX
Aws 12 Month Free Tier for Web Designers and Developers
Dylan Burris
 
PPTX
Testing the limits of cloud networks
PLUMgrid
 
PPTX
How to Build High Performance : WordPress
Dylan Burris
 
PPTX
Messaging Powered Front Ends
Elton Stoneman
 
PPTX
Introduction to SignalR
Adam Mokan
 
PDF
Introduction to SignalR
University of Hawai‘i at Mānoa
 
PPTX
Modernizing the Legacy - How Dish is Adapting its SOA Services for a Cloud Fi...
VMware Tanzu
 
Web Real-time Communications
Alexei Skachykhin
 
SignalR for ASP.NET Developers
Shivanand Arur
 
Microsoft signal r
rustd
 
Advanced WCF
Jack Spektor
 
Organic Growth and A Good Night Sleep: Effective Kafka Operations at Pinteres...
confluent
 
BlazeMeter Presents at the High Performance Drupal Meetup
BlazeMeter
 
Real time web with SignalR
Alessandro Melchiori
 
2.2 Reliable Message Bus based on RocketMQ
振东 刘
 
Real-time Communications with SignalR
Shravan Kumar Kasagoni
 
Load balancer
Raja Soundaramourty
 
Maintaining Consistency for a Financial Event-Driven Architecture (Iago Borge...
confluent
 
Php day 2011 - Zing me configuration system arch
Quang Anh Le
 
Building Realtime Web Applications With ASP.NET SignalR
Shravan Kumar Kasagoni
 
Aws 12 Month Free Tier for Web Designers and Developers
Dylan Burris
 
Testing the limits of cloud networks
PLUMgrid
 
How to Build High Performance : WordPress
Dylan Burris
 
Messaging Powered Front Ends
Elton Stoneman
 
Introduction to SignalR
Adam Mokan
 
Introduction to SignalR
University of Hawai‘i at Mānoa
 
Modernizing the Legacy - How Dish is Adapting its SOA Services for a Cloud Fi...
VMware Tanzu
 

Similar to Building and Scaling a WebSockets Pubsub System (20)

PDF
Adding Realtime to your Projects
Ignacio Martín
 
PDF
Real time web apps
Sepehr Rasouli
 
PPTX
Training Webinar: Enterprise application performance with server push technol...
OutSystems
 
PDF
Backend & Frontend architecture scalability & websockets
Anne Jan Brouwer
 
PDF
Real-Time Web Apps & .NET. What Are Your Options? NDC Oslo 2016
Phil Leggetter
 
PDF
Real-Time with Flowdock
Flowdock
 
PDF
Scaling Push Messaging for Millions of Devices @Netflix
C4Media
 
PDF
Let's Get Real (time): Server-Sent Events, WebSockets and WebRTC for the soul
Swanand Pagnis
 
PPTX
WebSocket protocol
Kensaku Komatsu
 
PDF
Building a Robust Node.JS WebSocket Server.pdf
Cubix Global
 
PPTX
Scaling GraphQL Subscriptions
Артём Курапов
 
PDF
Scalable and Available, Patterns for Success
Derek Collison
 
PPTX
Best practices of building data streaming API
Constantine Slisenka
 
PDF
WebSocket Perspectives and Vision for the Future
Frank Greco
 
PPTX
SenchaCon 2016: How to Give your Sencha App Real-time Web Performance - James...
Sencha
 
KEY
Distributed app development with nodejs and zeromq
Ruben Tan
 
PDF
Tools, Tips and Techniques for Developing Real-time Apps. Phil Leggetter
Future Insights
 
PPTX
Scalable Persistent Message Brokering with WSO2 Message Broker
Srinath Perera
 
PPTX
WebSockets-Revolutionizing-Real-Time-Communication.pptx
YasserLina
 
PPTX
Scaling Push Messaging for Millions of Netflix Devices
Susheel Aroskar
 
Adding Realtime to your Projects
Ignacio Martín
 
Real time web apps
Sepehr Rasouli
 
Training Webinar: Enterprise application performance with server push technol...
OutSystems
 
Backend & Frontend architecture scalability & websockets
Anne Jan Brouwer
 
Real-Time Web Apps & .NET. What Are Your Options? NDC Oslo 2016
Phil Leggetter
 
Real-Time with Flowdock
Flowdock
 
Scaling Push Messaging for Millions of Devices @Netflix
C4Media
 
Let's Get Real (time): Server-Sent Events, WebSockets and WebRTC for the soul
Swanand Pagnis
 
WebSocket protocol
Kensaku Komatsu
 
Building a Robust Node.JS WebSocket Server.pdf
Cubix Global
 
Scaling GraphQL Subscriptions
Артём Курапов
 
Scalable and Available, Patterns for Success
Derek Collison
 
Best practices of building data streaming API
Constantine Slisenka
 
WebSocket Perspectives and Vision for the Future
Frank Greco
 
SenchaCon 2016: How to Give your Sencha App Real-time Web Performance - James...
Sencha
 
Distributed app development with nodejs and zeromq
Ruben Tan
 
Tools, Tips and Techniques for Developing Real-time Apps. Phil Leggetter
Future Insights
 
Scalable Persistent Message Brokering with WSO2 Message Broker
Srinath Perera
 
WebSockets-Revolutionizing-Real-Time-Communication.pptx
YasserLina
 
Scaling Push Messaging for Millions of Netflix Devices
Susheel Aroskar
 
Ad

Recently uploaded (20)

PDF
Unified_Cloud_Comm_Presentation anil singh ppt
anilsingh298751
 
PPTX
Solar Thermal Energy System Seminar.pptx
Gpc Purapuza
 
PPTX
Snet+Pro+Service+Software_SNET+Pro+2+Instructions.pptx
jenilsatikuvar1
 
PPTX
Thermal runway and thermal stability.pptx
godow93766
 
PPTX
The Role of Information Technology in Environmental Protectio....pptx
nallamillisriram
 
PDF
Introduction to Productivity and Quality
মোঃ ফুরকান উদ্দিন জুয়েল
 
DOCX
CS-802 (A) BDH Lab manual IPS Academy Indore
thegodhimself05
 
PPTX
Heart Bleed Bug - A case study (Course: Cryptography and Network Security)
Adri Jovin
 
PDF
International Journal of Information Technology Convergence and services (IJI...
ijitcsjournal4
 
PPTX
MPMC_Module-2 xxxxxxxxxxxxxxxxxxxxx.pptx
ShivanshVaidya5
 
PDF
monopile foundation seminar topic for civil engineering students
Ahina5
 
PDF
Biomechanics of Gait: Engineering Solutions for Rehabilitation (www.kiu.ac.ug)
publication11
 
PPTX
Arduino Based Gas Leakage Detector Project
CircuitDigest
 
PPTX
Server Side Web Development Unit 1 of Nodejs.pptx
sneha852132
 
PPTX
Green Building & Energy Conservation ppt
Sagar Sarangi
 
PPTX
Hashing Introduction , hash functions and techniques
sailajam21
 
DOCX
8th International Conference on Electrical Engineering (ELEN 2025)
elelijjournal653
 
PDF
PORTFOLIO Golam Kibria Khan — architect with a passion for thoughtful design...
MasumKhan59
 
PDF
GTU Civil Engineering All Semester Syllabus.pdf
Vimal Bhojani
 
PPT
PPT2_Metal formingMECHANICALENGINEEIRNG .ppt
Praveen Kumar
 
Unified_Cloud_Comm_Presentation anil singh ppt
anilsingh298751
 
Solar Thermal Energy System Seminar.pptx
Gpc Purapuza
 
Snet+Pro+Service+Software_SNET+Pro+2+Instructions.pptx
jenilsatikuvar1
 
Thermal runway and thermal stability.pptx
godow93766
 
The Role of Information Technology in Environmental Protectio....pptx
nallamillisriram
 
Introduction to Productivity and Quality
মোঃ ফুরকান উদ্দিন জুয়েল
 
CS-802 (A) BDH Lab manual IPS Academy Indore
thegodhimself05
 
Heart Bleed Bug - A case study (Course: Cryptography and Network Security)
Adri Jovin
 
International Journal of Information Technology Convergence and services (IJI...
ijitcsjournal4
 
MPMC_Module-2 xxxxxxxxxxxxxxxxxxxxx.pptx
ShivanshVaidya5
 
monopile foundation seminar topic for civil engineering students
Ahina5
 
Biomechanics of Gait: Engineering Solutions for Rehabilitation (www.kiu.ac.ug)
publication11
 
Arduino Based Gas Leakage Detector Project
CircuitDigest
 
Server Side Web Development Unit 1 of Nodejs.pptx
sneha852132
 
Green Building & Energy Conservation ppt
Sagar Sarangi
 
Hashing Introduction , hash functions and techniques
sailajam21
 
8th International Conference on Electrical Engineering (ELEN 2025)
elelijjournal653
 
PORTFOLIO Golam Kibria Khan — architect with a passion for thoughtful design...
MasumKhan59
 
GTU Civil Engineering All Semester Syllabus.pdf
Vimal Bhojani
 
PPT2_Metal formingMECHANICALENGINEEIRNG .ppt
Praveen Kumar
 
Ad

Building and Scaling a WebSockets Pubsub System

  • 1. Building and Scaling a WebSockets Pubsub System Kapil Reddy @ Helpshift
  • 2. About me - Kapil Staff Engineer @ Helpshift Clojure Distributed Systems Games Music Books/Comics Football
  • 3. Helpshift is a Mobile CRM SaaS product. We help connect app developers with their customers. Since everything is now on mobile.
  • 4. Scale • ~2 TB data broadcast / day • Outgoing - 75 k msg/sec • Incoming - 1.5 k msg/sec • Concurrency - 3.5k Here are some scale numbers for the Platform we have built
  • 5. PubSub Platform We built a generic Pubish and Subscribe platform. Subscribers of these messages are Javascript clients listening on Websockets connection and Publishers are any backend server using ZMQ to publish the messages
  • 6. A simplified version of the platform’s architecture. Again browsers (Subscribers) connect to Dirigent using WebSockets and Backend servers (Publishers) connect to Diligent using ZMQ. It’s a simplified view right now.
  • 7. Zooming in a bit we get inside architecture a little more and see there are two different type of services. They internal talk to each other using ZMQ as well. Zookeeper is used to do co-ordination between Dirigent services
  • 8. We also we have multiple clusters and they can talk to each other. They have their different set of subscribers. Publishers can come from another cluster.
  • 10. v1 of the platform we used different transport mechanism. HTTP streaming for delivering messages to browsers and HTTP to deliver messages to Dirigent servers. HTTP mechanism posed problem and it had coupling effect with backend server. Whenever dirigent platform went down due to load the HTTP connections timed out and created a cascading failure in backend servers. We switched ZMQ there.
  • 11. Problems with HTTP streaming Browser client needs only a subset of data but unsubscribing and subscribing to new topics was not possible over HTTP streaming since it’s unidirectional channel. The only option was push everything to all clients for a specific subdomain. Initially it sounded like a good idea but once we hit scale we were running out of network bandwidth per machine. We switched to web sockets where client can ask for specific information based on UI actions.
  • 12. Under the hood • Clojure (JVM) • Http-kit (NIO based web sockets server) • ZMQ • Zookeeper
  • 13. Monitoring All the messages we are publishing is important data and needs to rendered in time. The nature of this data is ephemeral. We don’t store it anywhere so auditing is hard. So utilising monitoring was crucial for us.
  • 14. Under the hood • StatsD protocol • Graphite - Storage • Grafana - Frontend
  • 15. *example of monitoring comparison different stages* Since auditing this kind of data is hard. We compare metrics of data in different stages of the platform. But since the numbers are big it’s hard to spot any anomaly. What we are looking for is variance.
  • 16. Message variance is easy to parse visually. If variance is low some stage of the platform is dropping data. In fact we also have setup alerts on this same query.
  • 17. Another important metric is time taken to publish a message to WebSocket connection. Since near real time SLA is so important we look at p99s for anomalies. We have setup alerts on these as well.
  • 18. Cost saving Costs are a concern for us always! There are two important factors that add up to the cost. Outgoing bandwidth usage and number of machines
  • 19. Compression First we started using gzip compression for websockets. It’s a standard compression mechanism supported by browsers but as with browsers there are quirks here.
  • 20. Re-visiting features Biggest change you can do to save costs is to re-visit the features/business logic itself and try to optimise there. This reduced the bandwidth usage by significant amount.
  • 21. Auto scaling To save up on number of machines used. We started investigating in how to do auto scaling. Auto scaling was not a straight forward thing since all the connections are long running and usually can stay alive for as long as 8 hours.
  • 22. HAProxy with least conn We went with the obvious choice of least connection with HAProxy doing the load balancing.
  • 23. Least load connection works. Sometimes The problem with least load connection is assumption that number of connections a server is handling is directly proportional to amount of work it’s doing. This was a wrong assumption and it just lead us to uneven distribution. Server crashes and just bad sleepless nights.
  • 24. Feedback load balancing Feedback load balancing is something we started to do with Herald an internal tool we built at Helpshift. This helps HAProxy decide which server to choose when routing a new connection. All the servers can expose the current load they are under to Herald which in turns tells HAproxy which server to choose. If all servers are loaded we scale out. If all servers are under loaded we scale in.
  • 25. Summary • Building a web sockets infrastructure on EC2 is possible but it has quirks • Use feedback load balancing for WebSockets / Long running connection traffic • ZMQ, JVM are solid building blocks for building a realtime pubsub platform • Instrumentation in multiple stages of platform is a good way to keep track of a real time system