The Challenge of 100k Concurrent WebSocket Users — and How to Solve It. Part 1.
From concept to PoC: building a scalable WebSocket architecture
Foreword
Handling a large number of users became a more and more ordinary task for developers and architects, but it is not simplifying the complexity of this problem. This article describes the architecture design approach and Proof of Concept (development/testing/justification) of the solution that could handle 100k users in parallel to provide real-time communication with the ability to scale.
The article is split into two parts
The first part is about architecture, design, and analysis.
Introduction
The customer operates in the transportation services domain, and one of the core features of the project is real-time GPS tracking of vehicles. The system is already running in production, and active development is ongoing.
The system is deployed on the AWS Cloud platform. The GPS events from the vehicles are collected and sent to AWS MSK topic. Then, GPS messages are distributed via WebSocket to users who are subscribed to the corresponding vehicles.
Problem Analysis and Architectural Motivation
We started receiving complaints from the end users who were no longer seeing real-time vehicle movement in the application after new vehicles were added and new users were onboarded to the system. The system was growing fast — lots of new vehicles, lots of new users :)
After investigation, we determined enormous Kafka lag (1–2kkk messages), which caused significant delays in GPS data distribution over WebSockets. As a result, users were receiving vehicle location updates with delays ranging from 40 seconds to 15 minutes.
Let me describe the existing solution and problems with the existing implementation
The application is built as a monolith that handles all business logic.
WebSocket functionality is part of this monolith. WebSocket functionality is implemented through custom code, including:
- Connection
- Disconnect/Reconnect,
- Clearing Death WebSocket session.
- Storing session/Message sending/Message routing. - Authentication/Authorization
The non-functional requirements (NFRs) for REST APIs and WebSockets are different. This architecture doesn’t provide proper scalability, performance efficiency, or cost-effectiveness.
Multiple application instances are deployed. Each instance has its own cache to store WebSocket sessions (no efficient handling of WebSocket sessions, problem with traffic routing during reconnections). Each instance consumes all Kafka topic (no efficient consumption of partitions) to be able to track any vehicle.
AWS ALB is used to balance traffic to the instances. By default, ALB resets the TCP connection after 60 seconds of idle. That introduces permanent users reconnection. It is additional overhead for session management.
It could be mitigated by increasing timeout limits in ALB. But essentially, it is not a good idea because all TCP connections are going to be opened for a long period of time. It works for WebSocket, but if an application instance has REST and WebSocket endpoints, it is not a good idea to have idle time for REST endpoint for a long time. In that case, the timeout handling should be split for each type of connection. A better approach would be to introduce a dedicated ALB for WebSockets with a longer idle timeout, separating traffic types and tuning timeouts accordingly.
Architecture Significant Requirements
To design an efficient and scalable system, key non-functional requirements and technical constraints were identified.
Non-Functional Requirements (NFRs)
The fleet currently consists of 70,000 vehicles, with continuous growth expected.
For the first MVP, the system must support:
100,000 standard users and 15,000 business users, all connected in parallel, with future scalability in mind.
A standard user typically tracks 1–3 vehicles, while a business user may track anywhere from 1 to 1,000 vehicles.
Each GPS event must be delivered to the end user within 5 seconds of being received by the system (i.e., once it reaches the AWS MSK topic).
Technical Details and Constraints
WebSocket is the primary technology used for real-time messaging.
Each vehicle sends a GPS update every 10 seconds.
The payload size per message sent to end users is approximately 600 bytes.
Architecture Rethinking of the Problem
To overcome the existing limitations and support the required scale and performance, several architectural improvements were identified:
1. Session Management Improvements
Enhance session lifecycle handling, including:
Connection and reconnection logic
Cleanup of stale or dead sessions
Efficient association of vehicles with WebSocket sessions
2. Scalable and Efficient WebSocket Handling
WebSocket functionality should be extracted into a separate microservice, decoupling it from the monolithic application to allow independent scaling and resource optimization between REST and WebSocket traffic.
Partition consumption strategy should be improved:
Instead of each instance consuming all Kafka partitions, distribute specific partition ranges across instances.
This will reduce redundant load and improve scalability.
Eliminate AWS ALB connection drops
Address ALB idle timeout limitations to avoid forced reconnections that cause overhead in session management.
3. Auto-Scaling Strategy
Implement auto-scaling for GPS distributor instances based on key metrics, such as:
CPU and memory usage
Number of connected users
Architectural Scenarios and Design Decisions
I kept the problem, limitations, and non-functional requirements in mind while designing the system. I conducted an in-depth analysis and identified several possible ways to implement the solution.
Solution options
Option 1, Dynamic load distribution
Element catalog
AMSK — AWS Managed Streaming for Kafka; a message broker used as the source of GPS events from vehicles.
GPS Topic — The Kafka topic that is used for GPS data.
Partition 1,2,3— Kafka topic partitions used to enable scalable consumption. In reality, there are more than 300 partitions.
GPS Consumer Server — Reads GPS events and distributes them to the appropriate Redis pub-sub channel. Each pub-sub corresponds to a specific WebSocket server.
Redis — Acts as a cache and a secondary broker for routing GPS events to the correct WebSocket server instance.
SET — A Redis set used to register WebSocket servers instances and their associated pub-sub channels. It stores mappings of which pub-sub is responsible for which vehicle. When a WebSocket server receives a request to track a new vehicle, this mapping is updated.
Pub-Sub 1, 2, 3, …, N — Dedicated pub-sub channels between the GPS Consumer Server and WebSocket servers. Each pub-sub receives GPS data only for the vehicles that its associated WebSocket server is tracking. The GPS Consumer Server reads from the SET to determine which GPS message should be delivered to which pub-sub.
WebSocket Server — A microservice responsible for managing client WebSocket connections and distributing GPS events.
ECS Fargate Service — A container orchestration service used to group task instances with auto-scaling support.
AWS ALB — AWS Application Load Balancer configured to handle WebSocket connections. The idle timeout is increased (from the default 60 seconds) to support long-lived connections (5+ minutes).
ECS Fargate Service Auto-Scaling — A scaling policy that adjusts the number of WebSocket server instances based on application metrics such as CPU usage, memory consumption, or the number of connected WebSocket clients.
AWS Route 53 — Provides a top-level domain name and DNS resolution, associating it with the internal AWS ALB.
AWS CloudWatch — Used for collecting logs and metrics, and for triggering ECS auto-scaling actions based on defined thresholds.
WebSocket Clients— The end-users' WebSocket clients that receive GPS vehicle data in real time.
Description
The system consists of two main components: the GPS consumer server and the WebSocket server. These components are decoupled using message brokers and an event-driven architecture, ensuring independent scalability.
The GPS consumer server group processes GPS events data from a Kafka topic, which is partitioned to enhance scalability. GPS consumers distribute events only to relevant pub-sub channels, which are dynamically determined based on Redis SET mappings that link pub-sub channels to vehicles. This Redis SET is managed by the WebSocket server.
When the WebSocket server needs to start or stop tracking a vehicle, it updates the mapping between pub-sub channels and vehicles in Redis.
The system’s scalability is managed by an ECS Fargate auto-scaling policy, which monitors CPU and memory utilization. When an instance reaches 70% CPU usage, a new instance is launched; similarly, scaling occurs based on memory utilization and custom metrics such as the number of connected users. Downscaling occurs when resource usage decreases below the threshold.
The sequence diagram
Option 2, socket.io
Description
The core concept is to leverage the socket.io library. It offers out-of-the-box functionality for handling client connections(WebSocket, HTTP pooling), supports scaling, provides server/client implementations in various languages, and enables event distribution by groups.
The key idea is grouping clients who wish to receive information about GPS vehicle events. socket.io facilitates this with its room functionality. When a client wants to receive events for a particular vehicle, it connects to a specific room and awaits the events. socket.io manages everything under the hood.
Option 3, AWS Gateway WebSocket API
AMSK — AWS Managed Streaming for Kafka; the message broker used as the source of GPS events from vehicles.
GPS Topic — The Kafka topic that is used for GPS data.
Partition 1,2,3 — Partitions of the GPS Topic used to ensure scalability in message consumption. In reality, there are more than 300 partitions.
GPS Distributor Server — Reads GPS events and distributes those events to appropriate clients one by one (that are stored in Redis) by POST request to WebSocket API
Connection Manager — It is used to manage the user connectionId (save/remove) on user connect/disconnect events that are triggered by the AWS WebSocket Gateway.
Redis — Acts as a cache, storing associations between clientId and the list of vehicleIds each client is tracking.
SET — A Redis data structure used to associate connectedId with vehiclesIds.
ECS Fargate Service — Manages task instances for services (such as GPS Distributor and Connection Manager), allowing them to scale and balance traffic efficiently.
ECS Fargate Auto-Scaling — Auto-scaling policies that adjust the number of running tasks (e.g., GPS Distributor or Connection Manager) based on metrics like CPU usage, memory, and the number of WebSocket clients.
AWS Gateway WebSocket API — Handles connection lifecycle events (connect, disconnect, reconnect) for WebSocket clients.
AWS Route 53 — Provides the top-level domain name and DNS routing, linking it to the internal AWS ALB.
AWS CloudWatch — Collects logs and metrics, and triggers ECS auto-scaling actions based on configured thresholds.
WebSocket Client — The front-end or external clients connected via WebSocket that receive real-time GPS vehicle data.
Description
The main idea here is around auto-scalable WebSocket API. That could handle any number of WebSocket clients skillfully. When a websocket client is connected, AWS WebSocket API generates unique id(connectiID) that could be used to send information back to the client (we just need to make a simple POST request to WebSocket API). Connection manager handles connection/disconnection and saves/deletes ids from SET storage of Redis. On the other hand GPS Distributor uses SET to get connectionID and send information back to clients.
The sequence diagrams
Cost calculation
The price of the solution is one of the crucial part of architectural design and analysis that should be taken into account before decision-making. All calculation is based on the requirement to handle 100k users.
The price is based on 12.2024, AWS calculator
Option 1, Dynamic loading
3 instances of GPS Consumers 6 instances of WebSocket servers (with the assumption that each server could handle approximately 16k users)
4vCpu and 8 gb ram = 144.16 USD
10 instances per month = 1,441.61 USD for Web socket service Redis cache.m5.xlarge vCPU: 4 Memory: 12.93 GiB 227.03 USD Total = 2103 USD
Option 2, socket.io
The same as for option 1. Total = 2103 USD
Option 3, AWS Gateway WebSocket API
AWS Gateway WebSocket API bills for connected users/sending messages, and session time.
There could be the following cases
100k users actively listen to web sockets for 4 hours during the day. 2 hours in the morning, 2 hours in the evening
Each user listens to 2 vehicles Each vehicle produces 6 messages per minute 20 Days in a month (working days)
100 000 (4 60 20) 12 = 5 760 000 000 Messages per month
The first billion messages 0.000001 USD The next messages are 0.0000008 USD
5 760 000 000 1 000 000 000 = 1k USD 4 760 000 000 = 3808 USD
The cost for the connection duration 0.00000025 USD per minute for each connection
100 000 (4 60) = 24 000 000 Connection minutes per day 24 000 000 0.00000025 = 6 6 20 = 120 per month
3928 $ per month
100k users, users actively listen to web sockets for 2 hours during the day Each user listens to 2 vehicles
100 000 (2 60 20) 12 = 2 880 000 000 1 000 000 000 = 1k USD 1 880 000 000 = 1504 USD
100 000 (2 60) = 12 000 000 60 $ per month
2564 $ per month
100k users, users actively listen to web sockets for 1 hour during the day Each user listens to 2 vehicles
100 000 (1 60 20) 6 = 720 000 000 720 $ per month
100 000 (1 60) = 6 000 000 30 $ per month
750 $ per month Total WebSocket API costs (monthly): 750–4000 USD
The price for compute instances and Redis
GPS Distributor 3 instances of GPS Distributor 4vCpu and 8 gb ram = 144.16 USD 435 USD
Connection manager 2 instances of Connection manager 4vCpu and 8 gb ram = 144.16 USD 288.32 USD
Redis cache.m5.xlarge vCPU: 4 Memory: 12.93 GiB 227.03 USD
Total = 750–4000 USD for AWS WebSocket API + 951 USD for the compute instances and Redis.
Compare analysis and decision making
Conclusion
After a detailed investigation, we decided to move forward with AWS Gateway WebSocket API (Option 3).
This decision was driven by its fully managed, auto-scalable architecture, which simplifies integration and aligns well with non-functional requirements.
While the cost is higher compared to self-managed solutions, it’s justified by the significant reduction in development and maintenance effort.
We gain elasticity, resilience, and focus — without having to reinvent infrastructure components.
The next step is to develop a proof of concept to validate that this choice meets all performance and functional expectations.
The second part, where we’ll dive deep into the technical side — from source code and infrastructure setup to real-world performance testing.
Lead Software Engineer
2moVery informative