Store stream data on Data Lake

Principal Data Architect at Home24
Data Services: Search, Recommendations, Ranking
Worked on: Here Maps, Sapo.pt, DataJet, Xing, …
Scala, Perl, Prolog, Java, SQL, R, …
AWS: Step-Functions, Lambda Function, EMR, EC2,
Batch, SQS, SNS, Firehose, Athena, API Gateway, ...

● 15 persons of 12 Nationalities
● Serverless Lovers. For data ingestion we have:
● AWS Technologies: Step-Functions, Cloud-Formation, Lambda Functions,
Athena, EMR, Redshift, S3, ...
Production Development
Number of Lambdas 625 2311
Number of Step Function 113 490
Consumed time (a month) 3,383,525 sec (39 days) 5,371,037 sec (62 days)
Number of requests (a month) 2,014,203 Requests 3,300,118 Requests

● Majority of our Streams are low rate messages
● The Big Stream doesn’t have an easily predictable rate of
messages and can peak to 100 messages/sec
● We will have many more low rate Streams

Main requirements
● Store new Stream Data in Raw S3 Bucket
● Refine Raw S3 Bucket data to a Refined S3 Bucket
● Wrong formatted messages shall not stop the flow
● Notification shall be sent on bad data
● Data must be refined in less than 10 minutes
Other
● Able to replay many days of data fast
● For development, every developer shall be able to deploy his version
independently

Requirements
● Collect data from SNS
● The data must be stored as received in S3.
● Files size must be easy to process on
Lambda (< 10MB)
● At least 1 file per minute must be created

Requirements
Lambda (< 10MB)
Architecture
● A SQS Queue collects all data from the SNS

Requirements
Lambda (< 10MB)
Architecture
● A Lambda copies the data from the SQS to a
Firehose
● The Lambda Function is invoked once a
minute via CloudWatch Event

Requirements
Lambda (< 10MB)
Architecture
● A Lambda copies the data from the SQS to a
Firehose
● The Lambda Function is invoked once a
minute via CloudWatch Event
● Firehose merges the data and creates files
on Raw S3 Bucket

Requirement
● When some message are not
processable, send a notification.

Requirement
Architecture
● The data is deleted from the SQS
Queue after successful copy to
Firehose

Requirement
Architecture
Firehose
● On case of error, the messages will
end on the Dead-Letter Queue

Requirement
Architecture
Firehose
● Non empty Dead-Letter SQS means
there is an error on the data

Requirement
Architecture
Firehose
● Non empty Dead-Letter SQS means
there is an error on the data
● After fixing the Lambda function, one
can always copy the messages back
to the Raw SQS

Requirements
● Decompress data (zip, deflate, gz,
base64, ...)
● Normalize fields (dates for example)
● Add metadata
● Convert all to JSON
● Stored on S3

Requirements
base64, ...)
● Add metadata
● Stored on S3
Architecture
● When a new file is created on Raw S3
Bucket a message is sent to SQS via
SNS

Requirements
base64, ...)
● Add metadata
● Stored on S3
Architecture
SNS
● The Lambda Function is invoked once
a minute via CloudWatch Event and
process all unprocessed files

Requirements
base64, ...)
● Add metadata
● Stored on S3
Architecture
SNS
● A file with the same key, as Raw file, is
created on the Refine S3 Bucket

Requirements
base64, ...)
● Add metadata
● Stored on S3
Architecture
SNS
● A file with the same key, as Raw file, is
created on the Refine S3 Bucket
● Messages that fail to process will end
on the Dead Letter Queue

Requirements
● Replay multiple days of data

Requirements
Architecture
● Lambda Function List files on the
Raw S3 Bucket and send
messages to SQS

Requirements
Architecture
messages to SQS
● Since the files in Raw and Refine
have the same key, the files will
always overwrite the existing ones

Requirements
Architecture
messages to SQS
● The execution time of the Refiner
Lambda will rise and the Refiner
Lambdas will work in parallel

Requirements
Architecture
messages to SQS
● The execution time of the Refiner
Lambda will rise and the Refiner
Lambdas will work in parallelParallelism:
● our Lambda goes to ~190 sec, 3 lambdas
running in parallel.
● 9198 S3 objects
● 30 GB of GZip data, 10GB/hour

Requirement
● Developers shall be able to
deploy their Stream
Processors
● No interaction with external
team shall be required

Requirement
deploy their Stream
Processors
Architecture
● We created an internal SNS
where we clone the external
messages

Requirement
deploy their Stream
Processors
Architecture
messages
● SNS can write to multiple
SQS

Requirement
deploy their Stream
Processors
Architecture
messages
● SNS can write to multiple
SQS
● Same CloudFormation magic
and every developer can
deploy his own Environment

EC2 Lambda
CPU /
Price
1 t2.nano (5% vCPU and 500MB)
0.0063*24*30 = 4.536$/month
Considering 3 seconds a minute
with the highest memory (2
vCPU and 1536 MB)
3*60*24*30*10*(0.000002501+0
.0000002) = 3.5$/month

EC2 Lambda
CPU /
Price
0.0063*24*30 = 4.536$/month
vCPU and 1536 MB)
3*60*24*30*10*(0.000002501+0
.0000002) = 3.5$/month
Devops Higher Low

EC2 Lambda
CPU /
Price
0.0063*24*30 = 4.536$/month
vCPU and 1536 MB)
3*60*24*30*10*(0.000002501+0
.0000002) = 3.5$/month
Devops Higher Low
Scale Scale while it has credits to 1
vCPU. To have more vCPUs you
need to use more expensive
instance types or implement
autoscaling
Out of the box until a certain
level.
2 vCPU * 5 Lambdas = 10
vCPUs

EC2 Lambda
CPU /
Price
0.0063*24*30 = 4.536$/month
vCPU and 1536 MB)
3*60*24*30*10*(0.000002501+0
.0000002) = 3.5$/month
Devops Higher Low
Scale Scale while it has credits to 1
vCPU. To have more vCPUs you
need to use more expensive
instance types or implement
autoscaling
Out of the box until a certain
level.
2 vCPU * 5 Lambdas = 10
vCPUs
Price wise, lambda seems a good solution. For our problems, 10 vCPUs is
clearly more than enough.

Kinesys SQS
Slow stream 2 Shards 24.5$/month
Puts 0.042$/Month
Requests
2.07$/month
We analyze our 2 types of stream of data:
● Slow Stream: 1 message/sec (2.6 million requests/month)
On SQS you pay PUTs and GETs on Kinesys you pay PUTs

Kinesys SQS
Puts 0.042$/Month
Requests
2.07$/month
Fast stream 3 Shards 36.7$/month
Puts 1.1$/month
Requests
51.8$/month
● Fast Stream: 25 message/second (64.8 million requests/month)
with spikes of 100 message/second

Kinesys SQS
Puts 0.042$/Month
Requests
2.07$/month
Fast stream 3 Shards 36.7$/month
Puts 1.1$/month
Requests
51.8$/month
Errors Errors have to be controlled
externally
Errors go to
DeadLeter Queue
● Fast Stream: 25 message/second (64.8 million requests/month)
with spikes of 100 message/second

● You just pay for what you use
● Scalability is not an issue at our messages volume (top 100
messages/second)
○ SQS and Firehose can easily process that volume of messages
○ Multiple Lambdas can work in parallel in case of high traffic or
replay.
● Separated Lambdas by Stream help understanding the logs
● Separated environments simplify developers work
● Data is on S3 and it can be queried via Athena, EMR, Redshift
Spectrum, ...

Store stream data on Data Lake

Recommended

More Related Content

What's hot (20)

Similar to Store stream data on Data Lake (20)

More from Marcos Rebelo (6)

Recently uploaded (19)

Store stream data on Data Lake