SlideShare a Scribd company logo
Planet-scale Data Ingestion Pipeline
Bigdam
PLAZMA TD Internal Day 2018/02/19
#tdtech
Satoshi Tagomori (@tagomoris)
Satoshi Tagomori (@tagomoris)
Fluentd, MessagePack-Ruby, Norikra, Woothee, ...
Treasure Data, Inc.
Backend Team
Planet-scale Data Ingestion Pipeline: Bigdam
• Design for Large Scale Data Ingestion
• Issues to Be Solved
• Re-designing Systems
• Re-designed Pipeline: Bigdam
• Consistency
• Scaling
Large Scale Data Ingestion:
Traditional Pipeline
Data Ingestion in Treasure Data
• Accept requests from clients
• td-agent
• TD SDKs (incl. HTTP requests w/ JSON)
• Format data into MPC1
• Store MPC1 files into Plazmadb
clients Data Ingestion Pipeline
json
msgpack.gz
MPC1
Plazmadb
Presto
Hive
Traditional Pipeline
• Streaming Import API for td-agent
• API Server (RoR), Temporary Storage (S3)
• Import task queue (perfectqueue), workers (Java)
• 1 msgpack.gz file in request → 1 MPC1 file on Plazmadb
td-agent
api-import
(RoR)
msgpack.gz
S3
PerfectQueue
Plazmadb
Import
Worker
msgpack.gz
MPC1
Traditional Pipeline: Event Collector
• APIs for TD SDKs
• Event Collector nodes (hosted Fluentd)
• on the top of Streaming Import API
• 1 MPC1 file on Plazmadb per 3min. per Fluentd process
TD SDKs
api-import
(RoR)
json
S3
PerfectQueue
Plazmadb
Import
Worker
msgpack.gz
MPC1
event-collector
(Fluentd)
msgpack.gz
Growing Traffics on Traditional Pipeline
• Throughput of perfectqueue
• Latency until queries via Event-Collector
• Maintaining Event-Collector code
• Many small temporary files on S3
• Many small imported files on Plazmadb on S3
TD SDKs
api-import
(RoR)
json
S3
PerfectQu
eue
Plazmadb
Import
Worker
msgpack.gz
MPC1
event-
collector
msgpack.gz
td-agent
msgpack.gz
Perfectqueue Throughput Issue
• Perfectqueue
• "PerfectQueue is a highly available distributed queue built on top of
RDBMS."
• Fair scheduling
• https://github.com/treasure-data/perfectqueue
• Perfectqueue is NOT "perfect"...
• Need wide lock on table: poor concurrency
TD SDKs
api-import
(RoR)
json
S3
PerfectQu
eue
Plazmadb
Import
Worker
msgpack.gz
MPC1
event-
collector
msgpack.gz
td-agent
msgpack.gz
Latency until Queries via Event-Collector
• Event-collector buffers data in its storage
• 3min. + α
• Customers have to wait 3+min. until a record become
visible on Plazmadb
• 1/2 buffering time make 2x MPC1 files
TD SDKs
api-import
(RoR)
json
S3
PerfectQu
eue
Plazmadb
Import
Worker
msgpack.gz
MPC1
event-
collector
msgpack.gz
td-agent
msgpack.gz
Maintaining Event-Collector Code
• Mitsu says: "No problem about maintaining event-collector code"
• :P
• Event-collector processes HTTP requests in Ruby code
• Hard to test it
TD SDKs
api-import
(RoR)
json
S3
PerfectQu
eue
Plazmadb
Import
Worker
msgpack.gz
MPC1
event-
collector
msgpack.gz
td-agent
msgpack.gz
Many Small Temporary Files on S3
• api-import uploads all requested msgpack.gz files to S3
• S3 outage is critical issue
• AWS S3 outage in us-east-1 at Feb 28th, 2017
• Many uploaded files makes costs expensive
• costs per object
• costs per operation
TD SDKs
api-import
(RoR)
json
S3
PerfectQu
eue
Plazmadb
Import
Worker
msgpack.gz
MPC1
event-
collector
msgpack.gz
td-agent
msgpack.gz
Many Small Imported Files on Plazmadb on S3
• 1 MPC1 file on Plazmadb from 1 msgpack.gz file
• on Plazmadb realtime storage
• https://www.slideshare.net/treasure-data/td-techplazma
• Many MPC1 files:
• S3 request cost to store
• S3 request cost to fetch (from Presto, Hive)
• Performance regression to fetch many small files in queries

(256MB expected vs. 32MB actual)
Re-designing Systems
Make "Latency" Shorter (1)
• Clients to our endpoints
• JS SDK on customers' page sends data to our endpoints

from mobile devices
• Longer latency increases % of dropped records
• Many endpoints on the Earth: US, Asia + others
• Plazmadb in us-east-1 as "central location"
• Many geographically-parted "edge location"
Make "Latency" Shorter (2)
• Shorter waiting time to query records
• Flexible import task scheduling - better if configurable
• Decouple buffers from endpoint server processes
• More frequent import with aggregated buffers
bufferendpoint
endpoint
endpoint
endpoint
endpoint
buffer
buffer
buffer
buffer
BEFORE
MPC1
endpoint
endpoint
endpoint
endpoint
endpoint
AFTER
MPC1
MPC1
MPC1
MPC1
MPC1
buffer
buffer
buffer
Redesigning Queues
• Fair scheduling is not required for import tasks
• Import tasks are FIFO (First In, First Out)
• Small payload - (apikey, account_id, database, table)
• More throughput
• Using Queue service + RDBMS
• Queue service for enqueuing/dequeuing
• RDBMS to provide at-least-once
S3-free Temporary Storage
• Make the pipeline free from S3 outage
• Distributed storage cluster as buffer for uploaded data (w/ replication)
• Buffer transferring between edge and central locations
MPC1
endpoint
endpoint
endpoint
endpoint
endpoint
buffer
buffer
buffer
clients
Edge location Central location
buffer
buffer
buffer
Storage

Cluster
Storage

Cluster
Merging Temporary Buffers into a File on Plazmadb
• Non-1-by-1 conversion from msgpack.gz to MPC1
• Buffers can be gathered using secondary index
• primary index: buffer_id
• secondary index: account_id, database, table, apikey
bufferendpoint
endpoint
endpoint
endpoint
endpoint
buffer
buffer
buffer
buffer
BEFORE
MPC1
endpoint
endpoint
endpoint
endpoint
endpoint
AFTER
MPC1
MPC1
MPC1
MPC1
MPC1
buffer
buffer
buffer
Should It Provide Read-After-Write Consistency?
• BigQuery provides Read-After-Write consistency
• Pros: Inserted record can be queried now
• Cons:
• Much longer latency (especially from non-US regions)
• Much more expensive to host API servers for longer HTTP sessions
• Much more expensive to host Query nodes for smaller files on Plazmadb
• Much more troubles
• Say "No!" for it
Appendix
Bigdam
Bigdam: Planet-scale!

Edge locations on the earth + the Central location
Planet-scale Data Ingestion Pipeline: Bigdam
Planet-scale Data Ingestion Pipeline: Bigdam
Bigdam-Gateway (mruby on h2o)
• HTTP Endpoint servers
• Rack-like API for mruby handlers
• Easy to write, easy to test (!)
• Async HTTP requests from mruby, managed by h2o using Fiber
• HTTP/2 capability in future
• Handles all requests from td-agent and TD SDKs
• decode/authorize requests
• send data to storage nodes in parallel (to replicate)
Planet-scale Data Ingestion Pipeline: Bigdam
Bigdam-Pool (Java)
• Distributed Storage for buffering
• Expected data size: 1KB (a json) ~ 32MB (a msgpack.gz from td-agent)
• Append data into a buffer
• Query buffers using secondary index
• Transfer buffers from edge to central
chunks
buffers
Central location
Over Internet
Using HTTPS or HTTP/2
Buffer committed
(size or timeout)Edge location
Import workers
account_id, database, table
Planet-scale Data Ingestion Pipeline: Bigdam
Bigdam-Scheduler (Golang)
• Scheduler server
• Bigdam-pool requests to schedule import tasks to bigdam-scheduler

(many times in seconds)
• Bigdam-scheduler enqueues import tasks to bigdam-queue,

(once in configured interval: default 1min.)
bigdam-pool
nodes bigdam-queue
bigdam-pool
nodes
bigdam-pool
nodes
bigdam-pool
nodes
bigdam-pool
nodes
bigdam-Scheduler
for every committed buffers
once in a minutes
per account/db/table
account_id, database, table, apikey
1. bigdam-pool requests to schedule import tasks for every buffers
2. requested task is added in scheduler entries, if missing
l
account1, db1, table1, apikeyA
scheduler entries
bigdam-pool
nodes
account9, db8, table7, apikeyB bigdam-queuel
3. schedule a task to be enqueued after timeout from entry creation
bigdam-pool
nodes bigdam-queue
tasks to be enqueued
l
account1, db1, table1, apikeyA
scheduler entries
bigdam-pool
nodes
bigdam-queuel
tasks to be enqueued
4. enqueue an import task into bigdam-queue
bigdam-pool
nodes
bigdam-queue
l
account1, db1, table1, apikeyA
scheduler entries
account9, db8, table7, apikeyB l
tasks to be enqueued
account1, db1, table1, apikeyA
l
account1, db1, table1, apikeyA
scheduler entries
account9, db8, table7, apikeyB l
tasks to be enqueued
account1, db1, table1, apikeyA
5. remove an entry in schedule if succeeded
l
scheduler entries
account9, db8, table7, apikeyB l
tasks to be enqueued
bigdam-pool
nodes bigdam-queue
Planet-scale Data Ingestion Pipeline: Bigdam
Bigdam-Queue (Java)
• High throughput queue for import tasks
• Enqueue/dequeue using AWS SQS (standard queue)
• Task state management using AWS Aurora
• Roughly ordered, At-least-once
enqueue tasks
bigdam-scheduler
bigdam-queue server (Java)
AWS SQS
(standard)
task
task
task
2. enqueue 1. INSERT INTO
task, enqueued
task, enqueued
task, enqueued
AWS Aurora
request to dequeue task
bigdam-import
bigdam-queue server (Java)
AWS SQS
(standard)
task
task
task
1. dequeue
task, enqueued
task, enqueued
task, running
2. UPDATE
AWS Aurora
finish
bigdam-import
bigdam-queue server (Java)
AWS SQS
(standard)
task
task
task, enqueued
task, enqueued
1. DELETE
AWS Aurora
Planet-scale Data Ingestion Pipeline: Bigdam
Bigdam-Import (Java)
• Import worker
• Convert source (json/msgpack.gz) to MPC1
• Execute import tasks in parallel
• Dequeue tasks from bigdam-queue
• Query and download buffers from bigdam-pool
• Make a list of chunk ids and put it to bigdam-dddb
• Execute deduplication to determine chunks to be imported
• Make MPC1 files and put them into Plazmadb
Planet-scale Data Ingestion Pipeline: Bigdam
Bigdam-Dddb (Java)
• Database service for deduplication
• Based on AWS Aurora and S3
• Stores unique chunk ids per import task

not to import same chunk twice
1. store chunk-id list (small)
bigdam-import
bigdam-dddb server (Java)
2. INSERT
task-id, list-of-chunk-ids
AWS Aurora
2. store task-id

and S3 object path
bigdam-import
bigdam-dddb server (Java)
3. INSERT
1. upload

encoded chunk-ids
task-id, path-of-ids
AWS AuroraAWS S3
list-of-chunk-idstask-id, list-of-chunk-ids
For small list of chunk ids For huge list of chunk ids
1. query lists of past tasks
bigdam-import
bigdam-dddb server (Java)
2. SELECT
task-id, path-of-ids
AWS Aurora
list-of-chunk-idstask-id, list-of-chunk-ids
Fetch chunk id lists imported in past
3. download

if needed
Consistency and Scaling
Executing Deduplication at the end of pipeline
• Make it simple & reliable
gateway
clients

(data input)
At-least-once everywhere
pool
(edge)
pool
(central)
Plazmadb
import
worker
dddb
queue
schedu
ler
Deduplication
(Transaction + Retries)
At-Least-Once: Bigdam-pool Data Replication
Client-side replication:
client uploads 3 replica
to 3 nodes in parallel
Server-side replication:
primary node appends chunks to existing buffer,
and replicate them
(for equal contents/checksums in nodes)
for large chunks

(1MB~)
for small chunks

(~1MB)
At-Least-Once: Bigdam-pool Data Replication
Server replication
for transferred buffer
Scaling-out (almost) Everywhere
• Scalable components on EC2 (& ready for AWS autoscaling)
• AWS Aurora (w/o table locks) + AWS SQS (+ AWS S3)
gateway
clients

(data input)
pool
(edge)
pool
(central)
Plazmadb
import
worker
dddb
queue
schedu
ler
scale-outscale-out
Scaling-up Just For A Case: Scheduler
• Scheduler need to collect notifications of all buffers
• and cannot be parallelized by nodes (in easy way)
• Solution: high-performant singleton server: 90k+ reqs/sec
gateway
clients

(data input)
pool
(edge)
pool
(central)
Plazmadb
import
worker
dddb
queue
schedu
ler
singleton

server
Bigdam Current status: Under Testing
It's great fun
to design
Distributed Systems!
Thank you!
@tagomoris
We're Hiring!

More Related Content

What's hot (20)

分析指向データレイク実現の次の一手 ~Delta Lake、なにそれおいしいの?~(NTTデータ テクノロジーカンファレンス 2020 発表資料)
分析指向データレイク実現の次の一手 ~Delta Lake、なにそれおいしいの?~(NTTデータ テクノロジーカンファレンス 2020 発表資料)分析指向データレイク実現の次の一手 ~Delta Lake、なにそれおいしいの?~(NTTデータ テクノロジーカンファレンス 2020 発表資料)
分析指向データレイク実現の次の一手 ~Delta Lake、なにそれおいしいの?~(NTTデータ テクノロジーカンファレンス 2020 発表資料)
NTT DATA Technology & Innovation
 
JPAのキャッシュを使ったアプリケーション高速化手法
JPAのキャッシュを使ったアプリケーション高速化手法JPAのキャッシュを使ったアプリケーション高速化手法
JPAのキャッシュを使ったアプリケーション高速化手法
Chihiro Ito
 
実運用して分かったRabbit MQの良いところ・気をつけること #jjug
実運用して分かったRabbit MQの良いところ・気をつけること #jjug実運用して分かったRabbit MQの良いところ・気をつけること #jjug
実運用して分かったRabbit MQの良いところ・気をつけること #jjug
Yahoo!デベロッパーネットワーク
 
JAVA_HOME/binにあるコマンド、いくつ使っていますか?[JVM関連ツール編](JJUGナイトセミナー「Java解析ツール特集」 発表資料)
JAVA_HOME/binにあるコマンド、いくつ使っていますか?[JVM関連ツール編](JJUGナイトセミナー「Java解析ツール特集」 発表資料)JAVA_HOME/binにあるコマンド、いくつ使っていますか?[JVM関連ツール編](JJUGナイトセミナー「Java解析ツール特集」 発表資料)
JAVA_HOME/binにあるコマンド、いくつ使っていますか?[JVM関連ツール編](JJUGナイトセミナー「Java解析ツール特集」 発表資料)
NTT DATA Technology & Innovation
 
[Aurora事例祭り]Amazon Aurora を使いこなすためのベストプラクティス
[Aurora事例祭り]Amazon Aurora を使いこなすためのベストプラクティス[Aurora事例祭り]Amazon Aurora を使いこなすためのベストプラクティス
[Aurora事例祭り]Amazon Aurora を使いこなすためのベストプラクティス
Amazon Web Services Japan
 
WiredTigerを詳しく説明
WiredTigerを詳しく説明WiredTigerを詳しく説明
WiredTigerを詳しく説明
Tetsutaro Watanabe
 
トランザクションをSerializableにする4つの方法
トランザクションをSerializableにする4つの方法トランザクションをSerializableにする4つの方法
トランザクションをSerializableにする4つの方法
Kumazaki Hiroki
 
より速く より運用しやすく 進化し続けるJVM(Java Developers Summit Online 2023 発表資料)
より速く より運用しやすく 進化し続けるJVM(Java Developers Summit Online 2023 発表資料)より速く より運用しやすく 進化し続けるJVM(Java Developers Summit Online 2023 発表資料)
より速く より運用しやすく 進化し続けるJVM(Java Developers Summit Online 2023 発表資料)
NTT DATA Technology & Innovation
 
Raft
RaftRaft
Raft
Preferred Networks
 
Apache Kafkaって本当に大丈夫?~故障検証のオーバービューと興味深い挙動の紹介~
Apache Kafkaって本当に大丈夫?~故障検証のオーバービューと興味深い挙動の紹介~Apache Kafkaって本当に大丈夫?~故障検証のオーバービューと興味深い挙動の紹介~
Apache Kafkaって本当に大丈夫?~故障検証のオーバービューと興味深い挙動の紹介~
NTT DATA OSS Professional Services
 
BigQueryの課金、節約しませんか
BigQueryの課金、節約しませんかBigQueryの課金、節約しませんか
BigQueryの課金、節約しませんか
Ryuji Tamagawa
 
TiDBのトランザクション
TiDBのトランザクションTiDBのトランザクション
TiDBのトランザクション
Akio Mitobe
 
Hadoopデータ基盤とMulti-CloudなML基盤への取り組みの紹介
Hadoopデータ基盤とMulti-CloudなML基盤への取り組みの紹介Hadoopデータ基盤とMulti-CloudなML基盤への取り組みの紹介
Hadoopデータ基盤とMulti-CloudなML基盤への取り組みの紹介
MicroAd, Inc.(Engineer)
 
Redis
RedisRedis
Redis
DaeMyung Kang
 
Unicode文字列処理
Unicode文字列処理Unicode文字列処理
Unicode文字列処理
信之 岩永
 
最近のストリーム処理事情振り返り
最近のストリーム処理事情振り返り最近のストリーム処理事情振り返り
最近のストリーム処理事情振り返り
Sotaro Kimura
 
ウェブを速くするためにDeNAがやっていること - HTTP/2と、さらにその先
ウェブを速くするためにDeNAがやっていること - HTTP/2と、さらにその先ウェブを速くするためにDeNAがやっていること - HTTP/2と、さらにその先
ウェブを速くするためにDeNAがやっていること - HTTP/2と、さらにその先
Kazuho Oku
 
コンテナにおけるパフォーマンス調査でハマった話
コンテナにおけるパフォーマンス調査でハマった話コンテナにおけるパフォーマンス調査でハマった話
コンテナにおけるパフォーマンス調査でハマった話
Yuta Shimada
 
SQL大量発行処理をいかにして高速化するか
SQL大量発行処理をいかにして高速化するかSQL大量発行処理をいかにして高速化するか
SQL大量発行処理をいかにして高速化するか
Shogo Wakayama
 
ツール比較しながら語る O/RマッパーとDBマイグレーションの実際のところ
ツール比較しながら語る O/RマッパーとDBマイグレーションの実際のところツール比較しながら語る O/RマッパーとDBマイグレーションの実際のところ
ツール比較しながら語る O/RマッパーとDBマイグレーションの実際のところ
Y Watanabe
 
分析指向データレイク実現の次の一手 ~Delta Lake、なにそれおいしいの?~(NTTデータ テクノロジーカンファレンス 2020 発表資料)
分析指向データレイク実現の次の一手 ~Delta Lake、なにそれおいしいの?~(NTTデータ テクノロジーカンファレンス 2020 発表資料)分析指向データレイク実現の次の一手 ~Delta Lake、なにそれおいしいの?~(NTTデータ テクノロジーカンファレンス 2020 発表資料)
分析指向データレイク実現の次の一手 ~Delta Lake、なにそれおいしいの?~(NTTデータ テクノロジーカンファレンス 2020 発表資料)
NTT DATA Technology & Innovation
 
JPAのキャッシュを使ったアプリケーション高速化手法
JPAのキャッシュを使ったアプリケーション高速化手法JPAのキャッシュを使ったアプリケーション高速化手法
JPAのキャッシュを使ったアプリケーション高速化手法
Chihiro Ito
 
実運用して分かったRabbit MQの良いところ・気をつけること #jjug
実運用して分かったRabbit MQの良いところ・気をつけること #jjug実運用して分かったRabbit MQの良いところ・気をつけること #jjug
実運用して分かったRabbit MQの良いところ・気をつけること #jjug
Yahoo!デベロッパーネットワーク
 
JAVA_HOME/binにあるコマンド、いくつ使っていますか?[JVM関連ツール編](JJUGナイトセミナー「Java解析ツール特集」 発表資料)
JAVA_HOME/binにあるコマンド、いくつ使っていますか?[JVM関連ツール編](JJUGナイトセミナー「Java解析ツール特集」 発表資料)JAVA_HOME/binにあるコマンド、いくつ使っていますか?[JVM関連ツール編](JJUGナイトセミナー「Java解析ツール特集」 発表資料)
JAVA_HOME/binにあるコマンド、いくつ使っていますか?[JVM関連ツール編](JJUGナイトセミナー「Java解析ツール特集」 発表資料)
NTT DATA Technology & Innovation
 
[Aurora事例祭り]Amazon Aurora を使いこなすためのベストプラクティス
[Aurora事例祭り]Amazon Aurora を使いこなすためのベストプラクティス[Aurora事例祭り]Amazon Aurora を使いこなすためのベストプラクティス
[Aurora事例祭り]Amazon Aurora を使いこなすためのベストプラクティス
Amazon Web Services Japan
 
トランザクションをSerializableにする4つの方法
トランザクションをSerializableにする4つの方法トランザクションをSerializableにする4つの方法
トランザクションをSerializableにする4つの方法
Kumazaki Hiroki
 
より速く より運用しやすく 進化し続けるJVM(Java Developers Summit Online 2023 発表資料)
より速く より運用しやすく 進化し続けるJVM(Java Developers Summit Online 2023 発表資料)より速く より運用しやすく 進化し続けるJVM(Java Developers Summit Online 2023 発表資料)
より速く より運用しやすく 進化し続けるJVM(Java Developers Summit Online 2023 発表資料)
NTT DATA Technology & Innovation
 
Apache Kafkaって本当に大丈夫?~故障検証のオーバービューと興味深い挙動の紹介~
Apache Kafkaって本当に大丈夫?~故障検証のオーバービューと興味深い挙動の紹介~Apache Kafkaって本当に大丈夫?~故障検証のオーバービューと興味深い挙動の紹介~
Apache Kafkaって本当に大丈夫?~故障検証のオーバービューと興味深い挙動の紹介~
NTT DATA OSS Professional Services
 
BigQueryの課金、節約しませんか
BigQueryの課金、節約しませんかBigQueryの課金、節約しませんか
BigQueryの課金、節約しませんか
Ryuji Tamagawa
 
TiDBのトランザクション
TiDBのトランザクションTiDBのトランザクション
TiDBのトランザクション
Akio Mitobe
 
Hadoopデータ基盤とMulti-CloudなML基盤への取り組みの紹介
Hadoopデータ基盤とMulti-CloudなML基盤への取り組みの紹介Hadoopデータ基盤とMulti-CloudなML基盤への取り組みの紹介
Hadoopデータ基盤とMulti-CloudなML基盤への取り組みの紹介
MicroAd, Inc.(Engineer)
 
Unicode文字列処理
Unicode文字列処理Unicode文字列処理
Unicode文字列処理
信之 岩永
 
最近のストリーム処理事情振り返り
最近のストリーム処理事情振り返り最近のストリーム処理事情振り返り
最近のストリーム処理事情振り返り
Sotaro Kimura
 
ウェブを速くするためにDeNAがやっていること - HTTP/2と、さらにその先
ウェブを速くするためにDeNAがやっていること - HTTP/2と、さらにその先ウェブを速くするためにDeNAがやっていること - HTTP/2と、さらにその先
ウェブを速くするためにDeNAがやっていること - HTTP/2と、さらにその先
Kazuho Oku
 
コンテナにおけるパフォーマンス調査でハマった話
コンテナにおけるパフォーマンス調査でハマった話コンテナにおけるパフォーマンス調査でハマった話
コンテナにおけるパフォーマンス調査でハマった話
Yuta Shimada
 
SQL大量発行処理をいかにして高速化するか
SQL大量発行処理をいかにして高速化するかSQL大量発行処理をいかにして高速化するか
SQL大量発行処理をいかにして高速化するか
Shogo Wakayama
 
ツール比較しながら語る O/RマッパーとDBマイグレーションの実際のところ
ツール比較しながら語る O/RマッパーとDBマイグレーションの実際のところツール比較しながら語る O/RマッパーとDBマイグレーションの実際のところ
ツール比較しながら語る O/RマッパーとDBマイグレーションの実際のところ
Y Watanabe
 

Similar to Planet-scale Data Ingestion Pipeline: Bigdam (20)

Overview of data analytics service: Treasure Data Service
Overview of data analytics service: Treasure Data ServiceOverview of data analytics service: Treasure Data Service
Overview of data analytics service: Treasure Data Service
SATOSHI TAGOMORI
 
Meetup#2: Building responsive Symbology & Suggest WebService
Meetup#2: Building responsive Symbology & Suggest WebServiceMeetup#2: Building responsive Symbology & Suggest WebService
Meetup#2: Building responsive Symbology & Suggest WebService
Minsk MongoDB User Group
 
Rubyslava + PyVo #48
Rubyslava + PyVo #48Rubyslava + PyVo #48
Rubyslava + PyVo #48
Jozef Képesi
 
Data Analytics Service Company and Its Ruby Usage
Data Analytics Service Company and Its Ruby UsageData Analytics Service Company and Its Ruby Usage
Data Analytics Service Company and Its Ruby Usage
SATOSHI TAGOMORI
 
Running Production CDC Ingestion Pipelines With Balaji Varadarajan and Pritam...
Running Production CDC Ingestion Pipelines With Balaji Varadarajan and Pritam...Running Production CDC Ingestion Pipelines With Balaji Varadarajan and Pritam...
Running Production CDC Ingestion Pipelines With Balaji Varadarajan and Pritam...
HostedbyConfluent
 
Presto At Treasure Data
Presto At Treasure DataPresto At Treasure Data
Presto At Treasure Data
Taro L. Saito
 
Presto meetup 2015-03-19 @Facebook
Presto meetup 2015-03-19 @FacebookPresto meetup 2015-03-19 @Facebook
Presto meetup 2015-03-19 @Facebook
Treasure Data, Inc.
 
[262] netflix 빅데이터 플랫폼
[262] netflix 빅데이터 플랫폼[262] netflix 빅데이터 플랫폼
[262] netflix 빅데이터 플랫폼
NAVER D2
 
Ceph Day Beijing - Our Journey to High Performance Large Scale Ceph Cluster a...
Ceph Day Beijing - Our Journey to High Performance Large Scale Ceph Cluster a...Ceph Day Beijing - Our Journey to High Performance Large Scale Ceph Cluster a...
Ceph Day Beijing - Our Journey to High Performance Large Scale Ceph Cluster a...
Ceph Community
 
Ceph Day Beijing - Our journey to high performance large scale Ceph cluster a...
Ceph Day Beijing - Our journey to high performance large scale Ceph cluster a...Ceph Day Beijing - Our journey to high performance large scale Ceph cluster a...
Ceph Day Beijing - Our journey to high performance large scale Ceph cluster a...
Danielle Womboldt
 
Unified Batch & Stream Processing with Apache Samza
Unified Batch & Stream Processing with Apache SamzaUnified Batch & Stream Processing with Apache Samza
Unified Batch & Stream Processing with Apache Samza
DataWorks Summit
 
Data Analysis on AWS
Data Analysis on AWSData Analysis on AWS
Data Analysis on AWS
Paolo latella
 
Paris Data Geek - Spark Streaming
Paris Data Geek - Spark Streaming Paris Data Geek - Spark Streaming
Paris Data Geek - Spark Streaming
Djamel Zouaoui
 
Chirp 2010: Scaling Twitter
Chirp 2010: Scaling TwitterChirp 2010: Scaling Twitter
Chirp 2010: Scaling Twitter
John Adams
 
Gruter TECHDAY 2014 Realtime Processing in Telco
Gruter TECHDAY 2014 Realtime Processing in TelcoGruter TECHDAY 2014 Realtime Processing in Telco
Gruter TECHDAY 2014 Realtime Processing in Telco
Gruter
 
Hadoop, Taming Elephants
Hadoop, Taming ElephantsHadoop, Taming Elephants
Hadoop, Taming Elephants
Ovidiu Dimulescu
 
Apache Spark: The Next Gen toolset for Big Data Processing
Apache Spark: The Next Gen toolset for Big Data ProcessingApache Spark: The Next Gen toolset for Big Data Processing
Apache Spark: The Next Gen toolset for Big Data Processing
prajods
 
Spark Summit EU talk by Sol Ackerman and Franklyn D'souza
Spark Summit EU talk by Sol Ackerman and Franklyn D'souzaSpark Summit EU talk by Sol Ackerman and Franklyn D'souza
Spark Summit EU talk by Sol Ackerman and Franklyn D'souza
Spark Summit
 
Scaling 100PB Data Warehouse in Cloud
Scaling 100PB Data Warehouse in CloudScaling 100PB Data Warehouse in Cloud
Scaling 100PB Data Warehouse in Cloud
Changshu Liu
 
Facebook Presto presentation
Facebook Presto presentationFacebook Presto presentation
Facebook Presto presentation
Cyanny LIANG
 
Overview of data analytics service: Treasure Data Service
Overview of data analytics service: Treasure Data ServiceOverview of data analytics service: Treasure Data Service
Overview of data analytics service: Treasure Data Service
SATOSHI TAGOMORI
 
Meetup#2: Building responsive Symbology & Suggest WebService
Meetup#2: Building responsive Symbology & Suggest WebServiceMeetup#2: Building responsive Symbology & Suggest WebService
Meetup#2: Building responsive Symbology & Suggest WebService
Minsk MongoDB User Group
 
Rubyslava + PyVo #48
Rubyslava + PyVo #48Rubyslava + PyVo #48
Rubyslava + PyVo #48
Jozef Képesi
 
Data Analytics Service Company and Its Ruby Usage
Data Analytics Service Company and Its Ruby UsageData Analytics Service Company and Its Ruby Usage
Data Analytics Service Company and Its Ruby Usage
SATOSHI TAGOMORI
 
Running Production CDC Ingestion Pipelines With Balaji Varadarajan and Pritam...
Running Production CDC Ingestion Pipelines With Balaji Varadarajan and Pritam...Running Production CDC Ingestion Pipelines With Balaji Varadarajan and Pritam...
Running Production CDC Ingestion Pipelines With Balaji Varadarajan and Pritam...
HostedbyConfluent
 
Presto At Treasure Data
Presto At Treasure DataPresto At Treasure Data
Presto At Treasure Data
Taro L. Saito
 
Presto meetup 2015-03-19 @Facebook
Presto meetup 2015-03-19 @FacebookPresto meetup 2015-03-19 @Facebook
Presto meetup 2015-03-19 @Facebook
Treasure Data, Inc.
 
[262] netflix 빅데이터 플랫폼
[262] netflix 빅데이터 플랫폼[262] netflix 빅데이터 플랫폼
[262] netflix 빅데이터 플랫폼
NAVER D2
 
Ceph Day Beijing - Our Journey to High Performance Large Scale Ceph Cluster a...
Ceph Day Beijing - Our Journey to High Performance Large Scale Ceph Cluster a...Ceph Day Beijing - Our Journey to High Performance Large Scale Ceph Cluster a...
Ceph Day Beijing - Our Journey to High Performance Large Scale Ceph Cluster a...
Ceph Community
 
Ceph Day Beijing - Our journey to high performance large scale Ceph cluster a...
Ceph Day Beijing - Our journey to high performance large scale Ceph cluster a...Ceph Day Beijing - Our journey to high performance large scale Ceph cluster a...
Ceph Day Beijing - Our journey to high performance large scale Ceph cluster a...
Danielle Womboldt
 
Unified Batch & Stream Processing with Apache Samza
Unified Batch & Stream Processing with Apache SamzaUnified Batch & Stream Processing with Apache Samza
Unified Batch & Stream Processing with Apache Samza
DataWorks Summit
 
Data Analysis on AWS
Data Analysis on AWSData Analysis on AWS
Data Analysis on AWS
Paolo latella
 
Paris Data Geek - Spark Streaming
Paris Data Geek - Spark Streaming Paris Data Geek - Spark Streaming
Paris Data Geek - Spark Streaming
Djamel Zouaoui
 
Chirp 2010: Scaling Twitter
Chirp 2010: Scaling TwitterChirp 2010: Scaling Twitter
Chirp 2010: Scaling Twitter
John Adams
 
Gruter TECHDAY 2014 Realtime Processing in Telco
Gruter TECHDAY 2014 Realtime Processing in TelcoGruter TECHDAY 2014 Realtime Processing in Telco
Gruter TECHDAY 2014 Realtime Processing in Telco
Gruter
 
Apache Spark: The Next Gen toolset for Big Data Processing
Apache Spark: The Next Gen toolset for Big Data ProcessingApache Spark: The Next Gen toolset for Big Data Processing
Apache Spark: The Next Gen toolset for Big Data Processing
prajods
 
Spark Summit EU talk by Sol Ackerman and Franklyn D'souza
Spark Summit EU talk by Sol Ackerman and Franklyn D'souzaSpark Summit EU talk by Sol Ackerman and Franklyn D'souza
Spark Summit EU talk by Sol Ackerman and Franklyn D'souza
Spark Summit
 
Scaling 100PB Data Warehouse in Cloud
Scaling 100PB Data Warehouse in CloudScaling 100PB Data Warehouse in Cloud
Scaling 100PB Data Warehouse in Cloud
Changshu Liu
 
Facebook Presto presentation
Facebook Presto presentationFacebook Presto presentation
Facebook Presto presentation
Cyanny LIANG
 

More from SATOSHI TAGOMORI (20)

Ractor's speed is not light-speed
Ractor's speed is not light-speedRactor's speed is not light-speed
Ractor's speed is not light-speed
SATOSHI TAGOMORI
 
Good Things and Hard Things of SaaS Development/Operations
Good Things and Hard Things of SaaS Development/OperationsGood Things and Hard Things of SaaS Development/Operations
Good Things and Hard Things of SaaS Development/Operations
SATOSHI TAGOMORI
 
Maccro Strikes Back
Maccro Strikes BackMaccro Strikes Back
Maccro Strikes Back
SATOSHI TAGOMORI
 
Invitation to the dark side of Ruby
Invitation to the dark side of RubyInvitation to the dark side of Ruby
Invitation to the dark side of Ruby
SATOSHI TAGOMORI
 
Hijacking Ruby Syntax in Ruby (RubyConf 2018)
Hijacking Ruby Syntax in Ruby (RubyConf 2018)Hijacking Ruby Syntax in Ruby (RubyConf 2018)
Hijacking Ruby Syntax in Ruby (RubyConf 2018)
SATOSHI TAGOMORI
 
Make Your Ruby Script Confusing
Make Your Ruby Script ConfusingMake Your Ruby Script Confusing
Make Your Ruby Script Confusing
SATOSHI TAGOMORI
 
Hijacking Ruby Syntax in Ruby
Hijacking Ruby Syntax in RubyHijacking Ruby Syntax in Ruby
Hijacking Ruby Syntax in Ruby
SATOSHI TAGOMORI
 
Lock, Concurrency and Throughput of Exclusive Operations
Lock, Concurrency and Throughput of Exclusive OperationsLock, Concurrency and Throughput of Exclusive Operations
Lock, Concurrency and Throughput of Exclusive Operations
SATOSHI TAGOMORI
 
Data Processing and Ruby in the World
Data Processing and Ruby in the WorldData Processing and Ruby in the World
Data Processing and Ruby in the World
SATOSHI TAGOMORI
 
Technologies, Data Analytics Service and Enterprise Business
Technologies, Data Analytics Service and Enterprise BusinessTechnologies, Data Analytics Service and Enterprise Business
Technologies, Data Analytics Service and Enterprise Business
SATOSHI TAGOMORI
 
Ruby and Distributed Storage Systems
Ruby and Distributed Storage SystemsRuby and Distributed Storage Systems
Ruby and Distributed Storage Systems
SATOSHI TAGOMORI
 
Perfect Norikra 2nd Season
Perfect Norikra 2nd SeasonPerfect Norikra 2nd Season
Perfect Norikra 2nd Season
SATOSHI TAGOMORI
 
Fluentd 101
Fluentd 101Fluentd 101
Fluentd 101
SATOSHI TAGOMORI
 
To Have Own Data Analytics Platform, Or NOT To
To Have Own Data Analytics Platform, Or NOT ToTo Have Own Data Analytics Platform, Or NOT To
To Have Own Data Analytics Platform, Or NOT To
SATOSHI TAGOMORI
 
The Patterns of Distributed Logging and Containers
The Patterns of Distributed Logging and ContainersThe Patterns of Distributed Logging and Containers
The Patterns of Distributed Logging and Containers
SATOSHI TAGOMORI
 
How To Write Middleware In Ruby
How To Write Middleware In RubyHow To Write Middleware In Ruby
How To Write Middleware In Ruby
SATOSHI TAGOMORI
 
Modern Black Mages Fighting in the Real World
Modern Black Mages Fighting in the Real WorldModern Black Mages Fighting in the Real World
Modern Black Mages Fighting in the Real World
SATOSHI TAGOMORI
 
Open Source Software, Distributed Systems, Database as a Cloud Service
Open Source Software, Distributed Systems, Database as a Cloud ServiceOpen Source Software, Distributed Systems, Database as a Cloud Service
Open Source Software, Distributed Systems, Database as a Cloud Service
SATOSHI TAGOMORI
 
Fluentd Overview, Now and Then
Fluentd Overview, Now and ThenFluentd Overview, Now and Then
Fluentd Overview, Now and Then
SATOSHI TAGOMORI
 
How to Make Norikra Perfect
How to Make Norikra PerfectHow to Make Norikra Perfect
How to Make Norikra Perfect
SATOSHI TAGOMORI
 
Ractor's speed is not light-speed
Ractor's speed is not light-speedRactor's speed is not light-speed
Ractor's speed is not light-speed
SATOSHI TAGOMORI
 
Good Things and Hard Things of SaaS Development/Operations
Good Things and Hard Things of SaaS Development/OperationsGood Things and Hard Things of SaaS Development/Operations
Good Things and Hard Things of SaaS Development/Operations
SATOSHI TAGOMORI
 
Invitation to the dark side of Ruby
Invitation to the dark side of RubyInvitation to the dark side of Ruby
Invitation to the dark side of Ruby
SATOSHI TAGOMORI
 
Hijacking Ruby Syntax in Ruby (RubyConf 2018)
Hijacking Ruby Syntax in Ruby (RubyConf 2018)Hijacking Ruby Syntax in Ruby (RubyConf 2018)
Hijacking Ruby Syntax in Ruby (RubyConf 2018)
SATOSHI TAGOMORI
 
Make Your Ruby Script Confusing
Make Your Ruby Script ConfusingMake Your Ruby Script Confusing
Make Your Ruby Script Confusing
SATOSHI TAGOMORI
 
Hijacking Ruby Syntax in Ruby
Hijacking Ruby Syntax in RubyHijacking Ruby Syntax in Ruby
Hijacking Ruby Syntax in Ruby
SATOSHI TAGOMORI
 
Lock, Concurrency and Throughput of Exclusive Operations
Lock, Concurrency and Throughput of Exclusive OperationsLock, Concurrency and Throughput of Exclusive Operations
Lock, Concurrency and Throughput of Exclusive Operations
SATOSHI TAGOMORI
 
Data Processing and Ruby in the World
Data Processing and Ruby in the WorldData Processing and Ruby in the World
Data Processing and Ruby in the World
SATOSHI TAGOMORI
 
Technologies, Data Analytics Service and Enterprise Business
Technologies, Data Analytics Service and Enterprise BusinessTechnologies, Data Analytics Service and Enterprise Business
Technologies, Data Analytics Service and Enterprise Business
SATOSHI TAGOMORI
 
Ruby and Distributed Storage Systems
Ruby and Distributed Storage SystemsRuby and Distributed Storage Systems
Ruby and Distributed Storage Systems
SATOSHI TAGOMORI
 
Perfect Norikra 2nd Season
Perfect Norikra 2nd SeasonPerfect Norikra 2nd Season
Perfect Norikra 2nd Season
SATOSHI TAGOMORI
 
To Have Own Data Analytics Platform, Or NOT To
To Have Own Data Analytics Platform, Or NOT ToTo Have Own Data Analytics Platform, Or NOT To
To Have Own Data Analytics Platform, Or NOT To
SATOSHI TAGOMORI
 
The Patterns of Distributed Logging and Containers
The Patterns of Distributed Logging and ContainersThe Patterns of Distributed Logging and Containers
The Patterns of Distributed Logging and Containers
SATOSHI TAGOMORI
 
How To Write Middleware In Ruby
How To Write Middleware In RubyHow To Write Middleware In Ruby
How To Write Middleware In Ruby
SATOSHI TAGOMORI
 
Modern Black Mages Fighting in the Real World
Modern Black Mages Fighting in the Real WorldModern Black Mages Fighting in the Real World
Modern Black Mages Fighting in the Real World
SATOSHI TAGOMORI
 
Open Source Software, Distributed Systems, Database as a Cloud Service
Open Source Software, Distributed Systems, Database as a Cloud ServiceOpen Source Software, Distributed Systems, Database as a Cloud Service
Open Source Software, Distributed Systems, Database as a Cloud Service
SATOSHI TAGOMORI
 
Fluentd Overview, Now and Then
Fluentd Overview, Now and ThenFluentd Overview, Now and Then
Fluentd Overview, Now and Then
SATOSHI TAGOMORI
 
How to Make Norikra Perfect
How to Make Norikra PerfectHow to Make Norikra Perfect
How to Make Norikra Perfect
SATOSHI TAGOMORI
 

Recently uploaded (20)

Microsoft AI Nonprofit Use Cases and Live Demo_2025.04.30.pdf
Microsoft AI Nonprofit Use Cases and Live Demo_2025.04.30.pdfMicrosoft AI Nonprofit Use Cases and Live Demo_2025.04.30.pdf
Microsoft AI Nonprofit Use Cases and Live Demo_2025.04.30.pdf
TechSoup
 
How to Batch Export Lotus Notes NSF Emails to Outlook PST Easily?
How to Batch Export Lotus Notes NSF Emails to Outlook PST Easily?How to Batch Export Lotus Notes NSF Emails to Outlook PST Easily?
How to Batch Export Lotus Notes NSF Emails to Outlook PST Easily?
steaveroggers
 
Exploring Code Comprehension in Scientific Programming: Preliminary Insight...
Exploring Code Comprehension  in Scientific Programming:  Preliminary Insight...Exploring Code Comprehension  in Scientific Programming:  Preliminary Insight...
Exploring Code Comprehension in Scientific Programming: Preliminary Insight...
University of Hawai‘i at Mānoa
 
EASEUS Partition Master Crack + License Code
EASEUS Partition Master Crack + License CodeEASEUS Partition Master Crack + License Code
EASEUS Partition Master Crack + License Code
aneelaramzan63
 
WinRAR Crack for Windows (100% Working 2025)
WinRAR Crack for Windows (100% Working 2025)WinRAR Crack for Windows (100% Working 2025)
WinRAR Crack for Windows (100% Working 2025)
sh607827
 
Download YouTube By Click 2025 Free Full Activated
Download YouTube By Click 2025 Free Full ActivatedDownload YouTube By Click 2025 Free Full Activated
Download YouTube By Click 2025 Free Full Activated
saniamalik72555
 
Societal challenges of AI: biases, multilinguism and sustainability
Societal challenges of AI: biases, multilinguism and sustainabilitySocietal challenges of AI: biases, multilinguism and sustainability
Societal challenges of AI: biases, multilinguism and sustainability
Jordi Cabot
 
Pixologic ZBrush Crack Plus Activation Key [Latest 2025] New Version
Pixologic ZBrush Crack Plus Activation Key [Latest 2025] New VersionPixologic ZBrush Crack Plus Activation Key [Latest 2025] New Version
Pixologic ZBrush Crack Plus Activation Key [Latest 2025] New Version
saimabibi60507
 
Landscape of Requirements Engineering for/by AI through Literature Review
Landscape of Requirements Engineering for/by AI through Literature ReviewLandscape of Requirements Engineering for/by AI through Literature Review
Landscape of Requirements Engineering for/by AI through Literature Review
Hironori Washizaki
 
What Do Contribution Guidelines Say About Software Testing? (MSR 2025)
What Do Contribution Guidelines Say About Software Testing? (MSR 2025)What Do Contribution Guidelines Say About Software Testing? (MSR 2025)
What Do Contribution Guidelines Say About Software Testing? (MSR 2025)
Andre Hora
 
Solidworks Crack 2025 latest new + license code
Solidworks Crack 2025 latest new + license codeSolidworks Crack 2025 latest new + license code
Solidworks Crack 2025 latest new + license code
aneelaramzan63
 
Proactive Vulnerability Detection in Source Code Using Graph Neural Networks:...
Proactive Vulnerability Detection in Source Code Using Graph Neural Networks:...Proactive Vulnerability Detection in Source Code Using Graph Neural Networks:...
Proactive Vulnerability Detection in Source Code Using Graph Neural Networks:...
Ranjan Baisak
 
Not So Common Memory Leaks in Java Webinar
Not So Common Memory Leaks in Java WebinarNot So Common Memory Leaks in Java Webinar
Not So Common Memory Leaks in Java Webinar
Tier1 app
 
Exploring Wayland: A Modern Display Server for the Future
Exploring Wayland: A Modern Display Server for the FutureExploring Wayland: A Modern Display Server for the Future
Exploring Wayland: A Modern Display Server for the Future
ICS
 
Revolutionizing Residential Wi-Fi PPT.pptx
Revolutionizing Residential Wi-Fi PPT.pptxRevolutionizing Residential Wi-Fi PPT.pptx
Revolutionizing Residential Wi-Fi PPT.pptx
nidhisingh691197
 
Kubernetes_101_Zero_to_Platform_Engineer.pptx
Kubernetes_101_Zero_to_Platform_Engineer.pptxKubernetes_101_Zero_to_Platform_Engineer.pptx
Kubernetes_101_Zero_to_Platform_Engineer.pptx
CloudScouts
 
Automation Techniques in RPA - UiPath Certificate
Automation Techniques in RPA - UiPath CertificateAutomation Techniques in RPA - UiPath Certificate
Automation Techniques in RPA - UiPath Certificate
VICTOR MAESTRE RAMIREZ
 
Adobe After Effects Crack FREE FRESH version 2025
Adobe After Effects Crack FREE FRESH version 2025Adobe After Effects Crack FREE FRESH version 2025
Adobe After Effects Crack FREE FRESH version 2025
kashifyounis067
 
Why Orangescrum Is a Game Changer for Construction Companies in 2025
Why Orangescrum Is a Game Changer for Construction Companies in 2025Why Orangescrum Is a Game Changer for Construction Companies in 2025
Why Orangescrum Is a Game Changer for Construction Companies in 2025
Orangescrum
 
PDF Reader Pro Crack Latest Version FREE Download 2025
PDF Reader Pro Crack Latest Version FREE Download 2025PDF Reader Pro Crack Latest Version FREE Download 2025
PDF Reader Pro Crack Latest Version FREE Download 2025
mu394968
 
Microsoft AI Nonprofit Use Cases and Live Demo_2025.04.30.pdf
Microsoft AI Nonprofit Use Cases and Live Demo_2025.04.30.pdfMicrosoft AI Nonprofit Use Cases and Live Demo_2025.04.30.pdf
Microsoft AI Nonprofit Use Cases and Live Demo_2025.04.30.pdf
TechSoup
 
How to Batch Export Lotus Notes NSF Emails to Outlook PST Easily?
How to Batch Export Lotus Notes NSF Emails to Outlook PST Easily?How to Batch Export Lotus Notes NSF Emails to Outlook PST Easily?
How to Batch Export Lotus Notes NSF Emails to Outlook PST Easily?
steaveroggers
 
Exploring Code Comprehension in Scientific Programming: Preliminary Insight...
Exploring Code Comprehension  in Scientific Programming:  Preliminary Insight...Exploring Code Comprehension  in Scientific Programming:  Preliminary Insight...
Exploring Code Comprehension in Scientific Programming: Preliminary Insight...
University of Hawai‘i at Mānoa
 
EASEUS Partition Master Crack + License Code
EASEUS Partition Master Crack + License CodeEASEUS Partition Master Crack + License Code
EASEUS Partition Master Crack + License Code
aneelaramzan63
 
WinRAR Crack for Windows (100% Working 2025)
WinRAR Crack for Windows (100% Working 2025)WinRAR Crack for Windows (100% Working 2025)
WinRAR Crack for Windows (100% Working 2025)
sh607827
 
Download YouTube By Click 2025 Free Full Activated
Download YouTube By Click 2025 Free Full ActivatedDownload YouTube By Click 2025 Free Full Activated
Download YouTube By Click 2025 Free Full Activated
saniamalik72555
 
Societal challenges of AI: biases, multilinguism and sustainability
Societal challenges of AI: biases, multilinguism and sustainabilitySocietal challenges of AI: biases, multilinguism and sustainability
Societal challenges of AI: biases, multilinguism and sustainability
Jordi Cabot
 
Pixologic ZBrush Crack Plus Activation Key [Latest 2025] New Version
Pixologic ZBrush Crack Plus Activation Key [Latest 2025] New VersionPixologic ZBrush Crack Plus Activation Key [Latest 2025] New Version
Pixologic ZBrush Crack Plus Activation Key [Latest 2025] New Version
saimabibi60507
 
Landscape of Requirements Engineering for/by AI through Literature Review
Landscape of Requirements Engineering for/by AI through Literature ReviewLandscape of Requirements Engineering for/by AI through Literature Review
Landscape of Requirements Engineering for/by AI through Literature Review
Hironori Washizaki
 
What Do Contribution Guidelines Say About Software Testing? (MSR 2025)
What Do Contribution Guidelines Say About Software Testing? (MSR 2025)What Do Contribution Guidelines Say About Software Testing? (MSR 2025)
What Do Contribution Guidelines Say About Software Testing? (MSR 2025)
Andre Hora
 
Solidworks Crack 2025 latest new + license code
Solidworks Crack 2025 latest new + license codeSolidworks Crack 2025 latest new + license code
Solidworks Crack 2025 latest new + license code
aneelaramzan63
 
Proactive Vulnerability Detection in Source Code Using Graph Neural Networks:...
Proactive Vulnerability Detection in Source Code Using Graph Neural Networks:...Proactive Vulnerability Detection in Source Code Using Graph Neural Networks:...
Proactive Vulnerability Detection in Source Code Using Graph Neural Networks:...
Ranjan Baisak
 
Not So Common Memory Leaks in Java Webinar
Not So Common Memory Leaks in Java WebinarNot So Common Memory Leaks in Java Webinar
Not So Common Memory Leaks in Java Webinar
Tier1 app
 
Exploring Wayland: A Modern Display Server for the Future
Exploring Wayland: A Modern Display Server for the FutureExploring Wayland: A Modern Display Server for the Future
Exploring Wayland: A Modern Display Server for the Future
ICS
 
Revolutionizing Residential Wi-Fi PPT.pptx
Revolutionizing Residential Wi-Fi PPT.pptxRevolutionizing Residential Wi-Fi PPT.pptx
Revolutionizing Residential Wi-Fi PPT.pptx
nidhisingh691197
 
Kubernetes_101_Zero_to_Platform_Engineer.pptx
Kubernetes_101_Zero_to_Platform_Engineer.pptxKubernetes_101_Zero_to_Platform_Engineer.pptx
Kubernetes_101_Zero_to_Platform_Engineer.pptx
CloudScouts
 
Automation Techniques in RPA - UiPath Certificate
Automation Techniques in RPA - UiPath CertificateAutomation Techniques in RPA - UiPath Certificate
Automation Techniques in RPA - UiPath Certificate
VICTOR MAESTRE RAMIREZ
 
Adobe After Effects Crack FREE FRESH version 2025
Adobe After Effects Crack FREE FRESH version 2025Adobe After Effects Crack FREE FRESH version 2025
Adobe After Effects Crack FREE FRESH version 2025
kashifyounis067
 
Why Orangescrum Is a Game Changer for Construction Companies in 2025
Why Orangescrum Is a Game Changer for Construction Companies in 2025Why Orangescrum Is a Game Changer for Construction Companies in 2025
Why Orangescrum Is a Game Changer for Construction Companies in 2025
Orangescrum
 
PDF Reader Pro Crack Latest Version FREE Download 2025
PDF Reader Pro Crack Latest Version FREE Download 2025PDF Reader Pro Crack Latest Version FREE Download 2025
PDF Reader Pro Crack Latest Version FREE Download 2025
mu394968
 

Planet-scale Data Ingestion Pipeline: Bigdam

  • 1. Planet-scale Data Ingestion Pipeline Bigdam PLAZMA TD Internal Day 2018/02/19 #tdtech Satoshi Tagomori (@tagomoris)
  • 2. Satoshi Tagomori (@tagomoris) Fluentd, MessagePack-Ruby, Norikra, Woothee, ... Treasure Data, Inc. Backend Team
  • 4. • Design for Large Scale Data Ingestion • Issues to Be Solved • Re-designing Systems • Re-designed Pipeline: Bigdam • Consistency • Scaling
  • 5. Large Scale Data Ingestion: Traditional Pipeline
  • 6. Data Ingestion in Treasure Data • Accept requests from clients • td-agent • TD SDKs (incl. HTTP requests w/ JSON) • Format data into MPC1 • Store MPC1 files into Plazmadb clients Data Ingestion Pipeline json msgpack.gz MPC1 Plazmadb Presto Hive
  • 7. Traditional Pipeline • Streaming Import API for td-agent • API Server (RoR), Temporary Storage (S3) • Import task queue (perfectqueue), workers (Java) • 1 msgpack.gz file in request → 1 MPC1 file on Plazmadb td-agent api-import (RoR) msgpack.gz S3 PerfectQueue Plazmadb Import Worker msgpack.gz MPC1
  • 8. Traditional Pipeline: Event Collector • APIs for TD SDKs • Event Collector nodes (hosted Fluentd) • on the top of Streaming Import API • 1 MPC1 file on Plazmadb per 3min. per Fluentd process TD SDKs api-import (RoR) json S3 PerfectQueue Plazmadb Import Worker msgpack.gz MPC1 event-collector (Fluentd) msgpack.gz
  • 9. Growing Traffics on Traditional Pipeline • Throughput of perfectqueue • Latency until queries via Event-Collector • Maintaining Event-Collector code • Many small temporary files on S3 • Many small imported files on Plazmadb on S3 TD SDKs api-import (RoR) json S3 PerfectQu eue Plazmadb Import Worker msgpack.gz MPC1 event- collector msgpack.gz td-agent msgpack.gz
  • 10. Perfectqueue Throughput Issue • Perfectqueue • "PerfectQueue is a highly available distributed queue built on top of RDBMS." • Fair scheduling • https://github.com/treasure-data/perfectqueue • Perfectqueue is NOT "perfect"... • Need wide lock on table: poor concurrency TD SDKs api-import (RoR) json S3 PerfectQu eue Plazmadb Import Worker msgpack.gz MPC1 event- collector msgpack.gz td-agent msgpack.gz
  • 11. Latency until Queries via Event-Collector • Event-collector buffers data in its storage • 3min. + α • Customers have to wait 3+min. until a record become visible on Plazmadb • 1/2 buffering time make 2x MPC1 files TD SDKs api-import (RoR) json S3 PerfectQu eue Plazmadb Import Worker msgpack.gz MPC1 event- collector msgpack.gz td-agent msgpack.gz
  • 12. Maintaining Event-Collector Code • Mitsu says: "No problem about maintaining event-collector code" • :P • Event-collector processes HTTP requests in Ruby code • Hard to test it TD SDKs api-import (RoR) json S3 PerfectQu eue Plazmadb Import Worker msgpack.gz MPC1 event- collector msgpack.gz td-agent msgpack.gz
  • 13. Many Small Temporary Files on S3 • api-import uploads all requested msgpack.gz files to S3 • S3 outage is critical issue • AWS S3 outage in us-east-1 at Feb 28th, 2017 • Many uploaded files makes costs expensive • costs per object • costs per operation TD SDKs api-import (RoR) json S3 PerfectQu eue Plazmadb Import Worker msgpack.gz MPC1 event- collector msgpack.gz td-agent msgpack.gz
  • 14. Many Small Imported Files on Plazmadb on S3 • 1 MPC1 file on Plazmadb from 1 msgpack.gz file • on Plazmadb realtime storage • https://www.slideshare.net/treasure-data/td-techplazma • Many MPC1 files: • S3 request cost to store • S3 request cost to fetch (from Presto, Hive) • Performance regression to fetch many small files in queries
 (256MB expected vs. 32MB actual)
  • 16. Make "Latency" Shorter (1) • Clients to our endpoints • JS SDK on customers' page sends data to our endpoints
 from mobile devices • Longer latency increases % of dropped records • Many endpoints on the Earth: US, Asia + others • Plazmadb in us-east-1 as "central location" • Many geographically-parted "edge location"
  • 17. Make "Latency" Shorter (2) • Shorter waiting time to query records • Flexible import task scheduling - better if configurable • Decouple buffers from endpoint server processes • More frequent import with aggregated buffers bufferendpoint endpoint endpoint endpoint endpoint buffer buffer buffer buffer BEFORE MPC1 endpoint endpoint endpoint endpoint endpoint AFTER MPC1 MPC1 MPC1 MPC1 MPC1 buffer buffer buffer
  • 18. Redesigning Queues • Fair scheduling is not required for import tasks • Import tasks are FIFO (First In, First Out) • Small payload - (apikey, account_id, database, table) • More throughput • Using Queue service + RDBMS • Queue service for enqueuing/dequeuing • RDBMS to provide at-least-once
  • 19. S3-free Temporary Storage • Make the pipeline free from S3 outage • Distributed storage cluster as buffer for uploaded data (w/ replication) • Buffer transferring between edge and central locations MPC1 endpoint endpoint endpoint endpoint endpoint buffer buffer buffer clients Edge location Central location buffer buffer buffer Storage
 Cluster Storage
 Cluster
  • 20. Merging Temporary Buffers into a File on Plazmadb • Non-1-by-1 conversion from msgpack.gz to MPC1 • Buffers can be gathered using secondary index • primary index: buffer_id • secondary index: account_id, database, table, apikey bufferendpoint endpoint endpoint endpoint endpoint buffer buffer buffer buffer BEFORE MPC1 endpoint endpoint endpoint endpoint endpoint AFTER MPC1 MPC1 MPC1 MPC1 MPC1 buffer buffer buffer
  • 21. Should It Provide Read-After-Write Consistency? • BigQuery provides Read-After-Write consistency • Pros: Inserted record can be queried now • Cons: • Much longer latency (especially from non-US regions) • Much more expensive to host API servers for longer HTTP sessions • Much more expensive to host Query nodes for smaller files on Plazmadb • Much more troubles • Say "No!" for it Appendix
  • 23. Bigdam: Planet-scale!
 Edge locations on the earth + the Central location
  • 26. Bigdam-Gateway (mruby on h2o) • HTTP Endpoint servers • Rack-like API for mruby handlers • Easy to write, easy to test (!) • Async HTTP requests from mruby, managed by h2o using Fiber • HTTP/2 capability in future • Handles all requests from td-agent and TD SDKs • decode/authorize requests • send data to storage nodes in parallel (to replicate)
  • 28. Bigdam-Pool (Java) • Distributed Storage for buffering • Expected data size: 1KB (a json) ~ 32MB (a msgpack.gz from td-agent) • Append data into a buffer • Query buffers using secondary index • Transfer buffers from edge to central chunks buffers Central location Over Internet Using HTTPS or HTTP/2 Buffer committed (size or timeout)Edge location Import workers account_id, database, table
  • 30. Bigdam-Scheduler (Golang) • Scheduler server • Bigdam-pool requests to schedule import tasks to bigdam-scheduler
 (many times in seconds) • Bigdam-scheduler enqueues import tasks to bigdam-queue,
 (once in configured interval: default 1min.) bigdam-pool nodes bigdam-queue bigdam-pool nodes bigdam-pool nodes bigdam-pool nodes bigdam-pool nodes bigdam-Scheduler for every committed buffers once in a minutes per account/db/table
  • 31. account_id, database, table, apikey 1. bigdam-pool requests to schedule import tasks for every buffers 2. requested task is added in scheduler entries, if missing l account1, db1, table1, apikeyA scheduler entries bigdam-pool nodes account9, db8, table7, apikeyB bigdam-queuel 3. schedule a task to be enqueued after timeout from entry creation bigdam-pool nodes bigdam-queue tasks to be enqueued l account1, db1, table1, apikeyA scheduler entries bigdam-pool nodes bigdam-queuel tasks to be enqueued 4. enqueue an import task into bigdam-queue bigdam-pool nodes bigdam-queue l account1, db1, table1, apikeyA scheduler entries account9, db8, table7, apikeyB l tasks to be enqueued account1, db1, table1, apikeyA l account1, db1, table1, apikeyA scheduler entries account9, db8, table7, apikeyB l tasks to be enqueued account1, db1, table1, apikeyA 5. remove an entry in schedule if succeeded l scheduler entries account9, db8, table7, apikeyB l tasks to be enqueued bigdam-pool nodes bigdam-queue
  • 33. Bigdam-Queue (Java) • High throughput queue for import tasks • Enqueue/dequeue using AWS SQS (standard queue) • Task state management using AWS Aurora • Roughly ordered, At-least-once enqueue tasks bigdam-scheduler bigdam-queue server (Java) AWS SQS (standard) task task task 2. enqueue 1. INSERT INTO task, enqueued task, enqueued task, enqueued AWS Aurora request to dequeue task bigdam-import bigdam-queue server (Java) AWS SQS (standard) task task task 1. dequeue task, enqueued task, enqueued task, running 2. UPDATE AWS Aurora finish bigdam-import bigdam-queue server (Java) AWS SQS (standard) task task task, enqueued task, enqueued 1. DELETE AWS Aurora
  • 35. Bigdam-Import (Java) • Import worker • Convert source (json/msgpack.gz) to MPC1 • Execute import tasks in parallel • Dequeue tasks from bigdam-queue • Query and download buffers from bigdam-pool • Make a list of chunk ids and put it to bigdam-dddb • Execute deduplication to determine chunks to be imported • Make MPC1 files and put them into Plazmadb
  • 37. Bigdam-Dddb (Java) • Database service for deduplication • Based on AWS Aurora and S3 • Stores unique chunk ids per import task
 not to import same chunk twice 1. store chunk-id list (small) bigdam-import bigdam-dddb server (Java) 2. INSERT task-id, list-of-chunk-ids AWS Aurora 2. store task-id
 and S3 object path bigdam-import bigdam-dddb server (Java) 3. INSERT 1. upload
 encoded chunk-ids task-id, path-of-ids AWS AuroraAWS S3 list-of-chunk-idstask-id, list-of-chunk-ids For small list of chunk ids For huge list of chunk ids 1. query lists of past tasks bigdam-import bigdam-dddb server (Java) 2. SELECT task-id, path-of-ids AWS Aurora list-of-chunk-idstask-id, list-of-chunk-ids Fetch chunk id lists imported in past 3. download
 if needed
  • 39. Executing Deduplication at the end of pipeline • Make it simple & reliable gateway clients
 (data input) At-least-once everywhere pool (edge) pool (central) Plazmadb import worker dddb queue schedu ler Deduplication (Transaction + Retries)
  • 40. At-Least-Once: Bigdam-pool Data Replication Client-side replication: client uploads 3 replica to 3 nodes in parallel Server-side replication: primary node appends chunks to existing buffer, and replicate them (for equal contents/checksums in nodes) for large chunks
 (1MB~) for small chunks
 (~1MB)
  • 41. At-Least-Once: Bigdam-pool Data Replication Server replication for transferred buffer
  • 42. Scaling-out (almost) Everywhere • Scalable components on EC2 (& ready for AWS autoscaling) • AWS Aurora (w/o table locks) + AWS SQS (+ AWS S3) gateway clients
 (data input) pool (edge) pool (central) Plazmadb import worker dddb queue schedu ler scale-outscale-out
  • 43. Scaling-up Just For A Case: Scheduler • Scheduler need to collect notifications of all buffers • and cannot be parallelized by nodes (in easy way) • Solution: high-performant singleton server: 90k+ reqs/sec gateway clients
 (data input) pool (edge) pool (central) Plazmadb import worker dddb queue schedu ler singleton
 server
  • 44. Bigdam Current status: Under Testing
  • 45. It's great fun to design Distributed Systems! Thank you! @tagomoris