SlideShare a Scribd company logo
Rakuten Technology Conference 2017
A Distributed SQL Database
For Data Analysis, Astra Project
2017-10-28
Yosuke Hara (原 陽亮)

Rakuten Institute of Technology

Rakuten, Inc. rev. 1.0.5
Skylab
A Microservices Framework
11 0101
0010111011
110110010011
01110111011001
011101110110010
2
LeoFS
A Distributed Storage
11 0101
0010111011
110110010011
01110111011001
011101110110010
Astra
A Distributed SQL Database
For Data Analytics
11 0101
0010111011
110110010011
01110111011001
011101110110010
R&D Projects
Introducing To Astra
* “Astra” is a code name of a product under development
One of Backgrounds
More “Connected Things” In The World
Consumer Applications to Represent 63% of Total IoT Applications in 2017
IoT Units Installed Base by Category
MillionsofUnits
0
2,000
4,000
6,000
8,000
10,000
12,000
14,000
16,000
18,000
20,000
22,000
2016 2017 2018 2020
1,316.6
1,635.4
2,027.7
3,171
1,102.1
1,501
2,132.6
4,381.4
3,963
5,244.3
7,036.3
12,863
Consumer
Business: Cross-Industry
Business: Vertical-Specific
Source: Gartner (January 2017)
+31%
4
63%
18%
19%
20.4B
8.4B
6.4B
11.2B
Providing A Database That
Anyone Who Can Analyze Data
Initial Concept
6
Provides Components of DataLake as a Service
Data Science
+
DataLake
Data Governance Job Scheduler
+
Distributed
Computing
Data Store
Astra Skylab
Spark, Hadoop
Self-Service
Analytics
11 0101
0010111011
110110010011
01110111011001
011101110110
7
Current Concept
Advanced Data Analysis In Semi-Realtime At Low Cost
Aggregate, and
Analyze Data
Find Insights
Streaming Data
Un/Semi-
Structured Data
1100101
10010111011
110110010011
0110111011001
1101110110
Store Data
Into Astra
Data Intelligence Action
Tools / Apps
Automated
Systems
8
Current Concept: Depends on Single Source Of Truth
Self-Service Analytics
Data Governance
Distributed Computing
For Massive-Parallel
Processing
Distributed Database
For Aggregation and
Analysis
+
Distributed Storage
(DataLake Store)
+
Astra’s Components
1100101
10010111011
110110010011
0110111011001
1101110110
In-place Analysis
Features
Database
SQL Engine
Data Science
Analysis Functions
On The Distributed
Computing
Reliability, Scalability, and
Massive Parallel Processing
Ad-hoc Query
Various Data
Without Limit
Data Store
10
Unified Components
Confirms To ANSI SQL99 Standard
• Communication With Any BI / Data Visualization Tools, and Apps
• Able To Call All Astra’s Functions, UDFs and ML With SQL
The Features - ANSI SQL99 Standard
11
astra:test> SELECT workclass, COUNT(income)
-> AS income_count
-> FROM adult_income
-> WHERE income = '<=50K'
-> GROUP BY workclass
-> ORDER BY workclass;
workclass | income_count
------------------+--------------
? | 2534
Federal-gov | 871
Local-gov | 2209
Never-worked | 10
Private | 26519
Self-emp-inc | 757
Self-emp-not-inc | 2785
State-gov | 1451
Without-pay | 19
(9 rows)
Advanced Data Analytics On The Distributed Computing, Massive-
Parallel Processing
• Built-In Analysis Functions and UDF
• Machine Learning
The Features - Advanced Data Analytics
12
Business
Understanding
Data
Understanding
Data
Preparation
Modeling
Evaluation
Deployment
1100101
10010111011
110110010011
0110111011001
1101110110
Feedback
Able To Repeat
Trial And Error
w/o Limit
The Features - Availability and Scalability
High Availability
• Automated Data Replication And Recovery, and Failover
High Scalability
• An Elastic Cluster - Nodes That Can Flexibly Attach And Detach
13
Worker
Worker
Worker
Worker
Request
Worker
Response
Clients
Coordinator(s)
HTTP
Message with
Gossip Protocol
Monitoring Resources
Scheduling Jobs
* Circuit Breaker: martinfowler.com/bliki/CircuitBreaker.html
Circuit Breaker
Figure: Akka Circuit breaker
Requesting Jobs
Architecture
15
High-level ArchitectureSQLEngine
Workers
Database
Layer
DataStore
Layer
Astra
CLIClients
SQL over ODBC/JDBC
Astra DataStore
AstraSQL
AstraBase
- Original Data
- Semi-Structured Data
- Cold Data
- Columnar Tables
- Metadata Store
- Record Operation
- Record Set Cache (Hot Data)
- Distributed Computing
- Data Analysis
- Data Converter
- Semi-Structured Data To
Columnar Table
Original Data Load
Operate Astra
Multi-Coordinator
LeoFS is a software defined storage (SDS)
for DataLake and Web
LeoFS is an Enterprise Open Source Storage, and it is a highly
available, distributed, eventually consistent object/blob store
Goals:
- High Availability
- High Cost Performance Ratio
- High Scalability
LeoFS For Astra DataStore
16
Astra DataStore (LeoFS)
AstraSQL
AstraCLI
1-1. Put Original Data w/AstraCLI
2. Store the Data and Metadata
4. Request Converting Data Format of a Table
5. Convert Data Format of a Table
and Change Table’s Metadata
Processing Flow - Store a CSV file, Then Query Data
AstraBase 6. Store Converted Data
1-2. Create Metadata
[Store a CSV File]
[Convert Data Format At Async]
[Execute Query]
3. Query Data For Aggregation Or Data Analysis
1-1
1-2
2
3
17
REST-API
gRPCS3-API
gRPC
O/JDBC
AstraBase
Coordinator(s)
AstraBase
Workers
Resource Monitor
+ Scheduler
S3-API
gRPC
gRPC
AstraBase
Coordinator(s)
6
4
5
Astra DataStore (LeoFS)
AstraSQL 3-1. Retrieve Target Records from the Cache
4. Process Data Analysis in Parallel
5. Reply To AstraBase Coordinator,
Then Summarize the Result on the Coordinator
Processing Flow - Query for Advanced Analysis
AstraBase
3-2. Retrieve Target Records From LeoFS
(Cache Miss)
[Retrieve Records]
[Reply]
[Execute Query]
1. Execute SQL For Data Analysis
3-2
1
2-1
2-1. Request Data Analisys to AstraBase
gRPC
18
gRPCO/JDBC
AstraBase
Coordinator(s)
AstraBase
Workers
Resource Monitor
+ Scheduler
S3-API
3-1, 4
AstraBase
Coordinator(s)
5
gRPC
gRPC
2-2
2-2. Request Message to AstraBase’s Workers
Store Files Into Astra
(Original Data,
Semi-Structured Files)
Data Validation
Data Verification
Data Type Inference
Store Chunks and
Metadata
1. Data Load
To Handle Plural Data Formats In A Table
Partition Into Plural
Chunks
CSV / TSV / JSON
To Parquet / CarbonData SerDes
19
Able To Do Self Data
Analytics Even If During
Data Conversion
Data is partitioned by a condition
of a specified column
2. Data Conversion At Async
Data Storage
Supports Data Format and SerDes
- CSV, TSV, and Custom Delimiter Files
- JSON
- RegEx SerDes for Unstructured Data
- Parquet SerDes (A Columnar Storage Format)
- CarbonData SerDes (A Columnar Storage Format)
Supports Compression Methods
- SNAPPY
- ZLIB
- GZIP
- LZO
20
Supports Plural Data Formats And SerDes
Table Schema Parquet Format
CSV Format
An Example of METADATA as JSON
21
Stores Each File
Into Astra Data Store, LeoFS
Data Type
Inference
AstraBase
Coordinator(s)
Astra DataStore (LeoFS)
AstraSQL
AstraBase
3
2, 5
1
22
gRPCO/JDBC
Machine Learning on Astra - Modeling
[Create A Model, Then Store It]
2. Generate Tasks From A Job On A Coordinator
3. Request A Task To Workers
[Request A Modeling]
1. Request A Modeling To An Initiator Of AstraBase
4-1. Execute Function(s)
In Parallel On Each Worker
5. Summarize The Result On A Coordinator
Then Store The Model Into The Cluster To Reuse
4-2
4-2. Load Data From Data Store If Not Exists On Cache
S3-API
AstraBase
Workers
gRPC 4-1
gRPC
Resource Monitor
+ Scheduler
AstraBase
Coordinator(s)
S3-API
Integration With BI Tools
Integration With Tableau (BI Tool)
astra:test> DESCRIBE adult_income
-> ;
Column | Type | Extra | Comment
-----------------+---------+-------+---------
age | integer | |
workclass | varchar | |
fnlwgt | integer | |
education | varchar | |
educational-num | integer | |
marital-status | varchar | |
occupation | varchar | |
relationship | varchar | |
race | varchar | |
gender | varchar | |
capital-gain | integer | |
capital-loss | integer | |
hours-per-week | varchar | |
native-country | varchar | |
income | varchar | |
(15 rows)
astra:test> SELECT workclass, COUNT(income)
-> as income_count
-> FROM adult_income
-> WHERE income = '<=50K'
-> GROUP BY workclass
-> ORDER BY workclass;
workclass | income_count
------------------+--------------
? | 2534
Federal-gov | 871
Local-gov | 2209
Never-worked | 10
Private | 26519
Self-emp-inc | 757
Self-emp-not-inc | 2785
State-gov | 1451
Without-pay | 19
(9 rows)
24
25
Visualizing Data With 3rd Party Tools
Communicates With Visualizing Data And BI Tools
Dundas BI
Qlik Sense
Microsoft PowerBI
Future Plans
Future Plans
By Oct/E, 2017 Nov, 2017 - June/E, 2018 Q3 2018
Alpha 1st Beta
2nd Beta
Publish It
- Alpha
- Un/Semi-Structured Data and Parquet SerDes Support
- BI Tools and Visualization Tools Integration
- 1st Beta, Step-Growth Phase
- Record Set Cache
- Distributed Computing For UDF and ML
- Other SerDes Support
27
THANK YOU

More Related Content

What's hot (20)

PDF
Accelerating Deep Learning Training with BigDL and Drizzle on Apache Spark wi...
Databricks
 
PDF
Spark SQL Catalyst Code Optimization using Function Outlining with Kavana Bha...
Databricks
 
PDF
No More Cumbersomeness: Automatic Predictive Modeling on Apache Spark with Ma...
Databricks
 
PDF
Data Versioning and Reproducible ML with DVC and MLflow
Databricks
 
PDF
Stream, Stream, Stream: Different Streaming Methods with Apache Spark and Kafka
Databricks
 
PDF
More Data, More Problems: Scaling Kafka-Mirroring Pipelines at LinkedIn
confluent
 
PDF
Debugging Big Data Analytics in Apache Spark with BigDebug with Muhammad Gulz...
Databricks
 
PDF
Continuous Applications at Scale of 100 Teams with Databricks Delta and Struc...
Databricks
 
PDF
Databricks: What We Have Learned by Eating Our Dog Food
Databricks
 
PDF
No More Cumbersomeness: Automatic Predictive Modeling on Apache Spark with Ma...
Databricks
 
PDF
Building Reliable Data Lakes at Scale with Delta Lake
Databricks
 
PDF
Next Generation Workshop Car Diagnostics at BMW Powered by Apache Spark with ...
Databricks
 
PDF
Accelerating Data Science with Better Data Engineering on Databricks
Databricks
 
PDF
Apache Spark At Scale in the Cloud
Databricks
 
PDF
Accelerating Apache Spark-based Analytics on Intel Architecture-(Michael Gree...
Spark Summit
 
PDF
Using Databricks as an Analysis Platform
Databricks
 
PDF
Superworkflow of Graph Neural Networks with K8S and Fugue
Databricks
 
PDF
How to Boost 100x Performance for Real World Application with Apache Spark-(G...
Spark Summit
 
PDF
Hyperspace for Delta Lake
Databricks
 
PDF
Vectorized Deep Learning Acceleration from Preprocessing to Inference and Tra...
Databricks
 
Accelerating Deep Learning Training with BigDL and Drizzle on Apache Spark wi...
Databricks
 
Spark SQL Catalyst Code Optimization using Function Outlining with Kavana Bha...
Databricks
 
No More Cumbersomeness: Automatic Predictive Modeling on Apache Spark with Ma...
Databricks
 
Data Versioning and Reproducible ML with DVC and MLflow
Databricks
 
Stream, Stream, Stream: Different Streaming Methods with Apache Spark and Kafka
Databricks
 
More Data, More Problems: Scaling Kafka-Mirroring Pipelines at LinkedIn
confluent
 
Debugging Big Data Analytics in Apache Spark with BigDebug with Muhammad Gulz...
Databricks
 
Continuous Applications at Scale of 100 Teams with Databricks Delta and Struc...
Databricks
 
Databricks: What We Have Learned by Eating Our Dog Food
Databricks
 
No More Cumbersomeness: Automatic Predictive Modeling on Apache Spark with Ma...
Databricks
 
Building Reliable Data Lakes at Scale with Delta Lake
Databricks
 
Next Generation Workshop Car Diagnostics at BMW Powered by Apache Spark with ...
Databricks
 
Accelerating Data Science with Better Data Engineering on Databricks
Databricks
 
Apache Spark At Scale in the Cloud
Databricks
 
Accelerating Apache Spark-based Analytics on Intel Architecture-(Michael Gree...
Spark Summit
 
Using Databricks as an Analysis Platform
Databricks
 
Superworkflow of Graph Neural Networks with K8S and Fugue
Databricks
 
How to Boost 100x Performance for Real World Application with Apache Spark-(G...
Spark Summit
 
Hyperspace for Delta Lake
Databricks
 
Vectorized Deep Learning Acceleration from Preprocessing to Inference and Tra...
Databricks
 

Viewers also liked (20)

PDF
One Hundred Languages
Rakuten Group, Inc.
 
PDF
Don't manage too hard!
Rakuten Group, Inc.
 
PDF
Challenge for statup's cto from big company nagaaki hoshi
Rakuten Group, Inc.
 
PDF
WannaEat: A computer vision-based, multi-platform restaurant lookup app
Rakuten Group, Inc.
 
PDF
Rakuten app productivity initiative for developers marcus saw
Rakuten Group, Inc.
 
PDF
Life of an enginner in rakuten osaka diarmaid lindsay
Rakuten Group, Inc.
 
PDF
時間がないといって、オペレーション改善を怠るな~オペレーション改善奮闘記~ Emi muroya
Rakuten Group, Inc.
 
PDF
はてなのインフラの歴史、そしてMackerelへ至る道とこれから
Rakuten Group, Inc.
 
PDF
AI AND FUNDAMENTAL GAME TECHNOLOGIESIN FINAL FANTASY XV
Rakuten Group, Inc.
 
PDF
トラブルシューティングのあれこれ Yoshihiko kamata
Rakuten Group, Inc.
 
PDF
Value Delivery through RakutenBig Data Intelligence Ecosystem and Technology
Rakuten Group, Inc.
 
PDF
What i learned from translation of the sre ryuji tamagawa
Rakuten Group, Inc.
 
PDF
Rakutenとsreと私 yanagimoto koichi
Rakuten Group, Inc.
 
PDF
AI based language learning tools
Rakuten Group, Inc.
 
PDF
Predictions and Hard Problems With AI
Rakuten Group, Inc.
 
PDF
Human-Centric Machine Learning
Rakuten Group, Inc.
 
PDF
cloudera Apache Kudu Updatable Analytical Storage for Modern Data Platform
Rakuten Group, Inc.
 
PDF
Change the engineer life by batch system renewal
Rakuten Group, Inc.
 
PDF
Realizing AI Conversational Bot
Rakuten Group, Inc.
 
PDF
Building your own static site Using Hugo
Rakuten Group, Inc.
 
One Hundred Languages
Rakuten Group, Inc.
 
Don't manage too hard!
Rakuten Group, Inc.
 
Challenge for statup's cto from big company nagaaki hoshi
Rakuten Group, Inc.
 
WannaEat: A computer vision-based, multi-platform restaurant lookup app
Rakuten Group, Inc.
 
Rakuten app productivity initiative for developers marcus saw
Rakuten Group, Inc.
 
Life of an enginner in rakuten osaka diarmaid lindsay
Rakuten Group, Inc.
 
時間がないといって、オペレーション改善を怠るな~オペレーション改善奮闘記~ Emi muroya
Rakuten Group, Inc.
 
はてなのインフラの歴史、そしてMackerelへ至る道とこれから
Rakuten Group, Inc.
 
AI AND FUNDAMENTAL GAME TECHNOLOGIESIN FINAL FANTASY XV
Rakuten Group, Inc.
 
トラブルシューティングのあれこれ Yoshihiko kamata
Rakuten Group, Inc.
 
Value Delivery through RakutenBig Data Intelligence Ecosystem and Technology
Rakuten Group, Inc.
 
What i learned from translation of the sre ryuji tamagawa
Rakuten Group, Inc.
 
Rakutenとsreと私 yanagimoto koichi
Rakuten Group, Inc.
 
AI based language learning tools
Rakuten Group, Inc.
 
Predictions and Hard Problems With AI
Rakuten Group, Inc.
 
Human-Centric Machine Learning
Rakuten Group, Inc.
 
cloudera Apache Kudu Updatable Analytical Storage for Modern Data Platform
Rakuten Group, Inc.
 
Change the engineer life by batch system renewal
Rakuten Group, Inc.
 
Realizing AI Conversational Bot
Rakuten Group, Inc.
 
Building your own static site Using Hugo
Rakuten Group, Inc.
 
Ad

Similar to Rakuten Technology Conference 2017 A Distributed SQL Database For Data Analysis, Astra Project (20)

PPTX
Strata NY 2018: The deconstructed database
Julien Le Dem
 
PDF
DBA to Data Scientist
pasalapudi
 
PDF
Vikram Andem Big Data Strategy @ IATA Technology Roadmap
IT Strategy Group
 
PDF
From flat files to deconstructed database
Julien Le Dem
 
PDF
Data Science Toolchain 101
Francis Michael Bautista
 
PDF
Beyond Relational
Lynn Langit
 
PDF
Big data analysis concepts and references
Information Security Awareness Group
 
PPTX
Big Data Warsaw v 4 I "The Role of Hadoop Ecosystem in Advance Analytics" - R...
Dataconomy Media
 
PDF
Operational-Analytics
Niloy Mukherjee
 
PPT
Big Data
NGDATA
 
PPTX
Achieve big data analytic platform with lambda architecture on cloud
Scott Miao
 
PDF
Where Does Big Data Meet Big Database - QCon 2012
Ben Stopford
 
PPTX
Linked in nosql_atnetflix_2012_v1
Sid Anand
 
PPTX
NoSQL for the SQL Server Pro
Lynn Langit
 
PDF
Data Pipelines with Spark & DataStax Enterprise
DataStax
 
PPTX
Chen li asterix db: 大数据处理开源平台
jins0618
 
PPTX
How we evolved data pipeline at Celtra and what we learned along the way
Grega Kespret
 
PPTX
Introduction to Big Data
Vipin Batra
 
PPTX
Hadoop World 2011: Big Data Architecture: Integrating Hadoop with Other Enter...
Cloudera, Inc.
 
PDF
Technologies for Data Analytics Platform
N Masahiro
 
Strata NY 2018: The deconstructed database
Julien Le Dem
 
DBA to Data Scientist
pasalapudi
 
Vikram Andem Big Data Strategy @ IATA Technology Roadmap
IT Strategy Group
 
From flat files to deconstructed database
Julien Le Dem
 
Data Science Toolchain 101
Francis Michael Bautista
 
Beyond Relational
Lynn Langit
 
Big data analysis concepts and references
Information Security Awareness Group
 
Big Data Warsaw v 4 I "The Role of Hadoop Ecosystem in Advance Analytics" - R...
Dataconomy Media
 
Operational-Analytics
Niloy Mukherjee
 
Big Data
NGDATA
 
Achieve big data analytic platform with lambda architecture on cloud
Scott Miao
 
Where Does Big Data Meet Big Database - QCon 2012
Ben Stopford
 
Linked in nosql_atnetflix_2012_v1
Sid Anand
 
NoSQL for the SQL Server Pro
Lynn Langit
 
Data Pipelines with Spark & DataStax Enterprise
DataStax
 
Chen li asterix db: 大数据处理开源平台
jins0618
 
How we evolved data pipeline at Celtra and what we learned along the way
Grega Kespret
 
Introduction to Big Data
Vipin Batra
 
Hadoop World 2011: Big Data Architecture: Integrating Hadoop with Other Enter...
Cloudera, Inc.
 
Technologies for Data Analytics Platform
N Masahiro
 
Ad

More from Rakuten Group, Inc. (20)

PDF
EPSS (Exploit Prediction Scoring System)モニタリングツールの開発
Rakuten Group, Inc.
 
PPTX
コードレビュー改善のためにJenkinsとIntelliJ IDEAのプラグインを自作してみた話
Rakuten Group, Inc.
 
PDF
楽天における安全な秘匿情報管理への道のり
Rakuten Group, Inc.
 
PDF
What Makes Software Green?
Rakuten Group, Inc.
 
PDF
Simple and Effective Knowledge-Driven Query Expansion for QA-Based Product At...
Rakuten Group, Inc.
 
PDF
DataSkillCultureを浸透させる楽天の取り組み
Rakuten Group, Inc.
 
PDF
大規模なリアルタイム監視の導入と展開
Rakuten Group, Inc.
 
PDF
楽天における大規模データベースの運用
Rakuten Group, Inc.
 
PDF
楽天サービスを支えるネットワークインフラストラクチャー
Rakuten Group, Inc.
 
PDF
楽天の規模とクラウドプラットフォーム統括部の役割
Rakuten Group, Inc.
 
PDF
Rakuten Services and Infrastructure Team.pdf
Rakuten Group, Inc.
 
PDF
The Data Platform Administration Handling the 100 PB.pdf
Rakuten Group, Inc.
 
PDF
Supporting Internal Customers as Technical Account Managers.pdf
Rakuten Group, Inc.
 
PDF
Making Cloud Native CI_CD Services.pdf
Rakuten Group, Inc.
 
PDF
How We Defined Our Own Cloud.pdf
Rakuten Group, Inc.
 
PDF
Travel & Leisure Platform Department's tech info
Rakuten Group, Inc.
 
PDF
Travel & Leisure Platform Department's tech info
Rakuten Group, Inc.
 
PDF
OWASPTop10_Introduction
Rakuten Group, Inc.
 
PDF
Introduction of GORA API Group technology
Rakuten Group, Inc.
 
PDF
100PBを越えるデータプラットフォームの実情
Rakuten Group, Inc.
 
EPSS (Exploit Prediction Scoring System)モニタリングツールの開発
Rakuten Group, Inc.
 
コードレビュー改善のためにJenkinsとIntelliJ IDEAのプラグインを自作してみた話
Rakuten Group, Inc.
 
楽天における安全な秘匿情報管理への道のり
Rakuten Group, Inc.
 
What Makes Software Green?
Rakuten Group, Inc.
 
Simple and Effective Knowledge-Driven Query Expansion for QA-Based Product At...
Rakuten Group, Inc.
 
DataSkillCultureを浸透させる楽天の取り組み
Rakuten Group, Inc.
 
大規模なリアルタイム監視の導入と展開
Rakuten Group, Inc.
 
楽天における大規模データベースの運用
Rakuten Group, Inc.
 
楽天サービスを支えるネットワークインフラストラクチャー
Rakuten Group, Inc.
 
楽天の規模とクラウドプラットフォーム統括部の役割
Rakuten Group, Inc.
 
Rakuten Services and Infrastructure Team.pdf
Rakuten Group, Inc.
 
The Data Platform Administration Handling the 100 PB.pdf
Rakuten Group, Inc.
 
Supporting Internal Customers as Technical Account Managers.pdf
Rakuten Group, Inc.
 
Making Cloud Native CI_CD Services.pdf
Rakuten Group, Inc.
 
How We Defined Our Own Cloud.pdf
Rakuten Group, Inc.
 
Travel & Leisure Platform Department's tech info
Rakuten Group, Inc.
 
Travel & Leisure Platform Department's tech info
Rakuten Group, Inc.
 
OWASPTop10_Introduction
Rakuten Group, Inc.
 
Introduction of GORA API Group technology
Rakuten Group, Inc.
 
100PBを越えるデータプラットフォームの実情
Rakuten Group, Inc.
 

Recently uploaded (20)

PDF
Exolore The Essential AI Tools in 2025.pdf
Srinivasan M
 
PDF
Smart Trailers 2025 Update with History and Overview
Paul Menig
 
PDF
CIFDAQ Market Wrap for the week of 4th July 2025
CIFDAQ
 
PDF
POV_ Why Enterprises Need to Find Value in ZERO.pdf
darshakparmar
 
PPTX
Designing Production-Ready AI Agents
Kunal Rai
 
PDF
Biography of Daniel Podor.pdf
Daniel Podor
 
PDF
CIFDAQ Token Spotlight for 9th July 2025
CIFDAQ
 
PDF
Mastering Financial Management in Direct Selling
Epixel MLM Software
 
PPTX
The Project Compass - GDG on Campus MSIT
dscmsitkol
 
PDF
Reverse Engineering of Security Products: Developing an Advanced Microsoft De...
nwbxhhcyjv
 
PDF
[Newgen] NewgenONE Marvin Brochure 1.pdf
darshakparmar
 
PDF
Transforming Utility Networks: Large-scale Data Migrations with FME
Safe Software
 
PPTX
COMPARISON OF RASTER ANALYSIS TOOLS OF QGIS AND ARCGIS
Sharanya Sarkar
 
PDF
Advancing WebDriver BiDi support in WebKit
Igalia
 
DOCX
Cryptography Quiz: test your knowledge of this important security concept.
Rajni Bhardwaj Grover
 
PPTX
Webinar: Introduction to LF Energy EVerest
DanBrown980551
 
PDF
Empower Inclusion Through Accessible Java Applications
Ana-Maria Mihalceanu
 
PDF
Newgen 2022-Forrester Newgen TEI_13 05 2022-The-Total-Economic-Impact-Newgen-...
darshakparmar
 
PDF
Agentic AI lifecycle for Enterprise Hyper-Automation
Debmalya Biswas
 
PDF
CIFDAQ Market Insights for July 7th 2025
CIFDAQ
 
Exolore The Essential AI Tools in 2025.pdf
Srinivasan M
 
Smart Trailers 2025 Update with History and Overview
Paul Menig
 
CIFDAQ Market Wrap for the week of 4th July 2025
CIFDAQ
 
POV_ Why Enterprises Need to Find Value in ZERO.pdf
darshakparmar
 
Designing Production-Ready AI Agents
Kunal Rai
 
Biography of Daniel Podor.pdf
Daniel Podor
 
CIFDAQ Token Spotlight for 9th July 2025
CIFDAQ
 
Mastering Financial Management in Direct Selling
Epixel MLM Software
 
The Project Compass - GDG on Campus MSIT
dscmsitkol
 
Reverse Engineering of Security Products: Developing an Advanced Microsoft De...
nwbxhhcyjv
 
[Newgen] NewgenONE Marvin Brochure 1.pdf
darshakparmar
 
Transforming Utility Networks: Large-scale Data Migrations with FME
Safe Software
 
COMPARISON OF RASTER ANALYSIS TOOLS OF QGIS AND ARCGIS
Sharanya Sarkar
 
Advancing WebDriver BiDi support in WebKit
Igalia
 
Cryptography Quiz: test your knowledge of this important security concept.
Rajni Bhardwaj Grover
 
Webinar: Introduction to LF Energy EVerest
DanBrown980551
 
Empower Inclusion Through Accessible Java Applications
Ana-Maria Mihalceanu
 
Newgen 2022-Forrester Newgen TEI_13 05 2022-The-Total-Economic-Impact-Newgen-...
darshakparmar
 
Agentic AI lifecycle for Enterprise Hyper-Automation
Debmalya Biswas
 
CIFDAQ Market Insights for July 7th 2025
CIFDAQ
 

Rakuten Technology Conference 2017 A Distributed SQL Database For Data Analysis, Astra Project

  • 1. Rakuten Technology Conference 2017 A Distributed SQL Database For Data Analysis, Astra Project 2017-10-28 Yosuke Hara (原 陽亮)
 Rakuten Institute of Technology
 Rakuten, Inc. rev. 1.0.5
  • 2. Skylab A Microservices Framework 11 0101 0010111011 110110010011 01110111011001 011101110110010 2 LeoFS A Distributed Storage 11 0101 0010111011 110110010011 01110111011001 011101110110010 Astra A Distributed SQL Database For Data Analytics 11 0101 0010111011 110110010011 01110111011001 011101110110010 R&D Projects
  • 3. Introducing To Astra * “Astra” is a code name of a product under development
  • 4. One of Backgrounds More “Connected Things” In The World Consumer Applications to Represent 63% of Total IoT Applications in 2017 IoT Units Installed Base by Category MillionsofUnits 0 2,000 4,000 6,000 8,000 10,000 12,000 14,000 16,000 18,000 20,000 22,000 2016 2017 2018 2020 1,316.6 1,635.4 2,027.7 3,171 1,102.1 1,501 2,132.6 4,381.4 3,963 5,244.3 7,036.3 12,863 Consumer Business: Cross-Industry Business: Vertical-Specific Source: Gartner (January 2017) +31% 4 63% 18% 19% 20.4B 8.4B 6.4B 11.2B
  • 5. Providing A Database That Anyone Who Can Analyze Data
  • 6. Initial Concept 6 Provides Components of DataLake as a Service Data Science + DataLake Data Governance Job Scheduler + Distributed Computing Data Store Astra Skylab Spark, Hadoop Self-Service Analytics 11 0101 0010111011 110110010011 01110111011001 011101110110
  • 7. 7 Current Concept Advanced Data Analysis In Semi-Realtime At Low Cost Aggregate, and Analyze Data Find Insights Streaming Data Un/Semi- Structured Data 1100101 10010111011 110110010011 0110111011001 1101110110 Store Data Into Astra Data Intelligence Action Tools / Apps Automated Systems
  • 8. 8 Current Concept: Depends on Single Source Of Truth Self-Service Analytics Data Governance Distributed Computing For Massive-Parallel Processing Distributed Database For Aggregation and Analysis + Distributed Storage (DataLake Store) + Astra’s Components 1100101 10010111011 110110010011 0110111011001 1101110110 In-place Analysis
  • 10. Database SQL Engine Data Science Analysis Functions On The Distributed Computing Reliability, Scalability, and Massive Parallel Processing Ad-hoc Query Various Data Without Limit Data Store 10 Unified Components
  • 11. Confirms To ANSI SQL99 Standard • Communication With Any BI / Data Visualization Tools, and Apps • Able To Call All Astra’s Functions, UDFs and ML With SQL The Features - ANSI SQL99 Standard 11 astra:test> SELECT workclass, COUNT(income) -> AS income_count -> FROM adult_income -> WHERE income = '<=50K' -> GROUP BY workclass -> ORDER BY workclass; workclass | income_count ------------------+-------------- ? | 2534 Federal-gov | 871 Local-gov | 2209 Never-worked | 10 Private | 26519 Self-emp-inc | 757 Self-emp-not-inc | 2785 State-gov | 1451 Without-pay | 19 (9 rows)
  • 12. Advanced Data Analytics On The Distributed Computing, Massive- Parallel Processing • Built-In Analysis Functions and UDF • Machine Learning The Features - Advanced Data Analytics 12 Business Understanding Data Understanding Data Preparation Modeling Evaluation Deployment 1100101 10010111011 110110010011 0110111011001 1101110110 Feedback Able To Repeat Trial And Error w/o Limit
  • 13. The Features - Availability and Scalability High Availability • Automated Data Replication And Recovery, and Failover High Scalability • An Elastic Cluster - Nodes That Can Flexibly Attach And Detach 13 Worker Worker Worker Worker Request Worker Response Clients Coordinator(s) HTTP Message with Gossip Protocol Monitoring Resources Scheduling Jobs * Circuit Breaker: martinfowler.com/bliki/CircuitBreaker.html Circuit Breaker Figure: Akka Circuit breaker Requesting Jobs
  • 15. 15 High-level ArchitectureSQLEngine Workers Database Layer DataStore Layer Astra CLIClients SQL over ODBC/JDBC Astra DataStore AstraSQL AstraBase - Original Data - Semi-Structured Data - Cold Data - Columnar Tables - Metadata Store - Record Operation - Record Set Cache (Hot Data) - Distributed Computing - Data Analysis - Data Converter - Semi-Structured Data To Columnar Table Original Data Load Operate Astra Multi-Coordinator
  • 16. LeoFS is a software defined storage (SDS) for DataLake and Web LeoFS is an Enterprise Open Source Storage, and it is a highly available, distributed, eventually consistent object/blob store Goals: - High Availability - High Cost Performance Ratio - High Scalability LeoFS For Astra DataStore 16
  • 17. Astra DataStore (LeoFS) AstraSQL AstraCLI 1-1. Put Original Data w/AstraCLI 2. Store the Data and Metadata 4. Request Converting Data Format of a Table 5. Convert Data Format of a Table and Change Table’s Metadata Processing Flow - Store a CSV file, Then Query Data AstraBase 6. Store Converted Data 1-2. Create Metadata [Store a CSV File] [Convert Data Format At Async] [Execute Query] 3. Query Data For Aggregation Or Data Analysis 1-1 1-2 2 3 17 REST-API gRPCS3-API gRPC O/JDBC AstraBase Coordinator(s) AstraBase Workers Resource Monitor + Scheduler S3-API gRPC gRPC AstraBase Coordinator(s) 6 4 5
  • 18. Astra DataStore (LeoFS) AstraSQL 3-1. Retrieve Target Records from the Cache 4. Process Data Analysis in Parallel 5. Reply To AstraBase Coordinator, Then Summarize the Result on the Coordinator Processing Flow - Query for Advanced Analysis AstraBase 3-2. Retrieve Target Records From LeoFS (Cache Miss) [Retrieve Records] [Reply] [Execute Query] 1. Execute SQL For Data Analysis 3-2 1 2-1 2-1. Request Data Analisys to AstraBase gRPC 18 gRPCO/JDBC AstraBase Coordinator(s) AstraBase Workers Resource Monitor + Scheduler S3-API 3-1, 4 AstraBase Coordinator(s) 5 gRPC gRPC 2-2 2-2. Request Message to AstraBase’s Workers
  • 19. Store Files Into Astra (Original Data, Semi-Structured Files) Data Validation Data Verification Data Type Inference Store Chunks and Metadata 1. Data Load To Handle Plural Data Formats In A Table Partition Into Plural Chunks CSV / TSV / JSON To Parquet / CarbonData SerDes 19 Able To Do Self Data Analytics Even If During Data Conversion Data is partitioned by a condition of a specified column 2. Data Conversion At Async
  • 20. Data Storage Supports Data Format and SerDes - CSV, TSV, and Custom Delimiter Files - JSON - RegEx SerDes for Unstructured Data - Parquet SerDes (A Columnar Storage Format) - CarbonData SerDes (A Columnar Storage Format) Supports Compression Methods - SNAPPY - ZLIB - GZIP - LZO 20 Supports Plural Data Formats And SerDes
  • 21. Table Schema Parquet Format CSV Format An Example of METADATA as JSON 21 Stores Each File Into Astra Data Store, LeoFS Data Type Inference
  • 22. AstraBase Coordinator(s) Astra DataStore (LeoFS) AstraSQL AstraBase 3 2, 5 1 22 gRPCO/JDBC Machine Learning on Astra - Modeling [Create A Model, Then Store It] 2. Generate Tasks From A Job On A Coordinator 3. Request A Task To Workers [Request A Modeling] 1. Request A Modeling To An Initiator Of AstraBase 4-1. Execute Function(s) In Parallel On Each Worker 5. Summarize The Result On A Coordinator Then Store The Model Into The Cluster To Reuse 4-2 4-2. Load Data From Data Store If Not Exists On Cache S3-API AstraBase Workers gRPC 4-1 gRPC Resource Monitor + Scheduler AstraBase Coordinator(s) S3-API
  • 24. Integration With Tableau (BI Tool) astra:test> DESCRIBE adult_income -> ; Column | Type | Extra | Comment -----------------+---------+-------+--------- age | integer | | workclass | varchar | | fnlwgt | integer | | education | varchar | | educational-num | integer | | marital-status | varchar | | occupation | varchar | | relationship | varchar | | race | varchar | | gender | varchar | | capital-gain | integer | | capital-loss | integer | | hours-per-week | varchar | | native-country | varchar | | income | varchar | | (15 rows) astra:test> SELECT workclass, COUNT(income) -> as income_count -> FROM adult_income -> WHERE income = '<=50K' -> GROUP BY workclass -> ORDER BY workclass; workclass | income_count ------------------+-------------- ? | 2534 Federal-gov | 871 Local-gov | 2209 Never-worked | 10 Private | 26519 Self-emp-inc | 757 Self-emp-not-inc | 2785 State-gov | 1451 Without-pay | 19 (9 rows) 24
  • 25. 25 Visualizing Data With 3rd Party Tools Communicates With Visualizing Data And BI Tools Dundas BI Qlik Sense Microsoft PowerBI
  • 27. Future Plans By Oct/E, 2017 Nov, 2017 - June/E, 2018 Q3 2018 Alpha 1st Beta 2nd Beta Publish It - Alpha - Un/Semi-Structured Data and Parquet SerDes Support - BI Tools and Visualization Tools Integration - 1st Beta, Step-Growth Phase - Record Set Cache - Distributed Computing For UDF and ML - Other SerDes Support 27