SlideShare a Scribd company logo
February 16th
2016
louis.rabiet@squidsolutions.com
Migrating structured data between Hadoop
and RDBMS
Who am I?
• Full Stack engineer at Squid Solutions.
• Specialised in Big data.
• Fun fact: sleeping by myself in my tent on
the top of the highest mountains of the world
What I do ?
• Develop of an analytics toolbox.
• No setup. No SQL. No compromise.
• Generate SQL with a REST API.
It is open source!
https://github.com/openbouquet
Topic of today
• You need Scalability?
• You need a machine learning toolbox?
Hadoop is the solution.
•But you still need structured data?
Our tool provide a solution.
=> We need both!
What does that mean?
• Creation of dataset in Bouquet
• Send the dataset to Spark
• Enrich inside Spark
• Re-injection in original database
How we do it?
User input
Relational
DB
SparkBouquet
Create and Send
How does it work?
BouquetRelational
DB
Spark
HDFS/
Tachyon
Hive
Metastore
User select the data. Bouquet generate the corresponding SQL Code
Kafka
How does it work?
BouquetRelational
DB
Spark
HDFS/
Tachyon
Hive
Metastore
Data is read from the SQL database
Kafka
How does it work?
BouquetRelational
DB
Spark
HDFS/
Tachyon
Hive
Metastore
The BI tool creates an avro schema and send the data to Kafka
Kafka
How does it work?
BouquetRelational
DB
Spark
Kafka
HDFS/
Tachyon
Hive
Metastore
Kafka Broker(s) receive the data
How does it work?
BouquetRelational
DB
Spark
HDFS/
Tachyon
Hive
Metastore
Kafka
The hive metastore is updated and the hdfs connectors writes into hdfs
How to keep the data structured?
Use a schema registry (Avro in Kafka).
each schema has a corresponding kafka topic and a distinct hive table.
{
"type": "record",
"name": "ArtistGender",
"fields" : [
{"name": "count", "type": "long"},
{"name": "gender", "type": "String"]}
]
}
Challenges
- Auto creation of topics/table in Hive for each datasets from Bouquet.
- JDBC reads are too slow for something like Kafka.
- Issue with types conversion: null is not supported for all cases for example (issue
272 on schema-registry).
- Versions: Kafka 0.9.0, Tachyon 0.7.1, Spark 1.5.2 with HortonWorks 2.3.4 (Dec
2015)
- Hive: Setting the warehouse directory.
- In tachyon: Setting up hostname.
Tachyon?
• Use it as in memory filesystem to replace
HDFS.
• Interact with Spark using the hdfs plugin.
• Transparent from user point of view
Status
Injection SQL -> Spark: OK
Spark usage: OK
Re-injection: In alpha stage.
Re-injection
Two solutions:
• Spark user notifies Bouquet that data has
changed (using a custom function)
• Bouquet pulls the data from spark
We use it for real!
Collaborating with La Poste to be able to
use Spark and the re-injection mechanism
to use Bouquet and a geographical
visualisation.
In the future
• Notebook integration
• We got a DSL for bouquet API, we may
want to have built-in support spark.
• Improve scalability (Bulk Unload and
Kafka fine tuning)
QUESTIONS
OPENBOUQUET.IO
DB HD
Bouquet Architecture
Bouquet Server
SQL DATA
JDBC
Dynamic Caching
& Indexing
REST APIBusiness Modeling OAuth2
Generic Apps
Multi-Tenant
REDIS Elastic MongoDB
JS/SDK Custom Apps

More Related Content

What's hot (20)

PDF
Visualizing big data in the browser using spark
Databricks
 
PDF
Enabling Exploratory Analysis of Large Data with Apache Spark and R
Databricks
 
PDF
Spark Summit EU 2015: Revolutionizing Big Data in the Enterprise with Spark
Databricks
 
PDF
Spark Summit San Francisco 2016 - Ali Ghodsi Keynote
Databricks
 
PDF
Lessons from Running Large Scale Spark Workloads
Databricks
 
PDF
Spark Summit San Francisco 2016 - Matei Zaharia Keynote: Apache Spark 2.0
Databricks
 
PDF
New directions for Apache Spark in 2015
Databricks
 
PDF
Spark Application Carousel: Highlights of Several Applications Built with Spark
Databricks
 
PDF
Announcing Databricks Cloud (Spark Summit 2014)
Databricks
 
PDF
Transitioning from Traditional DW to Apache® Spark™ in Operating Room Predict...
Databricks
 
PDF
ETL to ML: Use Apache Spark as an end to end tool for Advanced Analytics
Miklos Christine
 
PDF
What's new in pandas and the SciPy stack for financial users
Wes McKinney
 
PDF
Talend spark meetup 03042017 - Paris Spark Meetup
Modern Data Stack France
 
PDF
Spark Under the Hood - Meetup @ Data Science London
Databricks
 
PDF
Distributed ML in Apache Spark
Databricks
 
PDF
Spark's Role in the Big Data Ecosystem (Spark Summit 2014)
Databricks
 
PPTX
Teaching Apache Spark: Demonstrations on the Databricks Cloud Platform
Yao Yao
 
PPTX
Databricks @ Strata SJ
Databricks
 
PDF
Not your Father's Database: Not Your Father’s Database: How to Use Apache® Sp...
Databricks
 
PDF
Optimizing Delta/Parquet Data Lakes for Apache Spark
Databricks
 
Visualizing big data in the browser using spark
Databricks
 
Enabling Exploratory Analysis of Large Data with Apache Spark and R
Databricks
 
Spark Summit EU 2015: Revolutionizing Big Data in the Enterprise with Spark
Databricks
 
Spark Summit San Francisco 2016 - Ali Ghodsi Keynote
Databricks
 
Lessons from Running Large Scale Spark Workloads
Databricks
 
Spark Summit San Francisco 2016 - Matei Zaharia Keynote: Apache Spark 2.0
Databricks
 
New directions for Apache Spark in 2015
Databricks
 
Spark Application Carousel: Highlights of Several Applications Built with Spark
Databricks
 
Announcing Databricks Cloud (Spark Summit 2014)
Databricks
 
Transitioning from Traditional DW to Apache® Spark™ in Operating Room Predict...
Databricks
 
ETL to ML: Use Apache Spark as an end to end tool for Advanced Analytics
Miklos Christine
 
What's new in pandas and the SciPy stack for financial users
Wes McKinney
 
Talend spark meetup 03042017 - Paris Spark Meetup
Modern Data Stack France
 
Spark Under the Hood - Meetup @ Data Science London
Databricks
 
Distributed ML in Apache Spark
Databricks
 
Spark's Role in the Big Data Ecosystem (Spark Summit 2014)
Databricks
 
Teaching Apache Spark: Demonstrations on the Databricks Cloud Platform
Yao Yao
 
Databricks @ Strata SJ
Databricks
 
Not your Father's Database: Not Your Father’s Database: How to Use Apache® Sp...
Databricks
 
Optimizing Delta/Parquet Data Lakes for Apache Spark
Databricks
 

Similar to HUG France Feb 2016 - Migration de données structurées entre Hadoop et RDBMS par Louis Rabiet (Squid Solution) (20)

PDF
Migrating structured data between Hadoop and RDBMS
Bouquet
 
PPTX
High concurrency,
Low latency analytics
using Spark/Kudu
Chris George
 
PDF
Avoiding big data antipatterns
grepalex
 
PDF
Big data should be simple
Dori Waldman
 
PDF
Sparkling Water 5 28-14
Sri Ambati
 
PPTX
Data warehousing with Hadoop
hadooparchbook
 
PDF
Lambda at Weather Scale - Cassandra Summit 2015
Robbie Strickland
 
PDF
2017 09-27 democratize data products with SQL
Yu Ishikawa
 
PDF
Building large scale transactional data lake using apache hudi
Bill Liu
 
PDF
Extending Spark for Qbeast's SQL Data Source​ with Paola Pardo and Cesare Cug...
Qbeast
 
PDF
In Memory Data Pipeline And Warehouse At Scale - BerlinBuzzwords 2015
Iulia Emanuela Iancuta
 
PDF
Cassandra and Spark, closing the gap between no sql and analytics codemotio...
Duyhai Doan
 
PDF
Veracity think bugdata #2 6.7.2015
Veracity - Think Big Data
 
PPTX
Big Data, Bigger Brains
Denny Lee
 
PDF
Spark cassandra connector.API, Best Practices and Use-Cases
Duyhai Doan
 
PDF
Spark cassandra integration, theory and practice
Duyhai Doan
 
PPTX
Reshape Data Lake (as of 2020.07)
Eric Sun
 
PPTX
Learning spark ch09 - Spark SQL
phanleson
 
PPTX
Storlets fb session_16_9
Eran Rom
 
PPTX
Column Stores and Google BigQuery
Csaba Toth
 
Migrating structured data between Hadoop and RDBMS
Bouquet
 
High concurrency,
Low latency analytics
using Spark/Kudu
Chris George
 
Avoiding big data antipatterns
grepalex
 
Big data should be simple
Dori Waldman
 
Sparkling Water 5 28-14
Sri Ambati
 
Data warehousing with Hadoop
hadooparchbook
 
Lambda at Weather Scale - Cassandra Summit 2015
Robbie Strickland
 
2017 09-27 democratize data products with SQL
Yu Ishikawa
 
Building large scale transactional data lake using apache hudi
Bill Liu
 
Extending Spark for Qbeast's SQL Data Source​ with Paola Pardo and Cesare Cug...
Qbeast
 
In Memory Data Pipeline And Warehouse At Scale - BerlinBuzzwords 2015
Iulia Emanuela Iancuta
 
Cassandra and Spark, closing the gap between no sql and analytics codemotio...
Duyhai Doan
 
Veracity think bugdata #2 6.7.2015
Veracity - Think Big Data
 
Big Data, Bigger Brains
Denny Lee
 
Spark cassandra connector.API, Best Practices and Use-Cases
Duyhai Doan
 
Spark cassandra integration, theory and practice
Duyhai Doan
 
Reshape Data Lake (as of 2020.07)
Eric Sun
 
Learning spark ch09 - Spark SQL
phanleson
 
Storlets fb session_16_9
Eran Rom
 
Column Stores and Google BigQuery
Csaba Toth
 
Ad

More from Modern Data Stack France (20)

PDF
Stash - Data FinOPS
Modern Data Stack France
 
PDF
Vue d'ensemble Dremio
Modern Data Stack France
 
PDF
From Data Warehouse to Lakehouse
Modern Data Stack France
 
PDF
Paris Spark Meetup - Trifacta - 03_04_2017
Modern Data Stack France
 
PDF
Hadoop meetup : HUGFR Construire le cluster le plus rapide pour l'analyse des...
Modern Data Stack France
 
PDF
Hadoop France meetup Feb2016 : recommendations with spark
Modern Data Stack France
 
PPTX
Hug janvier 2016 -EDF
Modern Data Stack France
 
PPTX
HUG France - 20160114 industrialisation_process_big_data CanalPlus
Modern Data Stack France
 
PDF
HUG France : HBase in Financial Industry par Pierre Bittner (Scaled Risk CTO)
Modern Data Stack France
 
PDF
Apache Flink par Bilal Baltagi Paris Spark Meetup Dec 2015
Modern Data Stack France
 
PDF
Datalab 101 (Hadoop, Spark, ElasticSearch) par Jonathan Winandy - Paris Spark...
Modern Data Stack France
 
PDF
Record linkage, a real use case with spark ml - Paris Spark meetup Dec 2015
Modern Data Stack France
 
PDF
Spark dataframe
Modern Data Stack France
 
PDF
June Spark meetup : search as recommandation
Modern Data Stack France
 
PDF
Spark ML par Xebia (Spark Meetup du 11/06/2015)
Modern Data Stack France
 
PPTX
Spark meetup at viadeo
Modern Data Stack France
 
PPTX
Paris Spark meetup : Extension de Spark (Tachyon / Spark JobServer) par jlamiel
Modern Data Stack France
 
PPTX
Hadoop User Group 29Jan2015 Apache Flink / Haven / CapGemnini REX
Modern Data Stack France
 
PDF
The Cascading (big) data application framework
Modern Data Stack France
 
PPTX
Introduction sur Tez par Olivier RENAULT de HortonWorks Meetup du 25/11/2014
Modern Data Stack France
 
Stash - Data FinOPS
Modern Data Stack France
 
Vue d'ensemble Dremio
Modern Data Stack France
 
From Data Warehouse to Lakehouse
Modern Data Stack France
 
Paris Spark Meetup - Trifacta - 03_04_2017
Modern Data Stack France
 
Hadoop meetup : HUGFR Construire le cluster le plus rapide pour l'analyse des...
Modern Data Stack France
 
Hadoop France meetup Feb2016 : recommendations with spark
Modern Data Stack France
 
Hug janvier 2016 -EDF
Modern Data Stack France
 
HUG France - 20160114 industrialisation_process_big_data CanalPlus
Modern Data Stack France
 
HUG France : HBase in Financial Industry par Pierre Bittner (Scaled Risk CTO)
Modern Data Stack France
 
Apache Flink par Bilal Baltagi Paris Spark Meetup Dec 2015
Modern Data Stack France
 
Datalab 101 (Hadoop, Spark, ElasticSearch) par Jonathan Winandy - Paris Spark...
Modern Data Stack France
 
Record linkage, a real use case with spark ml - Paris Spark meetup Dec 2015
Modern Data Stack France
 
Spark dataframe
Modern Data Stack France
 
June Spark meetup : search as recommandation
Modern Data Stack France
 
Spark ML par Xebia (Spark Meetup du 11/06/2015)
Modern Data Stack France
 
Spark meetup at viadeo
Modern Data Stack France
 
Paris Spark meetup : Extension de Spark (Tachyon / Spark JobServer) par jlamiel
Modern Data Stack France
 
Hadoop User Group 29Jan2015 Apache Flink / Haven / CapGemnini REX
Modern Data Stack France
 
The Cascading (big) data application framework
Modern Data Stack France
 
Introduction sur Tez par Olivier RENAULT de HortonWorks Meetup du 25/11/2014
Modern Data Stack France
 
Ad

Recently uploaded (20)

PPTX
The Internet of Things (IoT) refers to a vast network of interconnected devic...
chethana8182
 
PPT
1965 INDO PAK WAR which Pak will never forget.ppt
sanjaychief112
 
PDF
Latest Scam Shocking the USA in 2025.pdf
onlinescamreport4
 
PPTX
Blue and Dark Blue Modern Technology Presentation.pptx
ap177979
 
PPTX
dns domain name system history work.pptx
MUHAMMADKAVISHSHABAN
 
PPTX
Artificial-Intelligence-in-Daily-Life (2).pptx
nidhigoswami335
 
PDF
How Much GB RAM Do You Need for Coding? 5 Powerful Reasons 8GB Is More Than E...
freeshopbudget
 
DOCX
An_Operating_System by chidi kingsley wo
kingsleywokocha4
 
PDF
The Internet of Things (IoT) refers to a vast network of interconnected devic...
chethana8182
 
PDF
GEO Strategy 2025: Complete Presentation Deck for AI-Powered Customer Acquisi...
Zam Man
 
PPTX
Google SGE SEO: 5 Critical Changes That Could Wreck Your Rankings in 2025
Reversed Out Creative
 
PPTX
原版北不列颠哥伦比亚大学毕业证文凭UNBC成绩单2025年新版在线制作学位证书
e7nw4o4
 
PPTX
How tech helps people in the modern era.
upadhyayaryan154
 
PDF
UI/UX Developer Guide: Tools, Trends, and Tips for 2025
Penguin peak
 
PPTX
Different Generation Of Computers .pptx
divcoder9507
 
PPTX
MSadfadsfafdadfccadradfT_Presentation.pptx
pahalaedward2
 
PDF
LB# 820-1889_051-7370_C000.schematic.pdf
matheusalbuquerqueco3
 
PDF
Data Protection & Resilience in Focus.pdf
AmyPoblete3
 
PPTX
B2B_Ecommerce_Internship_Simranpreet.pptx
LipakshiJindal
 
PPTX
The Internet of Things (IoT) refers to a vast network of interconnected devic...
chethana8182
 
The Internet of Things (IoT) refers to a vast network of interconnected devic...
chethana8182
 
1965 INDO PAK WAR which Pak will never forget.ppt
sanjaychief112
 
Latest Scam Shocking the USA in 2025.pdf
onlinescamreport4
 
Blue and Dark Blue Modern Technology Presentation.pptx
ap177979
 
dns domain name system history work.pptx
MUHAMMADKAVISHSHABAN
 
Artificial-Intelligence-in-Daily-Life (2).pptx
nidhigoswami335
 
How Much GB RAM Do You Need for Coding? 5 Powerful Reasons 8GB Is More Than E...
freeshopbudget
 
An_Operating_System by chidi kingsley wo
kingsleywokocha4
 
The Internet of Things (IoT) refers to a vast network of interconnected devic...
chethana8182
 
GEO Strategy 2025: Complete Presentation Deck for AI-Powered Customer Acquisi...
Zam Man
 
Google SGE SEO: 5 Critical Changes That Could Wreck Your Rankings in 2025
Reversed Out Creative
 
原版北不列颠哥伦比亚大学毕业证文凭UNBC成绩单2025年新版在线制作学位证书
e7nw4o4
 
How tech helps people in the modern era.
upadhyayaryan154
 
UI/UX Developer Guide: Tools, Trends, and Tips for 2025
Penguin peak
 
Different Generation Of Computers .pptx
divcoder9507
 
MSadfadsfafdadfccadradfT_Presentation.pptx
pahalaedward2
 
LB# 820-1889_051-7370_C000.schematic.pdf
matheusalbuquerqueco3
 
Data Protection & Resilience in Focus.pdf
AmyPoblete3
 
B2B_Ecommerce_Internship_Simranpreet.pptx
LipakshiJindal
 
The Internet of Things (IoT) refers to a vast network of interconnected devic...
chethana8182
 

HUG France Feb 2016 - Migration de données structurées entre Hadoop et RDBMS par Louis Rabiet (Squid Solution)

  • 2. Who am I? • Full Stack engineer at Squid Solutions. • Specialised in Big data. • Fun fact: sleeping by myself in my tent on the top of the highest mountains of the world
  • 3. What I do ? • Develop of an analytics toolbox. • No setup. No SQL. No compromise. • Generate SQL with a REST API. It is open source! https://github.com/openbouquet
  • 4. Topic of today • You need Scalability? • You need a machine learning toolbox? Hadoop is the solution. •But you still need structured data? Our tool provide a solution. => We need both!
  • 5. What does that mean? • Creation of dataset in Bouquet • Send the dataset to Spark • Enrich inside Spark • Re-injection in original database
  • 6. How we do it? User input Relational DB SparkBouquet
  • 8. How does it work? BouquetRelational DB Spark HDFS/ Tachyon Hive Metastore User select the data. Bouquet generate the corresponding SQL Code Kafka
  • 9. How does it work? BouquetRelational DB Spark HDFS/ Tachyon Hive Metastore Data is read from the SQL database Kafka
  • 10. How does it work? BouquetRelational DB Spark HDFS/ Tachyon Hive Metastore The BI tool creates an avro schema and send the data to Kafka Kafka
  • 11. How does it work? BouquetRelational DB Spark Kafka HDFS/ Tachyon Hive Metastore Kafka Broker(s) receive the data
  • 12. How does it work? BouquetRelational DB Spark HDFS/ Tachyon Hive Metastore Kafka The hive metastore is updated and the hdfs connectors writes into hdfs
  • 13. How to keep the data structured? Use a schema registry (Avro in Kafka). each schema has a corresponding kafka topic and a distinct hive table. { "type": "record", "name": "ArtistGender", "fields" : [ {"name": "count", "type": "long"}, {"name": "gender", "type": "String"]} ] }
  • 14. Challenges - Auto creation of topics/table in Hive for each datasets from Bouquet. - JDBC reads are too slow for something like Kafka. - Issue with types conversion: null is not supported for all cases for example (issue 272 on schema-registry). - Versions: Kafka 0.9.0, Tachyon 0.7.1, Spark 1.5.2 with HortonWorks 2.3.4 (Dec 2015) - Hive: Setting the warehouse directory. - In tachyon: Setting up hostname.
  • 15. Tachyon? • Use it as in memory filesystem to replace HDFS. • Interact with Spark using the hdfs plugin. • Transparent from user point of view
  • 16. Status Injection SQL -> Spark: OK Spark usage: OK Re-injection: In alpha stage.
  • 17. Re-injection Two solutions: • Spark user notifies Bouquet that data has changed (using a custom function) • Bouquet pulls the data from spark
  • 18. We use it for real! Collaborating with La Poste to be able to use Spark and the re-injection mechanism to use Bouquet and a geographical visualisation.
  • 19. In the future • Notebook integration • We got a DSL for bouquet API, we may want to have built-in support spark. • Improve scalability (Bulk Unload and Kafka fine tuning)
  • 21. DB HD Bouquet Architecture Bouquet Server SQL DATA JDBC Dynamic Caching & Indexing REST APIBusiness Modeling OAuth2 Generic Apps Multi-Tenant REDIS Elastic MongoDB JS/SDK Custom Apps