HUG France Feb 2016 - Migration de données structurées entre Hadoop et RDBMS par Louis Rabiet (Squid Solution)

0 likes•1,769 views

The document discusses a method for migrating structured data between Hadoop and RDBMS using an analytics toolbox called Bouquet, which generates SQL via a REST API. It highlights the integration of various technologies including Spark, Kafka, and Avro schema for efficient data management and enrichment. Challenges in implementation, as well as future improvements for scalability and functionality, are also outlined.

Internet

February 16th
2016
louis.rabiet@squidsolutions.com
Migrating structured data between Hadoop
and RDBMS

Who am I?
• Full Stack engineer at Squid Solutions.
• Specialised in Big data.
• Fun fact: sleeping by myself in my tent on
the top of the highest mountains of the world

What I do ?
• Develop of an analytics toolbox.
• No setup. No SQL. No compromise.
• Generate SQL with a REST API.
It is open source!
https://github.com/openbouquet

Topic of today
• You need Scalability?
• You need a machine learning toolbox?
Hadoop is the solution.
•But you still need structured data?
Our tool provide a solution.
=> We need both!

What does that mean?
• Creation of dataset in Bouquet
• Send the dataset to Spark
• Enrich inside Spark
• Re-injection in original database

How we do it?
User input
Relational
DB
SparkBouquet

How does it work?
BouquetRelational
DB
Spark
HDFS/
Tachyon
Hive
Metastore
User select the data. Bouquet generate the corresponding SQL Code
Kafka

How does it work?
BouquetRelational
DB
Spark
HDFS/
Tachyon
Hive
Metastore
Data is read from the SQL database
Kafka

How does it work?
BouquetRelational
DB
Spark
HDFS/
Tachyon
Hive
Metastore
The BI tool creates an avro schema and send the data to Kafka
Kafka

How does it work?
BouquetRelational
DB
Spark
Kafka
HDFS/
Tachyon
Hive
Metastore
Kafka Broker(s) receive the data

How does it work?
BouquetRelational
DB
Spark
HDFS/
Tachyon
Hive
Metastore
Kafka
The hive metastore is updated and the hdfs connectors writes into hdfs

How to keep the data structured?
Use a schema registry (Avro in Kafka).
each schema has a corresponding kafka topic and a distinct hive table.
{
"type": "record",
"name": "ArtistGender",
"fields" : [
{"name": "count", "type": "long"},
{"name": "gender", "type": "String"]}
]
}

Challenges
- Auto creation of topics/table in Hive for each datasets from Bouquet.
- JDBC reads are too slow for something like Kafka.
- Issue with types conversion: null is not supported for all cases for example (issue
272 on schema-registry).
- Versions: Kafka 0.9.0, Tachyon 0.7.1, Spark 1.5.2 with HortonWorks 2.3.4 (Dec
2015)
- Hive: Setting the warehouse directory.
- In tachyon: Setting up hostname.

Tachyon?
• Use it as in memory filesystem to replace
HDFS.
• Interact with Spark using the hdfs plugin.
• Transparent from user point of view

Status
Injection SQL -> Spark: OK
Spark usage: OK
Re-injection: In alpha stage.

Re-injection
Two solutions:
• Spark user notifies Bouquet that data has
changed (using a custom function)
• Bouquet pulls the data from spark

We use it for real!
Collaborating with La Poste to be able to
use Spark and the re-injection mechanism
to use Bouquet and a geographical
visualisation.

In the future
• Notebook integration
• We got a DSL for bouquet API, we may
want to have built-in support spark.
• Improve scalability (Bulk Unload and
Kafka fine tuning)

DB HD
Bouquet Architecture
Bouquet Server
SQL DATA
JDBC
Dynamic Caching
& Indexing
REST APIBusiness Modeling OAuth2
Generic Apps
Multi-Tenant
REDIS Elastic MongoDB
JS/SDK Custom Apps

More Related Content

What's hot (20)

PDF

Visualizing big data in the browser using sparkDatabricks

PDF

Enabling Exploratory Analysis of Large Data with Apache Spark and RDatabricks

PDF

Spark Summit EU 2015: Revolutionizing Big Data in the Enterprise with SparkDatabricks

PDF

Spark Summit San Francisco 2016 - Ali Ghodsi KeynoteDatabricks

PDF

Lessons from Running Large Scale Spark WorkloadsDatabricks

PDF

Spark Summit San Francisco 2016 - Matei Zaharia Keynote: Apache Spark 2.0Databricks

PDF

New directions for Apache Spark in 2015Databricks

PDF

Spark Application Carousel: Highlights of Several Applications Built with SparkDatabricks

PDF

Announcing Databricks Cloud (Spark Summit 2014)Databricks

PDF

Transitioning from Traditional DW to Apache® Spark™ in Operating Room Predict...Databricks

PDF

ETL to ML: Use Apache Spark as an end to end tool for Advanced AnalyticsMiklos Christine

PDF

What's new in pandas and the SciPy stack for financial usersWes McKinney

PDF

Talend spark meetup 03042017 - Paris Spark MeetupModern Data Stack France

PDF

Spark Under the Hood - Meetup @ Data Science LondonDatabricks

PDF

Distributed ML in Apache SparkDatabricks

PDF

Spark's Role in the Big Data Ecosystem (Spark Summit 2014)Databricks

PPTX

Teaching Apache Spark: Demonstrations on the Databricks Cloud PlatformYao Yao

PPTX

Databricks @ Strata SJDatabricks

PDF

Not your Father's Database: Not Your Father’s Database: How to Use Apache® Sp...Databricks

PDF

Optimizing Delta/Parquet Data Lakes for Apache SparkDatabricks

Visualizing big data in the browser using sparkDatabricks

Enabling Exploratory Analysis of Large Data with Apache Spark and RDatabricks

Spark Summit EU 2015: Revolutionizing Big Data in the Enterprise with SparkDatabricks

Spark Summit San Francisco 2016 - Ali Ghodsi KeynoteDatabricks

Lessons from Running Large Scale Spark WorkloadsDatabricks

Spark Summit San Francisco 2016 - Matei Zaharia Keynote: Apache Spark 2.0Databricks

New directions for Apache Spark in 2015Databricks

Spark Application Carousel: Highlights of Several Applications Built with SparkDatabricks

Announcing Databricks Cloud (Spark Summit 2014)Databricks

Transitioning from Traditional DW to Apache® Spark™ in Operating Room Predict...Databricks

ETL to ML: Use Apache Spark as an end to end tool for Advanced AnalyticsMiklos Christine

What's new in pandas and the SciPy stack for financial usersWes McKinney

Talend spark meetup 03042017 - Paris Spark MeetupModern Data Stack France

Spark Under the Hood - Meetup @ Data Science LondonDatabricks

Distributed ML in Apache SparkDatabricks

Spark's Role in the Big Data Ecosystem (Spark Summit 2014)Databricks

Teaching Apache Spark: Demonstrations on the Databricks Cloud PlatformYao Yao

Databricks @ Strata SJDatabricks

Not your Father's Database: Not Your Father’s Database: How to Use Apache® Sp...Databricks

Optimizing Delta/Parquet Data Lakes for Apache SparkDatabricks

Similar to HUG France Feb 2016 - Migration de données structurées entre Hadoop et RDBMS par Louis Rabiet (Squid Solution) (20)

PDF

Migrating structured data between Hadoop and RDBMSBouquet

PPTX

High concurrency, Low latency analytics using Spark/KuduChris George

PDF

Avoiding big data antipatternsgrepalex

PDF

Big data should be simpleDori Waldman

PDF

Sparkling Water 5 28-14Sri Ambati

PPTX

Data warehousing with Hadoophadooparchbook

PDF

Lambda at Weather Scale - Cassandra Summit 2015Robbie Strickland

PDF

2017 09-27 democratize data products with SQLYu Ishikawa

PDF

Building large scale transactional data lake using apache hudiBill Liu

PDF

Extending Spark for Qbeast's SQL Data Source with Paola Pardo and Cesare Cug...Qbeast

PDF

In Memory Data Pipeline And Warehouse At Scale - BerlinBuzzwords 2015Iulia Emanuela Iancuta

PDF

Cassandra and Spark, closing the gap between no sql and analytics codemotio...Duyhai Doan

PDF

Veracity think bugdata #2 6.7.2015Veracity - Think Big Data

PPTX

Big Data, Bigger BrainsDenny Lee

PDF

Spark cassandra connector.API, Best Practices and Use-CasesDuyhai Doan

PDF

Spark cassandra integration, theory and practiceDuyhai Doan

PPTX

Reshape Data Lake (as of 2020.07)Eric Sun

PPTX

Learning spark ch09 - Spark SQLphanleson

PPTX

Storlets fb session_16_9Eran Rom

PPTX

Column Stores and Google BigQueryCsaba Toth

Migrating structured data between Hadoop and RDBMSBouquet

High concurrency, Low latency analytics using Spark/KuduChris George

Avoiding big data antipatternsgrepalex

Big data should be simpleDori Waldman

Sparkling Water 5 28-14Sri Ambati

Data warehousing with Hadoophadooparchbook

Lambda at Weather Scale - Cassandra Summit 2015Robbie Strickland

2017 09-27 democratize data products with SQLYu Ishikawa

Building large scale transactional data lake using apache hudiBill Liu

Extending Spark for Qbeast's SQL Data Source with Paola Pardo and Cesare Cug...Qbeast

In Memory Data Pipeline And Warehouse At Scale - BerlinBuzzwords 2015Iulia Emanuela Iancuta

Cassandra and Spark, closing the gap between no sql and analytics codemotio...Duyhai Doan

Veracity think bugdata #2 6.7.2015Veracity - Think Big Data

Big Data, Bigger BrainsDenny Lee

Spark cassandra connector.API, Best Practices and Use-CasesDuyhai Doan

Spark cassandra integration, theory and practiceDuyhai Doan

Reshape Data Lake (as of 2020.07)Eric Sun

Learning spark ch09 - Spark SQLphanleson

Storlets fb session_16_9Eran Rom

Column Stores and Google BigQueryCsaba Toth

More from Modern Data Stack France (20)

PDF

Stash - Data FinOPSModern Data Stack France

PDF

Vue d'ensemble DremioModern Data Stack France

PDF

From Data Warehouse to LakehouseModern Data Stack France

PDF

Paris Spark Meetup - Trifacta - 03_04_2017Modern Data Stack France

PDF

Hadoop meetup : HUGFR Construire le cluster le plus rapide pour l'analyse des...Modern Data Stack France

PDF

Hadoop France meetup Feb2016 : recommendations with sparkModern Data Stack France

PPTX

Hug janvier 2016 -EDFModern Data Stack France

PPTX

HUG France - 20160114 industrialisation_process_big_data CanalPlusModern Data Stack France

PDF

HUG France : HBase in Financial Industry par Pierre Bittner (Scaled Risk CTO)Modern Data Stack France

PDF

Apache Flink par Bilal Baltagi Paris Spark Meetup Dec 2015Modern Data Stack France

PDF

Datalab 101 (Hadoop, Spark, ElasticSearch) par Jonathan Winandy - Paris Spark...Modern Data Stack France

PDF

Record linkage, a real use case with spark ml - Paris Spark meetup Dec 2015Modern Data Stack France

PDF

Spark dataframeModern Data Stack France

PDF

June Spark meetup : search as recommandationModern Data Stack France

PDF

Spark ML par Xebia (Spark Meetup du 11/06/2015)Modern Data Stack France

PPTX

Spark meetup at viadeoModern Data Stack France

PPTX

Paris Spark meetup : Extension de Spark (Tachyon / Spark JobServer) par jlamielModern Data Stack France

PPTX

Hadoop User Group 29Jan2015 Apache Flink / Haven / CapGemnini REXModern Data Stack France

PDF

The Cascading (big) data application frameworkModern Data Stack France

PPTX

Introduction sur Tez par Olivier RENAULT de HortonWorks Meetup du 25/11/2014Modern Data Stack France

Stash - Data FinOPSModern Data Stack France

Vue d'ensemble DremioModern Data Stack France

From Data Warehouse to LakehouseModern Data Stack France

Paris Spark Meetup - Trifacta - 03_04_2017Modern Data Stack France

Hadoop meetup : HUGFR Construire le cluster le plus rapide pour l'analyse des...Modern Data Stack France

Hadoop France meetup Feb2016 : recommendations with sparkModern Data Stack France

Hug janvier 2016 -EDFModern Data Stack France

HUG France - 20160114 industrialisation_process_big_data CanalPlusModern Data Stack France

HUG France : HBase in Financial Industry par Pierre Bittner (Scaled Risk CTO)Modern Data Stack France

Apache Flink par Bilal Baltagi Paris Spark Meetup Dec 2015Modern Data Stack France

Datalab 101 (Hadoop, Spark, ElasticSearch) par Jonathan Winandy - Paris Spark...Modern Data Stack France

Record linkage, a real use case with spark ml - Paris Spark meetup Dec 2015Modern Data Stack France

Spark dataframeModern Data Stack France

June Spark meetup : search as recommandationModern Data Stack France

Spark ML par Xebia (Spark Meetup du 11/06/2015)Modern Data Stack France

Spark meetup at viadeoModern Data Stack France

Paris Spark meetup : Extension de Spark (Tachyon / Spark JobServer) par jlamielModern Data Stack France

Hadoop User Group 29Jan2015 Apache Flink / Haven / CapGemnini REXModern Data Stack France

The Cascading (big) data application frameworkModern Data Stack France

Introduction sur Tez par Olivier RENAULT de HortonWorks Meetup du 25/11/2014Modern Data Stack France

Recently uploaded (20)

PPTX

The Internet of Things (IoT) refers to a vast network of interconnected devic...chethana8182

PPT

1965 INDO PAK WAR which Pak will never forget.pptsanjaychief112

PDF

Latest Scam Shocking the USA in 2025.pdfonlinescamreport4

PPTX

Blue and Dark Blue Modern Technology Presentation.pptxap177979

PPTX

dns domain name system history work.pptxMUHAMMADKAVISHSHABAN

PPTX

Artificial-Intelligence-in-Daily-Life (2).pptxnidhigoswami335

PDF

How Much GB RAM Do You Need for Coding? 5 Powerful Reasons 8GB Is More Than E...freeshopbudget

DOCX

An_Operating_System by chidi kingsley wokingsleywokocha4

PDF

The Internet of Things (IoT) refers to a vast network of interconnected devic...chethana8182

PDF

GEO Strategy 2025: Complete Presentation Deck for AI-Powered Customer Acquisi...Zam Man

PPTX

Google SGE SEO: 5 Critical Changes That Could Wreck Your Rankings in 2025Reversed Out Creative

PPTX

原版北不列颠哥伦比亚大学毕业证文凭UNBC成绩单2025年新版在线制作学位证书e7nw4o4

PPTX

How tech helps people in the modern era.upadhyayaryan154

PDF

UI/UX Developer Guide: Tools, Trends, and Tips for 2025Penguin peak

PPTX

Different Generation Of Computers .pptxdivcoder9507

PPTX

MSadfadsfafdadfccadradfT_Presentation.pptxpahalaedward2

PDF

LB# 820-1889_051-7370_C000.schematic.pdfmatheusalbuquerqueco3

PDF

Data Protection & Resilience in Focus.pdfAmyPoblete3

PPTX

B2B_Ecommerce_Internship_Simranpreet.pptxLipakshiJindal

PPTX

The Internet of Things (IoT) refers to a vast network of interconnected devic...chethana8182

1965 INDO PAK WAR which Pak will never forget.pptsanjaychief112

Latest Scam Shocking the USA in 2025.pdfonlinescamreport4

Blue and Dark Blue Modern Technology Presentation.pptxap177979

dns domain name system history work.pptxMUHAMMADKAVISHSHABAN

Artificial-Intelligence-in-Daily-Life (2).pptxnidhigoswami335

How Much GB RAM Do You Need for Coding? 5 Powerful Reasons 8GB Is More Than E...freeshopbudget

An_Operating_System by chidi kingsley wokingsleywokocha4

The Internet of Things (IoT) refers to a vast network of interconnected devic...chethana8182

GEO Strategy 2025: Complete Presentation Deck for AI-Powered Customer Acquisi...Zam Man

Google SGE SEO: 5 Critical Changes That Could Wreck Your Rankings in 2025Reversed Out Creative

原版北不列颠哥伦比亚大学毕业证文凭UNBC成绩单2025年新版在线制作学位证书e7nw4o4

How tech helps people in the modern era.upadhyayaryan154

UI/UX Developer Guide: Tools, Trends, and Tips for 2025Penguin peak

Different Generation Of Computers .pptxdivcoder9507

MSadfadsfafdadfccadradfT_Presentation.pptxpahalaedward2

LB# 820-1889_051-7370_C000.schematic.pdfmatheusalbuquerqueco3

Data Protection & Resilience in Focus.pdfAmyPoblete3

B2B_Ecommerce_Internship_Simranpreet.pptxLipakshiJindal

The Internet of Things (IoT) refers to a vast network of interconnected devic...chethana8182

HUG France Feb 2016 - Migration de données structurées entre Hadoop et RDBMS par Louis Rabiet (Squid Solution)

1. February 16th 2016 [email protected] Migrating structured data between Hadoop and RDBMS

2. Who am I? • Full Stack engineer at Squid Solutions. • Specialised in Big data. • Fun fact: sleeping by myself in my tent on the top of the highest mountains of the world

3. What I do ? • Develop of an analytics toolbox. • No setup. No SQL. No compromise. • Generate SQL with a REST API. It is open source! https://github.com/openbouquet

4. Topic of today • You need Scalability? • You need a machine learning toolbox? Hadoop is the solution. •But you still need structured data? Our tool provide a solution. => We need both!

5. What does that mean? • Creation of dataset in Bouquet • Send the dataset to Spark • Enrich inside Spark • Re-injection in original database

6. How we do it? User input Relational DB SparkBouquet

7. Create and Send

8. How does it work? BouquetRelational DB Spark HDFS/ Tachyon Hive Metastore User select the data. Bouquet generate the corresponding SQL Code Kafka

9. How does it work? BouquetRelational DB Spark HDFS/ Tachyon Hive Metastore Data is read from the SQL database Kafka

10. How does it work? BouquetRelational DB Spark HDFS/ Tachyon Hive Metastore The BI tool creates an avro schema and send the data to Kafka Kafka

11. How does it work? BouquetRelational DB Spark Kafka HDFS/ Tachyon Hive Metastore Kafka Broker(s) receive the data

12. How does it work? BouquetRelational DB Spark HDFS/ Tachyon Hive Metastore Kafka The hive metastore is updated and the hdfs connectors writes into hdfs

13. How to keep the data structured? Use a schema registry (Avro in Kafka). each schema has a corresponding kafka topic and a distinct hive table. { "type": "record", "name": "ArtistGender", "fields" : [ {"name": "count", "type": "long"}, {"name": "gender", "type": "String"]} ] }

14. Challenges - Auto creation of topics/table in Hive for each datasets from Bouquet. - JDBC reads are too slow for something like Kafka. - Issue with types conversion: null is not supported for all cases for example (issue 272 on schema-registry). - Versions: Kafka 0.9.0, Tachyon 0.7.1, Spark 1.5.2 with HortonWorks 2.3.4 (Dec 2015) - Hive: Setting the warehouse directory. - In tachyon: Setting up hostname.

15. Tachyon? • Use it as in memory filesystem to replace HDFS. • Interact with Spark using the hdfs plugin. • Transparent from user point of view

16. Status Injection SQL -> Spark: OK Spark usage: OK Re-injection: In alpha stage.

17. Re-injection Two solutions: • Spark user notifies Bouquet that data has changed (using a custom function) • Bouquet pulls the data from spark

18. We use it for real! Collaborating with La Poste to be able to use Spark and the re-injection mechanism to use Bouquet and a geographical visualisation.

19. In the future • Notebook integration • We got a DSL for bouquet API, we may want to have built-in support spark. • Improve scalability (Bulk Unload and Kafka fine tuning)

20. QUESTIONS OPENBOUQUET.IO

21. DB HD Bouquet Architecture Bouquet Server SQL DATA JDBC Dynamic Caching & Indexing REST APIBusiness Modeling OAuth2 Generic Apps Multi-Tenant REDIS Elastic MongoDB JS/SDK Custom Apps