SlideShare a Scribd company logo
Data Science with the Help of Metadata
Jim Dowling
Associate Prof @ KTH
Senior Researcher @ SICS
CEO @ Logical Clocks AB
www.hops.io
@hopshadoop
Metadata for Source Code
•Metadata for Source Code
- Enables questions like: who, when, what, why?
•Metadata for Automation
- Enables testing, quality-control, deployment.
•Metadata for Collaboration
- Github projects, teams
Metadata for Datasets?
•Access Control
•Data provenance
•Auditing
•Development
- Schema for the dataset
- How can I load/download this dataset?
- Quality control
3
Metadata can simplify development
sqlContext = HiveContext(sc)
f1_df = sqlContext.sql(
"SELECT id, count(*) AS nb_entries
FROM my_db.log 
WHERE ts = '20160515' 
GROUP BY id"
)
sqlContext = SQLContext(sc)
f0 = sc.textFile('logfile')
fpFields = [
StructField(‘ts', StringType(), True),
StructField('id', StringType(), True),
StructField(‘it', StringType(), True)
]
fpSchema = StructType(fpFields)
df_f0 = sqlContext.createDataFrame(f0,
fpSchema)
df_f0.registerTempTable('log')
f1_df = sqlContext.sql(
"SELECT log.id, count(*) AS nb_entries
FROM log WHERE ts = '20160515‘
GROUP BY id“
)
4
SparkSQLHive-on-Spark
Hive is Metadata for HDFS files
5
Metadata for Files/Directories in HDFS
6
Add Schemas using
the Filesystem API
Add auditing using
the FSImage API
Add access control using
a Filesystem Plugin
Access Control in Hadoop
hdfs dfs -chmod -R 000 /apps/hive
7
[http://hortonworks.com/blog/best-practices-in-hdfs-authorization-with-apache-ranger]
Metadata Totem Poles in Hadoop
8How do you ensure the consistency of the metadata and the data?
Why are the Metadata Services Silo’ed?
9
HDFS v2
10
DataNodes
HDFS Client
Journal Nodes Zookeeper
NameNode Standby
NameNode
Max 200 GB
metadata
YARN
11
NodeManagers
YARN Client
Zookeeper
ResourceMgr Standby
ResourceMgr
Metadata on the
JVM Heap Again
Hops: Distributed Metadata for Hadoop
12
HopsFS Architecture
13
NameNodes
NDB
Leader
HDFS Client
DataNodes
> 12 TB
> 2.6 X
Throughput
[HopsFS: Scaling Hierarchical File System Metadata Using NewSQL Databases, Niazi et Al, arXiv 2016]
HopsYARN Architecture
14
ResourceMgrs
NDB
Scheduler
YARN Client
NodeManagers
Resource Trackers
Leader Election for
Failed Scheduler
Up to 10K
Node Clusters
Experience Designing Metadata in Hops
15
Hops Metadata services
Elasticsearch
Database
[HDFS/YARN]
Kafka
Zookeeper
Metadata API
The Distributed Database is the Single Source-of-Truth for Metadata
Metadata for HDFS and YARN
17
Files
Directories
Containers
Provenance
Security
Quotas
Projects
Datasets
Metadata + Data in the same Database
2-phase commit (transactions)
Strong Consistency for Metadata.
Metadata Integrity maintained using 2PC and Foreign Keys.
Metadata in Elasticsearch
18
Files
Directories
Metadata
Search
Indexes
DatabaseElasticsearch one-way replication
Eventual Consistency for Metadata.
Metadata Integrity maintained by Asynchronous Replication.
[ePipe Tutorial, BOSS Workshop, VLDB 2016]
Metadata for Kafka
19
Topics
Partitions
ACLs
Zookeeper/KafkaDatabase
Eventual Consistency for Metadata.
Metadata integrity maintained by custom recovery logic and polling.
Metadata API
polling
Case Study: Self-Service Multi-Tenant Projects
20
www.hops.io
@hopshadoop
Problem: Sensitive Data needs its own Cluster
21
NSA DataSet
User DataSet
Alice can copy/cross-link between data sets
Alice has only one Kerberos Identity.
Neither attribute-based access control nor dynamic roles supported in Hadoop.
Alice
Solution: Project-Specific UserIDs
22
Project NSA
Project Users
Member of
NSA__Alice
Users__Alice
Member of
HDFS enforces
access control
How can we share DataSets between Projects?
Sharing DataSets between Projects
23
Project NSA
Project Users
Member of
DataSetowns
Add members of Project
NSA to the DataSet group
NSA__Alice
Users__Alice
Member of
HopsWorks (WebApp) Enforces Dynamic Roles
24
Alice@gmail.com
NSA__Alice
Authenticate
Users__Alice
HopsWorks
HopsFS
HopsYARN
Projects
Secure
Impersonation
Kafka
X.509
Certificates
X.509 Certificate Per Project-Specific User
25
Alice@gmail.com
Authenticate
Add/Del
Users
Distributed
Database
Insert/Remove CertsProject
Mgr
Root
CA
Services
Hadoop
Spark
Kafka
etc
Cert Signing
Requests
Project
•A project is a collection of
- Members
- HDFS DataSets
- Kafka Topics
- Notebooks, Jobs
•A project has an owner
•A project has quotas
26
project
dataset 1
dataset N
Topic 1
Topic N
Kafka
HDFS
Project Roles
Data Owner Privileges
- Import/Export data
- Manage Membership
- Share DataSets, Topics
Data Scientist Privileges
- Write and Run code
27
We delegate administration of privileges to users
Elastic Hadoop
Each Project has a:
• YARN CPU Quota
• HDFS Storage Quota
Uber-Style Pricing to
incentivize cluster usage
28
Sharing DataSets/Topics between Projects
29
The same as Sharing Folders in Dropbox
Added Multi-Tenancy to Zeppelin
www.hops.site
31
A 2 MW datacenter research and test environment
5 lab modules, planned up to 3-4000 servers, 2-3000 square meters
[Slide by Prof. Tor Björn Minde, CEO SICS North Swedish ICT AB]
Demo
32
Status and Upcoming
•Automated installation support using Vagrant/Chef
or Karamel/Chef
•First official release of Hopsworks coming soon
•Globally shared datasets with peer-to-peer
technology, backed by our data center.
•Support for Apache Beam
Summing Up
Metadata services have the potential to make your
life easier as a Data Scientist
Most Hadoop Metadata services are proprietary and
require an administrator-in-the-loop
Hops provides an open, tinker-friendly platform for
building consistent metadata
Hopsworks shows how you can leverage metadata to
build a self-service project-based model for
Hadoop/Spark/Flink applications
34
The Team
Active: Jim Dowling, Seif Haridi, Tor Björn Minde,
Gautier Berthou, Salman Niazi, Mahmoud Ismail,
Theofilos Kakantousis, Johan Svedlund Nordström,
Vasileios Giannokostas, Ermias Gebremeskel,
Antonios Kouzoupis, Misganu Dessalegn, Rizvi Hasan,
Paul Mälzer, Bram Leenders, Juan Roca.
Alumni: K. “Sri” Srijeyanthan, Steffen Grohsschmiedt,
Alberto Lorente, Andre Moré, Ali Gholami,
Stig Viaene, Hooman Peiro, Evangelos Savvidis,
Jude D’Souza, Qi Qi, Gayana Chandrasekara,
Nikolaos Stanogias, Daniel Bali, Ioannis Kerkinos,
Peter Buechler, Pushparaj Motamari, Hamid Afzali,
Wasif Malik, Lalith Suresh, Mariano Valles, Ying Lieu.
Join us!
http://github.com/hopshadoop
www.hops.io
@hopshadoop

More Related Content

What's hot (20)

PDF
Data Gloveboxes: A Philosophy of Data Science Data Security
DataWorks Summit
 
PDF
A Big Data Lake Based on Spark for BBVA Bank-(Oscar Mendez, STRATIO)
Spark Summit
 
PDF
RISELab:Enabling Intelligent Real-Time Decisions
Jen Aman
 
PPTX
Open Source Big Data Ingestion - Without the Heartburn!
Pat Patterson
 
PPTX
"Who Moved my Data? - Why tracking changes and sources of data is critical to...
Cask Data
 
PDF
From Batch to Streaming ET(L) with Apache Apex
DataWorks Summit
 
PDF
Meetup070416 Presentations
Ana Rebelo
 
PDF
A Day in the Life of a Druid Implementor and Druid's Roadmap
Itai Yaffe
 
PPTX
Spark sql meetup
Michael Zhang
 
PPTX
Data Science at Scale by Sarah Guido
Spark Summit
 
PDF
Building Data Intensive Analytic Application on Top of Delta Lakes
Databricks
 
PDF
What's new in SQL on Hadoop and Beyond
DataWorks Summit/Hadoop Summit
 
PDF
From R Script to Production Using rsparkling with Navdeep Gill
Databricks
 
PDF
Real-Time Spark: From Interactive Queries to Streaming
Databricks
 
PDF
Recent Upgrades to ARM Data Transfer and Delivery Using Globus
Globus
 
PDF
More Data, More Problems: Scaling Kafka-Mirroring Pipelines at LinkedIn
confluent
 
PDF
Apache Arrow at DataEngConf Barcelona 2018
Wes McKinney
 
PDF
Building a Real-Time Data Pipeline: Apache Kafka at LinkedIn
Amy W. Tang
 
PPTX
Building a Virtual Data Lake with Apache Arrow
Dremio Corporation
 
PDF
NoSQL no more: SQL on Druid with Apache Calcite
gianmerlino
 
Data Gloveboxes: A Philosophy of Data Science Data Security
DataWorks Summit
 
A Big Data Lake Based on Spark for BBVA Bank-(Oscar Mendez, STRATIO)
Spark Summit
 
RISELab:Enabling Intelligent Real-Time Decisions
Jen Aman
 
Open Source Big Data Ingestion - Without the Heartburn!
Pat Patterson
 
"Who Moved my Data? - Why tracking changes and sources of data is critical to...
Cask Data
 
From Batch to Streaming ET(L) with Apache Apex
DataWorks Summit
 
Meetup070416 Presentations
Ana Rebelo
 
A Day in the Life of a Druid Implementor and Druid's Roadmap
Itai Yaffe
 
Spark sql meetup
Michael Zhang
 
Data Science at Scale by Sarah Guido
Spark Summit
 
Building Data Intensive Analytic Application on Top of Delta Lakes
Databricks
 
What's new in SQL on Hadoop and Beyond
DataWorks Summit/Hadoop Summit
 
From R Script to Production Using rsparkling with Navdeep Gill
Databricks
 
Real-Time Spark: From Interactive Queries to Streaming
Databricks
 
Recent Upgrades to ARM Data Transfer and Delivery Using Globus
Globus
 
More Data, More Problems: Scaling Kafka-Mirroring Pipelines at LinkedIn
confluent
 
Apache Arrow at DataEngConf Barcelona 2018
Wes McKinney
 
Building a Real-Time Data Pipeline: Apache Kafka at LinkedIn
Amy W. Tang
 
Building a Virtual Data Lake with Apache Arrow
Dremio Corporation
 
NoSQL no more: SQL on Druid with Apache Calcite
gianmerlino
 

Viewers also liked (6)

PDF
Odsc workshop - Distributed Tensorflow on Hops
Jim Dowling
 
PDF
Hopsworks - Self-Service Spark/Flink/Kafka/Hadoop
Jim Dowling
 
PPTX
Multi-tenant Flink as-a-service with Kafka on Hopsworks
Jim Dowling
 
PPTX
Jim Dowling - Multi-tenant Flink-as-a-Service on YARN
Flink Forward
 
PDF
Jim Dowling – Interactive Flink analytics with HopsWorks and Zeppelin
Flink Forward
 
PDF
Spark Summit EU talk by Jim Dowling
Spark Summit
 
Odsc workshop - Distributed Tensorflow on Hops
Jim Dowling
 
Hopsworks - Self-Service Spark/Flink/Kafka/Hadoop
Jim Dowling
 
Multi-tenant Flink as-a-service with Kafka on Hopsworks
Jim Dowling
 
Jim Dowling - Multi-tenant Flink-as-a-Service on YARN
Flink Forward
 
Jim Dowling – Interactive Flink analytics with HopsWorks and Zeppelin
Flink Forward
 
Spark Summit EU talk by Jim Dowling
Spark Summit
 
Ad

Similar to Data Science with the Help of Metadata (20)

PPTX
Strata Hadoop Hopsworks
Jim Dowling
 
PPTX
Shug meetup Hops Hadoop
Jim Dowling
 
PPTX
Hops - Distributed metadata for Hadoop
Jim Dowling
 
PPTX
Polyglot metadata for Hadoop
Jim Dowling
 
PDF
Spark-Streaming-as-a-Service with Kafka and YARN: Spark Summit East talk by J...
Spark Summit
 
PDF
Spark summit-east-dowling-feb2017-full
Jim Dowling
 
PDF
Structured-Streaming-as-a-Service with Kafka, YARN, and Tooling with Jim Dowling
Databricks
 
PDF
Hopsworks in the cloud Berlin Buzzwords 2019
Jim Dowling
 
PDF
Hadoop Ecosystem
Sandip Darwade
 
PDF
Beyond Hadoop and MapReduce
Alexander Alten
 
PPTX
The ExtremeEarth infrastructure-phiweek19
ExtremeEarth
 
PPTX
Dancing Elephants - Efficiently Working with Object Stores from Apache Spark ...
DataWorks Summit
 
PDF
Hops fs huawei internal conference july 2021
Jim Dowling
 
PDF
Ceph Day San Jose - Object Storage for Big Data
Ceph Community
 
PDF
Tcloud Computing Hadoop Family and Ecosystem Service 2013.Q3
tcloudcomputing-tw
 
PPTX
On-premise Spark as a Service with YARN
Jim Dowling
 
PPTX
Hadoop - Just the Basics for Big Data Rookies (SpringOne2GX 2013)
VMware Tanzu
 
PDF
Apache Hadoop & Friends at Utah Java User's Group
Cloudera, Inc.
 
PDF
Hadoop and object stores can we do it better
gvernik
 
PDF
The Evolution of Metadata: LinkedIn's Story [Strata NYC 2019]
Shirshanka Das
 
Strata Hadoop Hopsworks
Jim Dowling
 
Shug meetup Hops Hadoop
Jim Dowling
 
Hops - Distributed metadata for Hadoop
Jim Dowling
 
Polyglot metadata for Hadoop
Jim Dowling
 
Spark-Streaming-as-a-Service with Kafka and YARN: Spark Summit East talk by J...
Spark Summit
 
Spark summit-east-dowling-feb2017-full
Jim Dowling
 
Structured-Streaming-as-a-Service with Kafka, YARN, and Tooling with Jim Dowling
Databricks
 
Hopsworks in the cloud Berlin Buzzwords 2019
Jim Dowling
 
Hadoop Ecosystem
Sandip Darwade
 
Beyond Hadoop and MapReduce
Alexander Alten
 
The ExtremeEarth infrastructure-phiweek19
ExtremeEarth
 
Dancing Elephants - Efficiently Working with Object Stores from Apache Spark ...
DataWorks Summit
 
Hops fs huawei internal conference july 2021
Jim Dowling
 
Ceph Day San Jose - Object Storage for Big Data
Ceph Community
 
Tcloud Computing Hadoop Family and Ecosystem Service 2013.Q3
tcloudcomputing-tw
 
On-premise Spark as a Service with YARN
Jim Dowling
 
Hadoop - Just the Basics for Big Data Rookies (SpringOne2GX 2013)
VMware Tanzu
 
Apache Hadoop & Friends at Utah Java User's Group
Cloudera, Inc.
 
Hadoop and object stores can we do it better
gvernik
 
The Evolution of Metadata: LinkedIn's Story [Strata NYC 2019]
Shirshanka Das
 
Ad

More from Jim Dowling (20)

PDF
ARVC and flecainide case report[EI] Jim.docx.pdf
Jim Dowling
 
PDF
PyData Berlin 2023 - Mythical ML Pipeline.pdf
Jim Dowling
 
PDF
Serverless ML Workshop with Hopsworks at PyData Seattle
Jim Dowling
 
PDF
PyCon Sweden 2022 - Dowling - Serverless ML with Hopsworks.pdf
Jim Dowling
 
PDF
_Python Ireland Meetup - Serverless ML - Dowling.pdf
Jim Dowling
 
PDF
Building Hopsworks, a cloud-native managed feature store for machine learning
Jim Dowling
 
PDF
Real-Time Recommendations with Hopsworks and OpenSearch - MLOps World 2022
Jim Dowling
 
PDF
Ml ops and the feature store with hopsworks, DC Data Science Meetup
Jim Dowling
 
PDF
Hopsworks MLOps World talk june 21
Jim Dowling
 
PDF
Hopsworks Feature Store 2.0 a new paradigm
Jim Dowling
 
PDF
Metadata and Provenance for ML Pipelines with Hopsworks
Jim Dowling
 
PDF
GANs for Anti Money Laundering
Jim Dowling
 
PDF
Berlin buzzwords 2020-feature-store-dowling
Jim Dowling
 
PDF
Invited Lecture on GPUs and Distributed Deep Learning at Uppsala University
Jim Dowling
 
PDF
Hopsworks data engineering melbourne april 2020
Jim Dowling
 
PDF
The Bitter Lesson of ML Pipelines
Jim Dowling
 
PDF
Asynchronous Hyperparameter Search with Spark on Hopsworks and Maggy
Jim Dowling
 
PDF
Hopsworks at Google AI Huddle, Sunnyvale
Jim Dowling
 
PDF
HopsML Meetup talk on Hopsworks + ROCm/AMD June 2019
Jim Dowling
 
PDF
PyData Meetup - Feature Store for Hopsworks and ML Pipelines
Jim Dowling
 
ARVC and flecainide case report[EI] Jim.docx.pdf
Jim Dowling
 
PyData Berlin 2023 - Mythical ML Pipeline.pdf
Jim Dowling
 
Serverless ML Workshop with Hopsworks at PyData Seattle
Jim Dowling
 
PyCon Sweden 2022 - Dowling - Serverless ML with Hopsworks.pdf
Jim Dowling
 
_Python Ireland Meetup - Serverless ML - Dowling.pdf
Jim Dowling
 
Building Hopsworks, a cloud-native managed feature store for machine learning
Jim Dowling
 
Real-Time Recommendations with Hopsworks and OpenSearch - MLOps World 2022
Jim Dowling
 
Ml ops and the feature store with hopsworks, DC Data Science Meetup
Jim Dowling
 
Hopsworks MLOps World talk june 21
Jim Dowling
 
Hopsworks Feature Store 2.0 a new paradigm
Jim Dowling
 
Metadata and Provenance for ML Pipelines with Hopsworks
Jim Dowling
 
GANs for Anti Money Laundering
Jim Dowling
 
Berlin buzzwords 2020-feature-store-dowling
Jim Dowling
 
Invited Lecture on GPUs and Distributed Deep Learning at Uppsala University
Jim Dowling
 
Hopsworks data engineering melbourne april 2020
Jim Dowling
 
The Bitter Lesson of ML Pipelines
Jim Dowling
 
Asynchronous Hyperparameter Search with Spark on Hopsworks and Maggy
Jim Dowling
 
Hopsworks at Google AI Huddle, Sunnyvale
Jim Dowling
 
HopsML Meetup talk on Hopsworks + ROCm/AMD June 2019
Jim Dowling
 
PyData Meetup - Feature Store for Hopsworks and ML Pipelines
Jim Dowling
 

Recently uploaded (20)

PDF
OpenInfra ID 2025 - Are Containers Dying? Rethinking Isolation with MicroVMs.pdf
Muhammad Yuga Nugraha
 
PPTX
The Future of AI & Machine Learning.pptx
pritsen4700
 
PDF
Integrating IIoT with SCADA in Oil & Gas A Technical Perspective.pdf
Rejig Digital
 
PPTX
AI Code Generation Risks (Ramkumar Dilli, CIO, Myridius)
Priyanka Aash
 
PDF
Generative AI vs Predictive AI-The Ultimate Comparison Guide
Lily Clark
 
PPTX
Simple and concise overview about Quantum computing..pptx
mughal641
 
PDF
Researching The Best Chat SDK Providers in 2025
Ray Fields
 
PPTX
IT Runs Better with ThousandEyes AI-driven Assurance
ThousandEyes
 
PDF
Per Axbom: The spectacular lies of maps
Nexer Digital
 
PDF
Responsible AI and AI Ethics - By Sylvester Ebhonu
Sylvester Ebhonu
 
PPTX
python advanced data structure dictionary with examples python advanced data ...
sprasanna11
 
PPTX
Earn Agentblazer Status with Slack Community Patna.pptx
SanjeetMishra29
 
PDF
How ETL Control Logic Keeps Your Pipelines Safe and Reliable.pdf
Stryv Solutions Pvt. Ltd.
 
PDF
Build with AI and GDG Cloud Bydgoszcz- ADK .pdf
jaroslawgajewski1
 
PDF
Make GenAI investments go further with the Dell AI Factory
Principled Technologies
 
PDF
Market Wrap for 18th July 2025 by CIFDAQ
CIFDAQ
 
PDF
Brief History of Internet - Early Days of Internet
sutharharshit158
 
PDF
Research-Fundamentals-and-Topic-Development.pdf
ayesha butalia
 
PDF
Basics of Electronics for IOT(actuators ,microcontroller etc..)
arnavmanesh
 
PPTX
Agentic AI in Healthcare Driving the Next Wave of Digital Transformation
danielle hunter
 
OpenInfra ID 2025 - Are Containers Dying? Rethinking Isolation with MicroVMs.pdf
Muhammad Yuga Nugraha
 
The Future of AI & Machine Learning.pptx
pritsen4700
 
Integrating IIoT with SCADA in Oil & Gas A Technical Perspective.pdf
Rejig Digital
 
AI Code Generation Risks (Ramkumar Dilli, CIO, Myridius)
Priyanka Aash
 
Generative AI vs Predictive AI-The Ultimate Comparison Guide
Lily Clark
 
Simple and concise overview about Quantum computing..pptx
mughal641
 
Researching The Best Chat SDK Providers in 2025
Ray Fields
 
IT Runs Better with ThousandEyes AI-driven Assurance
ThousandEyes
 
Per Axbom: The spectacular lies of maps
Nexer Digital
 
Responsible AI and AI Ethics - By Sylvester Ebhonu
Sylvester Ebhonu
 
python advanced data structure dictionary with examples python advanced data ...
sprasanna11
 
Earn Agentblazer Status with Slack Community Patna.pptx
SanjeetMishra29
 
How ETL Control Logic Keeps Your Pipelines Safe and Reliable.pdf
Stryv Solutions Pvt. Ltd.
 
Build with AI and GDG Cloud Bydgoszcz- ADK .pdf
jaroslawgajewski1
 
Make GenAI investments go further with the Dell AI Factory
Principled Technologies
 
Market Wrap for 18th July 2025 by CIFDAQ
CIFDAQ
 
Brief History of Internet - Early Days of Internet
sutharharshit158
 
Research-Fundamentals-and-Topic-Development.pdf
ayesha butalia
 
Basics of Electronics for IOT(actuators ,microcontroller etc..)
arnavmanesh
 
Agentic AI in Healthcare Driving the Next Wave of Digital Transformation
danielle hunter
 

Data Science with the Help of Metadata