SlideShare a Scribd company logo
Building Data
Pipelines with the
Kite SDK
Joey Echeverria // Software Engineer
2
Problem
Building data pipelines with kite
4
Hadoop
©2015 Cloudera, Inc. All rights reserved.
5
Logs
©2015 Cloudera, Inc. All rights reserved.
Apache
HTTPD
Local Disk
Log
Files
Apache
HTTPD
Local Disk
Log
Files
Apache
HTTPD
Local Disk
Log
Files
Apache
HTTPD
Local Disk
Log
Files
Apache
HTTPD
Local Disk
Log
Files
Apache
HTTPD
Local Disk
syslog
Apache
HTTPD
Local Disk
Log
Files
Apache
HTTPD
Local Disk
Log
Files
Apache
HTTPD
Local Disk
Log
Files
Apache
HTTPD
Local Disk
Log
Files
Apache
HTTPD
Local Disk
Log
Files
Apache
HTTPD
Local Disk
syslog
Kafka
Kafka
Flume
HDFS
6
RDBMS
©2015 Cloudera, Inc. All rights reserved.
Sqoop
HDFS
RDBMS
7
Sea of text files
©2015 Cloudera, Inc. All rights reserved.
CSV CSV CSV CSV CSV
CSV CSV CSV CSV CSV
8
A note on Hadoop
9
Hadoop
• Technically:
– HDFS, YARN, MapReduce
• Hadoop ecosystem:
– Hadoop, HBase, Flume, Sqoop, Kafak, Oozie, Hive, Impala, Pig, Crunch,
Spark, etc.
– I’ll also call this just “Hadoop”
©2015 Cloudera, Inc. All rights reserved.
10
Introduction to the Kite
SDK
©2015 Cloudera, Inc. All rights reserved.
11
• Hadoop is all about data
• Bring all of your data to one platform
• Access data using the best engine for your use case
Data
©2015 Cloudera, Inc. All rights reserved.
12
• Hadoop ecosystem built from open source components
• Benefits:
– Shared investments
– No vendor lock-in
– Fast evolution
• Costs:
– APIs tend to be low-level
– Integration is ad-hoc
Open source core
©2015 Cloudera, Inc. All rights reserved.
13
• HDFS
– Filesystem
• HBase
– Byte array keys -> byte array values
Storage APIs
©2015 Cloudera, Inc. All rights reserved.
14
Relational systems
©2015 Cloudera, Inc. All rights reserved.
Database
Data files
User code
Provided
Maintained by the database
Application
JDBC Driver
15
Hadoop without Kite
©2015 Cloudera, Inc. All rights reserved.
Application
Database
Data files
Data files HBase
User code
Application
JDBC Driver
16
Hadoop with Kite
©2015 Cloudera, Inc. All rights reserved.
ApplicationApplication
Database
Data files
Data files
Kite
HBase
Data files HBase
Maintained by the Kite
Application
JDBC Driver
17
• Kite is the data API for the Hadoop ecosystem
• Kite makes it easy to put your data into Hadoop and to use it once
it’s there.
Kite
©2015 Cloudera, Inc. All rights reserved.
18
• Data is stored in datasets
• Datasets are made up of entities
• Related datasets are grouped into namespaces
Abstractions
©2015 Cloudera, Inc. All rights reserved.
19
• A collection of entities/records
– Like a relational database table
• Data types and field names defined by an Avro schema
• Identified by URI
– dataset:hdfs:/datasets/movie/ratings
– dataset:hive:movie/ratings
– dataset:hbase:zk1,zk2,zk3/ratings
Datasets
©2015 Cloudera, Inc. All rights reserved.
20
• A single record in a dataset
– Think row in a relational database table
• Entities can be complex and nested
– Avro compiled objects
– Avro generic objects
– Plain old java objects (POJOs)
Entities
©2015 Cloudera, Inc. All rights reserved.
21
• Namespaces group related datasets
– Think database or schema in a relational system
• Dataset names are unique within the same namespace
Namespaces
©2015 Cloudera, Inc. All rights reserved.
22
Schem
e
Pattern Example
Hive dataset:hive:<namespace>/<dataset-
name>
dataset:hive:movielens/movies
HDFS dataset:hdfs:/<path>/<namespace>/<datas
et-name>
dataset:hdfs:/datasets/movielens/movies
Local
FS
dataset:file:/<path>/<namespace>/<dataset
-name>
dataset:file:/tmp/data/movielens/movies
HBase dataset:hbase:<zookeeper-
hosts>/<dataset-name>
dataset:hbase:zoo-1,zoo-2,zoo-3/movies
Dataset URIs
©2015 Cloudera, Inc. All rights reserved.
• Hive URIs accept an optional location parameter for external
tables
– dataset:hive:movielens/movies?location=/datasets/movielens/movies
• HDFS URIs accept an optional nameservice and host
– dataset:hdfs://namenode:8020/datasets/movielens/movies
23
• Ingestion framework
– Integrates with Sqoop, Flume, and Kafka; doesn’t replace them
• ETL tool
– Basic command-line tool
– Complete ETL tools can build on Kite
• Processing language
– SQL, Crunch, MapReduce, Spark, Pig, etc.
What Kite isn’t
©2015 Cloudera, Inc. All rights reserved.
24
• Flume
– Stream log events directly into Kite datasets
• Sqoop
– Ingest relational database tables into Kite datasets
• Kafka
– Integration is through Flafka (Flume/Kafka integration)
Ingest integration
©2015 Cloudera, Inc. All rights reserved.
25
• MapReduce
– Input/OutputFormats
• Crunch
– Source and target
• Spark
– Use Input/OutputFormats to convert datasets to RDDs
• Impala, Hive, Pig
– Use underlying file format support
Data processing integration
©2015 Cloudera, Inc. All rights reserved.
26
• Codifies best practices
• Interoperability
• Shields you from Hadoop, Hive, etc. version changes
• Get up and running faster
What does Kite do for you?
©2015 Cloudera, Inc. All rights reserved.
27
• Kite is Apache 2.0 licensed
• Hosted on GitHub
• Compatibility:
– Test against upstream Apache Hadoop 1.0 and 2.3 as well as
CDH4/5
• Contributors:
– Cloudera, Cerner, Capital One, Intel, Pivotal
• Distributions:
– Cloudera, Hortonworks, Pivotal, MapR
Open source
©2015 Cloudera, Inc. All rights reserved.
28
• Site
– http://kitesdk.org
• Kite guide
– http://tiny.cloudera.com/KiteGuide
• Data module overview
– http://tiny.cloudera.com/Datasets
• Command-line interface tutorial
– http://tiny.cloudera.com/KiteCLI
• Kite examples
– https://github.com/kite-sdk/kite-examples
Resources
©2015 Cloudera, Inc. All rights reserved.
29
Using Kite
©2015 Cloudera, Inc. All rights reserved.
30
Architecture
©2015 Cloudera, Inc. All rights reserved.
CSV Kite CLI
Schema Kite CLI
HDFS
infer Avro schema create dataset
Kite CLI
load dataset
Crunch
HDFS
ImpalaReport
31
Dataset schemes
• Pluggable dataset interface with multiple schemes
• Schemes determine underlying storage mechanism and metadata
provider
• HDFS
– Data stored in HDFS directories
– Metadata stored in an Avro schema file and a Java properties file in the
dataset directory
• Hive
– Data stored in HDFS directories
– Metadata stored in Hive metastore
• HBase
– Data and metadata ©2015 Cloudera, Inc. All rights reserved.
32
Which scheme?
• HDFS
– Best for raw data and intermediate data in an ETL pipeline
– No SQL access
• Hive
– Best for data that is ready for query or SQL ETL
– No performance difference between Hive and HDFS-backed datasets
• HBase
– Best for online serving applications
– Provides sorted keys
– Optimistic concurrency control
©2015 Cloudera, Inc. All rights reserved.
33
Dataset formats
• Physical serialization format
• Avro
– Row-based storage format with schemas and compression
• Parquet
– Column-based storage format optimized for query access
• CSV
– Read-only format
– Used by ETL jobs to read raw data files
©2015 Cloudera, Inc. All rights reserved.
34
Avro
©2015 Cloudera, Inc. All rights reserved.
1
2
3
4
5
6
7
35
Parquet
©2015 Cloudera, Inc. All rights reserved.
a b c d e f g h i j
36
When to choose which format
• Avro
– Access all fields of a record at the same time
– Intermediate/non-long-lived data
• Parquet
– Access subset of fields/columns at a time
– SQL tables (Impala/Hive)
©2015 Cloudera, Inc. All rights reserved.
37
Compression type
• Uncompressed
– Nope. Nope. Nope. Nope.
• Snappy
– Default
– Balances performance and speed
– Fastest for query
• Deflate/gzip
– Good for archived/infrequently accessed data
– Slow writes, decent read performance
©2015 Cloudera, Inc. All rights reserved.
38
• Schema
– Record fields, like a table definition
Configuration
©2015 Cloudera, Inc. All rights reserved.
39
• Demo schema inference/generation
Demo
©2015 Cloudera, Inc. All rights reserved.
40
• Schema
– Record fields, like a table definition
• Partition strategy
– Physical layout/storage key definition
Configuration
©2015 Cloudera, Inc. All rights reserved.
41
• Map entity fields to partitions
• Unlike Hive, partitions are tied to per-entity data
• Common partition types: values, hashes, timestamp parsing
Partitioning
©2015 Cloudera, Inc. All rights reserved.
42
• Demo partition definition
Demo
©2015 Cloudera, Inc. All rights reserved.
43
• Experiment before understanding
• Creates configuration files
• Handles dataset lifecycle
– create, update, delete
• Basic ETL tasks
– copy datasets
– transform individual records
• Import CSV
Command-line interface
©2015 Cloudera, Inc. All rights reserved.
44
1. Describe your data
kite-dataset obj-schema org.grouplens.Rating 
--jar group-lens-1.0.jar -o rating.avsc
2. Describe your layout
kite-dataset partition-config ts:year ts:month ts:day 
--schema rating.avsc -o ymd.json
3. Create a dataset
kite-dataset create ratings --schema rating.avsc 
--partition-by ymd.json
Example
©2015 Cloudera, Inc. All rights reserved.
45
• Two packages
– Standalone for on-cluster use
– Tarball with dependencies for remote access (CDH5-only)
• Environment variables
– HIVE_HOME, HIVE_CONF_DIR, HBASE_HOME,
HADOOP_MAPRED_HOME, HADOOP_COMMON_HOME
• Debug environment
– debug=true ./kite-dataset <command>
• Verbose output
– ./kite-dataset -v <command>
Command-line interface
©2015 Cloudera, Inc. All rights reserved.
46
• Demo dataset creation with the CLI
• Demo dataset loading with the CLI
Demo
©2015 Cloudera, Inc. All rights reserved.
47
Maven parent POM
• Consolidated Kite and Hadoop dependencies
• To use:
– Set kite-app-parent-cdh4 or kite-app-parent-cdh5 as your project’s parent
POM
<parent>
<group>org.kitesdk</group>
<artifact>kite-app-parent-cdh5</artifact>
<version>0.17.1</version>
</parent>
©2015 Cloudera, Inc. All rights reserved.
48
• Demo maven project using Kite parent pom
Demo
©2015 Cloudera, Inc. All rights reserved.
49
• Java dataflow API
• Runs pipelines in memory, MapReduce, or Spark
• Parallel collections
Crunch
©2015 Cloudera, Inc. All rights reserved.
50
Use Crunch with Kite
• CrunchDatasets helper class
– CrunchDatasets.asSource(View view)
– CrunchDatasets.asTarget(View view)
• Supports Crunch write modes: default, overwrite and append
PCollection<Movie> movies = getPipeline().read(
CrunchDatasets.asSource(“dataset:hive:movies”, Movie.class));
• Re-partition data before writing
PCollection<Movie> partitionedMovies = CrunchDatasets.
partition(movies, targetDataset);
©2015 Cloudera, Inc. All rights reserved.
51
• Demo crunch processing on Kite
Demo
©2015 Cloudera, Inc. All rights reserved.
52
Impala
• Massively parallel processing (MPP) database
• SQL
• Distributed
• Fast
©2015 Cloudera, Inc. All rights reserved.
53
• Demo querying a Kite dataset with Impala
Demo
©2015 Cloudera, Inc. All rights reserved.
54
Architecture
©2015 Cloudera, Inc. All rights reserved.
CSV Kite CLI
Schema Kite CLI
HDFS
infer Avro schema create dataset
Kite CLI
load dataset
Crunch
HDFS
ImpalaReport
Thank you
Ad

More Related Content

What's hot (20)

Apache Arrow (Strata-Hadoop World San Jose 2016)
Apache Arrow (Strata-Hadoop World San Jose 2016)Apache Arrow (Strata-Hadoop World San Jose 2016)
Apache Arrow (Strata-Hadoop World San Jose 2016)
Wes McKinney
 
Applications on Hadoop
Applications on HadoopApplications on Hadoop
Applications on Hadoop
markgrover
 
NYC HUG - Application Architectures with Apache Hadoop
NYC HUG - Application Architectures with Apache HadoopNYC HUG - Application Architectures with Apache Hadoop
NYC HUG - Application Architectures with Apache Hadoop
markgrover
 
Gruter TECHDAY 2014 Realtime Processing in Telco
Gruter TECHDAY 2014 Realtime Processing in TelcoGruter TECHDAY 2014 Realtime Processing in Telco
Gruter TECHDAY 2014 Realtime Processing in Telco
Gruter
 
HDFS Tiered Storage: Mounting Object Stores in HDFS
HDFS Tiered Storage: Mounting Object Stores in HDFSHDFS Tiered Storage: Mounting Object Stores in HDFS
HDFS Tiered Storage: Mounting Object Stores in HDFS
DataWorks Summit
 
HPE Hadoop Solutions - From use cases to proposal
HPE Hadoop Solutions - From use cases to proposalHPE Hadoop Solutions - From use cases to proposal
HPE Hadoop Solutions - From use cases to proposal
DataWorks Summit
 
Dancing elephants - efficiently working with object stores from Apache Spark ...
Dancing elephants - efficiently working with object stores from Apache Spark ...Dancing elephants - efficiently working with object stores from Apache Spark ...
Dancing elephants - efficiently working with object stores from Apache Spark ...
DataWorks Summit
 
Efficient in situ processing of various storage types on apache tajo
Efficient in situ processing of various storage types on apache tajoEfficient in situ processing of various storage types on apache tajo
Efficient in situ processing of various storage types on apache tajo
Hyunsik Choi
 
HDFS Tiered Storage: Mounting Object Stores in HDFS
HDFS Tiered Storage: Mounting Object Stores in HDFSHDFS Tiered Storage: Mounting Object Stores in HDFS
HDFS Tiered Storage: Mounting Object Stores in HDFS
DataWorks Summit/Hadoop Summit
 
Spark and Object Stores —What You Need to Know: Spark Summit East talk by Ste...
Spark and Object Stores —What You Need to Know: Spark Summit East talk by Ste...Spark and Object Stores —What You Need to Know: Spark Summit East talk by Ste...
Spark and Object Stores —What You Need to Know: Spark Summit East talk by Ste...
Spark Summit
 
Intro to Hadoop Presentation at Carnegie Mellon - Silicon Valley
Intro to Hadoop Presentation at Carnegie Mellon - Silicon ValleyIntro to Hadoop Presentation at Carnegie Mellon - Silicon Valley
Intro to Hadoop Presentation at Carnegie Mellon - Silicon Valley
markgrover
 
OpenStack Trove Day (19 Aug 2014, Cambridge MA) - Sahara
OpenStack Trove Day (19 Aug 2014, Cambridge MA)  - SaharaOpenStack Trove Day (19 Aug 2014, Cambridge MA)  - Sahara
OpenStack Trove Day (19 Aug 2014, Cambridge MA) - Sahara
spinningmatt
 
A Container-based Sizing Framework for Apache Hadoop/Spark Clusters
A Container-based Sizing Framework for Apache Hadoop/Spark ClustersA Container-based Sizing Framework for Apache Hadoop/Spark Clusters
A Container-based Sizing Framework for Apache Hadoop/Spark Clusters
DataWorks Summit/Hadoop Summit
 
An Introduction to Accumulo
An Introduction to AccumuloAn Introduction to Accumulo
An Introduction to Accumulo
Donald Miner
 
dplyr Interfaces to Large-Scale Data
dplyr Interfaces to Large-Scale Datadplyr Interfaces to Large-Scale Data
dplyr Interfaces to Large-Scale Data
Cloudera, Inc.
 
Architecting Applications with Hadoop
Architecting Applications with HadoopArchitecting Applications with Hadoop
Architecting Applications with Hadoop
markgrover
 
Hadoop Summit 2014: Query Optimization and JIT-based Vectorized Execution in ...
Hadoop Summit 2014: Query Optimization and JIT-based Vectorized Execution in ...Hadoop Summit 2014: Query Optimization and JIT-based Vectorized Execution in ...
Hadoop Summit 2014: Query Optimization and JIT-based Vectorized Execution in ...
Gruter
 
Cloudera Impala
Cloudera ImpalaCloudera Impala
Cloudera Impala
Scott Leberknight
 
Lessons from the Field: Applying Best Practices to Your Apache Spark Applicat...
Lessons from the Field: Applying Best Practices to Your Apache Spark Applicat...Lessons from the Field: Applying Best Practices to Your Apache Spark Applicat...
Lessons from the Field: Applying Best Practices to Your Apache Spark Applicat...
Databricks
 
Lessons Learned from Dockerizing Spark Workloads: Spark Summit East talk by T...
Lessons Learned from Dockerizing Spark Workloads: Spark Summit East talk by T...Lessons Learned from Dockerizing Spark Workloads: Spark Summit East talk by T...
Lessons Learned from Dockerizing Spark Workloads: Spark Summit East talk by T...
Spark Summit
 
Apache Arrow (Strata-Hadoop World San Jose 2016)
Apache Arrow (Strata-Hadoop World San Jose 2016)Apache Arrow (Strata-Hadoop World San Jose 2016)
Apache Arrow (Strata-Hadoop World San Jose 2016)
Wes McKinney
 
Applications on Hadoop
Applications on HadoopApplications on Hadoop
Applications on Hadoop
markgrover
 
NYC HUG - Application Architectures with Apache Hadoop
NYC HUG - Application Architectures with Apache HadoopNYC HUG - Application Architectures with Apache Hadoop
NYC HUG - Application Architectures with Apache Hadoop
markgrover
 
Gruter TECHDAY 2014 Realtime Processing in Telco
Gruter TECHDAY 2014 Realtime Processing in TelcoGruter TECHDAY 2014 Realtime Processing in Telco
Gruter TECHDAY 2014 Realtime Processing in Telco
Gruter
 
HDFS Tiered Storage: Mounting Object Stores in HDFS
HDFS Tiered Storage: Mounting Object Stores in HDFSHDFS Tiered Storage: Mounting Object Stores in HDFS
HDFS Tiered Storage: Mounting Object Stores in HDFS
DataWorks Summit
 
HPE Hadoop Solutions - From use cases to proposal
HPE Hadoop Solutions - From use cases to proposalHPE Hadoop Solutions - From use cases to proposal
HPE Hadoop Solutions - From use cases to proposal
DataWorks Summit
 
Dancing elephants - efficiently working with object stores from Apache Spark ...
Dancing elephants - efficiently working with object stores from Apache Spark ...Dancing elephants - efficiently working with object stores from Apache Spark ...
Dancing elephants - efficiently working with object stores from Apache Spark ...
DataWorks Summit
 
Efficient in situ processing of various storage types on apache tajo
Efficient in situ processing of various storage types on apache tajoEfficient in situ processing of various storage types on apache tajo
Efficient in situ processing of various storage types on apache tajo
Hyunsik Choi
 
Spark and Object Stores —What You Need to Know: Spark Summit East talk by Ste...
Spark and Object Stores —What You Need to Know: Spark Summit East talk by Ste...Spark and Object Stores —What You Need to Know: Spark Summit East talk by Ste...
Spark and Object Stores —What You Need to Know: Spark Summit East talk by Ste...
Spark Summit
 
Intro to Hadoop Presentation at Carnegie Mellon - Silicon Valley
Intro to Hadoop Presentation at Carnegie Mellon - Silicon ValleyIntro to Hadoop Presentation at Carnegie Mellon - Silicon Valley
Intro to Hadoop Presentation at Carnegie Mellon - Silicon Valley
markgrover
 
OpenStack Trove Day (19 Aug 2014, Cambridge MA) - Sahara
OpenStack Trove Day (19 Aug 2014, Cambridge MA)  - SaharaOpenStack Trove Day (19 Aug 2014, Cambridge MA)  - Sahara
OpenStack Trove Day (19 Aug 2014, Cambridge MA) - Sahara
spinningmatt
 
A Container-based Sizing Framework for Apache Hadoop/Spark Clusters
A Container-based Sizing Framework for Apache Hadoop/Spark ClustersA Container-based Sizing Framework for Apache Hadoop/Spark Clusters
A Container-based Sizing Framework for Apache Hadoop/Spark Clusters
DataWorks Summit/Hadoop Summit
 
An Introduction to Accumulo
An Introduction to AccumuloAn Introduction to Accumulo
An Introduction to Accumulo
Donald Miner
 
dplyr Interfaces to Large-Scale Data
dplyr Interfaces to Large-Scale Datadplyr Interfaces to Large-Scale Data
dplyr Interfaces to Large-Scale Data
Cloudera, Inc.
 
Architecting Applications with Hadoop
Architecting Applications with HadoopArchitecting Applications with Hadoop
Architecting Applications with Hadoop
markgrover
 
Hadoop Summit 2014: Query Optimization and JIT-based Vectorized Execution in ...
Hadoop Summit 2014: Query Optimization and JIT-based Vectorized Execution in ...Hadoop Summit 2014: Query Optimization and JIT-based Vectorized Execution in ...
Hadoop Summit 2014: Query Optimization and JIT-based Vectorized Execution in ...
Gruter
 
Lessons from the Field: Applying Best Practices to Your Apache Spark Applicat...
Lessons from the Field: Applying Best Practices to Your Apache Spark Applicat...Lessons from the Field: Applying Best Practices to Your Apache Spark Applicat...
Lessons from the Field: Applying Best Practices to Your Apache Spark Applicat...
Databricks
 
Lessons Learned from Dockerizing Spark Workloads: Spark Summit East talk by T...
Lessons Learned from Dockerizing Spark Workloads: Spark Summit East talk by T...Lessons Learned from Dockerizing Spark Workloads: Spark Summit East talk by T...
Lessons Learned from Dockerizing Spark Workloads: Spark Summit East talk by T...
Spark Summit
 

Viewers also liked (20)

Building Hadoop Data Applications with Kite by Tom White
Building Hadoop Data Applications with Kite by Tom WhiteBuilding Hadoop Data Applications with Kite by Tom White
Building Hadoop Data Applications with Kite by Tom White
The Hive
 
HBase Data Modeling and Access Patterns with Kite SDK
HBase Data Modeling and Access Patterns with Kite SDKHBase Data Modeling and Access Patterns with Kite SDK
HBase Data Modeling and Access Patterns with Kite SDK
HBaseCon
 
Thrift vs Protocol Buffers vs Avro - Biased Comparison
Thrift vs Protocol Buffers vs Avro - Biased ComparisonThrift vs Protocol Buffers vs Avro - Biased Comparison
Thrift vs Protocol Buffers vs Avro - Biased Comparison
Igor Anishchenko
 
Building a Data Pipeline from Scratch - Joe Crobak
Building a Data Pipeline from Scratch - Joe CrobakBuilding a Data Pipeline from Scratch - Joe Crobak
Building a Data Pipeline from Scratch - Joe Crobak
Hakka Labs
 
Airfare prediction using Machine Learning with Apache Spark on 1 billion obse...
Airfare prediction using Machine Learning with Apache Spark on 1 billion obse...Airfare prediction using Machine Learning with Apache Spark on 1 billion obse...
Airfare prediction using Machine Learning with Apache Spark on 1 billion obse...
Josef A. Habdank
 
Building a Data Pipeline With Tools From the Hadoop Ecosystem - StampedeCon 2016
Building a Data Pipeline With Tools From the Hadoop Ecosystem - StampedeCon 2016Building a Data Pipeline With Tools From the Hadoop Ecosystem - StampedeCon 2016
Building a Data Pipeline With Tools From the Hadoop Ecosystem - StampedeCon 2016
StampedeCon
 
Apache Kudu: Technical Deep Dive


Apache Kudu: Technical Deep Dive

Apache Kudu: Technical Deep Dive


Apache Kudu: Technical Deep Dive


Cloudera, Inc.
 
Apache Flume NG
Apache Flume NGApache Flume NG
Apache Flume NG
huguk
 
Python untuk Pemrosesan Teks Bahasa Indonesia
Python untuk Pemrosesan Teks Bahasa IndonesiaPython untuk Pemrosesan Teks Bahasa Indonesia
Python untuk Pemrosesan Teks Bahasa Indonesia
Peb Ruswono Aryan
 
Python as part of a production machine learning stack by Michael Manapat PyDa...
Python as part of a production machine learning stack by Michael Manapat PyDa...Python as part of a production machine learning stack by Michael Manapat PyDa...
Python as part of a production machine learning stack by Michael Manapat PyDa...
PyData
 
Managing data workflows with Luigi
Managing data workflows with LuigiManaging data workflows with Luigi
Managing data workflows with Luigi
Teemu Kurppa
 
Designing Data Pipelines Using Hadoop
Designing Data Pipelines Using HadoopDesigning Data Pipelines Using Hadoop
Designing Data Pipelines Using Hadoop
DataWorks Summit
 
Chicago Hadoop User Group (CHUG) Presentation on Apache Flume - April 9, 2014
Chicago Hadoop User Group (CHUG) Presentation on Apache Flume - April 9, 2014Chicago Hadoop User Group (CHUG) Presentation on Apache Flume - April 9, 2014
Chicago Hadoop User Group (CHUG) Presentation on Apache Flume - April 9, 2014
Steve Hoffman
 
PyData: The Next Generation
PyData: The Next GenerationPyData: The Next Generation
PyData: The Next Generation
Wes McKinney
 
Apache Avro and You
Apache Avro and YouApache Avro and You
Apache Avro and You
Eric Wendelin
 
Hadoop Summit 2010 Tuning Hadoop To Deliver Performance To Your Application
Hadoop Summit 2010 Tuning Hadoop To Deliver Performance To Your ApplicationHadoop Summit 2010 Tuning Hadoop To Deliver Performance To Your Application
Hadoop Summit 2010 Tuning Hadoop To Deliver Performance To Your Application
Yahoo Developer Network
 
Hadoop as data refinery
Hadoop as data refineryHadoop as data refinery
Hadoop as data refinery
Steve Loughran
 
Avro intro
Avro introAvro intro
Avro intro
Randy Abernethy
 
Apache AVRO (Boston HUG, Jan 19, 2010)
Apache AVRO (Boston HUG, Jan 19, 2010)Apache AVRO (Boston HUG, Jan 19, 2010)
Apache AVRO (Boston HUG, Jan 19, 2010)
Cloudera, Inc.
 
Introduction to Hadoop Ecosystem
Introduction to Hadoop Ecosystem Introduction to Hadoop Ecosystem
Introduction to Hadoop Ecosystem
GetInData
 
Building Hadoop Data Applications with Kite by Tom White
Building Hadoop Data Applications with Kite by Tom WhiteBuilding Hadoop Data Applications with Kite by Tom White
Building Hadoop Data Applications with Kite by Tom White
The Hive
 
HBase Data Modeling and Access Patterns with Kite SDK
HBase Data Modeling and Access Patterns with Kite SDKHBase Data Modeling and Access Patterns with Kite SDK
HBase Data Modeling and Access Patterns with Kite SDK
HBaseCon
 
Thrift vs Protocol Buffers vs Avro - Biased Comparison
Thrift vs Protocol Buffers vs Avro - Biased ComparisonThrift vs Protocol Buffers vs Avro - Biased Comparison
Thrift vs Protocol Buffers vs Avro - Biased Comparison
Igor Anishchenko
 
Building a Data Pipeline from Scratch - Joe Crobak
Building a Data Pipeline from Scratch - Joe CrobakBuilding a Data Pipeline from Scratch - Joe Crobak
Building a Data Pipeline from Scratch - Joe Crobak
Hakka Labs
 
Airfare prediction using Machine Learning with Apache Spark on 1 billion obse...
Airfare prediction using Machine Learning with Apache Spark on 1 billion obse...Airfare prediction using Machine Learning with Apache Spark on 1 billion obse...
Airfare prediction using Machine Learning with Apache Spark on 1 billion obse...
Josef A. Habdank
 
Building a Data Pipeline With Tools From the Hadoop Ecosystem - StampedeCon 2016
Building a Data Pipeline With Tools From the Hadoop Ecosystem - StampedeCon 2016Building a Data Pipeline With Tools From the Hadoop Ecosystem - StampedeCon 2016
Building a Data Pipeline With Tools From the Hadoop Ecosystem - StampedeCon 2016
StampedeCon
 
Apache Kudu: Technical Deep Dive


Apache Kudu: Technical Deep Dive

Apache Kudu: Technical Deep Dive


Apache Kudu: Technical Deep Dive


Cloudera, Inc.
 
Apache Flume NG
Apache Flume NGApache Flume NG
Apache Flume NG
huguk
 
Python untuk Pemrosesan Teks Bahasa Indonesia
Python untuk Pemrosesan Teks Bahasa IndonesiaPython untuk Pemrosesan Teks Bahasa Indonesia
Python untuk Pemrosesan Teks Bahasa Indonesia
Peb Ruswono Aryan
 
Python as part of a production machine learning stack by Michael Manapat PyDa...
Python as part of a production machine learning stack by Michael Manapat PyDa...Python as part of a production machine learning stack by Michael Manapat PyDa...
Python as part of a production machine learning stack by Michael Manapat PyDa...
PyData
 
Managing data workflows with Luigi
Managing data workflows with LuigiManaging data workflows with Luigi
Managing data workflows with Luigi
Teemu Kurppa
 
Designing Data Pipelines Using Hadoop
Designing Data Pipelines Using HadoopDesigning Data Pipelines Using Hadoop
Designing Data Pipelines Using Hadoop
DataWorks Summit
 
Chicago Hadoop User Group (CHUG) Presentation on Apache Flume - April 9, 2014
Chicago Hadoop User Group (CHUG) Presentation on Apache Flume - April 9, 2014Chicago Hadoop User Group (CHUG) Presentation on Apache Flume - April 9, 2014
Chicago Hadoop User Group (CHUG) Presentation on Apache Flume - April 9, 2014
Steve Hoffman
 
PyData: The Next Generation
PyData: The Next GenerationPyData: The Next Generation
PyData: The Next Generation
Wes McKinney
 
Hadoop Summit 2010 Tuning Hadoop To Deliver Performance To Your Application
Hadoop Summit 2010 Tuning Hadoop To Deliver Performance To Your ApplicationHadoop Summit 2010 Tuning Hadoop To Deliver Performance To Your Application
Hadoop Summit 2010 Tuning Hadoop To Deliver Performance To Your Application
Yahoo Developer Network
 
Hadoop as data refinery
Hadoop as data refineryHadoop as data refinery
Hadoop as data refinery
Steve Loughran
 
Apache AVRO (Boston HUG, Jan 19, 2010)
Apache AVRO (Boston HUG, Jan 19, 2010)Apache AVRO (Boston HUG, Jan 19, 2010)
Apache AVRO (Boston HUG, Jan 19, 2010)
Cloudera, Inc.
 
Introduction to Hadoop Ecosystem
Introduction to Hadoop Ecosystem Introduction to Hadoop Ecosystem
Introduction to Hadoop Ecosystem
GetInData
 
Ad

Similar to Building data pipelines with kite (20)

Building a Hadoop Data Warehouse with Impala
Building a Hadoop Data Warehouse with ImpalaBuilding a Hadoop Data Warehouse with Impala
Building a Hadoop Data Warehouse with Impala
huguk
 
Building a Hadoop Data Warehouse with Impala
Building a Hadoop Data Warehouse with ImpalaBuilding a Hadoop Data Warehouse with Impala
Building a Hadoop Data Warehouse with Impala
Swiss Big Data User Group
 
Twitter with hadoop for oow
Twitter with hadoop for oowTwitter with hadoop for oow
Twitter with hadoop for oow
Gwen (Chen) Shapira
 
Application Architectures with Hadoop - Big Data TechCon SF 2014
Application Architectures with Hadoop - Big Data TechCon SF 2014Application Architectures with Hadoop - Big Data TechCon SF 2014
Application Architectures with Hadoop - Big Data TechCon SF 2014
hadooparchbook
 
SQL Engines for Hadoop - The case for Impala
SQL Engines for Hadoop - The case for ImpalaSQL Engines for Hadoop - The case for Impala
SQL Engines for Hadoop - The case for Impala
markgrover
 
Application Architectures with Hadoop | Data Day Texas 2015
Application Architectures with Hadoop | Data Day Texas 2015Application Architectures with Hadoop | Data Day Texas 2015
Application Architectures with Hadoop | Data Day Texas 2015
Cloudera, Inc.
 
Application Architectures with Hadoop
Application Architectures with HadoopApplication Architectures with Hadoop
Application Architectures with Hadoop
hadooparchbook
 
SQL On Hadoop
SQL On HadoopSQL On Hadoop
SQL On Hadoop
Muhammad Ali
 
Hadoop Present - Open Enterprise Hadoop
Hadoop Present - Open Enterprise HadoopHadoop Present - Open Enterprise Hadoop
Hadoop Present - Open Enterprise Hadoop
Yifeng Jiang
 
Data Science Languages and Industry Analytics
Data Science Languages and Industry AnalyticsData Science Languages and Industry Analytics
Data Science Languages and Industry Analytics
Wes McKinney
 
Application architectures with hadoop – big data techcon 2014
Application architectures with hadoop – big data techcon 2014Application architectures with hadoop – big data techcon 2014
Application architectures with hadoop – big data techcon 2014
Jonathan Seidman
 
Application architectures with Hadoop – Big Data TechCon 2014
Application architectures with Hadoop – Big Data TechCon 2014Application architectures with Hadoop – Big Data TechCon 2014
Application architectures with Hadoop – Big Data TechCon 2014
hadooparchbook
 
Building Hadoop Data Applications with Kite
Building Hadoop Data Applications with KiteBuilding Hadoop Data Applications with Kite
Building Hadoop Data Applications with Kite
huguk
 
Big Data Day LA 2016/ NoSQL track - Apache Kudu: Fast Analytics on Fast Data,...
Big Data Day LA 2016/ NoSQL track - Apache Kudu: Fast Analytics on Fast Data,...Big Data Day LA 2016/ NoSQL track - Apache Kudu: Fast Analytics on Fast Data,...
Big Data Day LA 2016/ NoSQL track - Apache Kudu: Fast Analytics on Fast Data,...
Data Con LA
 
Introduction to Kudu - StampedeCon 2016
Introduction to Kudu - StampedeCon 2016Introduction to Kudu - StampedeCon 2016
Introduction to Kudu - StampedeCon 2016
StampedeCon
 
Application Architectures with Hadoop
Application Architectures with HadoopApplication Architectures with Hadoop
Application Architectures with Hadoop
hadooparchbook
 
Introduction to HBase - NoSqlNow2015
Introduction to HBase - NoSqlNow2015Introduction to HBase - NoSqlNow2015
Introduction to HBase - NoSqlNow2015
Apekshit Sharma
 
What is New in Apache Hive 3.0?
What is New in Apache Hive 3.0?What is New in Apache Hive 3.0?
What is New in Apache Hive 3.0?
DataWorks Summit
 
Hive 3 New Horizons DataWorks Summit Melbourne February 2019
Hive 3 New Horizons DataWorks Summit Melbourne February 2019Hive 3 New Horizons DataWorks Summit Melbourne February 2019
Hive 3 New Horizons DataWorks Summit Melbourne February 2019
alanfgates
 
What is new in Apache Hive 3.0?
What is new in Apache Hive 3.0?What is new in Apache Hive 3.0?
What is new in Apache Hive 3.0?
DataWorks Summit
 
Building a Hadoop Data Warehouse with Impala
Building a Hadoop Data Warehouse with ImpalaBuilding a Hadoop Data Warehouse with Impala
Building a Hadoop Data Warehouse with Impala
huguk
 
Building a Hadoop Data Warehouse with Impala
Building a Hadoop Data Warehouse with ImpalaBuilding a Hadoop Data Warehouse with Impala
Building a Hadoop Data Warehouse with Impala
Swiss Big Data User Group
 
Application Architectures with Hadoop - Big Data TechCon SF 2014
Application Architectures with Hadoop - Big Data TechCon SF 2014Application Architectures with Hadoop - Big Data TechCon SF 2014
Application Architectures with Hadoop - Big Data TechCon SF 2014
hadooparchbook
 
SQL Engines for Hadoop - The case for Impala
SQL Engines for Hadoop - The case for ImpalaSQL Engines for Hadoop - The case for Impala
SQL Engines for Hadoop - The case for Impala
markgrover
 
Application Architectures with Hadoop | Data Day Texas 2015
Application Architectures with Hadoop | Data Day Texas 2015Application Architectures with Hadoop | Data Day Texas 2015
Application Architectures with Hadoop | Data Day Texas 2015
Cloudera, Inc.
 
Application Architectures with Hadoop
Application Architectures with HadoopApplication Architectures with Hadoop
Application Architectures with Hadoop
hadooparchbook
 
Hadoop Present - Open Enterprise Hadoop
Hadoop Present - Open Enterprise HadoopHadoop Present - Open Enterprise Hadoop
Hadoop Present - Open Enterprise Hadoop
Yifeng Jiang
 
Data Science Languages and Industry Analytics
Data Science Languages and Industry AnalyticsData Science Languages and Industry Analytics
Data Science Languages and Industry Analytics
Wes McKinney
 
Application architectures with hadoop – big data techcon 2014
Application architectures with hadoop – big data techcon 2014Application architectures with hadoop – big data techcon 2014
Application architectures with hadoop – big data techcon 2014
Jonathan Seidman
 
Application architectures with Hadoop – Big Data TechCon 2014
Application architectures with Hadoop – Big Data TechCon 2014Application architectures with Hadoop – Big Data TechCon 2014
Application architectures with Hadoop – Big Data TechCon 2014
hadooparchbook
 
Building Hadoop Data Applications with Kite
Building Hadoop Data Applications with KiteBuilding Hadoop Data Applications with Kite
Building Hadoop Data Applications with Kite
huguk
 
Big Data Day LA 2016/ NoSQL track - Apache Kudu: Fast Analytics on Fast Data,...
Big Data Day LA 2016/ NoSQL track - Apache Kudu: Fast Analytics on Fast Data,...Big Data Day LA 2016/ NoSQL track - Apache Kudu: Fast Analytics on Fast Data,...
Big Data Day LA 2016/ NoSQL track - Apache Kudu: Fast Analytics on Fast Data,...
Data Con LA
 
Introduction to Kudu - StampedeCon 2016
Introduction to Kudu - StampedeCon 2016Introduction to Kudu - StampedeCon 2016
Introduction to Kudu - StampedeCon 2016
StampedeCon
 
Application Architectures with Hadoop
Application Architectures with HadoopApplication Architectures with Hadoop
Application Architectures with Hadoop
hadooparchbook
 
Introduction to HBase - NoSqlNow2015
Introduction to HBase - NoSqlNow2015Introduction to HBase - NoSqlNow2015
Introduction to HBase - NoSqlNow2015
Apekshit Sharma
 
What is New in Apache Hive 3.0?
What is New in Apache Hive 3.0?What is New in Apache Hive 3.0?
What is New in Apache Hive 3.0?
DataWorks Summit
 
Hive 3 New Horizons DataWorks Summit Melbourne February 2019
Hive 3 New Horizons DataWorks Summit Melbourne February 2019Hive 3 New Horizons DataWorks Summit Melbourne February 2019
Hive 3 New Horizons DataWorks Summit Melbourne February 2019
alanfgates
 
What is new in Apache Hive 3.0?
What is new in Apache Hive 3.0?What is new in Apache Hive 3.0?
What is new in Apache Hive 3.0?
DataWorks Summit
 
Ad

More from Joey Echeverria (12)

Debugging Apache Spark
Debugging Apache SparkDebugging Apache Spark
Debugging Apache Spark
Joey Echeverria
 
Building production spark streaming applications
Building production spark streaming applicationsBuilding production spark streaming applications
Building production spark streaming applications
Joey Echeverria
 
Streaming ETL for All
Streaming ETL for AllStreaming ETL for All
Streaming ETL for All
Joey Echeverria
 
Embeddable data transformation for real time streams
Embeddable data transformation for real time streamsEmbeddable data transformation for real time streams
Embeddable data transformation for real time streams
Joey Echeverria
 
The Future of Apache Hadoop Security
The Future of Apache Hadoop SecurityThe Future of Apache Hadoop Security
The Future of Apache Hadoop Security
Joey Echeverria
 
Apache Accumulo and Cloudera
Apache Accumulo and ClouderaApache Accumulo and Cloudera
Apache Accumulo and Cloudera
Joey Echeverria
 
Analyzing twitter data with hadoop
Analyzing twitter data with hadoopAnalyzing twitter data with hadoop
Analyzing twitter data with hadoop
Joey Echeverria
 
Big data security
Big data securityBig data security
Big data security
Joey Echeverria
 
Hadoop in three use cases
Hadoop in three use casesHadoop in three use cases
Hadoop in three use cases
Joey Echeverria
 
Scratching your own itch
Scratching your own itchScratching your own itch
Scratching your own itch
Joey Echeverria
 
The power of hadoop in cloud computing
The power of hadoop in cloud computingThe power of hadoop in cloud computing
The power of hadoop in cloud computing
Joey Echeverria
 
Hadoop and h base in the real world
Hadoop and h base in the real worldHadoop and h base in the real world
Hadoop and h base in the real world
Joey Echeverria
 
Building production spark streaming applications
Building production spark streaming applicationsBuilding production spark streaming applications
Building production spark streaming applications
Joey Echeverria
 
Embeddable data transformation for real time streams
Embeddable data transformation for real time streamsEmbeddable data transformation for real time streams
Embeddable data transformation for real time streams
Joey Echeverria
 
The Future of Apache Hadoop Security
The Future of Apache Hadoop SecurityThe Future of Apache Hadoop Security
The Future of Apache Hadoop Security
Joey Echeverria
 
Apache Accumulo and Cloudera
Apache Accumulo and ClouderaApache Accumulo and Cloudera
Apache Accumulo and Cloudera
Joey Echeverria
 
Analyzing twitter data with hadoop
Analyzing twitter data with hadoopAnalyzing twitter data with hadoop
Analyzing twitter data with hadoop
Joey Echeverria
 
Hadoop in three use cases
Hadoop in three use casesHadoop in three use cases
Hadoop in three use cases
Joey Echeverria
 
Scratching your own itch
Scratching your own itchScratching your own itch
Scratching your own itch
Joey Echeverria
 
The power of hadoop in cloud computing
The power of hadoop in cloud computingThe power of hadoop in cloud computing
The power of hadoop in cloud computing
Joey Echeverria
 
Hadoop and h base in the real world
Hadoop and h base in the real worldHadoop and h base in the real world
Hadoop and h base in the real world
Joey Echeverria
 

Recently uploaded (20)

UiPath Community Berlin: Orchestrator API, Swagger, and Test Manager API
UiPath Community Berlin: Orchestrator API, Swagger, and Test Manager APIUiPath Community Berlin: Orchestrator API, Swagger, and Test Manager API
UiPath Community Berlin: Orchestrator API, Swagger, and Test Manager API
UiPathCommunity
 
Noah Loul Shares 5 Steps to Implement AI Agents for Maximum Business Efficien...
Noah Loul Shares 5 Steps to Implement AI Agents for Maximum Business Efficien...Noah Loul Shares 5 Steps to Implement AI Agents for Maximum Business Efficien...
Noah Loul Shares 5 Steps to Implement AI Agents for Maximum Business Efficien...
Noah Loul
 
TrustArc Webinar: Consumer Expectations vs Corporate Realities on Data Broker...
TrustArc Webinar: Consumer Expectations vs Corporate Realities on Data Broker...TrustArc Webinar: Consumer Expectations vs Corporate Realities on Data Broker...
TrustArc Webinar: Consumer Expectations vs Corporate Realities on Data Broker...
TrustArc
 
Generative Artificial Intelligence (GenAI) in Business
Generative Artificial Intelligence (GenAI) in BusinessGenerative Artificial Intelligence (GenAI) in Business
Generative Artificial Intelligence (GenAI) in Business
Dr. Tathagat Varma
 
Drupalcamp Finland – Measuring Front-end Energy Consumption
Drupalcamp Finland – Measuring Front-end Energy ConsumptionDrupalcamp Finland – Measuring Front-end Energy Consumption
Drupalcamp Finland – Measuring Front-end Energy Consumption
Exove
 
Greenhouse_Monitoring_Presentation.pptx.
Greenhouse_Monitoring_Presentation.pptx.Greenhouse_Monitoring_Presentation.pptx.
Greenhouse_Monitoring_Presentation.pptx.
hpbmnnxrvb
 
Massive Power Outage Hits Spain, Portugal, and France: Causes, Impact, and On...
Massive Power Outage Hits Spain, Portugal, and France: Causes, Impact, and On...Massive Power Outage Hits Spain, Portugal, and France: Causes, Impact, and On...
Massive Power Outage Hits Spain, Portugal, and France: Causes, Impact, and On...
Aqusag Technologies
 
Designing Low-Latency Systems with Rust and ScyllaDB: An Architectural Deep Dive
Designing Low-Latency Systems with Rust and ScyllaDB: An Architectural Deep DiveDesigning Low-Latency Systems with Rust and ScyllaDB: An Architectural Deep Dive
Designing Low-Latency Systems with Rust and ScyllaDB: An Architectural Deep Dive
ScyllaDB
 
What is Model Context Protocol(MCP) - The new technology for communication bw...
What is Model Context Protocol(MCP) - The new technology for communication bw...What is Model Context Protocol(MCP) - The new technology for communication bw...
What is Model Context Protocol(MCP) - The new technology for communication bw...
Vishnu Singh Chundawat
 
Cybersecurity Identity and Access Solutions using Azure AD
Cybersecurity Identity and Access Solutions using Azure ADCybersecurity Identity and Access Solutions using Azure AD
Cybersecurity Identity and Access Solutions using Azure AD
VICTOR MAESTRE RAMIREZ
 
The Evolution of Meme Coins A New Era for Digital Currency ppt.pdf
The Evolution of Meme Coins A New Era for Digital Currency ppt.pdfThe Evolution of Meme Coins A New Era for Digital Currency ppt.pdf
The Evolution of Meme Coins A New Era for Digital Currency ppt.pdf
Abi john
 
Into The Box Conference Keynote Day 1 (ITB2025)
Into The Box Conference Keynote Day 1 (ITB2025)Into The Box Conference Keynote Day 1 (ITB2025)
Into The Box Conference Keynote Day 1 (ITB2025)
Ortus Solutions, Corp
 
How analogue intelligence complements AI
How analogue intelligence complements AIHow analogue intelligence complements AI
How analogue intelligence complements AI
Paul Rowe
 
Procurement Insights Cost To Value Guide.pptx
Procurement Insights Cost To Value Guide.pptxProcurement Insights Cost To Value Guide.pptx
Procurement Insights Cost To Value Guide.pptx
Jon Hansen
 
SAP Modernization: Maximizing the Value of Your SAP S/4HANA Migration.pdf
SAP Modernization: Maximizing the Value of Your SAP S/4HANA Migration.pdfSAP Modernization: Maximizing the Value of Your SAP S/4HANA Migration.pdf
SAP Modernization: Maximizing the Value of Your SAP S/4HANA Migration.pdf
Precisely
 
Semantic Cultivators : The Critical Future Role to Enable AI
Semantic Cultivators : The Critical Future Role to Enable AISemantic Cultivators : The Critical Future Role to Enable AI
Semantic Cultivators : The Critical Future Role to Enable AI
artmondano
 
Increasing Retail Store Efficiency How can Planograms Save Time and Money.pptx
Increasing Retail Store Efficiency How can Planograms Save Time and Money.pptxIncreasing Retail Store Efficiency How can Planograms Save Time and Money.pptx
Increasing Retail Store Efficiency How can Planograms Save Time and Money.pptx
Anoop Ashok
 
TrsLabs - Fintech Product & Business Consulting
TrsLabs - Fintech Product & Business ConsultingTrsLabs - Fintech Product & Business Consulting
TrsLabs - Fintech Product & Business Consulting
Trs Labs
 
Quantum Computing Quick Research Guide by Arthur Morgan
Quantum Computing Quick Research Guide by Arthur MorganQuantum Computing Quick Research Guide by Arthur Morgan
Quantum Computing Quick Research Guide by Arthur Morgan
Arthur Morgan
 
Big Data Analytics Quick Research Guide by Arthur Morgan
Big Data Analytics Quick Research Guide by Arthur MorganBig Data Analytics Quick Research Guide by Arthur Morgan
Big Data Analytics Quick Research Guide by Arthur Morgan
Arthur Morgan
 
UiPath Community Berlin: Orchestrator API, Swagger, and Test Manager API
UiPath Community Berlin: Orchestrator API, Swagger, and Test Manager APIUiPath Community Berlin: Orchestrator API, Swagger, and Test Manager API
UiPath Community Berlin: Orchestrator API, Swagger, and Test Manager API
UiPathCommunity
 
Noah Loul Shares 5 Steps to Implement AI Agents for Maximum Business Efficien...
Noah Loul Shares 5 Steps to Implement AI Agents for Maximum Business Efficien...Noah Loul Shares 5 Steps to Implement AI Agents for Maximum Business Efficien...
Noah Loul Shares 5 Steps to Implement AI Agents for Maximum Business Efficien...
Noah Loul
 
TrustArc Webinar: Consumer Expectations vs Corporate Realities on Data Broker...
TrustArc Webinar: Consumer Expectations vs Corporate Realities on Data Broker...TrustArc Webinar: Consumer Expectations vs Corporate Realities on Data Broker...
TrustArc Webinar: Consumer Expectations vs Corporate Realities on Data Broker...
TrustArc
 
Generative Artificial Intelligence (GenAI) in Business
Generative Artificial Intelligence (GenAI) in BusinessGenerative Artificial Intelligence (GenAI) in Business
Generative Artificial Intelligence (GenAI) in Business
Dr. Tathagat Varma
 
Drupalcamp Finland – Measuring Front-end Energy Consumption
Drupalcamp Finland – Measuring Front-end Energy ConsumptionDrupalcamp Finland – Measuring Front-end Energy Consumption
Drupalcamp Finland – Measuring Front-end Energy Consumption
Exove
 
Greenhouse_Monitoring_Presentation.pptx.
Greenhouse_Monitoring_Presentation.pptx.Greenhouse_Monitoring_Presentation.pptx.
Greenhouse_Monitoring_Presentation.pptx.
hpbmnnxrvb
 
Massive Power Outage Hits Spain, Portugal, and France: Causes, Impact, and On...
Massive Power Outage Hits Spain, Portugal, and France: Causes, Impact, and On...Massive Power Outage Hits Spain, Portugal, and France: Causes, Impact, and On...
Massive Power Outage Hits Spain, Portugal, and France: Causes, Impact, and On...
Aqusag Technologies
 
Designing Low-Latency Systems with Rust and ScyllaDB: An Architectural Deep Dive
Designing Low-Latency Systems with Rust and ScyllaDB: An Architectural Deep DiveDesigning Low-Latency Systems with Rust and ScyllaDB: An Architectural Deep Dive
Designing Low-Latency Systems with Rust and ScyllaDB: An Architectural Deep Dive
ScyllaDB
 
What is Model Context Protocol(MCP) - The new technology for communication bw...
What is Model Context Protocol(MCP) - The new technology for communication bw...What is Model Context Protocol(MCP) - The new technology for communication bw...
What is Model Context Protocol(MCP) - The new technology for communication bw...
Vishnu Singh Chundawat
 
Cybersecurity Identity and Access Solutions using Azure AD
Cybersecurity Identity and Access Solutions using Azure ADCybersecurity Identity and Access Solutions using Azure AD
Cybersecurity Identity and Access Solutions using Azure AD
VICTOR MAESTRE RAMIREZ
 
The Evolution of Meme Coins A New Era for Digital Currency ppt.pdf
The Evolution of Meme Coins A New Era for Digital Currency ppt.pdfThe Evolution of Meme Coins A New Era for Digital Currency ppt.pdf
The Evolution of Meme Coins A New Era for Digital Currency ppt.pdf
Abi john
 
Into The Box Conference Keynote Day 1 (ITB2025)
Into The Box Conference Keynote Day 1 (ITB2025)Into The Box Conference Keynote Day 1 (ITB2025)
Into The Box Conference Keynote Day 1 (ITB2025)
Ortus Solutions, Corp
 
How analogue intelligence complements AI
How analogue intelligence complements AIHow analogue intelligence complements AI
How analogue intelligence complements AI
Paul Rowe
 
Procurement Insights Cost To Value Guide.pptx
Procurement Insights Cost To Value Guide.pptxProcurement Insights Cost To Value Guide.pptx
Procurement Insights Cost To Value Guide.pptx
Jon Hansen
 
SAP Modernization: Maximizing the Value of Your SAP S/4HANA Migration.pdf
SAP Modernization: Maximizing the Value of Your SAP S/4HANA Migration.pdfSAP Modernization: Maximizing the Value of Your SAP S/4HANA Migration.pdf
SAP Modernization: Maximizing the Value of Your SAP S/4HANA Migration.pdf
Precisely
 
Semantic Cultivators : The Critical Future Role to Enable AI
Semantic Cultivators : The Critical Future Role to Enable AISemantic Cultivators : The Critical Future Role to Enable AI
Semantic Cultivators : The Critical Future Role to Enable AI
artmondano
 
Increasing Retail Store Efficiency How can Planograms Save Time and Money.pptx
Increasing Retail Store Efficiency How can Planograms Save Time and Money.pptxIncreasing Retail Store Efficiency How can Planograms Save Time and Money.pptx
Increasing Retail Store Efficiency How can Planograms Save Time and Money.pptx
Anoop Ashok
 
TrsLabs - Fintech Product & Business Consulting
TrsLabs - Fintech Product & Business ConsultingTrsLabs - Fintech Product & Business Consulting
TrsLabs - Fintech Product & Business Consulting
Trs Labs
 
Quantum Computing Quick Research Guide by Arthur Morgan
Quantum Computing Quick Research Guide by Arthur MorganQuantum Computing Quick Research Guide by Arthur Morgan
Quantum Computing Quick Research Guide by Arthur Morgan
Arthur Morgan
 
Big Data Analytics Quick Research Guide by Arthur Morgan
Big Data Analytics Quick Research Guide by Arthur MorganBig Data Analytics Quick Research Guide by Arthur Morgan
Big Data Analytics Quick Research Guide by Arthur Morgan
Arthur Morgan
 

Building data pipelines with kite

  • 1. Building Data Pipelines with the Kite SDK Joey Echeverria // Software Engineer
  • 4. 4 Hadoop ©2015 Cloudera, Inc. All rights reserved.
  • 5. 5 Logs ©2015 Cloudera, Inc. All rights reserved. Apache HTTPD Local Disk Log Files Apache HTTPD Local Disk Log Files Apache HTTPD Local Disk Log Files Apache HTTPD Local Disk Log Files Apache HTTPD Local Disk Log Files Apache HTTPD Local Disk syslog Apache HTTPD Local Disk Log Files Apache HTTPD Local Disk Log Files Apache HTTPD Local Disk Log Files Apache HTTPD Local Disk Log Files Apache HTTPD Local Disk Log Files Apache HTTPD Local Disk syslog Kafka Kafka Flume HDFS
  • 6. 6 RDBMS ©2015 Cloudera, Inc. All rights reserved. Sqoop HDFS RDBMS
  • 7. 7 Sea of text files ©2015 Cloudera, Inc. All rights reserved. CSV CSV CSV CSV CSV CSV CSV CSV CSV CSV
  • 8. 8 A note on Hadoop
  • 9. 9 Hadoop • Technically: – HDFS, YARN, MapReduce • Hadoop ecosystem: – Hadoop, HBase, Flume, Sqoop, Kafak, Oozie, Hive, Impala, Pig, Crunch, Spark, etc. – I’ll also call this just “Hadoop” ©2015 Cloudera, Inc. All rights reserved.
  • 10. 10 Introduction to the Kite SDK ©2015 Cloudera, Inc. All rights reserved.
  • 11. 11 • Hadoop is all about data • Bring all of your data to one platform • Access data using the best engine for your use case Data ©2015 Cloudera, Inc. All rights reserved.
  • 12. 12 • Hadoop ecosystem built from open source components • Benefits: – Shared investments – No vendor lock-in – Fast evolution • Costs: – APIs tend to be low-level – Integration is ad-hoc Open source core ©2015 Cloudera, Inc. All rights reserved.
  • 13. 13 • HDFS – Filesystem • HBase – Byte array keys -> byte array values Storage APIs ©2015 Cloudera, Inc. All rights reserved.
  • 14. 14 Relational systems ©2015 Cloudera, Inc. All rights reserved. Database Data files User code Provided Maintained by the database Application JDBC Driver
  • 15. 15 Hadoop without Kite ©2015 Cloudera, Inc. All rights reserved. Application Database Data files Data files HBase User code Application JDBC Driver
  • 16. 16 Hadoop with Kite ©2015 Cloudera, Inc. All rights reserved. ApplicationApplication Database Data files Data files Kite HBase Data files HBase Maintained by the Kite Application JDBC Driver
  • 17. 17 • Kite is the data API for the Hadoop ecosystem • Kite makes it easy to put your data into Hadoop and to use it once it’s there. Kite ©2015 Cloudera, Inc. All rights reserved.
  • 18. 18 • Data is stored in datasets • Datasets are made up of entities • Related datasets are grouped into namespaces Abstractions ©2015 Cloudera, Inc. All rights reserved.
  • 19. 19 • A collection of entities/records – Like a relational database table • Data types and field names defined by an Avro schema • Identified by URI – dataset:hdfs:/datasets/movie/ratings – dataset:hive:movie/ratings – dataset:hbase:zk1,zk2,zk3/ratings Datasets ©2015 Cloudera, Inc. All rights reserved.
  • 20. 20 • A single record in a dataset – Think row in a relational database table • Entities can be complex and nested – Avro compiled objects – Avro generic objects – Plain old java objects (POJOs) Entities ©2015 Cloudera, Inc. All rights reserved.
  • 21. 21 • Namespaces group related datasets – Think database or schema in a relational system • Dataset names are unique within the same namespace Namespaces ©2015 Cloudera, Inc. All rights reserved.
  • 22. 22 Schem e Pattern Example Hive dataset:hive:<namespace>/<dataset- name> dataset:hive:movielens/movies HDFS dataset:hdfs:/<path>/<namespace>/<datas et-name> dataset:hdfs:/datasets/movielens/movies Local FS dataset:file:/<path>/<namespace>/<dataset -name> dataset:file:/tmp/data/movielens/movies HBase dataset:hbase:<zookeeper- hosts>/<dataset-name> dataset:hbase:zoo-1,zoo-2,zoo-3/movies Dataset URIs ©2015 Cloudera, Inc. All rights reserved. • Hive URIs accept an optional location parameter for external tables – dataset:hive:movielens/movies?location=/datasets/movielens/movies • HDFS URIs accept an optional nameservice and host – dataset:hdfs://namenode:8020/datasets/movielens/movies
  • 23. 23 • Ingestion framework – Integrates with Sqoop, Flume, and Kafka; doesn’t replace them • ETL tool – Basic command-line tool – Complete ETL tools can build on Kite • Processing language – SQL, Crunch, MapReduce, Spark, Pig, etc. What Kite isn’t ©2015 Cloudera, Inc. All rights reserved.
  • 24. 24 • Flume – Stream log events directly into Kite datasets • Sqoop – Ingest relational database tables into Kite datasets • Kafka – Integration is through Flafka (Flume/Kafka integration) Ingest integration ©2015 Cloudera, Inc. All rights reserved.
  • 25. 25 • MapReduce – Input/OutputFormats • Crunch – Source and target • Spark – Use Input/OutputFormats to convert datasets to RDDs • Impala, Hive, Pig – Use underlying file format support Data processing integration ©2015 Cloudera, Inc. All rights reserved.
  • 26. 26 • Codifies best practices • Interoperability • Shields you from Hadoop, Hive, etc. version changes • Get up and running faster What does Kite do for you? ©2015 Cloudera, Inc. All rights reserved.
  • 27. 27 • Kite is Apache 2.0 licensed • Hosted on GitHub • Compatibility: – Test against upstream Apache Hadoop 1.0 and 2.3 as well as CDH4/5 • Contributors: – Cloudera, Cerner, Capital One, Intel, Pivotal • Distributions: – Cloudera, Hortonworks, Pivotal, MapR Open source ©2015 Cloudera, Inc. All rights reserved.
  • 28. 28 • Site – http://kitesdk.org • Kite guide – http://tiny.cloudera.com/KiteGuide • Data module overview – http://tiny.cloudera.com/Datasets • Command-line interface tutorial – http://tiny.cloudera.com/KiteCLI • Kite examples – https://github.com/kite-sdk/kite-examples Resources ©2015 Cloudera, Inc. All rights reserved.
  • 29. 29 Using Kite ©2015 Cloudera, Inc. All rights reserved.
  • 30. 30 Architecture ©2015 Cloudera, Inc. All rights reserved. CSV Kite CLI Schema Kite CLI HDFS infer Avro schema create dataset Kite CLI load dataset Crunch HDFS ImpalaReport
  • 31. 31 Dataset schemes • Pluggable dataset interface with multiple schemes • Schemes determine underlying storage mechanism and metadata provider • HDFS – Data stored in HDFS directories – Metadata stored in an Avro schema file and a Java properties file in the dataset directory • Hive – Data stored in HDFS directories – Metadata stored in Hive metastore • HBase – Data and metadata ©2015 Cloudera, Inc. All rights reserved.
  • 32. 32 Which scheme? • HDFS – Best for raw data and intermediate data in an ETL pipeline – No SQL access • Hive – Best for data that is ready for query or SQL ETL – No performance difference between Hive and HDFS-backed datasets • HBase – Best for online serving applications – Provides sorted keys – Optimistic concurrency control ©2015 Cloudera, Inc. All rights reserved.
  • 33. 33 Dataset formats • Physical serialization format • Avro – Row-based storage format with schemas and compression • Parquet – Column-based storage format optimized for query access • CSV – Read-only format – Used by ETL jobs to read raw data files ©2015 Cloudera, Inc. All rights reserved.
  • 34. 34 Avro ©2015 Cloudera, Inc. All rights reserved. 1 2 3 4 5 6 7
  • 35. 35 Parquet ©2015 Cloudera, Inc. All rights reserved. a b c d e f g h i j
  • 36. 36 When to choose which format • Avro – Access all fields of a record at the same time – Intermediate/non-long-lived data • Parquet – Access subset of fields/columns at a time – SQL tables (Impala/Hive) ©2015 Cloudera, Inc. All rights reserved.
  • 37. 37 Compression type • Uncompressed – Nope. Nope. Nope. Nope. • Snappy – Default – Balances performance and speed – Fastest for query • Deflate/gzip – Good for archived/infrequently accessed data – Slow writes, decent read performance ©2015 Cloudera, Inc. All rights reserved.
  • 38. 38 • Schema – Record fields, like a table definition Configuration ©2015 Cloudera, Inc. All rights reserved.
  • 39. 39 • Demo schema inference/generation Demo ©2015 Cloudera, Inc. All rights reserved.
  • 40. 40 • Schema – Record fields, like a table definition • Partition strategy – Physical layout/storage key definition Configuration ©2015 Cloudera, Inc. All rights reserved.
  • 41. 41 • Map entity fields to partitions • Unlike Hive, partitions are tied to per-entity data • Common partition types: values, hashes, timestamp parsing Partitioning ©2015 Cloudera, Inc. All rights reserved.
  • 42. 42 • Demo partition definition Demo ©2015 Cloudera, Inc. All rights reserved.
  • 43. 43 • Experiment before understanding • Creates configuration files • Handles dataset lifecycle – create, update, delete • Basic ETL tasks – copy datasets – transform individual records • Import CSV Command-line interface ©2015 Cloudera, Inc. All rights reserved.
  • 44. 44 1. Describe your data kite-dataset obj-schema org.grouplens.Rating --jar group-lens-1.0.jar -o rating.avsc 2. Describe your layout kite-dataset partition-config ts:year ts:month ts:day --schema rating.avsc -o ymd.json 3. Create a dataset kite-dataset create ratings --schema rating.avsc --partition-by ymd.json Example ©2015 Cloudera, Inc. All rights reserved.
  • 45. 45 • Two packages – Standalone for on-cluster use – Tarball with dependencies for remote access (CDH5-only) • Environment variables – HIVE_HOME, HIVE_CONF_DIR, HBASE_HOME, HADOOP_MAPRED_HOME, HADOOP_COMMON_HOME • Debug environment – debug=true ./kite-dataset <command> • Verbose output – ./kite-dataset -v <command> Command-line interface ©2015 Cloudera, Inc. All rights reserved.
  • 46. 46 • Demo dataset creation with the CLI • Demo dataset loading with the CLI Demo ©2015 Cloudera, Inc. All rights reserved.
  • 47. 47 Maven parent POM • Consolidated Kite and Hadoop dependencies • To use: – Set kite-app-parent-cdh4 or kite-app-parent-cdh5 as your project’s parent POM <parent> <group>org.kitesdk</group> <artifact>kite-app-parent-cdh5</artifact> <version>0.17.1</version> </parent> ©2015 Cloudera, Inc. All rights reserved.
  • 48. 48 • Demo maven project using Kite parent pom Demo ©2015 Cloudera, Inc. All rights reserved.
  • 49. 49 • Java dataflow API • Runs pipelines in memory, MapReduce, or Spark • Parallel collections Crunch ©2015 Cloudera, Inc. All rights reserved.
  • 50. 50 Use Crunch with Kite • CrunchDatasets helper class – CrunchDatasets.asSource(View view) – CrunchDatasets.asTarget(View view) • Supports Crunch write modes: default, overwrite and append PCollection<Movie> movies = getPipeline().read( CrunchDatasets.asSource(“dataset:hive:movies”, Movie.class)); • Re-partition data before writing PCollection<Movie> partitionedMovies = CrunchDatasets. partition(movies, targetDataset); ©2015 Cloudera, Inc. All rights reserved.
  • 51. 51 • Demo crunch processing on Kite Demo ©2015 Cloudera, Inc. All rights reserved.
  • 52. 52 Impala • Massively parallel processing (MPP) database • SQL • Distributed • Fast ©2015 Cloudera, Inc. All rights reserved.
  • 53. 53 • Demo querying a Kite dataset with Impala Demo ©2015 Cloudera, Inc. All rights reserved.
  • 54. 54 Architecture ©2015 Cloudera, Inc. All rights reserved. CSV Kite CLI Schema Kite CLI HDFS infer Avro schema create dataset Kite CLI load dataset Crunch HDFS ImpalaReport