Building data pipelines with kite

Building Data
Pipelines with the
Kite SDK
Joey Echeverria // Software Engineer

4
Hadoop
©2015 Cloudera, Inc. All rights reserved.

5
Logs
Apache
HTTPD
Local Disk
Log
Files
Apache
HTTPD
Local Disk
Log
Files
Apache
HTTPD
Local Disk
Log
Files
Apache
HTTPD
Local Disk
Log
Files
Apache
HTTPD
Local Disk
Log
Files
Apache
HTTPD
Local Disk
syslog
Apache
HTTPD
Local Disk
Log
Files
Apache
HTTPD
Local Disk
Log
Files
Apache
HTTPD
Local Disk
Log
Files
Apache
HTTPD
Local Disk
Log
Files
Apache
HTTPD
Local Disk
Log
Files
Apache
HTTPD
Local Disk
syslog
Kafka
Kafka
Flume
HDFS

6
RDBMS
Sqoop
HDFS
RDBMS

7
Sea of text files
CSV CSV CSV CSV CSV
CSV CSV CSV CSV CSV

9
Hadoop
• Technically:
– HDFS, YARN, MapReduce
• Hadoop ecosystem:
– Hadoop, HBase, Flume, Sqoop, Kafak, Oozie, Hive, Impala, Pig, Crunch,
Spark, etc.
– I’ll also call this just “Hadoop”

10
Introduction to the Kite
SDK

11
• Hadoop is all about data
• Bring all of your data to one platform
• Access data using the best engine for your use case
Data

12
• Hadoop ecosystem built from open source components
• Benefits:
– Shared investments
– No vendor lock-in
– Fast evolution
• Costs:
– APIs tend to be low-level
– Integration is ad-hoc
Open source core

13
• HDFS
– Filesystem
• HBase
– Byte array keys -> byte array values
Storage APIs

14
Relational systems
Database
Data files
User code
Provided
Maintained by the database
Application
JDBC Driver

15
Hadoop without Kite
Application
Database
Data files
Data files HBase
User code
Application
JDBC Driver

16
Hadoop with Kite
ApplicationApplication
Database
Data files
Data files
Kite
HBase
Data files HBase
Maintained by the Kite
Application
JDBC Driver

17
• Kite is the data API for the Hadoop ecosystem
• Kite makes it easy to put your data into Hadoop and to use it once
it’s there.
Kite

18
• Data is stored in datasets
• Datasets are made up of entities
• Related datasets are grouped into namespaces
Abstractions

19
• A collection of entities/records
– Like a relational database table
• Data types and field names defined by an Avro schema
• Identified by URI
– dataset:hdfs:/datasets/movie/ratings
– dataset:hive:movie/ratings
– dataset:hbase:zk1,zk2,zk3/ratings
Datasets

20
• A single record in a dataset
– Think row in a relational database table
• Entities can be complex and nested
– Avro compiled objects
– Avro generic objects
– Plain old java objects (POJOs)
Entities

21
• Namespaces group related datasets
– Think database or schema in a relational system
• Dataset names are unique within the same namespace
Namespaces

22
Schem
e
Pattern Example
Hive dataset:hive:<namespace>/<dataset-
name>
dataset:hive:movielens/movies
HDFS dataset:hdfs:/<path>/<namespace>/<datas
et-name>
dataset:hdfs:/datasets/movielens/movies
Local
FS
dataset:file:/<path>/<namespace>/<dataset
-name>
dataset:file:/tmp/data/movielens/movies
HBase dataset:hbase:<zookeeper-
hosts>/<dataset-name>
dataset:hbase:zoo-1,zoo-2,zoo-3/movies
Dataset URIs
• Hive URIs accept an optional location parameter for external
tables
– dataset:hive:movielens/movies?location=/datasets/movielens/movies
• HDFS URIs accept an optional nameservice and host
– dataset:hdfs://namenode:8020/datasets/movielens/movies

23
• Ingestion framework
– Integrates with Sqoop, Flume, and Kafka; doesn’t replace them
• ETL tool
– Basic command-line tool
– Complete ETL tools can build on Kite
• Processing language
– SQL, Crunch, MapReduce, Spark, Pig, etc.
What Kite isn’t

24
• Flume
– Stream log events directly into Kite datasets
• Sqoop
– Ingest relational database tables into Kite datasets
• Kafka
– Integration is through Flafka (Flume/Kafka integration)
Ingest integration

25
• MapReduce
– Input/OutputFormats
• Crunch
– Source and target
• Spark
– Use Input/OutputFormats to convert datasets to RDDs
• Impala, Hive, Pig
– Use underlying file format support
Data processing integration

26
• Codifies best practices
• Interoperability
• Shields you from Hadoop, Hive, etc. version changes
• Get up and running faster
What does Kite do for you?

27
• Kite is Apache 2.0 licensed
• Hosted on GitHub
• Compatibility:
– Test against upstream Apache Hadoop 1.0 and 2.3 as well as
CDH4/5
• Contributors:
– Cloudera, Cerner, Capital One, Intel, Pivotal
• Distributions:
– Cloudera, Hortonworks, Pivotal, MapR
Open source

28
• Site
– http://kitesdk.org
• Kite guide
– http://tiny.cloudera.com/KiteGuide
• Data module overview
– http://tiny.cloudera.com/Datasets
• Command-line interface tutorial
– http://tiny.cloudera.com/KiteCLI
• Kite examples
– https://github.com/kite-sdk/kite-examples
Resources

29
Using Kite

30
Architecture
CSV Kite CLI
Schema Kite CLI
HDFS
infer Avro schema create dataset
Kite CLI
load dataset
Crunch
HDFS
ImpalaReport

31
Dataset schemes
• Pluggable dataset interface with multiple schemes
• Schemes determine underlying storage mechanism and metadata
provider
• HDFS
– Data stored in HDFS directories
– Metadata stored in an Avro schema file and a Java properties file in the
dataset directory
• Hive
– Data stored in HDFS directories
– Metadata stored in Hive metastore
• HBase
– Data and metadata ©2015 Cloudera, Inc. All rights reserved.

32
Which scheme?
• HDFS
– Best for raw data and intermediate data in an ETL pipeline
– No SQL access
• Hive
– Best for data that is ready for query or SQL ETL
– No performance difference between Hive and HDFS-backed datasets
• HBase
– Best for online serving applications
– Provides sorted keys
– Optimistic concurrency control

33
Dataset formats
• Physical serialization format
• Avro
– Row-based storage format with schemas and compression
• Parquet
– Column-based storage format optimized for query access
• CSV
– Read-only format
– Used by ETL jobs to read raw data files

34
Avro
1
2
3
4
5
6
7

35
Parquet
a b c d e f g h i j

36
When to choose which format
• Avro
– Access all fields of a record at the same time
– Intermediate/non-long-lived data
• Parquet
– Access subset of fields/columns at a time
– SQL tables (Impala/Hive)

37
Compression type
• Uncompressed
– Nope. Nope. Nope. Nope.
• Snappy
– Default
– Balances performance and speed
– Fastest for query
• Deflate/gzip
– Good for archived/infrequently accessed data
– Slow writes, decent read performance

38
• Schema
– Record fields, like a table definition
Configuration

39
• Demo schema inference/generation
Demo

40
• Schema
– Record fields, like a table definition
• Partition strategy
– Physical layout/storage key definition
Configuration

41
• Map entity fields to partitions
• Unlike Hive, partitions are tied to per-entity data
• Common partition types: values, hashes, timestamp parsing
Partitioning

42
• Demo partition definition
Demo

43
• Experiment before understanding
• Creates configuration files
• Handles dataset lifecycle
– create, update, delete
• Basic ETL tasks
– copy datasets
– transform individual records
• Import CSV
Command-line interface

44
1. Describe your data
kite-dataset obj-schema org.grouplens.Rating
--jar group-lens-1.0.jar -o rating.avsc
2. Describe your layout
kite-dataset partition-config ts:year ts:month ts:day
--schema rating.avsc -o ymd.json
3. Create a dataset
kite-dataset create ratings --schema rating.avsc
--partition-by ymd.json
Example

45
• Two packages
– Standalone for on-cluster use
– Tarball with dependencies for remote access (CDH5-only)
• Environment variables
– HIVE_HOME, HIVE_CONF_DIR, HBASE_HOME,
HADOOP_MAPRED_HOME, HADOOP_COMMON_HOME
• Debug environment
– debug=true ./kite-dataset <command>
• Verbose output
– ./kite-dataset -v <command>
Command-line interface

46
• Demo dataset creation with the CLI
• Demo dataset loading with the CLI
Demo

47
Maven parent POM
• Consolidated Kite and Hadoop dependencies
• To use:
– Set kite-app-parent-cdh4 or kite-app-parent-cdh5 as your project’s parent
POM
<parent>
<group>org.kitesdk</group>
<artifact>kite-app-parent-cdh5</artifact>
<version>0.17.1</version>
</parent>

48
• Demo maven project using Kite parent pom
Demo

49
• Java dataflow API
• Runs pipelines in memory, MapReduce, or Spark
• Parallel collections
Crunch

50
Use Crunch with Kite
• CrunchDatasets helper class
– CrunchDatasets.asSource(View view)
– CrunchDatasets.asTarget(View view)
• Supports Crunch write modes: default, overwrite and append
PCollection<Movie> movies = getPipeline().read(
CrunchDatasets.asSource(“dataset:hive:movies”, Movie.class));
• Re-partition data before writing
PCollection<Movie> partitionedMovies = CrunchDatasets.
partition(movies, targetDataset);

51
• Demo crunch processing on Kite
Demo

52
Impala
• Massively parallel processing (MPP) database
• SQL
• Distributed
• Fast

53
• Demo querying a Kite dataset with Impala
Demo

54
Architecture
CSV Kite CLI
Schema Kite CLI
HDFS
infer Avro schema create dataset
Kite CLI
load dataset
Crunch
HDFS
ImpalaReport

Building data pipelines with kite

Recommended

More Related Content

What's hot (20)

Viewers also liked (20)

Similar to Building data pipelines with kite (20)

More from Joey Echeverria (12)

Recently uploaded (20)

Building data pipelines with kite