Tajolabigdatacamp2014 140618135810-phpapp01 hyunsik-choi

Apache Tajo:
A Big Data Warehouse System
on Hadoop

Hyunsik Choi
Director of Research, Gruter
Big Data Camp LA 2014

Talk Outline
•  Introduction to Apache Tajo
•  What you can do with Tajo
•  Why you should use Tajo
•  Current Status of Tajo Project
•  Demonstration

About Me
•  Hyunsik Choi (pronounced “Hyeon-shick Cheh”)
•  PhD (Computer Science Engineering, 2013), Korea Univ.
•  Director of Research, Gruter Corp
•  Open-source Involvement
–  Full-time contributor to Apache Tajo (2013.6 ~ )
–  Apache Tajo PMC member and committer (2013.3 ~ )
–  Apache Giraph PMC member and committer (2011. 8 ~ )
•  Contact Info
–  Email: hyunsik@apache.org
–  Linkedin: http://linkedin.com/in/hyunsikchoi/

Apache Tajo
•  Open-source “SQL-on-H” “Big DW” system
•  Apache Top-level project since March 2014
•  Supports SQL standards
•  Low latency, long running batch queries
•  Features
–  Supports Joins (inner and all outer), Groupby, and Sort
–  Window function
–  Most SQL data types supported (except for Decimal)
•  Recent 0.8.0 release
–  https://blogs.apache.org/tajo/entry/apache_tajo_0_8_0

What You Can Do with Tajo
•  Batch queries
–  Long-running queries (~ hours)
•  Dynamic Scheduling
•  Fault Tolerance
–  ETL workloads
•  Interactive Ad-hoc Queries
–  Very low-latency (100 ms ~)
–  Few seconds on several TB dataset if you cluster
capability is enough

Why You Should Use Tajo
•  SQL Standards
–  Non standard features – PgSQL and Oracle
•  Simple Installation and Operation
–  http://tajo.apache.org/docs/0.8.0/getting_started.html
•  Simple Software Stack Requirement
–  No MapReduce and No Tez
–  Yarn support but not mandatory
–  Tajo + Linux system for single node cluster
–  Tajo + HDFS for a distributed cluster

•  Mature SQL Feature Set
–  Fully distributed query executions
•  Inner join, and left/right/full outer join
•  Groupby, sort, multiple distinct aggregation, window function
–  SQL data types
•  CHAR, BOOL, INT, BIGINT, REAL, DOUBLE, and TEXT
•  TIMESTAMP, DATE, TIME, and INTERVAL
•  DECIMAL (working)
–  Various file formats
•  Text file (CSV), RCFile, Parquet (flat schema), and
Avro (flat schema)

•  Fully community-driven open source
•  Stable development team
–  5 fulltime contributors + many contributors
•  Performance and speed
–  Faster than Hive 0.10 (1.5 – 10 times)
–  Tajo v.s. Hive 0.13 ?
–  Tajo v.s. Impala ?

•  Integration with Hadoop Ecosystem
–  Hadoop 2.2.0 – 2.4.0 support
–  Be able to connect to Hive Metastore
–  Directly process tables managed by Hive
–  Yarn support (backport)
•  Enable Tajo to deploy and run on Yarn cluster
•  Allow users to add/remove cluster nodes to/from Tajo
cluster in runtime
•  Contributed by Min Zhou (committer), Linkedin Engineer
•  https://github.com/coderplay/tajo-yarn

Current Status – Overall
•  Under beta stage – majority of key features are getting ready
•  Most of SQL features implemented
•  Working on hundreds of clusters for
production
–  Collaboration with the biggest telco in S. Korea
•  We’ve just started works on low-level
optimization.
–  Runtime byte code generation (v0.9)
–  Unsafe-based hash table for hash aggregation/join
–  Vectorized execution engine

Current Status – Logical Plan Optimizer
•  Basic Rewrite Rule
–  Common sub expression elimination
–  Constant folding (CF), and Null propagation
•  Projection Push Down (PPD)
–  push expressions to operators lower as possible
–  narrow read columns
–  remove duplicated expressions
•  if some expressions has common expression
•  Filter Push Down (FPD)
–  reduce rows to be processed earlier as possible
•  Extensible Rewrite Rule
–  Allow developers to write their own rewrite rules

SELECT !
item_id,!
order_id!
sum_price * (1.2 * 0.3)  
as total, !
FROM (!
SELECT!
item_id,!
order_id,!
sum(price) as sum_price!
FROM!
ITEMS!
GROUP BY item_id, order_id!
) a !
WHERE item_id = 17234!
SELECT!
item_id,!
order_id,!
sum(price) * (3.6)!
FROM!
ITEMS!
GROUP BY !
item_id,  
order_id!
WHERE item_id = 17234!
Original Rewritten
CF + PPD
FPD

•  Cost-based Join Order (since v0.2)
–  Don’t need to guess right join orders anymore
–  Greedy heuristic algorithm
•  Resulting in a bushy join tree instead of left-deep join tree
Left-deep Join Tree Bush Join Tree

Current Status – Window Function
•  OVER clause
–  row_number() and rank()
–  Aggregation function support
–  PARTITION and ORDER BY clause
SELECT depname, empno, salary, enroll_date FROM (
SELECT !
depname, empno, salary, enroll_date, !
rank() OVER (PARTITION BY depname  
ORDER BY salary DESC, empno) AS pos !
FROM empsalary !
) AS ss !
WHERE !
pos 3;!

Current Status – Join
•  Join
–  NATURAL, INNER, OUTER (LEFT, RIGHT, FULL)
–  SEMI, ANTI Join (planned for v0.9)
•  Join Predicates
–  WHERE and ON predicates
–  de-factor standard outer join behavior with both
predicates
SELECT * FROM t1 LEFT JOIN t2 ON t1.num = t2.num
WHERE t2.value = 'xxx';!
!
SELECT * FROM t1 LEFT JOIN t2 WHERE t1.num = t2.n
um and t2.value = ‘xxx’;!

Current Status – Table Partitions
•  Column Value Partition
–  Hive Compatible Partition
•  Range Partition (planned for 1.0)
–  Table will be partitioned by disjoint ranges.
–  Will remove the partition granularity problem of
Hive Partition
CREATE TABLE T1 (C1 INT, C2 TEXT)  
using PARQUET  
WITH (‘parquet.compression’ = ‘SNAPPY’) !
PARTITION BY COLUMN (C3 INT, C4 TEXT);!

Future Works
•  Multi-tenant Scheduler (v0.9)
–  Support multiple users and multiple queries
•  Runtime byte code generation for
expressions (v0.9)
–  Eliminate interpret overhead of expression evaluation
•  Authentication and SQL Standard Access Control
•  JIT-based Vectorized Processing Engine
–  Refer to Hadoop Summit 2014 Slide
(http://goo.gl/jWghhp)

Get Involved!
•  We are recruiting contributors!
•  General
–  http://tajo.apache.org
•  Getting Started
–  http://tajo.apache.org/docs/0.8.0/getting_started.html
•  Downloads
–  http://tajo.apache.org/docs/0.8.0/getting_started/downloading_source.html
•  Jira – Issue Tracker
–  https://issues.apache.org/jira/browse/TAJO
•  Join the mailing list
–  dev-subscribe@tajo.apache.org
–  issues-subscribe@tajo.apache.org

Tajolabigdatacamp2014 140618135810-phpapp01 hyunsik-choi

More Related Content

What's hot (20)

Viewers also liked (17)

Similar to Tajolabigdatacamp2014 140618135810-phpapp01 hyunsik-choi (20)

More from Data Con LA (20)

Recently uploaded (20)

Tajolabigdatacamp2014 140618135810-phpapp01 hyunsik-choi