SlideShare a Scribd company logo
Apache Tajo:
A Big Data Warehouse System
on Hadoop
 
Hyunsik Choi
Director of Research, Gruter
Big Data Camp LA 2014
Talk Outline
•  Introduction to Apache Tajo
•  What you can do with Tajo
•  Why you should use Tajo
•  Current Status of Tajo Project
•  Demonstration
About Me
•  Hyunsik Choi (pronounced “Hyeon-shick Cheh”)
•  PhD (Computer Science  Engineering, 2013), Korea Univ.
•  Director of Research, Gruter Corp
•  Open-source Involvement
–  Full-time contributor to Apache Tajo (2013.6 ~ )
–  Apache Tajo PMC member and committer (2013.3 ~ )
–  Apache Giraph PMC member and committer (2011. 8 ~ )
•  Contact Info
–  Email: hyunsik@apache.org
–  Linkedin: http://linkedin.com/in/hyunsikchoi/
Apache Tajo
•  Open-source “SQL-on-H” “Big DW” system
•  Apache Top-level project since March 2014
•  Supports SQL standards
•  Low latency, long running batch queries
•  Features
–  Supports Joins (inner and all outer), Groupby, and Sort
–  Window function
–  Most SQL data types supported (except for Decimal)
•  Recent 0.8.0 release
–  https://blogs.apache.org/tajo/entry/apache_tajo_0_8_0
Overall Architecture
What You Can Do with Tajo
•  Batch queries
–  Long-running queries (~ hours)
•  Dynamic Scheduling
•  Fault Tolerance
–  ETL workloads
•  Interactive Ad-hoc Queries
–  Very low-latency (100 ms ~)
–  Few seconds on several TB dataset if you cluster
capability is enough
Why You Should Use Tajo
•  SQL Standards
–  Non standard features – PgSQL and Oracle
•  Simple Installation and Operation
–  http://tajo.apache.org/docs/0.8.0/getting_started.html
•  Simple Software Stack Requirement
–  No MapReduce and No Tez
–  Yarn support but not mandatory
–  Tajo + Linux system for single node cluster
–  Tajo + HDFS for a distributed cluster
Why You Should Use Tajo
•  Mature SQL Feature Set
–  Fully distributed query executions
•  Inner join, and left/right/full outer join
•  Groupby, sort, multiple distinct aggregation, window function
–  SQL data types
•  CHAR, BOOL, INT, BIGINT, REAL, DOUBLE, and TEXT
•  TIMESTAMP, DATE, TIME, and INTERVAL
•  DECIMAL (working)
–  Various file formats
•  Text file (CSV), RCFile, Parquet (flat schema), and
Avro (flat schema)
Why You Should Use Tajo
•  Fully community-driven open source
•  Stable development team
–  5 fulltime contributors + many contributors
•  Performance and speed
–  Faster than Hive 0.10 (1.5 – 10 times)
–  Tajo v.s. Hive 0.13 ?
–  Tajo v.s. Impala ?
Why You Should Use Tajo
•  Integration with Hadoop Ecosystem
–  Hadoop 2.2.0 – 2.4.0 support
–  Be able to connect to Hive Metastore
–  Directly process tables managed by Hive
–  Yarn support (backport)
•  Enable Tajo to deploy and run on Yarn cluster
•  Allow users to add/remove cluster nodes to/from Tajo
cluster in runtime
•  Contributed by Min Zhou (committer), Linkedin Engineer
•  https://github.com/coderplay/tajo-yarn
Current Status – Overall
•  Under beta stage – majority of key features are getting ready
•  Most of SQL features implemented
•  Working on hundreds of clusters for
production
–  Collaboration with the biggest telco in S. Korea
•  We’ve just started works on low-level
optimization.
–  Runtime byte code generation (v0.9)
–  Unsafe-based hash table for hash aggregation/join
–  Vectorized execution engine
Current Status – Logical Plan Optimizer
•  Basic Rewrite Rule
–  Common sub expression elimination
–  Constant folding (CF), and Null propagation
•  Projection Push Down (PPD)
–  push expressions to operators lower as possible
–  narrow read columns
–  remove duplicated expressions
•  if some expressions has common expression
•  Filter Push Down (FPD)
–  reduce rows to be processed earlier as possible
•  Extensible Rewrite Rule
–  Allow developers to write their own rewrite rules
Current Status – Logical Plan Optimizer
SELECT !
item_id,!
order_id!
sum_price * (1.2 * 0.3) 

as total, !
FROM (!
SELECT!
item_id,!
order_id,!
sum(price) as sum_price!
FROM!
ITEMS!
GROUP BY item_id, order_id!
) a !
WHERE item_id = 17234!
SELECT!
item_id,!
order_id,!
sum(price) * (3.6)!
FROM!
ITEMS!
GROUP BY !
item_id, 

order_id!
WHERE item_id = 17234!
Original Rewritten
CF + PPD
FPD
Current Status – Logical Plan Optimizer
•  Cost-based Join Order (since v0.2)
–  Don’t need to guess right join orders anymore
–  Greedy heuristic algorithm
•  Resulting in a bushy join tree instead of left-deep join tree
Left-deep Join Tree Bush Join Tree
Current Status – Window Function
•  OVER clause
–  row_number() and rank()
–  Aggregation function support
–  PARTITION and ORDER BY clause
SELECT depname, empno, salary, enroll_date FROM (
SELECT !
depname, empno, salary, enroll_date, !
rank() OVER (PARTITION BY depname 

ORDER BY salary DESC, empno) AS pos !
FROM empsalary !
) AS ss !
WHERE !
pos  3;!
Current Status – Join
•  Join
–  NATURAL, INNER, OUTER (LEFT, RIGHT, FULL)
–  SEMI, ANTI Join (planned for v0.9)
•  Join Predicates
–  WHERE and ON predicates
–  de-factor standard outer join behavior with both
predicates
SELECT * FROM t1 LEFT JOIN t2 ON t1.num = t2.num
WHERE t2.value = 'xxx';!
!
SELECT * FROM t1 LEFT JOIN t2 WHERE t1.num = t2.n
um and t2.value = ‘xxx’;!
Current Status – Table Partitions
•  Column Value Partition
–  Hive Compatible Partition
•  Range Partition (planned for 1.0)
–  Table will be partitioned by disjoint ranges.
–  Will remove the partition granularity problem of
Hive Partition
CREATE TABLE T1 (C1 INT, C2 TEXT) 

using PARQUET 

WITH (‘parquet.compression’ = ‘SNAPPY’) !
PARTITION BY COLUMN (C3 INT, C4 TEXT);!
Future Works
•  Multi-tenant Scheduler (v0.9)
–  Support multiple users and multiple queries
•  Runtime byte code generation for
expressions (v0.9)
–  Eliminate interpret overhead of expression evaluation
•  Authentication and SQL Standard Access Control
•  JIT-based Vectorized Processing Engine
–  Refer to Hadoop Summit 2014 Slide
(http://goo.gl/jWghhp)
Get Involved!
•  We are recruiting contributors!
•  General
–  http://tajo.apache.org
•  Getting Started
–  http://tajo.apache.org/docs/0.8.0/getting_started.html
•  Downloads
–  http://tajo.apache.org/docs/0.8.0/getting_started/downloading_source.html
•  Jira – Issue Tracker
–  https://issues.apache.org/jira/browse/TAJO
•  Join the mailing list
–  dev-subscribe@tajo.apache.org
–  issues-subscribe@tajo.apache.org

More Related Content

PPTX
File Format Benchmark - Avro, JSON, ORC & Parquet
PDF
Parquet and AVRO
PPTX
Query Compilation in Impala
PDF
PyData London 2017 – Efficient and portable DataFrame storage with Apache Par...
PDF
Spotify: Data center & Backend buildout
PDF
The Evolution of Hadoop at Spotify - Through Failures and Pain
PDF
The Evolution of Big Data at Spotify
PDF
Cloudera Impala technical deep dive
File Format Benchmark - Avro, JSON, ORC & Parquet
Parquet and AVRO
Query Compilation in Impala
PyData London 2017 – Efficient and portable DataFrame storage with Apache Par...
Spotify: Data center & Backend buildout
The Evolution of Hadoop at Spotify - Through Failures and Pain
The Evolution of Big Data at Spotify
Cloudera Impala technical deep dive

What's hot (20)

PDF
High Performance Solr
PDF
Spotify in the Cloud - An evolution of data infrastructure - Strata NYC
PPTX
Full Text search in Django with Postgres
PPTX
Modern Data Architecture
PDF
How Apache Drives Music Recommendations At Spotify
PDF
Fulfilling Apache Arrow's Promises: Pandas on JVM memory without a copy
PPTX
Parallelizing Existing R Packages with SparkR
PDF
Real-Time Analytics with Solr: Presented by Yonik Seeley, Cloudera
PPTX
ORC 2015: Faster, Better, Smaller
PDF
SQL Now! How Optiq brings the best of SQL to NoSQL data.
PPTX
Choosing an HDFS data storage format- Avro vs. Parquet and more - StampedeCon...
PPTX
Fast Access to Your Data - Avro, JSON, ORC, and Parquet
PPTX
Productive data engineer
PDF
Optimizing Hive Queries
PDF
Lessons from the Field, Episode II: Applying Best Practices to Your Apache S...
PDF
Rebuilding Solr 6 Examples - Layer by Layer: Presented by Alexandre Rafalovit...
PDF
pandas.(to/from)_sql is simple but not fast
PPTX
ORC: 2015 Faster, Better, Smaller
PDF
Apache Spark v3.0.0
PDF
PostgreSQL and Sphinx pgcon 2013
High Performance Solr
Spotify in the Cloud - An evolution of data infrastructure - Strata NYC
Full Text search in Django with Postgres
Modern Data Architecture
How Apache Drives Music Recommendations At Spotify
Fulfilling Apache Arrow's Promises: Pandas on JVM memory without a copy
Parallelizing Existing R Packages with SparkR
Real-Time Analytics with Solr: Presented by Yonik Seeley, Cloudera
ORC 2015: Faster, Better, Smaller
SQL Now! How Optiq brings the best of SQL to NoSQL data.
Choosing an HDFS data storage format- Avro vs. Parquet and more - StampedeCon...
Fast Access to Your Data - Avro, JSON, ORC, and Parquet
Productive data engineer
Optimizing Hive Queries
Lessons from the Field, Episode II: Applying Best Practices to Your Apache S...
Rebuilding Solr 6 Examples - Layer by Layer: Presented by Alexandre Rafalovit...
pandas.(to/from)_sql is simple but not fast
ORC: 2015 Faster, Better, Smaller
Apache Spark v3.0.0
PostgreSQL and Sphinx pgcon 2013
Ad

Viewers also liked (17)

PDF
Big Data Day LA 2015 - Data mining, forecasting, and BI at the RRCC by Benjam...
PDF
Big Data Day LA 2015 - Tips for Building Self Service Data Science Platform b...
PPTX
Big Data Day LA 2015 - What's new and next in Apache Tez by Bikas Saha of Hor...
PDF
Getting started with Spark & Cassandra by Jon Haddad of Datastax
PDF
Big Data Day LA 2015 - Machine Learning on Largish Data by Szilard Pafka of E...
PDF
Big Data Day LA 2015 - The AWS Big Data Platform by Michael Limcaco of Amazon
PPTX
A noETL Parallel Streaming Transformation Loader using Spark, Kafka­ & Ver­tica
PDF
Data science and good questions eric kostello
PDF
The Future of Hadoop by Arun Murthy, PMC Apache Hadoop & Cofounder Hortonworks
PPTX
Big Data Day LA 2015 - Data Science ≠ Big Data by Jim McGuire of ZestFinance
PDF
Spark as part of a Hybrid RDBMS Architecture-John Leach Cofounder Splice Machine
PPTX
Big Data Day LA 2015 - Scalable and High-Performance Analytics with Distribut...
PDF
Big Data Day LA 2016/ Data Science Track - Enabling Cross-Screen Advertising ...
PDF
Big Data Day LA 2016/ NoSQL track - Architecting Real Life IoT Architecture, ...
PDF
Big Data Day LA 2016/ Hadoop/ Spark/ Kafka track - Deep Learning at Scale - A...
PPTX
Big Data Day LA 2016/ Use Case Driven track - BI is broken, Dave Fryer, Produ...
PPTX
Big Data Day LA 2016/ Big Data Track - Warner Bros. Digital Consumer Intellig...
Big Data Day LA 2015 - Data mining, forecasting, and BI at the RRCC by Benjam...
Big Data Day LA 2015 - Tips for Building Self Service Data Science Platform b...
Big Data Day LA 2015 - What's new and next in Apache Tez by Bikas Saha of Hor...
Getting started with Spark & Cassandra by Jon Haddad of Datastax
Big Data Day LA 2015 - Machine Learning on Largish Data by Szilard Pafka of E...
Big Data Day LA 2015 - The AWS Big Data Platform by Michael Limcaco of Amazon
A noETL Parallel Streaming Transformation Loader using Spark, Kafka­ & Ver­tica
Data science and good questions eric kostello
The Future of Hadoop by Arun Murthy, PMC Apache Hadoop & Cofounder Hortonworks
Big Data Day LA 2015 - Data Science ≠ Big Data by Jim McGuire of ZestFinance
Spark as part of a Hybrid RDBMS Architecture-John Leach Cofounder Splice Machine
Big Data Day LA 2015 - Scalable and High-Performance Analytics with Distribut...
Big Data Day LA 2016/ Data Science Track - Enabling Cross-Screen Advertising ...
Big Data Day LA 2016/ NoSQL track - Architecting Real Life IoT Architecture, ...
Big Data Day LA 2016/ Hadoop/ Spark/ Kafka track - Deep Learning at Scale - A...
Big Data Day LA 2016/ Use Case Driven track - BI is broken, Dave Fryer, Produ...
Big Data Day LA 2016/ Big Data Track - Warner Bros. Digital Consumer Intellig...
Ad

Similar to Tajolabigdatacamp2014 140618135810-phpapp01 hyunsik-choi (20)

PPTX
Big Data Camp LA 2014 - Apache Tajo: A Big Data Warehouse System on Hadoop
PDF
Tajo_Meetup_20141120
PPTX
Emerging technologies /frameworks in Big Data
PPTX
Gruter_TECHDAY_2014_03_ApacheTajo (in Korean)
PDF
Presto updates to 0.178
PPTX
Jethro data meetup index base sql on hadoop - oct-2014
PPT
Hive Evolution: ApacheCon NA 2010
PDF
introduction to data processing using Hadoop and Pig
PDF
Big Data Day LA 2015 - What's New Tajo 0.10 and Beyond by Hyunsik Choi of Gruter
PPTX
February 2014 HUG : Hive On Tez
PPTX
Apache Tajo - BWC 2014
PPT
Python redis talk
PPTX
HBaseCon 2015: Apache Phoenix - The Evolution of a Relational Database Layer ...
PPTX
Drill at the Chicago Hug
PPTX
HBaseCon2015-final
PDF
Overview of the Hive Stinger Initiative
PPTX
Apache Tajo: Query Optimization Techniques and JIT-based Vectorized Engine
PPTX
Hadoop for the Absolute Beginner
PDF
What's new in pandas and the SciPy stack for financial users
PDF
Overview of stinger interactive query for hive
Big Data Camp LA 2014 - Apache Tajo: A Big Data Warehouse System on Hadoop
Tajo_Meetup_20141120
Emerging technologies /frameworks in Big Data
Gruter_TECHDAY_2014_03_ApacheTajo (in Korean)
Presto updates to 0.178
Jethro data meetup index base sql on hadoop - oct-2014
Hive Evolution: ApacheCon NA 2010
introduction to data processing using Hadoop and Pig
Big Data Day LA 2015 - What's New Tajo 0.10 and Beyond by Hyunsik Choi of Gruter
February 2014 HUG : Hive On Tez
Apache Tajo - BWC 2014
Python redis talk
HBaseCon 2015: Apache Phoenix - The Evolution of a Relational Database Layer ...
Drill at the Chicago Hug
HBaseCon2015-final
Overview of the Hive Stinger Initiative
Apache Tajo: Query Optimization Techniques and JIT-based Vectorized Engine
Hadoop for the Absolute Beginner
What's new in pandas and the SciPy stack for financial users
Overview of stinger interactive query for hive

More from Data Con LA (20)

PPTX
Data Con LA 2022 Keynotes
PPTX
Data Con LA 2022 Keynotes
PDF
Data Con LA 2022 Keynote
PPTX
Data Con LA 2022 - Startup Showcase
PPTX
Data Con LA 2022 Keynote
PDF
Data Con LA 2022 - Using Google trends data to build product recommendations
PPTX
Data Con LA 2022 - AI Ethics
PDF
Data Con LA 2022 - Improving disaster response with machine learning
PDF
Data Con LA 2022 - What's new with MongoDB 6.0 and Atlas
PDF
Data Con LA 2022 - Real world consumer segmentation
PPTX
Data Con LA 2022 - Modernizing Analytics & AI for today's needs: Intuit Turbo...
PPTX
Data Con LA 2022 - Moving Data at Scale to AWS
PDF
Data Con LA 2022 - Collaborative Data Exploration using Conversational AI
PDF
Data Con LA 2022 - Why Database Modernization Makes Your Data Decisions More ...
PDF
Data Con LA 2022 - Intro to Data Science
PDF
Data Con LA 2022 - How are NFTs and DeFi Changing Entertainment
PPTX
Data Con LA 2022 - Why Data Quality vigilance requires an End-to-End, Automat...
PPTX
Data Con LA 2022-Perfect Viral Ad prediction of Superbowl 2022 using Tease, T...
PPTX
Data Con LA 2022- Embedding medical journeys with machine learning to improve...
PPTX
Data Con LA 2022 - Data Streaming with Kafka
Data Con LA 2022 Keynotes
Data Con LA 2022 Keynotes
Data Con LA 2022 Keynote
Data Con LA 2022 - Startup Showcase
Data Con LA 2022 Keynote
Data Con LA 2022 - Using Google trends data to build product recommendations
Data Con LA 2022 - AI Ethics
Data Con LA 2022 - Improving disaster response with machine learning
Data Con LA 2022 - What's new with MongoDB 6.0 and Atlas
Data Con LA 2022 - Real world consumer segmentation
Data Con LA 2022 - Modernizing Analytics & AI for today's needs: Intuit Turbo...
Data Con LA 2022 - Moving Data at Scale to AWS
Data Con LA 2022 - Collaborative Data Exploration using Conversational AI
Data Con LA 2022 - Why Database Modernization Makes Your Data Decisions More ...
Data Con LA 2022 - Intro to Data Science
Data Con LA 2022 - How are NFTs and DeFi Changing Entertainment
Data Con LA 2022 - Why Data Quality vigilance requires an End-to-End, Automat...
Data Con LA 2022-Perfect Viral Ad prediction of Superbowl 2022 using Tease, T...
Data Con LA 2022- Embedding medical journeys with machine learning to improve...
Data Con LA 2022 - Data Streaming with Kafka

Recently uploaded (20)

PDF
madgavkar20181017ppt McKinsey Presentation.pdf
PDF
AI And Its Effect On The Evolving IT Sector In Australia - Elevate
PDF
How UI/UX Design Impacts User Retention in Mobile Apps.pdf
PDF
Google’s NotebookLM Unveils Video Overviews
PDF
Reimagining Insurance: Connected Data for Confident Decisions.pdf
PPTX
Belt and Road Supply Chain Finance Blockchain Solution
PDF
Dell Pro 14 Plus: Be better prepared for what’s coming
PPTX
Web Security: Login Bypass, SQLi, CSRF & XSS.pptx
PDF
Shreyas Phanse Resume: Experienced Backend Engineer | Java • Spring Boot • Ka...
PPTX
Understanding_Digital_Forensics_Presentation.pptx
PPTX
CroxyProxy Instagram Access id login.pptx
PDF
Chapter 3 Spatial Domain Image Processing.pdf
PDF
ai-archetype-understanding-the-personality-of-agentic-ai.pdf
PPTX
breach-and-attack-simulation-cybersecurity-india-chennai-defenderrabbit-2025....
PDF
Transforming Manufacturing operations through Intelligent Integrations
PDF
CIFDAQ's Teaching Thursday: Moving Averages Made Simple
PDF
KodekX | Application Modernization Development
PDF
Sensors and Actuators in IoT Systems using pdf
PDF
Top Generative AI Tools for Patent Drafting in 2025.pdf
PDF
GamePlan Trading System Review: Professional Trader's Honest Take
madgavkar20181017ppt McKinsey Presentation.pdf
AI And Its Effect On The Evolving IT Sector In Australia - Elevate
How UI/UX Design Impacts User Retention in Mobile Apps.pdf
Google’s NotebookLM Unveils Video Overviews
Reimagining Insurance: Connected Data for Confident Decisions.pdf
Belt and Road Supply Chain Finance Blockchain Solution
Dell Pro 14 Plus: Be better prepared for what’s coming
Web Security: Login Bypass, SQLi, CSRF & XSS.pptx
Shreyas Phanse Resume: Experienced Backend Engineer | Java • Spring Boot • Ka...
Understanding_Digital_Forensics_Presentation.pptx
CroxyProxy Instagram Access id login.pptx
Chapter 3 Spatial Domain Image Processing.pdf
ai-archetype-understanding-the-personality-of-agentic-ai.pdf
breach-and-attack-simulation-cybersecurity-india-chennai-defenderrabbit-2025....
Transforming Manufacturing operations through Intelligent Integrations
CIFDAQ's Teaching Thursday: Moving Averages Made Simple
KodekX | Application Modernization Development
Sensors and Actuators in IoT Systems using pdf
Top Generative AI Tools for Patent Drafting in 2025.pdf
GamePlan Trading System Review: Professional Trader's Honest Take

Tajolabigdatacamp2014 140618135810-phpapp01 hyunsik-choi

  • 1. Apache Tajo: A Big Data Warehouse System on Hadoop
  • 2.   Hyunsik Choi Director of Research, Gruter Big Data Camp LA 2014
  • 3. Talk Outline •  Introduction to Apache Tajo •  What you can do with Tajo •  Why you should use Tajo •  Current Status of Tajo Project •  Demonstration
  • 4. About Me •  Hyunsik Choi (pronounced “Hyeon-shick Cheh”) •  PhD (Computer Science Engineering, 2013), Korea Univ. •  Director of Research, Gruter Corp •  Open-source Involvement –  Full-time contributor to Apache Tajo (2013.6 ~ ) –  Apache Tajo PMC member and committer (2013.3 ~ ) –  Apache Giraph PMC member and committer (2011. 8 ~ ) •  Contact Info –  Email: [email protected] –  Linkedin: http://linkedin.com/in/hyunsikchoi/
  • 5. Apache Tajo •  Open-source “SQL-on-H” “Big DW” system •  Apache Top-level project since March 2014 •  Supports SQL standards •  Low latency, long running batch queries •  Features –  Supports Joins (inner and all outer), Groupby, and Sort –  Window function –  Most SQL data types supported (except for Decimal) •  Recent 0.8.0 release –  https://blogs.apache.org/tajo/entry/apache_tajo_0_8_0
  • 7. What You Can Do with Tajo •  Batch queries –  Long-running queries (~ hours) •  Dynamic Scheduling •  Fault Tolerance –  ETL workloads •  Interactive Ad-hoc Queries –  Very low-latency (100 ms ~) –  Few seconds on several TB dataset if you cluster capability is enough
  • 8. Why You Should Use Tajo •  SQL Standards –  Non standard features – PgSQL and Oracle •  Simple Installation and Operation –  http://tajo.apache.org/docs/0.8.0/getting_started.html •  Simple Software Stack Requirement –  No MapReduce and No Tez –  Yarn support but not mandatory –  Tajo + Linux system for single node cluster –  Tajo + HDFS for a distributed cluster
  • 9. Why You Should Use Tajo •  Mature SQL Feature Set –  Fully distributed query executions •  Inner join, and left/right/full outer join •  Groupby, sort, multiple distinct aggregation, window function –  SQL data types •  CHAR, BOOL, INT, BIGINT, REAL, DOUBLE, and TEXT •  TIMESTAMP, DATE, TIME, and INTERVAL •  DECIMAL (working) –  Various file formats •  Text file (CSV), RCFile, Parquet (flat schema), and Avro (flat schema)
  • 10. Why You Should Use Tajo •  Fully community-driven open source •  Stable development team –  5 fulltime contributors + many contributors •  Performance and speed –  Faster than Hive 0.10 (1.5 – 10 times) –  Tajo v.s. Hive 0.13 ? –  Tajo v.s. Impala ?
  • 11. Why You Should Use Tajo •  Integration with Hadoop Ecosystem –  Hadoop 2.2.0 – 2.4.0 support –  Be able to connect to Hive Metastore –  Directly process tables managed by Hive –  Yarn support (backport) •  Enable Tajo to deploy and run on Yarn cluster •  Allow users to add/remove cluster nodes to/from Tajo cluster in runtime •  Contributed by Min Zhou (committer), Linkedin Engineer •  https://github.com/coderplay/tajo-yarn
  • 12. Current Status – Overall •  Under beta stage – majority of key features are getting ready •  Most of SQL features implemented •  Working on hundreds of clusters for production –  Collaboration with the biggest telco in S. Korea •  We’ve just started works on low-level optimization. –  Runtime byte code generation (v0.9) –  Unsafe-based hash table for hash aggregation/join –  Vectorized execution engine
  • 13. Current Status – Logical Plan Optimizer •  Basic Rewrite Rule –  Common sub expression elimination –  Constant folding (CF), and Null propagation •  Projection Push Down (PPD) –  push expressions to operators lower as possible –  narrow read columns –  remove duplicated expressions •  if some expressions has common expression •  Filter Push Down (FPD) –  reduce rows to be processed earlier as possible •  Extensible Rewrite Rule –  Allow developers to write their own rewrite rules
  • 14. Current Status – Logical Plan Optimizer SELECT ! item_id,! order_id! sum_price * (1.2 * 0.3) 
 as total, ! FROM (! SELECT! item_id,! order_id,! sum(price) as sum_price! FROM! ITEMS! GROUP BY item_id, order_id! ) a ! WHERE item_id = 17234! SELECT! item_id,! order_id,! sum(price) * (3.6)! FROM! ITEMS! GROUP BY ! item_id, 
 order_id! WHERE item_id = 17234! Original Rewritten CF + PPD FPD
  • 15. Current Status – Logical Plan Optimizer •  Cost-based Join Order (since v0.2) –  Don’t need to guess right join orders anymore –  Greedy heuristic algorithm •  Resulting in a bushy join tree instead of left-deep join tree Left-deep Join Tree Bush Join Tree
  • 16. Current Status – Window Function •  OVER clause –  row_number() and rank() –  Aggregation function support –  PARTITION and ORDER BY clause SELECT depname, empno, salary, enroll_date FROM ( SELECT ! depname, empno, salary, enroll_date, ! rank() OVER (PARTITION BY depname 
 ORDER BY salary DESC, empno) AS pos ! FROM empsalary ! ) AS ss ! WHERE ! pos 3;!
  • 17. Current Status – Join •  Join –  NATURAL, INNER, OUTER (LEFT, RIGHT, FULL) –  SEMI, ANTI Join (planned for v0.9) •  Join Predicates –  WHERE and ON predicates –  de-factor standard outer join behavior with both predicates SELECT * FROM t1 LEFT JOIN t2 ON t1.num = t2.num WHERE t2.value = 'xxx';! ! SELECT * FROM t1 LEFT JOIN t2 WHERE t1.num = t2.n um and t2.value = ‘xxx’;!
  • 18. Current Status – Table Partitions •  Column Value Partition –  Hive Compatible Partition •  Range Partition (planned for 1.0) –  Table will be partitioned by disjoint ranges. –  Will remove the partition granularity problem of Hive Partition CREATE TABLE T1 (C1 INT, C2 TEXT) 
 using PARQUET 
 WITH (‘parquet.compression’ = ‘SNAPPY’) ! PARTITION BY COLUMN (C3 INT, C4 TEXT);!
  • 19. Future Works •  Multi-tenant Scheduler (v0.9) –  Support multiple users and multiple queries •  Runtime byte code generation for expressions (v0.9) –  Eliminate interpret overhead of expression evaluation •  Authentication and SQL Standard Access Control •  JIT-based Vectorized Processing Engine –  Refer to Hadoop Summit 2014 Slide (http://goo.gl/jWghhp)
  • 20. Get Involved! •  We are recruiting contributors! •  General –  http://tajo.apache.org •  Getting Started –  http://tajo.apache.org/docs/0.8.0/getting_started.html •  Downloads –  http://tajo.apache.org/docs/0.8.0/getting_started/downloading_source.html •  Jira – Issue Tracker –  https://issues.apache.org/jira/browse/TAJO •  Join the mailing list –  [email protected] –  [email protected]