0% found this document useful (0 votes)
38 views

2 SQL Hadoop Analyzing Big Data Hive m2 Intro Slides

Hive provides a SQL-like interface to query large datasets stored in Hadoop. It introduces the concepts of schema-on-read and a warehouse directory to store metadata. HiveQL can be used to perform queries, create and load tables, and analyze data stored in HDFS. The demo showed how to create a database and table, load sample data, and run queries on the Hive table.
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
38 views

2 SQL Hadoop Analyzing Big Data Hive m2 Intro Slides

Hive provides a SQL-like interface to query large datasets stored in Hadoop. It introduces the concepts of schema-on-read and a warehouse directory to store metadata. HiveQL can be used to perform queries, create and load tables, and analyze data stored in HDFS. The demo showed how to create a database and table, load sample data, and run queries on the Hive table.
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 14

Introduction to Hive

Ahmad Alkilani
www.pluralsight.com
Outline

 Why Hive? Motivation

 Hive’s Architecture

 Hive Principles – Schema on Read

 Hive Principles – The Hive Warehouse

 HiveQL – SELECT, Sub queries, UNION ALL, CREATE DATABASE,


CREATE TABLE

 Demo – Working with Hive and loading data


Hive Motivation

 Opens up Big Data to the masses

 Provides a SQL-like query language and interfaces

 Builds on Hadoop core using MapReduce for execution

 Originally started at Facebook


 MapReduce development is time consuming
 Requires intimate knowledge of the framework
 Limited resources with required expertise
 No schema to help understand data in HDFS
Hive Architecture

JDBC
Thrift Server
ODBC
Metastore
HiveQL
Hive CLI

Hive Web UI

HDInsight

Driver
Query processing ,compiling, optimizing

Execution
MapReduce
HDFS
Hive Principles – Schema on Read

 Imposes no Hive-specific format


 Uses Serializers/Deserializers
JSON to read and write data

XML JSON

Log Files
Log Files
Hive: Read as the following structure
Text
Text

HDFS
Hive Principles – The Hive Warehouse

Hive warehouse
 Meta data about all the objects known to Hive, persisted in in the meta store
 Consists of
 Databases
Database_A Database_B
 Tables
 Partitions
2012
 Buckets/Clusters

 Local Hive warehouse


 Managed by Hive
 Typically under /hive/warehouse
 Dropping a table will drop the data just as well as the meta-data.
 External Tables
 Hive manages the meta-data only
 Anywhere on the Hadoop file system
 Dropping a table in Hive will only remove the table’s definition, data remains untouched.
Basic commands using HiveQL

Hive Basics
The SELECT statement
SELECT • DISTINCT Clause
exp1, exp2, exp3 SELECT DISTINCT col1, col2, col3 FROM some_table;
FROM
some_table • Aliasing
WHERE SELECT col1 + col2 AS col3 FROM some_table;
where_condition
LIMIT • REGEX Column Specification
number_of_records; SELECT '(ID|Name)?+.+’ FROM some_table;

FROM • Interchangeable constructs


some_table • Hive is not case sensitive
• Semicolon to terminate statements
SELECT
exp1, exp2, exp3
WHERE
where_condition;
Sub queries & Union

SELECT subq.mycol SELECT t3.mycol


FROM ( FROM (
SELECT col_a + col_b AS mycol SELECT col_a + col_b AS mycol
FROM some_table; FROM some_table
) subq; UNION ALL
SELECT col_y AS mycol
FROM another_table
SELECT col_a + col_b AS mycol ) t3
FROM some_table JOIN t4 ON (t4.col_x = t3.mycol);
UNION ALL
SELECT col_y AS mycol
FROM another_table;
Create Database
CREATE (DATABASE|SCHEMA) [IF NOT EXISTS]
HDFS
database_name/hive/warehouse
[COMMENT some_comment]
marketing.db finance.db

[LOCATION hdfs_path]
[WITH DBPROPERTIES (property_name=property_value, ….)];
USE db_name;
DROP (DATABASE|SCHEMA) [IF EXISTS] database_name;

/somewhere/on/hdf
shumanresources.db
Create Table
CREATE [EXTERNAL] TABLE [IF NOT EXISTS] [db_name.]table_name
[(col_name data type [COMMENT col_comment], ...)]
[PARTITIONED BY (col_name data type [COMMENT col_comment], ...)]
[ROW FORMAT row_format] [STORED AS file_format]
[LOCATION hdfs_path]
[TBLPROPERTIES (property_name=property_value, ...)];

HDFS
/hive/warehouse
advertising finance.db /mydata/2013/07/2
1
sales deals
ctry=USA ctry=UAE
/mydata/2013/07/2
6

/mydata/2012/03/1
/somewhere/on/hdfs 9
humanresources.db
employees my_ext_tabl
e
Working with Hive

Demo
Demo Recap

 Pluralsight database
 Hive creates pluralsight.db directory

 Created movies hive managed table


 Placed u.info in movies table; Hive doesn’t complain but results in NULLs
 Placed correct u.item data in movies table

 LOAD DATA INPATH [path]


 Moves data if source is HDFS
 Copies data if source is LOCAL
 Syntax: LOAD DATA LOCAL INPATH [path]

 Consider using EXTERNAL tables if data is already in HDFS


Summary

 Hive as an important player in the Big Data community

 Hive warehouse and schema on read concepts

 HiveQL
 SELECT, UNION ALL, Sub Queries, DISTINCT, Aliasing
 Create database
 External and Hive managed tables
 Loading data into the Hive warehouse
 Truncate or overwrite
 Different methods for creating tables
 CTAS
 LIKE

You might also like