0% found this document useful (0 votes)
8 views

BDA Unit-5-PPT

Apache Hive is a data warehouse infrastructure tool designed for processing structured data in Hadoop, facilitating easy querying and analysis of big data. It uses HiveQL for querying and is structured for OLAP rather than OLTP, with features like schema storage in a database and integration with HDFS. Hive supports various data types, table operations, partitioning, and built-in operators, making it a comprehensive solution for managing large datasets.

Uploaded by

Devabn Nirmal
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
8 views

BDA Unit-5-PPT

Apache Hive is a data warehouse infrastructure tool designed for processing structured data in Hadoop, facilitating easy querying and analysis of big data. It uses HiveQL for querying and is structured for OLAP rather than OLTP, with features like schema storage in a database and integration with HDFS. Hive supports various data types, table operations, partitioning, and built-in operators, making it a comprehensive solution for managing large datasets.

Uploaded by

Devabn Nirmal
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 39

APACHE HIVE

What is Hive?
 Hive is a data warehouse infrastructure tool to process structured data in
Hadoop.
 It is used to Summarize big data, and makes querying and analysing easy.
 Initially Hive was developed by Facebook, later the Apache Software
Foundation took it up and developed it further as an open source under
the name Apache Hive.
 Hive is not
 A relational database
 A design for OnLine Transaction Processing (OLTP)
 A language for real-time queries and row-level updates
Features of Hive
 It stores schema in a database and processed data into HDFS.
 It is designed for OLAP.
 It provides SQL type language for querying called HiveQL or HQL.
 It is familiar, fast, scalable, and extensible.
Architecture of Hive
Unit Name Operation
User Interface Create interaction between user and HDFS.

Hive chooses respective database servers to store the


Meta Store schema or Metadata of tables, databases, columns in a
table, their data types, and HDFS mapping.

 For querying on schema info on the Metastore.


HiveQL Process
 It avoids writing MapReduce program in Java, we can
Engine
write a query for MapReduce job and process it.

Processes the query and generates results as same as


Execution Engine
MapReduce results.

These are the data storage techniques to store data into


HDFS or HBASE
file system.
Working of Hive - workflow between Hive and Hadoop
Step
Name Operation
No.
1 Execute Query The Hive interface sends query to Driver (any database
driver such as JDBC, ODBC, etc.) to execute.
2 Get Plan The driver takes the help of query compiler that parses
the query to check the syntax and query plan or the
requirement of query.
3 Get Metadata The compiler sends metadata request to Metastore (any
database).
4 Send Metastore sends metadata as a response to the
Metadata compiler.
5 Send Plan The compiler checks the requirement and resends the
plan to the driver. Up to here, the parsing and compiling
of a query is complete.
6 Execute Plan The driver sends the execute plan to the execution
engine.
7 Execute Job Internally, the process of execution job is a MapReduce
job. The execution engine sends the job to JobTracker,
and it assigns this job to TaskTracker. Here, the query
executes MapReduce job.
Step
Name Operation
No.
7.1 Metadata Ops Meanwhile in execution, the execution engine can
execute metadata operations with Metastore.
8 Fetch Result The execution engine receives the results from Data
nodes.
9 Send Results The execution engine sends those resultant values to the
driver.
10 Send Results The driver sends the results to Hive Interfaces.
HIVE DATA TYPES
 All the data types are involved in the table creation.
 Classified into 4 types: Column types, literals, Null values, Complex types.
 Column Types : Integral type, String, Timestamp, Dates, Decimals.
Type Postfix Example
TINYINT Y 10Y
SMALLINT S 10S
INT - 10
BIGINT L 10L
VARCHAR Length: 1 to 65535 “ rama” or ‘rama’
CHAR Length: 255
TIMESTAMP YYYY-MM-DD HH:MM:SS.fffffffff
DATE YYYY-MM-DD
DECIMAL DECIMAL(precision, scale) decimal(10,0)
 Union Types
 Union is a collection of heterogeneous data types.
 UNIONTYPE<int, double, array<string>, struct<a:int,b:string>>
 Ex:
 {0:1}
 {1:2.0}
 {2:["three","four"]}
 {3:{"a":5,"b":"five"}}
 {2:["six","seven"]}
 {3:{"a":8,"b":"eight"}}
 {0:9}
 {1:10.0}
 FLOAT , DOUBLE – for storing floating point numbers.
 Missing values are represented by the special value NULL.
 Complex Types
 Arrays
 Arrays in Hive are used the same way they are used in Java.
 Syntax: ARRAY<data_type>
 Maps
 Maps in Hive are similar to Java Maps.
 Syntax: MAP<primitive_type, data_type>
 Structs
 Structs in Hive is similar to using complex data with comment.
 Syntax: STRUCT<col_name : data_type [COMMENT col_comment], ...>
CREATE DATABASE and DROP
 Can define databases and tables to analyse structured data.
 Hive contains a default database named default.
 A database in Hive is a namespace or a collection of tables.
 IF NOT EXISTS is an optional clause, which notifies the user that a database
with the same name already exists.
 Drop Database is a statement that drops all the tables and deletes the
database.
 DROP (DATABASE|SCHEMA) [IF EXISTS] database_name
[RESTRICT|CASCADE];
Table Operations
 Create Table is a statement used to create a table in Hive.
 Let us assume you need to create a table named employee using CREATE
TABLE statement.
 The following table lists the fields and their data types in employee table:
 The following data is a Comment, Row formatted fields such as Field
terminator, Lines terminator, and Stored File type.
 Load Data Statement
 we can insert data using the LOAD DATA statement.
 There are two ways to load data: one is from local file system and second is from
Hadoop file system.
 We will insert the following data into the table. It is a text file named
sample.txt in /home/user directory.
 ALTER TABLE :
 Here we can alter the attributes of a table such as changing its table name,
changing column names, adding columns, and deleting or replacing columns.
 This statement takes any of the following syntaxes based on what attributes we
wish to modify in a table.
 Change Statement :
 The following table contains the fields of employee table and it shows the fields
to be changed (in bold).
 Add Columns Statement

 hive> ALTER TABLE employee ADD COLUMNS ( > dept STRING COMMENT
'Department name');

 Replace Statement
 The following query deletes all the columns from the employee table and
replaces it with emp and name columns:
 Drop Table Statement
PARTITIONING
 It is a way of dividing a table into related parts based on the values of
partitioned columns such as date, city, and department.
 Using partition, it is easy to query a portion of the data.
 Tables or partitions are sub-divided into buckets, to provide extra structure
to the data that may be used for more efficient querying.
 Bucketing works based on the value of hash function of some column of a
table.
 We can add partitions to a table by altering the table.
 Renaming a Partition

 Dropping a Partition
BUILT-IN OPERATORS
 There are four types of operators in Hive:
 1. Relational Operators
 2. Arithmetic Operators
 3. Logical Operators
 4. Complex Operators
 Relational Operators
 A = B, A != B, A < B, A > B, A <= B, A >= B – all primitive types
 A IS NULL, A IS NOT NULL – all types
 A LIKE B, A RLIKE B, A REGEXP B -- Strings
 Arithmetic operators:
 A + B, A – B, * , / , % - BINARY and all numeric
 &, |, ^ - Binary and bitwise logical
 ~ - Unary

 Logical Operators: (BOOLEAN OPERANDS)


 AND - &&
 OR - ||
 NOT - !
 Complex Operators
VIEWS AND INDEXES
 Views are generated based on user requirements.
 You can save any result set data as a view.
 You can create a view at the time of executing a SELECT statement.
 Creating an Index
 An Index is nothing but a pointer on a particular column of a table.
 Creating an index means creating a pointer on a particular column of a table.
HIVEQL SELECT…WHERE
 The Hive Query Language (HiveQL) is a query language for Hive to process
and analyze structured data in a Metastore.

 SELECT statement is used to retrieve the data from a table.


 WHERE clause works similar to a condition. It filters the data using the
condition and gives you a finite result.
 JOINS:
 It is a clause that is used for combining specific fields from two tables by using
values common to each one.
 It is used to combine records from two or more tables in the database.
ORDERS TABLE

CUSTOMERS TABLE

You might also like