SlideShare a Scribd company logo
Hive Query Language (HQL)
Example 
• Here’s one way of populating a single row table: 
% echo 'X' > /tmp/dummy.txt 
% hive -e "CREATE TABLE dummy (value STRING);  
LOAD DATA LOCAL INPATH '/tmp/dummy.txt'  
OVERWRITE INTO TABLE dummy" 
• Use ‘hive’ to start hive 
• SHOW TABLES; 
• SELECT * FROM dummy; 
• DROP TABLE dummy; 
• SHOW TABLES;
Example 
• Let’s see how to use Hive to run a query on the weather dataset 
• The first step is to load the data into Hive’s managed storage. 
• Here we’ll have Hive use the local file-system for storage; later we’ll see how 
to store tables in HDFS. 
• Just like an RDBMS, Hive organizes its data into tables. 
• We create a table to hold the weather data using the CREATE TABLE 
statement: 
CREATE TABLE records (year STRING, temperature INT, quality INT) 
ROW FORMAT DELIMITED 
FIELDS TERMINATED BY 't'; 
• The first line declares a records table with three columns: year, temperature, 
and quality. 
• The type of each column must be specified, too: here the year is a string, 
while the other two columns are integers. 
• So far, the SQL is familiar. The ROW FORMAT clause, however, is particular to 
HiveQL. 
• It is saying that each row in the data file is tab-delimited text.
Example 
• Next we can populate Hive with the data. This is just a small sample, for 
exploratory purposes: 
LOAD DATA LOCAL INPATH ‘input/ncdc/micro-tab/sample.txt' 
OVERWRITE INTO TABLE records; 
• Running this command tells Hive to put the specified local file in its 
warehouse directory. 
• This is a simple filesystem operation. 
• There is no attempt, for example, to parse the file and store it in an internal 
database format, since Hive does not mandate any particular file format. 
Files are stored verbatim: they are not modified by Hive. 
• Tables are stored as directories under Hive’s warehouse directory, which is 
controlled by the hive.metastore.warehouse.dir, and defaults to 
/user/hive/warehouse 
• The OVERWRITE keyword in the LOAD DATA statement tells Hive to delete 
any existing files in the directory for the table. If it is omitted, then the new 
files are simply added to the table’s directory (unless they have the same 
names, in which case they replace the old files).
Example 
• Query: 
SELECT year, MAX(temperature) 
FROM records 
WHERE temperature != 9999 
AND (quality = 0 OR quality = 1 OR quality = 4 OR quality = 5 OR quality = 9) 
GROUP BY year; 
• Hive is configured using an XML configuration file like Hadoop’s. The file is 
called hive-site.xml and is located in Hive’s conf directory.
Hive Services 
• cli 
– The command line interface to Hive (the shell). This is the default service. 
• hiveserver 
– Runs Hive as a server exposing a Thrift service, enabling access from a range of 
clients written in different languages. Applications using the Thrift, JDBC, and 
ODBC connectors need to run a Hive server to communicate with Hive. Set the 
HIVE_PORT environment variable to specify the port the server will listen on 
(defaults to 10,000). 
• hwi 
– The Hive Web Interface. 
• jar 
– The Hive equivalent to hadoop jar, a convenient way to run Java applications that 
includes both Hadoop and Hive classes on the classpath. 
• metastore 
– By default, the metastore is run in the same process as the Hive service. Using 
this service, it is possible to run the metastore as a standalone (remote) process. 
Set the METASTORE_PORT environment variable to specify the port the server 
will listen on.
Hive architecture
Clients 
• Thrift Client 
– The Hive Thrift Client makes it easy to run Hive commands from a wide range of 
programming languages. Thrift bindings for Hive are available for C++, Java, PHP, 
Python, and Ruby. They can be found in the src/service/src subdirectory in the 
Hive distribution. 
• JDBC Driver 
– Hive provides a Type 4 (pure Java) JDBC driver, defined in the class 
org.apache.hadoop.hive.jdbc.HiveDriver. When configured with a JDBC URI of 
the form jdbc:hive://host:port/dbname, a Java application will connect to a Hive 
server running in a separate process at the given host and port. (The driver 
makes calls to an interface implemented by the Hive Thrift Client using the Java 
Thrift bindings.) You may alternatively choose to connect to Hive via JDBC in 
embedded mode using the URI jdbc:hive://. In this mode, Hive runs in the same 
JVM as the application invoking it, so there is no need to launch it as a 
standalone server since it does not use the Thrift service or the Hive Thrift Client. 
• ODBC Driver 
– The Hive ODBC Driver allows applications that support the ODBC protocol to 
connect to Hive. (Like the JDBC driver, the ODBC driver uses Thrift to 
communicate with the Hive server.) The ODBC driver is still in development, so 
you should refer to the latest instructions on the Hive wiki for how to build and 
run it.
The Metastore (Embedded) 
• The metastore is the central repository of Hive metadata. The metastore is 
divided into two pieces: 
– a service and 
– the backing store for the data 
• By default, the metastore service runs in the same JVM as the Hive service 
and contains an embedded Derby database instance backed by the local 
disk. 
• This is called the embedded metastore configuration. 
• Using an embedded metastore is a simple way to get started with Hive; 
however, only one embedded Derby database can access the database files 
on disk at any one time, which means you can only have one Hive session 
open at a time that shares the same metastore.
Local Metastore 
• The solution to supporting multiple sessions (and therefore multiple users) is 
to use a standalone database. 
• This configuration is referred to as a local metastore, since the metastore 
service still runs in the same process as the Hive service, but connects to a 
database running in a separate process, either on the same machine or on a 
remote machine. 
• Any JDBC-compliant database may be used. 
• MySQL is a popular choice for the standalone metastore. 
• In this case, javax.jdo.option.ConnectionURL is set to 
jdbc:mysql://host/dbname?createDatabaseIfNotExist=true, and 
javax.jdo.option.ConnectionDriverName is set to com.mysql.jdbc.Driver. (The 
user name and password should be set, too, of course.)
Remote Metastore 
• Going a step further, there’s another metastore configuration called a 
remote metastore, where one or more metastore servers run in separate 
processes to the Hive service. 
• This brings better manageability and security, since the database tier can be 
completely firewalled off, and the clients no longer need the database 
credentials.
Updates, transactions, and indexes 
• Updates, transactions, and indexes are mainstays of traditional 
databases. 
• Yet, until recently, these features have not been considered a 
part of Hive’s feature set. 
• This is because Hive was built to operate over HDFS data using 
MapReduce, where full-table scans are the norm and a table 
update is achieved by transforming the data into a new table. 
• For a data warehousing application that runs over large portions 
of the dataset, this works well.
HiveQL or HQL 
• Hive’s SQL dialect, called HiveQL, does not support the full SQL-92 
specification.
Tables 
• A Hive table is logically made up of the data being stored and the associated 
metadata describing the layout of the data in the table. 
• The data typically resides in HDFS, although it may reside in any Hadoop 
filesystem, including the local filesystem or S3. 
• Hive stores the metadata in a relational database—and not in HDFS, say 
Metastore. 
• Many relational databases have a facility for multiple namespaces, which 
allow users and applications to be segregated into different databases or 
schemas. 
• Hive supports the same facility, and provides commands such as CREATE 
DATABASE dbname, USE dbname, and DROP DATABASE dbname. 
• You can fully qualify a table by writing dbname.tablename. If no database is 
specified, tables belong to the default database.
Managed Tables and External Tables 
• When you create a table in Hive, by default Hive will manage the data, which 
means that Hive moves the data into its warehouse directory. Alternatively, 
you may create an external table, which tells Hive to refer to the data that is 
at an existing location outside the warehouse directory. 
• The difference between the two types of table is seen in the LOAD and DROP 
semantics. 
• When you load data into a managed table, it is moved into Hive’s warehouse 
directory. 
• CREATE EXTERNAL TABLE 
• With the EXTERNAL keyword, Hive knows that it is not managing the data, so 
it doesn’t move it to its warehouse directory. 
• Hive organizes tables into partitions, a way of dividing a table into coarse-grained 
parts based on the value of a partition column, such as date. Using 
partitions can make it faster to do queries on slices of the data. 
• CREATE TABLE logs (ts BIGINT, line STRING) 
PARTITIONED BY (dt STRING, country STRING); 
• LOAD DATA LOCAL INPATH 'input/hive/partitions/file1‘ 
INTO TABLE logs 
PARTITION (dt='2001-01-01', country='GB'); 
• SHOW PARTITIONS logs;
Buckets 
• There are two reasons why you might want to organize your 
tables (or partitions) into buckets. 
• The first is to enable more efficient queries. 
• Bucketing imposes extra structure on the table, which Hive can 
take advantage of when performing certain queries. 
• In particular, a join of two tables that are bucketed on the same 
columns—which include the join columns—can be efficiently 
implemented as a map-side join.
Buckets 
• The second reason to bucket a table is to make sampling more 
efficient. 
• When working with large datasets, it is very convenient to try 
out queries on a fraction of your dataset while you are in the 
process of developing or refining them. 
• CREATE TABLE bucketed_users (id INT, name STRING) 
CLUSTERED BY (id) INTO 4 BUCKETS; 
• Any particular bucket will effectively have a random set of users 
in it. 
• The data within a bucket may additionally be sorted by one or 
more columns. This allows even more efficient map-side joins, 
since the join of each bucket becomes an efficient merge-sort. 
• CREATE TABLE bucketed_users (id INT, name STRING) 
CLUSTERED BY (id) SORTED BY (id ASC) INTO 4 BUCKETS;
Buckets 
• Physically, each bucket is just a file in the table (or partition) 
directory. 
• In fact, buckets correspond to MapReduce output file partitions: 
a job will produce as many buckets (output files) as reduce 
tasks. 
• We can do a sampling of the table using the TABLESAMPLE 
clause, which restricts the query to a fraction of the buckets in 
the table rather than the whole table: 
• SELECT * FROM bucketed_users 
TABLESAMPLE(BUCKET 1 OUT OF 4 ON id); 
• Bucket numbering is 1-based, so this query retrieves all the 
users from the first of four buckets. 
• For a large, evenly distributed dataset, approximately one 
quarter of the table’s rows would be returned.
Storage Formats 
• There are two dimensions that 
govern table storage in Hive: the 
row format and the file format. 
• The row format dictates how 
rows, and the fields in a particular 
row, are stored. 
• In Hive parlance, the row format 
is defined by a SerDe. 
• When you create a table with no 
ROW FORMAT or STORED AS 
clauses, the default format is 
delimited text, with a row per 
line. 
• You can use sequence files in Hive 
by using the declaration STORED 
AS SEQUENCEFILE in the CREATE 
TABLE statement. 
• Hive provides another binary 
storage format called RCFile, short 
for Record Columnar File. RCFiles 
are similar to sequence files, 
except that they store data in a 
columnoriented fashion.
Lets do some Hands on 
• Look at the data in /input/tsv/, and choose any one of the TSV file 
• Understand the col types and names 
• drop table X; 
drop table bucketed_X; 
• CREATE TABLE X(id STRING, name STRING) 
• ROW FORMAT DELIMITED 
• FIELDS TERMINATED BY 't'; 
• LOAD DATA LOCAL INPATH '/input/tsv/X.tsv‘ 
OVERWRITE INTO TABLE X; 
• SELECT * FROM X; 
• Write some query : SELECT x 
• FROM X 
• WHERE y == ABC 
• GROUP BY z; 
• dfs -cat /user/hive/warehouse/X/X.tsv; 
• CREATE TABLE bucketed_X (id STRING, name STRING) 
CLUSTERED BY (id) INTO 4 BUCKETS; 
• CREATE TABLE bucketed_X (id STRING, name STRING) 
CLUSTERED BY (id) SORTED BY (id) INTO 4 BUCKETS; 
• SELECT * FROM X;
Lets do some Hands on 
• SET hive.enforce.bucketing=true; 
• INSERT OVERWRITE TABLE bucketed_X 
SELECT * FROM X; 
• dfs -ls /user/hive/warehouse/bucketed_X; 
• SELECT * FROM bucketed_X 
TABLESAMPLE(BUCKET 1 OUT OF 4 ON id); 
• SELECT * FROM bucketed_X 
TABLESAMPLE(BUCKET 1 OUT OF 2 ON id); 
• SELECT * FROM X 
TABLESAMPLE(BUCKET 1 OUT OF 4 ON rand());
Querying Data 
• Sorting data in Hive can be achieved by use of a standard ORDER BY clause, 
but there is a catch. 
• ORDER BY produces a result that is totally sorted, as expected, but to do so it 
sets the number of reducers to one, making it very inefficient for large 
datasets. 
• When a globally sorted result is not required—and in many cases it isn’t— 
then you can use Hive’s nonstandard extension, SORT BY instead. 
• SORT BY produces a sorted file per reducer. 
• In some cases, you want to control which reducer a particular row goes to, 
typically so you can perform some subsequent aggregation. This is what 
Hive’s DISTRIBUTE BY clause does. 
• FROM records 
SELECT year, temperature 
DISTRIBUTE BY year 
SORT BY year ASC, temperature DESC;
MapReduce Scripts 
• Using an approach like Hadoop Streaming, the TRANSFORM, MAP, and 
REDUCE clauses make it possible to invoke an external script or program 
from Hive. 
#!/usr/bin/env python 
import re 
import sys 
for line in sys.stdin: 
(year, temp, q) = line.strip().split() 
if (temp != "9999" and re.match("[01459]", q)): 
print "%st%s" % (year, temp) 
is_good_quality.py 
#!/usr/bin/env python 
import sys 
(last_key, max_val) = (None, 0) 
for line in sys.stdin: 
(key, val) = line.strip().split("t") 
if last_key and last_key != key: 
print "%st%s" % (last_key, max_val) 
(last_key, max_val) = (key, int(val)) 
else: 
(last_key, max_val) = (key, max(max_val, int(val))) 
if last_key: 
print "%st%s" % (last_key, max_val) 
/code/hive/python/max_temperature_reduce.py; 
• ADD FILE Hadoop_training/code/hive/python/is_good_quality.py; ADD FILE 
Hadoop_training/code/hive/python/max_temperature_reduce.py;
Example 
CREATE TABLE records (year STRING, temperature INT, quality INT) 
ROW FORMAT DELIMITED 
FIELDS TERMINATED BY 't'; 
LOAD DATA LOCAL INPATH 'input/ncdc/micro-tab/sample.txt' 
OVERWRITE INTO TABLE records; 
FROM records 
SELECT TRANSFORM(year, temperature, quality) 
USING 'is_good_quality.py' 
AS year, temperature; 
FROM ( 
FROM records 
MAP year, temperature, quality 
USING 'is_good_quality.py' 
AS year, temperature) map_output 
REDUCE year, temperature 
USING 'max_temperature_reduce.py' 
AS year, temperature;
Joins 
• The simplest kind of join is the inner join, where each match in the input 
tables results in a row in the output. 
• SELECT sales.*, things.* 
FROM sales JOIN things ON (sales.id = things.id); 
• Outer Join: 
• SELECT sales.*, things.* 
FROM sales LEFT OUTER JOIN things ON (sales.id = things.id); 
• SELECT sales.*, things.* 
FROM sales RIGHT OUTER JOIN things ON (sales.id = things.id); 
• SELECT sales.*, things.* 
FROM sales FULL OUTER JOIN things ON (sales.id = things.id);
Joins 
• SELECT * 
FROM things 
WHERE things.id IN (SELECT id from sales); 
• Is not possible in Hive – but can be written as: 
• SELECT * 
FROM things LEFT SEMI JOIN sales ON (sales.id = things.id);
End of session 
Day – 4: Hive Query Language (HQL)

More Related Content

What's hot (20)

PPTX
MapReduce.pptx
Sheba41
 
PPTX
Simple object access protocol(soap )
balamurugan.k Kalibalamurugan
 
PPTX
Apache PIG
Prashant Gupta
 
PPSX
HTML + CSS Examples
Mohamed Loey
 
PPTX
Hive Tutorial | Hive Architecture | Hive Tutorial For Beginners | Hive In Had...
Simplilearn
 
PPTX
Hadoop And Their Ecosystem ppt
sunera pathan
 
PPTX
Hadoop HDFS Architeture and Design
sudhakara st
 
PPTX
Oltp vs olap
Mr. Fmhyudin
 
PPTX
Introduction to Hadoop
Dr. C.V. Suresh Babu
 
PPTX
Apache Hive
tusharsinghal58
 
PPTX
Introduction to Apache Spark
Rahul Jain
 
PPTX
Web services SOAP
princeirfancivil
 
PPTX
Data warehouse
Yogendra Uikey
 
PPTX
Php with mysql ppt
Rajamanickam Gomathijayam
 
PDF
Hadoop YARN
Vigen Sahakyan
 
PPTX
Peephole Optimization
United International University
 
PPT
Hadoop HDFS.ppt
6535ANURAGANURAG
 
PPT
Files and Directories in PHP
Nicole Ryan
 
PPTX
Big Data Analytics with Hadoop
Philippe Julio
 
PDF
Nodejs presentation
Arvind Devaraj
 
MapReduce.pptx
Sheba41
 
Simple object access protocol(soap )
balamurugan.k Kalibalamurugan
 
Apache PIG
Prashant Gupta
 
HTML + CSS Examples
Mohamed Loey
 
Hive Tutorial | Hive Architecture | Hive Tutorial For Beginners | Hive In Had...
Simplilearn
 
Hadoop And Their Ecosystem ppt
sunera pathan
 
Hadoop HDFS Architeture and Design
sudhakara st
 
Oltp vs olap
Mr. Fmhyudin
 
Introduction to Hadoop
Dr. C.V. Suresh Babu
 
Apache Hive
tusharsinghal58
 
Introduction to Apache Spark
Rahul Jain
 
Web services SOAP
princeirfancivil
 
Data warehouse
Yogendra Uikey
 
Php with mysql ppt
Rajamanickam Gomathijayam
 
Hadoop YARN
Vigen Sahakyan
 
Peephole Optimization
United International University
 
Hadoop HDFS.ppt
6535ANURAGANURAG
 
Files and Directories in PHP
Nicole Ryan
 
Big Data Analytics with Hadoop
Philippe Julio
 
Nodejs presentation
Arvind Devaraj
 

Viewers also liked (20)

PDF
Facebooks Petabyte Scale Data Warehouse using Hive and Hadoop
royans
 
PDF
Hadoop first mr job - inverted index construction
Subhas Kumar Ghosh
 
PPTX
Hive and HiveQL - Module6
Rohit Agrawal
 
PPTX
Introduction to Apache HBase Training
Cloudera, Inc.
 
PPT
14 hql
thirumuru2012
 
PDF
Hive example
Intellipaat
 
PDF
Swing
Medivh2011
 
DOCX
Resume
RAHUL BACHHWALIA
 
PPTX
Tap The Hive Mind
George Hutton
 
PDF
Payment Gateway Live hadoop project
Kamal A
 
PPTX
HBaseCon 2015: Analyzing HBase Data with Apache Hive
HBaseCon
 
PPTX
Hadoop: An Industry Perspective
Cloudera, Inc.
 
PDF
Using Hadoop and Hive to Optimize Travel Search , WindyCityDB 2010
Jonathan Seidman
 
PPT
Hive User Meeting August 2009 Facebook
ragho
 
PDF
Analytical Queries with Hive: SQL Windowing and Table Functions
DataWorks Summit
 
PPT
Hadoop Real Life Use Case & MapReduce Details
Anju Singh
 
PPTX
Python in the Hadoop Ecosystem (Rock Health presentation)
Uri Laserson
 
PPT
Hadoop MapReduce Fundamentals
Lynn Langit
 
PDF
Hive Quick Start Tutorial
Carl Steinbach
 
PPTX
Hadoop Hive Tutorial | Hive Fundamentals | Hive Architecture
Skillspeed
 
Facebooks Petabyte Scale Data Warehouse using Hive and Hadoop
royans
 
Hadoop first mr job - inverted index construction
Subhas Kumar Ghosh
 
Hive and HiveQL - Module6
Rohit Agrawal
 
Introduction to Apache HBase Training
Cloudera, Inc.
 
Hive example
Intellipaat
 
Swing
Medivh2011
 
Tap The Hive Mind
George Hutton
 
Payment Gateway Live hadoop project
Kamal A
 
HBaseCon 2015: Analyzing HBase Data with Apache Hive
HBaseCon
 
Hadoop: An Industry Perspective
Cloudera, Inc.
 
Using Hadoop and Hive to Optimize Travel Search , WindyCityDB 2010
Jonathan Seidman
 
Hive User Meeting August 2009 Facebook
ragho
 
Analytical Queries with Hive: SQL Windowing and Table Functions
DataWorks Summit
 
Hadoop Real Life Use Case & MapReduce Details
Anju Singh
 
Python in the Hadoop Ecosystem (Rock Health presentation)
Uri Laserson
 
Hadoop MapReduce Fundamentals
Lynn Langit
 
Hive Quick Start Tutorial
Carl Steinbach
 
Hadoop Hive Tutorial | Hive Fundamentals | Hive Architecture
Skillspeed
 
Ad

Similar to 03 hive query language (hql) (20)

PPTX
6.hive
Prashant Gupta
 
PPTX
BDA: Introduction to HIVE, PIG and HBASE
tripathineeharika
 
PPTX
Apache Hive and commands PPT Presentation
Dhanush947555
 
PPTX
hive- Inroduction , INTEGRATION and work flow, Partition and Bucketing
ssuser9d6aac
 
PPTX
Big Data Analytics (BAD601) Module-4.pptx
AmbikaVenkatesh4
 
PDF
Apache Hive, data segmentation and bucketing
earnwithme2522
 
PPTX
Apache HBase™
Prashant Gupta
 
PPTX
Introduction to Apache Hive(Big Data, Final Seminar)
Takrim Ul Islam Laskar
 
PPTX
Unveiling Hive: A Comprehensive Exploration of Hive in Hadoop Ecosystem
mashoodsyed66
 
PPTX
Apache hive
pradipbajpai68
 
PPTX
Apache Hive
Amit Khandelwal
 
PPTX
Unit II Hadoop Ecosystem_Updated.pptx
BhavanaHotchandani
 
PDF
Working with Hive Analytics
Manish Chopra
 
PPTX
Hive - A theoretical overview in Detail.pptx
Mithun DSouza
 
PPTX
Research on vector spatial data storage scheme based
Anant Kumar
 
PPT
Hive_An Brief Introduction to HIVE_BIGDATAANALYTICS
RUHULAMINHAZARIKA
 
PPTX
Hive It stores schema in a database and processed data into HDFS. It provides...
rajsigh020
 
PPTX
443988696-Chapter-9-HIVEHIVEHIVE-pptx.pptx
AbdellahELMAMOUN
 
PPTX
Hadoop - Apache Hbase
Vibrant Technologies & Computers
 
PDF
Hive
Vetri V
 
BDA: Introduction to HIVE, PIG and HBASE
tripathineeharika
 
Apache Hive and commands PPT Presentation
Dhanush947555
 
hive- Inroduction , INTEGRATION and work flow, Partition and Bucketing
ssuser9d6aac
 
Big Data Analytics (BAD601) Module-4.pptx
AmbikaVenkatesh4
 
Apache Hive, data segmentation and bucketing
earnwithme2522
 
Apache HBase™
Prashant Gupta
 
Introduction to Apache Hive(Big Data, Final Seminar)
Takrim Ul Islam Laskar
 
Unveiling Hive: A Comprehensive Exploration of Hive in Hadoop Ecosystem
mashoodsyed66
 
Apache hive
pradipbajpai68
 
Apache Hive
Amit Khandelwal
 
Unit II Hadoop Ecosystem_Updated.pptx
BhavanaHotchandani
 
Working with Hive Analytics
Manish Chopra
 
Hive - A theoretical overview in Detail.pptx
Mithun DSouza
 
Research on vector spatial data storage scheme based
Anant Kumar
 
Hive_An Brief Introduction to HIVE_BIGDATAANALYTICS
RUHULAMINHAZARIKA
 
Hive It stores schema in a database and processed data into HDFS. It provides...
rajsigh020
 
443988696-Chapter-9-HIVEHIVEHIVE-pptx.pptx
AbdellahELMAMOUN
 
Hadoop - Apache Hbase
Vibrant Technologies & Computers
 
Hive
Vetri V
 
Ad

More from Subhas Kumar Ghosh (20)

PPTX
07 logistic regression and stochastic gradient descent
Subhas Kumar Ghosh
 
PPTX
06 how to write a map reduce version of k-means clustering
Subhas Kumar Ghosh
 
PPTX
05 k-means clustering
Subhas Kumar Ghosh
 
PPTX
02 data warehouse applications with hive
Subhas Kumar Ghosh
 
PPTX
01 hbase
Subhas Kumar Ghosh
 
PPTX
06 pig etl features
Subhas Kumar Ghosh
 
PPTX
05 pig user defined functions (udfs)
Subhas Kumar Ghosh
 
PPTX
04 pig data operations
Subhas Kumar Ghosh
 
PPTX
03 pig intro
Subhas Kumar Ghosh
 
PPTX
02 naive bays classifier and sentiment analysis
Subhas Kumar Ghosh
 
PPTX
Hadoop performance optimization tips
Subhas Kumar Ghosh
 
PPTX
Hadoop Day 3
Subhas Kumar Ghosh
 
PDF
Hadoop exercise
Subhas Kumar Ghosh
 
PDF
Hadoop map reduce v2
Subhas Kumar Ghosh
 
PPTX
Hadoop job chaining
Subhas Kumar Ghosh
 
PDF
Hadoop secondary sort and a custom comparator
Subhas Kumar Ghosh
 
PDF
Hadoop combiner and partitioner
Subhas Kumar Ghosh
 
PPTX
Hadoop deconstructing map reduce job step by step
Subhas Kumar Ghosh
 
PDF
Hadoop map reduce in operation
Subhas Kumar Ghosh
 
PDF
Hadoop map reduce concepts
Subhas Kumar Ghosh
 
07 logistic regression and stochastic gradient descent
Subhas Kumar Ghosh
 
06 how to write a map reduce version of k-means clustering
Subhas Kumar Ghosh
 
05 k-means clustering
Subhas Kumar Ghosh
 
02 data warehouse applications with hive
Subhas Kumar Ghosh
 
06 pig etl features
Subhas Kumar Ghosh
 
05 pig user defined functions (udfs)
Subhas Kumar Ghosh
 
04 pig data operations
Subhas Kumar Ghosh
 
03 pig intro
Subhas Kumar Ghosh
 
02 naive bays classifier and sentiment analysis
Subhas Kumar Ghosh
 
Hadoop performance optimization tips
Subhas Kumar Ghosh
 
Hadoop Day 3
Subhas Kumar Ghosh
 
Hadoop exercise
Subhas Kumar Ghosh
 
Hadoop map reduce v2
Subhas Kumar Ghosh
 
Hadoop job chaining
Subhas Kumar Ghosh
 
Hadoop secondary sort and a custom comparator
Subhas Kumar Ghosh
 
Hadoop combiner and partitioner
Subhas Kumar Ghosh
 
Hadoop deconstructing map reduce job step by step
Subhas Kumar Ghosh
 
Hadoop map reduce in operation
Subhas Kumar Ghosh
 
Hadoop map reduce concepts
Subhas Kumar Ghosh
 

Recently uploaded (20)

PDF
AWS Consulting Services: Empowering Digital Transformation with Nlineaxis
Nlineaxis IT Solutions Pvt Ltd
 
PDF
Designing Accessible Content Blocks (1).pdf
jaclynmennie1
 
PDF
The Rise of Sustainable Mobile App Solutions by New York Development Firms
ostechnologies16
 
PDF
TEASMA: A Practical Methodology for Test Adequacy Assessment of Deep Neural N...
Lionel Briand
 
PPTX
Wondershare Filmora Crack 14.5.18 + Key Full Download [Latest 2025]
HyperPc soft
 
PDF
Dealing with JSON in the relational world
Andres Almiray
 
PPTX
Perfecting XM Cloud for Multisite Setup.pptx
Ahmed Okour
 
PPTX
For my supp to finally picking supp that work
necas19388
 
PDF
capitulando la keynote de GrafanaCON 2025 - Madrid
Imma Valls Bernaus
 
PPTX
Avast Premium Security crack 25.5.6162 + License Key 2025
HyperPc soft
 
PPTX
NeuroStrata: Harnessing Neuro-Symbolic Paradigms for Improved Testability and...
Ivan Ruchkin
 
PDF
Code Once; Run Everywhere - A Beginner’s Journey with React Native
Hasitha Walpola
 
PDF
Telemedicine App Development_ Key Factors to Consider for Your Healthcare Ven...
Mobilityinfotech
 
PPTX
IObit Uninstaller Pro 14.3.1.8 Crack Free Download 2025
sdfger qwerty
 
PDF
Cloud computing Lec 02 - virtualization.pdf
asokawennawatte
 
PDF
Building scalbale cloud native apps with .NET 8
GillesMathieu10
 
PDF
IObit Uninstaller Pro 14.3.1.8 Crack for Windows Latest
utfefguu
 
PDF
Alur Perkembangan Software dan Jaringan Komputer
ssuser754303
 
PPTX
ManageIQ - Sprint 264 Review - Slide Deck
ManageIQ
 
PDF
Rewards and Recognition (2).pdf
ethan Talor
 
AWS Consulting Services: Empowering Digital Transformation with Nlineaxis
Nlineaxis IT Solutions Pvt Ltd
 
Designing Accessible Content Blocks (1).pdf
jaclynmennie1
 
The Rise of Sustainable Mobile App Solutions by New York Development Firms
ostechnologies16
 
TEASMA: A Practical Methodology for Test Adequacy Assessment of Deep Neural N...
Lionel Briand
 
Wondershare Filmora Crack 14.5.18 + Key Full Download [Latest 2025]
HyperPc soft
 
Dealing with JSON in the relational world
Andres Almiray
 
Perfecting XM Cloud for Multisite Setup.pptx
Ahmed Okour
 
For my supp to finally picking supp that work
necas19388
 
capitulando la keynote de GrafanaCON 2025 - Madrid
Imma Valls Bernaus
 
Avast Premium Security crack 25.5.6162 + License Key 2025
HyperPc soft
 
NeuroStrata: Harnessing Neuro-Symbolic Paradigms for Improved Testability and...
Ivan Ruchkin
 
Code Once; Run Everywhere - A Beginner’s Journey with React Native
Hasitha Walpola
 
Telemedicine App Development_ Key Factors to Consider for Your Healthcare Ven...
Mobilityinfotech
 
IObit Uninstaller Pro 14.3.1.8 Crack Free Download 2025
sdfger qwerty
 
Cloud computing Lec 02 - virtualization.pdf
asokawennawatte
 
Building scalbale cloud native apps with .NET 8
GillesMathieu10
 
IObit Uninstaller Pro 14.3.1.8 Crack for Windows Latest
utfefguu
 
Alur Perkembangan Software dan Jaringan Komputer
ssuser754303
 
ManageIQ - Sprint 264 Review - Slide Deck
ManageIQ
 
Rewards and Recognition (2).pdf
ethan Talor
 

03 hive query language (hql)

  • 2. Example • Here’s one way of populating a single row table: % echo 'X' > /tmp/dummy.txt % hive -e "CREATE TABLE dummy (value STRING); LOAD DATA LOCAL INPATH '/tmp/dummy.txt' OVERWRITE INTO TABLE dummy" • Use ‘hive’ to start hive • SHOW TABLES; • SELECT * FROM dummy; • DROP TABLE dummy; • SHOW TABLES;
  • 3. Example • Let’s see how to use Hive to run a query on the weather dataset • The first step is to load the data into Hive’s managed storage. • Here we’ll have Hive use the local file-system for storage; later we’ll see how to store tables in HDFS. • Just like an RDBMS, Hive organizes its data into tables. • We create a table to hold the weather data using the CREATE TABLE statement: CREATE TABLE records (year STRING, temperature INT, quality INT) ROW FORMAT DELIMITED FIELDS TERMINATED BY 't'; • The first line declares a records table with three columns: year, temperature, and quality. • The type of each column must be specified, too: here the year is a string, while the other two columns are integers. • So far, the SQL is familiar. The ROW FORMAT clause, however, is particular to HiveQL. • It is saying that each row in the data file is tab-delimited text.
  • 4. Example • Next we can populate Hive with the data. This is just a small sample, for exploratory purposes: LOAD DATA LOCAL INPATH ‘input/ncdc/micro-tab/sample.txt' OVERWRITE INTO TABLE records; • Running this command tells Hive to put the specified local file in its warehouse directory. • This is a simple filesystem operation. • There is no attempt, for example, to parse the file and store it in an internal database format, since Hive does not mandate any particular file format. Files are stored verbatim: they are not modified by Hive. • Tables are stored as directories under Hive’s warehouse directory, which is controlled by the hive.metastore.warehouse.dir, and defaults to /user/hive/warehouse • The OVERWRITE keyword in the LOAD DATA statement tells Hive to delete any existing files in the directory for the table. If it is omitted, then the new files are simply added to the table’s directory (unless they have the same names, in which case they replace the old files).
  • 5. Example • Query: SELECT year, MAX(temperature) FROM records WHERE temperature != 9999 AND (quality = 0 OR quality = 1 OR quality = 4 OR quality = 5 OR quality = 9) GROUP BY year; • Hive is configured using an XML configuration file like Hadoop’s. The file is called hive-site.xml and is located in Hive’s conf directory.
  • 6. Hive Services • cli – The command line interface to Hive (the shell). This is the default service. • hiveserver – Runs Hive as a server exposing a Thrift service, enabling access from a range of clients written in different languages. Applications using the Thrift, JDBC, and ODBC connectors need to run a Hive server to communicate with Hive. Set the HIVE_PORT environment variable to specify the port the server will listen on (defaults to 10,000). • hwi – The Hive Web Interface. • jar – The Hive equivalent to hadoop jar, a convenient way to run Java applications that includes both Hadoop and Hive classes on the classpath. • metastore – By default, the metastore is run in the same process as the Hive service. Using this service, it is possible to run the metastore as a standalone (remote) process. Set the METASTORE_PORT environment variable to specify the port the server will listen on.
  • 8. Clients • Thrift Client – The Hive Thrift Client makes it easy to run Hive commands from a wide range of programming languages. Thrift bindings for Hive are available for C++, Java, PHP, Python, and Ruby. They can be found in the src/service/src subdirectory in the Hive distribution. • JDBC Driver – Hive provides a Type 4 (pure Java) JDBC driver, defined in the class org.apache.hadoop.hive.jdbc.HiveDriver. When configured with a JDBC URI of the form jdbc:hive://host:port/dbname, a Java application will connect to a Hive server running in a separate process at the given host and port. (The driver makes calls to an interface implemented by the Hive Thrift Client using the Java Thrift bindings.) You may alternatively choose to connect to Hive via JDBC in embedded mode using the URI jdbc:hive://. In this mode, Hive runs in the same JVM as the application invoking it, so there is no need to launch it as a standalone server since it does not use the Thrift service or the Hive Thrift Client. • ODBC Driver – The Hive ODBC Driver allows applications that support the ODBC protocol to connect to Hive. (Like the JDBC driver, the ODBC driver uses Thrift to communicate with the Hive server.) The ODBC driver is still in development, so you should refer to the latest instructions on the Hive wiki for how to build and run it.
  • 9. The Metastore (Embedded) • The metastore is the central repository of Hive metadata. The metastore is divided into two pieces: – a service and – the backing store for the data • By default, the metastore service runs in the same JVM as the Hive service and contains an embedded Derby database instance backed by the local disk. • This is called the embedded metastore configuration. • Using an embedded metastore is a simple way to get started with Hive; however, only one embedded Derby database can access the database files on disk at any one time, which means you can only have one Hive session open at a time that shares the same metastore.
  • 10. Local Metastore • The solution to supporting multiple sessions (and therefore multiple users) is to use a standalone database. • This configuration is referred to as a local metastore, since the metastore service still runs in the same process as the Hive service, but connects to a database running in a separate process, either on the same machine or on a remote machine. • Any JDBC-compliant database may be used. • MySQL is a popular choice for the standalone metastore. • In this case, javax.jdo.option.ConnectionURL is set to jdbc:mysql://host/dbname?createDatabaseIfNotExist=true, and javax.jdo.option.ConnectionDriverName is set to com.mysql.jdbc.Driver. (The user name and password should be set, too, of course.)
  • 11. Remote Metastore • Going a step further, there’s another metastore configuration called a remote metastore, where one or more metastore servers run in separate processes to the Hive service. • This brings better manageability and security, since the database tier can be completely firewalled off, and the clients no longer need the database credentials.
  • 12. Updates, transactions, and indexes • Updates, transactions, and indexes are mainstays of traditional databases. • Yet, until recently, these features have not been considered a part of Hive’s feature set. • This is because Hive was built to operate over HDFS data using MapReduce, where full-table scans are the norm and a table update is achieved by transforming the data into a new table. • For a data warehousing application that runs over large portions of the dataset, this works well.
  • 13. HiveQL or HQL • Hive’s SQL dialect, called HiveQL, does not support the full SQL-92 specification.
  • 14. Tables • A Hive table is logically made up of the data being stored and the associated metadata describing the layout of the data in the table. • The data typically resides in HDFS, although it may reside in any Hadoop filesystem, including the local filesystem or S3. • Hive stores the metadata in a relational database—and not in HDFS, say Metastore. • Many relational databases have a facility for multiple namespaces, which allow users and applications to be segregated into different databases or schemas. • Hive supports the same facility, and provides commands such as CREATE DATABASE dbname, USE dbname, and DROP DATABASE dbname. • You can fully qualify a table by writing dbname.tablename. If no database is specified, tables belong to the default database.
  • 15. Managed Tables and External Tables • When you create a table in Hive, by default Hive will manage the data, which means that Hive moves the data into its warehouse directory. Alternatively, you may create an external table, which tells Hive to refer to the data that is at an existing location outside the warehouse directory. • The difference between the two types of table is seen in the LOAD and DROP semantics. • When you load data into a managed table, it is moved into Hive’s warehouse directory. • CREATE EXTERNAL TABLE • With the EXTERNAL keyword, Hive knows that it is not managing the data, so it doesn’t move it to its warehouse directory. • Hive organizes tables into partitions, a way of dividing a table into coarse-grained parts based on the value of a partition column, such as date. Using partitions can make it faster to do queries on slices of the data. • CREATE TABLE logs (ts BIGINT, line STRING) PARTITIONED BY (dt STRING, country STRING); • LOAD DATA LOCAL INPATH 'input/hive/partitions/file1‘ INTO TABLE logs PARTITION (dt='2001-01-01', country='GB'); • SHOW PARTITIONS logs;
  • 16. Buckets • There are two reasons why you might want to organize your tables (or partitions) into buckets. • The first is to enable more efficient queries. • Bucketing imposes extra structure on the table, which Hive can take advantage of when performing certain queries. • In particular, a join of two tables that are bucketed on the same columns—which include the join columns—can be efficiently implemented as a map-side join.
  • 17. Buckets • The second reason to bucket a table is to make sampling more efficient. • When working with large datasets, it is very convenient to try out queries on a fraction of your dataset while you are in the process of developing or refining them. • CREATE TABLE bucketed_users (id INT, name STRING) CLUSTERED BY (id) INTO 4 BUCKETS; • Any particular bucket will effectively have a random set of users in it. • The data within a bucket may additionally be sorted by one or more columns. This allows even more efficient map-side joins, since the join of each bucket becomes an efficient merge-sort. • CREATE TABLE bucketed_users (id INT, name STRING) CLUSTERED BY (id) SORTED BY (id ASC) INTO 4 BUCKETS;
  • 18. Buckets • Physically, each bucket is just a file in the table (or partition) directory. • In fact, buckets correspond to MapReduce output file partitions: a job will produce as many buckets (output files) as reduce tasks. • We can do a sampling of the table using the TABLESAMPLE clause, which restricts the query to a fraction of the buckets in the table rather than the whole table: • SELECT * FROM bucketed_users TABLESAMPLE(BUCKET 1 OUT OF 4 ON id); • Bucket numbering is 1-based, so this query retrieves all the users from the first of four buckets. • For a large, evenly distributed dataset, approximately one quarter of the table’s rows would be returned.
  • 19. Storage Formats • There are two dimensions that govern table storage in Hive: the row format and the file format. • The row format dictates how rows, and the fields in a particular row, are stored. • In Hive parlance, the row format is defined by a SerDe. • When you create a table with no ROW FORMAT or STORED AS clauses, the default format is delimited text, with a row per line. • You can use sequence files in Hive by using the declaration STORED AS SEQUENCEFILE in the CREATE TABLE statement. • Hive provides another binary storage format called RCFile, short for Record Columnar File. RCFiles are similar to sequence files, except that they store data in a columnoriented fashion.
  • 20. Lets do some Hands on • Look at the data in /input/tsv/, and choose any one of the TSV file • Understand the col types and names • drop table X; drop table bucketed_X; • CREATE TABLE X(id STRING, name STRING) • ROW FORMAT DELIMITED • FIELDS TERMINATED BY 't'; • LOAD DATA LOCAL INPATH '/input/tsv/X.tsv‘ OVERWRITE INTO TABLE X; • SELECT * FROM X; • Write some query : SELECT x • FROM X • WHERE y == ABC • GROUP BY z; • dfs -cat /user/hive/warehouse/X/X.tsv; • CREATE TABLE bucketed_X (id STRING, name STRING) CLUSTERED BY (id) INTO 4 BUCKETS; • CREATE TABLE bucketed_X (id STRING, name STRING) CLUSTERED BY (id) SORTED BY (id) INTO 4 BUCKETS; • SELECT * FROM X;
  • 21. Lets do some Hands on • SET hive.enforce.bucketing=true; • INSERT OVERWRITE TABLE bucketed_X SELECT * FROM X; • dfs -ls /user/hive/warehouse/bucketed_X; • SELECT * FROM bucketed_X TABLESAMPLE(BUCKET 1 OUT OF 4 ON id); • SELECT * FROM bucketed_X TABLESAMPLE(BUCKET 1 OUT OF 2 ON id); • SELECT * FROM X TABLESAMPLE(BUCKET 1 OUT OF 4 ON rand());
  • 22. Querying Data • Sorting data in Hive can be achieved by use of a standard ORDER BY clause, but there is a catch. • ORDER BY produces a result that is totally sorted, as expected, but to do so it sets the number of reducers to one, making it very inefficient for large datasets. • When a globally sorted result is not required—and in many cases it isn’t— then you can use Hive’s nonstandard extension, SORT BY instead. • SORT BY produces a sorted file per reducer. • In some cases, you want to control which reducer a particular row goes to, typically so you can perform some subsequent aggregation. This is what Hive’s DISTRIBUTE BY clause does. • FROM records SELECT year, temperature DISTRIBUTE BY year SORT BY year ASC, temperature DESC;
  • 23. MapReduce Scripts • Using an approach like Hadoop Streaming, the TRANSFORM, MAP, and REDUCE clauses make it possible to invoke an external script or program from Hive. #!/usr/bin/env python import re import sys for line in sys.stdin: (year, temp, q) = line.strip().split() if (temp != "9999" and re.match("[01459]", q)): print "%st%s" % (year, temp) is_good_quality.py #!/usr/bin/env python import sys (last_key, max_val) = (None, 0) for line in sys.stdin: (key, val) = line.strip().split("t") if last_key and last_key != key: print "%st%s" % (last_key, max_val) (last_key, max_val) = (key, int(val)) else: (last_key, max_val) = (key, max(max_val, int(val))) if last_key: print "%st%s" % (last_key, max_val) /code/hive/python/max_temperature_reduce.py; • ADD FILE Hadoop_training/code/hive/python/is_good_quality.py; ADD FILE Hadoop_training/code/hive/python/max_temperature_reduce.py;
  • 24. Example CREATE TABLE records (year STRING, temperature INT, quality INT) ROW FORMAT DELIMITED FIELDS TERMINATED BY 't'; LOAD DATA LOCAL INPATH 'input/ncdc/micro-tab/sample.txt' OVERWRITE INTO TABLE records; FROM records SELECT TRANSFORM(year, temperature, quality) USING 'is_good_quality.py' AS year, temperature; FROM ( FROM records MAP year, temperature, quality USING 'is_good_quality.py' AS year, temperature) map_output REDUCE year, temperature USING 'max_temperature_reduce.py' AS year, temperature;
  • 25. Joins • The simplest kind of join is the inner join, where each match in the input tables results in a row in the output. • SELECT sales.*, things.* FROM sales JOIN things ON (sales.id = things.id); • Outer Join: • SELECT sales.*, things.* FROM sales LEFT OUTER JOIN things ON (sales.id = things.id); • SELECT sales.*, things.* FROM sales RIGHT OUTER JOIN things ON (sales.id = things.id); • SELECT sales.*, things.* FROM sales FULL OUTER JOIN things ON (sales.id = things.id);
  • 26. Joins • SELECT * FROM things WHERE things.id IN (SELECT id from sales); • Is not possible in Hive – but can be written as: • SELECT * FROM things LEFT SEMI JOIN sales ON (sales.id = things.id);
  • 27. End of session Day – 4: Hive Query Language (HQL)