0% found this document useful (0 votes)
181 views

Hive Data Manipulation

The document provides an overview of common HiveQL commands and functions for data manipulation in Hive, including loading, inserting, partitioning, and exporting data, as well as selecting, filtering, joining, aggregating, and sampling data. Key HiveQL commands and functions covered include LOAD DATA, INSERT, SELECT, WHERE, GROUP BY, JOIN, TABLESAMPLE, and CREATE VIEW.

Uploaded by

pa ott
Copyright
© © All Rights Reserved
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
181 views

Hive Data Manipulation

The document provides an overview of common HiveQL commands and functions for data manipulation in Hive, including loading, inserting, partitioning, and exporting data, as well as selecting, filtering, joining, aggregating, and sampling data. Key HiveQL commands and functions covered include LOAD DATA, INSERT, SELECT, WHERE, GROUP BY, JOIN, TABLESAMPLE, and CREATE VIEW.

Uploaded by

pa ott
Copyright
© © All Rights Reserved
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
You are on page 1/ 17

Hive data Manipulation

HiveQL
Loading Data
Managed Tables
LOAD DATA LOCAL INPATH $
{env:HOME}/employees_data
OVERWRITE ITO TABLE employees
${env:HOME} can be replaced by /home/cloudera/ in
Cloudera
INPATH cannot contain any directories
LOAD DATA LOCAL copies local data to HDFS
LOAD DATA (without LOCAL) moves data to HDFS
Both source and destination must be in HDFS
In the above the destination HDFS file is in the directory:
/user/cloudera/hive/warehouse/mydb.db/employees/country=US/state=TX
Inserting Data
INSERT statement
INSERT OVERWRITE TABLE employees
SELECT * FROM employees1 e1
WHERE e1.cnty = US and e1.st = CA;
(Assumes the data is already in another
table called employees1)
Employees1 table is scanned for EACH
INSERT statement
Partitioning
From employees1 e1
INSERT OVERWRITE TABLE employees
PARTITION (country= US state = CA)
SELECT * WHERE e1.cnty = US and e1.st = CA
INSERT OVERWRITE TABLE employees
PARTITION (country= US state = NY)
SELECT * WHERE e1.cnty = US and e1.st = NY
INSERT INTO TABLE employees
PARTITION (country= US state = NV)
SELECT * WHERE e1.cnty = US and e1.st = NV;
You can mix OVERWRITE and INTO clauses
INSERT (3)
INSERT OVERWRITE TABLE employees
Partition (country, state)
SELECT e1.cnty, e1.st
FROM
Hive determines the values of partition
keys (country, state) from the last two
columns in the SELECT clause
Source column values and output column
values are determined by POSITION, not
matching names
Create and Load Table in One Query

CREATE TABLE ca_employees


AS SELECT name,salary,address
FROM employees
WHERE state = CA;
- Hive takes the schema from the SELECT
clause
- Loads the data with three fields: name,
salary and address
- This can also be used to extract subsets
from large tables
Exporting Data
How do we get data out of the
Tables?
hadoop fs cp source_path
target_path
Or
INSERT OVERWRITE LOCAL
DIRECTORY /tmp/cs_employees
SELECT etc.
This will create data in
/tmp/ca_employees dir
Hive - SELECT
SELECT name, salary FROM employees;
SELECT e.name, e.salary FROM employees e;
SELECT name, subordinates[0] FROM employees; - data
from Array
SELECT name, deductions[State Texas] FROM Employees;
- data from Map
SELECT name, address FROM employees; - data from
STRUCT (address)
SELECT name, deductions FROM employees; - data from
MAP (deductions)
- both above (address and deductions) will output in JSON
format
- Use dot notation for struct: address.city
Columns
SELECT symbol, price* FROM
stocks;
- gets all columns that start with the
name price
SELECT upper(name), salary * 1.1
FROM employees;
- does column calculations
Arithmetic Operators
Operator Description
+ Add
_ Subtract
* Multiply
/ Divide
% Modulo
& Bitwise AND
| Bitwise OR
^ Bitwise XOR
~ Bitwise NOT
Built-in Functions

round(d) round(d, N) floor(d)


ceil(d) ceiling(DOUBLE d) rand()
rand(seed) exp(d) ln(d)
log10(d) log2(d) log(base, d)
pow(d, p) power9d,p) sqrt(d)
abs(d) e() pi()
count(*) count(expr) counts count(DISTINCT, expr)
not null
sum(col) sum(DISTINC, col) avg(col)
avg(DISTINCT, col) min(col) max(col)
There are others
please see Hive
documentation
Table Generating Functions
explode(array) return one row for
each element in the array
explode(map) v.0.8.0 or later one
row for each map K-V pair
json_tuple(jsonstr,p1,p2,..pn)
returns a tuple -jsonstr->input
p1,p2,..pn->output columns
stack(n,col1,col2colM) convert M
cols into n rows of size (M/n)
Other Built-in-Functions
test in(v1,v2..vn) return true if test is in the list of values
length(s) length of string
reverse(s) reverse of string
concat(s1,s2,sn)
concat_ws(separator,s1,s2.sn)
substr(s,start_index)
substr(s, start, int length)
upper(s)
lower(s)
trim(s), ltrim(s), rtrim(s)
regex_replace(s,regex, repl_str)
to_date(timestamp), year(ts), month(ts),day(ts)
split(s, pattern)
Others
LIMIT
Nested SELECT
CASE..WHEN ..THEN
WHERE
JOIN, ON and HAVING clauses:
A = B, A<=B, A>=B.. etc.. can be used
LIKE address.street LIKE %AVE
RLIKE can use Java regular expressions
address.street RLIKE .*(Chicago|Ontario).*
GROUP BY, HAVING, JOIN
JOINS
JOIN (Inner JOIN) all non-matching records discarded.
Must find matching records inevery joined table
ON clause specifies condition for Joining
LEFT OUTER JOIN
OUTER JOIN JOIN is evaluated first and then WHERE
clause is applied!
RIGHT OUTER JOIN
FULL OUTER JOIN all matching records from all tables
LEFT SEMI JOIN returns records from left table if
matching records are found on the RIGHT table
Cartesian JOIN (cross product) use JOIN without ON..
ORDER BY, SORT BY (ASC and DESC)
SAMPLING
SELECT * FROM numbers
TABLESAMPLE(BUCKET 3 OUT OF 10)
ON rand()) s;
SELECT * FROM numbers
TABLESAMPLE(BUCKET 3 OUT OF 10)
ON number) s;
SELECT * FROM numbers
TABLESAMPLE(0.1 PERCENTY) s;
VIEWS
Allows query to be saved and treated
like a Table
Logical construct not materialized
View
CREATE VIEW short AS
SELECT * FROM people JOIN cart
ON (cart.people_id=people.id)
WHERE firstname=john;
SELECT lastname from short WHERE
id=3;

You might also like