Unit 5 Big Data
Unit 5 Big Data
Hbase – data model and implementations – Hbase clients – Hbase examples – praxis. Pig –
Grunt – pig data model – Pig Latin – developing and testing Pig Latin scripts. Hive – data types
and file formats – HiveQL data definition – HiveQL data manipulation – HiveQL queries.
Hbase
Hadoop can perform only batch processing, and data will be accessed only in a sequential
manner. That means one has to search the entire dataset even for the simplest of jobs.
A huge dataset when processed results in another huge data set, which should also be
processed sequentially. At this point, a new solution is needed to access any point of data in a
single unit of time (random access).
What is HBase?
HBase is a distributed column-oriented database built on top of the Hadoop file system. It is
an open-source project and is horizontally scalable.
HBase is a data model that is similar to Google’s big table designed to provide quick random
access to huge amounts of structured data. It leverages the fault tolerance provided by the
Hadoop File System (HDFS).
It is a part of the Hadoop ecosystem that provides random real-time read/write access to data
in the Hadoop File System.
One can store the data in HDFS either directly or through HBase. Data consumer
reads/accesses the data in HDFS randomly using HBase. HBase sits on top of the Hadoop
File System and provides read and write access.
HBase and HDFS
HDFS HBase
HDFS is a distributed file system suitable
HBase is a database built on top of the HDFS.
for storing large files.
HDFS does not support fast individual
HBase provides fast lookups for larger tables.
record lookups.
It provides high latency batch processing; It provides low latency access to single rows
no concept of batch processing. from billions of records (Random access).
HBase internally uses Hash tables and provides
It provides only sequential access of data. random access, and it stores the data in indexed
HDFS files for faster lookups.
HBase is a column-oriented database and the tables in it are sorted by row. The table
schema defines only column families, which are the key value pairs. A table have multiple
column families and each column family can have any number of columns. Subsequent column
values are stored contiguously on the disk. Each cell value of the table has a timestamp. In
short, in an HBase:
Column-oriented databases are those that store data tables as sections of columns of data,
rather than as rows of data. Shortly, they will have column families.
HBase RDBMS
HBase is schema-less, it doesn't have the concept An RDBMS is governed by its schema,
of fixed columns schema; defines only column which describes the whole structure of
families. tables.
It is built for wide tables. HBase is horizontally It is thin and built for small tables. Hard
scalable. to scale.
No transactions are there in HBase. RDBMS is transactional.
It has de-normalized data. It will have normalized data.
It is good for semi-structured as well as
It is good for structured data.
structured data.
Features of HBase
• Apache HBase is used to have random, real-time read/write access to Big Data.
• It hosts very large tables on top of clusters of commodity hardware.
• Apache HBase is a non-relational database modeled after Google's Bigtable. Bigtable
acts up on Google File System, likewise Apache HBase works on top of Hadoop and
HDFS.
Applications of HBase
The following are the Data model terminology used in Apache HBase.
1. Table
Apache HBase organizes data into tables which are composed of character and easy to use
with the file system.
2. Row
Apache HBase stores its data based on rows and each row has its unique row key. The row
key is represented as a byte array.
3. Column Family
The column families are used to store the rows and it also provides the structure to store data
in Apache HBase. It is composed of characters and strings and can be used with a file system
path. Each row in the table will have the same columns family but a row doesn't need to be
stored in all of its column family.
4. Column Qualifier
A column qualifier is used to point to the data that is stored in a column family. It is always
represented as a byte.
5. Cell
The cell is the combination of the column family, row key, column qualifier, and generally, it
is called a cell's value.
6. Timestamp
The value which is stored in the cell are versioned and each version is identified by a version
number that is assigned during creation time. In case if we don't mention timestamp while
writing data then the current time is considered.
In HBase, tables are split into regions and are served by the region servers. Regions are
vertically divided by column families into “Stores”. Stores are saved as files in HDFS. Shown
below is the architecture of HBase.
Note: The term ‘store’ is used for regions to explain the storage structure.
Components of Apache HBase Architecture
i. HMaster
Regions are nothing but tables that are split up and spread across the region servers.
Region server
When we take a deeper look into the region server, it contain regions and stores as shown
below:
The store contains memory store and HFiles. Memstore is just like a cache memory.
Anything that is entered into the HBase is stored here initially. Later, the data is transferred
and saved in Hfiles as blocks and the memstore is flushed.
iii. Zookeeper
1. Keeping Track of Cluster Info: It stores important information about the HBase
cluster, like where data is stored and which servers are working.
2. Choosing a Leader: When there are multiple HBase servers, ZooKeeper helps pick
one as the leader to manage everything smoothly.
3. Spotting Problems: ZooKeeper watches over the servers. If one stops working,
ZooKeeper alerts HBase so it can fix things quickly.
4. Managing Settings: It helps HBase adjust its settings based on what's happening in
the cluster.
5. Helping Clients Find Data: ZooKeeper tells clients where to look for data in the
HBase cluster.
This describes the java client API for HBase that is used to perform CRUD operations on
HBase tables. HBase is written in Java and has a Java Native API. Therefore it provides
programmatic access to Data Manipulation Language (DML).
Class HTable
Constructors
S.No. Constructors and Description
1 HTable()
HTable(TableName tableName, ClusterConnection connection, ExecutorService
2 pool)
Using this constructor, you can create an object to access an HBase table.
void close()
1
Releases all the resources of the HTable.
void delete(Delete delete)
2
Deletes the specified cells/row.
boolean exists(Get get)
3
Using this method, you can test the existence of columns in the table, as specified by
Get.
Result get(Get get)
4
Retrieves certain cells from a given row.
org.apache.hadoop.conf.Configuration getConfiguration()
5
Returns the Configuration object used by this instance.
TableName getName()
6
Returns the table name instance of this table.
HTableDescriptor getTableDescriptor()
7
Returns the table descriptor for this table.
byte[] getTableName()
8
Returns the name of this table.
void put(Put put)
9
Using this method, you can insert data into the table.
Class Put
This class is used to perform Put operations for a single row. It belongs to the
org.apache.hadoop.hbase.client package.
Constructors
S.No. Constructors and Description
1 Put(byte[] row)
Using this constructor, you can create a Put operation for the specified row.
Put(byte[] rowArray, int rowOffset, int rowLength)
2
Using this constructor, you can make a copy of the passed-in row key to keep local.
Put(byte[] rowArray, int rowOffset, int rowLength, long ts)
3
Using this constructor, you can make a copy of the passed-in row key to keep local.
Put(byte[] row, long ts)
4
Using this constructor, we can create a Put operation for the specified row, using a
given timestamp.
Methods
S.No. Methods and Description
Class Get
This class is used to perform Get operations on a single row. This class belongs to the
org.apache.hadoop.hbase.client package.
Constructor
S.No. Constructor and Description
Get(byte[] row)
1
Using this constructor, you can create a Get operation for the specified row.
2 Get(Get get)
Methods
S.No. Methods and Description
Get addColumn(byte[] family, byte[] qualifier)
1
Retrieves the column from the specific family with the specified qualifier.
Get addFamily(byte[] family)
2
Retrieves all columns from the specified family.
Class Delete
This class is used to perform Delete operations on a single row. To delete an entire row,
instantiate a Delete object with the row to delete. This class belongs to the
org.apache.hadoop.hbase.client package.
Constructor
S.No. Constructor and Description
Delete(byte[] row)
1
Creates a Delete operation for the specified row.
Delete(byte[] rowArray, int rowOffset, int rowLength)
2
Creates a Delete operation for the specified row and timestamp.
Delete(byte[] rowArray, int rowOffset, int rowLength, long ts)
3
Creates a Delete operation for the specified row and timestamp.
Delete(byte[] row, long timestamp)
4
Creates a Delete operation for the specified row and timestamp.
Methods
S.No. Methods and Description
This class is used to get a single row result of a Get or a Scan query.
Constructors
S.No. Constructors
Result()
1
Using this constructor, you can create an empty Result with no KeyValue payload;
returns null if you call raw Cells().
Methods
S.No. Methods and Description
1. Creating a Table:
Inserting Data:
Getting Data:
Scanning Data:
Deleting Data:
These examples demonstrate basic operations in HBase, including creating a table, inserting
data, retrieving data, scanning data, and deleting data. Remember to properly handle
exceptions and ensure you have established a connection to the HBase cluster using the
appropriate configuration settings.
1. Medical
In medical field HBase used for storing genome sequences and running map reduce on it and
stores the disease history of patient.
2. Sports
In sports field Hbase used for storing match history for better analytics and prediction.
3. Web
It is also used to store user history and preferences for better customer targeting.
4. Oil and petroleum industry
HBase is used in the oil and petroleum industry to store exploration date for analysis and
predict probable places where oil can be found.
5. E-commerce
It is used for recording and storing logs about customer search history and to perform
analytics the and then target advertisement for better business.
Aadhaar (UIDAI)
Yahoo
5.3 Pig
Pig is a scripting platform that runs on Hadoop clusters designed to process and
analyze large datasets. Pig is extensible, self-optimizing, and easily programmed.
Programmers can use Pig to write data transformations without knowing Java. Pig uses both
structured and unstructured data as input to perform analytics and uses HDFS to store the
results.
Components of Pig
The Pig Latin script is a procedural data flow language. It contains syntax and commands that
can be applied to implement business logic. Examples of Pig Latin are LOAD and STORE.
A runtime engine
The runtime engine is a compiler that produces sequences of MapReduce programs. It uses
HDFS to store and retrieve data. It is also used to interact with the Hadoop system (HDFS
and MapReduce).
Pig Commands
Given below in the table are some frequently used Pig Commands.
Command Function
load Reads data from the system
Store Writes data to file system
foreach Applies expressions to each record and outputs one or more records
filter Applies predicate and removes records that do not return true
Group/cogroup Collects records with the same key from one or more inputs
join Joins two or more inputs based on a key
order Sorts records based on a key
distinct Removes duplicate records
union Merges data sets
split Splits data into two or more sets based on filter conditions
stream Sends all records through a user-provided binary
dump Writes output to stdout
limit Limits the number of records
Apache Pig is generally used by data scientists for performing tasks involving ad-hoc
processing and quick prototyping. Apache Pig is used −
In 2006, Apache Pig was developed as a research project at Yahoo, especially to create and
execute MapReduce jobs on every dataset. In 2007, Apache Pig was open sourced via
Apache incubator. In 2008, the first release of Apache Pig came out. In 2010, Apache Pig
graduated as an Apache top-level project.
The main reason why programmers have started using Hadoop Pig is that it converts the
scripts into a series of MapReduce tasks making their job easy. Below is the architecture of
Pig Hadoop:
1. Parser: When a Pig Latin script is sent to Hadoop Pig, it is first handled by the
parser. The parser is responsible for checking the syntax of the script, along with other
miscellaneous checks. Parser gives an output in the form of a Directed Acyclic Graph
(DAG) that contains Pig Latin statements, together with other logical operators
represented as nodes.
2. Optimizer: After the output from the parser is retrieved, a logical plan for DAG is
passed to a logical optimizer. The optimizer is responsible for carrying out the logical
optimizations.
3. Compiler: The role of the compiler comes in when the output from the optimizer is
received. The compiler compiles the logical plan sent by the optimizing The logical
plan is then converted into a series of MapReduce tasks or jobs.
4. Execution Engine: After the logical plan is converted to MapReduce jobs, these jobs
are sent to Hadoop in a properly sorted order, and these jobs are executed on Hadoop
for yielding the desired result.
5.4 Grunt
Apache Pig Grunt is an interactive shell that enables users to enter Pig Latin interactively and
provides a shell to interact with HDFS and local file system commands. You can enter Pig
Latin commands directly into the Grunt shell for execution. Apache Pig starts executing the
Pig Latin language when it receives the STORE or DUMP command. Before executing the
command Pig Grunt shell do check the syntax and semantics to void any error.
$pig -x local
grunt>
We can use the Pig Grunt shell to run HDFS commands as well. Starting from Pig version 0.5
all Hadoop fs shell commands are available to use. They are accessed using the keyword FS
followed by the command.
Let us see few HDFS commands from the Pig Grunt shell.
fs -ls /
Syntax:
grunt> fs subcommand subcommand_parameters;
Command:
grunt> fs -ls /
Output:
fs -cat
Syntax:
grunt> fs subcommand subcommand_parameters
fs -mkdir
We can use the Pig Grunt shell to run the basic shell command. We can invoke any shell
commands using sh.
Let us see few Shell commands from the Pig Grunt shell. We cannot execute those
commands which are part of the shell environment such as –cd.
sh ls
Syntax:
grunt> sh subcommand subcommand_parameters
Command:
grunt> sh ls
Output:
sh cat
Syntax:
grunt> sh subcommand subcommand_parameters
Pig Grunt supports utilities commands as well such as help, clear, history apart from this
Grunt also provides commands for controlling Pig and MapReduce such as exec, run, kill.
Help Command
Syntax:
grunt> help
Clear Command
Syntax:
grunt> Clear
Command:
grunt> Clear
History Command
The history command is used to clear the screen of the Grunt shell.
Syntax:
grunt> history
Set Command
The SET command is used to assign values to keys that are case sensitive. In case the SET
command is used without providing arguments then all other system properties and
configurations are printed.
Syntax:
grunt> set [key 'value']
Kill Command
The kill command will attempt to kill any MapReduce jobs associated with the Pig job
Syntax:
grunt> kill JobId
5.6 Pig Latin
Pig Latin is the language used to analyze data in Hadoop using Apache Pig. In this chapter,
we are going to discuss the basics of Pig Latin such as Pig Latin statements, data types,
general and relational operators, and Pig Latin UDF’s.
The data model of Pig is fully nested. A Relation is the outermost structure of the Pig Latin
data model. And it is a bag where −
While processing data using Pig Latin, statements are the basic constructs.
• These statements work with relations. They include expressions and schemas.
• Every statement ends with a semicolon (;).
• We will perform various operations using operators provided by Pig Latin, through
statements.
• Except LOAD and STORE, while performing all other operations, Pig Latin
statements take a relation as input and produce another relation as output.
• As soon as you enter a Load statement in the Grunt shell, its semantic checking will
be carried out. To see the contents of the schema, you need to use the Dump operator.
Only after performing the dump operation, the MapReduce job for loading the data
into the file system will be carried out.
Example
Given below is a Pig Latin statement, which loads data to Apache Pig.
The following table describes the arithmetic operators of Pig Latin. Suppose a = 10 and b =
20.
− Subtraction − Subtracts right hand operand from left hand a − b will give −10
operand
* Multiplication − Multiplies values on either side of the a * b will give 200
operator
/ Division − Divides left hand operand by right hand b / a will give 2
operand
% Modulus − Divides left hand operand by right hand b % a will give 0
operand and returns remainder
b = (a == 1)? 20: 30;
Bincond − Evaluates the Boolean operators. It has three
if a = 1 the value of
?: operands as shown below.
b is 20.
variable x = (expression) ? value1 if true : value2 if false.
if a!=1 the value of b
is 30.
CASE f2 % 2
CASE
WHEN 0 THEN
WHEN
'even'
Case − The case operator is equivalent to nested bincond
THEN operator.
WHEN 1 THEN
'odd'
ELSE
END
END
== Equal − Checks if the values of two operands are equal or not; if (a = b) is not
yes, then the condition becomes true. true
!= Not Equal − Checks if the values of two operands are equal or (a != b) is true.
not. If the values are not equal, then condition becomes true.
Greater than − Checks if the value of the left operand is greater (a > b) is not
> than the value of the right operand. If yes, then the condition true.
becomes true.
Less than − Checks if the value of the left operand is less than
< the value of the right operand. If yes, then the condition becomes (a < b) is true.
true.
Greater than or equal to − Checks if the value of the left (a >= b) is not
>= operand is greater than or equal to the value of the right operand. true.
If yes, then the condition becomes true.
Less than or equal to − Checks if the value of the left operand is
<= less than or equal to the value of the right operand. If yes, then (a <= b) is true.
the condition becomes true.
f1 matches
matches Pattern matching − Checks whether the string in the left-hand
side matches with the constant in the right-hand side. '.*tutorial.*'
The following table describes the Type construction operators of Pig Latin.
Operator Description
Filtering
Sorting
Diagnostic Operators
Developing and testing Pig Latin scripts typically involves several steps. Pig Latin is a
scripting language used with Apache Pig, a platform for analyzing large data sets. Here's a
general guide on how to develop and test Pig Latin scripts:
1. Set up your environment: First, ensure that you have Apache Pig installed and
configured properly on your system. You can download Pig from the Apache Pig
website and follow the installation instructions.
2. Write your Pig Latin script: Create a Pig Latin script using a text editor or an
integrated development environment (IDE). Pig Latin scripts are typically written in
.pig files. These scripts consist of a series of statements that define the data flow and
transformations you want to apply to your data.
3. Understand the data: Before writing your script, understand the structure and format
of your data. Pig works well with structured and semi-structured data, such as CSV
files, JSON, or log files. Ensure that you have sample data available for testing your
script.
4. Test your script locally: Once you have written your Pig Latin script, you can test it
locally on a small sample of data to ensure that it behaves as expected. You can run
Pig in local mode using the pig command followed by the name of your script file.
For example:
pig myscript.pig
5. Debugging: If your script encounters errors during execution, use Pig's built-in
debugging features to identify and fix the issues. You can use the grunt shell to
interactively debug your script and inspect the intermediate results of each step.
6. Scale up testing: Once your script works correctly on a small sample of data, scale
up your testing by running it on larger datasets. You can do this by executing your
script on a Hadoop cluster using Pig's distributed mode. This allows you to test the
scalability and performance of your script under real-world conditions.
7. Optimize performance: If your script is slow or inefficient, optimize it for better
performance. This may involve restructuring your script, using built-in Pig
optimization techniques, or leveraging user-defined functions (UDFs) for custom
processing tasks.
8. Automate testing: Consider automating the testing of your Pig Latin scripts using
tools like Apache Oozie or Apache Airflow. This allows you to schedule and run your
scripts automatically, monitor their execution, and capture any errors or failures.
9. Document your script: Finally, document your Pig Latin script to make it easier for
others to understand and use. Include comments, annotations, and documentation that
explain the purpose of each step and how to run the script.
Hive
• Hive Clients: Hive offers a variety of drivers designed for communication with
different applications. For example, Hive provides Thrift clients for Thrift-based
applications. These clients and drivers then communicate with the Hive server, which
falls under Hive services.
• Hive Services: Hive services perform client interactions with Hive. For example, if a
client wants to perform a query, it must talk with Hive services.
• Hive Storage and Computing: Hive services such as file system, job client, and meta
store then communicates with Hive storage and stores things like metadata table
information and query results.
Hive's Features
• Hive is designed for querying and managing only structured data stored in tables
• Hive is scalable, fast, and uses familiar concepts
• Schema gets stored in a database, while processed data goes into a Hadoop Distributed
File System (HDFS)
• Tables and databases get created first; then data gets loaded into the proper tables
• Hive supports four file formats: ORC, SEQUENCEFILE, RCFILE (Record Columnar
File), and TEXTFILE
• Hive uses an SQL-inspired language, sparing the user from dealing with the complexity
of MapReduce programming. It makes learning more accessible by utilizing familiar
concepts found in relational databases, such as columns, tables, rows, and schema, etc.
• The most significant difference between the Hive Query Language (HQL) and SQL is
that Hive executes queries on Hadoop's infrastructure instead of on a traditional
database
• Since Hadoop's programming works on flat files, Hive uses directory structures to
"partition" data, improving performance on specific queries
• Hive supports partition and buckets for fast and simple data retrieval
• Hive supports custom user-defined functions (UDF) for tasks like data cleansing and
filtering. Hive UDFs can be defined according to programmers' requirements
Limitations of Hive
Of course, no resource is perfect, and Hive has some limitations. They are:
• Hive doesn’t support OLTP. Hive supports Online Analytical Processing (OLAP), but
not Online Transaction Processing (OLTP).
• It doesn’t support subqueries.
• It has a high latency.
• Hive tables don’t support delete or update operations.
Hive Modes
Depending on the size of Hadoop data nodes, Hive can operate in two different modes:
• Local mode
• Map-reduce mode
• Hadoop is installed under the pseudo mode, possessing only one data node
• The data size is smaller and limited to a single local machine
• Users expect faster processing because the local machine contains smaller datasets.
• Hadoop has multiple data nodes, and the data is distributed across these different
nodes
• Users must deal with more massive data sets
This chapter takes you through the different data types in Hive, which are involved in the
table creation. All the data types in Hive are classified into four types, given as follows:
• Column Types
• Literals
• Null Values
• Complex Types
Column Types
Column type are used as column data types of Hive. They are as follows:
Integral Types
Integer type data can be specified using integral data types, INT. When the data range
exceeds the range of INT, you need to use BIGINT and if the data range is smaller than the
INT, you use SMALLINT. TINYINT is smaller than SMALLINT.
TINYINT Y 10Y
SMALLINT S 10S
INT - 10
BIGINT L 10L
String Types
String type data types can be specified using single quotes (' ') or double quotes (" "). It
contains two data types: VARCHAR and CHAR. Hive follows C-types escape characters.
VARCHAR 1 to 65355
CHAR 255
Timestamp
Dates
Decimals
The DECIMAL type in Hive is as same as Big Decimal format of Java. It is used for
representing immutable arbitrary precision. The syntax and example is as follows:
DECIMAL(precision, scale)
decimal(10,0)
Union Types
Union is a collection of heterogeneous data types. You can create an instance using create
union. The syntax and example is as follows:
{0:1}
{1:2.0}
{2:["three","four"]}
{3:{"a":5,"b":"five"}}
{2:["six","seven"]}
{3:{"a":8,"b":"eight"}}
{0:9}
{1:10.0}
Literals
Floating point types are nothing but numbers with decimal points. Generally, this type of data
is composed of DOUBLE data type.
Decimal Type
Decimal type data is nothing but floating point value with higher range than DOUBLE data
type. The range of decimal type is approximately -10-308 to 10308.
Null Value
Complex Types
Arrays
Arrays in Hive are used the same way they are used in Java.
Syntax: ARRAY<data_type>
Maps
File Formats:
Some special file formats that Hive can handle are available, such as:
Text File Format: This is a simple format where data is stored as plain text files. Each line
typically represents a record, and fields within the record are separated by delimiters such as
commas or tabs.
Sequence File Format: This is a binary file format optimized for storing key-value pairs. It's
commonly used in Hadoop MapReduce jobs and is efficient for large-scale data processing.
RC File (Row Column File Format): RC File is a columnar storage format designed to
optimize query performance by storing data in columnar fashion, making it suitable for
analytics and data warehousing workloads.
Avro Files: Avro is a data serialization system that provides rich data structures and compact
binary formats. Avro files are self-describing and support schema evolution, making them
suitable for data interchange and long-term storage.
ORC Files (Optimized Row Columnar File Format): ORC is another columnar storage
format developed specifically for Hive. It offers advanced compression techniques and
improved performance for complex queries, making it ideal for analytics and data warehousing
use cases.
Parquet: Parquet is a columnar storage format that is highly optimized for efficient data
storage and query performance. It supports advanced features such as nested data structures,
predicate pushdown, and column pruning, making it suitable for a wide range of analytical
workloads.
Hive DDL commands are the statements used for defining and changing the structure of a
table or database in Hive. It is used to build or modify the tables and other objects in the
database.
1. CREATE
2. SHOW
3. DESCRIBE
4. USE
5. DROP
6. ALTER
7. TRUNCATE
The CREATE DATABASE statement is used to create a database in the Hive. The
DATABASE and SCHEMA are interchangeable. We can use either DATABASE or
SCHEMA.
Syntax:
The SHOW DATABASES statement lists all the databases present in the Hive.
Syntax:
SHOW (DATABASES|SCHEMAS);
3. DESCRIBE DATABASE in Hive
The DESCRIBE DATABASE statement in Hive shows the name of Database in Hive, its
comment (if set), and its location on the file system.
The USE statement in Hive is used to select the specific database for a session on which all
subsequent HiveQL statements would be executed.
Syntax:
USE database_name;
5. DROP DATABASE in Hive
The DROP DATABASE statement in Hive is used to Drop (delete) the database.
The default behavior is RESTRICT which means that the database is dropped only when it is
empty. To drop the database with tables, we can use CASCADE.
Syntax:
The ALTER DATABASE statement in Hive is used to change the metadata associated with
the database in Hive.
Apache Hive DML stands for (Data Manipulation Language) which is used to insert, update,
delete, and fetch data from Hive tables. Using DML commands we can load files into Apache
Hive tables, write data into the filesystem from Hive queries, perform merge operation on the
table, and so on.
• LOAD
• SELECT
• INSERT
• DELETE
• UPDATE
• EXPORT
• IMPORT
LOAD: The LOAD statement is used to load data from external files (e.g., in HDFS) into a
Hive table. It's more of a data loading operation rather than a traditional DML statement. For
example:
sql
LOAD DATA LOCAL INPATH '/path/to/data/file' INTO TABLE mytable;
• SELECT: The SELECT statement is used for querying data from Hive tables. It retrieves
data and does not modify the underlying data.
• INSERT: Hive provides an INSERT INTO statement to insert data into tables. However, it
appends data to existing data rather than performing updates or modifications. It's more of an
insert operation than a traditional SQL INSERT statement:
sql
INSERT INTO mytable VALUES (1, 'John', 25);
• DELETE: Hive does support a DELETE statement for deleting rows from a table, but this
operation is limited to ACID-compliant Hive tables (e.g., ORC or Parquet) with transactions
enabled. It's not widely used in typical Hive workloads.
• UPDATE: Hive traditionally does not support a standard SQL UPDATE statement to modify
data in existing rows. Instead, you often need to work with Hive's batch processing model
and recreate tables with updated data.
• EXPORT and IMPORT: These are HiveQL extensions for exporting data from Hive
tables into external storage (EXPORT) or importing data from external storage into Hive
tables (IMPORT). They are not standard SQL DML statements.
HiveQL queries
In Apache Hive, HiveQL (Hive Query Language) is used to query data stored in Hive tables.
HiveQL provides a SQL-like syntax for querying structured and semi-structured data. Here
are some commonly used query types in HiveQL:
1. SELECT:
o Used to retrieve data from one or more tables.
Syntax:
JOIN:
• Used to combine records from two or more tables based on a common column.
Syntax:
GROUP BY:
Syntax:
HAVING:
Syntax:
ORDER BY:
LIMIT:
Syntax:
Subqueries:
Syntax:
Conditional Expressions:
Syntax:
These are some of the common query types in HiveQL. They allow you to retrieve, filter,
aggregate, and sort data from Hive tables. HiveQL supports various SQL-like constructs,
making it easier to work with structured data in Hive.