UNIT 5 Notes
UNIT 5 Notes
Module V
HBase, data model and implementations, HBase clients, HBase examples, praxis. Cassandra,
Cassandra data model, Cassandra examples, Cassandra clients, Hadoop integration. Hive,
datatypes and file formats, HiveQL data definition, HiveQL data manipulation, HiveQL
queries.
5.1 HBase
5.1.1 What is HBase?
HBase is a distributed column-oriented database built on top of the Hadoop file system. It is an open-
source project and is horizontally scalable.
HBase is a data model that is similar to Google’s big table designed to provide quick random access to
huge amounts of structured data. It leverages the fault tolerance provided by the Hadoop File System
(HDFS).
It is a part of the Hadoop ecosystem that provides random real-time read/write access to data in the
Hadoop File System.
One can store the data in HDFS either directly or through HBase. Data consumer reads/accesses the data
in HDFS randomly using HBase. HBase sits on top of the Hadoop File System and provides read and
write access.
5.1.2 HBase and HDFS
HDFS HBase
HDFS is a distributed file system suitable for
HBase is a database built on top of the HDFS.
storing large files.
HDFS does not support fast individual record
HBase provides fast lookups for larger tables.
lookups.
It provides high latency batch processing; no It provides low latency access to single rows
concept of batch processing. from billions of records (Random access).
HBase internally uses Hash tables and provides
It provides only sequential access of data. random access, and it stores the data in indexed
HDFS files for faster lookups.
HBase is a column-oriented database and the tables in it are sorted by row. The table schema defines
only column families, which are the key value pairs. A table have multiple column families and each
column family can have any number of columns. Subsequent column values are stored contiguously on
the disk. Each cell value of the table has a timestamp. In short, in an HBase:
3
5.1.4 Column Oriented and Row Oriented
Column-oriented databases are those that store data tables as sections of columns of data, rather than as
rows of data. Shortly, they will have column families.
Such databases are designed for small number of rows Column-oriented databases are designed for
and columns. huge tables.
HBase RDBMS
HBase is schema-less, it doesn't have the concept of An RDBMS is governed by its schema, which
fixed columns schema; defines only column families. describes the whole structure of tables.
It is built for wide tables. HBase is horizontally It is thin and built for small tables. Hard to
scalable. scale.
It is good for semi-structured as well as structured data. It is good for structured data.
Features of HBase
Apache HBase is used to have random, real-time read/write access to Big Data.
It hosts very large tables on top of clusters of commodity hardware.
Apache HBase is a non-relational database modeled after Google's Bigtable. Bigtable acts up on
Google File System, likewise Apache HBase works on top of Hadoop and HDFS.
Applications of HBase
The HBase Data Model is designed to handle semi-structured data that may differ in field size, which is
a form of data and columns. The data model’s layout partitions the data into simpler components and
spread them across the cluster. HBase's Data Model consists of various logical components, such as a
table, line, column, family, column, column, cell, and edition.
Table:
An HBase table is made up of several columns. The tables in HBase defines upfront during the time of
the schema specification.
Row:
An HBase row consists of a row key and one or more associated value columns. Row keys are the bytes
that are not interpreted. Rows are ordered lexicographically, with the first row appearing in a table in the
lowest order. The layout of the row key is very critical for this purpose.
Column:
A column in HBase consists of a family of columns and a qualifier of columns, which is identified by a
character: (colon).
Apache HBase columns are separated into the families of columns. The column families physically
position a group of columns and their values to increase its performance. Every row in a table has a
similar family of columns, but there may not be anything in a given family of columns.
The same prefix is granted to all column members of a column family. For example, Column courses:
history and courses: math, are both members of the column family of courses. The character of the
colon (:) distinguishes the family of columns from the qualifier of the family of columns. The prefix of
the column family must be made up of printable characters.
During schema definition time, column families must be declared upfront while columns are not
specified during schema time. They can be conjured on the fly when the table is up and running.
Physically, all members of the column family are stored on the file system together.
The column qualifier is added to a column family. A column standard could be content (html and pdf),
which provides the content of a column unit. Although column families are set up at table formation,
column qualifiers are mutable and can vary significantly from row to row.
Cell: A Cell store data and is quite a unique combination of row key, Column Family, and the Column.
The data stored in a cell call its value and data types, which is every time treated as a byte[].
Timestamp:
In addition to each value, the timestamp is written and is the identifier for a given version of a number.
The timestamp reflects the time when the data is written on the Region Server. But when we put data
into the cell, we can assign a different timestamp value.
1. HMaster –
The implementation of Master Server in HBase is HMaster. It is a process in which regions are
assigned to region server as well as DDL (create, delete table) operations. It monitor all Region
Server instances present in the cluster. In a distributed environment, Master runs several
background threads. HMaster has many features like controlling load balancing, failover etc.
2. Region Server –
HBase Tables are divided horizontally by row key range into Regions. Regions are the basic
building elements of HBase cluster that consists of the distribution of tables and are comprised
of Column families. Region Server runs on HDFS DataNode which is present in Hadoop cluster.
Regions of Region Server are responsible for several things, like handling, managing, executing
as well as reads and writes HBase operations on that set of regions. The default size of a region
is 256 MB.
3. Zookeeper –
It is like a coordinator in HBase. It provides services like maintaining configuration information,
naming, providing distributed synchronization, server failure notification etc. Clients
communicate with region servers via zookeeper.
Distributed and Scalable: HBase is designed to be distributed and scalable, which means it can handle
large datasets and can scale out horizontally by adding more nodes to the cluster.
Column-oriented Storage: HBase stores data in a column-oriented manner, which means data is
organized by columns rather than rows. This allows for efficient data retrieval and aggregation.
Hadoop Integration: HBase is built on top of Hadoop, which means it can leverage Hadoop’s
distributed file system (HDFS) for storage and MapReduce for data processing.
Consistency and Replication: HBase provides strong consistency guarantees for read and write
operations, and supports replication of data across multiple nodes for fault tolerance.
Built-in Caching: HBase has a built-in caching mechanism that can cache frequently accessed data in
memory, which can improve query performance.
Compression: HBase supports compression of data, which can reduce storage requirements and
improve query performance.
Flexible Schema: HBase supports flexible schemas, which means the schema can be updated on the fly
without requiring a database schema migration.
Note – HBase is extensively used for online analytical operations, like in banking applications such as
real-time data updates in ATM machines, HBase can be used.
Adds HBase configuration files to a Configuration. This class belongs to the org.apache.hadoop.hbase
package.
Methods and description
S.No. Methods and Description
Class HTable
HTable is an HBase internal class that represents an HBase table. It is an implementation of table that is
used to communicate with a single HBase table. This class belongs to the
org.apache.hadoop.hbase.client class.
Constructors
S.No. Constructors and Description
1
HTable()
This class is used to perform Put operations for a single row. It belongs to the
org.apache.hadoop.hbase.client package.
Constructors
S.No. Constructors and Description
Put(byte[] row)
1
Using this constructor, you can create a Put operation for the specified row.
Put(byte[] rowArray, int rowOffset, int rowLength)
2
Using this constructor, you can make a copy of the passed-in row key to keep local.
Put(byte[] rowArray, int rowOffset, int rowLength, long ts)
3
Using this constructor, you can make a copy of the passed-in row key to keep local.
Put(byte[] row, long ts)
4
Using this constructor, we can create a Put operation for the specified row, using a given
timestamp.
Methods
S.No. Methods and Description
Class Get
This class is used to perform Get operations on a single row. This class belongs to the
org.apache.hadoop.hbase.client package.
Methods
S.No. Methods and Description
Class Result
This class is used to get a single row result of a Get or a Scan query.
This chapter demonstrates how to create data in an HBase table. To create data in an HBase table, the
following commands and methods are used:
put command,
add() method of Put class, and
put() method of HTable class.
Let us insert the first row values into the emp table as shown below.
Insert the remaining rows using the put command in the same way. If you insert the whole table, you
will get the following output.
ROW COLUMN+CELL
1 column=personal data:city, timestamp=1417524216501, value=hyderabad
value=manager
value=sr:engg
value=jr:engg
3 column=professional data:salary, timestamp=1417524702514,
value=25000
You can insert data into Hbase using the add() method of the Put class. You can save it using the put()
method of the HTable class. These classes belong to the org.apache.hadoop.hbase.client package.
Below given are the steps to create data in a Table of HBase.
The Configuration class adds HBase configuration files to its object. You can create a configuration
object using the create() method of the HbaseConfiguration class as shown below.
You have a class called HTable, an implementation of Table in HBase. This class is used to
communicate with a single HBase table. While instantiating this class, it accepts configuration object
and table name as parameters. You can instantiate HTable class as shown below.
To insert data into an HBase table, the add() method and its variants are used. This method belongs to
Put, therefore instantiate the put class. This class requires the row name you want to insert the data into,
in string format. You can instantiate the Put class as shown below.
The add() method of Put class is used to insert data. It requires 3 byte arrays representing column
family, column qualifier (column name), and the value to be inserted, respectively. Insert data into the
HBase table using the add() method as shown below.
After inserting the required rows, save the changes by adding the put instance to the put() method of
HTable class as shown below.
hTable.put(p);
After creating data in the HBase Table, close the HTable instance using the close() method as shown
below.
import java.io.IOException;
import org.apache.hadoop.conf.Configuration;
import org.apache.hadoop.hbase.HBaseConfiguration;
import org.apache.hadoop.hbase.client.HTable;
import org.apache.hadoop.hbase.client.Put;
import org.apache.hadoop.hbase.util.Bytes;
public class InsertData{
public static void main(String[] args) throws IOException {
// Instantiating Configuration class
Configuration config = HBaseConfiguration.create();
// Instantiating HTable class
HTable hTable = new HTable(config, "emp");
// Instantiating Put class
// accepts a row name.
Put p = new Put(Bytes.toBytes("row1"));
// adding values using add() method
// accepts column family name, qualifier/row name ,value
p.add(Bytes.toBytes("personal"),
Bytes.toBytes("name"),Bytes.toBytes("raju"));
p.add(Bytes.toBytes("personal"),
Bytes.toBytes("city"),Bytes.toBytes("hyderabad"));
p.add(Bytes.toBytes("professional"),Bytes.toBytes("designation"),
Bytes.toBytes("manager"));
p.add(Bytes.toBytes("professional"),Bytes.toBytes("salary"),
Bytes.toBytes("50000"));
// Saving the put Instance to the HTable.
hTable.put(p);
System.out.println("data inserted");
// closing HTable
hTable.close();
}
}
5.5 Cassandra
The data model of Cassandra is significantly different from what we normally see in an RDBMS. This
chapter provides an overview of how Cassandra stores its data.
Cluster
Cassandra database is distributed over several machines that operate together. The outermost container
is known as the Cluster. For failure handling, every node contains a replica, and in case of a failure, the
replica takes charge. Cassandra arranges the nodes in a cluster, in a ring format, and assigns data to
them.
Keyspace
Keyspace is the outermost container for data in Cassandra. The basic attributes of a Keyspace in
Cassandra are −
Replication factor − It is the number of machines in the cluster that will receive copies of the
same data.
Replica placement strategy − It is nothing but the strategy to place replicas in the ring. We
have strategies such as simple strategy (rack-aware strategy), old network topology strategy
(rack-aware strategy), and network topology strategy (datacenter-shared strategy).
Column families − Keyspace is a container for a list of one or more column families. A column
family, in turn, is a container of a collection of rows. Each row contains ordered columns.
Column families represent the structure of your data. Each keyspace has at least one and often
many column families.
The syntax of creating a Keyspace is as follows −
Column Family
A column family is a container for an ordered collection of rows. Each row, in turn, is an ordered
collection of columns. The following table lists the points that differentiate a column family from a table
of relational databases.
A schema in a relational model is fixed. Once we In Cassandra, although the column families are
define certain columns for a table, while inserting data, defined, the columns are not. You can freely
in every row all the columns must be filled at least with add any column to any column family at any
a null value. time.
Relational tables define only columns and the user fills In Cassandra, a table contains columns, or can
in the table with values. be defined as a super column family.
A Cassandra column family has the following attributes −
Note − Unlike relational tables where a column family’s schema is not fixed, Cassandra does not force
individual rows to have all the columns.
Column
A column is the basic data structure of Cassandra with three values, namely key or column name, value,
and a time stamp. Given below is the structure of a column.
SuperColumn
A super column is a special column, therefore, it is also a key-value pair. But a super column stores a
map of sub-columns.
Generally column families are stored on disk in individual files. Therefore, to optimize performance, it
is important to keep columns that you are likely to query together in the same column family, and a
super column can be helpful here.Given below is the structure of a super column.
Data Models of Cassandra and RDBMS
The following table lists down the points that differentiate the data model of Cassandra from that of an
RDBMS.
RDBMS Cassandra
RDBMS deals with structured data. Cassandra deals with unstructured data.
In RDBMS, a table is an array of arrays. (ROW In Cassandra, a table is a list of “nested key-value
x COLUMN) pairs”. (ROW x COLUMN key x COLUMN value)
Database is the outermost container that Keyspace is the outermost container that contains data
contains data corresponding to an application. corresponding to an application.
Tables are the entities of a database. Tables or column families are the entity of a keyspace.
Suppose that we are storing Facebook posts of different users in Cassandra. One of the common query
patterns will be fetching the top ‘N‘ posts made by a given user.
Thus, we need to store all data for a particular user on a single partition as per the above guidelines.
Also, using the post timestamp as the clustering key will be helpful for retrieving the top ‘N‘ posts more
efficiently.
Let’s define the Cassandra table schema for this use case:
Now, let’s write a query to find the top 20 posts for the user Anna:
Suppose that we are storing the details of different partner gyms across the different cities and states of
many countries and we would like to fetch the gyms for a given city.
Also, let’s say we need to return the results having gyms sorted by their opening date.
Based on the above guidelines, we should store the gyms located in a given city of a specific state and
country on a single partition and use the opening date and gym name as a clustering key.
Now, let’s look at a query that fetches the first ten gyms by their opening date for the city of Phoenix
within the U.S. state of Arizona:
Note: As the last query’s sort order is opposite of the sort order defined during the table creation, the
query will run slower as Cassandra will first fetch the data and then sort it in memory.
Let’s say we are running an e-commerce store and that we are storing the Customer and Product
information within Cassandra. Let’s look at some of the common query patterns around this use case:
We will start by using separate tables for storing the Customer and Product information. However, we
need to introduce a fair amount of denormalization to support the 3rd and 4th queries shown above.
We will create two more tables to achieve this – “Customer_by_Product” and Product_by_Customer“.
Note: To support both the queries, recently-liked products by a given customer and customers who
recently liked a given product, we have used the “liked_on” column as a clustering key.
Let’s look at the query to find the ten Customers who most recently liked the product “Pepsi“:
And let’s see the query that finds the recently-liked products (up to ten) by a customer named “Anna“:
What is a Cassandra client? The Cassandra client is used to connect, manage and develop your
Cassandra database. View + What is the Cassandra database client used for? The database client is
used to manage your Cassandra database with actions like insert, delete and update tables.
Apache Cassandra® is a free and open-source, distributed, wide column store, NoSQL database
management system designed to handle large amounts of data across many commodity servers,
providing high availability with no single point of failure.
This quickstart guide shows how to build a REST application using the Cassandra Quarkus extension,
which allows you to connect to an Apache Cassandra, DataStax Enterprise (DSE) or DataStax Astra
database, using the DataStax Java driver.
This is a simplistic code example of connecting to the trial Cassandra cluster, creating a time series data
table, filling it with realistic looking data, querying it and saving the results into a csv file for graphing
(Code below). To customise the code for your cluster, change the public IP addresses, and provide the
data centre name and user/password details (it’s safest to use a non-super user).
The Cluster.builder() call uses a fluent API to configure the client with the IP addresses, load balancing
policy, port number and user/password information. I’ve obviously been under a rock for a while as I
havn’t come across fluent programming before. It’s all about the cascading of method invocations, and
they are supported in Java 8 by Lambda functions (and used in Java Streams). This is a very simple
configuration which I’ll revisit in the future with the Instaclustr recommended settings for production
clusters.
The program then builds, gets meta data and prints out the host and cluster information, and then creates
a session by connecting. You have the option of dropping the test table if it already exists or adding
data to the existing table.
Next, we fill the table with some realistic time series sensor data. You can change how many host names
(100 by default) are used, and how many timestamps are generated. For each time 3 metrics and values
will be inserted. There are several types of statements in the Java clients including simple and prepared
statements. In theory prepared statements are faster so there’s an option to use either in the code. In
practice it seems that prepared statements may not improve response time significantly but may be
designed to improve throughput. Realistic looking data is generated by a simple random walk.
The code illustrates some possible queries (SELECTs), including a simple aggregate function (max) and
retrieving all the values for one host/metric combination, finding all host/metric permutations (to assist
with subsequent queries as we made the primary key a compound key of host and metric so both are
needed to select on), and finally retrieving the whole table and reporting the number of rows and total
bytes returned.
Hadoop architecture is designed to be easily integrated with other systems. Integration is very important
because although we can process the data efficiently in Hadoop, but we should also be able to send that
result to another system to move the data to another level. Data has to be integrated with other systems
to achieve interoperability and flexibility.
The following figure depicts the Hadoop system integrated with different systems and with some
implemented tools for reference:
Hadoop Integration with other systems
Hadoop Distributed File System (HDFS) – It is a file system that provides a robust distributed
file system. Hadoop has a framework that is used for job scheduling and cluster resource
management whose name is YARN.
Hadoop MapReduce –It is a system for parallel processing of large data sets that implement the
MapReduce model of distributed programming.
Hadoop extends an easier distributed storage with the help of HDFS and provides an analysis system
through MapReduce. It has a well-designed architecture to scale up or scale down the servers as per the
requirements of the user from one to hundreds or thousands of computers, having a high degree of fault
tolerance. Hadoop has proved its infallible need and standards in big data processing and efficient
storage management, it provides unlimited scalability and is supported by major vendors in the software
industry.
Why integrate R with Hadoop?
R is an open-source programming language. It is best suited for statistical and graphical analysis. Also,
if we need strong data analytics and visualization features, we have to combine R with Hadoop.
Hadoop and R complement each other very well in terms of big data visualization and analytics. There
are four ways of using Hadoop and R together, which are as follows:
R Hadoop
The R Hadoop methods are the collection of packages. It contains three packages i.e., rmr, rhbase, and
rhdfs.
For the Hadoop framework, the rmr package provides MapReduce functionality by executing the
Mapping and Reducing codes in R.
This package provides R database management capability with integration with HBASE.
Hadoop Streaming
Hadoop Streaming is a utility that allows users to create and run jobs with any executable as the mapper
and/or the reducer. Using the streaming system, we can develop working Hadoop jobs with just enough
knowledge of Java to write two shell scripts which work in tandem.
The combination of R and Hadoop appears as a must-have toolkit for people working with large data
sets and statistics. However, some Hadoop enthusiasts have raised a red flag when dealing with very
large Big Data excerpts.
They claim that the benefit of R is not its syntax, but the entire library of primitives for visualization and
data. These libraries are fundamentally non-distributed, making data retrieval a time-consuming affair.
This is an inherent flaw with R, and if you choose to ignore it, both R and Hadoop can work together.
5.9 Hive
Hive is a data warehouse infrastructure tool to process structured data in Hadoop. It resides on top of
Hadoop to summarize Big Data, and makes querying and analyzing easy.
Initially Hive was developed by Facebook, later the Apache Software Foundation took it up and
developed it further as an open source under the name Apache Hive. It is used by different companies.
For example, Amazon uses it in Amazon Elastic MapReduce.
Hive is not
A relational database
A design for OnLine Transaction Processing (OLTP)
A language for real-time queries and row-level updates
Features of Hive
Architecture of Hive
Hive chooses respective database servers to store the schema or Metadata of tables,
Meta Store
databases, columns in a table, their data types, and HDFS mapping.
HiveQL is similar to SQL for querying on schema info on the Metastore. It is one
HiveQL Process of the replacements of traditional approach for MapReduce program. Instead of
Engine writing MapReduce program in Java, we can write a query for MapReduce job and
process it.
Hadoop distributed file system or HBASE are the data storage techniques to store
HDFS or HBASE
data into file system.
Working of Hive
The following diagram depicts the workflow between Hive and Hadoop.
The following table defines how Hive interacts with Hadoop framework:
This chapter takes you through the different data types in Hive, which are involved in the table creation.
All the data types in Hive are classified into four types, given as follows:
Column Types
Literals
Null Values
Complex Types
Column Types
Column type are used as column data types of Hive. They are as follows:
Integral Types
Integer type data can be specified using integral data types, INT. When the data range exceeds the range
of INT, you need to use BIGINT and if the data range is smaller than the INT, you use SMALLINT.
TINYINT is smaller than SMALLINT.
TINYINT Y 10Y
SMALLINT S 10S
INT - 10
BIGINT L 10L
String Types
String type data types can be specified using single quotes (' ') or double quotes (" "). It contains two
data types: VARCHAR and CHAR. Hive follows C-types escape characters.
VARCHAR 1 to 65355
CHAR 255
Timestamp
Dates
Decimals
The DECIMAL type in Hive is as same as Big Decimal format of Java. It is used for representing
immutable arbitrary precision. The syntax and example is as follows:
DECIMAL(precision, scale)
decimal(10,0)
Union Types
Union is a collection of heterogeneous data types. You can create an instance using create union. The
syntax and example is as follows:
{0:1}
{1:2.0}
{2:["three","four"]}
{3:{"a":5,"b":"five"}}
{2:["six","seven"]}
{3:{"a":8,"b":"eight"}}
{0:9}
{1:10.0}
Literals
Floating point types are nothing but numbers with decimal points. Generally, this type of data is
composed of DOUBLE data type.
Decimal Type
Decimal type data is nothing but floating point value with higher range than DOUBLE data type. The
range of decimal type is approximately -10-308 to 10308.
Null Value
Complex Types
Arrays
Arrays in Hive are used the same way they are used in Java.
Syntax: ARRAY<data_type>
Maps
Structs
The Optimized Row Columnar (ORC) file format provides a highly efficient way to store data in the
Hive table. This file system was actually designed to overcome limitations of the other Hive file
formats. ORC reduces I/O overhead by accessing only the columns that are required for the current
query. It requires significantly fewer seek operations because all columns within a single group of row
data are stored together on disk. The ORC file format storage option is defined by specifying
“STORED AS ORC” at the end of the table creation.
6) PARQUET:
Parquet is a binary file format that is column driven. It is an open-source available to any project in the
Hadoop ecosystem and is designed for data storage in an effective and efficient flat columnar format
compared to row-based files such as CSV or TSV files. It only reads the necessary columns, which
significantly reduces the IO and thus makes it highly efficient for large-scale query types. The Parquet
table uses Snappy, which is a fast data compression and decompression library, as the default
compression.
The parquet file format storage option is defined by specifying “STORED AS PARQUET” at the end
of the table creation.
We can implement our own “input format” and “output format” incase the data comes in a different
format. These “input format” and “output format” is similar to Hadoop MapReduce’s input and output
formats.
Data Definition Language (DDL) is used to describe data and data structures of a
database. Hive has its own DDL, such as SQL DDL, which is used for managing,
creating, altering, and dropping databases, tables, and other objects in a database.
Similar to other SQL databases, Hive databases also contain namespaces for tables. If
the name of the database is not specified, the table is created in the default database.
Creating Databases In order to create a database, we can use the following command:
In the preceding command, the name of the database is added before the name of the
table. Therefore, temp_table gets added in the temp_database. In addition, you can
also create a table in the database by using the following commands:
In the preceding commands, the USE statement is used for setting the current database
to execute all the subsequent HiveQL statements. In this way, you do not need to add
the name of the database before the table name. The table temp_table is created in the
database temp_database. Furthermore, you can specify DBPROPERTIES in the form
Viewing a Database You can view all the databases present in a particular path by
using the following command:
Dropping a Database
Dropping a database means deleting it from its storage location.
Creating Tables You can create a table in a database by using the CREATE
command, as discussed earlier. Now, let’s learn how to provide the complete
definition of a table in a database by using the following commands:
In the preceding commands, temp_database is first set and then the table is created with the name
employee. In the employee table, the columns (ename, salary, and designation) are specified with their
respective data types. The TBLPROPERTIES isa set of key-value properties. C comments are also
added in the table to provide more details.
Altering Tables
1. Rename tables
2. Modify columns
After specifying the database schema and creating a database, the data can be
modified by using a set of procedures/mechanisms defined by a special language
known as Data Manipulation Language (DML).
While loading data into tables, Hive does not perform any type of transformations.
The data load operations in Hive are, at present, pure copy/move operations, which
move data files from one location to another. You can upload data into Hive tables
from the local file system as well as from HDFS. The syntax of loading data from
files into tables is as follows:
When the LOCAL keyword is specified in the LOAD DATA command, Hive
searches for the local directory. If the LOCAL keyword is not used, Hive checks the
directory on HDFS. On the other hand, when the OVERWRITE keyword is specified,
it deletes all the files under Hive’s warehouse directory for the given table. After that,
the latest files get uploaded. If you do not specify the OVERWRITE keyword, the
latest files are added in the already existing folder.
Inserting Data into Tables
In the preceding syntax, the INSERT OVERWRITE statement overwrites the current
data in the table or partition. The IF NOT EXISTS statement is given for a partition.
On the other hand, the INSERT INTO statement either appends the table or creates a
partition without modifying the existing data. The insert operation can be performed
on a table or a partition. You can also specify multiple insert clauses in the same
query.
Consider two tables, T1 and T2. We want to copy the sal column from T2 to T1 by
using the INSERT command. It can be done as follows:
Static partition insertion refers to the task of inserting data into a table by specifying a
partition column value. Consider the following example:
Dynamic Partition Insertion
In dynamic partition insertion, you need to specify a list of partition column names in
the PARTITION() clause along with the optional column values. A dynamic partition
column always has a corresponding input column in the SELECT statement. If the
SELECT statement has multiple column names, the dynamic partition columns must
be specified at the end of the columns and in the same order in which they appear in
the PARTITION() clause. By default, the feature of dynamic partition is disabled.
Inserting Data into Local Files
Sometimes, you might require to save the result of the SELECT query in flat files so
that you do not have to execute the queries again and again. Consider the following
example:
Delete in Hive
The delete operation is available in Hive from the Hive 0.14 version. The delete
operation can only be performed on those tables that support the ACID property. The
syntax for performing the delete operation is as follows:
5.13 Data Retrieval queries
Hive allows you to perform data retrieval queries by using the SELECT command
along with various types of operators and clauses. In this section, you learn about the
following:
The SELECT statement is the most common operation in SQL. You can filter the
required columns, rows, or both. The syntax for using the SELECT command is as
follows:
Using the WHERE Clause
The WHERE clause is used to search records of a table on the basis of a given
condition. This clause returns a boolean result. For example, a query can return only
those sales records that have an amount greater than 15000 from the US region. Hive
also supports a number of operators (such as > and <) in the WHERE clause. The
following query shows an example of using the WHERE clause:
The GROUP BY clause is used to put all the related records together. It can also be
used with aggregate functions. Often, it is required to group the resultsets in complex
queries.
The HAVING clause is used to specify a condition on the use of the GROUP BY
clause. The use of the HAVING clause is added in the version 0.7.0 of Hive.
Hive supports joining of one or more tables to aggregate information. The various
joins supported by Hive are:
Inner joins
Outer joins
Inner Joins
In case of inner joins, only the records satisfying the given condition get selected. All
the other records get discarded. Figure 12.11 illustrates the concept of inner joins:
Let’s take an example to describe the concept of inner joins. Consider two tables,
order and customer
Sometimes, you need to retrieve all the records from one table and only some records
from the other table.
In such cases, you have to use the outer join. Figure 12.12 illustrates the concept of
outer joins:
In this type of join, all the records from the table on the right side of the join are
retained.
Left Outer Join
In this type of join, all the records from the table on the left side of the join are
retained.
In this case, all the fields from both tables are included. For the entries that do not
have any match, a NULL value would be displayed.
Cartesian Product Joins
In cartesian product joins, all the records of one table are combined with another
table in all possible combinations.
This type of join does not involve any key column to join the tables. The following is
a query with a Cartesian product joint:
Joining Tables
You can combine the data of two or more tables in Hive by using HiveQL queries.
For this, we need to create tables and load them into Hive from HDFS.