0% found this document useful (0 votes)
4 views

Unit 5 (BDC)

The document provides an overview of Hive, an open-source data warehouse infrastructure built on Hadoop for processing structured data using SQL-like queries. It explains Hive's architecture, including components like the Metastore, HiveQL process engine, and execution engine, as well as data types, partitioning, and bucketing techniques. Additionally, it introduces HBase, a distributed column-oriented database that offers random access to large datasets within the Hadoop ecosystem.

Uploaded by

daksh.dg03
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
4 views

Unit 5 (BDC)

The document provides an overview of Hive, an open-source data warehouse infrastructure built on Hadoop for processing structured data using SQL-like queries. It explains Hive's architecture, including components like the Metastore, HiveQL process engine, and execution engine, as well as data types, partitioning, and bucketing techniques. Additionally, it introduces HBase, a distributed column-oriented database that offers random access to large datasets within the Hadoop ecosystem.

Uploaded by

daksh.dg03
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 59

Hive Introduction

• The term ‘Big Data’ is used for collections of large datasets that include huge volume, high
velocity, and a variety of data that is increasing day by day. Using traditional data
management systems, it is difficult to process Big Data. Therefore, the Apache Software
Foundation introduced a framework called Hadoop to solve Big Data management and
processing challenges.
Hadoop
• Hadoop is an open-source framework to store and process Big Data in a distributed
environment. It contains two modules, one is MapReduce and another is Hadoop
Distributed File System (HDFS).
• MapReduce: It is a parallel programming model for processing large amounts of
structured, semi-structured, and unstructured data on large clusters of commodity
hardware.
• HDFS:Hadoop Distributed File System is a part of Hadoop framework, used to store and
process the datasets. It provides a fault-tolerant file system to run on commodity
hardware.
Hive Introduction
• The Hadoop ecosystem contains different sub-projects (tools) such as
Sqoop, Pig, and Hive that are used to help Hadoop modules.
• Sqoop: It is used to import and export data to and from between
HDFS and RDBMS.
• Pig: It is a procedural language platform used to develop a script for
MapReduce operations.
• Hive: It is a platform used to develop SQL type scripts to do
MapReduce operations.
Hive Introduction
• Note: There are various ways to execute MapReduce operations:
• The traditional approach using Java MapReduce program for
structured, semi-structured, and unstructured data.
• The scripting approach for MapReduce to process structured and
semi structured data using Pig.
• The Hive Query Language (HiveQL or HQL) for MapReduce to process
structured data using Hive.
Hive Introduction
• Hive is a data warehouse infrastructure tool to process structured data in
Hadoop. It resides on top of Hadoop to summarize Big Data, and makes
querying and analyzing easy.
• Initially Hive was developed by Facebook, later the Apache Software
Foundation took it up and developed it further as an open source under
the name Apache Hive. It is used by different companies. For example,
Amazon uses it in Amazon Elastic MapReduce.
Hive is not
• A relational database
• A design for OnLine Transaction Processing (OLTP)
• A language for real-time queries and row-level updates
Hive Introduction
Features of Hive
• It stores schema in a database and processed data into HDFS.
• It is designed for OLAP(Online analytical Processing).
• It provides SQL type language for querying called HiveQL or HQL.
• It is familiar, fast, scalable, and extensible.
Architecture of Hive
• The following component diagram depicts the architecture of Hive:
Architecture of Hive
• This component diagram contains different units. The following table
describes each unit:
Hive is a data warehouse
infrastructure software that can
create interaction between user and
User Interface HDFS. The user interfaces that Hive
supports are Hive Web UI, Hive
command line, and Hive HD Insight
(In Windows server).
Architecture of Hive
• This component diagram contains different units. The following table
describes each unit:
Hive chooses respective database
servers to store the schema or
Meta Store Metadata of tables, databases,
columns in a table, their data types,
and HDFS mapping.
Architecture of Hive
• This component diagram contains different units. The following table
describes each unit:
HiveQL is similar to SQL for
querying on schema info on the
Metastore. It is one of the
replacements of traditional
HiveQL Process Engine approach for MapReduce program.
Instead of writing MapReduce
program in Java, we can write a
query for MapReduce job and
process it.
Architecture of Hive
• This component diagram contains different units. The following table
describes each unit:
The conjunction part of HiveQL
process Engine and MapReduce is
Hive Execution Engine. Execution
Execution Engine engine processes the query and
generates results as same as
MapReduce results. It uses the
flavor of MapReduce.
Architecture of Hive
• This component diagram contains different units. The following table
describes each unit:

Hadoop distributed file system or


HBASE are the data storage
HDFS or HBASE
techniques to store data into file
system.
Working of Hive
• The following diagram depicts the workflow between Hive and
Hadoop.
Working of Hive
• The following steps defines how Hive interacts with Hadoop
framework:
Execute QueryThe Hive interface such as Command Line or
1 Web UI sends query to Driver (any database driver such as
JDBC, ODBC, etc.) to execute.
Get PlanThe driver takes the help of query compiler that
2 parses the query to check the syntax and query plan or the
requirement of query.
Get MetadataThe compiler sends metadata request to
3
Metastore (any database).
Working of Hive
• The following steps defines how Hive interacts with Hadoop
framework:
Send Metadata Metastore sends metadata as a response to
4
the compiler.

Send Plan The compiler checks the requirement and resends


5 the plan to the driver. Up to here, the parsing and compiling of
a query is complete.
Execute Plan The driver sends the execute plan to the
6
execution engine.
Working of Hive
• The following steps defines how Hive interacts with Hadoop
framework:
Execute Job Internally, the process of execution job is a
MapReduce job. The execution engine sends the job to
7 JobTracker, which is in Name node and it assigns this job to
TaskTracker, which is in Data node. Here, the query executes
MapReduce job.

Metadata Ops Meanwhile in execution, the execution engine


7.1
can execute metadata operations with Metastore.

Fetch Result The execution engine receives the results from


8
Data nodes.
Working of Hive
• The following steps defines how Hive interacts with Hadoop
framework:
Send Results The execution engine sends those resultant
9
values to the driver.

10 Send Results The driver sends the results to Hive Interfaces.


Hive - Data Types
• This topic takes you through the different data types in Hive, which
are involved in the table creation. All the data types in Hive are
classified into four types, given as follows:
• Column Types
• Literals
• Null Values
• Complex Types
Hive - Data Types
Column Types
• Column type are used as column data types of Hive. They are as
follows:
Integral Types
• Integer type data can be specified using integral data types, INT.
When the data range exceeds the range of INT, you need to use
BIGINT and if the data range is smaller than the INT, you use
SMALLINT. TINYINT is smaller than SMALLINT.
• The following table depicts various INT data types:
Hive - Data Types
• The following table depicts various INT data types:
Type Postfix Example

TINYINT Y 10Y

SMALLINT S 10S

INT - 10

BIGINT L 10L
Hive - Data Types
• String Types
• String type data types can be specified using single quotes (' ') or
double quotes (" "). It contains two data types: VARCHAR and CHAR.
Hive follows C-types escape characters.
• The following table depicts various CHAR data types:

Data Type Length

VARCHAR 1 to 65355

CHAR 255
Hive - Data Types
Timestamp
• It supports traditional UNIX timestamp with optional nanosecond precision.
It supports java.sql.Timestamp format “YYYY-MM-DD HH:MM:SS.fffffffff”
and format “yyyy-mm-dd hh:mm:ss.ffffffffff”.
Dates
• DATE values are described in year/month/day format in the form
{{YYYY-MM-DD}}.
Decimals
• The DECIMAL type in Hive is as same as Big Decimal format of Java. It is
used for representing immutable arbitrary precision. The syntax and
example is as follows:
DECIMAL(precision, scale)
decimal(10,0)
Hive - Data Types
• Union Types
• Union is a collection of heterogeneous data types. You can create an
instance using create union. The syntax and example is as follows:
Hive - Data Types
Literals
• The following literals are used in Hive:
Floating Point Types
• Floating point types are nothing but numbers with decimal points.
Generally, this type of data is composed of DOUBLE data type.
Decimal Type
• Decimal type data is nothing but floating point value with higher range
than DOUBLE
-308 308
data type. The range of decimal type is approximately
-10 to 10 .
Null Value
• Missing values are represented by the special value NULL.
Hive - Data Types
• Complex Types
• The Hive complex data types are as follows:
Arrays
• Arrays in Hive are used the same way they are used in Java.

Syntax: ARRAY<data_type>
Hive - Data Types
• Maps
• Maps in Hive are similar to Java Maps.

Syntax: MAP<primitive_type, data_type>


Hive - Data Types
• Structs
• Structs in Hive is similar to using complex data with comment.

Syntax: STRUCT<col_name : data_type [COMMENT col_comment], ...>


What is Hive Partitioning and Bucketing?
• Apache Hive is an open source data warehouse system used for querying
and analyzing large datasets. Data in Apache Hive can be categorized
into Table, Partition, and Bucket. The table in Hive is logically made up of
the data being stored.
Partitioning in Hive
• As we know that Hadoop is used to handle the huge amount of data, it is
always required to use the best approach to deal with it. The partitioning in
Hive is the best example of it.
• Let's assume we have a data of 10 million students studying in an institute.
Now, we have to fetch the students of a particular course. If we use a
traditional approach, we have to go through the entire data. This leads to
performance degradation. In such a case, we can adopt the better
approach i.e., partitioning in Hive and divide the data among the different
datasets based on particular columns.
What is Hive Partitioning and Bucketing?
The partitioning in Hive can be executed in two ways –
• Static Partitioning
• Dynamic Partitioning
What is Hive Partitioning and Bucketing?
Static Partitioning
• In static or manual partitioning, it is required to pass the values of
partitioned columns manually while loading the data into the table.
Hence, the data file doesn't contain the partitioned columns.
Example of Static Partitioning
• First, select the database in which we want to create a table.
hive> use test;
• Create the table and provide the partitioned columns by using the
following command: -
What is Hive Partitioning and Bucketing?
• Create the table and provide the partitioned columns by using the
following command: -
What is Hive Partitioning and Bucketing?
• Let's retrieve the information associated with the table.
What is Hive Partitioning and Bucketing?
• Load the data into the table and pass the values of partition columns
with it by using the following command: -
• Here, we are partitioning the students of an institute based on
courses.
What is Hive Partitioning and Bucketing?
• Load the data of another file into the same table and pass the values
of partition columns with it by using the following command: -
What is Hive Partitioning and Bucketing?
• In the following screenshot, we can see that the table student is
divided into two categories.
What is Hive Partitioning and Bucketing?
Dynamic Partitioning
• In dynamic partitioning, the values of partitioned columns exist within
the table. So, it is not required to pass the values of partitioned
columns manually.
• First, select the database in which we want to create a table.
What is Hive Partitioning and Bucketing?
• Enable the dynamic partition by using the following commands: -

• Create a dummy table to store the data.


What is Hive Partitioning and Bucketing?
• Now, load the data into the table.

• Create a partition table by using the following command: -


What is Hive Partitioning and Bucketing?
• Now, insert the data of dummy table into the partition table.
What is Hive Partitioning and Bucketing?
• In the following screenshot, we can see that the table student_part is
divided into two categories.
What is Hive Partitioning and Bucketing?
Bucketing in Hive
• The bucketing in Hive is a data organizing technique. It is similar to
partitioning in Hive with an added functionality that it divides large
datasets into more manageable parts known as buckets. So, we can
use bucketing in Hive when the implementation of partitioning
becomes difficult. However, we can also divide partitions further in
buckets.
What is Hive Partitioning and Bucketing?
• The concept of bucketing is based on the hashing technique.
• Here, modules of current column value and the number of required
buckets is calculated (let say, F(x) % 3).
• Now, based on the resulted value, the data is stored into the
corresponding bucket.
HBase - Overview
Limitations of Hadoop
• Hadoop can perform only batch processing, and data will be accessed
only in a sequential manner. That means one has to search the entire
dataset even for the simplest of jobs.
• A huge dataset when processed results in another huge data set,
which should also be processed sequentially. At this point, a new
solution is needed to access any point of data in a single unit of time
(random access).
Hadoop Random Access Databases
• Applications such as HBase, Cassandra, couchDB, Dynamo, and
MongoDB are some of the databases that store huge amounts of data
and access the data in a random manner.
HBase - Overview
What is HBase?
• HBase is a distributed column-oriented database built on top of the
Hadoop file system. It is an open-source project and is horizontally scalable.
• HBase is a data model that is similar to Google’s big table designed to
provide quick random access to huge amounts of structured data. It
leverages the fault tolerance provided by the Hadoop File System (HDFS).
• It is a part of the Hadoop ecosystem that provides random real-time
read/write access to data in the Hadoop File System.
• One can store the data in HDFS either directly or through HBase. Data
consumer reads/accesses the data in HDFS randomly using HBase. HBase
sits on top of the Hadoop File System and provides read and write access.
HBase - Overview
HBase - Overview
Storage Mechanism in Hbase
• HBase is a column-oriented database and the tables in it are sorted
by row. The table schema defines only column families, which are the
key value pairs. A table have multiple column families and each
column family can have any number of columns. Subsequent column
values are stored contiguously on the disk. Each cell value of the table
has a timestamp. In short, in an HBase:
• Table is a collection of rows.
• Row is a collection of column families.
• Column family is a collection of columns.
• Column is a collection of key value pairs.
HBase - Overview
HBase - Overview
Features of HBase
• HBase is linearly scalable.
• It has automatic failure support.
• It provides consistent read and writes.
• It integrates with Hadoop, both as a source and a destination.
• It has easy java API for client.
• It provides data replication across clusters.
HBase - Overview
Where to Use HBase
• Apache HBase is used to have random, real-time read/write access to Big
Data.
• It hosts very large tables on top of clusters of commodity hardware.
• Apache HBase is a non-relational database modeled after Google's
Bigtable. Bigtable acts up on Google File System, likewise Apache HBase
works on top of Hadoop and HDFS.
Applications of HBase
• It is used whenever there is a need to write heavy applications.
• HBase is used whenever we need to provide fast random access to
available data.
• Companies such as Facebook, Twitter, Yahoo, and Adobe use HBase
internally.
HBase - Architecture
• In HBase, tables are split into regions and are served by the region
servers. Regions are vertically divided by column families into
“Stores”. Stores are saved as files in HDFS. Shown below is the
architecture of HBase.
• Note: The term ‘store’ is used for regions to explain the storage
structure.
HBase - Architecture
• HBase has three major components: the client library, a master server, and
region servers. Region servers can be added or removed as per
requirement.
MasterServer
• The master server -
• Assigns regions to the region servers and takes the help of Apache
ZooKeeper for this task.
• Handles load balancing of the regions across region servers. It unloads the
busy servers and shifts the regions to less occupied servers.
• Maintains the state of the cluster by negotiating the load balancing.
• Is responsible for schema changes and other metadata operations such as
creation of tables and column families.
HBase - Architecture
Regions
• Regions are nothing but tables that are split up and spread across the
region servers.
Region server
• The region servers have regions that -
• Communicate with the client and handle data-related operations.
• Handle read and write requests for all the regions under it.
• Decide the size of the region by following the region size thresholds.
HBase - Architecture
HBase - Architecture
Zookeeper
• Zookeeper is an open-source project that provides services like
maintaining configuration information, naming, providing distributed
synchronization, etc.
• Zookeeper has ephemeral nodes representing different region
servers. Master servers use these nodes to discover available servers.
• In addition to availability, the nodes are also used to track server
failures or network partitions.
• Clients communicate with region servers via zookeeper.
• In pseudo and standalone modes, HBase itself will take care of
zookeeper.
Spark-shell
• Apache Spark default comes with the spark-shell command that is used to interact with Spark from the
command line. This is usually used to quickly analyze data or test spark commands from the command
line. PySpark shell is referred to as REPL (Read Eval Print Loop). Apache Spark supports spark-shell for
Scala, pyspark for Python, and sparkr for R language. Java is not supported at this time.
• Spark Shell Key Points –

• Spark shell is referred as REPL (Read Eval Print Loop) which is used to quickly test Spark/PySpark
statements.
• The Spark Shell supports only Scala, Python and R (Java might be supported in previous versions).
• The spark-shell2 command is used to launch Spark with Scala shell. I have covered this in detail in this
article.
• The pyspark command is used to launch Spark with Python shell also call PySpark.
• The sparkr command is used to launch Spark with R language.
• In Spark shell, Spark by default provides spark and sc variables. spark is an object of SparkSession and
sc is an object of SparkContext.
• In Shell you cannot create your own SparkContext
Apache Flume - Introduction
What is Flume?
• Apache Flume is a tool/service/data ingestion mechanism for collecting aggregating and
transporting large amounts of streaming data such as log files, events (etc...) from various sources
to a centralized data store.
• Flume is a highly reliable, distributed, and configurable tool. It is principally designed to copy
streaming data (log data) from various web servers to HDFS.
Apache Flume - Introduction
Applications of Flume
• Assume an e-commerce web application wants to analyze the customer behavior from a particular
region. To do so, they would need to move the available log data in to Hadoop for analysis. Here, Apache
Flume comes to our rescue.
• Flume is used to move the log data generated by application servers into HDFS at a higher speed.
Advantages of Flume
• Here are the advantages of using Flume −
• Using Apache Flume we can store the data in to any of the centralized stores (HBase, HDFS).
• When the rate of incoming data exceeds the rate at which data can be written to the destination, Flume
acts as a mediator between data producers and the centralized stores and provides a steady flow of data
between them.
• Flume provides the feature of contextual routing(Contextual routing is the act of assigning a support
conversation to the right customer service representative, based on context from the customer's query.).
• The transactions in Flume are channel-based where two transactions (one sender and one receiver) are
maintained for each message. It guarantees reliable message delivery.
• Flume is reliable, fault tolerant, scalable, manageable, and customizable.
Sqoop - Introduction
• When Big Data storages and analyzers such as MapReduce, Hive, HBase,
Cassandra, Pig, etc. of the Hadoop ecosystem came into picture, they required a
tool to interact with the relational database servers for importing and exporting
the Big Data residing in them. Here, Sqoop occupies a place in the Hadoop
ecosystem to provide feasible interaction between relational database server and
Hadoop’s HDFS.
• Sqoop − “SQL to Hadoop and Hadoop to SQL”
• Sqoop is a tool designed to transfer data between Hadoop and relational
database servers. It is used to import data from relational databases such as
MySQL, Oracle to Hadoop HDFS, and export from Hadoop file system to
relational databases. It is provided by the Apache Software Foundation.
How Sqoop Works?
How Sqoop Works?
Sqoop Import
• The import tool imports individual tables from RDBMS to HDFS. Each
row in a table is treated as a record in HDFS. All records are stored as
text data in text files or as binary data in Avro and Sequence files.
Sqoop Export
• The export tool exports a set of files from HDFS back to an RDBMS.
The files given as input to Sqoop contain records, which are called as
rows in table. Those are read and parsed into a set of records and
delimited with user-specified delimiter.

You might also like