Unit 5 (BDC)
Unit 5 (BDC)
• The term ‘Big Data’ is used for collections of large datasets that include huge volume, high
velocity, and a variety of data that is increasing day by day. Using traditional data
management systems, it is difficult to process Big Data. Therefore, the Apache Software
Foundation introduced a framework called Hadoop to solve Big Data management and
processing challenges.
Hadoop
• Hadoop is an open-source framework to store and process Big Data in a distributed
environment. It contains two modules, one is MapReduce and another is Hadoop
Distributed File System (HDFS).
• MapReduce: It is a parallel programming model for processing large amounts of
structured, semi-structured, and unstructured data on large clusters of commodity
hardware.
• HDFS:Hadoop Distributed File System is a part of Hadoop framework, used to store and
process the datasets. It provides a fault-tolerant file system to run on commodity
hardware.
Hive Introduction
• The Hadoop ecosystem contains different sub-projects (tools) such as
Sqoop, Pig, and Hive that are used to help Hadoop modules.
• Sqoop: It is used to import and export data to and from between
HDFS and RDBMS.
• Pig: It is a procedural language platform used to develop a script for
MapReduce operations.
• Hive: It is a platform used to develop SQL type scripts to do
MapReduce operations.
Hive Introduction
• Note: There are various ways to execute MapReduce operations:
• The traditional approach using Java MapReduce program for
structured, semi-structured, and unstructured data.
• The scripting approach for MapReduce to process structured and
semi structured data using Pig.
• The Hive Query Language (HiveQL or HQL) for MapReduce to process
structured data using Hive.
Hive Introduction
• Hive is a data warehouse infrastructure tool to process structured data in
Hadoop. It resides on top of Hadoop to summarize Big Data, and makes
querying and analyzing easy.
• Initially Hive was developed by Facebook, later the Apache Software
Foundation took it up and developed it further as an open source under
the name Apache Hive. It is used by different companies. For example,
Amazon uses it in Amazon Elastic MapReduce.
Hive is not
• A relational database
• A design for OnLine Transaction Processing (OLTP)
• A language for real-time queries and row-level updates
Hive Introduction
Features of Hive
• It stores schema in a database and processed data into HDFS.
• It is designed for OLAP(Online analytical Processing).
• It provides SQL type language for querying called HiveQL or HQL.
• It is familiar, fast, scalable, and extensible.
Architecture of Hive
• The following component diagram depicts the architecture of Hive:
Architecture of Hive
• This component diagram contains different units. The following table
describes each unit:
Hive is a data warehouse
infrastructure software that can
create interaction between user and
User Interface HDFS. The user interfaces that Hive
supports are Hive Web UI, Hive
command line, and Hive HD Insight
(In Windows server).
Architecture of Hive
• This component diagram contains different units. The following table
describes each unit:
Hive chooses respective database
servers to store the schema or
Meta Store Metadata of tables, databases,
columns in a table, their data types,
and HDFS mapping.
Architecture of Hive
• This component diagram contains different units. The following table
describes each unit:
HiveQL is similar to SQL for
querying on schema info on the
Metastore. It is one of the
replacements of traditional
HiveQL Process Engine approach for MapReduce program.
Instead of writing MapReduce
program in Java, we can write a
query for MapReduce job and
process it.
Architecture of Hive
• This component diagram contains different units. The following table
describes each unit:
The conjunction part of HiveQL
process Engine and MapReduce is
Hive Execution Engine. Execution
Execution Engine engine processes the query and
generates results as same as
MapReduce results. It uses the
flavor of MapReduce.
Architecture of Hive
• This component diagram contains different units. The following table
describes each unit:
TINYINT Y 10Y
SMALLINT S 10S
INT - 10
BIGINT L 10L
Hive - Data Types
• String Types
• String type data types can be specified using single quotes (' ') or
double quotes (" "). It contains two data types: VARCHAR and CHAR.
Hive follows C-types escape characters.
• The following table depicts various CHAR data types:
VARCHAR 1 to 65355
CHAR 255
Hive - Data Types
Timestamp
• It supports traditional UNIX timestamp with optional nanosecond precision.
It supports java.sql.Timestamp format “YYYY-MM-DD HH:MM:SS.fffffffff”
and format “yyyy-mm-dd hh:mm:ss.ffffffffff”.
Dates
• DATE values are described in year/month/day format in the form
{{YYYY-MM-DD}}.
Decimals
• The DECIMAL type in Hive is as same as Big Decimal format of Java. It is
used for representing immutable arbitrary precision. The syntax and
example is as follows:
DECIMAL(precision, scale)
decimal(10,0)
Hive - Data Types
• Union Types
• Union is a collection of heterogeneous data types. You can create an
instance using create union. The syntax and example is as follows:
Hive - Data Types
Literals
• The following literals are used in Hive:
Floating Point Types
• Floating point types are nothing but numbers with decimal points.
Generally, this type of data is composed of DOUBLE data type.
Decimal Type
• Decimal type data is nothing but floating point value with higher range
than DOUBLE
-308 308
data type. The range of decimal type is approximately
-10 to 10 .
Null Value
• Missing values are represented by the special value NULL.
Hive - Data Types
• Complex Types
• The Hive complex data types are as follows:
Arrays
• Arrays in Hive are used the same way they are used in Java.
Syntax: ARRAY<data_type>
Hive - Data Types
• Maps
• Maps in Hive are similar to Java Maps.
• Spark shell is referred as REPL (Read Eval Print Loop) which is used to quickly test Spark/PySpark
statements.
• The Spark Shell supports only Scala, Python and R (Java might be supported in previous versions).
• The spark-shell2 command is used to launch Spark with Scala shell. I have covered this in detail in this
article.
• The pyspark command is used to launch Spark with Python shell also call PySpark.
• The sparkr command is used to launch Spark with R language.
• In Spark shell, Spark by default provides spark and sc variables. spark is an object of SparkSession and
sc is an object of SparkContext.
• In Shell you cannot create your own SparkContext
Apache Flume - Introduction
What is Flume?
• Apache Flume is a tool/service/data ingestion mechanism for collecting aggregating and
transporting large amounts of streaming data such as log files, events (etc...) from various sources
to a centralized data store.
• Flume is a highly reliable, distributed, and configurable tool. It is principally designed to copy
streaming data (log data) from various web servers to HDFS.
Apache Flume - Introduction
Applications of Flume
• Assume an e-commerce web application wants to analyze the customer behavior from a particular
region. To do so, they would need to move the available log data in to Hadoop for analysis. Here, Apache
Flume comes to our rescue.
• Flume is used to move the log data generated by application servers into HDFS at a higher speed.
Advantages of Flume
• Here are the advantages of using Flume −
• Using Apache Flume we can store the data in to any of the centralized stores (HBase, HDFS).
• When the rate of incoming data exceeds the rate at which data can be written to the destination, Flume
acts as a mediator between data producers and the centralized stores and provides a steady flow of data
between them.
• Flume provides the feature of contextual routing(Contextual routing is the act of assigning a support
conversation to the right customer service representative, based on context from the customer's query.).
• The transactions in Flume are channel-based where two transactions (one sender and one receiver) are
maintained for each message. It guarantees reliable message delivery.
• Flume is reliable, fault tolerant, scalable, manageable, and customizable.
Sqoop - Introduction
• When Big Data storages and analyzers such as MapReduce, Hive, HBase,
Cassandra, Pig, etc. of the Hadoop ecosystem came into picture, they required a
tool to interact with the relational database servers for importing and exporting
the Big Data residing in them. Here, Sqoop occupies a place in the Hadoop
ecosystem to provide feasible interaction between relational database server and
Hadoop’s HDFS.
• Sqoop − “SQL to Hadoop and Hadoop to SQL”
• Sqoop is a tool designed to transfer data between Hadoop and relational
database servers. It is used to import data from relational databases such as
MySQL, Oracle to Hadoop HDFS, and export from Hadoop file system to
relational databases. It is provided by the Apache Software Foundation.
How Sqoop Works?
How Sqoop Works?
Sqoop Import
• The import tool imports individual tables from RDBMS to HDFS. Each
row in a table is treated as a record in HDFS. All records are stored as
text data in text files or as binary data in Avro and Sequence files.
Sqoop Export
• The export tool exports a set of files from HDFS back to an RDBMS.
The files given as input to Sqoop contain records, which are called as
rows in table. Those are read and parsed into a set of records and
delimited with user-specified delimiter.