Big Data With Hadoop
Big Data With Hadoop
It is less flexible and difficult It is flexible and scalable. It is It is more flexible and simpler to scale
to scale. It is schema schema independent. than structured data but lesser than
dependent. unstructured data.
Versioning over tuples, row, Versioning is like as a whole Versioning over tuples is possible.
tables data.
Financial data, bar codes Media logs, videos, audios are Tweets organised by hashtags, folder
are some of the examples some of the examples of organised by topics are some of the
of structured data. unstructured data. examples of semi structured data.
HDFS:
HDFS is the primary or major component of Hadoop ecosystem and is responsible for storing
large data sets of structured or unstructured data across various nodes and thereby
maintaining the metadata in the form of log files.
HDFS consists of two core components i.e.
1. Name node
2. Data Node
Name Node is the prime node which contains metadata (data about data) requiring
comparatively fewer resources than the data nodes that stores the actual data. These data
nodes are commodity hardware in the distributed environment. Undoubtedly, making
Hadoop cost effective.
HDFS maintains all the coordination between the clusters and hardware, thus working at the
heart of the system.
YARN:
Yet Another Resource Negotiator, as the name implies, YARN is the one who helps to
manage the resources across the clusters. In short, it performs scheduling and resource
allocation for the Hadoop System.
Consists of three major components i.e.
1. Resource Manager
2. Nodes Manager
3. Application Manager
Resource manager has the privilege of allocating resources for the applications in a system
whereas Node managers work on the allocation of resources such as CPU, memory,
bandwidth per machine and later on acknowledges the resource manager. Application
manager works as an interface between the resource manager and node manager and
performs negotiations as per the requirement of the two.
MapReduce:
By making the use of distributed and parallel algorithms, MapReduce makes it possible to
carry over the processing’s logic and helps to write applications which transform big data
sets into a manageable one.
MapReduce makes the use of two functions i.e. Map() and Reduce() whose task is:
o Map() performs sorting and filtering of data and thereby organizing them in the form of
group. Map generates a key-value pair based result which is later on processed by the
Reduce() method.
o Reduce(), as the name suggests does the summarization by aggregating the mapped
data. In simple, Reduce() takes the output generated by Map() as input and combines
those tuples into smaller set of tuples.
Apache Sqoop
Apache Sqoop, a command-line interface tool, moves data between relational databases
and Hadoop.
It is used to export data from the Hadoop file system to relational databases and to import
data from relational databases such as MySQL and Oracle into the Hadoop file system.
Important Features of Apache Sqoop
Apache Sqoop has many essential features. Some of them are discussed here:
Sqoop uses the YARN framework to import and export data. Parallelism is enhanced by
fault tolerance in this way.
We may import the outcomes of a SQL query into HDFS using Sqoop.
For several RDBMSs, including MySQL and Microsoft SQL servers, Sqoop offers
connectors.
Sqoop supports the Kerberos computer network authentication protocol, allowing nodes
to authenticate users while securely communicating across an unsafe network.
Sqoop can load the full table or specific sections with a single command.
Unit 2
HDFS Architecture
HDFS is composed of master-slave architecture, which includes the following
elements:
NameNode
All the blocks on DataNodes are handled by NameNode, which is known as the
master node. It performs the following functions:
Monitor and control all the DataNodes instances.
Permits the user to access a file.
Stores all of the block records on a DataNode instance.
EditLogs are committed to disk after every write operation to Name Node’s
data storage. The data is then replicated to all the other data nodes, including
Data Node and Backup Data Node. In the event of a system failure, EditLogs
can be manually recovered by Data Node.
All of the DataNodes’ blocks must be alive in order for all of the blocks to be
removed from the data nodes.
Therefore, every UpdateNode in a cluster is aware of every DataNode in the
cluster, but only one of them is actively managing communication with all the
DataNodes. Since every DataNode runs their own software, they are
completely independent. Therefore, if a DataNode fails, the DataNode will be
replaced by another DataNode. This means that the failure of a DataNode will
not impact the rest of the cluster, since all the DataNodes are aware of every
DataNode in the cluster.
Secondary NameNode
When NameNode runs out of disk space, a secondary NameNode is activated to
perform a checkpoint. The secondary NameNode performs the following duties.
It stores all the transaction log data (from all the source databases) into one
location so that when you want to replay it, it is at one single location. Once
the data is stored, it is replicated across all the servers, either directly or via a
distributed file system.
The information stored in the filesystem is replicated across all the cluster
nodes and stored in all the data nodes. Data nodes store the data. The cluster
nodes store the information about the cluster nodes. This information is called
metadata.
DataNode
Every slave machine that contains data organizes a DataNode. DataNode stores
data in ext3 or ext4 file format on DataNodes. DataNodes do the following:
DataNodes store every data.
It handles all of the requested operations on files, such as reading file content
and creating new data, as described above.
All the instructions are followed, including scrubbing data on DataNodes,
establishing partnerships, and so on.
MapReduce
Java-based MapReduce is basically a processing method and a model for a distributed
computing program.
Map and Reduce are two crucial jobs that make up the MapReduce algorithm. A data set is
transformed into another set through a map, where each element is separated into tuples
key or value pairs.
The second work is a reduced task, that takes a map’s output as input and concatenates the
data objects in a smaller collection of tuples.
The reduction work is always carried out following the map job, as the name MapReduce
implies.
The main benefit of MapReduce would be that data processing can be scaled easily over
several computing nodes.
The following diagram shows the logical flow of a MapReduce programming model.
3. Fair Scheduler
The Fair Scheduler is very much similar to that of the
capacity scheduler.
The priority of the job is kept in consideration.
With the help of Fair Scheduler, the YARN applications
can share the resources in the large Hadoop Cluster and
these resources are maintained dynamically so no need
for prior capacity.
The resources are distributed in such a manner that all
applications within a cluster get an equal amount of time.
Fair Scheduler takes Scheduling decisions on the basis of memory, we can configure it to
work with CPU also.
As we told you it is similar to Capacity Scheduler but the major thing to notice is that in Fair
Scheduler whenever any high priority job arises in the same queue, the task is processed in
parallel by replacing some portion from the already dedicated slots.
Advantages:
Resources assigned to each application depend upon its priority.
it can limit the concurrent running task in a particular pool or queue.
Disadvantages: The configuration is required.
UNIT 3
Discuss the important characteristics of Hive. How do you compare Hive and
Relational Databases
Apache Hive
Apache Hive is a data warehouse and an ETL tool which provides an SQL-like interface
between the user and the Hadoop distributed file system (HDFS) which integrates Hadoop.
It is built on top of Hadoop.
It is a software project that provides data query and analysis.
It facilitates reading, writing and handling wide datasets that stored in distributed storage
and queried by Structure Query Language (SQL) syntax.
It is not built for Online Transactional Processing (OLTP) workloads.
It is frequently used for data warehousing tasks like data encapsulation, Ad-hoc Queries, and
analysis of huge datasets.
It is designed to enhance scalability, extensibility, performance, fault-tolerance and loose-
coupling with its input formats.
With a neat diagram explain the Architecture of Hive. Explain the measure
components of Hive
Architecture of Hive
The following architecture explains the flow of submission of query into Hive.
Hive Client
Hive allows writing applications in various languages, including Java, Python, and C++. It
supports different types of clients such as:-
Thrift Server - It is a cross-language service provider platform that serves the request from
all those programming languages that supports Thrift.
JDBC Driver - It is used to establish a connection between hive and Java applications. The
JDBC Driver is present in the class org.apache.hadoop.hive.jdbc.HiveDriver.
ODBC Driver - It allows the applications that support the ODBC protocol to connect to Hive.
Hive Services
The following are the services provided by Hive:-
Hive CLI - The Hive CLI (Command Line Interface) is a shell where we can execute Hive
queries and commands.
Hive Web User Interface - The Hive Web UI is just an alternative of Hive CLI. It provides a
web-based GUI for executing Hive queries and commands.
Hive MetaStore - It is a central repository that stores all the structure information of various
tables and partitions in the warehouse. It also includes metadata of column and its type
information, the serializers and deserializers which is used to read and write data and the
corresponding HDFS files where the data is stored.
Hive Server - It is referred to as Apache Thrift Server. It accepts the request from different
clients and provides it to Hive Driver.
Hive Driver - It receives queries from different sources like web UI, CLI, Thrift, and
JDBC/ODBC driver. It transfers the queries to the compiler.
Hive Compiler - The purpose of the compiler is to parse the query and perform semantic
analysis on the different query blocks and expressions. It converts HiveQL statements into
MapReduce jobs.
Hive Execution Engine - Optimizer generates the logical plan in the form of DAG of map-
reduce tasks and HDFS tasks. In the end, the execution engine executes the incoming tasks
in the order of their dependencies.
Spark Transformation
Spark Transformation is a function that produces new RDD from the existing RDDs.
It takes RDD as input and produces one or more RDD as output.
Each time it creates new RDD when we apply any transformation.
Thus, the so input RDDs, cannot be changed since RDD are immutable in nature.
Narrow Transformations:
These types of transformations convert each input partition
to only one output partition. When each partition at the
parent RDD is used by at most one partition of the child RDD
or when each partition from child produced or dependent on
single parent RDD.
This kind of transformation is basically fast.
Does not require any data shuffling over the cluster
network or no data movement.
Operation of map()and filter() belongs to this
transformations.
Wide Transformations:
This type of transformation will have input partitions
contributing to many output partitions. When each
partition at the parent RDD is used by multiple
partitions of the child RDD or when each partition
from child produced or dependent on multiple
parent RDD.
Slow as compare to narrow dependencies speed
might be significantly affected as it might be
required to shuffle data around different nodes
when creating new partitions.
Might Require data shuffling over the cluster
network or no data movement.
Function such as groupByKey(), aggregateByKey(), aggregate(), join(), repartition() are some
examples of wider transformations.
map(func)
The map function iterates over every line in RDD and split into new RDD.
Using map() transformation we take in any function, and that function is applied to every
element of RDD.
In the map, we have the flexibility that the input and the return type of RDD may differ from
each other. For example, we can have input RDD type as String, after applying the
map() function the return RDD can be Boolean.
For example, in RDD {1, 2, 3, 4, 5} if we apply “rdd.map(x=>x+2)” we will get the result as (3, 4,
5, 6, 7)
filter(func)
Spark RDD filter() function returns a new RDD, containing only the elements that meet a
predicate. It is a narrow operation because it does not shuffle data from one partition to many
partitions.
For example, Suppose RDD contains first five natural numbers (1, 2, 3, 4, and 5) and the
predicate is check for an even number. The resulting RDD after the filter will contain only the
even numbers i.e., 2 and 4.
Filter() example:
[php]val data = spark.read.textFile(“spark_test.txt”).rdd
val mapFile = data.flatMap(lines => lines.split(” “)).filter(value => value==”spark”)
println(mapFile.count())[/php]
groupByKey()
When we use groupByKey() on a dataset of (K, V) pairs, the data is shuffled according to the key
value K in another RDD. In this transformation, lots of unnecessary data get to transfer over the
network.
Spark provides the provision to save data to disk when there is more data shuffled onto a single
executor machine than can fit in memory.
groupByKey() example:
[php]val data = spark.sparkContext.parallelize(Array((‘k’,5),(‘s’,3),(‘s’,4),(‘p’,7),(‘p’,5),(‘t’,8),
(‘k’,6)),3)
val group = data.groupByKey().collect()
group.foreach(println)[/php]
reduceByKey(func, [numTasks])
When we use reduceByKey on a dataset (K, V), the pairs on the same machine with the same
key are combined, before the data is shuffled.
reduceByKey() example:
[php]val words = Array(“one”,”two”,”two”,”four”,”five”,”six”,”six”,”eight”,”nine”,”ten”)
val data = spark.sparkContext.parallelize(words).map(w => (w,1)).reduceByKey(_+_)
data.foreach(println)[/php]
Features Of RDD
i. In-memory Computation
Spark RDDs have a provision of in-memory computation. It stores intermediate results in
distributed memory(RAM) instead of stable storage(disk).
ii. Lazy Evaluations
All transformations in Apache Spark are lazy, in that they do not compute their results right
away. Instead, they just remember the transformations applied to some base data set.
Spark computes transformations when an action requires a result for the driver program.
iii. Fault Tolerance
Spark RDDs are fault tolerant as they track data lineage information to rebuild lost data
automatically on failure. They rebuild lost data on failure using lineage, each RDD remembers
how it was created from other datasets (by transformations like a map, join or groupBy) to
recreate itself.
iv. Immutability
Data is safe to share across processes. It can also be created or retrieved anytime which makes
caching, sharing & replication easy. Thus, it is a way to reach consistency in computations.
v. Partitioning
Partitioning is the fundamental unit of parallelism in Spark RDD. Each partition is one logical
division of data which is mutable. One can create a partition through some transformations on
existing partitions.
UNIT 4
With a neat diagram explain spark ecosystem
The Spark project consists of different types of tightly integrated components.
At its core, Spark is a computational engine that can schedule, distribute and monitor multiple
applications.
Let's understand each Spark component in detail.
Spark Core
The Spark Core is the heart of Spark and performs the core functionality.
It holds the components for task scheduling, fault recovery, interacting with storage systems
and memory management.
Spark SQL
The Spark SQL is built on the top of Spark Core. It provides support for structured data.
It allows to query the data via SQL (Structured Query Language) as well as the Apache Hive
variant of SQL called the HQL (Hive Query Language).
It supports JDBC and ODBC connections that establish a relation between Java objects and
existing databases, data warehouses and business intelligence tools.
It also supports various sources of data like Hive tables, Parquet, and JSON.
Spark Streaming
Spark Streaming is a Spark component that supports scalable and fault-tolerant processing
of streaming data.
It uses Spark Core's fast scheduling capability to perform streaming analytics.
It accepts data in mini-batches and performs RDD transformations on that data.
Its design ensures that the applications written for streaming data can be reused to analyse
batches of historical data with little modification.
The log files generated by web servers can be considered as a real-time example of a data
stream.
MLlib
The MLlib is a Machine Learning library that contains various machine learning algorithms.
These include correlations and hypothesis testing, classification and regression, clustering,
and principal component analysis.
It is nine times faster than the disk-based implementation used by Apache Mahout.
GraphX
The GraphX is a library that is used to manipulate graphs and perform graph-parallel
computations.
It facilitates to create a directed graph with arbitrary properties attached to each vertex and
edge.
To manipulate graph, it supports various fundamental operators like subgraph, join Vertices,
and aggregate Message
The Cluster Manager is responsible for allocating resources in the cluster. Apache Spark is
designed to be compatible with a range of options:
Standalone Cluster Manager: A straightforward, pre-integrated option bundled with
Spark, suitable for managing smaller workloads.
Hadoop YARN: Often the preferred choice due to its scalability and seamless integration
with Hadoop's data storage systems, ideal for larger, distributed workloads.
Apache Mesos: A robust option that manages resources across entire data centres,
making it suitable for large-scale, diverse workloads.
Kubernetes: A modern container orchestration platform that gained popularity as a
cluster manager for Spark applications owing to its robustness and compatibility with
containerized environments.
This flexible approach allows users to select the Cluster Manager that best fits their specific
needs, whether those pertain to workload scale, hardware type, or application requirements.
The Executors or Worker Nodes are the “slaves” responsible for the task completion. They
process tasks on the partitioned RDDs and return the result back to SparkContext.
Operations of RDD
Two operations can be applied in RDD. One is transformation. And another one in action.
Transformations
Transformations are the processes that you perform on an RDD to get a result which is also an RDD. The
example would be applying functions such as filter(), union(), map(), flatMap(), distinct(), reduceByKey(),
mapPartitions(), sortBy() that would create an another resultant RDD. Lazy evaluation is applied in the
creation of RDD.
Actions
Actions return results to the driver program or write it in a storage and kick off a computation. Some
examples are count(), first(), collect(), take(), countByKey(), collectAsMap(), and reduce().
Transformations will always return RDD whereas actions return some other data type.
UNIT 5
Producers
Just like in the messaging world, Producers in Kafka are the ones who produce and send the messages to
the topics.
As said before, the messages are sent in a round-robin way. Ex: Message 01 goes to partition 0 of Topic
1, and message 02 to partition 1 of the same topic. It means that we can’t guarantee that messages
produced by the same producer will always be delivered to the same topic. We need to specify a key
when sending the message, Kafka will generate a hash based on that key and will know what partition to
deliver that message.
That hash takes into consideration the number of the partitions of the topic, that’s why that number
cannot be changed when the topic is already created.
When we are working with the concept of messages, there’s something called Acknowledgment
(ack). The ack is basically a confirmation that the message was delivered. In Kafka, we can configure this
ack when producing the messages. There are three different levels of configuration for that:
ack = 0: When we configure the ack = 0, we’re saying that we don’t want to receive the ack from
Kafka. In case of broker failure, the message will be lost;
ack = 1: This is the default configuration, with that we’re saying that we want to receive an ack
from the leader of the partition. The data will only be lost if the leader goes down (still there’s a
chance);
ack = all: This is the most reliable configuration. We are saying that we want to not only receive a
confirmation from the leader but from their replicas as well. This is the most secure
configuration since there’s no data loss. Remembering that the replicas need to be in-sync (ISR).
If a single replica isn’t, Kafka will wait for the sync to send back de ack.
Catalyst Optimizer
Catalyst Optimizer is a component of Apache Spark's SQL engine, known as Spark SQL. It is a powerful
tool used for optimizing query performance in Spark. The primary goal of the Catalyst Optimizer is to
transform and optimize the user's SQL or DataFrame operations into an efficient physical execution plan.
The Catalyst Optimizer applies a series of optimization techniques to improve query execution time.
Some of these optimizations include but are not limited to:
Predicate Pushdown: This optimization pushes down filters and predicates closer to the data source,
reducing the amount of unnecessary data that needs to be processed.
Column Pruning: It eliminates unnecessary columns from being read or loaded during query
execution, reducing I/O and memory usage.
Constant Folding: This optimization identifies and evaluates constant expressions during query
analysis, reducing computational overhead during execution.
Projection Pushdown: It projects only the required columns of a table or dataset, reducing the
amount of data that needs to be processed, thus improving query performance.
Join Reordering: The Catalyst Optimizer reorders join operations based on statistical information
about the data, which can lead to more efficient join strategies.
Cost-Based Optimization: It leverages statistics and cost models to estimate the cost of different
query plans and selects the most efficient plan based on these estimates.
By applying these and other optimization techniques, the Catalyst Optimizer aims to generate an
optimized physical plan that executes the query efficiently, thereby improving the overall performance of
Spark SQL queries.
Spark SQL
Spark SQL is a Spark module for structured data processing. It provides a programming abstraction
called DataFrames and can also act as a distributed SQL query engine.
It enables unmodified Hadoop Hive queries to run up to 100x faster on existing deployments and
data.
It also provides powerful integration with the rest of the Spark ecosystem (e.g., integrating SQL
query processing with machine learning).
Spark SQL brings native support for SQL to Spark and streamlines the process of querying data stored
both in RDDs (Spark’s distributed datasets) and in external sources.
Spark SQL conveniently blurs the lines between RDDs and relational tables.
Spark SQL also includes a cost-based optimizer, columnar storage, and code generation to make
queries fast.
At the same time, it scales to thousands of nodes and multi-hour queries using the Spark engine,
which provides full mid-query fault tolerance, without having to worry about using a different engine
for historical data