0% found this document useful (0 votes)
31 views

Big Data With Hadoop

Big data is characterized by the five V's: volume, variety, value, velocity, and veracity. It includes structured, semi-structured, and unstructured data. A big data architecture must be able to handle large scale data and support different user needs through ingestion, processing, storage, and visualization layers. Common big data tools include Apache Cassandra, Statwing, Tableau, Apache Hadoop, and MongoDB. These tools support ingesting, storing, processing, and visualizing large and diverse datasets.

Uploaded by

sonu samge
Copyright
© © All Rights Reserved
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
31 views

Big Data With Hadoop

Big data is characterized by the five V's: volume, variety, value, velocity, and veracity. It includes structured, semi-structured, and unstructured data. A big data architecture must be able to handle large scale data and support different user needs through ingestion, processing, storage, and visualization layers. Common big data tools include Apache Cassandra, Statwing, Tableau, Apache Hadoop, and MongoDB. These tools support ingesting, storing, processing, and visualizing large and diverse datasets.

Uploaded by

sonu samge
Copyright
© © All Rights Reserved
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
You are on page 1/ 26

UNIT 1

Characteristics of Big Data


The Five V’s of Big Data
Big Data is simply a catchall term used to describe data too large and complex to store in
traditional databases.
The five V’s of Big Data are:
Volume – The amount of data generated. The name Big Data itself is related to an enormous
size. Big Data is a vast ‘volume’ of data generated from many sources daily, such as business
processes, machines, social media platforms, networks, human interactions, and many more.
EG: Facebook can generate approximately a billion messages, 4.5 billion times that
the "Like" button is recorded, and more than 350 million new posts are
uploaded each day.
Variety - Big Data can be structured, unstructured, and semi-
structured that are being collected from different sources. Data
will only be collected from databases and sheets in the past, but
these days the data will comes in array forms, that are PDFs,
Emails, audios, SM posts, photos, videos, etc.
Value - The ability to turn data into useful insights. Value is an
essential characteristic of big data. It is valuable and reliable data
that we store, process, and analyse.
Velocity - The speed at which data is generated, collected, and
analysed. Big data velocity deals with the speed at the data flows from
sources like application logs, business processes, networks, and social media sites, sensors,
mobile devices, etc. In every 60 sec- 10,000 tweets, 650000, post updates and 217 new mobile
users.
Veracity - Trustworthiness in terms of quality and accuracy. Veracity means how much the data
is reliable. It has many ways to filter or translate the data. Veracity is the process of being able
to handle and manage data efficiently. Big Data is also essential in business development.
For example, Facebook posts with hashtags.

Types of Data – Structured, Semi-structured, Unstructured data


Structured data –
Structured data is data whose elements are addressable for effective analysis. It has been
organized into a formatted repository that is typically a database. It concerns all data which can
be stored in database SQL in a table with rows and columns. They have relational keys and can
easily be mapped into pre-designed fields. Today, those data are most processed in the
development and simplest way to manage information. Example: Relational data.
Semi-Structured data –
Semi-structured data is information that does not reside in a relational database but that has
some organizational properties that make it easier to analyse. With some processes, you can
store them in the relation database (it could be very hard for some kind of semi-structured
data), but Semi-structured exist to ease space. Example: XML data.
Unstructured data –
Unstructured data is a data which is not organized in a predefined manner or does not have a
predefined data model, thus it is not a good fit for a mainstream relational database. So for
Unstructured data, there are alternative platforms for storing and managing, it is increasingly
prevalent in IT systems and is used by organizations in a variety of business intelligence and
analytics applications. Example: Word, PDF, Text, Media logs.

Structured Data Unstructured Data Semi-structured Data

Well organised data Not organised at all Partially organised

It is less flexible and difficult It is flexible and scalable. It is It is more flexible and simpler to scale
to scale. It is schema schema independent. than structured data but lesser than
dependent. unstructured data.

It is based on relational It is based on character and It is based on XML/ RDF


database. binary data.

Versioning over tuples, row, Versioning is like as a whole Versioning over tuples is possible.
tables data.

Easy analysis Difficult analysis Difficult analysis compared to


structured data but easier when
compared to unstructured data.

Financial data, bar codes Media logs, videos, audios are Tweets organised by hashtags, folder
are some of the examples some of the examples of organised by topics are some of the
of structured data. unstructured data. examples of semi structured data.

What is Big Data Architecture?


 The term "Big Data architecture" refers to the systems and software used to manage Big
Data.
 A Big Data architecture must be able to handle the scale, complexity, and variety of Big Data.
It must also be able to support the needs of different users, who may want to access and
analyse the data differently.
 The Big Data pipeline architecture must support all these activities so users can effectively
work with Big Data.
 It includes the organizational structures and processes used to manage data.
 Some Big Data Architecture Examples include - Azure Big Data architecture, Hadoop big data
architecture, and Spark architecture in Big Data.
 Here's a Big Data architecture diagram for your reference:
Big Data Architecture Layers
There are four main Big Data architecture layers to an architecture of Big Data:
1. Data Ingestion
This layer is responsible for collecting and storing data from various sources. In Big Data, the
data ingestion process of extracting data from various sources and loading it into a data
repository. Data ingestion is a key component of a Big Data architecture because it determines
how data will be ingested, transformed, and stored.
2. Data Processing
Data processing is the second layer, responsible for collecting, cleaning, and preparing the data
for analysis. This layer is critical for ensuring that the data is high quality and ready to be used in
the future.
3. Data Storage
Data storage is the third layer, responsible for storing the data in a format that can be easily
accessed and analysed. This layer is essential for ensuring that the data is accessible and
available to the other layers.
4. Data Visualization
Data visualization is the fourth layer and is responsible for creating visualizations of the data
that humans can easily understand. This layer is important for making the data accessible.

5 Big Data Tools


1. Apache Cassandra
Many of the big data tools and programs typically used are open source, meaning that they are
freely available and can be changed and redistributed. Apache Cassandra is one of these must-
know open-source tools that all big data professionals should master. With Apache Cassandra,
you can easily manage your database, as this tool is equipped to handle large amounts of data.
Features of APACHE Cassandra:
 Data Storage Flexibility: It supports all forms of data i.e. structured, unstructured, semi-
structured, and allows users to change as per their needs.
 Data Distribution System: Easy to distribute data with the help of replicating data on
multiple data centres.
 Fast Processing: Cassandra has been designed to run on efficient commodity hardware and
also offers fast storage and data processing.
 Fault-tolerance: The moment, if any node fails, it will be replaced without any delay.
2. Statwing
Statwing isn’t just a useful tool for those working in big data, but it is also a valuable tool for
those working in other professions, such as market research. This statistical tool can take
massive amounts of data and quickly create bar charts, scatterplots, and other graphs, and then
export these visuals to PowerPoint or Excel.
3. Tableau
Tableau is another tool that can create visuals with your data, but it also allows for no-code data
queries. For optimal convenience, users can operate Tableau while on the go with its mobile
option. Tableau is too grand for teams, as it has shareable and interactive dashboards.
4. Apache Hadoop
When it comes to big data frameworks, Apache Hadoop is often the go-to tool. Providing cross-
platform support, this Java-based tool is highly scalable. It has the capability of handling any
variation of data needed, including photos and videos.
Features of Apache Hadoop:
 Free to use and offers an efficient storage solution for businesses.
 Offers quick access via HDFS (Hadoop Distributed File System).
 Highly flexible and can be easily implemented with MySQL, and JSON.
 Highly scalable as it can distribute a large amount of data in small segments.
 It works on small commodity hardware like JBOD or a bunch of disks.
5. MongoDB
Data can change often, and if you’re dealing with large datasets that continually require
modifications, MongoDB is a great tool. As such, mastering this tool is crucial if you’re pursuing
a career in big data. It allows for the storage of data from various sources, including product
catalogs, mobile applications, and much more.
Features of Mongo DB:
 Written in C++: It’s a schema-less DB and can hold varieties of documents inside.
 Simplifies Stack: With the help of mongo, a user can easily store files without any
disturbance in the stack.
 Master-Slave Replication: It can write/read data from the master and can be called back for
backup.

Big Data Ecosystem with a neat diagram


 Hadoop Ecosystem is a platform or a suite which provides various services to solve the big
data problems.
 It includes Apache projects and various commercial tools and solutions.
 There are four major elements of Hadoop i.e. HDFS, MapReduce, YARN, and Hadoop
Common.
 Most of the tools or solutions are used to supplement or support these major elements.
 All these tools work collectively to provide services such as absorption, analysis, storage and
maintenance of data etc.
Following are the components that collectively form a Hadoop ecosystem:
 HDFS: Hadoop Distributed File System
 YARN: Yet Another Resource Negotiator
 MapReduce: Programming based Data Processing
 Spark: In-Memory data processing
 PIG, HIVE: Query based processing of data services
 HBase: NoSQL Database
 Mahout, Spark MLLib: Machine Learning algorithm libraries

HDFS:
 HDFS is the primary or major component of Hadoop ecosystem and is responsible for storing
large data sets of structured or unstructured data across various nodes and thereby
maintaining the metadata in the form of log files.
 HDFS consists of two core components i.e.
1. Name node
2. Data Node
 Name Node is the prime node which contains metadata (data about data) requiring
comparatively fewer resources than the data nodes that stores the actual data. These data
nodes are commodity hardware in the distributed environment. Undoubtedly, making
Hadoop cost effective.
 HDFS maintains all the coordination between the clusters and hardware, thus working at the
heart of the system.
YARN:
 Yet Another Resource Negotiator, as the name implies, YARN is the one who helps to
manage the resources across the clusters. In short, it performs scheduling and resource
allocation for the Hadoop System.
 Consists of three major components i.e.
1. Resource Manager
2. Nodes Manager
3. Application Manager
 Resource manager has the privilege of allocating resources for the applications in a system
whereas Node managers work on the allocation of resources such as CPU, memory,
bandwidth per machine and later on acknowledges the resource manager. Application
manager works as an interface between the resource manager and node manager and
performs negotiations as per the requirement of the two.
MapReduce:
 By making the use of distributed and parallel algorithms, MapReduce makes it possible to
carry over the processing’s logic and helps to write applications which transform big data
sets into a manageable one.
 MapReduce makes the use of two functions i.e. Map() and Reduce() whose task is:
o Map() performs sorting and filtering of data and thereby organizing them in the form of
group. Map generates a key-value pair based result which is later on processed by the
Reduce() method.
o Reduce(), as the name suggests does the summarization by aggregating the mapped
data. In simple, Reduce() takes the output generated by Map() as input and combines
those tuples into smaller set of tuples.

Apache Sqoop
 Apache Sqoop, a command-line interface tool, moves data between relational databases
and Hadoop.
 It is used to export data from the Hadoop file system to relational databases and to import
data from relational databases such as MySQL and Oracle into the Hadoop file system.
Important Features of Apache Sqoop
Apache Sqoop has many essential features. Some of them are discussed here:
 Sqoop uses the YARN framework to import and export data. Parallelism is enhanced by
fault tolerance in this way.
 We may import the outcomes of a SQL query into HDFS using Sqoop.
 For several RDBMSs, including MySQL and Microsoft SQL servers, Sqoop offers
connectors.
 Sqoop supports the Kerberos computer network authentication protocol, allowing nodes
to authenticate users while securely communicating across an unsafe network.
 Sqoop can load the full table or specific sections with a single command.

How does Sqoop work?


 Sqoop executes user commands via a command-line interface.
 Connectors aid in the transfer of data from any external source to Hadoop.
 They are also required to isolate production tables in the event of job failure corruption.
 The map job populates the tables and merges them with the destination table to convey the
data.
 You can utilize specialized connections to external systems, optimizing import and export.
The Sqoop extension framework contains plugins that can be used to install Sqoop.
 Sqoop is only capable of importing and exporting data depending on human instructions; it
cannot aggregate data.
Let us look in detail at the two main operations of Sqoop:
Sqoop Import :
The procedure is carried out with the aid of the Sqoop import command. We can import a table
from the Relational database management system to the Hadoop database server with the aid
of the import command. Each record loaded into the Hadoop database server as a single record
is kept in text files as part of the Hadoop framework. While importing data, we may also load
and split Hive. Sqoop also enables the incremental import of data, which means that if we have
already imported a database and want to add a few more rows, we can only do it with the aid of
these functions, not the entire database.
Sqoop Export:
The Sqoop export command facilitates the execution of the task with the aid of the export
command, which performs operations in reverse. Here, we can transfer data from the Hadoop
database file system to the relational database management system with the aid of the export
command. Before the operation is finished, the data that will be exported is converted into
records. Two processes are involved in the export of data: the first is searching the database for
metadata, and the second is moving the data.

Unit 2

HDFS Architecture
HDFS is composed of master-slave architecture, which includes the following
elements:
NameNode
All the blocks on DataNodes are handled by NameNode, which is known as the
master node. It performs the following functions:
 Monitor and control all the DataNodes instances.
 Permits the user to access a file.
 Stores all of the block records on a DataNode instance.
 EditLogs are committed to disk after every write operation to Name Node’s
data storage. The data is then replicated to all the other data nodes, including
Data Node and Backup Data Node. In the event of a system failure, EditLogs
can be manually recovered by Data Node.
 All of the DataNodes’ blocks must be alive in order for all of the blocks to be
removed from the data nodes.
 Therefore, every UpdateNode in a cluster is aware of every DataNode in the
cluster, but only one of them is actively managing communication with all the
DataNodes. Since every DataNode runs their own software, they are
completely independent. Therefore, if a DataNode fails, the DataNode will be
replaced by another DataNode. This means that the failure of a DataNode will
not impact the rest of the cluster, since all the DataNodes are aware of every
DataNode in the cluster.
Secondary NameNode
When NameNode runs out of disk space, a secondary NameNode is activated to
perform a checkpoint. The secondary NameNode performs the following duties.
 It stores all the transaction log data (from all the source databases) into one
location so that when you want to replay it, it is at one single location. Once
the data is stored, it is replicated across all the servers, either directly or via a
distributed file system.
 The information stored in the filesystem is replicated across all the cluster
nodes and stored in all the data nodes. Data nodes store the data. The cluster
nodes store the information about the cluster nodes. This information is called
metadata.
DataNode
Every slave machine that contains data organizes a DataNode. DataNode stores
data in ext3 or ext4 file format on DataNodes. DataNodes do the following:
 DataNodes store every data.
 It handles all of the requested operations on files, such as reading file content
and creating new data, as described above.
 All the instructions are followed, including scrubbing data on DataNodes,
establishing partnerships, and so on.

MapReduce
 Java-based MapReduce is basically a processing method and a model for a distributed
computing program.
 Map and Reduce are two crucial jobs that make up the MapReduce algorithm. A data set is
transformed into another set through a map, where each element is separated into tuples
key or value pairs.
 The second work is a reduced task, that takes a map’s output as input and concatenates the
data objects in a smaller collection of tuples.
 The reduction work is always carried out following the map job, as the name MapReduce
implies.
 The main benefit of MapReduce would be that data processing can be scaled easily over
several computing nodes.

The following diagram shows the logical flow of a MapReduce programming model.

The stages depicted above are


 Input: This is the input data / file to be processed.
 Split: Hadoop splits the incoming data into smaller pieces called “splits”.
 Map: In this step, MapReduce processes each split according to the logic defined in
map() function. Each mapper works on each split at a time. Each mapper is treated as a
task and multiple tasks are executed across different TaskTrackers and coordinated by
the JobTracker.
 Combine: This is an optional step and is used to improve the performance by reducing
the amount of data transferred across the network. Combiner is the same as the reduce
step and is used for aggregating the output of the map() function before it is passed to
the subsequent steps.
 Shuffle & Sort: In this step, outputs from all the mappers is shuffled, sorted to put them
in order, and grouped before sending them to the next step.
 Reduce: This step is used to aggregate the outputs of mappers using the reduce()
function. Output of reducer is sent to the next and final step. Each reducer is treated as
a task and multiple tasks are executed across different TaskTrackers and coordinated by
the JobTracker.
 Output: Finally the output of reduce step is written to a file in HDFS.

Hadoop YARN Architecture


The main components of YARN architecture include:
 Client: It submits map-reduce jobs.
 Resource Manager: It is the master
daemon of YARN and is responsible for
resource assignment and management
among all the applications. Whenever it
receives a processing request, it forwards it
to the corresponding node manager and
allocates resources for the completion of
the request accordingly. It has two major
components:
o Scheduler: It performs scheduling
based on the allocated application and available resources. It is a pure scheduler,
means it does not perform other tasks such as monitoring or tracking and does not
guarantee a restart if a task fails. The YARN scheduler supports plugins such as
Capacity Scheduler and Fair Scheduler to partition the cluster resources.
o Application manager: It is responsible for accepting the application and negotiating
the first container from the resource manager. It also restarts the Application Master
container if a task fails.
 Node Manager: It take care of individual node on Hadoop cluster and manages application
and workflow and that particular node. Its primary job is to keep-up with the Resource
Manager. It registers with the Resource Manager and sends heartbeats with the health
status of the node. It monitors resource usage, performs log management and also kills a
container based on directions from the resource manager. It is also responsible for creating
the container process and start it on the request of Application master.
 Application Master: An application is a single job submitted to a framework. The application
master is responsible for negotiating resources with the resource manager, tracking the
status and monitoring progress of a single application. The application master requests the
container from the node manager by sending a Container Launch Context(CLC) which
includes everything an application needs to run. Once the application is started, it sends the
health report to the resource manager from time-to-time.
 Container: It is a collection of physical resources such as RAM, CPU cores and disk on a
single node. The containers are invoked by Container Launch Context(CLC) which is a record
that contains information such as environment variables, security tokens, dependencies etc.

Explain various schedulers used in Hadoop


1. FIFO Scheduler
 As the name suggests FIFO i.e. First In First Out, so the
tasks or application that comes first will be served first.
 This is the default Scheduler we use in Hadoop.
 The tasks are placed in a queue and the tasks are
performed in their submission order.
 In this method, once the job is scheduled, no intervention
is allowed.
 So sometimes the high-priority process has to wait for a long time since the priority of the
task does not matter in this method.
Advantage:
 No need for configuration
 First Come First Serve
 simple to execute
Disadvantage:
 Priority of task doesn’t matter, so high priority jobs need to wait
 Not suitable for shared cluster
2. Capacity Scheduler
 In Capacity Scheduler we have multiple job queues for
scheduling our tasks.
 The Capacity Scheduler allows multiple occupants to
share a large size Hadoop cluster.
 In Capacity Scheduler corresponding for each job
queue, we provide some slots or cluster resources for
performing job operation.
 Each job queue has its own slots to perform its task. In
case we have tasks to perform in only one queue then
the tasks of that queue can access the slots of other
queues also as they are free to use, and when the new task enters to some other queue
then jobs in running in its own slots of the cluster are replaced with its own job.
 The capacity Scheduler mainly contains 3 types of the queue that are root, parent, and leaf
which are used to represent cluster, organization, or any subgroup, application submission
respectively.
Advantage:
 Best for working with Multiple clients or priority jobs in a Hadoop cluster
 Maximizes throughput in the Hadoop cluster
Disadvantage:
 More complex
 Not easy to configure for everyone

3. Fair Scheduler
 The Fair Scheduler is very much similar to that of the
capacity scheduler.
 The priority of the job is kept in consideration.
 With the help of Fair Scheduler, the YARN applications
can share the resources in the large Hadoop Cluster and
these resources are maintained dynamically so no need
for prior capacity.
 The resources are distributed in such a manner that all
applications within a cluster get an equal amount of time.
 Fair Scheduler takes Scheduling decisions on the basis of memory, we can configure it to
work with CPU also.
 As we told you it is similar to Capacity Scheduler but the major thing to notice is that in Fair
Scheduler whenever any high priority job arises in the same queue, the task is processed in
parallel by replacing some portion from the already dedicated slots.
Advantages:
 Resources assigned to each application depend upon its priority.
 it can limit the concurrent running task in a particular pool or queue.
Disadvantages: The configuration is required.

Explain how Failures are handled in Hadoop

UNIT 3

Discuss the important characteristics of Hive. How do you compare Hive and
Relational Databases
Apache Hive
 Apache Hive is a data warehouse and an ETL tool which provides an SQL-like interface
between the user and the Hadoop distributed file system (HDFS) which integrates Hadoop.
 It is built on top of Hadoop.
 It is a software project that provides data query and analysis.
 It facilitates reading, writing and handling wide datasets that stored in distributed storage
and queried by Structure Query Language (SQL) syntax.
 It is not built for Online Transactional Processing (OLTP) workloads.
 It is frequently used for data warehousing tasks like data encapsulation, Ad-hoc Queries, and
analysis of huge datasets.
 It is designed to enhance scalability, extensibility, performance, fault-tolerance and loose-
coupling with its input formats.

Relational Database Hive

 Maintains a database  Maintains a data warehouse

 Fixed schema  Varied schema

 Sparse tables  Dense tables


 Doesn’t support partitioning  Supports automation partition

 Stores both normalized and


 Stores normalized data
denormalized data

 Uses SQL (Structured Query


 Uses HQL (Hive Query Language)
Language)

With a neat diagram explain the Architecture of Hive. Explain the measure
components of Hive
Architecture of Hive
The following architecture explains the flow of submission of query into Hive.

Hive Client
Hive allows writing applications in various languages, including Java, Python, and C++. It
supports different types of clients such as:-
 Thrift Server - It is a cross-language service provider platform that serves the request from
all those programming languages that supports Thrift.
 JDBC Driver - It is used to establish a connection between hive and Java applications. The
JDBC Driver is present in the class org.apache.hadoop.hive.jdbc.HiveDriver.
 ODBC Driver - It allows the applications that support the ODBC protocol to connect to Hive.
Hive Services
The following are the services provided by Hive:-
 Hive CLI - The Hive CLI (Command Line Interface) is a shell where we can execute Hive
queries and commands.
 Hive Web User Interface - The Hive Web UI is just an alternative of Hive CLI. It provides a
web-based GUI for executing Hive queries and commands.
 Hive MetaStore - It is a central repository that stores all the structure information of various
tables and partitions in the warehouse. It also includes metadata of column and its type
information, the serializers and deserializers which is used to read and write data and the
corresponding HDFS files where the data is stored.
 Hive Server - It is referred to as Apache Thrift Server. It accepts the request from different
clients and provides it to Hive Driver.
 Hive Driver - It receives queries from different sources like web UI, CLI, Thrift, and
JDBC/ODBC driver. It transfers the queries to the compiler.
 Hive Compiler - The purpose of the compiler is to parse the query and perform semantic
analysis on the different query blocks and expressions. It converts HiveQL statements into
MapReduce jobs.
 Hive Execution Engine - Optimizer generates the logical plan in the form of DAG of map-
reduce tasks and HDFS tasks. In the end, the execution engine executes the incoming tasks
in the order of their dependencies.

Job Execution flow in Hive with Hadoop is demonstrated step by step.


 Step-1: Execute Query –
Interface of the Hive such as Command Line or Web user interface delivers query to the
driver to execute. In this, UI calls the execute interface to the driver such as ODBC or
JDBC.
 Step-2: Get Plan –
Driver designs a session handle for the query and transfer the query to the compiler to
make execution plan. In other words, driver interacts with the compiler.
 Step-3: Get Metadata –
In this, the compiler transfers the metadata request to any database and the compiler
gets the necessary metadata from the metastore.
 Step-4: Send Metadata –
Metastore transfers metadata as an acknowledgment to the compiler.
 Step-5: Send Plan –
Compiler communicating with driver with the execution plan made by the compiler to
execute the query.
 Step-6: Execute Plan –
Execute plan is sent to the execution engine by the driver.
o Execute Job
o Job Done
o Dfs operation (Metadata Operation)
 Step-7: Fetch Results –
Fetching results from the driver to the user interface (UI).
 Step-8: Send Results –
Result is transferred to the execution engine from the driver. Sending results to
Execution engine. When the result is retrieved from data nodes to the execution engine,
it returns the result to the driver and to user interface (UI).

Spark Transformation
 Spark Transformation is a function that produces new RDD from the existing RDDs.
 It takes RDD as input and produces one or more RDD as output.
 Each time it creates new RDD when we apply any transformation.
 Thus, the so input RDDs, cannot be changed since RDD are immutable in nature.
Narrow Transformations:
These types of transformations convert each input partition
to only one output partition. When each partition at the
parent RDD is used by at most one partition of the child RDD
or when each partition from child produced or dependent on
single parent RDD.
 This kind of transformation is basically fast.
 Does not require any data shuffling over the cluster
network or no data movement.
 Operation of map()and filter() belongs to this
transformations.

Wide Transformations:
This type of transformation will have input partitions
contributing to many output partitions. When each
partition at the parent RDD is used by multiple
partitions of the child RDD or when each partition
from child produced or dependent on multiple
parent RDD.
 Slow as compare to narrow dependencies speed
might be significantly affected as it might be
required to shuffle data around different nodes
when creating new partitions.
 Might Require data shuffling over the cluster
network or no data movement.
 Function such as groupByKey(), aggregateByKey(), aggregate(), join(), repartition() are some
examples of wider transformations.

 map(func)
The map function iterates over every line in RDD and split into new RDD.
Using map() transformation we take in any function, and that function is applied to every
element of RDD.
In the map, we have the flexibility that the input and the return type of RDD may differ from
each other. For example, we can have input RDD type as String, after applying the
map() function the return RDD can be Boolean.
For example, in RDD {1, 2, 3, 4, 5} if we apply “rdd.map(x=>x+2)” we will get the result as (3, 4,
5, 6, 7)
 filter(func)
Spark RDD filter() function returns a new RDD, containing only the elements that meet a
predicate. It is a narrow operation because it does not shuffle data from one partition to many
partitions.
For example, Suppose RDD contains first five natural numbers (1, 2, 3, 4, and 5) and the
predicate is check for an even number. The resulting RDD after the filter will contain only the
even numbers i.e., 2 and 4.
Filter() example:
[php]val data = spark.read.textFile(“spark_test.txt”).rdd
val mapFile = data.flatMap(lines => lines.split(” “)).filter(value => value==”spark”)
println(mapFile.count())[/php]
 groupByKey()
When we use groupByKey() on a dataset of (K, V) pairs, the data is shuffled according to the key
value K in another RDD. In this transformation, lots of unnecessary data get to transfer over the
network.
Spark provides the provision to save data to disk when there is more data shuffled onto a single
executor machine than can fit in memory.
groupByKey() example:
[php]val data = spark.sparkContext.parallelize(Array((‘k’,5),(‘s’,3),(‘s’,4),(‘p’,7),(‘p’,5),(‘t’,8),
(‘k’,6)),3)
val group = data.groupByKey().collect()
group.foreach(println)[/php]
 reduceByKey(func, [numTasks])
When we use reduceByKey on a dataset (K, V), the pairs on the same machine with the same
key are combined, before the data is shuffled.
reduceByKey() example:
[php]val words = Array(“one”,”two”,”two”,”four”,”five”,”six”,”six”,”eight”,”nine”,”ten”)
val data = spark.sparkContext.parallelize(words).map(w => (w,1)).reduceByKey(_+_)
data.foreach(println)[/php]

Resilient Distributed Datasets (RDDs)


 RDDs are the main logical data units in Spark.
 They are a distributed collection of objects, which are stored in memory or on disks of
different machines of a cluster.
 A single RDD can be divided into multiple logical partitions so that these partitions can be
stored and processed on different machines of a cluster.
 RDDs are immutable (read-only) in nature. You cannot change an original RDD, but you can
create new RDDs by performing coarse-grain operations, like transformations, on an existing
RDD.
 An RDD in Spark can be cached and used again for future transformations, which is a huge
benefit for users.
 RDDs are said to be lazily evaluated, i.e., they delay the evaluation until it is really needed.
This saves a lot of time and improves efficiency.

RDDs vs DataFrames vs Datasets


RDDs DataFrames Datasets
RDD came into existence in Datasets entered the market in DataFrame came into
Inception Year
the year 2011. the year 2013. existence in the year 2015
It is an extension of
RDD is a distributed It is also the distributed DataFrames with more
Data
collection of data elements collection organized into the features like type-safety
Representation
without any schema. named columns and object-oriented
interface.
No in-built optimization
engine for RDDs. It also uses a catalyst
It uses a catalyst optimizer for
Optimization Developers need to write optimizer for optimization
optimization.
the optimized code purposes.
themselves.
It will also automatically
Projection of Here, we need to define It will automatically find out find out the schema of the
Schema the schema manually. the schema of the dataset. dataset by using the SQL
Engine.
RDD is slower than both It provides an easy API to
DataFrames and Datasets perform aggregation Dataset is faster than RDDs
Aggregation
to perform simple operations. It performs but a bit slower than
Operation
operations like grouping aggregation faster than both DataFrames.
the data. RDDs and Datasets.

Features Of RDD
i. In-memory Computation
Spark RDDs have a provision of in-memory computation. It stores intermediate results in
distributed memory(RAM) instead of stable storage(disk).
ii. Lazy Evaluations
All transformations in Apache Spark are lazy, in that they do not compute their results right
away. Instead, they just remember the transformations applied to some base data set.
Spark computes transformations when an action requires a result for the driver program.
iii. Fault Tolerance
Spark RDDs are fault tolerant as they track data lineage information to rebuild lost data
automatically on failure. They rebuild lost data on failure using lineage, each RDD remembers
how it was created from other datasets (by transformations like a map, join or groupBy) to
recreate itself.
iv. Immutability
Data is safe to share across processes. It can also be created or retrieved anytime which makes
caching, sharing & replication easy. Thus, it is a way to reach consistency in computations.
v. Partitioning
Partitioning is the fundamental unit of parallelism in Spark RDD. Each partition is one logical
division of data which is mutable. One can create a partition through some transformations on
existing partitions.

UNIT 4
With a neat diagram explain spark ecosystem
The Spark project consists of different types of tightly integrated components.
At its core, Spark is a computational engine that can schedule, distribute and monitor multiple
applications.
Let's understand each Spark component in detail.

Spark Core
 The Spark Core is the heart of Spark and performs the core functionality.
 It holds the components for task scheduling, fault recovery, interacting with storage systems
and memory management.
Spark SQL
 The Spark SQL is built on the top of Spark Core. It provides support for structured data.
 It allows to query the data via SQL (Structured Query Language) as well as the Apache Hive
variant of SQL called the HQL (Hive Query Language).
 It supports JDBC and ODBC connections that establish a relation between Java objects and
existing databases, data warehouses and business intelligence tools.
 It also supports various sources of data like Hive tables, Parquet, and JSON.
Spark Streaming
 Spark Streaming is a Spark component that supports scalable and fault-tolerant processing
of streaming data.
 It uses Spark Core's fast scheduling capability to perform streaming analytics.
 It accepts data in mini-batches and performs RDD transformations on that data.
 Its design ensures that the applications written for streaming data can be reused to analyse
batches of historical data with little modification.
 The log files generated by web servers can be considered as a real-time example of a data
stream.
MLlib
 The MLlib is a Machine Learning library that contains various machine learning algorithms.
 These include correlations and hypothesis testing, classification and regression, clustering,
and principal component analysis.
 It is nine times faster than the disk-based implementation used by Apache Mahout.
GraphX
 The GraphX is a library that is used to manipulate graphs and perform graph-parallel
computations.
 It facilitates to create a directed graph with arbitrary properties attached to each vertex and
edge.
 To manipulate graph, it supports various fundamental operators like subgraph, join Vertices,
and aggregate Message

Master-slave architecture in Apache Spark


Apache Spark's architecture is based on a master-slave structure where a driver program (the
master node) operates with multiple executors or worker nodes(the slave nodes). The cluster
consists of a single master and multiple slaves, and Spark jobs are distributed across this cluster.

Apache Spark architecture in a nutshell.


The Driver Program is the "master" in the master-slave architecture that runs the main function
and creates a SparkContext, acting as the entry point and gateway to all Spark functionalities. It
communicates with the Cluster Manager to supervise jobs, partitions the job into tasks, and
assigns these tasks to worker nodes.

The Cluster Manager is responsible for allocating resources in the cluster. Apache Spark is
designed to be compatible with a range of options:
 Standalone Cluster Manager: A straightforward, pre-integrated option bundled with
Spark, suitable for managing smaller workloads.
 Hadoop YARN: Often the preferred choice due to its scalability and seamless integration
with Hadoop's data storage systems, ideal for larger, distributed workloads.
 Apache Mesos: A robust option that manages resources across entire data centres,
making it suitable for large-scale, diverse workloads.
 Kubernetes: A modern container orchestration platform that gained popularity as a
cluster manager for Spark applications owing to its robustness and compatibility with
containerized environments.
This flexible approach allows users to select the Cluster Manager that best fits their specific
needs, whether those pertain to workload scale, hardware type, or application requirements.
The Executors or Worker Nodes are the “slaves” responsible for the task completion. They
process tasks on the partitioned RDDs and return the result back to SparkContext.

Lazy Evaluation in Apache Spark


Lazy evaluation is a key feature of Apache Spark that improves its efficiency and performance. It
refers to the strategy where transformations on distributed datasets, are not immediately
executed, but instead, their execution is delayed until an ACTION is called.
For example:
 When we perform operations on an RDD/DataFrame/DataSet, such as filtering or mapping,
Spark doesn’t immediately process the data to produce the results.
 Instead, it builds a logical execution plan, called the DAG (Directed Acyclic Graph), which
represents the sequence of transformations to be applied on the RDD/DataFrame/DataSet.
This DAG is built incrementally as we apply more transformations.
 The evaluation of DAG starts when an ACTION is called. Some of the examples of an action
in Spark are collect, count, saveAsTextFile, first, foreach, countByKey, etc.

Advantages of Lazy Evaluation


 Increases Manageability :Using Apache Spark RDD lazy evaluation, users can freely organize their
Apache Spark program into smaller operations. It reduces the number of passes on data by grouping
operations.
 Saves Computation and increases speed : Lazy Evaluation plays a key role in saving calculation
overhead. Since value does not need to be calculated of, it is not used. Only necessary values are
computed. It saves the trip between driver and cluster, thus speeds up the process.
 Reduces complexities : The two main complexities of any operation are time and space complexity.
Using Spark lazy evaluation we can overcome both. Since we do not execute every operation, the
time gets saved. It let us work with an infinite data structure. The action is triggered only when the
data is required, it reduces overhead.
 Optimization : It provides optimization by reducing the number of queries.

What is RDD? Discuss the necessities for RDD abstraction


RDD stands for “Resilient Distributed Dataset”. It is the fundamental data structure of Apache Spark. RDD
in Apache Spark is an immutable collection of objects which computes on the different node of the
cluster.
Decomposing the name RDD:
 Resilient, i.e. fault-tolerant with the help of RDD lineage graph(DAG) and so able to recompute
missing or damaged partitions due to node failures.
 Distributed, since Data resides on multiple nodes.
 Dataset represents records of the data you work with. The user can load the data set externally
which can be either JSON file, CSV file, text file or database via JDBC with no specific data structure.

The key motivations behind the concept of RDD are-


 Iterative algorithms.
 Interactive data mining tools.
 DSM (Distributed Shared Memory) is a very general abstraction, but this generality makes it harder
to implement in an efficient and fault tolerant manner on commodity clusters. Here the need of RDD
comes into the picture.
 In distributed computing system data is stored in intermediate stable distributed store such
as HDFS or Amazon S3. This makes the computation of job slower since it involves many IO
operations, replications, and serializations in the process.
In first two cases we keep data in-memory, it can improve performance by an order of magnitude.
The main challenge in designing RDD is defining a program interface that provides fault tolerance
efficiently. To achieve fault tolerance efficiently, RDDs provide a restricted form of shared memory, based
on coarse-grained transformation rather than fine-grained updates to shared state.

Operations of RDD
Two operations can be applied in RDD. One is transformation. And another one in action.
Transformations
Transformations are the processes that you perform on an RDD to get a result which is also an RDD. The
example would be applying functions such as filter(), union(), map(), flatMap(), distinct(), reduceByKey(),
mapPartitions(), sortBy() that would create an another resultant RDD. Lazy evaluation is applied in the
creation of RDD.
Actions
Actions return results to the driver program or write it in a storage and kick off a computation. Some
examples are count(), first(), collect(), take(), countByKey(), collectAsMap(), and reduce().
Transformations will always return RDD whereas actions return some other data type.

UNIT 5

Apache Kafka Components and Its Architectural Concepts


Topics
A stream of messages that are a part of a specific category or feed name is referred to as a Kafka topic. In
Kafka, data is stored in the form of topics. Producers write their data to topics, and consumers read the
data from these topics.
Brokers
A Kafka cluster comprises one or more servers that are known as brokers. In Kafka, a broker works as a
container that can hold multiple topics with different partitions. A unique integer ID is used to identify
brokers in the Kafka cluster. Connection with any one of the kafka brokers in the cluster implies a
connection with the whole cluster. If there is more than one broker in a cluster, the brokers need not
contain the complete data associated with a particular topic.
Consumers and Consumer Groups
Consumers read data from the Kafka cluster. The data to be read by the consumers has to be pulled from
the broker when the consumer is ready to receive the message. A consumer group in Kafka refers to a
number of consumers that pull data from the same topic or same set of topics.
Producers
Producers in Kafka publish messages to one or more topics. They send data to the Kafka cluster.
Whenever a Kafka producer publishes a message to Kafka, the broker receives the message and appends
it to a particular partition. Producers are given a choice to publish messages to a partition of their choice.
Partitions
Topics in Kafka are divided into a configurable number of parts, which are known as partitions. Partitions
allow several consumers to read data from a particular topic in parallel. Partitions are separated in order.
The number of partitions is specified when configuring a topic, but this number can be changed later on.
The partitions comprising a topic are distributed across servers in the Kafka cluster. Each server in the
cluster handles the data and requests for its share of partitions. Messages are sent to the broker along
with a key. The key can be used to determine which partition that particular message will go to. All
messages which have the same key go to the same partition. If the key is not specified, then the partition
will be decided in a round-robin fashion.
Partition Offset
Messages or records in Kafka are assigned to a partition. To specify the position of the records within the
partition, each record is provided with an offset. A record can be uniquely identified within its partition
using the offset value associated with it. A partition offset carries meaning only within that particular
partition. Older records will have lower offset values since records are added to the ends of partitions.
Replicas
Replicas are like backups for partitions in Kafka. They are used to ensure that there is no data loss in the
event of a failure or a planned shutdown. Partitions of a topic are published across multiple servers in a
Kafka cluster. Copies of the partition are known as Replicas.
Leader and Follower
Every partition in Kafka will have one server that plays the role of a leader for that particular partition.
The leader is responsible for performing all the read and write tasks for the partition. Each partition can
have zero or more followers. The duty of the follower is to replicate the data of the leader. In the event
of a failure in the leader for a particular partition, one of the follower nodes can take on the role of the
leader.
Role of producer and consumer
Kafka Producers
 In Kafka, the producers send data directly to the broker that plays the role of leader for a given
partition.
 In order to help the producer send the messages directly, the nodes of the Kafka cluster answer
requests for metadata on which servers are alive and the current status of the leaders of partitions
of a topic so that the producer can direct its requests accordingly.
 The client decides which partition it publishes its messages to.
 This can either be done arbitrarily or by making use of a partitioning key, where all messages
containing the same partition key will be sent to the same partition.
 Messages in Kafka are sent in the form of batches, known as record batches.
 The producers accumulate messages in memory and send them in batches either after a fixed
number of messages are accumulated or before a fixed latency bound period of time has elapsed.
Kafka Consumers
 In Kafka, the consumer has to issue requests to the brokers indicating the partitions it wants to
consume.
 The consumer is required to specify its offset in the request and receives a chunk of log beginning
from the offset position from the broker.
 Since the consumer has control over this position, it can re-consume data if required.
 Records remain in the log for a configurable time period which is known as the retention period. The
consumer may re-consume the data as long as the data is present in the log.
 In Kafka, the consumers work on a pull-based approach. This means that data is not immediately
pushed onto the consumers from the brokers.
 The consumers have to send requests to the brokers to indicate that they are ready to consume the
data.
 A pull-based system ensures that the consumer does not get overwhelmed with messages and can
fall behind and catch up when it can.
 A pull-based system can also allow aggressive batching of data sent to the consumer since the
consumer will pull all available messages after its current position in the log. In this manner, batching
is performed without any unnecessary latency.

Producers
Just like in the messaging world, Producers in Kafka are the ones who produce and send the messages to
the topics.
As said before, the messages are sent in a round-robin way. Ex: Message 01 goes to partition 0 of Topic
1, and message 02 to partition 1 of the same topic. It means that we can’t guarantee that messages
produced by the same producer will always be delivered to the same topic. We need to specify a key
when sending the message, Kafka will generate a hash based on that key and will know what partition to
deliver that message.
That hash takes into consideration the number of the partitions of the topic, that’s why that number
cannot be changed when the topic is already created.
When we are working with the concept of messages, there’s something called Acknowledgment
(ack). The ack is basically a confirmation that the message was delivered. In Kafka, we can configure this
ack when producing the messages. There are three different levels of configuration for that:
 ack = 0: When we configure the ack = 0, we’re saying that we don’t want to receive the ack from
Kafka. In case of broker failure, the message will be lost;
 ack = 1: This is the default configuration, with that we’re saying that we want to receive an ack
from the leader of the partition. The data will only be lost if the leader goes down (still there’s a
chance);
 ack = all: This is the most reliable configuration. We are saying that we want to not only receive a
confirmation from the leader but from their replicas as well. This is the most secure
configuration since there’s no data loss. Remembering that the replicas need to be in-sync (ISR).
If a single replica isn’t, Kafka will wait for the sync to send back de ack.

Catalyst Optimizer
Catalyst Optimizer is a component of Apache Spark's SQL engine, known as Spark SQL. It is a powerful
tool used for optimizing query performance in Spark. The primary goal of the Catalyst Optimizer is to
transform and optimize the user's SQL or DataFrame operations into an efficient physical execution plan.
The Catalyst Optimizer applies a series of optimization techniques to improve query execution time.
Some of these optimizations include but are not limited to:
 Predicate Pushdown: This optimization pushes down filters and predicates closer to the data source,
reducing the amount of unnecessary data that needs to be processed.
 Column Pruning: It eliminates unnecessary columns from being read or loaded during query
execution, reducing I/O and memory usage.
 Constant Folding: This optimization identifies and evaluates constant expressions during query
analysis, reducing computational overhead during execution.
 Projection Pushdown: It projects only the required columns of a table or dataset, reducing the
amount of data that needs to be processed, thus improving query performance.
 Join Reordering: The Catalyst Optimizer reorders join operations based on statistical information
about the data, which can lead to more efficient join strategies.
 Cost-Based Optimization: It leverages statistics and cost models to estimate the cost of different
query plans and selects the most efficient plan based on these estimates.
By applying these and other optimization techniques, the Catalyst Optimizer aims to generate an
optimized physical plan that executes the query efficiently, thereby improving the overall performance of
Spark SQL queries.
Spark SQL
 Spark SQL is a Spark module for structured data processing. It provides a programming abstraction
called DataFrames and can also act as a distributed SQL query engine.
 It enables unmodified Hadoop Hive queries to run up to 100x faster on existing deployments and
data.
 It also provides powerful integration with the rest of the Spark ecosystem (e.g., integrating SQL
query processing with machine learning).
 Spark SQL brings native support for SQL to Spark and streamlines the process of querying data stored
both in RDDs (Spark’s distributed datasets) and in external sources.
 Spark SQL conveniently blurs the lines between RDDs and relational tables.
 Spark SQL also includes a cost-based optimizer, columnar storage, and code generation to make
queries fast.
 At the same time, it scales to thousands of nodes and multi-hour queries using the Spark engine,
which provides full mid-query fault tolerance, without having to worry about using a different engine
for historical data

Parameters RDBMS Apache Spark


Type of System Database System Cluster Computing System
Data Structure Structured Data Structured, Semi-structured and
Unstructured data
Cost/ It is licensed Charges apply to avail the services
Accessibility
Components Tables, Rows, Columns, Keys Resilient Distributed Dataset
(RDD), Mesos master,
Directed Acyclic Graph (DAG),
DataFrames,
Datasets
Data Objects Relational Tables Resilient Distributed Dataset (RDD)
Querying Uses Structured Query Uses Spark SQL, HiveQL, Scala
Language (SQL)
Processing Only reading is fast Faster processing(100X) than
power Hadoop as it works on
RAM
Scalability Limited scalability Lower scalability compared to
Hadoop
Caching It stores frequently queried Spark can cache data in memory
data in a temporary memory. for further iterations.
Application Online Transaction Processing Streaming Data,
Processing (OLTP) Machine Learning, Fog
Computing
Security Highly secure Less secure compared to Hadoop
Data Integrity High Low
Hardware High end servers Mid to High level hardware
Requirements

You might also like