0% found this document useful (0 votes)
13 views

BigData Unit-4 Complete

Uploaded by

shuklaraghv555
Copyright
© © All Rights Reserved
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
13 views

BigData Unit-4 Complete

Uploaded by

shuklaraghv555
Copyright
© © All Rights Reserved
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
You are on page 1/ 97

UNIT-IV

BIG DATA
HADOOP ECOSYSTEM COMPONENTS
HADOOP ECOSYSTEM COMPONENTS
• The Hadoop ecosystem is a collection of open-source software utilities,
frameworks, and tools designed to facilitate big data processing and analytics.
• The Hadoop ecosystem can be broadly categorized into several layers, each
serving a specific purpose in the big data processing and analytics pipeline.
Here are the main layers:
1.Storage Layer:
Hadoop Distributed File System (HDFS): This is the primary storage layer in Hadoop.
HDFS is a distributed file system that provides high-throughput access to application data
by storing it across multiple nodes in a Hadoop cluster.
2.Resource Management Layer:
YARN (Yet Another Resource Negotiator): YARN is a resource management layer in
Hadoop that manages resources in the cluster and schedules tasks. It enables multiple
data processing frameworks to run on the same cluster by decoupling resource
management from job scheduling and monitoring.
HADOOP ECOSYSTEM COMPONENTS
3. Data Processing Layer:
a) MapReduce: Programming model for distributed computing on large datasets.
b) Apache Spark: Cluster computing framework supporting in-memory
processing and various data processing tasks.
4. Query and Analytics Layer:
c) Apache Hive: Data warehousing framework providing a SQL-like interface
for querying and analyzing data stored in HDFS.
d) Apache Pig: High-level platform for creating MapReduce programs.
e) Apache Impala: SQL query engine for interactive analytics on data stored in
HDFS and Apache HBase.
f) Presto: Distributed SQL query engine for interactive analytics on large
datasets.
HADOOP ECOSYSTEM COMPONENTS
5. Data Ingestion and Integration Layer:
a) Apache Kafka: Distributed streaming platform for building real-time data
pipelines.
b) Apache Flume: Service for efficiently collecting and aggregating log data
from various sources.
c) Apache Sqoop: Tool for efficiently transferring bulk data between Hadoop
and relational databases.

6. Workflow Orchestration and Coordination:


Apache Oozie: Workflow scheduler for managing Hadoop jobs and
coordinating data processing workflows.
HADOOP ECOSYSTEM COMPONENTS
7. Distributed Coordination Service:
Apache ZooKeeper: Distributed coordination service for maintaining
configuration information, providing distributed synchronization, and providing
group services.
8. Data Serialization and Exchange:
Apache Avro: Data serialization system providing rich data structures and a
compact binary format.
9. Machine Learning and Data Science:
Apache Mahout: Library of scalable machine learning algorithms for clustering,
classification, and collaborative filtering.
10. Log Collection and Analysis:
Apache Chukwa: System for collecting and analyzing large volumes of log data
from distributed systems.
SCHEDULERS
• In Hadoop, schedulers play a crucial role in managing and allocating
cluster resources efficiently to different jobs and tasks. They ensure
that computational resources are utilized optimally, leading to
improved performance and resource utilization. Here are the main
schedulers used in Hadoop:
1.FIFO Scheduler:
a) The FIFO (First-In-First-Out) scheduler is the simplest scheduling algorithm
in Hadoop.
b) It schedules jobs in the order they are submitted to the cluster without
considering job priority or resource availability.
c) While simple, it may not be suitable for multi-tenant environments or
scenarios where certain jobs require priority execution.
SCHEDULERS
2. Capacity Scheduler:
a) The Capacity Scheduler is a pluggable scheduler in Hadoop that supports
multiple queues, each with its own guaranteed capacity.
b) It allows the cluster to be divided into multiple queues, where each queue is
allocated a certain percentage of cluster resources.
c) Jobs submitted to a queue can only use resources allocated to that queue,
preventing one job from monopolizing the entire cluster.
d) The Capacity Scheduler is suitable for multi-tenant environments where
different users or departments share the same Hadoop cluster.
SCHEDULERS
3. Fair Scheduler:
a) The Fair Scheduler is another pluggable scheduler in Hadoop that aims to
provide fair sharing of cluster resources among different jobs.
b) It dynamically allocates resources to jobs based on their demands, ensuring
that all jobs get an equal share of resources over time.
c) The Fair Scheduler divides resources among active jobs in a round-robin
fashion, adjusting allocations based on job requirements and cluster load.
d) This scheduler is suitable for environments where fair sharing of resources is
more important than guaranteed capacity.
SCHEDULERS
4. YARN ResourceManager Scheduler:
a) YARN (Yet Another Resource Negotiator) ResourceManager includes its
own scheduler, which is responsible for allocating resources to applications
in a cluster managed by YARN.
b) The ResourceManager Scheduler works with pluggable scheduler
implementations, such as Capacity Scheduler and Fair Scheduler, to allocate
resources to applications based on their resource requests and the cluster's
resource availability.
c) It manages the allocation of containers, which are units of resources (CPU,
memory) used by applications running on the cluster.
HADOOP 2.0 NEW FEATURES
• Hadoop 2.0, also known as Apache Hadoop 2.x, introduced several
significant features and improvements over its predecessor, Hadoop 1.x.
Here are some key features of Hadoop 2.0:
1.YARN (Yet Another Resource Negotiator):
 YARN is one of the most significant additions to Hadoop 2.0. It decouples the
resource management and job scheduling/monitoring functions from the
MapReduce programming model.
 With YARN, Hadoop becomes a more generic data processing platform,
supporting not only MapReduce but also other processing frameworks such as
Apache Spark, Apache Flink, and Apache Tez.
 YARN provides a central ResourceManager for managing cluster resources and
a per-application ApplicationMaster for negotiating resources with the
ResourceManager and executing tasks.
HADOOP 2.0 NEW FEATURES
2. High Availability NameNode (HA):

Hadoop 2.0 introduced High Availability for the NameNode, which is a


critical component in HDFS responsible for managing metadata and
namespace operations.
HA ensures that the NameNode is fault-tolerant and highly available by
introducing a standby NameNode that automatically takes over in case of a
failure in the active NameNode.
This feature improves the overall reliability and availability of the Hadoop
cluster, reducing the risk of downtime due to NameNode failures.
HADOOP 2.0 NEW FEATURES
3. Support for HDFS Federation:

 Hadoop 2.0 introduced HDFS Federation, which allows multiple NameNodes


to manage separate portions of the filesystem namespace.
 With Federation, the namespace of the HDFS is divided into multiple
independent namespaces, each managed by its own NameNode.
 This improves scalability and performance by distributing the namespace
metadata and load across multiple NameNodes, allowing for larger clusters
and higher throughput.
HADOOP 2.0 NEW FEATURES
4. Next-Generation MapReduce (MapReduce 2):

 Hadoop 2.0 includes enhancements to the MapReduce framework to make it


compatible with YARN and support non-MapReduce data processing models.
 MapReduce 2, or MRv2, allows MapReduce jobs to run as applications
within YARN, enabling better resource utilization and multi-framework
support.
 MRv2 also introduces performance improvements and optimizations to the
MapReduce framework, making it more efficient and scalable.
HADOOP 2.0 NEW FEATURES
5. Support for Other Data Processing Frameworks:
 Hadoop 2.0 opens up the Hadoop ecosystem to support other data processing
frameworks besides MapReduce.
 Frameworks like Apache Spark, Apache Flink, and Apache Tez can run on top
of YARN, leveraging its resource management capabilities and integrating
seamlessly with the Hadoop ecosystem.
6. Improved Security:
 Hadoop 2.0 includes enhancements to security features, such as better
authentication, authorization, and encryption mechanisms.
 It introduces support for Kerberos authentication, Access Control Lists (ACLs),
and integration with external security solutions like Apache Ranger and Apache
Sentry.
SQL vs NoSQL
 NoSQL (often interpreted as Not only SQL) database

 It provides a mechanism for storage and retrieval of data that is


modeled in means other than the tabular relations used in relational
databases.

SQL NoSQL

Relational Database Management System (RDBMS) Non-relational or distributed database system.

These databases have fixed or static or predefined


schema They have dynamic schema

These databases are best suited for complex queries These databases are not so good for complex queries

Vertically Scalable Horizontally scalable

Follows ACID property Follows BASE property


SQL vs NoSQL
NoSQL Types

Graph database

Document-oriented

Column family
What is MongoDB?
 MongoDB is an open source, document-oriented database designed with both
scalability and developer agility in mind.
 Instead of storing your data in tables and rows as you would with a relational database,
in MongoDB you store JSON-like documents with dynamic schemas(schema-free,
schema less).
{
"_id" : ObjectId("5114e0bd42…"),
“FirstName" : "John",
“LastName" : "Doe",
“Age" : 39,
“Interests" : [ "Reading", "Mountain Biking ]
“Favorites": {
"color": "Blue",
"sport": "Soccer“
}
}
MongoDB is Easy to Use
Scheme Free RDBMS vs MongoDB
MongoDB does not need any pre-defined data schema
Every document could have different data!
RDBMS MongoDB
{name: “will”, {name: “jeff”, {name: “brendan”, Database Database
eyes: “blue”, eyes: “blue”, boss: “will”} Table Collection
birthplace: “NY”, loc: [40.7, 73.4],
aliases: [“bill”, “ben”],
loc: [32.7, 63.4],
boss: “ben”} Row Document (JSON, BSON)
boss: ”ben”}
{name: “matt”, Column Field
weight:60, Index Index
height: 72,
{name: “ben”, Join Embedded Document
loc: [44.6, 71.3]}
age:25} Partition Shard
Features Of MongoDB
• Document-Oriented storege
• Full Index Support
• Replication & High Availability
• Auto-Sharding
• Aggregation
• MongoDB Atlas
• Various APIs
• JavaScript, Python, Ruby, Perl, Java, Java, Scala, C#, C++,
Haskell, Erlang
• Community
MongoDB
MongoDB is an open-source NoSQL database system designed for handling large volumes
of data. Unlike traditional relational databases, MongoDB uses a flexible, document-
oriented data model, making it particularly suitable for applications with evolving schemas
and complex data structures.
• Here's a brief introduction to some key concepts in MongoDB:
• Document: In MongoDB, data is stored in flexible, JSON-like documents. A document is a set of key-
value pairs, where keys are strings and values can be various data types, including strings, numbers,
arrays, or even nested documents.
• Collection: Collections are analogous to tables in relational databases. They are groups of MongoDB
documents, and each document within a collection can have a different structure. Collections do
not enforce a schema, allowing for flexibility in data representation.
• Database: A MongoDB database is a container for collections. It holds one or more collections of
documents.
• Document ID: Each document in a collection has a unique identifier called the "_id" field. This field
is automatically indexed and ensures the uniqueness of each document within a collection.
MongoDB
• Query Language: MongoDB provides a powerful query language that allows you to retrieve, filter, and
manipulate data stored in the database. Queries are expressed using JSON-like syntax.
• Indexes: MongoDB supports indexing to improve query performance. Indexes can be created on any field
in a document, including nested fields, and can significantly speed up data retrieval operations.
• Replication: MongoDB supports replica sets, which are groups of MongoDB instances that maintain the
same data set. Replica sets provide high availability and fault tolerance by automatically electing a
primary node to serve read and write operations.
• Sharding: Sharding is a method for distributing data across multiple machines to support horizontal
scalability. MongoDB can automatically partition data across shards based on a shard key, allowing for
high throughput and storage capacity.
• Aggregation Framework: MongoDB provides a powerful aggregation framework for performing data
aggregation operations, such as grouping, sorting, and filtering, similar to SQL's GROUP BY and ORDER BY
clauses.
• GridFS: MongoDB includes a specification called GridFS for storing and retrieving large files, such as
images, videos, and audio files, as separate documents.
• MongoDB is widely used in modern web applications, big data, real-time analytics, and IoT (Internet of Things)
applications due to its flexibility, scalability, and performance. It's a valuable tool for developers seeking to
manage and analyze large volumes of diverse data.
Replication
• Replication provides redundancy and increases data availability.
• With multiple copies of data on different database servers,
replication provides a level of fault tolerance against the loss of a
single database server.

Copy of database Copy of database


Replication
Sharding
• Sharding is a method for
distributing data across multiple
machines.
• MongoDB uses sharding to support
deployments with very large data
sets and high throughput
operations.
Sharding Architecture

• Shard is a Mongo instance to


handle a subset of original data.
• Mongos is a query router to shards.
• Config Server is a Mongo instance
which stores metadata information
and configuration details of cluster.
Sharding/Replication

• Replication Split data sets across


multiple data nodes for high
availability.
• Sharding scale up/down
horizontally when it is required for
high throughput
MongoDB Datatypes
MongoDB supports various data types to accommodate different kinds of data within its document-
oriented model. Here's an overview of the common data types in MongoDB:
• String: This data type represents UTF-8 encoded strings. Strings are used to store text data.
• Integer: MongoDB supports 32-bit and 64-bit integer values. These are used to store numerical data.
• Double: Double data type is used to store floating-point numbers.
• Boolean: Boolean data type represents boolean values, i.e., true or false.
• Date: Date data type stores date and time information. Dates are stored as milliseconds since the
Unix epoch (January 1, 1970).
• Array: Arrays are used to store lists of values or nested documents. Arrays can contain values of
different data types.
• Object: Objects or embedded documents are used to store nested data structures within a
document.
• ObjectId: ObjectId is a 12-byte identifier typically used as the primary key in MongoDB documents.
ObjectId values are generated based on a timestamp, machine identifier, process identifier, and a
random counter.
MongoDB Datatypes
• Binary Data: MongoDB supports various binary data types, such as Binary Data and
UUID. These are used to store binary data, such as images or files.
• Null: Null data type represents a null value.
• Regular Expression: MongoDB supports regular expressions for pattern matching
operations.
• Timestamp: Timestamp data type is used to store the timestamp of document
modifications. Timestamp values consist of a timestamp and an incrementing ordinal
for operations within the same timestamp.
• Decimal128: Decimal128 data type represents a 128-bit decimal floating-point number.
These data types provide flexibility in representing different kinds of data within
MongoDB documents. Additionally, MongoDB allows for nested structures and arrays,
enabling complex data modeling.
CRUD OPERATIONS IN MONGODB
CRUD operations in MongoDB refer to the basic operations that can be
performed on documents within collections. CRUD stands for Create,
Read, Update, and Delete. Here's an overview of each operation:
• Create (Insert):
• To create a new document in a collection, you use the ‘insertOne()’ or
‘insertMany()’ methods.
• ‘insertOne()’ inserts a single document into the collection.
• ‘insertMany()’ inserts multiple documents into the collection in a single
operation.
• Example:
• db.collection.insertOne({ name: "John", age: 30 });
CRUD OPERATIONS IN MONGODB
• Read (Query):
• To retrieve documents from a collection, you use the ‘find()’ method.
• ‘find()’ returns a cursor that you can iterate over to access the documents.
• You can specify query conditions to filter the documents returned.
• Example:
• db.collection.find({ age: { $gt: 25 } }); $gt greater than
• Update:
• To modify existing documents in a collection, you use the ‘updateOne()’ or ‘updateMany()’ methods.
• ‘updateOne()’ updates a single document that matches the specified filter.
• ‘updateMany()’ updates multiple documents that match the specified filter.
• Example:
• db.collection.updateOne(
{ name: "John" },
{ $set: { age: 35 } }
); $set  replaces the value of a field with the specified value
CRUD OPERATIONS IN MONGODB
• Delete:
• To remove documents from a collection, you use the ‘deleteOne()’ or
‘deleteMany()’ methods.
• ‘deleteOne()’ removes a single document that matches the specified filter.
• ‘deleteMany()’ removes multiple documents that match the specified filter.
• Example:
• db.collection.deleteOne({ name: "John" });
INDEXING IN MONGODB
• Indexing in MongoDB is a crucial aspect of
optimizing database performance, especially when
dealing with large datasets. It enhances query
performance by reducing the number of
documents MongoDB must scan to satisfy a
query. Here's a brief introduction to indexing in
MongoDB:

1. What is an Index? An index is a data structure


that improves the speed of data retrieval
operations on a database table at the cost of
additional space and increased maintenance time.
In MongoDB, indexes are similar to indexes in
other database systems. They store a small portion
of the collection's data in an easy-to-traverse form.
INDEXING IN MONGODB
2. Types of Indexes in MongoDB: MongoDB supports various types of
indexes to address different query patterns and optimization needs.
Some common types include:
a) Single Field Index: Indexes a single field in a collection.
b) Compound Index: Indexes multiple fields together. It's useful when queries
involve multiple fields.
c) Multikey Index: Indexes the content of arrays. MongoDB can create
indexes on arrays, and it creates separate index entries for each element of
the array.
d) Text Index: Optimized for searching text content.
e) Geospatial Index: Optimized for querying geospatial coordinates.
INDEXING IN MONGODB
3. Creating Indexes:
• Indexes can be created using the ‘createIndex()’ method in MongoDB. For
example:
• db.collection.createIndex({ field: 1 })
• Here, { field: 1 } indicates an ascending index on the field named "field".

4. Query Optimization with Indexes:


• Once indexes are created, MongoDB can efficiently use them to perform
query operations. When executing a query, MongoDB's query optimizer
analyzes the query and chooses the most efficient index to use, if any.
INDEXING IN MONGODB
5. Indexing Best Practices:
• Index Selectivity: Ensure that your indexes are selective enough to reduce the number of
documents scanned.
• Indexing Frequently Queried Fields: Index fields that are frequently used in queries.
• Avoid Overindexing: While indexes can improve read performance, they also consume disk space
and add overhead to write operations.
• Monitor Index Usage: Regularly monitor the usage of indexes and remove any that are not being
used or are redundant.

6. Index Management:
• MongoDB provides commands to manage indexes, including creating, dropping, and listing
indexes. You can also view index usage statistics to optimize index performance.
• Indexes play a vital role in optimizing MongoDB performance, especially in scenarios with large
datasets and complex query requirements. Understanding how to create and use indexes
effectively can significantly improve the efficiency of MongoDB databases.
CAPPED COLLECTIONS IN MONGODB
• Capped collections in MongoDB are special types of
collections that have a fixed size and follow a FIFO (First
In, First Out) storage mechanism. They are designed for
use cases where data needs to be stored in a circular buffer
fashion, such as logging or caching systems. Here's an
overview of capped collections in MongoDB:
1.Fixed Size: Capped collections have a predefined size
limit specified during their creation. Once this limit is
reached, MongoDB automatically removes older
documents to accommodate new ones, ensuring that the
collection never exceeds its size limit.
2.Insertion Order: Documents in a capped collection are
stored in the order in which they were inserted. When the
collection reaches its size limit and needs to make space
for new documents, MongoDB removes the oldest
documents first.
CAPPED COLLECTIONS IN MONGODB
3. Automatic Overwrite: In a capped collection, when the collection is
full and a new document is inserted, MongoDB automatically removes
the oldest document to make space for the new one. This behavior is
similar to a circular buffer.

4. Non-Resizable: Once created, the size of a capped collection cannot


be changed. You cannot add or remove indexes from a capped
collection, nor can you delete documents individually. You can only
drop and recreate the collection with a different size if needed.
CAPPED COLLECTIONS IN MONGODB
5. Usage Scenarios: Capped collections are suitable for scenarios where
you want to store a fixed amount of data for a specific use case. Common
use cases include:
1. Logging: Storing log messages where old logs can be automatically removed
when the collection reaches its size limit.
2. Event Streaming: Capturing real-time events with a limited retention period.
3. Cache Stores: Storing frequently accessed data in memory with a fixed size.
6. Creating Capped Collections: You can create a capped collection using
the ‘createCollection()’ method in MongoDB, specifying the ‘capped: true’
option and optionally providing a size limit in bytes. For example:
db.createCollection("logs", { capped: true, size: 1048576 })
// creates a capped collection with a size limit of 1MB
CAPPED COLLECTIONS IN MONGODB
7. Limitations:
• Capped collections do not support updates that increase the size of
documents.
• You cannot shard capped collections.
• Capped collections do not support the ‘$near’ operator for geospatial
queries.
• You cannot remove documents individually; documents are removed
automatically based on insertion order.
APACHE SPARK
• Apache Spark is an open source analytics engine used for big data workloads.
• It can handle both batches as well as real-time analytics and data processing workloads.
• It is designed to deliver the computational speed, scalability, and programmability required for big data—
specifically for streaming data, graph data, analytics, machine learning, large-scale data processing,
and artificial intelligence (AI) applications.
• It extends the MapReduce model to efficiently use it for more types of computations, which includes
interactive queries and stream processing.
• Spark provides native bindings for the Java, Scala, Python, and R programming languages. In addition, it
includes several libraries to support build applications for machine learning [MLlib], stream processing
[Spark Streaming], and graph processing [GraphX].
• Apache Spark consists of Spark Core and a set of libraries.
• Spark Core is responsible for providing distributed task transmission, scheduling, and I/O functionality.
The Spark Core engine uses the concept of a Resilient Distributed Dataset (RDD) as its basic data type.
• The RDD is designed so it will hide most of the computational complexity from its users.
• Spark is intelligent on the way it operates on data; data and partitions are aggregated across a server
cluster, where it can then be computed and either moved to a different data store or run through an
analytic model. You will not be asked to specify the destination of the files or the computational
resources that need to be used in order to store or retrieve files.
RESILIENT DISTRIBUTED DATASETS (RDDs)
• Resilient Distributed Datasets (RDD) is a fundamental data structure of Spark. It is an
immutable distributed collection of objects. Each dataset in RDD is divided into logical
partitions, which may be computed on different nodes of the cluster.
• RDDs can contain any type of Python, Java, or Scala objects, including user-defined classes.
• Formally, an RDD is a read-only, partitioned collection of records. RDDs can be created through
deterministic operations on either data on stable storage or other RDDs. RDD is a fault-tolerant
collection of elements that can be operated on in parallel.
• There are two ways to create RDDs − parallelizing an existing collection in your driver program,
or referencing a dataset in an external storage system, such as a shared file system, HDFS,
HBase, or any data source offering a Hadoop Input Format.
• Spark makes use of the concept of RDD to achieve faster and efficient MapReduce operations.
Data sharing is slow in MapReduce due to replication, serialization, and disk IO.
• The key idea of spark is Resilient Distributed Datasets (RDD); it supports in-memory processing
computation. This means, it stores the state of memory as an object across the jobs and the
object is sharable between those jobs. Data sharing in memory is 10 to 100 times faster than
network and Disk.
SPARK INSTALLATION
• Installing Apache Spark can vary depending
on your operating system and requirements.
Here's a general guide:
• Prerequisites:
• Java: Apache Spark requires Java 8+ to be
installed on your system. You can download and
install Java from the official Oracle website or
use OpenJDK.
• Scala (Optional): If you plan to use the Scala API,
you may need to install Scala. However, Spark
also supports Python and R.
• Steps to Install Apache Spark:
• Download Apache Spark:
• Visit the Apache Spark downloads page and choose
the latest stable release.
• Select the package type: pre-built for Hadoop
version or source code.
• Download the tarball or ZIP file.
SPARK INSTALLATION
• Extract Spark:
• Navigate to the directory where you downloaded Spark.
• Extract the contents of the downloaded file using ‘tar -zxvf spark-<version>.tgz’ (for tarball) or
unzip it (for ZIP).
• Configuration (Optional):
• You might want to customize Spark's configuration by editing the ‘conf/spark-env.sh’ or other
configuration files.
• Environment Variables (Optional):
• Add Spark's bin directory to the PATH environment variable.
• Set ‘SPARK_HOME’ environment variable to the location where Spark is extracted.
• Verify Installation:
• Open a terminal and type ‘spark-shell’ (for Scala) or ‘pyspark’ (for Python) to launch the Spark shell.
• If Spark is installed correctly, you should see the Spark shell starting up.
SPARK INSTALLATION
Additional Tips:
• Hadoop: If you're using Hadoop, make sure you have it installed and
properly configured. Spark requires Hadoop libraries to run.
• Cluster Setup: For setting up Spark on a cluster, you'll need to configure
‘conf/spark-defaults.conf’ and ‘conf/slaves’ files.
• Dependencies: Depending on your use case, you might need additional
dependencies like Hadoop, Hive, or HBase. Make sure to install them if
required.
• Documentation: Refer to the official Apache Spark documentation for
detailed installation instructions and configuration options specific to your
setup.
SPARK APPLICATION
• A Spark application typically refers to a program written using Apache Spark,
an open-source distributed computing system. Spark is commonly used for
processing and analyzing large-scale data sets across a cluster of computers.
• Here's a basic outline of what a Spark application might entail:
1.Setup: This involves configuring your Spark application, including setting up
dependencies, defining cluster resources (like memory and cores), and
initializing SparkContext or SparkSession, which are the entry points to
Spark functionality.
2.Data Ingestion: Spark applications often start by loading data from various
sources such as HDFS, S3, databases, or even local files into Spark RDDs
(Resilient Distributed Datasets) or DataFrames.
SPARK APPLICATION
3. Data Transformation and Analysis: Once the data is loaded, you can perform
various transformations and analyses on it using Spark's rich set of APIs. This might
involve filtering, aggregating, joining, or otherwise manipulating the data to extract
insights or perform computations.
4. Machine Learning or Advanced Analytics (Optional): If your application
involves machine learning or advanced analytics, you might leverage Spark's MLlib
library for machine learning tasks or other libraries for specialized analytics.
5. Output: Finally, you typically write the results of your analysis back to some
storage system or output destination, which could be HDFS, a database, or even
back to a local file system.
6. Cleanup: It's good practice to clean up any resources used by your Spark
application once it's done running, like stopping the SparkContext or SparkSession.
APPLICATIONS OF SPARK
1.Data Processing and ETL (Extract, Transform, Load):
• Batch Processing: This involves processing large volumes of data in batches. Spark is used for
tasks like data cleansing, aggregation, filtering, and transformation before loading it into a data
warehouse or database for further analysis.
• Real-time Processing: Spark Streaming enables the processing of real-time data streams. It is
commonly used for tasks like processing streaming data from IoT devices, social media feeds,
server logs, and financial transactions in real-time.
2.Machine Learning and Data Mining:
• Machine Learning Pipelines: Spark's MLlib (Machine Learning Library) provides scalable
machine learning algorithms and tools. Spark is used for tasks such as training and evaluating
machine learning models on large datasets, performing feature extraction, and hyperparameter
tuning.
• Recommendation Systems: Spark is used to build recommendation engines based on
collaborative filtering algorithms like Alternating Least Squares (ALS). These systems are
employed in e-commerce platforms, streaming services, and content recommendation engines.
APPLICATIONS OF SPARK
3. Graph Analytics:
• Graph Processing: Spark's GraphX library enables graph analytics at scale. It's used
for tasks like analyzing social networks, detecting communities, finding shortest paths,
and identifying influential nodes in graphs.
4. Interactive Data Analysis:
• SQL and DataFrames: Spark SQL allows querying structured and semi-structured
data using SQL queries or DataFrame API. It's used for interactive data exploration, ad-
hoc analysis, and running complex analytics queries on large datasets.
5. Data Integration and Ecosystem Interaction:
• Integration with Big Data Ecosystem: Spark integrates with various big data
technologies like Hadoop, HBase, Cassandra, and Kafka. It allows seamless data
integration, interoperability, and processing across different data sources and storage
systems.
APPLICATIONS OF SPARK
6. Data Visualization:
• Integration with Visualization Tools:
While Spark itself doesn't provide
visualization capabilities, it integrates with
visualization libraries and tools like Apache
Zeppelin, Jupyter Notebooks, and Tableau
for visualizing analyzed data and insights.
7. Data Pipelines and Workflows:
• Workflow Orchestration: Spark enables
building end-to-end data pipelines and
workflows for data ingestion, processing,
transformation, and storage. It supports
workflow orchestration tools like Apache
Airflow and Apache Oozie for automating
and managing data processing tasks.
SPARK JOB
• A Spark job typically consists of multiple stages of computation, each containing
multiple tasks that are executed in parallel across the cluster's worker nodes.
• A Spark job represents a complete unit of work submitted to the Spark cluster by
the driver program. It typically corresponds to a single action triggered by the
driver program on a RDD (Resilient Distributed Dataset) or DataFrame.
• Spark jobs enable scalable and distributed data processing across clusters of
machines, allowing users to perform complex analytics and computations on
large datasets efficiently.
• Characteristics:
• A job encompasses all the computation required to produce the final result of the action.
• It may consist of multiple stages, depending on the RDD lineage and transformations
applied.
• Spark jobs are submitted to the SparkContext or SparkSession for execution.
SPARK JOB
• Here's a breakdown of the components and workflow of a Spark job:
1.Driver Program: The user's application, written using Spark's API (e.g., Spark
SQL, Spark Streaming, MLlib), runs as the driver program. The driver program
defines the SparkContext, which serves as the entry point to the Spark cluster.
2.SparkContext (or SparkSession): The SparkContext (or SparkSession, in newer
versions of Spark) is responsible for coordinating the execution of Spark jobs on
the cluster. It communicates with the cluster manager (e.g., Standalone, YARN,
Mesos) to acquire resources and schedule tasks.
3.RDDs or DataFrames/Datasets: Spark jobs operate on distributed datasets
represented as Resilient Distributed Datasets (RDDs) or higher-level abstractions
like DataFrames or Datasets. These distributed collections are partitioned across
the cluster and are operated upon in parallel.
SPARK JOB
4. Transformations and Actions:
• Transformations: Spark provides various transformations
(e.g., map, filter, join) to transform RDDs/DataFrames into new
RDDs/DataFrames. Transformations are lazily evaluated,
meaning they're not executed immediately but create a lineage
of transformations.
• Actions: Actions trigger the execution of the Spark job by
evaluating the lineage of transformations and producing results.
Examples of actions include count, collect, save, and foreach.
5. DAG (Directed Acyclic Graph): Spark constructs a
Directed Acyclic Graph (DAG) of the job's stages during its
execution. Each stage represents a set of transformations that
can be executed together without shuffling data across the
cluster.
6. Job Submission: When the driver program invokes an
action on an RDD/DataFrame, Spark breaks the computation
into stages and submits them to the cluster for execution.
SPARK JOB
7. Task Execution:
• Each stage is divided into tasks, with each task operating on
a subset of the data partitions.
• Tasks are scheduled to run on executor nodes in the cluster.
Executors launch task processes to perform the actual
computation.
• Executors cache intermediate data in memory or disk to
optimize performance and fault tolerance.
8. Result Collection: Once all tasks have completed, the
driver program collects the results of the action and may
perform further processing or output generation.
9. Job Monitoring and Optimization:
• Spark provides monitoring tools (e.g., Spark UI) to track
the progress and performance of Spark jobs.
• Job performance can be optimized by tuning Spark
configurations, partitioning data appropriately, and
optimizing transformations and actions.
SPARK STAGES
• A stage in Spark represents a set of parallel tasks – one task per
partition (of an RDD that computes partial results of a function
executes as a part of a Spark job) that can be executed together,
typically due to a data shuffling operation (e.g., a stage is created
when data needs to be redistributed across partitions).
SPARK STAGES
• Types:
• RDD Lineage: Each stage is determined by the dependencies between RDDs in the
computation DAG (Directed Acyclic Graph).
• Narrow vs. Wide Stages:
• Narrow Stages: Tasks within a narrow stage can be executed without shuffling data across the network,
as they only depend on data within the same partition.
• Wide Stages (Shuffle Stages): Tasks within a wide stage require data exchange (shuffle) between
partitions, typically involving a data shuffle operation such as groupByKey or join.
• Characteristics:
• Stages are created based on the RDD lineage and the boundaries imposed by
transformations that require data shuffling.
• Each stage has a unique ID and may contain multiple tasks.
• Stages are executed sequentially, with each stage depending on the output of the preceding
stage.
SPARK TASKS
• A task is the smallest unit of work in Spark, representing a single
computation that needs to be performed on a single partition of data.
• Characteristics:
• Partition-based: Tasks operate on data partitions of RDDs or DataFrames.
• Parallel Execution: Tasks within a stage can execute in parallel across the
available worker nodes, processing different partitions of data concurrently.
• Managed by Executors: Tasks are executed by executor processes running on
worker nodes within the Spark cluster.
• Dynamic Scheduling: Spark dynamically schedules tasks based on available
resources, data locality, and task dependencies.
• Fault Tolerance: If a task fails, Spark can recompute it using lineage
information stored in RDDs, ensuring fault tolerance.
RELATIONSHIP BETWEEN STAGES
AND TASKS
• Stage Creation: When an action is triggered on an RDD or DataFrame, Spark
constructs a directed acyclic graph (DAG) of stages representing the
computation to be executed.
• Stage Boundaries:
• Narrow transformations within a stage are typically fused together, resulting in fewer
stages.
• Wide transformations introduce stage boundaries due to data shuffling requirements,
creating separate stages for map-side and reduce-side tasks.
• Task Assignment:
• Each stage is divided into tasks, with each task processing a partition of data from the
input RDD or DataFrame.
• Spark scheduler assigns tasks to available executor nodes based on data locality and
resource availability.
ANATOMY OF A SPARK JOB RUN
The highest level of a spark job run is composed of two entities:
🔵Driver: Host the application (SparkContext)
🔵Executors: Execution of application tasks.

🟩Job Submission: A spark job is submitted automatically when an action is performed on an


RDD. This internally calls runJob() on the SparkContext which passes the call to the scheduler
that runs as a a part of the driver.

The scheduler is made up of two components :


🔵DAG Scheduler - Breaks down Job into stages.
🔵Task Scheduler - Submits the task from each stage to the cluster.

🟩DAG Construction: A spark job is split into multiple stages. Each stage runs a specific task.
ANATOMY OF A SPARK JOB RUN
• There are mainly two types of tasks: shuffle map tasks and result task.
• 🔵Shuffle map tasks: Each shuffle map task runs computation on one RDD
and writes it output to a new set of partition which are fetched in a later stage.
Shuffle map tasks run in all stages except the final stage.
• 🔵Result tasks: It runs in the final stage and returns result to the user's
program. Each result task runs the computation on its RDD partition, then
sends the result back to the driver, which assembles results from all partitions
into a final result. It is to be noted that each task is given a placement
preference by the DAG scheduler to allow task scheduler to take advantage of
data locality. Once the DAG scheduler completes construction of the DAG of
stages, it submits each stage's set of task to the task scheduler. Child stages are
only submitted once their parents have completed successfully.
ANATOMY OF A SPARK JOB RUN
• 🟩Task Scheduling: When the task scheduler receives the set of tasks, it uses its list of
executors that are running for the application and constructs a mapping of tasks to
executors on the basis of placement preference. For a given executor, the scheduler will
first assign process-local tasks, then node-local task, then rack-local task, before assigning
any arbitrary task or random task. Executors send status update to the driver when a task is
completed or has failed. In case of task failure, task scheduler resubmits the task on
another executor. It also launches speculative tasks for tasks that are running slowly if this
feature is enabled. Speculative tasks are duplicates of existing tasks, which the scheduler
may run as a backup if a task is running more slowly than expected.

• 🟩Task Execution: Executor makes sure that the JAR and file dependencies are up to date.
It keeps local cache of dependencies from previous tasks. It deserialize the task
code(which consists of the user's functions) from the serialized bytes that were sent as a
part of the launch task message. Finally the task code is executed. Task can return a result
to the driver. The result is serialized and sent to executor backend, and finally to driver as
status update message.
ANATOMY OF A SPARK JOB RUN
• The anatomy of a Spark job run typically involves several components and stages:
1. Job Submission: This is the initial stage where a user submits a Spark job. The submission
process can be done through various methods like the Spark shell, Spark-submit script, REST
APIs, or interactive notebooks like Jupyter or Zeppelin.
2. Job Scheduler: Once the job is submitted, it enters a queue managed by the Spark job
scheduler. The scheduler determines when and where the job will be executed based on
resource availability and scheduling policies. Popular schedulers include FIFO, Fair, and
Capacity schedulers.
3. Task Generation: When the job is scheduled to run, Spark's driver program translates the job
into a directed acyclic graph (DAG) of stages. Each stage consists of tasks, which are the
smallest unit of work in Spark. Tasks are created based on the transformations and actions
specified in the Spark application code (e.g., RDD transformations, DataFrame operations).
4. Stage Execution: The DAG scheduler divides the job into stages of tasks based on the
dependencies between RDDs (Resilient Distributed Datasets) or DataFrames. These stages are
then submitted to the task scheduler, which runs the tasks on executor nodes in the Spark
cluster. Executors are JVM processes that manage task execution and data storage.
ANATOMY OF A SPARK JOB RUN
5. Data Processing: During stage execution, tasks are performed on partitions of the input
data. Spark processes data in parallel across the executor nodes, leveraging the distributed
nature of the cluster. Intermediate results are cached in memory or spilled to disk if necessary.
6. Shuffle: If there are shuffle operations (e.g., groupByKey, reduceByKey), Spark
redistributes data across the cluster to ensure that records with the same key end up on the
same machine. This involves a data exchange phase between executor nodes.
7. Result Aggregation: After all tasks have completed, Spark aggregates the results of
individual tasks to produce the final output. Depending on the action performed in the Spark
job (e.g., collect, save), the result may be returned to the driver program, stored in external
storage, or displayed to the user.
8. Job Completion: Once the job has finished executing all its tasks and producing the final
output, the Spark application terminates, and resources are released. The driver program cleans
up any remaining resources and exits.
SPARK ON YARN
• Spark on YARN refers to running Apache Spark applications on a Hadoop cluster managed by Yet
Another Resource Negotiator (YARN). YARN is the resource management layer in the Hadoop
ecosystem responsible for managing and allocating resources (CPU, memory) across applications
running on a Hadoop cluster.
• Here's how Spark integrates with YARN:
1. Resource Management: YARN provides resource management capabilities, allowing multiple
applications to run concurrently on the same Hadoop cluster without resource contention. Spark
leverages YARN to request and allocate resources (containers) for its executor nodes.
2. Job Execution: When a Spark application is submitted to run on YARN, it first communicates with
the ResourceManager to request resources for its driver and executor processes. The
ResourceManager allocates containers on various nodes in the cluster based on availability and
resource requirements.
3. Container Execution: Once resources are allocated, Spark launches its driver program in one of
the containers on the cluster. The driver program is responsible for orchestrating the execution of
the Spark application. Additionally, Spark launches executor processes in other containers to
perform the actual data processing tasks.
SPARK ON YARN
4. Task Execution: Within each executor, Spark executes tasks in parallel across the
available cores. Tasks process data partitions independently and communicate with each
other as needed, leveraging YARN for resource isolation and management.
5. Dynamic Resource Allocation: Spark on YARN supports dynamic resource allocation,
allowing it to adjust the number of executor containers based on workload demand. This
helps optimize resource utilization and improves cluster efficiency.
6. Fault Tolerance: YARN provides fault tolerance mechanisms to recover from failures
such as node failures or container crashes. Spark integrates with these mechanisms to
ensure that failed tasks are automatically re-executed on different nodes.
7. Integration with Hadoop Ecosystem: Running Spark on YARN allows seamless
integration with other Hadoop ecosystem components such as HDFS (Hadoop Distributed
File System), Hive, HBase, and others. This enables Spark applications to access data
stored in Hadoop and interact with other Hadoop-based services.
SCALA
• Scala is a powerful programming language that seamlessly combines
object-oriented and functional programming paradigms. It was
created by Martin Odersky, a professor at École Polytechnique
Fédérale de Lausanne (EPFL) in Switzerland and first released in 2003.
SCALA FEATURES
• Concise Syntax: Scala boasts a concise and expressive syntax that enables developers
to write clean and readable code. It draws inspiration from several programming
languages, including Java, Haskell, and Ruby.
• Object-Oriented: Scala is a fully object-oriented language, which means that every
value is an object. It supports classes, inheritance, and traits, allowing developers to
create robust and reusable code.
• Functional Programming: Scala also embraces functional programming principles.
Functions are first-class citizens, meaning they can be assigned to variables, passed as
arguments, and returned from other functions. Scala provides powerful higher-order
functions, pattern matching, and immutable data structures.
• Static Typing: Scala is statically typed, which means that the type of every expression
is known at compile-time. However, Scala's type inference system can often infer
types, reducing the need for explicit type annotations.
SCALA FEATURES
• Concurrency: Scala provides excellent support for concurrent and parallel
programming through features like actors and the Akka framework. These
features make it easier to write scalable and responsive applications.
• Interoperability: Scala runs on the Java Virtual Machine (JVM), which means it
seamlessly interoperates with Java. Scala code can call Java libraries and vice
versa, making it easy to leverage existing Java code and libraries.
• Tooling: Scala has a rich ecosystem of tools and libraries that facilitate
development. Popular build tools like sbt (Simple Build Tool) and dependency
management tools like Maven and Ivy are commonly used in Scala projects.
• Community and Adoption: Scala has a vibrant community and is widely adopted
in industry for a variety of applications, including web development, data analysis,
and distributed systems. It's used by companies like Twitter, LinkedIn, and Airbnb.
Popular libraries include Akka, Play Framework, Apache Spark, and ScalaTest.
SCALA FEATURES
• Type Inference: Scala has a sophisticated type inference system that can
often deduce the types of variables and expressions without explicit type
annotations. This reduces boilerplate code while still providing strong
static typing and compile-time safety.
• Immutable Collections: Scala offers a rich set of immutable collections
(e.g., List, Set, Map) that encourage functional programming practices
and make it easier to reason about code correctness and concurrency.
• Pattern Matching: Scala's pattern matching feature allows developers to
match values against patterns and destructure complex data structures. It
is often used in conjunction with case classes and sealed traits to write
expressive and type-safe code.
SCALA FEATURES
• Lazy Evaluation: Scala supports lazy evaluation, allowing developers
to define values that are computed only when they are accessed for
the first time. This can be useful for optimizing performance and
handling infinite data structures.
• DSLs (Domain-Specific Languages): Scala's flexible syntax and
powerful features make it well-suited for creating internal DSLs.
Developers can use Scala to define custom syntax and abstractions
that closely match the problem domain, leading to more expressive
and maintainable code.
CLASSES IN SCALA
• Definition: Classes define the properties (fields) and behaviors
(methods) of objects of a particular type.
CLASSES IN SCALA
• Fields:
• Fields can be mutable (var) or immutable (val).
• They can have default values, which are specified after the field's type.
• Constructors:
• Primary constructors are defined in the class declaration itself.
• Auxiliary constructors can be defined using ‘this’ keyword.
• Methods:
• Methods define the behaviors of the class.
• They can take parameters and return values.
• Instantiation:
• val obj = new ClassName(arguments)
OBJECTS IN SCALA
• Definition: Objects are single instances of their own definitions. They
are similar to singletons in other languages.
• Syntax:
OBJECTS IN SCALA
• Fields and Methods:
• Objects can contain fields and methods, just like classes.
• They cannot take parameters like constructors, as they cannot be instantiated.
• Usage:
• Objects are typically used to encapsulate utility methods, hold constants, or serve as
entry points to an application.
• Companion Objects:
• When an object shares the same name with a class, it's called a companion object.
• Companion objects can access private members of the class, and vice versa.
• Singleton Pattern:
• Objects are often used to implement the singleton pattern, ensuring that only one
instance of the object exists.
EXAMPLE OF A CLASS AND OBJECT IN SCALA
TYPES IN SCALA
• Integers:
• Byte: 8-bit signed integer (-128 to 127)
• Short: 16-bit signed integer (-32,768 to 32,767)
• Int: 32-bit signed integer (-2,147,483,648 to 2,147,483,647)
• Long: 64-bit signed integer (-9,223,372,036,854,775,808 to 9,223,372,036,854,775,807)
• Floating-Point Numbers:
• Float: 32-bit floating-point number
• Double: 64-bit floating-point number
• Boolean: Boolean type representing true or false.
• Characters: Char type representing a single 16-bit Unicode character.
• Strings: String type representing a sequence of characters.
• Unit: The Unit type corresponds to the void type in Java and represents absence of value. It's similar to
void in C and Java.
• Null and Nothing: Null is a subtype of all reference types. Nothing is a subtype of all types and has no
instances.
OPERATORS IN SCALA
1.Arithmetic Operators:
• +, -, *, /, % (Addition, Subtraction, Multiplication, Division, Modulus)
2.Relational Operators:
• ==, !=, <, >, <=, >= (Equality, Inequality, Less Than, Greater Than, Less Than or Equal To, Greater Than or Equal To)
3.Logical Operators:
• &&, ||, ! (Logical AND, Logical OR, Logical NOT)
4.Assignment Operators:
• =, +=, -=, *=, /=, %=, etc. (Assignment, Compound Assignment)
5.Bitwise Operators:
• &, |, ^, <<, >>, >>>, ~ (Bitwise AND, Bitwise OR, Bitwise XOR, Left Shift, Right Shift, Unsigned Right Shift, Bitwise NOT)
6.String Concatenation:
• + operator can be used to concatenate strings.
7.Type Casting:
• asInstanceOf is used for explicit type casting.
EXAMPLE
BUILT IN CONTROL STRUCTURES OF
SCALA
• Scala provides several built-in control structures for managing the
flow of execution in a program. Here are some of the key ones:
1.if-else Statements: Scala supports traditional if-else conditional
statements for branching logic.
BUILT IN CONTROL STRUCTURES OF
SCALA
• Pattern Matching: Pattern matching is a powerful feature in Scala
that allows you to match a value against a pattern and execute code
based on the match.
BUILT IN CONTROL STRUCTURES OF
SCALA
• For Loops: Scala supports both traditional for loops as well as for-
comprehensions.
BUILT IN CONTROL STRUCTURES OF
SCALA
• While and Do-While Loops: Scala also supports while and do-while
loops for iterative execution.
BUILT IN CONTROL STRUCTURES OF
SCALA
• Try-Catch-Finally: Scala provides try-catch-finally blocks for
exception handling.
BUILT IN CONTROL STRUCTURES OF
SCALA
• Option and Either: Scala encourages the use of Options and Eithers
for handling potentially absent or exceptional values.
FUNCTIONS IN SCALA
• In Scala, functions are first-class citizens, meaning you can treat
functions like any other value. You can assign functions to variables,
pass them as arguments to other functions, and return them from
functions.
• Defining Functions:
• Functions can be defined using the ‘def’ keyword.
FUNCTIONS IN SCALA
• Anonymous Functions:
• Anonymous functions, also known as function literals or lambda
expressions, can be defined using the ‘=>’ syntax.

• Higher-order Functions:
• Higher-order functions are functions that take other functions as
parameters or return functions.
CLOSURES IN SCALA
• A closure is a function which captures the variables from its lexical
scope. In Scala, closures are created when you use variables from the
enclosing scope inside a function literal.

• In this example, the ‘makeMultiplier’ function returns another


function that captures the ‘factor’ variable from its enclosing scope.
CURRYING IN SCALA
• Scala supports currying, a technique in which a function with multiple
parameters is transformed into a sequence of functions, each with a single
parameter. This allows for partial application of functions.

• Here, ‘add’ is a curried function, and ‘add(2)’ returns another function


that takes a single parameter.
INHERITANCE IN SCALA
• Extending Classes: Scala supports single inheritance, meaning a
subclass can only inherit from a single superclass. You can extend a
class using the ‘extends’ keyword.
INHERITANCE IN SCALA
• Overriding Methods: Subclasses can override methods from their
superclass using the ‘override’ keyword.
INHERITANCE IN SCALA
• Calling Superclass Methods: You can call methods from the
superclass using the ‘super’ keyword.
INHERITANCE IN SCALA
• Abstract Classes: Scala allows you to define abstract classes using the
‘abstract’ keyword. Abstract classes cannot be instantiated and can
contain both abstract and concrete methods.
INHERITANCE IN SCALA
• Traits:
• Traits are similar to interfaces in Java but can also include method
implementations. A class can extend multiple traits.
INHERITANCE IN SCALA
• Mixins:
• Mixins are a way to reuse code in Scala by mixing traits into classes.
This allows for flexible composition of behavior.

You might also like