0% found this document useful (0 votes)
15 views22 pages

BD Imp Ques 1

The document provides an overview of Big Data, its types, features, and the 5 V's: Volume, Velocity, Variety, Veracity, and Value. It discusses sources of Big Data, storage and processing methods, the evolution of Big Data, use cases across various industries, and details about the Hadoop ecosystem and architecture. Additionally, it outlines HDFS operations and commands for managing data within the Hadoop framework.

Uploaded by

Kiruthika GS
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
15 views22 pages

BD Imp Ques 1

The document provides an overview of Big Data, its types, features, and the 5 V's: Volume, Velocity, Variety, Veracity, and Value. It discusses sources of Big Data, storage and processing methods, the evolution of Big Data, use cases across various industries, and details about the Hadoop ecosystem and architecture. Additionally, it outlines HDFS operations and commands for managing data within the Hadoop framework.

Uploaded by

Kiruthika GS
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 22

BD IMP QUES 1

UNIT 1

2M

1.DEFINE BIGDATA AND 3 TYPES AND FEATURES

A collection of datasets which is huge and complex that is becomes difficult to process using dbms
tools.

Big Data has to deal with large and complex datasets that can be structured, Semi-structured, or
unstructured and will typically not fit into memory to be Processed.

Structured: Processed and Organised data

Unstructured: Unprocessed and Unorganised data

Semi- Structured: Both Structured and Unstructured data

Features

➢ New platforms and tools


➢ Rapid environment
➢ Linear seak-out
➢ Realtime data analysis

2.GIVE THE 5V’S OF BIGDATA

Volume

Big Data is a vast ‘volume’ of data generated from many sources daily, such as machines, social
media platforms, etc.

Velocity

Velocity plays an important role compared to others. Velocity creates the speed by which the data is
created in real-time. It contains the linking of incoming data sets speeds, rate of change, and activity
bursts. The primary aspect of Big Data is to provide demanding data rapidly.

Variety

Big Data can be structured, unstructured, and semi-structured that are being collected from
different sources. Data will only be collected from databases and sheets in the past, but these days
the data will comes in array forms, that are PDFs, Emails, audios, SM posts, photos, videos, etc.

Veracity

Veracity means how much the data is reliable. It has many ways to filter or translate the data.
Veracity is the process of being able to handle and manage data efficiently. Big Data is also essential
in business development.

Value

Value is an essential characteristic of big data. It is not the data that we process or store. It is
valuable and reliable data that we store, process, and also analyse.
3.WHAT ARE THE SOURCES OF BIGDATA AND HOW TO STORE AND PROCESS IT

Sources of Big Data

➢ Social Media – Posts, comments, videos.


➢ IoT & Sensors – Smart devices, industrial machines.
➢ Transactions – Banking, e-commerce, stock trading.
➢ Healthcare – Patient records, medical imaging.
➢ Scientific & Government Data – Research, census, weather.

Storage & Processing

➢ Storage – HDFS, NoSQL, Cloud, Data Lakes.


➢ Processing – Hadoop, Spark, Flink, Kafka, BigQuery.

4.DIFF BW BIGDATA AND CONVENTIONAL DB

5.WHAT IS YARN AND GIVE ISSUES AND CHALLENGES IN TRADITIONAL SYSTEMS

YARN-YET ANOTHER RESOURCE NEGOTIATOR

➢ Manages resources across clusters


➢ Scheduling and resource allocation in work of yarn
➢ Components
1. Resource mgmt
2. Node mgmt
3. App mgmt

ISSUES AND CHALLENGES IN TRADITIONAL SYSTEMS

➢ Lack of Skilled Professionals – Shortage of data scientists, analysts, and engineers.


➢ Rapid Data Growth – Requires strong infrastructure for processing and storage.
➢ Data Quality Issues – Inaccurate, unorganized data leads to poor insights.
➢ Compliance Risks – Ensuring legal and regulatory data protection is challenging.
➢ Integration Complexity – Difficult to unify data from multiple sources.
12M

1.EXPLAIN THE EVOLUTION OF BIGDATA 6M

Big Data has evolved over the years due to advancements in technology, the increasing amount of
data generation, and the need for efficient data management and analysis. The evolution can be
divided into the following stages:

1. Traditional Data Management (Pre-2000s)

➢ Data was mostly structured and stored in relational databases (RDBMS) like MySQL, Oracle,
and SQL Server.
➢ Data processing was limited to small-scale datasets using manual queries and batch
processing.
➢ Businesses relied on data warehouses for analytics, but scalability was a challenge.

2. The Rise of Big Data (2000s)

➢ The internet, e-commerce, and social media led to a massive explosion of data.
➢ Traditional databases struggled to handle large, unstructured, and real-time data.
➢ Google introduced MapReduce (2004), and Apache Hadoop (2006) was developed for
distributed storage and parallel processing.

3. The NoSQL & Cloud Era (2010s)

➢ NoSQL databases (MongoDB, Cassandra, HBase) emerged to handle semi-structured and


unstructured data.
➢ Cloud computing (AWS, Google Cloud, Azure) enabled cost-effective, scalable storage and
processing.
➢ Apache Spark provided faster in-memory processing for big data analytics.

4. AI & Real-Time Big Data (2020s-Present)

➢ The rise of AI, Machine Learning (ML), and IoT led to even larger and more complex data
streams.
➢ Real-time processing (Apache Kafka, Flink, Storm) became essential for quick decision-
making.
➢ Edge computing and 5G enabled faster data processing closer to the source.
➢ Data privacy and security became critical due to strict regulations like GDPR and CCPA.
EXPLAIN THE USECASES OF BIGDATA 6M

Big Data is transforming various industries by enabling better decision-making, automation, and
innovation.

1. Healthcare & Medical Research

➢ Predictive Analytics – AI-driven models analyze patient data to predict diseases (e.g., early
cancer detection).
➢ Genomics & Drug Discovery – Big Data helps in sequencing genomes and accelerating drug
research.
➢ Remote Patient Monitoring – IoT devices collect real-time health data for better patient
care.

2. Finance & Banking

➢ Fraud Detection – Banks use Big Data to analyze transaction patterns and detect fraud in
real time.
➢ Risk Management – Predicting credit risks and market fluctuations using machine learning
models.
➢ Algorithmic Trading – Automated stock trading based on real-time market trends.

3. Retail & E-Commerce

➢ Personalized Recommendations – Platforms like Amazon and Flipkart use Big Data to suggest
products based on user behavior.
➢ Inventory & Supply Chain Management – Predicting demand and optimizing logistics using
data analytics.
➢ Customer Sentiment Analysis – Analyzing social media and reviews to understand consumer
preferences.

4. Smart Cities & Transportation

➢ Traffic Management – Analyzing GPS and IoT data to optimize traffic flow and reduce
congestion.
➢ Public Transport Optimization – Big Data helps in scheduling and managing buses/trains
efficiently.
➢ Energy Management – Smart grids analyze energy consumption patterns for better
distribution.

5. Entertainment & Media

➢ Content Recommendation – Netflix, YouTube, and Spotify use Big Data to suggest movies,
videos, and music.
➢ Audience Analytics – Understanding user engagement to improve content production.
➢ Ad Targeting – Big Data helps in showing personalized ads to users.

6. Manufacturing & Industry 4.0

➢ Predictive Maintenance – Sensors in machines detect issues before they fail, reducing
downtime.
➢ Quality Control – AI-driven analysis of production data to ensure quality products.
➢ Supply Chain Optimization – Using data analytics to enhance manufacturing and logistics.
2.EXPLAIN THE HADOOP ECOSYSTEM IN DETAIL

Hadoop Ecosystem

Apache Hadoop is an open-source framework designed for efficient storage and processing of Big
Data. It enables handling large datasets that traditional RDBMS cannot process effectively.

Core Components of Hadoop Ecosystem

➢ HDFS (Hadoop Distributed File System)


➢ YARN (Yet Another Resource Negotiator)
➢ MapReduce
➢ Spark

Supporting Components

➢ PIG & HIVE


➢ HBase
➢ Mahout & Spark MLlib
➢ Solr & Lucene
➢ Zookeeper
➢ Oozie

1.Data Storage Layer

Stores massive amounts of structured, semi-structured, and unstructured data.

HDFS (Hadoop Distributed File System)

➢ A distributed file system that splits data into blocks and stores it across multiple nodes.
➢ Provides high fault tolerance and scalability.
➢ Works efficiently with large datasets that do not fit into a single machine.
HBase (Hadoop Database)

➢ A NoSQL database designed for real-time read/write access.


➢ Stores data in key-value format and runs on top of HDFS.
➢ Optimized for fast random access and large-scale data retrieval.

2.Data Processing Layer

Handles distributed processing and resource management.

MapReduce

➢ A programming model for parallel data processing.


➢ Divides a task into Map (data filtering & sorting) and Reduce (aggregating results) phases.
➢ Works efficiently for batch processing.

YARN (Yet Another Resource Negotiator)

➢ Manages system resources for distributed applications.


➢ Schedules tasks across multiple nodes dynamically.
➢ Improves resource utilization by allowing multiple applications to run simultaneously.

3.Data Access Layer

Provides various tools for querying, analysing, and managing data.

Hive

➢ SQL-like interface for querying structured data in Hadoop.


➢ Converts SQL queries into MapReduce jobs.
➢ Used for analytics and reporting.

Pig

➢ A high-level scripting language for complex data transformations.


➢ Uses a procedural approach instead of SQL-based queries.
➢ Easier to write and execute than MapReduce.

Mahout

➢ A machine learning library for clustering, classification, and recommendation systems.


➢ Works efficiently on large-scale datasets.
➢ Integrated with Spark for fast processing.

Avro

➢ A data serialization framework for Hadoop.


➢ Ensures efficient data exchange across different programming languages.
➢ Supports schema evolution (changes in data structure over time).

Sqoop

➢ A tool for transferring structured data between Hadoop and relational databases (RDBMS).
➢ Used for bulk imports and exports of data.
➢ Helps integrate traditional databases with Hadoop.
4.Data Management Layer

Monitors, schedules, and coordinates Hadoop workflows.

Oozie

➢ A workflow scheduler for managing and automating Hadoop jobs.


➢ Supports dependency management between tasks.
➢ Useful for executing complex workflows in a structured manner.

Chukwa

➢ A log collection and monitoring system for Hadoop clusters.


➢ Aggregates and analyses performance and system logs.
➢ Helps in debugging and performance tuning.

Flume

➢ A tool for ingesting and collecting streaming data (logs, events, sensor data, etc.).
➢ Works well for real-time analytics.
➢ Can transfer data to HDFS, HBase, or other storage systems.

ZooKeeper

➢ Manages distributed system coordination.


➢ Provides features like leader election, configuration management, and synchronization.
➢ Ensures consistency in Hadoop clusters.

Advantages of Hadoop Ecosystem

➢ Scalability
➢ Cost-Effective
➢ Fault Tolerance

Disadvantages of Hadoop Ecosystem

➢ Complex Setup & Management


➢ High Hardware Requirements
➢ Slow for Small Data
3.EXPLAIN THE HADOOP ARCHITECTURE IN DETAIL

Hadoop is a framework written in Java that utilizes a large cluster of commodity hardware to
maintain and store big size data. Hadoop works on MapReduce Programming Algorithm that was
introduced by Google.

Today lots of Big Brand Companies are using Hadoop in their Organization to deal with big data, eg.
Facebook, Yahoo, Netflix, eBay, etc.

The Hadoop Architecture Mainly consists of 4 components.

➢ MapReduce
➢ HDFS (Hadoop Distributed File System)
➢ YARN (Yet Another Resource Negotiator)
➢ Common Utilities or Hadoop Common

1.MapReduce

MapReduce nothing but just like an Algorithm or a data structure that is based on the YARN
framework. The major feature of MapReduce is to perform the distributed processing in parallel in a
Hadoop cluster which Makes Hadoop working so fast. When you are dealing with Big Data, serial
processing is no more of any use.

In first phase, Map is utilized and in next phase Reduce is utilized.


Input (Big Data) – Large datasets are stored in HDFS and split into smaller chunks for parallel
processing.

Map Phase – Data is processed in parallel by Map () functions, generating intermediate key-value
pairs.

Shuffle & Sort – Key-value pairs are grouped, shuffled, and sorted for processing.

Reduce Phase – Reduce () functions aggregate and process data to generate final results.

Output – The final processed data is stored in HDFS or other storage systems.

Key Benefits of MapReduce

➢ Parallel Processing
➢ Scalability
➢ Fault Tolerance

2.HDFS

HDFS is the storage layer of the Hadoop ecosystem, designed for fault-tolerance, high availability,
and scalability on commodity hardware.

Key Components of HDFS

NameNode (Master)

➢ Manages metadata (file names, sizes, locations).


➢ Keeps track of all DataNodes and directs operations (create, delete, replicate).

DataNode (Slave)

➢ Stores actual data blocks in a distributed manner.


➢ Can scale up from 1 to 500+ nodes to increase storage capacity.

HDFS Operations

➢ Create – Adds new files to HDFS.


➢ Delete – Removes files or directories.
➢ Replicate – Duplicates data across multiple nodes for fault tolerance.
➢ Read/Write – Clients interact with HDFS via MapReduce, Spark, Hive, etc.

➢ File Block Division – HDFS splits files into 128MB or 256MB blocks
➢ Replication – Default replication factor = 3, storing each block on three nodes for fault
tolerance.
➢ Fault Tolerance – Ensures data availability even if nodes fail.
➢ Rack Awareness – Spreads replicas across racks to reduce network congestion and improve
3.YARN (Yet Another Resource Negotiator)

YARN is a Framework on which MapReduce works. YARN performs 2 operations that are Job
scheduling and Resource Management. The Purpose of Job schedular is to divide a big task into small
jobs so that each job can be assigned to various slaves in a Hadoop cluster and Processing can be
Maximized.

Job Scheduler also keeps track of which job is important, which job has more priority, dependencies
between the jobs and all the other information like job timing, etc. And the use of Resource
Manager is to manage all the resources that are made available for running a Hadoop cluster.

Features of YARN

Multi-Tenancy

Scalability

Cluster-Utilization

Compatibility

4.Hadoop common or Common Utilities

Hadoop common or Common utilities are nothing but our java library and java files or we can say the
java scripts that we need for all the other components present in a Hadoop cluster.

These utilities are used by HDFS, YARN, and MapReduce for running the cluster. Hadoop Common
verify that Hardware failure in a Hadoop cluster is common so it needs to be solved automatically in
software by Hadoop Framework.

Advantages of Hadoop

➢ Scalability
➢ Fault Tolerance
➢ Cost-Effective

Disadvantages of Hadoop

➢ Complexity
➢ High Latency
➢ Security Concerns
4.GIVE THE DIFFERENT HDFS OPERATION AND COMMANDS

HDFS OPERATIONS

HDFS (Hadoop Distributed File System) supports various operations for managing large datasets
efficiently. Key operations include:

Starting HDFS:

Format the NameNode: $ hadoop namenode -format

Start HDFS cluster: $ start-dfs.sh

Listing Files in HDFS:

Check directory or file status: $ hadoop fs -ls <path>

Inserting Data into HDFS:

Create directory: $ hadoop fs -mkdir /user/input

Upload file: $ hadoop fs -put /home/file.txt /user/input

Verify file: $ hadoop fs -ls /user/input

Retrieving Data from HDFS:

View file content: $ hadoop fs -cat /user/output/outfile

Download file: $ hadoop fs -get /user/output/ /home/hadoop_tp/

Shutting Down HDFS:

Stop the HDFS cluster: $ stop-dfs.sh


HDFS COMMANDS

HDFS (Hadoop Distributed File System) commands help in managing files, directories, and data
within the system.

ls – Lists files and directories in HDFS.

hdfs dfs -ls /

mkdir – Creates a new directory in HDFS.

hdfs dfs -mkdir /user/username

touchz – Creates an empty file in HDFS.

hdfs dfs -touchz /geeks/myfile.txt

copyFromLocal / put – Copies a file from local to HDFS.

hdfs dfs -copyFromLocal AI.txt /geeks

hdfs dfs -put AI.txt /geeks

cat – Prints the content of an HDFS file.

hdfs dfs -cat /geeks/AI.txt

copyToLocal / get – Copies a file from HDFS to the local system.

hdfs dfs -copyToLocal /geeks/myfile.txt ../Desktop/hero

hdfs dfs -get /geeks/myfile.txt ../Desktop/hero

moveFromLocal – Moves a file from local to HDFS.

hdfs dfs -moveFromLocal cutAndPaste.txt /geeks

cp – Copies a file within HDFS.

hdfs dfs -cp /geeks /geeks_copied

mv – Moves a file within HDFS.

hdfs dfs -mv /geeks/myfile.txt /geeks_copied

rmr – Deletes a file or directory recursively.

hdfs dfs -rmr /geeks_copied


UNIT 2

2M

1.WHY ARE MAP REDUCE FUNCTIONS USED IN BIGDATA

MapReduce is used in Big Data to process and analyse massive datasets efficiently across distributed
systems. It enables parallel computation, making it ideal for handling petabytes of data.

Map (): Processes data and generates intermediate key-value pairs.

Reduce (): Aggregates and processes the mapped data to generate the final output.

Key Reasons for Using MapReduce

➢ Parallel Processing
➢ Scalability
➢ Fault Tolerance

2.DEFINE DATA READ AND DATA WRITE IN HDFS

1.Data Write in HDFS

➢ Client Request
➢ Block Allocation
➢ Pipeline Setup
➢ Replication
➢ Acknowledgment
➢ Completion

2.Data Read in HDFS

➢ Client Request
➢ Block Location
➢ Nearest DataNode Selection
➢ Sequential Read
➢ Reconstruction
➢ Completion

3.WHAT IS SPARK

Apache Spark is an open-source, distributed processing system used for big data analytics. It utilizes
in-memory caching, and optimized query execution for fast analytic queries against data of any size.

It provides development APIs in Java, Scala, Python and R, and supports code reuse across multiple
nodes—batch processing, interactive queries, real-time analytics, machine learning, and graph
processing.

Features of Spark:

➢ Swift Processing
➢ Reusability
➢ Real-Time Stream Processing
➢ Cost Efficient
4.WHY IS APACHE SPARK STREAMING USED

Apache Spark Streaming is a scalable, fault-tolerant real-time data processing system that extends
the core Spark API. It supports both batch and streaming workloads and processes real-time data
from sources like Kafka, Flume, and Amazon Kinesis. The processed data can be stored in file
systems, databases, or live dashboards.

Advantages of Spark Streaming

➢ Fast failure recovery


➢ Optimized resource usage
➢ Unified batch and streaming model
➢ Native integration

DStream

➢ A Discretized Stream (DStream) represents a continuous stream of data divided into small
batches.
➢ DStreams are built on RDDs (Resilient Distributed Datasets), enabling seamless integration
with Spark MLlib, Spark SQL, and other components.

5.WHAT IS APACHE MAHOUT

Apache Mahout is an open-source project for building scalable machine learning algorithms. It is
closely associated with Apache Hadoop, which uses an elephant as its logo.

ML Techniques in Mahout

➢ Recommendation – Used in recommender systems (e.g., personalized suggestions).


➢ Classification – Categorizing data into predefined groups.
➢ Clustering – Grouping similar data points together.

How Mahout Works

➢ Mahout follows the "bring the math to the data" approach, reducing data movement for
better performance.
➢ It runs on Hadoop (MapReduce) and also supports Apache Spark and Apache Flink for
distributed processing.

Features of Apache Mahout

➢ Fast & Efficient


➢ Clustering Algorithms
➢ Matrix & Vector Libraries

Applications of Mahout

➢ Used by: Adobe, Facebook, LinkedIn, Twitter, Yahoo, Foursquare.


➢ Foursquare – Recommender engine for places, food, and entertainment.
➢ Twitter – User interest modelling.
➢ Yahoo! – Pattern mining.
12M

1.EXPLAIN THE MAP REDUCE FRAMEWORK 4M

Hadoop MapReduce

Hadoop MapReduce is a parallel processing framework for handling large-scale data across
distributed clusters in a fault-tolerant manner. It processes multi-terabyte datasets efficiently by
dividing tasks into Map and Reduce stages.

How MapReduce Works

➢ Job Submission – The client submits a job to the MapReduce Master.


➢ Job Splitting – The Master divides the job into smaller job-parts.
➢ Map Phase – Each Map Task processes a portion of the input and produces key-value pairs.
➢ Shuffling & Sorting – Intermediate key-value pairs are grouped and sorted.
➢ Reduce Phase – The Reducer processes grouped data and generates the final output.
➢ Storage – The final output is stored in HDFS (Hadoop Distributed File System).

MapReduce Architecture Components

➢ Client – Submits jobs to the MapReduce Master.


➢ Job – The actual task that needs processing.
➢ MapReduce Master – Splits jobs into job-parts.
➢ Job Parts – Small independent tasks that contribute to the final output.
➢ Input Data – The raw data processed by MapReduce.
➢ Output Data – The final processed result stored in HDFS.

Key Benefits of MapReduce

➢ Parallel processing for large-scale data analysis.


➢ Fault tolerance via automatic task re-execution.
➢ Efficient scheduling using data locality (processing data where it resides).
➢ Support for multiple languages (Java, Python, Ruby, C++).
SOLVE WORD COUNT 4M

The MapReduce Word Count program calculates the frequency of each word in a text file stored in
HDFS.

Steps to Execute Word Count in Hadoop

Step 1: Prepare the Dataset

Download and extract the dataset as a text file (word_count.txt).

Step 2: Create a Directory in HDFS

hdfs fs -mkdir /input_wordCount

Step 3: Upload the Text File to HDFS

hdfs fs -put /local/path/word_count.txt /input_wordCount

Step 4: Verify the File in HDFS

hdfs fs -ls /input_wordCount

1)Input Splits

➢ The input dataset is divided into smaller chunks called input splits.
➢ Each split is assigned to an individual Map task.

2)Mapping

➢ Each split is passed through the Mapper function.


➢ The Mapper extracts words and assigns an initial count of 1 to each word.

3)Shuffling

➢ The Shuffle & Sort phase groups similar words together.


➢ The goal is to combine occurrences of the same word.

4)Reducing

➢ The Reducer function takes the shuffled data and aggregates values.
➢ It sums up the occurrences of each word.
SOLVE MATRIX MULTIPLICATION 4M

Matrix multiplication can be efficiently implemented using MapReduce by distributing the


computation across multiple nodes in a Hadoop cluster.

Consider the following matrix:

➢ Here matrix A is a 2×2 matrix which means the number of rows(i)=2 and the number of
columns(j)=2
➢ Matrix B is also a 2×2 matrix where number of rows(j)=2 and number of columns(k)=2.
➢ Each cell of the matrix is labelled as Aij and Bij.
➢ Now One step matrix multiplication has 1 mapper and 1 reducer.
➢ The Formula is:
➢ Mapper for Matrix A (k, v)=((i, k), (A, j, Aij)) for all k
➢ Mapper for Matrix B (k, v)=((i, k), (B, j, Bjk)) for all i

# Here all are 2, therefore when k=1, i can have 2 values 1 & 2, each case can have 2 further values
of j=1 and j=2.

Substituting all values

k=1

i=1 j=1 ((1, 1), (A, 1, 1))

j=2 ((1, 1), (A, 2, 2))

i=2 j=1 ((2, 1), (A, 1, 3))

j=2 ((2, 1), (A, 2, 4))

k=2

i=1 j=1 ((1, 2), (A, 1, 1))

j=2 ((1, 2), (A, 2, 2))

i=2 j=1 ((2, 2), (A, 1, 3))

j=2 ((2, 2), (A, 2, 4))

MAPPER FOR MATRIX b

i=1 j=1 k=1 ((1, 1), (B, 1, 5))

k=2 ((1, 2), (B, 1, 6))

j=2 k=1 ((1, 1), (B, 2, 7))

k=2 ((1, 2), (B, 2, 8))


i=2 j=1 k=1 ((2, 1), (B, 1, 5))

k=2 ((2, 2), (B, 1, 6))

j=2 k=1 ((2, 1), (B, 2, 7))

k=2 ((2, 2), (B, 2, 8))

The formula for Reducer is:

Reducer (k, v) = (i, k) =>Make sorted Alist and Blist (i, k) => Summation (Aij * Bjk)) for j Output => ((i,
k), sum)

from Mapper computation that 4 pairs are common (1, 1), (1, 2), (2, 1) and (2, 2)

Make a list separate for Matrix A & B with adjoining values taken from Mapper step above:

(1, 1) =>Alist ={(A, 1, 1), (A, 2, 2)}

Blist ={(B, 1, 5), (B, 2, 7)}

Now Aij x Bjk: [(1*5) + (2*7)] =19 -------(i)

(1, 2) =>Alist ={(A, 1, 1), (A, 2, 2)}

Blist ={(B, 1, 6), (B, 2, 8)}

Now Aij x Bjk: [(1*6) + (2*8)] =22 -------(ii)

(2, 1) =>Alist ={(A, 1, 3), (A, 2, 4)}

Blist ={(B, 1, 5), (B, 2, 7)}

Now Aij x Bjk: [(3*5) + (4*7)] =43 -------(iii)

(2, 2) =>Alist ={(A, 1, 3), (A, 2, 4)}

Blist ={(B, 1, 6), (B, 2, 8)}

Now Aij x Bjk: [(3*6) + (4*8)] =50 -------(iv)

From (i), (ii), (iii) and (iv) we conclude that

((1, 1), 19)

((1, 2), 22)

((2, 1), 43)

((2, 2), 50)

Final Matrix is:


2.EXPLAIN THE YARN ARCHITECTURE IN DETAIL

YARN stands for “Yet Another Resource Negotiator “. It was introduced in Hadoop 2.0 to remove the
bottleneck on Job Tracker which was present in Hadoop 1.0.

YARN Architecture Components

Client

➢ Submits MapReduce or other jobs to YARN.


➢ Requests resources from the Resource Manager.

Resource Manager (RM) 🛠

➢ Master daemon of YARN.


➢ Allocates cluster resources to applications.
➢ Has two key components:
1. Scheduler:
➢ Assigns resources but doesn’t track or restart failed tasks.
➢ Supports plugins like Capacity Scheduler and Fair Scheduler.
2. Application Manager:
➢ Accepts applications.
➢ Negotiates first Container for an application.
➢ Restarts the Application Master if a failure occurs.

Node Manager (NM)

➢ Manages individual nodes in the Hadoop cluster.


➢ Registers with the Resource Manager and sends heartbeats (health status).
➢ Monitors resource usage and manages logs.
➢ Starts and kills containers based on Resource Manager instructions.

Application Master (AM)

➢ Handles a single application (job).


➢ Negotiates resources with Resource Manager.
➢ Requests containers from Node Manager using a Container Launch Context (CLC).
➢ Monitors application progress and sends health reports to RM.
Container

➢ A logical unit of resources (CPU, RAM, Disk) on a node.


➢ Created when an Application Master requests it.
➢ Contains dependencies, environment variables, and security tokens via CLC.

How YARN Works (Job Execution Flow)

1. Client submits an application

2. The Resource Manager allocates a container to start the Application Manager

3. The Application Manager registers itself with the Resource Manager

4. The Application Manager negotiates containers from the Resource Manager

5. The Application Manager notifies the Node Manager to launch containers

6. Application code is executed in the container

7. Client contacts Resource Manager/Application Manager to monitor application’s status

8. Once the processing is complete, the Application Manager un-registers with the Resource
Manager

YARN Features

➢ Scalability
➢ Multi-tenancy

Advantages

➢ Flexibility
➢ Scalability
➢ Improved Performance

Disadvantages

➢ Complexity
➢ Overhead
➢ Limited Support
3.EXPLAIN THE SPARK ARCHITECTURE 8M

Apache Spark follows a master-slave architecture.

I. Components of Spark Architecture

a) Driver Program (Master)

➢ The driver program is the central control unit of Spark.


➢ It runs the main () function of the application and creates a SparkContext.
➢ Responsible for task scheduling, job coordination, and result aggregation.
➢ Converts user-defined transformations into a DAG and optimizes execution.

b) Cluster Manager

➢ Responsible for allocating resources (CPU, memory) across Spark applications.


➢ Spark can work with different cluster managers:
1. Standalone Mode (default Spark cluster manager)
2. YARN (Hadoop) (common in big data environments)
3. Apache Mesos (supports multiple frameworks)
4. Kubernetes (containerized workloads)

c) Executors (Worker Nodes)

➢ Executors are distributed agents responsible for executing tasks on worker nodes.
➢ Every Spark application gets its own executors, which:
1. Execute tasks assigned by the driver.
2. Store intermediate computation results in memory (for caching).
3. Write output to HDFS, databases, or other storage systems.

II. Spark Execution Flow

➢ Step 1: Submitting a Spark Job


➢ Step 2: Job Breakdown
➢ Step 3: Task Execution
➢ Step 4: Data Processing
➢ Step 5: Completion & Clean-up
III. Key Abstractions in Spark

a) Resilient Distributed Dataset (RDD)

➢ RDDs are the fundamental data structure in Spark.


➢ They allow fault tolerance, parallel processing, and in-memory computation.

b) Directed Acyclic Graph (DAG)

➢ A DAG represents the logical execution plan of a Spark job.


➢ It is an optimized execution graph that ensures efficient execution of transformations.

c) Tasks & Stages

➢ A job consists of multiple stages (based on shuffle operations).


➢ Each stage is divided into tasks, which are executed in parallel across executors.

IV. Advantages of Spark Architecture

➢ Fast Processing
➢ Fault Tolerance
➢ Scalability
➢ Multiple Language Support

GIVE FEW MACHINE LEARNING ALGORITHMS IN SPARK 4M

MLlib is nothing but a machine learning (ML) library of Apache Spark. Basically, it helps to make
practical machine learning scalable and easy.

1.Basic statistics

➢ Provides descriptive statistics such as mean, variance, correlation, and standard deviation.
➢ Example: Finding correlation between product sales and advertisement spending.

2.Classification and Regression

➢ Classification: Assigns labels to input data (Supervised Learning).


➢ Example: Spam detection in emails using Logistic Regression.
➢ Regression: Predicts continuous values.
➢ Example: House price prediction using Linear Regression.

3.Clustering

➢ Groups data points into clusters (Unsupervised Learning).


➢ Example: Customer segmentation using K-Means clustering.

4.Collaborative filtering

➢ Used for recommendation systems.


➢ Example: Movie or product recommendation based on user preferences using ALS
(Alternating Least Squares).

You might also like