0% found this document useful (0 votes)

4 views

Big data assignment notes

The document provides an overview of key concepts related to big data processing, including MapReduce in Hadoop, differences between HDFS and traditional file systems, and comparisons between Apache Spark and Hadoop. It also discusses NoSQL databases, data quality management, data governance, security threats, and real-time analytics architecture. Additionally, it outlines Apache Spark's architecture, core components, and common use cases for big data applications.

Uploaded by

paridhikadwey78

Available Formats

Download as DOCX, PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

4 views

Big data assignment notes

Uploaded by

paridhikadwey78

Available Formats

Download as DOCX, PDF, TXT or read online on Scribd

You are on page 1/ 13

BIG DATA ASSIGNMENT NOTES

ASSIGNMENT 3
1. How Does MapReduce Work in Hadoop?

MapReduce is a programming model used in Hadoop for processing large

data sets in a distributed manner.

How It Works:

🟩 Step 1: Input Splitting

 Large files are split into chunks (blocks).

 Each chunk is assigned to a Map task.

🟩 Step 2: Mapping

 Each Mapper processes a data block and produces key-value pairs.

 Example: Processing logs → (IP address, 1)

🟩 Step 3: Shuffling and Sorting

 Hadoop groups all values by key across all Mappers.

 Intermediate data is sorted and sent to Reducers.

🟩 Step 4: Reducing

 Reducers process each group of key-value pairs to produce final

output.

 Example: Summing counts per IP → (IP address, total visits)

🟩 Step 5: Output

 Final output is written to HDFS.

Example Use Case: Word count, log analysis, clickstream processing.

2. Difference Between HDFS and Traditional File Systems

HDFS (Hadoop Distributed Traditional File System
Feature
File System) (e.g., NTFS, ext4)

Distributed across multiple

Architecture Centralized or single-machine
nodes

Fault Manual or external backup

Built-in data replication
Tolerance required

Scales horizontally (add more

Scalability Limited to hardware
nodes)

Data Size Optimized for large files (GBs

Not ideal for massive files
Handling to TBs)

Supports frequent read-write

Write Support Write-once, read-many
operations

Data Locality Computation moves to data Data moves to computation

Large (default 128 MB or 256 Smaller (4 KB – 64 KB

Block Size
MB) typically)

3. How Does Spark Compare to Hadoop for Big Data Processing?

Feature Apache Spark Hadoop (MapReduce)

Processing In-memory Disk-based

Up to 100x faster for some Slower due to frequent disk

Speed
workloads I/O

High-level APIs (Python, Requires Java-based

Ease of Use
Scala, Java, R) MapReduce code

Real-time
Yes (Spark Streaming) No (batch only)
Support

Machine Limited support (needs

Built-in MLlib
Learning external tools)

Through task re-execution

Fault Tolerance DAG lineage and RDDs
and replication

Data Processing Batch, Streaming, Only batch processing

Feature Apache Spark Hadoop (MapReduce)

Modes Interactive, Graph

Summary:

 Use Hadoop MapReduce for batch jobs on extremely large datasets.

 Use Apache Spark for faster, in-memory, interactive or real-time

data processing.

UNIT 4
1. What is NoSQL, and How is it Used in Big Data Storage?

✅ Definition:

NoSQL (Not Only SQL) databases are non-relational databases designed

to handle large volumes of unstructured, semi-structured, or
structured data with high performance and scalability.

✅ Types of NoSQL Databases:

Type Description Examples

Document- Stores data as JSON-like

MongoDB, CouchDB
based documents

Redis, Amazon
Key-Value Stores pairs for fast lookups
DynamoDB

Stores data in columns instead Apache Cassandra,

Column-based
of rows HBase

Optimized for Neo4j, Amazon

Graph-based
relationships/networks Neptune

✅ Use in Big Data:

 Handles high volume, velocity, and variety of data.

 Scales horizontally across distributed clusters.

 Useful in real-time analytics, IoT, recommendation systems, and social

media platforms.
2. How Do You Handle Data Quality Issues in Big Data Sets?

Big data often contains noise, duplication, or missing values. Here's how you
can manage quality issues:

✅ Steps to Handle Data Quality:

Issue Type Handling Techniques

Imputation (mean/median), data interpolation,

Missing Data
deletion

Duplicate
Use hashing or unique IDs to remove duplicates
Records

Inconsistent Standardize units (e.g., date formats, case

Formats normalization)

Use statistical or ML techniques to detect and

Outliers/Noise
handle

Incorrect Data Cross-validation with reference datasets or rules

✅ Tools Commonly Used:

 Apache Spark, Talend, Trifacta, OpenRefine, Pandas (in Python)

3. Techniques for Data Preprocessing in Big Data

Data preprocessing prepares raw data for analytics or machine learning

models.

✅ Common Techniques:

Technique Purpose

Fix/remove incorrect, incomplete, or inconsistent

Data Cleaning
data

Data
Normalize, scale, encode data for algorithms
Transformation

Combine data from multiple sources (ETL

Data Integration
processes)

Data Reduction Dimensionality reduction (e.g., PCA), sampling,

Technique Purpose

aggregation

Convert continuous data into categories or

Data Discretization
intervals

Tokenization &
For text data — splitting sentences into words
Parsing

Streaming Real-time data transformation using tools like

Preprocessing Kafka, Spark

✅ Big Data Tools for Preprocessing:

 Apache Spark (with PySpark or Scala)

 Apache NiFi

 Hadoop MapReduce

 ETL pipelines (Airflow, Talend)

UNIT 5
✅ 1. How Do You Implement Data Governance in a Big Data
Environment?

Data governance ensures that data is accurate, secure, consistent, and

used responsibly.

📌 Key Components of Data Governance in Big Data:

Component Description

Data Catalog Centralized metadata store (e.g., Apache Atlas, Alation)

Tracks data flow from source to destination (e.g.,

Data Lineage
OpenLineage, Talend)

Role-Based Access Control (RBAC), policies for who can

Access Control
access what

Data Quality
Define valid values, types, ranges, null handling
Rules

Data Assign responsible roles for maintaining data integrity

Component Description

Stewardship

Policy
Compliance with GDPR, HIPAA, etc.
Management

🔧 Tools for Data Governance:

 Apache Atlas (metadata management)

 Apache Ranger (fine-grained access control)

 Collibra, Informatica, AWS Glue Data Catalog

✅ 2. Common Big Data Security Threats & Mitigation Strategies

⚠️Common Security Threats:

Threat Description Mitigation Strategies

Unauthorized access to Encryption (at-rest/in-transit),

Data Breaches
sensitive data Access controls

Unauthorized Lack of strict access Use Kerberos, LDAP, or OAuth

Access policies authentication

Data Leakage in Leakage during Secure APIs, TLS/SSL, audit

Pipelines processing or transfers trails

Malicious Code Attacks via open-source Code scanning, sandboxing

Injection or shared scripts jobs

Lack of Audit No monitoring of data Use logging systems like

Trails usage Apache Ranger, audit tools

🔐 Key Security Techniques:

 Kerberos: Secure authentication in Hadoop/Spark

 Apache Ranger: Role-based policies and audit logs

 Tokenization & Encryption: Protects PII data

 Network Layer Security: VPN, firewalls, VPCs

✅ 3. How Do You Scale Big Data Processing for Real-Time Analytics?

Real-time analytics requires fast ingestion, low-latency processing, and

scalable architecture.

⚙️Architecture for Real-Time Analytics:

text

CopyEdit

[Data Sources]

[Ingestion Layer] — Kafka / Flume / Kinesis

[Processing Layer] — Apache Spark Streaming / Flink / Storm

[Storage Layer] — Cassandra / HBase / Elasticsearch

[Visualization Layer] — Grafana / Kibana / Tableau

🧠 Key Techniques:

Technique Purpose

Real-time data computation (Spark

Stream Processing
Streaming, Flink)

Micro-Batching Efficient processing in small time windows

Autoscaling Dynamic resource allocation in cloud (K8s,

Infrastructure EMR)

Event-Driven
Process events instantly via Kafka or Pulsar
Architecture

In-Memory
Fast processing using RAM (Spark, Ignite)
Computing

🛠 Example Tools:

 Kafka + Spark Structured Streaming for low-latency pipelines

 AWS Kinesis + Lambda for serverless real-time processing

 Apache Flink for advanced stream processing with stateful operators

🚀 What is Apache Spark?

Apache Spark is an open-source, distributed computing framework

designed for fast processing of large-scale data. It supports batch
processing, streaming, machine learning, and SQL-based analytics —
all in one platform.

Feature Description

In-Memory Keeps intermediate data in memory for faster processing

Computing than Hadoop MapReduce

Supports SQL, MLlib (Machine Learning), GraphX (Graph

Unified Engine
Processing), and Spark Streaming

Language
APIs available in Python, Scala, Java, and R
Support

Distributed
Splits tasks across a cluster for parallel execution
Computing

Fault Tolerant Automatically handles failures using RDD lineage

🔄 Spark vs Hadoop (MapReduce)

Feature Apache Spark Hadoop MapReduce

Speed Faster (in-memory) Slower (disk-based)

Rich APIs in
Ease of Use Low-level Java APIs
Python/Scala/Java

Data
Batch + Streaming Batch only
Processing

Machine Needs integration with external

Built-in MLlib
Learning tools
🔥 Core Components of Apache Spark

1. Spark Core – The execution engine (RDDs, memory mgmt, fault

tolerance)

2. Spark SQL – Query structured data using SQL or DataFrames

3. Spark Streaming – Real-time data processing from sources like Kafka

4. MLlib – Machine learning library (classification, clustering, etc.)

5. GraphX – Graph processing (e.g., PageRank, graph traversal)

💡 Example: PySpark Code for Word Count

python

CopyEdit

from pyspark.sql import SparkSession

spark = SparkSession.builder.appName("WordCount").getOrCreate()

# Load text file

rdd = spark.sparkContext.textFile("sample.txt")

# Word count logic

counts = (

rdd.flatMap(lambda line: line.split())

.map(lambda word: (word, 1))

.reduceByKey(lambda a, b: a + b)

counts.collect()
✅ Common Use Cases

 Real-time analytics (e.g., fraud detection, log monitoring)

 ETL (Extract, Transform, Load) pipelines

 Recommendation engines

 Social media data analysis

 IoT stream processing

Would you like:

 A simple Spark ML example (like linear regression)?

 A real-time example using Spark Streaming + Kafka?

🚀 Apache Spark Architecture: Overview

Apache Spark follows a master-slave architecture with the following key

components:

🧱 Core Components of Apache Spark:

Component Role

Driver Controls the application, manages SparkContext, and

Program coordinates tasks

Cluster Allocates resources across Spark applications (e.g., YARN,

Manager Mesos, Kubernetes)

Executors Run tasks and return results to the driver

Tasks Individual units of work sent to executors

🔧 Detailed Components of Apache Spark

1. Spark Core (Foundation of everything)

 Manages memory, fault-tolerance, job scheduling.

 Provides the RDD (Resilient Distributed Dataset) abstraction for

distributed data.
2. Spark SQL

 Allows querying structured and semi-structured data using SQL,

DataFrames, and Datasets.

 Can read from Hive, Parquet, JSON, JDBC, etc.

3. Spark Streaming

 Enables real-time data processing.

 Processes live data streams using micro-batching.

4. MLlib (Machine Learning Library)

 Built-in library for scalable machine learning tasks:

o Classification, Regression, Clustering, Recommendation

5. GraphX

 API for graph processing (e.g., social networks, recommendation

graphs).

 Includes graph algorithms like PageRank and connected components.

⚙️How Apache Spark Works (Step-by-Step Execution)

Let’s understand with an example:

💼 Suppose: You want to count words in a large text file using Spark.

🔁 Spark Job Workflow:

1. Driver Program Starts

o It creates a SparkContext (entry point to Spark cluster).

2. Cluster Manager Allocates Resources

o The driver asks for executors on cluster nodes.

3. RDD/DataFrame Created

o Data is loaded into an RDD (e.g., from a text file).

4. Transformations Applied

o Operations like .map(), .filter(), .flatMap() define a DAG (Directed

Acyclic Graph).
5. Actions Trigger Execution

o An action like .collect() or .saveAsTextFile() starts actual

processing.

6. Task Scheduling

o Spark breaks the DAG into stages and tasks.

7. Tasks Sent to Executors

o Executors perform computations in parallel.

8. Results Returned

o Executors return the results to the driver, or write to storage.

🖼 Spark Architecture Diagram (Text-Based)

plaintext

CopyEdit

+----------------------+

| Driver Program | ← Controls the job

+----------------------+

+----------------------+ +----------------------+

| Cluster Manager | ←→→→ | Executors (n) |

+----------------------+ +----------------------+

| |

v v

Distribute Tasks Process Data, Store Cache

🧠 Summary
Component Responsibility

Main controller, builds job, sends tasks to

Driver
workers

Executor Workers that run tasks and store data

Cluster
Manages resources and task scheduling
Manager

RDD/
Data abstraction used for processing
DataFrame

Geels Expert Report
100% (6)
Geels Expert Report
39 pages
Introduction To Big Data With Spark and Hadoop
No ratings yet
Introduction To Big Data With Spark and Hadoop
61 pages
Exam DP 203 Data Engineering On Microsoft Azure Skills Measured
No ratings yet
Exam DP 203 Data Engineering On Microsoft Azure Skills Measured
8 pages
Exploring Hadoop Ecosystem (Volume 2): Stream Processing
From Everand
Exploring Hadoop Ecosystem (Volume 2): Stream Processing
Wei Liu
No ratings yet
Big Data Computing Notes
No ratings yet
Big Data Computing Notes
17 pages
Big Data Tools and Its Framework
No ratings yet
Big Data Tools and Its Framework
5 pages
BD by maaz
No ratings yet
BD by maaz
19 pages
PPT 2.1.1.
No ratings yet
PPT 2.1.1.
24 pages
Big Data Analytics Overview
No ratings yet
Big Data Analytics Overview
17 pages
L8 Big Data Management en
No ratings yet
L8 Big Data Management en
58 pages
Big data Handling Techniques
No ratings yet
Big data Handling Techniques
21 pages
BDA simple 1 to 4
No ratings yet
BDA simple 1 to 4
11 pages
IOT and Comp.architecture
No ratings yet
IOT and Comp.architecture
17 pages
Big Data Analytics unit wise short note
No ratings yet
Big Data Analytics unit wise short note
6 pages
SPARK
No ratings yet
SPARK
66 pages
Intro To Apache Spark
No ratings yet
Intro To Apache Spark
66 pages
BigData Spark Sparklyr
No ratings yet
BigData Spark Sparklyr
80 pages
Experiment No _ 1 Bda
No ratings yet
Experiment No _ 1 Bda
10 pages
APACHE SPARK and Scala
No ratings yet
APACHE SPARK and Scala
49 pages
20J41A0514-Big Data Spark
No ratings yet
20J41A0514-Big Data Spark
12 pages
Apache Spark With Java
No ratings yet
Apache Spark With Java
209 pages
1 Introduction To Big Data Management and Processing
No ratings yet
1 Introduction To Big Data Management and Processing
42 pages
SPARK
No ratings yet
SPARK
125 pages
Introduction to Spark
No ratings yet
Introduction to Spark
30 pages
L03-Spark Framework
No ratings yet
L03-Spark Framework
58 pages
Intro To Spark Development
No ratings yet
Intro To Spark Development
172 pages
Apache Spark: The Next Gen Toolset For Big Data Processing
No ratings yet
Apache Spark: The Next Gen Toolset For Big Data Processing
9 pages
Big Data Analysis BDA IMP QNA Openinapp
No ratings yet
Big Data Analysis BDA IMP QNA Openinapp
33 pages
Spark Devops
0% (1)
Spark Devops
301 pages
L3
No ratings yet
L3
30 pages
Apache Spark Engine
100% (1)
Apache Spark Engine
82 pages
PPT 2.1.5
No ratings yet
PPT 2.1.5
21 pages
4 Spark SBP
No ratings yet
4 Spark SBP
74 pages
Big Data Imp-1
No ratings yet
Big Data Imp-1
16 pages
BDA UNIT-1 NOTES
No ratings yet
BDA UNIT-1 NOTES
10 pages
Introduction To Data Science
No ratings yet
Introduction To Data Science
11 pages
Big_Data_Integration_and_Processing_15_Marks (1)
No ratings yet
Big_Data_Integration_and_Processing_15_Marks (1)
5 pages
Lecture 25
No ratings yet
Lecture 25
59 pages
Introduction To Spark
No ratings yet
Introduction To Spark
54 pages
Unit 4 BDTT
No ratings yet
Unit 4 BDTT
23 pages
IIT Kharagpur Data Science PDF
No ratings yet
IIT Kharagpur Data Science PDF
22 pages
Big Data Engines: Binary Batch Processing
No ratings yet
Big Data Engines: Binary Batch Processing
12 pages
Unleashing The Power of Apache Spark - A Comprehensive Guide To Data Processing at Scale
No ratings yet
Unleashing The Power of Apache Spark - A Comprehensive Guide To Data Processing at Scale
2 pages
2 emerging
No ratings yet
2 emerging
10 pages
Skyess Spark Syllabus
No ratings yet
Skyess Spark Syllabus
12 pages
CloudxLab BDHS Course Details
No ratings yet
CloudxLab BDHS Course Details
9 pages
Lecture 3 - Introduction To Apache Spark - 1691899519972
No ratings yet
Lecture 3 - Introduction To Apache Spark - 1691899519972
67 pages
Near_real_time_fraud_detection_with_Apac
No ratings yet
Near_real_time_fraud_detection_with_Apac
87 pages
1) Discuss Big Data Architecture in Detail With Help of Neat and Clean Diagram
No ratings yet
1) Discuss Big Data Architecture in Detail With Help of Neat and Clean Diagram
18 pages
Unit 5 Note
No ratings yet
Unit 5 Note
18 pages
09 Programming Hadoop - Spark, R and Pig
No ratings yet
09 Programming Hadoop - Spark, R and Pig
80 pages
Hadoop PPT
No ratings yet
Hadoop PPT
25 pages
0 The BigDataEra
No ratings yet
0 The BigDataEra
36 pages
Master Spark Concepts
No ratings yet
Master Spark Concepts
112 pages
Cloud Security UNIT 5
No ratings yet
Cloud Security UNIT 5
4 pages
Big Data Analytics
No ratings yet
Big Data Analytics
134 pages
Big-Data-A-Comprehensive-Overview
No ratings yet
Big-Data-A-Comprehensive-Overview
25 pages
Apache_Spark_Lecture_Notes
No ratings yet
Apache_Spark_Lecture_Notes
4 pages
Learning Hadoop 2
From Everand
Learning Hadoop 2
Garry Turkington
4/5 (1)
Mastering Apache Cassandra - Second Edition
From Everand
Mastering Apache Cassandra - Second Edition
Nishant Neeraj
No ratings yet
Hadoop Blueprints
From Everand
Hadoop Blueprints
Anurag Shrivastava
No ratings yet
Big Data Analytics
From Everand
Big Data Analytics
Venkat Ankam
No ratings yet
Time Series Data Warehouse For HDP 3
No ratings yet
Time Series Data Warehouse For HDP 3
17 pages
Roadmap
No ratings yet
Roadmap
13 pages
Predicting SQL Query Execution Time With A Cost Model For Spark Platform
No ratings yet
Predicting SQL Query Execution Time With A Cost Model For Spark Platform
10 pages
Guimarães, Lucas C. B. Rebello, Gabriel Antonio F. Camilo, Gustavo F. de Souza, Lucas Airam C. Duarte, Otto Carlos M. B. (2022)
No ratings yet
Guimarães, Lucas C. B. Rebello, Gabriel Antonio F. Camilo, Gustavo F. de Souza, Lucas Airam C. Duarte, Otto Carlos M. B. (2022)
16 pages
Data Science in Spark With Sparklyr::: Cheat Sheet
No ratings yet
Data Science in Spark With Sparklyr::: Cheat Sheet
2 pages
Chatna Partha
No ratings yet
Chatna Partha
6 pages
Kulibaba Roman Serhiyovych: Senior Python Developer, $2000 - $2500 Net
No ratings yet
Kulibaba Roman Serhiyovych: Senior Python Developer, $2000 - $2500 Net
4 pages
Project Report Sentiment Analysis On Twitter Using Apache Spark
No ratings yet
Project Report Sentiment Analysis On Twitter Using Apache Spark
9 pages
BlueData EPIC Software Architecture Technical White Paper
No ratings yet
BlueData EPIC Software Architecture Technical White Paper
29 pages
Athena
No ratings yet
Athena
13 pages
Performance Comparison of Hive, Impala and Spark SQL
No ratings yet
Performance Comparison of Hive, Impala and Spark SQL
6 pages
Himanshi Resume
No ratings yet
Himanshi Resume
1 page
Carlo Mazzaferro: Machine Learning Engineer
No ratings yet
Carlo Mazzaferro: Machine Learning Engineer
2 pages
BDA - Week04 - 10
No ratings yet
BDA - Week04 - 10
41 pages
Mastering Apache Spark PDF
No ratings yet
Mastering Apache Spark PDF
663 pages
How To Master Apache Spark Interview Questions
No ratings yet
How To Master Apache Spark Interview Questions
14 pages
Radha Kirshna Vadde
No ratings yet
Radha Kirshna Vadde
3 pages
Dice Resume CV Devendra Velivelli
No ratings yet
Dice Resume CV Devendra Velivelli
7 pages
Azure Reference Architectures - Microsoft Docs
No ratings yet
Azure Reference Architectures - Microsoft Docs
5 pages
Cloudera Developer Training For Apache Spark
No ratings yet
Cloudera Developer Training For Apache Spark
3 pages
Big data-UNIT 1
No ratings yet
Big data-UNIT 1
39 pages
Beginner Guide Spark
No ratings yet
Beginner Guide Spark
12 pages
Data Science - Glossary
100% (1)
Data Science - Glossary
12 pages
Spark Notes
No ratings yet
Spark Notes
71 pages
Narendra Dataengineer Resume - pdf-1
No ratings yet
Narendra Dataengineer Resume - pdf-1
4 pages
Apache Cassandra Database - Instaclustr
No ratings yet
Apache Cassandra Database - Instaclustr
8 pages
Deepak Sanagapalli Uuupdated Resume
No ratings yet
Deepak Sanagapalli Uuupdated Resume
8 pages
Paper 142
No ratings yet
Paper 142
12 pages