0% found this document useful (0 votes)
4 views

Big data assignment notes

The document provides an overview of key concepts related to big data processing, including MapReduce in Hadoop, differences between HDFS and traditional file systems, and comparisons between Apache Spark and Hadoop. It also discusses NoSQL databases, data quality management, data governance, security threats, and real-time analytics architecture. Additionally, it outlines Apache Spark's architecture, core components, and common use cases for big data applications.

Uploaded by

paridhikadwey78
Copyright
© © All Rights Reserved
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
4 views

Big data assignment notes

The document provides an overview of key concepts related to big data processing, including MapReduce in Hadoop, differences between HDFS and traditional file systems, and comparisons between Apache Spark and Hadoop. It also discusses NoSQL databases, data quality management, data governance, security threats, and real-time analytics architecture. Additionally, it outlines Apache Spark's architecture, core components, and common use cases for big data applications.

Uploaded by

paridhikadwey78
Copyright
© © All Rights Reserved
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
You are on page 1/ 13

BIG DATA ASSIGNMENT NOTES

ASSIGNMENT 3
1. How Does MapReduce Work in Hadoop?

MapReduce is a programming model used in Hadoop for processing large


data sets in a distributed manner.

How It Works:

🟩 Step 1: Input Splitting

 Large files are split into chunks (blocks).

 Each chunk is assigned to a Map task.

🟩 Step 2: Mapping

 Each Mapper processes a data block and produces key-value pairs.

 Example: Processing logs → (IP address, 1)

🟩 Step 3: Shuffling and Sorting

 Hadoop groups all values by key across all Mappers.

 Intermediate data is sorted and sent to Reducers.

🟩 Step 4: Reducing

 Reducers process each group of key-value pairs to produce final


output.

 Example: Summing counts per IP → (IP address, total visits)

🟩 Step 5: Output

 Final output is written to HDFS.

Example Use Case: Word count, log analysis, clickstream processing.

2. Difference Between HDFS and Traditional File Systems


HDFS (Hadoop Distributed Traditional File System
Feature
File System) (e.g., NTFS, ext4)

Distributed across multiple


Architecture Centralized or single-machine
nodes

Fault Manual or external backup


Built-in data replication
Tolerance required

Scales horizontally (add more


Scalability Limited to hardware
nodes)

Data Size Optimized for large files (GBs


Not ideal for massive files
Handling to TBs)

Supports frequent read-write


Write Support Write-once, read-many
operations

Data Locality Computation moves to data Data moves to computation

Large (default 128 MB or 256 Smaller (4 KB – 64 KB


Block Size
MB) typically)

3. How Does Spark Compare to Hadoop for Big Data Processing?

Feature Apache Spark Hadoop (MapReduce)

Processing In-memory Disk-based

Up to 100x faster for some Slower due to frequent disk


Speed
workloads I/O

High-level APIs (Python, Requires Java-based


Ease of Use
Scala, Java, R) MapReduce code

Real-time
Yes (Spark Streaming) No (batch only)
Support

Machine Limited support (needs


Built-in MLlib
Learning external tools)

Through task re-execution


Fault Tolerance DAG lineage and RDDs
and replication

Data Processing Batch, Streaming, Only batch processing


Feature Apache Spark Hadoop (MapReduce)

Modes Interactive, Graph

Summary:

 Use Hadoop MapReduce for batch jobs on extremely large datasets.

 Use Apache Spark for faster, in-memory, interactive or real-time


data processing.

UNIT 4
1. What is NoSQL, and How is it Used in Big Data Storage?

✅ Definition:

NoSQL (Not Only SQL) databases are non-relational databases designed


to handle large volumes of unstructured, semi-structured, or
structured data with high performance and scalability.

✅ Types of NoSQL Databases:

Type Description Examples

Document- Stores data as JSON-like


MongoDB, CouchDB
based documents

Redis, Amazon
Key-Value Stores pairs for fast lookups
DynamoDB

Stores data in columns instead Apache Cassandra,


Column-based
of rows HBase

Optimized for Neo4j, Amazon


Graph-based
relationships/networks Neptune

✅ Use in Big Data:

 Handles high volume, velocity, and variety of data.

 Scales horizontally across distributed clusters.

 Useful in real-time analytics, IoT, recommendation systems, and social


media platforms.
2. How Do You Handle Data Quality Issues in Big Data Sets?

Big data often contains noise, duplication, or missing values. Here's how you
can manage quality issues:

✅ Steps to Handle Data Quality:

Issue Type Handling Techniques

Imputation (mean/median), data interpolation,


Missing Data
deletion

Duplicate
Use hashing or unique IDs to remove duplicates
Records

Inconsistent Standardize units (e.g., date formats, case


Formats normalization)

Use statistical or ML techniques to detect and


Outliers/Noise
handle

Incorrect Data Cross-validation with reference datasets or rules

✅ Tools Commonly Used:

 Apache Spark, Talend, Trifacta, OpenRefine, Pandas (in Python)

3. Techniques for Data Preprocessing in Big Data

Data preprocessing prepares raw data for analytics or machine learning


models.

✅ Common Techniques:

Technique Purpose

Fix/remove incorrect, incomplete, or inconsistent


Data Cleaning
data

Data
Normalize, scale, encode data for algorithms
Transformation

Combine data from multiple sources (ETL


Data Integration
processes)

Data Reduction Dimensionality reduction (e.g., PCA), sampling,


Technique Purpose

aggregation

Convert continuous data into categories or


Data Discretization
intervals

Tokenization &
For text data — splitting sentences into words
Parsing

Streaming Real-time data transformation using tools like


Preprocessing Kafka, Spark

✅ Big Data Tools for Preprocessing:

 Apache Spark (with PySpark or Scala)

 Apache NiFi

 Hadoop MapReduce

 ETL pipelines (Airflow, Talend)

UNIT 5
✅ 1. How Do You Implement Data Governance in a Big Data
Environment?

Data governance ensures that data is accurate, secure, consistent, and


used responsibly.

📌 Key Components of Data Governance in Big Data:

Component Description

Data Catalog Centralized metadata store (e.g., Apache Atlas, Alation)

Tracks data flow from source to destination (e.g.,


Data Lineage
OpenLineage, Talend)

Role-Based Access Control (RBAC), policies for who can


Access Control
access what

Data Quality
Define valid values, types, ranges, null handling
Rules

Data Assign responsible roles for maintaining data integrity


Component Description

Stewardship

Policy
Compliance with GDPR, HIPAA, etc.
Management

🔧 Tools for Data Governance:

 Apache Atlas (metadata management)

 Apache Ranger (fine-grained access control)

 Collibra, Informatica, AWS Glue Data Catalog

✅ 2. Common Big Data Security Threats & Mitigation Strategies

⚠️Common Security Threats:

Threat Description Mitigation Strategies

Unauthorized access to Encryption (at-rest/in-transit),


Data Breaches
sensitive data Access controls

Unauthorized Lack of strict access Use Kerberos, LDAP, or OAuth


Access policies authentication

Data Leakage in Leakage during Secure APIs, TLS/SSL, audit


Pipelines processing or transfers trails

Malicious Code Attacks via open-source Code scanning, sandboxing


Injection or shared scripts jobs

Lack of Audit No monitoring of data Use logging systems like


Trails usage Apache Ranger, audit tools

🔐 Key Security Techniques:

 Kerberos: Secure authentication in Hadoop/Spark

 Apache Ranger: Role-based policies and audit logs

 Tokenization & Encryption: Protects PII data

 Network Layer Security: VPN, firewalls, VPCs


✅ 3. How Do You Scale Big Data Processing for Real-Time Analytics?

Real-time analytics requires fast ingestion, low-latency processing, and


scalable architecture.

⚙️Architecture for Real-Time Analytics:

text

CopyEdit

[Data Sources]

[Ingestion Layer] — Kafka / Flume / Kinesis

[Processing Layer] — Apache Spark Streaming / Flink / Storm

[Storage Layer] — Cassandra / HBase / Elasticsearch

[Visualization Layer] — Grafana / Kibana / Tableau

🧠 Key Techniques:

Technique Purpose

Real-time data computation (Spark


Stream Processing
Streaming, Flink)

Micro-Batching Efficient processing in small time windows

Autoscaling Dynamic resource allocation in cloud (K8s,


Infrastructure EMR)

Event-Driven
Process events instantly via Kafka or Pulsar
Architecture

In-Memory
Fast processing using RAM (Spark, Ignite)
Computing

🛠 Example Tools:

 Kafka + Spark Structured Streaming for low-latency pipelines


 AWS Kinesis + Lambda for serverless real-time processing

 Apache Flink for advanced stream processing with stateful operators

🚀 What is Apache Spark?

Apache Spark is an open-source, distributed computing framework


designed for fast processing of large-scale data. It supports batch
processing, streaming, machine learning, and SQL-based analytics —
all in one platform.

Feature Description

In-Memory Keeps intermediate data in memory for faster processing


Computing than Hadoop MapReduce

Supports SQL, MLlib (Machine Learning), GraphX (Graph


Unified Engine
Processing), and Spark Streaming

Language
APIs available in Python, Scala, Java, and R
Support

Distributed
Splits tasks across a cluster for parallel execution
Computing

Fault Tolerant Automatically handles failures using RDD lineage

🔄 Spark vs Hadoop (MapReduce)

Feature Apache Spark Hadoop MapReduce

Speed Faster (in-memory) Slower (disk-based)

Rich APIs in
Ease of Use Low-level Java APIs
Python/Scala/Java

Data
Batch + Streaming Batch only
Processing

Machine Needs integration with external


Built-in MLlib
Learning tools
🔥 Core Components of Apache Spark

1. Spark Core – The execution engine (RDDs, memory mgmt, fault


tolerance)

2. Spark SQL – Query structured data using SQL or DataFrames

3. Spark Streaming – Real-time data processing from sources like Kafka

4. MLlib – Machine learning library (classification, clustering, etc.)

5. GraphX – Graph processing (e.g., PageRank, graph traversal)

💡 Example: PySpark Code for Word Count

python

CopyEdit

from pyspark.sql import SparkSession

spark = SparkSession.builder.appName("WordCount").getOrCreate()

# Load text file

rdd = spark.sparkContext.textFile("sample.txt")

# Word count logic

counts = (

rdd.flatMap(lambda line: line.split())

.map(lambda word: (word, 1))

.reduceByKey(lambda a, b: a + b)

counts.collect()
✅ Common Use Cases

 Real-time analytics (e.g., fraud detection, log monitoring)

 ETL (Extract, Transform, Load) pipelines

 Recommendation engines

 Social media data analysis

 IoT stream processing

Would you like:

 A simple Spark ML example (like linear regression)?

 A real-time example using Spark Streaming + Kafka?

🚀 Apache Spark Architecture: Overview

Apache Spark follows a master-slave architecture with the following key


components:

🧱 Core Components of Apache Spark:

Component Role

Driver Controls the application, manages SparkContext, and


Program coordinates tasks

Cluster Allocates resources across Spark applications (e.g., YARN,


Manager Mesos, Kubernetes)

Executors Run tasks and return results to the driver

Tasks Individual units of work sent to executors

🔧 Detailed Components of Apache Spark

1. Spark Core (Foundation of everything)

 Manages memory, fault-tolerance, job scheduling.

 Provides the RDD (Resilient Distributed Dataset) abstraction for


distributed data.
2. Spark SQL

 Allows querying structured and semi-structured data using SQL,


DataFrames, and Datasets.

 Can read from Hive, Parquet, JSON, JDBC, etc.

3. Spark Streaming

 Enables real-time data processing.

 Processes live data streams using micro-batching.

4. MLlib (Machine Learning Library)

 Built-in library for scalable machine learning tasks:

o Classification, Regression, Clustering, Recommendation

5. GraphX

 API for graph processing (e.g., social networks, recommendation


graphs).

 Includes graph algorithms like PageRank and connected components.

⚙️How Apache Spark Works (Step-by-Step Execution)

Let’s understand with an example:

💼 Suppose: You want to count words in a large text file using Spark.

🔁 Spark Job Workflow:

1. Driver Program Starts

o It creates a SparkContext (entry point to Spark cluster).

2. Cluster Manager Allocates Resources

o The driver asks for executors on cluster nodes.

3. RDD/DataFrame Created

o Data is loaded into an RDD (e.g., from a text file).

4. Transformations Applied

o Operations like .map(), .filter(), .flatMap() define a DAG (Directed


Acyclic Graph).
5. Actions Trigger Execution

o An action like .collect() or .saveAsTextFile() starts actual


processing.

6. Task Scheduling

o Spark breaks the DAG into stages and tasks.

7. Tasks Sent to Executors

o Executors perform computations in parallel.

8. Results Returned

o Executors return the results to the driver, or write to storage.

🖼 Spark Architecture Diagram (Text-Based)

plaintext

CopyEdit

+----------------------+

| Driver Program | ← Controls the job

+----------------------+

+----------------------+ +----------------------+

| Cluster Manager | ←→→→ | Executors (n) |

+----------------------+ +----------------------+

| |

v v

Distribute Tasks Process Data, Store Cache

🧠 Summary
Component Responsibility

Main controller, builds job, sends tasks to


Driver
workers

Executor Workers that run tasks and store data

Cluster
Manages resources and task scheduling
Manager

RDD/
Data abstraction used for processing
DataFrame

You might also like