Big data assignment notes
Big data assignment notes
ASSIGNMENT 3
1. How Does MapReduce Work in Hadoop?
How It Works:
🟩 Step 2: Mapping
🟩 Step 4: Reducing
🟩 Step 5: Output
Real-time
Yes (Spark Streaming) No (batch only)
Support
Summary:
UNIT 4
1. What is NoSQL, and How is it Used in Big Data Storage?
✅ Definition:
Redis, Amazon
Key-Value Stores pairs for fast lookups
DynamoDB
Big data often contains noise, duplication, or missing values. Here's how you
can manage quality issues:
Duplicate
Use hashing or unique IDs to remove duplicates
Records
✅ Common Techniques:
Technique Purpose
Data
Normalize, scale, encode data for algorithms
Transformation
aggregation
Tokenization &
For text data — splitting sentences into words
Parsing
Apache NiFi
Hadoop MapReduce
UNIT 5
✅ 1. How Do You Implement Data Governance in a Big Data
Environment?
Component Description
Data Quality
Define valid values, types, ranges, null handling
Rules
Stewardship
Policy
Compliance with GDPR, HIPAA, etc.
Management
text
CopyEdit
[Data Sources]
🧠 Key Techniques:
Technique Purpose
Event-Driven
Process events instantly via Kafka or Pulsar
Architecture
In-Memory
Fast processing using RAM (Spark, Ignite)
Computing
🛠 Example Tools:
Feature Description
Language
APIs available in Python, Scala, Java, and R
Support
Distributed
Splits tasks across a cluster for parallel execution
Computing
Rich APIs in
Ease of Use Low-level Java APIs
Python/Scala/Java
Data
Batch + Streaming Batch only
Processing
python
CopyEdit
spark = SparkSession.builder.appName("WordCount").getOrCreate()
rdd = spark.sparkContext.textFile("sample.txt")
counts = (
.reduceByKey(lambda a, b: a + b)
counts.collect()
✅ Common Use Cases
Recommendation engines
Component Role
3. Spark Streaming
5. GraphX
💼 Suppose: You want to count words in a large text file using Spark.
3. RDD/DataFrame Created
4. Transformations Applied
6. Task Scheduling
8. Results Returned
plaintext
CopyEdit
+----------------------+
+----------------------+
+----------------------+ +----------------------+
+----------------------+ +----------------------+
| |
v v
🧠 Summary
Component Responsibility
Cluster
Manages resources and task scheduling
Manager
RDD/
Data abstraction used for processing
DataFrame