DS Writeup 11&12
DS Writeup 11&12
11
“Design a distributed application using MapReduce which processes a log file of a system.”
🧪 Practical Write-up
✅ Aim
To design and implement a distributed application using the MapReduce programming model that
processes and analyzes a system log file for extracting useful information such as IP access
frequency or error code analysis.
📋 Requirements
Hardware Requirements:
• A computer or cluster with Hadoop installed
• Minimum 4 GB RAM (per node)
• Multi-core processor
Software Requirements:
• Hadoop framework (preferably Hadoop 2.x or 3.x)
• HDFS (Hadoop Distributed File System)
• Java or Python (for MapReduce logic)
• Sample log file (e.g., Apache access log or system log)
• Ubuntu/Linux OS (recommended)
📚 Theory
🔸 Distributed Systems and Big Data:
A distributed system allows data processing across multiple machines, offering speed and
scalability. In the era of big data, distributed processing is essential due to the massive volume of
data generated by web servers, applications, and networks.
🔸 Sample Use-Case:
• Mapper reads each log entry, extracts the IP, and emits <IP, 1>.
• Reducer sums up counts for each IP, outputting total accesses per IP.
🧾 Conclusion
In this practical, a distributed log file analyzer was successfully designed and implemented using
the MapReduce programming model. The application demonstrated how large system log files can
be processed efficiently by splitting the workload into multiple nodes using Hadoop. This practical
reinforced the concepts of distributed computing, parallel processing, and scalable data analytics,
which are essential for handling modern big data challenges.
Experiment No.12
"Write a simple program in SCALA using Apache Spark framework."
🧪 Practical Write-up
✅ Aim
To write and execute a simple program in Scala using the Apache Spark framework for processing
and analyzing data in a distributed environment.
📋 Requirements
Hardware Requirements:
• Computer with at least 4 GB RAM
• Multi-core processor
Software Requirements:
• Apache Spark (2.x or 3.x)
• Scala (2.11 or 2.12)
• SBT (Scala Build Tool) or Spark shell
• Java JDK (1.8 or above)
• Linux/Ubuntu (recommended)
📚 Theory
🔷 What is Apache Spark?
Apache Spark is an open-source, distributed computing system that provides an interface for
programming entire clusters with implicit data parallelism and fault tolerance. It is widely used for
big data processing, analytics, and machine learning.
object WordCount {
def main(args: Array[String]): Unit = {
val conf = new SparkConf().setAppName("Word Count").setMaster("local")
val sc = new SparkContext(conf)
wordCounts.collect().foreach(println)
sc.stop()
}
}
🧪 Execution Steps:
1. Install Spark and Scala.
2. Save the program as WordCount.scala.