0% found this document useful (0 votes)
2 views

DS Writeup 11&12

The document outlines two experiments focused on distributed application design using MapReduce and Apache Spark. Experiment No. 11 involves creating a MapReduce application to analyze system log files for insights like IP access frequency, while Experiment No. 12 demonstrates writing a simple word count program in Scala using Apache Spark. Both experiments emphasize the importance of distributed computing and big data processing capabilities.

Uploaded by

sumitdorle91
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
2 views

DS Writeup 11&12

The document outlines two experiments focused on distributed application design using MapReduce and Apache Spark. Experiment No. 11 involves creating a MapReduce application to analyze system log files for insights like IP access frequency, while Experiment No. 12 demonstrates writing a simple word count program in Scala using Apache Spark. Both experiments emphasize the importance of distributed computing and big data processing capabilities.

Uploaded by

sumitdorle91
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 5

Experiment NO.

11
“Design a distributed application using MapReduce which processes a log file of a system.”

🧪 Practical Write-up
✅ Aim
To design and implement a distributed application using the MapReduce programming model that
processes and analyzes a system log file for extracting useful information such as IP access
frequency or error code analysis.

📋 Requirements
Hardware Requirements:
• A computer or cluster with Hadoop installed
• Minimum 4 GB RAM (per node)
• Multi-core processor

Software Requirements:
• Hadoop framework (preferably Hadoop 2.x or 3.x)
• HDFS (Hadoop Distributed File System)
• Java or Python (for MapReduce logic)
• Sample log file (e.g., Apache access log or system log)
• Ubuntu/Linux OS (recommended)

📚 Theory
🔸 Distributed Systems and Big Data:
A distributed system allows data processing across multiple machines, offering speed and
scalability. In the era of big data, distributed processing is essential due to the massive volume of
data generated by web servers, applications, and networks.

🔸 MapReduce Programming Model:


MapReduce is a model for processing large data sets in parallel across a distributed cluster. It
consists of two primary phases:
• Map Phase: Processes input data and produces intermediate key-value pairs.
• Shuffle & Sort: Groups data by keys and sorts it.
• Reduce Phase: Aggregates data for each key and produces the final output.

🔸 Hadoop & HDFS:


Hadoop is an open-source framework that uses the MapReduce model and HDFS to store and
process big data. HDFS splits large files into blocks and distributes them across nodes in the cluster,
enabling parallel computation.

🔸 Log File Analysis:


System log files contain structured or semi-structured data including:
• IP addresses
• Timestamps
• HTTP methods (GET, POST)
• Requested URLs
• Status codes (200, 404, 500)
MapReduce can be used to:
• Count IP address frequency
• Identify the number of failed requests
• Monitor traffic patterns
• Detect anomalies and security issues

🔸 Sample Use-Case:
• Mapper reads each log entry, extracts the IP, and emits <IP, 1>.

• Reducer sums up counts for each IP, outputting total accesses per IP.

🧾 Conclusion
In this practical, a distributed log file analyzer was successfully designed and implemented using
the MapReduce programming model. The application demonstrated how large system log files can
be processed efficiently by splitting the workload into multiple nodes using Hadoop. This practical
reinforced the concepts of distributed computing, parallel processing, and scalable data analytics,
which are essential for handling modern big data challenges.
Experiment No.12
"Write a simple program in SCALA using Apache Spark framework."

🧪 Practical Write-up
✅ Aim
To write and execute a simple program in Scala using the Apache Spark framework for processing
and analyzing data in a distributed environment.

📋 Requirements
Hardware Requirements:
• Computer with at least 4 GB RAM
• Multi-core processor

Software Requirements:
• Apache Spark (2.x or 3.x)
• Scala (2.11 or 2.12)
• SBT (Scala Build Tool) or Spark shell
• Java JDK (1.8 or above)
• Linux/Ubuntu (recommended)

📚 Theory
🔷 What is Apache Spark?
Apache Spark is an open-source, distributed computing system that provides an interface for
programming entire clusters with implicit data parallelism and fault tolerance. It is widely used for
big data processing, analytics, and machine learning.

🔷 Features of Apache Spark:


• In-memory computation
• Lazy evaluation
• Fault tolerance
• Rich APIs in Scala, Python, Java, and R
• Support for SQL, Streaming, MLlib, and GraphX
🔷 What is Scala?
Scala (Scalable Language) is a high-level programming language that combines object-oriented and
functional programming. Spark is written in Scala, and it provides first-class support for writing
Spark applications.

🔷 Spark Program Structure in Scala:


A typical Spark application includes:
• SparkContext: Entry point for Spark functionality.
• RDD/DataFrame: Resilient Distributed Dataset – immutable distributed collection of data.
• Transformations: Lazy operations on RDDs (e.g., map, filter).

• Actions: Triggers execution (e.g., count, collect).

✍️ Sample Scala Program in Apache Spark


Objective: Count the number of words in a text file.
import org.apache.spark.SparkContext
import org.apache.spark.SparkConf

object WordCount {
def main(args: Array[String]): Unit = {
val conf = new SparkConf().setAppName("Word Count").setMaster("local")
val sc = new SparkContext(conf)

val input = sc.textFile("input.txt")


val words = input.flatMap(line => line.split(" "))
val wordCounts = words.map(word => (word, 1)).reduceByKey(_ + _)

wordCounts.collect().foreach(println)
sc.stop()
}
}

🧪 Execution Steps:
1. Install Spark and Scala.
2. Save the program as WordCount.scala.

3. Compile using scalac or run in Spark shell:


spark-submit --class WordCount --master local WordCount.jar

4. Output displays word frequency from the file.


📝 Conclusion
In this practical, we successfully created a simple Scala program using Apache Spark to perform
word count. This introduced us to Spark's distributed data processing capabilities and Scala’s
concise syntax. Such programs are fundamental building blocks in big data processing and are
scalable across large datasets and clusters.

You might also like