0% found this document useful (0 votes)

2 views

DS Writeup 11&12

The document outlines two experiments focused on distributed application design using MapReduce and Apache Spark. Experiment No. 11 involves creating a MapReduce application to analyze system log files for insights like IP access frequency, while Experiment No. 12 demonstrates writing a simple word count program in Scala using Apache Spark. Both experiments emphasize the importance of distributed computing and big data processing capabilities.

Uploaded by

sumitdorle91

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

2 views

DS Writeup 11&12

Uploaded by

sumitdorle91

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 5

Experiment NO.

11
“Design a distributed application using MapReduce which processes a log file of a system.”

🧪 Practical Write-up
✅ Aim
To design and implement a distributed application using the MapReduce programming model that
processes and analyzes a system log file for extracting useful information such as IP access
frequency or error code analysis.

📋 Requirements
Hardware Requirements:
• A computer or cluster with Hadoop installed
• Minimum 4 GB RAM (per node)
• Multi-core processor

Software Requirements:
• Hadoop framework (preferably Hadoop 2.x or 3.x)
• HDFS (Hadoop Distributed File System)
• Java or Python (for MapReduce logic)
• Sample log file (e.g., Apache access log or system log)
• Ubuntu/Linux OS (recommended)

📚 Theory
🔸 Distributed Systems and Big Data:
A distributed system allows data processing across multiple machines, offering speed and
scalability. In the era of big data, distributed processing is essential due to the massive volume of
data generated by web servers, applications, and networks.

🔸 MapReduce Programming Model:

MapReduce is a model for processing large data sets in parallel across a distributed cluster. It
consists of two primary phases:
• Map Phase: Processes input data and produces intermediate key-value pairs.
• Shuffle & Sort: Groups data by keys and sorts it.
• Reduce Phase: Aggregates data for each key and produces the final output.

🔸 Hadoop & HDFS:

Hadoop is an open-source framework that uses the MapReduce model and HDFS to store and
process big data. HDFS splits large files into blocks and distributes them across nodes in the cluster,
enabling parallel computation.

🔸 Log File Analysis:

System log files contain structured or semi-structured data including:
• IP addresses
• Timestamps
• HTTP methods (GET, POST)
• Requested URLs
• Status codes (200, 404, 500)
MapReduce can be used to:
• Count IP address frequency
• Identify the number of failed requests
• Monitor traffic patterns
• Detect anomalies and security issues

🔸 Sample Use-Case:
• Mapper reads each log entry, extracts the IP, and emits <IP, 1>.

• Reducer sums up counts for each IP, outputting total accesses per IP.

🧾 Conclusion
In this practical, a distributed log file analyzer was successfully designed and implemented using
the MapReduce programming model. The application demonstrated how large system log files can
be processed efficiently by splitting the workload into multiple nodes using Hadoop. This practical
reinforced the concepts of distributed computing, parallel processing, and scalable data analytics,
which are essential for handling modern big data challenges.
Experiment No.12
"Write a simple program in SCALA using Apache Spark framework."

🧪 Practical Write-up
✅ Aim
To write and execute a simple program in Scala using the Apache Spark framework for processing
and analyzing data in a distributed environment.

📋 Requirements
Hardware Requirements:
• Computer with at least 4 GB RAM
• Multi-core processor

Software Requirements:
• Apache Spark (2.x or 3.x)
• Scala (2.11 or 2.12)
• SBT (Scala Build Tool) or Spark shell
• Java JDK (1.8 or above)
• Linux/Ubuntu (recommended)

📚 Theory
🔷 What is Apache Spark?
Apache Spark is an open-source, distributed computing system that provides an interface for
programming entire clusters with implicit data parallelism and fault tolerance. It is widely used for
big data processing, analytics, and machine learning.

🔷 Features of Apache Spark:

• In-memory computation
• Lazy evaluation
• Fault tolerance
• Rich APIs in Scala, Python, Java, and R
• Support for SQL, Streaming, MLlib, and GraphX
🔷 What is Scala?
Scala (Scalable Language) is a high-level programming language that combines object-oriented and
functional programming. Spark is written in Scala, and it provides first-class support for writing
Spark applications.

🔷 Spark Program Structure in Scala:

A typical Spark application includes:
• SparkContext: Entry point for Spark functionality.
• RDD/DataFrame: Resilient Distributed Dataset – immutable distributed collection of data.
• Transformations: Lazy operations on RDDs (e.g., map, filter).

• Actions: Triggers execution (e.g., count, collect).

✍️ Sample Scala Program in Apache Spark

Objective: Count the number of words in a text file.
import org.apache.spark.SparkContext
import org.apache.spark.SparkConf

object WordCount {
def main(args: Array[String]): Unit = {
val conf = new SparkConf().setAppName("Word Count").setMaster("local")
val sc = new SparkContext(conf)

val input = sc.textFile("input.txt")

val words = input.flatMap(line => line.split(" "))
val wordCounts = words.map(word => (word, 1)).reduceByKey(_ + _)

wordCounts.collect().foreach(println)
sc.stop()
}
}

🧪 Execution Steps:
1. Install Spark and Scala.
2. Save the program as WordCount.scala.

3. Compile using scalac or run in Spark shell:

spark-submit --class WordCount --master local WordCount.jar

4. Output displays word frequency from the file.

📝 Conclusion
In this practical, we successfully created a simple Scala program using Apache Spark to perform
word count. This introduced us to Spark's distributed data processing capabilities and Scala’s
concise syntax. Such programs are fundamental building blocks in big data processing and are
scalable across large datasets and clusters.

Q Tips: Fast, Scalable, and Maintainable Kdb+
From Everand
Q Tips: Fast, Scalable, and Maintainable Kdb+
Nick Psaris
No ratings yet
Learn SAP Basis in 24 Hours
From Everand
Learn SAP Basis in 24 Hours
Alex Nordeen
4.5/5 (2)
Manual Horno Vmir +
No ratings yet
Manual Horno Vmir +
35 pages
Learning PySpark
From Everand
Learning PySpark
Tomasz Drabas
No ratings yet
Exploring Hadoop Ecosystem (Volume 2): Stream Processing
From Everand
Exploring Hadoop Ecosystem (Volume 2): Stream Processing
Wei Liu
No ratings yet
Fast Data Processing with Spark 2 - Third Edition
From Everand
Fast Data Processing with Spark 2 - Third Edition
Krishna Sankar
No ratings yet
Fast Data Processing Systems with SMACK Stack
From Everand
Fast Data Processing Systems with SMACK Stack
Raúl Estrada
No ratings yet
Learning Apache Spark 2
From Everand
Learning Apache Spark 2
Muhammad Asif Abbasi
No ratings yet
Learning Hadoop 2
From Everand
Learning Hadoop 2
Garry Turkington
4/5 (1)
Sumit Kothari Apache Spark and Scala Practical 17
No ratings yet
Sumit Kothari Apache Spark and Scala Practical 17
18 pages
Hadoop Blueprints
From Everand
Hadoop Blueprints
Anurag Shrivastava
No ratings yet
Mastering Apache Cassandra - Second Edition
From Everand
Mastering Apache Cassandra - Second Edition
Nishant Neeraj
No ratings yet
CS702 Big Data Programs
No ratings yet
CS702 Big Data Programs
59 pages
Big Data Analytics
From Everand
Big Data Analytics
Venkat Ankam
No ratings yet
Administering ArcGIS for Server
From Everand
Administering ArcGIS for Server
Hussein Nasser
No ratings yet
Kafka Up and Running for Network DevOps: Set Your Network Data in Motion
From Everand
Kafka Up and Running for Network DevOps: Set Your Network Data in Motion
Eric Chou
No ratings yet
Lab Manual Big Data Analytics Lab (LC-CSE-410G) : Department of Computer Science and Engineering
No ratings yet
Lab Manual Big Data Analytics Lab (LC-CSE-410G) : Department of Computer Science and Engineering
28 pages
SAS Programming Guidelines Interview Questions You'll Most Likely Be Asked
From Everand
SAS Programming Guidelines Interview Questions You'll Most Likely Be Asked
Vibrant Publishers
No ratings yet
Bda 05
No ratings yet
Bda 05
12 pages
Bda 05
No ratings yet
Bda 05
12 pages
Computer Science: Learn about Algorithms, Cybersecurity, Databases, Operating Systems, and Web Design
From Everand
Computer Science: Learn about Algorithms, Cybersecurity, Databases, Operating Systems, and Web Design
Jonathan Rigdon
No ratings yet
Exploring Hadoop Ecosystem (Volume 1): Batch Processing
From Everand
Exploring Hadoop Ecosystem (Volume 1): Batch Processing
Wei Liu
No ratings yet
Apache Spark: CS240A Winter 2016. T Yang
No ratings yet
Apache Spark: CS240A Winter 2016. T Yang
36 pages
Parallel Python with Dask: Perform distributed computing, concurrent programming and manage large dataset
From Everand
Parallel Python with Dask: Perform distributed computing, concurrent programming and manage large dataset
Tim Peters
No ratings yet
Parallel Python with Dask
From Everand
Parallel Python with Dask
Tim Peters
No ratings yet
Notes
No ratings yet
Notes
53 pages
Oracle GoldenGate 11g Implementer's guide
From Everand
Oracle GoldenGate 11g Implementer's guide
John P Jeffries
5/5 (1)
Apache Spark Unleashed: Advanced Techniques for Data Processing and Analysis
From Everand
Apache Spark Unleashed: Advanced Techniques for Data Processing and Analysis
Adam Jones
No ratings yet
Data Engineering with Scala and Spark: Build streaming and batch pipelines that process massive amounts of data using Scala
From Everand
Data Engineering with Scala and Spark: Build streaming and batch pipelines that process massive amounts of data using Scala
Eric Tome
No ratings yet
Professional Heroku Programming
From Everand
Professional Heroku Programming
Chris Kemp
4/5 (2)
Parallel Programming With Spark: Matei Zaharia
No ratings yet
Parallel Programming With Spark: Matei Zaharia
40 pages
Bda Lab Exercises Lab Mannual - 2023
No ratings yet
Bda Lab Exercises Lab Mannual - 2023
72 pages
Mastering DuckDB: High-Performance Analytics Made Easy
From Everand
Mastering DuckDB: High-Performance Analytics Made Easy
Robert Johnson
No ratings yet
Mastering Apache Iceberg: Managing Big Data in a Modern Data Lake
From Everand
Mastering Apache Iceberg: Managing Big Data in a Modern Data Lake
Robert Johnson
No ratings yet
Practical Assignment - :: Distributed Data Processing With Apache Spark
No ratings yet
Practical Assignment - :: Distributed Data Processing With Apache Spark
3 pages
20dce017 Bda Pracfil
No ratings yet
20dce017 Bda Pracfil
41 pages
Bda Lab
No ratings yet
Bda Lab
94 pages
50 Recipes for Programming Node.js
From Everand
50 Recipes for Programming Node.js
Jamie Munro
3/5 (4)
Hadoop Beginner's Guide
From Everand
Hadoop Beginner's Guide
Garry Turkington
4/5 (7)
SAS Interview Questions You'll Most Likely Be Asked
From Everand
SAS Interview Questions You'll Most Likely Be Asked
Vibrant Publishers
No ratings yet
Learn Cassandra in 24 Hours
From Everand
Learn Cassandra in 24 Hours
Alex Nordeen
No ratings yet
Implementing Cloud Storage with OpenStack Swift
From Everand
Implementing Cloud Storage with OpenStack Swift
Amar Kapadia
No ratings yet
Concise Oracle Database For People Who Has No Time
From Everand
Concise Oracle Database For People Who Has No Time
Billy Aung Myint
No ratings yet
SPARK
No ratings yet
SPARK
36 pages
Practical Play Framework: Focus on what is really important
From Everand
Practical Play Framework: Focus on what is really important
Alberto Souza
No ratings yet
Backend Development
From Everand
Backend Development
Kai Turing
No ratings yet
BDALab Assn5
No ratings yet
BDALab Assn5
16 pages
Learning DHTMLX Suite UI
From Everand
Learning DHTMLX Suite UI
Eli Geske
No ratings yet
Kadi Sarva Vishwavidyalaya: LDRP Institute of Technology and Research Gandhinagar
No ratings yet
Kadi Sarva Vishwavidyalaya: LDRP Institute of Technology and Research Gandhinagar
44 pages
Elements of Android Room
From Everand
Elements of Android Room
Mark Murphy
No ratings yet
Lec - Spark
No ratings yet
Lec - Spark
65 pages
PySpark Essentials: A Practical Guide to Distributed Computing
From Everand
PySpark Essentials: A Practical Guide to Distributed Computing
Robert Johnson
No ratings yet
Learning Cascading
From Everand
Learning Cascading
Michael Covert
No ratings yet
Real-Time Big Data Analytics
From Everand
Real-Time Big Data Analytics
Shilpi
5/5 (1)
Mastering OpenStack: Design, deploy, and manage clouds in mid to large IT infrastructures
From Everand
Mastering OpenStack: Design, deploy, and manage clouds in mid to large IT infrastructures
Omar Khedher
No ratings yet
(Final) BDAL EXP LIST 2023-24
No ratings yet
(Final) BDAL EXP LIST 2023-24
2 pages
Quick Configuration of Openldap and Kerberos In Linux and Authenicating Linux to Active Directory
From Everand
Quick Configuration of Openldap and Kerberos In Linux and Authenicating Linux to Active Directory
Dr. Hidaia Mahmood Alassouli
No ratings yet
Big Data Lab File
No ratings yet
Big Data Lab File
49 pages
Dataflow and Reactive Programming Systems
From Everand
Dataflow and Reactive Programming Systems
Matt Carkci
No ratings yet
Oracle Quick Guides: Part 1 - Oracle Basics: Database and Tools
From Everand
Oracle Quick Guides: Part 1 - Oracle Basics: Database and Tools
Malcolm Coxall
No ratings yet
Visual Basic 2010 Coding Briefs Data Access
From Everand
Visual Basic 2010 Coding Briefs Data Access
Kevin Hough
5/5 (1)
ann5
No ratings yet
ann5
2 pages
DS_midsem
No ratings yet
DS_midsem
2 pages
TE_AIDS_T_T_2024_25_wef_21_03_25
No ratings yet
TE_AIDS_T_T_2024_25_wef_21_03_25
1 page
assignment no 13
No ratings yet
assignment no 13
3 pages
Practical 7.Write a Python Program for Bidirectional Associative Memory With Two Pairs of Vectors
No ratings yet
Practical 7.Write a Python Program for Bidirectional Associative Memory With Two Pairs of Vectors
1 page
new Unit_No_IV
No ratings yet
new Unit_No_IV
31 pages
Nov_Dec_2024
No ratings yet
Nov_Dec_2024
3 pages
DS&BDA 1-14
No ratings yet
DS&BDA 1-14
95 pages
LAYOUT CONVERTED FEETS-Model PDF
No ratings yet
LAYOUT CONVERTED FEETS-Model PDF
1 page
GN 148 of 2018 Judicature and Application of Laws (Electronic Filing) Rules, 2018
100% (1)
GN 148 of 2018 Judicature and Application of Laws (Electronic Filing) Rules, 2018
26 pages
Basics of Database Normaliztion
No ratings yet
Basics of Database Normaliztion
8 pages
Spark notes
No ratings yet
Spark notes
27 pages
Welcome To: Information System Analysis and Design Isad
100% (1)
Welcome To: Information System Analysis and Design Isad
104 pages
Avaya Branch Gateway G430 Overview and Specification Release81 Issue01 June 10 2019
No ratings yet
Avaya Branch Gateway G430 Overview and Specification Release81 Issue01 June 10 2019
60 pages
AR Overview and Details
No ratings yet
AR Overview and Details
3 pages
Hacking and Virus
No ratings yet
Hacking and Virus
18 pages
NCERT Class 11 Computer Science Data Representation in Computers
No ratings yet
NCERT Class 11 Computer Science Data Representation in Computers
17 pages
Nagpur Metro Tracks Construction Monitor
No ratings yet
Nagpur Metro Tracks Construction Monitor
5 pages
Adhoc and Sensor Network
No ratings yet
Adhoc and Sensor Network
4 pages
How To Write A Thesis Statement Exercises
100% (3)
How To Write A Thesis Statement Exercises
6 pages
Assig 4.15
No ratings yet
Assig 4.15
10 pages
Business Model Canvas
100% (1)
Business Model Canvas
2 pages
Pa 5450 Series
No ratings yet
Pa 5450 Series
7 pages
Jim Diorio, Executive Director, IT Infrastructure, Quest Diagnostics Incorporated
No ratings yet
Jim Diorio, Executive Director, IT Infrastructure, Quest Diagnostics Incorporated
4 pages
Apply Microsoft Office Tools
No ratings yet
Apply Microsoft Office Tools
7 pages
E5 ACPWhatisnew V1113
No ratings yet
E5 ACPWhatisnew V1113
6 pages
Youngest Entrepreneurs of India-Shravan and Sanjay Kumaran
No ratings yet
Youngest Entrepreneurs of India-Shravan and Sanjay Kumaran
3 pages
User manual Asus PRIME H510M-K R2.0 (English - 30 pages)
No ratings yet
User manual Asus PRIME H510M-K R2.0 (English - 30 pages)
4 pages
End User Computing Assignment 1
No ratings yet
End User Computing Assignment 1
14 pages
KP-WK05 Instructions
No ratings yet
KP-WK05 Instructions
11 pages
Zauderer Paper1
No ratings yet
Zauderer Paper1
10 pages
VHS09, VK309 For Insertion Installation
No ratings yet
VHS09, VK309 For Insertion Installation
4 pages
Group 1 Report1
No ratings yet
Group 1 Report1
31 pages
Unit 3 Basic Logic
No ratings yet
Unit 3 Basic Logic
58 pages
Computer and Internet: 1. Who Is The Father of Computer?
No ratings yet
Computer and Internet: 1. Who Is The Father of Computer?
18 pages
01 Laboratory Exercise 1
No ratings yet
01 Laboratory Exercise 1
1 page
Summer Training Diary On IoT
No ratings yet
Summer Training Diary On IoT
31 pages

DS Writeup 11&12

Uploaded by

DS Writeup 11&12

Uploaded by

Experiment NO.

🔸 MapReduce Programming Model:

🔸 Hadoop & HDFS:

🔸 Log File Analysis:

🔷 Features of Apache Spark:

🔷 Spark Program Structure in Scala:

• Actions: Triggers execution (e.g., count, collect).

✍️ Sample Scala Program in Apache Spark

val input = sc.textFile("input.txt")

3. Compile using scalac or run in Spark shell:

4. Output displays word frequency from the file.

You might also like