21BCP122 - Digital - Forensics - Assignment - 4a 2
21BCP122 - Digital - Forensics - Assignment - 4a 2
Assigment-4a
Jay Shah (21BCP075)
Digital Forensics
Harshit Chodvadiya(21BCP122)
D-2, G-3
Div-2 (G4)
Aim: Design a distributed application using MapReduce which processes
log file of a system. List out users who have logged for maximum period
on the system.
Tools Used:
● Databricks
● MapReduce
● PySpark
Theory:
MapReduce is a programming model and an associated implementation for processing
and generating big data sets with a parallel, distributed algorithm on a cluster. A
MapReduce program is composed of a map procedure, which performs filtering and
sorting, and a reduce method, which performs a summary operation.
The term "MapReduce" refers to two separate and distinct tasks that Hadoop programs
perform. The first is the map job, which takes a set of data and converts it into another
set of data, where individual elements are broken down into tuples (key/value pairs).
The reduce job takes the output from a map as input and combines those data tuples
into a smaller set of tuples. As the sequence of the name MapReduce implies, the
reduce job is always performed after the map job.
MapReduce programming offers several benefits to help you gain valuable insights
from your big data:
1
● Scalability: Businesses can process petabytes of data stored in the Hadoop
Distributed File System (HDFS).
● Flexibility: Hadoop enables easier access to multiple sources of data and
multiple types of data.
● Speed: With parallel processing and minimal data movement, Hadoop offers
fast processing of massive amounts of data.
● Simple: Developers can write code in a choice of languages, including Java,
C++ and Python.
Code:
PART-A
The code reads a log file, extracts IP addresses and timestamps, calculates the session
duration for each IP, and prints the top 5 sessions by duration. It also prints the total
session duration across all users.
import re
spark = SparkSession.builder.appName("LogParser").getOrCreate()
df1 = spark.read.format("csv").option("header",
[email protected]
"false").load("dbfs:/FileStore/shared_uploads/[email protected]/access
_log_short_file1.csv")
def parse_log_line(line):
if match:
ip_address = match.group(1)
2
timestamp = datetime.strptime(timestamp_str, "%d/%b/%Y:%H:%M:%S")
.withColumn("timestamp", parse_log_timestamp_udf(col("_c0"))) \
.select("ip_address", "timestamp")
# Filter out rows where parsing failed (i.e., ip_address or timestamp is null)
user_sessions = logs_rdd.groupByKey().mapValues(list).collectAsMap()
user_durations = {}
user_durations[ip] = session_duration
3
duration_minutes = duration / 60
total_minutes = sum(user_durations.values()) / 60
PART-B
The code reads a log file, extracts IP addresses and timestamps, counts the
occurrences of each IP address, sorts them by frequency, and displays the top IPs by
the number of occurrences. The result can also be saved to a CSV file if needed.
from pyspark.sql import SparkSession
import re
spark = SparkSession.builder.appName("LogAnalyzer").getOrCreate()
log_df = spark.read.format("csv").option("header",
[email protected]
]
"false").load("dbfs:/FileStore/shared_uploads/[email protected]/access
_log_short_file1.csv")
def parse_log_line(line):
if match:
ip_address = match.group(1)
4
timestamp_str = match.group(2).split(" ")[0]
def extract_ip(line):
ip, _ = parse_log_line(line)
return ip
def extract_timestamp(line):
_, ts = parse_log_line(line)
return ts
.withColumn("timestamp", extract_timestamp_udf(col("_c0")))
parsed_df = parsed_df.filter(col("ip_address").isNotNull())
ip_frequency_df = parsed_df.groupBy("ip_address").count()
sorted_ip_frequency_df = ip_frequency_df.orderBy(col("count").desc())
sorted_ip_frequency_df.show(truncate=False)
# Save the result to a CSV file in DBFS (uncomment the line below to enable
saving)
5
# sorted_ip_frequency_df.write.csv("dbfs:/path/to/output/ip_frequency.csv",
header=True)
Output:
PART-A
PART-B