0% found this document useful (0 votes)

6 views

21BCP122 - Digital - Forensics - Assignment - 4a 2

digital forensics

Uploaded by

Patel Anuj

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

6 views

21BCP122 - Digital - Forensics - Assignment - 4a 2

digital forensics

Uploaded by

Patel Anuj

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 6

Assignment - 4a

Assigment-4a
Jay Shah (21BCP075)
Digital Forensics
Harshit Chodvadiya(21BCP122)
D-2, G-3
Div-2 (G4)
Aim: Design a distributed application using MapReduce which processes
log file of a system. List out users who have logged for maximum period
on the system.

Tools Used:
● Databricks
● MapReduce
● PySpark

Theory:
MapReduce is a programming model and an associated implementation for processing
and generating big data sets with a parallel, distributed algorithm on a cluster. A
MapReduce program is composed of a map procedure, which performs filtering and
sorting, and a reduce method, which performs a summary operation.

The term "MapReduce" refers to two separate and distinct tasks that Hadoop programs
perform. The first is the map job, which takes a set of data and converts it into another
set of data, where individual elements are broken down into tuples (key/value pairs).

The reduce job takes the output from a map as input and combines those data tuples
into a smaller set of tuples. As the sequence of the name MapReduce implies, the
reduce job is always performed after the map job.

MapReduce programming offers several benefits to help you gain valuable insights
from your big data:

1
● Scalability: Businesses can process petabytes of data stored in the Hadoop
Distributed File System (HDFS).
● Flexibility: Hadoop enables easier access to multiple sources of data and
multiple types of data.
● Speed: With parallel processing and minimal data movement, Hadoop offers
fast processing of massive amounts of data.
● Simple: Developers can write code in a choice of languages, including Java,
C++ and Python.

Code:
PART-A
The code reads a log file, extracts IP addresses and timestamps, calculates the session
duration for each IP, and prints the top 5 sessions by duration. It also prints the total
session duration across all users.
import re

from datetime import datetime

from pyspark.sql import SparkSession

from pyspark.sql.functions import udf, col

from pyspark.sql.types import StringType, TimestampType

# Initialize Spark session

spark = SparkSession.builder.appName("LogParser").getOrCreate()

# Load the CSV file into a Spark DataFrame

df1 = spark.read.format("csv").option("header",
[email protected]
"false").load("dbfs:/FileStore/shared_uploads/[email protected]/access
_log_short_file1.csv")

# Define a UDF to parse the log lines

def parse_log_line(line):

pattern = r'(\d+\.\d+\.\d+\.\d+) - - \[(.?)\] "(.?)"'

match = re.match(pattern, line)

if match:

ip_address = match.group(1)

timestamp_str = match.group(2).split(" ")[0]

2
timestamp = datetime.strptime(timestamp_str, "%d/%b/%Y:%H:%M:%S")

return ip_address, timestamp

return None, None

# Register the UDFs with Spark

parse_log_line_udf = udf(lambda line: parse_log_line(line)[0], StringType())

parse_log_timestamp_udf = udf(lambda line: parse_log_line(line)[1],

TimestampType())

# Apply the UDFs to extract the IP address and timestamp

df2 = df1.withColumn("ip_address", parse_log_line_udf(col("_c0")))\

.withColumn("timestamp", parse_log_timestamp_udf(col("_c0"))) \

.select("ip_address", "timestamp")

# Filter out rows where parsing failed (i.e., ip_address or timestamp is null)

df2 = df2.filter(col("ip_address").isNotNull() & col("timestamp").isNotNull())

# Convert the DataFrame to an RDD for session calculation

logs_rdd = df2.rdd.map(lambda row: (row["ip_address"], row["timestamp"]))

# Group logs by IP address and calculate session durations

user_sessions = logs_rdd.groupByKey().mapValues(list).collectAsMap()

# Calculate session durations

user_durations = {}

for ip, timestamps in user_sessions.items():

timestamps.sort() # Sort the timestamps

session_duration = (timestamps[-1] - timestamps[0]).total_seconds()

user_durations[ip] = session_duration

# Sort the durations and get the top 5 sessions

sorted_user_durations = sorted(user_durations.items(), key=lambda x: x[1],

reverse=True)

# Print the top 5 sessions

for ip, duration in sorted_user_durations[:5]:

3
duration_minutes = duration / 60

print(f"IP: {ip}, Total Session Duration: {duration_minutes:.2f} minutes")

# Calculate the total session duration across all users

total_minutes = sum(user_durations.values()) / 60

print(f"Total session duration across all users: {total_minutes:.2f} minutes")

PART-B
The code reads a log file, extracts IP addresses and timestamps, counts the
occurrences of each IP address, sorts them by frequency, and displays the top IPs by
the number of occurrences. The result can also be saved to a CSV file if needed.
from pyspark.sql import SparkSession

from pyspark.sql.functions import udf, col

from pyspark.sql.types import StringType, TimestampType

import re

from datetime import datetime

# Initialize SparkSession (typically done automatically in Databricks)

spark = SparkSession.builder.appName("LogAnalyzer").getOrCreate()

# Read the log file into a DataFrame

log_df = spark.read.format("csv").option("header",
[email protected]
]
"false").load("dbfs:/FileStore/shared_uploads/[email protected]/access
_log_short_file1.csv")

# Define the function to parse log lines

def parse_log_line(line):

pattern = r'(\d+\.\d+\.\d+\.\d+) - - \[(.?)\] "(.?)"'

match = re.match(pattern, line)

if match:

ip_address = match.group(1)

4
timestamp_str = match.group(2).split(" ")[0]

timestamp = datetime.strptime(timestamp_str, "%d/%b/%Y:%H:%M:%S")

return ip_address, timestamp

return None, None

# Define UDFs for extracting IP and timestamp

def extract_ip(line):

ip, _ = parse_log_line(line)

return ip

def extract_timestamp(line):

_, ts = parse_log_line(line)

return ts

# Register UDFs with Spark

extract_ip_udf = udf(extract_ip, StringType())

extract_timestamp_udf = udf(extract_timestamp, TimestampType())

# Apply UDFs to the DataFrame to get separate IP and timestamp columns

parsed_df = log_df.withColumn("ip_address", extract_ip_udf(col("_c0")))\

.withColumn("timestamp", extract_timestamp_udf(col("_c0")))

# Drop rows where IP is None (invalid or unparsable lines)

parsed_df = parsed_df.filter(col("ip_address").isNotNull())

# Count the frequency of each IP address

ip_frequency_df = parsed_df.groupBy("ip_address").count()

# Sort by frequency in descending order

sorted_ip_frequency_df = ip_frequency_df.orderBy(col("count").desc())

# Show the result

sorted_ip_frequency_df.show(truncate=False)

# Save the result to a CSV file in DBFS (uncomment the line below to enable
saving)

5
# sorted_ip_frequency_df.write.csv("dbfs:/path/to/output/ip_frequency.csv",
header=True)

Output:
PART-A

PART-B

Learn SAP Basis in 24 Hours
From Everand
Learn SAP Basis in 24 Hours
Alex Nordeen
4.5/5 (2)
Hands - On Exercise: Using The Spark Shell..................................
100% (2)
Hands - On Exercise: Using The Spark Shell..................................
13 pages
Project On Netflix Data Analysis
100% (1)
Project On Netflix Data Analysis
22 pages
My Pyspark Practice Notes
100% (1)
My Pyspark Practice Notes
63 pages
Exploring Hadoop Ecosystem (Volume 2): Stream Processing
From Everand
Exploring Hadoop Ecosystem (Volume 2): Stream Processing
Wei Liu
No ratings yet
Test Automation and Testing Tools
No ratings yet
Test Automation and Testing Tools
46 pages
Ruby On Rails 101 Presentation Slides For A Five Day Introductory Course 119407436077654 3
No ratings yet
Ruby On Rails 101 Presentation Slides For A Five Day Introductory Course 119407436077654 3
341 pages
Oracle Certified Professional Java Programmer OCPJP 1Z0 809
From Everand
Oracle Certified Professional Java Programmer OCPJP 1Z0 809
Manish Soni
No ratings yet
Computer Engineering Laboratory Solution Primer
From Everand
Computer Engineering Laboratory Solution Primer
Karan Bhandari
No ratings yet
PHP Package Mastery: 100 Essential Tools in One Hour - 2024 Edition
From Everand
PHP Package Mastery: 100 Essential Tools in One Hour - 2024 Edition
Kanto
No ratings yet
DataGrokr Technical Assignment - Data Engineering - Internshala
No ratings yet
DataGrokr Technical Assignment - Data Engineering - Internshala
5 pages
BDA_All_37_Practical_Answers_
No ratings yet
BDA_All_37_Practical_Answers_
3 pages
Python Advanced Programming: The Guide to Learn Python Programming. Reference with Exercises and Samples About Dynamical Programming, Multithreading, Multiprocessing, Debugging, Testing and More
From Everand
Python Advanced Programming: The Guide to Learn Python Programming. Reference with Exercises and Samples About Dynamical Programming, Multithreading, Multiprocessing, Debugging, Testing and More
Marcus Richards
No ratings yet
Databricks Spark Reference Applications
No ratings yet
Databricks Spark Reference Applications
37 pages
DataGrokr Technical Assignment - Data Engineering (1) (1)
No ratings yet
DataGrokr Technical Assignment - Data Engineering (1) (1)
4 pages
Foundation Programming
No ratings yet
Foundation Programming
14 pages
BDA_pr9
No ratings yet
BDA_pr9
1 page
Big Data Technologies Lab
No ratings yet
Big Data Technologies Lab
8 pages
Int 421
No ratings yet
Int 421
2 pages
Data Science Programming In Python
From Everand
Data Science Programming In Python
Anita Raichand
No ratings yet
Slide 10 PySpark - SQL
No ratings yet
Slide 10 PySpark - SQL
131 pages
Final Print Py Spark
No ratings yet
Final Print Py Spark
133 pages
Py Spark 3 Quick Reference Guide
No ratings yet
Py Spark 3 Quick Reference Guide
2 pages
Mastering Go A Practical Guide to Developers: A Practical Guide to Developers
From Everand
Mastering Go A Practical Guide to Developers: A Practical Guide to Developers
Miguel Miranda de Mattos
No ratings yet
Learning Apache Spark 2
From Everand
Learning Apache Spark 2
Muhammad Asif Abbasi
No ratings yet
50 Recipes for Programming Node.js
From Everand
50 Recipes for Programming Node.js
Jamie Munro
3/5 (4)
IP Record Python 23-24 Aryan
No ratings yet
IP Record Python 23-24 Aryan
42 pages
Practical Assignment - :: Distributed Data Processing With Apache Spark
No ratings yet
Practical Assignment - :: Distributed Data Processing With Apache Spark
3 pages
Pyspark_tutorial_3
No ratings yet
Pyspark_tutorial_3
5 pages
Lab Spark
No ratings yet
Lab Spark
3 pages
ds2 5 Pig Pyspark
No ratings yet
ds2 5 Pig Pyspark
64 pages
Inspiring Powershell Articles
From Everand
Inspiring Powershell Articles
Murat Yildirimoglu
No ratings yet
BDA_pr9
No ratings yet
BDA_pr9
1 page
Databricks Cloud How To Log Analysis Example
No ratings yet
Databricks Cloud How To Log Analysis Example
9 pages
VL2023240503445_PE003
No ratings yet
VL2023240503445_PE003
11 pages
Spark Commands
No ratings yet
Spark Commands
3 pages
Dharani ph3
No ratings yet
Dharani ph3
21 pages
Assignment2
No ratings yet
Assignment2
7 pages
Lab Chapter 10 Use RDDs
0% (1)
Lab Chapter 10 Use RDDs
4 pages
Suppose You Have A Large Dataset Stored in A Distributed File System Like HDFS
No ratings yet
Suppose You Have A Large Dataset Stored in A Distributed File System Like HDFS
11 pages
Spark Job Dataproc
No ratings yet
Spark Job Dataproc
4 pages
Pyspark_Coding_Interview_Questions
No ratings yet
Pyspark_Coding_Interview_Questions
19 pages
Create An Spark Streaming App: 1. Architecture and Abstraction
No ratings yet
Create An Spark Streaming App: 1. Architecture and Abstraction
8 pages
Question Bank-BDA (Module 1&2) 2
No ratings yet
Question Bank-BDA (Module 1&2) 2
5 pages
CSE413_201-15-3452_LAB-REPORT_02
No ratings yet
CSE413_201-15-3452_LAB-REPORT_02
6 pages
10 SparkIntroduction BigData 2x
No ratings yet
10 SparkIntroduction BigData 2x
33 pages
RDD
No ratings yet
RDD
4 pages
DataGrokr Technical Assignment
No ratings yet
DataGrokr Technical Assignment
4 pages
Team_A_project.
No ratings yet
Team_A_project.
15 pages
BigDataPractiseQuestionPaper
No ratings yet
BigDataPractiseQuestionPaper
10 pages
Pyspark Basics
No ratings yet
Pyspark Basics
16 pages
BDA Report
No ratings yet
BDA Report
20 pages
Spark-Tutorial - IV - Python
No ratings yet
Spark-Tutorial - IV - Python
212 pages
Core Java Programming Book
From Everand
Core Java Programming Book
Manish Soni
No ratings yet
CMIS 550 2023W Assignment 1
No ratings yet
CMIS 550 2023W Assignment 1
4 pages
Interview Prep
No ratings yet
Interview Prep
24 pages
Python Pyspark q's
No ratings yet
Python Pyspark q's
16 pages
BDA_ASSIGNMENT-1
No ratings yet
BDA_ASSIGNMENT-1
3 pages
Learning Hadoop 2
From Everand
Learning Hadoop 2
Garry Turkington
4/5 (1)
XII IP CHN 03 MS
No ratings yet
XII IP CHN 03 MS
4 pages
Footprinting, Reconnaissance, Scanning and Enumeration Techniques of Computer Networks
From Everand
Footprinting, Reconnaissance, Scanning and Enumeration Techniques of Computer Networks
Dr. Hidaia Mahmood Alassouli
No ratings yet
Filg 8
No ratings yet
Filg 8
631 pages
manvi_report (1)
No ratings yet
manvi_report (1)
23 pages
Cs Notes File Handling_240815_080015
No ratings yet
Cs Notes File Handling_240815_080015
59 pages
Reflections in Java
No ratings yet
Reflections in Java
13 pages
Capability Maturity Model Integration
No ratings yet
Capability Maturity Model Integration
13 pages
Logcat Prev CSC Log
No ratings yet
Logcat Prev CSC Log
504 pages
Discovering Opensees: Surfing The Waves of Opensees: Adding Your Code To Opensees
No ratings yet
Discovering Opensees: Surfing The Waves of Opensees: Adding Your Code To Opensees
47 pages
Sample Answers
No ratings yet
Sample Answers
13 pages
Python_Backend_Roadmap
No ratings yet
Python_Backend_Roadmap
4 pages
Chapter 1
No ratings yet
Chapter 1
52 pages
Python Quick
No ratings yet
Python Quick
89 pages
Object-Oriented vs. Functional Programming Explained - TechTarget
No ratings yet
Object-Oriented vs. Functional Programming Explained - TechTarget
5 pages
Aspack.v2.12 DLL-OEPFinder
No ratings yet
Aspack.v2.12 DLL-OEPFinder
2 pages
Rds Cli PDF
No ratings yet
Rds Cli PDF
212 pages
Software Requisition - Form - Nitin
No ratings yet
Software Requisition - Form - Nitin
4 pages
Worksheet Class9 CTS
No ratings yet
Worksheet Class9 CTS
2 pages
C Language Functions LEFT, RIGHT and MID
No ratings yet
C Language Functions LEFT, RIGHT and MID
5 pages
Resume Cloud Ankith
No ratings yet
Resume Cloud Ankith
2 pages
Pdfnup
No ratings yet
Pdfnup
3 pages
Chapter 2 Types and Level of Testing
No ratings yet
Chapter 2 Types and Level of Testing
24 pages
BLC - EX NO 1.docx
No ratings yet
BLC - EX NO 1.docx
18 pages
Advance Computer Programming: by Hasnat Ali
No ratings yet
Advance Computer Programming: by Hasnat Ali
35 pages
Unit 5
No ratings yet
Unit 5
23 pages
DevOps Engineer Interview Questions - 2024
No ratings yet
DevOps Engineer Interview Questions - 2024
15 pages
SPM Lec#07
No ratings yet
SPM Lec#07
34 pages
Udemy For Business - Course Catalogue
No ratings yet
Udemy For Business - Course Catalogue
61 pages
Advance ReactJS With Typescript.
No ratings yet
Advance ReactJS With Typescript.
4 pages
Git Git Hub Syllabus
No ratings yet
Git Git Hub Syllabus
4 pages
Report:: Library Management System
No ratings yet
Report:: Library Management System
5 pages

21BCP122 - Digital - Forensics - Assignment - 4a 2

Uploaded by

21BCP122 - Digital - Forensics - Assignment - 4a 2

Uploaded by

Assignment - 4a

from datetime import datetime

from pyspark.sql import SparkSession

from pyspark.sql.functions import udf, col

from pyspark.sql.types import StringType, TimestampType

# Initialize Spark session

# Load the CSV file into a Spark DataFrame

# Define a UDF to parse the log lines

pattern = r'(\d+\.\d+\.\d+\.\d+) - - \[(.*?)\] "(.*?)"'

match = re.match(pattern, line)

timestamp_str = match.group(2).split(" ")[0]

return ip_address, timestamp

return None, None

# Register the UDFs with Spark

parse_log_line_udf = udf(lambda line: parse_log_line(line)[0], StringType())

parse_log_timestamp_udf = udf(lambda line: parse_log_line(line)[1],

# Apply the UDFs to extract the IP address and timestamp

df2 = df1.withColumn("ip_address", parse_log_line_udf(col("_c0")))\

df2 = df2.filter(col("ip_address").isNotNull() & col("timestamp").isNotNull())

# Convert the DataFrame to an RDD for session calculation

logs_rdd = df2.rdd.map(lambda row: (row["ip_address"], row["timestamp"]))

# Group logs by IP address and calculate session durations

# Calculate session durations

for ip, timestamps in user_sessions.items():

timestamps.sort() # Sort the timestamps

session_duration = (timestamps[-1] - timestamps[0]).total_seconds()

# Sort the durations and get the top 5 sessions

sorted_user_durations = sorted(user_durations.items(), key=lambda x: x[1],

# Print the top 5 sessions

for ip, duration in sorted_user_durations[:5]:

print(f"IP: {ip}, Total Session Duration: {duration_minutes:.2f} minutes")

# Calculate the total session duration across all users

print(f"Total session duration across all users: {total_minutes:.2f} minutes")

from pyspark.sql.functions import udf, col

from pyspark.sql.types import StringType, TimestampType

from datetime import datetime

# Initialize SparkSession (typically done automatically in Databricks)

# Read the log file into a DataFrame

# Define the function to parse log lines

pattern = r'(\d+\.\d+\.\d+\.\d+) - - \[(.*?)\] "(.*?)"'

match = re.match(pattern, line)

timestamp = datetime.strptime(timestamp_str, "%d/%b/%Y:%H:%M:%S")

return ip_address, timestamp

return None, None

# Define UDFs for extracting IP and timestamp

# Register UDFs with Spark

extract_ip_udf = udf(extract_ip, StringType())

extract_timestamp_udf = udf(extract_timestamp, TimestampType())

# Apply UDFs to the DataFrame to get separate IP and timestamp columns

parsed_df = log_df.withColumn("ip_address", extract_ip_udf(col("_c0")))\

# Drop rows where IP is None (invalid or unparsable lines)

# Count the frequency of each IP address

# Sort by frequency in descending order

# Show the result

You might also like

pattern = r'(\d+\.\d+\.\d+\.\d+) - - \[(.?)\] "(.?)"'

pattern = r'(\d+\.\d+\.\d+\.\d+) - - \[(.?)\] "(.?)"'