0% found this document useful (0 votes)

3 views

Ass 3 - Best

The document outlines an assignment for the Software Re-engineering course at Sir Syed University, focusing on data cleaning and preprocessing using a cricket dataset. Students are required to implement a cleaning process, generate code to display the original and cleaned datasets, and summarize the changes made. The assignment emphasizes hands-on experience with real-world data and is due on January 14, 2025.

Uploaded by

Bushra Shahzad

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

3 views

Ass 3 - Best

Uploaded by

Bushra Shahzad

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 13

Software Re-engineering (SWE-417T) SSUET/QR/114

Sir Syed University of Engineering & Technology

(SSUET)

Software Engineering Department

Course Name: Software Re-engineering (SWE-417T)

Semester: 7th

Batch: 2021F
Section: C

ASSIGNMENT#03

Submitted by:
Areeba Fatima (2021F-BSE-139)
Manal Nasir (2021F-BSE-153)
SM Umer Naqvi (2021F-BSE-362)
Abdul Moiz Hasan (2021F-BSE-149)

1
Software Re-engineering (SWE-417T) SSUET/QR/114

Department: Software Engineering Program: BS (SE)

Sr. No Course Learning Outcomes PLOs Blooms Taxonomy

PLO_4
C6
Set to perform complex design re-engineering (Design/Development
(Create)
CLO_3 and reverse engineering problems of solution)

Assignment Guidelines  This is Group based assignment with 4 members maximum.

 You are required to answer all questions in detail with
references. Consider Book and Internet as Reference Material
 Submission will be on VLE / Hardcopy.
 Any answers that are copied from another group will
automatically receive a zero mark.

Submission date 14-01-2025

Objective
This assignment aims to provide hands-on experience in data cleaning and preprocessing. You will
work with a real-world dataset to identify, clean, and prepare data for analysis.

Question# 1: Prepare practical implementation of Data cleaning & Preprocessing:

a) Select a dataset from a reliable resource (e.g., Kaggle, GitHub) with extract a subset of
50–100 instances to work with. Explain your choice of dataset and the problems you
expect to solve in the data.

Extracting an unorganized dataset containing 77 records of a player's batting career from a

website and converting it into a CSV file for reformatting:

2
Software Re-engineering (SWE-417T) SSUET/QR/114

https://www.espncricinfo.com/records/highest-career-batting-average-282910

Why This Dataset Was Chosen:

This dataset captures information about players with the highest career batting averages in
cricket. It is an excellent choice for data analysis for several reasons:

1. Relevance and Interest: Cricket is a globally popular sport, and analyzing data from it
offers insights into performance trends.
2. Compact Yet Informative: The dataset contains concise records with meaningful
attributes like player name, batting average, career span, matches, and runs scored,
making it manageable for analysis while still being rich in information.
3. Potential for Insights: The data can reveal patterns in player performance, historical
trends, and the evolution of the game.

Problems Expected to Solve:

By analyzing this dataset, several questions and problems can be addressed, including:

1. Top Performers: Identify players with exceptional batting averages and understand
what factors contribute to their success.
2. Career Longevity vs. Performance: Examine the relationship between the length of a
player's career and their batting average.
3. Era Comparisons: Explore how batting averages have changed across different
cricketing eras.
4. Consistency Analysis: Identify players with a high number of matches and consistent
averages over time.
5. Outlier Detection: Detect players with unusually high or low averages and investigate
contributing factors.
6. Impact of Matches Played: Analyze whether a player's batting average tends to
decline with an increase in the number of matches played.

3
Software Re-engineering (SWE-417T) SSUET/QR/114

b) Generate a code of cleaning process which displays result of cleaned data in source code
using (python, java etc). Ensure your code performs the following:
• Implements all necessary cleaning steps.
• Displays the original dataset before cleaning and the cleaned dataset afterward.
• Outputs a summary of the changes made (e.g., number of missing values filled,
rows removed).
# Import libraries import
pandas as pd import
numpy as np

# Import dataset
df = pd.read_csv('cricket_dataset.csv')

# Display the original dataset print("Original

Dataset:")
print(df)

# Summary of original data original_shape

= df.shape
print("\nOriginal Shape:", original_shape)

# Step 1: Identify missing values

missing_values = df.isnull().sum() print("\
nMissing Values Before Cleaning:")
print(missing_values)

# Step 2: Handle missing values

# Here we will drop rows with missing values for simplicity df
= df.dropna()

# Summary of changes
missing_values_after = original_shape[0] - df.shape[0]
print("\nNumber of Rows Removed Due to Missing Values:", missing_values_after)

# Step 3: Remove duplicates

duplicates_before = df.duplicated().sum() df
= df.drop_duplicates()
duplicates_after = df.duplicated().sum()
print("\nDuplicates Removed: Before =", duplicates_before, ", After =", duplicates_after)

# Step 4: Handle outliers using the Interquartile Range (IQR)

numerical_columns = ['Mat', 'Inns', 'NO', 'Runs', 'HS', 'Ave', 'BF', 'SR', '100', '50', '0', '4s',

4
Software Re-engineering (SWE-417T) SSUET/QR/114

'6s']
df[numerical_columns] = df[numerical_columns].apply(pd.to_numeric, errors='coerce')
q1 = df[numerical_columns].quantile(0.25) q3 = df[numerical_columns].quantile(0.75)
iqr = q3 - q1
outliers_before = df.shape[0]
df = df[~((df[numerical_columns] < (q1 - 1.5 * iqr)) | (df[numerical_columns] > (q3 + 1.5
* iqr))).any(axis=1)] outliers_after
= df.shape[0]
print("\nNumber of Outliers Removed:", outliers_before - outliers_after)

# Step 5: Check for inconsistencies (example: replace incorrect values)

# Assuming we want to standardize the 'HS' column (highest score) to remove any
nonnumeric values
df['HS'] = df['HS'].replace({'-': np.nan}) # Replace '-' with NaN
df['HS'] = pd.to_numeric(df['HS'], errors='coerce') # Convert to numeric, coercing errors to
NaN
df = df.dropna(subset=['HS']) # Drop rows where 'HS' is NaN after conversion

# Step 6: Remove unnecessary columns (if any)

df = df.drop(['Span'], axis=1)

# Step 7: Check data type inconsistencies print("\nData

Types Before Conversion:") print(df.dtypes)

# Convert columns to the correct data type if necessary df['Runs']

= df['Runs'].astype(int)

# Display the cleaned dataset print("\nCleaned

Dataset:")
print(df)

# Summary of the cleaned data print("\nCleaned Shape:",

df.shape) print("\nSummary of Changes Made:") print("Number of
Missing Values Filled:", missing_values_after) print("Duplicates
Removed:", duplicates_before - duplicates_after) print("Outliers
Removed:", outliers_before - outliers_after)

5
Software Re-engineering (SWE-417T) SSUET/QR/114

c) Generate output of cleaning process using any Tool OpenRefine, Trifacta Wrangler,
Winpure Clean & Match etc OR any Online Tool.

6
Software Re-engineering (SWE-417T) SSUET/QR/114

STEP 1: IDENTIFY MISSING VALUES

STEP 2: HANDLE MISSING VALUES

7
Software Re-engineering (SWE-417T) SSUET/QR/114

SUMMARY OF CHANGES

8
Software Re-engineering (SWE-417T) SSUET/QR/114

STEP 3: REMOVE DUPLICATES

STEP 4: HANDLE OUTLIERS USING THE INTERQUARTILE RANGE (IQR)

STEP 5: CHECK FOR INCONSISTENCIES (EXAMPLE: REPLACE INCORRECT VALUES)

9
Software Re-engineering (SWE-417T) SSUET/QR/114

STEP 6: REMOVE UNNECESSARY COLUMNS (IF ANY)

STEP 7: CHECK DATA TYPE INCONSISTENCIES

10
Software Re-engineering (SWE-417T) SSUET/QR/114

CONVERT COLUMNS TO THE CORRECT DATA TYPE IF NECESSARY

11
Software Re-engineering (SWE-417T) SSUET/QR/114

DISPLAY THE CLEANED DATASET

12
Software Re-engineering (SWE-417T) SSUET/QR/114

SUMMARY OF THE CLEANED DATA

Powered by TCPDF (www.tcpdf.org)

The AI Wealth Creation Blueprint PDF
67% (3)
The AI Wealth Creation Blueprint PDF
50 pages
The Age of AI and Our Human Future (Henry Kissinger, Eric Schmidt Etc.) (Z-Library)
100% (8)
The Age of AI and Our Human Future (Henry Kissinger, Eric Schmidt Etc.) (Z-Library)
148 pages
How To Hack Atm
87% (15)
How To Hack Atm
1 page
Christopher Langan - CTMU, The Cognitive-Theoretic Model of The Universe, A New Kind of Reality Theory
88% (8)
Christopher Langan - CTMU, The Cognitive-Theoretic Model of The Universe, A New Kind of Reality Theory
56 pages
Data Structure and Algorithmic Thinking With Python Data Structure and Algorithmic Puzzles PDF
95% (20)
Data Structure and Algorithmic Thinking With Python Data Structure and Algorithmic Puzzles PDF
471 pages
Gayle Laakmann McDowell - Cracking The Coding Interview - 189 Programming Questions and Solutions (2015, CareerCup)
81% (48)
Gayle Laakmann McDowell - Cracking The Coding Interview - 189 Programming Questions and Solutions (2015, CareerCup)
708 pages
Gödel, Escher, Bach - An Eternal Golden Braid (20th Anniversary Edition) by Douglas R. Hofstadter (Charm-Quark) PDF
100% (10)
Gödel, Escher, Bach - An Eternal Golden Braid (20th Anniversary Edition) by Douglas R. Hofstadter (Charm-Quark) PDF
821 pages
Cracking The Coding Interview - 189 Programming Questions and Solutions (6th Edition) (EnglishOnlineClub - Com)
100% (10)
Cracking The Coding Interview - 189 Programming Questions and Solutions (6th Edition) (EnglishOnlineClub - Com)
708 pages
Chris Bailey - Hyperfocus - The New Science of Attention, Productivity, and Creativity-Viking (2018)
100% (25)
Chris Bailey - Hyperfocus - The New Science of Attention, Productivity, and Creativity-Viking (2018)
306 pages
The Art of Asking ChatGPT For High-Quality Answers A Complete Guide To Prompt Engineering Techniques (Ibrahim John) (Z-Library)
100% (24)
The Art of Asking ChatGPT For High-Quality Answers A Complete Guide To Prompt Engineering Techniques (Ibrahim John) (Z-Library)
52 pages
The Fabric of Reality
100% (1)
The Fabric of Reality
6 pages
Banana Pancakes - Ukulele Chord Chart
100% (1)
Banana Pancakes - Ukulele Chord Chart
2 pages
75 Productivity Hacks - System Sunday
100% (7)
75 Productivity Hacks - System Sunday
75 pages
Military Remote Viewing Manual
100% (5)
Military Remote Viewing Manual
72 pages
Machine Learning For Humans
100% (4)
Machine Learning For Humans
97 pages
Cs 229, Autumn 2016 Problem Set #2: Naive Bayes, SVMS, and Theory
No ratings yet
Cs 229, Autumn 2016 Problem Set #2: Naive Bayes, SVMS, and Theory
20 pages
Astm D 3999 - 91 R03
No ratings yet
Astm D 3999 - 91 R03
15 pages
Ass 3 - Average (1)
No ratings yet
Ass 3 - Average (1)
6 pages
Ass 3 - Best (2)
No ratings yet
Ass 3 - Best (2)
10 pages
Ass 3 - Average
No ratings yet
Ass 3 - Average
10 pages
Assignment 02
No ratings yet
Assignment 02
7 pages
PDS_Exp_7_to_9
No ratings yet
PDS_Exp_7_to_9
10 pages
Ads Exp2 C35
No ratings yet
Ads Exp2 C35
9 pages
Data Analysis by Using Python
No ratings yet
Data Analysis by Using Python
15 pages
DAP writeups_merged
No ratings yet
DAP writeups_merged
33 pages
1-Introduction to data cleaning
No ratings yet
1-Introduction to data cleaning
22 pages
DWM Exp 7
No ratings yet
DWM Exp 7
4 pages
Sessional QP-TaT
No ratings yet
Sessional QP-TaT
5 pages
Deep Learning Ram
No ratings yet
Deep Learning Ram
21 pages
PracticalList_EDT_BCA_2024 SET B1_4
No ratings yet
PracticalList_EDT_BCA_2024 SET B1_4
8 pages
DS Lec 6
No ratings yet
DS Lec 6
27 pages
DataCleaninginML
No ratings yet
DataCleaninginML
15 pages
index
No ratings yet
index
4 pages
B DWM Lab Manual Zil
No ratings yet
B DWM Lab Manual Zil
114 pages
Data Cleaning Thesis
100% (2)
Data Cleaning Thesis
5 pages
Be A 65 Ads Exp 3
No ratings yet
Be A 65 Ads Exp 3
6 pages
Disruptive Technologies DA Lecture 8
No ratings yet
Disruptive Technologies DA Lecture 8
17 pages
III-Unit
No ratings yet
III-Unit
4 pages
PW2 DataCleaning
No ratings yet
PW2 DataCleaning
6 pages
Big Data Analysis
No ratings yet
Big Data Analysis
38 pages
Anand ML
No ratings yet
Anand ML
50 pages
Arnav MLlab01
No ratings yet
Arnav MLlab01
7 pages
DWM Module 2
No ratings yet
DWM Module 2
9 pages
DS-DS Lab-1
No ratings yet
DS-DS Lab-1
4 pages
Exploratory Data
No ratings yet
Exploratory Data
47 pages
Data Analytics lab manual
No ratings yet
Data Analytics lab manual
47 pages
ETI-microproject
No ratings yet
ETI-microproject
14 pages
Vansh
No ratings yet
Vansh
15 pages
A110 Rayyan Expt4dep
No ratings yet
A110 Rayyan Expt4dep
9 pages
Informatics Practicals 12th (Personal)
No ratings yet
Informatics Practicals 12th (Personal)
89 pages
DSBDA Lab Manual
No ratings yet
DSBDA Lab Manual
167 pages
3b. Data Pre-Processing
No ratings yet
3b. Data Pre-Processing
84 pages
ADS E2
No ratings yet
ADS E2
5 pages
DSBDA LAB - MANUAL (Autosaved) - Sd1-Converted-1-2
100% (1)
DSBDA LAB - MANUAL (Autosaved) - Sd1-Converted-1-2
256 pages
B Tech-AIML-question bank-2 Answer Key
No ratings yet
B Tech-AIML-question bank-2 Answer Key
9 pages
DWM - Co2-10
No ratings yet
DWM - Co2-10
27 pages
Statistical Transform Data Cleaning
No ratings yet
Statistical Transform Data Cleaning
30 pages
Day-4 Preprocessing
No ratings yet
Day-4 Preprocessing
11 pages
1_Data Preprocessing and Cleaning_55
No ratings yet
1_Data Preprocessing and Cleaning_55
8 pages
Data Analytics and Visualization Lab
No ratings yet
Data Analytics and Visualization Lab
81 pages
dw lab file
No ratings yet
dw lab file
18 pages
Dsbda Lab Manual Merged
No ratings yet
Dsbda Lab Manual Merged
117 pages
Data Pre-Processing Python For Beginner
No ratings yet
Data Pre-Processing Python For Beginner
12 pages
Data Pre-Processing Python For Beginner
No ratings yet
Data Pre-Processing Python For Beginner
12 pages
2_Machine Learning_130824
No ratings yet
2_Machine Learning_130824
81 pages
Group A Assignment No2 Writeup
No ratings yet
Group A Assignment No2 Writeup
9 pages
DA lab
No ratings yet
DA lab
27 pages
Data Preprocessing - 1: Course Leader
No ratings yet
Data Preprocessing - 1: Course Leader
22 pages
Practical File Informatics Practices Class 12 docx
No ratings yet
Practical File Informatics Practices Class 12 docx
27 pages
3 Data Preprocessing
No ratings yet
3 Data Preprocessing
33 pages
Overview of Data Cleaning
No ratings yet
Overview of Data Cleaning
17 pages
DS Journal_Final
No ratings yet
DS Journal_Final
37 pages
DS Journal-1
No ratings yet
DS Journal-1
25 pages
Data Exploration Preparation
No ratings yet
Data Exploration Preparation
12 pages
Week 3
No ratings yet
Week 3
77 pages
Oracle Certified Professional Java Programmer OCPJP 1Z0 809
From Everand
Oracle Certified Professional Java Programmer OCPJP 1Z0 809
Manish Soni
No ratings yet
Administering Microsoft Azure SQL Solutions DP 300
From Everand
Administering Microsoft Azure SQL Solutions DP 300
Manish Soni
No ratings yet
The Secrets of A Slot Machine
No ratings yet
The Secrets of A Slot Machine
4 pages
My Ai Cheat List
100% (11)
My Ai Cheat List
3 pages
Roadmap How To Learn AI in 2024 (Uncovered AI)
No ratings yet
Roadmap How To Learn AI in 2024 (Uncovered AI)
6 pages
Teas Topics To Study
100% (12)
Teas Topics To Study
6 pages
2045: The Year Man Becomes Immortal
No ratings yet
2045: The Year Man Becomes Immortal
9 pages
Wisc V Interpretation
100% (1)
Wisc V Interpretation
8 pages
Rationality From AI To Zombies
86% (7)
Rationality From AI To Zombies
1,813 pages
Tech Trend 2024 Report-2
No ratings yet
Tech Trend 2024 Report-2
11 pages
From Music To Mathematic
100% (1)
From Music To Mathematic
4 pages
Attention Is All You Need
67% (3)
Attention Is All You Need
11 pages
Mind Control Patents
100% (1)
Mind Control Patents
41 pages
Python Programming and Maching Learning 2 in 1 B08Y5DPX32
100% (7)
Python Programming and Maching Learning 2 in 1 B08Y5DPX32
145 pages
Psych Unit 7a Practice Quiz
No ratings yet
Psych Unit 7a Practice Quiz
4 pages
Current and Future Trends on AI Applications - Mohammed A Al-Sharafi
No ratings yet
Current and Future Trends on AI Applications - Mohammed A Al-Sharafi
456 pages
ENGLISH+7+Q3+Module+6 Week+7 React+to+What+is+Asserted+or+Expressed+in+a+Text FINAL
No ratings yet
ENGLISH+7+Q3+Module+6 Week+7 React+to+What+is+Asserted+or+Expressed+in+a+Text FINAL
22 pages
What Is X-Ray Diffraction in Nanotechnology?
100% (1)
What Is X-Ray Diffraction in Nanotechnology?
3 pages
Bihar State Food and Civil Supplies Corporation LTD
No ratings yet
Bihar State Food and Civil Supplies Corporation LTD
3 pages
Unit 13 Internal Correspondence at The Workplace: 13.0 Objectives
No ratings yet
Unit 13 Internal Correspondence at The Workplace: 13.0 Objectives
26 pages
Architectural Research Method
No ratings yet
Architectural Research Method
18 pages
Lucia 2004
No ratings yet
Lucia 2004
15 pages
Universiti Teknologi Mara Final Examination: Confidential CS/APR 2007/MAT020/021
No ratings yet
Universiti Teknologi Mara Final Examination: Confidential CS/APR 2007/MAT020/021
5 pages
Premature Fatigue Failure of A Spring Due To Quench Cracks
No ratings yet
Premature Fatigue Failure of A Spring Due To Quench Cracks
8 pages
ISO 1738 IDF 12 - Butter - Determination of Salt Content - Titration
No ratings yet
ISO 1738 IDF 12 - Butter - Determination of Salt Content - Titration
15 pages
Cons Pros: What Is Genetic Engineering?
No ratings yet
Cons Pros: What Is Genetic Engineering?
1 page
Active Listening
No ratings yet
Active Listening
7 pages
Full Download McGraw-Hill Education Conquering GRE Math, 4th Edition Robert E. Moyer PDF DOCX
100% (3)
Full Download McGraw-Hill Education Conquering GRE Math, 4th Edition Robert E. Moyer PDF DOCX
40 pages
Flysheet Hummingbird
No ratings yet
Flysheet Hummingbird
2 pages
By Dr.A.Tharakeshwar Professor in SOM, MIT-WPU
No ratings yet
By Dr.A.Tharakeshwar Professor in SOM, MIT-WPU
72 pages
Specification For Sealants
No ratings yet
Specification For Sealants
17 pages
DSE Section 15 Analytical Chemistry (Eng)
No ratings yet
DSE Section 15 Analytical Chemistry (Eng)
45 pages
Lesson 1
No ratings yet
Lesson 1
27 pages
Homework 3
No ratings yet
Homework 3
4 pages
Group 2 Reinforcement Latent and Observational Learning
No ratings yet
Group 2 Reinforcement Latent and Observational Learning
40 pages
Wind Load Calculation Template
0% (1)
Wind Load Calculation Template
17 pages
Aoac, 1985
No ratings yet
Aoac, 1985
5 pages
Tms 580
No ratings yet
Tms 580
6 pages
2005 Maharashtra Floods
No ratings yet
2005 Maharashtra Floods
6 pages
A Robust Kalman Filter Design For Image Restoration
No ratings yet
A Robust Kalman Filter Design For Image Restoration
4 pages
CLASS -9 COMPILED HOLIDAYS HW
No ratings yet
CLASS -9 COMPILED HOLIDAYS HW
8 pages
607 Module 1 - TASK 1 - ORTHOPEDIC IMPAIRMENTS Physical Disabilities
No ratings yet
607 Module 1 - TASK 1 - ORTHOPEDIC IMPAIRMENTS Physical Disabilities
3 pages
Candidate's Biodata: Registration Type (New/Updating)
No ratings yet
Candidate's Biodata: Registration Type (New/Updating)
4 pages
Updated CV Hrithik Mhatre
No ratings yet
Updated CV Hrithik Mhatre
2 pages
Studi Korelasi Koefisien Permeabilitas Vertikal Dan Permeabilitas Horizontal Pada Tanah Lempung
No ratings yet
Studi Korelasi Koefisien Permeabilitas Vertikal Dan Permeabilitas Horizontal Pada Tanah Lempung
11 pages

Ass 3 - Best

Uploaded by

Ass 3 - Best

Uploaded by

Software Re-engineering (SWE-417T) SSUET/QR/114

Sir Syed University of Engineering & Technology

Software Engineering Department

Course Name: Software Re-engineering (SWE-417T)

Department: Software Engineering Program: BS (SE)

Sr. No Course Learning Outcomes PLOs Blooms Taxonomy

Assignment Guidelines  This is Group based assignment with 4 members maximum.

Submission date 14-01-2025

Question# 1: Prepare practical implementation of Data cleaning & Preprocessing:

Extracting an unorganized dataset containing 77 records of a player's batting career from a

Why This Dataset Was Chosen:

Problems Expected to Solve:

# Display the original dataset print("Original

# Summary of original data original_shape

# Step 1: Identify missing values

# Step 2: Handle missing values

# Step 3: Remove duplicates

# Step 4: Handle outliers using the Interquartile Range (IQR)

# Step 5: Check for inconsistencies (example: replace incorrect values)

# Step 6: Remove unnecessary columns (if any)

# Step 7: Check data type inconsistencies print("\nData

# Convert columns to the correct data type if necessary df['Runs']

# Display the cleaned dataset print("\nCleaned

# Summary of the cleaned data print("\nCleaned Shape:",

STEP 1: IDENTIFY MISSING VALUES

STEP 2: HANDLE MISSING VALUES

STEP 3: REMOVE DUPLICATES

STEP 4: HANDLE OUTLIERS USING THE INTERQUARTILE RANGE (IQR)

STEP 5: CHECK FOR INCONSISTENCIES (EXAMPLE: REPLACE INCORRECT VALUES)

STEP 6: REMOVE UNNECESSARY COLUMNS (IF ANY)

STEP 7: CHECK DATA TYPE INCONSISTENCIES

CONVERT COLUMNS TO THE CORRECT DATA TYPE IF NECESSARY

DISPLAY THE CLEANED DATASET

SUMMARY OF THE CLEANED DATA

Powered by TCPDF (www.tcpdf.org)

You might also like