0% found this document useful (0 votes)
3 views

Ass 3 - Best

The document outlines an assignment for the Software Re-engineering course at Sir Syed University, focusing on data cleaning and preprocessing using a cricket dataset. Students are required to implement a cleaning process, generate code to display the original and cleaned datasets, and summarize the changes made. The assignment emphasizes hands-on experience with real-world data and is due on January 14, 2025.

Uploaded by

Bushra Shahzad
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
3 views

Ass 3 - Best

The document outlines an assignment for the Software Re-engineering course at Sir Syed University, focusing on data cleaning and preprocessing using a cricket dataset. Students are required to implement a cleaning process, generate code to display the original and cleaned datasets, and summarize the changes made. The assignment emphasizes hands-on experience with real-world data and is due on January 14, 2025.

Uploaded by

Bushra Shahzad
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 13

Software Re-engineering (SWE-417T) SSUET/QR/114

Sir Syed University of Engineering & Technology


(SSUET)

Software Engineering Department

Course Name: Software Re-engineering (SWE-417T)


Semester: 7th

Batch: 2021F
Section: C

ASSIGNMENT#03

Submitted by:
Areeba Fatima (2021F-BSE-139)
Manal Nasir (2021F-BSE-153)
SM Umer Naqvi (2021F-BSE-362)
Abdul Moiz Hasan (2021F-BSE-149)

1
Software Re-engineering (SWE-417T) SSUET/QR/114

Department: Software Engineering Program: BS (SE)

Sr. No Course Learning Outcomes PLOs Blooms Taxonomy


PLO_4
C6
Set to perform complex design re-engineering (Design/Development
(Create)
CLO_3 and reverse engineering problems of solution)

Assignment Guidelines  This is Group based assignment with 4 members maximum.


 You are required to answer all questions in detail with
references. Consider Book and Internet as Reference Material
 Submission will be on VLE / Hardcopy.
 Any answers that are copied from another group will
automatically receive a zero mark.

Submission date 14-01-2025

Objective
This assignment aims to provide hands-on experience in data cleaning and preprocessing. You will
work with a real-world dataset to identify, clean, and prepare data for analysis.

Question# 1: Prepare practical implementation of Data cleaning & Preprocessing:


a) Select a dataset from a reliable resource (e.g., Kaggle, GitHub) with extract a subset of
50–100 instances to work with. Explain your choice of dataset and the problems you
expect to solve in the data.

Extracting an unorganized dataset containing 77 records of a player's batting career from a


website and converting it into a CSV file for reformatting:

2
Software Re-engineering (SWE-417T) SSUET/QR/114

https://www.espncricinfo.com/records/highest-career-batting-average-282910

Why This Dataset Was Chosen:

This dataset captures information about players with the highest career batting averages in
cricket. It is an excellent choice for data analysis for several reasons:

1. Relevance and Interest: Cricket is a globally popular sport, and analyzing data from it
offers insights into performance trends.
2. Compact Yet Informative: The dataset contains concise records with meaningful
attributes like player name, batting average, career span, matches, and runs scored,
making it manageable for analysis while still being rich in information.
3. Potential for Insights: The data can reveal patterns in player performance, historical
trends, and the evolution of the game.

Problems Expected to Solve:

By analyzing this dataset, several questions and problems can be addressed, including:

1. Top Performers: Identify players with exceptional batting averages and understand
what factors contribute to their success.
2. Career Longevity vs. Performance: Examine the relationship between the length of a
player's career and their batting average.
3. Era Comparisons: Explore how batting averages have changed across different
cricketing eras.
4. Consistency Analysis: Identify players with a high number of matches and consistent
averages over time.
5. Outlier Detection: Detect players with unusually high or low averages and investigate
contributing factors.
6. Impact of Matches Played: Analyze whether a player's batting average tends to
decline with an increase in the number of matches played.

3
Software Re-engineering (SWE-417T) SSUET/QR/114

b) Generate a code of cleaning process which displays result of cleaned data in source code
using (python, java etc). Ensure your code performs the following:
• Implements all necessary cleaning steps.
• Displays the original dataset before cleaning and the cleaned dataset afterward.
• Outputs a summary of the changes made (e.g., number of missing values filled,
rows removed).
# Import libraries import
pandas as pd import
numpy as np

# Import dataset
df = pd.read_csv('cricket_dataset.csv')

# Display the original dataset print("Original


Dataset:")
print(df)

# Summary of original data original_shape


= df.shape
print("\nOriginal Shape:", original_shape)

# Step 1: Identify missing values


missing_values = df.isnull().sum() print("\
nMissing Values Before Cleaning:")
print(missing_values)

# Step 2: Handle missing values


# Here we will drop rows with missing values for simplicity df
= df.dropna()

# Summary of changes
missing_values_after = original_shape[0] - df.shape[0]
print("\nNumber of Rows Removed Due to Missing Values:", missing_values_after)

# Step 3: Remove duplicates


duplicates_before = df.duplicated().sum() df
= df.drop_duplicates()
duplicates_after = df.duplicated().sum()
print("\nDuplicates Removed: Before =", duplicates_before, ", After =", duplicates_after)

# Step 4: Handle outliers using the Interquartile Range (IQR)


numerical_columns = ['Mat', 'Inns', 'NO', 'Runs', 'HS', 'Ave', 'BF', 'SR', '100', '50', '0', '4s',

4
Software Re-engineering (SWE-417T) SSUET/QR/114

'6s']
df[numerical_columns] = df[numerical_columns].apply(pd.to_numeric, errors='coerce')
q1 = df[numerical_columns].quantile(0.25) q3 = df[numerical_columns].quantile(0.75)
iqr = q3 - q1
outliers_before = df.shape[0]
df = df[~((df[numerical_columns] < (q1 - 1.5 * iqr)) | (df[numerical_columns] > (q3 + 1.5
* iqr))).any(axis=1)] outliers_after
= df.shape[0]
print("\nNumber of Outliers Removed:", outliers_before - outliers_after)

# Step 5: Check for inconsistencies (example: replace incorrect values)


# Assuming we want to standardize the 'HS' column (highest score) to remove any
nonnumeric values
df['HS'] = df['HS'].replace({'-': np.nan}) # Replace '-' with NaN
df['HS'] = pd.to_numeric(df['HS'], errors='coerce') # Convert to numeric, coercing errors to
NaN
df = df.dropna(subset=['HS']) # Drop rows where 'HS' is NaN after conversion

# Step 6: Remove unnecessary columns (if any)


df = df.drop(['Span'], axis=1)

# Step 7: Check data type inconsistencies print("\nData


Types Before Conversion:") print(df.dtypes)

# Convert columns to the correct data type if necessary df['Runs']


= df['Runs'].astype(int)

# Display the cleaned dataset print("\nCleaned


Dataset:")
print(df)

# Summary of the cleaned data print("\nCleaned Shape:",


df.shape) print("\nSummary of Changes Made:") print("Number of
Missing Values Filled:", missing_values_after) print("Duplicates
Removed:", duplicates_before - duplicates_after) print("Outliers
Removed:", outliers_before - outliers_after)

5
Software Re-engineering (SWE-417T) SSUET/QR/114

c) Generate output of cleaning process using any Tool OpenRefine, Trifacta Wrangler,
Winpure Clean & Match etc OR any Online Tool.

6
Software Re-engineering (SWE-417T) SSUET/QR/114

STEP 1: IDENTIFY MISSING VALUES

STEP 2: HANDLE MISSING VALUES

7
Software Re-engineering (SWE-417T) SSUET/QR/114

SUMMARY OF CHANGES

8
Software Re-engineering (SWE-417T) SSUET/QR/114

STEP 3: REMOVE DUPLICATES

STEP 4: HANDLE OUTLIERS USING THE INTERQUARTILE RANGE (IQR)

STEP 5: CHECK FOR INCONSISTENCIES (EXAMPLE: REPLACE INCORRECT VALUES)

9
Software Re-engineering (SWE-417T) SSUET/QR/114

STEP 6: REMOVE UNNECESSARY COLUMNS (IF ANY)

STEP 7: CHECK DATA TYPE INCONSISTENCIES

10
Software Re-engineering (SWE-417T) SSUET/QR/114

CONVERT COLUMNS TO THE CORRECT DATA TYPE IF NECESSARY

11
Software Re-engineering (SWE-417T) SSUET/QR/114

DISPLAY THE CLEANED DATASET

12
Software Re-engineering (SWE-417T) SSUET/QR/114

SUMMARY OF THE CLEANED DATA

13

Powered by TCPDF (www.tcpdf.org)

You might also like