Ass 3 - Best
Ass 3 - Best
Batch: 2021F
Section: C
ASSIGNMENT#03
Submitted by:
Areeba Fatima (2021F-BSE-139)
Manal Nasir (2021F-BSE-153)
SM Umer Naqvi (2021F-BSE-362)
Abdul Moiz Hasan (2021F-BSE-149)
1
Software Re-engineering (SWE-417T) SSUET/QR/114
Objective
This assignment aims to provide hands-on experience in data cleaning and preprocessing. You will
work with a real-world dataset to identify, clean, and prepare data for analysis.
2
Software Re-engineering (SWE-417T) SSUET/QR/114
https://www.espncricinfo.com/records/highest-career-batting-average-282910
This dataset captures information about players with the highest career batting averages in
cricket. It is an excellent choice for data analysis for several reasons:
1. Relevance and Interest: Cricket is a globally popular sport, and analyzing data from it
offers insights into performance trends.
2. Compact Yet Informative: The dataset contains concise records with meaningful
attributes like player name, batting average, career span, matches, and runs scored,
making it manageable for analysis while still being rich in information.
3. Potential for Insights: The data can reveal patterns in player performance, historical
trends, and the evolution of the game.
By analyzing this dataset, several questions and problems can be addressed, including:
1. Top Performers: Identify players with exceptional batting averages and understand
what factors contribute to their success.
2. Career Longevity vs. Performance: Examine the relationship between the length of a
player's career and their batting average.
3. Era Comparisons: Explore how batting averages have changed across different
cricketing eras.
4. Consistency Analysis: Identify players with a high number of matches and consistent
averages over time.
5. Outlier Detection: Detect players with unusually high or low averages and investigate
contributing factors.
6. Impact of Matches Played: Analyze whether a player's batting average tends to
decline with an increase in the number of matches played.
3
Software Re-engineering (SWE-417T) SSUET/QR/114
b) Generate a code of cleaning process which displays result of cleaned data in source code
using (python, java etc). Ensure your code performs the following:
• Implements all necessary cleaning steps.
• Displays the original dataset before cleaning and the cleaned dataset afterward.
• Outputs a summary of the changes made (e.g., number of missing values filled,
rows removed).
# Import libraries import
pandas as pd import
numpy as np
# Import dataset
df = pd.read_csv('cricket_dataset.csv')
# Summary of changes
missing_values_after = original_shape[0] - df.shape[0]
print("\nNumber of Rows Removed Due to Missing Values:", missing_values_after)
4
Software Re-engineering (SWE-417T) SSUET/QR/114
'6s']
df[numerical_columns] = df[numerical_columns].apply(pd.to_numeric, errors='coerce')
q1 = df[numerical_columns].quantile(0.25) q3 = df[numerical_columns].quantile(0.75)
iqr = q3 - q1
outliers_before = df.shape[0]
df = df[~((df[numerical_columns] < (q1 - 1.5 * iqr)) | (df[numerical_columns] > (q3 + 1.5
* iqr))).any(axis=1)] outliers_after
= df.shape[0]
print("\nNumber of Outliers Removed:", outliers_before - outliers_after)
5
Software Re-engineering (SWE-417T) SSUET/QR/114
c) Generate output of cleaning process using any Tool OpenRefine, Trifacta Wrangler,
Winpure Clean & Match etc OR any Online Tool.
6
Software Re-engineering (SWE-417T) SSUET/QR/114
7
Software Re-engineering (SWE-417T) SSUET/QR/114
SUMMARY OF CHANGES
8
Software Re-engineering (SWE-417T) SSUET/QR/114
9
Software Re-engineering (SWE-417T) SSUET/QR/114
10
Software Re-engineering (SWE-417T) SSUET/QR/114
11
Software Re-engineering (SWE-417T) SSUET/QR/114
12
Software Re-engineering (SWE-417T) SSUET/QR/114
13