0% found this document useful (0 votes)
148 views

Individual Coursework (Replacing In-Class Test) : Big Data (6CS030)

The document summarizes issues found in sample employee and department data files, including missing values, outliers, and non-standard date formats. It suggests approaches to address the problems, such as replacing missing commission percentages with 0, setting Steven King's missing manager ID to NULL, identifying Kemberely Grant's correct department ID, and standardizing date formats. Evidence of the issues is demonstrated through data visualization and SQL queries showing missing fields and counts.

Uploaded by

Robin K.C.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
148 views

Individual Coursework (Replacing In-Class Test) : Big Data (6CS030)

The document summarizes issues found in sample employee and department data files, including missing values, outliers, and non-standard date formats. It suggests approaches to address the problems, such as replacing missing commission percentages with 0, setting Steven King's missing manager ID to NULL, identifying Kemberely Grant's correct department ID, and standardizing date formats. Evidence of the issues is demonstrated through data visualization and SQL queries showing missing fields and counts.

Uploaded by

Robin K.C.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
You are on page 1/ 8

Big Data

(6CS030)

A2: Individual Coursework


(Replacing In- Class Test)

Student Id :
Student Name : Robin KC
Cohort/Batch :4
Submitted to :
Submitted on : <dd-mm-yy>
1. Report
2. Sample data

2.1 Document the problems in the data


Given dataset = employees.csv and department.csv
Following issues are found in given datasets.
2.1.1 Missing Value
Most of the employees do not have the value of COMMISION_PCT as shown below.

The employee ‘Steven King’ does not have the MANAGER_ID as shown in following
figure.

The ‘DEPT_ID’ of Kemberely Grant is also missing.

2.1.2 Outliers
Outlier means the value out of range in given field so that such type of data problems
can be addressed in ‘SALARY’ field.

Here, Sigal Tobias has 128000 salary that values is out of range for JOB_ID
‘PU_CLERK’.
This problem can be also seen in ‘COMMISION_PCT’ field.
In COMMISION_PCT, most of the values are present in the range of 0 to 1 so that
the values 8500 and 150 are taken as outlier.

2.1.3 Non-standard date format

In ‘HIREDATE’ filed, the hire date of different employees is present in different format
as shown in below which is also taken as problem and suggest to follow same
standard date format.

2.2 Suggest what can be done to address the problems identified


 Most of the empty data present in ‘COMMISION_PCT’ can be replaced with 0
(zero) value.
 The missing value for ‘MANAGER_ID’ of Steven King can be set as NULL
instead of empty.
 By comparing DEPT_ID between employees.csv file and department.csv file,
actual DEPT_ID of Kemberely Grant can be identified which is 80.
 Outliers are handled by using average value of same JOB_ID.
 Different date formats can be represented in same date format.

3. Evidence

Evidence for existing issues in given dataset can be provided by using


data visualization technique and SQL queries.

3.1 Data Visualization


The missing of value in ‘COMMISION_PCT’ field is shown by line graph.
The outliers present in ‘SALARY’ field is demonstrated by using column chart.

Missing of ‘DEPT_ID’ of Kemberely Grant is as shown below.


3.2 Some SQL queries and results
3.2.1 Missing value in ‘DEPT_ID’

Output:

3.2.2 Missing value in ‘MANAGER_ID’

Output:
3.2.3 Total number of missing values in ‘COMMISSION_PCT’

Output:

You might also like