0% found this document useful (0 votes)

8 views

Data Scientist

Uploaded by

Trần Nguyễn Anh Quân

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as DOCX, PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

8 views

Data Scientist

Uploaded by

Trần Nguyễn Anh Quân

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as DOCX, PDF, TXT or read online on Scribd

You are on page 1/ 10

1.

Understand the problem

SL
Supervised learning: Labelled columns and rows (descriptive value), possible features
Classification (Discrete)/regression (continuous)
Types of output

Topic:
The information technology domain has reigned the job market and become a cornerstone
of the international economic landscape, catalyzing accumulative innovation and
expanding workforce opportunities. Nonetheless, salaries within this technological
enterprise sector exhibit wide variation due to a diverse set of factors, including level of
experience, geographical location, remote ratio, and company size.
Problem Statement:
This project endeavors to undertake an analysis of the salaries of technology
professionals in units of currency in USD, concentrating on the attainment of identifying
and discerning the determinants’ role in impacting job monetary compensation beyond
specific job titles and the high-tech industrial sector. While role classification in the
digital innovation-driven industry is one of the most popular and evident factors
regarding salary determinations; still, there are still critical variables that need to be put
into consideration to impact employees working in the information technology industry.
Those variables, i.e., experience level, location, remote ratio, and company size, will be
facilitated in an exploration of data analysis, aiming to provide a comprehensive
perspective of the salary in the technology industry, with the unit of currency of USD.
Problem Resolution Strategy
In order to address the spontaneous issue with respect to the effort to understand salary
distributions, the employment of structured and meticulous was established to make key
aforementioned variables focal points of the study, along with the exclusion of job
variants and labels. The methodology can be outlined in two primary stages. The first one
consists of data collection, data cleaning and interpretation, visual inspection of the data,
and dealing with outliers. After the completion of data preparation, a training model will
be conducted that predicts the salary in USD in correlation with the variables mentioned
above.
In this study, supervised learning will be utilized due to several listed factors. First and
foremost, supervised learning is a type of machine learning where a model is trained on a
labeled dataset. In this context, the ‘labeled’ means the correct answer or result that the
model is expected to predict, consisting of an input and its corresponding label (Sarker,
2021).
With respect to supervised learning problems, the aforementioned goal (trained model) is
to concentrate on predicting and speculating salaries paid in USD (the label) based on the
experience level, company location, remote ratio, and lastly, the company size (the
features).
Data Assessing and Collecting
The dataset was provided on Canvas, a digital platform serving academic purposes, in
COSC2968|COSC3053 Foundations of Artificial Intelligence for STEM course in the
Assessment part when opting for option B. The dataset titled “FOAI-assignment.csv”
contains a collection of data from 2020 to 2023 whose variables relate to salaries,
employment-related data, demographic characteristics, and company details within the
field of data science domain.
There were two software assessing the downloaded files for data cleaning and
interpretation. The first one was Jamovi, a statistical software, which would be used
along with SageMaker, an advanced machine learning service for analyzing data. These
two software would be used simultaneously (Amazon Sagemaker, 2024; The jamovi
project, 2024).
Data Description
Salaries in USD. This is the primary dependent variable in this study, also known as
labeled data. This attribute represents the annual monetary compensation for individuals
under USD units of currency.
Experience Level. The categorization of this attribute is based on employees’
professional experience, specifically entry-level, executive-level, mid-level, and senior-
level.
Company Location. This attribute represents the geographical location of employment,
also known as the country where the employer’s main facility or contracting department
is located. There was a wide range of countries in this dataset, spreading across 7
continents, from Asian countries to various European countries.
Company Size. The size of the employment organization or the average number of
people working for the company. This attribute can be defined by the number of
employees, classified by S (less than 50 employees), M (more than 50 and less than 250
employees), and L (more than 250 employees).
Remote Ratio. This feature is a numerical representation of the extent to which a job can
be performed remotely without showing up in physical facilities. It is used to quantify the
amount of remote work for a particular job or role. In the dataset, it is categorically
classified into three primary groups, and it can be interpreted respectively: 0 represents
there is no remote work (less than 20% of the work can be done remotely), 50 presents
that work can be done synchronously and asynchronously (approximately 50% of the
workload can be done remotely), and 100 illustrates that the job can be performed from a
remote location (above 80% of the task can be done off-site).
Initial Review

Table 1
Descriptives data on salary in USD

salary_in_usd

N 1494

Missing 6

Mean 130934

Median 130000

Standard deviation 66668

Minimum 5409

Maximum 450000

Note. The table shows descriptives data of salary in USD (n = 1494).

Table 1 illustrates the samples of the dataset, with a total sample of 1494 with 6 missing
values. The average salary within the sample is $130,934 (M = 130934, SD = 66668),
with the most commonly reported salary being $100,000 (Mode = 100000), spanning
from $5,409 (Min = 5409) to $450,000 (Max = 450000).

Table 2
Frequencies of experience_level

experience_level Means % of Total

EN 167 11.1 %

EX 58 3.9 %

MI 353 23.5 %

SE 922 61.5 %

Note. The table illustrates the frequencies of experience level of employees (n = 1500).

Table 2 depicts the distribution of employees of four different experience levels: Entry
level (EN), Executive level (EX), Mid-level (MI), and Senior level (SE). The vast
majority of individuals are sectored in the SE category, accounting for 61.5% of the total
population (n = 922). MI and EN employees constitute respectively 23.5% and 11.1% (n
= 353; n = 167), while the smallest group is EX employees, taking up only 3.9% (n =
58).

Table 3

Frequencies of region

region Means % of Total

Africa 1 0.1 %

Asia 10 0.7 %

Europe 77 5.1 %

North America 1157 77.1 %

Oceania 6 0.4 %

Other 248 16.5 %

South America 1 0.1 %

Note. The table shows the distribution of seven regions employees working in (n =
1500).

Unlike other tables, this table was categorically transformed from abbreviation of specific
countries to continents where those countries situated at. In other words, the original of
the dataset was company location, then it was labeled as region, making the data more
assessable for data processing. Table 3 presents the distribution of employees regarding
the continents, representing that North America was the continent having the highest
working population, accounting for more than three quarters of the population (n =
1157).

Table 4
Frequencies of company_size

company_size Means % of Total

L 320 21.3 %

M 1073 71.5 %

S 107 7.1 %

Note. The table is evident displaying distribution of organizational size (n = 1500).

Table 4 reveals the distribution of three company sizes, including small, medium and
large. The frequency distribution is illustrates that medium-sized enterprises dominate the
dataset, accounting for 71.5% of the total (n = 1073).

Table 5

Frequencies of remote_ratio

remote_ratio Means % of Total

0 579 38.6 %

50 130 8.7 %

100 791 52.7 %

Note. 0 = no remote tasks, 50 = task can be synchronous and asynchronous, 100 = 80%
tasks need to be performed on-site.

Table 5 indicates that the majority of the operation of the company requires employee to
work on-site at the physical facilities, accounting for 52.7% of the total sample (n = 791).

Data Cleaning
All variables, regarding the missing of values, fulfilled the data sufficiency except the only
dependent variable, i.e., salaries in USD. Based on table 1, there were a total of 6 missing data.
This could be due to some error during the data collecting phase, thereby applying means
replacement for the missing values, called data imputation. However, based on the histograms
with density (Figure 1), along with the Shapiro-Wilk p value was less than .001, visual
inspections could tell that there was a violation in normality assumption, meaning that the data
was not normally distributed. It could be visually inspected that the histogram was positively
skewed. This could be attributed to the existence of numerous potential outliers. However, this
dataset was a big, containing a sample of 1500 individuals, along with the reason that the outliers
existed in this dataset were not anormal whose existence of outliers was consistent. In other
words, neither individual low outliers nor individual high outliers in value, existing as a flock of
extremely high and low values. Nevertheless, in order to make the data more accessible for later
algorithms relating to predictability, a conduction of standardized Z-score put a concentration on
the removal of nine most extreme outliers, removing outliers that had the standardized Z-score
exceeding the value of 3.29. This step paved the way for a more normal distribution, orbiting the
concept of central limit theorem; still, visually inspected, the distribution of salaries was wide
and discrete. There was also an establishment of transforming the data of salaries in USD,
making it a novel value using transformative logarithm base 10 to simplify the values, lessening
the number of digits. Consistent with the attainment of reaching high value of R and R 2, the
value of correlation between predictor variable and response variables and the proportion of
variance taking up for explanation, company size attributes went under transformation, from
categorical attributes to numeric attributes. To be specific for transformation process, the larger
the company (e.g., from S to M to L), the higher the number representing for the size of the
company.
The model’s performance was evaluated using standard model’s coefficients and linear
regression metrics such as R, R-squared (R2), Root Mean Squared Error (RMSE), intercept to
help understanding the goodness of fit of the trained model and the accuracy of model’s
prediction.

Model Fit Measures

Model R R² RMSE

1 0.707 0.500 0.475

Note. Models estimated using sample size of N=1500

Model Coefficients - LOG(salary_in_usd (MeanReplacement))

Predictor Estimate SE t p

Intercept ᵃ 9.44555 0.4806 19.6524 < .001

company_size - Transform 1:

2–3 -0.01759 0.0327 -0.5386 0.590

1–3 -0.25418 0.0542 -4.6908 < .001

region:

Asia – Africa 0.95377 0.5031 1.8957 0.058

Europe – Africa 1.06511 0.4831 2.2045 0.028

North America – Africa 1.88007 0.4807 3.9108 < .001

Oceania – Africa 1.85206 0.5172 3.5811 < .001

Other – Africa 1.16998 0.4807 2.4337 0.015

South America – Africa -0.86143 0.6800 -1.2668 0.205

remote_ratio:

50 – 0 -0.00405 0.0525 -0.0772 0.938

100 – 0 -0.05662 0.0264 -2.1436 0.032

experience_level:

EX – EN 0.92175 0.0740 12.4483 < .001

MI – EN 0.36018 0.0454 7.9359 < .001

SE – EN 0.64048 0.0437 14.6490 < .001

ᵃ Represents reference level

The model coefficients table illustrates an analysis of the influence of various predictors on the
log-transformed salaries in USD in the data science field. For the R value, it is evident from the
table that this value of correlation coefficient indicating a strong positive linear relationship
between the independent variables (company size, regions, remote ratio, and experience level)
and the log-transformed value (R = 0.707). While only 50% of the variance in the log-
transformed salary can be explained by the predictors in the model due to weak R-squared of 0.5,
still, it is an acceptable value.
When all predictor factors (large company size, working in Africa, entry-level experience, and no
remote work) are set at their reference values, the intercept, with an estimate of 9.44555,
indicates the predicted log-transformed wage. With a p-value of less than 0.001, this intercept is
statistically significant and suggests that it is a major component of the model.
Salary is also impacted by working 100% remotely, resulting in a significant decrease in log-
transformed salary compared to a non-remote work setup (estimate = -0.05662, p = 0.032).
There are large regional variations, with certain regions showing significantly higher salaries
than Africa. To be specific, working in North America (estimate = 1.88007, p < 0.001) and
Oceania (estimate = 1.85206, p < 0.001) have a positively significant relationship with salary,
while working in South America contributes to a non-significant decrease in salary compared to
Africa (estimate = -0.86143, p = 0.205).
Higher experience in workplace could substantially increase the salary. This is evident that
employees with executive-level experience were paid significantly higher than employees with
least experience (EN) (estimate = 0.92175, p < 0.001), followed by senior-level compared to
entry-level (estimate = 0.64048, p < 0.001) and mid-level experience compared to entry-level
(estimate = 0.36018, p < 0.001).
Recommendations:
In order to maximizing the optimal compensation practices, recommendations are drawn to
incentivize organizations to adjust the salary scales, i.e., fairness surety and competitive
compensation. This can be done by benchmarking, a conduction of comparing salaries for similar
roles within the job market of the industry. This comparison should necessitate the common
income of the same roles in information technology but also the consideration of external factors
(e.g., remote work arrangements, company size, and regional divergence). More realistically,
based on the dataset, it is pivotal to facilitate implementation of regional salaries adjustments, in
particular, increasing the paid for African-regions employees, reducing the feelings of
cluelessness and outsourcing-likeliness, thereby enhancing the improvement of boosted morale
for employees. Last but not least, there should be an establishment of structured pay tiers based
on the working experience levels by clearly stating the definition of the responsibilities, skills,
experience, and qualifications required for each tier. This can be exemplified that entry-level
positions may need to meet some specific requirements regarding basic skills and minimal
experience, while executive roles require extensive experience and leadership capabilities. These
aforementioned recommendations facilitate the alignment of compensation practices with the
trained model’s finding, thereby enhancing not only salary but also job satisfaction across the
company.

Ritesh Tandon Machine Learning Project
100% (5)
Ritesh Tandon Machine Learning Project
23 pages
Econometrics: A Simple Introduction
From Everand
Econometrics: A Simple Introduction
K.H. Erickson
3.5/5 (5)
Data Interpretation Guide For All Competitive and Admission Exams
From Everand
Data Interpretation Guide For All Competitive and Admission Exams
Mohmmad Khaja Shareef
2.5/5 (6)
FoAI - ASM2 - Mai Ngo
No ratings yet
FoAI - ASM2 - Mai Ngo
15 pages
Salary Data Analysis - Phase 1
No ratings yet
Salary Data Analysis - Phase 1
5 pages
Decision Support Systems
No ratings yet
Decision Support Systems
23 pages
Salary Prediction
No ratings yet
Salary Prediction
28 pages
Data Viz Case Study
No ratings yet
Data Viz Case Study
3 pages
BIA 660: Glassdoor Sentimental Analysis and Salary Prediction
No ratings yet
BIA 660: Glassdoor Sentimental Analysis and Salary Prediction
15 pages
Dashboard and story telling-Sample report-sathesh
No ratings yet
Dashboard and story telling-Sample report-sathesh
13 pages
Business Intelligence and Analytics
No ratings yet
Business Intelligence and Analytics
8 pages
Volume6_Issue3_Paper10_2022
No ratings yet
Volume6_Issue3_Paper10_2022
6 pages
GROUP-4-INS1053-INS105301 (1)
No ratings yet
GROUP-4-INS1053-INS105301 (1)
57 pages
22067515 Kushal Kadayat
No ratings yet
22067515 Kushal Kadayat
33 pages
Job Analyser 1
No ratings yet
Job Analyser 1
28 pages
Assignment03_21979258
No ratings yet
Assignment03_21979258
3 pages
AMCAT Data Analysis
No ratings yet
AMCAT Data Analysis
18 pages
Salary Predictions
No ratings yet
Salary Predictions
43 pages
Salary Analysis[1]
No ratings yet
Salary Analysis[1]
1 page
KAUSHIK PROJECT
No ratings yet
KAUSHIK PROJECT
13 pages
Data Scientist Salaries 1686594662
No ratings yet
Data Scientist Salaries 1686594662
29 pages
daa project research
No ratings yet
daa project research
6 pages
Group 9 Analytics Assignmnet
No ratings yet
Group 9 Analytics Assignmnet
2 pages
Data Science Related Job Salary - Case Study
No ratings yet
Data Science Related Job Salary - Case Study
11 pages
Capstone Project Assignment
No ratings yet
Capstone Project Assignment
3 pages
African Journal of Advanced Pure and Applied Sciences (AJAPAS)
No ratings yet
African Journal of Advanced Pure and Applied Sciences (AJAPAS)
13 pages
Course Project - Machine Learning (DS PGC)
No ratings yet
Course Project - Machine Learning (DS PGC)
6 pages
HR Salary Dashboard
No ratings yet
HR Salary Dashboard
12 pages
Mini Project Report
No ratings yet
Mini Project Report
10 pages
shsconf_cdems2023_03013
No ratings yet
shsconf_cdems2023_03013
5 pages
KEL 2 - UAS DATA SCIENCE
No ratings yet
KEL 2 - UAS DATA SCIENCE
17 pages
Draft of Final Report
No ratings yet
Draft of Final Report
23 pages
Employee Data Analysis 2 (1) AKSHAYA.M
No ratings yet
Employee Data Analysis 2 (1) AKSHAYA.M
9 pages
Assessment 2 UEL CN 7000
No ratings yet
Assessment 2 UEL CN 7000
10 pages
Salary Structure Bench Marking
No ratings yet
Salary Structure Bench Marking
24 pages
PFDA
No ratings yet
PFDA
23 pages
Salaries for San Francisco Employee _ ML _ FA _ DA projects
No ratings yet
Salaries for San Francisco Employee _ ML _ FA _ DA projects
33 pages
Article Review 11 Eng
No ratings yet
Article Review 11 Eng
18 pages
Report
No ratings yet
Report
15 pages
Basic of Statistics
No ratings yet
Basic of Statistics
4 pages
data analytics final project
No ratings yet
data analytics final project
6 pages
GenAI HR
No ratings yet
GenAI HR
91 pages
Group 24 Miniproject
No ratings yet
Group 24 Miniproject
33 pages
DSBA - Exploratory Data Analysis v2
No ratings yet
DSBA - Exploratory Data Analysis v2
22 pages
1 s2.0 S0263237322000810 Main
No ratings yet
1 s2.0 S0263237322000810 Main
14 pages
A_Model_to_Predict_Pay_Scale_Fixation_in_Job_Marke
No ratings yet
A_Model_to_Predict_Pay_Scale_Fixation_in_Job_Marke
6 pages
Exploratory Data Analysis - Satyajit
No ratings yet
Exploratory Data Analysis - Satyajit
35 pages
Blended Data Cleaning
No ratings yet
Blended Data Cleaning
9 pages
Assessment 1 - UEL-CN-7000
No ratings yet
Assessment 1 - UEL-CN-7000
3 pages
TB 969425740
No ratings yet
TB 969425740
16 pages
Assignment - 1: AIM Objective
No ratings yet
Assignment - 1: AIM Objective
8 pages
Group 7_Statistic for Business_Group Report-1
No ratings yet
Group 7_Statistic for Business_Group Report-1
20 pages
TTDS Lectures
No ratings yet
TTDS Lectures
13 pages
DSBA+-+Exploratory+Data+Analysis+v2
No ratings yet
DSBA+-+Exploratory+Data+Analysis+v2
22 pages
Salary Prediction
No ratings yet
Salary Prediction
4 pages
Certified Human Resource Practitioner: Compensation Management
No ratings yet
Certified Human Resource Practitioner: Compensation Management
34 pages
REPORT Data-Science
No ratings yet
REPORT Data-Science
4 pages
Introduction To Business Statistics Through R Software: Software
From Everand
Introduction To Business Statistics Through R Software: Software
Editor IJSMI
No ratings yet
De-Mystifying Math and Stats for Machine Learning: Mastering the Fundamentals of Mathematics and Statistics for Machine Learning
From Everand
De-Mystifying Math and Stats for Machine Learning: Mastering the Fundamentals of Mathematics and Statistics for Machine Learning
Seaport AI Madhavan
No ratings yet
2021 BRM Benchmarking Compensation Analysis Report
From Everand
2021 BRM Benchmarking Compensation Analysis Report
BRM Institute
No ratings yet
Ontents: Foreword Preface To The Fourth Edition
No ratings yet
Ontents: Foreword Preface To The Fourth Edition
12 pages
Stat 2507
No ratings yet
Stat 2507
2 pages
Dealing With Outlying Observations: Standard Practice For
No ratings yet
Dealing With Outlying Observations: Standard Practice For
11 pages
Ardl or Bound Test (Autoregrassive Distributed Lag) : Saeed Aas Khna Meo
No ratings yet
Ardl or Bound Test (Autoregrassive Distributed Lag) : Saeed Aas Khna Meo
6 pages
Harare Institute of Technology
No ratings yet
Harare Institute of Technology
4 pages
Simple Linear Regression PDF
No ratings yet
Simple Linear Regression PDF
145 pages
Mathematical Statistics - Wiki
No ratings yet
Mathematical Statistics - Wiki
5 pages
Mean Absolute Error
No ratings yet
Mean Absolute Error
2 pages
2006 Chapter 08 Assignment
No ratings yet
2006 Chapter 08 Assignment
6 pages
Confusion Matrix - Wikipedia
No ratings yet
Confusion Matrix - Wikipedia
4 pages
Topic 1 The Nature of Probability and Statistics
No ratings yet
Topic 1 The Nature of Probability and Statistics
44 pages
Ridge and Lasso in Python PDF
No ratings yet
Ridge and Lasso in Python PDF
5 pages
Practice Questions
No ratings yet
Practice Questions
6 pages
Chapter 2
No ratings yet
Chapter 2
22 pages
Some Tools:-: Job Analysis Delphi Method Nominal Group Technique Scenario Planning
No ratings yet
Some Tools:-: Job Analysis Delphi Method Nominal Group Technique Scenario Planning
19 pages
10.3934 Math.2022099
No ratings yet
10.3934 Math.2022099
16 pages
Assignment Econometrics
No ratings yet
Assignment Econometrics
7 pages
Business Analytics Jeffrey D. Camm 2024 Scribd Download
100% (3)
Business Analytics Jeffrey D. Camm 2024 Scribd Download
62 pages
Sufficiency and Cramer Rao
No ratings yet
Sufficiency and Cramer Rao
5 pages
Suka Makan Asdwasd
No ratings yet
Suka Makan Asdwasd
19 pages
U6 Statistics Test 1 2024 - MR Share
No ratings yet
U6 Statistics Test 1 2024 - MR Share
2 pages
Chapter 4
No ratings yet
Chapter 4
52 pages
Pengunaan Aplikasi Merdeka Mengajar Dalam Meningkatkan Hasil Belajar Siswa Pada Sekolah Penggerak
No ratings yet
Pengunaan Aplikasi Merdeka Mengajar Dalam Meningkatkan Hasil Belajar Siswa Pada Sekolah Penggerak
12 pages
A Guide to dnorm, pnorm, rnorm, and qnorm in R
No ratings yet
A Guide to dnorm, pnorm, rnorm, and qnorm in R
3 pages
3.1 Random Sampling PDF
No ratings yet
3.1 Random Sampling PDF
36 pages
The Effect of Using Lattice Method On Multiplicati
No ratings yet
The Effect of Using Lattice Method On Multiplicati
4 pages
VAR Slides
No ratings yet
VAR Slides
12 pages
FEM3004 Chapter 3 Sampling Distribution
No ratings yet
FEM3004 Chapter 3 Sampling Distribution
20 pages
Business Analytics
No ratings yet
Business Analytics
10 pages
Random Variables Probability Distributions
No ratings yet
Random Variables Probability Distributions
23 pages