0% found this document useful (0 votes)
8 views

Data Scientist

5
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
8 views

Data Scientist

5
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
You are on page 1/ 10

1.

Understand the problem


SL
Supervised learning: Labelled columns and rows (descriptive value), possible features
Classification (Discrete)/regression (continuous)
Types of output

Topic:
The information technology domain has reigned the job market and become a cornerstone
of the international economic landscape, catalyzing accumulative innovation and
expanding workforce opportunities. Nonetheless, salaries within this technological
enterprise sector exhibit wide variation due to a diverse set of factors, including level of
experience, geographical location, remote ratio, and company size.
Problem Statement:
This project endeavors to undertake an analysis of the salaries of technology
professionals in units of currency in USD, concentrating on the attainment of identifying
and discerning the determinants’ role in impacting job monetary compensation beyond
specific job titles and the high-tech industrial sector. While role classification in the
digital innovation-driven industry is one of the most popular and evident factors
regarding salary determinations; still, there are still critical variables that need to be put
into consideration to impact employees working in the information technology industry.
Those variables, i.e., experience level, location, remote ratio, and company size, will be
facilitated in an exploration of data analysis, aiming to provide a comprehensive
perspective of the salary in the technology industry, with the unit of currency of USD.
Problem Resolution Strategy
In order to address the spontaneous issue with respect to the effort to understand salary
distributions, the employment of structured and meticulous was established to make key
aforementioned variables focal points of the study, along with the exclusion of job
variants and labels. The methodology can be outlined in two primary stages. The first one
consists of data collection, data cleaning and interpretation, visual inspection of the data,
and dealing with outliers. After the completion of data preparation, a training model will
be conducted that predicts the salary in USD in correlation with the variables mentioned
above.
In this study, supervised learning will be utilized due to several listed factors. First and
foremost, supervised learning is a type of machine learning where a model is trained on a
labeled dataset. In this context, the ‘labeled’ means the correct answer or result that the
model is expected to predict, consisting of an input and its corresponding label (Sarker,
2021).
With respect to supervised learning problems, the aforementioned goal (trained model) is
to concentrate on predicting and speculating salaries paid in USD (the label) based on the
experience level, company location, remote ratio, and lastly, the company size (the
features).
Data Assessing and Collecting
The dataset was provided on Canvas, a digital platform serving academic purposes, in
COSC2968|COSC3053 Foundations of Artificial Intelligence for STEM course in the
Assessment part when opting for option B. The dataset titled “FOAI-assignment.csv”
contains a collection of data from 2020 to 2023 whose variables relate to salaries,
employment-related data, demographic characteristics, and company details within the
field of data science domain.
There were two software assessing the downloaded files for data cleaning and
interpretation. The first one was Jamovi, a statistical software, which would be used
along with SageMaker, an advanced machine learning service for analyzing data. These
two software would be used simultaneously (Amazon Sagemaker, 2024; The jamovi
project, 2024).
Data Description
Salaries in USD. This is the primary dependent variable in this study, also known as
labeled data. This attribute represents the annual monetary compensation for individuals
under USD units of currency.
Experience Level. The categorization of this attribute is based on employees’
professional experience, specifically entry-level, executive-level, mid-level, and senior-
level.
Company Location. This attribute represents the geographical location of employment,
also known as the country where the employer’s main facility or contracting department
is located. There was a wide range of countries in this dataset, spreading across 7
continents, from Asian countries to various European countries.
Company Size. The size of the employment organization or the average number of
people working for the company. This attribute can be defined by the number of
employees, classified by S (less than 50 employees), M (more than 50 and less than 250
employees), and L (more than 250 employees).
Remote Ratio. This feature is a numerical representation of the extent to which a job can
be performed remotely without showing up in physical facilities. It is used to quantify the
amount of remote work for a particular job or role. In the dataset, it is categorically
classified into three primary groups, and it can be interpreted respectively: 0 represents
there is no remote work (less than 20% of the work can be done remotely), 50 presents
that work can be done synchronously and asynchronously (approximately 50% of the
workload can be done remotely), and 100 illustrates that the job can be performed from a
remote location (above 80% of the task can be done off-site).
Initial Review

Table 1
Descriptives data on salary in USD

salary_in_usd

N 1494

Missing 6

Mean 130934

Median 130000

Standard deviation 66668

Minimum 5409

Maximum 450000

Note. The table shows descriptives data of salary in USD (n = 1494).

Table 1 illustrates the samples of the dataset, with a total sample of 1494 with 6 missing
values. The average salary within the sample is $130,934 (M = 130934, SD = 66668),
with the most commonly reported salary being $100,000 (Mode = 100000), spanning
from $5,409 (Min = 5409) to $450,000 (Max = 450000).

Table 2
Frequencies of experience_level

experience_level Means % of Total

EN 167 11.1 %

EX 58 3.9 %

MI 353 23.5 %

SE 922 61.5 %

Note. The table illustrates the frequencies of experience level of employees (n = 1500).

Table 2 depicts the distribution of employees of four different experience levels: Entry
level (EN), Executive level (EX), Mid-level (MI), and Senior level (SE). The vast
majority of individuals are sectored in the SE category, accounting for 61.5% of the total
population (n = 922). MI and EN employees constitute respectively 23.5% and 11.1% (n
= 353; n = 167), while the smallest group is EX employees, taking up only 3.9% (n =
58).

Table 3

Frequencies of region

region Means % of Total

Africa 1 0.1 %

Asia 10 0.7 %

Europe 77 5.1 %

North America 1157 77.1 %

Oceania 6 0.4 %

Other 248 16.5 %

South America 1 0.1 %

Note. The table shows the distribution of seven regions employees working in (n =
1500).

Unlike other tables, this table was categorically transformed from abbreviation of specific
countries to continents where those countries situated at. In other words, the original of
the dataset was company location, then it was labeled as region, making the data more
assessable for data processing. Table 3 presents the distribution of employees regarding
the continents, representing that North America was the continent having the highest
working population, accounting for more than three quarters of the population (n =
1157).

Table 4
Frequencies of company_size

company_size Means % of Total

L 320 21.3 %

M 1073 71.5 %

S 107 7.1 %

Note. The table is evident displaying distribution of organizational size (n = 1500).

Table 4 reveals the distribution of three company sizes, including small, medium and
large. The frequency distribution is illustrates that medium-sized enterprises dominate the
dataset, accounting for 71.5% of the total (n = 1073).

Table 5

Frequencies of remote_ratio

remote_ratio Means % of Total

0 579 38.6 %

50 130 8.7 %

100 791 52.7 %

Note. 0 = no remote tasks, 50 = task can be synchronous and asynchronous, 100 = 80%
tasks need to be performed on-site.

Table 5 indicates that the majority of the operation of the company requires employee to
work on-site at the physical facilities, accounting for 52.7% of the total sample (n = 791).

Data Cleaning
All variables, regarding the missing of values, fulfilled the data sufficiency except the only
dependent variable, i.e., salaries in USD. Based on table 1, there were a total of 6 missing data.
This could be due to some error during the data collecting phase, thereby applying means
replacement for the missing values, called data imputation. However, based on the histograms
with density (Figure 1), along with the Shapiro-Wilk p value was less than .001, visual
inspections could tell that there was a violation in normality assumption, meaning that the data
was not normally distributed. It could be visually inspected that the histogram was positively
skewed. This could be attributed to the existence of numerous potential outliers. However, this
dataset was a big, containing a sample of 1500 individuals, along with the reason that the outliers
existed in this dataset were not anormal whose existence of outliers was consistent. In other
words, neither individual low outliers nor individual high outliers in value, existing as a flock of
extremely high and low values. Nevertheless, in order to make the data more accessible for later
algorithms relating to predictability, a conduction of standardized Z-score put a concentration on
the removal of nine most extreme outliers, removing outliers that had the standardized Z-score
exceeding the value of 3.29. This step paved the way for a more normal distribution, orbiting the
concept of central limit theorem; still, visually inspected, the distribution of salaries was wide
and discrete. There was also an establishment of transforming the data of salaries in USD,
making it a novel value using transformative logarithm base 10 to simplify the values, lessening
the number of digits. Consistent with the attainment of reaching high value of R and R 2, the
value of correlation between predictor variable and response variables and the proportion of
variance taking up for explanation, company size attributes went under transformation, from
categorical attributes to numeric attributes. To be specific for transformation process, the larger
the company (e.g., from S to M to L), the higher the number representing for the size of the
company.
The model’s performance was evaluated using standard model’s coefficients and linear
regression metrics such as R, R-squared (R2), Root Mean Squared Error (RMSE), intercept to
help understanding the goodness of fit of the trained model and the accuracy of model’s
prediction.

Model Fit Measures

Model R R² RMSE

1 0.707 0.500 0.475

Note. Models estimated using sample size of N=1500


Model Coefficients - LOG(salary_in_usd (MeanReplacement))

Predictor Estimate SE t p

Intercept ᵃ 9.44555 0.4806 19.6524 < .001

company_size - Transform 1:

2–3 -0.01759 0.0327 -0.5386 0.590

1–3 -0.25418 0.0542 -4.6908 < .001

region:

Asia – Africa 0.95377 0.5031 1.8957 0.058

Europe – Africa 1.06511 0.4831 2.2045 0.028

North America – Africa 1.88007 0.4807 3.9108 < .001

Oceania – Africa 1.85206 0.5172 3.5811 < .001

Other – Africa 1.16998 0.4807 2.4337 0.015

South America – Africa -0.86143 0.6800 -1.2668 0.205

remote_ratio:

50 – 0 -0.00405 0.0525 -0.0772 0.938

100 – 0 -0.05662 0.0264 -2.1436 0.032

experience_level:

EX – EN 0.92175 0.0740 12.4483 < .001

MI – EN 0.36018 0.0454 7.9359 < .001

SE – EN 0.64048 0.0437 14.6490 < .001

ᵃ Represents reference level


The model coefficients table illustrates an analysis of the influence of various predictors on the
log-transformed salaries in USD in the data science field. For the R value, it is evident from the
table that this value of correlation coefficient indicating a strong positive linear relationship
between the independent variables (company size, regions, remote ratio, and experience level)
and the log-transformed value (R = 0.707). While only 50% of the variance in the log-
transformed salary can be explained by the predictors in the model due to weak R-squared of 0.5,
still, it is an acceptable value.
When all predictor factors (large company size, working in Africa, entry-level experience, and no
remote work) are set at their reference values, the intercept, with an estimate of 9.44555,
indicates the predicted log-transformed wage. With a p-value of less than 0.001, this intercept is
statistically significant and suggests that it is a major component of the model.
Salary is also impacted by working 100% remotely, resulting in a significant decrease in log-
transformed salary compared to a non-remote work setup (estimate = -0.05662, p = 0.032).
There are large regional variations, with certain regions showing significantly higher salaries
than Africa. To be specific, working in North America (estimate = 1.88007, p < 0.001) and
Oceania (estimate = 1.85206, p < 0.001) have a positively significant relationship with salary,
while working in South America contributes to a non-significant decrease in salary compared to
Africa (estimate = -0.86143, p = 0.205).
Higher experience in workplace could substantially increase the salary. This is evident that
employees with executive-level experience were paid significantly higher than employees with
least experience (EN) (estimate = 0.92175, p < 0.001), followed by senior-level compared to
entry-level (estimate = 0.64048, p < 0.001) and mid-level experience compared to entry-level
(estimate = 0.36018, p < 0.001).
Recommendations:
In order to maximizing the optimal compensation practices, recommendations are drawn to
incentivize organizations to adjust the salary scales, i.e., fairness surety and competitive
compensation. This can be done by benchmarking, a conduction of comparing salaries for similar
roles within the job market of the industry. This comparison should necessitate the common
income of the same roles in information technology but also the consideration of external factors
(e.g., remote work arrangements, company size, and regional divergence). More realistically,
based on the dataset, it is pivotal to facilitate implementation of regional salaries adjustments, in
particular, increasing the paid for African-regions employees, reducing the feelings of
cluelessness and outsourcing-likeliness, thereby enhancing the improvement of boosted morale
for employees. Last but not least, there should be an establishment of structured pay tiers based
on the working experience levels by clearly stating the definition of the responsibilities, skills,
experience, and qualifications required for each tier. This can be exemplified that entry-level
positions may need to meet some specific requirements regarding basic skills and minimal
experience, while executive roles require extensive experience and leadership capabilities. These
aforementioned recommendations facilitate the alignment of compensation practices with the
trained model’s finding, thereby enhancing not only salary but also job satisfaction across the
company.

You might also like