0% found this document useful (0 votes)
23 views22 pages

Id5132 1

Uploaded by

sai.nidiginti
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
23 views22 pages

Id5132 1

Uploaded by

sai.nidiginti
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 22

id5132-1

October 24, 2024

[1]: import pandas as pd


import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
import plotly.express as px
from scipy.stats import chi2_contingency

from sklearn.preprocessing import StandardScaler

from sklearn.linear_model import LogisticRegression


from sklearn.model_selection import train_test_split
from sklearn.metrics import classification_report, roc_curve, auc

[2]: data =pd.read_csv(r"/content/EasyVisa (5).csv")


data

[2]: case_id continent education_of_employee has_job_experience \


0 EZYV01 Asia High School N
1 EZYV02 Asia Master's Y
2 EZYV03 Asia Bachelor's N
3 EZYV04 Asia Bachelor's N
4 EZYV05 Africa Master's Y
… … … … …
25475 EZYV25476 Asia Bachelor's Y
25476 EZYV25477 Asia High School Y
25477 EZYV25478 Asia Master's Y
25478 EZYV25479 Asia Master's Y
25479 EZYV25480 Asia Bachelor's Y

requires_job_training no_of_employees yr_of_estab \


0 N 14513 2007
1 N 2412 2002
2 Y 44444 2008
3 N 98 1897
4 N 1082 2005
… … … …
25475 Y 2601 2008

1
25476 N 3274 2006
25477 N 1121 1910
25478 Y 1918 1887
25479 N 3195 1960

region_of_employment prevailing_wage unit_of_wage full_time_position \


0 West 592.2029 Hour Y
1 Northeast 83425.6500 Year Y
2 West 122996.8600 Year Y
3 West 83434.0300 Year Y
4 South 149907.3900 Year Y
… … … … …
25475 South 77092.5700 Year Y
25476 Northeast 279174.7900 Year Y
25477 South 146298.8500 Year N
25478 West 86154.7700 Year Y
25479 Midwest 70876.9100 Year Y

case_status
0 Denied
1 Certified
2 Denied
3 Denied
4 Certified
… …
25475 Certified
25476 Certified
25477 Certified
25478 Certified
25479 Certified

[25480 rows x 12 columns]

[3]: # Check for null values


null_values = data.isnull().sum()

# Display the null values for each column


print("Null values in each column:\n", null_values)

# If you want to check if any null values are present


total_null_values = data.isnull().values.any()
print("\nAre there any null values in the dataset?", total_null_values)

Null values in each column:


case_id 0
continent 0
education_of_employee 0

2
has_job_experience 0
requires_job_training 0
no_of_employees 0
yr_of_estab 0
region_of_employment 0
prevailing_wage 0
unit_of_wage 0
full_time_position 0
case_status 0
dtype: int64

Are there any null values in the dataset? False

[4]: # Check for NaN values in the dataset


print("Missing values in dataset:\n", data.isnull().sum())

# Drop rows with NaN values in the target variable


data_clean = data.dropna(subset=['case_status', 'no_of_employees'])

Missing values in dataset:


case_id 0
continent 0
education_of_employee 0
has_job_experience 0
requires_job_training 0
no_of_employees 0
yr_of_estab 0
region_of_employment 0
prevailing_wage 0
unit_of_wage 0
full_time_position 0
case_status 0
dtype: int64

[5]: # Display the original DataFrame


print("Original DataFrame:")
print(data)

# Remove rows with any NaN values


data = data.dropna()

# Display the cleaned DataFrame


print("\nDataFrame after removing NaN values:")
print(data)

Original DataFrame:
case_id continent education_of_employee has_job_experience \
0 EZYV01 Asia High School N

3
1 EZYV02 Asia Master's Y
2 EZYV03 Asia Bachelor's N
3 EZYV04 Asia Bachelor's N
4 EZYV05 Africa Master's Y
… … … … …
25475 EZYV25476 Asia Bachelor's Y
25476 EZYV25477 Asia High School Y
25477 EZYV25478 Asia Master's Y
25478 EZYV25479 Asia Master's Y
25479 EZYV25480 Asia Bachelor's Y

requires_job_training no_of_employees yr_of_estab \


0 N 14513 2007
1 N 2412 2002
2 Y 44444 2008
3 N 98 1897
4 N 1082 2005
… … … …
25475 Y 2601 2008
25476 N 3274 2006
25477 N 1121 1910
25478 Y 1918 1887
25479 N 3195 1960

region_of_employment prevailing_wage unit_of_wage full_time_position \


0 West 592.2029 Hour Y
1 Northeast 83425.6500 Year Y
2 West 122996.8600 Year Y
3 West 83434.0300 Year Y
4 South 149907.3900 Year Y
… … … … …
25475 South 77092.5700 Year Y
25476 Northeast 279174.7900 Year Y
25477 South 146298.8500 Year N
25478 West 86154.7700 Year Y
25479 Midwest 70876.9100 Year Y

case_status
0 Denied
1 Certified
2 Denied
3 Denied
4 Certified
… …
25475 Certified
25476 Certified
25477 Certified
25478 Certified

4
25479 Certified

[25480 rows x 12 columns]

DataFrame after removing NaN values:


case_id continent education_of_employee has_job_experience \
0 EZYV01 Asia High School N
1 EZYV02 Asia Master's Y
2 EZYV03 Asia Bachelor's N
3 EZYV04 Asia Bachelor's N
4 EZYV05 Africa Master's Y
… … … … …
25475 EZYV25476 Asia Bachelor's Y
25476 EZYV25477 Asia High School Y
25477 EZYV25478 Asia Master's Y
25478 EZYV25479 Asia Master's Y
25479 EZYV25480 Asia Bachelor's Y

requires_job_training no_of_employees yr_of_estab \


0 N 14513 2007
1 N 2412 2002
2 Y 44444 2008
3 N 98 1897
4 N 1082 2005
… … … …
25475 Y 2601 2008
25476 N 3274 2006
25477 N 1121 1910
25478 Y 1918 1887
25479 N 3195 1960

region_of_employment prevailing_wage unit_of_wage full_time_position \


0 West 592.2029 Hour Y
1 Northeast 83425.6500 Year Y
2 West 122996.8600 Year Y
3 West 83434.0300 Year Y
4 South 149907.3900 Year Y
… … … … …
25475 South 77092.5700 Year Y
25476 Northeast 279174.7900 Year Y
25477 South 146298.8500 Year N
25478 West 86154.7700 Year Y
25479 Midwest 70876.9100 Year Y

case_status
0 Denied
1 Certified
2 Denied

5
3 Denied
4 Certified
… …
25475 Certified
25476 Certified
25477 Certified
25478 Certified
25479 Certified

[25480 rows x 12 columns]

1 Problem Definition
The goal of this analysis is to identify patterns and insights that can help in understanding the
factors influencing visa certification rates. By analyzing the dataset, we aim to facilitate the visa
approval process and recommend profiles for applicants based on significant drivers of case status.

1.1 EDA 1: Impact of Company Size on Visa Certification: How does the
number of employees in the employer’s company affect the likelihood of
visa certification? Is there a threshold for company size that influences
approval rates?
[6]: # Define the features and target variable
X = data_clean[['no_of_employees']]
y = data_clean['case_status'].map({'Certified': 1, 'Denied': 0}) # Convert to␣
↪binary

# Scale the features


scaler = StandardScaler()
X_scaled = scaler.fit_transform(X)

# Split the dataset into training and testing sets


X_train, X_test, y_train, y_test = train_test_split(X_scaled, y, test_size=0.2,␣
↪random_state=42, stratify=y)

[7]: # Create a logistic regression model with class weight handling


model = LogisticRegression(class_weight='balanced')
model.fit(X_train, y_train)

# Make predictions
y_pred = model.predict(X_test)
y_pred_proba = model.predict_proba(X_test)[:, 1]

# Print the classification report


print(classification_report(y_test, y_pred))

6
precision recall f1-score support

0 0.33 0.91 0.49 1692


1 0.69 0.09 0.17 3404

accuracy 0.37 5096


macro avg 0.51 0.50 0.33 5096
weighted avg 0.57 0.37 0.27 5096

[8]: # Calculate ROC curve


fpr, tpr, thresholds = roc_curve(y_test, y_pred_proba)
roc_auc = auc(fpr, tpr)

# Plot ROC curve


plt.figure(figsize=(10, 6))
plt.plot(fpr, tpr, color='blue', lw=2, label='ROC curve (area = {:.2f})'.
↪format(roc_auc))

plt.plot([0, 1], [0, 1], color='red', lw=2, linestyle='--')


plt.xlim([0.0, 1.0])
plt.ylim([0.0, 1.05])
plt.xlabel('False Positive Rate', fontsize=14)
plt.ylabel('True Positive Rate', fontsize=14)
plt.title('Receiver Operating Characteristic', fontsize=16)
plt.legend(loc='lower right')
plt.grid()
plt.show()

7
[9]: # Analyze thresholds
threshold_results = pd.DataFrame({
'threshold': thresholds,
'tpr': tpr,
'fpr': fpr,
'fnr': 1 - tpr, # False Negative Rate
'precision': tpr / (tpr + fpr) # Precision calculation
})

<ipython-input-9-9c3953983ca3>:7: RuntimeWarning: invalid value encountered in


divide
'precision': tpr / (tpr + fpr) # Precision calculation

[10]: # Determine the best threshold where TPR is maximized and FPR is minimized
best_threshold_idx = (tpr - fpr).argmax()
best_threshold = thresholds[best_threshold_idx]
best_tpr = tpr[best_threshold_idx]
best_fpr = fpr[best_threshold_idx]

[11]: # Print best threshold and its metrics


print(f'Best threshold for company size influencing approval rates:␣
↪{best_threshold:.2f}')

print(f'True Positive Rate (TPR) at best threshold: {best_tpr:.2f}')


print(f'False Positive Rate (FPR) at best threshold: {best_fpr:.2f}')

Best threshold for company size influencing approval rates: 0.50


True Positive Rate (TPR) at best threshold: 0.46
False Positive Rate (FPR) at best threshold: 0.43

[12]: # Additional analysis to evaluate model performance at different thresholds


# Display TPR, FPR, Precision for a range of thresholds
print(threshold_results[['threshold', 'tpr', 'fpr', 'precision']])

threshold tpr fpr precision


0 inf 0.000000 0.000000 NaN
1 0.672882 0.000000 0.000591 0.000000
2 0.651916 0.000000 0.001182 0.000000
3 0.608998 0.000588 0.001182 0.332025
4 0.608817 0.000588 0.001773 0.248897
… … … … …
2394 0.498343 0.998825 0.998818 0.500002
2395 0.498335 0.999412 0.999409 0.500001
2396 0.498334 0.999706 0.999409 0.500074
2397 0.498333 0.999706 1.000000 0.499927
2398 0.498331 1.000000 1.000000 0.500000

8
[2399 rows x 4 columns]

1.1.1 Results with improvement

[13]: # Preprocess the data


# Convert categorical variables to numerical (e.g., 'case_status',␣
↪'has_job_experience', etc.)

data['case_status'] = data['case_status'].map({'Certified': 1, 'Denied': 0})

# Handle missing values (you may choose a strategy based on your data)
data.dropna(subset=['no_of_employees', 'case_status'], inplace=True)

[34]: # Create a function to visualize the approval rates based on company size
def plot_approval_rates(data):
plt.figure(figsize=(20, 12))
sns.countplot(data=data, x='no_of_employees', hue='case_status',␣
↪palette='coolwarm')

plt.title('Visa Certification by Company Size')


plt.xlabel('Number of Employees')
plt.ylabel('Number of Applications')
plt.xticks(rotation=45)
plt.legend(title='Case Status', labels=['Denied', 'Certified'])
plt.show()

# Call the function to plot


plot_approval_rates(data)

9
[15]: # Further analysis with logistic regression
# Define the features and target variable
X = data[['no_of_employees']]
y = data['case_status']

# Split the dataset into training and testing sets


X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2,␣
↪random_state=42)

[16]: # Create a logistic regression model


model = LogisticRegression()
model.fit(X_train, y_train)

# Make predictions
y_pred = model.predict(X_test)

# Print the classification report


print(classification_report(y_test, y_pred))

precision recall f1-score support

0 0.00 0.00 0.00 1695


1 0.67 1.00 0.80 3401

accuracy 0.67 5096


macro avg 0.33 0.50 0.40 5096
weighted avg 0.45 0.67 0.53 5096

/usr/local/lib/python3.10/dist-packages/sklearn/metrics/_classification.py:1531:
UndefinedMetricWarning: Precision is ill-defined and being set to 0.0 in labels
with no predicted samples. Use `zero_division` parameter to control this
behavior.
_warn_prf(average, modifier, f"{metric.capitalize()} is", len(result))
/usr/local/lib/python3.10/dist-packages/sklearn/metrics/_classification.py:1531:
UndefinedMetricWarning: Precision is ill-defined and being set to 0.0 in labels
with no predicted samples. Use `zero_division` parameter to control this
behavior.
_warn_prf(average, modifier, f"{metric.capitalize()} is", len(result))
/usr/local/lib/python3.10/dist-packages/sklearn/metrics/_classification.py:1531:
UndefinedMetricWarning: Precision is ill-defined and being set to 0.0 in labels
with no predicted samples. Use `zero_division` parameter to control this
behavior.
_warn_prf(average, modifier, f"{metric.capitalize()} is", len(result))

10
[17]: # Calculate ROC curve
fpr, tpr, thresholds = roc_curve(y_test, model.predict_proba(X_test)[:, 1])
roc_auc = auc(fpr, tpr)

# Plot ROC curve


plt.figure()
plt.plot(fpr, tpr, color='blue', lw=2, label='ROC curve (area = {:.2f})'.
↪format(roc_auc))

plt.plot([0, 1], [0, 1], color='red', lw=2, linestyle='--')


plt.xlim([0.0, 1.0])
plt.ylim([0.0, 1.05])
plt.xlabel('False Positive Rate')
plt.ylabel('True Positive Rate')
plt.title('Receiver Operating Characteristic')
plt.legend(loc='lower right')
plt.show()

[18]: # Analyze thresholds


threshold_results = pd.DataFrame({
'threshold': thresholds,

11
'tpr': tpr,
'fpr': fpr
})

# Determine the best threshold where TPR is maximized and FPR is minimized
best_threshold = thresholds[(tpr - fpr).argmax()]
print(f'Best threshold for company size influencing approval rates:␣
↪{best_threshold}')

Best threshold for company size influencing approval rates: 0.6678088542127053

1.1.2 EDA 2: Does the age of the employer’s company (years since establishment)
correlate with visa certification rates? Are newer companies more or less likely
to receive certifications compared to established ones?
[19]: import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns

# Assuming 'data' is your DataFrame containing the dataset


current_year = 2024 # You can dynamically get the current year using datetime

# Step 1: Calculate the age of the company


data['company_age'] = current_year - data['yr_of_estab']

# Step 2: Group by 'case_status' and calculate the average company age for each␣
↪status

avg_age_by_status = data.groupby('case_status')['company_age'].mean().
↪reset_index()

# Step 3: Plotting the results


plt.figure(figsize=(8, 5))
sns.barplot(x='case_status', y='company_age', data=avg_age_by_status,␣
↪palette='viridis')

plt.title('Average Company Age by Visa Certification Status')


plt.xlabel('Visa Certification Status')
plt.ylabel('Average Company Age (Years)')
plt.show()

# Step 4: Calculate Pearson correlation between company age and visa␣


↪certification

# First, map the 'case_status' to binary values (Certified = 1, Denied = 0)


data['case_status_binary'] = data['case_status'].map({'Certified': 1, 'Denied':␣
↪0})

# Calculate correlation
correlation = data[['company_age', 'case_status_binary']].corr()

12
print("Correlation between company age and visa certification status:")
print(correlation)

<ipython-input-19-858b7bf4bf07>:16: FutureWarning:

Passing `palette` without assigning `hue` is deprecated and will be removed in


v0.14.0. Assign the `x` variable to `hue` and set `legend=False` for the same
effect.

sns.barplot(x='case_status', y='company_age', data=avg_age_by_status,


palette='viridis')

Correlation between company age and visa certification status:


company_age case_status_binary
company_age 1.0 NaN
case_status_binary NaN NaN

13
1.1.3 EDA 3: Regional Trends in Visa Certification: How does the region of em-
ployment influence the likelihood of visa approval? Are certain regions more
favorable for visa certifications than others, and what might explain these dif-
ferences?
[20]: # Map case_status: 1 = Certified, 0 = Denied
data['case_status_label'] = data['case_status'].map({1: 'Certified', 0:␣
↪'Denied'})

# Data Cleaning (Handle missing values if any)


# Drop rows with missing values in key columns (if necessary)
df = data.dropna(subset=['region_of_employment', 'case_status_label'])

[21]: # Calculate visa approval rates by region


# Group by region and case status to count the occurrences
region_case_status = df.groupby(['region_of_employment', 'case_status_label']).
↪size().unstack(fill_value=0)

# Add a column for the total number of applications in each region


region_case_status['total_applications'] = region_case_status['Certified'] +␣
↪region_case_status['Denied']

# Calculate the certification rate for each region


region_case_status['approval_rate'] = (region_case_status['Certified'] /␣
↪region_case_status['total_applications']) * 100

# Sort regions by approval rate in descending order


region_case_status_sorted = region_case_status.sort_values(by='approval_rate',␣
↪ascending=False)

# Display the approval rate per region


print(region_case_status_sorted[['Certified', 'Denied', 'approval_rate']])

case_status_label Certified Denied approval_rate


region_of_employment
Midwest 3253 1054 75.528210
South 4913 2104 70.015676
Northeast 4526 2669 62.904795
West 4100 2486 62.253265
Island 226 149 60.266667

[22]: # Plot the regional trends in visa certification using a bar chart
plt.figure(figsize=(10, 6))
sns.barplot(x=region_case_status_sorted.index,␣
↪y=region_case_status_sorted['approval_rate'], palette="Blues_d")

plt.title('Visa Approval Rate by Region of Employment', fontsize=16)


plt.xlabel('Region of Employment', fontsize=12)

14
plt.ylabel('Approval Rate (%)', fontsize=12)
plt.xticks(rotation=45, ha='right')
plt.tight_layout()
plt.show()

<ipython-input-22-bb2add09d93a>:3: FutureWarning:

Passing `palette` without assigning `hue` is deprecated and will be removed in


v0.14.0. Assign the `x` variable to `hue` and set `legend=False` for the same
effect.

sns.barplot(x=region_case_status_sorted.index,
y=region_case_status_sorted['approval_rate'], palette="Blues_d")

[23]: # Plot total applications by region to give additional context


plt.figure(figsize=(10, 6))
sns.barplot(x=region_case_status_sorted.index,␣
↪y=region_case_status_sorted['total_applications'], palette="Greens_d")

plt.title('Total Visa Applications by Region of Employment', fontsize=16)


plt.xlabel('Region of Employment', fontsize=12)
plt.ylabel('Total Applications', fontsize=12)
plt.xticks(rotation=45, ha='right')
plt.tight_layout()
plt.show()

<ipython-input-23-50e2a68b1904>:3: FutureWarning:

15
Passing `palette` without assigning `hue` is deprecated and will be removed in
v0.14.0. Assign the `x` variable to `hue` and set `legend=False` for the same
effect.

sns.barplot(x=region_case_status_sorted.index,
y=region_case_status_sorted['total_applications'], palette="Greens_d")

1.1.4 EDA 4: Job Training Requirements and Approval Rates: Is there a significant
relationship between the requirement for job training and visa certification
rates? Do applicants needing training face higher denial rates compared to
those who do not?
[24]: df = df.dropna(subset=['requires_job_training', 'case_status'])

# Step 3: Group the data by `requires_job_training` and analyze `case_status`


# Create a crosstab to view the relationship between job training and case␣
↪status

job_training_vs_status = pd.crosstab(df['requires_job_training'],␣
↪df['case_status'])

# Display the crosstab


print(job_training_vs_status)

# Step 4: Visualize the relationship using a bar plot

16
plt.figure(figsize=(8, 5))
sns.countplot(x='requires_job_training', hue='case_status', data=df,␣
↪palette='Set2')

plt.title('Visa Certification Rates Based on Job Training Requirement')


plt.xlabel('Requires Job Training')
plt.ylabel('Count of Cases')
plt.legend(title='Visa Status', loc='upper right')
plt.show()

# Step 5: Perform a Chi-Square test to check for significance


# Creating a contingency table
contingency_table = job_training_vs_status.values

# Performing the chi-square test


chi2, p, dof, expected = chi2_contingency(contingency_table)

# Display the test results


print(f"Chi-Square Statistic: {chi2}")
print(f"P-value: {p}")

# Step 6: Interpret the result


if p < 0.05:
print("There is a significant relationship between job training requirement␣
↪and visa certification rates (p < 0.05).")

else:
print("There is no significant relationship between job training␣
↪requirement and visa certification rates (p >= 0.05).")

case_status 0 1
requires_job_training
N 7513 15012
Y 949 2006

17
Chi-Square Statistic: 1.7524844405674074
P-value: 0.18556470819406773
There is no significant relationship between job training requirement and visa
certification rates (p >= 0.05).

1.1.5 EDA 5:Wage Disparities Across Education Levels: How does the prevailing
wage compare across different levels of education? Are higher-educated appli-
cants more likely to receive higher wages, and how does this impact their visa
approval status?
[25]: # Convert wage unit to yearly equivalent for easier comparison
def convert_to_yearly(row):
if row['unit_of_wage'] == 'Hourly':
return row['prevailing_wage'] * 2080 # Assuming 40 hours/week, 52␣
↪weeks/year

elif row['unit_of_wage'] == 'Weekly':


return row['prevailing_wage'] * 52 # 52 weeks/year
elif row['unit_of_wage'] == 'Monthly':
return row['prevailing_wage'] * 12 # 12 months/year
else:
return row['prevailing_wage'] # Already yearly

df['prevailing_wage_yearly'] = df.apply(convert_to_yearly, axis=1)

18
# Drop rows with missing or invalid wage or education data
df = df.dropna(subset=['prevailing_wage_yearly', 'education_of_employee'])

[26]: # Plot distribution of prevailing wage across different education levels


plt.figure(figsize=(12, 6))
sns.boxplot(data=df, x='education_of_employee', y='prevailing_wage_yearly',␣
↪palette='Set3')

plt.title('Prevailing Wage Across Education Levels')


plt.xlabel('Education Level')
plt.ylabel('Prevailing Wage (Yearly)')
plt.xticks(rotation=45)
plt.tight_layout()
plt.show()

<ipython-input-26-239460dd383b>:3: FutureWarning:

Passing `palette` without assigning `hue` is deprecated and will be removed in


v0.14.0. Assign the `x` variable to `hue` and set `legend=False` for the same
effect.

sns.boxplot(data=df, x='education_of_employee', y='prevailing_wage_yearly',


palette='Set3')

[27]: # Analyze the relationship between education level and visa approval
plt.figure(figsize=(12, 6))
sns.barplot(data=df, x='education_of_employee', y='prevailing_wage_yearly',␣
↪hue='case_status', ci=None, palette='Set2')

plt.title('Prevailing Wage Across Education Levels Based on Visa Status')

19
plt.xlabel('Education Level')
plt.ylabel('Average Prevailing Wage (Yearly)')
plt.xticks(rotation=45)
plt.legend(title='Visa Status')
plt.tight_layout()
plt.show()

<ipython-input-27-b857e9760eaa>:3: FutureWarning:

The `ci` parameter is deprecated. Use `errorbar=None` for the same effect.

sns.barplot(data=df, x='education_of_employee', y='prevailing_wage_yearly',


hue='case_status', ci=None, palette='Set2')

[28]: # Calculate the average prevailing wage per education level and visa status
wage_education_visa = df.groupby(['education_of_employee',␣
↪'case_status'])['prevailing_wage_yearly'].mean().reset_index()

# Display the results


print(wage_education_visa)

# Further statistical analysis (optional)


# Compare wage differences across education levels
from scipy.stats import f_oneway

# Extract wages for each education level


education_levels = df['education_of_employee'].unique()
wage_data = [df[df['education_of_employee'] == edu]['prevailing_wage_yearly']␣
↪for edu in education_levels]

20
# Perform ANOVA to test if wage differences between education levels are␣
↪statistically significant

anova_result = f_oneway(*wage_data)
print("ANOVA result:", anova_result)

education_of_employee case_status prevailing_wage_yearly


0 Bachelor's 0 66828.620185
1 Bachelor's 1 77399.880153
2 Doctorate 0 64579.805167
3 Doctorate 1 64558.333989
4 High School 0 69856.515009
5 High School 1 74926.673079
6 Master's 0 71707.831941
7 Master's 1 80782.520567
ANOVA result: F_onewayResult(statistic=52.84780210274433,
pvalue=4.817396648029027e-34)

[29]: # Data preprocessing: handling missing values in case_status and␣


↪full_time_position columns

df = df[['case_status', 'full_time_position']].dropna()

# Count the number of full-time and part-time positions for certified vs denied␣
↪cases

visa_status_counts = pd.crosstab(df['full_time_position'], df['case_status'])


print(visa_status_counts)

# Plot the results


plt.figure(figsize=(8, 6))
sns.countplot(data=df, x='full_time_position', hue='case_status',␣
↪palette='Set2')

plt.title('Visa Certification Outcomes by Employment Status')


plt.xlabel('Full-Time Position (Y=Full-Time, N=Part-Time)')
plt.ylabel('Count')
plt.show()

# Perform Chi-Square Test to check for significant association


chi2, p, dof, expected = chi2_contingency(visa_status_counts)

# Output the test result


print(f"Chi-Square Test Statistic: {chi2}")
print(f"P-value: {p}")

# Conclusion
if p < 0.05:
print("There is a significant difference in visa certification rates␣
↪between full-time and part-time positions.")

21
else:
print("There is no significant difference in visa certification rates␣
↪between full-time and part-time positions.")

case_status 0 1
full_time_position
N 852 1855
Y 7610 15163

Chi-Square Test Statistic: 4.029931866713214


P-value: 0.0446997461068597
There is a significant difference in visa certification rates between full-time
and part-time positions.

22

You might also like