Id5132 1
Id5132 1
1
25476 N 3274 2006
25477 N 1121 1910
25478 Y 1918 1887
25479 N 3195 1960
case_status
0 Denied
1 Certified
2 Denied
3 Denied
4 Certified
… …
25475 Certified
25476 Certified
25477 Certified
25478 Certified
25479 Certified
2
has_job_experience 0
requires_job_training 0
no_of_employees 0
yr_of_estab 0
region_of_employment 0
prevailing_wage 0
unit_of_wage 0
full_time_position 0
case_status 0
dtype: int64
Original DataFrame:
case_id continent education_of_employee has_job_experience \
0 EZYV01 Asia High School N
3
1 EZYV02 Asia Master's Y
2 EZYV03 Asia Bachelor's N
3 EZYV04 Asia Bachelor's N
4 EZYV05 Africa Master's Y
… … … … …
25475 EZYV25476 Asia Bachelor's Y
25476 EZYV25477 Asia High School Y
25477 EZYV25478 Asia Master's Y
25478 EZYV25479 Asia Master's Y
25479 EZYV25480 Asia Bachelor's Y
case_status
0 Denied
1 Certified
2 Denied
3 Denied
4 Certified
… …
25475 Certified
25476 Certified
25477 Certified
25478 Certified
4
25479 Certified
case_status
0 Denied
1 Certified
2 Denied
5
3 Denied
4 Certified
… …
25475 Certified
25476 Certified
25477 Certified
25478 Certified
25479 Certified
1 Problem Definition
The goal of this analysis is to identify patterns and insights that can help in understanding the
factors influencing visa certification rates. By analyzing the dataset, we aim to facilitate the visa
approval process and recommend profiles for applicants based on significant drivers of case status.
1.1 EDA 1: Impact of Company Size on Visa Certification: How does the
number of employees in the employer’s company affect the likelihood of
visa certification? Is there a threshold for company size that influences
approval rates?
[6]: # Define the features and target variable
X = data_clean[['no_of_employees']]
y = data_clean['case_status'].map({'Certified': 1, 'Denied': 0}) # Convert to␣
↪binary
# Make predictions
y_pred = model.predict(X_test)
y_pred_proba = model.predict_proba(X_test)[:, 1]
6
precision recall f1-score support
7
[9]: # Analyze thresholds
threshold_results = pd.DataFrame({
'threshold': thresholds,
'tpr': tpr,
'fpr': fpr,
'fnr': 1 - tpr, # False Negative Rate
'precision': tpr / (tpr + fpr) # Precision calculation
})
[10]: # Determine the best threshold where TPR is maximized and FPR is minimized
best_threshold_idx = (tpr - fpr).argmax()
best_threshold = thresholds[best_threshold_idx]
best_tpr = tpr[best_threshold_idx]
best_fpr = fpr[best_threshold_idx]
8
[2399 rows x 4 columns]
# Handle missing values (you may choose a strategy based on your data)
data.dropna(subset=['no_of_employees', 'case_status'], inplace=True)
[34]: # Create a function to visualize the approval rates based on company size
def plot_approval_rates(data):
plt.figure(figsize=(20, 12))
sns.countplot(data=data, x='no_of_employees', hue='case_status',␣
↪palette='coolwarm')
9
[15]: # Further analysis with logistic regression
# Define the features and target variable
X = data[['no_of_employees']]
y = data['case_status']
# Make predictions
y_pred = model.predict(X_test)
/usr/local/lib/python3.10/dist-packages/sklearn/metrics/_classification.py:1531:
UndefinedMetricWarning: Precision is ill-defined and being set to 0.0 in labels
with no predicted samples. Use `zero_division` parameter to control this
behavior.
_warn_prf(average, modifier, f"{metric.capitalize()} is", len(result))
/usr/local/lib/python3.10/dist-packages/sklearn/metrics/_classification.py:1531:
UndefinedMetricWarning: Precision is ill-defined and being set to 0.0 in labels
with no predicted samples. Use `zero_division` parameter to control this
behavior.
_warn_prf(average, modifier, f"{metric.capitalize()} is", len(result))
/usr/local/lib/python3.10/dist-packages/sklearn/metrics/_classification.py:1531:
UndefinedMetricWarning: Precision is ill-defined and being set to 0.0 in labels
with no predicted samples. Use `zero_division` parameter to control this
behavior.
_warn_prf(average, modifier, f"{metric.capitalize()} is", len(result))
10
[17]: # Calculate ROC curve
fpr, tpr, thresholds = roc_curve(y_test, model.predict_proba(X_test)[:, 1])
roc_auc = auc(fpr, tpr)
11
'tpr': tpr,
'fpr': fpr
})
# Determine the best threshold where TPR is maximized and FPR is minimized
best_threshold = thresholds[(tpr - fpr).argmax()]
print(f'Best threshold for company size influencing approval rates:␣
↪{best_threshold}')
1.1.2 EDA 2: Does the age of the employer’s company (years since establishment)
correlate with visa certification rates? Are newer companies more or less likely
to receive certifications compared to established ones?
[19]: import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
# Step 2: Group by 'case_status' and calculate the average company age for each␣
↪status
avg_age_by_status = data.groupby('case_status')['company_age'].mean().
↪reset_index()
# Calculate correlation
correlation = data[['company_age', 'case_status_binary']].corr()
12
print("Correlation between company age and visa certification status:")
print(correlation)
<ipython-input-19-858b7bf4bf07>:16: FutureWarning:
13
1.1.3 EDA 3: Regional Trends in Visa Certification: How does the region of em-
ployment influence the likelihood of visa approval? Are certain regions more
favorable for visa certifications than others, and what might explain these dif-
ferences?
[20]: # Map case_status: 1 = Certified, 0 = Denied
data['case_status_label'] = data['case_status'].map({1: 'Certified', 0:␣
↪'Denied'})
[22]: # Plot the regional trends in visa certification using a bar chart
plt.figure(figsize=(10, 6))
sns.barplot(x=region_case_status_sorted.index,␣
↪y=region_case_status_sorted['approval_rate'], palette="Blues_d")
14
plt.ylabel('Approval Rate (%)', fontsize=12)
plt.xticks(rotation=45, ha='right')
plt.tight_layout()
plt.show()
<ipython-input-22-bb2add09d93a>:3: FutureWarning:
sns.barplot(x=region_case_status_sorted.index,
y=region_case_status_sorted['approval_rate'], palette="Blues_d")
<ipython-input-23-50e2a68b1904>:3: FutureWarning:
15
Passing `palette` without assigning `hue` is deprecated and will be removed in
v0.14.0. Assign the `x` variable to `hue` and set `legend=False` for the same
effect.
sns.barplot(x=region_case_status_sorted.index,
y=region_case_status_sorted['total_applications'], palette="Greens_d")
1.1.4 EDA 4: Job Training Requirements and Approval Rates: Is there a significant
relationship between the requirement for job training and visa certification
rates? Do applicants needing training face higher denial rates compared to
those who do not?
[24]: df = df.dropna(subset=['requires_job_training', 'case_status'])
job_training_vs_status = pd.crosstab(df['requires_job_training'],␣
↪df['case_status'])
16
plt.figure(figsize=(8, 5))
sns.countplot(x='requires_job_training', hue='case_status', data=df,␣
↪palette='Set2')
else:
print("There is no significant relationship between job training␣
↪requirement and visa certification rates (p >= 0.05).")
case_status 0 1
requires_job_training
N 7513 15012
Y 949 2006
17
Chi-Square Statistic: 1.7524844405674074
P-value: 0.18556470819406773
There is no significant relationship between job training requirement and visa
certification rates (p >= 0.05).
1.1.5 EDA 5:Wage Disparities Across Education Levels: How does the prevailing
wage compare across different levels of education? Are higher-educated appli-
cants more likely to receive higher wages, and how does this impact their visa
approval status?
[25]: # Convert wage unit to yearly equivalent for easier comparison
def convert_to_yearly(row):
if row['unit_of_wage'] == 'Hourly':
return row['prevailing_wage'] * 2080 # Assuming 40 hours/week, 52␣
↪weeks/year
18
# Drop rows with missing or invalid wage or education data
df = df.dropna(subset=['prevailing_wage_yearly', 'education_of_employee'])
<ipython-input-26-239460dd383b>:3: FutureWarning:
[27]: # Analyze the relationship between education level and visa approval
plt.figure(figsize=(12, 6))
sns.barplot(data=df, x='education_of_employee', y='prevailing_wage_yearly',␣
↪hue='case_status', ci=None, palette='Set2')
19
plt.xlabel('Education Level')
plt.ylabel('Average Prevailing Wage (Yearly)')
plt.xticks(rotation=45)
plt.legend(title='Visa Status')
plt.tight_layout()
plt.show()
<ipython-input-27-b857e9760eaa>:3: FutureWarning:
The `ci` parameter is deprecated. Use `errorbar=None` for the same effect.
[28]: # Calculate the average prevailing wage per education level and visa status
wage_education_visa = df.groupby(['education_of_employee',␣
↪'case_status'])['prevailing_wage_yearly'].mean().reset_index()
20
# Perform ANOVA to test if wage differences between education levels are␣
↪statistically significant
anova_result = f_oneway(*wage_data)
print("ANOVA result:", anova_result)
df = df[['case_status', 'full_time_position']].dropna()
# Count the number of full-time and part-time positions for certified vs denied␣
↪cases
# Conclusion
if p < 0.05:
print("There is a significant difference in visa certification rates␣
↪between full-time and part-time positions.")
21
else:
print("There is no significant difference in visa certification rates␣
↪between full-time and part-time positions.")
case_status 0 1
full_time_position
N 852 1855
Y 7610 15163
22