0% found this document useful (0 votes)

23 views22 pages

Id5132 1

Uploaded by

sai.nidiginti

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

23 views22 pages

Id5132 1

Uploaded by

sai.nidiginti

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 22

id5132-1

October 24, 2024

[1]: import pandas as pd

import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
import plotly.express as px
from scipy.stats import chi2_contingency

from sklearn.preprocessing import StandardScaler

from sklearn.linear_model import LogisticRegression

from sklearn.model_selection import train_test_split
from sklearn.metrics import classification_report, roc_curve, auc

[2]: data =pd.read_csv(r"/content/EasyVisa (5).csv")

data

[2]: case_id continent education_of_employee has_job_experience \

0 EZYV01 Asia High School N
1 EZYV02 Asia Master's Y
2 EZYV03 Asia Bachelor's N
3 EZYV04 Asia Bachelor's N
4 EZYV05 Africa Master's Y
… … … … …
25475 EZYV25476 Asia Bachelor's Y
25476 EZYV25477 Asia High School Y
25477 EZYV25478 Asia Master's Y
25478 EZYV25479 Asia Master's Y
25479 EZYV25480 Asia Bachelor's Y

requires_job_training no_of_employees yr_of_estab \

0 N 14513 2007
1 N 2412 2002
2 Y 44444 2008
3 N 98 1897
4 N 1082 2005
… … … …
25475 Y 2601 2008

1
25476 N 3274 2006
25477 N 1121 1910
25478 Y 1918 1887
25479 N 3195 1960

region_of_employment prevailing_wage unit_of_wage full_time_position \

0 West 592.2029 Hour Y
1 Northeast 83425.6500 Year Y
2 West 122996.8600 Year Y
3 West 83434.0300 Year Y
4 South 149907.3900 Year Y
… … … … …
25475 South 77092.5700 Year Y
25476 Northeast 279174.7900 Year Y
25477 South 146298.8500 Year N
25478 West 86154.7700 Year Y
25479 Midwest 70876.9100 Year Y

case_status
0 Denied
1 Certified
2 Denied
3 Denied
4 Certified
… …
25475 Certified
25476 Certified
25477 Certified
25478 Certified
25479 Certified

[25480 rows x 12 columns]

[3]: # Check for null values

null_values = data.isnull().sum()

# Display the null values for each column

print("Null values in each column:\n", null_values)

# If you want to check if any null values are present

total_null_values = data.isnull().values.any()
print("\nAre there any null values in the dataset?", total_null_values)

Null values in each column:

case_id 0
continent 0
education_of_employee 0

2
has_job_experience 0
requires_job_training 0
no_of_employees 0
yr_of_estab 0
region_of_employment 0
prevailing_wage 0
unit_of_wage 0
full_time_position 0
case_status 0
dtype: int64

Are there any null values in the dataset? False

[4]: # Check for NaN values in the dataset

print("Missing values in dataset:\n", data.isnull().sum())

# Drop rows with NaN values in the target variable

data_clean = data.dropna(subset=['case_status', 'no_of_employees'])

Missing values in dataset:

case_id 0
continent 0
education_of_employee 0
has_job_experience 0
requires_job_training 0
no_of_employees 0
yr_of_estab 0
region_of_employment 0
prevailing_wage 0
unit_of_wage 0
full_time_position 0
case_status 0
dtype: int64

[5]: # Display the original DataFrame

print("Original DataFrame:")
print(data)

# Remove rows with any NaN values

data = data.dropna()

# Display the cleaned DataFrame

print("\nDataFrame after removing NaN values:")
print(data)

Original DataFrame:
case_id continent education_of_employee has_job_experience \
0 EZYV01 Asia High School N

3
1 EZYV02 Asia Master's Y
2 EZYV03 Asia Bachelor's N
3 EZYV04 Asia Bachelor's N
4 EZYV05 Africa Master's Y
… … … … …
25475 EZYV25476 Asia Bachelor's Y
25476 EZYV25477 Asia High School Y
25477 EZYV25478 Asia Master's Y
25478 EZYV25479 Asia Master's Y
25479 EZYV25480 Asia Bachelor's Y

requires_job_training no_of_employees yr_of_estab \

0 N 14513 2007
1 N 2412 2002
2 Y 44444 2008
3 N 98 1897
4 N 1082 2005
… … … …
25475 Y 2601 2008
25476 N 3274 2006
25477 N 1121 1910
25478 Y 1918 1887
25479 N 3195 1960

region_of_employment prevailing_wage unit_of_wage full_time_position \

case_status
0 Denied
1 Certified
2 Denied
3 Denied
4 Certified
… …
25475 Certified
25476 Certified
25477 Certified
25478 Certified

4
25479 Certified

[25480 rows x 12 columns]

DataFrame after removing NaN values:

case_id continent education_of_employee has_job_experience \
0 EZYV01 Asia High School N
1 EZYV02 Asia Master's Y
2 EZYV03 Asia Bachelor's N
3 EZYV04 Asia Bachelor's N
4 EZYV05 Africa Master's Y
… … … … …
25475 EZYV25476 Asia Bachelor's Y
25476 EZYV25477 Asia High School Y
25477 EZYV25478 Asia Master's Y
25478 EZYV25479 Asia Master's Y
25479 EZYV25480 Asia Bachelor's Y

requires_job_training no_of_employees yr_of_estab \

0 N 14513 2007
1 N 2412 2002
2 Y 44444 2008
3 N 98 1897
4 N 1082 2005
… … … …
25475 Y 2601 2008
25476 N 3274 2006
25477 N 1121 1910
25478 Y 1918 1887
25479 N 3195 1960

region_of_employment prevailing_wage unit_of_wage full_time_position \

case_status
0 Denied
1 Certified
2 Denied

5
3 Denied
4 Certified
… …
25475 Certified
25476 Certified
25477 Certified
25478 Certified
25479 Certified

[25480 rows x 12 columns]

1 Problem Definition
The goal of this analysis is to identify patterns and insights that can help in understanding the
factors influencing visa certification rates. By analyzing the dataset, we aim to facilitate the visa
approval process and recommend profiles for applicants based on significant drivers of case status.

1.1 EDA 1: Impact of Company Size on Visa Certification: How does the
number of employees in the employer’s company affect the likelihood of
visa certification? Is there a threshold for company size that influences
approval rates?
[6]: # Define the features and target variable
X = data_clean[['no_of_employees']]
y = data_clean['case_status'].map({'Certified': 1, 'Denied': 0}) # Convert to␣
↪binary

# Scale the features

scaler = StandardScaler()
X_scaled = scaler.fit_transform(X)

# Split the dataset into training and testing sets

X_train, X_test, y_train, y_test = train_test_split(X_scaled, y, test_size=0.2,␣
↪random_state=42, stratify=y)

[7]: # Create a logistic regression model with class weight handling

model = LogisticRegression(class_weight='balanced')
model.fit(X_train, y_train)

# Make predictions
y_pred = model.predict(X_test)
y_pred_proba = model.predict_proba(X_test)[:, 1]

# Print the classification report

print(classification_report(y_test, y_pred))

6
precision recall f1-score support

0 0.33 0.91 0.49 1692

1 0.69 0.09 0.17 3404

accuracy 0.37 5096

macro avg 0.51 0.50 0.33 5096
weighted avg 0.57 0.37 0.27 5096

[8]: # Calculate ROC curve

fpr, tpr, thresholds = roc_curve(y_test, y_pred_proba)
roc_auc = auc(fpr, tpr)

# Plot ROC curve

plt.figure(figsize=(10, 6))
plt.plot(fpr, tpr, color='blue', lw=2, label='ROC curve (area = {:.2f})'.
↪format(roc_auc))

plt.plot([0, 1], [0, 1], color='red', lw=2, linestyle='--')

plt.xlim([0.0, 1.0])
plt.ylim([0.0, 1.05])
plt.xlabel('False Positive Rate', fontsize=14)
plt.ylabel('True Positive Rate', fontsize=14)
plt.title('Receiver Operating Characteristic', fontsize=16)
plt.legend(loc='lower right')
plt.grid()
plt.show()

7
[9]: # Analyze thresholds
threshold_results = pd.DataFrame({
'threshold': thresholds,
'tpr': tpr,
'fpr': fpr,
'fnr': 1 - tpr, # False Negative Rate
'precision': tpr / (tpr + fpr) # Precision calculation
})

<ipython-input-9-9c3953983ca3>:7: RuntimeWarning: invalid value encountered in

divide
'precision': tpr / (tpr + fpr) # Precision calculation

[10]: # Determine the best threshold where TPR is maximized and FPR is minimized
best_threshold_idx = (tpr - fpr).argmax()
best_threshold = thresholds[best_threshold_idx]
best_tpr = tpr[best_threshold_idx]
best_fpr = fpr[best_threshold_idx]

[11]: # Print best threshold and its metrics

print(f'Best threshold for company size influencing approval rates:␣
↪{best_threshold:.2f}')

print(f'True Positive Rate (TPR) at best threshold: {best_tpr:.2f}')

print(f'False Positive Rate (FPR) at best threshold: {best_fpr:.2f}')

Best threshold for company size influencing approval rates: 0.50

True Positive Rate (TPR) at best threshold: 0.46
False Positive Rate (FPR) at best threshold: 0.43

[12]: # Additional analysis to evaluate model performance at different thresholds

# Display TPR, FPR, Precision for a range of thresholds
print(threshold_results[['threshold', 'tpr', 'fpr', 'precision']])

threshold tpr fpr precision

0 inf 0.000000 0.000000 NaN
1 0.672882 0.000000 0.000591 0.000000
2 0.651916 0.000000 0.001182 0.000000
3 0.608998 0.000588 0.001182 0.332025
4 0.608817 0.000588 0.001773 0.248897
… … … … …
2394 0.498343 0.998825 0.998818 0.500002
2395 0.498335 0.999412 0.999409 0.500001
2396 0.498334 0.999706 0.999409 0.500074
2397 0.498333 0.999706 1.000000 0.499927
2398 0.498331 1.000000 1.000000 0.500000

8
[2399 rows x 4 columns]

1.1.1 Results with improvement

[13]: # Preprocess the data

# Convert categorical variables to numerical (e.g., 'case_status',␣
↪'has_job_experience', etc.)

data['case_status'] = data['case_status'].map({'Certified': 1, 'Denied': 0})

# Handle missing values (you may choose a strategy based on your data)
data.dropna(subset=['no_of_employees', 'case_status'], inplace=True)

[34]: # Create a function to visualize the approval rates based on company size
def plot_approval_rates(data):
plt.figure(figsize=(20, 12))
sns.countplot(data=data, x='no_of_employees', hue='case_status',␣
↪palette='coolwarm')

plt.title('Visa Certification by Company Size')

plt.xlabel('Number of Employees')
plt.ylabel('Number of Applications')
plt.xticks(rotation=45)
plt.legend(title='Case Status', labels=['Denied', 'Certified'])
plt.show()

# Call the function to plot

plot_approval_rates(data)

9
[15]: # Further analysis with logistic regression
# Define the features and target variable
X = data[['no_of_employees']]
y = data['case_status']

# Split the dataset into training and testing sets

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2,␣
↪random_state=42)

[16]: # Create a logistic regression model

model = LogisticRegression()
model.fit(X_train, y_train)

# Make predictions
y_pred = model.predict(X_test)

# Print the classification report

print(classification_report(y_test, y_pred))

precision recall f1-score support

0 0.00 0.00 0.00 1695

1 0.67 1.00 0.80 3401

accuracy 0.67 5096

macro avg 0.33 0.50 0.40 5096
weighted avg 0.45 0.67 0.53 5096

/usr/local/lib/python3.10/dist-packages/sklearn/metrics/_classification.py:1531:
UndefinedMetricWarning: Precision is ill-defined and being set to 0.0 in labels
with no predicted samples. Use `zero_division` parameter to control this
behavior.
_warn_prf(average, modifier, f"{metric.capitalize()} is", len(result))
/usr/local/lib/python3.10/dist-packages/sklearn/metrics/_classification.py:1531:
UndefinedMetricWarning: Precision is ill-defined and being set to 0.0 in labels
with no predicted samples. Use `zero_division` parameter to control this
behavior.
_warn_prf(average, modifier, f"{metric.capitalize()} is", len(result))
/usr/local/lib/python3.10/dist-packages/sklearn/metrics/_classification.py:1531:
UndefinedMetricWarning: Precision is ill-defined and being set to 0.0 in labels
with no predicted samples. Use `zero_division` parameter to control this
behavior.
_warn_prf(average, modifier, f"{metric.capitalize()} is", len(result))

10
[17]: # Calculate ROC curve
fpr, tpr, thresholds = roc_curve(y_test, model.predict_proba(X_test)[:, 1])
roc_auc = auc(fpr, tpr)

# Plot ROC curve

plt.figure()
plt.plot(fpr, tpr, color='blue', lw=2, label='ROC curve (area = {:.2f})'.
↪format(roc_auc))

plt.plot([0, 1], [0, 1], color='red', lw=2, linestyle='--')

plt.xlim([0.0, 1.0])
plt.ylim([0.0, 1.05])
plt.xlabel('False Positive Rate')
plt.ylabel('True Positive Rate')
plt.title('Receiver Operating Characteristic')
plt.legend(loc='lower right')
plt.show()

[18]: # Analyze thresholds

threshold_results = pd.DataFrame({
'threshold': thresholds,

11
'tpr': tpr,
'fpr': fpr
})

# Determine the best threshold where TPR is maximized and FPR is minimized
best_threshold = thresholds[(tpr - fpr).argmax()]
print(f'Best threshold for company size influencing approval rates:␣
↪{best_threshold}')

Best threshold for company size influencing approval rates: 0.6678088542127053

1.1.2 EDA 2: Does the age of the employer’s company (years since establishment)
correlate with visa certification rates? Are newer companies more or less likely
to receive certifications compared to established ones?
[19]: import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns

# Assuming 'data' is your DataFrame containing the dataset

current_year = 2024 # You can dynamically get the current year using datetime

# Step 1: Calculate the age of the company

data['company_age'] = current_year - data['yr_of_estab']

# Step 2: Group by 'case_status' and calculate the average company age for each␣
↪status

avg_age_by_status = data.groupby('case_status')['company_age'].mean().
↪reset_index()

# Step 3: Plotting the results

plt.figure(figsize=(8, 5))
sns.barplot(x='case_status', y='company_age', data=avg_age_by_status,␣
↪palette='viridis')

plt.title('Average Company Age by Visa Certification Status')

plt.xlabel('Visa Certification Status')
plt.ylabel('Average Company Age (Years)')
plt.show()

# Step 4: Calculate Pearson correlation between company age and visa␣

↪certification

# First, map the 'case_status' to binary values (Certified = 1, Denied = 0)

data['case_status_binary'] = data['case_status'].map({'Certified': 1, 'Denied':␣
↪0})

# Calculate correlation
correlation = data[['company_age', 'case_status_binary']].corr()

12
print("Correlation between company age and visa certification status:")
print(correlation)

<ipython-input-19-858b7bf4bf07>:16: FutureWarning:

Passing `palette` without assigning `hue` is deprecated and will be removed in

v0.14.0. Assign the `x` variable to `hue` and set `legend=False` for the same
effect.

sns.barplot(x='case_status', y='company_age', data=avg_age_by_status,

palette='viridis')

Correlation between company age and visa certification status:

company_age case_status_binary
company_age 1.0 NaN
case_status_binary NaN NaN

13
1.1.3 EDA 3: Regional Trends in Visa Certification: How does the region of em-
ployment influence the likelihood of visa approval? Are certain regions more
favorable for visa certifications than others, and what might explain these dif-
ferences?
[20]: # Map case_status: 1 = Certified, 0 = Denied
data['case_status_label'] = data['case_status'].map({1: 'Certified', 0:␣
↪'Denied'})

# Data Cleaning (Handle missing values if any)

# Drop rows with missing values in key columns (if necessary)
df = data.dropna(subset=['region_of_employment', 'case_status_label'])

[21]: # Calculate visa approval rates by region

# Group by region and case status to count the occurrences
region_case_status = df.groupby(['region_of_employment', 'case_status_label']).
↪size().unstack(fill_value=0)

# Add a column for the total number of applications in each region

region_case_status['total_applications'] = region_case_status['Certified'] +␣
↪region_case_status['Denied']

# Calculate the certification rate for each region

region_case_status['approval_rate'] = (region_case_status['Certified'] /␣
↪region_case_status['total_applications']) * 100

# Sort regions by approval rate in descending order

region_case_status_sorted = region_case_status.sort_values(by='approval_rate',␣
↪ascending=False)

# Display the approval rate per region

print(region_case_status_sorted[['Certified', 'Denied', 'approval_rate']])

case_status_label Certified Denied approval_rate

region_of_employment
Midwest 3253 1054 75.528210
South 4913 2104 70.015676
Northeast 4526 2669 62.904795
West 4100 2486 62.253265
Island 226 149 60.266667

[22]: # Plot the regional trends in visa certification using a bar chart
plt.figure(figsize=(10, 6))
sns.barplot(x=region_case_status_sorted.index,␣
↪y=region_case_status_sorted['approval_rate'], palette="Blues_d")

plt.title('Visa Approval Rate by Region of Employment', fontsize=16)

plt.xlabel('Region of Employment', fontsize=12)

14
plt.ylabel('Approval Rate (%)', fontsize=12)
plt.xticks(rotation=45, ha='right')
plt.tight_layout()
plt.show()

<ipython-input-22-bb2add09d93a>:3: FutureWarning:

Passing `palette` without assigning `hue` is deprecated and will be removed in

v0.14.0. Assign the `x` variable to `hue` and set `legend=False` for the same
effect.

sns.barplot(x=region_case_status_sorted.index,
y=region_case_status_sorted['approval_rate'], palette="Blues_d")

[23]: # Plot total applications by region to give additional context

plt.figure(figsize=(10, 6))
sns.barplot(x=region_case_status_sorted.index,␣
↪y=region_case_status_sorted['total_applications'], palette="Greens_d")

plt.title('Total Visa Applications by Region of Employment', fontsize=16)

plt.xlabel('Region of Employment', fontsize=12)
plt.ylabel('Total Applications', fontsize=12)
plt.xticks(rotation=45, ha='right')
plt.tight_layout()
plt.show()

<ipython-input-23-50e2a68b1904>:3: FutureWarning:

15
Passing `palette` without assigning `hue` is deprecated and will be removed in
v0.14.0. Assign the `x` variable to `hue` and set `legend=False` for the same
effect.

sns.barplot(x=region_case_status_sorted.index,
y=region_case_status_sorted['total_applications'], palette="Greens_d")

1.1.4 EDA 4: Job Training Requirements and Approval Rates: Is there a significant
relationship between the requirement for job training and visa certification
rates? Do applicants needing training face higher denial rates compared to
those who do not?
[24]: df = df.dropna(subset=['requires_job_training', 'case_status'])

# Step 3: Group the data by `requires_job_training` and analyze `case_status`

# Create a crosstab to view the relationship between job training and case␣
↪status

job_training_vs_status = pd.crosstab(df['requires_job_training'],␣
↪df['case_status'])

# Display the crosstab

print(job_training_vs_status)

# Step 4: Visualize the relationship using a bar plot

16
plt.figure(figsize=(8, 5))
sns.countplot(x='requires_job_training', hue='case_status', data=df,␣
↪palette='Set2')

plt.title('Visa Certification Rates Based on Job Training Requirement')

plt.xlabel('Requires Job Training')
plt.ylabel('Count of Cases')
plt.legend(title='Visa Status', loc='upper right')
plt.show()

# Step 5: Perform a Chi-Square test to check for significance

# Creating a contingency table
contingency_table = job_training_vs_status.values

# Performing the chi-square test

chi2, p, dof, expected = chi2_contingency(contingency_table)

# Display the test results

print(f"Chi-Square Statistic: {chi2}")
print(f"P-value: {p}")

# Step 6: Interpret the result

if p < 0.05:
print("There is a significant relationship between job training requirement␣
↪and visa certification rates (p < 0.05).")

else:
print("There is no significant relationship between job training␣
↪requirement and visa certification rates (p >= 0.05).")

case_status 0 1
requires_job_training
N 7513 15012
Y 949 2006

17
Chi-Square Statistic: 1.7524844405674074
P-value: 0.18556470819406773
There is no significant relationship between job training requirement and visa
certification rates (p >= 0.05).

1.1.5 EDA 5:Wage Disparities Across Education Levels: How does the prevailing
wage compare across different levels of education? Are higher-educated appli-
cants more likely to receive higher wages, and how does this impact their visa
approval status?
[25]: # Convert wage unit to yearly equivalent for easier comparison
def convert_to_yearly(row):
if row['unit_of_wage'] == 'Hourly':
return row['prevailing_wage'] * 2080 # Assuming 40 hours/week, 52␣
↪weeks/year

elif row['unit_of_wage'] == 'Weekly':

return row['prevailing_wage'] * 52 # 52 weeks/year
elif row['unit_of_wage'] == 'Monthly':
return row['prevailing_wage'] * 12 # 12 months/year
else:
return row['prevailing_wage'] # Already yearly

df['prevailing_wage_yearly'] = df.apply(convert_to_yearly, axis=1)

18
# Drop rows with missing or invalid wage or education data
df = df.dropna(subset=['prevailing_wage_yearly', 'education_of_employee'])

[26]: # Plot distribution of prevailing wage across different education levels

plt.figure(figsize=(12, 6))
sns.boxplot(data=df, x='education_of_employee', y='prevailing_wage_yearly',␣
↪palette='Set3')

plt.title('Prevailing Wage Across Education Levels')

plt.xlabel('Education Level')
plt.ylabel('Prevailing Wage (Yearly)')
plt.xticks(rotation=45)
plt.tight_layout()
plt.show()

<ipython-input-26-239460dd383b>:3: FutureWarning:

Passing `palette` without assigning `hue` is deprecated and will be removed in

v0.14.0. Assign the `x` variable to `hue` and set `legend=False` for the same
effect.

sns.boxplot(data=df, x='education_of_employee', y='prevailing_wage_yearly',

palette='Set3')

[27]: # Analyze the relationship between education level and visa approval
plt.figure(figsize=(12, 6))
sns.barplot(data=df, x='education_of_employee', y='prevailing_wage_yearly',␣
↪hue='case_status', ci=None, palette='Set2')

plt.title('Prevailing Wage Across Education Levels Based on Visa Status')

19
plt.xlabel('Education Level')
plt.ylabel('Average Prevailing Wage (Yearly)')
plt.xticks(rotation=45)
plt.legend(title='Visa Status')
plt.tight_layout()
plt.show()

<ipython-input-27-b857e9760eaa>:3: FutureWarning:

The `ci` parameter is deprecated. Use `errorbar=None` for the same effect.

sns.barplot(data=df, x='education_of_employee', y='prevailing_wage_yearly',

hue='case_status', ci=None, palette='Set2')

[28]: # Calculate the average prevailing wage per education level and visa status
wage_education_visa = df.groupby(['education_of_employee',␣
↪'case_status'])['prevailing_wage_yearly'].mean().reset_index()

# Display the results

print(wage_education_visa)

# Further statistical analysis (optional)

# Compare wage differences across education levels
from scipy.stats import f_oneway

# Extract wages for each education level

education_levels = df['education_of_employee'].unique()
wage_data = [df[df['education_of_employee'] == edu]['prevailing_wage_yearly']␣
↪for edu in education_levels]

20
# Perform ANOVA to test if wage differences between education levels are␣
↪statistically significant

anova_result = f_oneway(*wage_data)
print("ANOVA result:", anova_result)

education_of_employee case_status prevailing_wage_yearly

0 Bachelor's 0 66828.620185
1 Bachelor's 1 77399.880153
2 Doctorate 0 64579.805167
3 Doctorate 1 64558.333989
4 High School 0 69856.515009
5 High School 1 74926.673079
6 Master's 0 71707.831941
7 Master's 1 80782.520567
ANOVA result: F_onewayResult(statistic=52.84780210274433,
pvalue=4.817396648029027e-34)

[29]: # Data preprocessing: handling missing values in case_status and␣

↪full_time_position columns

df = df[['case_status', 'full_time_position']].dropna()

# Count the number of full-time and part-time positions for certified vs denied␣
↪cases

visa_status_counts = pd.crosstab(df['full_time_position'], df['case_status'])

print(visa_status_counts)

# Plot the results

plt.figure(figsize=(8, 6))
sns.countplot(data=df, x='full_time_position', hue='case_status',␣
↪palette='Set2')

plt.title('Visa Certification Outcomes by Employment Status')

plt.xlabel('Full-Time Position (Y=Full-Time, N=Part-Time)')
plt.ylabel('Count')
plt.show()

# Perform Chi-Square Test to check for significant association

chi2, p, dof, expected = chi2_contingency(visa_status_counts)

# Output the test result

print(f"Chi-Square Test Statistic: {chi2}")
print(f"P-value: {p}")

# Conclusion
if p < 0.05:
print("There is a significant difference in visa certification rates␣
↪between full-time and part-time positions.")

21
else:
print("There is no significant difference in visa certification rates␣
↪between full-time and part-time positions.")

case_status 0 1
full_time_position
N 852 1855
Y 7610 15163

Chi-Square Test Statistic: 4.029931866713214

P-value: 0.0446997461068597
There is a significant difference in visa certification rates between full-time
and part-time positions.

UNIVERSAL BANK CASE SOLUTION
No ratings yet
UNIVERSAL BANK CASE SOLUTION
9 pages
Juno Data Analytics Bootcamp Package
No ratings yet
Juno Data Analytics Bootcamp Package
25 pages
Course Challenge w5 1 Coursera
No ratings yet
Course Challenge w5 1 Coursera
1 page
DA LAB MANNUAL
No ratings yet
DA LAB MANNUAL
25 pages
dsba_project_main__et_easyvisa
No ratings yet
dsba_project_main__et_easyvisa
46 pages
Step by step data processing for ML project
No ratings yet
Step by step data processing for ML project
16 pages
Loan Status Prediction
No ratings yet
Loan Status Prediction
23 pages
Project 5-EasyVisa assignment (1)
No ratings yet
Project 5-EasyVisa assignment (1)
57 pages
CASE STUDY STOCK MARKET PREDICITON
No ratings yet
CASE STUDY STOCK MARKET PREDICITON
10 pages
ML2_project
No ratings yet
ML2_project
38 pages
Sanatander Analysis
No ratings yet
Sanatander Analysis
19 pages
Machine Learning
No ratings yet
Machine Learning
67 pages
Python Code For Loan Default Prediction
No ratings yet
Python Code For Loan Default Prediction
4 pages
Finance and Risk Analytics Project Sai Vinayak Sanam PDF
No ratings yet
Finance and Risk Analytics Project Sai Vinayak Sanam PDF
99 pages
INSY446 - 4 - Classification Part 1
No ratings yet
INSY446 - 4 - Classification Part 1
26 pages
Kritika Sejwal 24MCI10023 ML Lab Project Report
No ratings yet
Kritika Sejwal 24MCI10023 ML Lab Project Report
10 pages
Data Analysis in Python-3
No ratings yet
Data Analysis in Python-3
4 pages
Documentation - Ishaan Mittal - Jio - Assessment
No ratings yet
Documentation - Ishaan Mittal - Jio - Assessment
9 pages
Finance and Risk Analytics Project.pdf
No ratings yet
Finance and Risk Analytics Project.pdf
94 pages
Logistic Regression
No ratings yet
Logistic Regression
8 pages
Bankruptcy Prevention Project
No ratings yet
Bankruptcy Prevention Project
16 pages
Introduction of Phase 4
No ratings yet
Introduction of Phase 4
14 pages
Financial Risk Analytics: Assignment
No ratings yet
Financial Risk Analytics: Assignment
35 pages
Articles Xgboost Classification With Smote-Enn Algorithm
No ratings yet
Articles Xgboost Classification With Smote-Enn Algorithm
11 pages
FRA Assignment - India Credit Model
No ratings yet
FRA Assignment - India Credit Model
14 pages
Amit_Khilare_EasyVisa_Project_Final
No ratings yet
Amit_Khilare_EasyVisa_Project_Final
28 pages
Credit Card Default Prediction PRESENTATION
No ratings yet
Credit Card Default Prediction PRESENTATION
12 pages
ML Capacity Career Choice Prediction Annotation
No ratings yet
ML Capacity Career Choice Prediction Annotation
20 pages
Report on Loan Eligibility Analysis
No ratings yet
Report on Loan Eligibility Analysis
5 pages
Credit Card Fraud Analysis Ashutosh
No ratings yet
Credit Card Fraud Analysis Ashutosh
3 pages
Logistic Regression
No ratings yet
Logistic Regression
4 pages
SSRN Id3769854
No ratings yet
SSRN Id3769854
8 pages
StarterNotebook - Jupyter Notebook
No ratings yet
StarterNotebook - Jupyter Notebook
12 pages
Amta Assignment
No ratings yet
Amta Assignment
20 pages
Modelling and Simmulation Assignment - Ipynb - Colab
No ratings yet
Modelling and Simmulation Assignment - Ipynb - Colab
7 pages
bacdeaf_23032025_115708_split_1
No ratings yet
bacdeaf_23032025_115708_split_1
37 pages
turover prediction
No ratings yet
turover prediction
52 pages
05 E RandomForest LoanData
No ratings yet
05 E RandomForest LoanData
8 pages
Bank Loan Title
No ratings yet
Bank Loan Title
7 pages
Churn Prediction- Commercial Use of Data Science
No ratings yet
Churn Prediction- Commercial Use of Data Science
25 pages
PA v0.25
No ratings yet
PA v0.25
18 pages
Module 3.4 Classification Models, Case Study
No ratings yet
Module 3.4 Classification Models, Case Study
12 pages
Machine Learning with PySpark and MLlib — Solving a Binary Classification Problem _ by Susan Li _ Towards Data Science
No ratings yet
Machine Learning with PySpark and MLlib — Solving a Binary Classification Problem _ by Susan Li _ Towards Data Science
10 pages
SML Practicals
No ratings yet
SML Practicals
4 pages
Codes
No ratings yet
Codes
14 pages
Online Payments Fraud Detection Documentation
No ratings yet
Online Payments Fraud Detection Documentation
40 pages
R Assignment
No ratings yet
R Assignment
8 pages
0Loan_Eligibility_prediction_Python.ipynb - Colab
No ratings yet
0Loan_Eligibility_prediction_Python.ipynb - Colab
6 pages
Sla4a 21im30005
No ratings yet
Sla4a 21im30005
11 pages
Loan Approval Model Prediction
No ratings yet
Loan Approval Model Prediction
10 pages
DADM Unit 5 Programs
No ratings yet
DADM Unit 5 Programs
63 pages
Credit_Scores_classification
No ratings yet
Credit_Scores_classification
104 pages
Ml Project 8303[2]
No ratings yet
Ml Project 8303[2]
16 pages
GROUP 9
No ratings yet
GROUP 9
9 pages
ANN,KNN & Decision Tree[1]
No ratings yet
ANN,KNN & Decision Tree[1]
13 pages
FINANCE & RISK ANALYTICS – PROJECT - YARESH VIJAYASUNDARAM
No ratings yet
FINANCE & RISK ANALYTICS – PROJECT - YARESH VIJAYASUNDARAM
48 pages
FRA Report
100% (1)
FRA Report
30 pages
Project paarth (1) (1)
No ratings yet
Project paarth (1) (1)
21 pages
HACKATHON
No ratings yet
HACKATHON
8 pages
ICT158 Introduction To Information Systems Supplementary ASSIGNMENT
No ratings yet
ICT158 Introduction To Information Systems Supplementary ASSIGNMENT
3 pages
DV
No ratings yet
DV
20 pages
Unit 4 Notes
No ratings yet
Unit 4 Notes
17 pages
DBMS, Big Data Anlaytics Module 1 Notes
No ratings yet
DBMS, Big Data Anlaytics Module 1 Notes
15 pages
Independent Sample T-Test
No ratings yet
Independent Sample T-Test
27 pages
KM-secC
No ratings yet
KM-secC
16 pages
Flight Ticket Price Predictor - Formatted Paper
No ratings yet
Flight Ticket Price Predictor - Formatted Paper
5 pages
12 Steps of QCC
100% (5)
12 Steps of QCC
16 pages
Data Analyst Resume Template
100% (1)
Data Analyst Resume Template
4 pages
Faqs Ds-Ba - Version1.0
No ratings yet
Faqs Ds-Ba - Version1.0
23 pages
Project Report Tainees
50% (2)
Project Report Tainees
96 pages
Assignment 6
No ratings yet
Assignment 6
9 pages
Data transformation techniques
No ratings yet
Data transformation techniques
13 pages
Canonical Correlation in SPSS
No ratings yet
Canonical Correlation in SPSS
11 pages
23524202108_Sifat
No ratings yet
23524202108_Sifat
11 pages
Chapter 3
No ratings yet
Chapter 3
7 pages
Project Proposal
No ratings yet
Project Proposal
10 pages
Fit Cmclogit
No ratings yet
Fit Cmclogit
30 pages
Abisola PCA
No ratings yet
Abisola PCA
25 pages
ACCO1158 - 2020 - JULY - EXAM - Approved - Take Home
No ratings yet
ACCO1158 - 2020 - JULY - EXAM - Approved - Take Home
9 pages
PS5-B-paper1 Inscriptions and Their Interpretation
No ratings yet
PS5-B-paper1 Inscriptions and Their Interpretation
4 pages
Data Exercise 2 Updated Assignment (1)
No ratings yet
Data Exercise 2 Updated Assignment (1)
9 pages
1.1 Is There A Significant Difference in The Household Income of The Respondents When Grouped According To GENDER?
No ratings yet
1.1 Is There A Significant Difference in The Household Income of The Respondents When Grouped According To GENDER?
7 pages
Chap1 Intro
No ratings yet
Chap1 Intro
32 pages
AI Tools for Academic Research-final
No ratings yet
AI Tools for Academic Research-final
54 pages
Simple Linear Regression Example
100% (1)
Simple Linear Regression Example
3 pages
CMN 279 (Sec081) - A3Form&Slides - F2024 - DesaiJai
No ratings yet
CMN 279 (Sec081) - A3Form&Slides - F2024 - DesaiJai
13 pages