0% found this document useful (0 votes)
16 views

Project 3 - Income Qualification - Source Code

Uploaded by

sneha fabey
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
16 views

Project 3 - Income Qualification - Source Code

Uploaded by

sneha fabey
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
You are on page 1/ 15

Project2: Income Qualification

Import os and Warnings

Problem Statement Scenario:


Many social programs have a hard time making sure the right people are given enough aid. It’s tricky when a
program focuses on the poorest segment of the population. This segment of population can’t provide the
necessary income and expense records to prove that they qualify.

In Latin America, a popular method called Proxy Means Test (PMT) uses an algorithm to verify income
qualification. With PMT, agencies use a model that considers a family’s observable household attributes like
the material of their walls and ceiling or the assets found in their homes to classify them and predict their
level of need. While this is an improvement, accuracy remains a problem as the region’s population grows
and poverty declines.

The Inter-American Development Bank (IDB) believes that new methods beyond traditional econometrics,
based on a dataset of Costa Rican household characteristics, might help improve PMT’s performance.

Let us explore our dataset before moving further

Let us identify our target variable


Lets Understand the type of data.

We have mixed data types. Specified as below:

 float64 : 8 variables
 int64 : 130 vriables
 object :5 variables

Below is Data dictionary for above object variables

 ID = Unique ID
 idhogar, Household level identifier
 dependency, Dependency rate, calculated = (number of members of the household younger than 19
or older than 64)/(number of member of household between 19 and 64)
 edjefe, years of education of male head of household, based on the interaction of escolari (years of
education), head of household and gender, yes=1 and no=0
 edjefa, years of education of female head of household, based on the interaction of escolari (years of
education), head of household and gender, yes=1 and no=0
Lets Convert object variables into numerical data.
Now all data is in numerical form

Lets identify variable with 0 varinace

elimbasu5 : 1 if rubbish disposal mainly by throwing in river, creek or sea.

Interpretation: From above it is shown that all values of elimbasu5 is same so there is no variability in
dataset therefor we will drop this variable

Check if there are any biases in your dataset.


Therefore, variables ('r4t3','hogar_total') have relationship between them. For good result we can use any
one of them.
Therefore, variables ('tipovivi3','v2a1') have relationship between them. For good result we can use any one
of them.
Therefore,variables ('v18q','v18q1') have relationship between them. For good result we can use any
one of them.

Conclusion : Therefore, there is bias in our dataset.

Check if there is a house without a family head.


"parentesco1" =1 if household head
Interpretation : Above cross tab shows 0 male head and 0 female head which implies that there are 435
families with no family head.

Count how many null values are existing in columns.


Interpretation: There are no null values in Target variable. Now lets proceed further and identify and fillna
of other variable.
Interpretation and action : 'v2a1', 'v18q1', 'rez_esc' have more than 50% null values, because for v18q1,
there are families with their own house so they won't pay rent in that case it should be 0 and similar is for
v18q1 there can be families with 0 tablets.

Istead we can drop a column tipovivi3,v18q

 tipovivi3, =1 rented
 v18q, owns a tablet

as v2a1 alone can show both **as v18q1 alone can show that if respondent owns a tablet or not
Interpretation : Now there is no null value in our datset.

Set the poverty level of the members and the head of the house same in a family.
Now for people below poverty level can be people paying less rent and don't own a
house. and it also depends on whether a house is in urban area or rural area.

 For rural area level if people paying rent less than 8000 is under poverty level.
 For Urban area level if people paying rent less than 140000 is under poverty level.
Interpretation :

 There are total 1242 people above poverty level independent of area whether rural or Urban
 Remaining 1111 people level depends on their area

Rural :

Above poverty level= 445

Urban :

Above poverty level =1103

Below poverty level=1081


Applying Standard Scalling to dataset

Now we will proceed to model fitting

Lets identify best parameters for our model using GridSearchCv


Lets apply cleaning on test data and then find prediction for that.

Interpretation : Above is our prediction for test data.

Conclusion :
Using RandomForest Classifier we can predict test_data with accuracy of 90%.

You might also like