0% found this document useful (0 votes)
20 views

Assignment II

Spark

Uploaded by

HPot PotTech
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
20 views

Assignment II

Spark

Uploaded by

HPot PotTech
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 11

Assignment II

Prediction of Credit Card Defaulters


Hive:
Initially I had done the dropping of id col on hive but my vm crashed and I lost all the screenshots for
that.

Understand and analyze the Dataset

First we load the data.


We then remove the ID column since it is not required.

We look at the schema.

We look at the different statistics of the numerical columns


We then look at the distribution of data of different features.
Next we see the distribution of the target variable.
As can be seen, the dataset is skewed.

Next we check if there are any null values.

Next we find the correlation between different features.


We can see that the bill_amts are highly correlated. Since we are using logistic
regression and one of the assumptions is that the features are uncorrelated,
hence we remove bill_amt2-bill_amt6.

We then change the target variable from 0/1 to No/Yes.


We transform the pay columns since we need them to be indices starting from 0
for the one hot encoder to work.
Determine the features.

We ignore bill_amt2-5 as stated above.

We first transform the categorical columns to one-hot representation

Then we vectorize all the required features so as it can be fed as input to the
logistic regression model.

We also scale the data to zero mean and unit variance.

We do all this by creating a pipeline of transformations and then fitting the


features through the pipeline.
Divide dataset

We split the dataset into train:test in 60:40 ratio.


Determine a Model and its measurement function

We define a logistic regression model and train the model on the train dataset.

Verify the Model accuracy.

We look at the area under ROC, accuracy and F1-score of our model
Use Sparkweb UI to determine which task take the most of your
execution time.

The fit command took the most time. It spawned 106 jobs with 126 stages. The maximum time in a
stage was 7 seconds as shown above.

You might also like