100% found this document useful (1 vote)
93 views

6 XG Boost - Jupyter Notebook

The document shows code for loading and preparing a customer churn dataset from CSV for modeling with XGBoost. It loads the data, splits it into training and test sets, encodes categorical features, transforms the data, fits an XGBoost classifier to the training set, makes predictions on the test set, and evaluates the model with a confusion matrix and accuracy score.

Uploaded by

venkatesh m
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
100% found this document useful (1 vote)
93 views

6 XG Boost - Jupyter Notebook

The document shows code for loading and preparing a customer churn dataset from CSV for modeling with XGBoost. It loads the data, splits it into training and test sets, encodes categorical features, transforms the data, fits an XGBoost classifier to the training set, makes predictions on the test set, and evaluates the model with a confusion matrix and accuracy score.

Uploaded by

venkatesh m
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 3

In 

[1]: import numpy as np


import matplotlib.pyplot as plt
import pandas as pd

In [2]: dataset = pd.read_csv("D:\\Course\\Python\\Datasets\\Churn_Modelling.csv")


In [3]: dataset

Out[3]: RowNumber CustomerId Surname CreditScore Geography Gender Age Tenure Balanc

0 1 15634602 Hargrave 619 France Female 42 2 0.0

1 2 15647311 Hill 608 Spain Female 41 1 83807.8

2 3 15619304 Onio 502 France Female 42 8 159660.8

3 4 15701354 Boni 699 France Female 39 1 0.0

4 5 15737888 Mitchell 850 Spain Female 43 2 125510.8

... ... ... ... ... ... ... ... ...

9995 9996 15606229 Obijiaku 771 France Male 39 5 0.0

9996 9997 15569892 Johnstone 516 France Male 35 10 57369.6

9997 9998 15584532 Liu 709 France Female 36 7 0.0

9998 9999 15682355 Sabbatini 772 Germany Male 42 3 75075.3

9999 10000 15628319 Walker 792 France Female 28 4 130142.7

10000 rows × 14 columns

In [4]: X = dataset.iloc[:, 3:13].values


X
y = dataset.iloc[:, 13].values
y

...

In [5]: X

test1 = pd.DataFrame(X)
test1

...
In [6]: from sklearn.preprocessing import LabelEncoder, OneHotEncoder

# Converting the categorical data into Number (0 ,1 ,2)

labelencoder_X_1 = LabelEncoder()
X[:, 1] = labelencoder_X_1.fit_transform(X[:, 1])

labelencoder_X_2 = LabelEncoder()
X[:, 2] = labelencoder_X_2.fit_transform(X[:, 2])



In [7]: X
test2 = pd.DataFrame(X)
test2
...

In [ ]: # Creating 3 dummy varabiles for country ( Factor level of 3 spain , France and G

#onehotencoder = OneHotEncoder()
#X = onehotencoder.fit_transform(X).toarray()
#X = X[:, 1:]


#onehotencoder = OneHotEncoder()
#X = onehotencoder.fit_transform(X).toarray()
#X = X[:, 1:]

In [8]: from sklearn.compose import ColumnTransformer



ct = ColumnTransformer([("Geography", OneHotEncoder(), [1])], remainder = 'passth

X = ct.fit_transform(X)

C:\Users\rgandyala\AppData\Local\Continuum\anaconda3\lib\site-packages\sklearn
\preprocessing\_encoders.py:415: FutureWarning: The handling of integer data wi
ll change in version 0.22. Currently, the categories are determined based on th
e range [0, max(values)], while in the future they will be determined based on
the unique values.

If you want the future behaviour and silence this warning, you can specify "cat
egories='auto'".

In case you used a LabelEncoder before this OneHotEncoder to convert the catego
ries to integers, then you can now use the OneHotEncoder directly.

warnings.warn(msg, FutureWarning)

In [9]: abc=pd.DataFrame(X)
abc

...
In [10]: from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.2)

In [2]: ​
from xgboost.sklearn import XGBClassifier

classifier = XGBClassifier()

In [13]: classifier.fit(X_train,y_train)

...

In [14]: y_pred = classifier.predict(X_test)

In [15]: y_pred

Out[15]: array([0, 0, 0, ..., 0, 0, 0], dtype=int64)

In [16]: from sklearn.metrics import confusion_matrix


cm = confusion_matrix(y_test, y_pred)

In [17]: cm

Out[17]: array([[1541, 69],

[ 200, 190]], dtype=int64)

In [18]: from sklearn.metrics import accuracy_score


Accuracy_Score = accuracy_score(y_test, y_pred)

In [19]: Accuracy_Score

Out[19]: 0.8655

You might also like