0% found this document useful (0 votes)
19 views

Feature Engg Code

Feature engg for machine learning

Uploaded by

promodkumarsahu7
Copyright
© © All Rights Reserved
Available Formats
Download as PDF or read online on Scribd
0% found this document useful (0 votes)
19 views

Feature Engg Code

Feature engg for machine learning

Uploaded by

promodkumarsahu7
Copyright
© © All Rights Reserved
Available Formats
Download as PDF or read online on Scribd
You are on page 1/ 16
11722, 2332 AM. In]: In [8]: In [9]: In [10]: In [12]: In [14]: In [15]: In [16]: # Home import import import import local_data_path Scaler DSML Feature Engg Class Code - Jupyter Notebook Loan decision automation pandas as pd Aumpy as np matplotlib.pyplot as seaborn as sns plt E:\DATA_SCIENCE\Scaler\Data\loan-prediction\train.csv" # https://drive. google. con/drive/folders/1QFDGIHCZPqS5kD7_Uo8SCH9BSAZAVBQj ?usp=st # Step 1: Data exploration (Basic) data = pd.read_csv(local_data_path) data. info() Rangelndex: 614 entries, @ to 613 Data columns (total 13 columns): # Column @ Loan_iD 614 1 Gender 601 2 Married 611 3. Dependents 599 4 Education 614 5 Self_Employed 582 6 ApplicantIncome 614 7 CoapplicantIncome 614 8 LoanAmount 592 9 Loan_Amount_Term 60 1@ Credit_History 564 11 Property_Area 614 12 Loan_Status 614 dtypes: floaté4(4), int64(1), object(8) memory usage: 62.5+ KB Non-Null Count non-null non-null non-null non-null non-null non-null non-null non-null non-null non-null non-null non-null non-null data = data.drop("Loan_1D", axis=1) object object object object object object inted floatea floate4 floated Floated object object Iocaiost 8888inotebooks/Jupyter Notebooks/Scalr/Scaler DSML Feature Engg Class Code ipynbit 16 11722, 2332 AM. Scaler DSML Feature Engg Class Code - Jupyter Notebook In [17]: data.describe() out [47]: Applicantincome Coapplicantincome LoanAmount Loan Amount Term Credit History count 614.0000 '614.000000 592.0000 ‘600.0000 564.000000 mean 5403.459283 1621.245798 148.412162 34200000 0.842199 std 6109.041673 2926.248369 85.587325 5.12041 0.364878 min 10,000000 0.000000 8.000000 2.00000 0.000000 25% 2677-50000 0.000000 100.000000 ‘360.0000 1.000000 50% 3812.500000 1188,500000 128.0000 360,00000 1.000000 75% 5796,000000 2297.250000 168.0000 360.00000 1.000000 max 81000.000000-—-41667:000000_700.000000 480.00000 1.000000 In [19]: data.describe(includ “object"]).transpose() out [19]: count unique top freq Gender 601 2 Male 489 Married 611 2 Yes 398 Dependents 5994 0 345 Education 614 2 Graduate 480 Solf_Employed 582 2 No 500 Property Area 614 3. Semiurban 233, Loan Status 614 2 Y 422 In [25]: data.Loan_Status out(25]: <<<2< Y Y Y Y N Loan_Status, Length: 614, dtype: object In [26]: # Step 2: Brainstorming Iocaiost 8888inotebooks/Jupyter Notebooks/Scalr/Scaler DSML Feature Engg Class Code ipynbit 216 11722, 2332 AM. In [27]: data.info() Scaler DSML Feature Engg Class Code - Jupyter Notebook RangeIndex: 614 entries, @ to 613 Data columns (total 12 columns): # Column 10 Property _Area 11 Loan_status dtypes: float6a(4), int6a(1), object(7) Loan_Amount_Term Credit_History @ Gender 1 Married 2 Dependents 3 Education 4 Self_Employed 5 ApplicantIncome 6 CoapplicantIncome 7 Loanamount 8 9 memory usage: 57.7+ KB In [28]: # Step 3: Look at basic distributions (univariates) Non-Null Count 601 611 599 614 582 614 614 592 600 564 614 614 non-null non-null non-null non-null non-null non-null non-null non-null non-null non-null non-null non-null # Step 4: Handle missing values In [29]: data.isna().sum() out[29]: Gender Married Dependents Education Self_Employed Applicant Income CoapplicantIncome LoanAnount Loan_Amount_Term Credit_History Property_Area Loan_status dtype: intea 13 15 32 22 14 50 In [33]: def missing_to_df(df): total_missing df percent_missing df = (df.isnul1().sum()/dF. isnul1().count()).sort_values(asce missing data_df = pd.concat( [total_missing df, percent_missing df], axis=1, Percent” keys ) Tota: df .isnull().sum() .sort_values (ascendin return missing data_df object object object object object intea Floated floates Floated Floated object object Iocaiost 8888inotebooks/Jupyter Notebooks/Scalr/Scaler DSML Feature Engg Class Code ipynbit 36 11722, 2332 AM. In [34]: out [34]: In [36]: In [38]: out [38]: In [39]: In [40]: In [42]: In [46]: Scaler DSML Feature Engg Class Code - Jupyter Notebook missing_df = missing_to_df(data) missing_df {missing df["Total"] > 0] Total Percent Credit History 50 0.081433 Solf_Employed 32. 0.052117 LeanAmount 22 0.096831 Dependents 15 0.024430 Loan_Amount_Term 14 0.022801 Gender 13. 0.021173 Married 3. 0.004886, data["Credit_History"] = data["Credit_History"].fillna(2) data["Self_Employe ].unique() array(['No', ‘Yes', nan], dtype-object) data["Self_Employe ] = data["Self_Employed"] .fil1na("other” from sklearn.inpute import SimpleImputer num_missing = [“LoanAmount", “Loan_Amount_Tern”] median_imputer = SimpleInputer (strateg; for col in num_missing: data[col] = pd.DataFrame(median_imputer.fit_transform(pd.DataFrame(data[col]) median") cat_missing = ["Gender", "Married", "Dependents"] freq_imputer = SimpleInputer(strategy="nost_frequent") for col in cat_missing: data[col] = pd.DataFrame(freq_imputer. Fit_transform(pd.DataFrame(data[col]))! Iocaiost 8888inotebooks/Jupyter Notebooks/Scalr/Scaler DSML Feature Engg Class Code ipynbit ane 31122, 232 AM ‘Scaler DSML Feature Engg Class Code - Jupyter Notebook In [47]: data. isnul1().sum() out [47]: Gender Married Dependents Education Self_Employed ApplicantIncome CoapplicantIncome Loananount Loan_Amount_Term Credit_History Property_Area Loan_status dtype: intea In [48]: # Removing or replacing redundant or eroneous values # if income was negative # married had some number In [49]: # detect and handle outLiers In [ ]: #more EDA, univariates and bivariates In [58]: sns.countplot(data=data, x="Loan_Status") Out[50]: 400 350 300 0 count 20 150 100 Loan_Status Iocaiost 8888inotebooks/Jupyter Notebooks/Scalr/Scaler DSML Feature Engg Class Code ipynbit 516 1122, 232 AM ‘Scaler DSML Feature Engg Class Code - Jupyter Notebook In [51]: sns.distplot(data[ "Applicant income” ]) €:\Users\nitis\anaconda3\1ib\site-packages\seaborn\distributions.py:2619: Futur eWarning: “distplot” is a deprecated function and will be removed in a future v ersion. Please adapt your code to use either “displot’ (a figure-level function with similar flexibility) or “histplot” (an axes-level function for histogram s). warnings.warn(msg, FutureWarning) out[51]: 0.00020 0.00005 0.00000 0 zo000 40000 —~—«eou00 |—«80000 eplicantincome In [52]: data.boxplot(column="ApplicantIncome", by="Educat ion") plt.show() Boxlopgpseaat acation sono nooo 60000 o 0000 @ seo 8 : 100 x00 t 200 ° = csucaton Iocaiost 8888inotebooks/Jupyter Notebooks/Scalr/Scaler DSML Feature Engg Class Code ipynbit 66 11722, 2332 AM. In [53]: out [53]: In [54]: out [54]: In [58]: out [58]: Scaler DSML Feature Engg Class Code - Jupyter Notebook sns .distplot (data["CoapplicantIncome" ]) €:\Users\nitis\anaconda3\1ib\site-packages\seaborn\distributions.py:2619: Futur eWarning: “distplot” is a deprecated function and will be removed in a future v ersion. Please adapt your code to use either “displot’ (a figure-level function with similar flexibility) or “histplot” (an axes-level function for histogran s). warnings.warn(msg, FutureWarning) 0.0005 0.0004 0.0003 Density ‘0.0002 0.0001 0.0000 0 1000 20000 ~=«30000~=«=«0000 Coapplicantincome data.groupby("Loan_Status").mean(){"ApplicantIncone" ] Loan_status N 5aa6.e78125 Y 5384068720 Name: ApplicantIncone, dtype: floatea bins = [@, 2500, 4000, 6000, 81000] group_name = ["Low", "Average", "High", "Very High"] data["Income_bin"] = pd.cut(data["ApplicantIncome"], bins=bins, labels=group_name data["Income_bin"] e High 1 High 2 Average 3 Average 4 High 609 Average 610 High 611 Very High 612 Very High 613 High Name: Income_bin, Length: 614, dtype: category Categories (4, object): [Low < Average < High < Very High] Iocaiost 8888inotebooks/Jupyter Notebooks/Scalr/Scaler DSML Feature Engg Class Code ipynbit m6 31122, 232 AM In [61]: In [62]: In [ ]: In [63]: In [64]: out [64]: ‘Scaler DSML Feature Engg Class Code - Jupyter Notebook Income_bin = pd.crosstab(data["Incone_bin"], data[”Loan_status"]) Income_bin.div(Income_bin.sum(axis=1), axis=0).plot(kind="bar", figsize=(4,4)) plt.xlabel("ApplicantIncome") plt.ylabel("Percentage") plt.show() 07 06 Percentage a1 oo Low Average High very High ‘Applicantincome # above is also not useful, because the approval rate across income bins is very # Feature Engineering data["Totalincome"] = data["ApplicantIncone"] + data["Coapplicant Income" ] data["TotalIncome_bin"] = pd.cut(data["TotalIncome"], bins=bins, labels=group_nar data["TotalIncome_bin"] RUNES 613 High Very High Average High High Average High Very High Very High High Name: TotalIncome_bin, Length: 614, dtype: category Categories (4, object): [Low < Average < High < Very High] Iocaiost 8888inotebooks/Jupyter Notebooks/Scalr/Scaler DSML Feature Engg Class Code ipynbit ane 31122, 232 AM ‘Scaler DSML Feature Engg Class Code - Jupyter Notebook In [66]: TotalIncome_bin = pd.crosstab(data["Totalincome_bin"], data["Loan_status"]) Total Income_bin.div(TotalIncome_bin.sum(axis= plt.xlabel ("Total Income") plt.ylabel("Percentage") plt.show() ), axis=0).plot(kind="bar", stackec 10 Loan status os 02 oo Low Average High very High “Btalincome In [67]: data = data.drop(["Income_bin”, “TotalIncome_bin"], axis=1) In [68]: data["Loan_Amount_Term"].nunique() out[6s]: 10 In [69]: data["Loan_Anount_Term"].value_counts() out[s9]: 360.8 526 180.0 44 480.0 15, 300.0 84.0 240.0 120.0 36.0 60.0 12.0 Name: Loan_Anount_Term, dtype: intea ere weal In [78]: data["Loan_Amount_Term"] = (data["Loan_Amount_Tern"]/12).astype(' float") Iocaiost 8888inotebooks/Jupyter Notebooks/Scalr/Scaler DSML Feature Engg Class Code ipynbit one 11722, 2332 AM. Scaler DSML Feature Engg Class Code - Jupyter Notebook In [71]: pd.crosstab(data["Loan_Amount_Term"], data["Loan_Status"]) out [71]: Loan_status NY Loan_Amount_1 19 0 4 a) 500 2 m 403 10 0 8 150 15 29 20 1 38 20 5 8 30.0 159 367 40 9 6 In [72]: data["Loan_Amount_per_year"] = data["LoanAmount" ]/data["Loan_Amount_Term"] In [75]: data["EMI"] = data["Loan_Amount_per_year"]*1000/12 In [76]: data out 76]: Gonder Married Dependents Education SelfEmployed Applicantincome Coapplicantincome 0 Wale No 0 Graduate %o sea 06 4 Malo Yes + Graduate No 4583 1508.0 2 Mae Yes 0 Graduate ves eco oo 3 Mae Yes © reat No 2583 2580 4 Mae No 0 Graduate No 000 oo 608 Female No 0 Graduate No 200 oo 610 Male Yes, a Graduate wo 4106 oo et Male Yes 1 Graduate No sore 2409 oz Mae Yes 2 Graduate No 7588 oo 13 Female No 0 Gradute ves 489 oo 614 rows x 15 columns Iocaiost 8888inotebooks/Jupyter Notebooks/Scalr/Scaler DSML Feature Engg Class Code ipynbit 1016 31122, 232 AM In [7]: data” ‘Scaler DSML Feature Engg Class Code -Jupyter Notebook ‘able_to_pay_ENI"] = (data["TotalIncome"]/1@ > data["EMI"]).astype( int’) In [78]: sns.countplot(x='able_to_pay_fMI’, datasdata, hue="Loan_status") out[78]: 400 350 300 0 200 count 150 100 In [79]: datal" out [79]: @ 1 2 BF Name: In [85]: data[" In [87]: datat" localhost 8888inotebookslJunyter Loan status ° 1 sable_to_pay EMI Dependents" ].value_counts() 360 102 101 51. Dependents, dtype: intea Dependents" ].replace(*3+', 3, inplace= rue) ‘Dependents"] = data["Dependents”].astype("float") Notebooks/ScalerScaler DSML Feature Engg Class Code jpyno# ne 31122, 232 AM In [88]: out [88]: In [89]: In [99]: ‘Scaler DSML Feature Engg Class Code -Jupyter Notebook sns.countplot(data=data, x="Dependents’, hue="Loan_Status") 250 200 150 count 100 00 10 20 30 Dependents # bivariate with credit_history, you will find that better credit history has bet data. info() Rangelndex: 614 entries, @ to 613 Data columns (total 16 columns # Column Non-Null Count type ® Gender 614 non-null object 1 Married 614 non-null object 2 Dependents 614 non-null floatea 3. Education 614 non-null object 4 Self_Employed 614 non-null object 5 ApplicantIncome 614 non-null intea 6 CoapplicantIncone 614 non-null _—float6a 7 LoanAmount 614 non-null float 8 Loan_Amount_Term 614 non-null —_floate4 9 Credit_History 614 non-null floate4 1@ Property_Area 614 non-null object 11 Loan_status 614 non-null object 12 TotalIncome 614 non-null —_floatea 33. Loan_Amount_per_year 614 non-null —float6a 14 EMT 614 non-null —_floatea 35. able_to_pay_EMT 614 non-null —_int32 dtypes: floatea(8), int32(1), int64(1), object(6) mem jory usage: 74.5+ KB Iocahst 8888inotebooks/Jupyter Notebooks/Scalr/Scaler DSML Feature Engg Class Code lpynbit 16 11722, 2332 AM. In [91]: out [91]: In [95]: In [96]: In [ ]: Scaler DSML Feature Engg Class Code - Jupyter Notebook data["Self_Employed"].value_counts() No 50 Yes 82 Other 32 Name: Self_Employed, dtype: inte4 data = pd.get_dummies(data, drop_first=True) data. info() Rangelndex: # Column @ Dependents 1 Applicant Income 2 CoapplicantIncome 3° LoanAmount 4 Loan_Anount_Term 5 Credit History 6 TotalIncome 7 Loan_Amount_per_year 8 EMT 9 able_to_pay_EMI 10 Gender_Nale 11 Married_Yes 12 Education Not Graduate 13 Self_Employed_other 14 Self_Employed_Yes 15 Property _Area_Semiurban 16 Property _Area_Urban 17 Loan_Status_Y dtypes: floatea(), int32(1), memory usage: [email protected] KB Feature engineering + feature transformation + new features + one hot encoding and so on. # Dimensionality reduction # removing unwanted features 614 entries, @ to 613 Data columns (total 18 columns. Non-Null Count oa 614 14 1a 614 o14 o14 614 614 614 614 614 614 614 614 614 614 614 non-null non-null non-null non-null non-null non-null non-null non-null non-null non-null non-null non-null non-null non-null non-null non-null non-null non-null Dtype Floates intea floated floated Floated Floated Floated Floates floates int32 uints uints uints uints uints uints uints uints inte4(1), uinte(s) # check corr and remove features Iocahst 8888inotebooks/Jupyter Notebooks/Scalr/Scaler DSML Feature Engg Class Code lpynbit 1316 31122, 232 AM ‘Scaler DML Feature Engg Class Code - Jupyter Notebook In [97 plt. Figure(Figsize=(20,20)) sns.heatmap(data.corr(), annot=True) plt.show() In [98]: # spearmans ranking corr coeff Iocahst 8888inotebooks/Jupyter Notebooks/Scalr/Scaler DSML Feature Engg Class Code lpynbit 146 31122, 232 AM ‘Scaler DML Feature Engg Class Code - Jupyter Notebook In [99]: plt.figure(figsize=(20,20)) sns. heatmap (data. corr(method=" sp plt.show() arman"), annot=True) In [100]: # feature scaling In [11]: from sklearn.preprocessing import StandardScaler, MinMaxScaler In [102]: normalizer = MinMaxScaler() Iocahst 8888inotebooks/Jupyter Notebooks/Scalr/Scaler DSML Feature Engg Class Code lpynbit 19116 11722, 2332 AM. Scaler DSML Feature Engg Class Code - Jupyter Notebook In [104]: pd.DataFrame(normalizer.fit_transform(data), columns=data.columns) out[104]: Dependents Applicantincome Coapplicantincome LoanAmount Loan_Amount Term Credit Hi © 0.000000 0.070489 0.000000 0.172214 0.743590 4 099399 0.054890 oossie2 —o.t72214 0.743590 2 0.000000 0.035250 0.900000 0.082489 0.743690 3 0.000000 0.030093, o0ssse2 0.160637 0.743590 4 0.000000 0.072356, 0.000000 0.191027 0.743590 609 0.000000 o.o34014 .000000 0.089725 0.743590 610 1.000000 0.048930 0.000000 0.044863, 0.358074 10399933 0.097984 0.005760 0.353111 0.743590 612 0.886867 0.091936, 0.000000 0.257598 0.743590 613 0.000000 0.054830 o.900000 0.179450 0.743590 614 rows * 18 columns > Inf]: Inf]: mf]: localhost 88tinctobooks/JupyerNoleboots/Scala/Scalar DSML Feature Engg Class Code py tee

You might also like