0% found this document useful (0 votes)

21 views

Ass 1 Dsbda

The document discusses preparing a medical insurance dataset for analysis. It loads the dataset, checks for null values and handles them through imputation, converts data types for appropriate analysis and modeling, encodes categorical variables and combines encoded columns with non-categorical columns.

Uploaded by

adagalepayale023

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

21 views

Ass 1 Dsbda

Uploaded by

adagalepayale023

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 8

ASS 1 DSBDA

January 22, 2024

[2]: # 1. Import all the required Python Libraries. >>>>> (numpy, pandas,␣
↪matplotlib, seaborn, ...)

[1]: import pandas as pd

import numpy as np

[111]: # 2. Locate an open source data from the web (e.g. https://www.kaggle.com)

# https://www.kaggle.com/datasets/rajgupta2019/medical-insurance-dataset -␣
↪Medical Insuarance Dataset

[112]: # 3.Loading the dataset into pandas frame

[2]: df = pd.read_csv(r'C:\Users\Aditi\Downloads\Test_Data.csv')

[114]: # 4.Checking for null or missing values in the dataset

[3]: df.head()

[3]: age gender bmi smoker region children

0 40.000000 male 29.900000 no southwest 2.0
1 47.000000 male NaN no southwest 1.0
2 54.000000 female 28.880000 no northeast 2.0
3 NaN male 30.568094 no northeast NaN
4 59.130049 male 33.132854 yes northeast 4.0

[4]: df.isna() #checking for null or missing values ␣

↪(True means there is a null value)

[4]: age gender bmi smoker region children

0 False False False False False False
1 False False True False False False
2 False False False False False False
3 True False False False False True
4 False False False False False False
.. … … … … … …
487 False False False False False False
488 False False False False False False

1
489 False False False False False False
490 False False False False False False
491 False False False False False False

[492 rows x 6 columns]

[5]: df.isnull() # Another method for checking␣

↪null values

[5]: age gender bmi smoker region children

[492 rows x 6 columns]

[6]: df.isna().sum() # It will give sum of the missing values

# Here we can see that there are 3 missing␣
↪values i.e. one in age column and in bmi and children columns

[6]: age 1
gender 0
bmi 1
smoker 0
region 0
children 1
dtype: int64

[7]: # Fill missing values with the mode of the 'children' column
df['children'].fillna(df['children'].mode()[0], inplace=True)

[8]: # Fill missing values with the mean of the 'age' column
df['age'].fillna(df['age'].mean(), inplace=True)

[9]: # Fill missing values with the median of the 'bmi' column
df['bmi'].fillna(df['bmi'].median(), inplace=True)

[10]: # Getting last 5 records

[11]: df.head() # Outcome after filling the missing values

2
[11]: age gender bmi smoker region children
0 40.000000 male 29.900000 no southwest 2.0
1 47.000000 male 29.959061 no southwest 1.0
2 54.000000 female 28.880000 no northeast 2.0
3 38.844276 male 30.568094 no northeast 1.0
4 59.130049 male 33.132854 yes northeast 4.0

[12]: # Getting first 5 records

[13]: df.tail()

[13]: age gender bmi smoker region children

487 51.000000 male 27.740000 no northeast 1.0
488 33.000000 male 42.400000 no southwest 5.0
489 47.769999 male 29.064615 no northeast 4.0
490 41.530738 female 24.260852 no southeast 5.0
491 36.000000 male 33.400000 yes southwest 2.0

[14]: # Getting information about dataset

[15]: df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 492 entries, 0 to 491
Data columns (total 6 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 age 492 non-null float64
1 gender 492 non-null object
2 bmi 492 non-null float64
3 smoker 492 non-null object
4 region 492 non-null object
5 children 492 non-null float64
dtypes: float64(3), object(3)
memory usage: 23.2+ KB

[16]: # Getting Dimensions of dataset

[17]: df.shape # It gives dimensions as ( rows,columns)

[17]: (492, 6)

[18]: # describing the dataset

[19]: df.describe

[19]: <bound method NDFrame.describe of age gender bmi smoker

region children

3
0 40.000000 male 29.900000 no southwest 2.0
1 47.000000 male 29.959061 no southwest 1.0
2 54.000000 female 28.880000 no northeast 2.0
3 38.844276 male 30.568094 no northeast 1.0
4 59.130049 male 33.132854 yes northeast 4.0
.. … … … … … …
487 51.000000 male 27.740000 no northeast 1.0
488 33.000000 male 42.400000 no southwest 5.0
489 47.769999 male 29.064615 no northeast 4.0
490 41.530738 female 24.260852 no southeast 5.0
491 36.000000 male 33.400000 yes southwest 2.0

[492 rows x 6 columns]>

[20]: # For Getting Size

[21]: df.size # It will give size of a dataset as 492(rows) x␣

↪5(columns) --> 2952

[21]: 2952

[22]: # 5. For getting Datatypes of the variables

[23]: df.dtypes

[23]: age float64

gender object
bmi float64
smoker object
region object
children float64
dtype: object

[24]: # need for conversion --->

''' We need to perform datatype conversions to ensure that our data is in
the appropriate format for analysis and modeling '''

[24]: ' We need to perform datatype conversions to ensure that our data is in \nthe
appropriate format for analysis and modeling '

[25]: # Generally we have Age and no of children are in Integer Datatype

#converting float to int

df=df.astype({"age":int})
df=df.astype({"children":int})

# df['smoker'] = pd.to_numeric(df['smoker'], errors='coerce').fillna(0).

↪astype(int) ----> object to int

4
[26]: df.dtypes

[26]: age int32

gender object
bmi float64
smoker object
region object
children int32
dtype: object

[27]: df.head()

[27]: age gender bmi smoker region children

0 40 male 29.900000 no southwest 2
1 47 male 29.959061 no southwest 1
2 54 female 28.880000 no northeast 2
3 38 male 30.568094 no northeast 1
4 59 male 33.132854 yes northeast 4

[28]: # 6. Conversion of categorical data into quantitative(numerical) data

[29]: # getting columns under categorical data

df_cat=df.select_dtypes(object)

[32]: # Prints only columns under categorical data

df_cat

[32]: gender smoker region

0 male no southwest
1 male no southwest
2 female no northeast
3 male no northeast
4 male yes northeast
.. … … …
487 male no northeast
488 male no southwest
489 male no northeast
490 female no southeast
491 male yes southwest

[492 rows x 3 columns]

[34]: # getting categorical columns in to the list

categorical_columns = ['gender', 'smoker', 'region']

[35]: # Create a new DataFrame with non-categorical columns

df_non_categorical = df.drop(columns=categorical_columns)

5
[36]: # Use pandas get_dummies to perform one-hot encoding on categorical data
df_encoded = pd.get_dummies(df[categorical_columns])

[37]: print(df_encoded)

gender_female gender_male smoker_no smoker_yes region_northeast \

0 False True True False False
1 False True True False False
2 True False True False True
3 False True True False True
4 False True False True True
.. … … … … …
487 False True True False True
488 False True True False False
489 False True True False True
490 True False True False False
491 False True False True False

region_northwest region_southeast region_southwest

0 False False True
1 False False True
2 False False False
3 False False False
4 False False False
.. … … …
487 False False False
488 False False True
489 False False False
490 False True False
491 False False True

[492 rows x 8 columns]

[38]: # Convert boolean values to integers (0s and 1s)

df_encoded = df_encoded.astype(int)

[39]: # Concatenate the one-hot encoded categorical columns with the non-categorical␣
↪columns

df_merg = pd.concat([df_non_categorical, df_encoded], axis=1)

[40]: # Display the resulting DataFrame

print(df_merg)

age bmi children gender_female gender_male smoker_no \

0 40 29.900000 2 0 1 1
1 47 29.959061 1 0 1 1
2 54 28.880000 2 1 0 1
3 38 30.568094 1 0 1 1

6
4 59 33.132854 4 0 1 0
.. … … … … … …
487 51 27.740000 1 0 1 1
488 33 42.400000 5 0 1 1
489 47 29.064615 4 0 1 1
490 41 24.260852 5 1 0 1
491 36 33.400000 2 0 1 0

smoker_yes region_northeast region_northwest region_southeast \

0 0 0 0 0
1 0 0 0 0
2 0 1 0 0
3 0 1 0 0
4 1 1 0 0
.. … … … …
487 0 1 0 0
488 0 0 0 0
489 0 1 0 0
490 0 0 0 1
491 1 0 0 0

region_southwest
0 1
1 1
2 0
3 0
4 0
.. …
487 0
488 1
489 0
490 0
491 1

[492 rows x 11 columns]

[ ]: # Alternative method for conversion if there are many categorical data that are␣
↪not possible to write manually in the list

'''cols = df_cat.columns

def cat_2_num(df, cols):

for col in cols:
dummies = pd.get_dummies(df[col], prefix=col, drop_first=True)
df = pd.concat([df, dummies], axis=1)
df = df.drop(col, axis=1)
return df

7
df_cat = cat_2_num(df_cat, cols)'''

Step-By-Step-Diabetes-Classification-Knn-Detailed-Copy1 - Jupyter Notebook
No ratings yet
Step-By-Step-Diabetes-Classification-Knn-Detailed-Copy1 - Jupyter Notebook
12 pages
Dork Path
No ratings yet
Dork Path
4 pages
1728086737277
No ratings yet
1728086737277
26 pages
Stroke Prediction
No ratings yet
Stroke Prediction
10 pages
Linear Regression: Data Exploration
No ratings yet
Linear Regression: Data Exploration
12 pages
DSBDA2
No ratings yet
DSBDA2
6 pages
AttiqAhmadAfsarMidExam
No ratings yet
AttiqAhmadAfsarMidExam
8 pages
SML Lab 1
No ratings yet
SML Lab 1
19 pages
ML Data Preprocessing in Python
No ratings yet
ML Data Preprocessing in Python
9 pages
Student Notebook HR Analysis
No ratings yet
Student Notebook HR Analysis
11 pages
ML Proj Diabetes.pptx
No ratings yet
ML Proj Diabetes.pptx
51 pages
Data Pre-Processing
No ratings yet
Data Pre-Processing
22 pages
Medical Cost Prediction
No ratings yet
Medical Cost Prediction
27 pages
m3125 Practical 3
No ratings yet
m3125 Practical 3
13 pages
Model2.ipynb - Colab
No ratings yet
Model2.ipynb - Colab
11 pages
Logistic Regression
No ratings yet
Logistic Regression
12 pages
Intreoduction To Python Comporison Operators
No ratings yet
Intreoduction To Python Comporison Operators
30 pages
K-Nearest Neighbors For Diabetes Prediction: Malik Yousaf (F2020019038) Ahsan Rauf (F2020019057)
No ratings yet
K-Nearest Neighbors For Diabetes Prediction: Malik Yousaf (F2020019038) Ahsan Rauf (F2020019057)
15 pages
Data Science Practicals - Ipynb
No ratings yet
Data Science Practicals - Ipynb
54 pages
AML Sessional 1 Students
No ratings yet
AML Sessional 1 Students
16 pages
ML Lab Records
No ratings yet
ML Lab Records
101 pages
Part A Assignment 6
No ratings yet
Part A Assignment 6
28 pages
Diabetes_Prediction_1704256341
No ratings yet
Diabetes_Prediction_1704256341
17 pages
Linear and Multilinear Regression
No ratings yet
Linear and Multilinear Regression
5 pages
Ss Project With Python
No ratings yet
Ss Project With Python
9 pages
Data Science Practical Book - Ipynb
No ratings yet
Data Science Practical Book - Ipynb
21 pages
Assignment 1 - LP1
No ratings yet
Assignment 1 - LP1
14 pages
Rapport
No ratings yet
Rapport
21 pages
Project paarth (1) (1)
No ratings yet
Project paarth (1) (1)
21 pages
Cardio Screen RF
100% (1)
Cardio Screen RF
27 pages
DMML lab report 03
No ratings yet
DMML lab report 03
9 pages
SVM - RF - Diabetes - CSV - 26 - 6 - 2023.ipynb - Colaboratory
No ratings yet
SVM - RF - Diabetes - CSV - 26 - 6 - 2023.ipynb - Colaboratory
8 pages
healthcare-project-simplilearn- Week1
No ratings yet
healthcare-project-simplilearn- Week1
6 pages
Data Mining Lab 03
No ratings yet
Data Mining Lab 03
10 pages
Preprocessing1.ipynb - Colab
No ratings yet
Preprocessing1.ipynb - Colab
13 pages
ML Practical 04
No ratings yet
ML Practical 04
20 pages
Random Forest - US - Heart - Patients - Class
100% (1)
Random Forest - US - Heart - Patients - Class
24 pages
Diabetes EDA and Kears Modeling
No ratings yet
Diabetes EDA and Kears Modeling
26 pages
Pandas cheat sheet
No ratings yet
Pandas cheat sheet
19 pages
EDA Cheatsheet - Class Note
No ratings yet
EDA Cheatsheet - Class Note
29 pages
Data Science Fundamentals
No ratings yet
Data Science Fundamentals
22 pages
Data Science
No ratings yet
Data Science
8 pages
turing-data-analysis
No ratings yet
turing-data-analysis
30 pages
EDA Cheatsheet - Class Note
No ratings yet
EDA Cheatsheet - Class Note
29 pages
RL_EX1.Ipynb - Colab
No ratings yet
RL_EX1.Ipynb - Colab
3 pages
KNN - Jupyter Notebook
No ratings yet
KNN - Jupyter Notebook
5 pages
inbound3085046103164618170
No ratings yet
inbound3085046103164618170
2 pages
# Load Packages: Pandas Pandas PD PD Numpy Numpy NP NP
No ratings yet
# Load Packages: Pandas Pandas PD PD Numpy Numpy NP NP
17 pages
data science programs
No ratings yet
data science programs
11 pages
Heart Disease Indicator Prediction Model
No ratings yet
Heart Disease Indicator Prediction Model
17 pages
diabetes-prediction-using-machine-learning
No ratings yet
diabetes-prediction-using-machine-learning
16 pages
Project
No ratings yet
Project
8 pages
Lab Manual - MachineLearningLaboratory-DR.vaishnavi (1)
No ratings yet
Lab Manual - MachineLearningLaboratory-DR.vaishnavi (1)
71 pages
Abhiml ML File
No ratings yet
Abhiml ML File
74 pages
Corrigé TP ML Prétraitmodelisation
No ratings yet
Corrigé TP ML Prétraitmodelisation
24 pages
EDA+Cheatsheet+ +Class+Note
No ratings yet
EDA+Cheatsheet+ +Class+Note
29 pages
Natural Language Understanding
No ratings yet
Natural Language Understanding
14 pages
My Code
No ratings yet
My Code
7 pages
Logistic Regression 205
No ratings yet
Logistic Regression 205
8 pages
EDA Cheatsheet - Class Note
No ratings yet
EDA Cheatsheet - Class Note
29 pages
The Basics of 3D Platonic Order.: 3D Platonic Order, #1
From Everand
The Basics of 3D Platonic Order.: 3D Platonic Order, #1
Paul Maddock
No ratings yet
Data Structure Interview Questions
No ratings yet
Data Structure Interview Questions
17 pages
12th PT Paper
No ratings yet
12th PT Paper
4 pages
3226 Coding Sol
No ratings yet
3226 Coding Sol
2 pages
AI Question Bank - For UT-II-23-24
100% (1)
AI Question Bank - For UT-II-23-24
2 pages
Amazon: Exam Questions AWS-Certified-Developer-Associate
No ratings yet
Amazon: Exam Questions AWS-Certified-Developer-Associate
6 pages
University of Pune Te (Computer Engineering) Semester I
No ratings yet
University of Pune Te (Computer Engineering) Semester I
38 pages
LAB Set Questions Rdbms
No ratings yet
LAB Set Questions Rdbms
18 pages
SmartShipAPIDocumentationV1 21
No ratings yet
SmartShipAPIDocumentationV1 21
47 pages
Pankaj Bhakare 14years Project MANAGER
No ratings yet
Pankaj Bhakare 14years Project MANAGER
5 pages
VBA
No ratings yet
VBA
16 pages
Microsoft Macro Assembler Reference Manual (1984)
No ratings yet
Microsoft Macro Assembler Reference Manual (1984)
168 pages
O Level Computer Science Paper 1 and Paper 2 Compiled by Syed Haseeb Bari
No ratings yet
O Level Computer Science Paper 1 and Paper 2 Compiled by Syed Haseeb Bari
25 pages
Linked List in Data Structure
No ratings yet
Linked List in Data Structure
57 pages
Yysdk Yym180and
No ratings yet
Yysdk Yym180and
126 pages
The FAT File System: CIS-24 Home
No ratings yet
The FAT File System: CIS-24 Home
20 pages
Notes Internet and Web Technology Iwt Unit 5
No ratings yet
Notes Internet and Web Technology Iwt Unit 5
18 pages
2.2.1 A Brief History of 8051
No ratings yet
2.2.1 A Brief History of 8051
16 pages
Profisee SaasArch Datasheet
No ratings yet
Profisee SaasArch Datasheet
2 pages
IDOC Basics: Presenter: Ashish Dharia
No ratings yet
IDOC Basics: Presenter: Ashish Dharia
42 pages
Ahsan Assignment 1
No ratings yet
Ahsan Assignment 1
10 pages
Computer
No ratings yet
Computer
16 pages
SQL Questions With Answer
No ratings yet
SQL Questions With Answer
11 pages
MAPI32
No ratings yet
MAPI32
4 pages
Back Up The Corrupt GDFS
No ratings yet
Back Up The Corrupt GDFS
2 pages
MAP 3D Course Outline
No ratings yet
MAP 3D Course Outline
3 pages
Mining Frequent Patterns, Association and Correlations
No ratings yet
Mining Frequent Patterns, Association and Correlations
42 pages
Examples Documentation VPM FAQ Built in V: Advanced Topics
No ratings yet
Examples Documentation VPM FAQ Built in V: Advanced Topics
22 pages
SQL injection attacks through HTTP headers occur when an attacker sends specially crafted HTTP headers to a server that is vulnerable to SQL injection
No ratings yet
SQL injection attacks through HTTP headers occur when an attacker sends specially crafted HTTP headers to a server that is vulnerable to SQL injection
9 pages
Assignment#2 (FA19 BCS 159)
No ratings yet
Assignment#2 (FA19 BCS 159)
3 pages
RH033
No ratings yet
RH033
264 pages
Assignment2-NIT CALICUT DSA
No ratings yet
Assignment2-NIT CALICUT DSA
10 pages
Week 1 Interview Questions
100% (1)
Week 1 Interview Questions
2 pages
Linux For Devops
No ratings yet
Linux For Devops
6 pages

Ass 1 Dsbda

Uploaded by

Ass 1 Dsbda

Uploaded by

ASS 1 DSBDA

January 22, 2024

[1]: import pandas as pd

[112]: # 3.Loading the dataset into pandas frame

[114]: # 4.Checking for null or missing values in the dataset

[3]: age gender bmi smoker region children

[4]: df.isna() #checking for null or missing values ␣

[4]: age gender bmi smoker region children

[492 rows x 6 columns]

[5]: df.isnull() # Another method for checking␣

[5]: age gender bmi smoker region children

[492 rows x 6 columns]

[6]: df.isna().sum() # It will give sum of the missing values

[10]: # Getting last 5 records

[11]: df.head() # Outcome after filling the missing values

[12]: # Getting first 5 records

[13]: age gender bmi smoker region children

[14]: # Getting information about dataset

[16]: # Getting Dimensions of dataset

[17]: df.shape # It gives dimensions as ( rows,columns)

[18]: # describing the dataset

[19]: <bound method NDFrame.describe of age gender bmi smoker

[492 rows x 6 columns]>

[20]: # For Getting Size

[21]: df.size # It will give size of a dataset as 492(rows) x␣

[22]: # 5. For getting Datatypes of the variables

[23]: age float64

[24]: # need for conversion --->

[25]: # Generally we have Age and no of children are in Integer Datatype

# df['smoker'] = pd.to_numeric(df['smoker'], errors='coerce').fillna(0).

[26]: age int32

[27]: age gender bmi smoker region children

[28]: # 6. Conversion of categorical data into quantitative(numerical) data

[29]: # getting columns under categorical data

[32]: # Prints only columns under categorical data

[32]: gender smoker region

[492 rows x 3 columns]

[34]: # getting categorical columns in to the list

[35]: # Create a new DataFrame with non-categorical columns

gender_female gender_male smoker_no smoker_yes region_northeast \

region_northwest region_southeast region_southwest

[492 rows x 8 columns]

[38]: # Convert boolean values to integers (0s and 1s)

df_merg = pd.concat([df_non_categorical, df_encoded], axis=1)

[40]: # Display the resulting DataFrame

age bmi children gender_female gender_male smoker_no \

smoker_yes region_northeast region_northwest region_southeast \

[492 rows x 11 columns]

def cat_2_num(df, cols):

You might also like