0% found this document useful (0 votes)

206 views

CLEANING DATA SET - Jupyter Notebook

This document discusses cleaning a dataset that contains null values. It shows how to identify and handle null values when analyzing the data. First, it loads sample data and checks for null values using .isnull().sum(). It then demonstrates how null values can cause issues for calculations by trying to calculate the mean of an array that contains NaN. The document also loads a dataset from an Excel file, checks for null values in each column, and counts the number of non-null values in each row and column. This provides an overview of techniques for identifying and dealing with missing or null data when preparing a dataset for analysis.

Uploaded by

nitindibai3

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

206 views

CLEANING DATA SET - Jupyter Notebook

Uploaded by

nitindibai3

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 15

30/10/2023, 10:31 CLEANING DATA SET - Jupyter Notebook

In [1]: import pandas as pd

import pymysql as sql
db=sql.connect(host='localhost',user='root',password='manish@sql0047',datab

data=pd.read_sql_query('select * from emp',db) #isse dataframe bna
data

C:\Users\Acer\AppData\Local\Temp\ipykernel_21408\2883241542.py:5: UserWarn
ing: pandas only supports SQLAlchemy connectable (engine/connection) or da
tabase string URI or sqlite3 DBAPI2 connection. Other DBAPI2 objects are n
ot tested. Please consider using SQLAlchemy.
data=pd.read_sql_query('select * from emp',db) #isse dataframe
bnane ki jarurt nhi h by default bn jati hai

Out[1]: id name lastname age city salary

0 111 Rohit Verma 27 Meerut 2000

1 112 Monu Kasana 23 Ghaziabad 5000

2 113 Vinod Sharma 28 Noida 12000

3 114 Satish Bhati 25 Bulandsher 4000

4 115 Manish Dhama 23 Greater Noida 10000

5 116 Sachin Dedha 24 Mujaffarnagar 9000

6 117 Manoj Tyagi 22 New Delhi 14000

In [2]: # CLEANING DATASET WHI WORK KREGA JHA PAR [NAN]

# in the above dataset there is no null value
👈 VALUES HOGI----------

In [3]: data.isnull().sum() # this query check the null values of each column...

Out[3]: id 0
name 0
lastname 0
age 0
city 0
salary 0
dtype: int64

In [4]: # example--------
import numpy as np
s=np.array([5,6,7,8,np.nan,44,55,np.nan])
s

Out[4]: array([ 5., 6., 7., 8., nan, 44., 55., nan])

In [5]: np.mean(s) # yha par koi bhi calculation possible nhi hai kyoki nan valu

Out[5]: nan

localhost:8888/notebooks/CLEANING DATA SET.ipynb 1/15

30/10/2023, 10:31 CLEANING DATA SET - Jupyter Notebook

In [6]: # ANOTHER EXAMPLE OF THE CLEANING DATA SET ------->> FIRST WE EXTRACT THE DA

In [7]: import pandas as pd

data=pd.read_excel("C:/Users/Acer/Desktop/DATA12.xlsx")
data

Out[7]: ID Name Industry Inception Revenue Expenses Profit Growth Sa

IT 6,482,465
0 1.0 Lamtone 2009.0 $11,757,018 5274553.0 0.30 N
Services Dollars

Financial 916,455
1 2.0 Stripfind 2010.0 $12,329,371 11412916.0 NaN N
Services Dollars

2 3.0 Canecorporation Health 2012.0 $10,597,009 NaN 3005820.0 NaN N

IT 7,429,377
3 4.0 NaN 2013.0 NaN 6597557.0 NaN N
Services Dollars

7,435,363
4 5.0 NaN NaN NaN NaN 3138627.0 NaN N
Dollars

5,470,303
5 6.0 Techline Health 2006.0 NaN 8427816.0 0.23 N
Dollars

6 7.0 Cityace NaN 2010.0 $9,254,614 NaN 3005116.0 0.06 N

3,878,113
7 8.0 Kayelectronics NaN 2009.0 $9,451,943 5573830.0 0.04 N
Dollars

IT
8 9.0 Ganzlax 2011.0 $14,001,180 NaN 11901180.0 0.18 N
Services

9 NaN NaN NaN NaN NaN NaN NaN NaN N

In [8]: data.isnull().sum() # this function count the null values of the each col

Out[8]: ID 1
Name 3
Industry 4
Inception 2
Revenue 4
Expenses 4
Profit 1
Growth 5
Salary 10
dtype: int64

In [9]: data.count(axis=1) # this count the non null values of each row

Out[9]: 0 8
1 7
2 6
3 5
4 3
5 7
6 6
7 7
8 7
9 0
dtype: int64

localhost:8888/notebooks/CLEANING DATA SET.ipynb 2/15

30/10/2023, 10:31 CLEANING DATA SET - Jupyter Notebook

In [10]: data.count(axis=0) # thiscount the non null values of the each column

Out[10]: ID 9
Name 7
Industry 6
Inception 8
Revenue 6
Expenses 6
Profit 9
Growth 5
Salary 0
dtype: int64

In [11]: data

Out[11]: ID Name Industry Inception Revenue Expenses Profit Growth Sa

IT 6,482,465
0 1.0 Lamtone 2009.0 $11,757,018 5274553.0 0.30 N
Services Dollars

Financial 916,455
1 2.0 Stripfind 2010.0 $12,329,371 11412916.0 NaN N
Services Dollars

2 3.0 Canecorporation Health 2012.0 $10,597,009 NaN 3005820.0 NaN N

IT 7,429,377
3 4.0 NaN 2013.0 NaN 6597557.0 NaN N
Services Dollars

7,435,363
4 5.0 NaN NaN NaN NaN 3138627.0 NaN N
Dollars

5,470,303
5 6.0 Techline Health 2006.0 NaN 8427816.0 0.23 N
Dollars

6 7.0 Cityace NaN 2010.0 $9,254,614 NaN 3005116.0 0.06 N

3,878,113
7 8.0 Kayelectronics NaN 2009.0 $9,451,943 5573830.0 0.04 N
Dollars

IT
8 9.0 Ganzlax 2011.0 $14,001,180 NaN 11901180.0 0.18 N
Services

9 NaN NaN NaN NaN NaN NaN NaN NaN N

In [12]: # HERE ALL THE VALUES OF THE SALARY COLUMN ARE NAN SO WE WANT TO DROP THE W
data.drop(['Salary'],axis=1,inplace=True) # isse salary wala column perma

localhost:8888/notebooks/CLEANING DATA SET.ipynb 3/15

30/10/2023, 10:31 CLEANING DATA SET - Jupyter Notebook

In [13]: data

Out[13]: ID Name Industry Inception Revenue Expenses Profit Growth

IT 6,482,465
0 1.0 Lamtone 2009.0 $11,757,018 5274553.0 0.30
Services Dollars

Financial 916,455
1 2.0 Stripfind 2010.0 $12,329,371 11412916.0 NaN
Services Dollars

2 3.0 Canecorporation Health 2012.0 $10,597,009 NaN 3005820.0 NaN

IT 7,429,377
3 4.0 NaN 2013.0 NaN 6597557.0 NaN
Services Dollars

7,435,363
4 5.0 NaN NaN NaN NaN 3138627.0 NaN
Dollars

5,470,303
5 6.0 Techline Health 2006.0 NaN 8427816.0 0.23
Dollars

6 7.0 Cityace NaN 2010.0 $9,254,614 NaN 3005116.0 0.06

3,878,113
7 8.0 Kayelectronics NaN 2009.0 $9,451,943 5573830.0 0.04
Dollars

IT
8 9.0 Ganzlax 2011.0 $14,001,180 NaN 11901180.0 0.18
Services

9 NaN NaN NaN NaN NaN NaN NaN NaN

In [14]: # AB ROW KE LIYE CHECK KRENGE -------->>>

data.count(axis=1)

Out[14]: 0 8
1 7
2 6
3 5
4 3
5 7
6 6
7 7
8 7
9 0
dtype: int64

In [15]: # isme ninth row mein total NAN values hai , so now we clean the whole nint
data.dropna(how='all',inplace=True) # this remove the row which have all NA

localhost:8888/notebooks/CLEANING DATA SET.ipynb 4/15

30/10/2023, 10:31 CLEANING DATA SET - Jupyter Notebook

In [16]: data

Out[16]: ID Name Industry Inception Revenue Expenses Profit Growth

6,482,465
0 1.0 Lamtone IT Services 2009.0 $11,757,018 5274553.0 0.30
Dollars

Financial 916,455
1 2.0 Stripfind 2010.0 $12,329,371 11412916.0 NaN
Services Dollars

2 3.0 Canecorporation Health 2012.0 $10,597,009 NaN 3005820.0 NaN

7,429,377
3 4.0 NaN IT Services 2013.0 NaN 6597557.0 NaN
Dollars

7,435,363
4 5.0 NaN NaN NaN NaN 3138627.0 NaN
Dollars

5,470,303
5 6.0 Techline Health 2006.0 NaN 8427816.0 0.23
Dollars

6 7.0 Cityace NaN 2010.0 $9,254,614 NaN 3005116.0 0.06

3,878,113
7 8.0 Kayelectronics NaN 2009.0 $9,451,943 5573830.0 0.04
Dollars

8 9.0 Ganzlax IT Services 2011.0 $14,001,180 NaN 11901180.0 0.18

In [17]: data.dropna(how='any') # isse ek bhi NAN value hogi row mein vo row delete

Out[17]: ID Name Industry Inception Revenue Expenses Profit Growth

0 1.0 Lamtone IT Services 2009.0 $11,757,018 6,482,465 Dollars 5274553.0 0.3

In [18]: data

Out[18]: ID Name Industry Inception Revenue Expenses Profit Growth

6,482,465
0 1.0 Lamtone IT Services 2009.0 $11,757,018 5274553.0 0.30
Dollars

Financial 916,455
1 2.0 Stripfind 2010.0 $12,329,371 11412916.0 NaN
Services Dollars

2 3.0 Canecorporation Health 2012.0 $10,597,009 NaN 3005820.0 NaN

7,429,377
3 4.0 NaN IT Services 2013.0 NaN 6597557.0 NaN
Dollars

7,435,363
4 5.0 NaN NaN NaN NaN 3138627.0 NaN
Dollars

5,470,303
5 6.0 Techline Health 2006.0 NaN 8427816.0 0.23
Dollars

6 7.0 Cityace NaN 2010.0 $9,254,614 NaN 3005116.0 0.06

3,878,113
7 8.0 Kayelectronics NaN 2009.0 $9,451,943 5573830.0 0.04
Dollars

8 9.0 Ganzlax IT Services 2011.0 $14,001,180 NaN 11901180.0 0.18

localhost:8888/notebooks/CLEANING DATA SET.ipynb 5/15

30/10/2023, 10:31 CLEANING DATA SET - Jupyter Notebook

In [19]: # NOW WE WORK ON REVENUE & EXPENSES COLUMNS ( cleaning the NAN values from

#STEP 1 =>> fill zero where NAN values in the column

data.fillna({'Revenue':'0'},inplace=True)
data

Out[19]: ID Name Industry Inception Revenue Expenses Profit Growth

6,482,465
0 1.0 Lamtone IT Services 2009.0 $11,757,018 5274553.0 0.30
Dollars

Financial 916,455
1 2.0 Stripfind 2010.0 $12,329,371 11412916.0 NaN
Services Dollars

2 3.0 Canecorporation Health 2012.0 $10,597,009 NaN 3005820.0 NaN

7,429,377
3 4.0 NaN IT Services 2013.0 0 6597557.0 NaN
Dollars

7,435,363
4 5.0 NaN NaN NaN 0 3138627.0 NaN
Dollars

5,470,303
5 6.0 Techline Health 2006.0 0 8427816.0 0.23
Dollars

6 7.0 Cityace NaN 2010.0 $9,254,614 NaN 3005116.0 0.06

3,878,113
7 8.0 Kayelectronics NaN 2009.0 $9,451,943 5573830.0 0.04
Dollars

8 9.0 Ganzlax IT Services 2011.0 $14,001,180 NaN 11901180.0 0.18

In [20]: # STEP 2 =>> make list

n=data['Revenue']
n2=list(n)
n2

Out[20]: ['$11,757,018 ',

'$12,329,371 ',
'$10,597,009 ',
'0',
'0',
'0',
'$9,254,614 ',
'$9,451,943 ',
'$14,001,180 ']

In [21]: # STEP 3=>> remove extra things like dollar($) and comma(,) from the list-
u=[]
for i in n2:
t=""
for j in i:
if(j!="$" and j!=","):
t=t+j
u.append(t)

localhost:8888/notebooks/CLEANING DATA SET.ipynb 6/15

30/10/2023, 10:31 CLEANING DATA SET - Jupyter Notebook

In [22]: u

Out[22]: ['11757018 ',

'12329371 ',
'10597009 ',
'0',
'0',
'0',
'9254614 ',
'9451943 ',
'14001180 ']

In [23]:
# STEP 4=>> Now all the correct values are in the variable {u} now move th

data['Revenue']=u
data

Out[23]: ID Name Industry Inception Revenue Expenses Profit Growth

6,482,465
0 1.0 Lamtone IT Services 2009.0 11757018 5274553.0 0.30
Dollars

Financial 916,455
1 2.0 Stripfind 2010.0 12329371 11412916.0 NaN
Services Dollars

2 3.0 Canecorporation Health 2012.0 10597009 NaN 3005820.0 NaN

7,429,377
3 4.0 NaN IT Services 2013.0 0 6597557.0 NaN
Dollars

7,435,363
4 5.0 NaN NaN NaN 0 3138627.0 NaN
Dollars

5,470,303
5 6.0 Techline Health 2006.0 0 8427816.0 0.23
Dollars

6 7.0 Cityace NaN 2010.0 9254614 NaN 3005116.0 0.06

3,878,113
7 8.0 Kayelectronics NaN 2009.0 9451943 5573830.0 0.04
Dollars

8 9.0 Ganzlax IT Services 2011.0 14001180 NaN 11901180.0 0.18

In [24]: data.info() # isse pyta chla Revenue abhi bhi object hai means String valu

<class 'pandas.core.frame.DataFrame'>
Int64Index: 9 entries, 0 to 8
Data columns (total 8 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 ID 9 non-null float64
1 Name 7 non-null object
2 Industry 6 non-null object
3 Inception 8 non-null float64
4 Revenue 9 non-null object
5 Expenses 6 non-null object
6 Profit 9 non-null float64
7 Growth 5 non-null float64
dtypes: float64(4), object(4)
memory usage: 648.0+ bytes

localhost:8888/notebooks/CLEANING DATA SET.ipynb 7/15

30/10/2023, 10:31 CLEANING DATA SET - Jupyter Notebook

In [25]: data['Revenue']=data['Revenue'].astype(int)

In [26]: data.info() # now Revenue has been changed into integer value------>>>>

<class 'pandas.core.frame.DataFrame'>
Int64Index: 9 entries, 0 to 8
Data columns (total 8 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 ID 9 non-null float64
1 Name 7 non-null object
2 Industry 6 non-null object
3 Inception 8 non-null float64
4 Revenue 9 non-null int32
5 Expenses 6 non-null object
6 Profit 9 non-null float64
7 Growth 5 non-null float64
dtypes: float64(4), int32(1), object(3)
memory usage: 612.0+ bytes

In [27]: # NOW WE PERFORM THE SAME STEPS FOR THE EXPENSE COLUMN----->>>>>>>>>

#STEP 1 =>> fill zero where NAN values in the column

data.fillna({'Expenses':'0'},inplace=True)
data

Out[27]: ID Name Industry Inception Revenue Expenses Profit Growth

6,482,465
0 1.0 Lamtone IT Services 2009.0 11757018 5274553.0 0.30
Dollars

Financial 916,455
1 2.0 Stripfind 2010.0 12329371 11412916.0 NaN
Services Dollars

2 3.0 Canecorporation Health 2012.0 10597009 0 3005820.0 NaN

7,429,377
3 4.0 NaN IT Services 2013.0 0 6597557.0 NaN
Dollars

7,435,363
4 5.0 NaN NaN NaN 0 3138627.0 NaN
Dollars

5,470,303
5 6.0 Techline Health 2006.0 0 8427816.0 0.23
Dollars

6 7.0 Cityace NaN 2010.0 9254614 0 3005116.0 0.06

3,878,113
7 8.0 Kayelectronics NaN 2009.0 9451943 5573830.0 0.04
Dollars

8 9.0 Ganzlax IT Services 2011.0 14001180 0 11901180.0 0.18

localhost:8888/notebooks/CLEANING DATA SET.ipynb 8/15

30/10/2023, 10:31 CLEANING DATA SET - Jupyter Notebook

In [28]: # STEP 2 =>> make list

l=data['Expenses']
l2=list(l)
l2

Out[28]: ['6,482,465 Dollars',

'916,455 Dollars',
'0',
'7,429,377 Dollars',
'7,435,363 Dollars',
'5,470,303 Dollars',
'0',
'3,878,113 Dollars',
'0']

In [38]: # STEP 3=>> remove extra things like [Dollars] and comma(,) from the list-
m=[]
for i in l2:
e=""
for j in i:
if(j.isdigit()):
e=e+j
m.append(e)

In [39]: m

Out[39]: ['6482465',
'916455',
'0',
'7429377',
'7435363',
'5470303',
'0',
'3878113',
'0']

In [40]: data.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 9 entries, 0 to 8
Data columns (total 8 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 ID 9 non-null float64
1 Name 7 non-null object
2 Industry 6 non-null object
3 Inception 8 non-null float64
4 Revenue 9 non-null int32
5 Expenses 9 non-null object
6 Profit 9 non-null float64
7 Growth 5 non-null float64
dtypes: float64(4), int32(1), object(3)
memory usage: 612.0+ bytes

localhost:8888/notebooks/CLEANING DATA SET.ipynb 9/15

30/10/2023, 10:31 CLEANING DATA SET - Jupyter Notebook

In [41]: data['Expenses']=m
data

Out[41]: ID Name Industry Inception Revenue Expenses Profit Growth

0 1.0 Lamtone IT Services 2009.0 11757018 6482465 5274553.0 0.30

Financial
1 2.0 Stripfind 2010.0 12329371 916455 11412916.0 NaN
Services

2 3.0 Canecorporation Health 2012.0 10597009 0 3005820.0 NaN

3 4.0 NaN IT Services 2013.0 0 7429377 6597557.0 NaN

4 5.0 NaN NaN NaN 0 7435363 3138627.0 NaN

5 6.0 Techline Health 2006.0 0 5470303 8427816.0 0.23

6 7.0 Cityace NaN 2010.0 9254614 0 3005116.0 0.06

7 8.0 Kayelectronics NaN 2009.0 9451943 3878113 5573830.0 0.04

8 9.0 Ganzlax IT Services 2011.0 14001180 0 11901180.0 0.18

In [42]: data['Expenses']=data['Expenses'].astype(int)

In [43]: data.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 9 entries, 0 to 8
Data columns (total 8 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 ID 9 non-null float64
1 Name 7 non-null object
2 Industry 6 non-null object
3 Inception 8 non-null float64
4 Revenue 9 non-null int32
5 Expenses 9 non-null int32
6 Profit 9 non-null float64
7 Growth 5 non-null float64
dtypes: float64(4), int32(2), object(2)
memory usage: 576.0+ bytes

In [44]: gf=data.select_dtypes(['int','float']) # this will privide the inte

Out[44]: ID Inception Revenue Expenses Profit Growth

0 1.0 2009.0 11757018 6482465 5274553.0 0.30

1 2.0 2010.0 12329371 916455 11412916.0 NaN

2 3.0 2012.0 10597009 0 3005820.0 NaN

3 4.0 2013.0 0 7429377 6597557.0 NaN

4 5.0 NaN 0 7435363 3138627.0 NaN

5 6.0 2006.0 0 5470303 8427816.0 0.23

6 7.0 2010.0 9254614 0 3005116.0 0.06

7 8.0 2009.0 9451943 3878113 5573830.0 0.04

8 9.0 2011.0 14001180 0 11901180.0 0.18

localhost:8888/notebooks/CLEANING DATA SET.ipynb 10/15

30/10/2023, 10:31 CLEANING DATA SET - Jupyter Notebook

In [46]: gf.isnull().sum()/len(gf)*100 # this provide the NaN value in percentage

Out[46]: ID 0.000000
Inception 11.111111
Revenue 0.000000
Expenses 0.000000
Profit 0.000000
Growth 44.444444
dtype: float64

In [47]: gf

Out[47]: ID Inception Revenue Expenses Profit Growth

0 1.0 2009.0 11757018 6482465 5274553.0 0.30

1 2.0 2010.0 12329371 916455 11412916.0 NaN

2 3.0 2012.0 10597009 0 3005820.0 NaN

3 4.0 2013.0 0 7429377 6597557.0 NaN

4 5.0 NaN 0 7435363 3138627.0 NaN

5 6.0 2006.0 0 5470303 8427816.0 0.23

6 7.0 2010.0 9254614 0 3005116.0 0.06

7 8.0 2009.0 9451943 3878113 5573830.0 0.04

8 9.0 2011.0 14001180 0 11901180.0 0.18

In [48]: # CHECK OUTLAYER IN GROWTH COLUMN------

t=gf['Growth']
t

Out[48]: 0 0.30
1 NaN
2 NaN
3 NaN
4 NaN
5 0.23
6 0.06
7 0.04
8 0.18
Name: Growth, dtype: float64

In [ ]:

localhost:8888/notebooks/CLEANING DATA SET.ipynb 11/15

30/10/2023, 10:31 CLEANING DATA SET - Jupyter Notebook

In [49]: t.plot.box()

Out[49]: <Axes: >

localhost:8888/notebooks/CLEANING DATA SET.ipynb 12/15

30/10/2023, 10:31 CLEANING DATA SET - Jupyter Notebook

In [52]: # CHECK OUTLAYER IN INCEPTION COLUMN----

t2=gf['Inception']
t2
t2.plot.box()

Out[52]: <Axes: >

In [50]: # FILL THE MEAN VALUE IN PLACE OF THE NaN VALUES------------>>

gf.fillna(gf.mean())

Out[50]: ID Inception Revenue Expenses Profit Growth

0 1.0 2009.0 11757018 6482465 5274553.0 0.300

1 2.0 2010.0 12329371 916455 11412916.0 0.162

2 3.0 2012.0 10597009 0 3005820.0 0.162

3 4.0 2013.0 0 7429377 6597557.0 0.162

4 5.0 2010.0 0 7435363 3138627.0 0.162

5 6.0 2006.0 0 5470303 8427816.0 0.230

6 7.0 2010.0 9254614 0 3005116.0 0.060

7 8.0 2009.0 9451943 3878113 5573830.0 0.040

8 9.0 2011.0 14001180 0 11901180.0 0.180

In [57]: gf['Growth']=gf['Growth'].fillna(gf['Growth'].median()) # fill meadian va

localhost:8888/notebooks/CLEANING DATA SET.ipynb 13/15

30/10/2023, 10:31 CLEANING DATA SET - Jupyter Notebook

In [58]: gf #(here median is 0.18)

Out[58]: ID Inception Revenue Expenses Profit Growth

0 1.0 2009.0 11757018 6482465 5274553.0 0.30

1 2.0 2010.0 12329371 916455 11412916.0 0.18

2 3.0 2012.0 10597009 0 3005820.0 0.18

3 4.0 2013.0 0 7429377 6597557.0 0.18

4 5.0 NaN 0 7435363 3138627.0 0.18

5 6.0 2006.0 0 5470303 8427816.0 0.23

6 7.0 2010.0 9254614 0 3005116.0 0.06

7 8.0 2009.0 9451943 3878113 5573830.0 0.04

8 9.0 2011.0 14001180 0 11901180.0 0.18

In [59]: data

Out[59]: ID Name Industry Inception Revenue Expenses Profit Growth

0 1.0 Lamtone IT Services 2009.0 11757018 6482465 5274553.0 0.30

Financial
1 2.0 Stripfind 2010.0 12329371 916455 11412916.0 NaN
Services

2 3.0 Canecorporation Health 2012.0 10597009 0 3005820.0 NaN

3 4.0 NaN IT Services 2013.0 0 7429377 6597557.0 NaN

4 5.0 NaN NaN NaN 0 7435363 3138627.0 NaN

5 6.0 Techline Health 2006.0 0 5470303 8427816.0 0.23

6 7.0 Cityace NaN 2010.0 9254614 0 3005116.0 0.06

7 8.0 Kayelectronics NaN 2009.0 9451943 3878113 5573830.0 0.04

8 9.0 Ganzlax IT Services 2011.0 14001180 0 11901180.0 0.18

In [60]: gf=data.select_dtypes(['int','float']) # this will privide the inte

Out[60]: ID Inception Revenue Expenses Profit Growth

0 1.0 2009.0 11757018 6482465 5274553.0 0.30

1 2.0 2010.0 12329371 916455 11412916.0 NaN

2 3.0 2012.0 10597009 0 3005820.0 NaN

3 4.0 2013.0 0 7429377 6597557.0 NaN

4 5.0 NaN 0 7435363 3138627.0 NaN

5 6.0 2006.0 0 5470303 8427816.0 0.23

6 7.0 2010.0 9254614 0 3005116.0 0.06

7 8.0 2009.0 9451943 3878113 5573830.0 0.04

8 9.0 2011.0 14001180 0 11901180.0 0.18

localhost:8888/notebooks/CLEANING DATA SET.ipynb 14/15

30/10/2023, 10:31 CLEANING DATA SET - Jupyter Notebook

In [62]: gf.fillna(0) # fill zero in place of NaN values

Out[62]: ID Inception Revenue Expenses Profit Growth

0 1.0 2009.0 11757018 6482465 5274553.0 0.30

1 2.0 2010.0 12329371 916455 11412916.0 0.00

2 3.0 2012.0 10597009 0 3005820.0 0.00

3 4.0 2013.0 0 7429377 6597557.0 0.00

4 5.0 0.0 0 7435363 3138627.0 0.00

5 6.0 2006.0 0 5470303 8427816.0 0.23

6 7.0 2010.0 9254614 0 3005116.0 0.06

7 8.0 2009.0 9451943 3878113 5573830.0 0.04

8 9.0 2011.0 14001180 0 11901180.0 0.18

In [ ]: # FILL THE LINEAR VALUES INTO THE DATA--------->>>>>>>>>

localhost:8888/notebooks/CLEANING DATA SET.ipynb 15/15

TV-300,400 Service Manual
71% (7)
TV-300,400 Service Manual
36 pages
Iot PPT 2024
100% (3)
Iot PPT 2024
48 pages
NumPy Notes
No ratings yet
NumPy Notes
13 pages
COMPX310-19A Machine Learning: An Introduction Using Python, Scikit-Learn, Keras, and Tensorflow
No ratings yet
COMPX310-19A Machine Learning: An Introduction Using Python, Scikit-Learn, Keras, and Tensorflow
44 pages
DataMining Lecture 1
No ratings yet
DataMining Lecture 1
35 pages
27 Jupyter Notebook
No ratings yet
27 Jupyter Notebook
42 pages
Data Visualization in Python
No ratings yet
Data Visualization in Python
11 pages
Attribute Oriented Induction
100% (1)
Attribute Oriented Induction
6 pages
Introduction To Data Mining: Dr. Dipti Chauhan Assistant Professor SCSIT, SUAS Indore
No ratings yet
Introduction To Data Mining: Dr. Dipti Chauhan Assistant Professor SCSIT, SUAS Indore
16 pages
Visualization Errors
No ratings yet
Visualization Errors
34 pages
Data Wrangling
No ratings yet
Data Wrangling
13 pages
Financial Analytics With Python
100% (1)
Financial Analytics With Python
40 pages
Decision Tree
No ratings yet
Decision Tree
12 pages
Python Project Submission by - Ravikanth Govindu: Due Date: 27th Mar 2022
No ratings yet
Python Project Submission by - Ravikanth Govindu: Due Date: 27th Mar 2022
48 pages
Data Visualization R Programming Power Bi Lab Record
No ratings yet
Data Visualization R Programming Power Bi Lab Record
29 pages
Assignment-2 Data Visualization and Data Preprocessing
No ratings yet
Assignment-2 Data Visualization and Data Preprocessing
1 page
Data Science & Business Analytics: Post Graduate Program in
No ratings yet
Data Science & Business Analytics: Post Graduate Program in
16 pages
SQL Quiz
No ratings yet
SQL Quiz
4 pages
Data Mining Lab Manual
No ratings yet
Data Mining Lab Manual
34 pages
Pandas Complete Notes
No ratings yet
Pandas Complete Notes
105 pages
Project - Data Mining: Bank - Marketing - Part1 - Data - CSV
No ratings yet
Project - Data Mining: Bank - Marketing - Part1 - Data - CSV
4 pages
Data Science PPT Module 1
100% (1)
Data Science PPT Module 1
24 pages
Data Mining
No ratings yet
Data Mining
27 pages
Pandas
100% (1)
Pandas
1,131 pages
Data Modeling: Jak Na Cheatsheet
No ratings yet
Data Modeling: Jak Na Cheatsheet
3 pages
Module 2
No ratings yet
Module 2
20 pages
ccs346-eda-unit-1-notes
No ratings yet
ccs346-eda-unit-1-notes
20 pages
Sajjad DS
100% (2)
Sajjad DS
97 pages
Machine Learning - Customer Segment Project. Approved by UDACITY
100% (1)
Machine Learning - Customer Segment Project. Approved by UDACITY
19 pages
Classification and Prediction
No ratings yet
Classification and Prediction
126 pages
Rayleigh Model
No ratings yet
Rayleigh Model
9 pages
Thera Bank - Project
100% (4)
Thera Bank - Project
34 pages
Regression Analysis in Machine Learning
No ratings yet
Regression Analysis in Machine Learning
26 pages
Data Mining Thesis
No ratings yet
Data Mining Thesis
104 pages
A Complete Tutorial Which Teaches Data Exploration in Detail PDF
No ratings yet
A Complete Tutorial Which Teaches Data Exploration in Detail PDF
18 pages
Assignment Data Analysis Example
100% (1)
Assignment Data Analysis Example
10 pages
Lecture Notes: Introduction To Data Science and Big Data
No ratings yet
Lecture Notes: Introduction To Data Science and Big Data
5 pages
CCS341_Data Warehousing_Unit 4 Notes
0% (1)
CCS341_Data Warehousing_Unit 4 Notes
19 pages
Math For Ai
No ratings yet
Math For Ai
15 pages
Machine Learning Introduction
No ratings yet
Machine Learning Introduction
20 pages
Practical - 1 - Data Exploration and Data Preparation - DAL - Lab
100% (1)
Practical - 1 - Data Exploration and Data Preparation - DAL - Lab
8 pages
Machine Learning Projects PDF
No ratings yet
Machine Learning Projects PDF
5 pages
Artificial Neural Networks Kluniversity Course Handout
No ratings yet
Artificial Neural Networks Kluniversity Course Handout
18 pages
Project
No ratings yet
Project
18 pages
SMDM Report
No ratings yet
SMDM Report
12 pages
Data Mining Project
No ratings yet
Data Mining Project
33 pages
Dinya Antony MRA ML2
100% (1)
Dinya Antony MRA ML2
24 pages
Introduction To Python and Computer Programming 1704298503
No ratings yet
Introduction To Python and Computer Programming 1704298503
44 pages
Project Report: CS 574 - Computer Vision Using Machine Learning
No ratings yet
Project Report: CS 574 - Computer Vision Using Machine Learning
38 pages
Machine Learning Guide Line
No ratings yet
Machine Learning Guide Line
10 pages
Data Visualization and Exploratory Analysis
No ratings yet
Data Visualization and Exploratory Analysis
23 pages
Data Visualization Ebook
No ratings yet
Data Visualization Ebook
15 pages
New Ebook Guide To AI Data Science
No ratings yet
New Ebook Guide To AI Data Science
50 pages
Exploratory Data Analysis
No ratings yet
Exploratory Data Analysis
10 pages
Data Mining Lab Manual
33% (3)
Data Mining Lab Manual
44 pages
Excel - Data - Analysis - 03 - Useful Books - TutorialsPoint
No ratings yet
Excel - Data - Analysis - 03 - Useful Books - TutorialsPoint
1 page
DSBDA Lab Manual
No ratings yet
DSBDA Lab Manual
155 pages
MLP - Week 5 - MNIST - Perceptron - Ipynb - Colaboratory
No ratings yet
MLP - Week 5 - MNIST - Perceptron - Ipynb - Colaboratory
31 pages
Unit II Visualizing Using Matplotlib
No ratings yet
Unit II Visualizing Using Matplotlib
24 pages
Data Cleaning
No ratings yet
Data Cleaning
20 pages
DAP writeups_merged
No ratings yet
DAP writeups_merged
33 pages
7 Cleaning data w3s.............................................
No ratings yet
7 Cleaning data w3s.............................................
2 pages
DSL-2750U U1 Manual v1.01
No ratings yet
DSL-2750U U1 Manual v1.01
99 pages
Software Engineering Part 1 50 MCQ CEXAMINDIA (1) - Unlocked
No ratings yet
Software Engineering Part 1 50 MCQ CEXAMINDIA (1) - Unlocked
7 pages
Lab MAnual
No ratings yet
Lab MAnual
40 pages
Mesa (Programming Language)
No ratings yet
Mesa (Programming Language)
5 pages
Configuring and Testing: CCNA Exploration Semester 1 - Chapter 11
No ratings yet
Configuring and Testing: CCNA Exploration Semester 1 - Chapter 11
47 pages
Basic Help On Using HeidiSQL
No ratings yet
Basic Help On Using HeidiSQL
3 pages
OpenScape 4000 Assistant V8 Signalling and Payload Encryption Administrator Documentation Issue 2
No ratings yet
OpenScape 4000 Assistant V8 Signalling and Payload Encryption Administrator Documentation Issue 2
18 pages
My Gmail Script
No ratings yet
My Gmail Script
5 pages
PCB Designingb Report
No ratings yet
PCB Designingb Report
31 pages
ISAD Report of Rozeep Rai
No ratings yet
ISAD Report of Rozeep Rai
30 pages
Sipeed MaixGo Datasheet V1.1
No ratings yet
Sipeed MaixGo Datasheet V1.1
7 pages
Main Nov Chnv02
No ratings yet
Main Nov Chnv02
62 pages
Esi Online Guidlines
No ratings yet
Esi Online Guidlines
53 pages
A Second-Order SQL Injection Detection Method
No ratings yet
A Second-Order SQL Injection Detection Method
5 pages
Fluent Bit with Kubernetes MEAP Phil Wilkins 2024 Scribd Download
100% (2)
Fluent Bit with Kubernetes MEAP Phil Wilkins 2024 Scribd Download
77 pages
File Description Device Driver ESPRIMO P5905 (I945g) : Support Information
No ratings yet
File Description Device Driver ESPRIMO P5905 (I945g) : Support Information
2 pages
Angularfire & NGRX - UDEMY
No ratings yet
Angularfire & NGRX - UDEMY
17 pages
Introspection
No ratings yet
Introspection
22 pages
Study Guide: Exam AZ-104: Microsoft Azure Administrator
No ratings yet
Study Guide: Exam AZ-104: Microsoft Azure Administrator
11 pages
Java Array
No ratings yet
Java Array
35 pages
Schneider Price List 2018
100% (1)
Schneider Price List 2018
360 pages
How to Add JavaScript in HTML Document
No ratings yet
How to Add JavaScript in HTML Document
8 pages
Profile Template
No ratings yet
Profile Template
4 pages
PDMS Pipework Support
100% (1)
PDMS Pipework Support
74 pages
First Name Last Name Title Person Linkedin Uemail Company Name
No ratings yet
First Name Last Name Title Person Linkedin Uemail Company Name
6 pages
Ict Das
No ratings yet
Ict Das
9 pages
Hybrid-Distance Monge Elkan
No ratings yet
Hybrid-Distance Monge Elkan
6 pages
ML81N Creating An Entry Sheet 03-29-2007
100% (2)
ML81N Creating An Entry Sheet 03-29-2007
29 pages