EDA - Session-4 - Numerical Data Analysis
EDA - Session-4 - Numerical Data Analysis
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
path=r"C:\Users\omkar\OneDrive\Documents\Data science\Naresh IT\Datafiles\Vi
visa_df=pd.read_csv(path)
visa_df.head(3)
In [2]: visa_df.columns
In [3]: visa_df.select_dtypes(exclude='object').columns
𝑝𝑟𝑒𝑣𝑎𝑖𝑙𝑖𝑛𝑔𝑤𝑎𝑔𝑒
In the numerical analysis
mean median std count 25p 50p
In [4]: visa_df['prevailing_wage']
Out[4]: 0 592.2029
1 83425.6500
2 122996.8600
3 83434.0300
4 149907.3900
...
25475 77092.5700
25476 279174.7900
25477 146298.8500
25478 86154.7700
25479 70876.9100
Name: prevailing_wage, Length: 25480, dtype: float64
𝑐𝑜𝑢𝑛𝑡
In [5]: len(visa_df['prevailing_wage'])
Out[5]: 25480
In [6]: visa_df['prevailing_wage'].count()
Out[6]: 25480
𝑚𝑒𝑎𝑛
In [7]: visa_df['prevailing_wage'].mean() # pandas
Out[7]: 74455.81459209183
In [8]: np.mean(visa_df['prevailing_wage'])
Out[8]: 74455.81459209183
𝑚𝑒𝑑𝑎𝑖𝑛
In [9]: visa_df['prevailing_wage'].median()
Out[9]: 70308.20999999999
In [10]: np.median(visa_df['prevailing_wage'])
Out[10]: 70308.20999999999
𝑚𝑎𝑥
In [11]: visa_df['prevailing_wage'].max()
Out[11]: 319210.27
In [12]: np.max(visa_df['prevailing_wage'])
Out[12]: 319210.27
𝑚𝑖𝑛
In [13]: visa_df['prevailing_wage'].min()
Out[13]: 2.1367
In [14]: np.min(visa_df['prevailing_wage'])
Out[14]: 2.1367
𝑠𝑡𝑑
In [16]: visa_df['prevailing_wage'].std()
Out[16]: 52815.94232687357
Out[22]: prevailing_wage
count 25480.00
max 319210.27
min 2.14
mean 74455.81
median 70308.21
std 52815.94
𝑝𝑒𝑟𝑐𝑒𝑛𝑡𝑖𝑙𝑒-𝑞𝑢𝑎𝑛𝑡𝑖𝑙𝑒
perecntile and quantile available in numpy
np.percentile()
column name
percentile value between 0 to 100
np.quantile()
column name
0 to 1
In quantile 0.25 means 25 in percentile
In [23]: np.percentile(visa_df['prevailing_wage'],25)
Out[23]: 34015.479999999996
In [26]: np.quantile(visa_df['prevailing_wage'],0.25)
Out[26]: 34015.479999999996
Out[36]: 6370
Out[38]: 12740
Out[39]: prevailing_wage
count 25480.0000
max 319210.2700
min 2.1400
mean 74455.8100
median 70308.2100
std 52815.9400
25% 34015.4800
50% 70308.2100
75% 107735.5125
In [40]: visa_df.describe()
# 3 numerical columns
print(l)
index=['count','max','min',
'mean','median','std',
'25%','50%','75%']
pd.DataFrame(zip(l[0],l[1],l[2]),columns=cols,index=index)
index=['count','max','min',
'mean','median','std',
'25%','50%','75%']
pd.DataFrame(d,index=index)
Out[1]:
tinent education_of_employee has_job_experience requires_job_training no_of_employees yr_
ℎ𝑖𝑠𝑡𝑜𝑔𝑟𝑎𝑚
In [5]: f,i,n=plt.hist(visa_df['prevailing_wage'],
bins=40)
In [8]: len(f),len(i),len(n)
In [ ]:
In [9]: f
Out[9]: array([2992., 871., 1005., 1170., 1242., 1434., 1385., 1443., 1444.,
1445., 1457., 1335., 1268., 1217., 1088., 978., 807., 645.,
509., 373., 264., 144., 105., 111., 107., 99., 88.,
79., 65., 64., 58., 53., 33., 33., 29., 19.,
7., 3., 6., 5.])
In [10]: i
In [20]: l=2.13670000e+00
u=7.98234003e+03
c1=visa_df['prevailing_wage']>=l
c2=visa_df['prevailing_wage']<u
c=c1&c2
len(visa_df[c])
Out[20]: 2992
871
In [ ]: # Task-1
Craeate a dataframe
lower upper frquency
2.136 7.98 2992
In [ ]: # task-2:
# In seaborn how to plot histogram
𝐵𝑜𝑥𝑝𝑙𝑜𝑡
Boxplot is used to identify outliers
In box plot we have
Q1: 25p value
Q2: 50p value
Q3: 75p value
IQR: Q3-Q1
Mild outliers Q1-1.5IQR and Q3+1.5IQR
huge outliers Q1-3IQR and Q3+3IQR
|-----:-----|
o |--------| : |--------| o o
|-----:-----|
flier <-----------> fliers
IQR
In [28]: plt.boxplot(visa_df['prevailing_wage'],
vert=False)
plt.show()
# black dots are outliers
# orange line is median
In [ ]: CI : Part 2 statistics