DAV_practicle_File
DAV_practicle_File
1. Given below is a dictionary having two keys ‘Boys’ and ‘Girls’ and having two
lists of heights of five Boys and Five Girls respectively as values associated
with these keys
3. Create a dataframe having at least 3 columns and 50 rows to store numeric data
generated using a random function. Replace 10% of the values by null values whose
index positions are generated using random function. Do the following:
4. Consider two excel files having attendance of a workshop’s participants for two
days. Each file has three fields ‘Name’, ‘Time of joining’, duration (in minutes)
where names are unique within a file. Note that duration may take one of three
values (30, 40, 50) only. Import the data into two dataframes and do the
following:
5. Taking Iris data, plot the following with proper legend and axis labels:
(Download IRIS data from: https://archive.ics.uci.edu/ml/datasets/iris or import
it from sklearn.datasets)
a. Plot bar chart to show the frequency of each class label in the data.
b. Draw a scatter plot for Petal width vs sepal width.
c. Plot density distribution for feature petal length.
d. Use a pair plot to show pairwise bivariate distribution in the Iris
Dataset.
7. Consider a data frame containing data about students i.e. name, gender and
passing division:
3 Aditya October M I
Narayan
4 Sanjeev Sahni February M II
8 Meeta July F II
Kulkarni
9 Preeti Ahuja November F II
a. Perform one hot encoding of the last two columns of categorical data using the
get_dummies() function.
b. Sort this data frame on the “Birth Month” column (i.e. January to
December). Hint: Convert Month to Categorical.
8. Consider the following data frame containing a family name, gender of the family
member and her/his monthly income in each record.
Name Gender MonthlyIncome (Rs.)
Shah Male 114000.00
Vats Male 65000.00
Vats Female 43150.00
Kumar Female 69500.00
Vats Female 155000.00
Kumar Male 103000.00
Shah Male 55000.00
Shah Female 112400.00
Kumar Female 81030.00
Vats Male 71900.00
The students are encouraged to work on a good dataset in consultation with their faculty and apply the concepts learned in
the course. Datasets mentioned in Ref 2, chapter 2 pg 37,38 may be consulted.The following is a sample of the kind of work
expected in the project.
Q1
DL = {'Boys': [72, 68, 70, 69, 74], 'Girls': [63, 65, 69, 62, 61]}
import pandas as pd
pd.DataFrame(DL).to_dict(orient="records")
Q2. a
import numpy as np
x = np.array([[10, 30], [20, 60],[40,100]])
print("Mean of each row:")
print(x.mean(axis=1)) print("Standard
Deviation:")print(np.std(x,axis=1))
print("Variance:")
print(np.var(x,axis=1))
Q2. b
import numpy as np
B = np.array( [56, 48, 22, 41, 78, 91, 24, 46, 8, 33])
print("Original array:")
print(B)
i = np.argsort(B)
print("Indices of the sorted elements of a given array:")
print(i)
Original array:
[56 48 22 41 78 91 24 46 8 33]
Indices of the sorted elements of a given array:
[8 2 6 9 3 7 1 0 4 5]
Q2.c
import numpy as np
R = int(input("Enter the number of rows:"))
C = int(input("Enter the number of columns:"))
matrix = []
print("Enter the entries rowwise:")
for i in range(R):
for j in range(C):
print(matrix[i][j], end = " ")
print()
print(np.shape(matrix))
print(type(matrix))
newarray = np.transpose(matrix)
print(newarray)
def find_zero(arr):
return [i for i , x in enumerate(arr) if x ==0]
arr1 = find(arr)
arr2 =find_zero(arr)
print(arr1)
print(arr2)
[0, 1, 2, 4, 5, 6, 8]
[3, 7]
Q3
import pandas as pd
import numpy as np
df = pd.DataFrame(np.random.randint(0,100,size=(75, 4)),
columns=list('ABCD'))
df
A B C D
0 85 67 26 95
1 12 23 23 17
2 11 18 18 72
3 92 4 33 80
4 78 55 20 56
.. .. .. .. ..
70 73 77 53 44
71 18 56 66 43
72 79 40 89 60
73 51 30 85 69
74 78 98 34 45
num_null(df)
A B C D
0 85.0 67.0 26.0 95.0
1 12.0 23.0 23.0 17.0
2 11.0 18.0 18.0 72.0
3 92.0 4.0 33.0 80.0
4 78.0 55.0 20.0 56.0
.. ... ... ... ...
70 73.0 77.0 53.0 44.0
71 18.0 56.0 66.0 43.0
72 NaN NaN NaN NaN
73 51.0 30.0 85.0 69.0
74 78.0 98.0 34.0 45.0
Q3. a
df.isnull().sum()
A 7
B 7
C 7
D 7
dtype: int64
df.isnull()
A B C D
0 False False False False
1 False False False False
2 False False False False
3 False False False False
4 False False False False
.. ... ... ... ...
70 False False False False
71 False False False False
72 True True True True
73 False False False False
74 False False False False
df['sum']=df.sum(axis=1)
df.head()
A B C D sum
0 85.0 67.0 26.0 95.0 273.0
1 12.0 23.0 23.0 17.0 75.0
2 11.0 18.0 18.0 72.0 119.0
3 92.0 4.0 33.0 80.0 209.0
4 78.0 55.0 20.0 56.0 209.0
df.sort_values('sum',ascending=False)
A B C D sum
18 77.0 97.0 64.0 47.0 285.0
39 75.0 46.0 72.0 92.0 285.0
44 79.0 93.0 85.0 24.0 281.0
0 85.0 67.0 26.0 95.0 273.0
33 94.0 54.0 48.0 69.0 265.0
25 66.0 97.0 9.0 92.0 264.0
22 32.0 56.0 97.0 79.0 264.0
5 20.0 85.0 72.0 78.0 255.0
26 83.0 34.0 74.0 62.0 253.0
49 44.0 43.0 63.0 99.0 249.0
30 96.0 33.0 66.0 53.0 248.0
70 73.0 77.0 53.0 44.0 247.0
61 24.0 41.0 93.0 86.0 244.0
7 67.0 52.0 26.0 99.0 244.0
50 36.0 22.0 94.0 89.0 241.0
23 69.0 20.0 82.0 67.0 238.0
73 51.0 30.0 85.0 69.0 235.0
12 27.0 73.0 62.0 70.0 232.0
52 59.0 99.0 39.0 32.0 229.0
27 7.0 90.0 46.0 76.0 219.0
6 53.0 79.0 44.0 37.0 213.0
34 64.0 32.0 18.0 96.0 210.0
3 92.0 4.0 33.0 80.0 209.0
4 78.0 55.0 20.0 56.0 209.0
14 91.0 4.0 97.0 14.0 206.0
21 48.0 70.0 12.0 75.0 205.0
9 52.0 40.0 86.0 26.0 204.0
19 3.0 87.0 35.0 75.0 200.0
38 82.0 35.0 82.0 1.0 200.0
54 98.0 35.0 33.0 27.0 193.0
24 60.0 28.0 27.0 76.0 191.0
71 18.0 56.0 66.0 43.0 183.0
36 15.0 63.0 48.0 53.0 179.0
8 93.0 7.0 54.0 25.0 179.0
67 45.0 71.0 13.0 48.0 177.0
32 72.0 30.0 15.0 55.0 172.0
45 4.0 62.0 21.0 85.0 172.0
48 35.0 31.0 82.0 22.0 170.0
41 40.0 44.0 48.0 38.0 170.0
10 1.0 19.0 64.0 86.0 170.0
68 21.0 51.0 63.0 34.0 169.0
16 58.0 58.0 24.0 20.0 160.0
17 71.0 4.0 6.0 74.0 155.0
20 6.0 14.0 94.0 40.0 154.0
53 49.0 31.0 23.0 38.0 141.0
69 26.0 55.0 18.0 40.0 139.0
11 87.0 11.0 38.0 1.0 137.0
64 10.0 17.0 76.0 31.0 134.0
28 29.0 79.0 5.0 7.0 120.0
2 11.0 18.0 18.0 72.0 119.0
55 65.0 24.0 9.0 20.0 118.0
29 12.0 93.0 7.0 2.0 114.0
13 56.0 20.0 31.0 7.0 114.0
60 0.0 27.0 67.0 5.0 99.0
15 NaN NaN NaN NaN 0.0
df.drop(18,inplace=True)
df
A B C D sum
0 85.0 67.0 26.0 95.0 273.0
2 11.0 18.0 18.0 72.0 119.0
3 92.0 4.0 33.0 80.0 209.0
4 78.0 55.0 20.0 56.0 209.0
5 20.0 85.0 72.0 78.0 255.0
6 53.0 79.0 44.0 37.0 213.0
7 67.0 52.0 26.0 99.0 244.0
8 93.0 7.0 54.0 25.0 179.0
9 52.0 40.0 86.0 26.0 204.0
10 1.0 19.0 64.0 86.0 170.0
11 87.0 11.0 38.0 1.0 137.0
12 27.0 73.0 62.0 70.0 232.0
13 56.0 20.0 31.0 7.0 114.0
14 91.0 4.0 97.0 14.0 206.0
15 NaN NaN NaN NaN 0.0
16 58.0 58.0 24.0 20.0 160.0
17 71.0 4.0 6.0 74.0 155.0
19 3.0 87.0 35.0 75.0 200.0
20 6.0 14.0 94.0 40.0 154.0
21 48.0 70.0 12.0 75.0 205.0
22 32.0 56.0 97.0 79.0 264.0
23 69.0 20.0 82.0 67.0 238.0
24 60.0 28.0 27.0 76.0 191.0
25 66.0 97.0 9.0 92.0 264.0
26 83.0 34.0 74.0 62.0 253.0
27 7.0 90.0 46.0 76.0 219.0
28 29.0 79.0 5.0 7.0 120.0
29 12.0 93.0 7.0 2.0 114.0
30 96.0 33.0 66.0 53.0 248.0
32 72.0 30.0 15.0 55.0 172.0
33 94.0 54.0 48.0 69.0 265.0
34 64.0 32.0 18.0 96.0 210.0
36 15.0 63.0 48.0 53.0 179.0
38 82.0 35.0 82.0 1.0 200.0
39 75.0 46.0 72.0 92.0 285.0
41 40.0 44.0 48.0 38.0 170.0
44 79.0 93.0 85.0 24.0 281.0
45 4.0 62.0 21.0 85.0 172.0
48 35.0 31.0 82.0 22.0 170.0
49 44.0 43.0 63.0 99.0 249.0
50 36.0 22.0 94.0 89.0 241.0
52 59.0 99.0 39.0 32.0 229.0
53 49.0 31.0 23.0 38.0 141.0
54 98.0 35.0 33.0 27.0 193.0
55 65.0 24.0 9.0 20.0 118.0
60 0.0 27.0 67.0 5.0 99.0
61 24.0 41.0 93.0 86.0 244.0
64 10.0 17.0 76.0 31.0 134.0
67 45.0 71.0 13.0 48.0 177.0
68 21.0 51.0 63.0 34.0 169.0
69 26.0 55.0 18.0 40.0 139.0
70 73.0 77.0 53.0 44.0 247.0
71 18.0 56.0 66.0 43.0 183.0
73 51.0 30.0 85.0 69.0 235.0
Q3.c
mod_df = df.dropna( axis=0,
thresh=5)
mod_df
A B C D sum
0 85.0 67.0 26.0 95.0 273.0
2 11.0 18.0 18.0 72.0 119.0
3 92.0 4.0 33.0 80.0 209.0
4 78.0 55.0 20.0 56.0 209.0
5 20.0 85.0 72.0 78.0 255.0
.. ... ... ... ... ...
69 26.0 55.0 18.0 40.0 139.0
70 73.0 77.0 53.0 44.0 247.0
71 18.0 56.0 66.0 43.0 183.0
73 51.0 30.0 85.0 69.0 235.0
74 78.0 98.0 34.0 45.0 255.0
Q3. d
sort_col = df.sort_values(by= 'A',ascending=False)
sort_col
A B C D sum
54 98.0 35.0 33.0 27.0 193.0
30 96.0 33.0 66.0 53.0 248.0
33 94.0 54.0 48.0 69.0 265.0
8 93.0 7.0 54.0 25.0 179.0
3 92.0 4.0 33.0 80.0 209.0
14 91.0 4.0 97.0 14.0 206.0
11 87.0 11.0 38.0 1.0 137.0
0 85.0 67.0 26.0 95.0 273.0
26 83.0 34.0 74.0 62.0 253.0
38 82.0 35.0 82.0 1.0 200.0
44 79.0 93.0 85.0 24.0 281.0
4 78.0 55.0 20.0 56.0 209.0
18 77.0 97.0 64.0 47.0 285.0
39 75.0 46.0 72.0 92.0 285.0
70 73.0 77.0 53.0 44.0 247.0
32 72.0 30.0 15.0 55.0 172.0
17 71.0 4.0 6.0 74.0 155.0
23 69.0 20.0 82.0 67.0 238.0
7 67.0 52.0 26.0 99.0 244.0
25 66.0 97.0 9.0 92.0 264.0
55 65.0 24.0 9.0 20.0 118.0
34 64.0 32.0 18.0 96.0 210.0
24 60.0 28.0 27.0 76.0 191.0
52 59.0 99.0 39.0 32.0 229.0
16 58.0 58.0 24.0 20.0 160.0
13 56.0 20.0 31.0 7.0 114.0
6 53.0 79.0 44.0 37.0 213.0
9 52.0 40.0 86.0 26.0 204.0
73 51.0 30.0 85.0 69.0 235.0
53 49.0 31.0 23.0 38.0 141.0
21 48.0 70.0 12.0 75.0 205.0
67 45.0 71.0 13.0 48.0 177.0
49 44.0 43.0 63.0 99.0 249.0
41 40.0 44.0 48.0 38.0 170.0
50 36.0 22.0 94.0 89.0 241.0
48 35.0 31.0 82.0 22.0 170.0
22 32.0 56.0 97.0 79.0 264.0
28 29.0 79.0 5.0 7.0 120.0
12 27.0 73.0 62.0 70.0 232.0
69 26.0 55.0 18.0 40.0 139.0
61 24.0 41.0 93.0 86.0 244.0
68 21.0 51.0 63.0 34.0 169.0
5 20.0 85.0 72.0 78.0 255.0
71 18.0 56.0 66.0 43.0 183.0
36 15.0 63.0 48.0 53.0 179.0
29 12.0 93.0 7.0 2.0 114.0
2 11.0 18.0 18.0 72.0 119.0
64 10.0 17.0 76.0 31.0 134.0
27 7.0 90.0 46.0 76.0 219.0
20 6.0 14.0 94.0 40.0 154.0
45 4.0 62.0 21.0 85.0 172.0
19 3.0 87.0 35.0 75.0 200.0
10 1.0 19.0 64.0 86.0 170.0
60 0.0 27.0 67.0 5.0 99.0
15 NaN NaN NaN NaN 0.0
Q3. e
df.drop_duplicates(subset='A',keep='first',inplace=True)
df
A B C D sum
0 85.0 67.0 26.0 95.0 273.0
2 11.0 18.0 18.0 72.0 119.0
3 92.0 4.0 33.0 80.0 209.0
4 78.0 55.0 20.0 56.0 209.0
5 20.0 85.0 72.0 78.0 255.0
6 53.0 79.0 44.0 37.0 213.0
7 67.0 52.0 26.0 99.0 244.0
8 93.0 7.0 54.0 25.0 179.0
9 52.0 40.0 86.0 26.0 204.0
10 1.0 19.0 64.0 86.0 170.0
11 87.0 11.0 38.0 1.0 137.0
12 27.0 73.0 62.0 70.0 232.0
13 56.0 20.0 31.0 7.0 114.0
14 91.0 4.0 97.0 14.0 206.0
15 NaN NaN NaN NaN 0.0
16 58.0 58.0 24.0 20.0 160.0
17 71.0 4.0 6.0 74.0 155.0
18 77.0 97.0 64.0 47.0 285.0
19 3.0 87.0 35.0 75.0 200.0
20 6.0 14.0 94.0 40.0 154.0
21 48.0 70.0 12.0 75.0 205.0
22 32.0 56.0 97.0 79.0 264.0
23 69.0 20.0 82.0 67.0 238.0
24 60.0 28.0 27.0 76.0 191.0
25 66.0 97.0 9.0 92.0 264.0
26 83.0 34.0 74.0 62.0 253.0
27 7.0 90.0 46.0 76.0 219.0
28 29.0 79.0 5.0 7.0 120.0
29 12.0 93.0 7.0 2.0 114.0
30 96.0 33.0 66.0 53.0 248.0
32 72.0 30.0 15.0 55.0 172.0
33 94.0 54.0 48.0 69.0 265.0
34 64.0 32.0 18.0 96.0 210.0
36 15.0 63.0 48.0 53.0 179.0
38 82.0 35.0 82.0 1.0 200.0
39 75.0 46.0 72.0 92.0 285.0
41 40.0 44.0 48.0 38.0 170.0
44 79.0 93.0 85.0 24.0 281.0
45 4.0 62.0 21.0 85.0 172.0
48 35.0 31.0 82.0 22.0 170.0
49 44.0 43.0 63.0 99.0 249.0
50 36.0 22.0 94.0 89.0 241.0
52 59.0 99.0 39.0 32.0 229.0
53 49.0 31.0 23.0 38.0 141.0
54 98.0 35.0 33.0 27.0 193.0
55 65.0 24.0 9.0 20.0 118.0
60 0.0 27.0 67.0 5.0 99.0
61 24.0 41.0 93.0 86.0 244.0
64 10.0 17.0 76.0 31.0 134.0
67 45.0 71.0 13.0 48.0 177.0
68 21.0 51.0 63.0 34.0 169.0
69 26.0 55.0 18.0 40.0 139.0
70 73.0 77.0 53.0 44.0 247.0
71 18.0 56.0 66.0 43.0 183.0
73 51.0 30.0 85.0 69.0 235.0
Q3. f
column_1 = df["A"]
column_2 = df["B"]
correlation = column_1.corr(column_2)
correlation
-0.1914983464702535
print(df.B.cov(df.C))
-161.4713487071978
Q4
xls1=pd.ExcelFile('/content/Attendance1.xlsx')
xls1.sheet_names
f1=pd.read_excel(xls1,xls1.sheet_names[0])
f1
xls2=pd.ExcelFile('/content/Attendance2.xlsx')
xls2.sheet_names
f2=pd.read_excel(xls2,xls2.sheet_names[0])
f2
Q4. a
j=pd.merge(f1,f2,on=['Name'])
j['Name']
Q4. b
k=pd.merge(f1,f2,how='outer',on=['Name'])
k['Name']
Q4. c
frames=[f1,f2]
result
Q4. d
f_new=pd.merge(f1,f2)
df2=f_new.set_index(keys=[f_new.columns[0],f_new.columns[2]])
df2
df2.describe()
Q5
import numpy as np
import pandas as pd
import seaborn as sns
sns.set_palette('husl')
import matplotlib.pyplot as plt
%matplotlib inline
data = pd.read_csv('Iris.csv')
data.head()
*c* argument looks like a single numeric RGB or RGBA sequence, which
should be avoided as value-mapping will have precedence in case its
length matches with *x* & *y*. Please use the *color* keyword-
argument or provide a 2-D array with a single row if you intend to
specify the same RGB or RGBA value for all points.
<function matplotlib.pyplot.show>
sns.barplot(x='Species',y='SepalLengthCm',data=data)
<matplotlib.axes._subplots.AxesSubplot at 0x7ff64c763310>
sns.barplot(x='Species',y='SepalWidthCm',data=data)
<matplotlib.axes._subplots.AxesSubplot at 0x7ff64c263450>
sns.barplot(x='Species',y='PetalLengthCm',data=data)
<matplotlib.axes._subplots.AxesSubplot at 0x7ff64c1ddc10>
sns.barplot(x='Species',y='PetalWidthCm',data=data)
<matplotlib.axes._subplots.AxesSubplot at 0x7ff64c14df50>
data.PetalLengthCm.plot.density(color='green')
plt.title('Density Plot for Petal Length')
plt.show()
sns.pairplot(data)
<seaborn.axisgrid.PairGrid at 0x7ff64c6c3990>
Q6
data2=
pd.read_csv('https://raw.githubusercontent.com/codebasics/py/master/
pandas/14_ts_datetimeindex/aapl.csv')
data2.head(10)
data2.groupby('Open')['Volume'].mean()
Open
96.75 23794945.0
96.82 56239822.0
97.17 24167463.0
97.39 38918997.0
97.41 25892171.0
...
155.02 21069647.0
155.19 64882657.0
155.25 21250798.0
155.94 20048478.0
156.01 26009719.0
Name: Volume, Length: 246, dtype: float64
data2.groupby('Open', as_index=False)['Volume'].mean()
Open Volume
0 96.75 23794945.0
1 96.82 56239822.0
2 97.17 24167463.0
3 97.39 38918997.0
4 97.41 25892171.0
.. ... ...
241 155.02 21069647.0
242 155.19 64882657.0
243 155.25 21250798.0
244 155.94 20048478.0
245 156.01 26009719.0
data2['Date'] = pd.to_datetime(data2['Date'])
data2.head(10)
High Low
97.65 96.73 23794945
97.67 96.84 25892171
97.70 97.12 24167463
97.97 96.42 56239822
98.84 96.92 40382921
...
155.81 153.78 26624926
155.98 154.48 21069647
156.06 154.72 20048478
156.42 154.67 32527017
156.65 155.05 26009719
Name: Volume, Length: 251, dtype: int64
df3 = pd.DataFrame({
'Name':['Mudit Chauhan','Seema Chopra','rani gupta','aditya
narayan','sanjeev sahani','prakash kumar','Ritu Agarwal','Akshay
Goel','Meeta Kulkarni','Preeti Ahuja','Sunil Das Gupta','Sonali
Sapre','Rashmi Talwar','Ashish Dubey','Kiran Sharma','Sameer Bansal'],
'Birth_Month':
['December','January','March','October','February','December','Septemb
er','August','July','November','April','January','May','June','Februar
y','October'],
'Gender':
['M','F','F','M','M','M','F','M','F','F','M','F','F','M','F','M'],
'Pass_division':[3,2,1,1,2,3,1,1,2,2,3,1,3,2,2,1]})
df3
pd.get_dummies(df3.Gender)
F M
0 0 1
1 1 0
2 1 0
3 0 1
4 0 1
5 0 1
6 1 0
7 0 1
8 1 0
9 1 0
10 0 1
11 1 0
12 1 0
13 0 1
14 1 0
15 0 1
pd.get_dummies(df3.Gender, drop_first=True)
M
0 1
1 0
2 0
3 1
4 1
5 1
6 0
7 1
8 0
9 0
10 1
11 0
12 0
13 1
14 0
15 1
Gender_F Gender_M
0 0 1
1 1 0
2 1 0
3 0 1
4 0 1
5 0 1
6 1 0
7 0 1
8 1 0
9 1 0
10 0 1
11 1 0
12 1 0
13 0 1
14 1 0
15 0 1
df3.head()
[5 rows x 9 columns]
df3
df3.sort_values(by='Birth_Month')
sort_order =
['January','February','March','April','May','June','July','August','Se
ptember','October','November','December']
df3.index = pd.CategoricalIndex(df3['Birth_Month'],
categories=sort_order,ordered=True)
df3 = df3.sort_index().reset_index(drop=True)
df3
Q8
df4 = pd.DataFrame({
'Name':
['Shah','Vats','Vats','Kumar','Vats','Kumar','Shah','Shah','Kumar','Sh
ah'],
'Gender':
['Male','Male' ,'Female','Female','Female','Male','Male','Female','Fem
ale','Male'],
'Monthly_Income (Rs)':
[114000,65000,43150,69500,155000,103000,55000,112400,81030,71900]})
df4
print (sumOfIncome)