0% found this document useful (0 votes)
35 views27 pages

In - Gov.transport VHINSC

Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
35 views27 pages

In - Gov.transport VHINSC

Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 27

Data Visualization using Pyplot

Data Visualization means representing the data in a graphical format which is easier to understand.
For Data Visualization in Python we are using the Matplotlib library

Matplotlib
Matplotlib is a Python 2D plotting library which produces publication quality figures in a variety of hardcopy
formats and interactive environments across platforms. Matplotlib can be used in Python scripts, the Python
and IPython shells, the Jupyter notebook, web application servers, and different graphical user interface
toolkits.

Types of plots/charts (to be studied as per syllabus):


Some examples of charts are:
1. Line plots
2. Bar plot
3. Histograms

Working with matplotlib


For working with matplotlib usually we use the following import command:
import matplotlib.pyplot as plt

and frequently we need numpy for creating datasets, so numpy is also imported as follows:
import numpy as np

Matplotlib Object Hierarchy


A plot drawn in matplotlib is a hierarchy of nested Python objects as shown below:

• A Figure object is the outermost container for a matplotlib graphic. It is the overall window/page on
which everything is drawn.
• The Axes is the area on which the data is plotted with functions such as plot() and scatter(). A
Figure can contain multiple Axes, but an Axes object is a part of only one Figure.
• Below the Axes in the hierarchy are smaller objects such as tick marks, individual lines, legends, and
text boxes. Almost every “element” of a chart is a Python object which can be manipulated, all the way
down to the ticks and labels.
Parts of a Figure/Plot

Basic Steps involved in drawing any plot

1. Identify the data you want to represent on the plot.


For plots such as line graph it means identify the values that will be represented in the X-axis as well as Y-
axis. For pie-charts, histograms etc. there will usually be only one dataset.
2. Identify the structure of the plot you want
The next step is identifying which plot will be suitable to represent the data accurately. It can be line plot,
bar plot, histogram etc. Also consider whether you want many sets of data to be represented in the same
plot or to show different plots for different sets of data.
3. Setup the different parameters of the plot
Each plot has different components such as the xticks, yticks, the shape/colour of markers/plots, legend
etc. Set the parameters of the plot.
4. Draw the plot.
Line plots:
A line chart or line plot or line graph or curve chart is a type of chart which displays information as a series of
data points called 'markers' connected by straight line segments.

A line chart is often used to visualize a trend in data over intervals of time – a time series – thus the line is often
drawn chronologically.
Plotting a Line Plot

#1 line plot - basic


import matplotlib.pyplot as plt

#1 setup the data


x=[1,2,3,4,5]
y=[2,4,6,8,10]

#2 setup the parameters for the plot


plt.plot(x,y)

#3 display the plot


plt.show()

Output:

1. For drawing any plot usually the matplotlib.pyplot is imported as plt


2. The plot function plt.plot() is used to draw line graphs drawing lines between any two successive values of x
and y.
3. The plot function accepts two datasets, the first one is a list of x-coordinates and the second a list of
corresponding y-coordinates. The number of values in both the x and y lists must be same.
4. The plt.plot(x,y) is used to draw the graph and the plt.show() function is used to display the plot on the
screen.

Variations to the above program:


1. The x- and y-coordinates can be numpy arrays. The advantage of using numpy is that linearly spaced
values of x-coordinates can be generated as follows.
import numpy as np
x=np.linspace(10,120,100)
The above code generates 100 values that are linearly spaced between the end values of 10 and 120.

2. After generating the x-values if the y-values can be represented as an equation in terms of x-variable
then the y values can be generated directly as follows:
y = x*x +2*x + 6
The above code will generate the corresponding y-values for all the x- values that are generated above.
For generating the sin values the following can be used:
y=np.sin(x)

3. For generation of smooth curves the linear spacing for both the x- and y- co-ordinates must be close to
one another for range of values to be displayed otherwise the curves will appear as jagged lines.
#2 line plot - using numpy arrays
import matplotlib.pyplot as plt
import numpy as np

#1 setup the data


x=np.linspace(10,120,100) #generate 100 linearly spaced values betwn 10 & 120
y = x*x +2*x + 6 #generate the corresponding y-values

plt.plot(x,y) #plot and show


plt.show()

x=np.linspace(0,np.pi*2,100) #generate 100 values between 0 and 2*pi


y=np.sin(x)
plt.plot(x,y)
plt.show()

Output:

Setting the parameters on the graph


The plt object can be used to set the different parameters of the plot and then display the plot.
The various parameters that can be set are:
1. plt.xlabel('time') - xlabel sets the x-axis label
2. plt.ylabel('speed') - ylabel sets the y-axis label
3. plt.yticks([5,7,10]) - yticks sets the tick marks that appear on y-axis
4. plt.xticks([1,3,4],['abc','def','ghi']) - xticks sets the ticks to appear on the x-axis at
points [1,3,4] the second parameter changes the corresponding labels to ['abc','def','ghi'].
5. plt.grid() - displays the gridlines
6. plt.legend() - displays the legend using the labels for the corresponding plots. legend is drawn
only after the plot() is called since it takes the labels from plot function.
7. Besides the plot function can accept a third parameter the format string -
plt.plot(x,y,'>--c', label='car 1')
The format string (fmt) has the following specification:
fmt = '[marker][line][color]'
All the values are optional and the possible values for marker, line and color are shown below:

marker line style color


character description character description character color
'.' point marker '-' solid line style 'b' blue
',' pixel marker '--' dashed line style 'g' green
dash-dot line
'o' circle marker '-.' 'r' red
style
'v' triangle_down marker ':' dotted line style 'c' cyan
'^' triangle_up marker 'm' magenta
'<' triangle_left marker 'y' yellow
'>' triangle_right marker 'k' black
'1' tri_down marker 'w' white
'2' tri_up marker
'3' tri_left marker
'4' tri_right marker
's' square marker
'p' pentagon marker
'*' star marker
'h' hexagon1 marker
'H' hexagon2 marker
'+' plus marker
'x' x marker
'D' diamond marker
'd' thin_diamond marker
'|' vline marker
'_' hline marker
Example of format strings:
'b' # blue markers with default shape
'or' # red circles
'-g' # green solid line
'--' # dashed line with default color
'^k:' # black triangle_up markers connected by a dotted line

#3 line plot - setting parameters


import matplotlib.pyplot as plt

#1 setup the data


x=[1,2,3,4,5]
y=[2,4,6,8,10]

#2 setup the parameters for the plot


plt.xlabel('time') #xlabel - x axis label
plt.ylabel('speed') #ylabel - y axis label
plt.title('speed vs time') # title - title of plot
plt.xticks([1,3,4],['abc','def','ghi']) #xticks - ticks on x-axis
plt.yticks([5,7,10]) #yticks - ticks on y-axis
plt.plot(x,y,'>--c', label='car 1') #using format strings
plt.legend() #display legend using label of plot
plt.grid() #display gridlines

#3 display the plot


plt.show()

Output:
Multiple plots in the same figure
Multiple plots can be drawn in the same figure by any one of the following methods:
1. By using the plt.plot() function multiple times with different data sets and parameters each time.
2. By using a single plot function with multiple parameters for x and y variables as shown below:
plt.plot(x1,y1,'formatstring1', x2,y2,'formatstring2')
When using this method, the labels for both the plots must be passed as a list while calling the legend() function

#3 line plot - multiple plots on same figure


import matplotlib.pyplot as plt

#1 setup the data


x1=[1,2,3,4,5] #dataset for first plot
y1=[2,4,6,8,10]

x2=[3,7,9,12,15,17] #dataset for second plot


y2=[3,9,12,18,23,27]

#2 setup the common parameters


plt.xlabel('time')
plt.ylabel('location')
plt.title('speed vs time')
plt.xticks([8,13,16],['loc1','loc2','loc3']) #ticks can be assigned names
plt.yticks([5,12,25])
plt.grid()

#3 parameters for first plot


plt.plot(x1,y1,'>--c', label='car 1') #car 1 dataset is x1,y1

#4 parameters for second plot


plt.plot(x2,y2,'o-.g', label='car 2') #car 2 dataset is x2,y2

plt.legend() #display legend only after plotting all graphs

#5 display the plot


plt.show()

#6 another way of plotting multiple plots


#all the other parameter are erased on calling show the second time
#if needed parameter must be set again
plt.plot(x1,y1,'>--c', x2,y2,'o-.g') #plot does not accept multiple labels
plt.legend(['car 1', 'car 2']) #multiple entries in legend written here
plt.show()

Output:
Plotting from a DataFrame
The plot() function can accept the source data for the x- and y- coordinates from an object having tabular data
such as from a DataFrame. The syntax used for plot() in this case is:
plt.plot('column_name_for_x_axis', 'column_name_for_y_axis',data=DataFrameName, label='labelname')

While using this method, we can also plot multiple plots from the same DataFrame by calling the plot() function
multiple times with different x- and y- data's.
#4 line plot - plotting from a DataFrame
import matplotlib.pyplot as plt
import pandas as pd

df1=pd.read_csv('.\chennai_reservoir_levels.csv')
#print('df1=\n', df1)

#1 plotting a single plot


plt.xticks(rotation=90) #rotate the xticks
plt.plot('Date','POONDI',data=df1, label='POONDI')
plt.legend()
plt.show()

#2 multiple plots in same Figure from DataFrame


plt.xticks(rotation=90) #rotate the xticks
plt.plot('Date','POONDI', data=df1)
plt.plot('Date','CHOLAVARAM', data=df1)
plt.plot('Date','REDHILLS', data=df1)
plt.plot('Date','CHEMBARAMBAKKAM', data=df1)
plt.legend(['POONDI','CHOLAVARAM','REDHILLS','CHEMBARAMBAKKAM'])
plt.show()

Output:
The source data in file 'chennai_reservoir_levels.csv' is shown below:
Date POONDI CHOLAVARAM REDHILLS CHEMBARAMBAKKAM
01-01-2018 1012 513 1585 1842
01-02-2018 1387 451 1368 1693
01-03-2018 2011 398 1194 1507
01-04-2018 1611 100 1660 1215
01-05-2018 396 70 1779 1198
01-06-2018 184 68 1427 1214
01-07-2018 132 61 1120 906
01-08-2018 50 26 920 628
01-09-2018 13 1 713 445
01-10-2018 93 8 478 338
01-11-2018 695 20 809 232
01-12-2018 381 40 1102 185
01-01-2019 298 48 941 102
01-03-2019 477 48 520 22
01-04-2019 333 42 301 10
01-05-2019 193 11 125 2
Plotting multiple subplots
We can plot multiple subplots in the same Figure object by dividing the Figure object into subplots as shown
below:
plt.subplots(num_of_rows, num_of_columns, sharex=False, sharey=False)

where
num_of_rows - is the number of rows in the figure
num_of_columns - is the number of columns in the Figure
sharex - if we want the subplots to share the xticks across the subplots then sharex must be set to True. If sharex
is True then xticks is shown only in the bottom-most plot. (Default value is False)
sharey - if we want the subplots to share the yticks across subplots then sharey must be set to True. If sharey is
True, then yticks is shown only for the leftmost plot (Default value is False)

The subplots() function returns two values, the figure object and the axes object. The figure object refers to the
entire drawing area. The axes objects can be used in two ways:
Method 1:
f1, (ax1,ax2) = plt.subplots(1,2,sharey=True)
The figure object is divided into 1 row and 2 columns i.e. two subplots are created. The first subplot is
assigned to object ax1 and the second subplot is assigned to ax2
Method 2:
f1, ax = plt.subplots(2,2,sharex=True, sharey=True)
The figure object is divided into 2 rows and 2 columns i.e. four subplots are created. All the four objects
are passed as a matrix to the ax object. Individual axes object is accessed using the matrix notation, i.e.
the first subplot is ax[0,0], the second subplot is ax[0,1], third subplot is ax[1,0], fourth subplot is ax[1,1].

ax[0,0] ax[0,1]

ax1 ax2

ax[1,0] ax[1,1]

f1, (ax1,ax2) = plt.subplots(1,2,sharey=True) f1, ax = plt.subplots(2,2,sharex=True, sharey=True)

After getting the individual axes objects we can use the plot() function with the individual axes objects and draw
independent plots in each of the axes object areas.

While using the individual axes objects the following care is to be taken:

1. For setting ticks on x- and y- axes, instead of plt.xticks() and plt.yticks() use the functions :
ax1.set_xticks() and ax1.set_yticks() functions
2. For rotating the labels - instead of plt.xticks(rotation=90) use :
ax[1,0].tick_params( axis='x', labelrotation =90)
#4 line plot - plotting multiple plots in different subplots
import matplotlib.pyplot as plt
import pandas as pd

df1=pd.read_csv('.\chennai_reservoir_levels.csv')
#print('df1=\n', df1)

#1 creating 1x2 grid subplots


f1, (ax1,ax2) = plt.subplots(1,2,sharey=True) #yticks shown only once per row

#2 set the parameters for the two subplots


ax1.set_xticks([]) #with axis objects use set_xticks not xticks
ax2.set_xticks([])
ax1.plot('Date','POONDI', data=df1,label='POONDI')
ax2.plot('Date','CHOLAVARAM', data=df1, label='CHOLAVARAM')
ax1.legend()
ax2.legend()

#3 Display the plot


plt.show()

Output:
#5 line plot - 4x4 subplots, sharing axes, rotating labels
import matplotlib.pyplot as plt
import pandas as pd
import numpy as np

df1=pd.read_csv('.\chennai_reservoir_levels.csv')
#print('df1=\n', df1)

#1 creating 4x4 grid


f2, ax = plt.subplots(2,2,sharey=True,sharex=True) #using single axes object

#2 using array notation to access individual subplot


ax[0,0].plot('Date','POONDI', data=df1, label='POONDI' )
ax[0,1].plot('Date','CHOLAVARAM', data=df1, label='CHOLAVARAM')
ax[1,0].plot('Date','REDHILLS', data=df1, label='REDHILLS')
ax[1,1].plot('Date','CHEMBARAMBAKKAM', data=df1, label='CHEMBARAMBAKKAM')

ax[1,0].set_xticks(np.arange(0,18,5)) #show only 0,5,10,15 th reading dates

#rotate labels in x-axis by 90 degrees


ax[1,0].tick_params( axis='x', labelrotation =90)
ax[1,1].tick_params( axis='x', labelrotation =90)

#display all legends


ax[0,0].legend()
ax[0,1].legend()
ax[1,0].legend()
ax[1,1].legend()
plt.show()

Output:
Bar plot:
A bar chart or bar graph is a chart or graph that presents categorical data with rectangular bars with heights or
lengths proportional to the values that they represent. The bars can be plotted vertically or horizontally.

A bar graph shows comparisons among discrete categories. One axis of the chart shows the specific categories
being compared, and the other axis represents a measured value.
Plotting a Bar Graph

The syntax for plotting a Bar Graph is:


plt.bar(x, height, width=0.8, bottom=None, align='center', data=None)

where:
x : sequence of scalars which form the x coordinates of the bars
height: sequence of scalars which form the heights of the bars.
width: scalar or array-like, optional which are the width(s) of the bars (default: 0.8)
bottom: scalar or array-like, optional. The y coordinate(s) of the bars bases (default: 0)
align: {'center', 'edge'}, optional, default: 'center'. It shows the alignment of the bars to the x coordinates:
'center': Center the base on the x positions.
'edge': Align the left edges of the bars with the x positions.
data: If the source of the data is another matrix like structure such as a DataFrame then the name of the object
is mentioned here.

Explanation regarding width of bar graph


If the x-coordinates are numbers then width is in the same unit as the number on the x-axis.

If the x-coordinates are not numbers (for e.g. strings) then the width is the fractional part of the distance
between one xtick and another xtick. For example consider the xticks line below

By default width=0.8. This means that between the xticks 'a' and 'b' , 0.8 i.e. 80 percent of the space will be
occupied by the bar graph and 20 percent will be the space between one bar and the next bar.

a b c

The other parameters of the bar graph such as xlabel, ylabel, title, xticks, yticks, legend are same as the line
plots and can be set using the plt object.
# 7 Simple bar plot
import matplotlib.pyplot as plt

x1=[10, 20, 30, 40, 50]


y1 = [35, 60, 75, 25, 90]
plt.bar(x1,y1) #width of graph is 0.8 since x1 is numeric data
plt.show()

x2=['a', 'b', 'c', 'd', 'e']


y2 = [35, 60, 75, 25, 90]
plt.bar(x2,y2) #width of graph is 80% since x2 is string data
plt.show()

Output:
# 8 bar plot setting parameters
import matplotlib.pyplot as plt

x=['a', 'b', 'c', 'd', 'e']


y = [35, 60, 75, 25, 90]

plt.xlabel('city') #xlabel - x axis label


plt.ylabel('number of birds') #ylabel - y axis label
plt.title('Birds in Cities') # title - title of plot
plt.yticks(y) #yticks - ticks on y-axis
plt.bar(x,y,label='Birds')

plt.legend() #display legend using label of plot


plt.grid() #display gridlines

plt.show()

Output:
Displaying Bar plot from a DataFrame

Consider the excel file 'product_sales.xlsx' containing the following data and imported into DataFrame df:

Sales Area Chocolate Cake Biscuit


Area A 20 5 20
Area M 30 9 12
Area B 12 12 18
Area N 8 7 23

For displaying data from a DataFrame df, we use the appropriate column names for the x- and y-coordinates and
pass the parameter data=df when using the bar() function.

# 9 bar plot from DataFrame


import matplotlib.pyplot as plt
import pandas as pd

df=pd.read_excel('product_sales.xlsx')

plt.bar('Sales Area','Chocolate',data=df, label='Chocolate')

plt.legend() #display legend using label of plot


plt.grid() #display gridlines

plt.show()

Output:
Displaying grouped bar chart

For displaying grouped data, we change the x-position on the x-axis where the bar for each of the individual
plots should appear. Consider that we want to draw three bar plots in the same figure for 'Area A', 'Area B' and
'Area C' to be shown on the x-axis. On the y-axis we want the data for 'Chocolate', 'Cake' and 'Biscuit' i.e. three
groups to be shown. The steps are as follows:

1. First change the xticks to start from 1,2,3 and so on. The names that are displayed on the tick marks can
be 'Area A', 'Area B' and 'Area C'.

x-axis

1 2 3

Area A Area B Area B

2. The distance between any two xticks is 1. This is the maximum width that is available for displaying all
the three bar plots. Select any one width such that the sum of all the widths of the three bar plots is less
than 1. For example if we select the width as 0.2, then since we display three bar plots, the combined
width becomes (0.2 x 3) = 0.6, which is less than 1. The remaining (1 - 0.6) =0.4 is the empty space
between one grouped bar plot and the next grouped bar plot.
3. Next step is rearrange the x-positions of the three individual bar plots so that they are adjacent and do
not overlap.
For doing so, the following method is adopted.

width of individual bar plot, wd=0.2

wd wd wd

x-axis

(x-wd) (x) (x+wd)

1 2 3

Area A Area B Area B

a) The centre bar plot of blue colour is centered exactly at the xtick position, (x)
b) The first bar bar plot of green colour is centered at (x-wd)
c) The third bar plot of red colour is centered at (x+wd)
i.e. for displaying three grouped bar plots, the first bar plot will be plotted at (x-wd) position, second bar plot
will be plotted at (x) position and the third bar plot will be plotted at (x+wd) position.

Similar method is adopted for displaying any number of grouped data. For example for displaying two grouped
bar plots, the first bar plot will be plotted at (x-wd/2) position and the second bar plot will be plotted at
(x+wd/2) position.
# 10 Displaying grouped bar charts
import matplotlib.pyplot as plt
import pandas as pd
import numpy as np

df=pd.read_excel('product_sales.xlsx')
x=np.arange(1, len(df)+1) #x values are 1,2,3,4…
wd=0.2 #width of bar plot=0.2
xlbl=df['Sales Area']

plt.xticks(x,xlbl)
plt.bar(x-wd,'Chocolate',data=df, label='Chocolate', width=wd)
plt.bar(x,'Cake',data=df, label='Cake', width=wd)
plt.bar(x+wd,'Biscuit',data=df, label='Biscuit', width=wd)

plt.legend() #display legend using label of plot


plt.grid() #display gridlines

plt.show()

Output:
Displaying Stacked bar plots

For displaying stacked bar plot we use the parameter bottom=[y values] while plotting the second bar plot. The
[y values] are the list of y-coordinate values of the first bar plot on which we want to stack the second bar plot.

Consider the following population data of Males and Females contained in the excel file 'population.xlsx'
for which we want to show the stacked bar plot.

city Male Female Total


Area A 20 17 37
Area M 30 19 49
Area B 12 15 27
Area N 8 7 15

# 11 Displaying stacked bar charts


import matplotlib.pyplot as plt
import pandas as pd

df=pd.read_excel('population.xlsx')

plt.bar('city','Male',data=df, label='Male')
plt.bar('city','Female',data=df, label='Female', bottom='Male')

plt.legend() #display legend using label of plot


plt.grid() #display gridlines

plt.show()

Output:
Horizontal Bar plot

To plot a horizontal bar plot use the barh() with the following syntax:

plt.barh(y, width, height=0.8, align='center', data=None)

where:
y : sequence of scalars which form the y coordinates of the bars
width: scalar or array-like, which are the width(s) of the bars on the x-axis
height: sequence of scalars which form the heights of the bars (default: 0.8)
align: {'center', 'edge'}, optional, default: 'center'. It shows the alignment of the bars to the y coordinates:
'center': Center the base on the x positions.
'edge': Align the left edges of the bars with the x positions.
data: If the source of the data is another matrix like structure such as a DataFrame then the name of the object
is mentioned here.

# 12 Displaying horizontal bar plots


import matplotlib.pyplot as plt
import pandas as pd

df=pd.read_excel('population.xlsx')

plt.barh('city','Male',data=df, label='Male')

plt.legend() #display legend using label of plot


plt.grid() #display gridlines

plt.show()

Output:
Histogram
Histogram is a graphical display of data using bars of different heights to group numbers into ranges. The height
of each bar shows how many of the data fall in that particular range.

A histogram is an accurate representation of the distribution of numerical data. It differs from a bar graph, in
the sense that a bar graph relates two variables, but a histogram relates only one variable.
How to draw a Histogram
Example 1: For the dataset containing CGPA of 15 students shown below draw the histogram for bin size 10:

6.1, 4.12, 8.2, 6.4, 3.6, 9.2, 5.5, 8.4, 6.2, 9.8, 5.3, 3.9, 8.1, 6.1, 2.7

Step 1: Calculate the range of the data set


range = largest value - smallest value = 9.8 - 2.7 = 7.1

Step 2: Divide the range by the number of groups you want and then round up.
For example we want to divide the data set into 10 groups (in python if bin size is not mentioned then 10 is
taken as the default bin size), and then the width of each group is found by
class-width = range / number of groups = 7.1 / 10 = 0.71
Therefore class width = 0.71

Step 3: Use the class width to create your groups


The smallest value is 2.7 and class-width is 0.71, so first class or first bin is from 2.7 to (2.7 + 0.71) i.e. from 2.7
to 3.41.
The second class or second bin is from 3.41 to (3.41 +0.71) i.e. second bin is 3.41 to 4.12 and so on…

Draw the following table with the classes/bins:


Bin Classes Tally Frequency
First Bin [2.7 – 3.41)
Second Bin [3.41 – 4.12)
Third Bin [4.12 – 4.83)
Fourth Bin [4.83 – 5.54)
Fifth Bin [5.54 – 6.25)
Sixth Bin [6.25 – 6.96)
Seventh Bin [6.96-7.67)
Eighth Bin [7.67-8.38)
Ninth Bin [8.38-9.09)
Tenth Bin [9.09-9.8]

For any class/bin square brackets [ or ] means including that number and round brackets ) means excluding
that number. For example, [2.7 – 3.41) means the range from 2.7 (including 2.7) till 3.41(excluding 3.41).

Only the last bin has both ends with square brackets which mean that both 9.09 and 9.8 will be counted in the
last bin.

[Note: Due to limitations of storing floating point numbers accurately in a computer, sometimes the values
appearing at the boundaries are taken in the adjacent bin. For e.g. if 5.54 is coming in the data, then ideally is
should come in bin [5.54 – 6.25) but due to inaccuracies in floating point number calculation/storage it is
considered in the bin [4.83 – 5.54). Apart from this python follows the boundary rules as explained above ]
Step 4: Fill the tally column
For each element in the dataset find the correct bin and place a tally mark ( | ) against that bin. The filled in
table is shown below:
Bin Classes Tally Frequency
First Bin [2.7 – 3.41) |
Second Bin [3.41 – 4.12) ||
Third Bin [4.12 – 4.83) |
Fourth Bin [4.83 – 5.54) ||
Fifth Bin [5.54 – 6.25) |||
Sixth Bin [6.25 – 6.96) |
Seventh Bin [6.96-7.67)
Eighth Bin [7.67-8.38) ||
Ninth Bin [8.38-9.09) |
Tenth Bin [9.09-9.8] ||

Step 5: Fill the Frequency column


Count the number of tally marks and fill the frequency column
Bin Classes Tally Frequency
First Bin [2.7 – 3.41) | 1
Second Bin [3.41 – 4.12) || 2
Third Bin [4.12 – 4.83) | 1
Fourth Bin [4.83 – 5.54) || 2
Fifth Bin [5.54 – 6.25) ||| 3
Sixth Bin [6.25 – 6.96) | 1
Seventh Bin [6.96-7.67) 0
Eighth Bin [7.67-8.38) || 2
Ninth Bin [8.38-9.09) | 1
Tenth Bin [9.09-9.8] || 2
Setp 6: Draw the histogram
Take the classes in the X-axis and the Frequency on the Y-axis and draw the histogram.
Drawing histogram using hist() function of pyplot
The hist() function can be used to draw a histogram. It accepts only a single dimensional 1D array or a list to
draw the histogram. The other properties of plot object such as setting xlabels, ylabels, xticks, yticks etc. remain
the same as line/bar plots.

In its simplest form histogram is drawn using the command:


plt.hist(data, bins=10)
where-
data - is the list or 1D array containing the data on which histogram is to be created
bins - it can either be a number or a list. If it is a single number it denotes the number of intervals of the
histogram we want. If bins parameters is a list, then the elements of the list are the bin edges. The
number of bin edges must be one greater than the number of intervals needed for the histogram. If bins
parameter is not passed a default value of 10 is taken.

Program : Drawing Histogram using pyplot

#18 Histogram using pyplot


import matplotlib.pyplot as plt
import pandas as pd

dict1={ 'student': ['s1','s2','s3','s4','s5','s6','s7','s8','s9','s10',


's11','s12','s13', 's14', 's15'],
'cgpa': [6.1,4.12,8.2,6.4,3.6,9.2,5.5,8.4,6.2,9.8,
5.3,3.9,8.1,6.1,2.7],
'numattempts':[1,2,1,3,1,1,2,1,1,2,1,1,3,1,2] }

df1=pd.DataFrame(dict1)
data = df1['cgpa'] #extract 1D data on which the histogram is to be drawn

plt.xlabel('cgpa range')
plt.ylabel('Number of Students')
plt.title('cgpa range vs Number of students')
plt.grid()
plt.hist(data,bins=10)
plt.show() #bin edges not shown automatically

Output:

Whenever python draws histogram the xticks are automatically calculated, which is not aligned to the bin edges.
Histogram - displaying bin edges correctly
There are two ways to display the bin edges correctly-
1. Create a list of bin edges. Use this list as the value for bins parameter of hist() function and set the xticks
to this list of bin edges.
2. Use the return value of the hist() function. The hist() function when called returns back a tuple. The first
element of the tuple is the numpy array containing the frequencies of the intervals of the histogram.
The second element is the numpy array containing the bin edges of the histogram that was created. The
number of bin edges is one more than the frequencies of the histogram.

#19 Histogram setting bin edges correctly


import matplotlib.pyplot as plt

data= [6.1,4.12,8.2,6.4,3.6,9.2,5.5,8.4,6.2,9.8,5.3,3.9,8.1,6.1,2.7]

#method 1 setting bin edges manually and setting the xticks to bin edges
b1=[3,5.5,7.5,10]
plt.hist(data, bins=b1)
plt.xticks(b1)
plt.grid()
plt.show()

#method2 accessing the bin edges returned by hist function


rtval = plt.hist(data)
print('rtval=', rtval)
print('rtval[0]=', rtval[0]) #contains frequencies
print('rtval[1]=', rtval[1]) #contains bin edges
plt.xticks(rtval[1]) #set xticks to bin edges
plt.grid()
plt.show()
Output:

rtval[0]= [1. 2. 1. 2. 3. 1. 0. 2. 1. 2.]


rtval[1]= [2.7 3.41 4.12 4.83 5.54 6.25 6.96 7.67 8.38 9.09 9.8 ]

You might also like