In - Gov.transport VHINSC
In - Gov.transport VHINSC
Data Visualization means representing the data in a graphical format which is easier to understand.
For Data Visualization in Python we are using the Matplotlib library
Matplotlib
Matplotlib is a Python 2D plotting library which produces publication quality figures in a variety of hardcopy
formats and interactive environments across platforms. Matplotlib can be used in Python scripts, the Python
and IPython shells, the Jupyter notebook, web application servers, and different graphical user interface
toolkits.
and frequently we need numpy for creating datasets, so numpy is also imported as follows:
import numpy as np
• A Figure object is the outermost container for a matplotlib graphic. It is the overall window/page on
which everything is drawn.
• The Axes is the area on which the data is plotted with functions such as plot() and scatter(). A
Figure can contain multiple Axes, but an Axes object is a part of only one Figure.
• Below the Axes in the hierarchy are smaller objects such as tick marks, individual lines, legends, and
text boxes. Almost every “element” of a chart is a Python object which can be manipulated, all the way
down to the ticks and labels.
Parts of a Figure/Plot
A line chart is often used to visualize a trend in data over intervals of time – a time series – thus the line is often
drawn chronologically.
Plotting a Line Plot
Output:
2. After generating the x-values if the y-values can be represented as an equation in terms of x-variable
then the y values can be generated directly as follows:
y = x*x +2*x + 6
The above code will generate the corresponding y-values for all the x- values that are generated above.
For generating the sin values the following can be used:
y=np.sin(x)
3. For generation of smooth curves the linear spacing for both the x- and y- co-ordinates must be close to
one another for range of values to be displayed otherwise the curves will appear as jagged lines.
#2 line plot - using numpy arrays
import matplotlib.pyplot as plt
import numpy as np
Output:
Output:
Multiple plots in the same figure
Multiple plots can be drawn in the same figure by any one of the following methods:
1. By using the plt.plot() function multiple times with different data sets and parameters each time.
2. By using a single plot function with multiple parameters for x and y variables as shown below:
plt.plot(x1,y1,'formatstring1', x2,y2,'formatstring2')
When using this method, the labels for both the plots must be passed as a list while calling the legend() function
Output:
Plotting from a DataFrame
The plot() function can accept the source data for the x- and y- coordinates from an object having tabular data
such as from a DataFrame. The syntax used for plot() in this case is:
plt.plot('column_name_for_x_axis', 'column_name_for_y_axis',data=DataFrameName, label='labelname')
While using this method, we can also plot multiple plots from the same DataFrame by calling the plot() function
multiple times with different x- and y- data's.
#4 line plot - plotting from a DataFrame
import matplotlib.pyplot as plt
import pandas as pd
df1=pd.read_csv('.\chennai_reservoir_levels.csv')
#print('df1=\n', df1)
Output:
The source data in file 'chennai_reservoir_levels.csv' is shown below:
Date POONDI CHOLAVARAM REDHILLS CHEMBARAMBAKKAM
01-01-2018 1012 513 1585 1842
01-02-2018 1387 451 1368 1693
01-03-2018 2011 398 1194 1507
01-04-2018 1611 100 1660 1215
01-05-2018 396 70 1779 1198
01-06-2018 184 68 1427 1214
01-07-2018 132 61 1120 906
01-08-2018 50 26 920 628
01-09-2018 13 1 713 445
01-10-2018 93 8 478 338
01-11-2018 695 20 809 232
01-12-2018 381 40 1102 185
01-01-2019 298 48 941 102
01-03-2019 477 48 520 22
01-04-2019 333 42 301 10
01-05-2019 193 11 125 2
Plotting multiple subplots
We can plot multiple subplots in the same Figure object by dividing the Figure object into subplots as shown
below:
plt.subplots(num_of_rows, num_of_columns, sharex=False, sharey=False)
where
num_of_rows - is the number of rows in the figure
num_of_columns - is the number of columns in the Figure
sharex - if we want the subplots to share the xticks across the subplots then sharex must be set to True. If sharex
is True then xticks is shown only in the bottom-most plot. (Default value is False)
sharey - if we want the subplots to share the yticks across subplots then sharey must be set to True. If sharey is
True, then yticks is shown only for the leftmost plot (Default value is False)
The subplots() function returns two values, the figure object and the axes object. The figure object refers to the
entire drawing area. The axes objects can be used in two ways:
Method 1:
f1, (ax1,ax2) = plt.subplots(1,2,sharey=True)
The figure object is divided into 1 row and 2 columns i.e. two subplots are created. The first subplot is
assigned to object ax1 and the second subplot is assigned to ax2
Method 2:
f1, ax = plt.subplots(2,2,sharex=True, sharey=True)
The figure object is divided into 2 rows and 2 columns i.e. four subplots are created. All the four objects
are passed as a matrix to the ax object. Individual axes object is accessed using the matrix notation, i.e.
the first subplot is ax[0,0], the second subplot is ax[0,1], third subplot is ax[1,0], fourth subplot is ax[1,1].
ax[0,0] ax[0,1]
ax1 ax2
ax[1,0] ax[1,1]
After getting the individual axes objects we can use the plot() function with the individual axes objects and draw
independent plots in each of the axes object areas.
While using the individual axes objects the following care is to be taken:
1. For setting ticks on x- and y- axes, instead of plt.xticks() and plt.yticks() use the functions :
ax1.set_xticks() and ax1.set_yticks() functions
2. For rotating the labels - instead of plt.xticks(rotation=90) use :
ax[1,0].tick_params( axis='x', labelrotation =90)
#4 line plot - plotting multiple plots in different subplots
import matplotlib.pyplot as plt
import pandas as pd
df1=pd.read_csv('.\chennai_reservoir_levels.csv')
#print('df1=\n', df1)
Output:
#5 line plot - 4x4 subplots, sharing axes, rotating labels
import matplotlib.pyplot as plt
import pandas as pd
import numpy as np
df1=pd.read_csv('.\chennai_reservoir_levels.csv')
#print('df1=\n', df1)
Output:
Bar plot:
A bar chart or bar graph is a chart or graph that presents categorical data with rectangular bars with heights or
lengths proportional to the values that they represent. The bars can be plotted vertically or horizontally.
A bar graph shows comparisons among discrete categories. One axis of the chart shows the specific categories
being compared, and the other axis represents a measured value.
Plotting a Bar Graph
where:
x : sequence of scalars which form the x coordinates of the bars
height: sequence of scalars which form the heights of the bars.
width: scalar or array-like, optional which are the width(s) of the bars (default: 0.8)
bottom: scalar or array-like, optional. The y coordinate(s) of the bars bases (default: 0)
align: {'center', 'edge'}, optional, default: 'center'. It shows the alignment of the bars to the x coordinates:
'center': Center the base on the x positions.
'edge': Align the left edges of the bars with the x positions.
data: If the source of the data is another matrix like structure such as a DataFrame then the name of the object
is mentioned here.
If the x-coordinates are not numbers (for e.g. strings) then the width is the fractional part of the distance
between one xtick and another xtick. For example consider the xticks line below
By default width=0.8. This means that between the xticks 'a' and 'b' , 0.8 i.e. 80 percent of the space will be
occupied by the bar graph and 20 percent will be the space between one bar and the next bar.
a b c
The other parameters of the bar graph such as xlabel, ylabel, title, xticks, yticks, legend are same as the line
plots and can be set using the plt object.
# 7 Simple bar plot
import matplotlib.pyplot as plt
Output:
# 8 bar plot setting parameters
import matplotlib.pyplot as plt
plt.show()
Output:
Displaying Bar plot from a DataFrame
Consider the excel file 'product_sales.xlsx' containing the following data and imported into DataFrame df:
For displaying data from a DataFrame df, we use the appropriate column names for the x- and y-coordinates and
pass the parameter data=df when using the bar() function.
df=pd.read_excel('product_sales.xlsx')
plt.show()
Output:
Displaying grouped bar chart
For displaying grouped data, we change the x-position on the x-axis where the bar for each of the individual
plots should appear. Consider that we want to draw three bar plots in the same figure for 'Area A', 'Area B' and
'Area C' to be shown on the x-axis. On the y-axis we want the data for 'Chocolate', 'Cake' and 'Biscuit' i.e. three
groups to be shown. The steps are as follows:
1. First change the xticks to start from 1,2,3 and so on. The names that are displayed on the tick marks can
be 'Area A', 'Area B' and 'Area C'.
x-axis
1 2 3
2. The distance between any two xticks is 1. This is the maximum width that is available for displaying all
the three bar plots. Select any one width such that the sum of all the widths of the three bar plots is less
than 1. For example if we select the width as 0.2, then since we display three bar plots, the combined
width becomes (0.2 x 3) = 0.6, which is less than 1. The remaining (1 - 0.6) =0.4 is the empty space
between one grouped bar plot and the next grouped bar plot.
3. Next step is rearrange the x-positions of the three individual bar plots so that they are adjacent and do
not overlap.
For doing so, the following method is adopted.
wd wd wd
x-axis
1 2 3
a) The centre bar plot of blue colour is centered exactly at the xtick position, (x)
b) The first bar bar plot of green colour is centered at (x-wd)
c) The third bar plot of red colour is centered at (x+wd)
i.e. for displaying three grouped bar plots, the first bar plot will be plotted at (x-wd) position, second bar plot
will be plotted at (x) position and the third bar plot will be plotted at (x+wd) position.
Similar method is adopted for displaying any number of grouped data. For example for displaying two grouped
bar plots, the first bar plot will be plotted at (x-wd/2) position and the second bar plot will be plotted at
(x+wd/2) position.
# 10 Displaying grouped bar charts
import matplotlib.pyplot as plt
import pandas as pd
import numpy as np
df=pd.read_excel('product_sales.xlsx')
x=np.arange(1, len(df)+1) #x values are 1,2,3,4…
wd=0.2 #width of bar plot=0.2
xlbl=df['Sales Area']
plt.xticks(x,xlbl)
plt.bar(x-wd,'Chocolate',data=df, label='Chocolate', width=wd)
plt.bar(x,'Cake',data=df, label='Cake', width=wd)
plt.bar(x+wd,'Biscuit',data=df, label='Biscuit', width=wd)
plt.show()
Output:
Displaying Stacked bar plots
For displaying stacked bar plot we use the parameter bottom=[y values] while plotting the second bar plot. The
[y values] are the list of y-coordinate values of the first bar plot on which we want to stack the second bar plot.
Consider the following population data of Males and Females contained in the excel file 'population.xlsx'
for which we want to show the stacked bar plot.
df=pd.read_excel('population.xlsx')
plt.bar('city','Male',data=df, label='Male')
plt.bar('city','Female',data=df, label='Female', bottom='Male')
plt.show()
Output:
Horizontal Bar plot
To plot a horizontal bar plot use the barh() with the following syntax:
where:
y : sequence of scalars which form the y coordinates of the bars
width: scalar or array-like, which are the width(s) of the bars on the x-axis
height: sequence of scalars which form the heights of the bars (default: 0.8)
align: {'center', 'edge'}, optional, default: 'center'. It shows the alignment of the bars to the y coordinates:
'center': Center the base on the x positions.
'edge': Align the left edges of the bars with the x positions.
data: If the source of the data is another matrix like structure such as a DataFrame then the name of the object
is mentioned here.
df=pd.read_excel('population.xlsx')
plt.barh('city','Male',data=df, label='Male')
plt.show()
Output:
Histogram
Histogram is a graphical display of data using bars of different heights to group numbers into ranges. The height
of each bar shows how many of the data fall in that particular range.
A histogram is an accurate representation of the distribution of numerical data. It differs from a bar graph, in
the sense that a bar graph relates two variables, but a histogram relates only one variable.
How to draw a Histogram
Example 1: For the dataset containing CGPA of 15 students shown below draw the histogram for bin size 10:
6.1, 4.12, 8.2, 6.4, 3.6, 9.2, 5.5, 8.4, 6.2, 9.8, 5.3, 3.9, 8.1, 6.1, 2.7
Step 2: Divide the range by the number of groups you want and then round up.
For example we want to divide the data set into 10 groups (in python if bin size is not mentioned then 10 is
taken as the default bin size), and then the width of each group is found by
class-width = range / number of groups = 7.1 / 10 = 0.71
Therefore class width = 0.71
For any class/bin square brackets [ or ] means including that number and round brackets ) means excluding
that number. For example, [2.7 – 3.41) means the range from 2.7 (including 2.7) till 3.41(excluding 3.41).
Only the last bin has both ends with square brackets which mean that both 9.09 and 9.8 will be counted in the
last bin.
[Note: Due to limitations of storing floating point numbers accurately in a computer, sometimes the values
appearing at the boundaries are taken in the adjacent bin. For e.g. if 5.54 is coming in the data, then ideally is
should come in bin [5.54 – 6.25) but due to inaccuracies in floating point number calculation/storage it is
considered in the bin [4.83 – 5.54). Apart from this python follows the boundary rules as explained above ]
Step 4: Fill the tally column
For each element in the dataset find the correct bin and place a tally mark ( | ) against that bin. The filled in
table is shown below:
Bin Classes Tally Frequency
First Bin [2.7 – 3.41) |
Second Bin [3.41 – 4.12) ||
Third Bin [4.12 – 4.83) |
Fourth Bin [4.83 – 5.54) ||
Fifth Bin [5.54 – 6.25) |||
Sixth Bin [6.25 – 6.96) |
Seventh Bin [6.96-7.67)
Eighth Bin [7.67-8.38) ||
Ninth Bin [8.38-9.09) |
Tenth Bin [9.09-9.8] ||
df1=pd.DataFrame(dict1)
data = df1['cgpa'] #extract 1D data on which the histogram is to be drawn
plt.xlabel('cgpa range')
plt.ylabel('Number of Students')
plt.title('cgpa range vs Number of students')
plt.grid()
plt.hist(data,bins=10)
plt.show() #bin edges not shown automatically
Output:
Whenever python draws histogram the xticks are automatically calculated, which is not aligned to the bin edges.
Histogram - displaying bin edges correctly
There are two ways to display the bin edges correctly-
1. Create a list of bin edges. Use this list as the value for bins parameter of hist() function and set the xticks
to this list of bin edges.
2. Use the return value of the hist() function. The hist() function when called returns back a tuple. The first
element of the tuple is the numpy array containing the frequencies of the intervals of the histogram.
The second element is the numpy array containing the bin edges of the histogram that was created. The
number of bin edges is one more than the frequencies of the histogram.
data= [6.1,4.12,8.2,6.4,3.6,9.2,5.5,8.4,6.2,9.8,5.3,3.9,8.1,6.1,2.7]
#method 1 setting bin edges manually and setting the xticks to bin edges
b1=[3,5.5,7.5,10]
plt.hist(data, bins=b1)
plt.xticks(b1)
plt.grid()
plt.show()