UNIT-6(Data Analytics and Visualization With Python)
UNIT-6(Data Analytics and Visualization With Python)
NumPy supports "nan," Python's lack of cross- • SciPy is a scientific computation library that uses
platform compatibility makes it challenging for NumPy underneath. SciPy stands for Scientific
users. As a result, we may run into issues while Python.
comparing values within the Python interpreter. • It provides more utility functions for optimization,
(2) When data is stored in contiguous memory stats and signal processing. Like NumPy, SciPy is
open source so we can use it freely. SciPy was
addresses, insertion and deletion processes become
created by NumPy's creator Travis Olliphant.
expensive since shifting.
Why Use SciPy?
• Data Visualization is the process of presenting data in the form of graphs or charts. It helps to understand large
and complex amounts of data very easily. It allows the decision-makers to make decisions very efficiently and
also allows them in identifying new trends and patterns very easily.
• It is also used in high-level data analysis for Machine Learning and Exploratory Data Analysis (EDA). Data
visualization can be done with various tools like Tableau, Power BI, Python.
Matplotlib
• Matplotlib is a low-level library of Python which is used in Python and IPython shells, Jupyter notebook
used for data visualization. It is easy to use and and web application servers also.
emulates MATLAB like graphs and visualization. • Matplotlib has a procedural interface named the
This library is built on the top of NumPy arrays and Pylab, which is designed to resemble MATLAB, a
consist of several plots like line chart, bar chart, proprietary programming language developed by
histogram, etc. It provides a lot of flexibility but at MathWorks.
the cost of writing more code.
• Matplotlib along with NumPy can be considered as
• Matplotlib is one of the most popular Python the open source equivalent of MATLAB.
packages used for data visualization. It is a cross-
• Matplotlib is an open-source drawing library that
platform library for making 2D plots from data in
supports various drawing types
arrays. Matplotlib is written in Python and makes
• You can generate plots, histograms, bar charts, and
use of NumPy, the numerical mathematics extension
other types of charts with just a few lines of code.
of Python.
• It’s often used in web application servers, shells, and
• It provides an object-oriented API that helps in
Python scripts
embedding plots in applications using Python GUI
toolkits such as PyQt, WxPythonotTkinter. It can be
Basic Plotting with Matplotlib
Fig. 6.4.2
X array or sequence of array ax.hist(a, bins = [0, 25, 50, 75, 100])
Customization of Histogram
# Creating histogram
# Remove axes splines
fig, axs = plt.subplots(1, 1,
for s in ['top', 'bottom', 'left', 'right']:
figsize =(10, 7),
axs.spines[s].set_visible(False)
tight_layout = True)
# Remove x, y ticks
axs.hist(x, bins = n_bins)
axs.xaxis.set_ticks_position('none')
axs.yaxis.set_ticks_position('none')
# Setting color
# Add padding between axes and labels fracs = ((N**(1 / 5)) / N.max())
axs.xaxis.set_tick_params(pad = 5) norm = colors.Normalize(fracs.min(), fracs.max())
axs.yaxis.set_tick_params(pad = 10)
for thisfrac, thispatch in zip(fracs, patches):
# Add x, y gridlines color = plt.cm.viridis(norm(thisfrac))
axs.grid(b = True, color ='grey', thispatch.set_facecolor(color)
linestyle ='-.', linewidth = 0.5,
alpha = 0.6) # Adding extra features
plt.xlabel("X-axis")
# Add Text watermark plt.ylabel("y-axis")
fig.text(0.9, 0.15, 'Jeeteshgavande30', plt.legend(legend)
fontsize = 12, plt.title('Customized histogram')
color ='red',
ha ='right', # Show plot
va ='bottom', plt.show()
alpha = 0.7)
# Creating histogram
N, bins, patches = axs.hist(x, bins = n_bins)
Output
6.5.1 Bar Chart • A bar plot or bar chart is a graph that represents the
category of data with rectangular bars with lengths
GQ Explain Bar Chart in detail ? and heights that is proportional to the values which
they represent. The bar plots can be plotted
horizontally or vertically.
# creating the dataset
• A bar chart describes the comparisons between the
discrete categories. One of the axis of the plot data = {'C':20, 'C++':15, 'Java':30,
represents the specific categories being compared,
'Python':35}
while the other axis represents the measured values
corresponding to those categories. courses = list(data.keys())
• The matplotlib API in Python provides the bar() values = list(data.values())
function which can be used in MATLAB style use or
as an object-oriented API.
The syntax of the bar() function to be used with the fig = plt.figure(figsize = (10, 5))
axes is as follows:-
# creating the bar plot
• plt.bar(x, height, width, bottom, align)
plt.bar(courses, values, color ='maroon',
• The function creates a bar plot bounded with a
rectangle depending on the given parameters. width = 0.4)
• Following is a simple example of the bar plot, which plt.xlabel("Courses offered")
represents the number of students enrolled in
different courses of an institute. plt.ylabel("No. of students enrolled")
data = pd.read_csv(r"cars.csv")
data.head() # Show Plot
df = pd.DataFrame(data) plt.show()
Output
It is observed in the above bar graph that the X-axis ticks are overlapping each other thus it cannot be seen
properly. Thus by rotating the X-axis ticks, it can be visible clearly. That is why customization in bar graphs is
required.
Python 3 ax.yaxis.set_tick_params(pad = 10)
import pandas as pd
from matplotlib import pyplot as plt # Add x, y gridlines
ax.grid(b = True, color ='grey',
# Read CSV into pandas linestyle ='-.', linewidth = 0.5,
data = pd.read_csv(r"cars.csv") alpha = 0.2)
data.head()
df = pd.DataFrame(data) # Show top values
ax.invert_yaxis()
name = df['car'].head(12)
price = df['price'].head(12) # Add annotation to bars
for i in ax.patches:
# Figure Size plt.text(i.get_width()+0.2, i.get_y()+0.5,
fig, ax = plt.subplots(figsize =(16, 9)) str(round((i.get_width()), 2)),
# Horizontal Bar Plot fontsize = 10, fontweight ='bold',
ax.barh(name, price) color ='grey')
# Add Plot Title
# Remove axes splines ax.set_title('Sports car and their price in crore',
for s in ['top', 'bottom', 'left', 'right']: loc ='left', )
ax.spines[s].set_visible(False)
# Add Text watermark
# Remove x, y Ticks fig.text(0.9, 0.15, 'Jeeteshgavande30', fontsize = 12,
ax.xaxis.set_ticks_position('none') color ='grey', ha ='right', va ='bottom',
ax.yaxis.set_ticks_position('none') alpha = 0.7)
import numpy as np
import matplotlib.pyplot as plt
plt.legend() plt.ylabel('Contribution')
plt.show() plt.title('Contribution by the teams')
plt.xticks(ind, ('T1', 'T2', 'T3', 'T4', 'T5')) represented by data/sum(data). If um(data)<1,
plt.yticks(np.arange(0, 81, 10)) then the data values returns the fractional area
directly, thus resulting pie will have empty wedge of
plt.legend((p1[0], p2[0]), ('boys', 'girls'))
size 1-sum(data).
• labels is a list of sequence of strings which sets the
plt.show() label of each wedge.
Output
• color attribute is used to provide color to the
wedges.
• autopct is a string used to label the wedge with
their numerical value.
• shadow is used to create shadow of wedge.
Let’s create a simple pie chart using the pie()
function:
Example
Python3
# Import libraries
from matplotlib import pyplot as plt
6.5.4 Pie Chart
import numpy as np
GQ Explain Pie chart in detail?
# Wedge properties
wp = { 'linewidth' : 1, 'edgecolor' : "green" }
# Creating plot
fig, ax = plt.subplots(figsize =(10, 7),
subplot_kw = dict(polar = True))
Output
➢ Example 6.5.5 : Let’s try to modify the above plot with
some of the customizations
# changing color and linewidth of
Python3
# caps
# Import libraries
for cap in bp['caps']:
import matplotlib.pyplot as plt
cap.set(color ='#8B008B',
import numpy as np
linewidth = 2)
# Creating dataset
# changing color and linewidth of
np.random.seed(10)
# medians
data_1 = np.random.normal(100, 10, 200)
for median in bp['medians']:
data_2 = np.random.normal(90, 20, 200)
median.set(color ='red',
data_3 = np.random.normal(80, 30, 200)
linewidth = 3)
data_4 = np.random.normal(70, 40, 200)
data = [data_1, data_2, data_3, data_4]
# changing style of fliers
for flier in bp['fliers']:
fig = plt.figure(figsize =(10, 7))
flier.set(marker ='D',
ax = fig.add_subplot(111)
color ='#e7298a',
alpha = 0.5)
# Creating axes instance
bp = ax.boxplot(data, patch_artist = True,
# x-axis labels
notch ='True', vert = 0)
ax.set_yticklabels(['data_1', 'data_2',
'data_3', 'data_4'])
colors = ['#0000FF', '#00FF00',
'#FFFF00', '#FF00FF']
# Adding title
plt.title("Customized box plot")
for patch, color in zip(bp['boxes'], colors):
patch.set_facecolor(color)
# Removing top axes and right axes
# ticks
# changing color and linewidth of
ax.get_xaxis().tick_bottom()
# whiskers
ax.get_yaxis().tick_left()
for whisker in bp['whiskers']:
whisker.set(color ='#8B008B',
# show plot
linewidth = 1.5,
plt.show()
linestyle =":")
Output
• Python 3.6+
• numpy (>= 1.13.3)
• scipy (>= 1.0.1) Line plot
• pandas (>= 0.22.0) The line plot is one of the most basic plot in seaborn
• matplotlib (>= 2.1.2) library. This plot is mainly used to visualize the data in
• statsmodel (>= 0.8.0) form of some time series, i.e. in continuous manner.
# Importing libraries
# Plot the responses for different\
import numpy as np # events and regions
import seaborn as sns sns.lineplot(x="timepoint",
y="signal",
# Selecting style as white, hue="region",
# dark, whitegrid, darkgrid style="event",
# or ticks data=fmri)
sns.set(style="white") Output
Lmplot
➢ Example 6.7.2 : This function will draw the figure and annotate the axes. To make a relational plot, First, you initialize the
grid, then you pass the plotting function to a map method and it will be called on each subplot.
Python3
# Form a facetgrid using columns with a hue
sea = sns.FacetGrid(exercise, col = "time", hue = "kind")
# adding legend
sea.add_legend()
Output
➢ Example 6.7.3 : There are several options for controlling the look of the grid that can be passed to the class constructor.
Python3
sea = sns.FacetGrid(exercise, row = "diet",
col = "time", margin_titles = True)
➢ Example 6.7.5 : The default ordering of the facets is derived from the information in the DataFrame. If the variable used to
define facets has a categorical type, then the order of the categories is used. Otherwise, the facets will be in the order of
appearance of the category levels. It is possible, however, to specify an ordering of any facet dimension with the
appropriate *_order parameter:
Python3
exercise_kind = exercise.kind.value_counts().index
sea = sns.FacetGrid(exercise, row = "kind",
row_order = exercise_kind,
height = 1.7, aspect = 4)
sea.map(sns.kdeplot, "id")
Output
➢ Example 6.7.6 : If you have many levels of one variable, you can plot it along the columns but “wrap” them so that they
span multiple rows. When doing this, you cannot use a row variable.
Python3
g = sns.PairGrid(exercise)
g.map_diag(sns.histplot)
g.map_offdiag(sns.scatterplot)
Output
➢ Example 6.7.7 : In this example, we will see that we can also plot multiplot grid with the help of pairplot() function. This
shows the relationship for (n, 2) combination of variable in a DataFrame as a matrix of plots and the diagonal plots are the
univariate plots.
Python3
# importing packages
import seaborn
import matplotlib.pyplot as plt
In this we will learn how to create subplots using matplotlib and seaborn.
Import all Python libraries needed
Python3
➢ Example 6.7.9 : In this example we create a plot with 1 row and 2 columns, still no data passed i.e. nrows and ncols. If
given in this order, we don’t need to type the arg names, just its values.
figsize set the total dimension of our figure.
sharex and sharey are used to share one or both axes between the charts.
Python3
➢ Example 6.7.11 : Here, we are Initializing matplotlib figure and axes, In this example, we are passing required data on
them with the help of the Exercise dataset which is a well-known dataset available as an inbuilt dataset in seaborn. By
using this method you can plot any number of the multi-plot grid and any style of the graph by implicit rows and columns
with the help of matplotlib in seaborn. We are using sns.boxplot here, where we need to set the argument with the
correspondent element from the axes variable.
Python3
import seaborn as sns
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
iris = sns.load_dataset("iris")
plt.subplot(Grid_plot[0, 0])
plt.subplot(Grid_plot[0, 1:])
plt.subplot(Grid_plot[1, :2])
plt.subplot(Grid_plot[1, 2])
Output
➢ Example 6.7.13 : Here we’ll create a 3×4 grid of subplot using subplots(), where all axes in the same row share their y-
axis scale, and all axes in the same column share their x-axis scale.
Python3
import matplotlib.pyplot as plt
The regression plots in seaborn are primarily intended to add a visual guide that helps to emphasize patterns in
a dataset during exploratory data analyses. Regression plots as the name suggests creates a regression line between 2
parameters and helps to visualize their linear relationships. We consider those kinds of plots in seaborn and shows
the ways that can be adapted to change the size, aspect, ratio etc. of such plots.
Seaborn is not only a visualization library but also a provider of built-in datasets. Here, we will be working with
one of such datasets in seaborn named ‘tips’. The tips dataset contains information about the people who probably
had food at the restaurant and whether or not they left a tip. It also provides information about the gender of the
people, whether they smoke, day, time and so on.
Let us have a look at the dataset first before we start with the regression plots.
Load the dataset
Python3
]Now let us begin with the regression plots in seaborn. Regression plots in seaborn can be easily implemented
with the help of the lmplot() function. lmplot() can be understood as a function that basically creates a linear model
plot. lmplot() makes a very simple linear regression plot.It creates a scatter plot with a linear fit on top of it.
Simple linear plot
Python3
sns.set_style('whitegrid')
Explanation
x and y parameters are specified to provide values for the x and y axes. sns.set_style() is used to have a grid in the
background instead of a default white background. The data parameter is used to specify the source of information
for drawing the plots.
Linear plot with additional parameters
Python3
sns.set_style('whitegrid')
sns.lmplot(x ='total_bill', y ='tip', data = dataset,
hue ='sex', markers =['o', 'v'])
Output
Explanation
In order to have a better analysis capability using these plots, we can specify hue to have a categorical separation
in our plot as well as use markers that come from the matplotlib marker symbols. Since we have two separate
categories we need to pass in a list of symbols while specifying the marker.
Setting the size and color of the plot
Python3
sns.set_style('whitegrid')
sns.lmplot(x ='total_bill', y ='tip', data = dataset, hue ='sex',
markers =['o', 'v'], scatter_kws ={'s':100},
palette ='plasma')
Output
Explanation
In this example what seaborn is doing is that its calling the matplotlib parameters indirectly to affect the scatter
plots. We specify a parameter called scatter_kws. We must note that the scatter_kws parameter changes the size of
only the scatter plots and not the regression lines. The regression lines remain untouched. We also use the palette
parameter to change the color of the plot. Rest of the things remain the same as explained in the first example.
Displaying multiple plots
Python3
Explanation
In the above code, we draw multiple plots by specifying a separation with the help of the rows and columns. Each
row contains the plots of tips vs the total bill for the different times specified in the dataset. Each column contains the
plots of tips vs the total bill for the different genders. A further separation is done by specifying the hue parameter on
the basis of whether the person smokes.
Size and aspect ratio of the plots
Python3
Explanation
Suppose we have a large number of plots in the output, we need to set the size and aspect for it in order to better
visualize it. aspect: scalar, optional specifies the aspect ratio of each facet, so that “aspect * height” gives the width of
each facet in inches.
6.7.4 Regplot
# loading dataset
data = sns.load_dataset("mpg")
# draw regplot
sns.regplot(x = "mpg",
y = "acceleration",
data = data)
# loading dataset
data = sns.load_dataset("titanic")
# draw regplot
sns.regplot(x = "age",
y = "fare",
data = data,
dropna = True)
# show the plot
plt.show()
Python3
# loading dataset
data = sns.load_dataset("exercise")
# draw regplot
sns.regplot(x = "id",
y = "pulse",
data = data)
➢ Example 6.7.17
Python3
# loading dataset
data = sns.load_dataset("attention")
# draw regplot
sns.regplot(x = "solutions",
y = "score",
data = data)