0% found this document useful (0 votes)
4 views

UNIT - 1 EDA Continuation

The document provides an overview of Exploratory Data Analysis (EDA) and data visualization techniques using Python libraries such as Matplotlib and Seaborn. It covers various visual aids including line charts, bar charts, scatter plots, and histograms, along with practical examples of data generation and manipulation. Additionally, it discusses data transformation techniques and compares different visualization libraries based on syntax, plot types, interactivity, and customization.

Uploaded by

mk4997320
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
4 views

UNIT - 1 EDA Continuation

The document provides an overview of Exploratory Data Analysis (EDA) and data visualization techniques using Python libraries such as Matplotlib and Seaborn. It covers various visual aids including line charts, bar charts, scatter plots, and histograms, along with practical examples of data generation and manipulation. Additionally, it discusses data transformation techniques and compares different visualization libraries based on syntax, plot types, interactivity, and customization.

Uploaded by

mk4997320
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
You are on page 1/ 113

TOPIC - 7

VISUAL AIDS
FOR EDA
AD3301 DATA EXPLORATION AND VISUALIZATION LT
PC

3024

OBJECTIVES:

TO OUTLINE AN OVERVIEW OF EXPLORATORY DATA ANALYSIS.


TO IMPLEMENT DATA VISUALIZATION USING MATPLOTLIB.
TO PERFORM UNIVARIATE DATA EXPLORATION AND ANALYSIS.
TO APPLY BIVARIATE DATA EXPLORATION AND ANALYSIS.
TO USE DATA EXPLORATION AND VISUALIZATION TECHNIQUES FOR MULTIVARIATE
AND TIME SERIES DATA.

UNIT I EXPLORATORY DATA ANALYSIS


9

EDA FUNDAMENTALS – UNDERSTANDING DATA SCIENCE – SIGNIFICANCE OF EDA –


MAKING SENSE OF DATA – COMPARING EDA WITH CLASSICAL AND BAYESIAN
ANALYSIS – SOFTWARE TOOLS FOR EDA - VISUAL AIDS FOR EDA- DATA
TRANSFORMATION TECHNIQUES-MERGING DATABASE, RESHAPING AND PIVOTING,
• Line chart
Python Libraries
• Bar chart
• Scatter plot
• Area plot & stacked plot
• Pie chart
• Table chart
• Polar chart
• Histogram
A RT ?
E C H
L I N
LINE CHART
• Line chart is used to illustrate
the relationship between two
or more continuous variables.
• Used to plot time series lines.
E R ? ?
FAK
‘faker’ - Python library - We have
created a function using the faker
Python library to generate the
dataset.
t a s e t
l e d a
s i m p .
t e a m n s
n e r a c o l u
Ge t t w o a t e ‘
h j u s i s ‘D
w i t l u m n i s
rs t c o l u m n
e fi d c o
Th s e c o n s t o c k
t h e g t h e
a n d c a t i n
, i n d i a t e .
i c e ‘ h a t d
‘Pr e o n t
p r i c
# Import Necessary Libraries

import datetime
import random
import radar
import pandas as pd

datetime - Provides classes for manipulating dates


and times
random - Allows you to generate random numbers
radar - It seems to be a library used for generating
random dates
pandas - A powerful data analysis and manipulation
library
# Function Definition

def generateData(n):

A function named “generateData” which takes an


integer “n” as input

# Variable Initialization

listdata = []
start = datetime.datetime(2019, 8, 1)
end = datetime.datetime(2019, 8, 30)

• Initializes an empty list “listdata“ to store generated data.


• Defines start and end variables representing the start and
end dates (August 1, 2019, to August 30, 2019).
# Data Generation Loop

for _ in range(n):
date = radar.random_datetime(start='2019-08-01',
stop='2019- 08-30').strftime("%Y-%m-%d")
price = round(random.uniform(900, 1000), 4)
listdata.append([date, price])

• Iterates “n“ times.


• Generates a random date between August 1, 2019, and
August 30, 2019, using the radar.random_datetime function.
It then formats the date to the "YYYY-MM-DD" format.
• Generates a random floating-point number between 900 and
1000 and rounds it to 4 decimal places.
• Appends the date and price as a list to listdata.
# Creating DataFrame
df = pd.DataFrame(listdata, columns=['Date', 'Price'])
Converts the “listdata” list of lists into a pandas DataFrame
with columns 'Date' and 'Price’

# Date Formatting
df['Date'] = pd.to_datetime(df['Date'], format='%Y-%m-%d')

Converts the 'Date' column in the DataFrame to datetime


format for proper date handling

# Data Aggregation

df = df.groupby(by='Date').mean()
Groups the DataFrame by the 'Date' column and calculates
the mean (average) of the 'Price' for each unique date
import datetime
import random
import radar
import pandas as pd
def generateData(n):
listdata = []
start = datetime.datetime(2019, 8, 1)
end = datetime.datetime(2019, 8, 30)
delta = end - start
for _ in range(n):
date = radar.random_datetime(start='2019-08-01', stop='2019-08-30').strftime("%Y-
%m-%d")
price = round(random.uniform(900, 1000), 4)
listdata.append([date, price])
df = pd.DataFrame(listdata, columns=['Date', 'Price'])
df['Date'] = pd.to_datetime(df['Date'], format='%Y-%m-%d')
df = df.groupby(by='Date').mean()
return df
Output
df =
generateData(50)
df.head(10)
CREATING THE LINE CHART
# Import the matplotlib library
import matplotlib.pyplot as plt
# Plot the graph
plt.plot(df)
# Display it on the screen
plt.show()
A RT ?
R C H
BA
BAR CHART
• Bars can be drawn horizontally or vertically to
represent categorical variables.
• Bar charts are frequently used to distinguish objects
between distinct collections in order to track variations
over time.
• Bar charts are very convenient when the changes are
large.
REAL TIME EXAMPLE
• A pharmacy in Norway keeps track of the
amount of Zoloft sold every month using Bar
chart.
Note: Zoloft is a medicine prescribed to patients
suffering from depression.
N DAR
CAL E
?
‘Calendar’ Python library to keep track
of the months of the year (1 to 12)
corresponding to January to December.
# Import Necessary Libraries

import numpy as np
import calendar
import matplotlib.pyplot as plt

Numpy- library for numerical computations in


Python
Matplotlib - plotting library in Python for creating
visualizations
# Creating Lists

months = list(range(1, 13))


sold_quantity = [round(random.uniform(100, 200)) for x
in range(1, 13)]

• ‘months’ is a list containing numbers from 1 to 12,


representing the months of the year.
• ‘sold_quantity‘ is a list comprehension that generates
12 random floating-point numbers between 100 and
200, rounded to the nearest integer.
import numpy as np
import calendar
import matplotlib.pyplot as plt
months = list(range(1, 13))
sold_quantity = [round(random.uniform(100,
200)) for x in range(1, 13)]
figure, axis = plt.subplots()
plt.xticks(months, calendar.month_name[1:13],
rotation=20)
plot = axis.bar(months, sold_quantity)
for rectangle in plot:
height = rectangle.get_height()
axis.text(rectangle.get_x() +
rectangle.get_width() /2., 1.002 *
height, '%d' % int(height), ha='center', va =
'bottom')

plt.show()
TT E R
SC A
LO T ?
P
SCATTER PLOT
• Scatter plots are also called scatter
graphs, scatter charts, scattergrams,
and scatter diagrams.
• They use a Cartesian coordinates
system to display values of typically
two variables for a set of data.
WHEN SHOULD WE USE A
SCATTER PLOT?
• Scatter plots can be constructed in the following two situations:
■ When one continuous variable is dependent on another
variable, which is under the control of the observer
■ When both continuous variables are independent
• Scatter plots are used when we need to show the relationship
between two variables, and hence are sometimes referred to as
correlation plots
REAL TIME EXAMPLE
1. The number of hours of sleep required by a
person depends on the age of the person.
2. The average income for adults is based on the
number of years of education.
• Display a scatter plot for sleep vs. age
dataset and Iris dataset
B B L E
B U
A R T ?
C H
BUBBLE CHART
• A bubble plot is a manifestation
of the scatter plot where each
data point on the graph is shown
as a bubble.
• Each bubble can be illustrated
with a different color, size, and
appearance.
Display a Bubble plot for Iris
dataset
SCATTER PLOT USING SEABORN

• A scatter plot can also be


generated using the seaborn
library.
• Seaborn makes the graph visually
better.
E A &
A R T ?
P LO
C K E D
S TA
AREA & STACKED PLOT
• The stacked plot owes its name to the
fact that it represents the area under a
line plot and that several such plots
can be stacked on top of one another,
giving the feeling of a stack.
• The stacked plot can be useful when
we want to visualize the cumulative
effect of multiple variables
ART ?
I E C H
P
PIE CHART
• This is one of the more interesting types of
data visualization graphs.
• The main reason is that people love circles.
• The purpose of the pie chart is to
communicate proportions.
• Use “Pokemon dataset” to visualize pie
chart.
A RT ?
L E C H
TA B
TABLE CHART
• A table chart combines a bar chart
and a table.
• Consider standard LED bulbs that
come in different wattages.
• The standard Philips LED bulb can
be 4.5 Watts, 6 Watts, 7 Watts, 8.5
Watts, 9.5 Watts, 13.5 Watts, and
15 Watts.
• Year, Wattage and Units are the
attributes.
A RT ?
R C H
P O L A
POLAR CHART
• A polar chart is a diagram that is plotted on
a polar axis.
• Its coordinates are angle and radius.
• It is also referred to as a spider web plot.
• Assume you have five courses in your
academic year & you planned to obtain the
following grades in each subject,
plannedGrade = [90, 95, 92, 68, 68, 90]
but after your final examination, these are the
grades you got:
actualGrade = [75, 89, 89, 80, 80, 75]
Create a polar plot for the above
RA M ?
T OG
H IS
HISTOGRAM
• A Histogram plots is a type of frequency graph used to depict the
distribution of any continuous variable.
• These types of plots are very popular in statistical analysis.
• Consider the following use cases. A survey created in vocational
training sessions of developers had 100 participants. They had several
years of Python programming experience ranging from 0 to 20. Use
histogram to plot distribution of python programming experience in the
vocational training.
E E N
B E W
T
E N C E
I F F E R RT &
D C H A
B A R M ?
G R A
I S T O
H
I P O P
LO L L
A R T ?
C H
LOLLIPOP CHART
• A lollipop chart can be used to
display ranking in the data.
• It is similar to an ordered bar
chart.
• Create a lollipop char for
Highway Mileage using car
dataset.
T H E
S I NG
H O O T ?
C H AR
E S T C
B
OTHER LIBRARIES TO EXPLORE

• So
Python Libraries
far, we have seen
different types of 2D and 3D
visualization techniques
using matplotlib and
seaborn.
TLI B?
T PLO
M A
MATPLOTLIB
• Matplotlib is the most widely used data visualization
library in Python.
• It provides a low-level API for creating a wide range of
plots, from simple line graphs to complex 3D plots.
• Matplotlib is highly customizable and provides
complete control over every aspect of the plot.
R N?
E ABO
S
SEABORN
• Seaborn is a high-level data visualization library built
on top of Matplotlib.
• It provides a wide range of statistical visualizations
and is particularly useful for exploring relationships
between variables.
• Seaborn has a clean and modern look and can
generate complex plots with minimal code.
T LY ?
PLO
PLOTLY
• Plotly is a web-based data visualization library that
provides highly interactive and customizable plots.
• It provides a wide range of visualizations, from basic
line and scatter plots to complex 3D plots and maps.
• Plotly is particularly useful for creating interactive
dashboards and reports.
B O R N
S E A
PA R E V S
COM T L I B
T P LO
M A
VS O T LY
PL
• Seaborn, Matplotlib and plotly has been compared based
on the four factors:

1. Syntax and API


2. Types of Plots

3. Interactivity

4. Customization
SYNTAX & API
• Seaborn provides a high-level API that is easy to use
and requires minimal code to generate complex plots.
• Matplotlib, on the other hand, provides a low-level API
that provides complete control over every aspect of the
plot but can be challenging to use.
• Plotly provides an intermediate-level API that is easy to
use and provides a wide range of customization
options.
TYPES OF PLOTS
• Seaborn provides a wide range of statistical visualizations
that are particularly useful for exploring relationships
between variables.
• Matplotlib provides a broad range of plot types, from simple
line and scatter plots to complex 3D plots.
• Plotly provides a wide range of interactive visualizations
that are useful for creating interactive dashboards and
reports.
INTERACTIVITY

• Seaborn and Matplotlib provide limited interactivity, while

Plotly provides highly interactive and responsive plots that

can be zoomed, panned, and rotated.


CUSTOMIZATION
• Seaborn provides a limited range of customization options
but can generate visually appealing plots with minimal code.
• Matplotlib provides complete control over every aspect of the
plot, making it highly customizable.
• Plotly provides a wide range of customization options and
provides support for themes and color scales.
TRY IT!
• Consider we have a data set of dimension 300
(n) × 50 (p). n represents the number of
observations, and p represents the number of
predictors/ attributes. How many scatter plots
are possible to analyze the variable
relationship?
TOPIC - 8
DATA
TRANSFORMATION
DATA TRANSFORMATION
• Data transformation is a set of techniques used to convert data from
one format or structure to another format or structure.
• Data transformation is the process where you extract data, sift through
data, understand the data, and then transform it into something you
can analyze.
• Raw or source data is often:
• Inconsistent: It contains both relevant and irrelevant data.
• Imprecise: It contains incorrectly entered information or missing
values.
• Repetitive: It contains duplicate data.
Types of Data Transformation
• Data Deduplication 8. Data Integration
• Key Restructuring 9. Data Filtering
• Data Cleansing 10. Data Joining
• Data Validation 11. Binning
• Format Revisioning 12. Data Splitting
• Data derivation 13. Data Summarization
• Data aggregation 14. Normalization &
TOPIC - 9
MERGING DATABASE,
RESHAPING AND PIVOTING,
TRANSFORMATION
TECHNIQUES
• Consider two courses Software Engineering course & an
Introduction to Machine Learning course and there are
enough students to split into two classes.
• The examination for each class and for each course was done
in two separate buildings and graded by four different
professors.
• Create a dataset using the above information.
DATASET
d f o r
d u s e
e th o
M i n g ?
m e rg
Methods used for merging
• Concat ()
• df.merge ()
• Append
• Join
Pandas concat() method
Pandas concat() method:

dataframe = pd.concat([dataFrame1,
dataFrame2], ignore_index=True)

dataframe
d ex ?
re _i n
i g n o
IGNORE_INDEX
• The ignore_index argument creates a new index; in its

absence, we'd keep the original indices.


a x i s ?
AXIS
• To combine the dataframe together in the same
direction, axis = 0 is used.
• To combine the dataframe side by side, axis=1 is
used.

pd.concat([dataFrame1, dataFrame2], axis=1)


TRY IT!
• Assume your head of department walked up to your desk
and started bombarding you with a series of questions:
• How many students appeared for the exams in total?
• How many students only appeared for the Software
Engineering course?
• How many students only appeared for the Machine
Learning course?
Data frames for both subject
import pandas as pd
df1SE = pd.DataFrame({ 'StudentID': [9, 11, 13, 15, 17, 19, 21, 23, 25, 27,
29], 'ScoreSE' : [22, 66, 31, 51, 71, 91, 56, 32, 52, 73, 92]})
df2SE = pd.DataFrame({'StudentID': [2, 4, 6, 8, 10, 12, 14, 16, 18, 20, 22,
24, 26, 28, 30],
‘ScoreSE': [98, 93, 44, 77, 69, 56, 31, 53, 78, 93, 56, 77, 33, 56, 27]})
df1ML = pd.DataFrame({ 'StudentID': [1, 3, 5, 7, 9, 11, 13, 15, 17, 19, 21,
23, 25, 27, 29],
'ScoreML' : [39, 49, 55, 77, 52, 86, 41, 77, 73, 51, 86, 82, 92, 23, 49]})
df2ML = pd.DataFrame({'StudentID': [2, 4, 6, 8, 10, 12, 14, 16, 18, 20],
'ScoreML': [93, 44, 78, 97, 87, 89, 39, 43, 88, 78]})
1. Concatenate along with an axis
dfSE = pd.concat([df1SE, df2SE],

ignore_index=True)

dfML = pd.concat([df1ML, df2ML],

ignore_index=True)

df = pd.concat([dfML, dfSE], axis=1)

df
Pandas df.merge()
method
df.merge() method can be used
along with joins
Types of Joins

• Inner Join
• Outer Join
• Left Join
• Right Join
• The inner join takes the intersection from two or more
dataframes, which corresponds to the INNER JOIN in SQL.
• The outer join takes the union from two or more
dataframes, which corresponds to the FULL OUTER JOIN in
SQL.
• The left join uses the keys from the left-hand dataframe
only, which corresponds to the LEFT OUTER JOIN in SQL.
• The right join uses the keys from the right-hand dataframe
only, which corresponds to the RIGHT OUTER JOIN in SQL
2. Use df.merge with an inner join
df.merge() is used to get a list of students who
appeared in both the courses.
dfSE = pd.concat([df1SE, df2SE],
ignore_index=True)
dfML = pd.concat([df1ML, df2ML],
ignore_index=True)
df = dfSE.merge(dfML, how='inner')
21 students took both the

courses
3. Use df.merge with a left join
dfSE = pd.concat([df1SE, df2SE],
ignore_index=True)
dfML = pd.concat([df1ML, df2ML],
ignore_index=True)
df = dfSE.merge(dfML, how='left')
df
s o n l y
u d e n t
n y s t a r e
w m a S o f t w
H o r t h e
re d f o s e ?
e a o u r
a pp e ri n g c
i n e
Eng
The total number would be 26.

Note that these students did not appear for


the Machine Learning exam and hence
their scores are marked as NaN.
w i th a
e rg e
d f. m ?
h t j o i n
ri g
4. Use df.merge with a right join
dfSE = pd.concat([df1SE, df2SE],
ignore_index=True)
dfML = pd.concat([df1ML, df2ML],
ignore_index=True)
df = dfSE.merge(dfML, how='right')
df
right join is used to get a list of all the
students who appeared in the Machine
Learning course.
w i th a
e r g e
d f. m ?
e r j o i n
o u t
5. Use df.merge with a outer join
dfSE = pd.concat([df1SE, df2SE],
ignore_index=True)
dfML = pd.concat([df1ML, df2ML],
ignore_index=True)
df = dfSE.merge(dfML, how='outer')
df
i n g o n
M erg
d e x ?
In
g a n d
h a p i n
Res
t i n g ?
Pi v o
Reshaping and Pivoting
• Pivoting - Rearrange data in a dataframe can be
done with hierarchical indexing using two
actions:
1. Stacking: Stack rotates from any
particular column in the data to the rows.
2. Unstacking: Unstack rotates from the
rows into the column.
TRY IT!

Create a dataframe that records the rainfall, humidity,

and wind conditions of five different counties in Norway


data = np.arange(15).reshape((3,5))
indexers = ['Rainfall', 'Humidity', 'Wind']
dframe1 = pd.DataFrame(data, index=indexers,
columns=['Bergen', 'Oslo', 'Trondheim',
'Stavanger', 'Kristiansand'])
dframe1
o d fo r
M e th
ki n g ?
S t a c
Reshaping and Pivoting
• Using the stack() method on the preceding
dframe1, we can pivot the columns into rows to
produce a series:
stacked = dframe1.stack()
stacked
o d fo r
M e t h
ki n g ?
n s ta c
U
stacked.unstack()
• Unstacking will create missing
data if all the values are not
present in each of the sub-groups.
series1 = pd.Series([000, 111, 222, 333],
index=['zeros','ones', 'twos', 'threes'])
series2 = pd.Series([444, 555, 666], index=
'fours', 'fives', 'sixes'])
frame2 = pd.concat([series1, series2],
keys=['Number1', 'Number2'])
frame2.unstack()
Reshaping
• Reshaping means changing the shape of an
array.
• The shape of an array is the number of
elements in each dimension.
• By reshaping we can add or remove dimensions
or change number of elements in each
dimension.
a ti o n
n s f o rm
Tra s ?
n i q u e
Tec h
Da t a
a t i o n ?
d u p l ic
De
l a c i n g
Rep
lu e s ?
Va
is s i n g
l i n g m
Ha n d
d a ta ?

You might also like