UNIT - 1 EDA Continuation
UNIT - 1 EDA Continuation
VISUAL AIDS
FOR EDA
AD3301 DATA EXPLORATION AND VISUALIZATION LT
PC
3024
OBJECTIVES:
import datetime
import random
import radar
import pandas as pd
def generateData(n):
# Variable Initialization
listdata = []
start = datetime.datetime(2019, 8, 1)
end = datetime.datetime(2019, 8, 30)
for _ in range(n):
date = radar.random_datetime(start='2019-08-01',
stop='2019- 08-30').strftime("%Y-%m-%d")
price = round(random.uniform(900, 1000), 4)
listdata.append([date, price])
# Date Formatting
df['Date'] = pd.to_datetime(df['Date'], format='%Y-%m-%d')
# Data Aggregation
df = df.groupby(by='Date').mean()
Groups the DataFrame by the 'Date' column and calculates
the mean (average) of the 'Price' for each unique date
import datetime
import random
import radar
import pandas as pd
def generateData(n):
listdata = []
start = datetime.datetime(2019, 8, 1)
end = datetime.datetime(2019, 8, 30)
delta = end - start
for _ in range(n):
date = radar.random_datetime(start='2019-08-01', stop='2019-08-30').strftime("%Y-
%m-%d")
price = round(random.uniform(900, 1000), 4)
listdata.append([date, price])
df = pd.DataFrame(listdata, columns=['Date', 'Price'])
df['Date'] = pd.to_datetime(df['Date'], format='%Y-%m-%d')
df = df.groupby(by='Date').mean()
return df
Output
df =
generateData(50)
df.head(10)
CREATING THE LINE CHART
# Import the matplotlib library
import matplotlib.pyplot as plt
# Plot the graph
plt.plot(df)
# Display it on the screen
plt.show()
A RT ?
R C H
BA
BAR CHART
• Bars can be drawn horizontally or vertically to
represent categorical variables.
• Bar charts are frequently used to distinguish objects
between distinct collections in order to track variations
over time.
• Bar charts are very convenient when the changes are
large.
REAL TIME EXAMPLE
• A pharmacy in Norway keeps track of the
amount of Zoloft sold every month using Bar
chart.
Note: Zoloft is a medicine prescribed to patients
suffering from depression.
N DAR
CAL E
?
‘Calendar’ Python library to keep track
of the months of the year (1 to 12)
corresponding to January to December.
# Import Necessary Libraries
import numpy as np
import calendar
import matplotlib.pyplot as plt
plt.show()
TT E R
SC A
LO T ?
P
SCATTER PLOT
• Scatter plots are also called scatter
graphs, scatter charts, scattergrams,
and scatter diagrams.
• They use a Cartesian coordinates
system to display values of typically
two variables for a set of data.
WHEN SHOULD WE USE A
SCATTER PLOT?
• Scatter plots can be constructed in the following two situations:
■ When one continuous variable is dependent on another
variable, which is under the control of the observer
■ When both continuous variables are independent
• Scatter plots are used when we need to show the relationship
between two variables, and hence are sometimes referred to as
correlation plots
REAL TIME EXAMPLE
1. The number of hours of sleep required by a
person depends on the age of the person.
2. The average income for adults is based on the
number of years of education.
• Display a scatter plot for sleep vs. age
dataset and Iris dataset
B B L E
B U
A R T ?
C H
BUBBLE CHART
• A bubble plot is a manifestation
of the scatter plot where each
data point on the graph is shown
as a bubble.
• Each bubble can be illustrated
with a different color, size, and
appearance.
Display a Bubble plot for Iris
dataset
SCATTER PLOT USING SEABORN
• So
Python Libraries
far, we have seen
different types of 2D and 3D
visualization techniques
using matplotlib and
seaborn.
TLI B?
T PLO
M A
MATPLOTLIB
• Matplotlib is the most widely used data visualization
library in Python.
• It provides a low-level API for creating a wide range of
plots, from simple line graphs to complex 3D plots.
• Matplotlib is highly customizable and provides
complete control over every aspect of the plot.
R N?
E ABO
S
SEABORN
• Seaborn is a high-level data visualization library built
on top of Matplotlib.
• It provides a wide range of statistical visualizations
and is particularly useful for exploring relationships
between variables.
• Seaborn has a clean and modern look and can
generate complex plots with minimal code.
T LY ?
PLO
PLOTLY
• Plotly is a web-based data visualization library that
provides highly interactive and customizable plots.
• It provides a wide range of visualizations, from basic
line and scatter plots to complex 3D plots and maps.
• Plotly is particularly useful for creating interactive
dashboards and reports.
B O R N
S E A
PA R E V S
COM T L I B
T P LO
M A
VS O T LY
PL
• Seaborn, Matplotlib and plotly has been compared based
on the four factors:
3. Interactivity
4. Customization
SYNTAX & API
• Seaborn provides a high-level API that is easy to use
and requires minimal code to generate complex plots.
• Matplotlib, on the other hand, provides a low-level API
that provides complete control over every aspect of the
plot but can be challenging to use.
• Plotly provides an intermediate-level API that is easy to
use and provides a wide range of customization
options.
TYPES OF PLOTS
• Seaborn provides a wide range of statistical visualizations
that are particularly useful for exploring relationships
between variables.
• Matplotlib provides a broad range of plot types, from simple
line and scatter plots to complex 3D plots.
• Plotly provides a wide range of interactive visualizations
that are useful for creating interactive dashboards and
reports.
INTERACTIVITY
dataframe = pd.concat([dataFrame1,
dataFrame2], ignore_index=True)
dataframe
d ex ?
re _i n
i g n o
IGNORE_INDEX
• The ignore_index argument creates a new index; in its
ignore_index=True)
ignore_index=True)
df
Pandas df.merge()
method
df.merge() method can be used
along with joins
Types of Joins
• Inner Join
• Outer Join
• Left Join
• Right Join
• The inner join takes the intersection from two or more
dataframes, which corresponds to the INNER JOIN in SQL.
• The outer join takes the union from two or more
dataframes, which corresponds to the FULL OUTER JOIN in
SQL.
• The left join uses the keys from the left-hand dataframe
only, which corresponds to the LEFT OUTER JOIN in SQL.
• The right join uses the keys from the right-hand dataframe
only, which corresponds to the RIGHT OUTER JOIN in SQL
2. Use df.merge with an inner join
df.merge() is used to get a list of students who
appeared in both the courses.
dfSE = pd.concat([df1SE, df2SE],
ignore_index=True)
dfML = pd.concat([df1ML, df2ML],
ignore_index=True)
df = dfSE.merge(dfML, how='inner')
21 students took both the
courses
3. Use df.merge with a left join
dfSE = pd.concat([df1SE, df2SE],
ignore_index=True)
dfML = pd.concat([df1ML, df2ML],
ignore_index=True)
df = dfSE.merge(dfML, how='left')
df
s o n l y
u d e n t
n y s t a r e
w m a S o f t w
H o r t h e
re d f o s e ?
e a o u r
a pp e ri n g c
i n e
Eng
The total number would be 26.