Notes - EDA-Unit1 (2)
Notes - EDA-Unit1 (2)
EDA fundamentals – Understanding data science – Significance of EDA – Making sense of data
– Comparing EDA with classical and Bayesian analysis – Software tools for EDA - Visual Aids
for EDA- Data transformation techniques-merging database, reshaping and pivoting,
transformation techniques.
COURSE OBJECTIVE:
To outline an overview of exploratory data analysis.
COURSE OUTCOME:
CO1: Understand the fundamentals of exploratory data analysis.
Exploratory Data Analysis Fundamentals
Data is a collection of discrete objects, numbers, words, events, facts, measurements,
observations, or descriptions of things. It is collected and stored by every event or process
occurring in several disciplines, including biology, economics, engineering, marketing, and
others. Processing such data elicits useful information and processing such information generates
useful knowledge.
EDA is a process of examining the available dataset to discover patterns, spot anomalies, test
hypotheses, and check assumptions using statistical measures. The primary aim of EDA is to
examine what data can tell us before actually going through formal modeling or hypothesis
formulation. EDA can also help statisticians to examine and discover the data and create newer
hypotheses that could be used for the development of a newer approach in data collection and
experimentations.
Understanding data science
Data scientists
• builds a performant model
• must be able to explain the results obtained
• use the result for business intelligence
Data science involves cross-disciplinary knowledge from computer science, data, statistics, and
mathematics.
The phases of data analysis
1. Data requirements:
• various sources of data for an organization
• type of data required for the organization to be collected, curated, and store
• categorize the data, numerical or categorical, and the format of storage and
dissemination.
Eg: An application tracking the sleeping pattern of patients suffering from dementia requires
several types of sensors' data storage, such as sleep data, heart rate from the patient, electro-
dermal activities, and user activities pattern. All of these data points are required to correctly
diagnose the mental state of the person.
2. Data collection:
Data collected from several sources must be stored in the correct format and transferred
to the right information technology personnel within a company.
3. Data processing:
Preprocessing involves the process of pre-curating the dataset before actual analysis.
Common tasks involve
• correctly exporting the dataset
• placing them under the right tables
• structuring them
• exporting them in the correct format
4. Data cleaning:
Preprocessed data may still not be ready for detailed analysis. It must be correctly
transformed for an
• incompleteness check
• duplicates check
• error check
• missing value check
These tasks are performed in the data cleaning stage, which involves responsibilities such as
• matching the correct record
• finding inaccuracies in the dataset
• understanding the overall data quality
• removing duplicate items and
• filling in the missing values
Eg: outlier detection methods for quantitative data cleaning
5. EDA:
It is the stage where the actual message contained in the data is understood. Several types
of data transformation techniques might be required during the process of exploration.
6. Modeling and algorithm:
Generalized models or mathematical formulas can represent or exhibit relationships
among different variables, such as correlation or causation. These models or equations involve
one or more variables that depend on other variables to cause an event.
Eg: when buying, pens, the total price of pens(Total) = price for one pen(UnitPrice) * the
number of pens bought (Quantity).
Model would be Total = UnitPrice * Quantity - total price is dependent on the unit price
total price - dependent variable and the unit price - independent variable. In general
Model describes the relationship between independent and dependent variables.
Inferential statistics deals with quantifying relationships between particular variables.
The Judd model for describing the relationship between data, model, and error
Data = Model + Error
Eg: inferential statistics - regression analysis
7. Data Product:
It is the computer software that uses data as inputs, produces outputs, and provides
feedback based on the output to control the environment. It is based on a model developed
during data analysis
Eg: a recommendation model that inputs user purchase history and recommends a related
item that the user is highly likely to buy.
8. Communication:
This stage deals with disseminating the results to end stakeholders to use the result for
business intelligence. It involves data visualization using techniques such as tables, charts,
summary diagrams, and bar charts to show the analyzed result.
Different fields of science, economics, engineering, and marketing accumulate and store data
primarily in electronic databases. Data collected should be used to make appropriate and well-
established decisions. It requires computer programs to gather insights about the collected data
which could be implemented using the process of data mining in which exploratory data analysis
is considered as the key. It allows visualization of the data to understand it as well as to create
hypotheses for further analysis. EDA reveals ground truth about the content without making any
underlying assumptions. Data scientists use EDA to understand the type of modeling and
hypotheses that can be created.
Key components of exploratory data analysis include
• summarizing data
• statistical analysis
• visualization of data
Python provides tools for exploratory analysis
• pandas for summarizing
• scipy for statistical analysis
• matplotlib, plotly for visualizations
Steps in EDA
It involves four different steps
1.Problem definition:
Before extracting useful insight from the data, it is essential to define the business
problem to be solved. The problem definition works as the driving force for a data analysis plan
execution. The main tasks involved in problem definition are
• defining the main objective of the analysis
• defining the main deliverables
• outlining the main roles and responsibilities
• obtaining the current status of the data
• defining the timetable
• performing cost/benefit analysis
Based on such a problem definition, an execution plan can be created.
2. Data preparation:
This step involves methods for preparing the dataset before actual analysis. Ii involves
• define the sources of data,
• define data schemas and tables,
• understand the main characteristics of the data,
• clean the dataset,
• delete non-relevant datasets,
• transform the data, and
• divide the data into required chunks for analysis.
3.Data analysis:
It deals with descriptive statistics and analysis of the data. The main tasks involve
• summarizing the data
• finding the hidden correlation and relationships among the data
• developing predictive models
• evaluating the models
• calculating the accuracies
Techniques used for data summarization
• summary tables
• graphs
• descriptive statistics
• inferential statistics
• correlation statistics
• searching
• grouping
• mathematical models
4. Development and representation of the results:
It involves presenting the dataset to the target audience. It can be in the form of
• graphs,
• summary tables
• maps
• diagrams
It deals with identifying the type of data under analysis. Different disciplines store
different kinds of data for different purposes.
Eg: Medical researchers store patients' data, universities store students' and teachers' data, and
real estate industries storehouse and building datasets.
A dataset contains many observations about a particular object. Each observation can have a
specific value for each of the variables.
Eg: Observation:
Table Representation:
Summarization:
There are four observations (001, 002, 003, 004, 005). Each observation describes variables
(PatientID, name, address, dob, email, gender, and weight).
Eg: a person's age, height, weight, blood pressure, heart rate, temperature, number of teeth,
number of bones, and the number of family members.
➢ Discrete data
Represents data that is countable and its values can be listed out. A variable that
represents a discrete dataset is referred to as a discrete variable. The discrete variable
takes a fixed number of distinct values.
Eg: Flipping a coin, the number of heads in 200 coin flips can take values from 0 to 200
(finite) cases.
Country variable can have values such as Nepal, India, Norway, and Japan. It is fixed.
Rank variable of a student in a classroom can take values from 1, 2, 3, 4, 5, and so on.
➢ Continuous data
A variable that can have an infinite number of numerical values within a specific
range is classified as continuous data. A variable describing continuous data is a continuous
variable. Continuous data can follow an interval measure of scale or ratio measure of scale.
Eg: Temperature of the city today
Identify discrete and Continuous variables from the table below
3.Interval
In interval scales, both the order and exact differences between the values are significant.
Interval scales are widely used in statistics, like measure of central tendencies—mean, median,
mode, and standard deviations.
Eg: Location in Cartesian coordinates and direction measured in degrees from magnetic north.
The mean, median, and mode are allowed on interval data.
4.Ratio
Ratio scales contain order, exact values, and absolute zero, which makes it possible to be
used in descriptive and inferential statistics. These scales provide numerous possibilities for
statistical analysis. Mathematical operations, the measure of central tendencies, and the measure
of dispersion and coefficient of variation can also be computed from such scales.
Eg: measure of energy, mass, length, duration, electrical energy, plan angle, and volume.
The following table gives a summary of the data types and scale measures:
There are several software tools that are available to facilitate EDA.
1. Python: This is an open source programming language widely used in data analysis, data
mining, and data science.
2. R programming language: R is an open source programming language that is widely
utilized in statistical computation and graphical data analysis
3. Weka: This is an open source data mining package that involves several EDA tools and
algorithms
4. KNIME: This is an open source tool for data analysis and is based on Eclipse
df = generateData(50)
df.head(10)
Steps involved
1.Load and prepare the dataset
2.Import the matplotlib library: import matplotlib.pyplot as plt
3.Plot the graph: plt.plot(df)
4.Display it on the screen: plt.show()
2. Bar charts
It is a most common type of visualization. Bars can be drawn horizontally or vertically to
represent categorical variables. Bar charts are frequently used to distinguish objects between
distinct collections in order to track variations over time
Eg: Assume a pharmacy in Norway keeps track of the amount of Zoloft sold every month. Use
the calendar Python library to keep track of the months of the year (1 to 12) corresponding to
January to December:
1. Import the required libraries
import numpy as np
import calendar
import matplotlib.pyplot as plt
2. Set up the data
months = list(range(1, 13))
sold_quantity = [round(random.uniform(100, 200)) for x in range(1, 13)]
3. Specify the layout of the figure and allocate space
figure, axis = plt.subplots()
4. Display the names of the months:
plt.xticks(months, calendar.month_name[1:13], rotation=20)
5. Plot the graph:
plot = axis.bar(months, sold_quantity)
6. Display the data value on the head of the bar- optional- visually gives more meaning
for rectangle in plot:
height = rectangle.get_height()
axis.text(rectangle.get_x() + rectangle.get_width() /2., 1.002 * height, '%d' %
int(height), ha='center', va = 'bottom')
7. Display the graph on the screen: plt.show()
Horizontal bar chart - code remains the same, with few changes
• plt.xticks changed to plt.yticks() and
• plt.bar() changed to plt.barh()
3.Scatter plot
Scatter plots are also called scatter graphs, scatter charts, scattergrams, and scatter
diagrams. They use a Cartesian coordinates system to display values of typically two variables
for a set of data. Scatter plots can be constructed in the following two situations:
• When one continuous variable is dependent on another variable, which is under the
control of the observer
• When both continuous variables are independent
Scatter plots are used when we need to show the relationship between two variables, and are also
referred to as correlation plots.
Eg: Number of hours of sleep required by a person depends on the age of the person.
The average income for adults is based on the number of years of education
The dataset :
https://raw.githubusercontent.com/PacktPublishing/hands-on-exp loratory-data-analysis-
with-python/master/Chapter%202/sleep_vs_age.csv
import seaborn as sns
import matplotlib.pyplot as plt sns.set()
# A regular scatter plot
plt.scatter(x=sleepDf["age"]/12., y=sleepDf["min_recommended"])
plt.scatter(x=sleepDf["age"]/12., y=sleepDf['max_recommended']) plt.xlabel('Age of person in
Years')
plt.ylabel('Total hours of sleep required')
plt.show()
Interpretation of the graph - total number of hours of sleep required by a person is high initially
and gradually decreases as age increases. Due to the lack of a continuous line, the results are not
self-explanatory - fit a line to it
# Line plot
plt.plot(sleepDf['age']/12., sleepDf['min_recommended'], 'g--')
plt.plot(sleepDf['age']/12., sleepDf['max_recommended'], 'r--')
plt.xlabel('Age of person in Years')
plt.ylabel('Total hours of sleep required')
plt.show()
Interpretation: Two lines decline as the age increases - newborns between 0 and 3 months
require at least 14-17 hours of sleep every day, adults and the elderly require 7-9 hours of sleep
every day.
Generate scatter plot for Iris dataset
Bubble plot
A bubble plot is a manifestation of the scatter plot where each data point on the graph is shown
as a bubble. Each bubble can be illustrated with a different color, size, and appearance.
# Load the Iris dataset
df = sns.load_dataset('iris')
df['species'] = df['species'].map({'setosa': 0, "versicolor": 1,
"virginica": 2})
# Create bubble plot
plt.scatter(df.petal_length, df.petal_width, s=50*df.petal_length*df.petal_width,
c=df.species,alpha=0.3)
# Create labels for axises
plt.xlabel('Septal Length')
plt.ylabel('Petal length')
plt.show()
A scatter plot can also be generated using the seaborn library which makes the graph visually
better.
4.Area plot and stacked plot
The stacked plot represents the area under a line plot and several such plots can be
stacked on top of one another, giving the feeling of a stack. It can be useful when we want to
visualize the cumulative effect of multiple variables being plotted on the y axis. Area plot can be
thought as a line plot that shows the area covered by filling it with a color
Define dataset:
# House loan Mortgage cost per month for a year
houseLoanMortgage = [9000, 9000, 8000, 9000, 8000, 9000, 9000, 9000, 9000, 8000, 9000,
9000]
# Utilities Bills for a year
utilitiesBills = [4218, 4218, 4218, 4218, 4218, 4218, 4219, 2218, 3218, 4233, 3000, 3000]
# Transportation bill for a year
transportation = [782, 900, 732, 892, 334, 222, 300, 800, 900, 582, 596, 222]
# Car mortgage cost for one year
carMortgage = [700, 701, 702, 703, 704, 705, 706, 707, 708, 709, 710, 711]
Import the required libraries and plot stacked charts:
import matplotlib.pyplot as plt
import seaborn as sns
sns.set()
months= [x for x in range(1,13)]
# Create placeholders for plot and add required color
plt.plot([],[], color='sandybrown', label='houseLoanMortgage')
plt.plot([],[], color='tan', label='utilitiesBills')
plt.plot([],[], color='bisque', label='transportation')
plt.plot([],[], color='darkcyan', label='carMortgage')
# Add stacks to the plot
plt.stackplot(months, houseLoanMortgage, utilitiesBills, transportation, carMortgage,
colors=['sandybrown', 'tan', 'bisque', 'darkcyan'])
plt.legend()
# Add Labels
plt.title('Household Expenses')
plt.xlabel('Months of the year')
plt.ylabel('Cost')
# Display on the screen
plt.show()
Interpretation: The house mortgage loan is the largest expense since the area under the curve
for the house mortgage loan is the largest. The area of utility bills stack covers the second-
largest area, and so on.
The graph clearly disseminates meaningful information to the targeted audience. Labels, legends,
and colors are important aspects of creating a meaningful visualization.
5.Pie chart
The pie chart fails to appeal to most experts. The purpose of the pie chart is to
communicate proportions.
Dataset: Pokemon dataset to draw a pie chart.
# Create URL to JSON file (alternatively this can be a filepath)
url = 'https://raw.githubusercontent.com/hmcuesta/PDA_Book/master/Chapter3/pokemo
nByType.csv'
# Load the first sheet of the JSON file into a data frame
pokemon = pd.read_csv(url, index_col='type')
pokemon
Plot the pie chart:
import matplotlib.pyplot as plt
plt.pie(pokemon['amount'], labels=pokemon.index, shadow=False, startangle=90,
autopct='%1.1f%%',)
plt.axis('equal')
plt.show()
Pandas code : pokemon.plot.pie(y="amount", figsize=(20, 10))
6.Table chart
A table chart combines a bar chart and a table.
Dataset: Consider standard LED bulbs that come in different wattages. The standard Philips
LED bulb can be 4.5 Watts, 6 Watts, 7 Watts, 8.5 Watts, 9.5 Watts, 13.5 Watts, and 15 Watts.
Let's assume there are two categorical variables, the year and the wattage, and a numeric
variable, which is the number of units sold in a particular year.
3. Grades obtained:
Extra entry- circular plot - connect the first and the last point together to form a circular flow
1. Import the required libraries:
3. Initialize the plot with the figure size and polar projection:
4. Get the grid lines to align with each of the subject names:
5. Use the plt.plot method to plot the graph and fill the area under it:
8.Histogram
Histogram plots are used to depict the distribution of any continuous variable. These
types of plots are very popular in statistical analysis.
Use case: A survey created in vocational training sessions of developers had 100 participants.
They had several years of Python programming experience ranging from 0 to 20.
Create the dataset
Generated histogram
9.Lollipop chart
A lollipop chart can be used to display ranking in the data. It is similar to an ordered bar
chart.
1.Load the dataset- carDF dataset:
7.Write the actual mean values in the plot, and display the plot:
Data transformation is a set of techniques used to convert data from one format or structure
to another format or structure. The main reason for transforming the data is to get a better
representation such that the transformed data is compatible with other data. The following are
some examples of transformation activities:
• Data deduplication -involves the identification of duplicates and their removal.
• Key restructuring -involves transforming any keys with built-in meanings to the generic
keys.
• Data cleansing -involves extracting words and deleting out-of-date, inaccurate, and
incomplete information from the source language without extracting the meaning or
information to enhance the accuracy of the source data.
• Data validation -is a process of formulating rules or algorithms that help in validating
different types of data against some known issues.
• Format revisioning -involves converting from one format to another.
• Data derivation -consists of creating a set of rules to generate more information from the
data source.
• Data aggregation -involves searching, extracting, summarizing, and preserving important
information in different types of reporting systems.
• Data integration- involves converting different data types and merging them into a
common structure or schema.
• Data filtering- involves identifying information relevant to any particular user.
• Data joining- involves establishing a relationship between two or more tables.
Assume:
o There are some students who are not taking the software engineering exam.
o There are some students who are not taking the machine learning exam.
o There are students who appeared in both courses.
Analyze using the EDA technique:
• How many students appeared for the exams in total?
• How many students only appeared for the Software Engineering course?
• How many students only appeared for the Machine Learning course?
1. Concatenating along with an axis
Provides the list of all the students who appeared in the Machine Learning course
5. Using pd.merge() methods with outer join
total number of students appearing for at least one course
Merging on index
Index acts as the keys for merging dataframes - pass left_index=True or right_index=True to
indicate that the index should be accepted as the merge key.
1. Consider the following two dataframes:
2. Merge using an inner join - default type of merge - merge based on intersection of the
keys
Helps to arrange data in a dataframe in some consistent manner. This can be done with
hierarchical indexing using two actions:
➢ Stacking: Stack rotates from any particular column in the data to the rows.
➢ Unstacking: Unstack rotates from the rows into the column.
1. Create a dataframe that records the rainfall, humidity, and wind conditions 1. of
five different counties in Norway
2. Use () method on dframe1 to pivot the columns into rows to produce a series
4. Create two series, series1 and series2, and then concatenate them
4. Add a new column and try to find duplicated items based on the second column
The two functions, notnull() and isnull(), are the complement to each other
2. Count the number of NaN values in each store
3. Find the total number of missing values
Dropping by columns
Specify a minimum number of NaNs that must exist before the column should be dropped use
argument thresh
Mathematical operations with NaN
3.Cumulative summing
In ser3, the first and the last values are 100 and 292 respectively. It calculates the next value as
(292-100)/(5-1) = 48. So, the next value after 100 is 100 + 48 = 148.
Renaming axis indexes
The rename method does not make a copy of the dataframe
2. Convert that dataset into intervals of 118 to 125, 126 to 135, 136 to 160, and finally 160
and higher
A parenthesis indicates that the side is open. A square bracket means that it is closed or
inclusive. (118, 125] means the left-hand side is open and the right-hand side is closed
3. Set a right=False argument to change the form of interval
6. Pass an integer for the bins - it will compute equal-length bins based on the
minimum and maximum values in the data
2. Calculate the total price based on the quantity sold and the unit price - add a new
column
3.Find the transaction that exceeded 3,000,000
4.Display all the columns and rows if TotalPrice is greater than 6741112
:
np.random.permutation() method- takes the length of the axis we require to be permuted – and
gives an array of integers indicating the new ordering
2.The output array is used in ix-based indexing for the take() function
Random sampling without replacement
To compute random sampling without replacement, follow these steps:
1. Createa permutation array
2. Slice off the first n elements of the array where n is the desired size of
the subset
1.Use the df.take() method to obtain actual samples