0% found this document useful (0 votes)

74 views34 pages

Notes - EDA-Unit1 (2)

Uploaded by

savi ezhilarasan

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

74 views34 pages

Notes - EDA-Unit1 (2)

Uploaded by

savi ezhilarasan

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 34

CCS346 - EXPLORATORY DATA ANALYSIS

UNIT I EXPLORATORY DATA ANALYSIS

EDA fundamentals – Understanding data science – Significance of EDA – Making sense of data
– Comparing EDA with classical and Bayesian analysis – Software tools for EDA - Visual Aids
for EDA- Data transformation techniques-merging database, reshaping and pivoting,
transformation techniques.

COURSE OBJECTIVE:
To outline an overview of exploratory data analysis.

COURSE OUTCOME:
CO1: Understand the fundamentals of exploratory data analysis.
Exploratory Data Analysis Fundamentals
Data is a collection of discrete objects, numbers, words, events, facts, measurements,
observations, or descriptions of things. It is collected and stored by every event or process
occurring in several disciplines, including biology, economics, engineering, marketing, and
others. Processing such data elicits useful information and processing such information generates
useful knowledge.
EDA is a process of examining the available dataset to discover patterns, spot anomalies, test
hypotheses, and check assumptions using statistical measures. The primary aim of EDA is to
examine what data can tell us before actually going through formal modeling or hypothesis
formulation. EDA can also help statisticians to examine and discover the data and create newer
hypotheses that could be used for the development of a newer approach in data collection and
experimentations.
Understanding data science
Data scientists
• builds a performant model
• must be able to explain the results obtained
• use the result for business intelligence
Data science involves cross-disciplinary knowledge from computer science, data, statistics, and
mathematics.
The phases of data analysis
1. Data requirements:
• various sources of data for an organization
• type of data required for the organization to be collected, curated, and store
• categorize the data, numerical or categorical, and the format of storage and
dissemination.
Eg: An application tracking the sleeping pattern of patients suffering from dementia requires
several types of sensors' data storage, such as sleep data, heart rate from the patient, electro-
dermal activities, and user activities pattern. All of these data points are required to correctly
diagnose the mental state of the person.
2. Data collection:
Data collected from several sources must be stored in the correct format and transferred
to the right information technology personnel within a company.
3. Data processing:
Preprocessing involves the process of pre-curating the dataset before actual analysis.
Common tasks involve
• correctly exporting the dataset
• placing them under the right tables
• structuring them
• exporting them in the correct format
4. Data cleaning:
Preprocessed data may still not be ready for detailed analysis. It must be correctly
transformed for an
• incompleteness check
• duplicates check
• error check
• missing value check
These tasks are performed in the data cleaning stage, which involves responsibilities such as
• matching the correct record
• finding inaccuracies in the dataset
• understanding the overall data quality
• removing duplicate items and
• filling in the missing values
Eg: outlier detection methods for quantitative data cleaning
5. EDA:
It is the stage where the actual message contained in the data is understood. Several types
of data transformation techniques might be required during the process of exploration.
6. Modeling and algorithm:
Generalized models or mathematical formulas can represent or exhibit relationships
among different variables, such as correlation or causation. These models or equations involve
one or more variables that depend on other variables to cause an event.
Eg: when buying, pens, the total price of pens(Total) = price for one pen(UnitPrice) * the
number of pens bought (Quantity).
Model would be Total = UnitPrice * Quantity - total price is dependent on the unit price
total price - dependent variable and the unit price - independent variable. In general
Model describes the relationship between independent and dependent variables.
Inferential statistics deals with quantifying relationships between particular variables.
The Judd model for describing the relationship between data, model, and error
Data = Model + Error
Eg: inferential statistics - regression analysis
7. Data Product:
It is the computer software that uses data as inputs, produces outputs, and provides
feedback based on the output to control the environment. It is based on a model developed
during data analysis
Eg: a recommendation model that inputs user purchase history and recommends a related
item that the user is highly likely to buy.
8. Communication:
This stage deals with disseminating the results to end stakeholders to use the result for
business intelligence. It involves data visualization using techniques such as tables, charts,
summary diagrams, and bar charts to show the analyzed result.

The significance of EDA

Different fields of science, economics, engineering, and marketing accumulate and store data
primarily in electronic databases. Data collected should be used to make appropriate and well-
established decisions. It requires computer programs to gather insights about the collected data
which could be implemented using the process of data mining in which exploratory data analysis
is considered as the key. It allows visualization of the data to understand it as well as to create
hypotheses for further analysis. EDA reveals ground truth about the content without making any
underlying assumptions. Data scientists use EDA to understand the type of modeling and
hypotheses that can be created.
Key components of exploratory data analysis include
• summarizing data
• statistical analysis
• visualization of data
Python provides tools for exploratory analysis
• pandas for summarizing
• scipy for statistical analysis
• matplotlib, plotly for visualizations
Steps in EDA
It involves four different steps
1.Problem definition:
Before extracting useful insight from the data, it is essential to define the business
problem to be solved. The problem definition works as the driving force for a data analysis plan
execution. The main tasks involved in problem definition are
• defining the main objective of the analysis
• defining the main deliverables
• outlining the main roles and responsibilities
• obtaining the current status of the data
• defining the timetable
• performing cost/benefit analysis
Based on such a problem definition, an execution plan can be created.
2. Data preparation:
This step involves methods for preparing the dataset before actual analysis. Ii involves
• define the sources of data,
• define data schemas and tables,
• understand the main characteristics of the data,
• clean the dataset,
• delete non-relevant datasets,
• transform the data, and
• divide the data into required chunks for analysis.
3.Data analysis:
It deals with descriptive statistics and analysis of the data. The main tasks involve
• summarizing the data
• finding the hidden correlation and relationships among the data
• developing predictive models
• evaluating the models
• calculating the accuracies
Techniques used for data summarization
• summary tables
• graphs
• descriptive statistics
• inferential statistics
• correlation statistics
• searching
• grouping
• mathematical models
4. Development and representation of the results:
It involves presenting the dataset to the target audience. It can be in the form of
• graphs,
• summary tables
• maps
• diagrams

Making sense of data

It deals with identifying the type of data under analysis. Different disciplines store
different kinds of data for different purposes.
Eg: Medical researchers store patients' data, universities store students' and teachers' data, and
real estate industries storehouse and building datasets.
A dataset contains many observations about a particular object. Each observation can have a
specific value for each of the variables.
Eg: Observation:

Table Representation:

Summarization:
There are four observations (001, 002, 003, 004, 005). Each observation describes variables
(PatientID, name, address, dob, email, gender, and weight).

Dataset falls into two groups—numerical data and categorical data

1. Numerical data or quantitative data
This data has a sense of measurement involved in it. It can be either discrete or continuous types.

Eg: a person's age, height, weight, blood pressure, heart rate, temperature, number of teeth,
number of bones, and the number of family members.
➢ Discrete data
Represents data that is countable and its values can be listed out. A variable that
represents a discrete dataset is referred to as a discrete variable. The discrete variable
takes a fixed number of distinct values.
Eg: Flipping a coin, the number of heads in 200 coin flips can take values from 0 to 200
(finite) cases.
Country variable can have values such as Nepal, India, Norway, and Japan. It is fixed.
Rank variable of a student in a classroom can take values from 1, 2, 3, 4, 5, and so on.
➢ Continuous data
A variable that can have an infinite number of numerical values within a specific
range is classified as continuous data. A variable describing continuous data is a continuous
variable. Continuous data can follow an interval measure of scale or ratio measure of scale.
Eg: Temperature of the city today
Identify discrete and Continuous variables from the table below

2. Categorical data or qualitative data

This type of data represents the characteristics of an object;
Eg: Gender - Male, Female, Other, or Unknown
Marital status - Annulled, Divorced, Interlocutory, Legally Separated, Married,
Polygamous, Never Married, Domestic Partner, Unmarried, Widowed, or Unknown,
Type of address – temporary, permanent, present
Categories of the movies - Action, Adventure, Comedy, Crime, Drama, Fantasy,
Historical, Horror, Mystery, Philosophical, Political, Romance, Saga, Satire, Science Fiction,
Social, Thriller, Urban, or Western
Blood type - A, B, AB, or O
Types of drugs - Stimulants, Depressants, Hallucinogens, Dissociatives, Opioids,
Inhalants, or Cannabis
A variable describing categorical data is referred to as a categorical variable. These types
of variables can have one of a limited number of values. Categorical values can be thought of as
an enumerated types or enumerations of variables. Most of the categorical dataset follows either
nominal or ordinal measurement scales.
There are different types of categorical variables:
➢ Binary categorical variable - can take exactly two values and is also referred to as a
dichotomous variable.
Eg: Result of an experiment - success or failure.
➢ Polytomous variables are categorical variables that can take more than two possible values.
Eg: Marital status can have several values, such as annulled, divorced, interlocutory,
legally separated, married, polygamous, never married, domestic partners, unmarried, widowed,
domestic partner, and unknown.
Measurement scales
There are four different types of measurement scales described in statistics: nominal,
ordinal, interval, and ratio. These scales are used more in academic industries.
1.Nominal
These are practiced for labeling variables without any quantitative value. The scales are
generally referred to as labels. And these scales are mutually exclusive and do not carry any
numerical importance.
• Gender- Male Female Third gender/Non-binary I prefer not to answer Other
• The languages that are spoken in a particular country Biological species
• Parts of speech in grammar (noun, pronoun, adjective, and so on)
• Taxonomic ranks in biology (Archea, Bacteria, and Eukarya)
No form of arithmetic calculation can be made on nominal measures.
2.Ordinal
The main difference in the ordinal and nominal scale is the order. In ordinal scales, the
order of the values is a significant factor.
Eg: WordPress is making content managers' lives easier. How do you feel about this statement?
The following diagram shows the Likert scale

3.Interval
In interval scales, both the order and exact differences between the values are significant.
Interval scales are widely used in statistics, like measure of central tendencies—mean, median,
mode, and standard deviations.
Eg: Location in Cartesian coordinates and direction measured in degrees from magnetic north.
The mean, median, and mode are allowed on interval data.
4.Ratio
Ratio scales contain order, exact values, and absolute zero, which makes it possible to be
used in descriptive and inferential statistics. These scales provide numerous possibilities for
statistical analysis. Mathematical operations, the measure of central tendencies, and the measure
of dispersion and coefficient of variation can also be computed from such scales.
Eg: measure of energy, mass, length, duration, electrical energy, plan angle, and volume.
The following table gives a summary of the data types and scale measures:

Comparing EDA with classical and Bayesian analysis

Approaches to data analysis.

1.Classical data analysis: - The problem definition and data collection step are followed
by model development, which is followed by analysis and result communication.
2.Exploratory data analysis approach:- Follows the same approach as classical data
analysis except the model imposition and the data analysis steps are swapped. The main focus is
on the data, its structure, outliers, models, and visualizations. Generally, in EDA, we do not
impose any deterministic or probabilistic models on the data.
3.Bayesian data analysis approach:- It incorporates prior probability distribution
knowledge into the analysis steps.
Different approaches for data analysis illustrating the difference in their execution steps:
Software tools available for EDA

There are several software tools that are available to facilitate EDA.
1. Python: This is an open source programming language widely used in data analysis, data
mining, and data science.
2. R programming language: R is an open source programming language that is widely
utilized in statistical computation and graphical data analysis
3. Weka: This is an open source data mining package that involves several EDA tools and
algorithms
4. KNIME: This is an open source tool for data analysis and is based on Eclipse

Visual Aids for EDA

Two important goals of data scientists

• extract knowledge from the data and
• present the data to stakeholders. - visual aids are very useful tools
1. Line chart
A line chart is used to illustrate the relationship between two or more continuous variables.
Eg: Plot time series lines using matplotlib library and the stock price data
Generate the dataset - faker Python library with columns - Date and Price, indicating the stock
price on that date or load the CSV file using the pandas library - read_csv
generateData function :
import datetime
import math
import pandas as pd
import random import radar from faker import Faker
fake = Faker()
def generateData(n):
listdata = []
start = datetime.datetime(2019, 8, 1)
end = datetime.datetime(2019, 8, 30)
delta = end - start
for _ in range(n):
date = radar.random_datetime(start='2019-08-1', stop='2019-08-
30').strftime("%Y-%m-%d")
price = round(random.uniform(900, 1000), 4)
listdata.append([date, price])
df = pd.DataFrame(listdata, columns = ['Date', 'Price'])
df['Date'] = pd.to_datetime(df['Date'], format='%Y-%m-%d')
df = df.groupby(by='Date').mean()
return df

df = generateData(50)
df.head(10)
Steps involved
1.Load and prepare the dataset
2.Import the matplotlib library: import matplotlib.pyplot as plt
3.Plot the graph: plt.plot(df)
4.Display it on the screen: plt.show()

2. Bar charts
It is a most common type of visualization. Bars can be drawn horizontally or vertically to
represent categorical variables. Bar charts are frequently used to distinguish objects between
distinct collections in order to track variations over time
Eg: Assume a pharmacy in Norway keeps track of the amount of Zoloft sold every month. Use
the calendar Python library to keep track of the months of the year (1 to 12) corresponding to
January to December:
1. Import the required libraries
import numpy as np
import calendar
import matplotlib.pyplot as plt
2. Set up the data
months = list(range(1, 13))
sold_quantity = [round(random.uniform(100, 200)) for x in range(1, 13)]
3. Specify the layout of the figure and allocate space
figure, axis = plt.subplots()
4. Display the names of the months:
plt.xticks(months, calendar.month_name[1:13], rotation=20)
5. Plot the graph:
plot = axis.bar(months, sold_quantity)
6. Display the data value on the head of the bar- optional- visually gives more meaning
for rectangle in plot:
height = rectangle.get_height()
axis.text(rectangle.get_x() + rectangle.get_width() /2., 1.002 * height, '%d' %
int(height), ha='center', va = 'bottom')
7. Display the graph on the screen: plt.show()
Horizontal bar chart - code remains the same, with few changes
• plt.xticks changed to plt.yticks() and
• plt.bar() changed to plt.barh()

3.Scatter plot
Scatter plots are also called scatter graphs, scatter charts, scattergrams, and scatter
diagrams. They use a Cartesian coordinates system to display values of typically two variables
for a set of data. Scatter plots can be constructed in the following two situations:
• When one continuous variable is dependent on another variable, which is under the
control of the observer
• When both continuous variables are independent
Scatter plots are used when we need to show the relationship between two variables, and are also
referred to as correlation plots.
Eg: Number of hours of sleep required by a person depends on the age of the person.
The average income for adults is based on the number of years of education
The dataset :
https://raw.githubusercontent.com/PacktPublishing/hands-on-exp loratory-data-analysis-
with-python/master/Chapter%202/sleep_vs_age.csv
import seaborn as sns
import matplotlib.pyplot as plt sns.set()
# A regular scatter plot
plt.scatter(x=sleepDf["age"]/12., y=sleepDf["min_recommended"])
plt.scatter(x=sleepDf["age"]/12., y=sleepDf['max_recommended']) plt.xlabel('Age of person in
Years')
plt.ylabel('Total hours of sleep required')
plt.show()
Interpretation of the graph - total number of hours of sleep required by a person is high initially
and gradually decreases as age increases. Due to the lack of a continuous line, the results are not
self-explanatory - fit a line to it
# Line plot
plt.plot(sleepDf['age']/12., sleepDf['min_recommended'], 'g--')
plt.plot(sleepDf['age']/12., sleepDf['max_recommended'], 'r--')
plt.xlabel('Age of person in Years')
plt.ylabel('Total hours of sleep required')
plt.show()
Interpretation: Two lines decline as the age increases - newborns between 0 and 3 months
require at least 14-17 hours of sleep every day, adults and the elderly require 7-9 hours of sleep
every day.
Generate scatter plot for Iris dataset
Bubble plot
A bubble plot is a manifestation of the scatter plot where each data point on the graph is shown
as a bubble. Each bubble can be illustrated with a different color, size, and appearance.
# Load the Iris dataset
df = sns.load_dataset('iris')
df['species'] = df['species'].map({'setosa': 0, "versicolor": 1,
"virginica": 2})
# Create bubble plot
plt.scatter(df.petal_length, df.petal_width, s=50*df.petal_length*df.petal_width,
c=df.species,alpha=0.3)
# Create labels for axises
plt.xlabel('Septal Length')
plt.ylabel('Petal length')
plt.show()

A scatter plot can also be generated using the seaborn library which makes the graph visually
better.
4.Area plot and stacked plot
The stacked plot represents the area under a line plot and several such plots can be
stacked on top of one another, giving the feeling of a stack. It can be useful when we want to
visualize the cumulative effect of multiple variables being plotted on the y axis. Area plot can be
thought as a line plot that shows the area covered by filling it with a color
Define dataset:
# House loan Mortgage cost per month for a year
houseLoanMortgage = [9000, 9000, 8000, 9000, 8000, 9000, 9000, 9000, 9000, 8000, 9000,
9000]
# Utilities Bills for a year
utilitiesBills = [4218, 4218, 4218, 4218, 4218, 4218, 4219, 2218, 3218, 4233, 3000, 3000]
# Transportation bill for a year
transportation = [782, 900, 732, 892, 334, 222, 300, 800, 900, 582, 596, 222]
# Car mortgage cost for one year
carMortgage = [700, 701, 702, 703, 704, 705, 706, 707, 708, 709, 710, 711]
Import the required libraries and plot stacked charts:
import matplotlib.pyplot as plt
import seaborn as sns
sns.set()
months= [x for x in range(1,13)]
# Create placeholders for plot and add required color
plt.plot([],[], color='sandybrown', label='houseLoanMortgage')
plt.plot([],[], color='tan', label='utilitiesBills')
plt.plot([],[], color='bisque', label='transportation')
plt.plot([],[], color='darkcyan', label='carMortgage')
# Add stacks to the plot
plt.stackplot(months, houseLoanMortgage, utilitiesBills, transportation, carMortgage,
colors=['sandybrown', 'tan', 'bisque', 'darkcyan'])
plt.legend()
# Add Labels
plt.title('Household Expenses')
plt.xlabel('Months of the year')
plt.ylabel('Cost')
# Display on the screen
plt.show()

Interpretation: The house mortgage loan is the largest expense since the area under the curve
for the house mortgage loan is the largest. The area of utility bills stack covers the second-
largest area, and so on.
The graph clearly disseminates meaningful information to the targeted audience. Labels, legends,
and colors are important aspects of creating a meaningful visualization.
5.Pie chart
The pie chart fails to appeal to most experts. The purpose of the pie chart is to
communicate proportions.
Dataset: Pokemon dataset to draw a pie chart.
# Create URL to JSON file (alternatively this can be a filepath)
url = 'https://raw.githubusercontent.com/hmcuesta/PDA_Book/master/Chapter3/pokemo
nByType.csv'
# Load the first sheet of the JSON file into a data frame
pokemon = pd.read_csv(url, index_col='type')
pokemon
Plot the pie chart:
import matplotlib.pyplot as plt
plt.pie(pokemon['amount'], labels=pokemon.index, shadow=False, startangle=90,
autopct='%1.1f%%',)
plt.axis('equal')
plt.show()
Pandas code : pokemon.plot.pie(y="amount", figsize=(20, 10))
6.Table chart
A table chart combines a bar chart and a table.
Dataset: Consider standard LED bulbs that come in different wattages. The standard Philips
LED bulb can be 4.5 Watts, 6 Watts, 7 Watts, 8.5 Watts, 9.5 Watts, 13.5 Watts, and 15 Watts.
Let's assume there are two categorical variables, the year and the wattage, and a numeric
variable, which is the number of units sold in a particular year.

Draw a table chart

Add the table to the bottom of the chart

7.Polar chart or spider web plot

Polar chart is a diagram that is plotted on a polar axis. Its coordinates are angle and
radius.
Create the dataset:
1. Assume five courses in the academic year:

2. Planned grades in each subject:

3. Grades obtained:

Extra entry- circular plot - connect the first and the last point together to form a circular flow
1. Import the required libraries:

2. Prepare the dataset and set up theta:

3. Initialize the plot with the figure size and polar projection:
4. Get the grid lines to align with each of the subject names:

5. Use the plt.plot method to plot the graph and fill the area under it:

6. Now, we plot the actual grades obtained:

7. We add a legend and a nice comprehensible title to the plot:

8. Finally, we show the plot on the screen:

The generated polar chart

8.Histogram
Histogram plots are used to depict the distribution of any continuous variable. These
types of plots are very popular in statistical analysis.
Use case: A survey created in vocational training sessions of developers had 100 participants.
They had several years of Python programming experience ranging from 0 to 20.
Create the dataset

Plotting the histogram chart:

1. Plot the distribution of group experience
2. Add labels to the axes and a title

3. Draw a green vertical line in the graph at the average experience

4. Display the plot:

Generated histogram

9.Lollipop chart
A lollipop chart can be used to display ranking in the data. It is similar to an ordered bar
chart.
1.Load the dataset- carDF dataset:

2.Group the dataset by manufacturer

3.Sort the values by cty and reset the index

4.Plot the graph

5.Annotate the title

6.Annotate labels, xticks, and ylims:

7.Write the actual mean values in the plot, and display the plot:

Generated lollipop chart

Choosing the best chart
Data Transformation

Data transformation is a set of techniques used to convert data from one format or structure
to another format or structure. The main reason for transforming the data is to get a better
representation such that the transformed data is compatible with other data. The following are
some examples of transformation activities:
• Data deduplication -involves the identification of duplicates and their removal.
• Key restructuring -involves transforming any keys with built-in meanings to the generic
keys.
• Data cleansing -involves extracting words and deleting out-of-date, inaccurate, and
incomplete information from the source language without extracting the meaning or
information to enhance the accuracy of the source data.
• Data validation -is a process of formulating rules or algorithms that help in validating
different types of data against some known issues.
• Format revisioning -involves converting from one format to another.
• Data derivation -consists of creating a set of rules to generate more information from the
data source.
• Data aggregation -involves searching, extracting, summarizing, and preserving important
information in different types of reporting systems.
• Data integration- involves converting different data types and merging them into a
common structure or schema.
• Data filtering- involves identifying information relevant to any particular user.
• Data joining- involves establishing a relationship between two or more tables.

Merging database-style dataframes

➢ use append, concat, merge, or join

The scores of students in two courses Software Engineering and Machine Learning:

Concatenate both data frames:

dataframe = pd.concat([dataFrame1, dataFrame2], ignore_index=True)
dataframe
pd.concat([dataFrame1, dataFrame2], axis=1)
Consider wo dataframes for each subject:
• Two for the Software Engineering course
• Another two for the Introduction to Machine Learning course

Assume:
o There are some students who are not taking the software engineering exam.
o There are some students who are not taking the machine learning exam.
o There are students who appeared in both courses.
Analyze using the EDA technique:
• How many students appeared for the exams in total?
• How many students only appeared for the Software Engineering course?
• How many students only appeared for the Machine Learning course?
1. Concatenating along with an axis

2. Using df.merge with an inner join

inner join - includes item exists in both dataframes.

Provides a list of students who appeared in both the courses - 21 students took both the
courses
3. Using the pd.merge() method with a left join

26 students only appeared for the Software Engineering course.

4. Using the pd.merge() method with a right join

Provides the list of all the students who appeared in the Machine Learning course
5. Using pd.merge() methods with outer join
total number of students appearing for at least one course
Merging on index
Index acts as the keys for merging dataframes - pass left_index=True or right_index=True to
indicate that the index should be accepted as the merge key.
1. Consider the following two dataframes:

2. Merge using an inner join - default type of merge - merge based on intersection of the
keys

3. Merge using an outer join

Reshaping and pivoting

Helps to arrange data in a dataframe in some consistent manner. This can be done with
hierarchical indexing using two actions:
➢ Stacking: Stack rotates from any particular column in the data to the rows.
➢ Unstacking: Unstack rotates from the rows into the column.
1. Create a dataframe that records the rainfall, humidity, and wind conditions 1. of
five different counties in Norway

2. Use () method on dframe1 to pivot the columns into rows to produce a series

3. Unstack the series into a dataframe using the unstack() method

4. Create two series, series1 and series2, and then concatenate them

5. Unstack the dataframe

Transformation techniques

Includes data transformations like cleaning, filtering and deduplication

I.Performing data deduplication - Removing duplicate rows to enhance the quality of the
dataset
1. Consider a simple dataframe

2. Returns a Boolean series stating the rows that are duplicates

3. Drop the duplicates

4. Add a new column and try to find duplicated items based on the second column

II.Replacing values - find and replace values inside a dataframe

1. Replace one value with the other value

2. Replace multiple values at once

III.Handling missing data
NaN - indicates that there is no value specified for the particular index. Reasons for NaN:
• When data is retrieved from an external source and there are some incomplete values
in the dataset
• When joining two different datasets and some values are not match.
• Missing values due to data collection errors
• When the shape of data changes, there are new additional rows or columns that
• are not determined.
• Reindexing of data can result in incomplete data
1. Create a dataframe

Dataframe show sales of different fruits from different stores.

None of the stores are reporting missing values
2. Add some missing values to the dataframe

NaN values in pandas objects

1. Check if null

The two functions, notnull() and isnull(), are the complement to each other
2. Count the number of NaN values in each store
3. Find the total number of missing values

4. Count the number of reported values

Dropping missing value

One of the ways to handle missing values is to remove them from our dataset.

determine null values

store4 only reported two items of data.

1. Remove the rows with missing values

Returns a copy of the dataframe by dropping the rows with NaN.

The original dataframe is not changed.
2. Apply to the entire dataframe

- output is an empty dataframe- because there is at least one NaN value in

our dataframe
Dropping by rows
Use the how=all argument to drop only those rows entire values are entirely NaN

Dropping by columns

Specify a minimum number of NaNs that must exist before the column should be dropped use
argument thresh
Mathematical operations with NaN

1.Compute the total quantity of fruits sold by store4

store4 has five NaN values. However, during the summing process, these values are treated as 0
and the result is 38.0.
2.Compute averages

3.Cumulative summing

Filling missing values

Replace NaN values with any particular values- use fillna() method
Backward and forward filling
fill store4 using the forward-filling technique Backward-filling

Interpolating missing values

Performs a linear interpolation of our missing values - interpolate()

In ser3, the first and the last values are 100 and 292 respectively. It calculates the next value as
(292-100)/(5-1) = 48. So, the next value after 100 is 100 + 48 = 148.
Renaming axis indexes
The rename method does not make a copy of the dataframe

Discretization and binning

When working with continuous datasets, we need to convert them into discrete or interval forms.
Each interval is referred to as a bin.
1. Data on the heights of a group of students

2. Convert that dataset into intervals of 118 to 125, 126 to 135, 136 to 160, and finally 160
and higher
A parenthesis indicates that the side is open. A square bracket means that it is closed or
inclusive. (118, 125] means the left-hand side is open and the right-hand side is closed
3. Set a right=False argument to change the form of interval

4. Check the number of values in each bin

5. Indicate the bin names by passing a list of labels

6. Pass an integer for the bins - it will compute equal-length bins based on the
minimum and maximum values in the data

7. Form the bins based on sample quantiles- qcut method

8. Count the number of values in each category - get equal-sized bins

9. Pass our own bins

Outlier detection and filtering

Outliers are data points that diverge from other observations for several reasons. The
main reason for this detection and filtering of outliers is that the presence of such outliers can
cause serious issues in statistical analysis.
1.Load the dataset that is available from the GitHub link as follows:

2. Calculate the total price based on the quantity sold and the unit price - add a new
column
3.Find the transaction that exceeded 3,000,000

4.Display all the columns and rows if TotalPrice is greater than 6741112

Permutation and random sampling

1. Select or permute a series of rows in a dataframe numpy.random.permutation()
function, we can randomly

:
np.random.permutation() method- takes the length of the axis we require to be permuted – and
gives an array of integers indicating the new ordering

2.The output array is used in ix-based indexing for the take() function
Random sampling without replacement
To compute random sampling without replacement, follow these steps:
1. Createa permutation array
2. Slice off the first n elements of the array where n is the desired size of
the subset
1.Use the df.take() method to obtain actual samples

2.Generate a random sample with replacement - numpy.random.randint() method

3.And now, we can draw the required samples:

Computing indicators/dummy variables

Often, we need to convert a categorical variable into some dummy matrix. Statistical modeling
or machine learning model development, it is essential to create dummy variables.
1. Create dataframe with data on gender and votes

2.Create dummy Variable - using the pd.get_dummies() function

3. Add a prefix to the columns

Unit I - Part I Notes
100% (7)
Unit I - Part I Notes
33 pages
Employee Attrition Study Case
No ratings yet
Employee Attrition Study Case
88 pages
Data Analysis Project Report
No ratings yet
Data Analysis Project Report
40 pages
Practical File: Internet Programming Lab
No ratings yet
Practical File: Internet Programming Lab
26 pages
(Excerpts From) Investigating Performance: Design and Outcomes With Xapi
From Everand
(Excerpts From) Investigating Performance: Design and Outcomes With Xapi
Janet Laane Effron
No ratings yet
Foundations of Data Science
No ratings yet
Foundations of Data Science
4 pages
Basic Statistical Descriptions of Data: Dr. Amiya Ranjan Panda
No ratings yet
Basic Statistical Descriptions of Data: Dr. Amiya Ranjan Panda
35 pages
MAD - 4 unit
No ratings yet
MAD - 4 unit
8 pages
SNA UNIT V
No ratings yet
SNA UNIT V
9 pages
FDSA UNIT 3
No ratings yet
FDSA UNIT 3
42 pages
Unit V Big Data Analytics
No ratings yet
Unit V Big Data Analytics
47 pages
Da Unit-2
No ratings yet
Da Unit-2
23 pages
4-Data Cleaning, Data Integration, Data Transformation, Data Reduction-03-02-2024
No ratings yet
4-Data Cleaning, Data Integration, Data Transformation, Data Reduction-03-02-2024
22 pages
BDA Lab ManuaL[1]
No ratings yet
BDA Lab ManuaL[1]
83 pages
Unit-2 Solution
No ratings yet
Unit-2 Solution
22 pages
R Language
No ratings yet
R Language
59 pages
Exploratory Data Analysis
No ratings yet
Exploratory Data Analysis
10 pages
UNIT 2
No ratings yet
UNIT 2
34 pages
FDS Unit 1
No ratings yet
FDS Unit 1
21 pages
CCS354 Network Security
No ratings yet
CCS354 Network Security
87 pages
Vidya Mam PPT - Lecture1
No ratings yet
Vidya Mam PPT - Lecture1
22 pages
STA112_Lecture_1_Content_Probability 1
No ratings yet
STA112_Lecture_1_Content_Probability 1
42 pages
Practical Lab File Based ON Programing in C: Submitted by
No ratings yet
Practical Lab File Based ON Programing in C: Submitted by
6 pages
Unit 3
No ratings yet
Unit 3
24 pages
12-Exploratory Data Analysis, Anomaly Detection-28!03!2023
No ratings yet
12-Exploratory Data Analysis, Anomaly Detection-28!03!2023
79 pages
AIML UNIT 4
No ratings yet
AIML UNIT 4
26 pages
DSDBA Sppu Dsbda QP
No ratings yet
DSDBA Sppu Dsbda QP
11 pages
BA Lab Manual
No ratings yet
BA Lab Manual
62 pages
Pattern Recognition
No ratings yet
Pattern Recognition
3 pages
2022 Dec. ITT401-A
No ratings yet
2022 Dec. ITT401-A
2 pages
ML 2
No ratings yet
ML 2
6 pages
Data Science Techniques Classification Regression and Clustering
No ratings yet
Data Science Techniques Classification Regression and Clustering
5 pages
Exploratory Data Analysis
No ratings yet
Exploratory Data Analysis
9 pages
Chi Merge
No ratings yet
Chi Merge
5 pages
AD3461 ML lab manual
No ratings yet
AD3461 ML lab manual
32 pages
DCCN Notes
No ratings yet
DCCN Notes
27 pages
IV_AI-DS_AD3491_FDSA_Unit4
No ratings yet
IV_AI-DS_AD3491_FDSA_Unit4
30 pages
FDS Lab Manual
No ratings yet
FDS Lab Manual
48 pages
274 - Soft Computing LECTURE NOTES
No ratings yet
274 - Soft Computing LECTURE NOTES
499 pages
Unit Ii Notes
No ratings yet
Unit Ii Notes
49 pages
Dsbda Unit 2
No ratings yet
Dsbda Unit 2
155 pages
Ad3411 - Student
No ratings yet
Ad3411 - Student
27 pages
Python Data Science
No ratings yet
Python Data Science
25 pages
Data Warehousing and Mining
No ratings yet
Data Warehousing and Mining
2 pages
Data Warehousing AND Data Mining
No ratings yet
Data Warehousing AND Data Mining
169 pages
Data Mining Concepts
No ratings yet
Data Mining Concepts
35 pages
AD3351-DAA-UNIT-I-PPT(1)
No ratings yet
AD3351-DAA-UNIT-I-PPT(1)
135 pages
DS Unit 5
No ratings yet
DS Unit 5
27 pages
DBMS Unit 2
No ratings yet
DBMS Unit 2
48 pages
Unit I
No ratings yet
Unit I
85 pages
UNIT2
No ratings yet
UNIT2
25 pages
FDP Brochure PDF
100% (1)
FDP Brochure PDF
2 pages
Handling Missing Value
No ratings yet
Handling Missing Value
12 pages
Future Skills - An Introduction, General Overview of The Future Skills Sub-Sector-1
No ratings yet
Future Skills - An Introduction, General Overview of The Future Skills Sub-Sector-1
15 pages
Question Bank 1to11
No ratings yet
Question Bank 1to11
19 pages
Ece443 - Wireless Sensor Networks Course Information Sheet: Electronics and Communication Engineering Department
No ratings yet
Ece443 - Wireless Sensor Networks Course Information Sheet: Electronics and Communication Engineering Department
10 pages
IF4071 - Deep Learning Laboratory
No ratings yet
IF4071 - Deep Learning Laboratory
1 page
unit V
No ratings yet
unit V
67 pages
NCIIT 12 Proceedings
100% (1)
NCIIT 12 Proceedings
86 pages
Optimizing Hadoop for MapReduce
From Everand
Optimizing Hadoop for MapReduce
Khaled Tannir
No ratings yet
Python Natural Language Processing Cookbook: Over 60 recipes for building powerful NLP solutions using Python and LLM libraries
From Everand
Python Natural Language Processing Cookbook: Over 60 recipes for building powerful NLP solutions using Python and LLM libraries
Zhenya Antić
No ratings yet
The Datadog Handbook: A Guide to Monitoring, Metrics, and Tracing
From Everand
The Datadog Handbook: A Guide to Monitoring, Metrics, and Tracing
Robert Johnson
No ratings yet
Word2vec Overview
No ratings yet
Word2vec Overview
2 pages
Top 9 Ethical Issues in AI
No ratings yet
Top 9 Ethical Issues in AI
2 pages
x a Iiiiii Iiiiii
No ratings yet
x a Iiiiii Iiiiii
2 pages
vm1
No ratings yet
vm1
10 pages
Java Topics
No ratings yet
Java Topics
1 page
csmmpppp
No ratings yet
csmmpppp
5 pages
unit-3-eda-notes
No ratings yet
unit-3-eda-notes
24 pages
Notes - EDA-Unit2 (1)
No ratings yet
Notes - EDA-Unit2 (1)
43 pages
Data Science Process
No ratings yet
Data Science Process
30 pages
Download The Philosophy of Quantitative Methods: Understanding Statistics Brian D. Haig ebook All Chapters PDF
100% (1)
Download The Philosophy of Quantitative Methods: Understanding Statistics Brian D. Haig ebook All Chapters PDF
41 pages
IDA Question Bank Ch2
No ratings yet
IDA Question Bank Ch2
26 pages
Employee_attrition_rate - Jupyter Notebook
No ratings yet
Employee_attrition_rate - Jupyter Notebook
62 pages
_Exploratory_Data_Analysis_of_Heart_Disease_Dataset__1737826105
No ratings yet
_Exploratory_Data_Analysis_of_Heart_Disease_Dataset__1737826105
50 pages
084 Liza Dagar Report
No ratings yet
084 Liza Dagar Report
38 pages
Data - Analysis Using Matlab
No ratings yet
Data - Analysis Using Matlab
156 pages
upGradMSDSBrochureclass Documents
No ratings yet
upGradMSDSBrochureclass Documents
31 pages
Python for Data Analysis 2nd module (2)
No ratings yet
Python for Data Analysis 2nd module (2)
14 pages
7 Low Level Design
No ratings yet
7 Low Level Design
10 pages
Introduction To HR Analytics
No ratings yet
Introduction To HR Analytics
60 pages
Project Charter Coffee Shop
No ratings yet
Project Charter Coffee Shop
3 pages
Comprehensive Air Quality Analysis using R Programming
No ratings yet
Comprehensive Air Quality Analysis using R Programming
12 pages
Harshal Project
No ratings yet
Harshal Project
16 pages
RCA Slides Warranty Conference 2015 Barsalou (BorgWarner)
No ratings yet
RCA Slides Warranty Conference 2015 Barsalou (BorgWarner)
22 pages
EDASage PDF
No ratings yet
EDASage PDF
8 pages
Statistics Research
No ratings yet
Statistics Research
24 pages
ML Project Final
No ratings yet
ML Project Final
33 pages
Data Science Curriculum
No ratings yet
Data Science Curriculum
3 pages
Cybersecurity Awareness and Preparedness Among Bachelor of Science in Information Technology BSIT Students in Eastern Visayas State University EVSU Final Project - Power Ranger Research Group
No ratings yet
Cybersecurity Awareness and Preparedness Among Bachelor of Science in Information Technology BSIT Students in Eastern Visayas State University EVSU Final Project - Power Ranger Research Group
5 pages
Enterprise Applications of Business Intelligence
No ratings yet
Enterprise Applications of Business Intelligence
12 pages
FInal-Project-Report [Predicting Solar Power Output using linear regression]
No ratings yet
FInal-Project-Report [Predicting Solar Power Output using linear regression]
8 pages
FRA Assignment
100% (1)
FRA Assignment
31 pages
EDAusingpython_SAlaruri
No ratings yet
EDAusingpython_SAlaruri
50 pages
Exploratory Data Analysis
No ratings yet
Exploratory Data Analysis
15 pages
Hiring Process Analytics1
No ratings yet
Hiring Process Analytics1
6 pages
YashJakhar_report.docx
No ratings yet
YashJakhar_report.docx
20 pages
Picture
No ratings yet
Picture
19 pages

Notes - EDA-Unit1 (2)

Uploaded by

Notes - EDA-Unit1 (2)

Uploaded by

CCS346 - EXPLORATORY DATA ANALYSIS

UNIT I EXPLORATORY DATA ANALYSIS

The significance of EDA

Making sense of data

Dataset falls into two groups—numerical data and categorical data

2. Categorical data or qualitative data

Comparing EDA with classical and Bayesian analysis

Approaches to data analysis.

Visual Aids for EDA

Two important goals of data scientists

Draw a table chart

7.Polar chart or spider web plot

2. Planned grades in each subject:

2. Prepare the dataset and set up theta:

6. Now, we plot the actual grades obtained:

7. We add a legend and a nice comprehensible title to the plot:

8. Finally, we show the plot on the screen:

The generated polar chart

Plotting the histogram chart:

3. Draw a green vertical line in the graph at the average experience

4. Display the plot:

2.Group the dataset by manufacturer

3.Sort the values by cty and reset the index

4.Plot the graph

6.Annotate labels, xticks, and ylims:

Generated lollipop chart

Merging database-style dataframes

➢ use append, concat, merge, or join

Concatenate both data frames:

2. Using df.merge with an inner join

inner join - includes item exists in both dataframes.

26 students only appeared for the Software Engineering course.

3. Merge using an outer join

3. Unstack the series into a dataframe using the unstack() method

5. Unstack the dataframe

Includes data transformations like cleaning, filtering and deduplication

2. Returns a Boolean series stating the rows that are duplicates

3. Drop the duplicates

II.Replacing values - find and replace values inside a dataframe

2. Replace multiple values at once

Dataframe show sales of different fruits from different stores.

NaN values in pandas objects

4. Count the number of reported values

Dropping missing value

determine null values

store4 only reported two items of data.

Returns a copy of the dataframe by dropping the rows with NaN.

- output is an empty dataframe- because there is at least one NaN value in

1.Compute the total quantity of fruits sold by store4

Filling missing values

Interpolating missing values

Discretization and binning

4. Check the number of values in each bin

5. Indicate the bin names by passing a list of labels

7. Form the bins based on sample quantiles- qcut method

8. Count the number of values in each category - get equal-sized bins

Outlier detection and filtering

Permutation and random sampling

2.Generate a random sample with replacement - numpy.random.randint() method

3.And now, we can draw the required samples:

Computing indicators/dummy variables

2.Create dummy Variable - using the pd.get_dummies() function

3. Add a prefix to the columns

You might also like