0% found this document useful (0 votes)
29 views

Notes - EDA-Unit1 (2)

Uploaded by

savi ezhilarasan
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
29 views

Notes - EDA-Unit1 (2)

Uploaded by

savi ezhilarasan
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 34

CCS346 - EXPLORATORY DATA ANALYSIS

UNIT I EXPLORATORY DATA ANALYSIS

EDA fundamentals – Understanding data science – Significance of EDA – Making sense of data
– Comparing EDA with classical and Bayesian analysis – Software tools for EDA - Visual Aids
for EDA- Data transformation techniques-merging database, reshaping and pivoting,
transformation techniques.

COURSE OBJECTIVE:
To outline an overview of exploratory data analysis.

COURSE OUTCOME:
CO1: Understand the fundamentals of exploratory data analysis.
Exploratory Data Analysis Fundamentals
Data is a collection of discrete objects, numbers, words, events, facts, measurements,
observations, or descriptions of things. It is collected and stored by every event or process
occurring in several disciplines, including biology, economics, engineering, marketing, and
others. Processing such data elicits useful information and processing such information generates
useful knowledge.
EDA is a process of examining the available dataset to discover patterns, spot anomalies, test
hypotheses, and check assumptions using statistical measures. The primary aim of EDA is to
examine what data can tell us before actually going through formal modeling or hypothesis
formulation. EDA can also help statisticians to examine and discover the data and create newer
hypotheses that could be used for the development of a newer approach in data collection and
experimentations.
Understanding data science
Data scientists
• builds a performant model
• must be able to explain the results obtained
• use the result for business intelligence
Data science involves cross-disciplinary knowledge from computer science, data, statistics, and
mathematics.
The phases of data analysis
1. Data requirements:
• various sources of data for an organization
• type of data required for the organization to be collected, curated, and store
• categorize the data, numerical or categorical, and the format of storage and
dissemination.
Eg: An application tracking the sleeping pattern of patients suffering from dementia requires
several types of sensors' data storage, such as sleep data, heart rate from the patient, electro-
dermal activities, and user activities pattern. All of these data points are required to correctly
diagnose the mental state of the person.
2. Data collection:
Data collected from several sources must be stored in the correct format and transferred
to the right information technology personnel within a company.
3. Data processing:
Preprocessing involves the process of pre-curating the dataset before actual analysis.
Common tasks involve
• correctly exporting the dataset
• placing them under the right tables
• structuring them
• exporting them in the correct format
4. Data cleaning:
Preprocessed data may still not be ready for detailed analysis. It must be correctly
transformed for an
• incompleteness check
• duplicates check
• error check
• missing value check
These tasks are performed in the data cleaning stage, which involves responsibilities such as
• matching the correct record
• finding inaccuracies in the dataset
• understanding the overall data quality
• removing duplicate items and
• filling in the missing values
Eg: outlier detection methods for quantitative data cleaning
5. EDA:
It is the stage where the actual message contained in the data is understood. Several types
of data transformation techniques might be required during the process of exploration.
6. Modeling and algorithm:
Generalized models or mathematical formulas can represent or exhibit relationships
among different variables, such as correlation or causation. These models or equations involve
one or more variables that depend on other variables to cause an event.
Eg: when buying, pens, the total price of pens(Total) = price for one pen(UnitPrice) * the
number of pens bought (Quantity).
Model would be Total = UnitPrice * Quantity - total price is dependent on the unit price
total price - dependent variable and the unit price - independent variable. In general
Model describes the relationship between independent and dependent variables.
Inferential statistics deals with quantifying relationships between particular variables.
The Judd model for describing the relationship between data, model, and error
Data = Model + Error
Eg: inferential statistics - regression analysis
7. Data Product:
It is the computer software that uses data as inputs, produces outputs, and provides
feedback based on the output to control the environment. It is based on a model developed
during data analysis
Eg: a recommendation model that inputs user purchase history and recommends a related
item that the user is highly likely to buy.
8. Communication:
This stage deals with disseminating the results to end stakeholders to use the result for
business intelligence. It involves data visualization using techniques such as tables, charts,
summary diagrams, and bar charts to show the analyzed result.

The significance of EDA

Different fields of science, economics, engineering, and marketing accumulate and store data
primarily in electronic databases. Data collected should be used to make appropriate and well-
established decisions. It requires computer programs to gather insights about the collected data
which could be implemented using the process of data mining in which exploratory data analysis
is considered as the key. It allows visualization of the data to understand it as well as to create
hypotheses for further analysis. EDA reveals ground truth about the content without making any
underlying assumptions. Data scientists use EDA to understand the type of modeling and
hypotheses that can be created.
Key components of exploratory data analysis include
• summarizing data
• statistical analysis
• visualization of data
Python provides tools for exploratory analysis
• pandas for summarizing
• scipy for statistical analysis
• matplotlib, plotly for visualizations
Steps in EDA
It involves four different steps
1.Problem definition:
Before extracting useful insight from the data, it is essential to define the business
problem to be solved. The problem definition works as the driving force for a data analysis plan
execution. The main tasks involved in problem definition are
• defining the main objective of the analysis
• defining the main deliverables
• outlining the main roles and responsibilities
• obtaining the current status of the data
• defining the timetable
• performing cost/benefit analysis
Based on such a problem definition, an execution plan can be created.
2. Data preparation:
This step involves methods for preparing the dataset before actual analysis. Ii involves
• define the sources of data,
• define data schemas and tables,
• understand the main characteristics of the data,
• clean the dataset,
• delete non-relevant datasets,
• transform the data, and
• divide the data into required chunks for analysis.
3.Data analysis:
It deals with descriptive statistics and analysis of the data. The main tasks involve
• summarizing the data
• finding the hidden correlation and relationships among the data
• developing predictive models
• evaluating the models
• calculating the accuracies
Techniques used for data summarization
• summary tables
• graphs
• descriptive statistics
• inferential statistics
• correlation statistics
• searching
• grouping
• mathematical models
4. Development and representation of the results:
It involves presenting the dataset to the target audience. It can be in the form of
• graphs,
• summary tables
• maps
• diagrams

Making sense of data

It deals with identifying the type of data under analysis. Different disciplines store
different kinds of data for different purposes.
Eg: Medical researchers store patients' data, universities store students' and teachers' data, and
real estate industries storehouse and building datasets.
A dataset contains many observations about a particular object. Each observation can have a
specific value for each of the variables.
Eg: Observation:

Table Representation:

Summarization:
There are four observations (001, 002, 003, 004, 005). Each observation describes variables
(PatientID, name, address, dob, email, gender, and weight).

Dataset falls into two groups—numerical data and categorical data


1. Numerical data or quantitative data
This data has a sense of measurement involved in it. It can be either discrete or continuous types.

Eg: a person's age, height, weight, blood pressure, heart rate, temperature, number of teeth,
number of bones, and the number of family members.
➢ Discrete data
Represents data that is countable and its values can be listed out. A variable that
represents a discrete dataset is referred to as a discrete variable. The discrete variable
takes a fixed number of distinct values.
Eg: Flipping a coin, the number of heads in 200 coin flips can take values from 0 to 200
(finite) cases.
Country variable can have values such as Nepal, India, Norway, and Japan. It is fixed.
Rank variable of a student in a classroom can take values from 1, 2, 3, 4, 5, and so on.
➢ Continuous data
A variable that can have an infinite number of numerical values within a specific
range is classified as continuous data. A variable describing continuous data is a continuous
variable. Continuous data can follow an interval measure of scale or ratio measure of scale.
Eg: Temperature of the city today
Identify discrete and Continuous variables from the table below

2. Categorical data or qualitative data


This type of data represents the characteristics of an object;
Eg: Gender - Male, Female, Other, or Unknown
Marital status - Annulled, Divorced, Interlocutory, Legally Separated, Married,
Polygamous, Never Married, Domestic Partner, Unmarried, Widowed, or Unknown,
Type of address – temporary, permanent, present
Categories of the movies - Action, Adventure, Comedy, Crime, Drama, Fantasy,
Historical, Horror, Mystery, Philosophical, Political, Romance, Saga, Satire, Science Fiction,
Social, Thriller, Urban, or Western
Blood type - A, B, AB, or O
Types of drugs - Stimulants, Depressants, Hallucinogens, Dissociatives, Opioids,
Inhalants, or Cannabis
A variable describing categorical data is referred to as a categorical variable. These types
of variables can have one of a limited number of values. Categorical values can be thought of as
an enumerated types or enumerations of variables. Most of the categorical dataset follows either
nominal or ordinal measurement scales.
There are different types of categorical variables:
➢ Binary categorical variable - can take exactly two values and is also referred to as a
dichotomous variable.
Eg: Result of an experiment - success or failure.
➢ Polytomous variables are categorical variables that can take more than two possible values.
Eg: Marital status can have several values, such as annulled, divorced, interlocutory,
legally separated, married, polygamous, never married, domestic partners, unmarried, widowed,
domestic partner, and unknown.
Measurement scales
There are four different types of measurement scales described in statistics: nominal,
ordinal, interval, and ratio. These scales are used more in academic industries.
1.Nominal
These are practiced for labeling variables without any quantitative value. The scales are
generally referred to as labels. And these scales are mutually exclusive and do not carry any
numerical importance.
• Gender- Male Female Third gender/Non-binary I prefer not to answer Other
• The languages that are spoken in a particular country Biological species
• Parts of speech in grammar (noun, pronoun, adjective, and so on)
• Taxonomic ranks in biology (Archea, Bacteria, and Eukarya)
No form of arithmetic calculation can be made on nominal measures.
2.Ordinal
The main difference in the ordinal and nominal scale is the order. In ordinal scales, the
order of the values is a significant factor.
Eg: WordPress is making content managers' lives easier. How do you feel about this statement?
The following diagram shows the Likert scale

3.Interval
In interval scales, both the order and exact differences between the values are significant.
Interval scales are widely used in statistics, like measure of central tendencies—mean, median,
mode, and standard deviations.
Eg: Location in Cartesian coordinates and direction measured in degrees from magnetic north.
The mean, median, and mode are allowed on interval data.
4.Ratio
Ratio scales contain order, exact values, and absolute zero, which makes it possible to be
used in descriptive and inferential statistics. These scales provide numerous possibilities for
statistical analysis. Mathematical operations, the measure of central tendencies, and the measure
of dispersion and coefficient of variation can also be computed from such scales.
Eg: measure of energy, mass, length, duration, electrical energy, plan angle, and volume.
The following table gives a summary of the data types and scale measures:

Comparing EDA with classical and Bayesian analysis

Approaches to data analysis.


1.Classical data analysis: - The problem definition and data collection step are followed
by model development, which is followed by analysis and result communication.
2.Exploratory data analysis approach:- Follows the same approach as classical data
analysis except the model imposition and the data analysis steps are swapped. The main focus is
on the data, its structure, outliers, models, and visualizations. Generally, in EDA, we do not
impose any deterministic or probabilistic models on the data.
3.Bayesian data analysis approach:- It incorporates prior probability distribution
knowledge into the analysis steps.
Different approaches for data analysis illustrating the difference in their execution steps:
Software tools available for EDA

There are several software tools that are available to facilitate EDA.
1. Python: This is an open source programming language widely used in data analysis, data
mining, and data science.
2. R programming language: R is an open source programming language that is widely
utilized in statistical computation and graphical data analysis
3. Weka: This is an open source data mining package that involves several EDA tools and
algorithms
4. KNIME: This is an open source tool for data analysis and is based on Eclipse

Visual Aids for EDA

Two important goals of data scientists


• extract knowledge from the data and
• present the data to stakeholders. - visual aids are very useful tools
1. Line chart
A line chart is used to illustrate the relationship between two or more continuous variables.
Eg: Plot time series lines using matplotlib library and the stock price data
Generate the dataset - faker Python library with columns - Date and Price, indicating the stock
price on that date or load the CSV file using the pandas library - read_csv
generateData function :
import datetime
import math
import pandas as pd
import random import radar from faker import Faker
fake = Faker()
def generateData(n):
listdata = []
start = datetime.datetime(2019, 8, 1)
end = datetime.datetime(2019, 8, 30)
delta = end - start
for _ in range(n):
date = radar.random_datetime(start='2019-08-1', stop='2019-08-
30').strftime("%Y-%m-%d")
price = round(random.uniform(900, 1000), 4)
listdata.append([date, price])
df = pd.DataFrame(listdata, columns = ['Date', 'Price'])
df['Date'] = pd.to_datetime(df['Date'], format='%Y-%m-%d')
df = df.groupby(by='Date').mean()
return df

df = generateData(50)
df.head(10)
Steps involved
1.Load and prepare the dataset
2.Import the matplotlib library: import matplotlib.pyplot as plt
3.Plot the graph: plt.plot(df)
4.Display it on the screen: plt.show()

2. Bar charts
It is a most common type of visualization. Bars can be drawn horizontally or vertically to
represent categorical variables. Bar charts are frequently used to distinguish objects between
distinct collections in order to track variations over time
Eg: Assume a pharmacy in Norway keeps track of the amount of Zoloft sold every month. Use
the calendar Python library to keep track of the months of the year (1 to 12) corresponding to
January to December:
1. Import the required libraries
import numpy as np
import calendar
import matplotlib.pyplot as plt
2. Set up the data
months = list(range(1, 13))
sold_quantity = [round(random.uniform(100, 200)) for x in range(1, 13)]
3. Specify the layout of the figure and allocate space
figure, axis = plt.subplots()
4. Display the names of the months:
plt.xticks(months, calendar.month_name[1:13], rotation=20)
5. Plot the graph:
plot = axis.bar(months, sold_quantity)
6. Display the data value on the head of the bar- optional- visually gives more meaning
for rectangle in plot:
height = rectangle.get_height()
axis.text(rectangle.get_x() + rectangle.get_width() /2., 1.002 * height, '%d' %
int(height), ha='center', va = 'bottom')
7. Display the graph on the screen: plt.show()
Horizontal bar chart - code remains the same, with few changes
• plt.xticks changed to plt.yticks() and
• plt.bar() changed to plt.barh()

3.Scatter plot
Scatter plots are also called scatter graphs, scatter charts, scattergrams, and scatter
diagrams. They use a Cartesian coordinates system to display values of typically two variables
for a set of data. Scatter plots can be constructed in the following two situations:
• When one continuous variable is dependent on another variable, which is under the
control of the observer
• When both continuous variables are independent
Scatter plots are used when we need to show the relationship between two variables, and are also
referred to as correlation plots.
Eg: Number of hours of sleep required by a person depends on the age of the person.
The average income for adults is based on the number of years of education
The dataset :
https://raw.githubusercontent.com/PacktPublishing/hands-on-exp loratory-data-analysis-
with-python/master/Chapter%202/sleep_vs_age.csv
import seaborn as sns
import matplotlib.pyplot as plt sns.set()
# A regular scatter plot
plt.scatter(x=sleepDf["age"]/12., y=sleepDf["min_recommended"])
plt.scatter(x=sleepDf["age"]/12., y=sleepDf['max_recommended']) plt.xlabel('Age of person in
Years')
plt.ylabel('Total hours of sleep required')
plt.show()
Interpretation of the graph - total number of hours of sleep required by a person is high initially
and gradually decreases as age increases. Due to the lack of a continuous line, the results are not
self-explanatory - fit a line to it
# Line plot
plt.plot(sleepDf['age']/12., sleepDf['min_recommended'], 'g--')
plt.plot(sleepDf['age']/12., sleepDf['max_recommended'], 'r--')
plt.xlabel('Age of person in Years')
plt.ylabel('Total hours of sleep required')
plt.show()
Interpretation: Two lines decline as the age increases - newborns between 0 and 3 months
require at least 14-17 hours of sleep every day, adults and the elderly require 7-9 hours of sleep
every day.
Generate scatter plot for Iris dataset
Bubble plot
A bubble plot is a manifestation of the scatter plot where each data point on the graph is shown
as a bubble. Each bubble can be illustrated with a different color, size, and appearance.
# Load the Iris dataset
df = sns.load_dataset('iris')
df['species'] = df['species'].map({'setosa': 0, "versicolor": 1,
"virginica": 2})
# Create bubble plot
plt.scatter(df.petal_length, df.petal_width, s=50*df.petal_length*df.petal_width,
c=df.species,alpha=0.3)
# Create labels for axises
plt.xlabel('Septal Length')
plt.ylabel('Petal length')
plt.show()

A scatter plot can also be generated using the seaborn library which makes the graph visually
better.
4.Area plot and stacked plot
The stacked plot represents the area under a line plot and several such plots can be
stacked on top of one another, giving the feeling of a stack. It can be useful when we want to
visualize the cumulative effect of multiple variables being plotted on the y axis. Area plot can be
thought as a line plot that shows the area covered by filling it with a color
Define dataset:
# House loan Mortgage cost per month for a year
houseLoanMortgage = [9000, 9000, 8000, 9000, 8000, 9000, 9000, 9000, 9000, 8000, 9000,
9000]
# Utilities Bills for a year
utilitiesBills = [4218, 4218, 4218, 4218, 4218, 4218, 4219, 2218, 3218, 4233, 3000, 3000]
# Transportation bill for a year
transportation = [782, 900, 732, 892, 334, 222, 300, 800, 900, 582, 596, 222]
# Car mortgage cost for one year
carMortgage = [700, 701, 702, 703, 704, 705, 706, 707, 708, 709, 710, 711]
Import the required libraries and plot stacked charts:
import matplotlib.pyplot as plt
import seaborn as sns
sns.set()
months= [x for x in range(1,13)]
# Create placeholders for plot and add required color
plt.plot([],[], color='sandybrown', label='houseLoanMortgage')
plt.plot([],[], color='tan', label='utilitiesBills')
plt.plot([],[], color='bisque', label='transportation')
plt.plot([],[], color='darkcyan', label='carMortgage')
# Add stacks to the plot
plt.stackplot(months, houseLoanMortgage, utilitiesBills, transportation, carMortgage,
colors=['sandybrown', 'tan', 'bisque', 'darkcyan'])
plt.legend()
# Add Labels
plt.title('Household Expenses')
plt.xlabel('Months of the year')
plt.ylabel('Cost')
# Display on the screen
plt.show()

Interpretation: The house mortgage loan is the largest expense since the area under the curve
for the house mortgage loan is the largest. The area of utility bills stack covers the second-
largest area, and so on.
The graph clearly disseminates meaningful information to the targeted audience. Labels, legends,
and colors are important aspects of creating a meaningful visualization.
5.Pie chart
The pie chart fails to appeal to most experts. The purpose of the pie chart is to
communicate proportions.
Dataset: Pokemon dataset to draw a pie chart.
# Create URL to JSON file (alternatively this can be a filepath)
url = 'https://raw.githubusercontent.com/hmcuesta/PDA_Book/master/Chapter3/pokemo
nByType.csv'
# Load the first sheet of the JSON file into a data frame
pokemon = pd.read_csv(url, index_col='type')
pokemon
Plot the pie chart:
import matplotlib.pyplot as plt
plt.pie(pokemon['amount'], labels=pokemon.index, shadow=False, startangle=90,
autopct='%1.1f%%',)
plt.axis('equal')
plt.show()
Pandas code : pokemon.plot.pie(y="amount", figsize=(20, 10))
6.Table chart
A table chart combines a bar chart and a table.
Dataset: Consider standard LED bulbs that come in different wattages. The standard Philips
LED bulb can be 4.5 Watts, 6 Watts, 7 Watts, 8.5 Watts, 9.5 Watts, 13.5 Watts, and 15 Watts.
Let's assume there are two categorical variables, the year and the wattage, and a numeric
variable, which is the number of units sold in a particular year.

Draw a table chart


Add the table to the bottom of the chart

7.Polar chart or spider web plot


Polar chart is a diagram that is plotted on a polar axis. Its coordinates are angle and
radius.
Create the dataset:
1. Assume five courses in the academic year:

2. Planned grades in each subject:

3. Grades obtained:

Extra entry- circular plot - connect the first and the last point together to form a circular flow
1. Import the required libraries:

2. Prepare the dataset and set up theta:

3. Initialize the plot with the figure size and polar projection:
4. Get the grid lines to align with each of the subject names:

5. Use the plt.plot method to plot the graph and fill the area under it:

6. Now, we plot the actual grades obtained:

7. We add a legend and a nice comprehensible title to the plot:

8. Finally, we show the plot on the screen:

The generated polar chart

8.Histogram
Histogram plots are used to depict the distribution of any continuous variable. These
types of plots are very popular in statistical analysis.
Use case: A survey created in vocational training sessions of developers had 100 participants.
They had several years of Python programming experience ranging from 0 to 20.
Create the dataset

Plotting the histogram chart:


1. Plot the distribution of group experience
2. Add labels to the axes and a title

3. Draw a green vertical line in the graph at the average experience

4. Display the plot:

Generated histogram

9.Lollipop chart
A lollipop chart can be used to display ranking in the data. It is similar to an ordered bar
chart.
1.Load the dataset- carDF dataset:

2.Group the dataset by manufacturer

3.Sort the values by cty and reset the index

4.Plot the graph


5.Annotate the title

6.Annotate labels, xticks, and ylims:

7.Write the actual mean values in the plot, and display the plot:

Generated lollipop chart


Choosing the best chart
Data Transformation

Data transformation is a set of techniques used to convert data from one format or structure
to another format or structure. The main reason for transforming the data is to get a better
representation such that the transformed data is compatible with other data. The following are
some examples of transformation activities:
• Data deduplication -involves the identification of duplicates and their removal.
• Key restructuring -involves transforming any keys with built-in meanings to the generic
keys.
• Data cleansing -involves extracting words and deleting out-of-date, inaccurate, and
incomplete information from the source language without extracting the meaning or
information to enhance the accuracy of the source data.
• Data validation -is a process of formulating rules or algorithms that help in validating
different types of data against some known issues.
• Format revisioning -involves converting from one format to another.
• Data derivation -consists of creating a set of rules to generate more information from the
data source.
• Data aggregation -involves searching, extracting, summarizing, and preserving important
information in different types of reporting systems.
• Data integration- involves converting different data types and merging them into a
common structure or schema.
• Data filtering- involves identifying information relevant to any particular user.
• Data joining- involves establishing a relationship between two or more tables.

Merging database-style dataframes

➢ use append, concat, merge, or join


The scores of students in two courses Software Engineering and Machine Learning:

Concatenate both data frames:


dataframe = pd.concat([dataFrame1, dataFrame2], ignore_index=True)
dataframe
pd.concat([dataFrame1, dataFrame2], axis=1)
Consider wo dataframes for each subject:
• Two for the Software Engineering course
• Another two for the Introduction to Machine Learning course

Assume:
o There are some students who are not taking the software engineering exam.
o There are some students who are not taking the machine learning exam.
o There are students who appeared in both courses.
Analyze using the EDA technique:
• How many students appeared for the exams in total?
• How many students only appeared for the Software Engineering course?
• How many students only appeared for the Machine Learning course?
1. Concatenating along with an axis

2. Using df.merge with an inner join

inner join - includes item exists in both dataframes.


Provides a list of students who appeared in both the courses - 21 students took both the
courses
3. Using the pd.merge() method with a left join

26 students only appeared for the Software Engineering course.


4. Using the pd.merge() method with a right join

Provides the list of all the students who appeared in the Machine Learning course
5. Using pd.merge() methods with outer join
total number of students appearing for at least one course
Merging on index
Index acts as the keys for merging dataframes - pass left_index=True or right_index=True to
indicate that the index should be accepted as the merge key.
1. Consider the following two dataframes:

2. Merge using an inner join - default type of merge - merge based on intersection of the
keys

3. Merge using an outer join


Reshaping and pivoting

Helps to arrange data in a dataframe in some consistent manner. This can be done with
hierarchical indexing using two actions:
➢ Stacking: Stack rotates from any particular column in the data to the rows.
➢ Unstacking: Unstack rotates from the rows into the column.
1. Create a dataframe that records the rainfall, humidity, and wind conditions 1. of
five different counties in Norway

2. Use () method on dframe1 to pivot the columns into rows to produce a series

3. Unstack the series into a dataframe using the unstack() method

4. Create two series, series1 and series2, and then concatenate them

5. Unstack the dataframe


Transformation techniques

Includes data transformations like cleaning, filtering and deduplication


I.Performing data deduplication - Removing duplicate rows to enhance the quality of the
dataset
1. Consider a simple dataframe

2. Returns a Boolean series stating the rows that are duplicates

3. Drop the duplicates

4. Add a new column and try to find duplicated items based on the second column

II.Replacing values - find and replace values inside a dataframe


1. Replace one value with the other value

2. Replace multiple values at once


III.Handling missing data
NaN - indicates that there is no value specified for the particular index. Reasons for NaN:
• When data is retrieved from an external source and there are some incomplete values
in the dataset
• When joining two different datasets and some values are not match.
• Missing values due to data collection errors
• When the shape of data changes, there are new additional rows or columns that
• are not determined.
• Reindexing of data can result in incomplete data
1. Create a dataframe

Dataframe show sales of different fruits from different stores.


None of the stores are reporting missing values
2. Add some missing values to the dataframe

NaN values in pandas objects


1. Check if null

The two functions, notnull() and isnull(), are the complement to each other
2. Count the number of NaN values in each store
3. Find the total number of missing values

4. Count the number of reported values

Dropping missing value


One of the ways to handle missing values is to remove them from our dataset.

determine null values

store4 only reported two items of data.


1. Remove the rows with missing values

Returns a copy of the dataframe by dropping the rows with NaN.


The original dataframe is not changed.
2. Apply to the entire dataframe

- output is an empty dataframe- because there is at least one NaN value in


our dataframe
Dropping by rows
Use the how=all argument to drop only those rows entire values are entirely NaN

Dropping by columns

Specify a minimum number of NaNs that must exist before the column should be dropped use
argument thresh
Mathematical operations with NaN

1.Compute the total quantity of fruits sold by store4


store4 has five NaN values. However, during the summing process, these values are treated as 0
and the result is 38.0.
2.Compute averages

3.Cumulative summing

Filling missing values


Replace NaN values with any particular values- use fillna() method
Backward and forward filling
fill store4 using the forward-filling technique Backward-filling

Interpolating missing values


Performs a linear interpolation of our missing values - interpolate()

In ser3, the first and the last values are 100 and 292 respectively. It calculates the next value as
(292-100)/(5-1) = 48. So, the next value after 100 is 100 + 48 = 148.
Renaming axis indexes
The rename method does not make a copy of the dataframe

Discretization and binning


When working with continuous datasets, we need to convert them into discrete or interval forms.
Each interval is referred to as a bin.
1. Data on the heights of a group of students

2. Convert that dataset into intervals of 118 to 125, 126 to 135, 136 to 160, and finally 160
and higher
A parenthesis indicates that the side is open. A square bracket means that it is closed or
inclusive. (118, 125] means the left-hand side is open and the right-hand side is closed
3. Set a right=False argument to change the form of interval

4. Check the number of values in each bin

5. Indicate the bin names by passing a list of labels

6. Pass an integer for the bins - it will compute equal-length bins based on the
minimum and maximum values in the data

7. Form the bins based on sample quantiles- qcut method

8. Count the number of values in each category - get equal-sized bins


9. Pass our own bins

Outlier detection and filtering


Outliers are data points that diverge from other observations for several reasons. The
main reason for this detection and filtering of outliers is that the presence of such outliers can
cause serious issues in statistical analysis.
1.Load the dataset that is available from the GitHub link as follows:

2. Calculate the total price based on the quantity sold and the unit price - add a new
column
3.Find the transaction that exceeded 3,000,000

4.Display all the columns and rows if TotalPrice is greater than 6741112

Permutation and random sampling


1. Select or permute a series of rows in a dataframe numpy.random.permutation()
function, we can randomly

:
np.random.permutation() method- takes the length of the axis we require to be permuted – and
gives an array of integers indicating the new ordering

2.The output array is used in ix-based indexing for the take() function
Random sampling without replacement
To compute random sampling without replacement, follow these steps:
1. Createa permutation array
2. Slice off the first n elements of the array where n is the desired size of
the subset
1.Use the df.take() method to obtain actual samples

2.Generate a random sample with replacement - numpy.random.randint() method

3.And now, we can draw the required samples:

Computing indicators/dummy variables


Often, we need to convert a categorical variable into some dummy matrix. Statistical modeling
or machine learning model development, it is essential to create dummy variables.
1. Create dataframe with data on gender and votes

2.Create dummy Variable - using the pd.get_dummies() function

3. Add a prefix to the columns

You might also like