EDA Lecture notes
EDA Lecture notes
EDA fundamentals – Understanding data science – Significance of EDA – Making sense of data –
Comparing EDA with classical and Bayesian analysis – Software tools for EDA - Visual Aids for
EDA- Data transformation techniques-merging database, reshaping and pivoting,
Transformation techniques.
Data science
Data science is an interdisciplinary field that combines scientific methods, processes,
algorithms, and systems to extract insights and knowledge from structured and
unstructured data. It involves applying techniques from various fields such as statistics,
mathematics, computer science, and domain expertise to analyze and interpret data in
order to solve complex problems, make predictions, and drive decision-making.
1 | CCS346 Unit - I
APEC
Data Collection: Data collection is the process of gathering or acquiring data from
various sources. This may include surveys, experiments, observations, web scraping,
sensors, or accessing existing datasets. The data collected should align with the defined
data requirements.
Data Processing: Data processing involves transforming raw data into a more usable and
structured format. This step may include data integration, data aggregation, data
transformation, and data normalization to prepare the data for further analysis.
Data Cleaning: Data cleaning, also known as data cleansing or data scrubbing, is the
process of identifying and rectifying errors, inconsistencies, missing values, and outliers
2 | CCS346 Unit - I
APEC
in the dataset. It aims to improve data quality and ensure that the data is accurate and
reliable for analysis.
Exploratory Data Analysis (EDA) is of great significance in data science and analysis.
Here are some key reasons why EDA is crucial:
Understanding the Data: EDA helps in gaining a deep understanding of the dataset at
hand. It allows data scientists to become familiar with the structure, contents, and
characteristics of the data, including variables, their distributions, and relationships.
This understanding is essential for making informed decisions throughout the data
analysis process.
Data Quality Assessment: EDA helps identify and assess the quality of the data. It allows
for the detection of missing values, outliers, inconsistencies, or errors in the dataset. By
addressing data quality issues, EDA helps ensure that subsequent analyses and models
are built on reliable and accurate data.
Feature Selection and Engineering: EDA aids in selecting relevant features or variables
for analysis. By examining relationships and correlations between variables, EDA can
guide the identification of important predictors or features that significantly contribute
to the desired outcome. EDA can also inspire the creation of new derived features or
transformations that improve model performance.
3 | CCS346 Unit - I
APEC
Uncovering Patterns and Insights: EDA enables the discovery of patterns, trends, and
relationships within the data. By using visualization techniques and summary statistics,
EDA helps uncover valuable insights and potential associations between variables.
These insights can drive further analysis, hypothesis generation, or the formulation of
research questions.
Hypothesis Generation and Testing: EDA plays a crucial role in generating hypotheses for
further investigation. By exploring the data, researchers can identify potential
relationships or patterns and formulate hypotheses to test formally. EDA can also
provide evidence or insights to support or refute existing hypotheses.
Understand the Data: Start by getting familiar with the dataset, its structure, and the
variables it contains. Understand the data types (e.g., numerical, categorical) and the
meaning of each variable.
Data Cleaning: Clean the dataset by handling missing values, outliers, and
inconsistencies. Identify and handle missing data appropriately (e.g., imputation,
deletion) based on the context and data quality requirements. Treat outliers and
inconsistent values by either correcting or removing them if necessary.
Handle Data Transformations: Explore and apply necessary data transformations such
as scaling, normalization, or logarithmic transformations to make the data suitable for
analysis. This step may be required to meet assumptions of certain statistical methods
or to improve the interpretability of the data.
Summary Statistics: Compute and analyze summary statistics for each variable. This
includes measures such as mean, median, mode, standard deviation, range, quartiles,
and other descriptive statistics. Summary statistics provide an initial understanding of
the data distribution and basic insights.
4 | CCS346 Unit - I
APEC
Data Visualization: Utilize various visualization techniques to explore the data visually.
Create histograms, scatter plots, box plots, bar charts, heatmaps, or other relevant
visualizations to understand the patterns, distributions, and relationships between
variables. Visualizations can reveal insights that may not be apparent from summary
statistics alone.
Exploring Time Series Data: If the dataset involves time series data, analyze trends,
seasonality, and other temporal patterns. Use line plots, time series decomposition,
autocorrelation plots, or other relevant techniques to explore the temporal behavior of
the data.
Feature Engineering: Based on the insights gained from EDA, consider creating new
derived features or transformations that may enhance the predictive power or
interpretability of the data. This can involve mathematical operations, combinations of
variables, or domain-specific transformations.
Iterative Analysis: EDA is often an iterative process. Repeat the above steps as needed,
diving deeper into specific variables or subsets of the data based on emerging patterns
or research questions. Refine the analysis based on new insights or feedback from
stakeholders.
5 | CCS346 Unit - I
APEC
Remember, the steps and techniques used in EDA can be flexible and iterative, tailored
to the specific dataset and research objectives. The goal is to gain a comprehensive
understanding of the data, identify patterns, and generate hypotheses for further
analysis.
Define Data
In data science, "data" refers to the raw, unprocessed, and often vast quantities of
information that is collected or generated from various sources. It can exist in different
formats, such as structured data (organized and well-defined), semi-structured data
(partially organized), or unstructured data (lacks a predefined structure).
Data serves as the foundation for data science activities and analysis. It can include
numbers, text, images, audio, video, sensor readings, transaction records, social media
posts, and much more. Data can be generated from diverse sources, including databases,
spreadsheets, web scraping, sensors, surveys, or online platforms.
Categorization of Data
In data science, data is typically categorized into two main types:
Quantitative Data: Also known as numerical or structured data, quantitative data
represents measurable quantities or variables. It includes attributes such as age,
temperature, sales figures, or stock prices. Quantitative data is typically analyzed using
statistical techniques and mathematical models.
Examples of quantitative data:
• Scores of tests and exams e.g. 74, 67, 98, etc.
• The weight of a person.
• The temperature in a roo
6 | CCS346 Unit - I
APEC
7 | CCS346 Unit - I
APEC
Ordinal Data
Ordinal data is qualitative data for which their values have some kind of relative
position. These kinds of data can be considered “in-between” qualitative and
quantitative data. The ordinal data only shows the sequences and cannot use for
statistical analysis. Compared to nominal data, ordinal data have some kind of order
that is not present in nominal data.
Continuous Data
Continuous data are in the form of fractional numbers. It can be the version of an
android phone, the height of a person, the length of an object, etc. Continuous data
represents information that can be divided into smaller levels. The continuous variable
can take any value within a range.
Examples of continuous data:
• Height of a person - 62.04762 inches, 79.948376 inches
• Speed of a vehicle
• “Time-taken” to finish the work
• Wi-Fi Frequency
• Market share price
9 | CCS346 Unit - I
APEC
Interval Data
The interval level is a numerical level of measurement which, like the ordinal scale,
places variables in order. The interval scale has a known and equal distance between
each value on the scale (imagine the points on a thermometer). Unlike the ratio scale
(the fourth level of measurement), interval data has no true zero; in other words, a
value of zero on an interval scale does not mean the variable is absent.
A temperature of zero degrees Fahrenheit doesn’t mean there is “no temperature” to be
measured—rather, it signifies a very low or cold temperature.
Ratio Data
The fourth and final level of measurement is the ratio level. Just like the interval scale,
the ratio scale is a quantitative level of measurement with equal intervals between each
point. Difference between Interval scale and Ratio scale is that it has a true zero. That is,
a value of zero on a ratio scale means that the variable you’re measuring is absent.
10 | CCS346 Unit - I
APEC
Population is a good example of ratio data. If you have a population count of zero
people, this means there are no people!
Measurement scales
Measurement scales, also known as data scales or levels of measurement, define
the properties and characteristics of the data collected or measured. There are four
commonly recognized measurement scales:
Nominal Scale:
11 | CCS346 Unit - I
APEC
• The nominal scale is the lowest level of measurement. It represents data that can
be categorized into distinct and mutually exclusive groups or categories. The
categories in a nominal scale have no inherent order or ranking.
• Examples of nominal scale data include gender (male/female), eye color
(blue/green/brown), or types of cars (sedan/SUV/hatchback).
• Nominal data can be represented using labels or codes.
Ordinal Scale:
• The ordinal scale represents data with categories that have a natural order or
ranking. In addition to the properties of the nominal scale, ordinal data allows
for the relative positioning or hierarchy between the categories. However, the
intervals between the categories may not be equal.
• Examples of ordinal scale data include rating scales (e.g., 1-5 scale indicating
satisfaction levels), education levels (e.g., high school, bachelor's, master's), or
performance rankings (first, second, third place). Ordinal data can be
represented using labels, codes, or numerical rankings.
Interval Scale:
• The interval scale represents data with categories that have equal intervals
between the values. In addition to the properties of the ordinal scale, interval
data allows for meaningful comparisons of the intervals between the categories.
However, it does not have a true zero point.
• Examples of interval scale data include calendar dates, temperature measured in
Celsius or Fahrenheit, or years. Interval data allows for mathematical operations
such as addition and subtraction but does not support meaningful multiplication
or division.
Ratio Scale:
• The ratio scale is the highest level of measurement. It represents data with
categories that have equal intervals between the values and possess a true zero
point. In addition to the properties of the interval scale, ratio data allows for all
mathematical operations and meaningful ratios.
• Examples of ratio scale data include weight, length, time duration, or count. Ratio
scale data provides a complete and meaningful representation of the data.
Understanding the measurement scale of the data is important for selecting appropriate
statistical techniques, visualization methods, and modeling approaches. Different scales
require different levels of analysis and interpretation.
12 | CCS346 Unit - I
APEC
13 | CCS346 Unit - I
APEC
14 | CCS346 Unit - I
APEC
Line chart
Line charts are used to represent the relation between two data X and Y on a different
axis. Here we will see some of the examples of a line chart in Python :
15 | CCS346 Unit - I
APEC
Year Unemployment_Rate
1920 9.8
1930 12
1940 8
1950 7.2
1960 6.9
1970 7
1980 6.5
1990 6.2
2000 5.5
2010 3.3
The ultimate goal is to depict the above data using a Line chart.
year = [1920, 1930, 1940, 1950, 1960, 1970, 1980, 1990, 2000, 2010]
unemployment_rate = [9.8, 12, 8, 7.2, 6.9, 7, 6.5, 6.2, 5.5, 3.3]
plt.plot(x_axis, y_axis)
plt.title('title name')
plt.xlabel('x_axis name')
plt.ylabel('y_axis name')
plt.show()
year = [1920, 1930, 1940, 1950, 1960, 1970, 1980, 1990, 2000, 2010]
unemployment_rate = [9.8, 12, 8, 7.2, 6.9, 7, 6.5, 6.2, 5.5, 3.3]
plt.plot(year, unemployment_rate)
plt.title('Unemployment rate vs Year')
16 | CCS346 Unit - I
APEC
plt.xlabel('Year')
plt.ylabel('Unemployment rate')
plt.show()
Output:
Bar charts
A bar plot or bar chart is a graph that represents the category of data with rectangular
bars with lengths and heights that is proportional to the values which they represent.
The bar plots can be plotted horizontally or vertically. A bar chart describes the
comparisons between the discrete categories.
import matplotlib.pyplot as plt Output:
import numpy as np
Scatter plots
Scatter plots are used to observe relationship between variables and uses dots to
represent the relationship between them. The scatter() method in the matplotlib library
is used to draw a scatter plot. Scatter plots are widely used to represent relation among
variables and how change in one affects the other.
/* Python program to create scatter plots*/ Output:
x =[5, 7, 8, 7, 2, 17, 2, 9,
17 | CCS346 Unit - I
APEC
4, 11, 12, 9, 6]
plt.scatter(x, y, c ="blue")
plt.xlabel("X-axis")
plt.ylabel("Y-axis")
Bubble chart
Bubble plots are an improved version of the scatter plot. In a scatter plot, there are two
dimensions x, and y. In a bubble plot, there are three dimensions x, y, and z, where the
third dimension z denotes weight. Here, each data point on the graph is shown as a
bubble. Each bubble can be illustrated with a different color, size, and appearance.
Area Chart
An area chart is really similar to a line chart, except that the area between the x axis and
the line is filled in with color or shading. It represents the evolution of a numeric
variable.
import numpy as np Output:
import matplotlib.pyplot as plt
# Create data
x=range(1,6)
y=[1,4,6,8,4]
18 | CCS346 Unit - I
APEC
# Area plot
plt.fill_between(x, y)
plot.show()
19 | CCS346 Unit - I
APEC
'magenta'])
plt.legend()
# Add Labels
plt.legend(loc='upper left')
plt.title('World Population')
plt.xlabel('Number of people (millions)')
plt.ylabel('Year')
Output:
20 | CCS346 Unit - I
APEC
# Add Labels
plt.title('Household Expenses')
plt.xlabel('Months of the year')
plt.ylabel('Cost')
Output:
Pie Chart
A Pie Chart is a circular statistical plot that can display only one series of data. The area
of the chart is the total percentage of the given data. The area of slices of the pie
represents the percentage of the parts of the data. The slices of pie are called wedges.
The area of the wedge is determined by the length of the arc of the wedge. The area of a
wedge represents the relative percentage of that part with respect to whole data. Pie
21 | CCS346 Unit - I
APEC
charts are commonly used in business presentations like sales, operations, survey
results, resources, etc as they provide a quick summary.
Output:
Table Chart
Matplotlib.pyplot.table() is a subpart of matplotlib library in which a table is generated
using the plotted graph for analysis. This method makes analysis easier and more
efficient as tables give a precise detail than graphs. The matplotlib.pyplot.table creates
tables that often hang beneath stacked bar charts to provide readers insight into the
data generated by the above graph.
plt.figure()
ax = plt.gca()
a = np.random.randn(5)
22 | CCS346 Unit - I
APEC
plt.plot(a)
plt.show()
Output
23 | CCS346 Unit - I
APEC
plt.axes(projection = 'polar')
# setting the radius
r = 2
rads = np.arange(0, (2 * np.pi), 0.01)
# plotting the circle
for rad in rads:
plt.polar(rad, r, 'g.')
plt.show()
Output
import numpy as np
import matplotlib.pyplot as plot
plot.axes(projection='polar')
plot.title('Circle in polar format')
rads = np.arange(0, (2*np.pi), 0.01)
for radian in rads:
plot.polar(radian,2,'o')
plot.show()
24 | CCS346 Unit - I
APEC
Output
Histogram
A histogram is a graph showing frequency distributions. It is a graph showing the
number of observations within each given interval.
Example: Say you ask for the height of 250 people, you might end up with a histogram
like this:
25 | CCS346 Unit - I
APEC
You can read from the histogram that there are approximately:
2 people from 140 to 145cm, 5 people from 145 to 150cm, 15 people from 151 to
156cm, 31 people from 157 to 162cm, 46 people from 163 to 168cm, 53 people from
168 to 173cm, 45 people from 173 to 178cm, 28 people from 179 to 184cm, 21 people
from 185 to 190cm, 4 people from 190 to 195cm
plt.hist(x)
plt.show()
Lollipop plot
A basic lollipop plot can be created using the stem() function of matplotlib. This function
takes x axis and y axis values as an argument. x values are optional; if you do not
provide x values, it will automatically assign x positions.
# create data
x=range(1,41)
values=np.random.uniform(size=40)
# stem function
plt.stem(x, values)
plt.ylim(0, 1.2)
plt.show()
26 | CCS346 Unit - I
APEC
Data transformation
Data transformation is a set of techniques used to convert data from one format or
structure to another format or structure. The following are some examples of
transformation activities:
• Data deduplication involves the identification of duplicates and their removal.
• Key restructuring involves transforming any keys with built-in meanings to the
generic keys.
• Data cleansing involves extracting words and deleting out-of-date, inaccurate,
and incomplete information from the source language without extracting the
meaning or information to enhance the accuracy of the source data.
• Data validation is a process of formulating rules or algorithms that help in
validating different types of data against some known issues.
• Format revisioning involves converting from one format to another.
• Data derivation consists of creating a set of rules to generate more information
from the data source.
• Data aggregation involves searching, extracting, summarizing, and preserving
important information in different types of reporting systems.
• Data integration involves converting different data types and merging them into
a common structure or schema.
• Data filtering involves identifying information relevant to any particular user.
• Data joining involves establishing a relationship between two or more tables.
#Importing libraries
Output:
import pandas as pd
import numpy as np
df1 = pd.DataFrame(
{
"A": ["A0", "A1", "A2", "A3"],
"B": ["B0", "B1", "B2", "B3"],
"C": ["C0", "C1", "C2", "C3"],
"D": ["D0", "D1", "D2", "D3"],
},
27 | CCS346 Unit - I
APEC
index=[0, 1, 2, 3],
)
df2 = pd.DataFrame(
{
"A": ["A4", "A5", "A6", "A7"],
"B": ["B4", "B5", "B6", "B7"],
"C": ["C4", "C5", "C6", "C7"],
"D": ["D4", "D5", "D6", "D7"],
},
index=[4, 5, 6, 7],
)
df3 = pd.DataFrame(
{
"A": ["A8", "A9", "A10", "A11"],
"B": ["B8", "B9", "B10", "B11"],
"C": ["C8", "C9", "C10", "C11"],
"D": ["D8", "D9", "D10", "D11"],
},
index=[8, 9, 10, 11],
)
frames = [df1, df2, df3]
result = pd.concat(frames)
print("\n", df1)
print("\n", df2)
print("\n", df3)
print("\n", result)
Visually, a concatenation with no parameters along rows would look like this:
28 | CCS346 Unit - I
APEC
In Pandas for a horizontal combination we have merge() and join(), whereas for vertical
combination we can use concat() and append(). Merge and join perform similar tasks
but internally they have some differences, similar to concat and append.
pandas merge():
Pandas provides various built-in functions for easily combining datasets. Among them,
merge() is a high-performance in-memory operation very similar to relational
databases like SQL. You can use merge() any time when you want to do database-like
join operations.
• The simplest call without any key column
• Specifying key columns using on
• Merging using left_on and right_on
• Various forms of joins: inner, left, right and outer
29 | CCS346 Unit - I
APEC
Syntax:
• # This join brings together the entire DataFame
df.merge(df2)
pd.merge(df1, df2)
(or)
df1.merge(df2)
(or)
df1.merge(df2, on='Name')
Output:
Code 2#: Merge two DataFrames via ‘id’ column.
30 | CCS346 Unit - I
APEC
df1.merge(df2, left_on='id',
right_on='customer_id')
Output:
Code 3#: Merge with different column names -
specify a left_on and right_on
31 | CCS346 Unit - I
APEC
32 | CCS346 Unit - I
APEC
33 | CCS346 Unit - I
APEC
pandas append():
To append the rows of one dataframe with the rows of another, we can use the Pandas append()
function. With the help of append(), we can append columns too.
Steps
• Create a two-dimensional, size-mutable, potentially heterogeneous tabular data, df1.
• Print the input DataFrame, df1.
• Create another DataFrame, df2, with the same column names and print it.
• Use the append method, df1.append(df2, ignore_index=True), to append the rows of df2
with df2.
• Print the resultatnt DataFrame.
34 | CCS346 Unit - I
APEC
Output:
df3 = df1.append(df2,ignore_index=True)
Output
Output
35 | CCS346 Unit - I
APEC
Output
36 | CCS346 Unit - I
APEC
import numpy as np
import pandas as pd
index = pd.MultiIndex.from_tuples([('one', 'x'), ('one', 'y'),
('two', 'x'), ('two','y')])
s = pd.Series(np.arange(2.0, 6.0), index=index)
print(s)
Output
df = s.unstack(level=0)
df.unstack()
output
37 | CCS346 Unit - I
APEC
Transformation.
While aggregation must return a reduced version of the data, transformation can return some
transformed version of the full data to recombine. For such a transformation, the output is the
same shape as the input.
key ABCABC
df.sum()
data 15
dtype: object
df.mean()
data 2.5
data
0 -1.5
1 -1.5
df.groupby('key').transform(lambda x: x -
2 -1.5
x.mean())
3 1.5
4 1.5
5 1.5
38 | CCS346 Unit - I
APEC
PROFESSIONAL ELECTIVE COURSES: VERTICALS
VERTICAL 1: DATA SCIENCE
CCS346 EXPLORATORY DATA ANALYSIS
UNIT II - EDA USING PYTHON
Data Manipulation With Pandas – Pandas Objects - Data Indexing And Selection –
Operating On Data – Handling Missing Data – Hierarchical Indexing – Combining
Datasets –Concat, Append, Merge And Join - Aggregation And Grouping – Pivot Tables –
Vectorized String Operations.
What is Pandas?
Pandas is a Python library used for working with data sets. It has functions for
analysing, cleaning, exploring, and manipulating data. The name "Pandas" has a
reference to both "Panel Data", and "Python Data Analysis" and was created by Wes
McKinney in 2008.
Why Use Pandas?
Pandas allows us to analyse big data and make conclusions based on statistical
theories. Pandas can clean messy data sets, and make them readable and relevant.
Relevant data is very important in data science.
Pandas is an open-source Python Library providing high-performance data
manipulation and analysis tool using its powerful data structures. The name Pandas is
derived from the word Panel Data – an Econometrics from Multidimensional data. In
2008, developer Wes McKinney started developing pandas when in need of high
performance, flexible tool for analysis of data.
Prior to Pandas, Python was majorly used for data mining and preparation. It had
very little contribution towards data analysis. Pandas solved this problem. Using
Pandas, we can accomplish five typical steps in the processing and analysis of data,
regardless of the origin of data -load, prepare, manipulate, model, and analyse.
Python with Pandas is used in a wide range of fields including academic and commercial
domains including finance, economics, Statistics, analytics, etc.
Key Features of Pandas
• Fast and efficient Data Frame object with default and customized indexing.
• Tools for loading data into in-memory data objects from different file formats.
• Data alignment and integrated handling of missing data.
• Reshaping and pivoting of date sets.
• Label-based slicing, indexing and sub-setting of large data sets.
• Columns from a data structure can be deleted or inserted.
• Group by data for aggregation and transformations.
• High performance merging and joining of data.
• Time Series functionality.
Standard Python distribution doesn't come bundled with Pandas module. A lightweight
alternative is to install NumPy using popular Python package installer, pip.
pip install pandas
If you install Anaconda Python package, Pandas will be installed by default with the following −
Windows
• Anaconda (from https://www.continuum.io) is a free Python distribution for SciPy
stack. It is also available for Linux and Mac.
• Canopy (https://www.enthought.com/products/canopy/) is available as free as well as
commercial distribution with full SciPy stack for Windows, Linux and Mac.
• Python (x,y) is a free Python distribution with SciPy stack and Spyder IDE for Windows
OS. (Downloadable from http://python-xy.github.io/)
Unit IV – Pandas
1
APEC
These data structures are built on top of Numpy array, which means they are fast.
Series
• Series is a one-dimensional array like structure with homogeneous data. For
example, the following series is a collection of integers 10, 23, 56, …
Key Points
o Homogeneous data
o Size Immutable
o Values of Data Mutable
DataFrame
• DataFrame is a two-dimensional array with heterogeneous data. For example,
the table represents the data of a sales team of an organization with their overall
performance rating. The data is represented in rows and columns. Each column
represents an attribute and each row represents a person.
Panel
• Panel is a three-dimensional data structure with heterogeneous data. It is hard to
represent the panel in graphical representation. But a panel can be illustrated as a
container of DataFrame.
Unit IV – Pandas
2
APEC
Key Points
• Heterogeneous data
• Size Mutable
• Data Mutable
Creating Series
1. Create an Empty Series
o A basic series, which can be created is an Empty Series.
#import the pandas library and aliasing as Series([], dtype: float64)
pd
import pandas as pd
s = pd.Series()
print (s)
2. Create a Series from ndarray
If data is an ndarray, then index passed must be of the same length. If no index is passed,
then by default index will be range(n) where n is array length, i.e.,
[0,1,2,3…. range(len(array))-1].
import pandas as pd 0 a
import numpy as np 1 b
data = np.array(['a','b','c','d']) 2 c
s = pd.Series(data) 3 d
print (s) dtype: object
import pandas as pd 100 a
import numpy as np 101 b
data = np.array(['a','b','c','d']) 102 c
s = pd.Series(data,index=[100,101,102,103]) 103 d
print (s) dtype: object
3. Create a Series from dict
A dict can be passed as input and if no index is specified, then the dictionary keys are taken
in a sorted order to construct index. If index is passed, the values in data corresponding to
the labels in the index will be pulled out.
import pandas as pd a 0.0
import numpy as np b 1.0
data = {'a' : 0., 'b' : 1., 'c' : 2.} c 2.0
s = pd.Series(data) dtype: float64
print (s)
import pandas as pd b 1.0
import numpy as np c 2.0
data = {'a' : 0., 'b' : 1., 'c' : 2.} d NaN
s = pd.Series(data,index=['b','c','d','a']) a 0.0
print (s) dtype: float64
4. Create a Series from Scalar
If data is a scalar value, an index must be provided. The value will be repeated to match the
length of index
import pandas as pd 0 5
Unit IV – Pandas
3
APEC
import numpy as np 1 5
s = pd.Series(5, index=[0, 1, 2, 3]) 2 5
print (s) 3 5
dtype: int64
Accessing Data from Series with Position
1. Data in the series can be accessed similar to that in an ndarray.
import pandas as pd 1
s = pd.Series([1,2,3,4,5],index =
['a','b','c','d','e'])
Unit IV – Pandas
4
APEC
pandas.DataFrame
o A pandas DataFrame can be created using the following constructor −
pandas.DataFrame( data, index, columns, dtype, copy)
Create DataFrame
o A pandas DataFrame can be created using various inputs like –
o Lists
o Dict
o Series
o Numpy ndarrays
o Another DataFrame
import pandas as pd a b c
data = [{'a': 1, 'b': 2},{'a': 5, 'b': 10, 'c': 20}] first 1 2 NaN
df = pd.DataFrame(data, index=['first', 'second']) second 5 10 20.0
print (df)
import pandas as pd a b
data = [{'a': 1, 'b': 2},{'a': 5, 'b': 10, 'c': 20}] first 1 2
second 5 10
#With two column indices, values same as dictionary keys
df1 = pd.DataFrame(data, index=['first', 'second'], x y
columns=['a', 'b']) first NaN NaN
second NaN NaN
#With two column indices with one index with other name
df2 = pd.DataFrame(data, index=['first', 'second'],
columns=['x', 'y'])
print (df1)
print (df2)
Column Selection
We will understand this by selecting a column from the DataFrame.
import pandas as pd a 1.0
d = {'one' : pd.Series([1, 2, 3], index=['a', 'b', 'c']), b 2.0
'two' : pd.Series([1, 2, 3, 4], index=['a', 'b', 'c', 'd'])} c 3.0
d NaN
df = pd.DataFrame(d) Name: one, dtype: float64
print (df ['one'])
6. Column Addition is performed by adding a new column to an existing data frame.
Unit IV – Pandas
6
APEC
import pandas as pd Given Data Frame:
one two
d = {'one' : pd.Series([11, 22, 33, 44], index=['a', 'b', 'c', a 11 100
'd']), b 22 200
'two' : pd.Series([100, 200, 300,400], index=['a', 'b', 'c', c 33 300
'd'])} d 44 400
Unit IV – Pandas
7
APEC
two 2
print (df.loc['b']) Name: b, dtype: int64
import pandas as pd one 3
d = {'one' : pd.Series([1, 2, 3], index=['a', 'b', 'c']), two 300
'two' : pd.Series([100, 200, 300], index=['a', 'b', 'c'])} Name: c, dtype: int64
df = pd.DataFrame(d)
print (df.iloc[2])
9. Slice Rows
Multiple rows can be selected using ‘ : ’ operator.
import pandas as pd one two
d = {'one' : pd.Series([1, 2, 3, 4], index=['a', 'b', 'c', 'd']), c 3 300
'two' : pd.Series([100, 200, 300, 400], index=['a', 'b', 'c', d 4 400
'd'])}
df = pd.DataFrame(d)
print (df[2:4])
10. Addition of Rows
Add new rows to a DataFrame using the append function.
import pandas as pd a b
df = pd.DataFrame([[55, 66], [77, 66]], columns = ['a','b']) 0 55 66
df2 = pd.DataFrame([[700, 600], [800, 900]], columns = 1 77 66
['a','b'])
0 700 600
df = df.append(df2) 1 800 900
print (df)
11. Deletion of Rows
Drop a label and see how many rows will get dropped.
import pandas as pd Original Data frame..........
a b
df = pd.DataFrame([[11, 22], [33, 44]], columns = ['a','b']) 0 11 22
df2 = pd.DataFrame([[55, 66], [77, 88]], columns = 1 33 44
['a','b']) 0 55 66
1 77 88
df = df.append(df2) Drop rows with label 0......
print("Original Data frame..........") a b
print(df) 1 33 44
print("Drop rows with label 0......") 1 77 88
df = df.drop(0)
print (df)
Descriptive Statistics
import pandas as pd Name Age Rating
import numpy as np 0 Tom 25 4.23
1 James 26 3.24
#Create a Dictionary of series 2 Ricky 25 3.98
d = {'Name':pd.Series(['Tom','James','Ricky','Vin','Steve','Smith','Jack', 3 Vin 23 2.56
'Lee','David','Gasper','Betina','Andres']),
'Age':pd.Series([25,26,25,23,30,29,23,34,40,30,51,46]),
4 Steve 30 3.20
5 Smith 29 4.60
'Rating':pd.Series([4.23,3.24,3.98,2.56,3.20,4.6,3.8,3.78,2.98,4.80,4.10,3.6 6 Jack 23 3.80
5]) 7 Lee 34 3.78
}
8 David 40 2.98
9 Gasper 30 4.80
Unit IV – Pandas
8
APEC
#Create a DataFrame 10 Betina 51 4.10
df = pd.DataFrame(d) 11 Andres 46 3.65
print (df)
sum()
Returns the sum of the values for the requested axis. By default, axis is index (axis=0).
import pandas as pd Name
TomJamesRickyVinSteveSmithJackLeeDavidGasperB
import numpy as np e...
Age 382
Rating 44.92
#Create a Dictionary of series dtype: object
d = {'Name':pd.Series(['Tom','James','Ricky','Vin','Steve','Smith','Jack',
'Lee','David','Gasper','Betina','Andres']),
'Age':pd.Series([25,26,25,23,30,29,23,34,40,30,51,46]),
'Rating':pd.Series([4.23,3.24,3.98,2.56,3.20,4.6,3.8,3.78,2.98,4.80,4.10,3.65])
}
Note:
#Create a DataFrame Each individual column is added individually
df = pd.DataFrame(d) (Strings are appended).
print (df.sum())
axis=1
import pandas as pd 0 29.23
import numpy as np 1 29.24
2 28.98
#Create a Dictionary of series 3 25.56
d = {'Name':pd.Series(['Tom','James','Ricky','Vin','Steve','Smith','Jack', 4 33.20
'Lee','David','Gasper','Betina','Andres']),
'Age':pd.Series([25,26,25,23,30,29,23,34,40,30,51,46]), 5 33.60
'Rating':pd.Series([4.23,3.24,3.98,2.56,3.20,4.6,3.8,3.78,2.98,4.80,4.10,3.65]) 6 26.80
}
7 37.78
8 42.98
#Create a DataFrame
9 34.80
df = pd.DataFrame(d)
10 55.10
print (df.sum(1))
11 49.65
dtype: float64
import pandas as pd Mean ..............
import numpy as np Age 31.833333
dtype: float64
#Create a Dictionary of series
d=
{'Age':pd.Series([25,26,25,23,30,29,23,34,40,30,51,46])}
#Create a DataFrame
df = pd.DataFrame(d)
print("Mean ..............")
print (df.mean())
import pandas as pd Standard Devaiation ..............
import numpy as np Age 9.232682
dtype: float64
#Create a Dictionary of series
d=
{'Age':pd.Series([25,26,25,23,30,29,23,34,40,30,51,46])}
#Create a DataFrame
df = pd.DataFrame(d)
print("standard Devaiation ..............")
print (df.std())
Unit IV – Pandas
9
APEC
import pandas as pd Describing various datas...........
import numpy as np Age
count 12.000000
#Create a Dictionary of series mean 31.833333
d= std 9.232682
{'Age':pd.Series([25,26,25,23,30,29,23,34,40,30,51,46])} min 23.000000
25% 25.000000
#Create a DataFrame 50% 29.500000
df = pd.DataFrame(d) 75% 35.500000
print("Describing various datas...........") max 51.000000
print (df.describe())
Drop duplicates Name Age Country
# import the library as pd 0 Srivignesh 22 India
import pandas as pd 1 Srivignesh 22 India
df = pd.DataFrame( 2 Hari 11 India
{
'Name': ['Srivignesh', 'Srivignesh','Hari'], Name Age Country
'Age': [22,22, 11], 0 Srivignesh 22 India
'Country': ['India','India', 'India'] 2 Hari 11 India
}
)
print(df)
newdf = df.drop_duplicates()
print(newdf)
⎯ Transposed Data Frame
o The transpose() function is used to transpose index and columns
⎯ axes
o Returns the list of row axis labels and column axis labels.
⎯ empty
o Returns the Boolean value saying whether the Object is empty or not; True
indicates that the object is empty.
⎯ ndim
o Returns the number of dimensions of the object. By definition, DataFrame is a 2D
object.
⎯ Shape
o Returns a tuple representing the dimensionality of the DataFrame. Tuple (a,b),
where a represents the number of rows and b represents the number of columns.
⎯ Size
o Returns the number of elements in the DataFrame.
⎯ values
o Returns the actual data in the DataFrame as an NDarray
Unit IV – Pandas
12
APEC
print(df)
In DataFrame sometimes many datasets simply arrive with missing data, either because it
exists and was not collected or it never existed. To make matters even more complicated,
different data sources may indicate missing data in different ways. For example, suppose
different users being surveyed may choose not to share their income, some users may choose
not to share the address in this way many datasets went missing.
Trade-Offs in Missing Data Conventions
A number of schemes have been developed to indicate the presence of missing data in a table or
DataFrame. Generally, they revolve around one of two strategies: using a mask that globally
indicates missing values, or choosing a sentinel value that indicates a missing entry.
• In the masking approach : The mask might be an Boolean array
• In the sentinel approach, the sentinel value could be some data-specific convention, such
as indicating a missing integer value with –9999 or some rare bit pattern, or it could be a
more global convention, such as indicating a missing floating-point value with NaN (Not
a Number)
Pandas treat None and NaN as essentially interchangeable for indicating missing or null values.
To facilitate this convention, there are several useful functions for detecting, removing, and
replacing null values in Pandas DataFrame :
• isnull()
• notnull()
• dropna()
Unit IV – Pandas
13
APEC
• fillna()
• replace()
• interpolate()
import pandas as pd
import numpy as np
# dictionary of lists
dict = {'First Score':[100, 90, np.nan, 95],
'Second Score': [30, 45, 56, np.nan],
'Third Score':[np.nan, 40, 80, 98]}
# importing numpy as np
import numpy as np
After dropping rows with
# dictionary of lists atleast 1 NaN value
dict = {'First Score':[100, 90, np.nan, 95],
'Second Score': [30, np.nan, 45, 56],
'Third Score':[52, 40, 80, 98],
'Fourth Score':[np.nan, np.nan, np.nan,65]}
Unit IV – Pandas
14
APEC
# using dropna() function
df.dropna()
Output
Code #2: Drop rows whose all data is missing or contain null Data Frame:
values(NaN)
# importing pandas as pd
import pandas as pd
df = pd.DataFrame(dict)
print(df)
# using dropna() function
df.dropna(how = 'all')
# dictionary of lists
dict = {'First Score':[100, 90,np.nan,95],
'Second Score': [30, 45,56,np.nan],
'Third Score':[np.nan, 40,80,98]}
Code #2: Filling null values with the previous ones Output
import pandas as pd
import numpy as np
Unit IV – Pandas
15
APEC
# dictionary of lists
dict = {'First Score':[100, 90, np.nan, 95],
'Second Score': [30, 45, 56, np.nan],
'Third Score':[np.nan,40,80,np.nan]}
Code #3: Filling null value with the next ones Output
import pandas as pd
import numpy as np
# dictionary of lists
dict = {'First Score':[100, 90, np.nan, 95],
'Second Score': [30, 45, 56, np.nan],
'Third Score':[np.nan, 40, 80, 98]}
Now we are going to fill all the null values in Gender column Output:
with “No Gender”
Unit IV – Pandas
16
APEC
# filling a null values using fillna()
data["Gender"].fillna("No Gender",inplace=True)
print(data)
Now we are going to replace the all Nan value in the data Output:
frame with -99 value.
Interpolation
Interpolation in Python is a technique used to estimate unknown data points between two
known data points. Interpolation is mostly used to impute missing values in the data frame or
series while pre-processing data. Interpolation is also used in Image Processing when
expanding an image you can estimate the pixel value with help of neighbouring pixels.
When to use Interpolation?
We can use Interpolation to find missing value with help of its neighbours. When imputing
missing values with average does not fit best, we have to move to a different technique and the
technique most people find is Interpolation. Interpolation is mostly used while working with
time-series data because in time-series data we like to fill missing values with previous one or
two values. for example, suppose temperature, now we would always prefer to fill today’s
temperature with the mean of the last 2 days, not with the mean of the month. We can also use
Interpolation for calculating the moving averages.
Unit IV – Pandas
17
APEC
Using Interpolation to fill Missing Values in Series Data
Pandas series is a one-dimensional array which is capable to store elements of various data
types like list. We can easily create series with help of a list, tuple, or dictionary. To perform all
Interpolation methods we will create a pandas series with some NaN values and try to fill
missing values with different methods of Interpolation.
Output:
import pandas as pd
import numpy as np
a = pd.Series([0, 1, np.nan, 3, 4, 5, 7],
index=[100,101,102,103,104,105,106])
print(a)
1) Linear Interpolation
Linear Interpolation simply means to estimate a missing value by connecting dots in a straight
line in increasing order. In short, it estimates the unknown value in the same increasing order
from previous values. The default method used by Interpolation is Linear so while applying it
we did not need to specify it.
Code #6.1: Linear Interpolation Output:
import pandas as pd
import numpy as np
a = pd.Series([0, 1, np.nan, 3, 4, 5, 7],
index=[100,101,102,103,104,105,106])
print(a.interpolate())
Hence, Linear interpolation works in the same order. Remember that it does not interpret using
the index, it interprets values by connecting points in a straight line.
2) Polynomial Interpolation
In Polynomial Interpolation you need to specify an order. It means that polynomial
interpolation is filling missing values with the lowest possible degree that passes through
available data points. The polynomial Interpolation curve is like the trigonometric sin curve or
assumes it like a parabola shape.
Code #6.2: Polynomial Interpolation Output:
import pandas as pd
import numpy as np
a = pd.Series([0, 1, np.nan, 3, 4, 5, 7],
index=[100,101,102,103,104,105,106])
a.interpolate(method="polynomial", order=2)
If you pass an order as 1 then the output will similar to linear Output:
because the polynomial of order 1 is linear.
import pandas as pd
import numpy as np
Unit IV – Pandas
18
APEC
a = pd.Series([0, 1, np.nan, 3, 4, 5, 7],
index=[100,101,102,103,104,105,106])
a.interpolate(method="polynomial", order=1)
import pandas as pd
import numpy as np
a = pd.Series([0, 1, np.nan, 3, 4, 5, 7],
index=[100,101,102,103,104,105,106])
a.interpolate(method="pad", limit=2)
import pandas as pd
# Creating the dataframe
df = pd.DataFrame({"A":[12, 4, 7, None, 2],
"B":[None, 3, 57, 3, None],
"C":[20, 16, None, 3, 8],
"D":[14, 3, None, None, 6]})
print(df)
Unit IV – Pandas
19
APEC
Displaying the Data Frame & Performing Linear Interpolation in forwarding Direction
The linear method ignores the index and treats missing values as equally spaced and finds the best
point to fit the missing value after previous points. If the missing value is at first index then it will leave
it as Nan. Let’s apply it to our dataframe.
Code #7.1: Performing Linear Interpolation in forward Direction Output:
import pandas as pd
# Creating the dataframe
df = pd.DataFrame({"A":[12, 4, 7, None, 2],
"B":[None, 3, 57, 3, None],
"C":[20, 16, None, 3, 8],
"D":[14, 3, None, None, 6]})
df.interpolate(method ='linear', limit_direction ='forward')
What is MultiIndex?
MultiIndex allows you to select more than one row and column in your index. To understand
MultiIndex, let’s see the indexes of the data.
#Importing libraries Output
import pandas as pd
import numpy as np
data=pd.Series(np.random.randn(8),
index=[["a","a","a","b","b","b","c","c"],
[1,2,3,1,2,3,1,2]])
print(data.index)
MultiIndex is an advanced indexing technique for DataFrames that shows the multiple levels of
the indexes. Our dataset has two levels. You can obtain subsets of the data using the indexes. For
example, let’s take a look at the values with index a.
Unit IV – Pandas
21
APEC
You can select values from the inner index. Let’s take a look at the Output
first values of the inner index.
#We can also look more than one index
data.loc[:,1]
To restore the dataset, you can use the stack method. Output
data.unstack().stack()
Output:
Unit IV – Pandas
22
APEC
data2=data.set_index(["a","b"]) data3=data.set_index(["a","b"]).sort_index()
data2 data3
Output: Output:
Unit IV – Pandas
23
APEC
Combining Data in Pandas with append(), merge(), join(), and concat()
df1 = pd.DataFrame(
{
"A": ["A0", "A1", "A2", "A3"],
"B": ["B0", "B1", "B2", "B3"],
"C": ["C0", "C1", "C2", "C3"],
"D": ["D0", "D1", "D2", "D3"],
},
index=[0, 1, 2, 3],
)
df2 = pd.DataFrame(
{
"A": ["A4", "A5", "A6", "A7"],
"B": ["B4", "B5", "B6", "B7"],
"C": ["C4", "C5", "C6", "C7"],
"D": ["D4", "D5", "D6", "D7"],
},
index=[4, 5, 6, 7],
)
df3 = pd.DataFrame(
{
"A": ["A8", "A9", "A10", "A11"],
"B": ["B8", "B9", "B10", "B11"],
"C": ["C8", "C9", "C10", "C11"],
"D": ["D8", "D9", "D10", "D11"],
},
index=[8, 9, 10, 11],
)
frames = [df1, df2, df3]
result = pd.concat(frames)
print("\n", df1)
print("\n", df2)
print("\n", df3)
print("\n", result)
Unit IV – Pandas
24
APEC
Visually, a concatenation with no parameters along rows would look like this:
Unit IV – Pandas
25
APEC
index.
Unit IV – Pandas
26
APEC
In Pandas for a horizontal combination we have merge() and join(), whereas for vertical
combination we can use concat() and append(). Merge and join perform similar tasks but
internally they have some differences, similar to concat and append.
pandas merge():
Pandas provides various built-in functions for easily combining datasets. Among them, merge()
is a high-performance in-memory operation very similar to relational databases like SQL. You
can use merge() any time when you want to do database-like join operations.
• The simplest call without any key column
• Specifying key columns using on
• Merging using left_on and right_on
• Various forms of joins: inner, left, right and outer
Syntax:
• # This join brings together the entire DataFame
df.merge(df2)
pd.merge(df1, df2)
(or)
df1.merge(df2)
(or)
df1.merge(df2, on='Name')
Unit IV – Pandas
27
APEC
Output:
Code 2#: Merge two DataFrames via ‘id’ column.
df1.merge(df2, left_on='id',
right_on='customer_id')
Output:
Code 3#: Merge with different column names - specify a
left_on and right_on
Unit IV – Pandas
28
APEC
Unit IV – Pandas
29
APEC
Unit IV – Pandas
30
APEC
Unit IV – Pandas
31
APEC
pandas append():
To append the rows of one dataframe with the rows of another, we can use the Pandas append()
function. With the help of append(), we can append columns too.
Steps
• Create a two-dimensional, size-mutable, potentially heterogeneous tabular data, df1.
• Print the input DataFrame, df1.
• Create another DataFrame, df2, with the same column names and print it.
• Use the append method, df1.append(df2, ignore_index=True), to append the rows of df2
with df2.
• Print the resultatnt DataFrame.
Code 5#: Append Function to join DataFrames Output:
import pandas as pd
df1 = pd.DataFrame({"x": [5, 2],
"y": [4, 7],
"z": [1, 3]})
df2 = pd.DataFrame({"x": [1, 3],
"y": [1, 9],
"z": [1, 3]})
print ("\n", df1)
print ("\n", df2)
df3 = df1.append(df2)
print ("\n ", df3)
Output:
df3 = df1.append(df2,ignore_index=Tru
e)
Unit IV – Pandas
32
APEC
print ("\n ", df3)
Aggregation in Pandas
Aggregation in pandas provides various functions that perform a mathematical or logical
operation on our dataset and returns a summary of that function. Aggregation can be used to get
a summary of columns in our dataset like getting sum, minimum, maximum, etc. from a
particular column of our dataset.
Some functions used in the aggregation are:
• sum() Compute sum of column values
• min() Compute min of column values
• max() Compute max of column values
• mean() Compute mean of column
• size() Compute column sizes
• describe() Generates descriptive statistics
• first() Compute first of group values
• last() Compute last of group values
• count() Compute count of column values
• std() Standard deviation of column
• var() Compute variance of column
• sem() Standard error of the mean of column
•
df = pd.DataFrame([[9, 4, 8, 9],
[8, 10, 7, 6],
[7, 6, 8, 5]],
columns=['Maths', 'English',
'Science', 'History'])
# display dataset
print(df)
Output
agg() Calculate the sum, min, and max of each column in our dataset.
df.agg(['sum', 'min', 'max'])
Unit IV – Pandas
33
APEC
Unit IV – Pandas
34
APEC
Grouping in Pandas
Grouping is used to group data using some criteria from our dataset. It is used as split-apply-
combine strategy.
• Splitting the data into groups based on some criteria.
• Applying a function to each group independently.
• Combining the results into a data structure.
Examples:
• We use groupby() function to group the data on “Maths” value. It returns
the object as result.
df.groupby(by=['Maths'])
Applying groupby() function to group the data on “Maths” value. To view result of formed
groups use first() function.
a = df.groupby('Maths')
a.first()
First grouping based on “Maths” within each team we are grouping based on “Science”
b = df.groupby(['Maths', 'Science'])
b.first()
Unit IV – Pandas
35
APEC
1 1 0.227877
2 3 -0.562860
Multiple aggregations B C
min max min
max
df.groupby('A').agg(['min', 'max']) A
1 1 2 0.227877
0.362838
2 3 4 -0.562860
1.267767
Select a column for aggregation min max
df.groupby('A').B.agg(['min', 'max']) A
1 1 2
2 3 4
Different aggregations per column B C
min max sum
df.groupby('A').agg({'B': ['min', 'max'], 'C': 'sum'})
A
1 1 2 0.590715
2 3 4 0.704907
df = pd.DataFrame({'key': ['A', 'B', 'C', 'A', 'B', 'C'], key data
'data': range(6)}, columns=['key', 'data']) 0 A 0
1 B 1
print(df) 2 C 2
3 A 3
4 B 4
5 C 5
df.groupby('key').sum()) data
key
A 3
B 5
C 7
Transformation.
While aggregation must return a reduced version of the data, transformation can return some
transformed version of the full data to recombine. For such a transformation, the output is the
same shape as the input.
key ABCABC
df.sum() data 15
dtype: object
df.mean() data 2.5
data
0 -1.5
1 -1.5
df.groupby('key').transform(lambda x: x -
2 -1.5
x.mean()) 3 1.5
4 1.5
5 1.5
Pivot Tables
We have seen how the GroupBy abstraction lets us explore relationships within a dataset. A
pivot table is a similar operation that is commonly seen in spread sheets and other programs
that operate on tabular data. The pivot table takes simple column wise data as input, and groups
the entries into a two-dimensional table that provides a multidimensional summarization of the
data. The difference between pivot tables and GroupBy can sometimes cause confusion; it helps
me to think of pivot tables as essentially a multidimensional version of GroupBy aggregation.
Unit IV – Pandas
36
APEC
That is, you split apply- combine, but both the split and the combine happen across not a one
dimensional index, but across a two-dimensional grid.
Purpose:
Create a spreadsheet-style pivot table as a DataFrame. The levels in the pivot table of pandas
will be stored in MultiIndex objects (hierarchical indexes) on the index and columns of the
result DataFrame.
How to make a pivot table?
Use the pd.pivot_table() function and specify what feature should go in the rows and columns
using the index and columns parameters respectively. The feature that should be used to fill in
the cell values should be specified in the values parameter.
import pandas as pd
import numpy as np
df = pd.DataFrame({'First Name':['Aryan','Rohan','Riya','Yash','Siddhant',],
'Last Name:['Singh','Agarwal','Shah','Bhatia','Khanna'],
'Type:['Full-time Employee','Intern','Full-time Employee',
'Part-time Employee', 'Full-time Employee'],
'Department': ['Administration','Technical','Administration',
'Technical','Management'],
'YoE': [2, 3, 5, 7, 6],
'Salary': [20000, 5000, 10000, 10000, 20000]})
Print(df)
Output:
output = pd.pivot_table(data=df,
index=['Type'],
columns=['Department'],
values='Salary',
aggfunc='mean')
print(output)
Unit IV – Pandas
37
APEC
Here, we have made a basic pivot table in pandas which shows the average salary of each type
of employee for each department. As there are no user-defined parameters passed, the
remaining arguments have assumed their default values.
Pivot table with multiple aggregation functions
Df1 = pd.pivot_table(data=df, index=['Type'],
values='Salary',
aggfunc=['sum', 'mean', 'count'])
print(df1)
Output
df = pd.DataFrame({"P": ["f1","f1","f1","f1","f1","b1","b1","b1","b1"],
"Q": ["one", "one", "one", "two", "two",
"one", "one", "two", "two"],
"R": ["small", "large", "large", "small",
"small", "large", "small", "small",
"large"],
Unit IV – Pandas
38
APEC
"S": [1, 2, 2, 3, 3, 4, 5, 6, 7],
"T": [2, 4, 5, 5, 6, 6, 8, 9, 9]})
print(df)
Output
Output:
Unit IV – Pandas
39
APEC
df.sort_values(by=['R'])
Vectorized Strings
Vectorized string operations refer to performing operations on strings in a vectorized
manner, meaning that the operations are applied simultaneously to multiple strings rather than
individually. This approach leverages the power of optimized low-level operations provided by
modern programming languages and libraries. In many programming languages, vectorized
string operations are supported through libraries or modules that provide efficient string
handling functions.
Unit IV – Pandas
40
APEC
For example, in Python, the NumPy and pandas libraries offer vectorized string operations. Here
are a few examples of vectorized string operations commonly used in programming languages:
• Concatenation: Joining multiple strings together. For instance, given two arrays of
strings, a vectorized operation can concatenate the corresponding strings in each array
to create a new array.
• Substring extraction: Extracting a portion of a string based on a specified start and end
index. Vectorized substring extraction can be performed on multiple strings
simultaneously.
• Case conversion: Changing the case of strings, such as converting all characters to
uppercase or lowercase. Vectorized operations can be used to apply case conversion to
multiple strings at once.
• String matching: Finding strings that match a specific pattern or regular expression.
Vectorized string matching allows searching for matches across multiple strings
efficiently.
• Replacement: Replacing substrings or patterns within strings. Vectorized replacement
operations can replace substrings across multiple strings simultaneously.
For arrays of strings, NumPy does not provide such simple access
data=['peter','Paul','MARY','gUIDO'] Output
print(data) ['peter','Paul','MARY','gUIDO']
data=['peter','Paul','MARY','gUIDO'] ['Peter', 'Paul', 'Mary', 'Guido']
[s.capitalize() for s in data]
This is perhaps sufficient to work with some data, but it will break if there are any missing values. For
example:
data = ['peter', 'Paul', none, 'MARY', 'gUIDO']
[s.capitalize() for s in data]
Pandas includes features to address both this need for vectorized string operations and for
correctly handling missing data via the str attribute of Pandas Series and Index objects
containing strings. So, for example, suppose we create a Pandas Series with this data:
Unit IV – Pandas
41
APEC
dtype: object
We can now call a single method that will capitalize all the entries, while skipping over any
missing values:
import pandas as pd Output
names = pd.Series(data) 0 Peter
names.str.capitalize() 1 Paul
2 None
3 Mary
4 Guido
dtype: object
Nearly all Python’s built-in string methods are mirrored by a Pandas vectorized string method.
Here is a list of Pandas str methods that mirror Python string methods:
len() lower() translate() islower()
ljust() upper() startswith() isupper()
rjust() find() endswith() isnumeric()
center() rfind() isalnum() isdecimal()
zfill() index() isalpha() split()
strip() rindex() isdigit() rsplit()
rstrip() capitalize() isspace() partition()
lstrip() swapcase() istitle() rpartition()
names.str.lower() 0 peter
1 paul
2 none
3 mary
4 guido
dtype: object
names.str.len() 0 5
1 4
2 4
3 4
4 5
dtype: int64
names.str.startswith('M') 0 False
1 False
2 False
3 True
4 False
dtype: bool
import pandas as pd 0 [peter, Charles]
data = ['peter Charles', 'Paul 1 [Paul, Roudridge]
Roudridge', 'MARY Siva', 'gUIDO'] 2 [MARY, Siva]
names = pd.Series(data) 3 [gUIDO]
dtype: object
names.str.split()
Unit IV – Pandas
42
APEC
Miscellaneous methods
Finally, there are some miscellaneous methods that enable other convenient operations.
Vectorized item access and slicing. The get() and slice() operations enable vectorized element
access from each array. For example, we can get a slice of the first three characters of each array
using str.slice(0, 3).
names.str[0:3] 0 pet
1 Pau
2 MAR
Unit IV – Pandas
43
APEC
3 gUI
dtype: object
names.str.split().str.get(-1) 0 Charles
1 Roudridge
2 Siva
3 gUIDO
dtype: object
Unit IV – Pandas
44
APEC
Example 2:
Consider the following household dataset:
Unit III - 1
APEC
Summary Statistics
There are two popular types of summary statistics:
• Measures of central tendency: these numbers describe where the centre of a
dataset is located.
o Examples include the mean and the median.
• Measures of Dispersion: These numbers describe how evenly distributed the
values are in a dataset.
o Examples are range, standard deviation, interquartile range, and variance.
▪ Range -the difference between the max value and min value in a
dataset
▪ Standard Deviation- an average measure of the spread
▪ Interquartile Range- the spread of the middle 50% of values
Frequency Distributions
• Frequency means how often something takes place. The frequency observation
tells the number of times for the occurrence of an event.
o The frequency distribution table show qualitative and quantitative
variables.
Charts
• Another way to perform univariate analysis is to create charts to visualize the
distribution of values for a certain variable.
• Examples include Boxplots, Histograms, Density Curves, Pie Charts
Bar chart
• The bar chart is represented in the form of rectangular bars. The graph will
compare various categories.
• The graph could be plotted vertically or horizontally. The horizontal or the x-axis
will represent the category and the vertical y-axis represents the category’s
value. The bar graph looks at the data set and makes comparisons.
Histogram
• The histogram is the same as a bar chart which analysis the data counts. The bar
graph will count against categories and the histogram displays the categories
into bins. The bin is capable of showing the number of data positions, the range,
or the interval.
Frequency Polygon
Unit III - 2
APEC
• The frequency polygon is similar to the histogram. However, these can be used to
compare the data sets or in order to display the cumulative frequency
distribution. The frequency polygon will be represented as a line graph.
Pie Chart
• The pie chart displays the data in a circular format. The graph is divided into
pieces where each piece is proportional to the fraction of the complete category.
So each slice of the pie in the pie chart is relative to categories size. The entire pie
is 100 percent and when you add up each of the pie slices then it should also add
up to 100.
Example 3:
Performing Univariate analysis using the Household Size variable from our dataset
mentioned earlier:
Summary Statistics
Measures of central tendency
• Mean (the average value): 3.8
• Median (the middle value): 4
Measures of Dispersion:
• Range (the difference between the max and min): 6
• Interquartile Range (the spread of the middle 50% of values): 2.5
• Standard Deviation (an average measure of spread): 1.87
Frequency Distributions
We can also create the following frequency distribution table to summarize how
often different values occur:
Unit III - 3
APEC
Charts
We can create the following charts to help us visualize the distribution of values for
Household Size:
Boxplot
• A boxplot is a plot that shows the five-number summary of a dataset. The five-
number summary includes:
o Minimum value
o First quartile
o Median value
o Third quartile
o Maximum value
Here’s what a boxplot would look like for the variable Household Size:
Histogram
• A histogram is a type of chart that uses vertical bars to display frequencies. This
type of chart is a useful way to visualize the distribution of values in a dataset.
Unit III - 4
APEC
Density Curve
• A density curve is a curve on a graph that represents the distribution of values in
a dataset. It’s particularly useful for visualizing the “shape” of a distribution,
including whether or not a distribution has one or more “peaks” of frequently
occurring values and whether or not the distribution is skewed to the left or the
right.
Pie Chart
• A pie chart is a type of chart that is shaped like a circle and uses slices to
represent proportions of a whole.
Depending on the type of data, one of these charts may be more useful for visualizing
the distribution of values than the others.
Unit III - 5
APEC
Numerical Summaries
In data exploration, numerical summaries are essential for understanding the
level (central tendency) and spread (variability) of a variable. These summaries provide
a concise description of the distribution of data points and help in identifying patterns,
outliers, and potential relationships with other variables. Here are some common
numerical summaries used in data exploration:
• Measures of Central Tendency
o Mean: The arithmetic average of all the data points in a variable. It is calculated
by summing all the values and dividing by the total number of observations.
o Median: The middle value in an ordered list of data points. It divides the data
into two equal halves, with 50% of the observations above and 50% below it.
o Mode: The most frequently occurring value(s) in a variable. It represents the
peak(s) of the distribution.
• Measures of Variability
o Range: The difference between the maximum and minimum values in a
variable. It provides an idea of the spread of the data but is sensitive to outliers.
o Standard Deviation: A measure of how much the data points deviate from the
mean. It quantifies the average distance between each data point and the mean.
o Variance: The average squared deviation from the mean. It measures the
variability of data points around the mean.
• Additional Measures
o Quartiles: Values that divide the data into four equal parts. The first quartile
(Q1) represents the 25th percentile, the median represents the 50th
percentile, and the third quartile (Q3) represents the 75th percentile.
o Interquartile Range (IQR): The range between the first and third quartiles. It
provides a measure of the spread of the central half of the data and is less
affected by extreme values.
o Skewness: A measure of the asymmetry of the distribution. Positive skewness
indicates a longer tail on the right side, while negative skewness indicates a
longer tail on the left side.
o Kurtosis: A measure of the peakedness or flatness of the distribution. It
compares the tails and central peak to a normal distribution. Positive kurtosis
indicates a more peaked distribution, while negative kurtosis indicates a
flatter distribution.
These numerical summaries provide insights into the characteristics of a variable,
allowing for a better understanding of its distribution and variability. They serve as the
foundation for further analysis and can guide decision-making processes in data
exploration.
Unit III - 6
APEC
Example 1
Consider this dataset showing the retirement age of 11 people, in whole years:
54, 54, 54, 55, 56, 57, 57, 58, 58, 60, 60
This table shows a simple frequency distribution of the retirement age data.
The most commonly occurring value is 54, therefore the mode of this distribution is 54
years.
Mean
• The mean is the sum of the value of each observation in a dataset divided by the
number of observations. This is also known as the arithmetic average.
Unit III - 7
APEC
54 + 54 + 54 + 55 + 56 + 57 + 57 + 58 + 58 + 60 + 60
𝑀𝑒𝑎𝑛 = = 56.6 𝑌𝑒𝑎𝑟𝑠
11
Median
• The median is the middle value in distribution when the values are arranged in
ascending or descending order.
• The median divides the distribution in half (there are 50% of observations on
either side of the median value). In a distribution with an odd number of
observations, the median value is the middle value.
Median = 19
Looking at the retirement age distribution (which has 11 observations), the median is
the middle value, which is 57 years:
54, 54, 54, 55, 56, 57, 57, 58, 58, 60, 60
When the distribution has an even number of observations, the median value is the
mean of the two middle values. In the following distribution, the two middle values are
56 and 57, therefore the median equals 56.5 years.
52, 54, 54, 54, 55, 56, 57, 57, 58, 58, 60, 60
Unit III - 8
APEC
Important
• The mean is strongly affected by (not resistant to) outliers and skewness,
whereas the median is not affected by (resistant to) outliers and skewness.
o Outliers – mean is pulled towards the outliers
o Skewness – mean is pulled towards the longer tail.
▪ Symmetric : Mean = Median = Mode
▪ Left-skewed(Negatively skewed): Mode > Median > Mean
▪ Right-skewed(positively skewed): Mode < Median < Mean
Skewed distributions
• When a distribution is skewed, the mode remains the most commonly occurring
value, the median remains the middle value in the distribution, but the mean is
generally ‘pulled’ in the direction of the tails. In a skewed distribution, the
median is often a preferred measure of central tendency, as the mean is not
usually in the middle of the distribution.
Positively or Right-skewed
• A distribution is said to be positively or right skewed when the tail on the right
side of the distribution is longer than the left side. In a positively skewed
distribution, the mean to be ‘pulled’ toward the right tail of the distribution.
• The following graph shows a larger retirement age data set with a distribution
which is right skewed. The data has been grouped into classes, as the variable
being measured (retirement age) is continuous. The mode is 54 years, the modal
class is 54-56 years, the median is 56 years, and the mean is 57.2 years.
Unit III - 9
APEC
Negatively or left-skewed
• A distribution is said to be negatively or left skewed when the tail on the left side
of the distribution is longer than the right side. In a negatively skewed
distribution, the mean to be ‘pulled’ toward the left tail of the distribution.
• The following graph shows a larger retirement age dataset with a distribution
which left skewed. The mode is 65 years, the modal class is 63-65 years, the
median is 63 years and the mean is 61.8 years.
Unit III - 10
APEC
Example:
Consider the following example:
Analysis:
• Mode (most frequent value), Median (middle value*) and Mean (arithmetic
average) of both datasets is 6.
• If we just look at the measures of central tendency, we assume that the datasets
are the same. However, if we look at the spread of the values in the following
graph, we can see that Dataset B is more dispersed than Dataset A.
o The measures of central tendency and measures of spread help us to better
understand the data.
• Range
Range = Difference between the smallest value and the largest value in a dataset.
Range =4 [High value (8) and Low value (4)]
• Quartiles
Unit III - 11
APEC
Quartiles divide an ordered dataset into four equal parts, and refer to the values
of the point between the quarters. A dataset may also be divided into quintiles
(five equal parts) or deciles (ten equal parts)
o 25th percentile
Lower quartile (Q1) is the point between the lowest 25% of values and
the highest 75% of values.
o 50th percentile
Second quartile (Q2) is the middle of the data set. It is also called the
median.
o 75th percentile
Upper quartile (Q3) is the point between the lowest 75% and highest
25% of values.
o Calculating quartiles
o As the quartile point falls between two values, the mean (average) of
those values is the quartile value:
▪ Q1 = (5+5) / 2 = 5
▪ Q2 = (6+6) / 2 = 6
▪ Q3 = (7+7) / 2 = 7
o As the quartile point falls between two values, the mean (average) of
those values is the quartile value:
▪ Q1 = (3+4) / 2 = 3.5
▪ Q2 = (6+6) / 2 = 6
▪ Q3 = (8+9) / 2 = 8.5
• Interquartile
Unit III - 12
APEC
The interquartile range (IQR) is the difference between the upper (Q3) and lower
(Q1) quartiles. IQR is often seen as a better measure of spread than the range as
it is not affected by outliers.
Example 2:
Data is arranged in the ordered array as follows: 11, 12, 13, 16, 16, 17, 18, 21, 22.
Solution:
Number of items = 9
Q1 is in the (9+1)/4 = 2.5 position of the ranked data,
Q1 = (12+13)/2 = 12.5
Q2 is in the (9+1)/2 = 5th position of the ranked data,
Q2 = median = 16
Q3 is in the (9+1)/4 = 7.5 position of the ranked data,
Q3 = (18+21)/2 = 19.5
The five numbers that help describe the center, spread and shape of data are
• Xsmallest
• First Quartile (Q1)
• Median (Q2)
• Third Quartile (Q3)
• Xlargest
Unit III - 13
APEC
where:
• 𝑋𝑖 → Refers the 𝑖 𝑡ℎ unit, starting from the first observation to the last
• 𝛭→Population mean
• 𝑁 →Number of units in the population
where:
• 𝑥𝑖 →Refers the 𝑖 𝑡ℎ unit, starting from the first observation to the last
• 𝑥̅ → Sample mean
• 𝑛→ Number of units in the sample
Example 1
Find the variance and standard deviation of the following scores on an exam:
92, 95, 85, 80, 75, 50
Solution
Step 1: Mean of the data
92 + 95 + 85 + 80 + 75 + 50
𝑀𝑒𝑎𝑛 = = 79.5
6
Step 2: Find the difference between each score and the mean (deviation).
Unit III - 14
APEC
Example 2
Find the standard deviation of the average temperatures recorded over a five-day
period last winter:
18, 22, 19, 25, 12
Temp Temp - Mean Difference from mean Sum of squares
18 18 – 19.2 -1.2 1.44
22 22 – 19.2 2.8 7.84
19 19 – 19.2 -0.2 0.04
25 25 – 19.2 5.8 33.64
12 12 – 19.2 -7.2 51.84
Mean
94.80
96/5 = 19.2
94.8
Variance, we divide 5 – 1 = 4 ie. = 23.7
5
Standard Deviation = √23.7 ≈ 4.9
Unit III - 15
APEC
Scaling
Scaling refers to the process of transforming variables to have a similar scale. It
helps to ensure that all variables are on a comparable magnitude, preventing certain
variables from dominating the analysis due to their larger values. Common scaling
methods include:
Unit III - 16
APEC
Example 1
Let's calculate the Min-Max scaling for a dataset step by step. Consider the following
dataset: [12, 15, 18, 20, 25].
Resulting Min-Max scaled dataset is approximately [0.000, 0.231, 0.462, 0.615, 1.000].
Each value represents the scaled value for the corresponding data point in the original
dataset, where 0 corresponds to the minimum value and 1 corresponds to the maximum
value.
Unit III - 17
APEC
Example 2
Let's say we have a variable representing the income of individuals in a dataset. The
original income values range from $20,000 to $100,000. We want to scale these values
to a range between 0 and 1 using Min-Max Scaling.
Solution
Calculate the minimum and maximum values of the income variable:
min(x) = $20,000
max(x) = $100,000
Choose a data point, for example, $40,000, and apply the Min-Max Scaling formula:
40000 − 20000
𝑥𝑠𝑐𝑎𝑙𝑒𝑑 = = 0.25
100000 − 20000
Therefore, $40,000 would be scaled to 0.25.
Repeat the scaling process for all other data points. The resulting scaled values will fall
within the range of 0 to 1, with the minimum value transformed to 0 and the maximum
value transformed to 1. When we are dealing with image processing, the pixels need
normalized to be between 0 and 255.
Analysis
Employee Number Age Salary
Emp1 44 73000
Emp2 27 47000
Emp3 30 53000 Age Range : 27 – 48
Emp4 38 62000 Salary Range : 47000 - 78000
Emp5 40 57000
Emp6 35 53000
Emp7 48 78000
Distance between Emp2 and Emp1 = √ (27 − 44)2 + (47000 − 73000)2 = 31.06
Unit III - 18
APEC
Distance between Emp2 and Emp3 = √ (30 − 27)2 + (53000 − 47000)2 = 6.70
Normalization process:
Standardization process
Standardization is also known as z-score Normalization. In standardization,
features are scaled to have zero-mean and one-standard-deviation. It means after
standardization features will have mean = 0 and standard deviation = 1.
Distance between Emp2 and Emp1 = √ (−1.51 − 0.95)2 + (−1.27 − 1.19)2 = 3.47 Comparison
is more
Distance between Emp2 and Emp3 = √ (−1.07 + 1.51)2 + (−0.70 + 1.27)2 = 0.71 significant
Unit III - 19
APEC
array([[0. , 1. ],
[0.2, 0.8],
[0.4, 0.6],
[0.6, 0.4],
[0.8, 0.2],
[1. , 0. ]])
Example 3 - Scaling students marks using min-max scaling
import numpy as np
Unit III - 20
APEC
Output
0 1 2 3 4 5 \
count 208.000000 208.000000 208.000000 208.000000 208.000000 208.000000
mean 0.204011 0.162180 0.139068 0.114342 0.173732 0.253615
std 0.169550 0.141277 0.126242 0.110623 0.140888 0.158843
min 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000
25% 0.087389 0.067938 0.057326 0.044163 0.079508 0.152714
50% 0.157080 0.129447 0.107753 0.090942 0.141517 0.220236
75% 0.251106 0.202958 0.185447 0.139563 0.237319 0.333042
max 1.000000 1.000000 1.000000 1.000000 1.000000 1.000000
6 7 8 9 ... 50 \
count 208.000000 208.000000 208.000000 208.000000 ... 208.000000
mean 0.320472 0.285114 0.252485 0.281652 ... 0.160047
std 0.167175 0.187767 0.175311 0.192215 ... 0.119607
min 0.000000 0.000000 0.000000 0.000000 ... 0.000000
25% 0.209957 0.165215 0.132571 0.142964 ... 0.083914
50% 0.280438 0.235061 0.214349 0.244673 ... 0.138446
75% 0.407738 0.361852 0.334555 0.368082 ... 0.207420
max 1.000000 1.000000 1.000000 1.000000 ... 1.000000
Unit III - 21
APEC
51 52 53 54 55 56 \
count 208.000000 208.000000 208.000000 208.000000 208.000000 208.000000
mean 0.180031 0.265172 0.290669 0.197061 0.200555 0.213642
std 0.137432 0.183385 0.213474 0.160717 0.147080 0.164361
min 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000
25% 0.092368 0.118831 0.127924 0.080499 0.102564 0.096591
50% 0.151213 0.235065 0.242690 0.156463 0.165385 0.160511
75% 0.227175 0.374026 0.394737 0.260771 0.260897 0.287642
max 1.000000 1.000000 1.000000 1.000000 1.000000 1.000000
57 58 59
count 208.000000 208.000000 208.000000
mean 0.175035 0.216015 0.136425
std 0.148051 0.170286 0.116190
min 0.000000 0.000000 0.000000
25% 0.075515 0.098485 0.057737
50% 0.125858 0.173554 0.108545
75% 0.229977 0.281680 0.183025
max 1.000000 1.000000 1.000000
Unit III - 22
APEC
= 85
Step 2: Calculate the standard deviation (σ) of the dataset.
std_dev = sqrt(((75-85)^2 + (80-85)^2 + (85-85)^2 + (90-85)^2 + (95-85)^2) / 5)
= sqrt((100 + 25 + 0 + 25 + 100) / 5)
= sqrt(250 / 5)
= sqrt(50)
= 7.071
Example 2
John is a researcher for the infant clothes company BABYCLOTHES Inc. To assist in
determining sizes for their baby outfits, they have hired him. The business wants to
launch a new line of extra-small baby garments, but it's unsure of what size to produce.
To estimate the potential size of the market for garments designed for infants weighing
less than 6 pounds, scientists want to know how many premature babies are born
weighing less than 6 pounds.
John can compute the Z-score, which aids in determining the deviation from the mean, if
he discovers that the mean weight for a premature newborn infant is 5 pounds and the
standard deviation is 1.25 pounds.
Answer
Z = (x - μ) / σ,
where x is the data point, μ is the mean, and σ is the standard deviation
Z –score = (6 - 5) / 1.25 = 0.80
# Example dataset
Unit III - 23
APEC
# Z-score normalization
normalized_data = (data - mean) / std_dev
print(normalized_data)
Output
[ 0.66208471 -1.40693001 -0.57932412 1.4896906 -0.16552118]
# Calculate the z-score from with scipy
import scipy.stats as stats
values = [10, 5, 7, 12, 8]
zscores = stats.zscore(values)
print(zscores)
Output
[ 0.66208471 -1.40693001 -0.57932412 1.4896906 -0.16552118]
# stats.zscore() method
import numpy as np
from scipy import stats
Output
arr1 : [[20, 2, 7, 1, 34], [50, 12, 12, 34, 4]]
df = pd.DataFrame.from_dict({
Unit III - 24
APEC
print(df.head())
Output
Name Age Income Education
0 Nik 32 80000 5
1 Kate 30 90000 7
2 Joe 67 45000 3
3 Mitch 34 23000 4
4 Alana 20 12000 4
Output
[[1. 0. ] Output
[0.27272727 0.625 ] [[ 0.97596444 -1.61155897]
[0. 1. ] [-0.66776515 0.08481889]
[1. 0.75 ]] [-1.28416374 1.10264561]
[ 0.97596444 0.42409446]]
Unit III - 25
APEC
zscores = stats.zscore(values)
print(zscores)
Output
[[ 0.97596444 -1.61155897]
[-0.66776515 0.08481889]
[-1.28416374 1.10264561]
[ 0.97596444 0.42409446]]
Adding Constant
Scaling refers to the process of transforming the values of a dataset to fit within a
specific range. It is commonly used when the features of the dataset have different
scales. Two common scaling techniques are adding or subtracting a constant and
multiplying or dividing by a constant.
# Original data
original_data = np.array([1, 2, 3, 4, 5])
Unit III - 26
APEC
plt.xlabel('Index')
plt.ylabel('Value')
plt.title('Comparison of Data Scaling')
plt.grid(True)
plt.show()
Gaussian distribution
In real life, many datasets can be modeled by Gaussian Distribution (Univariate
or Multivariate). So it is quite natural and intuitive to assume that the clusters come
from different Gaussian Distributions. Or in other words, it tried to model the dataset as
a mixture of several Gaussian Distributions.
A Gaussian distribution, also known as a normal distribution or bell curve, is a
probability distribution that is symmetric and characterized by its mean and standard
deviation. It is named after the German mathematician, Carl Friedrich Gauss. It is one of
the most commonly encountered distributions in statistics and probability theory. Some
common example datasets that follow Gaussian distribution are:
o Body temperature
o People’s Heights
o Car mileage
o IQ scores
The Gaussian distribution has the following properties:
• Symmetry: The distribution is symmetric around its mean, which means that the
probability density function (PDF) is symmetric.
• Mean: The mean (μ) represents the central value or the average of the
distribution.
Unit III - 27
APEC
where:
o f(x) is the probability density function at a specific value x.
o μ is the mean of the distribution.
o σ is the standard deviation of the distribution.
o π is a mathematical constant (approximately 3.14159).
o exp() is the exponential function.
## plot data
plt.plot(x_data, y_data,'green')
plt.show()
Output
Unit III - 28
APEC
Output
Unit III - 29
APEC
Inequality
Income inequality using quantiles and quantile-shares
Inequality analysis involves examining the distribution of a variable and exploring the
disparities or differences in values across different groups or segments of the data.
Quantiles and quantile shares are useful measures for analyzing inequality. Here's how
they can be used:
• Quantiles:
Quantiles divide a dataset into equal-sized groups, representing specific
percentiles of the data. For example, the median represents the 50th percentile,
dividing the data into two equal halves. Other quantiles, such as quartiles (25th,
50th, and 75th percentiles), quintiles (20th, 40th, 60th, and 80th percentiles), or
deciles (10th, 20th, ..., 90th percentiles), divide the data into smaller segments.
• Quantile Shares:
Quantile shares represent the cumulative proportion of a variable's distribution
held by each quantile group. They help us in understanding the concentration or
dispersion of values across different parts of the distribution. For example, if the
top 10% of earners in a population hold 50% of the total income, it indicates a
higher level of income concentration.
Unit III - 30
APEC
Quantile Shares:
0.25 0.067757
0.50 0.088785
0.75 0.126168
Name: Income, dtype: float64
Unit III - 31
APEC
Analysis
• The further the Lorenz curve is from the equality line, the greater the income
inequality. If the Lorenz curve lies below the equality line, it indicates income
concentration among a smaller portion of the population. If it lies above the
equality line, it indicates income dispersion among a larger portion of the
population.
• The area between the Lorenz curve and the equality line represents income
inequality. A larger area indicates higher income inequality, while a smaller area
indicates lower inequality.
By plotting the Lorenz curve and comparing it to the equality line, we can visually assess
the level of income inequality in a population. The Lorenz curve provides a
comprehensive overview of income distribution, allowing us to analyze disparities and
evaluate the fairness and equity of income allocation.
Unit III - 32
APEC
Output
Gini coefficient
The Gini coefficient is a widely used measure of income or wealth inequality. It
quantifies the extent of income inequality in a population, ranging from 0 to 1, where 0
represents perfect equality (all individuals have the same income) and 1 represents
maximum inequality (one individual has all the income, while others have none).
Unit III - 33
APEC
The Gini coefficient is derived from the Lorenz curve, which plots the cumulative
income shares against the cumulative population shares. To calculate the Gini
coefficient, you can follow these steps:
• Obtain the cumulative population shares and cumulative income shares from the
Lorenz curve. These represent the x-axis and y-axis values, respectively.
• Calculate the area between the Lorenz curve and the equality line (the diagonal
line). This area is known as the "area between the Lorenz curve and the line of
perfect equality."
• Calculate the area under the equality line (the area of the triangle formed by the
equality line and the two axes).
• Divide the "area between the Lorenz curve and the line of perfect equality" by
the "area under the equality line" to get the Gini coefficient.
• Mathematically, the formula for the Gini coefficient can be expressed as:
G = (A) / (A + B)
where:
G: Gini coefficient
A: Area between the Lorenz curve and the line of perfect equality
B: Area under the equality line
• The Gini coefficient ranges from 0 to 1, with higher values indicating higher
income inequality.
def calculate_gini_coefficient(income):
# Sort the income values in ascending order
sorted_income = np.sort(income)
# Calculate the area between Lorenz curve and line of income equality
area_between_curve_and_equality = np.sum(cumulative_income_shares[:-1] *
(cumulative_population_shares[1:] - cumulative_population_shares[:-1]))
return gini_coefficient
Unit III - 34
APEC
Output
Gini coefficient: 0.4666666666666667
Unit III - 35
APEC
Introduction
Data in statistics is sometimes classified according to the number of variables. For example,
“height” might be one variable and “weight” might be another variable. Depending on the
number of variables, the data is classified as univariate, Bivariate, Multivariate.
• Univariate analysis -- Analysis of one (“uni”) variable.
• Bivariate analysis -- Analysis of exactly two variables.
• Multivariate analysis -- Analysis of more than two variables.
Unit IV - 1
APEC
The relationship between the explanatory variable and the response variable is typically
analyzed, how changes in the explanatory variable affect the response variable. Statistical
techniques like regression analysis, ANOVA, or correlation analysis are commonly used to
quantify and analyze this relationship.
Unit IV - 2
APEC
Scatter plots
A scatter plot is a graphical representation of data points on a Cartesian plane. It
shows the values of two variables as points on the graph, with one variable represented on
the x-axis and the other on the y-axis. Scatter plots helps to visualize the relationship
between the variables. The pattern formed by the data points can provide insights into the
type and strength of the relationship.
Suppose you collect data from a group of individuals, recording the number of hours
they spend watching TV (x-axis) and the amount of time they spend on physical exercise
(y-axis) in a week. By plotting these data points on a scatter plot, you can visually analyze
the relationship between the two variables. If the data points show a cluster towards the
lower end of physical exercise time, it suggests a potential negative relationship. This
implies that individuals who spend more time watching TV tend to engage in less physical
exercise.
# Data collection
# Example TV viewing hours
tv_hours = [2, 3, 1, 4, 5, 2, 1, 3, 4, 2]
Output
Unit IV - 3
APEC
Correlation analysis
Correlation analysis measures the strength and direction of the linear relationship
between two variables. It provides a numerical value, called a correlation coefficient, which
quantifies the relationship. The correlation coefficient ranges from -1 to +1. A positive
correlation coefficient indicates a positive relationship, a negative correlation coefficient
indicates a negative relationship, and a correlation coefficient of zero indicates no linear
relationship.
Let's say the correlation coefficient between TV viewing hours and physical exercise
time is -0.6. This negative value indicates a moderate negative correlation. It suggests that
as TV viewing hours increase, there tends to be a decrease in the amount of time spent on
physical exercise. However, it's important to remember that correlation does not imply
causation. The negative correlation does not necessarily mean that watching TV causes a
decrease in physical exercise, but rather that there is an association between the two
variables.
Simple Python program to explain correlation analysis
import numpy as np
from scipy.stats import pearsonr, linregress
import matplotlib.pyplot as plt
# Data collection
tv_hours = [2, 3, 1, 4, 5, 2, 1, 3, 4, 2] # Example TV viewing hours
exercise_time = [30, 45, 20, 60, 75, 40, 25, 50, 55, 35] # Example exercise
time
# Bivariate analysis
correlation_coefficient, _ = pearsonr(tv_hours, exercise_time)
slope, intercept, r_value, p_value, std_err = linregress(tv_hours,
exercise_time)
Unit IV - 4
APEC
plt.scatter(tv_hours, exercise_time)
plt.xlabel("TV Viewing Hours")
plt.ylabel("Exercise Time (minutes)")
plt.title("TV Viewing vs. Exercise Time")
# Regression line
x = np.array(tv_hours)
y = slope * x + intercept
plt.plot(x, y, color='red')
# Results
print("Correlation coefficient:", correlation_coefficient)
print("Slope:", slope)
print("Intercept:", intercept)
print("R-value:", r_value)
print("P-value:", p_value)
print("Standard error:", std_err)
Output
Regression analysis
Regression analysis is used to model the relationship between two variables. It
helps predict the value of one variable based on the known value of another variable. In
bivariate analysis, simple linear regression is commonly used. It assumes a linear
relationship between the variables and estimates the best-fit line that minimizes the
distance between the observed data points and the predicted values.
Suppose the regression analysis suggests a simple linear equation like "Physical
Exercise Time = 150 - 10 * TV Viewing Hours." This equation implies that for each
Unit IV - 5
APEC
additional hour spent watching TV, the expected decrease in physical exercise time is 10
minutes. Using this equation, you can estimate the physical exercise time for an individual
based on the number of hours they spend watching TV.
Simple Python program to explain regression analysis
import numpy as np
from scipy.stats import linregress
# Data collection
tv_hours = [2, 3, 1, 4, 5, 2, 1, 3, 4, 2] # Example TV viewing hours
exercise_time = [30, 45, 20, 60, 75, 40, 25, 50, 55, 35] # Example exercise
time
# Regression analysis
slope, intercept, r_value, p_value, std_err = linregress(tv_hours,
exercise_time)
# Results
print("Slope:", slope)
print("Intercept:", intercept)
print("R-value:", r_value)
print("P-value:", p_value)
print("Standard error:", std_err)
Unit IV - 6
APEC
/READNAMES=on
/ASSUMEDSTRWIDTH=32767.
EXECUTE.
DATASET NAME DataSet1 WINDOW=FRONT.
REGRESSION
/MISSING LISTWISE
/STATISTICS COEFF OUTS R ANOVA
/CRITERIA=PIN(.05) POUT(.10)
/NOORIGIN
/DEPENDENT AGE
/METHOD=ENTER WEIGHIN
/RESIDUALS NORMPROB(ZRESID).
Unit IV - 7
APEC
Charts
Unit IV - 8
APEC
Unit IV - 9
APEC
Proportions
Proportions represent the relative size or share of one category within a total. When
examining the relationship between two variables, you can calculate proportions based on
the occurrence or absence of specific events within each variable.
For example, consider a dataset of students and their preferred subjects: Math and
Science. Calculate the proportion of students who prefer Math and the proportion of
students who prefer Science. These proportions reflect the distribution of preferences
within the dataset.
Percentages
Percentages represent proportions expressed as a fraction of 100. They provide a
way to compare proportions on a standardized scale.
In the previous example, if 60 out of 100 students prefer Math, the proportion of
students preferring Math is 60/100 or 0.6. The percentage of students preferring Math is
0.6 multiplied by 100, which equals 60%. Similarly, you can calculate the percentage of
students preferring Science.
Probabilities
Probabilities quantify the likelihood of an event occurring. In the context of two variables,
probabilities can be used to understand the likelihood of events happening simultaneously
or independently based on the relationship between the variables. Probabilities can help
determine the chances of specific outcomes or events occurring, considering the
relationship between the two variables.
Example 1
To demonstrate how proportions and percentages can be calculated from a students'
database, let's consider a simple example. Assume we have a database containing
information about students' favorite subjects: Math, Science, and English. We want to
calculate the proportion and percentage of students who prefer each subject. Here's an
example database:
Student ID Favorite Subject
1 Math
2 Science
3 Math
Unit IV - 10
APEC
4 English
5 Math
6 Science
7 Science
8 English
9 English
10 Math
Solution
To calculate the proportions and percentages, we need to determine the number of
students who prefer each subject and divide it by the total number of students.
Step 1: Count the number of students who prefer each subject:
Math: 4 students
Science: 3 students
English: 3 students
Step 2: Calculate the proportions:
Proportion of students who prefer Math: 4/10 = 0.4
Proportion of students who prefer Science: 3/10 = 0.3
Proportion of students who prefer English: 3/10 = 0.3
Step 3: Calculate the percentages:
Percentage of students who prefer Math: 0.4 * 100 = 40%
Percentage of students who prefer Science: 0.3 * 100 = 30%
Percentage of students who prefer English: 0.3 * 100 = 30%
So, in this example, 40% of the students prefer Math, 30% prefer Science, and 30% prefer
English.
By calculating proportions and percentages, we can gain insights into the distribution of
favorite subjects among students in the database. This information can help in
understanding the preferences and patterns within the student population.
Example 2
Let's work through a solved example of analyzing a frequency table related to gender
(male/female) and job satisfaction (satisfied/dissatisfied) data. We collected data on job
satisfaction from a sample of 200 employees.
Unit IV - 11
APEC
Frequency table
Job Satisfaction
Satisfied Dissatisfied
Male 80 30
Gender
Female 70 20
Step 2: Calculate the totals
Analyzing the table, we find 80 males are satisfied with their job, 30 males are dissatisfied,
70 females are satisfied, and 20 females are dissatisfied.
• Calculate the row totals. The row totals represent the total number of employees for
each gender.
Row total for Male = 80 + 30 = 110
Row total for Female = 70 + 20 = 90
• Calculate column totals: The column totals represent the total number of employees
for each job satisfaction level.
Column total for Male = 80 + 70 = 150
Column total for Female = 30 + 20 = 50
Unit IV - 12
APEC
Analysis
By calculating the probabilities, the probability of job satisfaction is higher for
females (35%) compared to males (40%), while the probability of job dissatisfaction is
higher for males (15%) compared to females (10%).
Unit IV - 13
APEC
Output
Syntax
FREQUENCIES VARIABLES=Age Treatment
/STATISTICS=STDDEV VARIANCE RANGE MINIMUM MAXIMUM SEMEAN
/HISTOGRAM
/ORDER=ANALYSIS.
Frequencies
Unit IV - 14
APEC
Frequency Table
Histogram
Unit IV - 15
APEC
Step 2: Create the Contingency Table: Create a table with rows representing one variable
and columns representing the other variable. Each cell in the table will contain the
frequency or count of observations falling into that particular combination of categories.
• You collect data from a sample of 200 employees and create a contingency table to
analyze the relationship.
• Construct the table with gender as rows and job satisfaction as columns.
Contingency table
Job Satisfaction
Satisfied Dissatisfied
Male 60 40
Gender
Female 70 30
Unit IV - 16
APEC
Step 3: Calculate Row and Column Totals: Add row and column totals to the contingency
table
Contingency table
Job Satisfaction
TOTAL
Satisfied Dissatisfied
Male 60 40 100
Gender
Female 70 30 100
TOTAL 130 70 200
Step 4: Calculate Expected Frequencies (Optional): If you want to assess whether the
observed frequencies deviate significantly from what would be expected, you can calculate
the expected frequencies.
• Expected frequency for the "Male-Satisfied" cell = (100 × 130) / 200 = 65.
• Expected Frequency for "Male-Dissatisfied" cell = (100 × 70) / 200 = 35.
• Expected frequency for the "Female-Satisfied" cell = (100 × 130) / 200 = 65.
• Expected Frequency for "Female-Dissatisfied" cell = (100 × 70) / 200 = 35.
Step 5: Interpret the Observed Frequencies: Examine the observed frequencies in the
contingency table. Look for any patterns or differences between the levels of the variables,
as they may indicate a significant relationship.
• In our example, we can observe that there are 60 males who are satisfied with their
job and 70 females who are satisfied. Similarly, there are 40 males and 30 females
who are dissatisfied.
Unit IV - 17
APEC
• The chi-square test is commonly used to analyze contingency tables and determine
if there is a significant association between two categorical variables. The test
compares the observed frequencies in the contingency table with the expected
frequencies.
• The chi-square test assesses whether the observed frequencies differ significantly
from the expected frequencies. It provides a statistical measure of the association
between the variables.
• Here's how you can conduct a chi-square test using a contingency table:
Unit IV - 18
APEC
• If the chi-square statistic is greater than the critical value or the p-value is less than
the significance level (p < α), reject the null hypothesis. This indicates that there is
evidence of an association between gender and job satisfaction.
• If the chi-square statistic is smaller than the critical value or the p-value is greater
than the significance level (p > α), fail to reject the null hypothesis.
Step 9: Visualize the Contingency Table: To further explore the relationship between the
variables, you can create visual representations of the contingency table, such as stacked
bar charts or heatmaps.
Unit IV - 19
APEC
• 0 cells (0.0%) have expected count less than 5. The minimum expected count is
10.56.
• Computed only for a 2x2 table
Unit IV - 20
APEC
• DO IF syntax
Input Variable -> Output Variable: The center text box lists the variable(s) you have
selected to recode, as well as the name your new variable(s) will have after the recode
Unit IV - 21
APEC
Output Variable: Define the name and label for your recoded variable(s) by typing them in
the text fields. Once you are finished, click Change. Here, we have changed as “Age -->
Patient_Age”).
Unit IV - 22
APEC
Syntax
Syntax
Unit IV - 23
APEC
When applying these techniques, it is important to consider the context and objectives of
the analysis. Be mindful of preserving the integrity of the data and ensure that the recoded,
reordered, or collapsed categories accurately represent the underlying information.
Unit IV - 24
APEC
Construction of a boxplot
• The median (50th percentile) of the dataset is represented by a horizontal line
inside a box.
• The box extends from the lower quartile (25th percentile) to the upper quartile
(75th percentile), indicating the interquartile range (IQR).
• Whiskers, represented by vertical lines, extend from the box's edges to the furthest
data points within a certain range. The exact range depends on the specific rules
used for whisker calculation.
• Data points outside the whiskers are considered outliers and are typically plotted
individually.
Interpretation of a boxplot
• Center: The median (line inside the box) represents the dataset's central tendency. If
the median is closer to the bottom of the box, the data is skewed towards lower
values, while if it's closer to the top, the data is skewed towards higher values.
• Spread: The length of the box (IQR) gives an indication of the data's dispersion. A
longer box implies greater variability, while a shorter box indicates less variability.
• Skewness: The symmetry of the boxplot helps identify skewness in the data. If the
median line is not in the center of the box, the data may be skewed.
• Outliers: Data points plotted individually outside the whiskers are considered
outliers and may indicate unusual or extreme observations.
• Boxplots are particularly useful when comparing distributions between different
groups or variables. By placing multiple boxplots side by side, you can visually
compare their central tendencies, spreads, and skewness.
In summary, boxplots offer a visual summary of key statistical characteristics of a dataset,
making them a valuable tool for data exploration and initial analysis.
# Example dataset
scores = [55, 58, 60, 62, 65, 66, 68,
70, 72, 73, 74, 75, 76, 77,
78, 79, 80, 81, 82, 83, 85,
86, 87, 88, 89, 90, 91, 92, 93, 95]
Unit IV - 25
APEC
Output
Unit IV - 26
APEC
To create a box plot to visualize the distribution of these data values, we can click the
Analyze tab, then Descriptive Statistics, then Explore:
To create a box plot, drag the variable points into the box labelled Dependent List. Then
make sure Plots is selected under the option that says Display near the bottom of the box.
Unit IV - 27
APEC
Exam Score
Outliers
In statistics, outliers are data points that deviate significantly from the rest of the
dataset. These observations are considered to be unusual, extreme, or inconsistent with the
overall pattern or distribution of the data.
Unit IV - 28
APEC
Outliers can occur due to various reasons, including measurement errors, data entry
mistakes, natural variability, or genuinely unusual observations. They can have a
significant impact on statistical analysis, as they can skew results and affect the validity of
assumptions made about the data.
Identifying outliers is important because they can distort statistical measures such
as the mean and standard deviation, as well as affect data modeling and analysis. Outliers
can be detected through various methods, such as graphical exploration, statistical tests, or
using domain knowledge.
Outliers can be treated by removing them, transforming them, or treating them as missing
values, depending on the circumstances and impact on the analysis. In summary, outliers
are data points that are significantly different from the rest of the dataset and can have an
impact on statistical analysis. Detecting and managing outliers is crucial to ensure accurate
and reliable results.
Unit IV - 29
APEC
# Example dataset
data = [10, 12, 15, 16, 17, 18, 20, 21, 22, 100]
Output
Note:
The flierprops parameter is used to customize the appearance of the outliers. In this example, we set the
marker style to a red circle ('o'), the marker face color to red, and the marker size to 8 .
Dependents
“Dependents together” means that all dependent variables are shown together in each
boxplot. If you enter a factor -say, sex- you'll get a separate boxplot for each factor level -
Unit IV - 30
APEC
female and male respondents. “Factor levels together” creates a separate boxplot for each
dependent variable, showing all factor levels together in each boxplot.
“Exclude cases pairwise” means that the results for each variable are based on all cases
that don't have a missing value for that variable. “Exclude cases listwise” uses only cases
without any missing values on all variables.
Syntax
Unit IV - 31
APEC
Unit IV - 32
APEC
Unit IV - 33
APEC
Unit IV - 34
APEC
• The first columns tells how many cases were used for each variable.
• Note that trial 5 has N = 205 or 86.1% missing values.
Unit IV - 35
APEC
T - Test
A T-test is a statistical method of comparing the means or proportions of two samples
gathered from either the same group or different categories. It is aimed at hypothesis
testing, which is used to test a hypothesis pertaining to a given population. It is the
difference between population means and a hypothesized value.
There are several types of t-tests, but the most commonly used ones are:
• Independent samples t-test: This test compares the means of two independent
groups to determine if there is a significant difference between them. It assumes
that the two groups are independent of each other.
• Paired samples t-test: This test compares the means of two related groups, where
each observation in one group is paired with an observation in the other group. It is
used when the same individuals or objects are measured before and after an
intervention or treatment.
• One-sample t-test: This test compares the mean of a single group to a known
population mean or a hypothesized value. It is used to determine if the observed
mean significantly differs from a specific value.
One-sample, two-sample, paired, equal, and unequal variance are the types of T-tests users
can use for mean comparisons.
Unit IV - 36
APEC
Solution
Step 1: Sum the two groups:
A: 1 + 2 + 2 + 3 + 3 + 4 + 4 + 5 + 5 + 6 = 35
B: 1 + 2 + 4 + 5 + 5 + 5 + 6 + 6 + 7 + 9 = 50
Step 4: Square the individual scores and then add them up:
A: 11 + 22 + 22 + 33 + 33 + 44 + 44 + 55 + 55 + 66 = 145
B: 12 + 22 + 44 + 55 + 55 + 55 + 66 + 66 + 77 + 99 = 298
Unit IV - 37
APEC
Step 5: Insert your numbers into the following formula and solve:
t = -1.69
Step 7: Look up your degrees of freedom (Step 6) in the t-table. If you don’t know what
your alpha level is, use 5% (0.05).
18 degrees of freedom at an alpha level of 0.05 = 2.10.
Unit IV - 38
APEC
Step 8: Compare your calculated value (Step 5) to your table value (Step 7). The calculated
value of -1.79 is less than the cutoff of 2.10 from the table. Therefore p > .05. As the p-value
Unit IV - 39
APEC
is greater than the alpha level, So we can reject the null hypothesis that there is no
difference between means.
Click on Analyze -> Compare Means -> Independent-Samples T Test. This will bring up the
following dialog box.
Unit IV - 40
APEC
To perform the t test, we’ve got to get our dependent variable (Frisbee Throwing Distance)
into the Test Variable(s) box, and our grouping variable (Dog Owner) into the Grouping
Variable box. To move the variables over, you can either drag and drop, or use the arrows,
as above.
You’ll notice that the Grouping Variable, DogOwner, has two question marks in brackets
after it. This indicates that you need to define the groups that make up the grouping
variable. Click on the Define Groups button.
Unit IV - 41
APEC
We’re using 0 and 1 to specify each group, 0 is No Dog; and 1 is Owns Dog.
The first thing to note is the mean values in the Group Statistics table. Here you can see that
on average people who own dogs throw frisbees further than people who don’t own dogs
(54.92 metres as against only 40.12 metres).
SPSS is reporting a t value of -3.320 and a 2-tailed p-value of .003. This would almost
always be considered a significant result (standard alpha levels are .05 and .01). Therefore,
we can be confident in rejecting the null hypothesis that holds that there is no difference
between the frisbee throwing abilities of dog owners and non-owners.
Unit IV - 42
APEC
The One Sample t Test examines whether the mean of a population is statistically different
from a known or hypothesized value. The One Sample t Test is a parametric test.
In a One Sample t Test, the test variable's mean is compared against a "test value", which is
a known or hypothesized value of the mean in the population.
Example:
A particular factory's machines are supposed to fill bottles with 150 milliliters of product. A
plant manager wants to test a random sample of bottles to ensure that the machines are
not under- or over-filling the bottles.
Note: The One Sample t Test can only compare a single sample mean to a specified
constant. It cannot compare sample means between two or more groups. If you wish to
compare the means of multiple groups to each other, you will likely want to run an
Independent Samples t Test (to compare the means of two groups) or a One-Way ANOVA
(to compare the means of two or more groups).
Data Requirements
• Test variable that is continuous (i.e., interval or ratio level)
• Scores on the test variable
• Random sample of data from the population
• No outliers
Hypotheses
The null hypothesis (H0) and (two-tailed) alternative hypothesis (H1) of the one sample T
test can be expressed as:
• H0: µ = µ0 ("the population mean is equal to the [proposed] population mean")
• H1: µ ≠ µ0 ("the population mean is not equal to the [proposed] population mean")
where µ is the "true" population mean and µ0 is the proposed value of the population
mean.
Test statistic
The test statistic for a One Sample t Test is denoted t, which is calculated using the
following formula:
μ0 = The test value -- the proposed constant for the population mean
𝑥̅ = Sample mean
n = Sample size (i.e., number of observations)
s = Sample standard deviation
Unit IV - 43
APEC
The calculated t value is then compared to the critical t value from the t distribution table
with degrees of freedom df = n - 1 and chosen confidence level. If the calculated t value >
critical t value, then we reject the null hypothesis.
Problem Statement
The mean height of adults ages 20 and older is about 6.5 inches (69.3 inches for males, 63.8
inches for females).
Unit IV - 44
APEC
In our sample data, we have a sample of 435 college students from a single college. Let's
test if the mean height of students at this college is significantly different than 66.5 inches
using a one-sample t test. The null and alternative hypotheses of this test will be:
To add vertical reference lines at the mean (or another location), double-click on the plot to
open the Chart Editor, then click Options > X Axis Reference Line. In the Properties
window, you can enter a specific location on the x-axis for the vertical line, or you can
choose to have the reference line at the mean or median of the sample data (using the
sample data). Click Apply to make sure your new line is added to the chart.
Here, we have added two reference lines: one at the sample mean (the solid black
line), and the other at 66.5 (the dashed red line).
Unit IV - 45
APEC
output
Tables
Unit IV - 46
APEC
Here are a few key uses and interpretations of the regression line in a scatter plot:
• Trend Identification: The regression line helps identify the general trend or
direction of the relationship between the variables. If the line slopes upwards from
left to right, it suggests a positive relationship, indicating that as the independent
variable increases, the dependent variable tends to increase as well. Conversely, a
downward-sloping line indicates a negative relationship.
• Prediction: The regression line can be used for predicting the dependent variable
(y) based on a given value of the independent variable (x). By plugging in an x-value
into the equation of the line, you can estimate the corresponding y-value. However,
it's important to note that predictions become less reliable as you move further
away from the observed data points.
• Strength of Relationship: The steepness or slope of the regression line provides
information about the strength of the relationship between the variables. A steeper
line indicates a stronger association between x and y, while a shallower line
suggests a weaker relationship.
• Outlier Detection: The regression line can help identify outliers or data points that
deviate significantly from the overall trend. Points that fall far away from the
regression line may represent unusual or exceptional observations that warrant
further investigation.
• Model Evaluation: The regression line also serves as a benchmark for evaluating the
goodness-of-fit of a regression model. Various statistical measures, such as the
coefficient of determination (R-squared), can be used to assess how well the
regression line represents the data and the proportion of the variation in the
dependent variable that is explained by the independent variable.
It's important to note that the regression line represents the average relationship between
the variables and may not perfectly capture the behavior of individual data points.
Additionally, other types of regression models, such as polynomial regression or multiple
regression, can be used to capture more complex relationships between variables.
Overall, the regression line in a scatter plot provides a visual representation and
estimation of the relationship between two variables, helping to understand patterns, make
predictions, and evaluate the strength of the association.
Unit IV - 47
APEC
Syntax
Output
Unit IV - 48
APEC
For adding a regression line, first double click the chart to open it in a Chart Editor window.
Next, click the “Add Fit Line at Total” icon as shown below.
Unit IV - 49
APEC
Salary′=9310+449⋅Hours
# Example data
x = [1, 2, 3, 4, 5, 6, 7, 8, 9, 10]
y = [2, 4, 5, 7, 8, 10, 11, 13, 14, 16]
# Add legend
plt.legend()
Unit IV - 50
APEC
Note:
In this example, we have two arrays, x and y, representing the data points for the scatter
plot. We then use the linregress() function from SciPy to perform linear regression on the x
and y data.
Next, we add the regression line using plt.plot(x, intercept + slope*np.array(x), color='red',
label='Regression Line'). This line is created by evaluating the equation y = intercept +
slope*x for each x value.
Unit IV - 51
APEC
Inferential Statistics
Inferential statistics is a branch of statistics that makes the use of various analytical tools to
draw inferences about the population data from sample data. The purpose of descriptive
and inferential statistics is to analyze different types of data using different tools.
Descriptive statistics helps to describe and organize known data using charts, bar graphs,
etc., while inferential statistics aims at making inferences and generalizations about the
population data.
Descriptive statistics allow you to describe a data set, while inferential statistics allow you
to make inferences based on a data set. The samples chosen in inferential statistics need to
be representative of the entire population.
There are two main types of inferential statistics - hypothesis testing and regression
analysis.
• Hypothesis Testing - This technique involves the use of hypothesis tests such as the
z test, f test, t test, etc. to make inferences about the population data. It requires
setting up the null hypothesis, alternative hypothesis, and testing the decision
criteria.
• Regression Analysis - Such a technique is used to check the relationship between
dependent and independent variables. The most commonly used type of regression
is linear regression.
T-test
• Sample size is less than 30 and the data set follows a t-distribution.
• The population variance is not known to the researcher.
o 𝑁𝑢𝑙𝑙 𝐻𝑦𝑝𝑜𝑡ℎ𝑒𝑠𝑖𝑠: 𝐻0: 𝜇 = 𝜇0
o 𝐴𝑙𝑡𝑒𝑟𝑛𝑎𝑡𝑒 𝐻𝑦𝑝𝑜𝑡ℎ𝑒𝑠𝑖𝑠: 𝐻1: 𝜇 > 𝜇0
F-test
• Checks whether a difference between the variances of two samples or populations
exists or not.
Multivariate Analysis:
Multivariate analysis refers to statistical techniques used to analyze and understand
relationships between multiple variables simultaneously. It involves exploring patterns,
dependencies, and associations among variables in a dataset. Some commonly used
multivariate analysis techniques include:
• Multivariate Regression Analysis: Extends simple linear regression to analyze the
relationship between multiple independent variables and a dependent variable.
• Principal Component Analysis (PCA): Reduces the dimensionality of a dataset by
transforming variables into a smaller set of uncorrelated variables called principal
components.
• Factor Analysis: Examines the underlying factors or latent variables that explain the
correlations among a set of observed variables.
• Cluster Analysis: Identifies groups or clusters of similar observations based on the
similarity of their attributes.
• Discriminant Analysis: Differentiates between two or more predefined groups based
on a set of predictor variables.
• Canonical Correlation Analysis: Analyzes the relationship between two sets of
variables to identify the underlying dimensions that are shared between them.
Approach:
Descriptive Analysis:
• Calculate descriptive statistics for each variable, including measures like mean,
median, standard deviation, and frequency distributions.
• Examine the distributions of age and income using histograms or boxplots to
identify any outliers or unusual patterns.
• Create cross-tabulations or contingency tables to explore the distribution of
education level across different age groups or income brackets.
Correlation Analysis:
• Calculate the correlation coefficients (e.g., Pearson correlation, Spearman
correlation) between age, income, and education level.
• Interpret the correlation coefficients to determine the strength and direction of the
relationships between variables.
• Visualize the relationships using a correlation matrix or a heatmap to identify any
significant associations.
Regression Analysis:
• Perform multivariate regression analysis to assess the impact of age and education
level on income.
• Set income as the dependent variable and age and education level as independent
variables.
• Interpret the regression coefficients to understand how each independent variable
influences the dependent variable.
• Assess the overall model fit and statistical significance of the regression model.
Multivariate Visualization:
• Create scatter plots or bubble plots to visualize the relationship between age,
income, and education level.
• Use different colors or symbols to represent different education levels and examine
if there are distinct patterns or trends.
Further Analysis:
• Consider additional multivariate techniques such as factor analysis or cluster
analysis to explore underlying dimensions or groups within the data.
• Conduct subgroup analyses or interaction analyses to investigate if the relationships
differ across different demographic groups or educational backgrounds.
Causal explanations
Causal explanations aim to understand the cause-and-effect relationships between
variables and explain why certain outcomes occur. They involve identifying the factors or
APEC
conditions that influence a particular outcome and determining the mechanisms through
which they operate.
Causal explanations are important in various fields, including social sciences,
economics, psychology, and epidemiology, among others. They help researchers
understand the fundamental drivers of phenomena and develop interventions or policies to
bring about desired outcomes.
Some key aspects and approaches to consider when seeking causal explanations:
Association vs. Causation:
It's crucial to differentiate between mere associations or correlations between variables
and actual causal relationships. Correlation does not imply causation, and establishing
causality requires rigorous evidence, such as experimental designs or well-designed
observational studies that account for potential confounding factors.
APEC
Establishing Causality:
Several criteria need to be considered when establishing causality, such as temporal
precedence (the cause precedes the effect in time), covariation (the cause and effect vary
together), and ruling out alternative explanations.
Simple Scenario:
We want to investigate whether exercise has a causal effect on weight loss. We hypothesize
that regular exercise leads to a reduction in weight.
Explanation:
To establish a causal explanation, we would need to conduct a study that meets the criteria
for establishing causality, such as a randomized controlled trial (RCT). In this hypothetical
RCT, we randomly assign participants to two groups:
• Experimental Group: Participants in this group are instructed to engage in a
structured exercise program, such as 30 minutes of moderate-intensity aerobic
exercise five times a week.
• Control Group: Participants in this group do not receive any specific exercise
instructions and maintain their usual daily activities.
The study is conducted over a period of three months, during which the weight of each
participant is measured at the beginning and end of the study. The data collected are as
follows:
Experimental Group:
APEC
Analysis:
We compare the average weight loss between the experimental and control groups.
The results show that the experimental group had an average weight loss of 4 kg, while the
control group had an average weight loss of only 1 kg. The difference in average weight loss
between the groups suggests that regular exercise has a causal effect on weight loss.
Additionally, we can use statistical tests, such as t-tests or analysis of variance
(ANOVA), to determine if the observed difference in weight loss between the groups is
statistically significant. If the p-value is below a predetermined significance level (e.g., p <
0.05), we can conclude that the difference is unlikely due to chance alone and provides
further evidence for a causal relationship.
Example
Let's consider the variables "Gender" (Male/Female), "Education Level" (High
school/College/Graduate), and "Income Level" (Low/Medium/High). We want to explore if
there is an association between gender, education level, and income level.
A three-variable contingency table for this example might look like:
Income Level
Education Level
Low Medium High
High School 20 40 30
APEC
College 30 50 40
Graduate 10 20 30
From this contingency table, we can analyze the relationship between these variables. For
example:
• Conditional Relationships: We can examine the relationship between gender and
income level, conditional on education level. This can be done by comparing the
income level distribution for males and females within each education level
category.
• Marginal Relationships: We can examine the relationship between gender and
education level, and between education level and income level separately by looking
at the marginal distributions of the variables.
• Assessing Dependency: We can perform statistical tests, such as the chi-square test,
to determine if there is a statistically significant association between the variables.
This helps assess the dependency and provides insights into potential causal
explanations.
Crosstabs is just another name for contingency tables, which summarize the relationship
between different categorical variables. Crosstabs in SPSS can help you visualize the
proportion of cases in subgroups.
• To describe a single categorical variable, we use frequency tables.
• To describe the relationship between two categorical variables, we use a special
type of table called a cross-tabulation (or "crosstab")
o Categories of one variable determine the rows of the table
o Categories of the other variable determine the columns
o The cells of the table contain the number of times that a particular
combination of categories occurred.
A "square" crosstab is one in which the row and column variables have the same number of
categories. Tables of dimensions 2x2, 3x3, 4x4, etc. are all square crosstabs.
Example 1
APEC
Example 2
Example 3
• A → Row(s): One or more variables to use in the rows of the crosstab(s). You must
enter at least one Row variable.
• B →Column(s): One or more variables to use in the columns of the crosstab(s). You
must enter at least one Column variable.
• C → Layer: An optional "stratification" variable. When a layer variable is specified,
the crosstab between the Row and Column variable(s) will be created at each level
of the layer variable. You can have multiple layers of variables by specifying the first
layer variable and then clicking Next to specify the second layer variable.
• D → Statistics: Opens the Crosstabs: Statistics window, which contains fifteen
different inferential statistics for comparing categorical variables.
APEC
• E → Cells: Opens the Crosstabs: Cell Display window, which controls which output is
displayed in each cell of the crosstab.
• F → Format: Opens the Crosstabs: Table Format window, which specifies how the
rows of the table are sorted.
APEC
Syntax
CROSSTABS
/TABLES=RankUpperUnder BY LiveOnCampus BY State_Residency
/FORMAT=AVALUE TABLES
/CELLS=COUNT
/COUNT ROUND CELL.
Output
Again, the Crosstabs output includes the boxes Case Processing Summary and the
crosstabulation itself.
APEC
Notice that after including the layer variable State Residency, the number of valid cases we
have to work with has dropped from 388 to 367. This is because the crosstab requires
nonmissing values for all three variables: row, column, and layer.
l
The layered crosstab shows the individual Rank by Campus tables within each level of State
Residency. Some observations we can draw from this table include:
• A slightly higher proportion of out-of-state underclassmen live on campus (30/43)
than do in-state underclassmen (110/168).
• There were about equal numbers of out-of-state upper and underclassmen; for in-
state students, the underclassmen outnumbered the upperclassmen.
• Of the nine upperclassmen living on-campus, only two were from out of state.
Temporal Dependencies:
• Time series data often exhibits temporal dependencies, where each observation is
influenced by previous observations.
• Understanding these dependencies is crucial for analyzing and forecasting time
series data accurately.
• Seasonality: The recurring patterns or cycles that occur at fixed time intervals.
o Example: Monthly Sales of Ice Cream
o Sales of ice cream are higher during the summer months compared to the
rest of the year, showing a seasonal pattern.
Data cleaning
Data cleaning is the process of identifying and correcting inaccurate records from a dataset
along with recognizing unreliable or irrelevant parts of the data.
Handling Missing Values:
• Identify missing values in the time series data.
• Decide on an appropriate method to handle missing values, such as interpolation,
forward filling, or backward filling.
• Use pandas or other libraries to fill or interpolate missing values.
Outlier Detection and Treatment:
• Identify outliers in the time series data that may be caused by measurement errors
or anomalies.
• Use statistical techniques, such as z-score or modified z-score, to detect outliers.
• Decide on the treatment of outliers, such as removing them, imputing them with a
reasonable value, or replacing them using smoothing techniques.
Handling Duplicates:
• Check for duplicate entries in the time series data.
• Remove or handle duplicate values appropriately based on the specific
requirements of the analysis.
APEC
Output
Original Data:
Value
Date
2021-01-01 10.0
2021-01-02 NaN
2021-01-03 12.0
2021-01-04 18.0
2021-01-05 NaN
Interpolated Data:
Value
Date
2021-01-01 10.0
2021-01-02 11.0
2021-01-03 12.0
2021-01-04 18.0
2021-01-05 18.0
# Remove outliers
df_cleaned = df[df['Value'] <= threshold]
# Plot the original and cleaned time series data with outliers highlighted
plt.figure(figsize=(8, 4))
plt.plot(df.index, df['Value'], label='Original', color='blue')
plt.scatter(outliers.index, outliers['Value'], color='red', label='Outliers')
plt.plot(df_cleaned.index, df_cleaned['Value'], label='Cleaned',
color='green')
plt.xlabel('Date')
plt.ylabel('Value')
plt.title('Outlier Detection and Treatment')
plt.legend()
plt.show()
Output
Original Data:
Value
Date
APEC
2021-01-01 10
2021-01-02 15
2021-01-03 12
2021-01-04 100
2021-01-05 20
Cleaned Data:
Value
Date
2021-01-01 10
2021-01-02 15
2021-01-03 12
2021-01-05 20
Time-based indexing
Time-based indexing refers to the process of organizing and accessing data based on
timestamps or time intervals. It involves assigning timestamps to data records or events
and utilizing these timestamps to efficiently retrieve and manipulate the data.
In time-based indexing, each data record or event is associated with a timestamp
indicating when it occurred. The timestamps can be precise points in time or time intervals,
depending on the granularity required for the application. The data is then organized and
indexed based on these timestamps, enabling quick and efficient access to specific time
ranges or individual timestamps.
Time-based indexing is commonly used in various domains that involve time-series
data or events, such as financial markets, scientific research, IoT (Internet of Things)
applications, system monitoring, and social media analysis.
In the context of TSA (Time Series Analysis), time-based indexing refers to the practice of
organizing and accessing time-series data based on the timestamps associated with each
observation. TSA involves analyzing and modeling data that is collected over time, and
time-based indexing plays a crucial role in effectively working with such data.
Time-based indexing allows for efficient retrieval and manipulation of time-series data,
enabling various operations such as subsetting, filtering, and aggregation based on specific
time periods or intervals.
• Pandas: Pandas provides the DateTimeIndex object, which allows for indexing and
manipulation of time-series data. It offers a wide range of time-based operations,
such as slicing by specific time periods, resampling at different frequencies, and
handling missing or irregular timestamps.
Time-based indexing in TSA is essential for conducting exploratory data analysis, fitting
time-series models, forecasting future values, and evaluating model performance.
Slicing: Slicing involves retrieving a subset of data within a specific time range. With time-
based indexing, you can easily slice the time-series data based on specific dates, times, or
time intervals.
Example:
# Retrieve data between two specific dates
subset = df['2023-01-01':'2023-03-31']
df = pd.DataFrame(data)
df['timestamp'] = pd.to_datetime(df['timestamp'])
Output
value
timestamp
2023-01-02 15
2023-01-03 12
Resampling: Resampling involves changing the frequency of the time-series data. You can
upsample (increase frequency) or downsample (decrease frequency) the data to different
time intervals, such as aggregating hourly data to daily data or converting daily data to
monthly data.
Example:
# Resample data to monthly frequency
monthly_data = df.resample('M').mean()
df = pd.DataFrame(data)
Output
value
timestamp
2023-01-31 13.75
Shifting: Shifting involves moving the timestamps of the data forwards or backwards by a
specified number of time units. This operation is useful for calculating time differences or
creating lagged variables.
Example:
# Shift the data one day forward
shifted_data = df.shift(1, freq='D')
df = pd.DataFrame(data)
Output
value
timestamp
2023-01-02 10
2023-01-03 15
2023-01-04 12
2023-01-05 18
APEC
Rolling Windows: Rolling windows involve calculating statistics over a moving window of
data. It allows for analyzing trends or patterns in a time-series by considering a fixed-size
window of observations.
Example:
# Calculate the rolling average over a 7-day window
rolling_avg = df['value'].rolling(window=7).mean()
import pandas as pd
df = pd.DataFrame(data)
Output
timestamp
2023-01-01 NaN
2023-01-02 12.5
2023-01-03 13.5
2023-01-04 15.0
Name: value, dtype: float64
Grouping and Aggregation: Grouping and aggregation operations involve grouping the
time-series data based on specific time periods (e.g., days, weeks, months) and performing
calculations on each group, such as calculating the sum, mean, or maximum value.
APEC
Example:
# Calculate the sum of values for each month
monthly_sum = df.groupby(pd.Grouper(freq='M')).sum()
df = pd.DataFrame(data)
Output
value
timestamp
2023-01-31 55
df = pd.DataFrame(data)
Output
value
timestamp
2023-02-01 15
2023-03-01 12
value
timestamp
2023-01-31 10.0
2023-02-28 15.0
2023-03-31 12.0
2023-04-30 18.0
Bar Charts
Bar charts, also known as bar graphs or column charts, are a type of graph that uses
rectangular bars to represent data. They are widely used for visualizing categorical or
discrete data, where each category is represented by a separate bar. Bar charts are effective
in displaying comparisons between different categories or showing the distribution of a
single variable across different groups. The length of each bar is proportional to the value
of the variable at that point in time.
Gantt chart
A Gantt chart is a type of bar chart that is commonly used in project management to
visually represent project schedules and tasks over time. It provides a graphical
representation of the project timeline, showing the start and end dates of tasks, as well as
their duration and dependencies.
The key features of a Gantt chart are as follows:
• Task Bars: Each task is represented by a horizontal bar on the chart. The length of
the bar indicates the duration of the task, and its position on the chart indicates the
start and end dates.
• Timeline: The horizontal axis of the chart represents the project timeline, typically
displayed in increments of days, weeks, or months. It allows for easy visualization of
the project duration and scheduling.
• Dependencies: Gantt charts often include arrows or lines between tasks to represent
dependencies or relationships between them. This helps to visualize the order in
which tasks need to be completed and identify any critical paths or potential
APEC
bottlenecks.
Milestones: Milestones are significant events or achievements within a project. They
are typically represented by diamond-shaped markers on the chart to indicate
important deadlines or deliverables.
Output
Stream graph
A stream graph is a variation of a stacked area chart that displays changes in data over time
of different categories through the use of flowing, organic shapes that create an aesthetic
river/stream appearance. Unlike the stacked area chart, which plots data over a fixed,
straight axis, the stream plot has values displaced around a varying central baseline.
Each individual stream shape in the stream graph is proportional to the values of it’s
categories. Color can be used to either distinguish each category or to visualize each
category’s additional quantitative values through varying the color shade.
Making a Stream Graph with Python
For this example we will use Altair, which is a graphing library in python. Altair is a
declarative statistical visualization library, based on Vega and Vega-Lite. The source code is
available on GitHub.
To begin creating our stream graph, we will need to first install Altair and vega_datasets.
APEC
Now, let’s use altair and the vega datasets to create an interactive stream graph looking at
unemployment data across a series of 10 years of time across multiple industries.
source = data.unemployment_across_industries.url
alt.Chart(source).mark_area().encode(
alt.X('yearmonth(date):T',
axis=alt.Axis(format='%Y', domain=False, tickSize=0)
),
alt.Y('sum(count):Q', stack='center', axis=None),
alt.Color('series:N',
scale=alt.Scale(scheme='category20b')
)
).interactive()
Output
APEC
Heat map
A heat map is a graphical representation of data where individual values are
represented as colors. It is typically used to visualize the density or intensity of a particular
phenomenon over a geographic area or a grid of cells.
In a heat map, each data point is assigned a color based on its value or frequency.
Typically, a gradient of colors is used, ranging from cooler colors (such as blue or green) to
warmer colors (such as yellow or red). The colors indicate the magnitude of the data, with
darker or more intense colors representing higher values and lighter or less intense colors
representing lower values.
Heat maps are commonly used in various fields, including data analysis, statistics,
finance, marketing, and geographic information systems (GIS). They can provide insights
into patterns, trends, or anomalies in the data by visually highlighting areas of higher or
lower concentration.
import numpy as np
import matplotlib.pyplot as plt
# Generate random data
data = np.random.rand(10, 10)
# Create heatmap
plt.imshow(data, cmap='hot', interpolation='nearest')
# Add color bar
plt.colorbar()
# Show the plot
plt.show()
output
APEC
Grouping
Grouping time series data involves dividing it into distinct groups based on certain criteria.
This grouping can be useful for performing calculations, aggregations, or analyses on
specific subsets of the data. In Python, you can use the pandas library to perform grouping
operations on time series data. Here's an example of how to group time series data using
pandas:
Simple Python program to explain Grouping
import pandas as pd
Output
Grouped DataFrame (by category):
value
category
A 2450
B 2500
In this example, we first create a sample DataFrame df with a datetime index. The
DataFrame contains a 'value' column ranging from 0 to 99 and a 'category' column with
two distinct categories 'A' and 'B'.
Resampling
Resampling time series data involves grouping the data into different time intervals
and aggregating or summarizing the values within each interval. This process is useful
when you want to change the frequency or granularity of the data or when you need to
perform calculations over specific time intervals. There are two common methods for
resampling time series data: upsampling and downsampling.
In this example, the data is being downsampled to daily frequency, and the mean value
within each day is calculated.
APEC
Upsampling: Upsampling involves increasing the frequency of the data by grouping it into
smaller time intervals. This may require filling in missing values or interpolating to
estimate values within the new intervals. Some common upsampling methods include:
• Forward/Backward Filling: Propagate the last known value forward or backward to
fill missing values within each interval.
• Interpolation: Use interpolation methods like linear, polynomial, or spline
interpolation to estimate values within each interval.
• Resample Method: Utilize specialized resampling methods to estimate values within
each interval.
Here's an example of upsampling time series data using the resample() function in pandas:
import pandas as pd
In this example, the data is being upsampled to hourly frequency, and missing values are
interpolated using the interpolate() function.
print("\nUpsampled DataFrame:")
print(upsampled_df.head())
Output
Downsampled DataFrame:
value
APEC
date
2022-01-31 15.0
2022-02-28 44.5
2022-03-31 74.0
2022-04-30 94.5
Upsampled DataFrame:
value
date
2022-01-01 00:00:00 0.000000
2022-01-01 01:00:00 0.041667
2022-01-01 02:00:00 0.083333
2022-01-01 03:00:00 0.125000
2022-01-01 04:00:00 0.166667
In this example, we first create a sample DataFrame df with a datetime index. The
DataFrame contains a 'value' column ranging from 0 to 99, with a daily frequency for the
'date' index.