0% found this document useful (0 votes)
5 views

EDA Lecture notes

The document outlines the fundamentals of Exploratory Data Analysis (EDA) within the context of data science, emphasizing its significance in understanding data, assessing data quality, and uncovering insights. It details the steps involved in EDA, including data cleaning, visualization, and hypothesis generation, while categorizing data into qualitative and quantitative types. Additionally, it explains measurement scales and their relevance in analyzing data effectively.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
5 views

EDA Lecture notes

The document outlines the fundamentals of Exploratory Data Analysis (EDA) within the context of data science, emphasizing its significance in understanding data, assessing data quality, and uncovering insights. It details the steps involved in EDA, including data cleaning, visualization, and hypothesis generation, while categorizing data into qualitative and quantitative types. Additionally, it explains measurement scales and their relevance in analyzing data effectively.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 205

APEC

PROFESSIONAL ELECTIVE COURSES: VERTICALS


VERTICAL 1: DATA SCIENCE
CCS346 EXPLORATORY DATA ANALYSIS
UNIT I - EXPLORATORY DATA ANALYSIS

EDA fundamentals – Understanding data science – Significance of EDA – Making sense of data –
Comparing EDA with classical and Bayesian analysis – Software tools for EDA - Visual Aids for
EDA- Data transformation techniques-merging database, reshaping and pivoting,
Transformation techniques.

Data science
Data science is an interdisciplinary field that combines scientific methods, processes,
algorithms, and systems to extract insights and knowledge from structured and
unstructured data. It involves applying techniques from various fields such as statistics,
mathematics, computer science, and domain expertise to analyze and interpret data in
order to solve complex problems, make predictions, and drive decision-making.

Fig. Data Science

Data science encompasses a range of activities, including data collection, data


cleaning and preprocessing, exploratory data analysis, feature engineering, modeling,
and evaluation. It often involves working with large and diverse datasets, utilizing
statistical and machine learning techniques, and leveraging computational tools and
technologies to extract meaningful patterns, correlations, and insights from data.
Data scientists use a combination of analytical, programming, and problem-
solving skills to formulate research questions, design experiments, develop models and
algorithms, and interpret the results. They may work with structured data from

1 | CCS346 Unit - I
APEC

databases, spreadsheets, or relational data sources, as well as unstructured data from


text, images, videos, or social media.
The ultimate goal of data science is to uncover valuable insights and knowledge
that can drive data-driven decision-making, optimize processes, improve efficiency, and
enable businesses or organizations to gain a competitive edge. Data science has
applications in a wide range of domains, including business, finance, healthcare,
marketing, social sciences, and more.
Understanding Data Science
Data science is an interdisciplinary field that combines various techniques, tools, and
methodologies to extract insights and knowledge from data. It involves applying
statistical analysis, machine learning, and computational methods to solve complex
problems, make predictions, and drive decision-making.

Fig. Need for Data Science


Data Requirements: Data requirements refer to the specific needs to address a particular
problem or achieve specific objectives. It involves identifying the types of data required,
their sources, formats, quality, and quantity.

Data Collection: Data collection is the process of gathering or acquiring data from
various sources. This may include surveys, experiments, observations, web scraping,
sensors, or accessing existing datasets. The data collected should align with the defined
data requirements.

Data Processing: Data processing involves transforming raw data into a more usable and
structured format. This step may include data integration, data aggregation, data
transformation, and data normalization to prepare the data for further analysis.

Data Cleaning: Data cleaning, also known as data cleansing or data scrubbing, is the
process of identifying and rectifying errors, inconsistencies, missing values, and outliers

2 | CCS346 Unit - I
APEC

in the dataset. It aims to improve data quality and ensure that the data is accurate and
reliable for analysis.

EDA (Exploratory Data Analysis)


Exploratory Data Analysis is an approach in data analysis that focuses on summarizing,
visualizing, and understanding the main characteristics of a dataset. EDA involves using
statistical and visualization techniques to examine the data, identify patterns, uncover
relationships between variables, detect anomalies, and gain insights that can inform
further analysis or decision-making.
The main objectives of EDA are to:
• Understand the structure and nature of the data.
• Identify any missing values, outliers, or inconsistencies in the data.
• Discover patterns, trends, and relationships between variables.
• Extract meaningful insights and generate hypotheses for further analysis.
• Validate assumptions and check data quality.
EDA typically involves various techniques, such as data visualization (e.g., plots, charts,
graphs), summary statistics, descriptive statistics, correlation analysis, distribution
analysis, and data preprocessing. These techniques help data analysts and scientists
gain a deeper understanding of the dataset and guide them in making informed
decisions regarding data cleaning, feature engineering, modeling, or hypothesis testing.

Exploratory Data Analysis (EDA) is of great significance in data science and analysis.
Here are some key reasons why EDA is crucial:

Understanding the Data: EDA helps in gaining a deep understanding of the dataset at
hand. It allows data scientists to become familiar with the structure, contents, and
characteristics of the data, including variables, their distributions, and relationships.
This understanding is essential for making informed decisions throughout the data
analysis process.

Data Quality Assessment: EDA helps identify and assess the quality of the data. It allows
for the detection of missing values, outliers, inconsistencies, or errors in the dataset. By
addressing data quality issues, EDA helps ensure that subsequent analyses and models
are built on reliable and accurate data.

Feature Selection and Engineering: EDA aids in selecting relevant features or variables
for analysis. By examining relationships and correlations between variables, EDA can
guide the identification of important predictors or features that significantly contribute
to the desired outcome. EDA can also inspire the creation of new derived features or
transformations that improve model performance.

3 | CCS346 Unit - I
APEC

Uncovering Patterns and Insights: EDA enables the discovery of patterns, trends, and
relationships within the data. By using visualization techniques and summary statistics,
EDA helps uncover valuable insights and potential associations between variables.
These insights can drive further analysis, hypothesis generation, or the formulation of
research questions.

Hypothesis Generation and Testing: EDA plays a crucial role in generating hypotheses for
further investigation. By exploring the data, researchers can identify potential
relationships or patterns and formulate hypotheses to test formally. EDA can also
provide evidence or insights to support or refute existing hypotheses.

Decision-Making Support: EDA assists in making data-driven decisions. By visualizing


and summarizing the data, EDA provides insights that can inform strategic and tactical
decisions. It helps stakeholders understand the implications of the data and facilitates
evidence-based decision-making.

Data Visualization and Communication: EDA utilizes various data visualization


techniques to present the findings in a clear and understandable manner. Visualizations
enable effective communication of complex information, making it easier for
stakeholders to grasp the insights derived from the data.

Steps involved in Exploratory Data Analysis (EDA)


It can vary depending on the specific dataset and objectives. However, here is a general
outline of the key steps in EDA:

Understand the Data: Start by getting familiar with the dataset, its structure, and the
variables it contains. Understand the data types (e.g., numerical, categorical) and the
meaning of each variable.

Data Cleaning: Clean the dataset by handling missing values, outliers, and
inconsistencies. Identify and handle missing data appropriately (e.g., imputation,
deletion) based on the context and data quality requirements. Treat outliers and
inconsistent values by either correcting or removing them if necessary.

Handle Data Transformations: Explore and apply necessary data transformations such
as scaling, normalization, or logarithmic transformations to make the data suitable for
analysis. This step may be required to meet assumptions of certain statistical methods
or to improve the interpretability of the data.

Summary Statistics: Compute and analyze summary statistics for each variable. This
includes measures such as mean, median, mode, standard deviation, range, quartiles,
and other descriptive statistics. Summary statistics provide an initial understanding of
the data distribution and basic insights.

4 | CCS346 Unit - I
APEC

Data Visualization: Utilize various visualization techniques to explore the data visually.
Create histograms, scatter plots, box plots, bar charts, heatmaps, or other relevant
visualizations to understand the patterns, distributions, and relationships between
variables. Visualizations can reveal insights that may not be apparent from summary
statistics alone.

Identify Relationships and Correlations: Analyze the relationships between variables


using correlation analysis or other statistical techniques. Identify variables that are
highly correlated or exhibit strong associations. Correlation matrices, scatter plots, or
heatmaps can be useful for visualizing relationships between numerical variables.
Cross-tabulations or stacked bar charts can be used for categorical variables.

Exploring Time Series Data: If the dataset involves time series data, analyze trends,
seasonality, and other temporal patterns. Use line plots, time series decomposition,
autocorrelation plots, or other relevant techniques to explore the temporal behavior of
the data.

Feature Engineering: Based on the insights gained from EDA, consider creating new
derived features or transformations that may enhance the predictive power or
interpretability of the data. This can involve mathematical operations, combinations of
variables, or domain-specific transformations.

Iterative Analysis: EDA is often an iterative process. Repeat the above steps as needed,
diving deeper into specific variables or subsets of the data based on emerging patterns
or research questions. Refine the analysis based on new insights or feedback from
stakeholders.

5 | CCS346 Unit - I
APEC

Documentation and Reporting: Document the findings, insights, and visualizations


generated during the EDA process. Prepare a report or presentation that effectively
communicates the key findings, patterns, and relationships discovered. Include
visualizations, summary statistics, and any relevant observations to support the
conclusions.

Remember, the steps and techniques used in EDA can be flexible and iterative, tailored
to the specific dataset and research objectives. The goal is to gain a comprehensive
understanding of the data, identify patterns, and generate hypotheses for further
analysis.

Define Data
In data science, "data" refers to the raw, unprocessed, and often vast quantities of
information that is collected or generated from various sources. It can exist in different
formats, such as structured data (organized and well-defined), semi-structured data
(partially organized), or unstructured data (lacks a predefined structure).

Data serves as the foundation for data science activities and analysis. It can include
numbers, text, images, audio, video, sensor readings, transaction records, social media
posts, and much more. Data can be generated from diverse sources, including databases,
spreadsheets, web scraping, sensors, surveys, or online platforms.

Categorization of Data
In data science, data is typically categorized into two main types:
Quantitative Data: Also known as numerical or structured data, quantitative data
represents measurable quantities or variables. It includes attributes such as age,
temperature, sales figures, or stock prices. Quantitative data is typically analyzed using
statistical techniques and mathematical models.
Examples of quantitative data:
• Scores of tests and exams e.g. 74, 67, 98, etc.
• The weight of a person.
• The temperature in a roo

6 | CCS346 Unit - I
APEC

Qualitative Data: Also known as categorical or unstructured data, qualitative data


represents non-numeric attributes or characteristics. It includes categories such as
gender, color, occupation, or sentiment. Qualitative data is often analyzed using
techniques such as text mining, sentiment analysis, or topic modeling.
Examples of qualitative data:
• Colors e.g. the color of the sea
• Popular holiday destinations such as Switzerland, New Zealand, South Africa, etc.
• Ethnicity such as American Indian, Asian, etc.

Qualitative Data Types


Nominal Data
This data type is used just for labeling variables, without having any quantitative value.
It just names a thing without applying for any particular order.

7 | CCS346 Unit - I
APEC

Examples of Nominal Data:


• Gender (Women, Men)
• Hair color (Blonde, Brown, Brunette, Red, etc.)
• Marital status (Married, Single, Widowed)

Ordinal Data
Ordinal data is qualitative data for which their values have some kind of relative
position. These kinds of data can be considered “in-between” qualitative and
quantitative data. The ordinal data only shows the sequences and cannot use for
statistical analysis. Compared to nominal data, ordinal data have some kind of order
that is not present in nominal data.

Examples of Ordinal Data:


• When companies ask for feedback, experience, or satisfaction on a scale of 1 to
10
• Letter grades in the exam (A, B, C, D, etc.)
• Ranking of people in a competition (First, Second, Third, etc.)
• Economic Status (High, Medium, and Low)
8 | CCS346 Unit - I
APEC

• Education Level (Higher, Secondary, Primary)

Quantitative Data Types


Discrete Data
The term discrete means distinct or separate. The discrete data contain the values that
fall under integers or whole numbers. These data can’t be broken into decimal or
fraction values.The discrete data are countable and have finite values; their subdivision
is not possible. These data are represented mainly by a bar graph, number line, or
frequency table.
Examples of discrete data:
• Number of students in a class.
• Number of workers in a company.
• Number of test questions you answered correctly.

Continuous Data
Continuous data are in the form of fractional numbers. It can be the version of an
android phone, the height of a person, the length of an object, etc. Continuous data
represents information that can be divided into smaller levels. The continuous variable
can take any value within a range.
Examples of continuous data:
• Height of a person - 62.04762 inches, 79.948376 inches
• Speed of a vehicle
• “Time-taken” to finish the work
• Wi-Fi Frequency
• Market share price

9 | CCS346 Unit - I
APEC

Interval Data
The interval level is a numerical level of measurement which, like the ordinal scale,
places variables in order. The interval scale has a known and equal distance between
each value on the scale (imagine the points on a thermometer). Unlike the ratio scale
(the fourth level of measurement), interval data has no true zero; in other words, a
value of zero on an interval scale does not mean the variable is absent.
A temperature of zero degrees Fahrenheit doesn’t mean there is “no temperature” to be
measured—rather, it signifies a very low or cold temperature.

Examples of Interval data:


• Temperature (°C or F, but not Kelvin)
• Dates (1055, 1297, 1976, etc.)
• Time Gap on a 12-hour clock (6 am, 6 pm)
• IQ score
• Income categorized as ranges ($30-39k, $40-49k, $50-59k, and so on)

Ratio Data
The fourth and final level of measurement is the ratio level. Just like the interval scale,
the ratio scale is a quantitative level of measurement with equal intervals between each
point. Difference between Interval scale and Ratio scale is that it has a true zero. That is,
a value of zero on a ratio scale means that the variable you’re measuring is absent.

10 | CCS346 Unit - I
APEC

Population is a good example of ratio data. If you have a population count of zero
people, this means there are no people!

Example of Ratio data:


• Weight in grams (continuous)
• Number of employees at a company (discrete)
• Speed in miles per hour (continuous)
• Length in centimeters (continuous)
• Age in years (continuous)
• Income in dollars (continuous)

Measurement scales
Measurement scales, also known as data scales or levels of measurement, define
the properties and characteristics of the data collected or measured. There are four
commonly recognized measurement scales:

Nominal Scale:

11 | CCS346 Unit - I
APEC

• The nominal scale is the lowest level of measurement. It represents data that can
be categorized into distinct and mutually exclusive groups or categories. The
categories in a nominal scale have no inherent order or ranking.
• Examples of nominal scale data include gender (male/female), eye color
(blue/green/brown), or types of cars (sedan/SUV/hatchback).
• Nominal data can be represented using labels or codes.

Ordinal Scale:
• The ordinal scale represents data with categories that have a natural order or
ranking. In addition to the properties of the nominal scale, ordinal data allows
for the relative positioning or hierarchy between the categories. However, the
intervals between the categories may not be equal.
• Examples of ordinal scale data include rating scales (e.g., 1-5 scale indicating
satisfaction levels), education levels (e.g., high school, bachelor's, master's), or
performance rankings (first, second, third place). Ordinal data can be
represented using labels, codes, or numerical rankings.

Interval Scale:
• The interval scale represents data with categories that have equal intervals
between the values. In addition to the properties of the ordinal scale, interval
data allows for meaningful comparisons of the intervals between the categories.
However, it does not have a true zero point.
• Examples of interval scale data include calendar dates, temperature measured in
Celsius or Fahrenheit, or years. Interval data allows for mathematical operations
such as addition and subtraction but does not support meaningful multiplication
or division.

Ratio Scale:
• The ratio scale is the highest level of measurement. It represents data with
categories that have equal intervals between the values and possess a true zero
point. In addition to the properties of the interval scale, ratio data allows for all
mathematical operations and meaningful ratios.
• Examples of ratio scale data include weight, length, time duration, or count. Ratio
scale data provides a complete and meaningful representation of the data.

Understanding the measurement scale of the data is important for selecting appropriate
statistical techniques, visualization methods, and modeling approaches. Different scales
require different levels of analysis and interpretation.

12 | CCS346 Unit - I
APEC

Comparing EDA with classical and Bayesian analysis


There are several approaches to data analysis.
For classical analysis, the sequence is
Problem => Data => Model => Analysis => Conclusions
For EDA, the sequence is
Problem => Data => Analysis => Model => Conclusions
For Bayesian, the sequence is
Problem => Data => Model => Prior Distribution => Analysis => Conclusions
• Classical data analysis:
o For the classical data analysis approach, the problem definition and data
collection step are followed by model development, which is followed by
analysis and result communication.
• Exploratory data analysis approach:
o For the EDA approach, it follows the same approach as classical data
analysis except the model imposition and the data analysis steps are
swapped. The main focus is on the data, its structure, outliers, models,
and visualizations. Generally, in EDA, we do not impose any deterministic
or probabilistic models on the data.
• Bayesian data analysis approach:
o The Bayesian approach incorporates prior probability distribution
knowledge into the analysis steps as shown in the following diagram

13 | CCS346 Unit - I
APEC

Software tools available for EDA


There are several software tools available for performing Exploratory Data
Analysis (EDA). Here are some commonly used ones:

• Python: This is an open source programming language widely used in data


analysis, data mining, and data science
• R programming language: R is an open source programming language that is
widely utilized in statistical computation and graphical data analysis
• Weka: This is an open source data mining package that involves several EDA
tools and algorithms
• Jupyter Notebook / JupyterLab: Jupyter is an open-source web-based platform
that supports multiple programming languages, including Python and R.
• SPSS: IBM SPSS Statistics is a comprehensive software package for statistical
analysis. It provides a range of tools for data exploration, descriptive statistics,
hypothesis testing, and advanced modeling techniques.

14 | CCS346 Unit - I
APEC

• KNIME: KNIME (Konstanz Information Miner) is an open-source data analytics


platform that allows users to visually design data workflows.

Visual Aids for EDA


Two important goals of data scientists are:
• Extract knowledge from the data.
• Present the data to stakeholders.
Presenting results to stakeholders is very complex because the audiences have not
enough technical knowledge. Hence, visual aids are very useful tools. We are going to
learn different types of techniques that can be used in the visualization of data.
• Line chart
• Bar chart
• Scatter plot
• Area plot and stacked plot
• Pie chart
• Table chart
• Polar chart
• Histogram
• Lollipop chart
• Choosing the best chart
• Other libraries to explore

Matplotlib is a data visualization library in Python. The pyplot, a sublibrary of


matplotlib, is a collection of functions that helps in creating a variety of charts.

Line chart
Line charts are used to represent the relation between two data X and Y on a different
axis. Here we will see some of the examples of a line chart in Python :

Steps to Plot a Line Chart in Python using Matplotlib


Step 1: Install the Matplotlib package

pip install matplotlib

Step 2: Gather the data for the Line chart


Next, gather the data for your Line chart. For example, let’s use the following data about
two variables:
• Year
• Unemployment_rate
Here is the complete data:

15 | CCS346 Unit - I
APEC

Year Unemployment_Rate
1920 9.8
1930 12
1940 8
1950 7.2
1960 6.9
1970 7
1980 6.5
1990 6.2
2000 5.5
2010 3.3

The ultimate goal is to depict the above data using a Line chart.

Step 3: Capture the data in Python


You can capture the above data in Python using the following two Lists:

year = [1920, 1930, 1940, 1950, 1960, 1970, 1980, 1990, 2000, 2010]
unemployment_rate = [9.8, 12, 8, 7.2, 6.9, 7, 6.5, 6.2, 5.5, 3.3]

Step 4: Plot a Line chart in Python using Matplotlib


For the final step, you may use the template below in order to plot the Line chart in
Python:

import matplotlib.pyplot as plt

x_axis = ['value_1', 'value_2', 'value_3', ...]


y_axis = ['value_1', 'value_2', 'value_3', ...]

plt.plot(x_axis, y_axis)
plt.title('title name')
plt.xlabel('x_axis name')
plt.ylabel('y_axis name')
plt.show()

Here is the code for our example:


/* Python program to create a Line chart */
import matplotlib.pyplot as plt

year = [1920, 1930, 1940, 1950, 1960, 1970, 1980, 1990, 2000, 2010]
unemployment_rate = [9.8, 12, 8, 7.2, 6.9, 7, 6.5, 6.2, 5.5, 3.3]

plt.plot(year, unemployment_rate)
plt.title('Unemployment rate vs Year')

16 | CCS346 Unit - I
APEC

plt.xlabel('Year')
plt.ylabel('Unemployment rate')
plt.show()
Output:

Bar charts
A bar plot or bar chart is a graph that represents the category of data with rectangular
bars with lengths and heights that is proportional to the values which they represent.
The bar plots can be plotted horizontally or vertically. A bar chart describes the
comparisons between the discrete categories.
import matplotlib.pyplot as plt Output:
import numpy as np

x = np.array(["A", "B", "C", "D"])


y = np.array([3, 8, 1, 10])

plt.bar(x,y, color = 'maroon')


plt.show()

Scatter plots
Scatter plots are used to observe relationship between variables and uses dots to
represent the relationship between them. The scatter() method in the matplotlib library
is used to draw a scatter plot. Scatter plots are widely used to represent relation among
variables and how change in one affects the other.
/* Python program to create scatter plots*/ Output:

import matplotlib.pyplot as plt

x =[5, 7, 8, 7, 2, 17, 2, 9,

17 | CCS346 Unit - I
APEC

4, 11, 12, 9, 6]

y =[99, 86, 87, 88, 100, 86,


103, 87, 94, 78, 77, 85, 86]

plt.scatter(x, y, c ="blue")

plt.xlabel("X-axis")
plt.ylabel("Y-axis")

# To show the plot


plt.show()

Bubble chart
Bubble plots are an improved version of the scatter plot. In a scatter plot, there are two
dimensions x, and y. In a bubble plot, there are three dimensions x, y, and z, where the
third dimension z denotes weight. Here, each data point on the graph is shown as a
bubble. Each bubble can be illustrated with a different color, size, and appearance.

/* Python program to create Bubble plots*/ Output:


import matplotlib.pyplot as plt
import numpy as np
x = np.random.rand(40)
y = np.random.rand(40)
z = np.random.rand(40)
colors = np.random.rand(40)
# use the scatter function
plt.scatter(x, y, s=z*1000,c=colors)
plt.show()

Area Chart
An area chart is really similar to a line chart, except that the area between the x axis and
the line is filled in with color or shading. It represents the evolution of a numeric
variable.
import numpy as np Output:
import matplotlib.pyplot as plt

# Create data
x=range(1,6)
y=[1,4,6,8,4]

18 | CCS346 Unit - I
APEC

# Area plot
plt.fill_between(x, y)
plot.show()

Stacked Area Chart


A stacked area chart is the extension of a basic area chart. It displays the evolution of
the value of several groups on the same graphic. The values of each group are displayed
on top of each other, what allows checking on the same figure the evolution of both the
total of a numeric variable, and the importance of each group.
The stacked plot owes its name to the fact that it represents the area under a line plot
and that several such plots can be stacked on top of one another, giving the feeling of a
stack. The stacked plot can be useful when we want to visualize the cumulative effect of
multiple variables being plotted on the y axis.

import matplotlib.pyplot as plt


import seaborn as sns
sns.set()

# Population of different countries in billions


Africa = [228, 284, 365, 477, 631, 814, 1044, 1275]
America = [340, 425, 519, 619, 727, 840, 943, 1006]
California = [1394, 1686, 2120, 2625, 3202, 3714, 4169, 4560]
Austraila = [220, 253, 276, 295, 310, 303, 294, 293]
Denmark = [120, 150, 190, 220, 260, 310, 360, 390]

year = [1950, 1960, 1970, 1980, 1990, 2000, 2010, 2018]

# Create placeholders for plot and add required color


plt.plot([],[], color='brown', label='Africa')
plt.plot([],[], color='green', label='America')
plt.plot([],[], color='orange', label='California')
plt.plot([],[], color='blue', label='Austraila')
plt.plot([],[], color='pink', label='Denmark')

# Add stacks to the plot


plt.stackplot(year, Africa, America, California, Austraila,
Denmark, colors=['brown', 'green', 'orange', 'blue',

19 | CCS346 Unit - I
APEC

'magenta'])
plt.legend()

# Add Labels
plt.legend(loc='upper left')
plt.title('World Population')
plt.xlabel('Number of people (millions)')
plt.ylabel('Year')

# Display on the screen


plt.show()

Output:

Python program to display area and stacked chart:

# House loan Mortgage cost per month for a year


houseLoanMortgage = [9000, 9000, 8000, 9000, 8000,
9000, 9000, 9000, 9000, 8000, 9000, 9000]
# Utilities Bills for a year
utilitiesBills = [4218, 4218, 4218, 4218, 4218,
4218, 4219, 2218, 3218, 4233, 3000, 3000]
# Transportation bill for a year
transportation = [782, 900, 732, 892, 334, 222,
300, 800, 900, 582, 596, 222]
# Car mortgage cost for one year

20 | CCS346 Unit - I
APEC

carMortgage = [700, 701, 702, 703, 704,


705, 706, 707, 708, 709, 710, 711]

import matplotlib.pyplot as plt


import seaborn as sns
sns.set()

months= [x for x in range(1,13)]

# Create placeholders for plot and add required color


plt.plot([],[], color='brown', label='houseLoanMortgage')
plt.plot([],[], color='green', label='utilitiesBills')
plt.plot([],[], color='orange', label='transportation')
plt.plot([],[], color='blue', label='carMortgage')

# Add stacks to the plot


plt.stackplot(months, houseLoanMortgage, utilitiesBills,
transportation, carMortgage, colors=['brown', 'green', 'orange',
'blue'])
plt.legend()

# Add Labels
plt.title('Household Expenses')
plt.xlabel('Months of the year')
plt.ylabel('Cost')

# Display on the screen


plt.show()

Output:

Pie Chart
A Pie Chart is a circular statistical plot that can display only one series of data. The area
of the chart is the total percentage of the given data. The area of slices of the pie
represents the percentage of the parts of the data. The slices of pie are called wedges.
The area of the wedge is determined by the length of the arc of the wedge. The area of a
wedge represents the relative percentage of that part with respect to whole data. Pie

21 | CCS346 Unit - I
APEC

charts are commonly used in business presentations like sales, operations, survey
results, resources, etc as they provide a quick summary.

import matplotlib.pyplot as plt

langs = ['C', 'C++', 'Java', 'Python', 'PHP']


students = [23,17,35,29,12]
plt.pie(students, labels=langs)
plt.title('Students taking different programming languages')
plt.axis('equal')
plt.show()

Output:

Table Chart
Matplotlib.pyplot.table() is a subpart of matplotlib library in which a table is generated
using the plotted graph for analysis. This method makes analysis easier and more
efficient as tables give a precise detail than graphs. The matplotlib.pyplot.table creates
tables that often hang beneath stacked bar charts to provide readers insight into the
data generated by the above graph.

import matplotlib.pylab as plt


import numpy as np

plt.figure()
ax = plt.gca()
a = np.random.randn(5)

22 | CCS346 Unit - I
APEC

#defining the attributes


col_labels = ['Col1','Col2','Col3']
row_labels = ['Row1','Row2','Row3']
table_vals = [[10, 9, 8], [20, 19, 18], [30, 29, 28]]
row_colors = ['red', 'blue', 'yellow']
#plotting
my_table = plt.table(cellText=table_vals,
colWidths=[0.1] * 3,
rowLabels=row_labels,
colLabels=col_labels,
rowColours=row_colors,
loc='upper right')

plt.plot(a)
plt.show()

Output

Polar chart or spider web plot


Matplotlib provides the module and functions to plot the coordinates on polar axes. A
point in polar co-ordinates is represented as (r, theta). Here, r is its distance from the
origin and theta is the angle at which r has to be measured from origin. Any
mathematical function in the Cartesian coordinate system can also be plotted using the
polar coordinates.

import numpy as np Output


import matplotlib.pyplot as plt

23 | CCS346 Unit - I
APEC

plt.axes(projection = 'polar')
# setting the radius
r = 2
rads = np.arange(0, (2 * np.pi), 0.01)
# plotting the circle
for rad in rads:
plt.polar(rad, r, 'g.')
plt.show()
Output
import numpy as np
import matplotlib.pyplot as plot
plot.axes(projection='polar')
plot.title('Circle in polar format')
rads = np.arange(0, (2*np.pi), 0.01)
for radian in rads:
plot.polar(radian,2,'o')
plot.show()

import matplotlib.pyplot as plt


import numpy as np

subjects = ["C programming", "Numerical methods", "Operating


system",
"DBMS", "Computer Networks"]

actual_grades = [75, 89, 89, 80, 80, 75]


expected_grades = [90, 95, 92, 68, 68, 90]

# Initializing the spiderplot by


# setting figure size and polar
# projection
plt.figure(figsize =(10, 6))
plt.subplot(polar = True)

theta = np.linspace(0, 2 * np.pi, len(actual_grades))

# Arranging the grid into number


# of sales into equal parts in
# degrees
lines, labels = plt.thetagrids(range(0, 360,
int(360/len(subjects))),
(subjects))

# Plot actual sales graph


plt.plot(theta, actual_grades)

24 | CCS346 Unit - I
APEC

plt.fill(theta, actual_grades, 'b', alpha = 0.1)

# Plot expected sales graph


plt.plot(theta, expected_grades)

# Add legend and title for the plot


plt.legend(labels =('Actual_grades', 'expected_grades'),
loc = 1)
plt.title("Actual vs Expected Grades by Students")

# Display the plot on the screen


plt.show()

Output

Histogram
A histogram is a graph showing frequency distributions. It is a graph showing the
number of observations within each given interval.
Example: Say you ask for the height of 250 people, you might end up with a histogram
like this:

25 | CCS346 Unit - I
APEC

You can read from the histogram that there are approximately:

2 people from 140 to 145cm, 5 people from 145 to 150cm, 15 people from 151 to
156cm, 31 people from 157 to 162cm, 46 people from 163 to 168cm, 53 people from
168 to 173cm, 45 people from 173 to 178cm, 28 people from 179 to 184cm, 21 people
from 185 to 190cm, 4 people from 190 to 195cm

import matplotlib.pyplot as plt Output


import numpy as np

x = np.random.normal(170, 10, 250)

plt.hist(x)
plt.show()

Lollipop plot
A basic lollipop plot can be created using the stem() function of matplotlib. This function
takes x axis and y axis values as an argument. x values are optional; if you do not
provide x values, it will automatically assign x positions.

import matplotlib.pyplot as plt Output


import numpy as np

# create data
x=range(1,41)
values=np.random.uniform(size=40)

# stem function
plt.stem(x, values)
plt.ylim(0, 1.2)
plt.show()

# stem function: If x is not


provided, a sequence of numbers is
created by python:
plt.stem(values)
plt.show()

26 | CCS346 Unit - I
APEC

Data transformation
Data transformation is a set of techniques used to convert data from one format or
structure to another format or structure. The following are some examples of
transformation activities:
• Data deduplication involves the identification of duplicates and their removal.
• Key restructuring involves transforming any keys with built-in meanings to the
generic keys.
• Data cleansing involves extracting words and deleting out-of-date, inaccurate,
and incomplete information from the source language without extracting the
meaning or information to enhance the accuracy of the source data.
• Data validation is a process of formulating rules or algorithms that help in
validating different types of data against some known issues.
• Format revisioning involves converting from one format to another.
• Data derivation consists of creating a set of rules to generate more information
from the data source.
• Data aggregation involves searching, extracting, summarizing, and preserving
important information in different types of reporting systems.
• Data integration involves converting different data types and merging them into
a common structure or schema.
• Data filtering involves identifying information relevant to any particular user.
• Data joining involves establishing a relationship between two or more tables.

Merging database-style dataframes


Combining Data in Pandas with append(), merge(), join(), and concat()
pandas concat(): Combining Data Across Rows or Columns
Concatenation is a bit different from the merging techniques that you saw above. With
merging, you can expect the resulting dataset to have rows from the parent datasets
mixed in together, often based on some commonality.
With concatenation, your datasets are just stitched together along an axis — either the
row axis or column axis.

#Importing libraries
Output:
import pandas as pd
import numpy as np

df1 = pd.DataFrame(
{
"A": ["A0", "A1", "A2", "A3"],
"B": ["B0", "B1", "B2", "B3"],
"C": ["C0", "C1", "C2", "C3"],
"D": ["D0", "D1", "D2", "D3"],
},

27 | CCS346 Unit - I
APEC

index=[0, 1, 2, 3],
)
df2 = pd.DataFrame(
{
"A": ["A4", "A5", "A6", "A7"],
"B": ["B4", "B5", "B6", "B7"],
"C": ["C4", "C5", "C6", "C7"],
"D": ["D4", "D5", "D6", "D7"],
},
index=[4, 5, 6, 7],
)
df3 = pd.DataFrame(
{
"A": ["A8", "A9", "A10", "A11"],
"B": ["B8", "B9", "B10", "B11"],
"C": ["C8", "C9", "C10", "C11"],
"D": ["D8", "D9", "D10", "D11"],
},
index=[8, 9, 10, 11],
)
frames = [df1, df2, df3]
result = pd.concat(frames)
print("\n", df1)
print("\n", df2)
print("\n", df3)
print("\n", result)

Visually, a concatenation with no parameters along rows would look like this:

df1 = pd.DataFrame( df1 = pd.DataFrame(


{ {
"A": ["A0", "A1"], "A": ["A0", "A1"],
"B": ["B0", "B1"], "B": ["B0", "B1"],
"C": ["C0", "C1"], "C": ["C0", "C1"],
"D": ["D0", "D1"], "D": ["D0", "D1"],
},index=[0, 1]) },index=[0, 1])

28 | CCS346 Unit - I
APEC

df2 = pd.DataFrame( df2 = pd.DataFrame(


{ {
"A": ["A4", "A5"], "A": ["A4", "A5"],
"B": ["B4", "B5"], "B": ["B4", "B5"],
"C": ["C4", "C5"], "C": ["C4", "C5"],
"D": ["D4", "D5"], "D": ["D4", "D5"],
},index=[0, 1]) },index=[0, 1])
frames = [df1, df2] frames = [df1, df2]
result = pd.concat(frames) result = pd.concat((frames),
print("\n", df1) axis = "columns")
print("\n", df2) print("\n", df1)
print("\n", result) print("\n", df2)
print("\n", result)
Output:
Output:

result = pd.concat((frames), axis =


"rows")
Output:
result = pd.concat(frames, keys=["x", "y"])
As you can see, the resulting object’s index has a
hierarchical index.

In Pandas for a horizontal combination we have merge() and join(), whereas for vertical
combination we can use concat() and append(). Merge and join perform similar tasks
but internally they have some differences, similar to concat and append.
pandas merge():
Pandas provides various built-in functions for easily combining datasets. Among them,
merge() is a high-performance in-memory operation very similar to relational
databases like SQL. You can use merge() any time when you want to do database-like
join operations.
• The simplest call without any key column
• Specifying key columns using on
• Merging using left_on and right_on
• Various forms of joins: inner, left, right and outer

29 | CCS346 Unit - I
APEC

Syntax:
• # This join brings together the entire DataFame
df.merge(df2)

• # This join only brings together a subset of columns


• # 'col1' is my key in this case
df[['col1', 'col2']].merge(df2[['col1', 'col3']], on='col1')
Code 1#: Merging two DataFrames Output:
df1 = pd.DataFrame({
'id': [1, 2, 3, 4],
'name': ['Tom', 'Jenny', 'James', '
Dan'],
})
df2 = pd.DataFrame({
'id': [2, 3, 4, 5],
'age': [31, 20, 40, 70],
'sex': ['F', 'M', 'M', 'F']
})
print("\n",df1)
print("\n",df2)

final = pd.merge(df1, df2)


print("\n",final)

pd.merge(df1, df2)
(or)
df1.merge(df2)
(or)
df1.merge(df2, on='Name')

Output:
Code 2#: Merge two DataFrames via ‘id’ column.

final = df1.merge(df2, on='id')


print("\n",final)

30 | CCS346 Unit - I
APEC

df1.merge(df2, left_on='id',
right_on='customer_id')

Output:
Code 3#: Merge with different column names -
specify a left_on and right_on

final = df1.merge(df2, left_on='id',


right_on='customer_id
')

Various type of joins: inner, left, right and outer


They are 4 types of joins available to Pandas merge() function. The logic behind these joins is
very much the same that you have in SQL when you join tables. You can perform a type of join
by specifying the how argument with the following values:
• inner: Default join is inner in Pandas merge() function, and it produces records that
have matching values in both DataFrames
• left: produces all records from the left DataFrame and the matched records from the
right DataFrame
• right: produces all records from the right DataFrame and the matched records from the
left DataFrame
• outer: produces all records when there is a match in either left or right DataFrame

31 | CCS346 Unit - I
APEC

pd.merge(df_customer, df_info, on='id', how=?)

Code 4#: Merge using inner join Output:


Pandas merge() is performing the inner join and it
produces only the set of records that match in both id name age sex
DataFrame. 0 2 Jenny 31 F
1 3 James 20 M
final = pd.merge(df1, df2, on='id', 2 4 Dan 40 M
how = 'inner')

pd.merge(df1, f2, on='id', how = ‘inner’)

Code 4#: Merge using Left join Output:


The left join produces all records from the left id name age sex
DataFrame, and the matched records from the right 0 1 Tom NaN NaN
DataFrame. If there is no match, the left side will 1 2 Jenny 31.0 F
contain NaN. 2 3 James 20.0 M
3 4 Dan 40.0 M
final = pd.merge(df1, df2, on='id',
how = 'left')

32 | CCS346 Unit - I
APEC

pd.merge(df1, f2, on='id', how = ‘left’)

Code 4#: Merge using Right join Output:


The right join produces all records from the right id name age sex
DataFrame, and the matched records from the left 0 2 Jenny 31 F
DataFrame. If there is no match, the right side will 1 3 James 20 M
contain NaN. 2 4 Dan 40 M
3 5 NaN 70 F
final = pd.merge(df1, df2, on='id',
how = 'right')

pd.merge(df1, f2, on='id', how = ‘right’)

Code 4#: Merge using Outer join Output:


The outer join produces all records when there is a id name age sex
match in either left or right DataFrame. NaN will be 0 1 Tom NaN NaN
filled for no match on either sides. 1 2 Jenny 31.0 F
final = pd.merge(df1, df2, on='id', 2 3 James 20.0 M
how = 'outer') 3 4 Dan 40.0 M
4 5 NaN 70.0 F

33 | CCS346 Unit - I
APEC

pd.merge(df1, f2, on='id', how = ‘outer’)

pandas append():

To append the rows of one dataframe with the rows of another, we can use the Pandas append()
function. With the help of append(), we can append columns too.

Steps
• Create a two-dimensional, size-mutable, potentially heterogeneous tabular data, df1.
• Print the input DataFrame, df1.
• Create another DataFrame, df2, with the same column names and print it.
• Use the append method, df1.append(df2, ignore_index=True), to append the rows of df2
with df2.
• Print the resultatnt DataFrame.

Code 5#: Append Function to join DataFrames Output:


import pandas as pd
df1 = pd.DataFrame({"x": [5, 2],
"y": [4, 7],
"z": [1, 3]})
df2 = pd.DataFrame({"x": [1, 3],
"y": [1, 9],
"z": [1, 3]})
print ("\n", df1)
print ("\n", df2)
df3 = df1.append(df2)
print ("\n ", df3)

34 | CCS346 Unit - I
APEC

Output:

df3 = df1.append(df2,ignore_index=True)

import pandas as pd Output:


df1 = pd.DataFrame({"x": [5, 2],
"y": [4, 7],
"z": [1, 3]})
df2 = pd.DataFrame({"a": [1, 3],
"b": [1, 9],
"c": [1, 3]})
print ("\n", df1)
print ("\n", df2)
df3 = df1.append(df2, ignore_index=True)
print ("\n ", df3)

Reshaping and pivoting


pivot() function
• Return reshaped DataFrame organized by given index / column values.
• Reshape data (produce a “pivot” table) based on column values.
• Uses unique values from specified index / columns to form axes of the resulting
DataFrame. This function does not support data aggregation, multiple values will
result in a MultiIndex in the columns.

import numpy as np
import pandas as pd
df = pd.DataFrame({'s1': ['one', 'one', 'one', 'two', 'two','two'],
's2': ['P', 'Q', 'R', 'P', 'Q', 'R'],
's3': [2, 3, 4, 5, 6, 7],
's4': ['x', 'y', 'z', 'q', 'w', 't']})
print(df)

Output

df.pivot(index='s1', columns='s2', values='s3')

Output

35 | CCS346 Unit - I
APEC

df.pivot(index='s1', columns='s2', values=['s3', 's4'])

Output

Stack() and unstack()


During EDA, we often need to rearrange data in a dataframe in some consistent manner.
This can be done with hierarchical indexing using two actions:
• Stacking: Stack rotates from any particular column in the data to the rows.
• Unstacking: Unstack rotates from the rows into the column

36 | CCS346 Unit - I
APEC

import numpy as np
import pandas as pd
index = pd.MultiIndex.from_tuples([('one', 'x'), ('one', 'y'),
('two', 'x'), ('two','y')])
s = pd.Series(np.arange(2.0, 6.0), index=index)
print(s)

Output

df = s.unstack(level=0)
df.unstack()

output

37 | CCS346 Unit - I
APEC

Transformation.
While aggregation must return a reduced version of the data, transformation can return some
transformed version of the full data to recombine. For such a transformation, the output is the
same shape as the input.
key ABCABC
df.sum()
data 15
dtype: object
df.mean()
data 2.5
data
0 -1.5
1 -1.5
df.groupby('key').transform(lambda x: x -
2 -1.5
x.mean())
3 1.5
4 1.5
5 1.5

38 | CCS346 Unit - I
APEC
PROFESSIONAL ELECTIVE COURSES: VERTICALS
VERTICAL 1: DATA SCIENCE
CCS346 EXPLORATORY DATA ANALYSIS
UNIT II - EDA USING PYTHON
Data Manipulation With Pandas – Pandas Objects - Data Indexing And Selection –
Operating On Data – Handling Missing Data – Hierarchical Indexing – Combining
Datasets –Concat, Append, Merge And Join - Aggregation And Grouping – Pivot Tables –
Vectorized String Operations.

What is Pandas?
Pandas is a Python library used for working with data sets. It has functions for
analysing, cleaning, exploring, and manipulating data. The name "Pandas" has a
reference to both "Panel Data", and "Python Data Analysis" and was created by Wes
McKinney in 2008.
Why Use Pandas?
Pandas allows us to analyse big data and make conclusions based on statistical
theories. Pandas can clean messy data sets, and make them readable and relevant.
Relevant data is very important in data science.
Pandas is an open-source Python Library providing high-performance data
manipulation and analysis tool using its powerful data structures. The name Pandas is
derived from the word Panel Data – an Econometrics from Multidimensional data. In
2008, developer Wes McKinney started developing pandas when in need of high
performance, flexible tool for analysis of data.
Prior to Pandas, Python was majorly used for data mining and preparation. It had
very little contribution towards data analysis. Pandas solved this problem. Using
Pandas, we can accomplish five typical steps in the processing and analysis of data,
regardless of the origin of data -load, prepare, manipulate, model, and analyse.
Python with Pandas is used in a wide range of fields including academic and commercial
domains including finance, economics, Statistics, analytics, etc.
Key Features of Pandas
• Fast and efficient Data Frame object with default and customized indexing.
• Tools for loading data into in-memory data objects from different file formats.
• Data alignment and integrated handling of missing data.
• Reshaping and pivoting of date sets.
• Label-based slicing, indexing and sub-setting of large data sets.
• Columns from a data structure can be deleted or inserted.
• Group by data for aggregation and transformations.
• High performance merging and joining of data.
• Time Series functionality.
Standard Python distribution doesn't come bundled with Pandas module. A lightweight
alternative is to install NumPy using popular Python package installer, pip.
pip install pandas
If you install Anaconda Python package, Pandas will be installed by default with the following −
Windows
• Anaconda (from https://www.continuum.io) is a free Python distribution for SciPy
stack. It is also available for Linux and Mac.
• Canopy (https://www.enthought.com/products/canopy/) is available as free as well as
commercial distribution with full SciPy stack for Windows, Linux and Mac.
• Python (x,y) is a free Python distribution with SciPy stack and Spyder IDE for Windows
OS. (Downloadable from http://python-xy.github.io/)
Unit IV – Pandas
1
APEC

Pandas deals with the following three data structures −


• Series
• DataFrame
• Panel

These data structures are built on top of Numpy array, which means they are fast.

Series
• Series is a one-dimensional array like structure with homogeneous data. For
example, the following series is a collection of integers 10, 23, 56, …

Key Points
o Homogeneous data
o Size Immutable
o Values of Data Mutable

DataFrame
• DataFrame is a two-dimensional array with heterogeneous data. For example,
the table represents the data of a sales team of an organization with their overall
performance rating. The data is represented in rows and columns. Each column
represents an attribute and each row represents a person.

Data Type of Columns


The data types of the four columns are as follows –

Panel
• Panel is a three-dimensional data structure with heterogeneous data. It is hard to
represent the panel in graphical representation. But a panel can be illustrated as a
container of DataFrame.

Unit IV – Pandas
2
APEC
Key Points
• Heterogeneous data
• Size Mutable
• Data Mutable

WORKING WITH SERIES


A series can be created using various inputs like −
• Array
• Dict
• Scalar value or constant

Creating Series
1. Create an Empty Series
o A basic series, which can be created is an Empty Series.
#import the pandas library and aliasing as Series([], dtype: float64)
pd
import pandas as pd
s = pd.Series()
print (s)
2. Create a Series from ndarray
If data is an ndarray, then index passed must be of the same length. If no index is passed,
then by default index will be range(n) where n is array length, i.e.,
[0,1,2,3…. range(len(array))-1].
import pandas as pd 0 a
import numpy as np 1 b
data = np.array(['a','b','c','d']) 2 c
s = pd.Series(data) 3 d
print (s) dtype: object
import pandas as pd 100 a
import numpy as np 101 b
data = np.array(['a','b','c','d']) 102 c
s = pd.Series(data,index=[100,101,102,103]) 103 d
print (s) dtype: object
3. Create a Series from dict
A dict can be passed as input and if no index is specified, then the dictionary keys are taken
in a sorted order to construct index. If index is passed, the values in data corresponding to
the labels in the index will be pulled out.
import pandas as pd a 0.0
import numpy as np b 1.0
data = {'a' : 0., 'b' : 1., 'c' : 2.} c 2.0
s = pd.Series(data) dtype: float64
print (s)
import pandas as pd b 1.0
import numpy as np c 2.0
data = {'a' : 0., 'b' : 1., 'c' : 2.} d NaN
s = pd.Series(data,index=['b','c','d','a']) a 0.0
print (s) dtype: float64
4. Create a Series from Scalar
If data is a scalar value, an index must be provided. The value will be repeated to match the
length of index
import pandas as pd 0 5
Unit IV – Pandas
3
APEC
import numpy as np 1 5
s = pd.Series(5, index=[0, 1, 2, 3]) 2 5
print (s) 3 5
dtype: int64
Accessing Data from Series with Position
1. Data in the series can be accessed similar to that in an ndarray.
import pandas as pd 1
s = pd.Series([1,2,3,4,5],index =
['a','b','c','d','e'])

#retrieve the first element


print (s[0])
2. Retrieve the first three elements in the Series
import pandas as pd a 1
s = pd.Series([1,2,3,4,5],index = b 2
['a','b','c','d','e']) c 3
dtype: int64
#retrieve the first three elements
print (s[0:3])
3. Retrieve the last three elements.
import pandas as pd c 3
s = pd.Series([1,2,3,4,5],index = d 4
['a','b','c','d','e']) e 5
dtype: int64
#retrieve the last three element
print (s[-3:])
4. Retrieve multiple elements using a list of index label values.
import pandas as pd a 1
s = pd.Series([1,2,3,4,5],index = c 3
['a','b','c','d','e']) d 4
dtype: int64
#retrieve multiple elements
print s[['a','c','d']]

WORKING WITH DATA FRAMES

A Data frame is a two-dimensional data structure, i.e., data is aligned in a tabular


fashion in rows and columns.
Features of DataFrame
o Potentially columns are of different types
o Size – Mutable
o Labeled axes (rows and columns)
o Can Perform Arithmetic operations on rows and columns
Structure
o Let us assume that we are creating a data frame with student’s data.

Unit IV – Pandas
4
APEC

pandas.DataFrame
o A pandas DataFrame can be created using the following constructor −
pandas.DataFrame( data, index, columns, dtype, copy)

Create DataFrame
o A pandas DataFrame can be created using various inputs like –
o Lists
o Dict
o Series
o Numpy ndarrays
o Another DataFrame

1. Create an Empty DataFrame


#import the pandas library and aliasing as pd Empty DataFrame
import pandas as pd Columns: []
df = pd.DataFrame() Index: []
print (df)
2. Create a DataFrame from Lists
The DataFrame can be created using a single list or a list of lists.
0
import pandas as pd 0 1
data = [1,2,3,4,5] 1 2
df = pd.DataFrame(data) 2 3
print (df) 3 4
4 5
import pandas as pd Name Age
data = [['Alex',10],['Bob',12],['Clarke',13]] 0 Alex 10
df = pd.DataFrame(data,columns=['Name','Age']) 1 Bob 12
print (df) 2 Clarke 13

import pandas as pd Name Age


data = [['Alex',10],['Bob',12],['Clarke',13]] 0 Alex 10.0
df = 1 Bob 12.0
pd.DataFrame(data,columns=['Name','Age'],dtype=float) 2 Clarke 13.0
print df
Unit IV – Pandas
5
APEC

3. Create a DataFrame from Dict of ndarrays / Lists


All the ndarrays must be of same length. If index is passed, then the length of the index should
equal to the length of the arrays.
If no index is passed, then by default, index will be range(n), where n is the array length.
import pandas as pd Name Age
data = {'Name':['Tom', 'Jack', 'Steve', 'Ricky'], 0 Tom 28
'Age':[28,34,29,42]} 1 Jack 34
df = pd.DataFrame(data) 2 Steve 29
print (df) 3 Ricky 42
4. Create an indexed DataFrame using arrays.
import pandas as pd Name Age
data = {'Name':['Tom', 'Jack', 'Steve', rank1 Tom 28
'Ricky'],'Age':[28,34,29,42]} rank2 Jack 34
df = pd.DataFrame(data, rank3 Steve 29
index=['rank1','rank2','rank3','rank4']) rank4 Ricky 42
print (df)
5. Create a DataFrame from List of Dicts
import pandas as pd a b c
data = [{'a': 1, 'b': 2},{'a': 5, 'b': 10, 'c': 20}] 0 1 2 NaN
df = pd.DataFrame(data) 1 5 10 20.0
print (df)

import pandas as pd a b c
data = [{'a': 1, 'b': 2},{'a': 5, 'b': 10, 'c': 20}] first 1 2 NaN
df = pd.DataFrame(data, index=['first', 'second']) second 5 10 20.0

print (df)
import pandas as pd a b
data = [{'a': 1, 'b': 2},{'a': 5, 'b': 10, 'c': 20}] first 1 2
second 5 10
#With two column indices, values same as dictionary keys
df1 = pd.DataFrame(data, index=['first', 'second'], x y
columns=['a', 'b']) first NaN NaN
second NaN NaN
#With two column indices with one index with other name
df2 = pd.DataFrame(data, index=['first', 'second'],
columns=['x', 'y'])
print (df1)
print (df2)
Column Selection
We will understand this by selecting a column from the DataFrame.
import pandas as pd a 1.0
d = {'one' : pd.Series([1, 2, 3], index=['a', 'b', 'c']), b 2.0
'two' : pd.Series([1, 2, 3, 4], index=['a', 'b', 'c', 'd'])} c 3.0
d NaN
df = pd.DataFrame(d) Name: one, dtype: float64
print (df ['one'])
6. Column Addition is performed by adding a new column to an existing data frame.
Unit IV – Pandas
6
APEC
import pandas as pd Given Data Frame:
one two
d = {'one' : pd.Series([11, 22, 33, 44], index=['a', 'b', 'c', a 11 100
'd']), b 22 200
'two' : pd.Series([100, 200, 300,400], index=['a', 'b', 'c', c 33 300
'd'])} d 44 400

df = pd.DataFrame(d) Adding a new column by passing as


print ("Given Data Frame:") Series:
print(df) one two three
a 11 100 7.0
# Adding a new column to an existing DataFrame object b 22 200 8.0
with column label by passing new series c 33 300 9.0
d 44 400 NaN
print ("Adding a new column by passing as Series:")
df['three']=pd.Series([7,8,9],index=['a','b','c']) Adding a new column using the
print (df) existing columns in DataFrame:
one two three four
print ("Adding a new column using the existing columns a 11 100 7.0 18.0
in DataFrame:") b 22 200 8.0 30.0
df['four']=df['one']+df['three'] c 33 300 9.0 42.0
d 44 400 NaN NaN
print (df)
7. Columns can be deleted or popped
import pandas as pd Our dataframe is:
one two three
d = {'one' : pd.Series([11, 22, 33], index=['a', 'b', 'c']), a 11 1 10
'two' : pd.Series([1, 2, 3], index=['a', 'b', 'c']), b 22 2 20
'three' : pd.Series([10,20,30], index=['a','b','c'])} c 33 3 30
df = pd.DataFrame(d)
print ("Our dataframe is:")
print (df)

# using del function Deleting using DEL function:


print ("Deleting using DEL function:") one three
del df['two'] a 11 10
print (df) b 22 20
c 33 30
# using pop function
print ("Deleting using POP function:") Deleting using POP function:
df.pop('one') three
print (df) a 10
b 20
c 30
Row Selection, Addition, and Deletion
8. Rows can be selected by passing row label to a loc function.
import pandas as pd one two
d = {'one' : pd.Series([11, 22, 33], index=['a', 'b', 'c']), a 11 1
'two' : pd.Series([1, 2, 3], index=['a', 'b', 'c'])} b 22 2
c 33 3
df = pd.DataFrame(d)
print (df) one 22

Unit IV – Pandas
7
APEC
two 2
print (df.loc['b']) Name: b, dtype: int64
import pandas as pd one 3
d = {'one' : pd.Series([1, 2, 3], index=['a', 'b', 'c']), two 300
'two' : pd.Series([100, 200, 300], index=['a', 'b', 'c'])} Name: c, dtype: int64

df = pd.DataFrame(d)
print (df.iloc[2])
9. Slice Rows
Multiple rows can be selected using ‘ : ’ operator.
import pandas as pd one two
d = {'one' : pd.Series([1, 2, 3, 4], index=['a', 'b', 'c', 'd']), c 3 300
'two' : pd.Series([100, 200, 300, 400], index=['a', 'b', 'c', d 4 400
'd'])}

df = pd.DataFrame(d)
print (df[2:4])
10. Addition of Rows
Add new rows to a DataFrame using the append function.
import pandas as pd a b
df = pd.DataFrame([[55, 66], [77, 66]], columns = ['a','b']) 0 55 66
df2 = pd.DataFrame([[700, 600], [800, 900]], columns = 1 77 66
['a','b'])
0 700 600
df = df.append(df2) 1 800 900
print (df)
11. Deletion of Rows
Drop a label and see how many rows will get dropped.
import pandas as pd Original Data frame..........
a b
df = pd.DataFrame([[11, 22], [33, 44]], columns = ['a','b']) 0 11 22
df2 = pd.DataFrame([[55, 66], [77, 88]], columns = 1 33 44
['a','b']) 0 55 66
1 77 88
df = df.append(df2) Drop rows with label 0......
print("Original Data frame..........") a b
print(df) 1 33 44
print("Drop rows with label 0......") 1 77 88
df = df.drop(0)

print (df)
Descriptive Statistics
import pandas as pd Name Age Rating
import numpy as np 0 Tom 25 4.23
1 James 26 3.24
#Create a Dictionary of series 2 Ricky 25 3.98
d = {'Name':pd.Series(['Tom','James','Ricky','Vin','Steve','Smith','Jack', 3 Vin 23 2.56
'Lee','David','Gasper','Betina','Andres']),
'Age':pd.Series([25,26,25,23,30,29,23,34,40,30,51,46]),
4 Steve 30 3.20
5 Smith 29 4.60
'Rating':pd.Series([4.23,3.24,3.98,2.56,3.20,4.6,3.8,3.78,2.98,4.80,4.10,3.6 6 Jack 23 3.80
5]) 7 Lee 34 3.78
}
8 David 40 2.98
9 Gasper 30 4.80
Unit IV – Pandas
8
APEC
#Create a DataFrame 10 Betina 51 4.10
df = pd.DataFrame(d) 11 Andres 46 3.65
print (df)
sum()
Returns the sum of the values for the requested axis. By default, axis is index (axis=0).
import pandas as pd Name
TomJamesRickyVinSteveSmithJackLeeDavidGasperB
import numpy as np e...
Age 382
Rating 44.92
#Create a Dictionary of series dtype: object
d = {'Name':pd.Series(['Tom','James','Ricky','Vin','Steve','Smith','Jack',
'Lee','David','Gasper','Betina','Andres']),
'Age':pd.Series([25,26,25,23,30,29,23,34,40,30,51,46]),
'Rating':pd.Series([4.23,3.24,3.98,2.56,3.20,4.6,3.8,3.78,2.98,4.80,4.10,3.65])
}

Note:
#Create a DataFrame Each individual column is added individually
df = pd.DataFrame(d) (Strings are appended).
print (df.sum())
axis=1
import pandas as pd 0 29.23
import numpy as np 1 29.24
2 28.98
#Create a Dictionary of series 3 25.56
d = {'Name':pd.Series(['Tom','James','Ricky','Vin','Steve','Smith','Jack', 4 33.20
'Lee','David','Gasper','Betina','Andres']),
'Age':pd.Series([25,26,25,23,30,29,23,34,40,30,51,46]), 5 33.60
'Rating':pd.Series([4.23,3.24,3.98,2.56,3.20,4.6,3.8,3.78,2.98,4.80,4.10,3.65]) 6 26.80
}
7 37.78
8 42.98
#Create a DataFrame
9 34.80
df = pd.DataFrame(d)
10 55.10
print (df.sum(1))
11 49.65
dtype: float64
import pandas as pd Mean ..............
import numpy as np Age 31.833333
dtype: float64
#Create a Dictionary of series
d=
{'Age':pd.Series([25,26,25,23,30,29,23,34,40,30,51,46])}

#Create a DataFrame
df = pd.DataFrame(d)
print("Mean ..............")
print (df.mean())
import pandas as pd Standard Devaiation ..............
import numpy as np Age 9.232682
dtype: float64
#Create a Dictionary of series
d=
{'Age':pd.Series([25,26,25,23,30,29,23,34,40,30,51,46])}

#Create a DataFrame
df = pd.DataFrame(d)
print("standard Devaiation ..............")
print (df.std())
Unit IV – Pandas
9
APEC
import pandas as pd Describing various datas...........
import numpy as np Age
count 12.000000
#Create a Dictionary of series mean 31.833333
d= std 9.232682
{'Age':pd.Series([25,26,25,23,30,29,23,34,40,30,51,46])} min 23.000000
25% 25.000000
#Create a DataFrame 50% 29.500000
df = pd.DataFrame(d) 75% 35.500000
print("Describing various datas...........") max 51.000000
print (df.describe())
Drop duplicates Name Age Country
# import the library as pd 0 Srivignesh 22 India
import pandas as pd 1 Srivignesh 22 India
df = pd.DataFrame( 2 Hari 11 India
{
'Name': ['Srivignesh', 'Srivignesh','Hari'], Name Age Country
'Age': [22,22, 11], 0 Srivignesh 22 India
'Country': ['India','India', 'India'] 2 Hari 11 India
}
)
print(df)
newdf = df.drop_duplicates()
print(newdf)
⎯ Transposed Data Frame
o The transpose() function is used to transpose index and columns

⎯ axes
o Returns the list of row axis labels and column axis labels.

⎯ empty
o Returns the Boolean value saying whether the Object is empty or not; True
indicates that the object is empty.

⎯ ndim
o Returns the number of dimensions of the object. By definition, DataFrame is a 2D
object.

⎯ Shape
o Returns a tuple representing the dimensionality of the DataFrame. Tuple (a,b),
where a represents the number of rows and b represents the number of columns.

⎯ Size
o Returns the number of elements in the DataFrame.

⎯ values
o Returns the actual data in the DataFrame as an NDarray

⎯ Head & Tail


o To view a small sample of a DataFrame object, use the head() and tail() methods.
o head() returns the first n rows (observe the index values). The default number of
elements to display is five, but you may pass a custom number.
Unit IV – Pandas
10
APEC
o tail() returns the last n rows (observe the index values). The default number of
elements to display is five, but you may pass a custom number.

import pandas as pd Our Data Frame:


import numpy as np Name Age Rating
0 Tom 25 4.23
#Create a Dictionary 1 James 26 3.24
df = pd.DataFrame( 2 Ricky 25 3.98
{'Name':['Tom','James','Ricky','Vin','Steve','Smith','Jack'], 3 Vin 23 2.56
'Age':[25,26,25,23,30,29,23], 4 Steve 30 3.20
'Rating':[4.23,3.24,3.98,2.56,3.20,4.6,3.8]}) 5 Smith 29 4.60
6 Jack 23 3.80
#print the DataFrame
Transposed Data Frame:
0 1 2 3 4 5 6
print ("Our Data Frame:") Name Tom James Ricky Vin Steve Smith
print (df) Jack
Age 25 26 25 23 30 29 23
Rating 4.23 3.24 3.98 2.56 3.2 4.6 3.8
print ("Transposed Data Frame:")
print (df.T) Row axis labels and column axis labels
are:
print ("Row axis labels and column axis labels are:") [RangeIndex(start=0, stop=7, step=1),
print (df.axes) Index(['Name', 'Age', 'Rating'],
dtype='object')]
print ("The data types of each column are:")
The data types of each column are:
print (df.dtypes) Name object
Age int64
print ("Is the object empty?") Rating float64
print (df.empty) dtype: object

Is the object empty?


False
print ("The dimension of the object is:") The dimension of the object is:
print (df.ndim) 2
The shape of the object is:
print ("The shape of the object is:") (7, 3)
print (df.shape)
The total number of elements in our
object is:
print ("The total number of elements in our object 21
is:")
print (df.size) The actual data in our data frame is:
[['Tom' 25 4.23]
print ("The actual data in our data frame is:") ['James' 26 3.24]
['Ricky' 25 3.98]
print (df.values) ['Vin' 23 2.56]
['Steve' 30 3.2]
['Smith' 29 4.6]
['Jack' 23 3.8]]

print ("The first two rows of the data frame is:")


print (df.head(2)) The first two rows of the data frame is:
Name Age Rating
0 Tom 25 4.23
1 James 26 3.24

The first five rows of the data frame is:


Unit IV – Pandas
11
APEC
# if value not specified, default is taken as 5 Name Age Rating
0 Tom 25 4.23
print ("The first five rows of the data frame is:") 1 James 26 3.24
print (df.head()) 2 Ricky 25 3.98
3 Vin 23 2.56
4 Steve 30 3.20

The last two rows of the data frame is:


print ("The last two rows of the data frame is:")
Name Age Rating
print (df.tail(2)) 5 Smith 29 4.6
6 Jack 23 3.8

Read CSV Files


• A simple way to store big data sets is to use CSV files (comma separated files).
• CSV files contains plain text and is a well know format that can be read by everyone including Pandas.
• In our examples we will be using a CSV file called 'data.csv'.
/* Load the CSV into a DataFrame */ Tip: use to_string() to print the
import pandas as pd entire DataFrame.
df = pd.read_csv('data.csv')
print(df.to_string())
/*Print Data Frame without the to_string() method */ Duration Pulse Maxpulse Calories
0 60 110 130 409.1
1 60 117 145 479.0
import pandas as pd
2 60 103 135 340.0
df = pd.read_csv('data.csv') 3 45 109 175 282.4
print(df) 4 45 117 148 406.0
.. ... ... ... ...
164 60 105 140 290.8
165 60 110 145 300.4
166 60 115 145 310.2
167 75 120 150 320.4
168 75 125 150 330.4

[169 rows x 4 columns]


max_rows
The number of rows returned is defined in Pandas option settings. We can check your system's
maximum rows with the pd.options.display.max_rows statement.

/* Python program to cheack maximum rows */ 60

import pandas as pd In my system the number is 60, which means


print(pd.options.display.max_rows) that if the DataFrame contains more than 60
rows, the print(df) statement will return only
the headers and the first and last 5 rows.

You can change the maximum rows number


with the same statement.
/* Increase the maximum number of rows to display the
entire DataFrame */
import pandas as pd
pd.options.display.max_rows = 9999
df = pd.read_csv('data.csv')

Unit IV – Pandas
12
APEC
print(df)

Missing Data in Pandas


The difference between data found in many tutorials and data in the real world is that real-
world data is rarely clean and homogeneous. In particular, many interesting datasets will have
some amount of data missing. Missing Data can occur when no information is provided for one
or more items or for a whole unit. Missing Data is a very big problem in real-life scenarios.
What is a Missing Value?
Missing data is defined as the values or data that is not stored (or not present) for some
variable/s in the given dataset. Below is a sample of the missing data from the Titanic dataset.
You can see the columns ‘Age’ and ‘Cabin’ have some missing values.

In DataFrame sometimes many datasets simply arrive with missing data, either because it
exists and was not collected or it never existed. To make matters even more complicated,
different data sources may indicate missing data in different ways. For example, suppose
different users being surveyed may choose not to share their income, some users may choose
not to share the address in this way many datasets went missing.
Trade-Offs in Missing Data Conventions
A number of schemes have been developed to indicate the presence of missing data in a table or
DataFrame. Generally, they revolve around one of two strategies: using a mask that globally
indicates missing values, or choosing a sentinel value that indicates a missing entry.
• In the masking approach : The mask might be an Boolean array
• In the sentinel approach, the sentinel value could be some data-specific convention, such
as indicating a missing integer value with –9999 or some rare bit pattern, or it could be a
more global convention, such as indicating a missing floating-point value with NaN (Not
a Number)
Pandas treat None and NaN as essentially interchangeable for indicating missing or null values.
To facilitate this convention, there are several useful functions for detecting, removing, and
replacing null values in Pandas DataFrame :
• isnull()
• notnull()
• dropna()
Unit IV – Pandas
13
APEC
• fillna()
• replace()
• interpolate()

Checking for missing values using isnull() and notnull()


In order to check missing values in Pandas DataFrame, we use a function isnull() and notnull().
Both function help in checking whether a value is NaN or not. These functions can also be used
in Pandas Series in order to find null values in a series.

/* Python program to check missing values */ Output

import pandas as pd
import numpy as np

# dictionary of lists
dict = {'First Score':[100, 90, np.nan, 95],
'Second Score': [30, 45, 56, np.nan],
'Third Score':[np.nan, 40, 80, 98]}

# creating a dataframe from list


df = pd.DataFrame(dict)
print (df) When df.notnull()is applied

# using isnull() function


df.isnull()

Dropping missing values using dropna()


In order to drop a null values from a dataframe, we used dropna() function this function drop
Rows/Columns of datasets with Null values in different ways.

Code #1: Dropping rows with at least 1 null value. Output


Data Frame:
# importing pandas as pd
import pandas as pd

# importing numpy as np
import numpy as np
After dropping rows with
# dictionary of lists atleast 1 NaN value
dict = {'First Score':[100, 90, np.nan, 95],
'Second Score': [30, np.nan, 45, 56],
'Third Score':[52, 40, 80, 98],
'Fourth Score':[np.nan, np.nan, np.nan,65]}

# creating a dataframe from dictionary


df = pd.DataFrame(dict)

Unit IV – Pandas
14
APEC
# using dropna() function
df.dropna()
Output
Code #2: Drop rows whose all data is missing or contain null Data Frame:
values(NaN)

# importing pandas as pd
import pandas as pd

# importing numpy as np After dropping rows whose all


import numpy as np data is missing or contain null
values(NaN)
# dictionary of lists
dict = {'First Score':[100, np.nan, np.nan, 95],
'Second Score': [30, np.nan, 45, 56],
'Third Score':[52, np.nan, 80, 98],
'Fourth Score':[np.nan, np.nan, np.nan, 65]}

df = pd.DataFrame(dict)
print(df)
# using dropna() function
df.dropna(how = 'all')

Filling missing values using fillna(), replace() and interpolate()


In order to fill null values in a datasets, we use fillna(), replace() and interpolate() function these
function replace NaN values with some value of their own. All these function help in filling a null
values in datasets of a DataFrame. Interpolate() function is basically used to fill NA values in the
dataframe but it uses various interpolation technique to fill the missing values rather than hard-
coding the value.
Code #1: Filling null values with a single value Output
import pandas as pd
import numpy as np

# dictionary of lists
dict = {'First Score':[100, 90,np.nan,95],
'Second Score': [30, 45,56,np.nan],
'Third Score':[np.nan, 40,80,98]}

# creating a dataframe from dictionary


df = pd.DataFrame(dict)

# filling missing value using fillna()


df.fillna(0)

Code #2: Filling null values with the previous ones Output
import pandas as pd
import numpy as np

Unit IV – Pandas
15
APEC
# dictionary of lists
dict = {'First Score':[100, 90, np.nan, 95],
'Second Score': [30, 45, 56, np.nan],
'Third Score':[np.nan,40,80,np.nan]}

# creating a dataframe from dictionary


df = pd.DataFrame(dict)

# filling a missing value with


# previous ones
df.fillna(method ='pad')

Code #3: Filling null value with the next ones Output
import pandas as pd
import numpy as np

# dictionary of lists
dict = {'First Score':[100, 90, np.nan, 95],
'Second Score': [30, 45, 56, np.nan],
'Third Score':[np.nan, 40, 80, 98]}

# creating a dataframe from dictionary


df = pd.DataFrame(dict)

# filling null value using fillna() function


df.fillna(method ='bfill')

Code #4: Filling null values in CSV File Output:


# importing pandas package
import pandas as pd

# making data frame from csv file


data = pd.read_csv("employees.csv")

# Printing the first 10 to 24 rows of


# the data frame for visualization
Print(data[10:25])

This program prints the employee database


starting from row 10 to Row 25. In the output, we
can see Row 20 & Row 22, the values for gender is
NaN

Now we are going to fill all the null values in Gender column Output:
with “No Gender”

# importing pandas package


import pandas as pd

# making data frame from csv file


data = pd.read_csv("employees.csv")

Unit IV – Pandas
16
APEC
# filling a null values using fillna()
data["Gender"].fillna("No Gender",inplace=True)

print(data)

Code #5: Filling a null values using replace() method Output:


# importing pandas package
import pandas as pd

# making data frame from csv file


data = pd.read_csv("employees.csv")

# Printing the first 10 to 24 rows of


# the data frame for visualization
data[10:25]

Now we are going to replace the all Nan value in the data Output:
frame with -99 value.

# importing pandas package


import pandas as pd

# making data frame from csv file


data = pd.read_csv("employees.csv")

# Replace Nan value in dataframe with value -99


data.replace(to_replace = np.nan, value = -99)

Interpolation
Interpolation in Python is a technique used to estimate unknown data points between two
known data points. Interpolation is mostly used to impute missing values in the data frame or
series while pre-processing data. Interpolation is also used in Image Processing when
expanding an image you can estimate the pixel value with help of neighbouring pixels.
When to use Interpolation?
We can use Interpolation to find missing value with help of its neighbours. When imputing
missing values with average does not fit best, we have to move to a different technique and the
technique most people find is Interpolation. Interpolation is mostly used while working with
time-series data because in time-series data we like to fill missing values with previous one or
two values. for example, suppose temperature, now we would always prefer to fill today’s
temperature with the mean of the last 2 days, not with the mean of the month. We can also use
Interpolation for calculating the moving averages.

Unit IV – Pandas
17
APEC
Using Interpolation to fill Missing Values in Series Data
Pandas series is a one-dimensional array which is capable to store elements of various data
types like list. We can easily create series with help of a list, tuple, or dictionary. To perform all
Interpolation methods we will create a pandas series with some NaN values and try to fill
missing values with different methods of Interpolation.

Output:
import pandas as pd
import numpy as np
a = pd.Series([0, 1, np.nan, 3, 4, 5, 7],
index=[100,101,102,103,104,105,106])
print(a)

1) Linear Interpolation
Linear Interpolation simply means to estimate a missing value by connecting dots in a straight
line in increasing order. In short, it estimates the unknown value in the same increasing order
from previous values. The default method used by Interpolation is Linear so while applying it
we did not need to specify it.
Code #6.1: Linear Interpolation Output:

import pandas as pd
import numpy as np
a = pd.Series([0, 1, np.nan, 3, 4, 5, 7],
index=[100,101,102,103,104,105,106])
print(a.interpolate())

Hence, Linear interpolation works in the same order. Remember that it does not interpret using
the index, it interprets values by connecting points in a straight line.

2) Polynomial Interpolation
In Polynomial Interpolation you need to specify an order. It means that polynomial
interpolation is filling missing values with the lowest possible degree that passes through
available data points. The polynomial Interpolation curve is like the trigonometric sin curve or
assumes it like a parabola shape.
Code #6.2: Polynomial Interpolation Output:

import pandas as pd
import numpy as np
a = pd.Series([0, 1, np.nan, 3, 4, 5, 7],
index=[100,101,102,103,104,105,106])
a.interpolate(method="polynomial", order=2)

If you pass an order as 1 then the output will similar to linear Output:
because the polynomial of order 1 is linear.

import pandas as pd
import numpy as np

Unit IV – Pandas
18
APEC
a = pd.Series([0, 1, np.nan, 3, 4, 5, 7],
index=[100,101,102,103,104,105,106])
a.interpolate(method="polynomial", order=1)

3) Interpolation through Padding


Interpolation with help of padding simply means filling missing values with the same value
present above them in the dataset. If the missing value is in the first row then this method will
not work. While using this technique you also need to specify the limit which means how many
NaN values to fill. So, if you are working on a real-world project and want to fill missing values
with previous values you have to specify the limit as to the number of rows in the dataset.

Code #6.3: Padding Interpolation Output:

import pandas as pd
import numpy as np
a = pd.Series([0, 1, np.nan, 3, 4, 5, 7],
index=[100,101,102,103,104,105,106])
a.interpolate(method="pad", limit=2)

import pandas as pd Output:


import numpy as np
a = pd.Series([0, 1, np.nan, 3, 4, 5, 7],
index=[100,101,102,103,104,105,106])
a.interpolate(method="bfill", limit=2)

Using Interpolation to fill Missing Values in Pandas DataFrame


DataFrame is a widely used python data structure that stores the data in form of rows and
columns. When performing data analysis we always store the data in a table which is known as
a dataframe. Dataframe can contain huge missing values in many columns so let us understand
how we can use Interpolation to fill missing values in the dataframe.

Code #7: Displaying the Data Frame Output:

import pandas as pd
# Creating the dataframe
df = pd.DataFrame({"A":[12, 4, 7, None, 2],
"B":[None, 3, 57, 3, None],
"C":[20, 16, None, 3, 8],
"D":[14, 3, None, None, 6]})
print(df)

Unit IV – Pandas
19
APEC
Displaying the Data Frame & Performing Linear Interpolation in forwarding Direction
The linear method ignores the index and treats missing values as equally spaced and finds the best
point to fit the missing value after previous points. If the missing value is at first index then it will leave
it as Nan. Let’s apply it to our dataframe.
Code #7.1: Performing Linear Interpolation in forward Direction Output:
import pandas as pd
# Creating the dataframe
df = pd.DataFrame({"A":[12, 4, 7, None, 2],
"B":[None, 3, 57, 3, None],
"C":[20, 16, None, 3, 8],
"D":[14, 3, None, None, 6]})
df.interpolate(method ='linear', limit_direction ='forward')

If you only want to perform interpolation in the single column then it is


also simple and follows the below code.
import pandas as pd
# Creating the dataframe
df = pd.DataFrame({"A":[12, 4, 7, None, 2],
"B":[None, 3, 57, 3, None],
"C":[20, 16, None, 3, 8],
"D":[14, 3, None, None, 6]})
df['C'].interpolate(method="linear")
Interpolation with Padding
We have already seen that to use padding we have to specify the limit of NaN values to be filled. we
have a maximum of 2 NaN values in the dataframe so our limit will be 2.
Code #7.2: Performing Interpolation with padding Output:
import pandas as pd
# Creating the dataframe
df = pd.DataFrame({"A":[12, 4, 7, None, 2],
"B":[None, 3, 57, 3, None],
"C":[20, 16, None, 3, 8],
"D":[14, 3, None, None, 6]})
df.interpolate(method="pad", limit=2)

Hierarchical Indexing with Pandas


Till now we have focused primarily on one-dimensional and two-dimensional data, stored in
Pandas Series and DataFrame objects, respectively. Often it is useful to go beyond this and store
higher-dimensional data—that is, data indexed by more than one or two keys. While Pandas
does provide Panel and Panel4D objects that natively handle three-dimensional and four-
dimensional data, a far more common pattern in practice is to make use of hierarchical indexing
(also known as multi-indexing) to incorporate multiple index levels within a single index.

#Importing libraries Output


import pandas as pd
import numpy as np
Unit IV – Pandas
20
APEC
data=pd.Series(np.random.randn(8),
index=[["a","a","a","b","b","b","c","c"],
[1,2,3,1,2,3,1,2]])
print(data)

What is MultiIndex?
MultiIndex allows you to select more than one row and column in your index. To understand
MultiIndex, let’s see the indexes of the data.
#Importing libraries Output
import pandas as pd
import numpy as np
data=pd.Series(np.random.randn(8),
index=[["a","a","a","b","b","b","c","c"],
[1,2,3,1,2,3,1,2]])
print(data.index)

MultiIndex is an advanced indexing technique for DataFrames that shows the multiple levels of
the indexes. Our dataset has two levels. You can obtain subsets of the data using the indexes. For
example, let’s take a look at the values with index a.

#Importing libraries Output


import pandas as pd
import numpy as np
data=pd.Series(np.random.randn(8),
index=[["a","a","a","b","b","b","c","c"],
[1,2,3,1,2,3,1,2]])
print(data["a"])

#slicing can also be done on multiindexes Output


data["b":"c"]

#We can also look more than one index Output


data.loc[["a","c"]]

Unit IV – Pandas
21
APEC
You can select values from the inner index. Let’s take a look at the Output
first values of the inner index.
#We can also look more than one index
data.loc[:,1]

What is the unstack?


The stack method turns column names into index values, and the unstack method turns index
values into column names. You can see the data as a table with the unstack method.
#Importing libraries Output
import pandas as pd
import numpy as np
data = pd.Series(np.random.randn(8),
index=[["a","a","a","b","b","b","c","c"]
,
[1,2,3,1,2,3,1,2]])
data.unstack()

To restore the dataset, you can use the stack method. Output

data.unstack().stack()

Hierarchical Indexing in The Data Frame


You can move the DataFrame’s columns to the row index. To show this, let’s create a dataset.
#Importing libraries
import pandas as pd
import numpy as np
data=pd.DataFrame({"x":range(8),
"y":range(8,0,-1),
"a":["one","one","one","one","two","two","two","two"],
"b":[30,41,22,33,5,133,21,31]})
print(data)

Output:

Unit IV – Pandas
22
APEC
data2=data.set_index(["a","b"]) data3=data.set_index(["a","b"]).sort_index()
data2 data3

Output: Output:

Unit IV – Pandas
23
APEC
Combining Data in Pandas with append(), merge(), join(), and concat()

pandas concat(): Combining Data Across Rows or Columns


Concatenation is a bit different from the merging techniques that you saw above. With merging,
you can expect the resulting dataset to have rows from the parent datasets mixed in together,
often based on some commonality.
With concatenation, your datasets are just stitched together along an axis — either the row axis
or column axis.
#Importing libraries
Output:
import pandas as pd
import numpy as np

df1 = pd.DataFrame(
{
"A": ["A0", "A1", "A2", "A3"],
"B": ["B0", "B1", "B2", "B3"],
"C": ["C0", "C1", "C2", "C3"],
"D": ["D0", "D1", "D2", "D3"],
},
index=[0, 1, 2, 3],
)
df2 = pd.DataFrame(
{
"A": ["A4", "A5", "A6", "A7"],
"B": ["B4", "B5", "B6", "B7"],
"C": ["C4", "C5", "C6", "C7"],
"D": ["D4", "D5", "D6", "D7"],
},
index=[4, 5, 6, 7],
)
df3 = pd.DataFrame(
{
"A": ["A8", "A9", "A10", "A11"],
"B": ["B8", "B9", "B10", "B11"],
"C": ["C8", "C9", "C10", "C11"],
"D": ["D8", "D9", "D10", "D11"],
},
index=[8, 9, 10, 11],
)
frames = [df1, df2, df3]
result = pd.concat(frames)
print("\n", df1)
print("\n", df2)
print("\n", df3)
print("\n", result)

Unit IV – Pandas
24
APEC

Visually, a concatenation with no parameters along rows would look like this:

df1 = pd.DataFrame( df1 = pd.DataFrame(


{ {
"A": ["A0", "A1"], "A": ["A0", "A1"],
"B": ["B0", "B1"], "B": ["B0", "B1"],
"C": ["C0", "C1"], "C": ["C0", "C1"],
"D": ["D0", "D1"], "D": ["D0", "D1"],
},index=[0, 1]) },index=[0, 1])
df2 = pd.DataFrame( df2 = pd.DataFrame(
{ {
"A": ["A4", "A5"], "A": ["A4", "A5"],
"B": ["B4", "B5"], "B": ["B4", "B5"],
"C": ["C4", "C5"], "C": ["C4", "C5"],
"D": ["D4", "D5"], "D": ["D4", "D5"],
},index=[0, 1]) },index=[0, 1])
frames = [df1, df2] frames = [df1, df2]
result = pd.concat(frames) result = pd.concat((frames),
print("\n", df1) axis = "columns")
print("\n", df2) print("\n", df1)
print("\n", result) print("\n", df2)
print("\n", result)
Output:
Output:

result = pd.concat((frames), axis = "rows")


Output:
result = pd.concat(frames, keys=["x", "y"])

As you can see, the resulting object’s index has a hierarchical

Unit IV – Pandas
25
APEC
index.

Unit IV – Pandas
26
APEC
In Pandas for a horizontal combination we have merge() and join(), whereas for vertical
combination we can use concat() and append(). Merge and join perform similar tasks but
internally they have some differences, similar to concat and append.
pandas merge():
Pandas provides various built-in functions for easily combining datasets. Among them, merge()
is a high-performance in-memory operation very similar to relational databases like SQL. You
can use merge() any time when you want to do database-like join operations.
• The simplest call without any key column
• Specifying key columns using on
• Merging using left_on and right_on
• Various forms of joins: inner, left, right and outer
Syntax:
• # This join brings together the entire DataFame
df.merge(df2)

• # This join only brings together a subset of columns


• # 'col1' is my key in this case
df[['col1', 'col2']].merge(df2[['col1', 'col3']], on='col1')
Code 1#: Merging two DataFrames Output:
df1 = pd.DataFrame({
'id': [1, 2, 3, 4],
'name': ['Tom', 'Jenny', 'James', 'Dan'],
})
df2 = pd.DataFrame({
'id': [2, 3, 4, 5],
'age': [31, 20, 40, 70],
'sex': ['F', 'M', 'M', 'F']
})
print("\n",df1)
print("\n",df2)

final = pd.merge(df1, df2)


print("\n",final)

pd.merge(df1, df2)
(or)
df1.merge(df2)
(or)
df1.merge(df2, on='Name')

Unit IV – Pandas
27
APEC
Output:
Code 2#: Merge two DataFrames via ‘id’ column.

final = df1.merge(df2, on='id')


print("\n",final)

df1.merge(df2, left_on='id',
right_on='customer_id')

Output:
Code 3#: Merge with different column names - specify a
left_on and right_on

final = df1.merge(df2, left_on='id',


right_on='customer_id')

Various type of joins: inner, left, right and outer


They are 4 types of joins available to Pandas merge() function. The logic behind these joins is very much
the same that you have in SQL when you join tables. You can perform a type of join by specifying the how
argument with the following values:
• inner: Default join is inner in Pandas merge() function, and it produces records that have
matching values in both DataFrames
• left: produces all records from the left DataFrame and the matched records from the right
DataFrame
• right: produces all records from the right DataFrame and the matched records from the left
DataFrame
• outer: produces all records when there is a match in either left or right DataFrame

Unit IV – Pandas
28
APEC

pd.merge(df_customer, df_info, on='id', how=?)

Code 4#: Merge using inner join Output:


Pandas merge() is performing the inner join and it
produces only the set of records that match in both id name age sex
DataFrame. 0 2 Jenny 31 F
1 3 James 20 M
2 4 Dan 40 M
final = pd.merge(df1, df2, on='id',
how = 'inner')

pd.merge(df1, f2, on='id', how = ‘inner’)

Code 4#: Merge using Left join Output:


The left join produces all records from the left id name age sex
DataFrame, and the matched records from the right 0 1 Tom NaN NaN
1 2 Jenny 31.0 F
DataFrame. If there is no match, the left side will 2 3 James 20.0 M
contain NaN. 3 4 Dan 40.0 M

final = pd.merge(df1, df2, on='id',


how = 'left')

Unit IV – Pandas
29
APEC

pd.merge(df1, f2, on='id', how = ‘left’)

Code 4#: Merge using Right join Output:


The right join produces all records from the right id name age sex
DataFrame, and the matched records from the left 0 2 Jenny 31 F
1 3 James 20 M
DataFrame. If there is no match, the right side will 2 4 Dan 40 M
contain NaN. 3 5 NaN 70 F

final = pd.merge(df1, df2, on='id',


how = 'right')

pd.merge(df1, f2, on='id', how = ‘right’)

Code 4#: Merge using Outer join Output:


The outer join produces all records when there is a id name age sex
match in either left or right DataFrame. NaN will be 0 1 Tom NaN NaN
1 2 Jenny 31.0 F
filled for no match on either sides. 2 3 James 20.0 M
final = pd.merge(df1, df2, on='id', 3 4 Dan 40.0 M
how = 'outer') 4 5 NaN 70.0 F

Unit IV – Pandas
30
APEC

pd.merge(df1, f2, on='id', how = ‘outer’)

Unit IV – Pandas
31
APEC
pandas append():
To append the rows of one dataframe with the rows of another, we can use the Pandas append()
function. With the help of append(), we can append columns too.

Steps
• Create a two-dimensional, size-mutable, potentially heterogeneous tabular data, df1.
• Print the input DataFrame, df1.
• Create another DataFrame, df2, with the same column names and print it.
• Use the append method, df1.append(df2, ignore_index=True), to append the rows of df2
with df2.
• Print the resultatnt DataFrame.
Code 5#: Append Function to join DataFrames Output:
import pandas as pd
df1 = pd.DataFrame({"x": [5, 2],
"y": [4, 7],
"z": [1, 3]})
df2 = pd.DataFrame({"x": [1, 3],
"y": [1, 9],
"z": [1, 3]})
print ("\n", df1)
print ("\n", df2)
df3 = df1.append(df2)
print ("\n ", df3)
Output:

df3 = df1.append(df2,ignore_index=Tru
e)

import pandas as pd Output:


df1 = pd.DataFrame({"x": [5, 2],
"y": [4, 7],
"z": [1, 3]})
df2 = pd.DataFrame({"a": [1, 3],
"b": [1, 9],
"c": [1, 3]})
print ("\n", df1)
print ("\n", df2)
df3 = df1.append(df2, ignore_index=True)

Unit IV – Pandas
32
APEC
print ("\n ", df3)

Aggregation in Pandas
Aggregation in pandas provides various functions that perform a mathematical or logical
operation on our dataset and returns a summary of that function. Aggregation can be used to get
a summary of columns in our dataset like getting sum, minimum, maximum, etc. from a
particular column of our dataset.
Some functions used in the aggregation are:
• sum() Compute sum of column values
• min() Compute min of column values
• max() Compute max of column values
• mean() Compute mean of column
• size() Compute column sizes
• describe() Generates descriptive statistics
• first() Compute first of group values
• last() Compute last of group values
• count() Compute count of column values
• std() Standard deviation of column
• var() Compute variance of column
• sem() Standard error of the mean of column

df = pd.DataFrame([[9, 4, 8, 9],
[8, 10, 7, 6],
[7, 6, 8, 5]],
columns=['Maths', 'English',
'Science', 'History'])

# display dataset
print(df)

Output

agg() Calculate the sum, min, and max of each column in our dataset.
df.agg(['sum', 'min', 'max'])

Unit IV – Pandas
33
APEC

Unit IV – Pandas
34
APEC

Grouping in Pandas
Grouping is used to group data using some criteria from our dataset. It is used as split-apply-
combine strategy.
• Splitting the data into groups based on some criteria.
• Applying a function to each group independently.
• Combining the results into a data structure.
Examples:
• We use groupby() function to group the data on “Maths” value. It returns
the object as result.
df.groupby(by=['Maths'])
Applying groupby() function to group the data on “Maths” value. To view result of formed
groups use first() function.
a = df.groupby('Maths')
a.first()

First grouping based on “Maths” within each team we are grouping based on “Science”
b = df.groupby(['Maths', 'Science'])
b.first()

Simple DataFrame Output:


import pandas as pd
df = pd.DataFrame( A B C
0 1 1 0.362838
{
1 1 2 0.227877
"A": [1, 1, 2, 2], 2 2 3 1.267767
"B": [1, 2, 3, 4], 3 2 4 -0.562860
"C": [0.362838,0.227877,1.267767,-0.562860]
}
)
print(df)
The aggregation is for each column. B C
df.groupby('A').agg('min') A

Unit IV – Pandas
35
APEC
1 1 0.227877
2 3 -0.562860
Multiple aggregations B C
min max min
max
df.groupby('A').agg(['min', 'max']) A
1 1 2 0.227877
0.362838
2 3 4 -0.562860
1.267767
Select a column for aggregation min max
df.groupby('A').B.agg(['min', 'max']) A
1 1 2
2 3 4
Different aggregations per column B C
min max sum
df.groupby('A').agg({'B': ['min', 'max'], 'C': 'sum'})
A
1 1 2 0.590715
2 3 4 0.704907
df = pd.DataFrame({'key': ['A', 'B', 'C', 'A', 'B', 'C'], key data
'data': range(6)}, columns=['key', 'data']) 0 A 0
1 B 1
print(df) 2 C 2
3 A 3
4 B 4
5 C 5
df.groupby('key').sum()) data
key
A 3
B 5
C 7
Transformation.
While aggregation must return a reduced version of the data, transformation can return some
transformed version of the full data to recombine. For such a transformation, the output is the
same shape as the input.
key ABCABC
df.sum() data 15
dtype: object
df.mean() data 2.5
data
0 -1.5
1 -1.5
df.groupby('key').transform(lambda x: x -
2 -1.5
x.mean()) 3 1.5
4 1.5
5 1.5

Pivot Tables
We have seen how the GroupBy abstraction lets us explore relationships within a dataset. A
pivot table is a similar operation that is commonly seen in spread sheets and other programs
that operate on tabular data. The pivot table takes simple column wise data as input, and groups
the entries into a two-dimensional table that provides a multidimensional summarization of the
data. The difference between pivot tables and GroupBy can sometimes cause confusion; it helps
me to think of pivot tables as essentially a multidimensional version of GroupBy aggregation.
Unit IV – Pandas
36
APEC
That is, you split apply- combine, but both the split and the combine happen across not a one
dimensional index, but across a two-dimensional grid.

Purpose:
Create a spreadsheet-style pivot table as a DataFrame. The levels in the pivot table of pandas
will be stored in MultiIndex objects (hierarchical indexes) on the index and columns of the
result DataFrame.
How to make a pivot table?
Use the pd.pivot_table() function and specify what feature should go in the rows and columns
using the index and columns parameters respectively. The feature that should be used to fill in
the cell values should be specified in the values parameter.
import pandas as pd
import numpy as np
df = pd.DataFrame({'First Name':['Aryan','Rohan','Riya','Yash','Siddhant',],
'Last Name:['Singh','Agarwal','Shah','Bhatia','Khanna'],
'Type:['Full-time Employee','Intern','Full-time Employee',
'Part-time Employee', 'Full-time Employee'],
'Department': ['Administration','Technical','Administration',
'Technical','Management'],
'YoE': [2, 3, 5, 7, 6],
'Salary': [20000, 5000, 10000, 10000, 20000]})

Print(df)

Output:

output = pd.pivot_table(data=df,
index=['Type'],
columns=['Department'],
values='Salary',
aggfunc='mean')
print(output)

Unit IV – Pandas
37
APEC

Here, we have made a basic pivot table in pandas which shows the average salary of each type
of employee for each department. As there are no user-defined parameters passed, the
remaining arguments have assumed their default values.
Pivot table with multiple aggregation functions
Df1 = pd.pivot_table(data=df, index=['Type'],
values='Salary',
aggfunc=['sum', 'mean', 'count'])
print(df1)

Output

Calculating row and column grand totals in pivot_table


Now, let’s take a look at the grand total of the salary of each type of employee. For this, we will
use the margins and the margins_name parameter.
# Calculate row and column totals (margins)
df1 = pd.pivot_table(data=df, index=['Type'],
values='Salary',
aggfunc=['sum', 'mean', 'count'],
margins=True,
margins_name='Grand Total')
print(df1)

df = pd.DataFrame({"P": ["f1","f1","f1","f1","f1","b1","b1","b1","b1"],
"Q": ["one", "one", "one", "two", "two",
"one", "one", "two", "two"],
"R": ["small", "large", "large", "small",
"small", "large", "small", "small",
"large"],
Unit IV – Pandas
38
APEC
"S": [1, 2, 2, 3, 3, 4, 5, 6, 7],
"T": [2, 4, 5, 5, 6, 6, 8, 9, 9]})
print(df)
Output

table = pd.pivot_table(df, values='S', index=['P', 'Q'],


columns=['R'], aggfunc=np.sum)
print(table)

Output:

Unit IV – Pandas
39
APEC

table = pd.pivot_table(df, values='S', index=['P', 'Q'],


columns=['R'], aggfunc=np.sum, fill_value=9999)
print(table)

df.sort_values(by=['R'])

Vectorized Strings
Vectorized string operations refer to performing operations on strings in a vectorized
manner, meaning that the operations are applied simultaneously to multiple strings rather than
individually. This approach leverages the power of optimized low-level operations provided by
modern programming languages and libraries. In many programming languages, vectorized
string operations are supported through libraries or modules that provide efficient string
handling functions.

Unit IV – Pandas
40
APEC
For example, in Python, the NumPy and pandas libraries offer vectorized string operations. Here
are a few examples of vectorized string operations commonly used in programming languages:
• Concatenation: Joining multiple strings together. For instance, given two arrays of
strings, a vectorized operation can concatenate the corresponding strings in each array
to create a new array.
• Substring extraction: Extracting a portion of a string based on a specified start and end
index. Vectorized substring extraction can be performed on multiple strings
simultaneously.
• Case conversion: Changing the case of strings, such as converting all characters to
uppercase or lowercase. Vectorized operations can be used to apply case conversion to
multiple strings at once.
• String matching: Finding strings that match a specific pattern or regular expression.
Vectorized string matching allows searching for matches across multiple strings
efficiently.
• Replacement: Replacing substrings or patterns within strings. Vectorized replacement
operations can replace substrings across multiple strings simultaneously.

This vectorization of operations simplifies the syntax of operating on arrays of data


import numpy as np Output
x = np.array([2, 3, 5, 7, 11, 13]) [ 2 3 5 7 11 13]
print(x)
print(x*2) [ 4 6 10 14 22 26]

For arrays of strings, NumPy does not provide such simple access
data=['peter','Paul','MARY','gUIDO'] Output
print(data) ['peter','Paul','MARY','gUIDO']
data=['peter','Paul','MARY','gUIDO'] ['Peter', 'Paul', 'Mary', 'Guido']
[s.capitalize() for s in data]
This is perhaps sufficient to work with some data, but it will break if there are any missing values. For
example:
data = ['peter', 'Paul', none, 'MARY', 'gUIDO']
[s.capitalize() for s in data]

NameError Traceback (most recent call last)


<ipython-input-4-6d5ec95d781b> in <cell line: 1>()
----> 1 data = ['peter', 'Paul', none, 'MARY', 'gUIDO']
2 [s.capitalize() for s in data]

NameError: name 'none' is not defined

Pandas includes features to address both this need for vectorized string operations and for
correctly handling missing data via the str attribute of Pandas Series and Index objects
containing strings. So, for example, suppose we create a Pandas Series with this data:

import pandas as pd Output


names = pd.Series(data) 0 peter
print(names) 1 Paul
2 none
3 MARY
4 gUIDO

Unit IV – Pandas
41
APEC
dtype: object
We can now call a single method that will capitalize all the entries, while skipping over any
missing values:
import pandas as pd Output
names = pd.Series(data) 0 Peter
names.str.capitalize() 1 Paul
2 None
3 Mary
4 Guido
dtype: object

Methods similar to Python string methods

Nearly all Python’s built-in string methods are mirrored by a Pandas vectorized string method.
Here is a list of Pandas str methods that mirror Python string methods:
len() lower() translate() islower()
ljust() upper() startswith() isupper()
rjust() find() endswith() isnumeric()
center() rfind() isalnum() isdecimal()
zfill() index() isalpha() split()
strip() rindex() isdigit() rsplit()
rstrip() capitalize() isspace() partition()
lstrip() swapcase() istitle() rpartition()

names.str.lower() 0 peter
1 paul
2 none
3 mary
4 guido
dtype: object
names.str.len() 0 5
1 4
2 4
3 4
4 5
dtype: int64
names.str.startswith('M') 0 False
1 False
2 False
3 True
4 False
dtype: bool
import pandas as pd 0 [peter, Charles]
data = ['peter Charles', 'Paul 1 [Paul, Roudridge]
Roudridge', 'MARY Siva', 'gUIDO'] 2 [MARY, Siva]
names = pd.Series(data) 3 [gUIDO]
dtype: object
names.str.split()

Unit IV – Pandas
42
APEC

Methods using regular expressions


In addition, there are several methods that accept regular expressions to examine the content of
each string element, and follow some of the API conventions of Python’s built-in re module

Mapping between Pandas methods and functions in Python’s re module


Method Description
match() Call re.match() on each element, returning a Boolean.
extract() Call re.match() on each element, returning matched groups as strings.
findall() Call re.findall() on each element.
replace() Replace occurrences of pattern with some other string.
contains() Call re.search() on each element, returning a Boolean.
count() Count occurrences of pattern.
split() Equivalent to str.split(), but accepts regexps.
rsplit() Equivalent to str.rsplit(), but accepts regexps.

Extract The First Name


names.str.extract('([A-Za-z]+)') 0 peter
1 Paul
2 MARY
3 gUIDO

Miscellaneous methods
Finally, there are some miscellaneous methods that enable other convenient operations.

Other Pandas string methods


Method Description
get() Index each element
slice() Slice each element
slice_replace() Replace slice in each element with passed value
cat() Concatenate strings
repeat() Repeat values
normalize() Return Unicode form of string
pad() Add whitespace to left, right, or both sides of strings
wrap() Split long strings into lines with length less than a given width
join() Join strings in each element of the Series with passed separator
get_dummies() Extract dummy variables as a DataFrame

Vectorized item access and slicing. The get() and slice() operations enable vectorized element
access from each array. For example, we can get a slice of the first three characters of each array
using str.slice(0, 3).

df.str.slice(0, 3) is equivalent to df.str[0:3]:

names.str[0:3] 0 pet
1 Pau
2 MAR
Unit IV – Pandas
43
APEC
3 gUI
dtype: object
names.str.split().str.get(-1) 0 Charles
1 Roudridge
2 Siva
3 gUIDO
dtype: object

Unit IV – Pandas
44
APEC

PROFESSIONAL ELECTIVE COURSES: VERTICALS


VERTICAL 1: DATA SCIENCE
CCS346 EXPLORATORY DATA ANALYSIS
UNIT III - UNIVARIATE ANALYSIS
Introduction to Single variable: Distributions and Variables - Numerical Summaries of
Level and Spread - Scaling and Standardizing – Inequality.
Introduction
In statistics, there are three kinds of techniques that are used in the data
analysis. These are univariate analysis, bivariate analysis, and multivariate analysis.
Univariate analysis refers to the analysis of one variable. You can remember this
because the prefix “uni” means “one.
”Univariate analysis is a basic kind of analysis technique for statistical data. Here
the data contains just one variable and does not have to deal with the relationship of a
cause and effect.
The analysis of two variables and their relationship is termed bivariate analysis.
If three or more variables are considered simultaneously, it is multivariate analysis.
Example 1:
• Counting the number of boys and girls in the classroom

Example 2:
Consider the following household dataset:

Now perform univariate analysis on the variable Household Size:

Unit III - 1
APEC

Three common ways to perform univariate analysis:


• Summary Statistics
• Frequency Distributions
• Charts

Summary Statistics
There are two popular types of summary statistics:
• Measures of central tendency: these numbers describe where the centre of a
dataset is located.
o Examples include the mean and the median.
• Measures of Dispersion: These numbers describe how evenly distributed the
values are in a dataset.
o Examples are range, standard deviation, interquartile range, and variance.
▪ Range -the difference between the max value and min value in a
dataset
▪ Standard Deviation- an average measure of the spread
▪ Interquartile Range- the spread of the middle 50% of values

Frequency Distributions
• Frequency means how often something takes place. The frequency observation
tells the number of times for the occurrence of an event.
o The frequency distribution table show qualitative and quantitative
variables.

Charts
• Another way to perform univariate analysis is to create charts to visualize the
distribution of values for a certain variable.
• Examples include Boxplots, Histograms, Density Curves, Pie Charts

Bar chart
• The bar chart is represented in the form of rectangular bars. The graph will
compare various categories.
• The graph could be plotted vertically or horizontally. The horizontal or the x-axis
will represent the category and the vertical y-axis represents the category’s
value. The bar graph looks at the data set and makes comparisons.

Histogram
• The histogram is the same as a bar chart which analysis the data counts. The bar
graph will count against categories and the histogram displays the categories
into bins. The bin is capable of showing the number of data positions, the range,
or the interval.

Frequency Polygon

Unit III - 2
APEC

• The frequency polygon is similar to the histogram. However, these can be used to
compare the data sets or in order to display the cumulative frequency
distribution. The frequency polygon will be represented as a line graph.

Pie Chart
• The pie chart displays the data in a circular format. The graph is divided into
pieces where each piece is proportional to the fraction of the complete category.
So each slice of the pie in the pie chart is relative to categories size. The entire pie
is 100 percent and when you add up each of the pie slices then it should also add
up to 100.

Example 3:
Performing Univariate analysis using the Household Size variable from our dataset
mentioned earlier:

Summary Statistics
Measures of central tendency
• Mean (the average value): 3.8
• Median (the middle value): 4
Measures of Dispersion:
• Range (the difference between the max and min): 6
• Interquartile Range (the spread of the middle 50% of values): 2.5
• Standard Deviation (an average measure of spread): 1.87

Frequency Distributions
We can also create the following frequency distribution table to summarize how
often different values occur:

Unit III - 3
APEC

• Most frequent household size = 4.

Charts
We can create the following charts to help us visualize the distribution of values for
Household Size:
Boxplot
• A boxplot is a plot that shows the five-number summary of a dataset. The five-
number summary includes:
o Minimum value
o First quartile
o Median value
o Third quartile
o Maximum value
Here’s what a boxplot would look like for the variable Household Size:

Histogram
• A histogram is a type of chart that uses vertical bars to display frequencies. This
type of chart is a useful way to visualize the distribution of values in a dataset.

Unit III - 4
APEC

Density Curve
• A density curve is a curve on a graph that represents the distribution of values in
a dataset. It’s particularly useful for visualizing the “shape” of a distribution,
including whether or not a distribution has one or more “peaks” of frequently
occurring values and whether or not the distribution is skewed to the left or the
right.

Pie Chart
• A pie chart is a type of chart that is shaped like a circle and uses slices to
represent proportions of a whole.

Depending on the type of data, one of these charts may be more useful for visualizing
the distribution of values than the others.

Unit III - 5
APEC

Numerical Summaries
In data exploration, numerical summaries are essential for understanding the
level (central tendency) and spread (variability) of a variable. These summaries provide
a concise description of the distribution of data points and help in identifying patterns,
outliers, and potential relationships with other variables. Here are some common
numerical summaries used in data exploration:
• Measures of Central Tendency
o Mean: The arithmetic average of all the data points in a variable. It is calculated
by summing all the values and dividing by the total number of observations.
o Median: The middle value in an ordered list of data points. It divides the data
into two equal halves, with 50% of the observations above and 50% below it.
o Mode: The most frequently occurring value(s) in a variable. It represents the
peak(s) of the distribution.
• Measures of Variability
o Range: The difference between the maximum and minimum values in a
variable. It provides an idea of the spread of the data but is sensitive to outliers.
o Standard Deviation: A measure of how much the data points deviate from the
mean. It quantifies the average distance between each data point and the mean.
o Variance: The average squared deviation from the mean. It measures the
variability of data points around the mean.
• Additional Measures
o Quartiles: Values that divide the data into four equal parts. The first quartile
(Q1) represents the 25th percentile, the median represents the 50th
percentile, and the third quartile (Q3) represents the 75th percentile.
o Interquartile Range (IQR): The range between the first and third quartiles. It
provides a measure of the spread of the central half of the data and is less
affected by extreme values.
o Skewness: A measure of the asymmetry of the distribution. Positive skewness
indicates a longer tail on the right side, while negative skewness indicates a
longer tail on the left side.
o Kurtosis: A measure of the peakedness or flatness of the distribution. It
compares the tails and central peak to a normal distribution. Positive kurtosis
indicates a more peaked distribution, while negative kurtosis indicates a
flatter distribution.
These numerical summaries provide insights into the characteristics of a variable,
allowing for a better understanding of its distribution and variability. They serve as the
foundation for further analysis and can guide decision-making processes in data
exploration.

Unit III - 6
APEC

Measures of Central Tendency


Mode
• The mode is the most commonly occurring value in a distribution.

Example 1
Consider this dataset showing the retirement age of 11 people, in whole years:
54, 54, 54, 55, 56, 57, 57, 58, 58, 60, 60
This table shows a simple frequency distribution of the retirement age data.

The most commonly occurring value is 54, therefore the mode of this distribution is 54
years.

Mean
• The mean is the sum of the value of each observation in a dataset divided by the
number of observations. This is also known as the arithmetic average.

Unit III - 7
APEC

Looking at the retirement age distribution again:


54, 54, 54, 55, 56, 57, 57, 58, 58, 60, 60
• The mean is calculated by adding together all the values
(54+54+54+55+56+57+57+58+58+60+60 = 623) and dividing by the number of
observations (11) which equals 56.6 years.

54 + 54 + 54 + 55 + 56 + 57 + 57 + 58 + 58 + 60 + 60
𝑀𝑒𝑎𝑛 = = 56.6 𝑌𝑒𝑎𝑟𝑠
11

Median
• The median is the middle value in distribution when the values are arranged in
ascending or descending order.
• The median divides the distribution in half (there are 50% of observations on
either side of the median value). In a distribution with an odd number of
observations, the median value is the middle value.

For an odd number of observations:

Median = 19

For an even number of observations:

Median is the average of the middle two values


19+26
Median = 2 = 22.5

Looking at the retirement age distribution (which has 11 observations), the median is
the middle value, which is 57 years:
54, 54, 54, 55, 56, 57, 57, 58, 58, 60, 60
When the distribution has an even number of observations, the median value is the
mean of the two middle values. In the following distribution, the two middle values are
56 and 57, therefore the median equals 56.5 years.
52, 54, 54, 54, 55, 56, 57, 57, 58, 58, 60, 60

Unit III - 8
APEC

Important
• The mean is strongly affected by (not resistant to) outliers and skewness,
whereas the median is not affected by (resistant to) outliers and skewness.
o Outliers – mean is pulled towards the outliers
o Skewness – mean is pulled towards the longer tail.
▪ Symmetric : Mean = Median = Mode
▪ Left-skewed(Negatively skewed): Mode > Median > Mean
▪ Right-skewed(positively skewed): Mode < Median < Mean

Skewed distributions
• When a distribution is skewed, the mode remains the most commonly occurring
value, the median remains the middle value in the distribution, but the mean is
generally ‘pulled’ in the direction of the tails. In a skewed distribution, the
median is often a preferred measure of central tendency, as the mean is not
usually in the middle of the distribution.
Positively or Right-skewed
• A distribution is said to be positively or right skewed when the tail on the right
side of the distribution is longer than the left side. In a positively skewed
distribution, the mean to be ‘pulled’ toward the right tail of the distribution.

• The following graph shows a larger retirement age data set with a distribution
which is right skewed. The data has been grouped into classes, as the variable
being measured (retirement age) is continuous. The mode is 54 years, the modal
class is 54-56 years, the median is 56 years, and the mean is 57.2 years.

Unit III - 9
APEC

Negatively or left-skewed
• A distribution is said to be negatively or left skewed when the tail on the left side
of the distribution is longer than the right side. In a negatively skewed
distribution, the mean to be ‘pulled’ toward the left tail of the distribution.
• The following graph shows a larger retirement age dataset with a distribution
which left skewed. The mode is 65 years, the modal class is 63-65 years, the
median is 63 years and the mean is 61.8 years.

Relation among mean, mode and median

Unit III - 10
APEC

Example:
Consider the following example:

Analysis:
• Mode (most frequent value), Median (middle value*) and Mean (arithmetic
average) of both datasets is 6.
• If we just look at the measures of central tendency, we assume that the datasets
are the same. However, if we look at the spread of the values in the following
graph, we can see that Dataset B is more dispersed than Dataset A.
o The measures of central tendency and measures of spread help us to better
understand the data.

• Range
Range = Difference between the smallest value and the largest value in a dataset.
Range =4 [High value (8) and Low value (4)]

Range =10 [High value (8) and Low value (4)]

• Number line view of datasets

Range of values for Dataset B is larger than Dataset A.

• Quartiles

Unit III - 11
APEC

Quartiles divide an ordered dataset into four equal parts, and refer to the values
of the point between the quarters. A dataset may also be divided into quintiles
(five equal parts) or deciles (ten equal parts)

o 25th percentile
Lower quartile (Q1) is the point between the lowest 25% of values and
the highest 75% of values.

o 50th percentile
Second quartile (Q2) is the middle of the data set. It is also called the
median.

o 75th percentile
Upper quartile (Q3) is the point between the lowest 75% and highest
25% of values.

o Calculating quartiles

o As the quartile point falls between two values, the mean (average) of
those values is the quartile value:
▪ Q1 = (5+5) / 2 = 5
▪ Q2 = (6+6) / 2 = 6
▪ Q3 = (7+7) / 2 = 7

o As the quartile point falls between two values, the mean (average) of
those values is the quartile value:
▪ Q1 = (3+4) / 2 = 3.5
▪ Q2 = (6+6) / 2 = 6
▪ Q3 = (8+9) / 2 = 8.5

• Interquartile

Unit III - 12
APEC

The interquartile range (IQR) is the difference between the upper (Q3) and lower
(Q1) quartiles. IQR is often seen as a better measure of spread than the range as
it is not affected by outliers.

The interquartile range for Dataset A is = 2


Interquartile range = Q3 - Q1 => 7 – 5 => 2
The interquartile range for Dataset B is = 5
Interquartile range = Q3 - Q1 => 8.5 - 3.5 => 5

Example 2:
Data is arranged in the ordered array as follows: 11, 12, 13, 16, 16, 17, 18, 21, 22.
Solution:
Number of items = 9
Q1 is in the (9+1)/4 = 2.5 position of the ranked data,
 Q1 = (12+13)/2 = 12.5
Q2 is in the (9+1)/2 = 5th position of the ranked data,
 Q2 = median = 16
Q3 is in the (9+1)/4 = 7.5 position of the ranked data,
 Q3 = (18+21)/2 = 19.5

Five Number Summary


The five number summary consists of the minimum, lower quartile (Q1), median (Q2),
upper quartile (Q3) and the maximum. The minimum is the smallest number, the
maximum is the largest, the median is in the middle, Q1 is the median of the first half of
the data and Q3 is the median of the second half.

The five numbers that help describe the center, spread and shape of data are
• Xsmallest
• First Quartile (Q1)
• Median (Q2)
• Third Quartile (Q3)
• Xlargest

Unit III - 13
APEC

Variance and standard deviation


The variance and the standard deviation are measures of the spread of the data around
the mean.
• Datasets with a small spread, all values are very close to the mean, resulting in a
small variance and standard deviation.
• Datasets with a larger spread, values are spread further away from the mean,
leading to a larger variance and standard deviation.

Population Variance Formula

where:
• 𝑋𝑖 → Refers the 𝑖 𝑡ℎ unit, starting from the first observation to the last
• 𝛭→Population mean
• 𝑁 →Number of units in the population

Sample Variance Formula

where:
• 𝑥𝑖 →Refers the 𝑖 𝑡ℎ unit, starting from the first observation to the last
• 𝑥̅ → Sample mean
• 𝑛→ Number of units in the sample

Example 1
Find the variance and standard deviation of the following scores on an exam:
92, 95, 85, 80, 75, 50
Solution
Step 1: Mean of the data
92 + 95 + 85 + 80 + 75 + 50
𝑀𝑒𝑎𝑛 = = 79.5
6

Step 2: Find the difference between each score and the mean (deviation).

Unit III - 14
APEC

Score Score - Mean Difference from mean Sum of squares


92 92 – 79.5 12.5 (12.5)2
95 95 – 79.5 15.5 (15.5)2
85 85 – 79.5 5.5 (5.5)2
80 80 – 79.5 0.5 (0.5)2
75 75 – 79.5 -4.5 (−4.5)2
50 50 – 79.5 -29.5 (−29.5)2
Mean = 79.5 1317.50

Next we square each of these differences and then sum them.


1317.50
Variance = = 263.5
5
Standard Deviation = √263.5 ≈ 16.2

Example 2
Find the standard deviation of the average temperatures recorded over a five-day
period last winter:
18, 22, 19, 25, 12
Temp Temp - Mean Difference from mean Sum of squares
18 18 – 19.2 -1.2 1.44
22 22 – 19.2 2.8 7.84
19 19 – 19.2 -0.2 0.04
25 25 – 19.2 5.8 33.64
12 12 – 19.2 -7.2 51.84
Mean
94.80
96/5 = 19.2

94.8
Variance, we divide 5 – 1 = 4 ie. = 23.7
5
Standard Deviation = √23.7 ≈ 4.9

Unit III - 15
APEC

Scaling and Standardization


Scaling and standardization are techniques used in data exploration to
preprocess and transform numerical variables to a common scale. These techniques are
particularly useful when working with variables that have different units of
measurement or different ranges.

Scaling
Scaling refers to the process of transforming variables to have a similar scale. It
helps to ensure that all variables are on a comparable magnitude, preventing certain
variables from dominating the analysis due to their larger values. Common scaling
methods include:

• Min-Max Scaling (Normalization): This technique rescales the data to a fixed


range, typically between 0 and 1. It is achieved by subtracting the minimum
value of the variable and dividing it by the range (maximum minus minimum
value).

Unit III - 16
APEC

• Z-Score Standardization: Z-score standardization transforms the data to have a


mean of 0 and a standard deviation of 1. It is calculated by subtracting the mean
of the variable and dividing it by the standard deviation.

Example 1
Let's calculate the Min-Max scaling for a dataset step by step. Consider the following
dataset: [12, 15, 18, 20, 25].

Step 1: Identify the minimum and maximum values in the dataset.


min_val = 12
max_val = 25

Step 2: Calculate the range of the dataset.


Range = max_val - min_val
= 25 - 12
= 13

Step 3: Subtract the minimum value from each data point.


12 - min_val = 12 - 12 =0
15 - min_val = 15 – 12 =3
18 - min_val = 18 – 12 =6
20 - min_val = 20 – 12 =8
25 - min_val = 25 - 12 = 13

Step 4: Divide each result by the range


0 / range = 0 / 13 ≈ 0.000
3 / range = 3 / 13 ≈ 0.231
6 / range = 6 / 13 ≈ 0.462
8 / range = 8 / 13 ≈ 0.615
13 / range = 13 / 13 = 1.000

Resulting Min-Max scaled dataset is approximately [0.000, 0.231, 0.462, 0.615, 1.000].
Each value represents the scaled value for the corresponding data point in the original
dataset, where 0 corresponds to the minimum value and 1 corresponds to the maximum
value.

Unit III - 17
APEC

Example 2
Let's say we have a variable representing the income of individuals in a dataset. The
original income values range from $20,000 to $100,000. We want to scale these values
to a range between 0 and 1 using Min-Max Scaling.
Solution
Calculate the minimum and maximum values of the income variable:
min(x) = $20,000
max(x) = $100,000
Choose a data point, for example, $40,000, and apply the Min-Max Scaling formula:
40000 − 20000
𝑥𝑠𝑐𝑎𝑙𝑒𝑑 = = 0.25
100000 − 20000
Therefore, $40,000 would be scaled to 0.25.
Repeat the scaling process for all other data points. The resulting scaled values will fall
within the range of 0 to 1, with the minimum value transformed to 0 and the maximum
value transformed to 1. When we are dealing with image processing, the pixels need
normalized to be between 0 and 255.
Analysis
Employee Number Age Salary
Emp1 44 73000
Emp2 27 47000
Emp3 30 53000 Age Range : 27 – 48
Emp4 38 62000 Salary Range : 47000 - 78000
Emp5 40 57000
Emp6 35 53000
Emp7 48 78000

In machine learning everything is measured in terms of numbers and when we want to


identify the nearest neighbours, similarity or dissimilarity of features. Euclidean
distance is commonly used as a similarity or dissimilarity metric between data points.

Distance between Emp2 and Emp1 = √ (27 − 44)2 + (47000 − 73000)2 = 31.06

Unit III - 18
APEC

Distance between Emp2 and Emp3 = √ (30 − 27)2 + (53000 − 47000)2 = 6.70

Normalization process:

Distance between Emp2 and Emp1 = √ (0 − .80)2 + (0 − .83)2 = 1.15 Comparison


Distance between Emp2 and Emp3 = √ (.14 − 0)2 + (.19 − 0)2 = 0.23 is more
significant

Standardization process
Standardization is also known as z-score Normalization. In standardization,
features are scaled to have zero-mean and one-standard-deviation. It means after
standardization features will have mean = 0 and standard deviation = 1.

Distance between Emp2 and Emp1 = √ (−1.51 − 0.95)2 + (−1.27 − 1.19)2 = 3.47 Comparison
is more
Distance between Emp2 and Emp3 = √ (−1.07 + 1.51)2 + (−0.70 + 1.27)2 = 0.71 significant

Unit III - 19
APEC

Example 1 - Scaling a Pandas DataFrame using min-max scaling


import pandas as pd s1 s2
from mlxtend.preprocessing import minmax_scaling
0 1 10
s1 = pd.Series([1, 2, 3, 4, 5, 6], index=(range(6))) 1 2 9
s2 = pd.Series([10, 9, 8, 7, 6, 5], index=(range(6))) 2 3 8
df = pd.DataFrame(s1, columns=['s1']) 3 4 7
df['s2'] = s2
print(df)
4 5 6
print("Scaled DataFrame") 5 6 5
minmax_scaling(df, columns=['s1', 's2']) Scaled DataFrame
s1 s2
0 0.0 1.0
1 0.2 0.8
2 0.4 0.6
3 0.6 0.4
4 0.8 0.2
5 1.0 0.0
Example 2 - Scaling arrays using min-max scaling
import numpy as np [[ 1 10]
X = np.array([[1, 10], [2, 9], [3, 8],
[ 2 9]
[4, 7], [5, 6], [6, 5]]) [ 3 8]
print(X) [ 4 7]
from mlxtend.preprocessing import minmax_scaling [ 5 6]
minmax_scaling(X, columns=[0, 1])
[ 6 5]]

array([[0. , 1. ],
[0.2, 0.8],
[0.4, 0.6],
[0.6, 0.4],
[0.8, 0.2],
[1. , 0. ]])
Example 3 - Scaling students marks using min-max scaling
import numpy as np

# Sample student dataset (marks in 3 subjects)


[0.57142857
student_data = [ 0.57142857 0.8 ]
[80, 70, 90], [0. 0. 0.]
[60, 50, 70], [0.85714286 1.
[90, 85, 95],
[75, 80, 70],
1. ]
[95, 60, 80] [0.42857143
] 0.85714286 0. ]
[1. 0.28571429
# Find the minimum and maximum values for each subject
0.4 ]
min_values = np.min(student_data, axis=0)
max_values = np.max(student_data, axis=0)

# Perform Min-Max Scaling on each subject's marks


scaled_data = (student_data - min_values) / (max_values -
min_values)

# Create x-axis labels for subjects


subjects = ['Subject 1',

Unit III - 20
APEC

'Subject 2', 'Subject 3']

# Print the scaled data


for scaled_marks in scaled_data:
print(scaled_marks)
Example 4 - Scaling Sonar dataset using min-max scaling
Apart from supporting library functions other functions used are:
• fit(data) method -- compute the mean and std dev for a given feature so that it can be
used further for scaling.
• transform(data) -- perform scaling using mean and std dev calculated using the .fit()
• fit_transform() -- does both fit and transform.
# visualize a minmax scaler transform of the sonar dataset
from pandas import read_csv
from pandas import DataFrame
from pandas.plotting import scatter_matrix
from sklearn.preprocessing import MinMaxScaler
from matplotlib import pyplot
# load dataset
url = "https://raw.githubusercontent.com/jbrownlee/Datasets/master/sonar.csv"
dataset = read_csv(url, header=None)
# retrieve just the numeric input values
data = dataset.values[:, :-1]
# perform a robust scaler transform of the dataset
trans = MinMaxScaler()
data = trans.fit_transform(data)
# convert the array back to a dataframe
dataset = DataFrame(data)
# summarize
print(dataset.describe())
# histograms of the variables
dataset.hist()
pyplot.show()

Output

0 1 2 3 4 5 \
count 208.000000 208.000000 208.000000 208.000000 208.000000 208.000000
mean 0.204011 0.162180 0.139068 0.114342 0.173732 0.253615
std 0.169550 0.141277 0.126242 0.110623 0.140888 0.158843
min 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000
25% 0.087389 0.067938 0.057326 0.044163 0.079508 0.152714
50% 0.157080 0.129447 0.107753 0.090942 0.141517 0.220236
75% 0.251106 0.202958 0.185447 0.139563 0.237319 0.333042
max 1.000000 1.000000 1.000000 1.000000 1.000000 1.000000

6 7 8 9 ... 50 \
count 208.000000 208.000000 208.000000 208.000000 ... 208.000000
mean 0.320472 0.285114 0.252485 0.281652 ... 0.160047
std 0.167175 0.187767 0.175311 0.192215 ... 0.119607
min 0.000000 0.000000 0.000000 0.000000 ... 0.000000
25% 0.209957 0.165215 0.132571 0.142964 ... 0.083914
50% 0.280438 0.235061 0.214349 0.244673 ... 0.138446
75% 0.407738 0.361852 0.334555 0.368082 ... 0.207420
max 1.000000 1.000000 1.000000 1.000000 ... 1.000000

Unit III - 21
APEC

51 52 53 54 55 56 \
count 208.000000 208.000000 208.000000 208.000000 208.000000 208.000000
mean 0.180031 0.265172 0.290669 0.197061 0.200555 0.213642
std 0.137432 0.183385 0.213474 0.160717 0.147080 0.164361
min 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000
25% 0.092368 0.118831 0.127924 0.080499 0.102564 0.096591
50% 0.151213 0.235065 0.242690 0.156463 0.165385 0.160511
75% 0.227175 0.374026 0.394737 0.260771 0.260897 0.287642
max 1.000000 1.000000 1.000000 1.000000 1.000000 1.000000

57 58 59
count 208.000000 208.000000 208.000000
mean 0.175035 0.216015 0.136425
std 0.148051 0.170286 0.116190
min 0.000000 0.000000 0.000000
25% 0.075515 0.098485 0.057737
50% 0.125858 0.173554 0.108545
75% 0.229977 0.281680 0.183025
max 1.000000 1.000000 1.000000

Example for Z-score Normalization


Example 1
Let's calculate the Z-scores for each data point in a dataset. Consider the following
dataset: [75, 80, 85, 90, 95].
Solution
We will calculate the Z-score for each data point using the formula:
Z = (x - μ) / σ,
where x is the data point, μ is the mean, and σ is the standard deviation
Step 1: Calculate the mean (μ) of the dataset.
mean = (75 + 80 + 85 + 90 + 95) / 5

Unit III - 22
APEC

= 85
Step 2: Calculate the standard deviation (σ) of the dataset.
std_dev = sqrt(((75-85)^2 + (80-85)^2 + (85-85)^2 + (90-85)^2 + (95-85)^2) / 5)
= sqrt((100 + 25 + 0 + 25 + 100) / 5)
= sqrt(250 / 5)
= sqrt(50)
= 7.071

Now, let's calculate the Z-score for each data point:


For data point 75: For data point 80:
Z = (75 - 85) / 7.071 Z = (80 - 85) / 7.071
= -1.414 = -0.707
For data point 85: For data point 90:
Z = (85 - 85) / 7.071 Z = (90 - 85) / 7.071
=0 = 0.707
For data point 95:
Z = (95 - 85) / 7.071
= 1.414
Answer
The Z-scores for the dataset [75, 80, 85, 90, 95] are approximately
[-1.414, -0.707, 0, 0.707, 1.414].

Example 2
John is a researcher for the infant clothes company BABYCLOTHES Inc. To assist in
determining sizes for their baby outfits, they have hired him. The business wants to
launch a new line of extra-small baby garments, but it's unsure of what size to produce.
To estimate the potential size of the market for garments designed for infants weighing
less than 6 pounds, scientists want to know how many premature babies are born
weighing less than 6 pounds.

John can compute the Z-score, which aids in determining the deviation from the mean, if
he discovers that the mean weight for a premature newborn infant is 5 pounds and the
standard deviation is 1.25 pounds.
Answer
Z = (x - μ) / σ,
where x is the data point, μ is the mean, and σ is the standard deviation
 Z –score = (6 - 5) / 1.25 = 0.80

Simple Python Program Z-score Normalization


import numpy as np

# Example dataset

Unit III - 23
APEC

data = np.array([10, 5, 7, 12, 8])

# Calculate mean and standard deviation


mean = np.mean(data)
std_dev = np.std(data)

# Z-score normalization
normalized_data = (data - mean) / std_dev

print(normalized_data)

Output
[ 0.66208471 -1.40693001 -0.57932412 1.4896906 -0.16552118]
# Calculate the z-score from with scipy
import scipy.stats as stats
values = [10, 5, 7, 12, 8]

zscores = stats.zscore(values)
print(zscores)

Output
[ 0.66208471 -1.40693001 -0.57932412 1.4896906 -0.16552118]

# stats.zscore() method
import numpy as np
from scipy import stats

arr1 = [[20, 2, 7, 1, 34],


[50, 12, 12, 34, 4]]

print ("\narr1 : ", arr1)


print ("\nZ-score for arr1 : \n", stats.zscore(arr1))
print ("\nZ-score for arr1 : \n", stats.zscore(arr1, axis= 1))

Output
arr1 : [[20, 2, 7, 1, 34], [50, 12, 12, 34, 4]]

Z-score for arr1 :


[[-1. -1. -1. -1. 1.]
[ 1. 1. 1. 1. -1.]]

Z-score for arr1 :


[[ 0.57251144 -0.85876716 -0.46118977 -0.93828264 1.68572813]
[ 1.62005758 -0.61045648 -0.61045648 0.68089376 -1.08003838]]
Z-scores calculation using Pandas Dataframe
# Loading a Sample Pandas Dataframe
import pandas as pd

df = pd.DataFrame.from_dict({

Unit III - 24
APEC

'Name': ['Nik', 'Kate', 'Joe', 'Mitch', 'Alana'],


'Age': [32, 30, 67, 34, 20],
'Income': [80000, 90000, 45000, 23000, 12000],
'Education' : [5, 7, 3, 4, 4]
})

print(df.head())

df['Income zscore'] = stats.zscore(df['Income'])


print(df.head())

Output
Name Age Income Education
0 Nik 32 80000 5
1 Kate 30 90000 7
2 Joe 67 45000 3
3 Mitch 34 23000 4
4 Alana 20 12000 4

Name Age Income Education Income zscore


0 Nik 32 80000 5 0.978700
1 Kate 30 90000 7 1.304934
2 Joe 67 45000 3 -0.163117
3 Mitch 34 23000 4 -0.880830
4 Alana 20 12000 4 -1.239687

MinMax Scaler vs Standard Scaler


# import module # import module
from sklearn.preprocessing import from sklearn.preprocessing import
StandardScaler StandardScaler

# create data # create data


data = [[11,2],[3,7],[0,10],[11 8]] data = [[11,2],[3,7],[0,10],[11,8]]

# compute required values # compute required values


scaler = MinMaxScaler() scaler = StandardScaler()
model = scaler.fit(data) model = scaler.fit(data)
scaled_data = model.transform(data) scaled_data = model.transform(data)

# print scaled data # print scaled data


print(scaled_data) print(scaled_data)

Output
[[1. 0. ] Output
[0.27272727 0.625 ] [[ 0.97596444 -1.61155897]
[0. 1. ] [-0.66776515 0.08481889]
[1. 0.75 ]] [-1.28416374 1.10264561]
[ 0.97596444 0.42409446]]

Unit III - 25
APEC

import scipy.stats as stats


values = data =
[[11,2],[3,7],[0,10],[11,8]]

zscores = stats.zscore(values)
print(zscores)

Output
[[ 0.97596444 -1.61155897]
[-0.66776515 0.08481889]
[-1.28416374 1.10264561]
[ 0.97596444 0.42409446]]

Adding Constant
Scaling refers to the process of transforming the values of a dataset to fit within a
specific range. It is commonly used when the features of the dataset have different
scales. Two common scaling techniques are adding or subtracting a constant and
multiplying or dividing by a constant.

Python program for adding a constant in Scaling


import numpy as np
import matplotlib.pyplot as plt
from sklearn.preprocessing import MinMaxScaler

# Original data
original_data = np.array([1, 2, 3, 4, 5])

# Scaling using Min-Max scaling


scaler = MinMaxScaler(feature_range=(0, 1))
scaled_data = scaler.fit_transform(original_data.reshape(-1,
1)).flatten()

# Scaling by adding a constant


constant = 10
scaled_data_with_constant = scaled_data + constant

# Plotting the data


plt.figure(figsize=(8, 6))
plt.plot(original_data, label='Original Data')
plt.plot(scaled_data, label='Scaled Data')
plt.plot(scaled_data_with_constant, label='Scaled Data with
Constant')
plt.legend()

Unit III - 26
APEC

plt.xlabel('Index')
plt.ylabel('Value')
plt.title('Comparison of Data Scaling')
plt.grid(True)
plt.show()

Gaussian distribution
In real life, many datasets can be modeled by Gaussian Distribution (Univariate
or Multivariate). So it is quite natural and intuitive to assume that the clusters come
from different Gaussian Distributions. Or in other words, it tried to model the dataset as
a mixture of several Gaussian Distributions.
A Gaussian distribution, also known as a normal distribution or bell curve, is a
probability distribution that is symmetric and characterized by its mean and standard
deviation. It is named after the German mathematician, Carl Friedrich Gauss. It is one of
the most commonly encountered distributions in statistics and probability theory. Some
common example datasets that follow Gaussian distribution are:
o Body temperature
o People’s Heights
o Car mileage
o IQ scores
The Gaussian distribution has the following properties:
• Symmetry: The distribution is symmetric around its mean, which means that the
probability density function (PDF) is symmetric.
• Mean: The mean (μ) represents the central value or the average of the
distribution.

Unit III - 27
APEC

• Standard Deviation: The standard deviation (σ) represents the spread or


dispersion of the distribution. It determines how much the values deviate from
the mean.
• Shape: The Gaussian distribution has a characteristic bell-shaped curve, with the
highest point at the mean and the tails extending towards infinity.

where:
o f(x) is the probability density function at a specific value x.
o μ is the mean of the distribution.
o σ is the standard deviation of the distribution.
o π is a mathematical constant (approximately 3.14159).
o exp() is the exponential function.

Simple python program to explain Gaussian distribution


import numpy as np
import scipy as sp
from scipy import stats
import matplotlib.pyplot as plt

## generate the data and plot it for an ideal normal curve

## x-axis for the plot


x_data = np.arange(-5, 5, 0.001)

## y-axis as the gaussian


y_data = stats.norm.pdf(x_data, 0, 1)

## plot data
plt.plot(x_data, y_data,'green')
plt.show()

Output

Unit III - 28
APEC

Simple python program to explain Histogram of Gaussian distribution


import numpy as np
import matplotlib.pyplot as plt

# Generate random numbers from a Gaussian distribution


mean = 0 # Mean of the distribution
std_dev = 1 # Standard deviation of the distribution
sample_size = 1000 # Number of samples to generate

samples = np.random.normal(mean, std_dev, sample_size)

# Plot a histogram of the generated samples


plt.hist(samples, bins=30, density=True, alpha=0.7)
plt.xlabel('Value')
plt.ylabel('Probability Density')
plt.title('Histogram of Gaussian Distribution')
plt.grid(True)
plt.show()

Output

Unit III - 29
APEC

Inequality
Income inequality using quantiles and quantile-shares
Inequality analysis involves examining the distribution of a variable and exploring the
disparities or differences in values across different groups or segments of the data.
Quantiles and quantile shares are useful measures for analyzing inequality. Here's how
they can be used:
• Quantiles:
Quantiles divide a dataset into equal-sized groups, representing specific
percentiles of the data. For example, the median represents the 50th percentile,
dividing the data into two equal halves. Other quantiles, such as quartiles (25th,
50th, and 75th percentiles), quintiles (20th, 40th, 60th, and 80th percentiles), or
deciles (10th, 20th, ..., 90th percentiles), divide the data into smaller segments.

• Quantile Shares:
Quantile shares represent the cumulative proportion of a variable's distribution
held by each quantile group. They help us in understanding the concentration or
dispersion of values across different parts of the distribution. For example, if the
top 10% of earners in a population hold 50% of the total income, it indicates a
higher level of income concentration.

To illustrate these concepts, let's consider an example of analyzing income inequality


using quantiles and quantile shares:
Simple python program to explain Income inequality using quantiles and
quantile-shares
import pandas as pd
import matplotlib.pyplot as plt

# Example dataset with income values


data = pd.DataFrame({'Income': [25000, 30000, 35000, 40000,
45000, 50000, 60000, 70000, 80000, 100000]})

# Calculate quantiles and quantile shares


quantiles = [0.25, 0.5, 0.75]
quantile_values = data['Income'].quantile(quantiles)
quantile_shares = data['Income'].quantile(quantiles) /
data['Income'].sum()

# Print quantiles and quantile shares


print("Quantiles:")
print(quantile_values)
print("\nQuantile Shares:")
print(quantile_shares)

Unit III - 30
APEC

# Plotting quantile values


plt.figure(figsize=(8, 6))
plt.plot(quantiles, quantile_values, marker='o', linestyle='-
', color='red')
plt.xlabel('Quantiles')
plt.ylabel('Income')
plt.title('Income Quantiles')
plt.xticks(quantiles)
plt.grid(True)
plt.show()
Output
Quantiles:
0.25 36250.0
0.50 47500.0
0.75 67500.0
Name: Income, dtype: float64

Quantile Shares:
0.25 0.067757
0.50 0.088785
0.75 0.126168
Name: Income, dtype: float64

Income inequality using Cumulative income shares and Lorenz curves


Cumulative income shares and Lorenz curves are concepts used to analyze income
inequality. Here's an explanation of each:
• Cumulative Income Shares:

Unit III - 31
APEC

Cumulative income shares represent the cumulative proportion of total income


held by different segments or groups of the population. It helps us understand
the distribution of income across various portions of the population.
o To calculate cumulative income shares, we sort the income values in
ascending order and calculate the cumulative sum of sorted incomes
divided by the sum of all incomes. This gives us the proportion of total
income accumulated by each portion of the population as we move from
the lowest to the highest incomes.
• Lorenz Curve:
The Lorenz curve is a graphical representation of income inequality. It plots the
cumulative income shares on the y-axis against the cumulative population shares
on the x-axis. The Lorenz curve helps visualize how income is distributed in
relation to the population.
o The equality line is also plotted on the same graph, which represents
perfect income equality. It is a diagonal line starting from the origin (0,0)
and ending at (1,1). The Lorenz curve shows how the actual distribution
of income deviates from the equality line.

Analysis
• The further the Lorenz curve is from the equality line, the greater the income
inequality. If the Lorenz curve lies below the equality line, it indicates income
concentration among a smaller portion of the population. If it lies above the
equality line, it indicates income dispersion among a larger portion of the
population.
• The area between the Lorenz curve and the equality line represents income
inequality. A larger area indicates higher income inequality, while a smaller area
indicates lower inequality.

By plotting the Lorenz curve and comparing it to the equality line, we can visually assess
the level of income inequality in a population. The Lorenz curve provides a
comprehensive overview of income distribution, allowing us to analyze disparities and
evaluate the fairness and equity of income allocation.

Simple python program to explain Income inequality using Cumulative income


shares and Lorenz curves
# Example dataset with income values
income = np.array([25000, 30000, 35000, 40000, 45000, 50000, 60000, 70000,
80000, 100000])

# Sort the income values in ascending order


sorted_income = np.sort(income)

# Calculate cumulative income shares

Unit III - 32
APEC

cumulative_income_shares = np.cumsum(sorted_income) / np.sum(sorted_income)

# Calculate cumulative population shares


cumulative_population_shares = np.linspace(0,1, len(sorted_income))

# Calculate diagonal line (equality line)


equality_line = np.linspace(0, 1, len(sorted_income))

# Plotting Lorenz curve


plt.figure(figsize=(8, 6))
plt.plot(cumulative_population_shares,cumulative_income_shares,
label='Lorenz Curve')
plt.plot(cumulative_population_shares, equality_line, label='Equality
Line', linestyle='--', color='red')
plt.fill_between(cumulative_population_shares, cumulative_income_shares,
equality_line, alpha=0.5)
plt.xlabel('Cumulative Population Share')
plt.ylabel('Cumulative Income Share')
plt.title('Lorenz Curve - Income Inequality')
plt.legend()
plt.grid(True)
plt.show()

Output

Gini coefficient
The Gini coefficient is a widely used measure of income or wealth inequality. It
quantifies the extent of income inequality in a population, ranging from 0 to 1, where 0
represents perfect equality (all individuals have the same income) and 1 represents
maximum inequality (one individual has all the income, while others have none).

Unit III - 33
APEC

The Gini coefficient is derived from the Lorenz curve, which plots the cumulative
income shares against the cumulative population shares. To calculate the Gini
coefficient, you can follow these steps:

• Obtain the cumulative population shares and cumulative income shares from the
Lorenz curve. These represent the x-axis and y-axis values, respectively.
• Calculate the area between the Lorenz curve and the equality line (the diagonal
line). This area is known as the "area between the Lorenz curve and the line of
perfect equality."
• Calculate the area under the equality line (the area of the triangle formed by the
equality line and the two axes).
• Divide the "area between the Lorenz curve and the line of perfect equality" by
the "area under the equality line" to get the Gini coefficient.
• Mathematically, the formula for the Gini coefficient can be expressed as:
G = (A) / (A + B)
where:
G: Gini coefficient
A: Area between the Lorenz curve and the line of perfect equality
B: Area under the equality line
• The Gini coefficient ranges from 0 to 1, with higher values indicating higher
income inequality.

Simple python program to explain Income inequality using Gini coefficient


import numpy as np

def calculate_gini_coefficient(income):
# Sort the income values in ascending order
sorted_income = np.sort(income)

# Get the cumulative population shares


cumulative_population_shares = np.arange(1, len(sorted_income) + 1) /
len(sorted_income)

# Calculate the cumulative income shares


cumulative_income_shares = np.cumsum(sorted_income)/np.sum(sorted_income)

# Calculate the area between Lorenz curve and line of income equality
area_between_curve_and_equality = np.sum(cumulative_income_shares[:-1] *
(cumulative_population_shares[1:] - cumulative_population_shares[:-1]))

# Calculate the area under the line of perfect income equality


area_under_equality = cumulative_population_shares[-1]

# Calculate the Gini coefficient


gini_coefficient= 1-(2*area_between_curve_and_equality/area_under_equality)

return gini_coefficient

# Example income distribution


income = np.array([1000, 2000, 3000, 4000, 5000])

Unit III - 34
APEC

# Calculate the Gini coefficient


gini = calculate_gini_coefficient(income)

# Print the result


print("Gini coefficient:", gini)

Output
Gini coefficient: 0.4666666666666667

Summary Measure of Inequality


When evaluating a summary measure of inequality, there are several desirable
properties to consider. Here are some of the key properties that are often desired:
• Scale invariance: The measure should not change when the units of
measurement are transformed or scaled. It should provide consistent results
regardless of the scale of the variable being measured.
• Anonymity: The measure should not be affected by changes in the identity or
characteristics of individuals within the population. It should only depend on the
income or wealth distribution itself.
• Population size independence: The measure should not be sensitive to changes
in the total population size. It should provide meaningful comparisons even
when the population size varies.
• Transfer principle: The measure should reflect changes in inequality when
income or wealth is redistributed among individuals. If there is a transfer of
resources from richer to poorer individuals, the measure should indicate a
decrease in inequality.
• Decomposability: The measure should allow for the decomposition of inequality
into different components, such as within-group and between-group inequality.
This property helps in understanding the sources and drivers of inequality.
• Mathematical properties: The measure should possess well-defined
mathematical properties, such as continuity, differentiability, and boundedness.
These properties ensure that the measure is mathematically tractable and
interpretable.
• Intuitiveness and interpretability: The measure should be easy to understand
and interpret intuitively. It should provide meaningful insights into the
distribution of income or wealth and the level of inequality.
It's important to note that no single measure can fully satisfy all of these properties
simultaneously. Different measures of inequality emphasize different aspects and trade-
offs. Therefore, it is often recommended to consider multiple measures and interpret
the results collectively to gain a comprehensive understanding of inequality.

Unit III - 35
APEC

PROFESSIONAL ELECTIVE COURSES: VERTICALS


VERTICAL 1: DATA SCIENCE
CCS346 EXPLORATORY DATA ANALYSIS
UNIT IV - BIVARIATE ANALYSIS
Relationships between Two Variables – Percentage Tables – Analyzing Contingency Tables
– Handling Several Batches – Scatterplots and Resistant Lines.

Introduction
Data in statistics is sometimes classified according to the number of variables. For example,
“height” might be one variable and “weight” might be another variable. Depending on the
number of variables, the data is classified as univariate, Bivariate, Multivariate.
• Univariate analysis -- Analysis of one (“uni”) variable.
• Bivariate analysis -- Analysis of exactly two variables.
• Multivariate analysis -- Analysis of more than two variables.

Relationships between Two Variables


Bivariate Analysis – Definition:
• Bivariate Analysis is a type of statistical analysis where two variables are observed.
In this analysis, one variable is dependent and the other is independent. These two
variables are mostly denoted by X and Y. Hence it is denoted as pair (X, Y).
• It is mainly used to examine the relationship between two variables. It helps in
understanding how the variables are related and if there is any association or
correlation between them.
Example: Hours spent watching TV Vs time spent in doing physical exercise.
• Techniques used in bivariate analysis are:
o Correlation analysis
o Scatter plots
o Regression analysis

Unit IV - 1
APEC

Relationship between two variables


In statistical analysis, the terms "explanatory variable" and "response variable" are
commonly used to describe the relationship between two variables in a study.

Explanatory Variable (Independent Variable)


The explanatory variable, also known as the independent variable or predictor
variable, is the variable that is manipulated or controlled by the researcher. It is believed to
have an influence on the response variable.
For example, in a study investigating the effect of studying time on exam scores, the
amount of time spent studying would be the explanatory variable. Students study for
different durations (e.g., 1 hour, 2 hours, or 3 hours).

Response Variable (Dependent Variable)


The response variable, also known as the dependent variable or outcome variable, is
the variable that is observed or measured in response to changes in the explanatory
variable. Its value depends on the values of the explanatory variable.
In the same example of studying time and exam scores, the exam scores would be
the response variable.

The relationship between the explanatory variable and the response variable is typically
analyzed, how changes in the explanatory variable affect the response variable. Statistical
techniques like regression analysis, ANOVA, or correlation analysis are commonly used to
quantify and analyze this relationship.

Techniques used in Bivariate Analysis

Unit IV - 2
APEC

Scatter plots
A scatter plot is a graphical representation of data points on a Cartesian plane. It
shows the values of two variables as points on the graph, with one variable represented on
the x-axis and the other on the y-axis. Scatter plots helps to visualize the relationship
between the variables. The pattern formed by the data points can provide insights into the
type and strength of the relationship.
Suppose you collect data from a group of individuals, recording the number of hours
they spend watching TV (x-axis) and the amount of time they spend on physical exercise
(y-axis) in a week. By plotting these data points on a scatter plot, you can visually analyze
the relationship between the two variables. If the data points show a cluster towards the
lower end of physical exercise time, it suggests a potential negative relationship. This
implies that individuals who spend more time watching TV tend to engage in less physical
exercise.

Simple Python program for scatter plot


import matplotlib.pyplot as plt

# Data collection
# Example TV viewing hours
tv_hours = [2, 3, 1, 4, 5, 2, 1, 3, 4, 2]

# Example exercise time


exercise_time = [30, 45, 20, 60, 75, 40, 25, 50, 55, 35]

# Creating a scatter plot


plt.scatter(tv_hours, exercise_time)
plt.xlabel("TV Viewing Hours")
plt.ylabel("Exercise Time (minutes)")
plt.title("TV Viewing vs. Exercise Time")
plt.show()

Output

Unit IV - 3
APEC

Correlation analysis
Correlation analysis measures the strength and direction of the linear relationship
between two variables. It provides a numerical value, called a correlation coefficient, which
quantifies the relationship. The correlation coefficient ranges from -1 to +1. A positive
correlation coefficient indicates a positive relationship, a negative correlation coefficient
indicates a negative relationship, and a correlation coefficient of zero indicates no linear
relationship.
Let's say the correlation coefficient between TV viewing hours and physical exercise
time is -0.6. This negative value indicates a moderate negative correlation. It suggests that
as TV viewing hours increase, there tends to be a decrease in the amount of time spent on
physical exercise. However, it's important to remember that correlation does not imply
causation. The negative correlation does not necessarily mean that watching TV causes a
decrease in physical exercise, but rather that there is an association between the two
variables.
Simple Python program to explain correlation analysis
import numpy as np
from scipy.stats import pearsonr, linregress
import matplotlib.pyplot as plt

# Data collection
tv_hours = [2, 3, 1, 4, 5, 2, 1, 3, 4, 2] # Example TV viewing hours
exercise_time = [30, 45, 20, 60, 75, 40, 25, 50, 55, 35] # Example exercise
time

# Bivariate analysis
correlation_coefficient, _ = pearsonr(tv_hours, exercise_time)
slope, intercept, r_value, p_value, std_err = linregress(tv_hours,
exercise_time)

# Creating a scatter plot

Unit IV - 4
APEC

plt.scatter(tv_hours, exercise_time)
plt.xlabel("TV Viewing Hours")
plt.ylabel("Exercise Time (minutes)")
plt.title("TV Viewing vs. Exercise Time")

# Regression line
x = np.array(tv_hours)
y = slope * x + intercept
plt.plot(x, y, color='red')

# Display the plot with regression line


plt.show()

# Results
print("Correlation coefficient:", correlation_coefficient)
print("Slope:", slope)
print("Intercept:", intercept)
print("R-value:", r_value)
print("P-value:", p_value)
print("Standard error:", std_err)
Output

Correlation coefficient: 0.9795036751931497


Slope: 12.45341614906832
Intercept: 9.875776397515537
R-value: 0.9795036751931494
P-value: 7.532861648764562e-07
Standard error: 0.9054273128641811

Regression analysis
Regression analysis is used to model the relationship between two variables. It
helps predict the value of one variable based on the known value of another variable. In
bivariate analysis, simple linear regression is commonly used. It assumes a linear
relationship between the variables and estimates the best-fit line that minimizes the
distance between the observed data points and the predicted values.
Suppose the regression analysis suggests a simple linear equation like "Physical
Exercise Time = 150 - 10 * TV Viewing Hours." This equation implies that for each

Unit IV - 5
APEC

additional hour spent watching TV, the expected decrease in physical exercise time is 10
minutes. Using this equation, you can estimate the physical exercise time for an individual
based on the number of hours they spend watching TV.
Simple Python program to explain regression analysis
import numpy as np
from scipy.stats import linregress

# Data collection
tv_hours = [2, 3, 1, 4, 5, 2, 1, 3, 4, 2] # Example TV viewing hours
exercise_time = [30, 45, 20, 60, 75, 40, 25, 50, 55, 35] # Example exercise
time

# Regression analysis
slope, intercept, r_value, p_value, std_err = linregress(tv_hours,
exercise_time)

# Predicting exercise time for new TV viewing hours


new_tv_hours = [2.5, 3.5, 4.5] # Example new TV viewing hours
predicted_exercise_time = [slope * x + intercept for x in new_tv_hours]

# Results
print("Slope:", slope)
print("Intercept:", intercept)
print("R-value:", r_value)
print("P-value:", p_value)
print("Standard error:", std_err)

# Predicted exercise time for new TV viewing hours


print("Predicted Exercise Time for New TV Viewing Hours:")
for i in range(len(new_tv_hours)):
print(f"TV Viewing Hours: {new_tv_hours[i]}, Predicted Exercise Time:
{predicted_exercise_time[i]}")
Output
Slope: 12.45341614906832
Intercept: 9.875776397515537
R-value: 0.9795036751931494
P-value: 7.532861648764562e-07
Standard error: 0.9054273128641811
Predicted Exercise Time for New TV Viewing Hours:
TV Viewing Hours: 2.5, Predicted Exercise Time: 41.00931677018633
TV Viewing Hours: 3.5, Predicted Exercise Time: 53.462732919254655
TV Viewing Hours: 4.5, Predicted Exercise Time: 65.91614906832297

Working with SPSS


Syntax
GET DATA
/TYPE=XLS
/FILE='C:\Users\lenovo\Desktop\cancer.xls'
/SHEET=name 'Cancer'
/CELLRANGE=full

Unit IV - 6
APEC

/READNAMES=on
/ASSUMEDSTRWIDTH=32767.
EXECUTE.
DATASET NAME DataSet1 WINDOW=FRONT.
REGRESSION
/MISSING LISTWISE
/STATISTICS COEFF OUTS R ANOVA
/CRITERIA=PIN(.05) POUT(.10)
/NOORIGIN
/DEPENDENT AGE
/METHOD=ENTER WEIGHIN
/RESIDUALS NORMPROB(ZRESID).

Unit IV - 7
APEC

Charts

Unit IV - 8
APEC

Unit IV - 9
APEC

Proportions, percentages and probabilities


The relationship between two variables can be examined in terms of proportions,
percentages, and probabilities. These measures provide insights into the distribution and
likelihood of events occurring. Let's explore how these concepts relate to the relationship
between two variables.

Proportions
Proportions represent the relative size or share of one category within a total. When
examining the relationship between two variables, you can calculate proportions based on
the occurrence or absence of specific events within each variable.
For example, consider a dataset of students and their preferred subjects: Math and
Science. Calculate the proportion of students who prefer Math and the proportion of
students who prefer Science. These proportions reflect the distribution of preferences
within the dataset.

Percentages
Percentages represent proportions expressed as a fraction of 100. They provide a
way to compare proportions on a standardized scale.
In the previous example, if 60 out of 100 students prefer Math, the proportion of
students preferring Math is 60/100 or 0.6. The percentage of students preferring Math is
0.6 multiplied by 100, which equals 60%. Similarly, you can calculate the percentage of
students preferring Science.

Probabilities
Probabilities quantify the likelihood of an event occurring. In the context of two variables,
probabilities can be used to understand the likelihood of events happening simultaneously
or independently based on the relationship between the variables. Probabilities can help
determine the chances of specific outcomes or events occurring, considering the
relationship between the two variables.

Example 1
To demonstrate how proportions and percentages can be calculated from a students'
database, let's consider a simple example. Assume we have a database containing
information about students' favorite subjects: Math, Science, and English. We want to
calculate the proportion and percentage of students who prefer each subject. Here's an
example database:
Student ID Favorite Subject
1 Math
2 Science
3 Math

Unit IV - 10
APEC

4 English
5 Math
6 Science
7 Science
8 English
9 English
10 Math

Solution
To calculate the proportions and percentages, we need to determine the number of
students who prefer each subject and divide it by the total number of students.
Step 1: Count the number of students who prefer each subject:
Math: 4 students
Science: 3 students
English: 3 students
Step 2: Calculate the proportions:
Proportion of students who prefer Math: 4/10 = 0.4
Proportion of students who prefer Science: 3/10 = 0.3
Proportion of students who prefer English: 3/10 = 0.3
Step 3: Calculate the percentages:
Percentage of students who prefer Math: 0.4 * 100 = 40%
Percentage of students who prefer Science: 0.3 * 100 = 30%
Percentage of students who prefer English: 0.3 * 100 = 30%
So, in this example, 40% of the students prefer Math, 30% prefer Science, and 30% prefer
English.

By calculating proportions and percentages, we can gain insights into the distribution of
favorite subjects among students in the database. This information can help in
understanding the preferences and patterns within the student population.

Example 2
Let's work through a solved example of analyzing a frequency table related to gender
(male/female) and job satisfaction (satisfied/dissatisfied) data. We collected data on job
satisfaction from a sample of 200 employees.

Calculating the Frequency table


Step 1:
Here is the frequency table:

Unit IV - 11
APEC

Frequency table
Job Satisfaction
Satisfied Dissatisfied
Male 80 30
Gender
Female 70 20
Step 2: Calculate the totals
Analyzing the table, we find 80 males are satisfied with their job, 30 males are dissatisfied,
70 females are satisfied, and 20 females are dissatisfied.
• Calculate the row totals. The row totals represent the total number of employees for
each gender.
Row total for Male = 80 + 30 = 110
Row total for Female = 70 + 20 = 90
• Calculate column totals: The column totals represent the total number of employees
for each job satisfaction level.
Column total for Male = 80 + 70 = 150
Column total for Female = 30 + 20 = 50

Step 3: Calculate percentages or proportions


Convert the frequencies to percentages or proportions to analyze the relative distribution
of job satisfaction levels among male and female employees.
Proportion of males who are satisfied is 80/110 = 0.73 or 73%,
Proportion of females who are satisfied is 70/90 = 0.78 or 78%.

Step 4: Interpret the results


Analyze the frequencies, row totals, column totals, and percentages to gain insights into the
distribution of job satisfaction levels among male and female employees.
• In this example, it appears that a higher proportion of females (78%) are satisfied
compared to males (73%).

Calculate the probability


Calculate the probability of job satisfaction for males
𝐹𝑟𝑒𝑞𝑢𝑒𝑛𝑐𝑦 𝑜𝑓 𝑀𝑎𝑙𝑒 𝑆𝑎𝑡𝑖𝑠𝑓𝑖𝑒𝑑
𝑃𝑟𝑜𝑏𝑎𝑏𝑖𝑙𝑖𝑡𝑦 𝑜𝑓 𝑀𝑎𝑙𝑒 𝑏𝑒𝑖𝑛𝑔 𝑆𝑎𝑡𝑖𝑠𝑓𝑖𝑒𝑑 =
𝑇𝑜𝑡𝑎𝑙 𝑛𝑢𝑚𝑏𝑒𝑟 𝑜𝑓 𝑜𝑏𝑠𝑒𝑟𝑣𝑎𝑡𝑖𝑜𝑛𝑠
= 80 / (80 + 30 + 70 + 20)
= 80 / 200
= 0.4 (or 40%)

Calculate the probability of job dissatisfaction for males

Unit IV - 12
APEC

𝐹𝑟𝑒𝑞𝑢𝑒𝑛𝑐𝑦 𝑜𝑓 𝑀𝑎𝑙𝑒 𝑑𝑖𝑠𝑠𝑎𝑡𝑖𝑠𝑓𝑖𝑒𝑑


𝑃𝑟𝑜𝑏𝑎𝑏𝑖𝑙𝑖𝑡𝑦 𝑜𝑓 𝑀𝑎𝑙𝑒 𝑏𝑒𝑖𝑛𝑔 𝑑𝑖𝑠𝑠𝑎𝑡𝑖𝑠𝑓𝑖𝑒𝑑 =
𝑇𝑜𝑡𝑎𝑙 𝑛𝑢𝑚𝑏𝑒𝑟 𝑜𝑓 𝑜𝑏𝑠𝑒𝑟𝑣𝑎𝑡𝑖𝑜𝑛𝑠
= 30 / (80 + 30 + 70 + 20)
= 30 / 200
= 0.15 (or 15%)

Calculate the probability of job satisfaction for females


𝐹𝑟𝑒𝑞𝑢𝑒𝑛𝑐𝑦 𝑜𝑓 𝐹𝑒𝑚𝑎𝑙𝑒 𝑆𝑎𝑡𝑖𝑠𝑓𝑖𝑒𝑑
𝑃𝑟𝑜𝑏𝑎𝑏𝑖𝑙𝑖𝑡𝑦 𝑜𝑓 𝐹𝑒𝑚𝑎𝑙𝑒 𝑏𝑒𝑖𝑛𝑔 𝑆𝑎𝑡𝑖𝑠𝑓𝑖𝑒𝑑 =
𝑇𝑜𝑡𝑎𝑙 𝑛𝑢𝑚𝑏𝑒𝑟 𝑜𝑓 𝑜𝑏𝑠𝑒𝑟𝑣𝑎𝑡𝑖𝑜𝑛𝑠
= 70 / (80 + 30 + 70 + 20)
= 70 / 200
= 0.35 (or 35%)

Calculate the probability of job dissatisfaction for females

𝐹𝑟𝑒𝑞𝑢𝑒𝑛𝑐𝑦 𝑜𝑓 𝐹𝑒𝑚𝑎𝑙𝑒 𝑑𝑖𝑠𝑠𝑎𝑡𝑖𝑠𝑓𝑖𝑒𝑑


𝑃𝑟𝑜𝑏𝑎𝑏𝑖𝑙𝑖𝑡𝑦 𝑜𝑓 𝐹𝑒𝑚𝑎𝑙𝑒 𝑏𝑒𝑖𝑛𝑔 𝑑𝑖𝑠𝑠𝑎𝑡𝑖𝑠𝑓𝑖𝑒𝑑 =
𝑇𝑜𝑡𝑎𝑙 𝑛𝑢𝑚𝑏𝑒𝑟 𝑜𝑓 𝑜𝑏𝑠𝑒𝑟𝑣𝑎𝑡𝑖𝑜𝑛𝑠
= 20 / (80 + 30 + 70 + 20)
= 20 / 200
= 0.1 (or 10%)

Analysis
By calculating the probabilities, the probability of job satisfaction is higher for
females (35%) compared to males (40%), while the probability of job dissatisfaction is
higher for males (15%) compared to females (10%).

Working with SPSS


Frequency, percentage & proportions
Syntax
GET DATA /TYPE=XLSX
/FILE='C:\Users\lenovo\Desktop\newdrug(excel2007).xlsx'
/SHEET=name 'Sheet1'
/CELLRANGE=full
/READNAMES=on
/ASSUMEDSTRWIDTH=32767.
EXECUTE.
DATASET NAME DataSet5 WINDOW=FRONT.
FREQUENCIES VARIABLES=Treatment
/BARCHART FREQ
/ORDER=ANALYSIS.

Unit IV - 13
APEC

Output

Syntax
FREQUENCIES VARIABLES=Age Treatment
/STATISTICS=STDDEV VARIANCE RANGE MINIMUM MAXIMUM SEMEAN
/HISTOGRAM
/ORDER=ANALYSIS.

Frequencies

Unit IV - 14
APEC

Frequency Table

Histogram

Unit IV - 15
APEC

Analyzing contingency tables


Contingency tables, also known as cross-tabulation tables or two-way frequency tables, are
used to analyze the relationship between two categorical variables. They helps us to
summarize and visualize the data, and helps in finding any associations or dependencies
between the variables.
Step-by-step guide to analyze contingency tables
Step 1: Understand the Variables: Determine the two categorical variables involved in the
contingency table. Identify the categories or levels of each variable.
• Suppose you are studying the relationship between gender (male/female) and job
satisfaction (satisfied/dissatisfied) among employees in a company.
o Here, the two categorical variables are gender (with two levels: male and
female) and job satisfaction (with two levels: satisfied and dissatisfied).

Step 2: Create the Contingency Table: Create a table with rows representing one variable
and columns representing the other variable. Each cell in the table will contain the
frequency or count of observations falling into that particular combination of categories.
• You collect data from a sample of 200 employees and create a contingency table to
analyze the relationship.
• Construct the table with gender as rows and job satisfaction as columns.

Contingency table
Job Satisfaction
Satisfied Dissatisfied
Male 60 40
Gender
Female 70 30

Unit IV - 16
APEC

Step 3: Calculate Row and Column Totals: Add row and column totals to the contingency
table
Contingency table
Job Satisfaction
TOTAL
Satisfied Dissatisfied
Male 60 40 100
Gender
Female 70 30 100
TOTAL 130 70 200

Step 4: Calculate Expected Frequencies (Optional): If you want to assess whether the
observed frequencies deviate significantly from what would be expected, you can calculate
the expected frequencies.

𝑅𝑜𝑤 𝑡𝑜𝑡𝑎𝑙 ∗ 𝐶𝑜𝑙𝑢𝑚𝑛 𝑇𝑜𝑡𝑎𝑙


𝐸𝑥𝑝𝑒𝑐𝑡𝑒𝑑 𝐹𝑟𝑒𝑞𝑢𝑒𝑛𝑐𝑦 =
𝑇𝑜𝑡𝑎𝑙 𝑛𝑢𝑚𝑏𝑒𝑟 𝑜𝑓 𝑂𝑏𝑠𝑒𝑟𝑣𝑎𝑡𝑖𝑜𝑛𝑠

• Expected frequency for the "Male-Satisfied" cell = (100 × 130) / 200 = 65.
• Expected Frequency for "Male-Dissatisfied" cell = (100 × 70) / 200 = 35.
• Expected frequency for the "Female-Satisfied" cell = (100 × 130) / 200 = 65.
• Expected Frequency for "Female-Dissatisfied" cell = (100 × 70) / 200 = 35.

Step 5: Interpret the Observed Frequencies: Examine the observed frequencies in the
contingency table. Look for any patterns or differences between the levels of the variables,
as they may indicate a significant relationship.
• In our example, we can observe that there are 60 males who are satisfied with their
job and 70 females who are satisfied. Similarly, there are 40 males and 30 females
who are dissatisfied.

Step 6: Calculate Percentages or Proportions: Convert the frequencies in the contingency


table to percentages or proportions. This helps in understanding the relative distribution of
observations across the categories and facilitates comparison between the levels of the
variables.
• Here, the proportion of males satisfied with their job is 60/100 = 0.6 or 60%, while
the proportion of females satisfied is 70/100 = 0.7 or 70%.

Step 7: Conduct Chi-Square Test (Optional):


Chi-square test

Unit IV - 17
APEC

• The chi-square test is commonly used to analyze contingency tables and determine
if there is a significant association between two categorical variables. The test
compares the observed frequencies in the contingency table with the expected
frequencies.
• The chi-square test assesses whether the observed frequencies differ significantly
from the expected frequencies. It provides a statistical measure of the association
between the variables.

• Here's how you can conduct a chi-square test using a contingency table:

Observed Expected (𝑶 − 𝑬)𝟐


Intervention Outcome (O-E) (O-E)2 (
(O) (E) 𝑬
Satisfied 60 65 -5 25 0.38
Male
Unsatisfied 40 35 5 25 0.71
Satisfied 70 65 5 25 0.38
Female
Unsatisfied 30 35 -5 25 0.71
TOTAL – X2 2.20

• Calculate the degrees of freedom:


Since there are two intervention groups (male, female) and two outcome groups
(satisfied, unsatisfied) there are (2 − 1) * (2 − 1) = 1 degrees of freedom.
DF = (Number of rows - 1) × (Number of columns - 1) = (2 - 1) × (2 - 1) = 1

Step 8: Interpret the Results


For a test of significance at α = .05 and df = 1, the X2 critical value is 3.841

Unit IV - 18
APEC

• Compare the chi-square statistic with the critical value or p-value:


X2 = 2.20
Critical value = 3.841
The Χ2 value is lesser than the critical value.

• If the chi-square statistic is greater than the critical value or the p-value is less than
the significance level (p < α), reject the null hypothesis. This indicates that there is
evidence of an association between gender and job satisfaction.
• If the chi-square statistic is smaller than the critical value or the p-value is greater
than the significance level (p > α), fail to reject the null hypothesis.

Step 9: Visualize the Contingency Table: To further explore the relationship between the
variables, you can create visual representations of the contingency table, such as stacked
bar charts or heatmaps.

Contingency Table using SPSS


Syntax
CROSSTABS
/TABLES=Treatment BY Gender
/FORMAT=AVALUE TABLES
/STATISTICS=CHISQ
/CELLS=COUNT
/COUNT ROUND CELL.

Unit IV - 19
APEC

• 0 cells (0.0%) have expected count less than 5. The minimum expected count is
10.56.
• Computed only for a 2x2 table

Recoding, reordering, and collapsing categorical variables


Recoding, reordering, and collapsing categorical variables are common data manipulation
techniques used to transform and organize categorical variables in a dataset.
Recoding Categorical Variables:
Recoding involves changing the values of a categorical variable to new values based
on certain criteria or rules. This technique is useful when you want to group similar
categories together or create new categories. For example, suppose you have a categorical
variable "Education" with categories "High School," "Bachelor's Degree," and "Master's
Degree." You may want to recode these categories as "High School or Less," "Bachelor's
Degree," and "Advanced Degree" to create broader categories. Recoding can be performed
manually or using software like SPSS, R, or Python.

In SPSS, there are three basic options for recoding variables:


• Recode into Different Variables
• Recode into Same Variables

Unit IV - 20
APEC

• DO IF syntax

Recode into Different Variables and DO IF syntax


- Create a new variable without modifying the original variable
Recode into Same Variables
- Permanently overwrite the original variable.
Recode into Different Variables
Recoding into a different variable transforms an original variable into a new
variable. That is, the changes do not overwrite the original variable; instead changes are
applied to a copy of the original variable under a new name.
To recode into different variables, click Transform > Recode into Different Variables.

The Recode into Different Variables window will appear.

Input Variable -> Output Variable: The center text box lists the variable(s) you have
selected to recode, as well as the name your new variable(s) will have after the recode

Unit IV - 21
APEC

Output Variable: Define the name and label for your recoded variable(s) by typing them in
the text fields. Once you are finished, click Change. Here, we have changed as “Age -->
Patient_Age”).

Old And New Values


Once you click Old and New Values, a new window where you will specify how to
transform the values will appear.
• Old Value: Specify the type of value you wish to recode
• New Value: Specify the new value for your variable

Unit IV - 22
APEC

Syntax

Syntax

Unit IV - 23
APEC

Reordering Categorical Variables:


Reordering categorical variables involves changing the order of categories to align with a
specific criterion or logical sequence. This technique is useful when you want to present the
categories in a specific order for analysis or visualization purposes. For instance, if you
have a categorical variable "Income Level" with categories "Low," "Medium," and "High,"
you may want to reorder them as "Low," "Medium," and "High" for consistency or to reflect
a natural order. Reordering can be done manually or using data analysis software.

Collapsing Categorical Variables:


Collapsing involves combining multiple categories of a categorical variable into a single
category. This technique is used when you want to simplify the analysis or reduce the
number of categories. For example, if you have a categorical variable "Age Group" with
categories "18-24," "25-34," "35-44," and "45-54," you may want to collapse them into
broader categories like "18-34" and "35-54" to have fewer groups for analysis. Collapsing
can be done manually or programmatically in statistical software.

When applying these techniques, it is important to consider the context and objectives of
the analysis. Be mindful of preserving the integrity of the data and ensure that the recoded,
reordered, or collapsed categories accurately represent the underlying information.

Handling Several Batches


Boxplots are widely used in data exploration to visualize the distribution and statistical
properties of a dataset. They provide a concise summary of the data's central tendency,
dispersion, and skewness. Here's how boxplots are constructed and interpreted:

Unit IV - 24
APEC

Construction of a boxplot
• The median (50th percentile) of the dataset is represented by a horizontal line
inside a box.
• The box extends from the lower quartile (25th percentile) to the upper quartile
(75th percentile), indicating the interquartile range (IQR).
• Whiskers, represented by vertical lines, extend from the box's edges to the furthest
data points within a certain range. The exact range depends on the specific rules
used for whisker calculation.
• Data points outside the whiskers are considered outliers and are typically plotted
individually.

Interpretation of a boxplot
• Center: The median (line inside the box) represents the dataset's central tendency. If
the median is closer to the bottom of the box, the data is skewed towards lower
values, while if it's closer to the top, the data is skewed towards higher values.
• Spread: The length of the box (IQR) gives an indication of the data's dispersion. A
longer box implies greater variability, while a shorter box indicates less variability.
• Skewness: The symmetry of the boxplot helps identify skewness in the data. If the
median line is not in the center of the box, the data may be skewed.
• Outliers: Data points plotted individually outside the whiskers are considered
outliers and may indicate unusual or extreme observations.
• Boxplots are particularly useful when comparing distributions between different
groups or variables. By placing multiple boxplots side by side, you can visually
compare their central tendencies, spreads, and skewness.
In summary, boxplots offer a visual summary of key statistical characteristics of a dataset,
making them a valuable tool for data exploration and initial analysis.

Simple Python Program to display Boxplot


import matplotlib.pyplot as plt

# Example dataset
scores = [55, 58, 60, 62, 65, 66, 68,
70, 72, 73, 74, 75, 76, 77,
78, 79, 80, 81, 82, 83, 85,
86, 87, 88, 89, 90, 91, 92, 93, 95]

# Creating the boxplot


plt.boxplot(scores)

Unit IV - 25
APEC

# Adding labels and title


plt.xlabel('Scores')
plt.ylabel('Values')
plt.title('Boxplot of Exam Scores')

# Displaying the plot


plt.show()

Output

How to Create and Interpret Box Plots in SPSS


A box plot is used to visualize the five number summary of a dataset, which includes:
• The minimum
• The first quartile
• The median
• The third quartile
• The maximum

Creating a Single Box Plot in SPSS


Suppose we have the following dataset that shows the exam marks scored by 10 students:

Unit IV - 26
APEC

To create a box plot to visualize the distribution of these data values, we can click the
Analyze tab, then Descriptive Statistics, then Explore:

To create a box plot, drag the variable points into the box labelled Dependent List. Then
make sure Plots is selected under the option that says Display near the bottom of the box.

Unit IV - 27
APEC

Exam Score

Outliers
In statistics, outliers are data points that deviate significantly from the rest of the
dataset. These observations are considered to be unusual, extreme, or inconsistent with the
overall pattern or distribution of the data.

Unit IV - 28
APEC

Outliers can occur due to various reasons, including measurement errors, data entry
mistakes, natural variability, or genuinely unusual observations. They can have a
significant impact on statistical analysis, as they can skew results and affect the validity of
assumptions made about the data.
Identifying outliers is important because they can distort statistical measures such
as the mean and standard deviation, as well as affect data modeling and analysis. Outliers
can be detected through various methods, such as graphical exploration, statistical tests, or
using domain knowledge.

Some commonly used techniques to identify outliers include:


• Visual inspection: Plotting the data using graphs like scatter plots, boxplots, or
histograms can reveal any data points that appear significantly different from the
majority of the data.
• Statistical tests: Methods like the z-score or modified z-score can be used to identify
observations that fall beyond a certain threshold of standard deviations from the
mean. Values that exceed a specific cutoff (e.g., z-score greater than 3 or 3.5) can be
considered outliers.
• Interquartile Range (IQR) method: This method uses the quartiles (Q1 and Q3) and
the IQR to determine outliers. Data points below Q1 - 1.5 * IQR or above Q3 + 1.5 *
IQR are considered outliers.
• Domain knowledge: Sometimes, outliers are known to exist based on the context or
subject matter expertise. In such cases, domain knowledge can be used to identify
and handle outliers appropriately.

Outliers can be treated by removing them, transforming them, or treating them as missing
values, depending on the circumstances and impact on the analysis. In summary, outliers
are data points that are significantly different from the rest of the dataset and can have an
impact on statistical analysis. Detecting and managing outliers is crucial to ensure accurate
and reliable results.

Unit IV - 29
APEC

Simple Python Program to show outlier in Boxplot


import matplotlib.pyplot as plt

# Example dataset
data = [10, 12, 15, 16, 17, 18, 20, 21, 22, 100]

# Creating the boxplot


plt.boxplot(data, flierprops={'marker': 'o', 'markerfacecolor':
'red', 'markersize': 8})

# Adding labels and title


plt.xlabel('Data')
plt.ylabel('Values')
plt.title('Boxplot with Outliers Highlighted')

# Displaying the plot


plt.show()

Output

Note:
The flierprops parameter is used to customize the appearance of the outliers. In this example, we set the
marker style to a red circle ('o'), the marker face color to red, and the marker size to 8 .

Dependents
“Dependents together” means that all dependent variables are shown together in each
boxplot. If you enter a factor -say, sex- you'll get a separate boxplot for each factor level -

Unit IV - 30
APEC

female and male respondents. “Factor levels together” creates a separate boxplot for each
dependent variable, showing all factor levels together in each boxplot.
“Exclude cases pairwise” means that the results for each variable are based on all cases
that don't have a missing value for that variable. “Exclude cases listwise” uses only cases
without any missing values on all variables.

Syntax

Unit IV - 31
APEC

Unit IV - 32
APEC

Unit IV - 33
APEC

Unit IV - 34
APEC

• The first columns tells how many cases were used for each variable.
• Note that trial 5 has N = 205 or 86.1% missing values.

Boxplot for 1 Variable - Multiple Groups of Cases


GGRAPH
/GRAPHDATASET NAME="graphdataset" VARIABLES=agegroup r03 MISSING=LISTWISE
REPORTMISSING=NO
/GRAPHSPEC SOURCE=INLINE.
BEGIN GPL
SOURCE: s=userSource(id("graphdataset"))
DATA: agegroup=col(source(s), name("agegroup"), unit.category())
DATA: r03=col(source(s), name("r03"))
GUIDE: axis(dim(1), label("Age Group"))
GUIDE: axis(dim(2), label("Reaction time trial 3"))
GUIDE: text.title(label("I CAN TYPE MY AMAZING TITLE RIGHT HERE!"))
SCALE: cat(dim(1), include("1", "2", "3"))
SCALE: linear(dim(2), include(0))
ELEMENT: schema(position(bin.quantile.letter(agegroup*r03)), label(r03))
END GPL.

Unit IV - 35
APEC

T - Test
A T-test is a statistical method of comparing the means or proportions of two samples
gathered from either the same group or different categories. It is aimed at hypothesis
testing, which is used to test a hypothesis pertaining to a given population. It is the
difference between population means and a hypothesized value.
There are several types of t-tests, but the most commonly used ones are:
• Independent samples t-test: This test compares the means of two independent
groups to determine if there is a significant difference between them. It assumes
that the two groups are independent of each other.

• Paired samples t-test: This test compares the means of two related groups, where
each observation in one group is paired with an observation in the other group. It is
used when the same individuals or objects are measured before and after an
intervention or treatment.

• One-sample t-test: This test compares the mean of a single group to a known
population mean or a hypothesized value. It is used to determine if the observed
mean significantly differs from a specific value.

One-sample, two-sample, paired, equal, and unequal variance are the types of T-tests users
can use for mean comparisons.

Unit IV - 36
APEC

Independent samples t tests


Independent samples t tests have the following hypotheses:
• Null hypothesis: The means for the two populations are equal.
• Alternative hypothesis: The means for the two populations are not equal.
o If the p-value is less than your significance level (e.g., 0.05), you can reject the
null hypothesis. The difference between the two means is statistically
significant. Your sample provides strong enough evidence to conclude that
the two population means are not equal.

Simple Example to calculating an Independent Samples T Test


Calculate an independent samples t test for the following data sets:
Data set A: 1,2,2,3,3,4,4,5,5,6
Data set B: 1,2,4,5,5,5,6,6,7,9

Solution
Step 1: Sum the two groups:
A: 1 + 2 + 2 + 3 + 3 + 4 + 4 + 5 + 5 + 6 = 35
B: 1 + 2 + 4 + 5 + 5 + 5 + 6 + 6 + 7 + 9 = 50

Step 2: Square the sums from Step 1:


Set A total = 352 = 1225
Set B total = 502 = 2500

Step 3: Calculate the means for the two groups:


A: (1 + 2 + 2 + 3 + 3 + 4 + 4 + 5 + 5 + 6)/10 = 35/10 = 3.5
B: (1 + 2 + 4 + 5 + 5 + 5 + 6 + 6 + 7 + 9) = 50/10 = 5

Step 4: Square the individual scores and then add them up:
A: 11 + 22 + 22 + 33 + 33 + 44 + 44 + 55 + 55 + 66 = 145
B: 12 + 22 + 44 + 55 + 55 + 55 + 66 + 66 + 77 + 99 = 298

Unit IV - 37
APEC

Step 5: Insert your numbers into the following formula and solve:

(ΣA)2: Sum of data set A, squared (Step 2).


(ΣB)2: Sum of data set B, squared (Step 2).
μA: Mean of data set A (Step 3)
μB: Mean of data set B (Step 3)
ΣA2: Sum of the squares of data set A (Step 4)
ΣB2: Sum of the squares of data set B (Step 4)
nA: Number of items in data set A
nB: Number of items in data set B

Applying all the values,

t = -1.69

Step 6: Find the Degrees of freedom (nA-1 + nB-1) = 18

Step 7: Look up your degrees of freedom (Step 6) in the t-table. If you don’t know what
your alpha level is, use 5% (0.05).
18 degrees of freedom at an alpha level of 0.05 = 2.10.

Unit IV - 38
APEC

Step 8: Compare your calculated value (Step 5) to your table value (Step 7). The calculated
value of -1.79 is less than the cutoff of 2.10 from the table. Therefore p > .05. As the p-value

Unit IV - 39
APEC

is greater than the alpha level, So we can reject the null hypothesis that there is no
difference between means.

Click on Analyze -> Compare Means -> Independent-Samples T Test. This will bring up the
following dialog box.

Unit IV - 40
APEC

To perform the t test, we’ve got to get our dependent variable (Frisbee Throwing Distance)
into the Test Variable(s) box, and our grouping variable (Dog Owner) into the Grouping
Variable box. To move the variables over, you can either drag and drop, or use the arrows,
as above.

The dialog box should now look like this.

You’ll notice that the Grouping Variable, DogOwner, has two question marks in brackets
after it. This indicates that you need to define the groups that make up the grouping
variable. Click on the Define Groups button.

Unit IV - 41
APEC

We’re using 0 and 1 to specify each group, 0 is No Dog; and 1 is Owns Dog.

The first thing to note is the mean values in the Group Statistics table. Here you can see that
on average people who own dogs throw frisbees further than people who don’t own dogs
(54.92 metres as against only 40.12 metres).

SPSS is reporting a t value of -3.320 and a 2-tailed p-value of .003. This would almost
always be considered a significant result (standard alpha levels are .05 and .01). Therefore,
we can be confident in rejecting the null hypothesis that holds that there is no difference
between the frisbee throwing abilities of dog owners and non-owners.

One Sample t Test

Unit IV - 42
APEC

The One Sample t Test examines whether the mean of a population is statistically different
from a known or hypothesized value. The One Sample t Test is a parametric test.
In a One Sample t Test, the test variable's mean is compared against a "test value", which is
a known or hypothesized value of the mean in the population.
Example:
A particular factory's machines are supposed to fill bottles with 150 milliliters of product. A
plant manager wants to test a random sample of bottles to ensure that the machines are
not under- or over-filling the bottles.

Note: The One Sample t Test can only compare a single sample mean to a specified
constant. It cannot compare sample means between two or more groups. If you wish to
compare the means of multiple groups to each other, you will likely want to run an
Independent Samples t Test (to compare the means of two groups) or a One-Way ANOVA
(to compare the means of two or more groups).
Data Requirements
• Test variable that is continuous (i.e., interval or ratio level)
• Scores on the test variable
• Random sample of data from the population
• No outliers

Hypotheses
The null hypothesis (H0) and (two-tailed) alternative hypothesis (H1) of the one sample T
test can be expressed as:
• H0: µ = µ0 ("the population mean is equal to the [proposed] population mean")
• H1: µ ≠ µ0 ("the population mean is not equal to the [proposed] population mean")

where µ is the "true" population mean and µ0 is the proposed value of the population
mean.
Test statistic
The test statistic for a One Sample t Test is denoted t, which is calculated using the
following formula:

μ0 = The test value -- the proposed constant for the population mean
𝑥̅ = Sample mean
n = Sample size (i.e., number of observations)
s = Sample standard deviation

Unit IV - 43
APEC

𝑠𝑥̅ = Estimated standard error of the mean (s/sqrt(n))

The calculated t value is then compared to the critical t value from the t distribution table
with degrees of freedom df = n - 1 and chosen confidence level. If the calculated t value >
critical t value, then we reject the null hypothesis.

Run a One Sample t Test


To run a One Sample t Test in SPSS, click Analyze > Compare Means > One-Sample T Test.

Problem Statement
The mean height of adults ages 20 and older is about 6.5 inches (69.3 inches for males, 63.8
inches for females).

Unit IV - 44
APEC

In our sample data, we have a sample of 435 college students from a single college. Let's
test if the mean height of students at this college is significantly different than 66.5 inches
using a one-sample t test. The null and alternative hypotheses of this test will be:

H0: µHeight = 66.5 ("the mean height is equal to 66.5")


H1: µHeight ≠ 66.5 ("the mean height is not equal to 66.5")

before the test


In the sample data, we will use the variable Height, which a continuous variable
representing each respondent’s height in inches. The heights exhibit a range of values from
55.00 to 88.41 (Analyze > Descriptive Statistics > Descriptives).
Let's create a histogram of the data to get an idea of the distribution, and to see if our
hypothesized mean is near our sample mean. Click Graphs > Legacy Dialogs > Histogram.
Move variable Height to the Variable box, then click OK.

To add vertical reference lines at the mean (or another location), double-click on the plot to
open the Chart Editor, then click Options > X Axis Reference Line. In the Properties
window, you can enter a specific location on the x-axis for the vertical line, or you can
choose to have the reference line at the mean or median of the sample data (using the
sample data). Click Apply to make sure your new line is added to the chart.
Here, we have added two reference lines: one at the sample mean (the solid black
line), and the other at 66.5 (the dashed red line).

Unit IV - 45
APEC

running the test


To run the One Sample t Test, click Analyze > Compare Means > One-Sample T Test. Move
the variable Height to the Test Variable(s) area. In the Test Value field, enter 66.5.

output
Tables

A→ Test Value: which we have entered in T Test window.


B → t Statistic: In this example, t = 5.810. Note that t is calculated by dividing the mean
difference (E) by the standard error mean (from the One-Sample Statistics box).
C →df: The degrees of freedom for the test. For a one-sample t test, df = n - 1; so here, df =
408 - 1 = 407.
D → Significance (One-Sided p and Two-Sided p): The p-values corresponding to one of the
possible one-sided alternative hypotheses (in this case, µHeight > 66.5) and two-sided
alternative hypothesis (µHeight ≠ 66.5), respectively.
E → Mean Difference: The difference between the "observed" sample mean (from the One
Sample Statistics box) and the "expected" mean (the specified test value (A)). The positive t
value in this example indicates that the mean height of the sample is greater than the
hypothesized value (66.5).
F →Confidence Interval for the Difference: The confidence interval for the difference
between the specified test value and the sample mean.

Decision and Conclusions


Since p < 0.001, we reject the null hypothesis that the mean height of students at this
college is equal to the hypothesized population mean of 66.5 inches and conclude that the
mean height is significantly different than 66.5 inches.

Based on the results, we can state the following:


• There is a significant difference in the mean height of the students at this college and
the overall adult population. (p < .001).

Unit IV - 46
APEC

Regression line in a scatter plot


The regression line in a scatter plot is used to visualize the relationship between two
variables and to estimate the trend or pattern in the data. It represents the best-fit line
through the scatter plot points, indicating the average or expected value of the dependent
variable (y) for a given value of the independent variable (x).

Here are a few key uses and interpretations of the regression line in a scatter plot:
• Trend Identification: The regression line helps identify the general trend or
direction of the relationship between the variables. If the line slopes upwards from
left to right, it suggests a positive relationship, indicating that as the independent
variable increases, the dependent variable tends to increase as well. Conversely, a
downward-sloping line indicates a negative relationship.
• Prediction: The regression line can be used for predicting the dependent variable
(y) based on a given value of the independent variable (x). By plugging in an x-value
into the equation of the line, you can estimate the corresponding y-value. However,
it's important to note that predictions become less reliable as you move further
away from the observed data points.
• Strength of Relationship: The steepness or slope of the regression line provides
information about the strength of the relationship between the variables. A steeper
line indicates a stronger association between x and y, while a shallower line
suggests a weaker relationship.
• Outlier Detection: The regression line can help identify outliers or data points that
deviate significantly from the overall trend. Points that fall far away from the
regression line may represent unusual or exceptional observations that warrant
further investigation.
• Model Evaluation: The regression line also serves as a benchmark for evaluating the
goodness-of-fit of a regression model. Various statistical measures, such as the
coefficient of determination (R-squared), can be used to assess how well the
regression line represents the data and the proportion of the variation in the
dependent variable that is explained by the independent variable.

It's important to note that the regression line represents the average relationship between
the variables and may not perfectly capture the behavior of individual data points.
Additionally, other types of regression models, such as polynomial regression or multiple
regression, can be used to capture more complex relationships between variables.
Overall, the regression line in a scatter plot provides a visual representation and
estimation of the relationship between two variables, helping to understand patterns, make
predictions, and evaluate the strength of the association.

Unit IV - 47
APEC

How to make a scatter plot chart in SPSS


A simple option for drawing linear regression lines is found under Graphs SPSS Menu
Arrow Legacy Dialogs SPSS Menu Arrow Scatter/Dot as shown below.

Syntax

Output

Unit IV - 48
APEC

For adding a regression line, first double click the chart to open it in a Chart Editor window.
Next, click the “Add Fit Line at Total” icon as shown below.

Adding Regression line to Scatter Plot

Unit IV - 49
APEC

The linear regression equation is shown in the label on our line:


y = 9.31E3 + 4.49E2*x
which means that

Salary′=9310+449⋅Hours

Simple Python program to add regression line in Scatter Plot


import numpy as np
import matplotlib.pyplot as plt
from scipy.stats import linregress

# Example data
x = [1, 2, 3, 4, 5, 6, 7, 8, 9, 10]
y = [2, 4, 5, 7, 8, 10, 11, 13, 14, 16]

# Perform linear regression


slope, intercept, r_value, p_value, std_err = linregress(x, y)

# Create scatter plot


plt.scatter(x, y, color='blue', label='Data')

# Add regression line


plt.plot(x, intercept + slope*np.array(x), color='red',
label='Regression Line')

# Add labels and title


plt.xlabel('X')
plt.ylabel('Y')
plt.title('Scatter Plot with Regression Line')

# Add legend
plt.legend()

# Display the plot


plt.show()

Unit IV - 50
APEC

Note:
In this example, we have two arrays, x and y, representing the data points for the scatter
plot. We then use the linregress() function from SciPy to perform linear regression on the x
and y data.

After that, we create a scatter plot using plt.scatter(x, y, color='blue', label='Data'),


specifying the data points, color, and label for the scatter plot.

Next, we add the regression line using plt.plot(x, intercept + slope*np.array(x), color='red',
label='Regression Line'). This line is created by evaluating the equation y = intercept +
slope*x for each x value.

Unit IV - 51
APEC

PROFESSIONAL ELECTIVE COURSES: VERTICALS


VERTICAL 1: DATA SCIENCE
CCS346 EXPLORATORY DATA ANALYSIS
UNIT V - MULTIVARIATE AND TIME SERIES ANALYSIS
Introducing a Third Variable – Causal Explanations – Three-Variable Contingency Tables
and Beyond – Fundamentals of TSA – Characteristics of time series data – Data Cleaning –
Time-based indexing – Visualizing – Grouping – Resampling.

Inferential Statistics
Inferential statistics is a branch of statistics that makes the use of various analytical tools to
draw inferences about the population data from sample data. The purpose of descriptive
and inferential statistics is to analyze different types of data using different tools.
Descriptive statistics helps to describe and organize known data using charts, bar graphs,
etc., while inferential statistics aims at making inferences and generalizations about the
population data.

Descriptive statistics allow you to describe a data set, while inferential statistics allow you
to make inferences based on a data set. The samples chosen in inferential statistics need to
be representative of the entire population.

Descriptive statistics Inferential statistics


Collecting, summarizing and Describing Drawing conclusions and/or making
Data decisions from the sample of population
APEC

There are two main types of inferential statistics - hypothesis testing and regression
analysis.
• Hypothesis Testing - This technique involves the use of hypothesis tests such as the
z test, f test, t test, etc. to make inferences about the population data. It requires
setting up the null hypothesis, alternative hypothesis, and testing the decision
criteria.
• Regression Analysis - Such a technique is used to check the relationship between
dependent and independent variables. The most commonly used type of regression
is linear regression.

Brief about all tests


Z-test
• Sample size is greater than or equal to 30 and the data set follows a normal
distribution.
• The population variance is known to the researcher.
o 𝑁𝑢𝑙𝑙 ℎ𝑦𝑝𝑜𝑡ℎ𝑒𝑠𝑖𝑠: 𝐻0 ∶ 𝜇 = 𝜇0
o 𝐴𝑙𝑡𝑒𝑟𝑛𝑎𝑡𝑒 ℎ𝑦𝑝𝑜𝑡ℎ𝑒𝑠𝑖𝑠: 𝐻1: 𝜇 > 𝜇0
APEC

T-test
• Sample size is less than 30 and the data set follows a t-distribution.
• The population variance is not known to the researcher.
o 𝑁𝑢𝑙𝑙 𝐻𝑦𝑝𝑜𝑡ℎ𝑒𝑠𝑖𝑠: 𝐻0: 𝜇 = 𝜇0
o 𝐴𝑙𝑡𝑒𝑟𝑛𝑎𝑡𝑒 𝐻𝑦𝑝𝑜𝑡ℎ𝑒𝑠𝑖𝑠: 𝐻1: 𝜇 > 𝜇0

F-test
• Checks whether a difference between the variances of two samples or populations
exists or not.

Choosing a statistical test:


The following flowchart allows you to choose the correct statistical test for your research
easily.
APEC

Multivariate Analysis:
Multivariate analysis refers to statistical techniques used to analyze and understand
relationships between multiple variables simultaneously. It involves exploring patterns,
dependencies, and associations among variables in a dataset. Some commonly used
multivariate analysis techniques include:
• Multivariate Regression Analysis: Extends simple linear regression to analyze the
relationship between multiple independent variables and a dependent variable.
• Principal Component Analysis (PCA): Reduces the dimensionality of a dataset by
transforming variables into a smaller set of uncorrelated variables called principal
components.
• Factor Analysis: Examines the underlying factors or latent variables that explain the
correlations among a set of observed variables.
• Cluster Analysis: Identifies groups or clusters of similar observations based on the
similarity of their attributes.
• Discriminant Analysis: Differentiates between two or more predefined groups based
on a set of predictor variables.
• Canonical Correlation Analysis: Analyzes the relationship between two sets of
variables to identify the underlying dimensions that are shared between them.

Simple Example of Multivariate Analysis


Let's consider a simple example of multivariate analysis using a dataset that includes three
variables: "age" (continuous), "income" (continuous), and "education level" (categorical:
high school, bachelor's degree, master's degree).
Objective: Determine the relationship between age, income, and education level.
APEC

Approach:
Descriptive Analysis:
• Calculate descriptive statistics for each variable, including measures like mean,
median, standard deviation, and frequency distributions.
• Examine the distributions of age and income using histograms or boxplots to
identify any outliers or unusual patterns.
• Create cross-tabulations or contingency tables to explore the distribution of
education level across different age groups or income brackets.
Correlation Analysis:
• Calculate the correlation coefficients (e.g., Pearson correlation, Spearman
correlation) between age, income, and education level.
• Interpret the correlation coefficients to determine the strength and direction of the
relationships between variables.
• Visualize the relationships using a correlation matrix or a heatmap to identify any
significant associations.
Regression Analysis:
• Perform multivariate regression analysis to assess the impact of age and education
level on income.
• Set income as the dependent variable and age and education level as independent
variables.
• Interpret the regression coefficients to understand how each independent variable
influences the dependent variable.
• Assess the overall model fit and statistical significance of the regression model.
Multivariate Visualization:
• Create scatter plots or bubble plots to visualize the relationship between age,
income, and education level.
• Use different colors or symbols to represent different education levels and examine
if there are distinct patterns or trends.
Further Analysis:
• Consider additional multivariate techniques such as factor analysis or cluster
analysis to explore underlying dimensions or groups within the data.
• Conduct subgroup analyses or interaction analyses to investigate if the relationships
differ across different demographic groups or educational backgrounds.

Causal explanations
Causal explanations aim to understand the cause-and-effect relationships between
variables and explain why certain outcomes occur. They involve identifying the factors or
APEC

conditions that influence a particular outcome and determining the mechanisms through
which they operate.
Causal explanations are important in various fields, including social sciences,
economics, psychology, and epidemiology, among others. They help researchers
understand the fundamental drivers of phenomena and develop interventions or policies to
bring about desired outcomes.

Some key aspects and approaches to consider when seeking causal explanations:
Association vs. Causation:
It's crucial to differentiate between mere associations or correlations between variables
and actual causal relationships. Correlation does not imply causation, and establishing
causality requires rigorous evidence, such as experimental designs or well-designed
observational studies that account for potential confounding factors.
APEC

Establishing Causality:
Several criteria need to be considered when establishing causality, such as temporal
precedence (the cause precedes the effect in time), covariation (the cause and effect vary
together), and ruling out alternative explanations.

Simple Scenario:
We want to investigate whether exercise has a causal effect on weight loss. We hypothesize
that regular exercise leads to a reduction in weight.

Explanation:
To establish a causal explanation, we would need to conduct a study that meets the criteria
for establishing causality, such as a randomized controlled trial (RCT). In this hypothetical
RCT, we randomly assign participants to two groups:
• Experimental Group: Participants in this group are instructed to engage in a
structured exercise program, such as 30 minutes of moderate-intensity aerobic
exercise five times a week.
• Control Group: Participants in this group do not receive any specific exercise
instructions and maintain their usual daily activities.
The study is conducted over a period of three months, during which the weight of each
participant is measured at the beginning and end of the study. The data collected are as
follows:

Experimental Group:
APEC

• Participant 1: Weight at the beginning = 80 kg, Weight at the end = 75 kg


• Participant 2: Weight at the beginning = 75 kg, Weight at the end = 72 kg
• Participant 3: Weight at the beginning = 85 kg, Weight at the end = 80 kg
Control Group:
• Participant 4: Weight at the beginning = 82 kg, Weight at the end = 81 kg
• Participant 5: Weight at the beginning = 78 kg, Weight at the end = 78 kg
• Participant 6: Weight at the beginning = 79 kg, Weight at the end = 80 kg

Analysis:
We compare the average weight loss between the experimental and control groups.
The results show that the experimental group had an average weight loss of 4 kg, while the
control group had an average weight loss of only 1 kg. The difference in average weight loss
between the groups suggests that regular exercise has a causal effect on weight loss.
Additionally, we can use statistical tests, such as t-tests or analysis of variance
(ANOVA), to determine if the observed difference in weight loss between the groups is
statistically significant. If the p-value is below a predetermined significance level (e.g., p <
0.05), we can conclude that the difference is unlikely due to chance alone and provides
further evidence for a causal relationship.

Three-variable contingency tables


When exploring causal explanations, particularly in the context of three variables,
three-variable contingency tables can be utilized. These tables provide a way to examine
the relationship between three categorical variables and investigate potential causal
explanations by introducing a third variable.
A three-variable contingency table consists of rows and columns representing the
categories of three variables, and the cells of the table contain frequency counts or
proportions. It allows for the analysis of the joint distribution of three variables and helps
identify any associations or dependencies between them.

Example
Let's consider the variables "Gender" (Male/Female), "Education Level" (High
school/College/Graduate), and "Income Level" (Low/Medium/High). We want to explore if
there is an association between gender, education level, and income level.
A three-variable contingency table for this example might look like:

Income Level
Education Level
Low Medium High
High School 20 40 30
APEC

College 30 50 40
Graduate 10 20 30

From this contingency table, we can analyze the relationship between these variables. For
example:
• Conditional Relationships: We can examine the relationship between gender and
income level, conditional on education level. This can be done by comparing the
income level distribution for males and females within each education level
category.
• Marginal Relationships: We can examine the relationship between gender and
education level, and between education level and income level separately by looking
at the marginal distributions of the variables.
• Assessing Dependency: We can perform statistical tests, such as the chi-square test,
to determine if there is a statistically significant association between the variables.
This helps assess the dependency and provides insights into potential causal
explanations.

By analyzing the three-variable contingency table, we can gain a deeper understanding of


the relationships between the variables and explore potential causal explanations by
considering the influence of the third variable.

Crosstabs is just another name for contingency tables, which summarize the relationship
between different categorical variables. Crosstabs in SPSS can help you visualize the
proportion of cases in subgroups.
• To describe a single categorical variable, we use frequency tables.
• To describe the relationship between two categorical variables, we use a special
type of table called a cross-tabulation (or "crosstab")
o Categories of one variable determine the rows of the table
o Categories of the other variable determine the columns
o The cells of the table contain the number of times that a particular
combination of categories occurred.
A "square" crosstab is one in which the row and column variables have the same number of
categories. Tables of dimensions 2x2, 3x3, 4x4, etc. are all square crosstabs.

Example 1
APEC

Row variable: Gender (2 categories: male, female)


Column variable: Alcohol (2 categories: no, yes)
Table dimension: 2x2

Example 2

Row variable: Class Rank (4 categories: freshman, sophomore, junior, senior)


Column variable: Gender (2 categories: male, female)
Table dimension: 4x2

Example 3

Row variable: Gender (2 categories: male, female)


Column variable: Smoking (3 categories: never smoked, past smoker, current smoker)
Table dimension: 2x3
APEC

Crosstabs with Layer Variable (Third categorical variable)


To create a crosstab, click Analyze > Descriptive Statistics > Crosstabs.

• A → Row(s): One or more variables to use in the rows of the crosstab(s). You must
enter at least one Row variable.
• B →Column(s): One or more variables to use in the columns of the crosstab(s). You
must enter at least one Column variable.
• C → Layer: An optional "stratification" variable. When a layer variable is specified,
the crosstab between the Row and Column variable(s) will be created at each level
of the layer variable. You can have multiple layers of variables by specifying the first
layer variable and then clicking Next to specify the second layer variable.
• D → Statistics: Opens the Crosstabs: Statistics window, which contains fifteen
different inferential statistics for comparing categorical variables.
APEC

• E → Cells: Opens the Crosstabs: Cell Display window, which controls which output is
displayed in each cell of the crosstab.

• F → Format: Opens the Crosstabs: Table Format window, which specifies how the
rows of the table are sorted.
APEC

Working with third categorical variable


Now we work with three categorical variables: RankUpperUnder, LiveOnCampus, and
State_Residency of the student’s database.
Description
Create a crosstab of RankUpperUnder by LiveOnCampus, with variable State_Residency
acting as a strata, or layer variable.

Running the Procedure


• Using the Crosstabs Dialog Window
• Open the Crosstabs dialog (Analyze > Descriptive Statistics > Crosstabs).
• Select RankUpperUnder as the row variable, and LiveOnCampus as the column
variable.
• Select State_Residency as the layer variable.
• You may want to go back to the Cells options and turn off the row, column, and total
percentages if you have just run the previous example.
• Click OK.

Syntax
CROSSTABS
/TABLES=RankUpperUnder BY LiveOnCampus BY State_Residency
/FORMAT=AVALUE TABLES
/CELLS=COUNT
/COUNT ROUND CELL.

Output
Again, the Crosstabs output includes the boxes Case Processing Summary and the
crosstabulation itself.
APEC

Notice that after including the layer variable State Residency, the number of valid cases we
have to work with has dropped from 388 to 367. This is because the crosstab requires
nonmissing values for all three variables: row, column, and layer.

l
The layered crosstab shows the individual Rank by Campus tables within each level of State
Residency. Some observations we can draw from this table include:
• A slightly higher proportion of out-of-state underclassmen live on campus (30/43)
than do in-state underclassmen (110/168).
• There were about equal numbers of out-of-state upper and underclassmen; for in-
state students, the underclassmen outnumbered the upperclassmen.
• Of the nine upperclassmen living on-campus, only two were from out of state.

Time Series Analysis (TSA)


Time Series Analysis (TSA) is a statistical technique used to analyze and understand data
that is collected over time. It involves studying the patterns, trends, and characteristics of
the data to make forecasts, identify underlying factors, and make informed decisions. Here
are the key fundamentals of TSA:
APEC

Time Series Data:


• Time series data is a sequence of observations collected at regular time intervals. It
can be in the form of numerical values, counts, percentages, or categorical data.
• The data points are ordered chronologically, and each observation is associated
with a specific time stamp.

Temporal Dependencies:
• Time series data often exhibits temporal dependencies, where each observation is
influenced by previous observations.
• Understanding these dependencies is crucial for analyzing and forecasting time
series data accurately.

Components of Time Series:


Time series data can be decomposed into several components:
• Trend: The long-term movement or direction of the data.
o Example: Monthly Average Home Prices in a City
o The average home prices have been steadily increasing over the past few
years, indicating a positive trend.

• Seasonality: The recurring patterns or cycles that occur at fixed time intervals.
o Example: Monthly Sales of Ice Cream
o Sales of ice cream are higher during the summer months compared to the
rest of the year, showing a seasonal pattern.

• Cyclical Variation: Longer-term patterns that are not necessarily fixed.


o Example: Quarterly GDP Growth Rate
o The GDP growth rate exhibits alternating periods of expansion and
contraction, representing broader cyclical patterns influenced by economic
factors.

• Residuals: The random fluctuations or noise in the data.


o Example: Daily Stock Market Returns
o Sudden spikes or drops in stock market returns that cannot be attributed to
any specific trend or seasonality are considered irregular components.

• Stationarity: It refers to the statistical properties of a time series remaining constant


over time.
o A stationary time series exhibits a constant mean, variance, and
autocovariance structure.
o Stationarity is important because many time series models assume
stationarity to make reliable forecasts.
APEC

Time Series Visualization:


• Visualizing time series data helps in understanding patterns, trends, and anomalies.
• Common visualizations include line plots, scatter plots, seasonal decomposition
plots, autocorrelation plots, and heatmaps.
o Visualizations provide insights into the data's behavior and guide further
analysis.

Time Series Analysis Techniques:


TSA employs a range of statistical techniques, including:
• Descriptive Analysis: Summarizing the data using measures such as mean, median,
standard deviation, and percentiles.
• Autocorrelation Analysis: Examining the correlation between a time series and its
lagged values to identify dependencies.
• Forecasting: Using past observations to predict future values of the time series.
• Time Series Modeling: Building mathematical models to capture the underlying
patterns and relationships in the data.
• Seasonal Adjustment: Removing the seasonal component from the data to focus on
the underlying trend and irregular components.
• TSA is applied in various domains, including finance, economics, marketing, weather
forecasting, and resource allocation. It helps in understanding historical patterns,
making future predictions, and guiding decision-making processes.

Working of Time Series Analysis


Sometimes data changes over time. This data is called time-dependent data. Given
time-dependent data, you can analyze the past to predict the future. The future prediction
will also include time as a variable, and the output will vary with time. Using time-
dependent data, you can find patterns that repeat over time.
A Time Series is a set of observations that are collected after regular intervals of
time. If plotted, the Time series would always have one of its axes as time.
APEC

Time Series Analysis collects data over time.

Data cleaning
Data cleaning is the process of identifying and correcting inaccurate records from a dataset
along with recognizing unreliable or irrelevant parts of the data.
Handling Missing Values:
• Identify missing values in the time series data.
• Decide on an appropriate method to handle missing values, such as interpolation,
forward filling, or backward filling.
• Use pandas or other libraries to fill or interpolate missing values.
Outlier Detection and Treatment:
• Identify outliers in the time series data that may be caused by measurement errors
or anomalies.
• Use statistical techniques, such as z-score or modified z-score, to detect outliers.
• Decide on the treatment of outliers, such as removing them, imputing them with a
reasonable value, or replacing them using smoothing techniques.
Handling Duplicates:
• Check for duplicate entries in the time series data.
• Remove or handle duplicate values appropriately based on the specific
requirements of the analysis.
APEC

Resampling and Frequency Conversion:


• Adjust the frequency of the time series data if needed.
• Convert the data to a lower frequency (e.g., from daily to monthly) or a higher
frequency (e.g., from monthly to daily) based on the analysis requirements.
• Use functions like resample() in pandas to perform resampling.
Addressing Non-Stationarity:
• Check for non-stationarity in the time series data, which can affect the analysis
results.
• Apply techniques like differencing to make the data stationary, which involves
computing the differences between consecutive observations.
• Use statistical tests like the Augmented Dickey-Fuller (ADF) test to test for
stationarity.
Handling Time Zones and Daylight Saving Time:
• Ensure that the time series data is in the correct time zone and accounts for daylight
saving time if applicable.
• Adjust the timestamps accordingly to maintain consistency in the data.
Consistent Time Intervals:
• Verify that the time series data has consistent and regular time intervals.
• Fill any gaps or irregularities in the time intervals, if necessary.

Simple python program to explain missing values in time series data


import pandas as pd
import numpy as np

# Create a sample time series dataset with missing values


data = {'Date': ['2021-01-01', '2021-01-02', '2021-01-03', '2021-01-04',
'2021-01-05'],
'Value': [10, np.nan, 12, 18, np.nan]}

# Convert the dictionary to a pandas DataFrame


df = pd.DataFrame(data)

# Convert the 'Date' column to datetime format


df['Date'] = pd.to_datetime(df['Date'])

# Set the 'Date' column as the index


df.set_index('Date', inplace=True)

# Handling missing values


df_filled = df.fillna(method='ffill') # Forward fill missing values
df_interpolated = df.interpolate() # Interpolate missing values

# Print the original and cleaned time series data


print("Original Data:\n", df)
APEC

print("\nForward Filled Data:\n", df_filled)


print("\nInterpolated Data:\n", df_interpolated)

Output
Original Data:
Value
Date
2021-01-01 10.0
2021-01-02 NaN
2021-01-03 12.0
2021-01-04 18.0
2021-01-05 NaN

Forward Filled Data:


Value
Date
2021-01-01 10.0
2021-01-02 10.0
2021-01-03 12.0
2021-01-04 18.0
2021-01-05 18.0

Interpolated Data:
Value
Date
2021-01-01 10.0
2021-01-02 11.0
2021-01-03 12.0
2021-01-04 18.0
2021-01-05 18.0

Simple python program to detect outliers in time series data


import pandas as pd
import numpy as np
import matplotlib.pyplot as plt

# Create a sample time series dataset with outliers


data = {'Date': ['2021-01-01', '2021-01-02', '2021-01-03', '2021-01-04',
'2021-01-05'],
'Value': [10, 15, 12, 100, 20]}

# Convert the dictionary to a pandas DataFrame


df = pd.DataFrame(data)
APEC

# Convert the 'Date' column to datetime format


df['Date'] = pd.to_datetime(df['Date'])

# Set the 'Date' column as the index


df.set_index('Date', inplace=True)

# Detect outliers using a threshold


threshold = 30 # Set the threshold for outlier detection
outliers = df[df['Value'] > threshold]

# Remove outliers
df_cleaned = df[df['Value'] <= threshold]

# Plot the original and cleaned time series data with outliers highlighted
plt.figure(figsize=(8, 4))
plt.plot(df.index, df['Value'], label='Original', color='blue')
plt.scatter(outliers.index, outliers['Value'], color='red', label='Outliers')
plt.plot(df_cleaned.index, df_cleaned['Value'], label='Cleaned',
color='green')
plt.xlabel('Date')
plt.ylabel('Value')
plt.title('Outlier Detection and Treatment')
plt.legend()
plt.show()

# Print the original and cleaned time series data


print("Original Data:\n", df)
print("\nCleaned Data:\n", df_cleaned)

Output

Original Data:
Value
Date
APEC

2021-01-01 10
2021-01-02 15
2021-01-03 12
2021-01-04 100
2021-01-05 20

Cleaned Data:
Value
Date
2021-01-01 10
2021-01-02 15
2021-01-03 12
2021-01-05 20

Time-based indexing
Time-based indexing refers to the process of organizing and accessing data based on
timestamps or time intervals. It involves assigning timestamps to data records or events
and utilizing these timestamps to efficiently retrieve and manipulate the data.
In time-based indexing, each data record or event is associated with a timestamp
indicating when it occurred. The timestamps can be precise points in time or time intervals,
depending on the granularity required for the application. The data is then organized and
indexed based on these timestamps, enabling quick and efficient access to specific time
ranges or individual timestamps.
Time-based indexing is commonly used in various domains that involve time-series
data or events, such as financial markets, scientific research, IoT (Internet of Things)
applications, system monitoring, and social media analysis.

In the context of TSA (Time Series Analysis), time-based indexing refers to the practice of
organizing and accessing time-series data based on the timestamps associated with each
observation. TSA involves analyzing and modeling data that is collected over time, and
time-based indexing plays a crucial role in effectively working with such data.

Time-based indexing allows for efficient retrieval and manipulation of time-series data,
enabling various operations such as subsetting, filtering, and aggregation based on specific
time periods or intervals.

In TSA, time-based indexing is typically implemented using specialized data structures or


libraries that provide functionality for working with time-series data. Some popular
libraries for time-based indexing and analysis in Python include:
APEC

• Pandas: Pandas provides the DateTimeIndex object, which allows for indexing and
manipulation of time-series data. It offers a wide range of time-based operations,
such as slicing by specific time periods, resampling at different frequencies, and
handling missing or irregular timestamps.

• Statsmodels: Statsmodels is a Python library that includes extensive functionality for


time-series analysis. It provides time-series models and statistical tools for various
types of time-based data analysis. It works well with Pandas' DateTimeIndex and
provides methods for model estimation, forecasting, and diagnostics.

• NumPy: Although not specifically designed for time-series analysis, NumPy, a


fundamental library for numerical computations in Python, can be used for time-
based indexing. NumPy's array indexing and slicing capabilities, combined with
timestamps represented as numerical values, allow for efficient retrieval and
manipulation of time-series data.

Time-based indexing in TSA is essential for conducting exploratory data analysis, fitting
time-series models, forecasting future values, and evaluating model performance.

Time-based indexing operations


Here are some common time-based indexing operations:

Slicing: Slicing involves retrieving a subset of data within a specific time range. With time-
based indexing, you can easily slice the time-series data based on specific dates, times, or
time intervals.
Example:
# Retrieve data between two specific dates
subset = df['2023-01-01':'2023-03-31']

Simple Python Program to Explain Time Series Slicing


import pandas as pd

# Create a sample time-series dataset


data = {
'timestamp': ['2023-01-01', '2023-01-02', '2023-01-03', '2023-01-04'],
'value': [10, 15, 12, 18]
}

df = pd.DataFrame(data)

# Convert 'timestamp' column to datetime


APEC

df['timestamp'] = pd.to_datetime(df['timestamp'])

# Set 'timestamp' as the index


df.set_index('timestamp', inplace=True)

# Retrieve data between two specific dates


subset = df['2023-01-02':'2023-01-03']
print(subset)

Output
value
timestamp
2023-01-02 15
2023-01-03 12

Resampling: Resampling involves changing the frequency of the time-series data. You can
upsample (increase frequency) or downsample (decrease frequency) the data to different
time intervals, such as aggregating hourly data to daily data or converting daily data to
monthly data.
Example:
# Resample data to monthly frequency
monthly_data = df.resample('M').mean()

Simple Python Program to Explain Resampling


import pandas as pd

# Create a sample time-series dataset


data = {
'timestamp': ['2023-01-01', '2023-01-02', '2023-01-03', '2023-01-04'],
'value': [10, 15, 12, 18]
}

df = pd.DataFrame(data)

# Convert 'timestamp' column to datetime


df['timestamp'] = pd.to_datetime(df['timestamp'])

# Set 'timestamp' as the index


df.set_index('timestamp', inplace=True)

# Resample data to monthly frequency


monthly_data = df.resample('M').mean()
print(monthly_data)
APEC

Output
value
timestamp
2023-01-31 13.75

Shifting: Shifting involves moving the timestamps of the data forwards or backwards by a
specified number of time units. This operation is useful for calculating time differences or
creating lagged variables.

Example:
# Shift the data one day forward
shifted_data = df.shift(1, freq='D')

Simple Python Program to Explain Shifting


import pandas as pd

# Create a sample time-series dataset


data = {
'timestamp': ['2023-01-01', '2023-01-02', '2023-01-03', '2023-01-04'],
'value': [10, 15, 12, 18]
}

df = pd.DataFrame(data)

# Convert 'timestamp' column to datetime


df['timestamp'] = pd.to_datetime(df['timestamp'])

# Set 'timestamp' as the index


df.set_index('timestamp', inplace=True)

# Shift the data one day forward


shifted_data = df.shift(1, freq='D')
print(shifted_data)

Output
value
timestamp
2023-01-02 10
2023-01-03 15
2023-01-04 12
2023-01-05 18
APEC

Rolling Windows: Rolling windows involve calculating statistics over a moving window of
data. It allows for analyzing trends or patterns in a time-series by considering a fixed-size
window of observations.
Example:
# Calculate the rolling average over a 7-day window
rolling_avg = df['value'].rolling(window=7).mean()

Simple Python Program to Explain Rolling Windows

import pandas as pd

# Create a sample time-series dataset


data = {
'timestamp': ['2023-01-01', '2023-01-02', '2023-01-03', '2023-01-04'],
'value': [10, 15, 12, 18]
}

df = pd.DataFrame(data)

# Convert 'timestamp' column to datetime


df['timestamp'] = pd.to_datetime(df['timestamp'])

# Set 'timestamp' as the index


df.set_index('timestamp', inplace=True)

# Calculate the rolling average over a 2-day window


rolling_avg = df['value'].rolling(window=2).mean()
print(rolling_avg)

Output
timestamp
2023-01-01 NaN
2023-01-02 12.5
2023-01-03 13.5
2023-01-04 15.0
Name: value, dtype: float64

Grouping and Aggregation: Grouping and aggregation operations involve grouping the
time-series data based on specific time periods (e.g., days, weeks, months) and performing
calculations on each group, such as calculating the sum, mean, or maximum value.
APEC

Example:
# Calculate the sum of values for each month
monthly_sum = df.groupby(pd.Grouper(freq='M')).sum()

Simple Python Program to Explain Grouping and Aggregation


import pandas as pd

# Create a sample time-series dataset


data = {
'timestamp': ['2023-01-01', '2023-01-02', '2023-01-03', '2023-01-04'],
'value': [10, 15, 12, 18]
}

df = pd.DataFrame(data)

# Convert 'timestamp' column to datetime


df['timestamp'] = pd.to_datetime(df['timestamp'])

# Set 'timestamp' as the index


df.set_index('timestamp', inplace=True)

# Calculate the sum of values for each month


monthly_sum = df.groupby(pd.Grouper(freq='M')).sum()
print(monthly_sum)

Output
value
timestamp
2023-01-31 55

Time based indexing using pandas


A pandas DataFrame or Series with a time-based index is defined as a time series. The
parameters in the time series can be anything that can fit inside the containers. Date or
time values are merely used to retrieve them. In pandas, a time series container may be
altered in a variety of ways.

Simple Example for time-based indexing


import pandas as pd

# Create a sample time-series dataset


data = {
'timestamp': ['2023-01-01', '2023-02-01', '2023-03-01', '2023-04-01'],
'value': [10, 15, 12, 18]
}
APEC

df = pd.DataFrame(data)

# Convert 'timestamp' column to datetime


df['timestamp'] = pd.to_datetime(df['timestamp'])

# Set 'timestamp' as the index


df.set_index('timestamp', inplace=True)

# Access data for a specific time period


subset = df['2023-02-01':'2023-03-01']
print(subset)

# Resample the data to a different frequency


resampled_data = df.resample('1M').mean()
print(resampled_data)

Output

value
timestamp
2023-02-01 15
2023-03-01 12
value
timestamp
2023-01-31 10.0
2023-02-28 15.0
2023-03-31 12.0
2023-04-30 18.0

In the example above, we start by creating a DataFrame with a 'timestamp' column


and a corresponding 'value' column. We convert the 'timestamp' column to a datetime data
type using pd.to_datetime(). Next, we set the 'timestamp' column as the index using
set_index().
We then demonstrate two common time-based indexing operations. First, we access
a subset of the data for a specific time period using time-based slicing with
df['start_date':'end_date']. In this case, we retrieve the data between February 1, 2023, and
March 1, 2023.
Next, we showcase resampling the data to a different frequency using resample().
We specify '1M' as the frequency, indicating monthly resampling. The data is resampled by
taking the mean value for each month.
APEC

Visualizing Time Series Data


Line graph
A line graph is a common and effective way to visualize time series data. It displays data
points connected by straight lines, allowing you to observe the trend and changes in values
over time

Stacked Area Chart


A stacked area chart, also known as a stacked area graph, is a type of graph used to
visualize the cumulative contribution or proportion of multiple variables over time. It
displays multiple series as stacked areas, where the height of each area represents the
value of a particular variable at a given time. The areas are stacked on top of each other,
illustrating how the variables contribute to the total value or the overall trend.
APEC

Bar Charts
Bar charts, also known as bar graphs or column charts, are a type of graph that uses
rectangular bars to represent data. They are widely used for visualizing categorical or
discrete data, where each category is represented by a separate bar. Bar charts are effective
in displaying comparisons between different categories or showing the distribution of a
single variable across different groups. The length of each bar is proportional to the value
of the variable at that point in time.

Gantt chart
A Gantt chart is a type of bar chart that is commonly used in project management to
visually represent project schedules and tasks over time. It provides a graphical
representation of the project timeline, showing the start and end dates of tasks, as well as
their duration and dependencies.
The key features of a Gantt chart are as follows:
• Task Bars: Each task is represented by a horizontal bar on the chart. The length of
the bar indicates the duration of the task, and its position on the chart indicates the
start and end dates.
• Timeline: The horizontal axis of the chart represents the project timeline, typically
displayed in increments of days, weeks, or months. It allows for easy visualization of
the project duration and scheduling.
• Dependencies: Gantt charts often include arrows or lines between tasks to represent
dependencies or relationships between them. This helps to visualize the order in
which tasks need to be completed and identify any critical paths or potential
APEC

bottlenecks.
Milestones: Milestones are significant events or achievements within a project. They
are typically represented by diamond-shaped markers on the chart to indicate
important deadlines or deliverables.

Simple Python Program to Explain Gnatt Chart


def create_gantt_chart(tasks):
fig, ax = plt.subplots()

# Set y-axis limits


ax.set_ylim(0, 10)

# Set x-axis limits and labels


ax.set_xlim(0, 30)
ax.set_xlabel('Time')
ax.set_ylabel('Tasks')

# Plot the tasks as horizontal bars


for task in tasks:
start = task['start']
end = task['end']
y = task['task_id']
ax.barh(y, end - start, left=start, height=0.5,
align='center', color='red')

# Set the y-ticks and labels


y_ticks = [task['task_id'] for task in tasks]
y_labels = [task['name'] for task in tasks]
ax.set_yticks(y_ticks)
ax.set_yticklabels(y_labels)

# Display the Gantt chart


plt.show()

# Example tasks data


tasks = [
{'name': 'Task 1', 'start': 5, 'end': 15, 'task_id': 1},
{'name': 'Task 2', 'start': 10, 'end': 20, 'task_id': 2},
{'name': 'Task 3', 'start': 15, 'end': 25, 'task_id': 3},
{'name': 'Task 4', 'start': 20, 'end': 30, 'task_id': 4}
]
APEC

# Create the Gantt chart


create_gantt_chart(tasks)

Output

Stream graph
A stream graph is a variation of a stacked area chart that displays changes in data over time
of different categories through the use of flowing, organic shapes that create an aesthetic
river/stream appearance. Unlike the stacked area chart, which plots data over a fixed,
straight axis, the stream plot has values displaced around a varying central baseline.

Each individual stream shape in the stream graph is proportional to the values of it’s
categories. Color can be used to either distinguish each category or to visualize each
category’s additional quantitative values through varying the color shade.
Making a Stream Graph with Python
For this example we will use Altair, which is a graphing library in python. Altair is a
declarative statistical visualization library, based on Vega and Vega-Lite. The source code is
available on GitHub.
To begin creating our stream graph, we will need to first install Altair and vega_datasets.
APEC

!pip install altair


!pip install vega_datasets

Now, let’s use altair and the vega datasets to create an interactive stream graph looking at
unemployment data across a series of 10 years of time across multiple industries.

import altair as alt


from vega_datasets import data

source = data.unemployment_across_industries.url

alt.Chart(source).mark_area().encode(
alt.X('yearmonth(date):T',
axis=alt.Axis(format='%Y', domain=False, tickSize=0)
),
alt.Y('sum(count):Q', stack='center', axis=None),
alt.Color('series:N',
scale=alt.Scale(scheme='category20b')
)
).interactive()

Output
APEC

Heat map
A heat map is a graphical representation of data where individual values are
represented as colors. It is typically used to visualize the density or intensity of a particular
phenomenon over a geographic area or a grid of cells.
In a heat map, each data point is assigned a color based on its value or frequency.
Typically, a gradient of colors is used, ranging from cooler colors (such as blue or green) to
warmer colors (such as yellow or red). The colors indicate the magnitude of the data, with
darker or more intense colors representing higher values and lighter or less intense colors
representing lower values.
Heat maps are commonly used in various fields, including data analysis, statistics,
finance, marketing, and geographic information systems (GIS). They can provide insights
into patterns, trends, or anomalies in the data by visually highlighting areas of higher or
lower concentration.

import numpy as np
import matplotlib.pyplot as plt
# Generate random data
data = np.random.rand(10, 10)
# Create heatmap
plt.imshow(data, cmap='hot', interpolation='nearest')
# Add color bar
plt.colorbar()
# Show the plot
plt.show()
output
APEC

Grouping
Grouping time series data involves dividing it into distinct groups based on certain criteria.
This grouping can be useful for performing calculations, aggregations, or analyses on
specific subsets of the data. In Python, you can use the pandas library to perform grouping
operations on time series data. Here's an example of how to group time series data using
pandas:
Simple Python program to explain Grouping
import pandas as pd

# Create a sample DataFrame with a datetime index


data = {'date': pd.date_range(start='2022-01-01', periods=100,
freq='D'),
'category': ['A', 'B', 'A', 'B'] * 25,
'value': range(100)}
df = pd.DataFrame(data)
df.set_index('date', inplace=True)

# Grouping by category and calculating the sum


grouped_df = df.groupby('category').sum()

# Grouping by month and calculating the mean


monthly_df = df.groupby(pd.Grouper(freq='M')).mean()

# Print the grouped DataFrames


print("Grouped DataFrame (by category):")
print(grouped_df)

print("\nMonthly DataFrame (mean per month):")


print(monthly_df)

Output
Grouped DataFrame (by category):
value
category
A 2450
B 2500

Monthly DataFrame (mean per month):


value
date
2022-01-31 15.0
2022-02-28 44.5
2022-03-31 74.0
2022-04-30 94.5
APEC

In this example, we first create a sample DataFrame df with a datetime index. The
DataFrame contains a 'value' column ranging from 0 to 99 and a 'category' column with
two distinct categories 'A' and 'B'.

We then proceed with the grouping operations:


• To group by a specific column, such as 'category', we use the groupby() function,
specifying the column name as the argument. In this case, we group the data by the
'category' column and calculate the sum for each group using .sum().
• To group by a specific time period, such as month, we can use the groupby()
function with pd.Grouper(freq='M'). This creates a grouper object that can be used
to group the data by the desired frequency. In this case, we group the data by month
and calculate the mean value for each month using .mean().

Resampling
Resampling time series data involves grouping the data into different time intervals
and aggregating or summarizing the values within each interval. This process is useful
when you want to change the frequency or granularity of the data or when you need to
perform calculations over specific time intervals. There are two common methods for
resampling time series data: upsampling and downsampling.

Downsampling: Downsampling involves reducing the frequency of the data by grouping it


into larger time intervals. This is typically done by aggregating or summarizing the data
within each interval. Some common downsampling methods include:
• Mean/Median: Calculate the mean or median value within each interval.
• Sum: Calculate the sum of values within each interval.
• Min/Max: Determine the minimum or maximum value within each interval.
• Resample Method: Use specialized resampling methods like interpolation or
forward/backward filling to estimate values within each interval.
Here's an example of downsampling time series data using the resample() function in
pandas:
import pandas as pd

# Assuming 'df' is your DataFrame with a datetime index


downsampled_df = df.resample('D').mean()

In this example, the data is being downsampled to daily frequency, and the mean value
within each day is calculated.
APEC

Upsampling: Upsampling involves increasing the frequency of the data by grouping it into
smaller time intervals. This may require filling in missing values or interpolating to
estimate values within the new intervals. Some common upsampling methods include:
• Forward/Backward Filling: Propagate the last known value forward or backward to
fill missing values within each interval.
• Interpolation: Use interpolation methods like linear, polynomial, or spline
interpolation to estimate values within each interval.
• Resample Method: Utilize specialized resampling methods to estimate values within
each interval.
Here's an example of upsampling time series data using the resample() function in pandas:

import pandas as pd

# Assuming 'df' is your DataFrame with a datetime index


upsampled_df = df.resample('H').interpolate()

In this example, the data is being upsampled to hourly frequency, and missing values are
interpolated using the interpolate() function.

Simple python program to explain upsampling and Downsampling


import pandas as pd

# Create a sample DataFrame with a datetime index


data = {'date': pd.date_range(start='2022-01-01', periods=100, freq='D'),
'value': range(100)}
df = pd.DataFrame(data)
df.set_index('date', inplace=True)

# Downsampling to monthly frequency


downsampled_df = df.resample('M').mean()

# Upsampling to hourly frequency


upsampled_df = df.resample('H').interpolate()

# Print the downsampled and upsampled DataFrames


print("Downsampled DataFrame:")
print(downsampled_df.head())

print("\nUpsampled DataFrame:")
print(upsampled_df.head())

Output
Downsampled DataFrame:
value
APEC

date
2022-01-31 15.0
2022-02-28 44.5
2022-03-31 74.0
2022-04-30 94.5

Upsampled DataFrame:
value
date
2022-01-01 00:00:00 0.000000
2022-01-01 01:00:00 0.041667
2022-01-01 02:00:00 0.083333
2022-01-01 03:00:00 0.125000
2022-01-01 04:00:00 0.166667

In this example, we first create a sample DataFrame df with a datetime index. The
DataFrame contains a 'value' column ranging from 0 to 99, with a daily frequency for the
'date' index.

We then proceed with the resampling:


• For downsampling, we use the resample() function with the argument 'M' to
downsample the data to monthly frequency. In this case, we calculate the mean
value for each month using .mean().
• For upsampling, we use the resample() function with the argument 'H' to upsample
the data to hourly frequency. We use .interpolate() to fill in the missing values
within each hour using interpolation.

You might also like