Topic_3_Basic_statistics
Topic_3_Basic_statistics
ISE291
Table of Contents
1 Data and Variables
1.1 Learning Outcomes
1.2 Tabular Data
1.3 Field Types
2 Field Segregation
2.0.1 Example: Variables/Fields
3 Frequency Distribution
3.1 Histogram
3.1.1 Histograms of Non-numerical Data
3.1.2 Histograms of Numerical Data
3.2 Pie-Charts
3.2.1 Example: Hands on Histogram and Pie-Chart
3.3 Normal Distribution
4 Measures of Centrality
4.1 Mean
4.2 Median
4.3 Mode
4.3.1 Example: Calculate the mean, median and mode
5 Dispersion of a Distribution
5.1 Range
5.2 Interquartile Range
5.3 Variance
5.4 Standard Deviation
5.4.1 Example: Measures of Dispersion
5.5 Box Plots
5.6 Comparing Box Plots
5.6.1 Example: Box Plots
6 Comparing Distributions
6.1 Hypothesis Testing
6.2 P-Value
6.3 Typical Tests
6.3.1 Hands on Hypothesis Testing
7 References:
7.1 Theory:
Learning Outcomes
1. Outline the typical terminology used in data science.
2. Identify data distributions
3. Compare data distributions
Tabular Data
Before we process or analyze data, we have to capture and represent it using Variables
Variable is a label (field, header or title) that we give to our data.
Consider the following data in tabular form:
The column headings Faculty Name, Area of Specialization, Experience in yrs, International Faculty and # Assignments are the labels or fields.
Note: There are many formats in which data can be represented, in this course we will only consider tabular format (i.e. the data will have row corresponding to records, and column corresponding to fields).
Field Types
Some of the common field types that we often come in tabular data are:
1. Numerical Data: these are numerical values of the filed, where the value has a meaning. For example, in the previous table, Experience in yrs is a numerical data. A value of 10 in the second row implies
the corresponding faculty has 10 years of experience.
2. Categorical Data: there are non-numerical field values, where the value represents a category without any order. For example, in the previous table, International Faculty is a categorical data. A value of
'Yes' in the first row implies the corresponding faculty a foreign faculty. This column has two categories. Similarly, in the previous table, the column Area of Specialization is a categorical data, with 3
different categories.
A. Nominal Data: sometimes categorical data is represented using numbers for convenience. Such representation is called as nominal data. For example, in International Faculty column, one could
replace 'Yes' with 1 and 'No' with 0, or vice-versa. These numbers do not have meaningful mathematical or statistical insights. They are just representation for convenience.
B. Ordinal Data: these are non-numerical field values, where the value represents a category with order. For example, in the previous table, # Assignments is a ordinal data. A value of 'High' in the
second row implies the corresponding faculty assigns too many homeworks.
Note: Nominal data should not be considered as numeric data. Generally, categorical and nominal data are synonyms, but in this course we restrict the definition of nominal data to numerical representation
for convenience.
Field Segregation
Not all the fields in a given data table are same. They can be classified or segregated as:
1. Independent Variables: A variable that is thought to be controlled or not affected by other variables is called an independent variable. Typically, it is also known as attribute, control, explanatory,
regressor, input, predictor, observed, feature, field variable.
2. Dependent Variables: A variable that depends on other variable (and maybe of interest in the analysis). It is also known as predicted, explained, target, response, output/outcome, label variable.
3. Auxiliary Variables: These are the variables that provide meta information about the data, and may not be helpful (or may not be meant) to be used in the analysis.
Example: Variables/Fields
Question-A: Consider the following table:
2. Identify the variable/field types. The following table gives the required information:
Field Type
Stu ID Nominal
Score Numeric
Subj Categorical
GPA Numeric
Honor Nominal
Grade Ordinal
Frequency Distribution
A graph showing how many times each value occur in a variable.
Histogram
1. Plot values of observation on one of the axes (typically x-axis). The values can be ordered (if applicable)
2. Plot perpendicular bars to the above axis.
3. The height of the bar shows how many times each value occurred in the dataset.
Pie-Charts
1. Pie-charts are extends the visualization offered by histograms.
2. Typically, it is useful for the categorical data.
3. Unlike histogram, Pie-charts display the proportion/percentage of occurrence of each category.
4. The idea is to represent the proportion as the sectors of a circle, where the length of the arc represents the proportion.
5. When we depict all the categories, then the pie chart will be a full circle. Otherwise, it could be an incomplete circle.
Let us build the pie chart for the table given for non-numerical data. The process to draw pie-chart is shown in the following video.
Stu ID 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25
Subj M P M P P P P P M M P P P M P M M C C M C P P P M
Subj = ['M', 'P', 'M', 'P', 'P', 'P', 'P', 'P', 'M', 'M', 'P', 'P',
'P', 'M', 'P', 'M', 'M', 'C', 'C', 'M', 'C', 'P', 'P', 'P', 'M']
Score = [ 75, 81, 95, 85, 81, 82, 97, 89, 76, 100, 77, 79, 87,
87, 79, 75, 100, 82, 83, 98, 86, 87, 71, 78, 81]
plt.figure(figsize=(5,5))
plt.hist(Subj)
plt.show()
import numpy as np
subj_arr = np.array(Subj)
subj_labels, subj_counts = np.unique(Subj,return_counts=True)
## To get counts
# print(Subj.count('M'))
Normal Distribution
1. A histogram depicts the distribution of the variable.
2. A variable's distribution may not be easy to estimate.
3. However, it can be easy to check if it is closer to a particular distribution.
4. Among standard distributions, Normal Distribution is a prominent distribution.
5. Many statistical theories are based on the assumption of normal distribution.
So, let us see what normal distribution is, and how to know if a variable's distribution is closer to normal distribution.
Normal Distribution:
1. In an ideal world, data would be distributed symmetrically around the center of all the data.
2. If we draw a vertical line through the center of a distribution, both sides should look the same.
3. The so-called normal distribution is characterized by a bell-shaped curve.
Note: there actual process to measures Kurtosis and Skew will be skipped for now.
Measures of Centrality
How to estimate the center of distribution.
Mean
1. Mean is commonly referred as average, though they are not exactly synonyms.
2. Mean is often used to measure the central tendency of continuous data as well as discrete data.
3. If x1 , x2 , … , xn are values, then the mean ¯x
¯¯ is calculated as:
¯x
¯¯
x1 + x2 + … + xn
=
n
Median
1. Median is the middle value of the data that has been sorted according to the values of the data.
2. When the data has even number of values, median is calculated as the average of the two middle values.
3. Typically, median is less susceptible to the presence of outliers (compared to mean).
Mode
1. Mode is the most frequently occurring value in a dataset.
2. Typically, mode is used for non-numerical data.
3. On a histogram, the category of the highest bar denotes the mode of the data.
Stu ID 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25
1. Calculate the mean, median and mode of the data (by hand).
2. Calculate the above measures using Python.
1. Calculate the mean, median and mode of the data (by hand).
Mean:
75 + 81 + 95 + 85 + 81 + 82 + 97 + 89 + 76 + 100 + 77 + 79 + 87 + 87 + 79 + 75 + 100 + 82 + 83 + 98 + 86 + 87 + 71 + 78 + 81
= 84.44
25
Median: (sorted in ascending order)
Scoresorted = [71, 75, 75, 76, 77, 78, 79, 79, 81, 81, 81, 82, 82, 83, 85, 86, 87, 87, 87, 89, 95, 97, 98, 100, 100]
Value 71 75 76 77 78 79 81 82 83 85 86 87 89 95 97 98 100
Occurrence 1 2 1 1 1 2 3 2 1 1 1 3 1 1 1 1 2
From the above table, we can see that the mode is not unique, and the values are 81 and 87.
import numpy as np
from scipy import stats
Score = [75, 81, 95, 85, 81, 82, 97, 89, 76, 100, 77, 79, 87, 87, 79, 75, 100, 82, 83, 98, 86, 87, 71, 78, 81]
Score = np.array(Score)
print(f'The mean is {Score.mean()}, the median is {np.median(Score)}, and the mode is {stats.mode(Score)[0]}.')
The mean is 84.44, the median is 82.0, and the mode is [81].
Dispersion of a Distribution
How to measure the spread of a distribution.
1. A measure of 'center' may not be enough in understanding the actual shape of a distribution.
2. The spread or dispersion of a distribution can be measured using the following typical measures: Range, Interquartile Range, Variance and Standard Deviation.
Range
1. It is the difference between the largest value and the smallest value.
2. It is susceptible to the outliers.
Interquartile Range
1. It is the range obtained after removing extreme values.
2. One convention is to cut-off the top and bottom one-quarter of the data and calculate the range of the remaining middle 50% of the scores.
3. The bottom one-quarter values of data are also known as lower or first quartile (25th percentile). It is denoted as Q1 .
4. The top one-quarter values of data are also known as upper or third quartile (75th percentile). It is denoted as Q3 .
5. The Interquartile Range(IQR) is now calculated as:
IQR = Q3 − Q1
Variance
1. It is a measure used to indicate how spread out the data points are.
2. If the individual observations vary greatly from the group mean, then the variance is big; and vice versa.
3. The variance of the population is defined by the following formula:
¯¯¯¯¯
2
∑(Xi − X)2
σ =
N
¯¯¯¯¯
where σ 2 is the population variance, X is the population mean, Xi is the ith element from the population, and N is the number of elements in the population.
4. The variance of the sample is defined by the following formula:
¯¯ 2
∑(xi − ¯x)
2
s =
n−1
Standard Deviation
1. It is the square root of the variance.
2. It is computed as:
s=√
∑(xi − ¯x)
¯¯ 2
n−1
1. The advantage: the units of standard deviation is same as the units of the data.
2. The above is not true for variance.
Stu ID 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25
1. Calculate the Range, Variance, and Standard deviation of the data (by hand).
2. Calculate the above measures using Python.
3. Calculate the interquartile range.
1. Calculate the Range, Sample Variance, and corresponding Standard deviation of the data (by hand).
Range: The max value in the data is 100. The min value in the data is 71.
Thus, the range is 100-71 = 29.
Stu ID 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23
(x − ¯x¯¯) -9.44 -3.44 10.56 0.56 -3.44 -2.44 12.56 4.56 -8.44 15.56 -7.44 -5.44 2.56 2.56 -5.44 -9.44 15.56 -2.44 -1.44 13.56 1.56 2.56 -13.4
(x − ¯x¯¯)2 89.1136 11.8336 111.5136 0.3136 11.8336 5.9536 157.7536 20.7936 71.2336 242.1136 55.3536 29.5936 6.5536 6.5536 29.5936 89.1136 242.1136 5.9536 2.0736 183.8736 2.4336 6.5536 180.63
The sum of the elements of the last row divided by 24 gives sample variance, which is equal to 67.34.
Standard deviation: It is the square root of variance, and for the given data it is equal to 8.21.
Score = [ 75, 81, 95, 85, 81, 82, 97, 89, 76, 100, 77, 79, 87, 87, 79, 75, 100, 82, 83, 98, 86, 87, 71, 78, 81]
Score = np.array(Score)
## Or indivudial
# Q3 = np.percentile(Score, 75)
# Q1 = np.percentile(Score, 25)
IQR = Q3 - Q1
print(f'The interquartile range is {IQR}')
# print(Q3,Q1)
Box Plots
To build a boxplot, following things are required from the data.
Stu ID 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25
Score = [ 75, 81, 95, 85, 81, 82, 97, 89, 76, 100, 77, 79, 87, 87, 79, 75, 100, 82, 83, 98, 86, 87, 71, 78, 81]
Score = np.array(Score)
Comparing Distributions
Are the two distributions similar or dissimilar?
1. Often there is a need to compare different distributions to derive some important insights or make decisions.
2. Visual inspection via comparing histograms (or any graphical depiction of distribution) is not practical (and not easy).
3. Visual inspection via comparing box plots.
4. Hypothesis testing and Confidence or probability value are used for comparing distributions.
Hypothesis Testing
1. A hypothesis is a way to state our assumption or belief that could be tested.
2. The default knowledge or assumption could be stated as a null hypothesis.
3. The opposite of the default knowledge is stated as alternative hypothesis.
4. Example:
Null hypothesis: no difference between the two distributions.
Alternative hypothesis: there is a difference between the two distributions.
P-Value
1. p-value is the confidence or probability value of the null hypothesis.
2. It indicates how much we believe the two distributions are the same.
3. If p-value is very small, then we can reject the null hypothesis, and accept the alternative hypothesis.
4. If p-value is not very small, then we fail to reject the null hypothesis.
5. Typically, a p-value less than 0.05 or 5% is considered as very small.
Typical Tests
1. Shapiro-Wilk Test: Test if a data sample follows normal distribution.
Null hypothesis: the sample has a normal distribution.
2. Student's t-test: Tests if the means of two independent samples are significantly different. Assume data follow normal distributions.
Null hypothesis: the means of the samples are equal.
3. Mann-Whitney U Test: Tests if the means of two independent samples are significantly different.
Null hypothesis: the means of the distributions are equal.
Stu ID 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25
Score_1 = [75, 81, 85, 85, 81, 82, 87, 89, 73, 100, 72, 79, 87, 87, 79, 75, 100, 82, 83, 88, 86, 87, 66, 78, 81]
Score_2 = [70, 99, 70, 100, 78, 66, 76, 79, 65, 77, 100, 85, 71, 99, 66, 84, 87, 66, 66, 74, 81, 83, 85, 75, 66]
# They can be of different lengths
_, pval = shapiro(Score_1)
print('p-value',pval)
if pval <0.05:
print("We reject null hypothesis for Score_1")
else:
print("We fail to reject null hypothesis for Score_1")
print("\n"*3)
_, pval = shapiro(Score_2)
print('p-value',pval)
if pval <0.05:
print("We reject null hypothesis for Score_2")
else:
print("We fail to reject null hypothesis for Score_2")
ShapiroResult(statistic=0.9529008269309998, pvalue=0.29114970564842224)
p-value 0.29114970564842224
We fail to reject null hypothesis for Score_1
p-value 0.01274434570223093
We reject null hypothesis for Score_2
In [31]: # 2. Draw the box plots of Score_1 and Score_2, and comment on their similarity/dissimilarity.
import matplotlib.pyplot as plt
fig =plt.figure(figsize=(6,8))
ax = fig.add_subplot(111) #fig.add_subplot(ROW,COLUMN,POSITION)
plt.boxplot([Score_1,Score_2])
plt.title('Box Plot')
ax.set_xticklabels(['Score_1', 'Score_2'])
plt.show()
In [32]: # 3. Check if the means of Score_1 and Score_2 are equal using Student's t-test.
from scipy.stats import ttest_ind
import numpy as np
#Student’s t-test is very good when the data follows normal distribution
_,pval = ttest_ind(Score_1,Score_2,equal_var=False)
print('p-value',pval)
if pval <0.05:
print("We reject null hypothesis")
else:
print("We fail to reject null hypothesis")
p-value 0.15643810807436723
We fail to reject null hypothesis
In [36]: # 4. Check if the means of Score_1 and Score_2 are equal using Mann-Whitney U Test.
from scipy.stats import mannwhitneyu
if pval <0.05:
print("We reject null hypothesis")
else:
print("We fail to reject null hypothesis")
p-value 0.02961242046613917
We reject null hypothesis
References:
Theory:
1. Chirag Shah, "A Hands-On Introduction to Data Science," Cambridge University Press, 2020, Section 3.3.1, 3.3.2, 3.3.3, 3.3.4.
2. Plots: https://matplotlib.org/
3. Numpy: https://numpy.org/doc/stable/
4. Scipy: https://docs.python.org/3/library/statistics.html
In [ ]: