Data Preprocessing I
Data Preprocessing I
Structured Data
Ratio Scale
Numerical
Data
Interval Scale
Ordinal Scale
Categorical
Data
Nominal Scale
Attribute Types
Discrete vs Continuous
2.5 2.5
2 2
1
3
2 1.5
Very Bad 1.5
1
Very Good 1
3
Good 0.5
0.5
Bad 0
0 Very Bad Very Good Good Bad
representation of relative Bar Chart: Need place the distinct Pareto Chart: Bar
frequencies for each category. categories on horizontal axis or vice Chart Sorted by
versa and the frequencies or relative frequency/relative
frequencies of each category will be frequency
placed in vertical axis or vice versa.
Statistical Operations on Data
Describing Categorical Data
𝒙′ = 𝒘𝟏𝒙𝟏+𝒘𝟐𝒙𝟐+𝒘𝟑𝒙𝟑+…+𝒘𝑵𝒙𝑵
𝒘𝟏+𝒘𝟐+….+𝒘𝑵
Cutting off the lower and higher extremes by 2% - 20% after sorting the data values and
then evaluating the mean is called trimmed mean.
Mean can also be evaluated from assumed mean.(Discrete Series & Continuous Series)
We need to find out the mid-point of each class interval and multiply with the frequency of
occurrences of data point belonging to the class interval for finding mean of continuous
attributes. Rest would be same.
Median Evaluation is similar to the Categorical data if number of observations is minimum.
𝑁
Combined Mean - 2
−(U CF )
For high number of observations: 𝐌𝒆𝒅𝒊𝒂𝒏 = 𝐿 + ∗𝑤
𝑓 𝑚𝑒𝑑𝑖𝑎𝑛
Statistical Operations on Data
Describing Numerical Data
Measure of Central Tendency
Mode for continuous series, when the data values are provided in class intervals
and frequency of each interval is known will be defined as,
𝑓1 − 𝑓0
𝑀𝑜𝑑𝑒 = 𝐿 + ∗𝑤
2𝑓1 − 𝑓0 − 𝑓2
Relationship among Mean, Median and Mode.
45
25
20
Symmetric
Positively Skewed (Asymmetric)
skewed
15 Negatively Skewed (Asymmetric)
mode > median > mean if
10
5
negatively skewed
0
0 2 4 6 8
Number of Class Intervals
Statistical Operations on Data
Describing Numerical Data
Measure of Dispersion
Dispersion measures the spread of numerical data.
Techniques - Range, Variance, Standard Deviation, Percentile and Interquartile
Range.
Range, it is the difference between the largest and smallest values of the
observation. Hence, the range can talk more about the data sets. Sensitive to
outliers.
Variance takes into consideration all the data values or observations. It
evaluates the deviations of the data values from the mean and aggregates all
deviations to provide a numeric value.
Variance formula: 𝜎 2 = 𝑁1 σ 𝑁 (𝑥 𝑖 −𝑥)′)2
𝑖=1
Square root of variance is the Standard Deviation.
Percentile –
o 100p percentile of a data set containing n records.
o p is the percentile that has the values in [0,1].
Statistical Operations on Data
Describing Numerical Data
Measure of Dispersion
Percentile –
o Determine n*p.
o If n*p is not an integer then determine the smallest integer greater than
n*p. The value at that position would be 100p percentile.
o If n*p is an integer then the mean values of the position n*p and n*p + 1
is the 100p percentile.
Quartile –
o 25th percentile is called the first quartile (Q1), 50th
percentile called the second quartile or median (Q2) and 75th
percentile called the third quartile (Q3). That is, the quartile breaks the
data set into four parts.
o Interquartile Range IQR => Q3 – Q1.
o Values of data below Q1 – 1.5IQR and above Q3 + 1.5IQR can be classified as
Association between Catégorial Variables.
Use of Contingency Table
E.g.1. Association Between Gender and Owning a Smartphone.
100 samples are collected and we have 44 female and 56 male students. 76
owned smartphones and 24 did not.
Finally, 34 female owned smartphone and 42 males owned smartphone.
E.g.2. Income level (ordinal variable with values: low, medium and high) and
smartphone ownership (nominal variable with values: yes or no).
Income level is an ordinal variable then in contingency table, low, medium and
high values can be coded as 1, 2, 3 respectively to maintain the ordering.
Association between Catégorial Variables.
Use of Contingency Table
For E.g. 1, it can be observed that 24% of population do not own a smartphone
whereas, 76% own a smartphone. This distribution is consistent for male
and female. Around 23% of female do not own a smartphone and 77%
owns a smartphone. 25% of male do not owns a smartphone and
75% owns a smartphone. --- Gender and Owning a Smartphone not
associated.
On contrary, for E.g.2, ownership distribution is 38% and 62%. Although, such
distribution is not consistent with income levels. Only 10% of high-
income group do not own a smartphone 41% and 64% of medium
and low-income group do not a smartphone. --- Income level and
Association between Catégorial Variables.
Use of Stacked Bar Chart
E.g.1. The proportion of smartphone E.g.2. The proportion of ownership is not same
ownership is same for male and for high, medium and low-income groups
female.
Association between Catégorial Variables.
Row Relative Frequency and Column Relative Frequency
Division of each cell of Contingency Table by Row total – Row relative frequency
Division of each cell of Contingency Table by Column total – Column relative
frequency
E.g. 1. Row Relative Frequency E.g. 2. Row Relative Frequency
Association between Catégorial Variables.
Row Relative Frequency and Column Relative Frequency
Division of each cell of Contingency Table by Row total – Row relative frequency
Division of each cell of Contingency Table by Column total – Column relative
frequency
E.g. 1. Column Relative Frequency E.g. 2. Column Relative Frequency
Association between Catégorial Variables.
Utility of Row & Column Relative Frequencies for Finding
Association:
Knowing information about one variable provides information about the other
variable – Association of two variables.
If the row relative frequencies (or column relative frequencies) have same patterns
for all rows (or columns) – Two variables are not associated.
If row relative frequencies (or column relative frequencies) have different patterns
for some rows (or some columns) – Two variables are associated.
E.g. 1. E.g. 2.
Association between Numerical Variables.
How to interpret association in scatter plot.
Quantification of numeric association.
Ok Type 2 Error
Reject
Type 1 Error Ok
Not Reject
Association between Variables.
Types of Hypothesis Testing:
Gender Age Group Weight (Kg) Height (cm)
M Elderly 70 1.4
F Adult 6.5 1.2
…… ……. …..
…… …… …… …..
Gender | Whether there is a difference in Male and Female Proportion? | H1: Yes, H0: No | Test: One
Sample Proportion Test since only one categorical variable. | P ≤ 0.05 then Reject H0.
Gender & Age Group | Is there is any difference over Male and Female Proportion based on Age
Group? | H1: Yes, H0: No | Test: Chi-squared Test since two categorical variables. | P ≤ 0.05 then
Reject H0.
Numeric Feature like Height | Test: T-test | One numeric variable
Two Numerical Variables | Test: Correlation (-1 to + 1)
One Numerical and One Categorical | If categorical has two categories T-test otherwise ANOVA
Test
Question & Answer
Measuring the Strength of Association (Examples)
Session Outcomes
In this session you learned about:
1. Data & Attributes
2. Statistical Operations for observing data and
attributes.
3. Measuring of association between Categorial
Variables.
4. Measuring of association between Numerical
Variables.
Thank You