0% found this document useful (0 votes)

8 views

Statistics Unit1 Notes.docx

Msc ds stats

Uploaded by

Shubham Wagh

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

8 views

Statistics Unit1 Notes.docx

Msc ds stats

Uploaded by

Shubham Wagh

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 11

Unit 1 : Descriptive Statistics

Measures of Central Tendency

A measure of central tendency is a single value that attempts to describe a set of data by identifying
the central position within that set of data.

The mean (often called the average) is most likely the measure of central tendency that you are most
familiar with, but there are others, such as the median and the mode.

The mean, median and mode are all measures of central tendency, but under different conditions,
some measures of central tendency become more appropriate to use than others.

MEAN

The mean is the average of a data set.

The mean is equal to the sum of all the values in the data set divided by the number of values in the
data set.

If x1 , x2, …xn are n values in the data, the sample mean denoted by 𝑥
𝑥1+𝑥2+⋯𝑥𝑛 Σ𝑥𝑖
is given by 𝑥 = 𝑛
or 𝑥 = 𝑛

Mean is not the best measure of central tendency when

i. The dataset contains outliers

ii. The distribution is skewed.

Properties of mean:

1. If a fixed value d is added to each of the observations in the data, then

Mean of the new data = d + mean of the old data.

e.g. Suppose the mean monthly salary of the employees working for a company is Rs. 15000/-. If
each employee gets a monthly raise of Rs.1500/-, then the mean monthly salary after the raise will
be 15000+1500= Rs. 16500

2. If each observation in the data is multiplied by a fixed constant c, then

Mean of new data = c.mean of the old data.

e.g. mean height of students in a class is 4.8 feet. If the height is expressed in inches, the mean
height would be 12(4.8) = 57.6 inches.

Median

The median is the middle score for a set of data that has been

arranged in order of magnitude.

If the number of observations in the data set are n, then

𝑛+1 𝑡ℎ
Median = Value of ( )
2
observation in the arranged dataset.
e.g. Consider the data set 65,55,89,56,35,14,56,55,87,44,92. Median=56

For the dataset 65,55,89,56,35,14,56,55,87,44. Median=55.5

Median is less affected by the outliers and is a preferred measure of central tendency for a skewed
data.

MODE

The mode is the most frequent score in the data set.

e.g. If the data set is 45,42,56,42,56,67,43,42,56,45,42,40,42. Mode= 42.

Normally, the mode is used for categorical data where we wish to know which is the most common
category

The main problem with mode is that it may not be unique.

It will not be an appropriate measure of central tendency if the most common value is away from the
rest of the data set.

We can have datasets with more than one mode or datasets with no

mode.

Mean , Median and Mode using R

Mean

Syntax

The basic syntax for calculating mean in R is −

mean(x, trim = 0, na.rm = FALSE, ...)

Following is the description of the parameters used −

● x is the input vector.

● trim is used to drop some observations from both end of the sorted vector.

● na.rm is used to remove the missing values from the input vector.

● x <- c(2,1,2,3,1,2,3,4,1,5,5,3,2,3)

mean(x);

Applying Trim Option

When trim parameter is supplied, the values in the vector get sorted and then the required numbers
of observations are dropped from calculating the mean.

When trim = 0.3, .3n values from each end will be dropped from the calculations to find mean,
where n is the total no of observations.

x<-c(−21, −5, 2, 3, 4.2, 7, 8, 12, 18, 54)

In this case the sorted vector is (−21, −5, 2, 3, 4.2, 7, 8, 12, 18, 54) and the values removed from the
vector for calculating mean are (−21,−5,2) from left and (12,18,54) from right.

mean(x,trim=0.3)

Applying NA Option

If there are missing values, then the mean function returns NA.

To drop the missing values from the calculation use na.rm = TRUE. which means remove the NA
values.

x<-c(−21, −5, 2, 3, NA,4.2, 7, 8, 12, NA, 18, 54)

mean(x,na.rm=TRUE)

Median using R

Syntax

The basic syntax for calculating median in R is −

median(x, na.rm = FALSE)

Following is the description of the parameters used −

● x is the input vector.

● na.rm is used to remove the missing values from the input vector. x <-
c(2,1,2,3,1,2,3,4,1,5,5,3,2)

median<-median(x); median;

Mode using R

R does not have a standard in-built function to calculate mode. So we create a user function to
calculate mode of a data set in R. This function takes the vector as input and gives the mode value as
output.

xt<- table(x)

mode<-(which(xt==max(xt))); mode;

Example:

1. x <- c(2,1,2,3,1,2,3,4,1,5,5,3,2,3) 2. x<- c(“o”,”i”,”a”,”i","i","u","e")

[1] 2 [1] “i"

Partition Values

Partition Values divide the set of observations into several equal parts.

• Quartiles

• Deciles

• Percentiles
Quartiles

Quartiles are the values that divide a given data set into four equal parts. There are three quartiles
Q1,Q2 and Q3.

Q1: Lower or the First quartile: Is the value for which 1/4th of the obseravations are less than or
equal to it.

Q2: Middle or the Second quartile: Is the value for which 1/2 of the obseravations are less than or
equal to it. It is the median of the data set.

Q3: Upper or the Third quartile: Is the value for which 3/4th of the obseravations are less than or
equal to it.

We need to arrange all the observations in the data set in ascending order.
ⅈ(𝑁+1)
Qi = Value of 4
𝑡ℎ observation. i= 1, 2, 3

Deciles

Deciles are those values that divide any dataset into ten equal parts. Therefore, there are a total of
nine deciles.

D1, D2, D3, D4, ……… D9.

Di is the value for which i/10th of the total observations are less than or equal to Di,

i=1,2,…..,9 after all the observations are arranged in ascending order.

ⅈ(𝑁+1)
Di = Value of 10
𝑡ℎ observation. i= 1, 2,…9

Percentiles

Percentile divide any given dataset into 100 equal parts. There are a total of 99 percentiles.

P1, P2, P3, P4, ……… P99.

Pi is the value for which i/100th of the total observations are less than or equal to Pi,

i=1,2,…..,99 after all the observations are arranged in ascending order.

ⅈ(𝑁+1)
Pi = Value of 100
𝑡ℎ observation. i= 1, 2,… 99

Example:

Calculate Q1, D3 and P55 for the dataset given below: 42,45,41,48,50,52,43,44,42,50,56,52,55,43.

Arranged data: 41,42,42,43,43,44,45,48,50,50,52,52,55,56.

Q1= Value of 3.75th observation = 42.75

D3= Value of 4.5thobservation= 43

P55 = Value of 8.25th observation = 48.5

Using R

There is a single command for the calculation of all partition values.

If the array of observations is x and we need to compute kth percentile then use the command:

>quantile(x, k)

Note:

Manual calculations and values using R may not match exactly since R uses 9 different algorithms for
the calculation of percentiles.(type=6) (SPSS and Minitab)

Box Plot (Box and Whisker Plot)

The box and whiskers chart shows you how your data is spread out. Five pieces of information are
generally included in the chart:

• The minimum (the smallest number in the data set). The minimum is

shown at the far left of the chart, at the end of the left “whisker”.

• First quartile, Q1, is the far left of the box (or the far right of the left whisker).

• The median is shown as a line in the center of the box.

• Third quartile, Q3, shown at the far right of the box (at the far left of the right
whisker).

The maximum (the largest number in the data set), shown at the far right of the chart, at the end of
the right “whisker”.

Outliers

Box plot can be used to identify outliers in the data.

• Q3-Q1 is the interquartile range

• Maximum = Q3+1.5 IQR

• Minimum = Q1-1.5 IQR

• Points outside min and max are outliers.

Example

Draw a box-and-whisker plot for the data set {3, 7, 8, 5, 12, 14, 21, 13,18,10,11}.

Arranged data: 3,5,7,8,10,11,12,13,14,18,21

Min =3 ,Max =21 , Q1= 7 , Q2= 11 , Q3=14

Box Plot using R

• Box plot using R shows the outliers.

• Command : boxplot(x)

boxplot(x,main=“Box Plot”) ———-Title

boxplot(x,y,z)——-for comparison

Uses of a box plot

It is used to know

• the outliers and its values

• symmetry of data

• tight grouping of data

• data skewness -if, in which direction and how

Measures of Dispersion
Statistical dispersion means the extent to which a numerical data is likely to vary about an
average value.It helps to understand the distribution of the data.

Variance

Variance measures variability from the average or mean.

It is the average of the squares of the differences between each number in

the data set and its mean.

(
∑ 𝑥𝑖−µ )2
Variance = 𝑛
where µ is the population mean. If it is unknown then

It is replaced with sample mean.

2
( )
∑ 𝑥𝑖−𝑥
Variance = 𝑛

It is expressed in the squares of the units in which the original data is given.

For calculations we use the simplified version

Σ𝑥2𝑖−𝑛𝑥2
Variance = 𝑛

It is denoted by σ2

Standard Deviation

2
∑𝑥𝑖 −𝑛𝑥2
It is defined as square root of the variance and is defined by 𝑛

It is expressed in the same units in which the original data is given.

Coefficient of Variation

The coefficient of variation represents the ratio of the standard deviation to

the mean multiplied by 100.

σ
C.V = ×100
𝑥

It is expressed as a percentage.

It is a useful statistic for comparing the degree of variation from one data set to another, even if the
means are drastically different from one another or units of measurement are different.

Variance,SD and CV using R

Let the array of observations be x.

Variance <- var(x);

Standard deviation <- sd(x);

Coefficient of Variation <- sd(x)/mean(x)*100;

Measures of Skewness and Kurtosis

Skewness
Skewness refers to asymmetry in a curve of a dataset.

If the curve is shifted to the left or to the right, it is said to be skewed. Skewness can be quantified to
define the extent to which a distribution differs from a normal distribution.

A distribution can be

Symmetric

Positively skewed

Negatively skewed

Symmetric distribution

For a symmetric distribution mean=mode=median.

The box plot of a symmetrical distribution. Here Q3-Q2 = Q2-Q1

And also the lengths of the two whiskers are equal.

Positively Skewed distribution

A positively skewed distribution is the distribution with the tail on its right side.For these
distributions Mean > Median > Mode

The box plot of a positive skewed distribution Here Q3-Q2 > Q2-Q1
Negatively Skewed distribution

A negatively skewed distribution is the distribution with the tail on its left side.For these distributions
mean < median < mode.

The box plot of a positive skewed distribution .Here Q3-Q2 < Q2-Q1

Measures of Skewness

1. Karl Pearsons Coefficient of Skewness

It is denoted by SKp and is defined by,

Skp = Mean – Mode

Standard Deviation

If mode is indeterminate use the empirical relationship between mean,

median and mode

Skp = 3(Mean – Median)

Standard Deviation

SKp = 0 Distribution is symmetric

>0 Distribution is positively skewed

<0 Distribution is negatively skewed

2. Bowley’s Coefficient of Skewness

It is denoted by SKb and is defined by,

Skb = (Q3 – Q2) – (Q2 – Q1)

Q3 – Q1
= Q3 + Q1 – 2Q2
Q3 – Q1

SKb = 0 Distribution is symmetric

> 0 Distribution is positively skewed

< 0 Distribution is negatively skewed

-1 ≤ SKb ≤ +1

3. Pearson’s Measure of Skewness (Based on moments)

2
µ3
It is denoted by ϒ1 = β1 , β1 = 3
µ2

𝑟
( )
∑ 𝑥𝑖−𝑥
Where µr = rth central moment given by µ𝑟 = 𝑛
r =1,2,3…

Interpretation

ϒ1 is positive root of β1 if μ3 is positive and ϒ1 is negative root of β1 if μ3 is

negative.

The distribution is symmetric if ϒ1 = 0

It is positively skewed if ϒ1 > 0

It is negatively skewed if ϒ1 < 0

Kurtosis

Kurtosis provides information about peakedness of a distribution. Peakedness in a distribution is the

degree to which data values are concentrated around the mean.

Datasets with high kurtosis tend to have a distinct peak near the mean and

tend to decline rapidly, and have heavy tails.

Datasets with low kurtosis tend to have a flat top near the mean rather than a sharp peak.

Types of Kurtosis

There are 3 types of kurtosis:

• Mesokurtic: Distributions that are moderate in breadth and curves with a medium peaked
height.

• Leptokurtic: Distributions whose curve is sharply peaked with heavy tails.

• Platykurtic: Distributions whose curve has a flat peak and has more dispersed scores with
lighter tails.

Measure of Kurtosis

Pearson’s measure of Kurtosis(Based on moments)

µ4
It is denoted by ϒ2 = β2 - 3 , β2 = 2
µ2

Interpretation

The distribution is mesokurtic if ϒ2 = 0 β2 = 3

It is leptokurtic if ϒ2 > 0 β2 > 3

It is platykurtic if ϒ2 < 0 β2 < 3

Skewness and Kurtosis using R

Draw a histogram and box plot to get an idea about skewness. There is no function calculating the
measures directly.
Use summary function to get mean, Q1,Q2 and Q3. Find sd(x).

Using formula compute SKp and SKb

Else use library(moments)

skewness(x) gives the value of ϒ1

kurtosis(x) gives the value of β2

Lecture_04
No ratings yet
Lecture_04
88 pages
Measusres of Locations
No ratings yet
Measusres of Locations
52 pages
MCS Lecture 3
No ratings yet
MCS Lecture 3
57 pages
Lecture 3 Numerical Measures of Data
No ratings yet
Lecture 3 Numerical Measures of Data
36 pages
المحاضرة رقم 3
No ratings yet
المحاضرة رقم 3
44 pages
Measures of Location and VARIATION For 1 Variable
No ratings yet
Measures of Location and VARIATION For 1 Variable
44 pages
Lecture 3 - Stat HO
No ratings yet
Lecture 3 - Stat HO
21 pages
Measures of Central Tendency and Spread: Chapter 1, Section 2
No ratings yet
Measures of Central Tendency and Spread: Chapter 1, Section 2
36 pages
Descriptive Statistics
No ratings yet
Descriptive Statistics
38 pages
EECM3724_Unit_1_Ch3_slides_2022
No ratings yet
EECM3724_Unit_1_Ch3_slides_2022
48 pages
Week 1
No ratings yet
Week 1
25 pages
Lecture 2b - Describing Data-Numerical
No ratings yet
Lecture 2b - Describing Data-Numerical
47 pages
Topic 1 Describing Data II
No ratings yet
Topic 1 Describing Data II
68 pages
St130: Basic Statistics Week 3: Lecture: School of Computing Information and Mathematical Sciences
No ratings yet
St130: Basic Statistics Week 3: Lecture: School of Computing Information and Mathematical Sciences
62 pages
Statistics ClassNotes - 2
No ratings yet
Statistics ClassNotes - 2
10 pages
03 Numerical Description
No ratings yet
03 Numerical Description
52 pages
slides_week2
No ratings yet
slides_week2
43 pages
2 Measures of Location - Dispersion
No ratings yet
2 Measures of Location - Dispersion
61 pages
MATH& 146 Lesson 8: Averages and Variation
No ratings yet
MATH& 146 Lesson 8: Averages and Variation
30 pages
Part 2-Chapter 3 - Describing Data - Edit
No ratings yet
Part 2-Chapter 3 - Describing Data - Edit
46 pages
Statistical Data
No ratings yet
Statistical Data
41 pages
Unit - 2 Biostatistics
No ratings yet
Unit - 2 Biostatistics
9 pages
Stat 1101 4 7
No ratings yet
Stat 1101 4 7
18 pages
Week 6+7+8
No ratings yet
Week 6+7+8
37 pages
EXP-1- Statistics and Plotting
No ratings yet
EXP-1- Statistics and Plotting
23 pages
Unit 3 Summarising Data - Averages and Dispersion
No ratings yet
Unit 3 Summarising Data - Averages and Dispersion
22 pages
10_23ECE216_Descriptive Statistics
No ratings yet
10_23ECE216_Descriptive Statistics
60 pages
STAE lecture notes_LU3_Annotated
No ratings yet
STAE lecture notes_LU3_Annotated
10 pages
RMBS BPT402
No ratings yet
RMBS BPT402
103 pages
MCT and MD For Pharmacy Students
No ratings yet
MCT and MD For Pharmacy Students
58 pages
Teacher Lecture
No ratings yet
Teacher Lecture
23 pages
Discriptive Statistics
No ratings yet
Discriptive Statistics
50 pages
Introductory of Statistics - Chapter 3
No ratings yet
Introductory of Statistics - Chapter 3
7 pages
Chapter 4 Numerical Descriptive Measures of Data
No ratings yet
Chapter 4 Numerical Descriptive Measures of Data
35 pages
Topic 3
No ratings yet
Topic 3
49 pages
2 Stats Intro 14022024 105150am
No ratings yet
2 Stats Intro 14022024 105150am
19 pages
CHAPTER 1 Descriptive Statistics
No ratings yet
CHAPTER 1 Descriptive Statistics
5 pages
DSJ BMS Unit2
No ratings yet
DSJ BMS Unit2
18 pages
5-MEASURES of DISPERSION-02-Aug-2019Material I 02-Aug-2019 Exp. No. 1 - Measures of Central Tendency Dispersion Skewness and Kurtosi
No ratings yet
5-MEASURES of DISPERSION-02-Aug-2019Material I 02-Aug-2019 Exp. No. 1 - Measures of Central Tendency Dispersion Skewness and Kurtosi
10 pages
Mathematical Analysis
100% (1)
Mathematical Analysis
46 pages
2025-02-25_15-19-30_gBgFpFhD4P7W4A8jcNqeiM6UE3yAWzC05C0SBjgT
No ratings yet
2025-02-25_15-19-30_gBgFpFhD4P7W4A8jcNqeiM6UE3yAWzC05C0SBjgT
36 pages
1.2 Mathematical Presentation of Data
No ratings yet
1.2 Mathematical Presentation of Data
28 pages
Ch 2 Lecture Notes
No ratings yet
Ch 2 Lecture Notes
12 pages
2a. Describing Variables with Numbers
No ratings yet
2a. Describing Variables with Numbers
30 pages
Introduction To Descriptive Statistics
No ratings yet
Introduction To Descriptive Statistics
73 pages
Lecture 1ASADA Descriptive Stats
No ratings yet
Lecture 1ASADA Descriptive Stats
38 pages
Lecture 4 Copy 1
No ratings yet
Lecture 4 Copy 1
13 pages
Lec006 - Measures of Dispersion
No ratings yet
Lec006 - Measures of Dispersion
42 pages
Lecture 2-3 Data Analysis Location & Dispression
No ratings yet
Lecture 2-3 Data Analysis Location & Dispression
43 pages
Topic II Part II
No ratings yet
Topic II Part II
22 pages
Statistics Part 1 and 2
No ratings yet
Statistics Part 1 and 2
53 pages
Lesson-3.2-Measures-of-Central-Tendency-Position-and-Variation
No ratings yet
Lesson-3.2-Measures-of-Central-Tendency-Position-and-Variation
62 pages
Descriptive Statistics 1
No ratings yet
Descriptive Statistics 1
63 pages
Measures of Dispersion
No ratings yet
Measures of Dispersion
59 pages
Lecture 3
No ratings yet
Lecture 3
14 pages
Introduction To Statistics
No ratings yet
Introduction To Statistics
35 pages
BB Module 2 BASIC STATISTICS
No ratings yet
BB Module 2 BASIC STATISTICS
63 pages
Descreptive Statistics 1
No ratings yet
Descreptive Statistics 1
74 pages
Standard Deviation
No ratings yet
Standard Deviation
13 pages
Statistical Foundations for Psychology
From Everand
Statistical Foundations for Psychology
James C. Ware
No ratings yet
KNIME CheatSheet Beginner A3 Web
100% (1)
KNIME CheatSheet Beginner A3 Web
1 page
1 Descriptive Statistics
No ratings yet
1 Descriptive Statistics
20 pages
Frequencies: Frequencies Variables Usia /piechart Percent /order Analysis
No ratings yet
Frequencies: Frequencies Variables Usia /piechart Percent /order Analysis
37 pages
Math41 - Statistics 23-4-2021
No ratings yet
Math41 - Statistics 23-4-2021
31 pages
Practice Questions Additional PDF
No ratings yet
Practice Questions Additional PDF
33 pages
Get Python for Data Science 2nd Edition John Paul Mueller free all chapters
No ratings yet
Get Python for Data Science 2nd Edition John Paul Mueller free all chapters
65 pages
Explanatory Data Analysis
No ratings yet
Explanatory Data Analysis
28 pages
Stats Lab 2
No ratings yet
Stats Lab 2
15 pages
Representation and Summary of Data Questions Edexcel Statistics 1
No ratings yet
Representation and Summary of Data Questions Edexcel Statistics 1
59 pages
DOC-20241220-WA0001.
No ratings yet
DOC-20241220-WA0001.
68 pages
Overseas Presentation NCBTS and TSNA
No ratings yet
Overseas Presentation NCBTS and TSNA
3 pages
S1 Cheat Sheet
No ratings yet
S1 Cheat Sheet
9 pages
Probasta 2 PDF
No ratings yet
Probasta 2 PDF
17 pages
State 301 Grand Quiz
No ratings yet
State 301 Grand Quiz
4 pages
Statistics and Machine Learning
No ratings yet
Statistics and Machine Learning
51 pages
water_potability_ppt
No ratings yet
water_potability_ppt
12 pages
Measures of Position
No ratings yet
Measures of Position
19 pages
Data Exploration and Visualization - AD3301 - Important Questions With Answer - Unit 3 - Univariate Analysis
No ratings yet
Data Exploration and Visualization - AD3301 - Important Questions With Answer - Unit 3 - Univariate Analysis
8 pages
Machine Learning-1
No ratings yet
Machine Learning-1
64 pages
Sayed Hassan MIS410
No ratings yet
Sayed Hassan MIS410
5 pages
AML-2203 Advanced Python AI and ML Tools Assignment
No ratings yet
AML-2203 Advanced Python AI and ML Tools Assignment
19 pages
No, Unless The Number of Data Points Are Known
No ratings yet
No, Unless The Number of Data Points Are Known
6 pages
Statistical Data Analysis Explained
93% (27)
Statistical Data Analysis Explained
359 pages
Term 1 Examination Maths
No ratings yet
Term 1 Examination Maths
21 pages
Data Visualization
No ratings yet
Data Visualization
46 pages
unit_5 (1)
No ratings yet
unit_5 (1)
81 pages
Homework 1
No ratings yet
Homework 1
22 pages
Ast5e PPT ch02
No ratings yet
Ast5e PPT ch02
80 pages
George Otieno
No ratings yet
George Otieno
199 pages
Lumira DataStorytellingHandbook 2017 PDF
No ratings yet
Lumira DataStorytellingHandbook 2017 PDF
49 pages