0% found this document useful (0 votes)
92 views

Basic Statistics and Probability

The document provides an introduction to statistics and probability. It discusses descriptive statistics such as mean, median, mode, measures of central tendency, and measures of spread like range, variance, standard deviation, and quartiles. It demonstrates calculating these measures using a sample speed data set of vehicles. Finally, it discusses different ways of visually representing data, including histograms, stem-and-leaf plots, scatter plots, and box plots.

Uploaded by

Vikram SK
Copyright
© © All Rights Reserved
Available Formats
Download as PPT, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
92 views

Basic Statistics and Probability

The document provides an introduction to statistics and probability. It discusses descriptive statistics such as mean, median, mode, measures of central tendency, and measures of spread like range, variance, standard deviation, and quartiles. It demonstrates calculating these measures using a sample speed data set of vehicles. Finally, it discusses different ways of visually representing data, including histograms, stem-and-leaf plots, scatter plots, and box plots.

Uploaded by

Vikram SK
Copyright
© © All Rights Reserved
Available Formats
Download as PPT, PDF, TXT or read online on Scribd
You are on page 1/ 49

Introduction to statistics and

probability
Prepared by Ranju Mohan, PhD student, CE dept, IITM
For CE302 class by Gitakrishnan Ramadurai, AP, CE dept, IITM
Statistics:

Mean, Median,
Mode etc

Standard deviation,
Variance, Range,
Quartiles etc
Working with data…..

Sample + Descriptive statistics


(Measures of spread & Measures of
dispersion)

Statistical estimate
POPULATION +
Inferential statistics
Sample Concepts of
probability

INFERENCE about the


population
Descriptive statistics
Speed data collected for 10 vehicles:
Speed 32.4  3  27.8  43.5  52.3  36.9  29.5  49.8  58.9
(km/hr.) Mean 
10
32.4
27.8  39.59
27.8
29.5
43.5
32.4 Arranging in increasing order,
52.3
32.4
36.9
32.4 32.4  36.9
Median   34.65
29.5
36.9 2
32.4
43.5
49.8
Mode = 32.4
58.9
52.3
Measures of Central
32.4
58.9
Tendency
Descriptive statistics – Measures of spread

Speed
(km/hr.) Range = Max value –Min value = 58.9- 27.8 = 31.1
32.4
27.8
27.8
29.5 Absolute deviation: |value - mean|
43.5
32.4 Ex: Absolute deviation = |52.3- 39.59|=12.71
52.3
32.4 Arranging in increasing order,
36.9
32.4
Quartiles 1st (25% of 2nd (25% of 3rd (25% of 4th (25%
29.5
36.9 data) data) data) of data)
32.4
43.5 Value [1(10)+2]/4th [2(10)+2]/4th [3(10)+2]/4th -
49.8
49.8
58.9 = 32.4 (Q1) =34.65 (Q2) =49.8 (Q3)
52.3
32.4 Interquartile range = Q3-Q1 = 17.4
58.9
Descriptive statistics - Measures of spread
Speed Variance :
(km/hr.)
n
 xi  x  2

32.4
(32.4 – 39.59)2
27.8
s2  = 684.21/9 = 76.02
(27.8 – 39.59)2
i 1 n 1
43.5
(43.5 – 39.59)2
52.3
(52.3 – 39.59)2
36.9
(36.9 – 39.59)2 Standard deviation :
29.5
(29.5 – 39.59)2
n
 xi  x  2
32.4
(32.4 – 39.59)2
49.8
(49.8 – 39.59)2
s 
i 1 n 1
= √684.21/9 = √76.02 = 8.72

58.9
(58.9 – 39.59)2
32.4
(32.4 – 39.59)2
∑ = 684.21
Data:
 Groups of information that represents the qualitative or
quantitative attributes of a variable or a set of variables.

 Visual/Graphical Representation:
 Frequency distributions
 Graphs
 Box plot
 Scatter plot
 Stem and leaf
Data representation: Examples
Bar chart
Year No. of Accidents

Total no. of
Fatal Non- Total accidents
fatal
2002 4 7 11 Graphical
representation
2003 7 6 13
2004 4 6 10
Year
2005 4 4 8
2006 2 4 6
2007 3 3 6 No. of
accidents

Frequency distribution table


Data representation: Examples
Previous data ,

Year No. of
Accidents
2002 11
2003 13
2004 10
2005 8
2006 6
2007 6 Frequency polygon
Data representation: Examples

Pie Diagram

Daily volume of vehicles observed :


Vehicle type frequency
Bus 35 Vehicle
composition
Car/Jeep 160
Truck 22
Bicycle 40
Others 25
TOTAL 282
Data representation: Examples
Given speed data:
63.2 49.9 36.9 44.2 54.8 49 42.9 32.4
54.3 37.5 45.1 51.7 47.5 43.8 55.9 48.8
41.1 47.5 52.3 39.2 57.3 36.3 42.8 58.7
52.9 42.5 46.4 53.3 46.5 43.2 56.9 47.7
47.8 35.6 50.3 44.7 46.2 38.4 62.4 39.4
56.4 55.1 64.8 52.8

3 2.4 5.6 6.3 6.9 7.5 8.4 9.2 9.4

4 1.1 2.5 2.8 2.9 3.2 3.8 4.2 4.7 5.1 6.2 6.4 6.5 7.5 7.5 7.7 7.8 8.8 9 9.9

5 0.3 1.7 2.3 2.8 2.9 3.3 4.3 4.8 5.1 5.9 6.4 6.9 7.3 8.7

6 2.4 3.2 4.8

57.3
Stem and Leaf plot
Data representation: Examples
Given speed data (km/hr.),
63.2 49.9 36.9 44.2 54.8 49 42.9 32.4
54.3 37.5 45.1 51.7 47.5 43.8 55.9 48.8
41.1 47.5 52.3 39.2 57.3 36.3 42.8 58.7
52.9 42.5 46.4 53.3 46.5 43.2 56.9 47.7 Speed class No. of
47.8 35.6 50.3 44.7 46.2 38.4 62.4 49.4 vehicles
56.4 55.1 64.8 52.8 30-35 1
35-40 6
Group into different speed class 40-45 8

max value  min value 45-50 12


Class Interval 
1  3.22 log  No. of veh 50-55 8
64.8 - 32.4 55-60 6

1  3.22 log(48) 60-65 3
 5.05, say 5
Data representation: Examples

Speed No. of
class vehicles
30-35 1

No. of vehicles
35-40 6
40-45 8
45-50 12
50-55 8
55-60 6
60-65 3 30 35 40 45 50 55 60 65
Speed (km/hr.)

Histogram
Data representation: Examples

Speed No. of Cumulative


class vehicles no. of veh.
30-35 1 1
35-40 6 7
40-45 8 15
45-50 12 27
50-55 8 35
55-60 6 41
60-65 3 44

Speed (km/hr.)

Ogive
Q: Number of vehicles with speed less than 50 km/hr. ?
Ans: 31
Data representation: Examples
Paired data set:
(Number of accidents ,Vehicle speed)

Speed No. of
(km/hr.) accidents
22 1
How to relate?
45 3
32 4
75 8
66 10
58 6
Scatter plot
Data representation: Examples

When having several simultaneous comparison: 11


observations in 3 diff. days. (n=11) Box Plot
Speed (km/hr)
Mon Wed Fri Points to be plotted Speed (km/hr.)
L 25
32 33
22 24 H Mon Wed Fri
Q3
25
28 22
29 33
29 Q3 H
Lowest (L) 25 28 28
Q1 33
65 70
30 58
31
Q3
46
38 42
33 36
33 Q1 = [1(n)+2]/4th value Q2 33 Q230 31
Q2
= 3rd value
28
39 29
40 29
36 Q1
Q1
Q2 42
59 55
42 38 L
L
L
Q2 = median 42 42 38
38
46 40
55 45
40
Q3= [3(n)+2]/4th value 59 64 45
55 60 31
41
=9th value
Q3 59
42 72
64 50
45
39
60 64
70 41
50 Highest (H) 65 72 58
Day
H 65
60 30
72 40
58
Inferential statistics- basic concepts
 Use sample statistics and probability concepts to make
inferences about the population

 Probability (P): The likelihood of something happening or


being true.

 Based on the assumption that sampling is random


Inferential statistics :
Probability concepts-Random variables

Random
Experiment

x1 x2 x3 x4 x5 Outcomes

Assigns unique
values to the
Discrete
outcomes Random
variable (a
Function)
Continuous
Probability concepts – Random variables
Discrete Random variable Continuous Random variable

Probability mass function, pmf = p(x) Probability density function, pdf = f(x)

b
p(x) = P(X=x)
 f(x)dx  P(a  x  b)
0  p(x)  1 a
f(x)  1
 p(x)  1
 f(x) 1
Cumulative distribution function F(X ≤ x) Examples:
x P(X=x) f(x) = x, 0 ≤ x ≤ √2; 0 otherwise
1 0.13
F(x) = x2/2
2 0.27 F(3) =0.13+0.27+0.25
3 0.25
4 0.15 = 0.65
5 0.20
Probability concepts – Random variables
Discrete Random variable Continuous Random variable

Given xi’s and p(xi)’ s Given xi’s and f(xi)’s

Expectation of a random variable, E(X): Weighted average of the possible values

E(X)   x ip(x i ) E(X)   xf(x)


i

E(X2 )   x i p(x i )
2 E(X2 )   x 2 f(x)
i

Mean = E(X) = First moment about origin

Variance =V(X) Second moment about mean V(X)  E(x  μ )2


or
V(X)  E(x2 )  E(x)2
Some common probability distributions used
in traffic engineering

Discrete data Continuous data


Bernoulli distribution Exponential distribution

Binomial distribution Normal distribution &


distribution arising from
Multinomial distribution normal

Poisson distribution Chi-square distribution

t- distribution

F – distribution
Special Random variables and probability
distributions
Discrete Random variables
Bernoulli :
Two possible outcomes for one trial:
‘success’ (X=1) or ‘failure’ (X=0)

P(X  0)  1- p
pmf   0  p 1
P(X  1)  p

Mean = p; Variance = p(1-p)


Binomial :
‘n’ independent trials, each having two
outcomes
n!
pmf  p(X  x )  p x qn x ; x  0,1,...,n
x!(n  x )!

Mean = np; Variance = np(1-p)


Special Random variables and probability distributions
Discrete Random variables
Poisson:
When ‘n’ is large and p is small
e   x
pmf  P(X  x)  ; i  0,1,.....
x!
  mean number of successes  np
x  actual number of successes

Mean = ν; Variance = ν
Continuous Random variables
Normal:

 ( x  ) 2
1
pdf  e 22
;-  x  
 2

Mean = µ ; Variance = ϭ2
Special Random variables and probability distributions
Continuous Random variables
Normal random variable, z

x 
z

When µ=0 and ϭ =1;
x2
1 -z 0 z
pdf  e 2
;-  x  
2
Exponential:

e  x if x  0
P( x )  
0 if x  0

Mean = 1/λ ; Variance = 1/λ2


Examples
 Que: On a particular junction, out of two routes to a particular destination,
probability of choosing 1st route is twice as that of the 2nd route. How many
number of vehicles will turn to Route 1 when a total of 5 vehicles reach at
the junction at a specified time?
p(route 1)  p x (1 p)1 x
x  0 with probability 0.33
x  1 with probability 0.67

p(route 1)  0.330 (1 0.33)1 



0.33
or   0.67
p(route 1)  0.671 (1 0.67)0 

Ans: Number of vehicles choosing Route 1


Bernoulli distribution = 0.67×5 = 3.3, say 3
Examples
 Que: Probability of choosing a particular route is 1/5. Find out the
probabilities that out of 5 vehicles reaching that location, exactly 0, 1, 2, 3, 4, 5
vehicles will choose that particular route.

With Binomial distribution,


Ans: x P(x)
n!
P(x)  p x qn x 0 0.33
x!(n  x)!
1 0.41

n = 5 ; p =1/5 2 0.20
3 .05
4 .006
5 .0003
Examples
 Ques: For 3 different routes at a particular location, probability of choice is
given by 0.35, 0.40, 0.25 respectively. What is the probability that out of 5
vehicles reaching at the location, one, three and two vehicles will choose the
route 1, 2 and 3 respectively.

By multinomial distribution,
n! x x x
p( x1 , x 2 ,....x k )  p1 1 p 2 2 ....p k k
x1! x 2 !...x k !

5!
p(1, 3, 2)  0.351  0.43  0.252
1! 3! 2!

Ans: 0.014
Examples
 On a motorway, the number of vehicles arriving from one direction in
successive 10 sec intervals was counted and is given below. Find out the
probabilities P(0), P(˃ 3), P(3˂ X˂ 6) etc.

By Poisson distribution, No.No. of veh. in


of veh. P(x)
Frequency Total no. Total time
in 10 10
secsec
(i) (x) (ii) of veh. (ii*10)
e   x
p( x )  (i*ii)
x! 0
0
11
0.135
0 110
ν = (200/1000)*10 = 2 1 0.27
1 28 28 280
2 0.271
2 30 60 300
3 0.18
3 18 54 180
Ans: 4 0.091
4 8 32 80
5 0.036
P(0) = 0.135 5 4 20 40
6 0.012
P(X˃3) = 0.144 6 1 6 10
7 .0034
P(3˂ X< 6) =0.127 7 0 0 0
∑ = 200 ∑ =1000
Example:
 Ques: If an average of 3 trucks arrive per hour to be unloaded at a
warehouse, what are the probability that the time between the arrivals of
successive trucks will be (i) less than 5 min., (ii) at least 45 min.

Using exponential distribution,

P ( x )   e  x
x
P(X  x )    e -x dx  1- e -x
0

λ = 3/60 = 0.05 veh/min.


Ans:

P(X<5)= 1-e-0.05(5) = 0.2212

P(X≥45) = e-0.05(45) = 0.105


Example:
 The spot speed at a particular location are normally distributed with a
mean of 51.7 km/hr. and std. deviation of 8.3 km/hr. what is (ii)the
probability that speed exceeds 65 km/hr. (ii) the 85th percentile speed.
x  P(X≤65)
z
 P(X>65)

(i) z = (65-51.7)/8.3 =1.6


0 1.6
From the standard normal distribution table,
F(1.6)= P(X≤65) = 0.9452 Ans: P(X>65) = 0.0548

(ii) P(X≤x) = 0.85 = F(z)= 0.85 P(X≤x)


From the standard normal distribution table,
z= 1.04
x = 1.04(8.3)+51.7 = 60.33
0 1.04
Ans: x =60.33 km/hr.
Inferential statistics: Sampling distributions
 Sampling Theory:
If a random sample of size n is taken from a population of mean  and variance  2 ,
then the sample mean X follows a normal distributi on with mean  and variance  2 /n.

The standard error of mean is given by .
n

 Central limit theorem:


If x is the mean of a sample of size n taken from a population of mean μ and variance σ 2 ,
x-μ
then the variate z  approaches a normal distribution as n 
 
 σ 
 
 n
Central limit theorem-Error estimate

X 
1   z / 2   z / 2

n
/2 /2
X 
 z / 2
 z / 2 z / 2 
n

z  / 2
E  X   , where ' ' is th level of significance
n

 Level of significance: the probability that the computed estimate will lie
outside the indicated range .Here the range is the confidence level,1  
Example
 While determining the mean speed of veh. on a section of a road, engineer
wants to be able to assert with 95% confidence that the mean speed is off
by 2.5 km/hr. If std. deviation is 8.2 km/hr., how large the sample is?

z  1    0.95
E  X    /2
n
 / 2  .025  / 2  0.025
1    0.95;   0.05; /2  0.025

z /2  1.96  1.96 1.96

1.96  8.2
2.5 
n
Sample size, n = 41
Central Limit theorem
 Confidence interval (C.I.) for the population mean µ

 z  z 
C.I.   X   / 2 , X   / 2  1 0.95

 n n 
0.025
 /2
/2
0.025

-1.96
z / 2 z/2
1.96
 Example:
A random sample of size 100 is taken from a population with std. deviation
5.1, given that the sample mean is 21.6, construct a 95% confidence interval.

 1.96  5.1 1.96  5.1 


C.I.   21.6  , 21.6  
 100 100  Ans: (21.5, 22.6)
Distributions from Normal distribution
z1, z2, ……,zn Independent  2n  z12  z 22  ...  z 2n
standard normal random variables
P( X   2 , n )  

z  2 ,n
Standard normal distribution Chi-square distribution with
‘n’ degrees of freedom

n
x 1
1 x
 2
e   2
2 2
pdf  ,x  0
n 
  1!
2 
Distributions from Normal distribution
z – Random variable with standard  2 ,n – Random variable with Chi-square
normal distribution distribution

 

z
 2 ,n

P( t n  t  , n )  
z
tn  
 2n n
n
As n becomes large, n2  1  t n  z t  ,n
t- distribution with ‘n’ degrees of
freedom
Distributions from Normal distribution
For independent chi-square random
variables
 n and  m
2 2

,  2n
Fn ,m  n
 2 
m
m
0 F ,n ,m
P(Fn ,m  F ,n ,m )  
F- distribution with degrees
of freedom ‘n’ and ‘m’
1
 F1 ,m ,n
F ,n ,m
How to use these sampling distributions to
draw conclusion?
 Hypothesis testing
 Concerned with two distinct choices:
 Null Hypothesis (H0)
 Alternate hypothesis (H1)
 Test whether to accept or reject H0 using various test statistics.
 Two types of errors:

Two possibilities Decision


Accept H0 Reject H0
H0 True Correct ! Type I error
H1 True Type II error Correct !
Testing the hypothesis
 One tail or two tail?
Acceptance
region

Acceptance Rejection Rejection


1-α region Rejection
region region
region
1-α
α α/2 α/2
One tailed Two tailed

 Confidence level: 1-α : probability that the computed estimate


will lie in the acceptance region
 Level of significance: α :probability that the computed estimate
will lie in the rejection region
Distribution statistics in hypothesis testing
 Que.No.1: The spot speed at a particular location in an
expressway are known to be normally distributed with a mean
of 80km/hr. and std. dev. of 15km/hr. A new radar speed meter
was bought by traffic dept. and a set of 100 observations were
taken. The mean speed observed was 77.3km/hr. Is there any
evidence to prove that :
(i) the new speed meter might have been faulty
(ii) the new speed meter is showing lesser speed than actual.
Assume 5% level of significance.

Solution
Distribution statistics in hypothesis testing
 Que. No. 2: The mean spot speed of 15 vehicles observed on a
Sunday at a particular roadway was 81.2km/hr. The mean
speeds of all vehicles at this location as per previous records
was 75.5 km/hr. and std. dev. 10.2km/hr. Is there sufficient
evidence to show that the speeds of vehicles on that Sunday was
higher than the average speed? Take level of significance as
5%

Solution
Distribution statistics in hypothesis testing
 Ques. No.3: Two samples of speed data are collected are as
follows:
For sample 1, mean speed is 74.3km/hr. and std. dev. is 7km/hr. (n 1=120)
For sample 2, mean speed is 72.5km/hr. and std. dev. is 8km/hr. (n 2=120)
Is there any evidence to prove that the mean speed reduced by
more than 0.5km/hr. when using these samples? Assume level of
significance as 10%.

Solution
Distribution statistics in hypothesis testing
 Que.No.4:For a given vehicle speed data sample of size 20, the
standard deviation observed was 12.5km/hr. The data can be
used only if the standard deviation is near to approximately
equal to10km/hr. Check whether the data can be accepted at 5%
level of significance.

Solution
Distribution statistics in hypothesis testing
 Que.No.5:It is desired to determine whether there is less
variability in the speed data collected for day 1 than for day2.
If independent random samples are taken for these two days as
below:
For day 1: std. dev.=12km/hr. ;sample size=12
For day 2:std. dev.=10km/hr. ;sample size=14,
test the given hypothesis with a level of significance 5%.

Solution
Distribution statistics – Hypothesis testing
 Que.No.6: Every minute vehicle count data was collected for a period of 65
minutes. Determine at 95% confidence level , whether the data follows a
poisson distribution.
No. of Observed
arrival frequency
0 2
1 6
2 7 To test the fit of data to a
3 12 particular distribution,

4 13 ‘GOODNESS OF FIT’ test


5 9
6 9 Solution
7 4
8 2
9 1
Summary of test statistics for Hypothesis testing
TEST STATISTICS
H1
Reject H0 if
Hint: µ0 = population mean
ϭ0 = population std. dev.
Large sample – comparing mean
H0 :   0
  0 z  z 
X 
z   0 z  z

n   0 z  z  or z  z 
2 2

Small sample / unknown variance – comparing mean


H0 :   0
  0 t  t 
X 
t   0 t  t
s
n   0 t   t  or t  t 
2 2
Summary of test statistics for Hypothesis testing
TEST STATISTICS
H1
Reject H0 if
Hint: µ0 = population mean
ϭ0 = population std. dev.
Comparison of sample mean
H 0 : 1   2  

X z  z 
1  X 2    1   2 
1   2  
z
2
1  2
2
1   2   z  z

n1 n2 1   2   z  z  or z  z 
2 2

One variance
2
H 0 :  2  0
 2   2
2
 2  0

2 
 n  1 s 2
2  0
2
 2  12
2
 2  0
2
 2  12 or  2   2
2 2
Summary of test statistics for Hypothesis testing
TEST STATISTICS
H1
Reject H0 if
Hint: µ0 = population mean
ϭ0 = population std. dev.
Two variance
2 2
H 0 : 1   2

s1
2
/ n1 s 2
2
/n2  2
1   2
2
F  F ,n1 1,n 2 1
F
s 2
2
/ n  s
2 1
2
/n2 
2
1   2
2
F  F ,n 2 1,n1 1
s 2
l arg e / nL  s 2
small / n S  1   2
2 2
F  F ,n l arg e 1,n small 1
Underlying distribution
H 0 : Data follows given distribution

 
2  Oi  E i  2 Data does
not follow
i Ei given  2   2
distribution
Thank You

You might also like