0% found this document useful (0 votes)
17 views

Module I. Basic Calculations. Average, Standard Deviation by Excel (5)

Uploaded by

celeanahib
Copyright
© © All Rights Reserved
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
17 views

Module I. Basic Calculations. Average, Standard Deviation by Excel (5)

Uploaded by

celeanahib
Copyright
© © All Rights Reserved
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
You are on page 1/ 48

Module I

Descriptive Statistics
Summarizing and Describing a Single Set
of Observations
R. Lopez
MAC 2205
Statistics Applied Education
MRU
Module 1. Objectives. Review
• Objectives
• I..- Basic concepts.
• 1.1.-Population, Variable types, sample, distribution.
• 1.2.-Normal distribution, Binomial, Poisson and Chi Square.
• 2 Describing Data from a Research in Education.
• 2.1.- Different kind of data:
• 2.1.1.-Categorical Data
• 2.1.2.- Interval Data
• 3.- Statistics for:
• 3.1.-Central tendency: Mean, Mode and Median.
• 3.2- Dispersion: Standard Deviation, Variance, Standard Error, Confidence interval, Quartiles.
• 3.3.-Form of the distribution: Skewnnes and Kurtosis
• 4.- Graphs to represent Distributions:
• 4.1.-Histograms and Box Plots
• 4.2- Relative Position
• 5.- Stratification: Data Comparison
Descriptive Statistics and Inferential
Statistics
• Descriptive Statistics definition:
• Descriptive statistics has the only purpose to describe set of data,
calculating the statistics or representing the data graphically .
• Statistics for Central Tendency: Mean, Median, Mode
• Statistics for Spread or Variability: Variance, Standard deviation,
confidence interval, error, range, quartiles, percentiles.
• Statistics for symmetry: Skewness and Kurtosis.
• Graphics: Histograms, Boxplots, etc.
Inferential statistics
• With inferential statistics, we are trying to reach conclusions that
extend beyond the immediate data alone. For instance, we use
inferential statistics to try to infer from the sample data what the
population might think.
• We use inferential statistics to make inferences from our data to
more general conditions; we use descriptive statistics simply to
describe what's going on in our data.
• Inferential statistics are useful in experimental and quasi-experimental
research design or in program outcome evaluation. The simplest
inferential test is used when you want to compare the average
performance of two groups on a single measure to see if there is a
significance difference.
• An example: We might want to know whether eighth-grade boys and
girls differ in math test scores or whether a program group differs on
the outcome measure from a control group. We use the t-test to
determine significant differences between the averages of two
groups.
• Most of the major inferential statistics come from a general family of
statistical models known as the General Linear Model.
• This includes the t-test, Analysis of Variance (ANOVA), Analysis of
Covariance (ANCOVA), regression analysis, and many of the
multivariate methods like factor analysis, multidimensional scaling,
cluster analysis, discriminant function analysis, and so on.
Introduction: The first step to select the proper
statistical test should be to identify the variable
type of the data we have:
Categorical Data
• Categorical Data: ( Nominal and Ordinal)
• Nominal data is constituted by names or identifiers.
• For example Gender is a Nominal variable that can be identified by
letters: male or female
• or by a number associated with them: 0 ( male) 1 ( female)
• Ordinal data is used in order to compare subjects that are organized
following an order or ranking scheme.
• For example is we want to categorize the degree of mentally illness
patients we can not attach to each of them a value, but only an order,
with respect to: ‘less than” “ equal to” or “greater than”.
Categorical Data
• Nominal and ordinal data are often summarized with bar charts.
Interval
• Interval data are represented with
numbers but the differences between
values have a meaning..
• Ratio data are represented with
numbers and as in the interval, the
differences between values have a
meaning and can be measured
depending on the accuracy of the
instrument or the technique we use
to measure it.
• The difference with interval data is
that ratio data has a clearly
interpretable zero
Sources for data
• When searching for information on a topic, it is important to
understand the value of primary, secondary, and tertiary sources.
• Primary sources allow researchers to get as close as possible to
original ideas, events, and empirical research as possible. Such
sources may include creative works, first hand or contemporary
accounts of events, and the publication of the results of empirical
observations or research.
Introduction. Important definitions
• Descriptive Statistics: ( Summarization)
• A data set is a collection of facts and values.
• In the first part our purpose is to represent the data set usually by
means of histograms or bar charts, pies, distributions, etc.
• Besides the second purpose is to calculate the statistical parameters
that can represent them in a summarize way:
• Central Tendency
• Dispersion
• Shape of the distribution
Statistics for interval and ratio data.
Central Tendency.
Central Tendency:
Mean: For a data set the mean refers to the central
value : specifically, the sum of the values divided by
the number of values.

Median: The median is that value such that the same


number of values are above (larger than) and below
(less than) the median — provided that there is an odd
number of observations in the data set.
50% of the observations are below the median and 50%
are above.

Mode: The mode is the value that appears most often


in a set of data., it is the value that is most likely to be
sampled
Mean value or average: X
• The mean value or average is calculated by adding all the values of a
particular variable in a data set and dividing them by the total number
of values “n” :
• i=n

• X = ∑i=1 xi / n = ( x1 + x2 + x3 + x4 +……xn ) /n
• If we have the following individual set of single outcomes:
• Calculate
subject 1 the
2 average
3 or
4 mean 5 value:
6 (using
7 Excel)
8 9
values 103 108 95 110 109 103 92 98 105

Result Sum = 923


Average = 102.56
Using Excel for Statistical
calculations
• Go to Excel and open it:
• Open a blank workshop:
The excel page will open
Type the data in any column
Selecting the average
• Some computer might request to click on: HOME
• To appear a line with icons on top of the page
• Click on : ∑Autosum and
• click on the white arrow to scroll down:
• Click on average: it will appear in any cell =average(B2:B10)
• where B2 corresponds to the cell with the first data and B10 the cell
with the last data, the same can be done highlighting the column with
the data and then click enter and the average 102.5556 will appear.
• To obtain the median: type =median(B2:B10) enter and 103 will
appear
Spread or variability
• Spread refers to variability in a data and can be measure by different
parameters: Variance, Standard Deviation, Range, Interquartile
range ,Standard error, Confidence interval, etc.
• The range is the simplest measure of variability but it should not be used
when outliers are present, values that are significantly higher or lower than
the rest of the data set. Excell and SPSS can be used to calculate these values.
• The range equals the largest observation minus the smallest observation.
The inter-quartile range (IQR) measures the range of the middle fifty percent
of the data. IQR = Q3 – Q1, where Q3 is the third quartile (75% of the
observations are below Q3 and 25% are above) and Q1 is the first quartile
(25% of the observations are below Q1 and 75% are above).
The Standard Deviation and the Variance.
• The standard deviation is the square root of the variance.
• The variance is the average squared distance from the mean. Hence the units of
measurement for the variance are squared units.
• The popularity of the standard deviation as a measure of variability is due that
the normal distribution is defined by two parameters: the mean and the standard
deviation.
• An important empirical rule regarding the Normal Distribution is that about 65%
of observations are within 1 standard deviation of the mean and about 95% are
within two standard deviations of the mean.
Variance, Standard Deviation,
Standard error

Standard error with Standard error with the


proportions Standard deviation.
Normal distribution of data. Gauss’s distribution.
• The Normal Distribution is a very important statistical data
distribution pattern occurring in many natural phenomena, such as
height, human population, blood pressure, etc.
Determining the Standard deviation
by Excel
• Continuing using the same data
• Click on : ∑Autosum and scroll down till more functions and there
click on StDev or type =STDEV(B2:B10) and 6.346478 will appear
• Remembeer that the Variance is the square of the Standard DeviatioN
• = 40.277783
• The standard error is St.error = St.Dev/(square root of n)
• St. error = 6.346478/ square root of 9 = 6.346478/3 = 2.115159
• The importance of the standard error is that if we multiply it for ±1.96
or approximately ±2.0 we obtain the 95% confidence interval
• 95% confidence interval = ±2.0 * 2.115159 = ± 4.2303
Departure from the Normal
Distribution
• Skewness
• In probability theory and statistics, skewness is a measure of the
asymmetry of the probability distribution of a real-valued random
variable about its mean. The skewness value can be positive or
negative, or undefined.
• For a unimodal distribution, negative skew commonly indicates that
the tail is on the left side of the distribution, and positive skew
indicates that the tail is on the right. In cases where one tail is long
but the other tail is fat, skewness does not obey a simple rule. For
example, a zero value means that the tails on both sides of the mean
balance out overall; this is the case for a symmetric distribution, but
can also be true for an asymmetric distribution where one tail is long
• The Pearson mode skewness or first skewness coefficient, is defined
as:

• (mean − mode)/standard deviation


• But excel can calculate it directedly from the data
• By typing in a cell =skew(B2:B10) and enter we obtain -0.53163
• And if that value is smaller than ± 2.0 then the data won’t be
departed from the Normal Distribution, as it happened in for this
data.
•.
Parameters that represent the
Normal Distribution and the 95%
confidence interval
In the Normal distribution the mean, the
median and the mode are equal.
That distribution is symmetrical and bell
shaped. (Skewness = 0 and Kurtosis = 3)
The so called interval of 95% confidence means
that we should have 95% of the results within
those values, that can be calculated for a sample,
taken from the Normal distribution, by::

X ± 2 * SE x
SEx = S/ ᴠ n
The 95% confidence interval:
• For the following data calculate the mean value, the variance, the
standard deviation, the error and the 95% interval assuming a Normal
distribution, for the lower blood pressure value, for females and
males
Subject 1 2 3 4 5 6 7 8 9
x 81 82 84 88 84 90 88 86 87

Females:
Total sum = 770 Average =85.56

Subject 1 2 3 4 5 6 7 8 9
x 87 90 93 91 90 89 92 90 88

Males
Total sum=810 Average = 90
The variance and the Standard
Deviation

Variance calculation for Females:


S2 =[(81-85.56)2 + (82 -85.56)2 + (84-85.56)2 + (88-85.56)2 +(84-85.56)2 + (90-85.56)2 +(88-85.56)2+(86-85.56)2 +(87-85.56)2 ]/(9-1)

S2 = 9.02778 S=3.004626

For Males
S2 =[(87-90)2 + (90 -90)2 + (93-90)2 + (91-90)2 +(90-90)2 + (89-90)2 +(92-90)2+(90-90)2 +(88-90)2 ]/(9-1) = 3.4999
S= 1.870829
Calculations by Excel
females males
81 87
82 90
84 93
88 91
84 90
90 89
88 92
86 90
87 88
Average Click on home 85.55556 90
Click on autosum
Select average

St.Dev Click on home 3.004626 1.870829


Click on autosum
Select StDev
Select the interval of the values D3 to D11

Median Click on home 86 90


Click on autosum
Select Median
Select the interval of the values D3 to D11

Skewness Click on home -0.15946 0


Click on autosum
type =Skew
Select the interval of the values D3 to D11
Calculations by Excel: summary
Creatininefemales
Grades for females Creatinine
Grades formales
males For Females
81 87 Confidence interval at 95%
82 90 X ± 2 * SE x
84 93 SEx = S/ ᴠ n = 3.00 /3 = 1.00
88 91 85.56 ± 2 *1.00 =
84 90 the 95% interval:
90 89 83.56 to 87.56
88 92
86 90 For Males
87 88 SEx = S/ ᴠ n = 1.871 /3 = 0.624
770 810 90 ± 2 * 0.624 = 90 ± 1.25
Average 85.55556 90 The 95% interval:
88.75 to 91.25
St.Dev. 3.004626 1.870829

Can you conclude that both groups differs at the


95% confidence interval?
The 95% confidence interval
The 95% confidence interval
• As we saw in the previous slide how the 95% confidence interval can
be used to determine if we obtained a significant difference ( at the
95%) between the low blood pressures or males and females for the
statistical analysis that can be performed to compare results or
outcomes from a statistical trial.
• Even though we did that manually there are different soft wares that
help us doing all those calculations, the easies perhaps is the Excel,.
SPSS Statistics GradPack
• IBM offers an affordable version of its SPSS version 24 , for only 40 dollars.
• Students may purchase SPSS GradPack from their college or university. Or,
they may purchase from these official distributors of IBM SPSS analytics
software for academic customers.
• SPSS Statistics Desktop
• Trial download for version 25.0 or newer for example 26.0
• Get your IBM SPSS 6 month TRIAL for students in three minutes or
less! Start leveraging your data today to identify your best customers,
forecast future trends, improves supplier performance, and more. This trial
software expires 6 month from the installation date. When purchasing, it
may take two business days to receive your authorization keys from IBM
support.
• Review systems requirements
• Learn more about the product
Skewness
• In probability theory and statistics, skewness is a measure of the
asymmetry of the probability distribution of a real-valued random
variable about its mean. The skewness value can be positive or
negative, or even undefined. The qualitative interpretation of the
skew is complicated.
Kurtosis
Kurtosis is a measure of whether the data are peaked or flat relative to a normal distribution.
That is, data sets with high kurtosis tend to have a distinct peak near the mean, decline rather
rapidly, and have heavy tails(leptokurtic).
Ogive
• An ogive is the roundly tapered end of a two-dimensional or three-
dimensional object. Villard de Honnecourt, a 13th-century itinerant
master-builder from the Picardy in the north of France, was the first
writer to use the word ogive.
Fractal
• A fractal is a natural phenomenon or a mathematical set that exhibits
a repeating pattern that displays at every scale. It is also known as
expanding symmetry or evolving symmetry. If the replication is exactly
the same at every scale, it is called a self-similar pattern. An example
of this is the Menger Sponge. Fractals can also be nearly the same at
different levels
Fractal
Statistical calculations
• During the next presentation, we’ll learn the statistical calculations
that can be applied to the cases that we’ll discuss, beginning with:
• 1.-Comparing the mean value of the results .
• 2.- If we want to compare the average of one group with the average
of a second group.
• How ?
• We’ll use or the Zscore from the Normal Distribution, when the
sample size is >30
• Z= (average1-average2)/Standard error
• Or the t test for sample size < 30
• t= (average1-average2)/Standard error
• We’ll learn how to do that later.
Constructing graphs by Excel
• First you should have a column with several values of a variable:
• Please open Excel and in any column type:
Grades of a previous 82
course
85
75
87
81
83
77 84
80 90
76 88

79 79

78
Then into Excel :
• High light the whole column.
• Then click on Insert ( on the top)
• It should appear different option and notice that there is one with
different graphs: you can click on recommended graphs to see is there
one that might be convenient. (but no).
• Look for the one that loos like 3 rectangules with the name
histograms copy it a post it into the power point presentation or any
word report that you like
histogram
histogram
• The histogram reports the same information as your data but divided
• In 3 classes. Those classes divided the so called
• range= highervalue –lower value= 90-75= 15
• In 3 classes ( the number of classes depends of how many values you
have, in this case we have n=14 values , approximately taking the
square root of n in this case 3 classes, then dividing the range by 3 we
obtained the so called interval of each class. The value in the middle
of each rectangle or class is called the mark of that class.
Box and whisker graph
• Go again to Excel but now click on the graph for Box and Whisker and
click on it:
Analyzing the Box and Whisker graph
• Lets explain the information that it provides:
• Determine the average and the standard Deviation as we did before:
• Average 81.6 Notice that is the value of the X sign in the Box and
whisker graph that it is at the same time with the value of the line in the
middle of the box ( the median)
• St.Dev.= 4.627319733 rounding off 4.6
• With the Standard Deviation we can calculate the 95% confidence interval
which are close to the values at the tip and end of the values given by the
whiskers:
• 95 % confidence interval 81.6 +/- 2 * 4.6 = 72.4 to 90.8.
• The upper line of the box is the upper quartile the value that has 25% of
the values over it and 75 % of the values below.The line at the bottom of
the box is the lower quartile the value that haw 75% of the values over it
and 25% of the values below it.
As a Summary, during the next presentation we’ll
study different types of statistical test.
• To compare two averages:
• Student t Test or the Zscore ( if the data is in accordance with the Normal
distribution)
• To compare variances or more than two averages:
• The ANOVA test o Fisher test. Or the Kruskal-Wallis test if the data doesn’t follow
the Normal Distribution ( non parametric test)
• To compare medians or ranks(for non parametric data)
• Mann Whitney test ( if the data doesn’t follow the Normal distribution)
• To compare proportions:
• Binomial distribution.
• To compare frequencies:
• Chi square test.

The scatter plot. Go to insert and then click
on recommended chart and select scatter
plot. All the values will appear as dots
connected by a broken line
Chart Title
95

90

85

80

75

70

65
0 2 4 6 8 10 12 14 16
Conclusion
• We only presented the typical statistical test that we are going to use
for the different types of cases
• We won’t request that the students calculate their results using SPSS
or manually because we’ll use Excel for those calculation.

You might also like