0% found this document useful (0 votes)
14 views16 pages

Lecture_1_2_notes_BA

notes

Uploaded by

bhaskkar
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
14 views16 pages

Lecture_1_2_notes_BA

notes

Uploaded by

bhaskkar
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
You are on page 1/ 16

Lecture Notes :1 &2

What is the Mean and How to Find It: Definition &


Formula
What is the Mean?

The mean in math and statistics summarizes an entire dataset


with a single number representing the data’s center point or
typical value. It is also known as the arithmetic mean, and it is
the most common measure of central tendency. It is frequently
called the “average.”

Learn how to find the mean and know when it is and is not a
good statistic to use!

How to Find the Mean

Finding the mean is very simple. Just add all the values and
divide by the number of observations. The mean formula is
below:

For example, if the heights of five people are 48, 51, 52, 54,
and 56 inches. Here’s how to find the mean:

48 + 51 + 52 + 54 + 56 / 5 = 52.2

Their average height is 52.2 inches.

Mean Formula

There are two versions of the mean formula in math—


the sample and population formulas. In each case, the process
for how to find the mean mathematically does not change. Add
the values and divide by the number of values. However, the
formula notation differs between the two types.

Module-1 Lectures:1 &2 Bhaskar


Sample Mean Formula

The sample mean formula is the following:

Where:

o x̄ is the sample average of variable x.


o ∑xn= sum of n values.
o n = number of values in the sample.

Typically, the sample formula notation uses lowercase


letters.

Population Mean Formula

The population mean formula is the following:

Where:

o µ is the population average.


o ∑XN= sum of N values.
o N = number of values in the population.

Typically, the population mean formula notation uses Greek


and uppercase letters.

Module-1 Lectures:1 &2 Bhaskar


When Do You Use the Average?

Ideally, the mean in math (or the average) indicates the region
where most values in a distribution fall. Statisticians refer to it
as the central location of a distribution. You can think of it as
the tendency of data to cluster around a middle value.
The histogram below illustrates the average accurately finding
the center of the data’s distribution.

However, the average does not always find the center of the
data. It is sensitive to skewed data and extreme values. For
example, when the data are skewed, it can miss the mark. In
the histogram below, the average is outside the area with the
most common values.

Module-1 Lectures:1 &2 Bhaskar


This problem occurs because outliers have a substantial
impact on the mean. Extreme values in an extended tail pull it
away from the center. As the distribution becomes more
skewed, the average is drawn further away from the center.

In these cases, the average can be misleading because it might


not be near the most common values. Consequently, it’s best
to use the average to measure the central tendency when you
have a symmetric distribution.

For skewed distributions, it’s often better to use


the median or trimmed mean, which use different methods to
find the central location. Note that the average provides no
information about the variability present in a distribution. To
evaluate that characteristic, assess the standard deviation.

Standard Deviation: Interpretations and Calculations


The standard deviation (SD) is a single number that
summarizes the variability in a dataset. It represents the
typical distance between each data point and the mean.
Smaller values indicate that the data points cluster closer to
the mean—the values in the dataset are relatively consistent.
Conversely, higher values signify that the values spread out

Module-1 Lectures:1 &2 Bhaskar


further from the mean. Data values become more dissimilar,
and extreme values become more likely.

The standard deviation uses the original data units, simplifying


the interpretation. For this reason, it is the most widely used
measure of variability. Suppose a pizza restaurant measures
its delivery time in minutes and has an SD of 5. In that case,
the interpretation is that the typical delivery occurs 5 minutes
before or after the mean time. Statisticians often report the
standard deviation with the mean: 20 minutes (StDev 5). If
another pizza restaurant has a standard deviation of 10
minutes, we know that its delivery service is more
inconsistent. We’ll assess this example more closely later on in
lecture 3.

In this note/article learn why the standard deviation is


essential, work through an interpretation example, and learn
how to calculate it by hand.

Why is the Standard Deviation Important?

Understanding the standard deviation is crucial. While the


mean identifies a central value in the distribution, it does not
indicate how far the data points fall from the center. Higher
SD values signify that more data points are further away from
the

Module-1 Lectures:1 &2 Bhaskar


Objective: To learn why the standard deviation is
essential, work through an interpretation example, and
learn how to calculate it by hand.

Variability is everywhere. When you order a favourite meal at


a restaurant, it isn’t exactly the same each time. Your drive
time to work varies every day. Parts from an assembly line
might seem identical, but they have subtly different lengths
and widths.

When variability is high, you can expect to experience extreme


values more frequently, which can cause problems! If the
restaurant meal differs noticeably from the usual, you might
not like it at all. When your morning commute takes much
longer than the average travel time, you will be late. And,
manufactured parts that are too far out then system won’t
perform correctly.

Frequently, we feel distressed at the extremes more than the


mean. Standard deviations help you understand the variability
and provides vital information about the consistency of
outcomes or lack thereof.

Example of Using the Standard Deviation

Suppose two pizza restaurants advertise a 20-minute average


delivery time. We’re starving and both look equally good!
However, we know the mean does not tell the entire story!

Let’s assess their standard deviations to choose the restaurant.


Imagine we obtain their delivery time data. One restaurant has
a SD of 10 minutes while the other has a value of 5. How does
this affect deliveries?

The graphs below incorporate the SDs to answer this question.


The restaurant with the larger standard deviation (10 minutes)
has more variable delivery times and a broader distribution
curve.

Module-1 Lectures:1 &2 Bhaskar


NOTE: Area under the curve is equal to one always,
irrespective of the shape – normal or square

In these charts, we’ll consider a 30-minute wait or longer to be


unacceptable—we’re hungry! The shaded areas represent the
percentage of delivery times exceeding 30 minutes. Almost
16% of deliveries for the high variability pizza joint exceed 30
minutes compared to only 2% for the low variability
restaurant. They both have a mean delivery time of 20
minutes, but I know where I’d place my order when I’m
hungry!

After calculating the standard deviation, you can use various


methods to evaluate it. The graphs above incorporate the SD

Module-1 Lectures:1 &2 Bhaskar


into the normal probability distribution. Alternatively, you can
use the Empirical Rule or Chebyshev’s Theorem to assess
how the standard deviation relates to the distribution of
values1. Alternatively, you can calculate the coefficient of
variation (CV), which uses both the SD and the mean.

Standard Deviation Formula

The formula for the standard deviation is below.

o s = the sample StDev


o N = number of observations
o Xi = value of each observation
o x̄ = the sample mean

Statisticians refer to the numerator portion of the standard


deviation formula as the sum of squares. (Remember why we
did it in class? +1 and -1 simple addition is zero i.e. no
deviation in our steps at all !).

Technically, this formula is for the sample standard deviation.


The population version uses N in the denominator. Learn
about the differences between the population and sample
varieties.? i.e why (N-1) in case of samples.

Step-by-Step Example of Calculating the Standard


Deviation

Calculating the standard deviation involves the following


steps. The numbers correspond to the column numbers.

1
Note: not important to remember the name for us , Just for info.

Module-1 Lectures:1 &2 Bhaskar


The calculations take each observation (1), subtract the
sample mean (2) to calculate the difference (3), and square
that difference (4).

Then, at the bottom, sum the column of squared differences


and divide it by 16 (17 – 1 = 16), which equals 201.
Statisticians call this value the variance.

Calculate the square root of the variance to derive the SD.


(Question: why not just leave at Variance? Why Square root?)

Learn how you can use the range of a dataset to estimate the
standard deviation using the range rule of thumb.

The standard deviation is similar to the mean absolute


deviation. Both statistics use the original data units and they
compare the data points to the mean to assess variability.

Module-1 Lectures:1 &2 Bhaskar


However, there are differences. To learn more, read about
the mean absolute deviation (MAD).

Mean, Median, and Mode: Measures of Central Tendency

What is Central Tendency?

Measures of central tendency are summary statistics that


represent the center point or typical value of a dataset.
Examples of these measures include the mean, median,
and mode. These statistics indicate where most values in a
distribution fall and are also referred to as the central location
of a distribution. You can think of central tendency as the
propensity for data points to cluster around a middle value.

In statistics, the mean, median, and mode are the three most
common measures of central tendency. Each one calculates
the central point using a different method. Choosing the best
measure of central tendency depends on the type of data you
have. In this post, I explore the mean, median, and mode as
measures of central tendency, show you how to calculate
them, and how to determine which one is best for your data.

Locating the Measures of Central Tendency

Most articles about the mean, median, and mode focus on how
you calculate these measures of central tendency., I’m going to
start by illustrating the central point of several datasets
graphically—so you understand the goal. Then, we’ll move on
to choosing the best measure of central tendency for your data
and the calculations.

The three distributions below represent different data


conditions. In each distribution, look for the region where the
most common values fall. Even though the shapes and type of

Module-1 Lectures:1 &2 Bhaskar


data are different, you can find that central tendency. That’s
the area in the distribution where the most common values are
located. These examples cover the mean, median, and mode or
3Ms .

Module-1 Lectures:1 &2 Bhaskar


As the graphs highlight, you can see where most values tend to
occur. That’s the concept. Measures of central tendency
represent this idea with a value. Coming up, you’ll learn that
as the distribution and kind of data changes, so does the best
measure of central tendency. Consequently, you need to know

Module-1 Lectures:1 &2 Bhaskar


the type of data you have, and graph it, before choosing
between the mean, median, and mode!2

Median

The median is the middle value. It is the value that splits the
dataset in half, making it a natural measure of central
tendency.

To find the median, order your data from smallest to largest,


and then find the data point that has an equal number of
values above it and below it. The method for locating the
median varies slightly depending on whether your dataset has
an even or odd number of values. I’ll show you how to find the
median for both cases. In the examples below, I use whole
numbers for simplicity, but you can have decimal places.

In the dataset with the odd number of observations, notice


how the number 12 has six values above it and six below it.
Therefore, 12 is the median of this dataset.

When there is an even number of values, you count in to the


two innermost values and then take the average. The average
of 27 and 29 is 28. Consequently, 28 is the median of this
dataset.
2

Module-1 Lectures:1 &2 Bhaskar


Outliers and skewed data have a smaller effect on the median
than the mean as a measures of central tendency. To
understand why, imagine we have the Median dataset below
and find that the median is 46. However, we discover data
entry errors and need to change four values, which are shaded
in the Median Fixed dataset. We’ll make them all significantly
higher so that we now have a skewed distribution with large
outliers.

Module-1 Lectures:1 &2 Bhaskar


As you can see, the median doesn’t change at all. It is still 46.
When comparing the mean vs median, the mean depends on all
values in the dataset while the median does not. Consequently,
when some of the values are more extreme, the effect on the
median is smaller. Of course, with other types of changes, the
median can change. When you have a skewed distribution, the
median is a better measure of central tendency than the mean.

Mean vs Median as Measures of Central Tendency

Now, let’s compare the mean vs median as measures of central


tendency on symmetrical and skewed distributions to see how
they perform. The histograms below allow us to compare these
two statistics directly.

In a symmetric distribution, the mean and median both find


the center accurately. They are approximately equal, and both
are valid measures of central tendency.

Module-1 Lectures:1 &2 Bhaskar


In a skewed distribution, the outliers in the tail pull the mean
away from the center towards the longer tail. For this
example, the mean vs median differs by over 9000. The median
better represents the central tendency for the skewed
distribution.

These data are based on the U.S. household income for 2006.
Income is the classic example of when to use the median
instead of the mean because its distribution tends to be
skewed. The median indicates that half of all incomes fall
below 27581, and half are above it. For these data, the mean
overestimates where most household incomes fall.

NOTE : the median is a robust statistic while the mean is


sensitive to outliers and skewed distributions.

Module-1 Lectures:1 &2 Bhaskar

You might also like