_ Unit 2 _ Descriptive Analytics
_ Unit 2 _ Descriptive Analytics
UNIT 2
Frequency Distributionand Data: Types, Tables, and Graphs
For example
Raw data :
Raw data is an initial collection of information. This information has not yet been organized. After
the very first step of data collection, you will get raw data. For example,
A group of five friends their favourite colour. The answers are Blue, Green, Blue, Red, and Red. This
collection of information is the raw data.
Discrete data :Discrete data is that which is recorded in whole numbers, like the number of
children in a school or number of tigers in a zoo. It cannot be in decimals or fractions.
Continuous data :Continuous data need not be in whole numbers, it can be in decimals.
Examples are the temperature in a city for a week, your percentage of marks for the last exam
etc.
Pictographs
Bar Graphs
The frequency of any value is the number of times that value appears in a data set. So from the
above examples of colours, we can say two children like the colour blue, so its frequency is two.
So to make meaning of the raw data, we must organize. And finding out the frequency of the
data values is how this organisation is done.
Frequency Distribution
Many times it is not easy or feasible to find the frequency of data from a very large dataset. So to
make sense of the data we make a frequency table and graphs. Let us take the example of the
heights of ten students in cms.
Frequency Distribution Table
139, 145, 150, 145, 136, 150, 152, 144, 138, 138
This frequency table will help us make better sense of the data given. Also when the data set is too
big (say if we were dealing with 100 students) we use tally marks for counting. It makes the task
more organized and easy. Below is an example of how we use tally marks.
Frequency Distribution Graph
Using the same above example we can make the following graph:
130-140 4
140-150 3
150-160 3
From the above table, you can see that the value of 150 is put in the class interval of 150-160 and
not 140-150. This is the convention we must follow.
The table gives the number of snacksordered and the number of days as a tally. Find
the frequency of snacks ordered. 2
Answer: From the frequency table the number of snacks ordered ranging between
2-4 is 4 days
4 to 6 is 3 days
6 to 8 is 9 days
8 to 10 is 9 days
10 to 12 is 7 days.
So the frequencies for all snacks ordered are 4, 3, 9, 9, 7
Next, divide the range by the number of the group you want your data in and then round up.
Answer: In an overview, the frequency distribution of all distinct values in some variables and the
number of times they occur. Meaning that it tells how frequencies are distributed overvalues in a
frequency distribution. However, mostly we use frequency distributions to summarize categorical
variables.
Why are frequency distributions important? 2
Answer: The various components of the frequency distribution are: Class interval, types of
class interval, class boundaries, midpoint or class mark, width or size o class interval, class
frequency,
etc.
Descriptive Statistics
A population is the group to be studied, and population data is a collection of all elements in the
population. For example:
A sample is a subset of data drawn from the population of interest. For example:
For example,
The population mean (µ) is estimated by the sample mean (x̄). The population variance (σ2) is
estimated by the sample variance (s2).
Variables are divided into two major groups: Qualitative And Quantitative.
1. Qualitative variables
Quantitative variables have values that are typically numeric, such as measurements.
2. Quantitative variables
Examples
Frequency Polygon
A line graph for quantitative data that also emphasizes the continuity of continuous variables
An important variation on a histogram is the frequency polygon, or line graph. Frequency
polygons may be constructed directly from frequency distributions. However, we will follow the
step-by-step transformation of a histogram into a frequency polygon, as described in panels A,
B, C, and D of Figure 2.2. A. This panel shows the histogram for the weight distribution. B. Place
dots at the midpoints of each bar top or, in the absence of bar tops, at midpoints for classes on
the horizontal axis, and connect them with straight lines. [To find the midpoint of any class,
such
as 160–169, simply add the two tabled boundaries (160 + 169 = 329) and divide this sum by 2
(329/2 = 164.5).] C. Anchor the frequency polygon to the horizontal axis. First, extend the upper
tail to the midpoint of the first unoccupied class (250–259) on the upper flank of the histogram.
Then extend the lower tail to the midpoint of the first unoccupied class (120–129) on the lower
flank of the histogram. Now all of the area under the frequency polygon is enclosed completely.
D. Finally, erase all of the histogram bars, leaving only the frequency polygon. Frequency
polygons are particularly useful when two or more frequency distributions or relative
frequency distributions are to be included in the same graph.
Stem and Leaf Displays:
A device for sorting quantitative data on the basis of leading and trailing digits.
Still another technique for summarizing quantitative data is a stem and leaf display. Stem and
leaf displays are ideal for summarizing distributions, such as that for weight data, without
destroying the identities of individual observations.
Constructing a Display
The stemplot (also called stem and leaf plot) is another graphical display ofthe
distribution of quantitative variable.
To create a stemplot, the idea is to separate each data point into a stemand leaf,
asfollows:
• Note: For this to work, ALL data points should be rounded to the same
number of decimal places.
EXAMPLE: Best Actress Oscar Winners
When some of the stems hold a large number of leaves, we can split each stem into
two: one holding the leaves 0-4, and the other holding the leaves 5-9. A
statistical software package will often do the splitting for you, when appropriate.Note
that when rotated 90 degrees counter-clockwise, the stemplot visuallyresembles a
histogram:
Whether expressed as a histogram, a frequency polygon, or a stem and leaf display, an important
characteristic of a frequency distribution is its shape. Figure 2.3 shows some of the more typical
shapes for smoothed frequency polygons (which ignore the inevitable irregularities of real
data).
Normal
Any distribution that approximates the normal shape in panel A of Figure 2.3 can be analyzed, as
we will see in Chapter 5, with the aid of the well-documented normal curve. The familiar bell-
shaped silhouette of the normal curve can be superimposed on many frequency distributions,
including those for uninterrupted gestation periods of human fetuses, scores on standardized
tests, and even the popping times of individual kernels in a batch of popcorn.
Bimodal
Any distribution that approximates the bimodal shape in panel B of Figure 2.3 might, as suggested
previously, reflect the coexistence of two different types of observations in the same
distribution. For instance, the distribution of the ages of residents in a neighborhood consisting
largely of either new parents or their infants has a bimodal shape.
Positively Skewed The two remaining shapes in Figure 2.3 are lopsided. A lopsided distribution caused
by a few extreme observations in the positive direction (to the right of the majority of
observations), as in panel C of Figure 2.3, is a positively skewed distribution.
The distribution of incomes among U.S. families has a pronounced positive skew, with most
family incomes under $200,000 and relatively few family incomes spanning a wide range
of values above $200,000. The distribution of weights in Figure 2.1 also is positively
skewed.
Negatively Skewed A lopsided distribution caused by a few extreme observations in the negative
direction (to the left of the majority of observations), as in panel D of Figure 2.3, is a negatively
skewed distribution. The distribution of ages at retirement among U.S. job holders has a
pronounced negative skew, with most retirement ages at 60 years or older and relatively few
retirement ages spanning the wide range of ages younger than 60.
A GRAPH FOR QUALITATIVE (NOMINAL) DATA:
The distribution in Table 2.7, based on replies to the question “Do you have a Facebook profile?”
appears as a bar graph in Figure 2.4. A glance at this graph confirms that Yes replies occur
approximately twice as often as No replies. As with histograms, equal segments along the
horizontal axis are allocated to the different words or classes that appear in the frequency
distribution for qualitative data. Likewise, equal segments along the vertical axis reflect
increases in frequency. The body of the bar graph consists of a series of bars whose heights
reflect the frequencies for the various words or classes. A person’s answer to the question “Do
you have a Facebook profile?” is either Yes or No, not some impossible intermediate value, such
as 40 percent Yes and 60 percent No. Gaps are placed between adjacent bars of bar graphs to
emphasize the discontinuous nature of qualitative data. A bar graph also can be used with
quantitative data to emphasize the discontinuous nature of a discrete variable, such as the
number of children in a family.
Misleading Graphs:
Graphs can be constructed in an unscrupulous manner to support a particular point of view.
Indeed, this type of statistical fraud gives credibility to popular sayings, including “Numbers
don’t lie, but statisticians do” and “There are three kinds of lies—lies, damned lies, and
statistics.” For example, to imply that comparatively many students responded Yes to the
Facebook profile question, an unscrupulous person might resort to the various tricks shown in
Figure 2.5:
■ The width of the Yes bar is more than three times that of the No bar, thus violating the
custom that bars be equal in width.
■ The lower end of the frequency scale is omitted, thus violating the custom that the entire
scale be reproduced, beginning with zero. (Otherwise, a broken scale should be highlighted by
crossover lines, as in Figures 2.1 and 2.2.)
■ The height of the vertical axis is several times the width of the horizontal axis, thus violating
the custom, heretofore unmentioned, that the vertical axis be approximately as tall as the
horizontal axis is wide. Beware of graphs in which, because the vertical axis is many times
larger than the horizontal axis (as in Figure 2.5), frequency differences are exaggerated, or in
which, because the vertical axis is many times smaller than the horizontal axis, frequency
differences are suppressed.
AVERAGES
A center of a data set is a way of describing a location. We can measure a center of a
data in 3 different ways: the mean (average), the median and the mode.
The two main numerical measures for the center of a distribution are the mean and
the median. Each one of these measures is based on a completely different idea of
describing the center of a distribution. Let us first present each one of the measures,
and then compare their properties.
MEAN
The mean is the average of a set of observations (i.e., the sum of the observations
divided by the number of observations).
The mean is the average of a set of observations. If the n observations are written as their
mean can be written mathematically as: their mean is:
We add all of the ages to get 1233 and divide by the number of ages which was 32 to
get 38.5. We denote this result as x-bar and called the sample mean.
Often we have large sets of data and use a frequency table to display the data more
efficiently. Data were collected from the last three World Cup soccer tournaments. A
total of 192 games were played. The table below lists the number of goals scored per
game (not including any goals scored in shootouts).
Total # Frequency
Goals/Game
0 17
1 45
2 51
3 37
4 25
5 11
6 3
7 2
8 1
To find the mean number of goals scored per game, we would need to find the sum
of all 192 numbers, and then divide that sum by 192.
Rather than add 192 numbers, we use the fact that the same numbers appear many
times. For example, the number 0 appears 17 times, the number 1 appears 45 times,
the number2 appears 51 times, etc.
= 0(17) + 1(45) + 2(51) + 3(37) + 4(25) + 5(11) + 6(3) + 7(2) + 8(1) = 453.
Note that, in this example, the values of 1, 2, and 3 are the most common andour
averagefalls in this range representing the bulk of the data.
MEDIAN
If n is even, the median M is the mean of the two center observations in the ordered
list. These two observations are the ones “sitting” in the (n / 2) and(n / 2) + 1 spots
in the ordered list.
EXAMPLE: Median (1)
For a simple visualization of the location of the median, consider the followingtwo
simple cases of n = 7 and n = 8 ordered observations, with each observation
represented by asolid circle:
Comments:
In the images above, the dots are equally spaced, this need not indicate the data
values are actually equally spaced as we are only interested in listing them in order.
In fact, in the above pictures, two subsequent dots could have exactly the same value.
It is clear that the value of the median will be in the same position regardless of the
distance between data values.
Counting from the top, we find that: the 16th ranked observation is 35the 17thranked
observation also happens to be 35. Therefore, the median M = (35 + 35) / 2 = 35
The mean and the median, the most common measures of center, each describe the
centerof a distribution of values in a different way.
The mean describes the center as an average value, in which the actual values of the
data points play an important role.
The median, on the other hand, locates the middle value as the center, and
theorder of the data is the key.
Data set A → 64 65 66 68 70 71 73
Data set B → 64 65 66 68 70 71 730
For dataset A, the mean is 68.1, and the median is 68.
Looking at dataset B, notice that all of the observations except the last one are
close together. The observation 730 is very large, and is certainly an outlier. In this
case, the median is still 68, but the mean will be influenced by the high outlier, and
shifted up to 162.
The message that we should take from this example is:
The mean is very sensitive to outliers (because it factors in their magnitude), while
the median is resistant (or robust) to outliers.
The mode of a data set is the number that occurs most frequently in the set.
• If no value appears more than once in the data set, the data set has no mode.
• If a there are two values that appear in the data set an equal number of
times, theyboth will be modes etc.
For skewed left distributions and/or datasets with low outliers: the mean is less than
the median.
Let’s Summarize
The two main numerical measures for the center of a distribution are the
mean and the median. The mean is the average value, while the median is the middle
value.
The mean is very sensitive to outliers (as it factors in their magnitude),
while the median is resistant to outliers.
The mean is an appropriate measure of center for symmetric distributions
with no outliers. In all other cases, the median is often a better measure of the
center of the distribution.
Describing Variability
Intuitive Approach
In Figure 4.1, each of the three frequency distributions consists of seven scores with the
same mean (10) but with different variabilities. (Ignore the numbers in boxes; their
significance will be explained later.) Before reading on, rank the three distributions from
least to most variable. Your intuition was correct if you concluded that distribution A has
the least variability, distribution B has intermediate variability, and distribution C has the
most variability. If this conclusion is not obvious, look at each of the three distributions,
one at a time, and note any differences among the values of individual scores. For
distribution A with the least (zero) variability, all seven scores have the same value (10).
For distribution B with intermediate variability, the values of scores vary slightly (one 9
and one 11), and for distribution C with most variability, they vary even more (one 7, two
9s, two 11s, and one 13). Importance of Variability Variability assumes a key role in an
analysis of research results. For example, a researcher might ask: Does fitness training
improve, on average, the scores of depressed patients on a mental-wellness test? To
answer this question, depressed patients are randomly assigned to two groups, fitness
training is given to one group, and wellness scores are obtained for both groups. Let’s
assume that the mean wellness score is larger for the group with fitness training. Is the
observed mean difference between the two groups real or merely transitory? This
decision depends not only on the size of the mean difference between the two groups but
also on the inevitable variabilities of individual scores within each group. To illustrate the
importance of variability, Figure 4.2 shows the outcomes for two fictitious experiments,
each with the same mean difference of 2, but with the two groups in experiment B having
less variability than the two groups in experiment C. Notice that groups B and C in Figure
4.2 are the same as their counterparts in Figure 4.1. Although the new group B* retains
exactly the same (intermediate) variability
as group B, each of its seven scores and its mean have been shifted 2 units to the right.
Likewise, although the new group C* retains exactly the same (most) variability as group
C, each of its seven scores and its mean have been shifted 2 units to the right.
Consequently, the crucial mean difference of 2 (from 12 − 10 = 2) is the same for both
experiments. Before reading on, decide which mean difference of 2 in Figure 4.2 is more
apparent. The mean difference for experiment B should seem more apparent because of
the smaller variabilities within both groups B and B*. Just as it’s easier to hear a phone
message when static is reduced, it’s easier to see a difference between group means when
variabilities within groups are reduced.
Range
A range measures the spread of a data inside the limits of a data set, it is calculated as
a difference between the highest and lowest values in the data set. The larger the
range, the greater the spread of the data.The range covered by the data is the most
intuitive measure of variability. The range is exactly the distance between the
smallest data point (min) and the largest one (Max).
Note: When we first looked at the histogram, and tried to get a first feel for thespread
of the data, we were actually approximating the range, rather than calculating the
exact range.
Standard deviation is the measure of the overall spread (variability) of a data set
valuesfrom the mean. The more spread out a data set is, the greater are thedistances
from themean and the standard deviation.
There are many notations for the standard deviation: SD, s, Sd, StDev. Here,
we’ll use SDas an abbreviation for standard deviation, and use s as the
The following are the number of customers who entered a video store in8
consecutivehours: 7, 9, 5, 13, 3, 11, 15, 9
(112)/(7) = 16
• This value, the sum of the squared deviations divided by n – 1, is called the
variance. However, the variance is not used as a measure of spread directly as
the units are the square of the units of the original data.
5. The standard deviation of the data is the square root of the variance
calculated in step.
In this case, we have the square root of 16 which is 4. We will use the lower case
letter s represent the standard deviation. s = 4
Example 7
Compute the standard deviation of the sample data: 3, 5, 7 with a sample mean of 5.
DEGREES OF FREEDOM ( d f)
Degrees of freedom (df) refers to the number of values that are free to vary, given one or
more mathematical restrictions, in a sample being used to estimate a population
characteristic.
The number of values free to vary, given one or more mathematical restrictions.
IQR = Q3 – Q1
The following picture illustrates this idea: (Think about the horizontal line as the data
ranging from the min to the Max). IMPORTANT NOTE: The “lines” in the following
illustrations are not to scale. The equal distances indicate equal amounts of data NOT
equal distance between the numeric values.
1. Arrange the data in increasing order, and find the median M. Recall that
the median divides the data, so that 50% of the data points are below the
median, and 50% of the data points are above the median.
2. Find the median of the lower 50% of the data. This is called the first
quartile of the distribution, and the point is denoted by Q1. Note from the
picture that Q1 divides the lower 50% of the data into two halves, containing
25% of the data points in eachhalf. Q1 is called the first quartile, since one
quarter of the data points fall below it.
3. Repeat this again for the top 50% of the data. Find the median of the
top 50% of the data. This point is called the third quartile of the distribution,
and is denoted by Q3.Note from the picture that Q3 divides the top 50% of the
data into two halves, with 25%of the data points in each.Q3 is called the third
quartile,since three quarters of the data points fall below it.
4. The middle 50% of the data falls between Q1 and Q3, and
Comments:
1. The last picture shows that Q1, M, and Q3 divide the data into four
quarters with 25%of the data points in each, where the median is essentially
the second quartile. The use of IQR = Q3 – Q1 as a measure of spread is
therefore particularly appropriate when the median M is used as a measure
ofcenter.
To find the IQR of the Best Actress Oscar winners’ distribution, it will be
convenient touse the stemplot.
Q1 is the median of the bottom half of the data. Since there are 16 observations
in that half, Q1 is the mean of the 8th and 9th ranked observations in that half:
Q1 = (31 + 33) / 2 = 32
Similarly, Q3 is the median of the top half of the data, and since there are 16
observations in that half, Q3 is the mean of the 8th and 9th ranked observations
in that half:
Q3 = (41 + 42) / 2 = 41.5
IQR = 41.5 – 32 = 9.5
Note that in this example, the range covered by all the ages is 59 years, while the
range covered by the middle 50% of the ages is only 9.5 years. While the whole
dataset is spread over a range of 59 years, the middle 50% of the datais packed
into only 9.5 years.
The Normal Distribution
The mean is the center of this distribution and the highest point.
The curve is symmetric about the mean. (The area to the left of the mean equals the area
to the right of the mean.)
The total area under the curve is equal to one.
As x increases and decreases, the curve goes to zero but never touches.
The PDF of a normal curve is .
A normal curve can be used to estimate probabilities.
A normal curve can be used to estimate proportions of a population that have certain
x- values.
There are millions of possible combinations of means and standard deviations for continuous
random variables.
Finding probabilities associated with these variables would require us to integrate the PDF
over the range of values we are interested in.
The standard normal table gives probabilities associated with specific Z-scores.
The table we use is cumulative from the left.
The negative side is for all Z-scores less than zero (all values less than the mean).
The positive side is for all Z-scores greater than zero (all values greater than the mean).
Not all standard normal tables work the same way.
Example 10
Read down the Z-column to get the first part of the Z-score (1.6).
Read across the top row to get the second decimal place in the Z-score (0.02).
The intersection of this row and column gives the area under the curve to the left of the
Z- score.
What if we have an area and we want to find the Z-score associated with that area?
Instead of Z-score → area, we want area → Z-score.
We can use the standard normal table to find the area in the body of values and
read backwards to find the associated Z-score.
Using the table, search the probabilities to find an area that is closest to the probability
you are interested in.
Example 11
Since the table is cumulative from the left, you must use the complement of 5%.
1.000 – 0.05 = 0.9500
Figure
13. The standard normal table.
The Z-score for the 95th percentile is 1.64.Area in between Two Z-scores
Example 12
The middle 95% has 2.5% on the right and 2.5% on the left.
Use the symmetry of the curve.
Figure 14. The middle
95% of the area under a normal curve.
Look at your standard normal table. Since the table is cumulative from the left, it is
easier to find the area to the left first.
Find the area of 0.025 on the negative side of the table.
The Z-score for the area to the left is -1.96.
Since the curve is symmetric, the Z-score for the area to the right is 1.96.
Common Z-scores
Z.05 = 1.645 and the area between -1.645 and 1.645 is 90%
Z.025 = 1.96 and the area between -1.96 and 1.96 is 95%
Z.005 = 2.575 and the area between -2.575 and 2.575 is 99%
Typically, our normally distributed data do not have μ = 0 and σ = 1, but we can relate any
normal distribution to the standard normal distributions using the Z-score. We can
transform values of x to values of z.
For example, if a normally distributed random variable has a μ = 6 and σ = 2, then a value
of x = 7 corresponds to a Z-score of 0.5.
This tells you that 7 is one-half a standard deviation above its mean. We can use this
relationship to find probabilities for any normal random variable.
To find the area for values of X, a normal random variable, draw a picture of the area of
interest, convert the x-values to Z-scores using the Z-score and then use the standard
normal table to find areas to the left, to the right, or in between.
Example 13
Adult deer population weights are normally distributed with µ = 110 lb. and σ = 29.7 lb.
As a biologist you determine that a weight less than 82 lb. is unhealthy and you want to
know what proportion of your population is unhealthy.
P(x<82)
Figure 16. The area
under a normal curve for P(x<82).
Convert 82 to a Z-score
This is an “area to the left” problem so you can read directly from the table to get the
probability.
P(x<82) = 0.1736
Approximately 17.36% of the population of adult deer is underweight, OR one deer chosen
at random will have a 17.36% chance of weighing less than 82 lb.
Example 14
Statistics from the Midwest Regional Climate Center indicate that Jones City, which has a
large wildlife refuge, gets an average of 36.7 in. of rain each year with a standard deviation
of 5.1 in. The amount of rain is normally distributed. During what percent of the years does
Jones City get more than 40 in. of rain?
For approximately 25.78% of the years, Jones City will get more than 40 in. of rain.
Assessing Normality
If the distribution is unknown and the sample size is not greater than 30
(Central Limit Theorem), we have to assess the assumption of normality.
Our primary method is the normal probability plot. This plot graphs the observed
data, ranked in ascending order, against the “expected” Z-score of that rank.
If the sample data were taken from a normally distributed random variable, then the
plot would be approximately linear.
The center line is the relationship we would expect to see if the data were
drawn from a perfectly normal distribution.
Notice how the observed data (red dots) loosely follow this linear
relationship. Minitab also computes an Anderson-Darling test to assess
normality.
The null hypothesis for this test is that the sample data have been drawn from
a normally distributed population. A p-value greater than 0.05 supports the
assumption of normality.
Compare the histogram and the normal probability plot in this next example. The histogram
indicates a skewed right distribution.
Figure 20. Histogram and normal probability plot for skewed right data.
The observed data do not follow a linear pattern and the p-value for the A-D test is less
than 0.005 indicating a non-normal population distribution.
Normality cannot be assumed. You must always verify this assumption. Remember, the
probabilities we are finding come from the standard NORMAL table. If our data are NOT
normally distributed, then these probabilities DO NOT APPLY.
www.BrainKart.com