FDS Unit II Notes
FDS Unit II Notes
Types of Data:
Most datasets broadly fall into two groups—numerical data and categorical data.
Numerical data
➢ For example, a person's age, height, weight, blood pressure, heart rate, temperature,
Discrete data
➢ For example, if we flip a coin, the number of heads in 200 coin flips can take values from 0 to
200 (finite) cases.
1
CS3352-Foundations of Data Science Information Technology 2022-2023
➢ Example:
The Country variable can have values such as Nepal, India, Norway, and Japan.
The Rank variable of a student in a classroom can take values from 1, 2, 3, 4, 5, and so on.
Continuous data
➢ A variable that can have an infinite number of numerical values within a specific range is
classified as continuous data.
➢ Continuous data can follow an interval measure of scale or ratio measure of scale
➢ Example:
Example table:
Categorical data
the movies.
2
CS3352-Foundations of Data Science Information Technology 2022-2023
Movie genres (Action, Adventure, Comedy, Crime, Drama, Fantasy, Historical, Horror,
Mystery, Philosophical, Political, Romance, Saga, Satire, Science Fiction, Social, Thriller,
Urban, or Western)
Polytomous variables
➢ Example: marital status can have several values, such as divorced, legally separated, married,
never married, unmarried, widowed,etc.
➢ Most of the categorical dataset follows either nominal or ordinal measurement scales.
Nominal
➢ These are used for labeling variables without any quantitative value. The scales are generally
referred to as labels.
➢ These scales are mutually exclusive and do not carry any numerical importance.
➢ Examples:
3
CS3352-Foundations of Data Science Information Technology 2022-2023
Male
Female
Third gender/Non-binary
I prefer not to answer
Other
3. Biological species
➢ Nominal scales are considered qualitative scales and the measurements that are taken using
qualitative scales are considered qualitative data.
Frequency is the rate at which a label occurs over a period of time within the dataset.
Proportion can be calculated by dividing the frequency by the total number of events.
Then, the percentage of each proportion is computed.
To visualize the nominal dataset, either a pie chart or a bar chart can be used.
Ordinal
➢ The main difference in the ordinal and nominal scale is the order.
WordPress is making content managers' lives easier. How do you feel about this statement?
Likert scale:
4
CS3352-Foundations of Data Science Information Technology 2022-2023
➢ The answer to the question is scaled down to five different ordinal values, Strongly Agree,
Agree, Neutral, Disagree, and Strongly Disagree.
➢ To make it easier, consider ordinal scales as an order of ranking (1st, 2nd, 3rd, 4th, and so on).
➢ The median item is allowed as the measure of central tendency; however, the average is not
permitted.
Interval
➢ Both the order and exact differences between the values are significant.
➢ Examples:
Ratio
➢ Mathematical operations, the measure of central tendencies, and the measure of dispersion and
coefficient of variation can also be computed from such scales.
➢ Examples: the measure of energy, mass, length, duration, electrical energy, plan angle, and
volume.
TYPES OF VARIABLES
5
CS3352-Foundations of Data Science Information Technology 2022-2023
The weights can be described not only as quantitative data but also as observations for a
quantitative variable, since the various weights take on different numerical values.
Described as observations for a qualitative variable, since the replies to the Facebook
profile question take on different values of either Yes or No.
Examples include most counts, such as the number of children in a family. (1, 2, 3, etc., but never
11 / 2 in spite of how you might occasionally feel about a sibling).
Examples include amounts, such as weights of male statistics students; durations, such as the
reaction times of grade school children to a fire alarm; and standardized test scores, such as those
on the Scholastic Aptitude Test (SAT).
Approximate Numbers that are rounded off, as is always the case with values for
continuous variables,the resulting numbers are approximate, never exact.
For example, the weights of the male statistics students. A student whose weight is listed as 150
lbs could actually weigh between 149.5 and 150.5 lbs.
Because of rounding-off procedures, gaps appear among values for continuous variables. For
example, because weights are rounded to the nearest pound, no male statistics student has a listed
weight between 150 and 151 lbs.
Give the answer for Indicate whether the following quantitative observations are discrete or
continuous?
Answer:
6
CS3352-Foundations of Data Science Information Technology 2022-2023
7
CS3352-Foundations of Data Science Information Technology 2022-2023
8
CS3352-Foundations of Data Science Information Technology 2022-2023
A frequency distribution helps us to detect any pattern in the data (assuming a pattern exists) by
superimposing some order on the inevitable variability among observations. For example, the
appearance of a familiar bell-shaped pattern in the frequency distribution of reaction times of
airline pilots to a cockpit alarm suggests the presence of many small chance factors whose
collective effect must be considered in pilot retraining or cockpit redesign. Frequency distribution
is an organized tabulation/graphical representation of the number of individuals in each category on
the scale of measurement.
A frequency distribution describes the number of observations for each possible value of a
variable. Frequency distributions are depicted using graphs and frequency tables.
Example: Frequency distributionIn the 2022 Winter Olympics, Team USA won 25 medals. This
frequency table gives the medals’ values (gold, silver, and bronze) and frequencies:
9
CS3352-Foundations of Data Science Information Technology 2022-2023
10
CS3352-Foundations of Data Science Information Technology 2022-2023
The method for making a frequency table differs between the four types of frequency distributions.
You can follow the guides below or use software such as Excel, SPSS, or R to make a frequency
table.
The ungrouped frequency distribution is a type of frequency distribution that displays the
frequency of each individual data value instead of groups of data values. In this type of frequency
distribution, we can directly see how often different values occurred in the table.
When observations are sorted into classes of single values the result is referred to as a frequency
distribution for ungrouped data.
Example:
11
CS3352-Foundations of Data Science Information Technology 2022-2023
1. Create a table with two columns and as many rows as there are values of the variable.
Label the first column using the variable name and label the second column “Frequency.”
Enter the values in the first column.
o For ordinal variables, the values should be ordered from smallest to largest in the
table rows.
o For nominal variables, the values can be in any order in the table. You may wish to
order them alphabetically or in some other logical order.
2. Count the frequencies. The frequencies are the number of times each value occurs. Enter
the frequencies in the second column of the table beside their corresponding values.
o Especially if your dataset is large, it may help to count the frequencies by tallying.
Add a third column called “Tally.” As you read the observations, make a tick mark
in the appropriate row of the tally column for each observation. Count the tally
marks to determine the frequency.
12
CS3352-Foundations of Data Science Information Technology 2022-2023
Example: Making an ungrouped frequency tableA gardener set up a bird feeder in their backyard.
To help them decide how much and what type of birdseed to buy, they decide to record the bird
species that visit their feeder. Over the course of one morning, the following birds visit their feeder:
When observations are sorted into classes of more than one value, the result is referred to as a
frequency distribution for grouped data.
Example:
Essential
Example: 130–139, 140–149, 150–159, etc. It would be incorrect to use 130–140, 140–150, 150–
160, etc., in which, because the boundaries of classes overlap, an observation of 140 (or 150) could
be assigned to either of two classes.
13
CS3352-Foundations of Data Science Information Technology 2022-2023
Example: The class 210–219 and its frequency of zero. It would be incorrect to skip this class
because of its zero frequency.
Example: 130–139, 140–149, 150–159, etc. It would be incorrect to use 130–139, 140–159, etc.,
in which the second class interval (140–159) is twice as wide as the first class interval (130–139).
Optional
4. All classes should have both an upper boundary and a lower boundary.
Example: 240–249. Less preferred would be 240–above, in which no maximum value can be
5. Select the class interval from convenient numbers, such as 1, 2, 3, . . . 10, particularly 5 and 10
or multiples of 5 and 10.
Example: 130–139, 140–149, in which the class interval of 10 is a convenient number. Less
preferred would be 130–142, 143–155, etc., in which the class interval of 13 is not a convenient
number.
6. The lower boundary of each class interval should be a multiple of the class interval.
Example: 130–139, 140–149, in which the lower boundaries of 130, 140, are multiples of 10, the
class interval. Less preferred would be 135–144, 145–154, etc., in which the lower boundaries of
135 and 145 are not multiples of 10, the class interval.
The size of the gap should always equal one unit of measurement; that is, it should always equal
the smallest possible difference between scores within a particular set of data.
14
CS3352-Foundations of Data Science Information Technology 2022-2023
15
CS3352-Foundations of Data Science Information Technology 2022-2023
16
CS3352-Foundations of Data Science Information Technology 2022-2023
1. Divide the variable into class intervals. Below is one method to divide a variable into
class intervals. Different methods will give different answers, but there’s no agreement on
the best method to calculate class intervals.
o Calculate the range. Subtract the lowest value in the dataset from the highest.
o Decide the class interval width. There are no firm rules on how to choose the
width, but the following formula is a rule of thumb:
You can round this value to a whole number or a number that’s convenient to add
(such as a multiple of 10).
17
CS3352-Foundations of Data Science Information Technology 2022-2023
o Calculate the class intervals. Each interval is defined by a lower limit and upper
limit. Observations in a class interval are greater than or equal to the lower limit and
less than the upper limit:
The lower limit of the first interval is the lowest value in the dataset. Add the class
interval width to find the upper limit of the first interval and the lower limit of the
second variable. Keep adding the interval width to calculate more class intervals
until you exceed the highest value.
2. Create a table with two columns and as many rows as there are class intervals. Label the
first column using the variable name and label the second column “Frequency.” Enter the
class intervals in the first column.
3. Count the frequencies. The frequencies are the number of observations in each class
interval. You can count by tallying if you find it helpful. Enter the frequencies in the second
column of the table beside their corresponding class intervals.
Example: Grouped frequency distributionA sociologist conducted a survey of 20 adults. She wants
to report the frequency distribution of the ages of the survey respondents. The respondents were the
following ages in years:
52, 34, 32, 29, 63, 40, 46, 54, 36, 36, 24, 19, 45, 20, 28, 29, 38, 33, 49, 37
18
CS3352-Foundations of Data Science Information Technology 2022-2023
19
CS3352-Foundations of Data Science Information Technology 2022-2023
From this table, the gardener can make observations, such as that 19% of the bird feeder visits were
from chickadees and 25% were from finches.
Cumulative frequency distributions show the total number of observations in each class and in all
lower-ranked classes.
This type of distribution can be used effectively with sets of scores, such as test scores for
intellectual or academic aptitude, when relative standing within the distribution assumes primary
importance. Cumulative frequencies are usually converted, in to cumulative percentages.
Cumulative percentages are often referred to as percentile ranks.
To convert a frequency distribution into a cumulative frequency distribution, add to the frequency
of each class the sum of the frequencies of all classes ranked below it.
20
CS3352-Foundations of Data Science Information Technology 2022-2023
Percentile Ranks When used to describe the relative position of any score within its parent
distribution, cumulative percentages are referred to as percentile ranks. The percentile rank of a
score indicates the percentage of scores in the entire distribution with similar or smaller values
than that score. Thus a weight has a percentile rank of 80 if equal or lighter weights constitute 80
percent of the entire distribution.
21
CS3352-Foundations of Data Science Information Technology 2022-2023
1. Add a third column to the table for the cumulative frequencies. The cumulative
frequency is the number of observations less than or equal to a certain value or class
interval. To calculate the relative frequencies, add each frequency to the frequencies in the
previous rows.
2. Optional: If you want to calculate the cumulative relative frequency, add another column
and divide each cumulative frequency by the sample size.
Example: Cumulative frequency distribution
From this table, the sociologist can make observations such as 13 respondents (65%) were under 39
years old, and 16 respondents (80%) were under 49 years old.
Example:
(b) Specify the real limits for the lowest class interval in this frequency distribution.
22
CS3352-Foundations of Data Science Information Technology 2022-2023
Answer:
Relative frequency distributions show the frequency of each class as a part or fraction of the total
frequency for the entire distribution.
To convert a frequency distribution into a relative frequency distribution, divide the frequency
for each class by the total frequency for the entire distribution.
23
CS3352-Foundations of Data Science Information Technology 2022-2023
The conversion to proportions is straightforward. For instance, to obtain the proportion of .06 for
the class 130–139, divide the frequency of 3 for that class by the total frequency of 53. Repeat
this process until a proportion has been calculated for each class.
OUTLIERS
The appearance of one or more very extreme scores are called as outliers. A GPA of 0.06, anIQ of
170, summer wages of $62,000—each requires special attention because of its potentialimpact on
a summary of the data.
Might Exclude from Summaries We might choose to segregate an outlier from any summary of the
data.
Example: Identify any outliers in each of the following sets of data collected from nine college
students
24
CS3352-Foundations of Data Science Information Technology 2022-2023
Answer: Outliers are summer income of $25,700; an age of 61; and a family size of 18. No outliers
for GPA
When, among a set of observations, any single observation is a word, letter, or numerical code, the
data are qualitative. Frequency distributions for qualitative data are easy to construct.
When qualitative data have an ordinal level of measurement because observations can be ordered
from least to most, that order should be preserved in the frequency table.
Frequency distributions for qualitative variables can always be converted into relative frequency
distributions. If measurement is ordinal because observations can be ordered from least to most,
25
CS3352-Foundations of Data Science Information Technology 2022-2023
Answers:
26
CS3352-Foundations of Data Science Information Technology 2022-2023
INTERPRETING DISTRIBUTIONS:
GRAPHS:
Data can be described clearly and concisely with the aid of a well-constructed frequency
distribution. And data can often be described even more vividly, by converting frequency
distributions into graphs.
GRAPHS FOR QUANTITATIVE DATA
Histograms
A histogram is the most commonly used graph to show frequency distributions. It is a bar-type
graph for quantitative data. The common boundaries between adjacent bars emphasize the
continuity of the data, as with continuous variables.
Features of histograms:
Equal units along the horizontal axis (the X axis, or abscissa) reflect the various class
intervals of the frequency distribution
Equal units along the vertical axis (the Y axis, or ordinate) reflect increases in frequency.
27
CS3352-Foundations of Data Science Information Technology 2022-2023
(The units along the vertical axis do not have to be the same width as those along the
horizontal axis.)
The intersection of the two axes defines the origin at which both numerical scales equal 0.
Numerical scales always increase from left to right along the horizontal axis and from
bottom to top along the vertical axis. It is considered good practice to use wiggly lines to
highlight breaks in scale.
The body of the histogram consists of a series of bars whose heights reflect the frequencies
for the various classes
Figure: Histogram
Frequency Polygon
Frequency Polygon is a line graph for quantitative data that also emphasizes the continuity of
continuous variables. Frequency polygons may be constructed directly from frequency
distributions.
The following frequency distribution shows the annual incomes in dollars for a group of college
graduates.
28
CS3352-Foundations of Data Science Information Technology 2022-2023
B. Place dots at the midpoints of each bar top or, in the absence of bar tops, at midpoints for classes
on the horizontal axis, and connect them with straight lines.[To find the midpoint of any class,
such as 160–169, simply add the two tabled boundaries (160 + 169 = 329) and divide this sum by
2 (329/2 = 164.5).]
C. Anchor the frequency polygon to the horizontal axis. First, extend the upper tail to the midpoint
of the first unoccupied class (250–259) on the upper flank of the histogram. Then extend the lower
tail to the midpoint of the first unoccupied class (120–129) on the lower flank of the histogram.
Now all of the area under the frequency polygon is enclosed completely.
D. Finally, erase all of the histogram bars, leaving only the frequency polygon. Frequency
polygons are particularly useful when two or more frequency distributions or relative frequency
distributions are to be included in the same graph.
29
CS3352-Foundations of Data Science Information Technology 2022-2023
Constructing a Display
The leftmost panel of Table 2.9 re-creates the weights of the 53 male statistics students listed in
Table 1.1. To construct the stem and leaf display for these data, first note that, when counting by
tens, the weights range from the 130s to the 240s. Arrange a column of numbers, the stems,
30
CS3352-Foundations of Data Science Information Technology 2022-2023
beginning with 13 (representing the 130s) and ending with 24 (representing the 240s). Draw a
vertical line to separate the stems, which represent multiples of 10, from the space to be occupied
by the leaves, which represent multiples of 1.
TYPICAL SHAPES
Whether expressed as a histogram, a frequency polygon, or a stem and leaf display, an important
characteristic of a frequency distribution is its shape.
1. Normal
The familiar bell-shaped silhouette of the normal curve can be superimposed on many frequency
distributions, including those for uninterrupted gestation periods of human fetuses, scores on
standardized tests, and even the popping times of individual kernels in a batch of popcorn.
2. Bimodal
31
CS3352-Foundations of Data Science Information Technology 2022-2023
Any distribution that approximates the bimodal shape might, reflect the coexistence of two
different types of observations in the same distribution. For instance, the distribution of the ages
of residents in a neighborhood consisting largely of either new parents or their infants has a
bimodal shape.
3. Positively Skewed
A lopsided distribution caused by a few extreme observations in the positive direction (to the right
of the majority of observations), is a positively skewed distribution. The distribution of incomes
among U.S. families has a pronounced positive skew, with most family incomes under $200,000
and relatively few family incomes spanning a wide range of values above $200,000.
4. Negatively Skewed
A lopsided distribution caused by a few extreme observations in the negative direction (to the left
of the majority of observations), is a negatively skewed distribution. The distribution of ages at
retirement among U.S. job holders has a pronounced negative skew, with most retirement ages at
60 years or older and relatively few retirement ages spanning the wide range of ages younger than
60.
Bar Graph: A bar-type graph for qualitative data. Gaps between adjacent bars emphasize the
discontinuous nature of the data.
32
CS3352-Foundations of Data Science Information Technology 2022-2023
A person’s answer to the question “Do you have a Facebook profile?” is either Yes or No, not
some impossible intermediate value, such as 40 percent Yes and 60 percent No. Gaps are placed
between adjacent bars of bar graphs to emphasize the discontinuous nature of qualitative data. A
bar graph also can be used with quantitative data to emphasize the discontinuous nature of a
discrete variable, such as the number of children in a family.
CONSTRUCTING GRAPHS
1. Decide on the appropriate type of graph, recalling that histograms and frequency polygons are
appropriate for quantitative data, while bar graphs are appropriate for qualitative data and also are
sometimes used with discrete quantitative data.
2. Draw the horizontal axis, then the vertical axis, remembering that the vertical axis should be
about as tall as the horizontal axis is wide.
3. Identify the string of class intervals that eventually will be superimposed on the horizontal axis.
For qualitative data or ungrouped quantitative data, this is easy—just use the classes suggested by
the data. For grouped quantitative data, proceed as if you were creating a set of class intervals for
a frequency distribution.
4. Superimpose the string of class intervals (with gaps for bar graphs) along the entire length of
the horizontal axis. Do not use a string of empty class intervals to bridge a sizable gap between the
origin of 0 and the smallest class interval. Instead, use wiggly lines to signal a break in scale, then
begin with the smallest class interval. Also, do not clutter the horizontal scale with excessive
numbers—use just a few convenient numbers.
5. Along the entire length of the vertical axis, superimpose a progression of convenient numbers,
beginning at the bottom with 0 and ending at the top with a number as large as or slightly larger
than the maximum observed frequency. If there is a considerable gap between the origin of 0 and
the smallest observed frequency, use wiggly lines to signal a break in scale.
33
CS3352-Foundations of Data Science Information Technology 2022-2023
6. Using the scaled axes, construct bars (or dots and lines) to reflect the frequency of observations
within each class interval. For frequency polygons, dots should be located above the midpoints of
class intervals, and both tails of the graph should be anchored to the horizontal axis.
7. Supply labels for both axes and a title (or even an explanatory sentence) for the graph.
Tables and graphs of frequency distributions are important points of departure when attempting to
describe data. More precise summaries, such as averages, provide additional valuable information.
Averages consist of numbers (or words) about which the data are, in some sense, centered. They
are often referred to as measures of central tendency, the several types of average yield numbers
or words that attempt to describe, most generally, the middle or typical value for a distribution.
MODE
The mode reflects the value of the most frequently occurring score.It is easy to assign a value to
the mode if the data are organized.
More Than One Mode
Distributions can have more than one mode (or no mode at all). Distributions with two obvious
peaks, even though they are not exactly the same height, are referred to as bimodal. Distributions
with more than two peaks are referred to as multimodal. The presence of more than one mode
might reflect important differences among subsets of data.
Example:
34
CS3352-Foundations of Data Science Information Technology 2022-2023
Notice that even the distribution of presidential terms in Figure 3.1 tends to be bimodal, with a
major peak at 4 years and a minor peak at 8 years, reflecting the two most typical terms of office.
MEDIAN
The median reflects the middle value when observations are ordered from least to most.
The median splits a set of ordered observations into two equal parts, the upper and lower halves.
In other words, the median has a percentile rank of 50, since observations with equal or smaller
values constitute 50 percent of the entire distribution.
Finding the Median
The following table shows how to find the median for two different sets of scores. The numbers
in shaded squares cross-reference instructions in the top panel with examples in the bottom panel.
35
CS3352-Foundations of Data Science Information Technology 2022-2023
MEAN
The mean is found by adding all scores and then dividing by the number of scores.
That is,
To find the mean term for the 20 presidents, add all 20 terms in Table 3.1 (4 + . . .+ 4 + 8) to obtain
a sum of 112 years, and then divide this sum by 20, the number of presidents, to obtain a mean of
5.60 years.
Sample or Population?
36
CS3352-Foundations of Data Science Information Technology 2022-2023
Statisticians distinguish between two types of means—the population mean and the sample
mean—depending on whether the data are viewed as a population (a complete set of scores) or as
a sample (a subset of scores).
For example, if the terms of the 20 U.S. presidents are viewed as a population, then 5.60 years
qualifies as a population mean. On the other hand, if the terms of the 20 U.S. presidents are viewed
as a sample from the terms of all U.S. presidents, then 5.60 years qualifies as a sample mean. Not
only is the present distinction entirely a matter of perspective, but it also produces exactly the same
numerical value of 5.60 for both means.
For instance, Yes qualifies as the modal or most typical response for the Facebook profile question.
Inappropriate Averages
It would not be appropriate to report a median for unordered qualitative data with nominal
measurement, such as the ancestries of Americans. Nor would it be appropriate to report a mean
for any qualitative data, such as the ranks of officers in the U.S. Army. After all, words cannot be
37
CS3352-Foundations of Data Science Information Technology 2022-2023
added and then divided, as required by the formula for the mean.
When the data consist of a series of ranks, with its ordinal level of measurement, the median rank
always can be obtained. It’s simply the middlemost or average of the two middlemost ranks.
Example: College students were surveyed about where they would most like to spend their spring
break: Daytona Beach (DB), Cancun, Mexico (C), South Padre Island (SP),Lake Havasu (LH), or
other (O). The results were as follows:
Describing Variability
Averages are important, but they tell only part of the story. Statistics flourishes because we live in
a world of variability; no two people are identical, and a few are really far out. When summarizing
a set of data, we specify not only measures of central tendency, such as the mean, but also measures
of variability, that is, measures of the amount by which scores are dispersed or scattered in a
distribution.
This chapter describes several measures of variability, including the range, the interquartile range,
the variance, and most important, the standard deviation.
38
CS3352-Foundations of Data Science Information Technology 2022-2023
RANGE
Exact measures of variability not only aid communication but also are essential tools in statistics.
One such measure is the range. The range is the difference between the largest and smallest scores.
In Figure 4.1, distribution A, the least variable, has the smallest range of 0 (from 10 to 10);
distribution B, the moderately variable, has an intermediate range of 2 (from 11 to 9); and
distribution C, the most variable, has the largest range of 6 (from 13 to 7), in agreement with our
intuitive judgments about differences in variability. The range is a handy measure of variability
that can readily be calculated and understood.
VARIANCE
For example, in distribution C, one score coincides with the mean of 10, four scores (two 9s and
two 11s) deviate 1 unit from the mean, and two scores (one 7 and one 13) deviate 3 units from the
mean, yielding a set of seven deviation scores: one 0, two –1s, two 1s, one –3, and one 3. (Deviation
scores above the mean are assigned positive signs; those below the mean are assigned negative
signs.)
STANDARD DEVIATION
Standard deviation is the square root of the mean of all squared deviations from the mean, that is,
39
CS3352-Foundations of Data Science Information Technology 2022-2023
As with the mean, statisticians distinguish between population and sample for both the variance
and the standard deviation, depending on whether the data are viewed as a complete set
(population) or as a subset (sample). This distinction is introduced here, and it will be very
important in inferential statistics.
Calculating the standard deviation requires that we obtain first a value for the variance. However,
calculating the variance requires, in turn, that we obtain the sum of the squared deviation scores.
The sum of squared deviation scores, or more simply the sum of squares, symbolized by SS, merits
special attention because it’s a major component in calculations for the variance, as well as many
other statistical measures. There are two formulas for the sum of squares: the definition formula,
which is easier to understand and remember, and the computation formula, which usually is more
efficient. In addition, we’ll introduce versions of these two formulas for populations and for
samples
Sum of Squares Formulas for Population
The definition formula provides the most accessible version of the population sum of squares:
where SS represents the sum of squares, Σ directs us to sum over the expression to its right, and
(X − μ)2 denotes each of the squared deviation scores. Formula 4.1 should be read as “The sum of
squares equals the sum of all squared deviation scores.” You can reconstruct this formula by
remembering the following three steps:
1. Subtract the population mean, μ, from each original score, X, to obtain a deviation score,
X − μ.
2. Square each deviation score, (X − μ)2, to eliminate negative signs.
40
CS3352-Foundations of Data Science Information Technology 2022-2023
Computation formula
Example:
41
CS3352-Foundations of Data Science Information Technology 2022-2023
Sample notation can be substituted for population notation in the above two formulas without
causing any essential changes:
Example:
42
CS3352-Foundations of Data Science Information Technology 2022-2023
or, in symbols:
43
CS3352-Foundations of Data Science Information Technology 2022-2023
Although the sum of squares term remains essentially the same for both populations and samples,
there is a small but important change in the formulas for the variance and standard deviation for
samples. This change appears in the denominator of each formula where N, the population size, is
replaced not by n, the sample size, but by n − 1, as shown:
44
CS3352-Foundations of Data Science Information Technology 2022-2023
Example:
45
CS3352-Foundations of Data Science Information Technology 2022-2023
Degrees of freedom (df) refers to the number of values that are free to vary, given one or more
mathematical restrictions, in a sample being used to estimate a population characteristic.
we can use degrees of freedom to rewrite the formulas for the sample variance and standard
deviation:
46
CS3352-Foundations of Data Science Information Technology 2022-2023
The most important spinoff of the range, the interquartile range (IQR), is simply the range for the
middle 50 percent of the scores. More specifically, the IQR equals the distance between the third
quartile (or 75th percentile) and the first quartile (or 25th percentile), that is, after the highest
quarter (or top 25 percent) and the lowest quarter or bottom 25 percent) have been trimmed from
the original set of scores. Since most distributions are spread more widely in their extremities than
their middle, the IQR tends to be less than half the size of the range.
47
CS3352-Foundations of Data Science Information Technology 2022-2023
Qualitative Data
Measures of variability are virtually nonexistent for qualitative or nominal data. It is probably
adequate to note merely whether scores are evenly divided among the various classes
(maximum variability), unevenly divided among the various classes (intermediate
variability), or concentrated mostly in one class (minimum variability).
If qualitative data can be ordered because measurement is ordinal (or if the data are ranked),
thenit’s appropriate to describe variability by identifying extreme scores (or ranks).
48