0% found this document useful (0 votes)
72 views48 pages

FDS Unit II Notes

DATA SCIENCE NOTES

Uploaded by

maryjoseph
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
72 views48 pages

FDS Unit II Notes

DATA SCIENCE NOTES

Uploaded by

maryjoseph
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 48

CS3352-Foundations of Data Science Information Technology 2022-2023

UNIT II DESCRIBING DATA


Types of Data - Types of Variables -Describing Data with Tables and Graphs –Describing
Data with Averages - Describing Variability - Normal Distributions and Standard (z) Scores.

Types of Data:
Most datasets broadly fall into two groups—numerical data and categorical data.

Numerical data

➢ This data has a sense of measurement involved in it

➢ For example, a person's age, height, weight, blood pressure, heart rate, temperature,

number of teeth, number of bones, and the number of family members.

➢ This data is often referred to as quantitative data in statistics.

➢ The numerical dataset can be either discrete or continuous types.

Discrete data

➢ This is data that is countable and its values can be listed.

➢ For example, if we flip a coin, the number of heads in 200 coin flips can take values from 0 to
200 (finite) cases.
1
CS3352-Foundations of Data Science Information Technology 2022-2023

➢ A variable that represents a discrete dataset is referred to as a discrete variable.

➢ The discrete variable takes a fixed number of distinct values.

➢ Example:

 The Country variable can have values such as Nepal, India, Norway, and Japan.
 The Rank variable of a student in a classroom can take values from 1, 2, 3, 4, 5, and so on.

Continuous data

➢ A variable that can have an infinite number of numerical values within a specific range is
classified as continuous data.

➢ A variable describing continuous data is a continuous variable.

➢ Continuous data can follow an interval measure of scale or ratio measure of scale

➢ Example:

 The temperature of a city


 The weight variable is a continuous variable

Example table:

Categorical data

➢ This type of data represents the characteristics of an object

➢ Examples: gender, marital status, type of address, or categories of

the movies.

➢ This data is often referred to as qualitative datasets in statistics.

➢ Examples of categorical data

 Gender (Male, Female, Other, or Unknown)

 Marital Status (Annulled, Divorced, Interlocutory, Legally Separated, Married,


Polygamous, Never Married, Domestic Partner, Unmarried, Widowed, or Unknown)

2
CS3352-Foundations of Data Science Information Technology 2022-2023

 Movie genres (Action, Adventure, Comedy, Crime, Drama, Fantasy, Historical, Horror,
Mystery, Philosophical, Political, Romance, Saga, Satire, Science Fiction, Social, Thriller,
Urban, or Western)

 Blood type (A, B, AB, or O)


 Types of drugs (Stimulants, Depressants, Hallucinogens, Dissociatives, Opioids, Inhalants,
or Cannabis)

➢ A variable describing categorical data is referred to as a categorical variable.

➢ These types of variables can have a limited number of values.

Types of categorical variables

Binary categorical variable

➢ This type of variable can take exactly two values

➢ Also referred to as a dichotomous variable.

➢ Example: while creating an experiment, the result is either success or failure.

Polytomous variables

➢ This type can take more than two possible values.

➢ Example: marital status can have several values, such as divorced, legally separated, married,
never married, unmarried, widowed,etc.

➢ Most of the categorical dataset follows either nominal or ordinal measurement scales.

Nominal

➢ These are used for labeling variables without any quantitative value. The scales are generally
referred to as labels.

➢ These scales are mutually exclusive and do not carry any numerical importance.

➢ Examples:

1. What is your gender?

3
CS3352-Foundations of Data Science Information Technology 2022-2023

 Male
 Female
 Third gender/Non-binary
 I prefer not to answer
 Other

2. The languages that are spoken in a particular country

3. Biological species

4. Parts of speech in grammar (noun, pronoun, adjective, and so on)

5. Taxonomic ranks in biology (Archea, Bacteria, and Eukarya)

➢ Nominal scales are considered qualitative scales and the measurements that are taken using
qualitative scales are considered qualitative data.

➢ Using numbers as labels have no concrete numerical value or meaning.

➢ No form of arithmetic calculation can be made on nominal measures.

➢ Example: The following can be measured in the case of a nominal dataset,

 Frequency is the rate at which a label occurs over a period of time within the dataset.
 Proportion can be calculated by dividing the frequency by the total number of events.
 Then, the percentage of each proportion is computed.
 To visualize the nominal dataset, either a pie chart or a bar chart can be used.

Ordinal

➢ The main difference in the ordinal and nominal scale is the order.

➢ In ordinal scales, the order of the values is a significant factor.

➢ The Likert scale uses a variation of an ordinal scale.

➢ Example of ordinal scale using the Likert scale:

WordPress is making content managers' lives easier. How do you feel about this statement?

Likert scale:

4
CS3352-Foundations of Data Science Information Technology 2022-2023

➢ The answer to the question is scaled down to five different ordinal values, Strongly Agree,
Agree, Neutral, Disagree, and Strongly Disagree.

➢ These Scales are referred to as the Likert scale.

More examples of the Likert scale:

➢ To make it easier, consider ordinal scales as an order of ranking (1st, 2nd, 3rd, 4th, and so on).

➢ The median item is allowed as the measure of central tendency; however, the average is not
permitted.

Interval

➢ Both the order and exact differences between the values are significant.

➢ Interval scales are widely used in statistics.

➢ Examples:

 The measure of central tendencies—mean, median, mode, and standard deviations.


 location in Cartesian coordinates and direction measured in degrees from magnetic north.

Ratio

➢ Ratio scales contain order, exact values, and absolute zero.

➢ They are used in descriptive and inferential statistics.

➢ These scales provide numerous possibilities for statistical analysis.

➢ Mathematical operations, the measure of central tendencies, and the measure of dispersion and
coefficient of variation can also be computed from such scales.

➢ Examples: the measure of energy, mass, length, duration, electrical energy, plan angle, and
volume.

TYPES OF VARIABLES

A variable is a characteristic or property that can take on different values.

5
CS3352-Foundations of Data Science Information Technology 2022-2023

 The weights can be described not only as quantitative data but also as observations for a
quantitative variable, since the various weights take on different numerical values.
 Described as observations for a qualitative variable, since the replies to the Facebook
profile question take on different values of either Yes or No.

Discrete and Continuous Variables

 A discrete variable consists of isolated numbers separated by gaps.

Examples include most counts, such as the number of children in a family. (1, 2, 3, etc., but never
11 / 2 in spite of how you might occasionally feel about a sibling).

 A continuous variable consists of numbers whose values, at least in theory, have no


restrictions.

Examples include amounts, such as weights of male statistics students; durations, such as the
reaction times of grade school children to a fire alarm; and standardized test scores, such as those
on the Scholastic Aptitude Test (SAT).

 Approximate Numbers that are rounded off, as is always the case with values for
continuous variables,the resulting numbers are approximate, never exact.

For example, the weights of the male statistics students. A student whose weight is listed as 150
lbs could actually weigh between 149.5 and 150.5 lbs.

Because of rounding-off procedures, gaps appear among values for continuous variables. For
example, because weights are rounded to the nearest pound, no male statistics student has a listed
weight between 150 and 151 lbs.

Give the answer for Indicate whether the following quantitative observations are discrete or
continuous?

(a) litter of mice


(b) cooking time for pasta
(c) parole violations by convicted felons
(d) IQ
(e) age
(f) population of your hometown
(g) speed of a jetliner

Answer:

6
CS3352-Foundations of Data Science Information Technology 2022-2023

Independent and Dependent Variables

 An Independent is the treatment manipulated by the investigator in an experiment


manipulated by the investigator.
 A variable that is believed to have been influenced by the independent variable, the
dependent variable is measured, counted, or recorded by the investigator.
 Dependent Variable A variable that is believed to have been influenced by the
independent variable.

How to use the book

7
CS3352-Foundations of Data Science Information Technology 2022-2023

8
CS3352-Foundations of Data Science Information Technology 2022-2023

DESCRIBING DATA WITH TABLES AND GRAPHS

A frequency distribution helps us to detect any pattern in the data (assuming a pattern exists) by
superimposing some order on the inevitable variability among observations. For example, the
appearance of a familiar bell-shaped pattern in the frequency distribution of reaction times of
airline pilots to a cockpit alarm suggests the presence of many small chance factors whose
collective effect must be considered in pilot retraining or cockpit redesign. Frequency distribution
is an organized tabulation/graphical representation of the number of individuals in each category on
the scale of measurement.
A frequency distribution describes the number of observations for each possible value of a
variable. Frequency distributions are depicted using graphs and frequency tables.
Example: Frequency distributionIn the 2022 Winter Olympics, Team USA won 25 medals. This
frequency table gives the medals’ values (gold, silver, and bronze) and frequencies:

9
CS3352-Foundations of Data Science Information Technology 2022-2023

What is a frequency distribution?


The frequency of a value is the number of times it occurs in a dataset. A frequency distribution is
the pattern of frequencies of a variable. It’s the number of times each possible value of a variable
occurs in a dataset.

Types of frequency distributions


There are four types of frequency distributions:

 Ungrouped frequency distributions: The number of observations of each value of a


variable.
o You can use this type of frequency distribution for categorical variables.
 Grouped frequency distributions: The number of observations of each class interval of a
variable. Class intervals are ordered groupings of a variable’s values.
o You can use this type of frequency distribution for quantitative variables.
 Relative frequency distributions: The proportion of observations of each value or class
interval of a variable.
o You can use this type of frequency distribution for any type of variable when
you’re more interested in comparing frequencies than the actual number of
observations.
 Cumulative frequency distributions: The sum of the frequencies less than or equal to
each value or class interval of a variable.
o You can use this type of frequency distribution for ordinal or quantitative
variables when you want to understand how often observations fall below certain
values.

How to make a frequency table


Frequency distributions are often displayed using frequency tables. A frequency table is an
effective way to summarize or organize a dataset. It’s usually composed of two columns:

 The values or class intervals


 Their frequencies

10
CS3352-Foundations of Data Science Information Technology 2022-2023

The method for making a frequency table differs between the four types of frequency distributions.
You can follow the guides below or use software such as Excel, SPSS, or R to make a frequency
table.

How to make an ungrouped frequency table

The ungrouped frequency distribution is a type of frequency distribution that displays the
frequency of each individual data value instead of groups of data values. In this type of frequency
distribution, we can directly see how often different values occurred in the table.

 Ungrouped frequency distribution

When observations are sorted into classes of single values the result is referred to as a frequency
distribution for ungrouped data.

Example:

11
CS3352-Foundations of Data Science Information Technology 2022-2023

1. Create a table with two columns and as many rows as there are values of the variable.
Label the first column using the variable name and label the second column “Frequency.”
Enter the values in the first column.
o For ordinal variables, the values should be ordered from smallest to largest in the
table rows.
o For nominal variables, the values can be in any order in the table. You may wish to
order them alphabetically or in some other logical order.
2. Count the frequencies. The frequencies are the number of times each value occurs. Enter
the frequencies in the second column of the table beside their corresponding values.
o Especially if your dataset is large, it may help to count the frequencies by tallying.
Add a third column called “Tally.” As you read the observations, make a tick mark
in the appropriate row of the tally column for each observation. Count the tally
marks to determine the frequency.

12
CS3352-Foundations of Data Science Information Technology 2022-2023

Example: Making an ungrouped frequency tableA gardener set up a bird feeder in their backyard.
To help them decide how much and what type of birdseed to buy, they decide to record the bird
species that visit their feeder. Over the course of one morning, the following birds visit their feeder:

GROUPED FREQUENCY DISTRIBUTION:

When observations are sorted into classes of more than one value, the result is referred to as a
frequency distribution for grouped data.

Example:

Essential

1. Each observation should be included in one, and only one, class.

Example: 130–139, 140–149, 150–159, etc. It would be incorrect to use 130–140, 140–150, 150–
160, etc., in which, because the boundaries of classes overlap, an observation of 140 (or 150) could
be assigned to either of two classes.

13
CS3352-Foundations of Data Science Information Technology 2022-2023

2. List all classes, even those with zero frequencies.

Example: The class 210–219 and its frequency of zero. It would be incorrect to skip this class
because of its zero frequency.

3. All classes should have equal intervals.

Example: 130–139, 140–149, 150–159, etc. It would be incorrect to use 130–139, 140–159, etc.,

in which the second class interval (140–159) is twice as wide as the first class interval (130–139).

Optional

4. All classes should have both an upper boundary and a lower boundary.

Example: 240–249. Less preferred would be 240–above, in which no maximum value can be

assigned to observations in this class.

5. Select the class interval from convenient numbers, such as 1, 2, 3, . . . 10, particularly 5 and 10
or multiples of 5 and 10.

Example: 130–139, 140–149, in which the class interval of 10 is a convenient number. Less

preferred would be 130–142, 143–155, etc., in which the class interval of 13 is not a convenient
number.

6. The lower boundary of each class interval should be a multiple of the class interval.

Example: 130–139, 140–149, in which the lower boundaries of 130, 140, are multiples of 10, the
class interval. Less preferred would be 135–144, 145–154, etc., in which the lower boundaries of
135 and 145 are not multiples of 10, the class interval.

7. Aim for a total of approximately 10 classes

Gaps between Classes:

The size of the gap should always equal one unit of measurement; that is, it should always equal
the smallest possible difference between scores within a particular set of data.

Constructing Frequency Distributions

14
CS3352-Foundations of Data Science Information Technology 2022-2023

15
CS3352-Foundations of Data Science Information Technology 2022-2023

How to make a grouped frequency table

16
CS3352-Foundations of Data Science Information Technology 2022-2023

1. Divide the variable into class intervals. Below is one method to divide a variable into
class intervals. Different methods will give different answers, but there’s no agreement on
the best method to calculate class intervals.
o Calculate the range. Subtract the lowest value in the dataset from the highest.
o Decide the class interval width. There are no firm rules on how to choose the
width, but the following formula is a rule of thumb:

You can round this value to a whole number or a number that’s convenient to add
(such as a multiple of 10).
17
CS3352-Foundations of Data Science Information Technology 2022-2023

o Calculate the class intervals. Each interval is defined by a lower limit and upper
limit. Observations in a class interval are greater than or equal to the lower limit and
less than the upper limit:

The lower limit of the first interval is the lowest value in the dataset. Add the class
interval width to find the upper limit of the first interval and the lower limit of the
second variable. Keep adding the interval width to calculate more class intervals
until you exceed the highest value.

2. Create a table with two columns and as many rows as there are class intervals. Label the
first column using the variable name and label the second column “Frequency.” Enter the
class intervals in the first column.
3. Count the frequencies. The frequencies are the number of observations in each class
interval. You can count by tallying if you find it helpful. Enter the frequencies in the second
column of the table beside their corresponding class intervals.
Example: Grouped frequency distributionA sociologist conducted a survey of 20 adults. She wants
to report the frequency distribution of the ages of the survey respondents. The respondents were the
following ages in years:
52, 34, 32, 29, 63, 40, 46, 54, 36, 36, 24, 19, 45, 20, 28, 29, 38, 33, 49, 37

Round the class interval width to 10.


The class intervals are 19 ≤ a < 29, 29 ≤ a < 39, 39 ≤ a < 49, 49 ≤ a < 59, and 59 ≤ a < 69.

18
CS3352-Foundations of Data Science Information Technology 2022-2023

RELATIVE FREQUENCY TABLE:

1. Create an ungrouped or grouped frequency table.


2. Add a third column to the table for the relative frequencies. To calculate the relative
frequencies, divide each frequency by the sample size. The sample size is the sum of the
frequencies.

19
CS3352-Foundations of Data Science Information Technology 2022-2023

Example: Relative frequency distribution

From this table, the gardener can make observations, such as that 19% of the bird feeder visits were
from chickadees and 25% were from finches.

CUMULATIVE FREQUENCY DISTRIBUTIONS

Cumulative frequency distributions show the total number of observations in each class and in all
lower-ranked classes.

This type of distribution can be used effectively with sets of scores, such as test scores for
intellectual or academic aptitude, when relative standing within the distribution assumes primary
importance. Cumulative frequencies are usually converted, in to cumulative percentages.
Cumulative percentages are often referred to as percentile ranks.

Constructing Cumulative Frequency Distributions:

To convert a frequency distribution into a cumulative frequency distribution, add to the frequency
of each class the sum of the frequencies of all classes ranked below it.

20
CS3352-Foundations of Data Science Information Technology 2022-2023

Percentile Ranks When used to describe the relative position of any score within its parent
distribution, cumulative percentages are referred to as percentile ranks. The percentile rank of a
score indicates the percentage of scores in the entire distribution with similar or smaller values
than that score. Thus a weight has a percentile rank of 80 if equal or lighter weights constitute 80
percent of the entire distribution.

CUMULATIVE FREQUENCY TABLE

Create an ungrouped or grouped frequency table for an ordinal or quantitative variable.


Cumulative frequencies don’t make sense for nominal variables because the values have no
order—one value isn’t more than or less than another value.

21
CS3352-Foundations of Data Science Information Technology 2022-2023

1. Add a third column to the table for the cumulative frequencies. The cumulative
frequency is the number of observations less than or equal to a certain value or class
interval. To calculate the relative frequencies, add each frequency to the frequencies in the
previous rows.
2. Optional: If you want to calculate the cumulative relative frequency, add another column
and divide each cumulative frequency by the sample size.
Example: Cumulative frequency distribution

From this table, the sociologist can make observations such as 13 respondents (65%) were under 39
years old, and 16 respondents (80%) were under 49 years old.

Example:

The IQ scores for a group of 35 high school dropouts are as follows:

(a) Construct a frequency distribution for grouped data.

(b) Specify the real limits for the lowest class interval in this frequency distribution.

22
CS3352-Foundations of Data Science Information Technology 2022-2023

Answer:

RELATIVE FREQUENCY DISTRIBUTIONS

Relative frequency distributions show the frequency of each class as a part or fraction of the total
frequency for the entire distribution.

Constructing Relative Frequency Distributions

To convert a frequency distribution into a relative frequency distribution, divide the frequency
for each class by the total frequency for the entire distribution.

23
CS3352-Foundations of Data Science Information Technology 2022-2023

The conversion to proportions is straightforward. For instance, to obtain the proportion of .06 for
the class 130–139, divide the frequency of 3 for that class by the total frequency of 53. Repeat
this process until a proportion has been calculated for each class.

OUTLIERS

The appearance of one or more very extreme scores are called as outliers. A GPA of 0.06, anIQ of
170, summer wages of $62,000—each requires special attention because of its potentialimpact on
a summary of the data.

Check for Accuracy


Whenever you encounter an outrageously extreme value, such as a GPA of 0.06, attempt to verify
its accuracy. For instance, was a respectable GPA of 3.06 recorded erroneously as 0.06? If the
outlier survives an accuracy check, it should be treated as a legitimate score.

Might Exclude from Summaries We might choose to segregate an outlier from any summary of the
data.

Example: Identify any outliers in each of the following sets of data collected from nine college
students

24
CS3352-Foundations of Data Science Information Technology 2022-2023

Answer: Outliers are summer income of $25,700; an age of 61; and a family size of 18. No outliers
for GPA

FREQUENCY DISTRIBUTIONS FOR QUALITATIVE (NOMINAL) DATA

When, among a set of observations, any single observation is a word, letter, or numerical code, the
data are qualitative. Frequency distributions for qualitative data are easy to construct.

Ordered Qualitative Data

When qualitative data have an ordinal level of measurement because observations can be ordered
from least to most, that order should be preserved in the frequency table.

Relative and Cumulative Distributions for Qualitative Data:

Frequency distributions for qualitative variables can always be converted into relative frequency
distributions. If measurement is ordinal because observations can be ordered from least to most,

25
CS3352-Foundations of Data Science Information Technology 2022-2023

cumulative frequencies (and cumulative percentages) can be used.

Answers:

26
CS3352-Foundations of Data Science Information Technology 2022-2023

INTERPRETING DISTRIBUTIONS:

1. Read the title, column headings, and any footnotes.


2. Find where the data comes from, look if a source is cited
3. Focus on the form of the frequency distribution
4. Inspect the content of the frequency distribution

GRAPHS:

Data can be described clearly and concisely with the aid of a well-constructed frequency
distribution. And data can often be described even more vividly, by converting frequency
distributions into graphs.
GRAPHS FOR QUANTITATIVE DATA

Histograms

A histogram is the most commonly used graph to show frequency distributions. It is a bar-type
graph for quantitative data. The common boundaries between adjacent bars emphasize the
continuity of the data, as with continuous variables.

Features of histograms:

 Equal units along the horizontal axis (the X axis, or abscissa) reflect the various class
intervals of the frequency distribution
 Equal units along the vertical axis (the Y axis, or ordinate) reflect increases in frequency.
27
CS3352-Foundations of Data Science Information Technology 2022-2023

(The units along the vertical axis do not have to be the same width as those along the
horizontal axis.)
 The intersection of the two axes defines the origin at which both numerical scales equal 0.

 Numerical scales always increase from left to right along the horizontal axis and from
bottom to top along the vertical axis. It is considered good practice to use wiggly lines to
highlight breaks in scale.
 The body of the histogram consists of a series of bars whose heights reflect the frequencies
for the various classes

Figure: Histogram

Frequency Polygon

Frequency Polygon is a line graph for quantitative data that also emphasizes the continuity of
continuous variables. Frequency polygons may be constructed directly from frequency
distributions.

The following frequency distribution shows the annual incomes in dollars for a group of college
graduates.

28
CS3352-Foundations of Data Science Information Technology 2022-2023

The step-by-step transformation of a histogram into a frequency polygon, as described in panels


A, B, C, and D of Figure 2.2:

A. This panel shows the histogram for the weight distribution.

B. Place dots at the midpoints of each bar top or, in the absence of bar tops, at midpoints for classes
on the horizontal axis, and connect them with straight lines.[To find the midpoint of any class,
such as 160–169, simply add the two tabled boundaries (160 + 169 = 329) and divide this sum by
2 (329/2 = 164.5).]

C. Anchor the frequency polygon to the horizontal axis. First, extend the upper tail to the midpoint
of the first unoccupied class (250–259) on the upper flank of the histogram. Then extend the lower
tail to the midpoint of the first unoccupied class (120–129) on the lower flank of the histogram.
Now all of the area under the frequency polygon is enclosed completely.

D. Finally, erase all of the histogram bars, leaving only the frequency polygon. Frequency
polygons are particularly useful when two or more frequency distributions or relative frequency
distributions are to be included in the same graph.

29
CS3352-Foundations of Data Science Information Technology 2022-2023

Stem and Leaf Displays

 It is a technique for summarizing quantitative data.


 Stem and leaf displays are ideal for summarizing distributions, such as that for weight data,
without destroying the identities of individual observations.

Constructing a Display
The leftmost panel of Table 2.9 re-creates the weights of the 53 male statistics students listed in
Table 1.1. To construct the stem and leaf display for these data, first note that, when counting by
tens, the weights range from the 130s to the 240s. Arrange a column of numbers, the stems,

30
CS3352-Foundations of Data Science Information Technology 2022-2023

beginning with 13 (representing the 130s) and ending with 24 (representing the 240s). Draw a
vertical line to separate the stems, which represent multiples of 10, from the space to be occupied
by the leaves, which represent multiples of 1.

TYPICAL SHAPES

Whether expressed as a histogram, a frequency polygon, or a stem and leaf display, an important
characteristic of a frequency distribution is its shape.

1. Normal

The familiar bell-shaped silhouette of the normal curve can be superimposed on many frequency
distributions, including those for uninterrupted gestation periods of human fetuses, scores on
standardized tests, and even the popping times of individual kernels in a batch of popcorn.
2. Bimodal

31
CS3352-Foundations of Data Science Information Technology 2022-2023

Any distribution that approximates the bimodal shape might, reflect the coexistence of two
different types of observations in the same distribution. For instance, the distribution of the ages
of residents in a neighborhood consisting largely of either new parents or their infants has a
bimodal shape.

3. Positively Skewed

A lopsided distribution caused by a few extreme observations in the positive direction (to the right
of the majority of observations), is a positively skewed distribution. The distribution of incomes
among U.S. families has a pronounced positive skew, with most family incomes under $200,000
and relatively few family incomes spanning a wide range of values above $200,000.

4. Negatively Skewed

A lopsided distribution caused by a few extreme observations in the negative direction (to the left
of the majority of observations), is a negatively skewed distribution. The distribution of ages at
retirement among U.S. job holders has a pronounced negative skew, with most retirement ages at
60 years or older and relatively few retirement ages spanning the wide range of ages younger than
60.

A GRAPH FOR QUALITATIVE (NOMINAL) DATA

Bar Graph: A bar-type graph for qualitative data. Gaps between adjacent bars emphasize the
discontinuous nature of the data.

32
CS3352-Foundations of Data Science Information Technology 2022-2023

A person’s answer to the question “Do you have a Facebook profile?” is either Yes or No, not
some impossible intermediate value, such as 40 percent Yes and 60 percent No. Gaps are placed
between adjacent bars of bar graphs to emphasize the discontinuous nature of qualitative data. A
bar graph also can be used with quantitative data to emphasize the discontinuous nature of a
discrete variable, such as the number of children in a family.

CONSTRUCTING GRAPHS

1. Decide on the appropriate type of graph, recalling that histograms and frequency polygons are
appropriate for quantitative data, while bar graphs are appropriate for qualitative data and also are
sometimes used with discrete quantitative data.

2. Draw the horizontal axis, then the vertical axis, remembering that the vertical axis should be
about as tall as the horizontal axis is wide.
3. Identify the string of class intervals that eventually will be superimposed on the horizontal axis.
For qualitative data or ungrouped quantitative data, this is easy—just use the classes suggested by
the data. For grouped quantitative data, proceed as if you were creating a set of class intervals for
a frequency distribution.

4. Superimpose the string of class intervals (with gaps for bar graphs) along the entire length of
the horizontal axis. Do not use a string of empty class intervals to bridge a sizable gap between the
origin of 0 and the smallest class interval. Instead, use wiggly lines to signal a break in scale, then
begin with the smallest class interval. Also, do not clutter the horizontal scale with excessive
numbers—use just a few convenient numbers.

5. Along the entire length of the vertical axis, superimpose a progression of convenient numbers,
beginning at the bottom with 0 and ending at the top with a number as large as or slightly larger
than the maximum observed frequency. If there is a considerable gap between the origin of 0 and
the smallest observed frequency, use wiggly lines to signal a break in scale.

33
CS3352-Foundations of Data Science Information Technology 2022-2023

6. Using the scaled axes, construct bars (or dots and lines) to reflect the frequency of observations
within each class interval. For frequency polygons, dots should be located above the midpoints of
class intervals, and both tails of the graph should be anchored to the horizontal axis.

7. Supply labels for both axes and a title (or even an explanatory sentence) for the graph.

DESCRIBING DATA WITH AVERAGES

Tables and graphs of frequency distributions are important points of departure when attempting to
describe data. More precise summaries, such as averages, provide additional valuable information.

Averages consist of numbers (or words) about which the data are, in some sense, centered. They
are often referred to as measures of central tendency, the several types of average yield numbers
or words that attempt to describe, most generally, the middle or typical value for a distribution.

MODE

The mode reflects the value of the most frequently occurring score.It is easy to assign a value to
the mode if the data are organized.
More Than One Mode

Distributions can have more than one mode (or no mode at all). Distributions with two obvious
peaks, even though they are not exactly the same height, are referred to as bimodal. Distributions
with more than two peaks are referred to as multimodal. The presence of more than one mode
might reflect important differences among subsets of data.

Example:

34
CS3352-Foundations of Data Science Information Technology 2022-2023

Notice that even the distribution of presidential terms in Figure 3.1 tends to be bimodal, with a
major peak at 4 years and a minor peak at 8 years, reflecting the two most typical terms of office.

MEDIAN

The median reflects the middle value when observations are ordered from least to most.

The median splits a set of ordered observations into two equal parts, the upper and lower halves.
In other words, the median has a percentile rank of 50, since observations with equal or smaller
values constitute 50 percent of the entire distribution.
Finding the Median

The following table shows how to find the median for two different sets of scores. The numbers
in shaded squares cross-reference instructions in the top panel with examples in the bottom panel.

35
CS3352-Foundations of Data Science Information Technology 2022-2023

MEAN
The mean is found by adding all scores and then dividing by the number of scores.
That is,

To find the mean term for the 20 presidents, add all 20 terms in Table 3.1 (4 + . . .+ 4 + 8) to obtain
a sum of 112 years, and then divide this sum by 20, the number of presidents, to obtain a mean of
5.60 years.

Sample or Population?

36
CS3352-Foundations of Data Science Information Technology 2022-2023

Statisticians distinguish between two types of means—the population mean and the sample
mean—depending on whether the data are viewed as a population (a complete set of scores) or as
a sample (a subset of scores).

For example, if the terms of the 20 U.S. presidents are viewed as a population, then 5.60 years
qualifies as a population mean. On the other hand, if the terms of the 20 U.S. presidents are viewed
as a sample from the terms of all U.S. presidents, then 5.60 years qualifies as a sample mean. Not
only is the present distinction entirely a matter of perspective, but it also produces exactly the same
numerical value of 5.60 for both means.

Formula for Sample Mean

Formula for Population Mean

AVERAGES FOR QUALITATIVE AND RANKED DATA

Mode Always Appropriate for Qualitative Data


For quantitative data, all three averages such as mean, median and mode can be used. But when
the data are qualitative, choice among averages is restricted. The mode always can be used with
qualitative data.

For instance, Yes qualifies as the modal or most typical response for the Facebook profile question.

Median Sometimes Appropriate


The median can be used whenever it is possible to order qualitative data from least to most because
the level of measurement is ordinal. It’s easiest to determine the median class for ordered
qualitative data by using relative frequencies.

Inappropriate Averages

It would not be appropriate to report a median for unordered qualitative data with nominal
measurement, such as the ancestries of Americans. Nor would it be appropriate to report a mean
for any qualitative data, such as the ranks of officers in the U.S. Army. After all, words cannot be

37
CS3352-Foundations of Data Science Information Technology 2022-2023

added and then divided, as required by the formula for the mean.

Averages for Ranked Data

When the data consist of a series of ranks, with its ordinal level of measurement, the median rank
always can be obtained. It’s simply the middlemost or average of the two middlemost ranks.

Example: College students were surveyed about where they would most like to spend their spring
break: Daytona Beach (DB), Cancun, Mexico (C), South Padre Island (SP),Lake Havasu (LH), or
other (O). The results were as follows:

Find the mode and, if possible, the median.


Answer: mode = DB (Daytoa Beach)
Impossible to find the median when qualitative data are unordered, with only nominal
measurement.

Describing Variability

Averages are important, but they tell only part of the story. Statistics flourishes because we live in
a world of variability; no two people are identical, and a few are really far out. When summarizing
a set of data, we specify not only measures of central tendency, such as the mean, but also measures
of variability, that is, measures of the amount by which scores are dispersed or scattered in a
distribution.

This chapter describes several measures of variability, including the range, the interquartile range,
the variance, and most important, the standard deviation.
38
CS3352-Foundations of Data Science Information Technology 2022-2023

RANGE

Exact measures of variability not only aid communication but also are essential tools in statistics.
One such measure is the range. The range is the difference between the largest and smallest scores.
In Figure 4.1, distribution A, the least variable, has the smallest range of 0 (from 10 to 10);
distribution B, the moderately variable, has an intermediate range of 2 (from 11 to 9); and
distribution C, the most variable, has the largest range of 6 (from 13 to 7), in agreement with our
intuitive judgments about differences in variability. The range is a handy measure of variability
that can readily be calculated and understood.

VARIANCE

The variance is a measure of variability.

For example, in distribution C, one score coincides with the mean of 10, four scores (two 9s and
two 11s) deviate 1 unit from the mean, and two scores (one 7 and one 13) deviate 3 units from the
mean, yielding a set of seven deviation scores: one 0, two –1s, two 1s, one –3, and one 3. (Deviation
scores above the mean are assigned positive signs; those below the mean are assigned negative
signs.)

STANDARD DEVIATION

Standard deviation is the square root of the mean of all squared deviations from the mean, that is,

39
CS3352-Foundations of Data Science Information Technology 2022-2023

DETAILS: STANDARD DEVIATION

As with the mean, statisticians distinguish between population and sample for both the variance
and the standard deviation, depending on whether the data are viewed as a complete set
(population) or as a subset (sample). This distinction is introduced here, and it will be very
important in inferential statistics.

Sum of Squares (SS)

Calculating the standard deviation requires that we obtain first a value for the variance. However,
calculating the variance requires, in turn, that we obtain the sum of the squared deviation scores.
The sum of squared deviation scores, or more simply the sum of squares, symbolized by SS, merits
special attention because it’s a major component in calculations for the variance, as well as many
other statistical measures. There are two formulas for the sum of squares: the definition formula,
which is easier to understand and remember, and the computation formula, which usually is more
efficient. In addition, we’ll introduce versions of these two formulas for populations and for
samples
Sum of Squares Formulas for Population

The definition formula provides the most accessible version of the population sum of squares:

where SS represents the sum of squares, Σ directs us to sum over the expression to its right, and
(X − μ)2 denotes each of the squared deviation scores. Formula 4.1 should be read as “The sum of
squares equals the sum of all squared deviation scores.” You can reconstruct this formula by
remembering the following three steps:

1. Subtract the population mean, μ, from each original score, X, to obtain a deviation score,
X − μ.
2. Square each deviation score, (X − μ)2, to eliminate negative signs.

3. Sum all squared deviation scores, Σ (X − μ) 2.

40
CS3352-Foundations of Data Science Information Technology 2022-2023

Computation formula

Example:

41
CS3352-Foundations of Data Science Information Technology 2022-2023

Sum of Squares Formulas for Sample

Sample notation can be substituted for population notation in the above two formulas without
causing any essential changes:

Example:

42
CS3352-Foundations of Data Science Information Technology 2022-2023

Standard Deviation for Population σ


Recall that, most generally, a mean is defined as the sum of all scores divided by the number of
scores. Since the variance is the mean of all squared deviation scores, it can be defined as the sum
of all squared deviation scores divided by the number of scores:

or, in symbols:

43
CS3352-Foundations of Data Science Information Technology 2022-2023

Standard Deviation for Sample (s )

Although the sum of squares term remains essentially the same for both populations and samples,
there is a small but important change in the formulas for the variance and standard deviation for
samples. This change appears in the denominator of each formula where N, the population size, is
replaced not by n, the sample size, but by n − 1, as shown:

44
CS3352-Foundations of Data Science Information Technology 2022-2023

Example:

45
CS3352-Foundations of Data Science Information Technology 2022-2023

DEGREES OF FREEDOM (df)

Degrees of freedom (df) refers to the number of values that are free to vary, given one or more
mathematical restrictions, in a sample being used to estimate a population characteristic.

we can use degrees of freedom to rewrite the formulas for the sample variance and standard
deviation:

46
CS3352-Foundations of Data Science Information Technology 2022-2023

INTERQUARTILE RANGE (IQR)

The most important spinoff of the range, the interquartile range (IQR), is simply the range for the
middle 50 percent of the scores. More specifically, the IQR equals the distance between the third
quartile (or 75th percentile) and the first quartile (or 25th percentile), that is, after the highest
quarter (or top 25 percent) and the lowest quarter or bottom 25 percent) have been trimmed from
the original set of scores. Since most distributions are spread more widely in their extremities than
their middle, the IQR tends to be less than half the size of the range.

47
CS3352-Foundations of Data Science Information Technology 2022-2023

MEASURES OF VARIABILITY FOR QUALITATIVE AND RANKED DATA

Qualitative Data

Measures of variability are virtually nonexistent for qualitative or nominal data. It is probably
adequate to note merely whether scores are evenly divided among the various classes
(maximum variability), unevenly divided among the various classes (intermediate
variability), or concentrated mostly in one class (minimum variability).

Ordered Qualitative and Ranked Data

If qualitative data can be ordered because measurement is ordinal (or if the data are ranked),
thenit’s appropriate to describe variability by identifying extreme scores (or ranks).

48

You might also like