Topic 3 - Data Analytics
Topic 3 - Data Analytics
TOPIC 3
DATA ANALYTICS
Contents
3.1 Descriptive Analysis
3.2 Measure of Central Tendency
3.3 Measure of Dispersion
3.4 Correlation and Causation
Learning Outcomes
1.Define descriptive analysis
2.Describe measure of central tendency
3.Describe measure of dispersion
4.Distinguish between correlation and causation
3.1
Descriptive Analysis
Population and Sample Data
•Data can be categorised in several ways based on how they are collected, and the type
collected.
•In many cases, it is not feasible to collect data from the population of all elements of
interest.
•In such instances, we collect data from a subset of the population known as a sample.
•It is very important to collect sample data that are representing the population data so
that generalisations can be made from them. As such, a representative sample can be
gathered by random sampling from the population data.
[SOURCE: CAMM, J. D., COCHRAN, J. J., FRY, M. J., OHLMANN, J. W. & ANDERSON, D. R. (2018). BUSINESS ANALYTICS (3RD ED.). CENGAGE LEARNING.]
Population vs Sample
•A population is the entire group that you want to draw conclusions about.
• Ex : All Foundation students in UTAR
•A sample is the specific group that you will collect data from. The size of the
sample is always less than the total size of the population.
• Ex : 20 Foundation students in UTAR
[SOURCE: CAMM, J. D., COCHRAN, J. J., FRY, M. J., OHLMANN, J. W. & ANDERSON, D. R. (2018). BUSINESS ANALYTICS (3RD ED.). CENGAGE LEARNING.]
Creating Distributions from Data
•Distribution helps summarise many characteristics of a data set by describing how often certain
values for a variable appear in that data set which is known as frequency.
•Sometimes, we might find it desirable to express the distribution in terms of percentages, and
this data can be also be displayed in the frequency table as it helps to organise and summarise
the data in a tabular format, interpret the data, and detect extreme values in the data set.
•Bins (classes) would be the different category or grouping used to distribute the data.
•Distribution can be created for both categorical and quantitative data.
◦ Frequency Distributions for Categorical Data
◦ Frequency Distributions for Quantitative Data
◦ Relative Frequency and Percent Frequency Distributions
[SOURCE: CAMM, J. D., COCHRAN, J. J., FRY, M. J., OHLMANN, J. W. & ANDERSON, D. R. (2018). BUSINESS ANALYTICS (3RD ED.). CENGAGE LEARNING.]
Frequency Distributions for Categorical
Data
[SOURCE: CAMM, J. D., COCHRAN, J. J., FRY, M. J., OHLMANN, J. W. & ANDERSON, D. R. (2018). BUSINESS ANALYTICS (3RD ED.). CENGAGE LEARNING.]
Frequency Distributions for Quantitative
Data
•Quantitative data is used to answer questions such as “How many?”, “How often?”, “How much?”.
•When creating frequency distributions for quantitative data, we must be more careful in defining the
non-overlapping bins to be used in the frequency distribution.
•The three steps necessary to define the classes for a frequency distribution with quantitative data are
as follow:
1. Determine the number of non-overlapping bins.
2. Determine the width of each bin
3. Determine the bin limits.
[SOURCE: CAMM, J. D., COCHRAN, J. J., FRY, M. J., OHLMANN, J. W. & ANDERSON, D. R. (2018). BUSINESS ANALYTICS (3RD ED.). CENGAGE LEARNING.]
Number of Bins
•Bins are formed by specifying the ranges used to group the data.
Example:
•Table 2.6 is relatively small (n = 20), we chose to develop a frequency distribution with
five bins.
[SOURCE: CAMM, J. D., COCHRAN, J. J., FRY, M. J., OHLMANN, J. W. & ANDERSON, D. R. (2018). BUSINESS ANALYTICS (3RD ED.). CENGAGE LEARNING.]
Width of the Bins
•To determine an approximate bin width, we begin by identifying the largest and
smallest data value. Use the following formula:
•Approximate Bin Width = (33 – 12)/5 = 4.2. Round up and use a bin width of five days.
[SOURCE: CAMM, J. D., COCHRAN, J. J., FRY, M. J., OHLMANN, J. W. & ANDERSON, D. R. (2018). BUSINESS ANALYTICS (3RD ED.). CENGAGE LEARNING.]
Bins Limits
•We selected 10 days as the lower bin limit and 14 days as the upper bin limit for the first class.
•Defining the lower and upper bin limits to obtain a total of five bins: 10 – 14, 15 – 19, 20 – 24,
25 – 29 and 30 – 34.
•Using the first two upper bin limits for 14 and 19, we see that the bin width is 19 – 14 = 5.
(OR upper boundary of a bin – lower boundary of a bin)
•Table 2.6 shows that four values; 12, 14, 14, and 13 are belong to the 10 – 14 bin.
Thus, The frequency for the 10 – 14 bin is 4.
[SOURCE: CAMM, J. D., COCHRAN, J. J., FRY, M. J., OHLMANN, J. W. & ANDERSON, D. R. (2018). BUSINESS ANALYTICS (3RD ED.). CENGAGE LEARNING.]
Bins Limits
• The following table shows the frequency distributions for the audit times data
(Table 2.6) using five bins.
[SOURCE: CAMM, J. D., COCHRAN, J. J., FRY, M. J., OHLMANN, J. W. & ANDERSON, D. R. (2018). BUSINESS ANALYTICS (3RD ED.). CENGAGE LEARNING.]
Relative Frequency and
Percent Frequency Distributions
•A relative frequency distribution shows the proportion of the total number of
observations associated with each value or class of values.
•For a data set with n observations, the relative frequency of each bin can be determined
as follows:
[SOURCE: CAMM, J. D., COCHRAN, J. J., FRY, M. J., OHLMANN, J. W. & ANDERSON, D. R. (2018). BUSINESS ANALYTICS (3RD ED.). CENGAGE LEARNING.]
• Table 2.5 shows a relative frequency distribution and a percent frequency distribution
for the soft drink data.
• From the table above, we can identify the percentage of occurrence for each bin.
[SOURCE: CAMM, J. D., COCHRAN, J. J., FRY, M. J., OHLMANN, J. W. & ANDERSON, D. R. (2018). BUSINESS ANALYTICS (3RD ED.). CENGAGE LEARNING.]
• Table 2.7 shows the frequency, relative frequency, and percent frequency distributions
for the audit times data.
[SOURCE: CAMM, J. D., COCHRAN, J. J., FRY, M. J., OHLMANN, J. W. & ANDERSON, D. R. (2018). BUSINESS ANALYTICS (3RD ED.). CENGAGE LEARNING.]
Steps to obtaining the frequency in Excel
•We can use FREQUENCY function in Excel to count the number of observation in each bin.
•Figure 2.11 shows the data from Table 2.6 entered into an Excel Worksheet. The sample of
20 audit times is contained in cells A2:D6.
•The upper limits of bins are listed in cells A10:A14.
•Steps to use FREQUENCY function:
Step 1: Select cells B10:B14
Step 2: Type the formula =FREQUENCY(A2:D6,A10:A14). The range A2:D6 defines the
data set, and the range A10:A14 defines the bins.
Steps 3: Press CTRL + SHIFT + ENTER after typing the formula in Step 2.
[SOURCE: CAMM, J. D., COCHRAN, J. J., FRY, M. J., OHLMANN, J. W. & ANDERSON, D. R. (2018). BUSINESS ANALYTICS (3RD ED.). CENGAGE LEARNING.]
[SOURCE: CAMM, J. D., COCHRAN, J. J., FRY, M. J., OHLMANN, J. W. & ANDERSON, D. R. (2018). BUSINESS ANALYTICS (3RD ED.). CENGAGE LEARNING.]
Histogram
•A common graphical presentation of quantitative data is a histogram.
•Figure 2.12 is a histogram for the audit time data.
[SOURCE: CAMM, J. D., COCHRAN, J. J., FRY, M. J., OHLMANN, J. W. & ANDERSON, D. R. (2018). BUSINESS ANALYTICS (3RD ED.). CENGAGE LEARNING.]
3.2
Measure of Central
Tendency
Measure of Central Tendency
•There are three measures of central tendency: mean, median, and mode.
•Each measure of central tendency represents a single value identifying the
central position within a data set or, more technically, the middle or center in a
statistical distribution.
[SOURCE: CAMM, J. D., COCHRAN, J. J., FRY, M. J., OHLMANN, J. W. & ANDERSON, D. R. (2018). BUSINESS ANALYTICS (3RD ED.). CENGAGE LEARNING.]
Mean
•The most commonly used measure of central tendency is the mean or average.
•Is calculated by adding up a group of numbers and then dividing the sum by the
count of those numbers.
•The mean can be found in Excel using the AVERAGE function.
[SOURCE: CAMM, J. D., COCHRAN, J. J., FRY, M. J., OHLMANN, J. W. & ANDERSON, D. R. (2018). BUSINESS ANALYTICS (3RD ED.). CENGAGE LEARNING.]
Median
•Another measure of central tendency is the value in the middle when the data are
arranged in ascending order (smallest to largest value).
•For odd number of observations, the median is the middle value.
median = 3
•For even number of observations, we define the median as the average of the values for
the middle two observations.
median = 2.5
•The median of a data set can be found in Excel using the function MEDIAN.
[SOURCE: CAMM, J. D., COCHRAN, J. J., FRY, M. J., OHLMANN, J. W. & ANDERSON, D. R. (2018). BUSINESS ANALYTICS (3RD ED.). CENGAGE LEARNING.]
Mode
•A third measure of central tendency, the mode, is the value that occurs most
frequently in a data set.
•To illustrate the identification of the mode, consider the following five data.
32 42 46 46 54
•The only value that occurs more than once is 46. This value has a frequency of 2
(the greatest frequency), hence, it is the mode.
[SOURCE: CAMM, J. D., COCHRAN, J. J., FRY, M. J., OHLMANN, J. W. & ANDERSON, D. R. (2018). BUSINESS ANALYTICS (3RD ED.). CENGAGE LEARNING.]
Mode
•The mode can be found in Excel using the MODE.SNGL or MODE.MULT functions.
•MODE.SNGL - To find the mode for a data set with only one most often occurring value.
•MODE.MULT – To find more than one mode if greatest frequency occurs at two or more
different values in a data set.
•For example, in Table 2.9, there are two selling prices occur twice ($138,000 and
$254,000). Hence, this data set has more than one mode. To find both modes in Excel
(Refer to figure 2.16):
[SOURCE: CAMM, J. D., COCHRAN, J. J., FRY, M. J., OHLMANN, J. W. & ANDERSON, D. R. (2018). BUSINESS ANALYTICS (3RD ED.). CENGAGE LEARNING.]
Example
• Table 2.9 shows the collected data on home sales.
[SOURCE: CAMM, J. D., COCHRAN, J. J., FRY, M. J., OHLMANN, J. W. & ANDERSON, D. R. (2018). BUSINESS ANALYTICS (3RD ED.). CENGAGE LEARNING.]
• Figure 2.16 shows the home sales data from Table 2.9 in an Excel Worksheet.
•The value for the mean in
cell E2 is calculated using
the formula
=AVERAGE(B2:B13).
[SOURCE: CAMM, J. D., COCHRAN, J. J., FRY, M. J., OHLMANN, J. W. & ANDERSON, D. R. (2018). BUSINESS ANALYTICS (3RD ED.). CENGAGE LEARNING.]
3.3
Measure of Dispersion
Measure of Dispersion
•The terms variability, spread, and dispersion are synonyms, and refer to how
spread out a distribution is.
•A measure of dispersion indicates how data are dispersed around the mean of
the distribution.
•There are four frequently used measures of dispersion: range, interquartile
range, variance, and standard deviation.
[SOURCE: CAMM, J. D., COCHRAN, J. J., FRY, M. J., OHLMANN, J. W. & ANDERSON, D. R. (2018). BUSINESS ANALYTICS (3RD ED.). CENGAGE LEARNING.]
Range
•The range can be found by subtracting the smallest value from the largest value in a
data set.
[SOURCE: CAMM, J. D., COCHRAN, J. J., FRY, M. J., OHLMANN, J. W. & ANDERSON, D. R. (2018). BUSINESS ANALYTICS (3RD ED.). CENGAGE LEARNING.]
How to calculate range in Excel?
•The range can be calculated in Excel using the MAX and MIN functions.
•Example: Find a range of scores for a sample consisting of 6 students (B2:B7).
•Answer:
Range = 20
Variance
•The variance is a measure of dispersion that utilises all the data. The variance is
based on the deviation about the mean, which is the difference between the
value of each observation and the mean. The variance defines how close the
values in the distribution are to the middle of the distribution.
•Mathematically, using the mean as the measure of the middle of the distribution,
the variance is defined as the average squared difference of the values from the
mean.
•Variance just gives you a very general idea of the dispersion of data set. The
variance equal to 0 indicates there is no variability.
•The bigger the variance is, the more spread out the data.
[SOURCE: CAMM, J. D., COCHRAN, J. J., FRY, M. J., OHLMANN, J. W. & ANDERSON, D. R. (2018). BUSINESS ANALYTICS (3RD ED.). CENGAGE LEARNING.]
How to calculate variance in Excel?
•To calculate the variance in Excel, use VAR or VAR.S function.
•Example: Find a variance of scores for a sample consisting of 6 students (B2:B7).
•Answer:
Sample variance = 58.3
Standard Deviation
•The standard deviation is a measure that indicates how much the values of the set of
data deviate (spread out) from the mean. It means whether the data is close to the
mean or fluctuates a lot.
•Mathematically, the standard deviation is defined to be the positive square root of the
variance.
•Recall that the units associated with the variance are squared and that it is difficult to
interpret the meaning of squared units. Because the standard deviation is the square
root of the variance, in other words, the standard deviation is measured in the same
units as the original data. For this reason, the standard deviation is more easily
compared to the mean and other statistics that are measured in the same units as the
original data.
[SOURCE: CAMM, J. D., COCHRAN, J. J., FRY, M. J., OHLMANN, J. W. & ANDERSON, D. R. (2018). BUSINESS ANALYTICS (3RD ED.). CENGAGE LEARNING.]
Standard Deviation
•The standard deviation equal to 0 indicates that every value in the data set is exactly
equal to the mean.
•The closer the standard deviation is to zero, the lower the data variability and the more
reliable the mean is (data is more consistent).
•The higher the standard deviation, the more variation there is in the data and the less
accurate the mean is.
[SOURCE: CAMM, J. D., COCHRAN, J. J., FRY, M. J., OHLMANN, J. W. & ANDERSON, D. R. (2018). BUSINESS ANALYTICS (3RD ED.). CENGAGE LEARNING.]
How to calculate standard deviation in
Excel?
•To calculate the standard deviation in Excel, use STDEV or STDEV.S function.
•Example: Find a standard deviation of scores for a sample consisting of 6 students
(B2:B7).
• Use this formula:
=STDEV(B2:B7)
• Answer:
Standard Deviation = 7.64
How to calculate standard deviation in
Excel?
•In combination with the mean, the standard deviation can tell you what is the
most score achieved by the students.
•The mean of the data set is 81.5 and the standard deviation is 7.64, most of the
students achieved the score between 74 (81.5 – 7.64) and 89 (81.5 + 7.64).
[SOURCE: CAMM, J. D., COCHRAN, J. J., FRY, M. J., OHLMANN, J. W. & ANDERSON, D. R. (2018). BUSINESS ANALYTICS (3RD ED.). CENGAGE LEARNING.]
Coefficient of variation
•Coefficient of variation indicates how large the standard deviation is relative to the mean.
This measure usually expressed as a percentage.
standard deviation
100 %
mean
•Example, For the score data set, we found a sample mean of 81.5 and a sample standard
deviation of 7.64. The coefficient of variation is (7.64/81.5 x 100) = 9.37%.
•In words, the coefficient of variation tells us that the sample standard deviation is 9.37%
of the value of the sample mean.
[SOURCE: CAMM, J. D., COCHRAN, J. J., FRY, M. J., OHLMANN, J. W. & ANDERSON, D. R. (2018). BUSINESS ANALYTICS (3RD ED.). CENGAGE LEARNING.]
Percentiles
•A percentile is the value of a variable at which a specified (approximate) percentage of
observations are below that value.
• The pth percentile tells us the point in the data where approximately p% of the
observations have values less than the pth percentile; hence, approximately (100 − p)%
of the observations have values greater than the pth percentile.
•The pth percentile can also be calculated in Excel using the function PERCENTILE.EXC.
[SOURCE: CAMM, J. D., COCHRAN, J. J., FRY, M. J., OHLMANN, J. W. & ANDERSON, D. R. (2018). BUSINESS ANALYTICS (3RD ED.). CENGAGE LEARNING.]
How to calculate percentile in Excel?
•Example: Compute the 70th percentile for the following sample data (B2:B11).
• Use this formula:
=PERCENTILE.EXC(B2:B11,0.7)
• Answer:
70th Percentile = 87.1
[SOURCE: CAMM, J. D., COCHRAN, J. J., FRY, M. J., OHLMANN, J. W. & ANDERSON, D. R. (2018). BUSINESS ANALYTICS (3RD ED.). CENGAGE LEARNING.]
How to calculate quartiles in Excel?
•Example: Compute the quartiles for the following sample data (B2:B11).
• Answer:
1st Quartile = 71.5 (25th percentile)
2nd Quartile = 78 (median)
3rd Quartile = 88 (75th percentile)
Interquartile Range
•The difference between the third and first quartiles is often referred to as the
interquartile range, or IQR.
Q3 Q1 88 71.5 16.5
•Because it excludes the smallest and largest 25% of values in the data, the IQR
is a useful measure of variation for data that have extreme values or are highly
skewed.
[SOURCE: CAMM, J. D., COCHRAN, J. J., FRY, M. J., OHLMANN, J. W. & ANDERSON, D. R. (2018). BUSINESS ANALYTICS (3RD ED.). CENGAGE LEARNING.]
• Figure 2.19 shows the calculation of measures of dispersion for the home sales data in Excel.
[SOURCE: CAMM, J. D., COCHRAN, J. J., FRY, M. J., OHLMANN, J. W. & ANDERSON, D. R. (2018). BUSINESS ANALYTICS (3RD ED.). CENGAGE LEARNING.]
Figure 2.19
• the range in cell E7 is calculated using the formula = MAX(B2:B13) − MIN(B2:B13).
• the variance in cell E8 is calculated using the formula = VAR.S(B2:B13).
• the sample standard deviation in cell E9 is calculated using the formula = STDEV.S(B2:B13).
• the coefficient of variation is calculated in cell E11 using the formula =E9/E2, which divides
the standard deviation by the mean.
• the 85th percentile of the home sales data. The value in cell E13 is calculated using the
formula = PERCENTILE.EXC(B2:B13,0.85
• the calculations for first, second, and third quartiles for the home sales data in cell E15.
The formula = QUARTILE.EXC(B2:B13,1).
• Cells E16 and E17 use similar formulas to compute the second and third quartiles.
[SOURCE: CAMM, J. D., COCHRAN, J. J., FRY, M. J., OHLMANN, J. W. & ANDERSON, D. R. (2018). BUSINESS ANALYTICS (3RD ED.). CENGAGE LEARNING.]
Empirical Rule
•When the distribution of data exhibits a symmetric bell-shaped distribution, as shown
in Figure 2.21, the empirical rule can be used to determine the percentage of data
values that are within a specified number of standard deviations of the mean.
[SOURCE: CAMM, J. D., COCHRAN, J. J., FRY, M. J., OHLMANN, J. W. & ANDERSON, D. R. (2018). BUSINESS ANALYTICS (3RD ED.). CENGAGE LEARNING.]
Empirical Rule
Interpretation:
[SOURCE: CAMM, J. D., COCHRAN, J. J., FRY, M. J., OHLMANN, J. W. & ANDERSON, D. R. (2018). BUSINESS ANALYTICS (3RD ED.). CENGAGE LEARNING.]
Empirical Rule
Example: The height of adult males in the United States has a bell-shaped distribution
with a mean of approximately 69.5 inches and standard deviation of approximately
3 inches. Using the empirical rule, we can draw the following conclusions:
• Approximately 68% of adult males in the United States have
heights between 69.5 - 3 = 66.5 and 69.5 + 3 = 72.5 inches.
[SOURCE: CAMM, J. D., COCHRAN, J. J., FRY, M. J., OHLMANN, J. W. & ANDERSON, D. R. (2018). BUSINESS ANALYTICS (3RD ED.). CENGAGE LEARNING.]
Identifying Outliers
•Data set can have unusually large or unusually small values. These extreme values are
called outliers.
•An outlier may be a data value that has been incorrectly recorded; if so, it can be
corrected before the data are analysed further.
•An outlier may also be from an observation that does not belong to the population we
are studying and was incorrectly included in the data set; if so, it can be removed.
•Finally, an outlier may be an unusual data value that has been recorded correctly and is
a member of the population we are studying. In such cases, the observation should
remain.
[SOURCE: CAMM, J. D., COCHRAN, J. J., FRY, M. J., OHLMANN, J. W. & ANDERSON, D. R. (2018). BUSINESS ANALYTICS (3RD ED.). CENGAGE LEARNING.]
Box Plots
•Also known as box-and-whisker plots.
•A box plot is a graphical summary of the distribution of data.
•A box plot is developed from the quartiles for a data set.
•Box plots are also very useful for comparing different data sets.
[SOURCE: CAMM, J. D., COCHRAN, J. J., FRY, M. J., OHLMANN, J. W. & ANDERSON, D. R. (2018). BUSINESS ANALYTICS (3RD ED.). CENGAGE LEARNING.]
•Figure 2.24 shows the box plot in Excel for single variable.
[SOURCE: CAMM, J. D., COCHRAN, J. J., FRY, M. J., OHLMANN, J. W. & ANDERSON, D. R. (2018).
BUSINESS ANALYTICS (3RD ED.). CENGAGE LEARNING.]
•Figure 2.25 shows the box plots in Excel for multiple variables. Comparison of home sales from
several different communities.
[SOURCE: CAMM, J. D., COCHRAN, J. J., FRY, M. J., OHLMANN, J. W. & ANDERSON, D. R. (2018).
BUSINESS ANALYTICS (3RD ED.). CENGAGE LEARNING.]
What can we learn from Figure
2.25?
•The most expensive houses appear to be in Shadyside and the cheapest houses in
Hamilton.
•The median home sales price in Groton is about the same as the median home sales
price in Irving.
•However, home sales prices in Irving have much greater variability. Homes appear to be
selling in Irving for many different prices, from very low to very high.
•Home sales prices have the least variation in Groton and Hamilton.
•The only outlier that appears in these box plots is for home sales in Groton. However,
note that most homes sell for very similar prices in Groton, so the selling price does not
have to be too far from the median to be considered an outlier.
[SOURCE: CAMM, J. D., COCHRAN, J. J., FRY, M. J., OHLMANN, J. W. & ANDERSON, D. R. (2018). BUSINESS ANALYTICS (3RD ED.). CENGAGE LEARNING.]
3.4
Correlation and Causation
Correlation Coefficient
•Thus far, we have examined numerical methods used to summarise the data
for one variable at a time.
•However, managers or decision makers are often interested in the relationship
between two variables.
•To describe the relationship between two variables, correlation coefficient is
used.
[SOURCE: CAMM, J. D., COCHRAN, J. J., FRY, M. J., OHLMANN, J. W. & ANDERSON, D. R. (2018). BUSINESS ANALYTICS (3RD ED.). CENGAGE LEARNING.]
Correlation Coefficient
•The correlation coefficient measures the relationship between two variables.
•The correlation coefficient can take only values between -1 and 1. It measures both the
strength and direction of the linear relationship between the variables.
•We can compute correlation coefficient using the Excel function CORREL.
[SOURCE: CAMM, J. D., COCHRAN, J. J., FRY, M. J., OHLMANN, J. W. & ANDERSON, D. R. (2018). BUSINESS ANALYTICS (3RD ED.). CENGAGE LEARNING.]
Correlation Coefficient
• Correlation coefficient value near 0 indicates no linear relationship Correlation Strength Direction
Coefficient (r)
between the x and y variables.
1 Perfect
• Correlation coefficient greater than 0 indicates a positive linear 0.7 ≤ r < 1 Strong
relationship between the x and y variables. The closer the correlation Positive relationship
coefficient is to +1, the closer the x and y values are to forming a 0.5 ≤ r < 0.7 Moderate
(Positive correlation)
straight line that trends upward to the right (positive slope). 0.3 ≤ r < 0.5 Weak
0 < r < 0.3 Very weak
• Correlation coefficient less than 0 indicates a negative linear
relationship between the x and y variables. The closer the correlation 0 No relationship (No correlation)
coefficient is to −1, the closer the x and y values are to forming a - 0 < r < 0.3 Very weak
straight line that trends downward to the right (negative slope). - 0.3 ≤ r < 0.5 Weak
• As correlation coefficient gets closer to -1 or 1, the strength of the - 0.5 ≤ r < 0.7 Moderate Negative relationship
(Negative correlation)
relationship increases. - 0.7 ≤ r < 1 Strong
-1 Perfect
[SOURCE: CAMM, J. D., COCHRAN, J. J., FRY, M. J., OHLMANN, J. W. & ANDERSON, D. R. (2018). BUSINESS ANALYTICS (3RD ED.). CENGAGE LEARNING.]
Correlation Coefficient Graphs
Positive correlation Negative correlation No correlation
[SOURCE: CAMM, J. D., COCHRAN, J. J., FRY, M. J., OHLMANN, J. W. & ANDERSON, D. R. (2018). BUSINESS ANALYTICS (3RD ED.). CENGAGE LEARNING.]
• Table 2.14 shows a data for bottled water sales at Queensland
Amusement Park for a sample of 14 summer days.
[SOURCE: CAMM, J. D., COCHRAN, J. J., FRY, M. J., OHLMANN, J. W. & ANDERSON, D. R. (2018).
BUSINESS ANALYTICS (3RD ED.). CENGAGE LEARNING.]
• Figure 2.26 shows a scatter plot chart of the positive linear
relation between sales of bottled water and high temperatures.
[SOURCE: CAMM, J. D., COCHRAN, J. J., FRY, M. J., OHLMANN, J. W. & ANDERSON, D. R. (2018).
BUSINESS ANALYTICS (3RD ED.). CENGAGE LEARNING.]
• Figure 2.27 shows the data from Table 2.14 entered into an Excel Worksheet.
[SOURCE: CAMM, J. D., COCHRAN, J. J., FRY, M. J., OHLMANN, J. W. & ANDERSON, D. R. (2018). BUSINESS ANALYTICS (3RD ED.). CENGAGE LEARNING.]
Causation
•An essential thing to understand about correlation is that it only shows how closely
related two variables are. Correlation, however, does not imply causation.
•Even if there is a correlation between two variables, we cannot conclude that one
variable causes a change in the other. This relationship could be coincidental, or a third
factor may be causing both variables to change.
•Causation means that one event causes another event to occur. Causation can only be
determined from an appropriately designed experiment. In such experiments, similar
groups receive different treatments, and the outcomes of each group are studied. We
can only conclude that a treatment causes an effect if the groups have noticeably
different outcomes.
[SOURCE: HTTPS://WWW.KHANACADEMY.ORG/TEST-PREP/PRAXIS-MATH/PRAXIS-MATH-LESSONS/GTP--PRAXIS-MATH--LESSONS--STATISTICS-AND-PROBABILITY/A/GTP--PRAXIS-MATH--ARTICLE--
CORRELATION-AND-CAUSATION--LESSON]
Example 1
Liam collected data on the sales of ice cream cones and air conditioners in his hometown.
He found that when ice cream sales were low, air conditioner sales tended to be low and
that when ice cream sales were high, air conditioner sales tended to be high.
•Liam can conclude that sales of ice cream cones and air conditioner are positively
correlated.
•Liam can't conclude that selling more ice cream cones causes more air conditioners to be
sold. It is likely that the increases in the sales of both ice cream cones and air
conditioners are caused by a third factor, an increase in temperature!
•So, sales of ice cream cones and air conditioner are positively correlated, but they do not
cause one another.
[SOURCE: HTTPS://WWW.KHANACADEMY.ORG/TEST-PREP/PRAXIS-MATH/PRAXIS-MATH-LESSONS/GTP--PRAXIS-MATH--LESSONS--STATISTICS-AND-PROBABILITY/A/GTP--PRAXIS-MATH--ARTICLE--
CORRELATION-AND-CAUSATION--LESSON]
Example 2
Jane notices that students in her class with higher grades in college have higher grades
in high school.
•Based on this observation, there is a positive correlation between the higher grades in
college and higher grades in school.
•Jane can’t conclude that success in high school causes the success in college. It is
usually someone's working hard in college courses that causes that person to succeed
in college.
•So again, the two events, high school success and college success are positively
correlated, but they do not cause one another.
[SOURCE: HTTPS://WWW.KHANACADEMY.ORG/TEST-PREP/PRAXIS-MATH/PRAXIS-MATH-LESSONS/GTP--PRAXIS-MATH--LESSONS--STATISTICS-AND-PROBABILITY/A/GTP--PRAXIS-MATH--ARTICLE--
CORRELATION-AND-CAUSATION--LESSON]
Exercise
How do you describe this
dataset?
Said there is a total of 60 students taking INTRODUCTION OF DATA ANALYTICS in this
trimester. Their lab test scores are as follows.
57 66 61 61 72 75
63 66 78 45 26 40
53 56 48 51 38 22
48 80 45 82 32 87
56 31 53 76 46 78
45 72 52 65 49 83
67 66 47 82 43 89
89 58 59 51 91 51
65 60 58 62 64 63
47 62 83 51 94 60
Describe Data
Histogram (Score Distribution)
Score Count
21 – 30 2
31 – 40 4
41 – 50 10
51 – 60 15
61 – 70 13
71 – 80 7
81 – 90 7
91 – 100 2
You may ENRICH the data
No Score Gender Class No Score Gender Class No Score Gender Class
1 57 Male T1 21 61 Female T2 41 72 Female T3
2 63 Female T1 22 78 Female T2 42 26 Female T3
3 53 Female T1 23 48 Female T2 43 38 Male T3
4 48 Female T1 24 45 Female T2 44 32 Female T3
5 56 Female T1 25 53 Female T2 45 46 Male T3
6 45 Male T1 26 52 Male T2 46 49 Female T3
7 67 Male T1 27 47 Male T2 47 43 Female T3
8 89 Male T1 28 59 Female T2 48 91 Female T3
9 65 Female T1 29 58 Female T2 49 64 Female T3
10 47 Female T1 30 83 Female T2 50 94 Male T3
11 66 Female T1 31 61 Female T2 51 75 Male T3
12 66 Female T1 32 45 Male T2 52 40 Male T3
13 56 Female T1 33 51 Female T2 53 22 Female T3
14 80 Female T1 34 82 Female T2 54 87 Female T3
15 31 Female T1 35 76 Male T2 55 78 Female T3
16 72 Female T1 36 65 Female T2 56 83 Female T3
17 66 Female T1 37 82 Female T2 57 89 Female T3
18 58 Female T1 38 51 Female T2 58 51 Female T3
19 60 Female T1 39 62 Female T2 59 63 Male T3
20 62 Male T1 40 51 Male T2 60 60 Female T3
Data Visualisation (some Column
Charts)
Is there any difference
of student performance
in gender or class?
Measures of Central Tendency and Dispersion
Standard
Average deviation