0% found this document useful (0 votes)

12 views

Topic 3 - Data Analytics

This document provides an introduction to descriptive analysis and measures of central tendency in data analytics. It defines key concepts such as population and sample, frequency distributions, measures of central tendency including mean, median and mode. Examples and steps are given to calculate these measures and create frequency distributions using Excel functions. Histograms are introduced as a way to graphically present quantitative data distributions.

Uploaded by

Tajendra Kathiravan

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PPTX, PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

12 views

Topic 3 - Data Analytics

Uploaded by

Tajendra Kathiravan

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PPTX, PDF, TXT or read online on Scribd

You are on page 1/ 72

FHCT1014

Introduction to Data Analytics

TOPIC 3
DATA ANALYTICS
Contents
3.1 Descriptive Analysis
3.2 Measure of Central Tendency
3.3 Measure of Dispersion
3.4 Correlation and Causation
Learning Outcomes
1.Define descriptive analysis
2.Describe measure of central tendency
3.Describe measure of dispersion
4.Distinguish between correlation and causation
3.1
Descriptive Analysis
Population and Sample Data
•Data can be categorised in several ways based on how they are collected, and the type
collected.
•In many cases, it is not feasible to collect data from the population of all elements of
interest.
•In such instances, we collect data from a subset of the population known as a sample.
•It is very important to collect sample data that are representing the population data so
that generalisations can be made from them. As such, a representative sample can be
gathered by random sampling from the population data.

[SOURCE: CAMM, J. D., COCHRAN, J. J., FRY, M. J., OHLMANN, J. W. & ANDERSON, D. R. (2018). BUSINESS ANALYTICS (3RD ED.). CENGAGE LEARNING.]
Population vs Sample
•A population is the entire group that you want to draw conclusions about.
• Ex : All Foundation students in UTAR

•A sample is the specific group that you will collect data from. The size of the
sample is always less than the total size of the population.
• Ex : 20 Foundation students in UTAR

[SOURCE: CAMM, J. D., COCHRAN, J. J., FRY, M. J., OHLMANN, J. W. & ANDERSON, D. R. (2018). BUSINESS ANALYTICS (3RD ED.). CENGAGE LEARNING.]
Creating Distributions from Data
•Distribution helps summarise many characteristics of a data set by describing how often certain
values for a variable appear in that data set which is known as frequency.
•Sometimes, we might find it desirable to express the distribution in terms of percentages, and
this data can be also be displayed in the frequency table as it helps to organise and summarise
the data in a tabular format, interpret the data, and detect extreme values in the data set.
•Bins (classes) would be the different category or grouping used to distribute the data.
•Distribution can be created for both categorical and quantitative data.
◦ Frequency Distributions for Categorical Data
◦ Frequency Distributions for Quantitative Data
◦ Relative Frequency and Percent Frequency Distributions

[SOURCE: CAMM, J. D., COCHRAN, J. J., FRY, M. J., OHLMANN, J. W. & ANDERSON, D. R. (2018). BUSINESS ANALYTICS (3RD ED.). CENGAGE LEARNING.]
Frequency Distributions for Categorical
Data

[SOURCE: CAMM, J. D., COCHRAN, J. J., FRY, M. J., OHLMANN, J. W. & ANDERSON, D. R. (2018). BUSINESS ANALYTICS (3RD ED.). CENGAGE LEARNING.]
Frequency Distributions for Quantitative
Data
•Quantitative data is used to answer questions such as “How many?”, “How often?”, “How much?”.
•When creating frequency distributions for quantitative data, we must be more careful in defining the
non-overlapping bins to be used in the frequency distribution.
•The three steps necessary to define the classes for a frequency distribution with quantitative data are
as follow:
1. Determine the number of non-overlapping bins.
2. Determine the width of each bin
3. Determine the bin limits.

[SOURCE: CAMM, J. D., COCHRAN, J. J., FRY, M. J., OHLMANN, J. W. & ANDERSON, D. R. (2018). BUSINESS ANALYTICS (3RD ED.). CENGAGE LEARNING.]
Number of Bins
•Bins are formed by specifying the ranges used to group the data.
Example:
•Table 2.6 is relatively small (n = 20), we chose to develop a frequency distribution with
five bins.

[SOURCE: CAMM, J. D., COCHRAN, J. J., FRY, M. J., OHLMANN, J. W. & ANDERSON, D. R. (2018). BUSINESS ANALYTICS (3RD ED.). CENGAGE LEARNING.]
Width of the Bins
•To determine an approximate bin width, we begin by identifying the largest and
smallest data value. Use the following formula:

•Approximate Bin Width = (33 – 12)/5 = 4.2. Round up and use a bin width of five days.

[SOURCE: CAMM, J. D., COCHRAN, J. J., FRY, M. J., OHLMANN, J. W. & ANDERSON, D. R. (2018). BUSINESS ANALYTICS (3RD ED.). CENGAGE LEARNING.]
Bins Limits
•We selected 10 days as the lower bin limit and 14 days as the upper bin limit for the first class.
•Defining the lower and upper bin limits to obtain a total of five bins: 10 – 14, 15 – 19, 20 – 24,
25 – 29 and 30 – 34.
•Using the first two upper bin limits for 14 and 19, we see that the bin width is 19 – 14 = 5.
(OR upper boundary of a bin – lower boundary of a bin)
•Table 2.6 shows that four values; 12, 14, 14, and 13 are belong to the 10 – 14 bin.
Thus, The frequency for the 10 – 14 bin is 4.

[SOURCE: CAMM, J. D., COCHRAN, J. J., FRY, M. J., OHLMANN, J. W. & ANDERSON, D. R. (2018). BUSINESS ANALYTICS (3RD ED.). CENGAGE LEARNING.]
Bins Limits
• The following table shows the frequency distributions for the audit times data
(Table 2.6) using five bins.

[SOURCE: CAMM, J. D., COCHRAN, J. J., FRY, M. J., OHLMANN, J. W. & ANDERSON, D. R. (2018). BUSINESS ANALYTICS (3RD ED.). CENGAGE LEARNING.]
Relative Frequency and
Percent Frequency Distributions
•A relative frequency distribution shows the proportion of the total number of
observations associated with each value or class of values.
•For a data set with n observations, the relative frequency of each bin can be determined
as follows:

•The percent frequency of each bin can be determined as follows:

Percent frequency of a bin = Relative frequency of a bin × 100

[SOURCE: CAMM, J. D., COCHRAN, J. J., FRY, M. J., OHLMANN, J. W. & ANDERSON, D. R. (2018). BUSINESS ANALYTICS (3RD ED.). CENGAGE LEARNING.]
• Table 2.5 shows a relative frequency distribution and a percent frequency distribution
for the soft drink data.

• From the table above, we can identify the percentage of occurrence for each bin.

[SOURCE: CAMM, J. D., COCHRAN, J. J., FRY, M. J., OHLMANN, J. W. & ANDERSON, D. R. (2018). BUSINESS ANALYTICS (3RD ED.). CENGAGE LEARNING.]
• Table 2.7 shows the frequency, relative frequency, and percent frequency distributions
for the audit times data.

[SOURCE: CAMM, J. D., COCHRAN, J. J., FRY, M. J., OHLMANN, J. W. & ANDERSON, D. R. (2018). BUSINESS ANALYTICS (3RD ED.). CENGAGE LEARNING.]
Steps to obtaining the frequency in Excel
•We can use FREQUENCY function in Excel to count the number of observation in each bin.
•Figure 2.11 shows the data from Table 2.6 entered into an Excel Worksheet. The sample of
20 audit times is contained in cells A2:D6.
•The upper limits of bins are listed in cells A10:A14.
•Steps to use FREQUENCY function:
Step 1: Select cells B10:B14
Step 2: Type the formula =FREQUENCY(A2:D6,A10:A14). The range A2:D6 defines the
data set, and the range A10:A14 defines the bins.
Steps 3: Press CTRL + SHIFT + ENTER after typing the formula in Step 2.

[SOURCE: CAMM, J. D., COCHRAN, J. J., FRY, M. J., OHLMANN, J. W. & ANDERSON, D. R. (2018). BUSINESS ANALYTICS (3RD ED.). CENGAGE LEARNING.]
[SOURCE: CAMM, J. D., COCHRAN, J. J., FRY, M. J., OHLMANN, J. W. & ANDERSON, D. R. (2018). BUSINESS ANALYTICS (3RD ED.). CENGAGE LEARNING.]
Histogram
•A common graphical presentation of quantitative data is a histogram.
•Figure 2.12 is a histogram for the audit time data.

• Note that the class of 15-19 days have

the highest frequency, which is 8.

• Meanwhile, the class of 30-34 days

have the lowest frequency, which is 1.

[SOURCE: CAMM, J. D., COCHRAN, J. J., FRY, M. J., OHLMANN, J. W. & ANDERSON, D. R. (2018). BUSINESS ANALYTICS (3RD ED.). CENGAGE LEARNING.]
3.2
Measure of Central
Tendency
Measure of Central Tendency
•There are three measures of central tendency: mean, median, and mode.
•Each measure of central tendency represents a single value identifying the
central position within a data set or, more technically, the middle or center in a
statistical distribution.

[SOURCE: CAMM, J. D., COCHRAN, J. J., FRY, M. J., OHLMANN, J. W. & ANDERSON, D. R. (2018). BUSINESS ANALYTICS (3RD ED.). CENGAGE LEARNING.]
Mean
•The most commonly used measure of central tendency is the mean or average.
•Is calculated by adding up a group of numbers and then dividing the sum by the
count of those numbers.
•The mean can be found in Excel using the AVERAGE function.

[SOURCE: CAMM, J. D., COCHRAN, J. J., FRY, M. J., OHLMANN, J. W. & ANDERSON, D. R. (2018). BUSINESS ANALYTICS (3RD ED.). CENGAGE LEARNING.]
Median
•Another measure of central tendency is the value in the middle when the data are
arranged in ascending order (smallest to largest value).
•For odd number of observations, the median is the middle value.

median = 3
•For even number of observations, we define the median as the average of the values for
the middle two observations.

median = 2.5
•The median of a data set can be found in Excel using the function MEDIAN.

[SOURCE: CAMM, J. D., COCHRAN, J. J., FRY, M. J., OHLMANN, J. W. & ANDERSON, D. R. (2018). BUSINESS ANALYTICS (3RD ED.). CENGAGE LEARNING.]
Mode
•A third measure of central tendency, the mode, is the value that occurs most
frequently in a data set.

•To illustrate the identification of the mode, consider the following five data.
32 42 46 46 54

•The only value that occurs more than once is 46. This value has a frequency of 2
(the greatest frequency), hence, it is the mode.

[SOURCE: CAMM, J. D., COCHRAN, J. J., FRY, M. J., OHLMANN, J. W. & ANDERSON, D. R. (2018). BUSINESS ANALYTICS (3RD ED.). CENGAGE LEARNING.]
Mode
•The mode can be found in Excel using the MODE.SNGL or MODE.MULT functions.
•MODE.SNGL - To find the mode for a data set with only one most often occurring value.
•MODE.MULT – To find more than one mode if greatest frequency occurs at two or more
different values in a data set.
•For example, in Table 2.9, there are two selling prices occur twice ($138,000 and
$254,000). Hence, this data set has more than one mode. To find both modes in Excel
(Refer to figure 2.16):

[SOURCE: CAMM, J. D., COCHRAN, J. J., FRY, M. J., OHLMANN, J. W. & ANDERSON, D. R. (2018). BUSINESS ANALYTICS (3RD ED.). CENGAGE LEARNING.]
Example
• Table 2.9 shows the collected data on home sales.

[SOURCE: CAMM, J. D., COCHRAN, J. J., FRY, M. J., OHLMANN, J. W. & ANDERSON, D. R. (2018). BUSINESS ANALYTICS (3RD ED.). CENGAGE LEARNING.]
• Figure 2.16 shows the home sales data from Table 2.9 in an Excel Worksheet.
•The value for the mean in
cell E2 is calculated using
the formula
=AVERAGE(B2:B13).

•The value for median in cell

E3 is found using the
formula =MEDIAN(B2:B13).

•Excel enters the values for

both modes of this data set
in cells E4 and E5:
$138,000 and $254,000.

[SOURCE: CAMM, J. D., COCHRAN, J. J., FRY, M. J., OHLMANN, J. W. & ANDERSON, D. R. (2018). BUSINESS ANALYTICS (3RD ED.). CENGAGE LEARNING.]
3.3
Measure of Dispersion
Measure of Dispersion
•The terms variability, spread, and dispersion are synonyms, and refer to how
spread out a distribution is.
•A measure of dispersion indicates how data are dispersed around the mean of
the distribution.
•There are four frequently used measures of dispersion: range, interquartile
range, variance, and standard deviation.

[SOURCE: CAMM, J. D., COCHRAN, J. J., FRY, M. J., OHLMANN, J. W. & ANDERSON, D. R. (2018). BUSINESS ANALYTICS (3RD ED.). CENGAGE LEARNING.]
Range
•The range can be found by subtracting the smallest value from the largest value in a
data set.

•Example: What is the range of the following group of numbers?

32,30,45,40,99,78,82
Answer : 99 – 30 = 69 ; The range is 69
•Large range shows that the data set is more disperse or further away from mean.
•As the range is based on only the largest and smallest values in the calculation thus is
highly influenced by extreme values.

[SOURCE: CAMM, J. D., COCHRAN, J. J., FRY, M. J., OHLMANN, J. W. & ANDERSON, D. R. (2018). BUSINESS ANALYTICS (3RD ED.). CENGAGE LEARNING.]
How to calculate range in Excel?
•The range can be calculated in Excel using the MAX and MIN functions.
•Example: Find a range of scores for a sample consisting of 6 students (B2:B7).

•Use this formula:

=MAX(B2:B7) - MIN(B2:B7)

•Answer:
Range = 20
Variance
•The variance is a measure of dispersion that utilises all the data. The variance is
based on the deviation about the mean, which is the difference between the
value of each observation and the mean. The variance defines how close the
values in the distribution are to the middle of the distribution.
•Mathematically, using the mean as the measure of the middle of the distribution,
the variance is defined as the average squared difference of the values from the
mean.
•Variance just gives you a very general idea of the dispersion of data set. The
variance equal to 0 indicates there is no variability.
•The bigger the variance is, the more spread out the data.
[SOURCE: CAMM, J. D., COCHRAN, J. J., FRY, M. J., OHLMANN, J. W. & ANDERSON, D. R. (2018). BUSINESS ANALYTICS (3RD ED.). CENGAGE LEARNING.]
How to calculate variance in Excel?
•To calculate the variance in Excel, use VAR or VAR.S function.
•Example: Find a variance of scores for a sample consisting of 6 students (B2:B7).

•Use this formula:

= VAR (B2:B7)

•Answer:
Sample variance = 58.3
Standard Deviation
•The standard deviation is a measure that indicates how much the values of the set of
data deviate (spread out) from the mean. It means whether the data is close to the
mean or fluctuates a lot.
•Mathematically, the standard deviation is defined to be the positive square root of the
variance.
•Recall that the units associated with the variance are squared and that it is difficult to
interpret the meaning of squared units. Because the standard deviation is the square
root of the variance, in other words, the standard deviation is measured in the same
units as the original data. For this reason, the standard deviation is more easily
compared to the mean and other statistics that are measured in the same units as the
original data.

[SOURCE: CAMM, J. D., COCHRAN, J. J., FRY, M. J., OHLMANN, J. W. & ANDERSON, D. R. (2018). BUSINESS ANALYTICS (3RD ED.). CENGAGE LEARNING.]
Standard Deviation
•The standard deviation equal to 0 indicates that every value in the data set is exactly
equal to the mean.
•The closer the standard deviation is to zero, the lower the data variability and the more
reliable the mean is (data is more consistent).
•The higher the standard deviation, the more variation there is in the data and the less
accurate the mean is.

[SOURCE: CAMM, J. D., COCHRAN, J. J., FRY, M. J., OHLMANN, J. W. & ANDERSON, D. R. (2018). BUSINESS ANALYTICS (3RD ED.). CENGAGE LEARNING.]
How to calculate standard deviation in
Excel?
•To calculate the standard deviation in Excel, use STDEV or STDEV.S function.
•Example: Find a standard deviation of scores for a sample consisting of 6 students
(B2:B7).
• Use this formula:
=STDEV(B2:B7)

• Answer:
Standard Deviation = 7.64
How to calculate standard deviation in
Excel?
•In combination with the mean, the standard deviation can tell you what is the
most score achieved by the students.
•The mean of the data set is 81.5 and the standard deviation is 7.64, most of the
students achieved the score between 74 (81.5 – 7.64) and 89 (81.5 + 7.64).

[SOURCE: CAMM, J. D., COCHRAN, J. J., FRY, M. J., OHLMANN, J. W. & ANDERSON, D. R. (2018). BUSINESS ANALYTICS (3RD ED.). CENGAGE LEARNING.]
Coefficient of variation
•Coefficient of variation indicates how large the standard deviation is relative to the mean.
This measure usually expressed as a percentage.

 standard deviation 
  100 %
 mean 
•Example, For the score data set, we found a sample mean of 81.5 and a sample standard
deviation of 7.64. The coefficient of variation is (7.64/81.5 x 100) = 9.37%.
•In words, the coefficient of variation tells us that the sample standard deviation is 9.37%
of the value of the sample mean.

[SOURCE: CAMM, J. D., COCHRAN, J. J., FRY, M. J., OHLMANN, J. W. & ANDERSON, D. R. (2018). BUSINESS ANALYTICS (3RD ED.). CENGAGE LEARNING.]
Percentiles
•A percentile is the value of a variable at which a specified (approximate) percentage of
observations are below that value.

• The pth percentile tells us the point in the data where approximately p% of the
observations have values less than the pth percentile; hence, approximately (100 − p)%
of the observations have values greater than the pth percentile.

•The pth percentile can also be calculated in Excel using the function PERCENTILE.EXC.

[SOURCE: CAMM, J. D., COCHRAN, J. J., FRY, M. J., OHLMANN, J. W. & ANDERSON, D. R. (2018). BUSINESS ANALYTICS (3RD ED.). CENGAGE LEARNING.]
How to calculate percentile in Excel?
•Example: Compute the 70th percentile for the following sample data (B2:B11).
• Use this formula:
=PERCENTILE.EXC(B2:B11,0.7)

• B2:B11 defines the data set for which we are calculating

a percentile, and 0.7 defines the percentile of interest.

• Answer:
70th Percentile = 87.1

• 70% of the students (Around 7 students) had achieved a

score below 87.1.
Quartiles
•Quartiles divide data into four parts, with each part containing approximately
one-fourth, or 25 percent, of the observations and are defined as follows:

Q1  first quartile, or 25th percentile

Q2  second quartile, or 50th percentile (median)
Q3  third quartile, or 75th percentile

•A quartile can be computed in Excel using the function QUARTILE.EXC.

[SOURCE: CAMM, J. D., COCHRAN, J. J., FRY, M. J., OHLMANN, J. W. & ANDERSON, D. R. (2018). BUSINESS ANALYTICS (3RD ED.). CENGAGE LEARNING.]
How to calculate quartiles in Excel?
•Example: Compute the quartiles for the following sample data (B2:B11).

• Use this formula:

1st Quartile = QUARTILE.EXC(B2:B11,1)
2nd Quartile = QUARTILE.EXC(B2:B11,2)
3rd Quartile = QUARTILE.EXC(B2:B11,3)

• The range B2:B11 defines the data set, and 1

indicates that we want to compute the first quartile.

• Answer:
1st Quartile = 71.5 (25th percentile)
2nd Quartile = 78 (median)
3rd Quartile = 88 (75th percentile)
Interquartile Range
•The difference between the third and first quartiles is often referred to as the
interquartile range, or IQR.

• For the score of 10 student's data, IQR :

Q3  Q1  88  71.5  16.5

•Because it excludes the smallest and largest 25% of values in the data, the IQR
is a useful measure of variation for data that have extreme values or are highly
skewed.

[SOURCE: CAMM, J. D., COCHRAN, J. J., FRY, M. J., OHLMANN, J. W. & ANDERSON, D. R. (2018). BUSINESS ANALYTICS (3RD ED.). CENGAGE LEARNING.]
• Figure 2.19 shows the calculation of measures of dispersion for the home sales data in Excel.

[SOURCE: CAMM, J. D., COCHRAN, J. J., FRY, M. J., OHLMANN, J. W. & ANDERSON, D. R. (2018). BUSINESS ANALYTICS (3RD ED.). CENGAGE LEARNING.]
Figure 2.19
• the range in cell E7 is calculated using the formula = MAX(B2:B13) − MIN(B2:B13).
• the variance in cell E8 is calculated using the formula = VAR.S(B2:B13).
• the sample standard deviation in cell E9 is calculated using the formula = STDEV.S(B2:B13).
• the coefficient of variation is calculated in cell E11 using the formula =E9/E2, which divides
the standard deviation by the mean.
• the 85th percentile of the home sales data. The value in cell E13 is calculated using the
formula = PERCENTILE.EXC(B2:B13,0.85
• the calculations for first, second, and third quartiles for the home sales data in cell E15.
The formula = QUARTILE.EXC(B2:B13,1).
• Cells E16 and E17 use similar formulas to compute the second and third quartiles.

[SOURCE: CAMM, J. D., COCHRAN, J. J., FRY, M. J., OHLMANN, J. W. & ANDERSON, D. R. (2018). BUSINESS ANALYTICS (3RD ED.). CENGAGE LEARNING.]
Empirical Rule
•When the distribution of data exhibits a symmetric bell-shaped distribution, as shown
in Figure 2.21, the empirical rule can be used to determine the percentage of data
values that are within a specified number of standard deviations of the mean.

[SOURCE: CAMM, J. D., COCHRAN, J. J., FRY, M. J., OHLMANN, J. W. & ANDERSON, D. R. (2018). BUSINESS ANALYTICS (3RD ED.). CENGAGE LEARNING.]
Empirical Rule
Interpretation:

[SOURCE: CAMM, J. D., COCHRAN, J. J., FRY, M. J., OHLMANN, J. W. & ANDERSON, D. R. (2018). BUSINESS ANALYTICS (3RD ED.). CENGAGE LEARNING.]
Empirical Rule
Example: The height of adult males in the United States has a bell-shaped distribution
with a mean of approximately 69.5 inches and standard deviation of approximately
3 inches. Using the empirical rule, we can draw the following conclusions:
• Approximately 68% of adult males in the United States have
heights between 69.5 - 3 = 66.5 and 69.5 + 3 = 72.5 inches.

• Approximately 95% of adult males in the United States have

heights between 69.5 – 6 = 63.5 and 69.5 + 6 = 75.5 inches.

• Almost all adult males in the United States have heights

between 69.5 – 9 = 60.5 and 69.5 + 9 = 78.5 inches.

[SOURCE: CAMM, J. D., COCHRAN, J. J., FRY, M. J., OHLMANN, J. W. & ANDERSON, D. R. (2018). BUSINESS ANALYTICS (3RD ED.). CENGAGE LEARNING.]
Identifying Outliers
•Data set can have unusually large or unusually small values. These extreme values are
called outliers.
•An outlier may be a data value that has been incorrectly recorded; if so, it can be
corrected before the data are analysed further.
•An outlier may also be from an observation that does not belong to the population we
are studying and was incorrectly included in the data set; if so, it can be removed.
•Finally, an outlier may be an unusual data value that has been recorded correctly and is
a member of the population we are studying. In such cases, the observation should
remain.

[SOURCE: CAMM, J. D., COCHRAN, J. J., FRY, M. J., OHLMANN, J. W. & ANDERSON, D. R. (2018). BUSINESS ANALYTICS (3RD ED.). CENGAGE LEARNING.]
Box Plots
•Also known as box-and-whisker plots.
•A box plot is a graphical summary of the distribution of data.
•A box plot is developed from the quartiles for a data set.
•Box plots are also very useful for comparing different data sets.

[SOURCE: CAMM, J. D., COCHRAN, J. J., FRY, M. J., OHLMANN, J. W. & ANDERSON, D. R. (2018). BUSINESS ANALYTICS (3RD ED.). CENGAGE LEARNING.]
•Figure 2.24 shows the box plot in Excel for single variable.

[SOURCE: CAMM, J. D., COCHRAN, J. J., FRY, M. J., OHLMANN, J. W. & ANDERSON, D. R. (2018).
BUSINESS ANALYTICS (3RD ED.). CENGAGE LEARNING.]
•Figure 2.25 shows the box plots in Excel for multiple variables. Comparison of home sales from
several different communities.

[SOURCE: CAMM, J. D., COCHRAN, J. J., FRY, M. J., OHLMANN, J. W. & ANDERSON, D. R. (2018).
BUSINESS ANALYTICS (3RD ED.). CENGAGE LEARNING.]
What can we learn from Figure
2.25?
•The most expensive houses appear to be in Shadyside and the cheapest houses in
Hamilton.
•The median home sales price in Groton is about the same as the median home sales
price in Irving.
•However, home sales prices in Irving have much greater variability. Homes appear to be
selling in Irving for many different prices, from very low to very high.
•Home sales prices have the least variation in Groton and Hamilton.
•The only outlier that appears in these box plots is for home sales in Groton. However,
note that most homes sell for very similar prices in Groton, so the selling price does not
have to be too far from the median to be considered an outlier.

[SOURCE: CAMM, J. D., COCHRAN, J. J., FRY, M. J., OHLMANN, J. W. & ANDERSON, D. R. (2018). BUSINESS ANALYTICS (3RD ED.). CENGAGE LEARNING.]
3.4
Correlation and Causation
Correlation Coefficient
•Thus far, we have examined numerical methods used to summarise the data
for one variable at a time.
•However, managers or decision makers are often interested in the relationship
between two variables.
•To describe the relationship between two variables, correlation coefficient is
used.

[SOURCE: CAMM, J. D., COCHRAN, J. J., FRY, M. J., OHLMANN, J. W. & ANDERSON, D. R. (2018). BUSINESS ANALYTICS (3RD ED.). CENGAGE LEARNING.]
Correlation Coefficient
•The correlation coefficient measures the relationship between two variables.
•The correlation coefficient can take only values between -1 and 1. It measures both the
strength and direction of the linear relationship between the variables.
•We can compute correlation coefficient using the Excel function CORREL.

[SOURCE: CAMM, J. D., COCHRAN, J. J., FRY, M. J., OHLMANN, J. W. & ANDERSON, D. R. (2018). BUSINESS ANALYTICS (3RD ED.). CENGAGE LEARNING.]
Correlation Coefficient
• Correlation coefficient value near 0 indicates no linear relationship Correlation Strength Direction
Coefficient (r)
between the x and y variables.
1 Perfect
• Correlation coefficient greater than 0 indicates a positive linear 0.7 ≤ r < 1 Strong
relationship between the x and y variables. The closer the correlation Positive relationship
coefficient is to +1, the closer the x and y values are to forming a 0.5 ≤ r < 0.7 Moderate
(Positive correlation)
straight line that trends upward to the right (positive slope). 0.3 ≤ r < 0.5 Weak
0 < r < 0.3 Very weak
• Correlation coefficient less than 0 indicates a negative linear
relationship between the x and y variables. The closer the correlation 0 No relationship (No correlation)
coefficient is to −1, the closer the x and y values are to forming a - 0 < r < 0.3 Very weak
straight line that trends downward to the right (negative slope). - 0.3 ≤ r < 0.5 Weak
• As correlation coefficient gets closer to -1 or 1, the strength of the - 0.5 ≤ r < 0.7 Moderate Negative relationship
(Negative correlation)
relationship increases. - 0.7 ≤ r < 1 Strong
-1 Perfect

[SOURCE: CAMM, J. D., COCHRAN, J. J., FRY, M. J., OHLMANN, J. W. & ANDERSON, D. R. (2018). BUSINESS ANALYTICS (3RD ED.). CENGAGE LEARNING.]
Correlation Coefficient Graphs
Positive correlation Negative correlation No correlation

• Positive correlation – As x increases, y tends to increase.

• Negative correlation – As x increases, y tends to decrease.
• No correlation - As x increases, y tends to stay about the same or have no clear pattern.

[SOURCE: CAMM, J. D., COCHRAN, J. J., FRY, M. J., OHLMANN, J. W. & ANDERSON, D. R. (2018). BUSINESS ANALYTICS (3RD ED.). CENGAGE LEARNING.]
• Table 2.14 shows a data for bottled water sales at Queensland
Amusement Park for a sample of 14 summer days.

[SOURCE: CAMM, J. D., COCHRAN, J. J., FRY, M. J., OHLMANN, J. W. & ANDERSON, D. R. (2018).
BUSINESS ANALYTICS (3RD ED.). CENGAGE LEARNING.]
• Figure 2.26 shows a scatter plot chart of the positive linear
relation between sales of bottled water and high temperatures.

[SOURCE: CAMM, J. D., COCHRAN, J. J., FRY, M. J., OHLMANN, J. W. & ANDERSON, D. R. (2018).
BUSINESS ANALYTICS (3RD ED.). CENGAGE LEARNING.]
• Figure 2.27 shows the data from Table 2.14 entered into an Excel Worksheet.

• The correlation coefficient is

computed in cell B18. The formula
= CORREL(A2:A15, B2:B15)

• Correlation coefficient = 0.93

indicates a positive relationship
between the high temperature and
sales of bottled water.

• This verifies the relationship that as

the high temperature for a day
increases, sales of bottled water
generally increase.

[SOURCE: CAMM, J. D., COCHRAN, J. J., FRY, M. J., OHLMANN, J. W. & ANDERSON, D. R. (2018). BUSINESS ANALYTICS (3RD ED.). CENGAGE LEARNING.]
Causation
•An essential thing to understand about correlation is that it only shows how closely
related two variables are. Correlation, however, does not imply causation.
•Even if there is a correlation between two variables, we cannot conclude that one
variable causes a change in the other. This relationship could be coincidental, or a third
factor may be causing both variables to change.
•Causation means that one event causes another event to occur. Causation can only be
determined from an appropriately designed experiment. In such experiments, similar
groups receive different treatments, and the outcomes of each group are studied. We
can only conclude that a treatment causes an effect if the groups have noticeably
different outcomes.

[SOURCE: HTTPS://WWW.KHANACADEMY.ORG/TEST-PREP/PRAXIS-MATH/PRAXIS-MATH-LESSONS/GTP--PRAXIS-MATH--LESSONS--STATISTICS-AND-PROBABILITY/A/GTP--PRAXIS-MATH--ARTICLE--
CORRELATION-AND-CAUSATION--LESSON]
Example 1
Liam collected data on the sales of ice cream cones and air conditioners in his hometown.
He found that when ice cream sales were low, air conditioner sales tended to be low and
that when ice cream sales were high, air conditioner sales tended to be high.
•Liam can conclude that sales of ice cream cones and air conditioner are positively
correlated.
•Liam can't conclude that selling more ice cream cones causes more air conditioners to be
sold. It is likely that the increases in the sales of both ice cream cones and air
conditioners are caused by a third factor, an increase in temperature!
•So, sales of ice cream cones and air conditioner are positively correlated, but they do not
cause one another.

[SOURCE: HTTPS://WWW.KHANACADEMY.ORG/TEST-PREP/PRAXIS-MATH/PRAXIS-MATH-LESSONS/GTP--PRAXIS-MATH--LESSONS--STATISTICS-AND-PROBABILITY/A/GTP--PRAXIS-MATH--ARTICLE--
CORRELATION-AND-CAUSATION--LESSON]
Example 2
Jane notices that students in her class with higher grades in college have higher grades
in high school.
•Based on this observation, there is a positive correlation between the higher grades in
college and higher grades in school.
•Jane can’t conclude that success in high school causes the success in college. It is
usually someone's working hard in college courses that causes that person to succeed
in college.
•So again, the two events, high school success and college success are positively
correlated, but they do not cause one another.

[SOURCE: HTTPS://WWW.KHANACADEMY.ORG/TEST-PREP/PRAXIS-MATH/PRAXIS-MATH-LESSONS/GTP--PRAXIS-MATH--LESSONS--STATISTICS-AND-PROBABILITY/A/GTP--PRAXIS-MATH--ARTICLE--
CORRELATION-AND-CAUSATION--LESSON]
Exercise
How do you describe this
dataset?
Said there is a total of 60 students taking INTRODUCTION OF DATA ANALYTICS in this
trimester. Their lab test scores are as follows.
57 66 61 61 72 75
63 66 78 45 26 40
53 56 48 51 38 22
48 80 45 82 32 87
56 31 53 76 46 78
45 72 52 65 49 83
67 66 47 82 43 89
89 58 59 51 91 51
65 60 58 62 64 63
47 62 83 51 94 60
Describe Data
Histogram (Score Distribution)
Score Count
21 – 30 2
31 – 40 4
41 – 50 10
51 – 60 15
61 – 70 13
71 – 80 7
81 – 90 7
91 – 100 2
You may ENRICH the data
No Score Gender Class No Score Gender Class No Score Gender Class
1 57 Male T1 21 61 Female T2 41 72 Female T3
2 63 Female T1 22 78 Female T2 42 26 Female T3
3 53 Female T1 23 48 Female T2 43 38 Male T3
4 48 Female T1 24 45 Female T2 44 32 Female T3
5 56 Female T1 25 53 Female T2 45 46 Male T3
6 45 Male T1 26 52 Male T2 46 49 Female T3
7 67 Male T1 27 47 Male T2 47 43 Female T3
8 89 Male T1 28 59 Female T2 48 91 Female T3
9 65 Female T1 29 58 Female T2 49 64 Female T3
10 47 Female T1 30 83 Female T2 50 94 Male T3
11 66 Female T1 31 61 Female T2 51 75 Male T3
12 66 Female T1 32 45 Male T2 52 40 Male T3
13 56 Female T1 33 51 Female T2 53 22 Female T3
14 80 Female T1 34 82 Female T2 54 87 Female T3
15 31 Female T1 35 76 Male T2 55 78 Female T3
16 72 Female T1 36 65 Female T2 56 83 Female T3
17 66 Female T1 37 82 Female T2 57 89 Female T3
18 58 Female T1 38 51 Female T2 58 51 Female T3
19 60 Female T1 39 62 Female T2 59 63 Male T3
20 62 Male T1 40 51 Male T2 60 60 Female T3
Data Visualisation (some Column
Charts)
Is there any difference
of student performance
in gender or class?
Measures of Central Tendency and Dispersion
Standard
Average deviation

Female 60.8 16.5

Male 59.2 17.1

Scatter Plot and Correlation
Study hour Score
3 31 Said the T1 students recorded their weekly study hours.
4 45
7 47 You would like to know the relationship between score and study hour.
9 48
6 53
7.5 56
6.5 56
8.5 57
10 58
6 60
7.5 62
8 63
8.5 65
Correlation = 0.83
9 66
9.5 66
7.5 66
10 67
9 72
12.5 80
12 89

(Solved) Entrance Exam Sample Question
100% (3)
(Solved) Entrance Exam Sample Question
19 pages
Solution Manual For Business Statistics 8th Edition Groebner
100% (2)
Solution Manual For Business Statistics 8th Edition Groebner
48 pages
Cummins v504 Series Parts Catalog
100% (68)
Cummins v504 Series Parts Catalog
4 pages
12 Chapter3 PDF
No ratings yet
12 Chapter3 PDF
33 pages
Statistical Analysis With Software Application - Week2
No ratings yet
Statistical Analysis With Software Application - Week2
76 pages
Camm BA 5e PPT CH02 03-09-23 PC - Final
No ratings yet
Camm BA 5e PPT CH02 03-09-23 PC - Final
52 pages
QM1 Notes
No ratings yet
QM1 Notes
81 pages
FROM DR Neerja Nigam
No ratings yet
FROM DR Neerja Nigam
75 pages
Business Statistics - Session 1 - 3
No ratings yet
Business Statistics - Session 1 - 3
63 pages
DOM503 Session 1
No ratings yet
DOM503 Session 1
19 pages
SECPROJECT.ITSKILLSANDDATAANALYSIS 2
No ratings yet
SECPROJECT.ITSKILLSANDDATAANALYSIS 2
69 pages
FIN10002 - Notes Master
No ratings yet
FIN10002 - Notes Master
44 pages
c90751e8e8d07684db8bc7ca32526f71
No ratings yet
c90751e8e8d07684db8bc7ca32526f71
53 pages
Ba Lecture 2
No ratings yet
Ba Lecture 2
54 pages
Chapter 1 (Introduction)
No ratings yet
Chapter 1 (Introduction)
40 pages
Chap 2 Introduction To Statistics
No ratings yet
Chap 2 Introduction To Statistics
46 pages
Business Statistics: Methods For Describing Sets of Data
No ratings yet
Business Statistics: Methods For Describing Sets of Data
103 pages
Quantitative Data Analysis
No ratings yet
Quantitative Data Analysis
31 pages
Topic 2 Frequency Distribution and Data Presentation, Measures of Central Tendency and Dispersion
No ratings yet
Topic 2 Frequency Distribution and Data Presentation, Measures of Central Tendency and Dispersion
46 pages
Statistics For Css
No ratings yet
Statistics For Css
73 pages
Week 02 Data Organizatiion and Presentaion
No ratings yet
Week 02 Data Organizatiion and Presentaion
51 pages
Free Assignments in PDF: Order Special Projects, T Lesson Plan, 03049699108
No ratings yet
Free Assignments in PDF: Order Special Projects, T Lesson Plan, 03049699108
11 pages
Session 3 Descriptive Analysis I-Frequency Distribution and Cross Tabulation
No ratings yet
Session 3 Descriptive Analysis I-Frequency Distribution and Cross Tabulation
30 pages
Statistics 1
No ratings yet
Statistics 1
291 pages
Manual
No ratings yet
Manual
46 pages
Module-4-Part-1_082406
No ratings yet
Module-4-Part-1_082406
31 pages
Engineering Data Analysis Part 1 23241stsem Notes
No ratings yet
Engineering Data Analysis Part 1 23241stsem Notes
108 pages
Eba3e PPT ch04
No ratings yet
Eba3e PPT ch04
100 pages
Principle of Biostatistic Marcello Pagano Principle & Method Richard A Jhonson & Gouri K. Bhattacharyya
No ratings yet
Principle of Biostatistic Marcello Pagano Principle & Method Richard A Jhonson & Gouri K. Bhattacharyya
45 pages
Chapter 9
No ratings yet
Chapter 9
12 pages
661e301832c4eevans Analytics3e PPT 04 Accessible
No ratings yet
661e301832c4eevans Analytics3e PPT 04 Accessible
101 pages
Chapter 4 - Descriptive Statistics
No ratings yet
Chapter 4 - Descriptive Statistics
100 pages
Module 2 data collection
No ratings yet
Module 2 data collection
17 pages
Descriptive Statistics
No ratings yet
Descriptive Statistics
21 pages
Business Statistics 10th Edition Groebner Solutions Manual instant download
100% (1)
Business Statistics 10th Edition Groebner Solutions Manual instant download
47 pages
Descriptive Statistics: Instructor: Maira Sami
No ratings yet
Descriptive Statistics: Instructor: Maira Sami
55 pages
Business Statistics: A Decision-Making Approach: Graphs, Charts, and Tables - Describing Your Data
No ratings yet
Business Statistics: A Decision-Making Approach: Graphs, Charts, and Tables - Describing Your Data
47 pages
1.2 Frequency Distributions IN DTB
No ratings yet
1.2 Frequency Distributions IN DTB
32 pages
Final Presentation
No ratings yet
Final Presentation
274 pages
Chapter 3
No ratings yet
Chapter 3
50 pages
CHAP 1 Statistics in Business
No ratings yet
CHAP 1 Statistics in Business
31 pages
BSTA205 - Revision Sheet - Midterm Examination
No ratings yet
BSTA205 - Revision Sheet - Midterm Examination
12 pages
2.fundamentals of Ststisitics
No ratings yet
2.fundamentals of Ststisitics
126 pages
Ch.2 PPT - Descriptive Stat
No ratings yet
Ch.2 PPT - Descriptive Stat
49 pages
DOM105 Session 1
No ratings yet
DOM105 Session 1
31 pages
Module 5 - Data Visualization.pptx (1)
No ratings yet
Module 5 - Data Visualization.pptx (1)
53 pages
Lecture Week 2 Statistics
No ratings yet
Lecture Week 2 Statistics
57 pages
Statistics 1232445944520487 1
No ratings yet
Statistics 1232445944520487 1
101 pages
Statistics
No ratings yet
Statistics
46 pages
Basic Statistics
100% (9)
Basic Statistics
73 pages
Lecture-3 Frequency Distribution
No ratings yet
Lecture-3 Frequency Distribution
22 pages
MAS202 - MKT1805 - GROUP 5 - GA
No ratings yet
MAS202 - MKT1805 - GROUP 5 - GA
12 pages
LECTURED Statistics Refresher
100% (1)
LECTURED Statistics Refresher
123 pages
Business Analytics: Methods, Models, and Decisions: Descriptive Statistics
No ratings yet
Business Analytics: Methods, Models, and Decisions: Descriptive Statistics
100 pages
Quality Control: Fundamentals of Statistics
No ratings yet
Quality Control: Fundamentals of Statistics
62 pages
Lesson Quiz 2 (Quantitative Methods)
No ratings yet
Lesson Quiz 2 (Quantitative Methods)
9 pages
Descriptive Stats
No ratings yet
Descriptive Stats
39 pages
Chapter 3 (Descriptive)
No ratings yet
Chapter 3 (Descriptive)
78 pages
CAS_Descriptive Statistics_Final PPT-1
No ratings yet
CAS_Descriptive Statistics_Final PPT-1
112 pages
Statistical Foundations for Psychology
From Everand
Statistical Foundations for Psychology
James C. Ware
No ratings yet
Business Statistics I Essentials
From Everand
Business Statistics I Essentials
Louise Clark
5/5 (5)
De-Mystifying Math and Stats for Machine Learning: Mastering the Fundamentals of Mathematics and Statistics for Machine Learning
From Everand
De-Mystifying Math and Stats for Machine Learning: Mastering the Fundamentals of Mathematics and Statistics for Machine Learning
Seaport AI Madhavan
No ratings yet
Statistics I Essentials
From Everand
Statistics I Essentials
Emil G. Milewski
No ratings yet
Tracing Genset Layout 1
No ratings yet
Tracing Genset Layout 1
1 page
Eng. Wubayehu Mamo CV 2
No ratings yet
Eng. Wubayehu Mamo CV 2
3 pages
Cuestionario CRM
No ratings yet
Cuestionario CRM
8 pages
Role of Media Essay
100% (2)
Role of Media Essay
6 pages
DLP - MATH1 Add 1 to 2-digit numbers with sum up to 100 without regrouping
No ratings yet
DLP - MATH1 Add 1 to 2-digit numbers with sum up to 100 without regrouping
5 pages
Lesson 4
No ratings yet
Lesson 4
19 pages
Laboratorium Pembelajaran Ilmu Komputer Fakultas Ilmu Komputer Universitas Brawijaya
No ratings yet
Laboratorium Pembelajaran Ilmu Komputer Fakultas Ilmu Komputer Universitas Brawijaya
6 pages
Chapter Four
No ratings yet
Chapter Four
10 pages
G7 Endterm Series-003
100% (1)
G7 Endterm Series-003
32 pages
Management Theory Assignment Answers June 2022
No ratings yet
Management Theory Assignment Answers June 2022
10 pages
Monthly Accomplishment Report
No ratings yet
Monthly Accomplishment Report
7 pages
Good Dynamic Model
No ratings yet
Good Dynamic Model
2 pages
Working Dogs - Reading Task
No ratings yet
Working Dogs - Reading Task
3 pages
S8_Mid-point Test_3 and 6
No ratings yet
S8_Mid-point Test_3 and 6
3 pages
Engl 3140 Reflection Essay
No ratings yet
Engl 3140 Reflection Essay
1 page
MATH-2204 - Tutorial 2
No ratings yet
MATH-2204 - Tutorial 2
2 pages
Creep & Superplasticity PDF
No ratings yet
Creep & Superplasticity PDF
54 pages
Writing 150: Writing and Critical Reasoning Issues in Aesthetics
No ratings yet
Writing 150: Writing and Critical Reasoning Issues in Aesthetics
4 pages
Note 1469965023
No ratings yet
Note 1469965023
24 pages
DPP 2 Watermark
No ratings yet
DPP 2 Watermark
2 pages
A Level Art Dissertation Structure
100% (1)
A Level Art Dissertation Structure
4 pages
ML Week 3 Logistic Regression
60% (10)
ML Week 3 Logistic Regression
6 pages
Sony Malaysia
No ratings yet
Sony Malaysia
10 pages
BAP - Sampling & Testing Requirements - SPS 5.1 Farmed - Issue 2.1 - 30-March-2022
No ratings yet
BAP - Sampling & Testing Requirements - SPS 5.1 Farmed - Issue 2.1 - 30-March-2022
6 pages
A Comic Empire The Global Expansion of PDF
No ratings yet
A Comic Empire The Global Expansion of PDF
30 pages
Environmental Engg August-2022
No ratings yet
Environmental Engg August-2022
1 page
013-2804
No ratings yet
013-2804
17 pages

Topic 3 - Data Analytics

Uploaded by

Topic 3 - Data Analytics

Uploaded by

FHCT1014

Introduction to Data Analytics

•The percent frequency of each bin can be determined as follows:

Percent frequency of a bin = Relative frequency of a bin × 100

• Note that the class of 15-19 days have

• Meanwhile, the class of 30-34 days

•The value for median in cell

•Excel enters the values for

•Example: What is the range of the following group of numbers?

•Use this formula:

•Use this formula:

• B2:B11 defines the data set for which we are calculating

• 70% of the students (Around 7 students) had achieved a

Q1  first quartile, or 25th percentile

•A quartile can be computed in Excel using the function QUARTILE.EXC.

• Use this formula:

• The range B2:B11 defines the data set, and 1

• For the score of 10 student's data, IQR :

• Approximately 95% of adult males in the United States have

• Almost all adult males in the United States have heights

• Positive correlation – As x increases, y tends to increase.

• The correlation coefficient is

• Correlation coefficient = 0.93

• This verifies the relationship that as

Female 60.8 16.5

Male 59.2 17.1

You might also like