Marketing Analytics (Unit 2)
Marketing Analytics (Unit 2)
Unit-2
Descriptive Analytics
By: Deependra Singh, Assistant Professor, School of Management, The NorthCap University, Gurugram
Descriptive Analytics
• Descriptive analytics summarizes data into meaningful charts and
reports, for example, about budgets, sales, revenues, or cost.
• Descriptive analytics are a set of techniques used to explain or
quantify the past.
• Several examples of descriptive analytics include data queries,
visual reports, and descriptive statistics.
Understanding Data
• Data can be defined as a systematic record of a particular quantity. It is the
different values of that quantity represented together in a set.
• It is a collection of facts and figures to be used for a specific purpose such as
a survey or analysis.
• When arranged in an organized form, can be called information. The source
of data (primary data, secondary data) is also an important factor.
Types of Data
➢Qualitative VS Quantitative Data
Qualitative Data: They represent some characteristics or attributes. They
depict descriptions that may be observed but cannot be computed or
calculated. They are more exploratory than conclusive in nature.
Quantitative Data: These can be measured and not simply observed. They
can be numerically represented and calculations can be performed on them.
This information is numerical and can be classified as quantitative.
➢Discrete VS Continuous Data
Discrete Data: Discrete data that can take on only integer values, such as
counts.
Continuous Data: Continuous data that can take on any value in an interval.
These are data that can take values between a certain range with the highest and
lowest values.
➢Primary VS Secondary Data
Primary Data: Primary data is one which an investigator collects for the first
time for a particular purpose.
Secondary Data: They are the data that are sourced from someplace that
has originally collected it.
➢Categorical VS Binary VS Ordinal Data
• Categorical Data: Data that can take on only a specific set of values
representing a set of possible categories.
• Binary/Dichotomous/Boolean Data: A special case of categorical data
with just two categories of values (0/1, true/false).
• Ordinal Data: Categorical data that has an explicit ordering.
Rectangular Data
• The typical frame of reference for an analysis in data science is a
rectangular data object, like a spreadsheet or database table.
• Rectangular data (like a spreadsheet) is the basic data structure for
statistical and machine learning models.
• Rectangular data is essentially a two-dimensional matrix with rows
indicating records (cases) and columns indicating features
(variables).
Sales volume generated by salesmen
Salesmen
Region
S1 S2 S3 S4
East 24 30 26 23
West 22 32 27 25
North 23 28 25 22
South 32 31 32 34
Non-rectangular data structures
• Time series data records successive measurements of the same variable. It
is the raw material for statistical forecasting methods.
• Spatial data are used in mapping and location analytics. It is relatively more
complex and varied than rectangular data structure.
Data Preparation and handling
• Data cleaning is the process of detecting and correcting for removing
corrupt or incomplete records from a record set, and refers to identifying
incomplete, incorrect, inaccurate or irrelevant parts of the data and then
replacing, modifying, or deleting unsuitable data.
• Data screening is the process of ensuring that researcher’s data is clean and
ready to go for statistical analyses.
• Data editing is the inspection and correction of the data received from each
element of the sample.
Data Cleaning
• Under data cleaning, a researcher generally focuses on these three aspects:
❑Missing Data: Information not available for a case about whom other
information is available. It occurs when a respondent fails to answer some
questions in a survey.
❑Outliers: Outliers are observations with a unique combination of
characteristics identifiable as distinctly different from the other observations.
❑Normality: Normality is the degree to which the distribution of the sample
data corresponds to a normal distribution.
Four-step process for identifying missing
data and applying remedies
• The researcher must ascertain whether the missing data process occurs in a
completely random manner. When the data set is small, the researcher may
be able to visually see much pattern or perform a set of simple calculation.
• However, as sample size increases, so does the need for empirical diagnostic
tests. Some statistical programs add techniques specifically designed for
missing data analysis, eg. Missing Value Analysis in SPSS, which generally
include one or both diagnostic tests.
• The first approach assesses the missing data process of a single variable Y by forming two
groups: observations with missing data for Y and those with valid values of Y. Statistical
tests are then performed to determine whether significant differences exist between the two
groups on other variables of interest. Significant differences indicate the possibility of a non
random missing data process.
• The second approach is an overall test of randomness that determines whether the
missing data can be classified as MCAR. This test analyses the pattern of missing data on all
available variables and compares it with the pattern expected for a random missing data
process. If no significant differences are found, the missing data can be classified as MCAR.
If significant differences are found, the researcher must use the approaches to identify the
specific missing data processes that are nonrandom.
Step 4: Select the imputation method
• At this step of process, the researcher must select the approach used for
accommodating missing data in the analysis. This decision is based on
whether the missing data are missing at random (MAR) or missing
completely at random (MCAR).
• Imputation is the process of estimating the missing value based on valid
values of other variables and/or cases in the sample. The researcher has
several options for imputation.
• Imputation process is generally avoided for non-metric data. This process is
suggested for the metric data basically.
Imputation or missing data process
In this approach, the researcher substitutes a value from another source for the
missing values.
• In the “hot deck” method, the value comes from another observation in the
sample that is deemed similar. Each observation with missing data is paired with
another case that is similar on a variable specified by the researcher. Then missing
data are replaced with valid values from the similar observation.
• In the “cold deck” method, the replacement value is derived from an external
source (e.g. prior studies, other sample, etc.). Here, the researcher must be sure that
the replacement value from an external source is more valid than an internally
generated value.
Case substitution
• Because most multivariate analyses involve more than two variables, the
bivariate methods quickly become inadequate because,
i)They require a large number of graphs,
ii)They are limited to two dimensions (variables) at a time
• This issue is addressed by the Mahalanobis D2 measure. Higher D2 values
represent observations farther removed from the general distribution of
observations in this multidimensional space.
Normality
• The most fundamental assumption in multivariate analysis is normality,
referring to the shape of the data distribution for an individual metric
variable and its correspondence to the normal distribution.
• Normal distribution: Purely theoretical continuous probability distribution in
which the horizontal axis represents all possible values of a variable and the
vertical axis represents the probability of those values occurring. The scores
on the variable are clustered around the mean in a symmetrical, unimodal pattern
known as the bell-shaped, or normal curve. It is also called as Gaussian distribution
as well.
Slicing and Dicing of Data
• In many marketing situations, researcher needs to do slicing and dicing of
data.
• Software, like Excel through Pivot table, SPSS through cross-tabulation
enable researcher to quickly summarize and describe the data in many
different ways.
Data Visualisation
• Data visualization is the process of translating large data sets and metrics
into charts, graphs and other visuals.
• The resulting visual representation of data makes it easier to identify and
share real-time trends, outliers, and new insights about the information
represented in the data.
• In the world of Big Data, data visualization tools and technologies are
essential to analyze massive amounts of information and make data-driven
decisions.
Common general types of data
visualization
• Charts
• Tables
• Graphs
• Maps
• Infographics
• Dashboards
Descriptive Statistics
• Descriptive statistics is the process of describing data and trying to reach a
conclusion based on it.
• Descriptive statistics includes two concepts measures of central tendency
and measures of dispersion.
Measures of Central Tendency
1. Mathematical averages 2. Positional averages
(a) Arithmetic mean or mean (a) Median
▪ Simple
▪ Weighted (b) Mode
(b) Geometric mean (c) Quartiles
(c) Harmonic mean (d) Deciles
(e) Percentiles
Arithmetic Mean
▪ The arithmetic mean (AM) of a set of observations is their sum, divided by the
number of observations.
▪ It is generally denoted by x or AM. Population mean is denoted by μ.
▪ Arithmetic mean is of two types:
Simple arithmetic mean
Weighted arithmetic mean
• Computation of Arithmetic Mean for Discrete Frequency Distribution
▪ If there are three items 4, 6, and 9, then their geometric mean, which is
generally denoted by G, can be computed as:
Computation of Geometric Mean for
Individual Series
Harmonic Mean
• The harmonic mean of any series is the reciprocal of the arithmetic mean
of the reciprocal of the variate, that is, the harmonic mean by definition is
given by:
Computation of Harmonic Mean for
Individual Series
Relationship between Arithmetic mean (AM),
Geometric Mean (GM) and Harmonic Mean (HM)
Positional Averages
▪ Arithmetic mean, geometric mean, and harmonic mean are all mathematical
in nature and are measures of quantitative characteristics of data.
▪ To measure the qualitative characteristics of data, other measures of central
tendency, namely median and mode are used.
▪ Positional averages, as the name indicates, mainly focus on the position of
the value of an observation in the data set.
Median
▪ The median may be defined as the middle or central value of the variable
when values are arranged in the order of magnitude.
▪ In other words, median is defined as that value of the variable that divides
the group into two equal parts, one part comprising all values greater and the
other all values lesser than the median.
Computation of Median for the Individual
Series
• In this type of distribution, data can be arranged in ascending or descending
order. If there are n terms (observations) in the data, there can be two cases:
• Mode
▪ Mode is the variate having the maximum frequency in a data series.
▪ In the case of an individual series, data is arranged in order and mode can be determined by
inspection only.
▪ The value of the variable (in data series) which occurs the most or the value of the data
series with maximum frequency is the mode of the data series.
▪ For example, for a series 1, 1, 3, 3, 3, 3, 4, 5, 8, 8, 16, 16 (arranged in the order of
magnitude), observation 3 has the maximum frequency 4. Therefore, mode of the series is 3.
Empirical Relationship between Mean, Median
and Mode
Partition Values: Quartiles, Deciles, and
Percentiles
▪ Partition values are measures that divide the data into several equal parts.
Quartiles divide data into 4 equal parts, deciles divide data into 10 equal
parts, and percentiles divide data into 100 equal parts.
▪ For an individual series, the first and third quartiles can be computed using
the following formula:
• In a data series, when the observations are arranged in an ordered sequence,
deciles divide the data into 10 equal parts. In the case of individual series
and discrete frequency distribution, the generalized formula for computing
deciles is given as:
• In a data series, when observations are arranged in an ordered sequence,
percentiles divide the data into 100 equal parts. For an individual series and
a discrete frequency distribution, the generalized formula for computing
percentiles is given as:
Measures of Dispersion
▪ The meaning of dispersion is “scatteredness.” The degree to which
numerical data tends to spread around an average value is called variation or
dispersion of data.
Types of Measures of Dispersion
▪ There are two types of measures of dispersion:
1. Absolute measures of dispersion: Absolute measures of dispersion are
presented in the same unit as the unit of distribution.
2. Relative measures of dispersion: Relative measures of dispersion are
useful in comparing two sets of data which have different units of
measurement.
▪ Relative measures of dispersion are pure unitless numbers and are generally
called coefficient of dispersion.
Methods of Measuring Dispersion
The following are some of the important and widely used methods
of measuring dispersion:
▪ Range
▪ Interquartile range and quartile deviation
▪ Average absolute deviation
▪ Standard deviation
• Range
▪ Range is defined as the difference between the smallest and the greatest
values in a distribution.
▪ Range is an absolute measure of dispersion. The relative measure of
dispersion for range is called the coefficient of range and is calculated by the
following formula:
• Interquartile range and quartile deviation
▪ Interquartile range is the difference between the third quartile and the first quartile.
▪ Quartile deviation or semi-interquartile range can be obtained by dividing the
interquartile range by 2.
▪ Quartile deviation is an absolute measure of dispersion. Relative measure is called
the coefficient of quartile deviation. Coefficient of quartile deviation can be used to
measure the degree of variation in two different distributions when both have
different units of measurement.
• Average absolute deviation
Average absolute deviation is the average amount of scatter of the items in a
distribution, from either the mean or the median or the mode, ignoring the
signs of deviations.
• Average absolute deviation is an absolute measure of dispersion.
In this context, a relative measure, also known as coefficient of
average absolute deviation, is obtained by the following formula:
Standard Deviation and Variance
• Standard deviation is the square root of the sum of square deviations of various
values from their arithmetic mean divided by the sample size minus one.
• Variance is the square of standard deviation. Sample variance is the sum of squared
deviations of various values from their arithmetic mean divided by the sample size
minus one.
• For population standard deviation, we have N instead of n-1 in formula of sample
standard deviation.
Coefficient of Variance
▪ To compare the dispersion of two distributions, the relative measure of
standard deviation is used and is referred to as the coefficient of variation.
▪ A distribution with lesser CV shows greater consistency, homogeneity, and
uniformity, whereas a distribution with greater CV is considered more
variable than others.
Measures of Association
▪ Measures of association are statistics for measuring the strength of relationship
between two variables.
▪ Correlation measures the degree of association between two variables.
▪ Karl Pearson’s coefficient of correlation is a quantitative measure of the degree
of relationship between two variables. Suppose these variables are x and y, then
Karl Pearson’s coefficient of correlation is calculated as:
• The coefficient of correlation lies in between +1 and –1.
Empirical rule