Descriptive Analytics
Descriptive Analytics
https://en.wikipedia.org/wiki/Clickstream
http://hortonworks.com/hadoop-tutorial/how-to-visualize-website-clickstream-data/
http://searchcrm.techtarget.com/definition/clickstream-analysis
https://www.qubole.com/blog/big-data/clickstream-data-analysis/
Cross-sectional, Time Series, and Panel Data
1. Cross-Sectional Data: A data collected on many variables of interest at the same time or duration
of time is called cross-sectional data. For example, consider data on movies such as budget, box-
office collection, actors, directors, genre of the movie during year 2017.
2. Time Series Data: A data collected for a single variable such as demand for smartphones
collected over several time intervals (weekly, monthly, etc.) is called a time series data.
3. Panel Data: Data collected on several variables (multiple dimensions) over several time intervals
is called panel data (also known as longitudinal data). Example of a panel data is data collected on
variables such as gross domestic product (GDP), Gini index, and unemployment rate for several
countries over several years.
TYPES OF DATA MEASUREMENT SCALES
Nominal Scale (Qualitative Data)
Nominal scale refers to variables that are basically names (qualitative data) and also known as categorical variables. For example,
variables such as marital status (single, married, divorced) and industry type (manufacturing, healthcare, banking and finance) fall
under nominal scale
Ordinal Scale
Ordinal scale is a variable in which the value of the data is captured from an ordered set, which is recorded in the order of magnitude.
For example, in many survey data, Likert scale is used.
Interval Scale
Interval scale corresponds to a variable in which the value is chosen from an interval set. Variable such as temperature measured in
centigrade (°C) or intelligence quotient (IQ) score are examples of interval scale.
Ratio Scale
Any variable for which the ratios can be computed and are meaningful is called ratio scale
POPULATION AND SAMPLE
• Population (also known as universal set) is the set of all possible data for a given
context whereas sample is the subset taken from a population
• In many analytical problems, we make inference about the population based on the
sample data. There are many challenges in sampling (process of selecting an
observation from the population).
• An incorrect sample may result in bias and incorrect inference about the population.
MEASURES OF CENTRAL TENDENCY
Mean (or Average) Value
• Mean is the arithmetical average value of the data and is one of the most frequently used measures of central tendency. Associated with the mean is a
phenomenon often called “wisdom of crowd”, according to which the collective wisdom of people is better than any individual person’s knowledge.
• Making decisions solely based on mean value is not advisable. In capital asset procurement such as procurement of fighter aircraft and weapons, defense
services across the world use mean time between failures (MTBF) as one of the measures of system reliability (performance).
• Median is the value that divides the data into two equal parts, that is, the proportion of observations below median and above median will be 50%.
Mode
• Decile corresponds to special values of percentile that divide the data into 10 equal parts. First decile contains first 10% of the data and second decile
contains first 20% of the data and so on.
• Quartile divides the data into 4 equal parts. The first quartile (Q1) contains first 25% of the data, Q2 contains 50% of the data and is also the median
Problem
Time between failures (in hours) of a wire cutter used in a cookie manufacturing oven is given in Table
2.4. The function of the wire-cut is to cut the dough into cookies of desired size.
(a) Calculate the mean, median, and mode of time between failures of wire-cuts.
(b) The company would like to know by what time 10% (ten percentile or P10) and 90% (ninety percentile
or P90) of the wire-cuts will fail?
(c) Calculate the values of P25 and P75.
Solution
MEASURES OF VARIATION
1. Range - Range is the difference between maximum and minimum value of the data. It captures the data spread. In
2. Inter-Quartile Distance (IQD) - Inter-quartile distance (IQD), also called inter-quartile range (IQR), is a measure of
3. Variance - Variance is a measure of variability in the data from the mean value. Variance for population, s 2, is
calculated using
4. Standard Deviation
MEASURES OF SHAPE - SKEWNESS AND KURTOSIS
• Skewness is a measure of symmetry or lack of symmetry. A data set is symmetrical when the proportion of data
at equal distance (measured in terms of standard deviation) from mean (or median) is equal
• Kurtosis
DATA VISUALIZATION
Histogram
Histogram is the visual representation of the data which can be used to assess the probability distribution
(frequency distribution) of the data.
Histogram is very useful since it assists data scientist to identify the following:
1. The shape of the distribution and to assess the probability distribution of the data.
Bar Chart
Pie Chart
Scatter plot is a plot of two variables that will assist data scientists to
understand if there is any relationship between two variables. The relationship
could be linear or non-linear
understand the variability of the data and the existence of outliers. Box plot is
designed by identifying the following descriptive statistics:
1. Lower quartile (1st Quartile), median and upper quartile (3rd Quartile).
The process of identifying a subset from a population of elements (aka observations or cases) is called
sampling process or simply sampling
Identification of target population that is important for a given problem under study.
Sampling method