L1-D3 Concepts of Data Analysis
L1-D3 Concepts of Data Analysis
www.infocepts.com
Data analysis
“Data analysis is a process of inspecting, cleansing, transforming, and modeling data
with the goal of discovering useful information, informing conclusions,
and supporting decision-making” -Wikipedia
Attributes Quantitative Qualitative
Helpful in “answering questions of who, where, how many, how Provides naturally occurring information and assists in answering why and
Usage much, and what is the relationship between specific variables” how questions
Type of Data Hard data are collected, as they are in the form of numbers, Soft data are collected, as they are in the form of words (texts, images,
counts and other statistical formulae artefacts, narratives) and everything else.
Clear and formulated conventions for data analysis and process Methods of data analysis are not clearly formulated and process is not
is predictable predetermined.
Process Data analysis is usually done at the end when all data has been Data is analyzed as they are collected because data collection and analysis
collected in a linear fashion are interactive and occur in overlapping cycles
Not flexible and is usually difficult to follow-up on promising Flexible and allows adjustments during data collection through
hunches supplementary questions to gather additional data
Standardized data is collected through measuring either Huge amounts of data need to be summarized and interpreted
qualitative or quantitative variables
Approach
The analyst seeks to verify or test a theory and the approach The analyst lets the data and the interpretation of it, guide analysis
tends to be confirmatory without any assumption and the approach tends to be exploratory
Relationships between independent and dependent variables is Focus on the meaning of events and actions as expressed by the
Focus
of major concern (tends to be variable-centric) participants (case-centric)
Statistical and probability techniques mostly driven by Non –mathematical or non-numerical methods such as content
Tools & Techniques
mathematical and numerical methods analysis, ground theory, conversation analysis etc…
www.infocepts.com
Visual analysis of data - Scatter Plot Day 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30
Order
33 26 24 21 19 20 18 18 52 56 27 22 18 49 22 20 23 32 20 18 19 20 30 22 23 32 25 53 57 50
Count
Visual analysis helps in determining repeated patterns, high
level linear trend and outliers of data.
The analysts use Scatter plot as first visualization to see data
patterns
They are very helpful in understanding relationship between
two quantitative variables, visually
It makes it easy to identify Clusters, patterns and outliers for
further analysis
Outlier
Cluster
www.infocepts.com 3
Introduction to Statistics
Descriptive Statistics
www.infocepts.com
Population and Samples
www.infocepts.com
Descriptive Statistics
Descriptive Statistics refers to statistical methods of describing and summarizing data The Descriptive statistics provide following information about data
using tabular, visual, and quantitative techniques. Central Tendency is an average value of any distribution of data that best represents the
It Provides a summary of numerical statistical measures that describe location, dispersion, middle. Also called centrality.
and shape for sample data Dispersion or variability : Dispersion describes the spread or scattering of data from its
A variable is a single characteristic of the data. central location.
Interquartile Range is Difference between third and first quartile (Q3 - Q1) Median is value at the centre of the ordered values of a variable in a sample .
Variance is Average*of the squared deviations from the mean Mode is most repeated value in the sample. There can be more than one Mode for a
www.infocepts.com 6
Measures of central Tendency
Mean Mode
Population Mean Sample Mean It is used for either numerical or categorical data where mean or median has no meaning.
N = Population Size
n = Sample Size
There may be no mode or several modes in a data set. Hence, it is seldom used unless absolutely required.
N
x
n
x x 2 xN
1 i1
i
x i
x1 x 2 x n
N N x i1
n n
Mode is used when mean or median are not useful or data is categorical.
When n or N or count of values is even then median is computed as average of two middle values.
of television set sold will not help, because there is no television of screen-size 32.53-inch. Knowledge that the mode
Useful when data are highly skewed is 30 inches would tell him the screen-size of television that is sold most.
www.infocepts.com
Sample
Measure of Central Tendency
Illustration I
Sample
Mean =
Median for even sample size = sum of 15th and 16th values /2
= 23 + 23 / 2 = 23
Minimum value = 18
Average order receipt per day for September 2018 is 29 Percent of month that had minimum business = 6/30 = 0.2
orders/day
We can say that 20% of September had minimum business
It is calculated as
th
For any particular number r between 0 and 100, the r percentile is a value such that r percent of the observations in th th th
25 ,50 and 75 percentiles are called quartiles
th
the data set fall at or below that value. E.g. 95 Percentile means 95% of data is smaller than the value.
th
Quartiles distribute dataset in 4 equal sets which has 25% records each
The most common way to compute the r percentile is to
1. th th
25 and 75 percentiles are called first and third quartile respectively
Order the data values from smallest to largest
2. th
Calculate the rank of the r percentile using the formula th th
0 quartile is minimum value and 4 quartile is maximum value.
th nd
50 percentile or 2 quartile is called median
3. Round I to nearest integer
4. The difference between the values at first and third quartile is called interquartile range (IQR). It contains middle 50% of data.
Locate the value at the position I from smallest end
5. th
That value is the r percentile value
Hence it is not sensitive to outliers
IQR is used to find outliers which is defined as values that are below Q1-1.5*IQR or above Q3+1.5*IQR
www.infocepts.com
Sample
Measure of Dispersion – Range & Quartile
Illustration II Day of Month
Order Count
st th
1 Quartile location = 30 * 25/100 = 30 * 0.25 = 7.5 ~ 8 location
th
Q1 = value at 8 location = 20
nd
Q3 = value at 22 location = 32
Any order count less than 2 or greater than 50.25 is a probable
outlier and need further analysis in details. Inter Quartile Range = IQR = 32 – 20 = 12
50% of the days, the order receipt was between 20 and 32. So we can say that
Half of the month when the business was more than minimum, the order receipts per day was in range from 20 to 32.
Variance & Standard Deviation
Spread of data from its mean is called deviation from mean
Deviation is calculated as
The sum of all deviation in a population or sample is zero. Hence it is difficult to find single deviation for entire
sample or population.
Hence, it is squared and then averaged like mean, which gives single value for population or sample called
2
variance denoted by sigma square for population. ( ) and s for sample
Variance is in squared units unlike deviation. To describe data in its own units, square root of variance was
Population Variance and standard Sample Variance and standard deviation
defined as standard deviation and denoted by sigma (σ) for population and s for sample.
deviation
Standard deviation is used to state the approximate percentage of values that may lie within a k time of
standard deviations from the mean of a data set, if the data are normally distributed. (μ±kσ), generally k is 1,2
or 3.
www.infocepts.com
Sample Measure of Dispersion – Variance & standard deviation
Illustration III
= mean = 29
N = 30
80% of the month, the order receipts per day was in range from
16 to 42 orders .
www.infocepts.com
Statistical Distribution
Characteristics of Normal Distribution
Experimental findings show that most commonly seen distribution of random variable probabilities, in nature
and man made things is normal distribution. Hence most commonly used distribution is normal distribution, in
statistics.
www.infocepts.com 14
Frequency Distribution
Frequency is number of occurrence of value or a range of values or category in data set. Skewness
The categorical data is grouped in categories and non categorical numeric data can be group in ranges. Each category Skewness is measure of the degree of asymmetry of a frequency distribution
or group is called Class. Coefficient of skewness is between -1 and 1 for symmetric skewness coefficient is 0.
Frequency Table shows frequency for each class with limits of each class
Frequency distribution curve helps analysts study distribution of data in various classes and verify hypothesis about
Class
www.infocepts.com
Central Limit Theorem n=5
n = 20
0. 2 5
0. 2
0. 2 0
0. 1 5
0. 1
0. 1 0
When sampling from a population with mean μ and finite standard deviation σ, the sampling 0. 0 5
0. 0 X
0. 0 0
X
distribution of the sample mean will tend to a normal distribution with mean μ and standard
Large n
deviation ; as the sample size becomes large (n >30).
0. 4
n 0. 3
0. 2
0. 1
0. 0 X
-
In case it is not possible to have 30 or more observations then sample must be tested for normal distribution.
www.infocepts.com 16
Training &
Development
Thank You !