0% found this document useful (0 votes)
44 views

L1-D3 Concepts of Data Analysis

Uploaded by

Simar
Copyright
© © All Rights Reserved
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
44 views

L1-D3 Concepts of Data Analysis

Uploaded by

Simar
Copyright
© © All Rights Reserved
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
You are on page 1/ 17

Concepts of Data Analysis

www.infocepts.com
Data analysis
“Data analysis is a process of inspecting, cleansing, transforming, and modeling data
with the goal of discovering useful information, informing conclusions,
and supporting decision-making” -Wikipedia
Attributes Quantitative Qualitative
Helpful in “answering questions of who, where, how many, how Provides naturally occurring information and assists in answering why and
Usage much, and what is the relationship between specific variables” how questions
Type of Data Hard data are collected, as they are in the form of numbers, Soft data are collected, as they are in the form of words (texts, images,
counts and other statistical formulae artefacts, narratives) and everything else.
Clear and formulated conventions for data analysis and process Methods of data analysis are not clearly formulated and process is not
is predictable predetermined.
Process Data analysis is usually done at the end when all data has been Data is analyzed as they are collected because data collection and analysis
collected in a linear fashion are interactive and occur in overlapping cycles
Not flexible and is usually difficult to follow-up on promising Flexible and allows adjustments during data collection through
hunches supplementary questions to gather additional data
Standardized data is collected through measuring either Huge amounts of data need to be summarized and interpreted
qualitative or quantitative variables
Approach
The analyst seeks to verify or test a theory and the approach The analyst lets the data and the interpretation of it, guide analysis
tends to be confirmatory without any assumption and the approach tends to be exploratory

Relationships between independent and dependent variables is Focus on the meaning of events and actions as expressed by the
Focus
of major concern (tends to be variable-centric) participants (case-centric)
Statistical and probability techniques mostly driven by Non –mathematical or non-numerical methods such as content
Tools & Techniques
mathematical and numerical methods analysis, ground theory, conversation analysis etc…

www.infocepts.com
Visual analysis of data - Scatter Plot Day 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30
Order
33 26 24 21 19 20 18 18 52 56 27 22 18 49 22 20 23 32 20 18 19 20 30 22 23 32 25 53 57 50
 Count
Visual analysis helps in determining repeated patterns, high
level linear trend and outliers of data.
 The analysts use Scatter plot as first visualization to see data
patterns
 They are very helpful in understanding relationship between
two quantitative variables, visually
 It makes it easy to identify Clusters, patterns and outliers for
further analysis

Patterns and outliers in Scatter Plots

Negative Relationship Positive Relationship Linear Relationship Non-linear Relationship

Outlier

Cluster

www.infocepts.com 3
Introduction to Statistics
Descriptive Statistics

www.infocepts.com
Population and Samples

In Statistical Analysis, the data to be analysed is termed as population.


The Population size (N) may be known or unknown. E.g. the population can
be world population infected by swine-flu for swine-flu infection analysis,
which is difficult to quantize.
A Sample is always a portion of the population and it’s size (n) is always
known
 A population consists of the set of all entities for which the Analysis is
performed.
 A sample is a subset of the measured entities selected from the
population.
 When population data is unrealistic to collect, the analysts use random
sample of data and infer for population.(E.g. Spread of Swine-flu in the
world)
 Sampling from the population is often done randomly, such that every possible sample of equal size (n) will have an equal

chance of being selected. Such a sample is called Simple Random Sample.

www.infocepts.com
Descriptive Statistics

 Descriptive Statistics refers to statistical methods of describing and summarizing data The Descriptive statistics provide following information about data

using tabular, visual, and quantitative techniques.  Central Tendency is an average value of any distribution of data that best represents the
 It Provides a summary of numerical statistical measures that describe location, dispersion, middle. Also called centrality.

and shape for sample data  Dispersion or variability : Dispersion describes the spread or scattering of data from its
 A variable is a single characteristic of the data. central location.

Measures of Dispersion Measures of Central Tendency


 Range is Difference between maximum and minimum values  Mean is average of all values of a variable in sample

 Interquartile Range is Difference between third and first quartile (Q3 - Q1)  Median is value at the centre of the ordered values of a variable in a sample .

 Variance is Average*of the squared deviations from the mean  Mode is most repeated value in the sample. There can be more than one Mode for a

 Standard Deviation is Square root of the variance sample.

www.infocepts.com 6
Measures of central Tendency
Mean Mode

 The Mean is the arithmetic average of data values


 Mode is the value that is repeated highly or has highest frequency or count of records
 Mean = sum of values divided by the count of values

  Mode is Not affected by extreme values (outliers)


Affected by extreme values (outliers)

Population Mean Sample Mean  It is used for either numerical or categorical data where mean or median has no meaning.

N = Population Size
n = Sample Size
 There may be no mode or several modes in a data set. Hence, it is seldom used unless absolutely required.
N

x
n


x  x 2    xN
 1 i1
i
x i
x1  x 2    x n
N N x i1

n n

Median Choosing Right Measure

  Mean is generally used, unless extreme values (outliers) exist


In an ordered (sorted) data set, the median is the “middle” value at n or N/2 location in ordered data set, i.e.,

the number that splits the data set in half


 The Median is often used, when the data is highly skewed

  Mode is used when mean or median are not useful or data is categorical.
When n or N or count of values is even then median is computed as average of two middle values.

 The median is not affected by extreme values (outliers)


For example, when a television retailer decides how many of each screen size to stock, the mean 32.53 of screen-sizes

of television set sold will not help, because there is no television of screen-size 32.53-inch. Knowledge that the mode
 Useful when data are highly skewed is 30 inches would tell him the screen-size of television that is sold most.

www.infocepts.com
Sample
Measure of Central Tendency
Illustration I

Sample
  Mean =

What is average order receipt rate for September 2018?

What percent of days minimum sale was recorded?


  = 28.83 orders/day ~ 29 orders per day

Median for even sample size = sum of 15th and 16th values /2

= 23 + 23 / 2 = 23

Mode = 18 (six times (days) the value is repeated)

Minimum value = 18

Average order receipt per day for September 2018 is 29 Percent of month that had minimum business = 6/30 = 0.2

orders/day
We can say that 20% of September had minimum business

20% of September 2018 had minimum business


Measures of dispersion
  Range

 Range is span of values of a variable that appeared in data

 It helps understand the boundaries of spread of values of a variable in a sample.

 It is calculated as

 As it is dependent on maximum and minimum values, it is sensitive to Outliers.

  Percentile Quartile & interquartile range

th
For any particular number r between 0 and 100, the r percentile is a value such that r percent of the observations in  th th th
25 ,50 and 75 percentiles are called quartiles
th
the data set fall at or below that value. E.g. 95 Percentile means 95% of data is smaller than the value.
th
 Quartiles distribute dataset in 4 equal sets which has 25% records each
The most common way to compute the r percentile is to
1.  th th
25 and 75 percentiles are called first and third quartile respectively
Order the data values from smallest to largest
2. th
Calculate the rank of the r percentile using the formula  th th
0 quartile is minimum value and 4 quartile is maximum value.

 th nd
50 percentile or 2 quartile is called median
3. Round I to nearest integer
4.  The difference between the values at first and third quartile is called interquartile range (IQR). It contains middle 50% of data.
Locate the value at the position I from smallest end
5. th
That value is the r percentile value
Hence it is not sensitive to outliers

 IQR is used to find outliers which is defined as values that are below Q1-1.5*IQR or above Q3+1.5*IQR

www.infocepts.com
Sample
Measure of Dispersion – Range & Quartile
Illustration II Day of Month

Find the Outliers in the data?


outliers

What is the range of receipt of orders per day on days

when business is higher than minimum ? Q1

Order Count
st th
1 Quartile location = 30 * 25/100 = 30 * 0.25 = 7.5 ~ 8 location

th
Q1 = value at 8 location = 20

3rd Quartile location = 30 * 75/100 = 30 * 0.75 = 22.5 ~ 22nd location

nd
Q3 = value at 22 location = 32
Any order count less than 2 or greater than 50.25 is a probable

outlier and need further analysis in details. Inter Quartile Range = IQR = 32 – 20 = 12

Q3 Outlier value on minimum side = 20 – (12*1.5) = 20 – 18 = 2


Half of the month, when business was more than minimum, the
Outlier value on maximum side = 32 + (1.5*12) = 32 + 18 = 50.25
order receipts per day was in range from 20 to 32 orders .
So any value < 2 or > 50.25 is a probable outlier

50% of the days, the order receipt was between 20 and 32. So we can say that

Half of the month when the business was more than minimum, the order receipts per day was in range from 20 to 32.
Variance & Standard Deviation
  Spread of data from its mean is called deviation from mean

 Deviation is calculated as

 The sum of all deviation in a population or sample is zero. Hence it is difficult to find single deviation for entire

sample or population.

 Hence, it is squared and then averaged like mean, which gives single value for population or sample called
2
variance denoted by sigma square for population. ( ) and s for sample

 Variance is in squared units unlike deviation. To describe data in its own units, square root of variance was
Population Variance and standard Sample Variance and standard deviation
defined as standard deviation and denoted by sigma (σ) for population and s for sample.
deviation
 Standard deviation is used to state the approximate percentage of values that may lie within a k time of

standard deviations from the mean of a data set, if the data are normally distributed. (μ±kσ), generally k is 1,2

or 3.

www.infocepts.com
Sample Measure of Dispersion – Variance & standard deviation
Illustration III

What is the range of receipt of orders per day 80% of

the September 2018 ?

 
= mean = 29

N = 30

Variance = = 4839 / 30 = 161.3

Standard deviation = σ = = ±12.7 ~ ±13

80% of the month, the order receipts per day was in range from

16 to 42 orders .

From our plot around 80% of values lie between 16 and 42


Statistical Distribution
Normal Distribution, Skewness, Kurtosis and
central limit theorem

www.infocepts.com
Statistical Distribution
Characteristics of Normal Distribution

 The distribution is symmetric, so its measure of skewness is zero.


 A statistical distribution is a graphical depiction of frequency counts or probabilities for various values of a
 The mean, median, and mode are all equal. Thus, half the area falls above the mean and half falls below it.
variable that can occur.

 Distributions are important because most of the analyses done in business statistics are based on the
The empirical rules apply exactly for the normal distribution is

 Around 68.3% observations will fall within 1 standard deviation of mean


characteristics of a particular distribution.

 In statistical experiments involving chance, outcomes occur randomly. Hence, probability of occurrence of
Around 95.4% observations will fall within 2 standard deviation of mean

 Around 99.7% observations will fall within 3 standard deviation of mean


values is used to study distribution.

 A random variable is a variable that contains the outcomes of a chance experiment.

 Experimental findings show that most commonly seen distribution of random variable probabilities, in nature

and man made things is normal distribution. Hence most commonly used distribution is normal distribution, in

statistics.

www.infocepts.com 14
Frequency Distribution
 Frequency is number of occurrence of value or a range of values or category in data set. Skewness

 The categorical data is grouped in categories and non categorical numeric data can be group in ranges. Each category  Skewness is measure of the degree of asymmetry of a frequency distribution
or group is called Class.  Coefficient of skewness is between -1 and 1 for symmetric skewness coefficient is 0.
 Frequency Table shows frequency for each class with limits of each class

 Histogram is the graph that plots data from frequency table.

 Frequency distribution curve helps analysts study distribution of data in various classes and verify hypothesis about

data to infer conclusions from it using inferential statistics.

Frequency Table and Histogram of our sample data


Kurtosis
Frequency

Measure of flatness or peakedness of a frequency distribution

Class

www.infocepts.com
Central Limit Theorem n=5
n = 20
0. 2 5
0. 2

0. 2 0

0. 1 5
0. 1
0. 1 0

When sampling from a population with mean μ and finite standard deviation σ, the sampling 0. 0 5

0. 0 X
0. 0 0
X

distribution of the sample mean will tend to a normal distribution with mean μ and standard
Large n

deviation ; as the sample size becomes large (n >30).
0. 4

n 0. 3

0. 2

0. 1

0. 0 X
-

 Central Limit theorem applies for Sampling population of any Distribution.

 Hence it is a custom that Sample must consist 30 or more observations.

 In case it is not possible to have 30 or more observations then sample must be tested for normal distribution.

www.infocepts.com 16
Training &

Development

Thank You !

You might also like