Module 1 - 5
Module 1 - 5
STATISTICAL SOFTWARE
PACKAGE
MODULE 1
Research – Introduction – tools – software - Statistical Package
for Social Science (SPSS)- STATA – SAS- R - NVivo – MATLAB –
EVIEWS - JMP –(Only introduction about all software
RESEARCH- INTRODUCTION
v‘Re-Search’ Means
Repeatedly searching for
knowledge.
vFact Finding Enquiries
based on the Scientific
Investigation
vSo in simple words it is an
organized and systematic way
of finding answers to the
questions that a researchers
asks.
DEFINITIONS
Redman and Mory define research as a
“systematized effort to gain new
knowledge
Applied Research
v Applied research is a type of research design that seeks to solve a specific problem or provide innovative
solutions to issues affecting an individual, group or society.
v It is often referred to as a scientific method of inquiry or contractual research because it involves the practical
application of scientific methods to everyday problems.
v There are 3 types of applied research. These are evaluation research, research and development, and
action research.
BASED ON OBJECTIVE
Exploratory Research
v Exploratory research is the process of investigating a problem that has not been studied or thoroughly
investigated in the past .
v Exploratory type of research is usually conducted to have a better understanding of the existing problem,
but usually doesn't lead to a conclusive result.
Descriptive Research
v Descriptive research is a type of research that describes a population, situation, or phenomenon that is
being studied. It focuses on answering the how, what, when, and where questions If a research problem,
rather than the why.
v Based on fact finding enquiries.
Correlation Research
v Correlational research is a type of non experimental method in which a researcher measures two variables,
understands and assesses the statistical relationship between them.
Explanatory Research
v Explanatory research is a method developed to investigate a phenomenon that had not been studied
before or had not been well explained previously in a proper way.
v Explanatory research is responsible for finding the why of the events through the establishment of cause-
effect relationships.
BASED ON DATA COLLECTION TECHNIQUES
vQuantitative research is defined as a systematic investigation of phenomena
by gathering quantifiable data and performing statistical, mathematical, or
computational techniques.
v Quantitative research templates are objective, elaborate, and many times, even
investigational.
vQualitative research is defined as a market research method that focuses on
obtaining data through open-ended and conversational communication.
v Qualitative research methods are designed in a manner that help reveal the
behavior and perception of a target audience with reference to a particular topic.
v There are different types of qualitative research methods like an in-depth
interview, focus groups, ethnographic research, content analysis, case study
research that are usually used.
PROCESS INVOLVED IN RESEARCH
ff
Define
Review the Formulate Design
research Collect data
literature hypotheses research
problem
f
ff
9
MEASUREMENT SCALES
•Measurement is the foundation of any
scientific investigation. Scales of measurement
refer to ways in which variables are defined
and categorized.
NOMINAL/
ORDINAL INTERVAL RATIO
CATEGORICAL
11
NON METRIC MEASUREMENT SCALES
• Interval scale - Interval scales allow us not only to rank order the items that are
measured, but also to quantify and compare the sizes of differences between them.
• With interval measurement we can determined not only that a person ranks
higher but how they rank.
• Example :
• Ratio - The ratio scale of measurement is similar to the interval scale in
that it also represents quantity and has equality of units, where there is a
true zero and equal intervals between neighbouring points.
Example:
Time spend for studying causes a change in test score.
DATA ANALYSIS
• Data analysis is the process of collecting, modeling,
and analysing data to extract insights that support
decision-making.
Microsoft Excel has the basic features of all spreadsheets, using a grid
of cells arranged in numbered rows and letter-named columns to organize data
manipulations like arithmetic operations. It has a battery of supplied functions to
answer statistical, engineering, and financial needs. In addition, it can display data
as line graphs, histograms and charts, and with a very limited three-dimensional
graphical display. It allows sectioning of data to view its dependencies on various
factors for different perspectives (using pivot tables and the scenario manager).
SPSS- Statistical package for Social
Sciences
• SPSS Statistics is a powerful statistical software
platform.
• It offers a user-friendly interface and a robust set of
features that lets your organization quickly extract
actionable insights from your data.
• SPSS (Statistical Package for the Social Sciences) is
a versatile and responsive program designed to
undertake a range of statistical procedures.
• SPSS software is widely used in a range of
disciplines
Advantages
• Very easy to learn and use
• Can use either with menus or syntax files
• Quite good graphics
• Excels at descriptive statistics, basic regression analysis, analysis of
variance, and some newer techniques such as Classification and
Regression Trees (CART)
• Has its own structural equation modelling software AMOS, that dovetails
with SPSS
Disadvantages
• Focus is on statistical methods mainly used in the social sciences, market
research and psychology
• Has advanced regression modelling procedures such as LMM and GEE, but
they are awful to use with very obscure syntax
• Has few of the more powerful techniques required in epidemiological
analysis, such as competing risk analysis or standardised rates.
STATA
• According to StataCorp (2016), Stata is “a complete,
integrated statistical software package that provides
everything needed for data analysis, data management,
and graphics”.
• Basically, Stata is a software that allows to store and
manage data (large and small data sets), undertake
statistical analysis on our data, and create some really nice
graphs.
• This software is commonly used among health researchers,
particularly those working with very large data sets,
because it is a powerful software that allows to do almost
anything we like with our data.
Advantages
• Can use either with menus or syntax files
• Much more powerful than SPSS – probably equivalent to SAS
• Excels at advanced regression modelling
• Has its own in-built structural equation modelling
• Has a good suite of epidemiological procedures
• Researchers around the world write their own procedures in Stata,
which are then available to all users
Disadvantages
• Harder to learn and use than SPSS
• Does not yet have some specialised techniques such as CART or Partial
Least squares regression
SAS- R
SAS stands for Statistical Analysis System. It was developed at
the North Carolina State University in 1966, so is
contemporary with SPSS.
Advantages
• Can use either with menus or syntax files
• Much more powerful than SPSS
• Commonly used for data management in clinical trials
Disadvantages
• Harder to learn and use than SPSS
N Vivo
• NVivo is a software program used for qualitative and mixed-methods research.
• Specifically, it is used for the analysis of unstructured text, audio, video, and image data,
including (but not limited to) interviews, focus groups, surveys, social media, and journal
articles.
• It is produced by QSR International. As of July 2014, it is available for both Windows and
Macintosh operating systems.
ADVANTAGES OF N VIVO
• Analyze and organize unstructured text, audio, video, or image data.
• Playback ability for audio and video files, so that interviews can be transcribed in NVivo.
• Ability to capture social media data from Facebook and Twitter using the NCapture browser
plug-in.
• Import notes and captures from Evernote - great for field research.
• Import citations from EndNote, Mendeley, Zotero, or other bibliographic management
software - great for literature reviews.
• Perform simple text analysis queries (such as text search or word frequencies) for text data in
English, French, German, Spanish, Portuguese, Japanese, and Simplified Chinese.
FILE TYPES ARE ASSOCIATED WITH
NVIVO
• MATLAB is used to
• Analyze data
• Develop algorithms
• Create models and applications
E-VIEWS
• EViews is an easy-to-use statistical, econometric, and economic modeling
package.
Object
Window/
Work Area
EViews Desktop Details
Main Menu
Note: Path/Database/Workfile
can be changed by double-clicking in each. Path/directory Database Workfile
EViews Work file and Objects
• EViews does NOT open up with a “blank” generic document (unlike
Word ®, Excel ®, etc.).
ADVANTAGES OF JMP
• Streamlined menu interface arranged by "context" (e.g. univariate analysis, bivariate
analysis) instead of statistical tests.
• Dynamic output after running a procedure, can add or remove additional statistics and
graphs in the results window without having to re-run the procedure.
• Support for design of experiments and design generation.
• New in JMP Pro 13: Analyze unstructured text (e.g. open-ended survey questions) with
text mining techniques like cluster analysis and topic modeling.
• Extensive array of algorithms, especially for factor analysis (factor extraction and axes
rotation).
• Interactively build and refine graphs and tables with the Graph Builder and Tabulate
tools, respectively.
• Ability to interface with SAS: Import/export SAS data, write and execute SAS code, etc.
FILE TYPES ARE ASSOCIATED WITH JMP
• For example: Suppose we are interested in the exam marks of all the
students in India. But it is not feasible to measure the exam marks
of all the students in India. So now we will measure the marks of a
smaller sample of students, for example 1000 students. This sample
will now represent the large population of Indian students. We
would consider this sample for our statistical study for studying the
population from which it’s deduced.
TYPES OF ANALYSIS
Analysis
Strategy
48
TECHNIQUES USED IN
DATA ANALYSIS
• Univariate Analysis: Uni means one and this means that the
data has only one kind of variable. The major reason for
univariate analysis is to use the data to describe, summarise it,
and then find some pattern in the data.
• For example, the height of ten students in a class can be recorded and
this is univariate data. There is only one variable which is the height. The
description of the pattern that is found in this type of data is made by
drawing out conclusions based on dispersion, central measures of tendency,
spread, or data, and this is done through the histograms, frequency
distribution table, bar charts, etc.
UNIVARIATE ANALSYIS
Univariate analysis explores each variable in a data set,
separately. It describes the pattern of response to the
variable. It describes each variable on its own.
• Frequency
CATEGORICAL • Eg: Male, Female
• Mean, S.D.
METRIC
• Eg: Customer satisfaction
50
UNIVARIATE
(one variable)
Metric Categorical
variable variable
Histogram Par
Mean Variance Normal Percentages
Graph
Standard
Median
deviation
Uniform Box plot Pie Graph
Percentiles
51
BIVARIATE ANALAYSIS
Bivariate analysis is the
simultaneous analysis of two
variables (attributes). It
explores the concept of 2 CATEGORICAL
BIVARAITE
relationship between two
variables, whether there
2METRIC
exists an association and the
strength of this association, or
whether there are differences 1Category/1Metric
between two variables and
the significance of these
differences.
52
• There are three types of bivariate analysis. They are:
2 CATEGORICAL (Association between job cadre & income)
2METRIC(Quality of food served & customer satisfaction)
1 CATEGORY & 1 METRIC (Gender & job satisfaction)
• The variables that are involved are X and Y i.e. Dependent and
independent variable.
• For example: To study the relationship between two
variables i.e. Online teaching and the marks scored by the
students.
(1)In the left bottom corner we find tabs for switching between Variable View and Data View. For now,
select Variable View.
(2)In Variable View, variables are shown as rows of cells.
(3) The first column shows the variable name for each variable
(4)The fifth column may or may not contain a variable label. This describes the exact meaning of each
variable.
(5) The sixth column shows value labels: descriptions of the meaning of one, many or all values that a
variable may contain
In short, Variable View does not show the
data itself but, rather, information about the
data.
Ordinal
Interval Ratio
Nominal
Ordinal Interval Ratio
(Categorical)
* People or * People or * Intervals * There is a
objects with the objects with a between adjacent rationale zero
same scale value higher scale scale values are point for the
are the same on value have more equal with scale.
some attribute. of some attribute. respect the * Ratios are
* The intervals attribute being equivalent, e.g.,
* The values of between adjacent measured. the ratio of 2 to 1
the scale have no scale values are * Measure the is the same as
'numeric' indeterminate. difference the ratio of 8 to 4.
meaning in the * Scale between 2 values
way that you assignment is by * E.g., the
usually think the property of difference
about numbers. "greater than," between 8 and 9
"equal to," or "less is the same as
than.“ “satisfied”, the difference
dissatisfied” between 76 and
77.
Nominal Ordinal Interval Ratio
Classification Ordered but Ordered, Ordered,
data: differences between constant scale, constant
e.g. Male / values are not but no natural scale, natural
Female important Zero. zero
Differences e.g., Height,
No ordering:
(e.g.), Likert scales, make sense, but Weight,
e.g. it makes no
sense to state rank on a scale of ratios do not. Age,
that M > F 1….5 e.g. Length
your degree of Temperature
Arbitrary labels: satisfaction (C,F), Dates
e.g., M/F, 0/1, etc e.g., Restaurant ( difference
ratings between
Tendulkar can be (10 – Highly temperature of
identified by satisfied, 100 to 90
number 10 in 5 – Moderate, 1 – degree is same
Indian cricket Highly Dissatisfied) as 90 to 80
team degree)
Choosing the Statistical Test
1. Data
2. Samples
3. Purpose
Appropriate Statistics
Nominal Ordinal Interval Ratio
[ Cross tabs ] [ Frequencies ] Mean Coefficient of
Standard Deviation Variation,
Chi square, Median, Pearson's product-
Phi Interquartile range moment correlation
(CV = SD /
Cramer's t test
Analysis of variance, M)
Contingency
Multivariate analysis of
[ Nonparametric ] variance, MANOVA
[ Nonparametric ]
Factor analysis
Kolmogorov-Smirnov Regression
Chi-square,
Sign Multiple correlation, R
Runs
Binomial Wilcoxen
McNemar Kendall coefficient of
Cochran concordance
Friedman two-way anova
Mann-Whitney U
Wald-Wolfowitz
Kruskal-Wallis
Simple Random Sampling
STRATIFIED RANDOM SAMPLING
Cluster Sampling
In stratified random sampling, all the strata of the population is sampled
while in cluster sampling, the researcher only randomly selects a number of
clusters from the collection of clusters of the entire population.
Systematic Sampling
MULTISTAGE SAMPLING
JUDGMENT SAMPLING
CONVENIENCE SAMPLING
QUOTA SAMPLING
What is SPSS?
• Statistical Package for Social Sciences
• General Purpose Statistical Software
• Statistical analysis for the input data
• Basics concepts of research in SPSS are
- Variable, Scale, Hypothesis, Significance level & Data
• Outcome of the data analysis indicates the significance of the
collected data by the researcher
VARIABLES
Variables:
“A variable is an characteristics of an event, object, idea, type of
category that trying to measure”.
(E.g.) [Satisfaction level of (Working Environment)]
[Item] (Variable)
* Item may consist of dimension (i.e.) Satisfaction level of working
environment can be measured by two dimension like Internal Working
environment & External working environment and asking the
respondents to rate the different categories in the each and every
dimension.
* Item & variables will have attributes (category, measuring items)
(E.g.) Highly Satisfied, Satisfied, Moderate, Dissatisfied, Highly
Dissatisfied.
Types of Variable:
1. Dependent variable (Criterion variable) – depends on other
factors.
(E.g.) Your CIAT mark is dependent variable, because it depends
on many factors like how much you studied, how much you
slept, how much you eat before your test, ho much relevant
concept you studied, etc….
Size N n
Mean µ
_
SD s s=
_
å (x - x )
2
S=
å (x - x )
2
n n -1
Proportion P p =
x
n
Correlation r r=
C O V( x , y )
sx sy
Coefficient
HYPOTHESIS
Testing of Hypothesis:
Testing of
Hypothesis
Given the observed data set, the P value is the smallest level for
which the null hypothesis is rejected (and the alternative is
accepted)
• If the P value lies between 0.01 to 0.05 (ie. 0.01< P value £ 0.05)
then reject H0 at 5% level of significance
Procedure
Analyze Descriptive Statistics Frequencies
Interpretation
Descriptive Statistics
Measures of central tendency and measures of
dispersion is collectively known as descriptive
statistics.
It describe the what the set of numbers looks like.
The mean tell us the what is the average height, and
standard deviation tells us how spread out the
heights are around that average.
Procedure
Analyze Descriptive Statistics Frequency
then statistics and select Mean, Median, Mode, Maximum,
Range, etc.
Descriptive Statistics
Purpose:
To describe each variable – What is the current level of the
variable of interest?
Frequency:
Means, Minimum, Maximum, Standard Deviation, Swekness &
Kurtosis. .
Analyze Descriptive Statistics Frequency then
statistics and select Mean, Median, Mode, Maximum, Range, etc.
Frequencies for two or more nominal variables
Analyze Summarize Cross tabulation
Means of variables by subgroups defined by one or more nominal
variables
Analyze Compare Means Means (Use of Levels)
Interpretation:
Measure of Central Tendency: (Quantitative Data)
Mean – average of all the observation (i.e.) sum of observations /
no. of observations.
Standard Error of Mean – sample mean deviates from actual
mean of population.
Median – value of middle item when it is arranged in ascending
or descending order (i.e.) value of variables which divides into
two equal parts.
Mode – value which occurs most frequently in set of
observations.
Measure of Dispersion:
“Degree to which numerical data tend to spread about average
value.”
Range – difference between greatest and least value.
Interpretation:
Standard Deviation – positive square root of arithmetic mean of
squares of deviation of observations from their arithmetic mean.
Swekness – it is asymmetrical that shows there is slight deviation
in curve distribution.
When mean, median fall at different point it indicates that the
curve is skew distribution (i.e.) mean, median & mode value will
not coincide.
Mean & Mode value will be wide difference & median value lie
between mean and mode value.
Left & Right Skew (i.e.) (+) & (-)
Skew
Normal Curve (bell shaped) No
Skewness
Interpretation:
Kurtosis – degree of Kurtosis of distribution measured relative to
peakedness of a normal curve.
(i.e.) it tells about another form of distribution. Kurtosis
Normal curve (bell – shaped)
Flat than normal curve
Peak than normal curve
Leptokurtic (peak)
Mesokurtic (Normal curve)
Platykurtic (flat)
Kurtosis is measured based on Co-efficient (beta) or its deviation
If β value is 3, then the curve is normal (Mesokurtic).
If β value is < 3, then the curve is flat (Platykurtic).
If β value is > 3, then the curve is Peak (Leptokurtic).
Exercise : Descriptive Statistics
Problem:
The following figures represent the ages of a sample of 60
employees of a firm Orange plc.
35 44 54 33 46 20 32 25 50 39 33 37
42 40 20 25 34 52 27 22 18 40 23 17
41 45 21 34 49 27 60 46 32 58 23 52
24 64 41 47 54 37 40 41 40 36 46 29
34 39 39 40 37 50 41 34 47 34 45 36
Find out Mean, Median, Mode, Standard deviation, Standard
error, Range, Maximum & Minimum.
CROSS TABS
A crosstab is a table showing the relationship between two or more variables. Where the table only shows
the relationship between two categorical variables, a crosstab is also known as a contingency table.
The rows and columns will be switched. In our example, the data is now
grouped by genre, with columns for each year.
Keeping charts up to date
Results discussion
Exercise
• Is there any difference between male and female
toward GPA score?
Paired ‘t’ Test for difference of Two Means
Two variable should be scale and dependent
Problem:
A Company arranged an intensive training course for its
team of salesmen. A random sample of 10 salesmen was
selected and the value (in ‘000) of their sales made in the
weeks immediately before and after the course are shown in
the following table:
Salesmen 1 2 3 4 5 6 7 8 9 10
Sales Before 12 23 5 18 10 21 19 15 8 14
Sales After 18 22 15 21 13 22 17 19 12 16
Procedure
Discussion
Exercise
Is there any difference between pre test score
and post test score in the before and after
training program?
ANOVA (One Way)
Why ANOVA?
This is an extension of two sample t-test.
What if we wish to compare the means of more than two
population ?
What it does? It compares three or more population mean and
see is there any difference or not
Variable selection
Exercise
The institution ABC decides to examine the stress level
of the students across departments as the institution
faces lower academic results. The response of the
students are collected in five point likert scale with a
total of 30 students from commerce, management and
economics department. The response were collected
from 1 to 5. One being strongly agree to five strongly
disagree.
Research question:
Are the students of different disciplines (commerce, management and
economics) differ in their stress level .
Objectives:
To study the difference among students group (commerce,
management and economics) towards their stress level.
Hypothesis
H0- There is no significant difference among students group with
respect to stress level of students.
H01- There is significant difference among students group with respect
to stress level of students.
Procedure
Hypothesis
A random sample of the students in each row was taken. The score for those
students on the second exam was recorded
• Front: 82, 83, 97, 93, 55, 67, 53
• Middle: 83, 78, 68, 61, 77, 54, 69, 51, 63
• Back: 38, 59, 55, 66, 45, 52, 52, 61
Two way ANOVA
Two way ANOVA is an extension of one way ANOVA.
In two way ANOVA, we will be having two independent variable
and one dependent variable
Objective
To test is there any difference among students stress level based
on the interaction of demographical characteristics.
Hypothesis
Discussion
NON PARAMETRIC TESTS
NORMALITY TESTS
v Normality refers to a specific statistical distribution called a normal
distribution, or sometimes the Gaussian distribution or bell-shaped curve.
The normal distribution is a symmetrical continuous distribution defined
by the mean and standard deviation of the data.
v
v The two well-known tests of normality, namely, the Kolmogorov–Smirnov test and the Shapiro–
Wilk test are most widely used methods to test the normality of the data. Normality tests can be
conducted in the statistical software “SPSS” (analyze → descriptive statistics → explore → plots →
normality plots with tests).
CORRELATION
qThe degree of relationship between the quantitative variables under
consideration is measure through the correlation analysis.
qThe measure of correlation called the correlation coefficient.
qThe degree of relationship is expressed by coefficient which range from
correlation ( -1 ≤ r ≥ +1). The direction of change is indicated by a sign.
Types of correlation
r-1 ≤ r ≥ +1
n Degree of Correlation is expressed by a value of Coefficient
• Suppose: r = 0.9, r2 = 0.81 this would mean that 81% of the variation
in the dependent variable has been explained by the independent
variable.
Coefficient of Determination
Problem:
Find the correlation coefficient between income and expenditure
of the family to the following data. Also test whether correlation
coefficient is significant.
Income ( in hundreds) 60 58 45 65 56 38 70
Expenditure (in hundreds) 55 50 40 60 62 45 63
Solution:
First find the coefficient of correlation by using the formula
Discussion
Regression Analysis
When it is used?
To establish cause and effect between dependent variable and the
number of independent variables.
Analyze Regression Linear (use statistics, save
options).
Variables – Enter, Backward, Forward and Stepwise method.
Options – residual analysis, influence statistics, collinearity
diagnostics, normality plots.
Interpretation:
Goodness of Model: R2, F-statistics, Adjusted R2, Standard Error.
Strength of Influence of Independent variable: beta and
standardization.
Regression Analysis
Regression Analysis is a very powerful tool in the field of
statistical analysis in predicting the value of one variable, given
the value of another variable, when those variables are related to
each other.
The real life data may contain noise which hides the pattern that is
present in the data.
We call this as Model. A simple representation of data is
Data = Model + Noise
The term Model is also called pattern; It is also an expected
value.
The term Noise is also called error, deviation, or residual.
One has to find the model in order to explain the data
What is error in mathematical form?
Actual y
Error
Model
x
The relationship between y and x is
Here alpha +beta x is the model. Note that error is
different t for different x. The set of equations to find the
alpha and beta are called normal equations.
Example (Simple Correlation)
• Age of a car(in years) : 1 2 3 4
• Maintenance cost: 2 3 5 8
(in thoushands)
y = a + bX
Maintenance cost= -0.5 + 2(Age of car)
The beta value tells us to what degree each predictor affects the
outcome if the effects of all other predictors are held constants.
Multicollinearity