0% found this document useful (0 votes)
8 views

Module 1 - 5

Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
8 views

Module 1 - 5

Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 176

RESEARCH-RELATED

STATISTICAL SOFTWARE
PACKAGE
MODULE 1
Research – Introduction – tools – software - Statistical Package
for Social Science (SPSS)- STATA – SAS- R - NVivo – MATLAB –
EVIEWS - JMP –(Only introduction about all software
RESEARCH- INTRODUCTION
v‘Re-Search’ Means
Repeatedly searching for
knowledge.
vFact Finding Enquiries
based on the Scientific
Investigation
vSo in simple words it is an
organized and systematic way
of finding answers to the
questions that a researchers
asks.
DEFINITIONS
Redman and Mory define research as a
“systematized effort to gain new
knowledge

According to Clifford Woody “research


comprises defining and redefining
problems, formulating hypothesis or
suggested solutions; collecting, organizing
and evaluating data; making deductions
and reaching conclusions; and at last
carefully testing the conclusions to
determine whether they fit the
formulating hypothesis”.
TYPES OF RESEARCH – BASED ON APPLICATION
Basic Research
v Basic research is a type of research approach that is aimed at gaining a better understanding of a subject,
phenomenon or basic law of nature.
v This type of research is primarily focused on the advancement of knowledge rather than solving a specific
problem.
v Basic research is also referred to as pure research or fundamental research.
v The primary aim of this research approach is to gather information in order to improve one's understanding,
and this information can then be useful in proffering solutions to a problem.

Applied Research
v Applied research is a type of research design that seeks to solve a specific problem or provide innovative
solutions to issues affecting an individual, group or society.
v It is often referred to as a scientific method of inquiry or contractual research because it involves the practical
application of scientific methods to everyday problems.
v There are 3 types of applied research. These are evaluation research, research and development, and
action research.
BASED ON OBJECTIVE
Exploratory Research
v Exploratory research is the process of investigating a problem that has not been studied or thoroughly
investigated in the past .
v Exploratory type of research is usually conducted to have a better understanding of the existing problem,
but usually doesn't lead to a conclusive result.
Descriptive Research
v Descriptive research is a type of research that describes a population, situation, or phenomenon that is
being studied. It focuses on answering the how, what, when, and where questions If a research problem,
rather than the why.
v Based on fact finding enquiries.
Correlation Research
v Correlational research is a type of non experimental method in which a researcher measures two variables,
understands and assesses the statistical relationship between them.
Explanatory Research
v Explanatory research is a method developed to investigate a phenomenon that had not been studied
before or had not been well explained previously in a proper way.
v Explanatory research is responsible for finding the why of the events through the establishment of cause-
effect relationships.
BASED ON DATA COLLECTION TECHNIQUES
vQuantitative research is defined as a systematic investigation of phenomena
by gathering quantifiable data and performing statistical, mathematical, or
computational techniques.
v Quantitative research templates are objective, elaborate, and many times, even
investigational.
vQualitative research is defined as a market research method that focuses on
obtaining data through open-ended and conversational communication.
v Qualitative research methods are designed in a manner that help reveal the
behavior and perception of a target audience with reference to a particular topic.
v There are different types of qualitative research methods like an in-depth
interview, focus groups, ethnographic research, content analysis, case study
research that are usually used.
PROCESS INVOLVED IN RESEARCH
ff

Define
Review the Formulate Design
research Collect data
literature hypotheses research
problem

f
ff

Analyse Interpret and


data report

Where f = feed back(helps in controlling the sub system


ff= feed forward(serves the vital function of providing criteria for evaluation
INTRODUCTION
üMost social science research involves an investigator gathering
data and performing analyses to determine what the data mean.

üData consist of quantitative information like price, income sales


etc., or qualitative information like knowledge, performance,
character etc.,

üA common feature of survey based research is to have


respondents feelings, attitudes, opinions, etc., in some measurable
form.

üThese information must be converted into numerical form for


further analysis possible through measurement technique.

9
MEASUREMENT SCALES
•Measurement is the foundation of any
scientific investigation. Scales of measurement
refer to ways in which variables are defined
and categorized.

•Each scale of measurement has certain


properties which in turn determines the
appropriateness for use of certain statistical
analyses.
MEASUREMENT
SCALES

NON METRIC METRIC DATA


DATA

NOMINAL/
ORDINAL INTERVAL RATIO
CATEGORICAL

11
NON METRIC MEASUREMENT SCALES

• Non Metric Data : describe differences in type or kind by indicating the


presence or absence of a characteristics or property. It Can be Nominal or
Ordinal Scale.
• Nominal Scales- assigns numbers as a way to label to identify subjects or
objects.
§ Commonly used examples of nominally scaled data include many demographic
attributes (religion, occupation, age, family type etc.,)
§ Many forms of behaviour (voting behaviour, purchase activity) or any other
actions that is discrete (happens or not)
• Example: Gender : 1-Male; 2-Female
• Ordinal Scales- An ordinal scale of measurement represents an
ordered series of relationships or rank order.

• EXAMPLES: Rank: 1st place, 2nd place, ... last place,

• Level of Agreement: no, maybe, yes.

• Example: Preferences for different soft drinks


METRIC MEASUREMENT SCALES
• Metric data : used when subject differ in amount or degree on a particular
attribute.

• Interval scale - Interval scales allow us not only to rank order the items that are
measured, but also to quantify and compare the sizes of differences between them.

• With interval measurement we can determined not only that a person ranks
higher but how they rank.

• Example :
• Ratio - The ratio scale of measurement is similar to the interval scale in
that it also represents quantity and has equality of units, where there is a
true zero and equal intervals between neighbouring points.

• For example, the number of children in a household or years of work


experience are ratio variables: A respondent can have no children in
their household or zero years of work experience

• Money is a good example of an everyday ratio scale of measurement. If


we have ₹100 we have twice as much purchasing power as ₹ 50.
VARIABLES
§ Anything in the research that could be changed and affect the results
of the investigation. Anything that varies or changes in value. Variables
take on two or more values.

§ Specifically, variables represents persons or objects that can be


manipulated, controlled or merely measured for the sake of research.

§ Example: Gender, marital status, an individual attitude towards women


empowerment (highly favourable to unfavorable) etc.,
Variables can be classified as Independent or
Dependent.
Independent Variables : The independent variable is the cause.
An independent variable is the variable that you believe
will influence your outcome measure. Also called
predictor variable.
Dependent Variables : The dependent variable is the effect
A dependent variable is the variable that is dependent
on or influenced by the independent variable(s). This is
also called outcome variable.

Example:
Time spend for studying causes a change in test score.
DATA ANALYSIS
• Data analysis is the process of collecting, modeling,
and analysing data to extract insights that support
decision-making.

• In simple words Data analysis focuses on the process


of turning raw data into useful statistics,
information, and explanations.
TOOLS USED ANALYSIS
• Excel
• SPSS
• STATA
• SAS-R
• Nvivo
• MATLAB
• E-views
• JMP
•R
• Python
EXCEL
Microsoft Excel is a spreadsheet developed by Microsoft for Windows, macOS,
Android and iOS. It features calculation, graphing tools, pivot tables, and
a macro programming language called Visual Basic for Applications. A spreadsheet
editor that originally competed with the dominant Lotus 1-2-3 and eventually
outsold it. Microsoft released the first version of Excel for the Mac OS in 1985 and
the first Windows version (numbered 2.05 to line up with the Mac) in November
1987.

Microsoft Excel has the basic features of all spreadsheets, using a grid
of cells arranged in numbered rows and letter-named columns to organize data
manipulations like arithmetic operations. It has a battery of supplied functions to
answer statistical, engineering, and financial needs. In addition, it can display data
as line graphs, histograms and charts, and with a very limited three-dimensional
graphical display. It allows sectioning of data to view its dependencies on various
factors for different perspectives (using pivot tables and the scenario manager).
SPSS- Statistical package for Social
Sciences
• SPSS Statistics is a powerful statistical software
platform.
• It offers a user-friendly interface and a robust set of
features that lets your organization quickly extract
actionable insights from your data.
• SPSS (Statistical Package for the Social Sciences) is
a versatile and responsive program designed to
undertake a range of statistical procedures.
• SPSS software is widely used in a range of
disciplines
Advantages
• Very easy to learn and use
• Can use either with menus or syntax files
• Quite good graphics
• Excels at descriptive statistics, basic regression analysis, analysis of
variance, and some newer techniques such as Classification and
Regression Trees (CART)
• Has its own structural equation modelling software AMOS, that dovetails
with SPSS
Disadvantages
• Focus is on statistical methods mainly used in the social sciences, market
research and psychology
• Has advanced regression modelling procedures such as LMM and GEE, but
they are awful to use with very obscure syntax
• Has few of the more powerful techniques required in epidemiological
analysis, such as competing risk analysis or standardised rates.
STATA
• According to StataCorp (2016), Stata is “a complete,
integrated statistical software package that provides
everything needed for data analysis, data management,
and graphics”.
• Basically, Stata is a software that allows to store and
manage data (large and small data sets), undertake
statistical analysis on our data, and create some really nice
graphs.
• This software is commonly used among health researchers,
particularly those working with very large data sets,
because it is a powerful software that allows to do almost
anything we like with our data.
Advantages
• Can use either with menus or syntax files
• Much more powerful than SPSS – probably equivalent to SAS
• Excels at advanced regression modelling
• Has its own in-built structural equation modelling
• Has a good suite of epidemiological procedures
• Researchers around the world write their own procedures in Stata,
which are then available to all users
Disadvantages
• Harder to learn and use than SPSS
• Does not yet have some specialised techniques such as CART or Partial
Least squares regression
SAS- R
SAS stands for Statistical Analysis System. It was developed at
the North Carolina State University in 1966, so is
contemporary with SPSS.
Advantages
• Can use either with menus or syntax files
• Much more powerful than SPSS
• Commonly used for data management in clinical trials
Disadvantages
• Harder to learn and use than SPSS
N Vivo
• NVivo is a software program used for qualitative and mixed-methods research.
• Specifically, it is used for the analysis of unstructured text, audio, video, and image data,
including (but not limited to) interviews, focus groups, surveys, social media, and journal
articles.
• It is produced by QSR International. As of July 2014, it is available for both Windows and
Macintosh operating systems.

ADVANTAGES OF N VIVO
• Analyze and organize unstructured text, audio, video, or image data.
• Playback ability for audio and video files, so that interviews can be transcribed in NVivo.
• Ability to capture social media data from Facebook and Twitter using the NCapture browser
plug-in.
• Import notes and captures from Evernote - great for field research.
• Import citations from EndNote, Mendeley, Zotero, or other bibliographic management
software - great for literature reviews.
• Perform simple text analysis queries (such as text search or word frequencies) for text data in
English, French, German, Spanish, Portuguese, Japanese, and Simplified Chinese.
FILE TYPES ARE ASSOCIATED WITH
NVIVO

*.nvp - An NVivo for Windows project file.


*.nvpx - An NVivo for Mac project file.
MATLAB
• MATLAB is a programming and numeric computing platform used by millions of
engineers and scientists to analyze data, develop algorithms, and create models.
• MATLAB is a proprietary multi-paradigm programming language and numeric
computing environment developed by MathWorks.
• MATLAB allows matrix manipulations, plotting of functions and data,
implementation of algorithms, creation of user interfaces, and interfacing with
programs written in other languages.
• MATLAB helps ideas from research to production by deploying to enterprise
applications and embedded devices, as well as integrating with Simulink and
Model-Based Design.

• MATLAB is used to
• Analyze data
• Develop algorithms
• Create models and applications
E-VIEWS
• EViews is an easy-to-use statistical, econometric, and economic modeling
package.

• There are three ways to work in EViews:


• Graphical user interface (using mouse and menus/dialogs).
• Single commands (using the command window).
• Program files (commands assembled in a script executed in batch
mode).

• EViews offers financial institutions, corporations, government agencies,


and academics access to powerful statistical, time series, forecasting, and
modeling tools through an innovative, easy-to-use object-oriented interface.
EViews Desktop
Command
Window

Object
Window/
Work Area
EViews Desktop Details
Main Menu

Note: Path/Database/Workfile
can be changed by double-clicking in each. Path/directory Database Workfile
EViews Work file and Objects
• EViews does NOT open up with a “blank” generic document (unlike
Word ®, Excel ®, etc.).

• EViews documents (aka “workfiles”) need to be created and are not


generic (they will contain information about your data, etc.).

• EViews is an “object”- oriented program. Objects are collections of


information related to a particular analysis (series, groups,
equations, graphs, tables).

• Workfiles are holders of these “objects”.


JMP
JMP is a software program used for statistical analysis. It is created by SAS Institute Inc.
Unlike SAS (which is command-driven), JMP has a graphical user interface, and is
compatible with both Windows and Macintosh operating systems.

ADVANTAGES OF JMP
• Streamlined menu interface arranged by "context" (e.g. univariate analysis, bivariate
analysis) instead of statistical tests.
• Dynamic output after running a procedure, can add or remove additional statistics and
graphs in the results window without having to re-run the procedure.
• Support for design of experiments and design generation.
• New in JMP Pro 13: Analyze unstructured text (e.g. open-ended survey questions) with
text mining techniques like cluster analysis and topic modeling.
• Extensive array of algorithms, especially for factor analysis (factor extraction and axes
rotation).
• Interactively build and refine graphs and tables with the Graph Builder and Tabulate
tools, respectively.
• Ability to interface with SAS: Import/export SAS data, write and execute SAS code, etc.
FILE TYPES ARE ASSOCIATED WITH JMP

•*.jmp - A JMP data table.


•*.jsl - A JMP script file.
•*.jrn - A JMP report file.
UNIT II
SPSS - INTRODUCTION - INTRODUCTION
TO SPSS - DATA ANALYSIS WITH SPSS:
GENERAL ASPECTS, WORKFLOW,
CRITICAL ISSUES - SPSS: GENERAL
DESCRIPTION, FUNCTIONS, MENUS,
COMMANDS - SPSS FILE MANAGEMENT -
EXERCISE
DATA ANALYSIS
• Data analysis is the process of collecting, modeling,
and analysing data to extract insights that support
decision-making.

• In simple words Data analysis focuses on the process


of turning raw data into useful statistics,
information, and explanations.
DESCRIPTIVE STATISTICS
• Refers to describing the characteristics of a sample
or population.
• Summarizes about the sample either in quantitative
or visual format using charts, tables and graphs.
• It describes the important characteristics/
properties of the data using the measures the
central tendency like mean/ median/mode and the
measures of dispersion like range, standard
deviation, variance etc.
• It assigns numerical values to describe the trend of the
samples collected.
• It converts large volumes of data and presents it in a
simpler, more meaningful format that is easier to
understand and interpret.
• Descriptive statistics indicate that interpretation is the
primary purpose, while inferential statistics make
future predictions for a larger set of data based on
descriptive values obtained.
• Hence, descriptive statistics form the first step and the
basis of quantitative data analysis.
TYPES OF DESCRIPTIVE STATISTICS
• A) Measures of Frequency
• This measures how often a particular variable occurs in the distribution. It can
be measured in numbers or percentages and shows how frequently a response
or variable occurs.
• B) Measures of Central Tendency
• Measures of central tendency indicate the average or the most common
variable in the data set. They identify certain points by computing the mean,
median, and mode.
• C) Measures of Variation or Dispersion
• This shows how spread out the responses in the data set are. It helps identify
the gap between the highest and lowest values and how far apart individual
values are from the mean or the average. Measures of variation are calculated
using the range, standard deviation, and variance.
• D) Measure of Position
• This measures how individual values are positioned with one another. This
method of calculation relies on a standardized value. Percentiles and quartile
ranks indicate the measures of position.
EXAMPLES
• It indicates the overall performance of a sportsman in a
tournament, such as in baseball. A batting average gives the
average number of hits by the batter in the total time at-bat.
• A GPA or grade point average indicates the overall performance of
a student at school across multiple tests and courses throughout
the year.
• Identify the distribution of college students using different
variables like year of study, gender, course, etc.
• Determine the demographics of a certain population in a city, state,
or country. Descriptive statistics can identify the distribution of
the population in terms of gender or occupation, the variance in
income levels, etc.
CATEGORIES OF TOOLS
• There are two categories of tools in descriptive statistics:
• Numerical Tools: These include the various methods of calculation:
• Mean
• Median
• Mode
• Standard deviation
• Variance
• Range
• Coefficient of variation
• Skewness and kurtosis coefficients
• Quartiles
• Percentiles
• Contingency tables
• Frequency tables
• Correlation
• RV coefficient
• Graphic Tools: These allow the representation of various data
points as graphs or tables:
• Box plots
• Scatter plots
• Whisker plots
• Bar chart
• Pie chart
• Histogram
• Ternary diagram
• Correlation map
• Probability plot
• Strip plot
INFERENTIAL STATISTICS
• It is about using data from sample and then making inferences about the larger
population from which the sample is drawn.
• The goal of the inferential statistics is to draw conclusions from a sample and
generalize them to the population.
• It determines the probability of the characteristics of the sample using probability
theory. The most common methodologies used are hypothesis tests.

• For example: Suppose we are interested in the exam marks of all the
students in India. But it is not feasible to measure the exam marks
of all the students in India. So now we will measure the marks of a
smaller sample of students, for example 1000 students. This sample
will now represent the large population of Indian students. We
would consider this sample for our statistical study for studying the
population from which it’s deduced.
TYPES OF ANALYSIS

Analysis
Strategy

Univariate Bivariate Multivariate


(one (Two (more than 2
variable) variable) variables)

48
TECHNIQUES USED IN
DATA ANALYSIS
• Univariate Analysis: Uni means one and this means that the
data has only one kind of variable. The major reason for
univariate analysis is to use the data to describe, summarise it,
and then find some pattern in the data.

• For example, the height of ten students in a class can be recorded and
this is univariate data. There is only one variable which is the height. The
description of the pattern that is found in this type of data is made by
drawing out conclusions based on dispersion, central measures of tendency,
spread, or data, and this is done through the histograms, frequency
distribution table, bar charts, etc.
UNIVARIATE ANALSYIS
Univariate analysis explores each variable in a data set,
separately. It describes the pattern of response to the
variable. It describes each variable on its own.
• Frequency
CATEGORICAL • Eg: Male, Female

• Mean, S.D.
METRIC
• Eg: Customer satisfaction
50
UNIVARIATE
(one variable)

Metric Categorical
variable variable

CENTRAL VARIATION DISTRIBUTION PLOTS FREQUENCIES PLOTS


TENDENCY

Histogram Par
Mean Variance Normal Percentages
Graph

Standard
Median
deviation
Uniform Box plot Pie Graph

Mode Range Exponential Stem plot

Percentiles

51
BIVARIATE ANALAYSIS
Bivariate analysis is the
simultaneous analysis of two
variables (attributes). It
explores the concept of 2 CATEGORICAL

BIVARAITE
relationship between two
variables, whether there
2METRIC
exists an association and the
strength of this association, or
whether there are differences 1Category/1Metric
between two variables and
the significance of these
differences.
52
• There are three types of bivariate analysis. They are:
2 CATEGORICAL (Association between job cadre & income)
2METRIC(Quality of food served & customer satisfaction)
1 CATEGORY & 1 METRIC (Gender & job satisfaction)

• The variables that are involved are X and Y i.e. Dependent and
independent variable.
• For example: To study the relationship between two
variables i.e. Online teaching and the marks scored by the
students.

Can you suggest some data analysis tools?


Simple Correlation : To what extent are two
variables related? Strength of association between
two metric variable .

Simple Linear Regression : A single independent


variable is used to predict the value of a dependent
variable.

One Way ANOVA: Used to test the difference in a single


dependent variable among two or more groups.
54
MULTIVARIATE ANALAYSIS
Multivariate analysis is required when more than two variables have
to be analysed simultaneously.
All statistical techniques which simultaneously analyse more than two
variables on a sample of observations can be categorized as multivariate
techniques.
Broadly, multivariate techniques can be classified either as dependence
methods or interdependence methods.
In the analysis of dependence, we are attempting to explain or predict the
dependent variable(s) on the bases of several independent variables.
In contrast, an interdependence technique is one in which no single
variable or group of variable is defined as being independent or
dependent.
55
•For example: when we need to
understand the marks scored by
students, it is not only required we
understand it with online teaching it can
also be methodology used, time allotted,
practice sessions etc.

•Common tools used are multiple regression,


MANOVA, Factor Analysis, Cluster analysis,
Path Analysis.
SPSS- Statistical package for Social
Sciences
• SPSS Statistics is a powerful statistical software
platform.
• It offers a user-friendly interface and a robust set of
features that lets your organization quickly extract
actionable insights from your data.
• SPSS (Statistical Package for the Social Sciences) is
a versatile and responsive program designed to
undertake a range of statistical procedures.
• SPSS software is widely used in a range of
disciplines
• SPSS is software for editing and analyzing all sorts
of data.
• These data may come from basically any source:
scientific research, a customer database, Google
Analytics or even the server log files of a website.
• SPSS can open all file formats that are commonly
used for structured data such as
• spreadsheets from MS Excel or open office;
• plain text files (.txt or .csv);
• relational (SQL) databases;
• Stata and SAS.
SPSS DATA VIEW
After opening data, SPSS displays them in a spreadsheet-like fashion as
shown in the screenshot below
(1)The data editor has tabs for switching between
Data View and Variable View.
(2)Columns of cells are called variables. Each variable
has a unique name (“gender”) which is shown in the
column header.
(3)Rows of cells are called cases. Oftentimes, each
respondent in a study is represented as a single case.
(4)In SPSS, values refer to cell contents.
(5)The status bar may give useful information on the
data.
SPSS VARIABLE VIEW

(1)In the left bottom corner we find tabs for switching between Variable View and Data View. For now,
select Variable View.
(2)In Variable View, variables are shown as rows of cells.
(3) The first column shows the variable name for each variable
(4)The fifth column may or may not contain a variable label. This describes the exact meaning of each
variable.
(5) The sixth column shows value labels: descriptions of the meaning of one, many or all values that a
variable may contain
In short, Variable View does not show the
data itself but, rather, information about the
data.

This is sometimes called “metadata” or “the


codebook”. In SPSS, however, it's called the
dictionary.
UNIT III

SPSS - Input and data cleaning - Defining


variables - Manual input of data - Automated
input of data and file import - Data
manipulation - Data Transformation - Syntax
files and scripts - Output management -
Exercise
Analysis and Interpretation
• Analysis and interpretation are central steps in the
research process. The aim of the analysis is to organize,
classify and summarize the collected data so that they
can be better comprehended and interpreted to give
answers to the questions that triggered the research.

• Interpretation is the search for the broader meaning of


findings. Analysis is not fulfilled without interpretation;
and interpretation cannot proceed without analysis. So,
both are inter dependent.
Statistics
Statistics: The science of collecting, describing, and
interpreting data.
“The science of collecting, organizing, presenting,
analyzing, and interpreting data to assist in making
more effective decisions”
For instance: A survey was conducted to find the
favorite fruit of 100 people. The circle graph
Types of Statistics
Basic Concepts
Population
Collection of all individuals or objects or items
under study and denoted by N
Sample
A part of a population and denoted by n
Variable
Characteristic of an individual or object.
Parameter
Characteristic of the population
Statistic
Characteristic of the sample
Example
A college dean is interested in learning about the
average age of faculty. Identify the basic terms in this
situation.
The population is the age of all faculty members at the
college.
A sample is any subset of that population. For example,
we might select 10 faculty members and determine their
age.
The variable is the “age” of each faculty member.
One data would be the age of a specific faculty member.
The data would be the set of values in the sample.
The parameter of interest is the “average” age of all
faculty at the college.
The statistic is the “average” age for all faculty in the
sample.
Measurement of Scale
Measurement is the process of assigning numbers to
objects or observations, the level of measurement
being a function of the rules under which the
numbers are assigned
Easy to assign numbers in respect of properties of
some objects like weight, height etc., but it is
relatively difficult in respect of others like motivation
to success, ability to stand stress, etc.
Types of Scale
1. Nominal
2. Ordinal
3. Interval
4. Ratio
Nominal

Ordinal

Interval Ratio
Nominal
Ordinal Interval Ratio
(Categorical)
* People or * People or * Intervals * There is a
objects with the objects with a between adjacent rationale zero
same scale value higher scale scale values are point for the
are the same on value have more equal with scale.
some attribute. of some attribute. respect the * Ratios are
* The intervals attribute being equivalent, e.g.,
* The values of between adjacent measured. the ratio of 2 to 1
the scale have no scale values are * Measure the is the same as
'numeric' indeterminate. difference the ratio of 8 to 4.
meaning in the * Scale between 2 values
way that you assignment is by * E.g., the
usually think the property of difference
about numbers. "greater than," between 8 and 9
"equal to," or "less is the same as
than.“ “satisfied”, the difference
dissatisfied” between 76 and
77.
Nominal Ordinal Interval Ratio
Classification Ordered but Ordered, Ordered,
data: differences between constant scale, constant
e.g. Male / values are not but no natural scale, natural
Female important Zero. zero
Differences e.g., Height,
No ordering:
(e.g.), Likert scales, make sense, but Weight,
e.g. it makes no
sense to state rank on a scale of ratios do not. Age,
that M > F 1….5 e.g. Length
your degree of Temperature
Arbitrary labels: satisfaction (C,F), Dates
e.g., M/F, 0/1, etc e.g., Restaurant ( difference
ratings between
Tendulkar can be (10 – Highly temperature of
identified by satisfied, 100 to 90
number 10 in 5 – Moderate, 1 – degree is same
Indian cricket Highly Dissatisfied) as 90 to 80
team degree)
Choosing the Statistical Test
1. Data
2. Samples
3. Purpose
Appropriate Statistics
Nominal Ordinal Interval Ratio
[ Cross tabs ] [ Frequencies ] Mean Coefficient of
Standard Deviation Variation,
Chi square, Median, Pearson's product-
Phi Interquartile range moment correlation
(CV = SD /
Cramer's t test
Analysis of variance, M)
Contingency
Multivariate analysis of
[ Nonparametric ] variance, MANOVA
[ Nonparametric ]
Factor analysis
Kolmogorov-Smirnov Regression
Chi-square,
Sign Multiple correlation, R
Runs
Binomial Wilcoxen
McNemar Kendall coefficient of
Cochran concordance
Friedman two-way anova
Mann-Whitney U
Wald-Wolfowitz
Kruskal-Wallis
Simple Random Sampling
STRATIFIED RANDOM SAMPLING
Cluster Sampling
In stratified random sampling, all the strata of the population is sampled
while in cluster sampling, the researcher only randomly selects a number of
clusters from the collection of clusters of the entire population.
Systematic Sampling
MULTISTAGE SAMPLING
JUDGMENT SAMPLING
CONVENIENCE SAMPLING
QUOTA SAMPLING
What is SPSS?
• Statistical Package for Social Sciences
• General Purpose Statistical Software
• Statistical analysis for the input data
• Basics concepts of research in SPSS are
- Variable, Scale, Hypothesis, Significance level & Data
• Outcome of the data analysis indicates the significance of the
collected data by the researcher
VARIABLES
Variables:
“A variable is an characteristics of an event, object, idea, type of
category that trying to measure”.
(E.g.) [Satisfaction level of (Working Environment)]

[Item] (Variable)
* Item may consist of dimension (i.e.) Satisfaction level of working
environment can be measured by two dimension like Internal Working
environment & External working environment and asking the
respondents to rate the different categories in the each and every
dimension.
* Item & variables will have attributes (category, measuring items)
(E.g.) Highly Satisfied, Satisfied, Moderate, Dissatisfied, Highly
Dissatisfied.
Types of Variable:
1. Dependent variable (Criterion variable) – depends on other
factors.
(E.g.) Your CIAT mark is dependent variable, because it depends
on many factors like how much you studied, how much you
slept, how much you eat before your test, ho much relevant
concept you studied, etc….

2. Independent variable (Predictor variable) – it stands alone and


doesn’t change or vary or dependent on other variables.
(E.g.) Age doesn’t depend on any factors (i.e.) your age was not
identified based on your colour, physical appearance, etc…..
3. Discrete variable – indicates the exact number.
(E.g.) No. of children or students in class room or schools or
college (i.e) exactly 44 students, not 44.5 students.
4. Continuous variable – indicates the continuation of the
numbers, scales, etc.. (i.e.) it is applicable in experiments, testing,
etc..
(E.g.) time to complete the 100 meter track could be 10.42 seconds
but it can be 10.4241254723……….
5. Quantitative variables (categorical variable) – variable that
express a qualitative attributes but it is not an numerical
ordering.
(E.g.) expression, emotions, feeling of love..
6. Quantitative variables – variables measured in terms of numbers
/ quantity / amount.
(E.g.) your height, weight, shoe size, etc….
NOTATIONS OF POPULATION AND SAMPLE

Characteristics Population Sample

Size N n

Mean µ
_
SD s s=
_
å (x - x )
2
S=
å (x - x )
2
n n -1

Proportion P p =
x
n

Correlation r r=
C O V( x , y )
sx sy
Coefficient
HYPOTHESIS
Testing of Hypothesis:
Testing of
Hypothesis

Non Parametric test


Parametric test
Chi – Square test
Two Sample t – test Mann – Whitney ‘U’
Independent Two test
Sample t – test Kruskal Wallis test (‘H’
Paired Sample t – test test)
One & Two Way Sign test
ANOVA Kolmogorov-Smirnov
Test
Types of Error

Type of decision H0 true H0 false

Reject H0 Type I error Correct


decision
Accept H0 Correct decision Type II error

Type I Error – If the hypothesis is true, but rejects H0


Type II Error – If the hypothesis is false, but accept H0
SIGNIFICANCE (P VALUE)

Given the observed data set, the P value is the smallest level for
which the null hypothesis is rejected (and the alternative is
accepted)

• If the P value £ a then reject H0 ; Otherwise accept H0

• If the P value £ 0.01 then reject H0 at 1% level of significance

• If the P value lies between 0.01 to 0.05 (ie. 0.01< P value £ 0.05)
then reject H0 at 5% level of significance

• If the P value > 0.05 then accept H0 at 5% level of significance


Concept of P value
P value Notation Conclusion Level of
Significance

0.000 to 0.010 ** Reject Null Highly


Hypothesis at 1 % Significant
level
0.011 to 0.050 * Reject Null Significant
Hypothesis at 5 %
level
0.051 to 1.000 No star Accept Null Not
Hypothesis at 5 % Significant
level
0.000 denoted as < 0.001**
Determination of Sample Size

Determination of Sample Size:


n = N / [1+N(α)2]

Where n – required no. of sample, N – total no. of population.


α – significance level (i.e.) 0.01, 0.05 & 0.10 (99%, 95% & 90%)

(E.g.) If population size is N = 500, then sample size is


n = 500 / [1+500(0.05)2] = 222

Significance level is selected based on the accuracy of data which


is required by researcher (i.e.) 99% means researcher needs
accurate result.
SPSS User Manual
Two Types of View:
Variable View:
Data View:
SPSS User Manual
Two Types of View:
1. Variable view – enter the given variables.
• Name – entering the name of the variable in short form (characters
are increased based on adjustment in width) without any space,
symbols (i.e.) only alphabetic characters are allowed.
• Type – type of variable (i.e.) either in words, numbers, date or any
scientific notation. (E.g.) words – string, numbers, marks – numeric,
etc.
• Width – indicates the no. of characters entered in the name cell or
variable name in the data view.
• Decimals – indicates the value entered in data view after the
decimal points based on the value points given in the decimal cell in
variable view.
• Label – entering the complete variable name in order to display in the
analysis part without any confusion in the variables name.
• Values – giving the values for the label according to the variable. (i.e.)
giving no. to the categorical data. (E.g.) 1 – male, 2 – Female – 2 and 1
– Yes, 2 – No & 3 – Partly.
• Missing – entering the missing values that occurred during the data
collection process.
• Columns – indicates the no. of columns in the cell.
• Align – indicates the alignment style (left, right, center or justified)
• Measure – indicate the type of scale to be selected for the data entered
in the variable view. (i.e.) Nominal scale, ordinal scale, ratio or interval
scale (we will see the explanation for this scale later).
• Role – indicates what role is going to perform for the enter data in the
variable view. (i.e.) entering the data – input, result – output, or both,
etc.
2. Data View – entering the collected data and database in the respective
variable column (variable name already entered in the variable view).
3. SPSS file or documents should be saved in following manner
• For data file – filename.spss (E.g) mba.spss, projectreport.spss
• For output file – filename.sav or filename.lst.
4. Short cut keys:
• For insert variables or cases, Click Edit then Insert cases or Insert variables
option.
• For replace any values or variables, Click Edit then replace.
• To go to any particular variables or cases, Click Edit then Go to cases or Go to
variables.
• For Sorting data or variables, Click Data then Sort Cases or Sort variables.
• For selecting or copying any particular data from already saved file, Click Data
then Copy dataset.
• For selecting any particular cases in data view, Click Data then Select Cases.
• For splitting any file, Click Data then Split file.
• For inserting or drawing graph for the given data, Click Graphs then Legacy
dialogues then select any chart like Bar, Pie, Area, Line, Histogram, etc.
• For replacing the labels for the given value in variable view, Click View then
select Value Labels.
How to enter the data in SPSS after collecting data
from the respondents
Reliability
Coefficient of Alpha is otherwise known as Cronbach’s
alpha and it is one way to estimate the reliability of an
instruments.
When we are using coefficient alpha, we are estimating
the internal consistency of reliability.
Acceptance level
For developed instruments, the cronbach’s alpha
should be greater than 0.80.
For developing instruments, the cronbach’s alpha
should be greater than 0.70.
Procedure
Analyze Scale Reliability Analysis
Frequencies
Frequency refers to the number of times an
observation occurs or appears in a data set. Ex: 23, 26,
11, 18, 09, 21, 23, 30, 22, 11, 21, 20, 11, 13, 23, 11, 29, 25, 26, 26.

Procedure
Analyze Descriptive Statistics Frequencies

Interpretation
Descriptive Statistics
— Measures of central tendency and measures of
dispersion is collectively known as descriptive
statistics.
— It describe the what the set of numbers looks like.
The mean tell us the what is the average height, and
standard deviation tells us how spread out the
heights are around that average.
Procedure
Analyze Descriptive Statistics Frequency
then statistics and select Mean, Median, Mode, Maximum,
Range, etc.
Descriptive Statistics
Purpose:
— To describe each variable – What is the current level of the
variable of interest?
Frequency:
— Means, Minimum, Maximum, Standard Deviation, Swekness &
Kurtosis. .
— Analyze Descriptive Statistics Frequency then
statistics and select Mean, Median, Mode, Maximum, Range, etc.
Frequencies for two or more nominal variables
— Analyze Summarize Cross tabulation
Means of variables by subgroups defined by one or more nominal
variables
— Analyze Compare Means Means (Use of Levels)
Interpretation:
Measure of Central Tendency: (Quantitative Data)
— Mean – average of all the observation (i.e.) sum of observations /
no. of observations.
— Standard Error of Mean – sample mean deviates from actual
mean of population.
— Median – value of middle item when it is arranged in ascending
or descending order (i.e.) value of variables which divides into
two equal parts.
— Mode – value which occurs most frequently in set of
observations.
Measure of Dispersion:
— “Degree to which numerical data tend to spread about average
value.”
— Range – difference between greatest and least value.
Interpretation:
— Standard Deviation – positive square root of arithmetic mean of
squares of deviation of observations from their arithmetic mean.
— Swekness – it is asymmetrical that shows there is slight deviation
in curve distribution.
— When mean, median fall at different point it indicates that the
curve is skew distribution (i.e.) mean, median & mode value will
not coincide.
— Mean & Mode value will be wide difference & median value lie
between mean and mode value.
Left & Right Skew (i.e.) (+) & (-)
Skew
Normal Curve (bell shaped) No
Skewness
Interpretation:
— Kurtosis – degree of Kurtosis of distribution measured relative to
peakedness of a normal curve.
— (i.e.) it tells about another form of distribution. Kurtosis
Normal curve (bell – shaped)
Flat than normal curve
Peak than normal curve
— Leptokurtic (peak)
— Mesokurtic (Normal curve)
— Platykurtic (flat)
— Kurtosis is measured based on Co-efficient (beta) or its deviation
If β value is 3, then the curve is normal (Mesokurtic).
If β value is < 3, then the curve is flat (Platykurtic).
If β value is > 3, then the curve is Peak (Leptokurtic).
Exercise : Descriptive Statistics
Problem:
The following figures represent the ages of a sample of 60
employees of a firm Orange plc.

35 44 54 33 46 20 32 25 50 39 33 37
42 40 20 25 34 52 27 22 18 40 23 17
41 45 21 34 49 27 60 46 32 58 23 52
24 64 41 47 54 37 40 41 40 36 46 29
34 39 39 40 37 50 41 34 47 34 45 36
Find out Mean, Median, Mode, Standard deviation, Standard
error, Range, Maximum & Minimum.
CROSS TABS
A crosstab is a table showing the relationship between two or more variables. Where the table only shows
the relationship between two categorical variables, a crosstab is also known as a contingency table.

It is used to test the association between the two variables.


Selection of variable should be nominal or ordinal.
Objective
To examine the association between the smoking and cancer?
Hypothesis
H0 – No association between smoke and cancer
H1 - Association between smoke and cancer
Smoke – 1. Smokers 2. Non smoker
Cancer - 1. Lung Cancer 2. Did not get lung cancer
Procedure
Analyze descriptive cross tab
Exercise 2
Is there any association between family income and
total savings.
CHART- CREATE AND ADD CHART
• Charts allow you to illustrate your workbook data graphically, which
makes it easy to visualize comparisons and trends.
• Excel has several different types of charts, allowing you to choose the
one that best fits your data.
TO INSERT A CHART
• Select the cells you want to chart, including the column
titles and row labels. These cells will be the source data for the
chart. In our example, we'll select cells A1:F6.
• From the Insert tab, click the desired Chart command. In our
example, we'll select Column.

Choose the desired chart type from the drop-down menu.


• The selected chart will be inserted in the worksheet.
CHART LAYOUT AND STYLE
• After inserting a chart, there are several things you may want to change
about the way your data is displayed. It's easy to edit a
chart's layout and style from the Design tab.
• Excel allows you to add chart elements—such as chart titles, legends,
and data labels—to make your chart easier to read. To add a chart element,
click the Add Chart Element command on the Design tab, then choose
the desired element from the drop-down menu.
• To edit a chart element, like a chart title, simply
double-click the placeholder and begin typing.
• If you don't want to add chart elements individually,
you can use one of Excel's predefined layouts. Simply
click the Quick Layout command, then choose
the desired layout from the drop-down menu.
OTHER CHART OPTIONS
• There are many other ways to customize and organize your charts. For
example, Excel allows you to rearrange a chart's data, change the chart
type, and even move the chart to a different location in the workbook.
• To switch row and column data:
• Sometimes you may want to change the way charts group your data. For
example, in the chart below, the Book Sales data are grouped by year, with
columns for each genre. However, we could switch the rows and columns
so the chart will group the data by genre, with columns for each year. In
both cases, the chart contains the same data—it's just organized differently.
• Select the chart you want to modify.
• From the Design tab, select the Switch
Row/Column command.

The rows and columns will be switched. In our example, the data is now
grouped by genre, with columns for each year.
Keeping charts up to date

• By default, when you add more data to your spreadsheet, the


chart may not include the new data.
• To fix this, you can adjust the data range. Simply click the chart,
and it will highlight the data range in your spreadsheet. You can
then click and drag the handle in the lower-right corner to
change the data range.
If we frequently add more data to our spreadsheet, it may
become tedious to update the data range. Luckily, there is an
easier way. Simply format our source data as a table, then
create a chart based on that table. When you add more data
below the table, it will automatically be included in both the
table and the chart, keeping everything consistent and up to
date.
UNIT V
SPSS - Statistical tests - Means - T-test - One-way
ANOVA - Non parametric tests - Normality tests -
Correlation and regression - Linear correlation and
regression - Multiple regression (linear) -
Multivariate analysis - Factor analysis - Cluster
analysis - Exercise
T – test (One sample t test)
Purpose:
Whether two means are different or not using a procedure
called the one sample t test.
Assumptions
ØVariable is measured by interval level or ratio (ordinal with
caution)
ØSamples randomly selected
ØVariable is normally distributed
vAcceptable degree of skewness and kurtosis
v or
vUsing the Central Limit Theorem(n> 30)
vIf both conditions are not satisfied don’t use t test
Applications of One sample t test
What is our Aim of the study
The main aim of our study is to see the degree of stress level of
M.com students who are pursuing their degree during 2018 -19 at
Pondicherry University. Here the normal stress level is considered as
30 points. (population mean).
Step-1: Research Question
Do the stress level of M.Com students who are pursuing their degree
at Pondicherry University during the year 2018-19 are at par with
predetermined level of 30 points? Or their stress level are more or
less than the normal level.
Step-2: Objectives
To study the stress level of M.Com students who are pursuing during
2018-19 at Pondicherry university.
Step-3: Hypothesis
H0 : The predetermined stress level (30 points) and the average stress level of M
Com students pursuing degree at Pondicherry university are same.
H1: The predetermined stress level (30 points) and the average stress level of M Com
students pursuing degree at Pondicherry university are not same
Step-4: Selection of the variable
Only one variable, Identified as scale variable, How do you say it is
scale variable?
Answer: Data is open ended not closed ended. Therefore it is
continuous variable
: Stress level (scores)
Step-5 : Name of the test
One sample t-test
Step-6 - Set the Alpha level/ Significance level
Alpha (a) level=.05 (type 1 error) That is, it is the probability of rejecting
a true null hypothesis
Alpha is often set to 0.05 for two-sided tests and to 0.025 for one-
sided tests.
Step-7- Rejection region; Reject the null hypothesis If P Value<0.05
Procedure:
Analyze compare means one sample t test
Discussion
Independent sample t test
It is used to compare the mean differences between two separate
groups.
Variable ( D.V is Scale and IV variable is Nominal)
Aim of the study
After seeing the overall stress level of M.com students, for going
to little bit deep in the research, we are interested to see the stress
level between male and female students in M.com graduate who
are pursuing their degree during 2018 -19 at Pondicherry
University.
Objective
To study the stress level of gender who are undergoing M.com in
the department of commerce during the year 2018-19 at
Pondicherry university
Hypothesis
HO : There is no difference between gender and stress level
H1: There is a difference between gender and stress level
Procedure
Analyze compare means Independent sample t
test

Results discussion
Exercise
• Is there any difference between male and female
toward GPA score?
Paired ‘t’ Test for difference of Two Means
Two variable should be scale and dependent
Problem:
A Company arranged an intensive training course for its
team of salesmen. A random sample of 10 salesmen was
selected and the value (in ‘000) of their sales made in the
weeks immediately before and after the course are shown in
the following table:
Salesmen 1 2 3 4 5 6 7 8 9 10
Sales Before 12 23 5 18 10 21 19 15 8 14
Sales After 18 22 15 21 13 22 17 19 12 16

Test whether there is evidence of an increase in mean sales.


1. Null Hypothesis: There is no significant difference in
sales of before and after the training course.

2. Alternate Hypothesis: There is significant difference


in sales of before and after the training course.

Procedure

Analyze Compare Means Paired


Sample t test

Discussion
Exercise
Is there any difference between pre test score
and post test score in the before and after
training program?
ANOVA (One Way)
Why ANOVA?
This is an extension of two sample t-test.
What if we wish to compare the means of more than two
population ?
What it does? It compares three or more population mean and
see is there any difference or not
Variable selection

Dependent Variable Independent Variable ( Factor)


Ratio or Interval (Metric data) categorical (Non metric data)
Terminology used in ANOVA Factors
These are really Independent variables.
Levels Of Factors
1. Single Factor only one level use One
Sample “t” test.
2. Single Factor Two level use Two sample
“t” test.
3. Single Factor More than two Level use One way
ANOVA

4. Two Factors More than two level use Two way


ANOVA

Exercise
The institution ABC decides to examine the stress level
of the students across departments as the institution
faces lower academic results. The response of the
students are collected in five point likert scale with a
total of 30 students from commerce, management and
economics department. The response were collected
from 1 to 5. One being strongly agree to five strongly
disagree.
Research question:
Are the students of different disciplines (commerce, management and
economics) differ in their stress level .

Objectives:
To study the difference among students group (commerce,
management and economics) towards their stress level.
Hypothesis
H0- There is no significant difference among students group with
respect to stress level of students.
H01- There is significant difference among students group with respect
to stress level of students.

Procedure

Analyze Compare means One- Way ANOVA


One way ANOVA (Exercise)
Objective
To test is there any difference among gender towards their stress level.

Hypothesis

H0- There is no significant difference between gender with respect to


stress level of students.
H01- There is significant difference between gender with respect to
stress level of students.
Exercise

A random sample of the students in each row was taken. The score for those
students on the second exam was recorded
• Front: 82, 83, 97, 93, 55, 67, 53
• Middle: 83, 78, 68, 61, 77, 54, 69, 51, 63
• Back: 38, 59, 55, 66, 45, 52, 52, 61
Two way ANOVA
Two way ANOVA is an extension of one way ANOVA.
In two way ANOVA, we will be having two independent variable
and one dependent variable
Objective
To test is there any difference among students stress level based
on the interaction of demographical characteristics.

Hypothesis

H0- There is no significant difference between gender and


students group with respect to stress level of students.
H1- There is significant difference between gender and students
group with respect to stress level of students.
Procedure
Analyze General Linear Model Univariate
DV Stress level (Scale)
IV Gender, Students group (Nominal/Ordinal)

Discussion
NON PARAMETRIC TESTS
NORMALITY TESTS
v Normality refers to a specific statistical distribution called a normal
distribution, or sometimes the Gaussian distribution or bell-shaped curve.
The normal distribution is a symmetrical continuous distribution defined
by the mean and standard deviation of the data.
v

v The two well-known tests of normality, namely, the Kolmogorov–Smirnov test and the Shapiro–
Wilk test are most widely used methods to test the normality of the data. Normality tests can be
conducted in the statistical software “SPSS” (analyze → descriptive statistics → explore → plots →
normality plots with tests).
CORRELATION
qThe degree of relationship between the quantitative variables under
consideration is measure through the correlation analysis.
qThe measure of correlation called the correlation coefficient.
qThe degree of relationship is expressed by coefficient which range from
correlation ( -1 ≤ r ≥ +1). The direction of change is indicated by a sign.

Types of correlation

Methods of studying correlation


a) Scatter diagram
b) Karl pearson’s coefficient of correlation
c) Spearman’s Rank correlation coefficient
d) Method of least squares
TYPES OF CORRELATION

• Positive Correlation: The correlation is said to be positive


correlation if the values of two variables changing with same
direction.
Ex. Publicity Expenses. & sales, Height & weight.

• Negative Correlation: The correlation is said to be negative


correlation when the values of variables change with opposite
direction.
Ex. Price & quantity demanded.
Direction of the Correlation

• Positive relationship – Variables change in the same direction.


• As X is increasing, Y is increasing
• As X is decreasing, Y is decreasing
• E.g., As height increases, so does weight.

• Negative relationship – Variables change in opposite directions.


• As X is increasing, Y is decreasing
• As X is decreasing, Y is increasing
• E.g., As TV time increases, grades decrease
Methods of Studying Correlation

— Scatter Diagram Method - Scatter Diagram is a graph of observed plotted points


where each points represents the values of X & Y as a coordinate. It portrays the
relationship between these two variables graphically.
— Karl Pearson’s Coefficient of Correlation – it is denoted by- ‘r’ The coefficient of
correlation ‘r’ measure the degree of linear relationship between two variables
say x & y.
n Karl Pearson’s Coefficient of Correlation denoted by –

r-1 ≤ r ≥ +1
n Degree of Correlation is expressed by a value of Coefficient

n Direction of change is Indicated by sign (- ve) or ( + ve)


Interpretation of Coefficient of Correlation (r):
— The value of correlation coefficient ‘r’ ranges from
-1 to +1
— If r = +1, then the correlation between the two variables is said to be
perfect and positive.
— If r = -1, then the correlation between the two variables is said to be
perfect and negative.
— If r = 0, then there exists no correlation between the variables.
Coefficient of Determination

• The convenient way of interpreting the value of correlation coefficient


is to use of square of coefficient of correlation which is called
Coefficient of Determination.

• The Coefficient of Determination = r2.

• Suppose: r = 0.9, r2 = 0.81 this would mean that 81% of the variation
in the dependent variable has been explained by the independent
variable.
Coefficient of Determination

• The maximum value of r2 is 1 because it is possible to explain all of the


variation in y but it is not possible to explain more than all of it.

• Coefficient of Determination = Explained variation / Total variation


Test for Significance of Correlation Coefficient

Problem:
Find the correlation coefficient between income and expenditure
of the family to the following data. Also test whether correlation
coefficient is significant.

Income ( in hundreds) 60 58 45 65 56 38 70
Expenditure (in hundreds) 55 50 40 60 62 45 63
Solution:
First find the coefficient of correlation by using the formula

1. Null Hypothesis: There is no relationship between income and expenditure of


the family

2. Alternate Hypothesis: There is relationship between income and expenditure


of the family
3. Test Statistic: t test for coefficient of correlation is

Discussion
Regression Analysis
When it is used?
— To establish cause and effect between dependent variable and the
number of independent variables.
— Analyze Regression Linear (use statistics, save
options).
— Variables – Enter, Backward, Forward and Stepwise method.
— Options – residual analysis, influence statistics, collinearity
diagnostics, normality plots.
Interpretation:
— Goodness of Model: R2, F-statistics, Adjusted R2, Standard Error.
— Strength of Influence of Independent variable: beta and
standardization.
Regression Analysis
Regression Analysis is a very powerful tool in the field of
statistical analysis in predicting the value of one variable, given
the value of another variable, when those variables are related to
each other.
The real life data may contain noise which hides the pattern that is
present in the data.
We call this as Model. A simple representation of data is
Data = Model + Noise
— The term Model is also called pattern; It is also an expected
value.
— The term Noise is also called error, deviation, or residual.
— One has to find the model in order to explain the data
What is error in mathematical form?

Actual y
Error
Model

x
The relationship between y and x is
Here alpha +beta x is the model. Note that error is
different t for different x. The set of equations to find the
alpha and beta are called normal equations.
Example (Simple Correlation)
• Age of a car(in years) : 1 2 3 4
• Maintenance cost: 2 3 5 8
(in thoushands)
y = a + bX
Maintenance cost= -0.5 + 2(Age of car)

X Y(observed) predicted Error=observed -predicted


1 2 1.5 0.5
2 3 3.5 -0.5
3 5 5.5 -0.5
4 8 7.5 0.5
Prof B.Ravikumar
Prof B.Ravikumar
Prof B.Ravikumar
Prof B.Ravikumar
Multiple regression
A multiple regression is a model in which a response variable Y is
linked to p (>=1) explanatory (or regressor) variables X1, X2, …, Xp
and a random departure (error) term.

The model is linear, if it is linear in the parameters and the link


between explanatory variables and departure term is an additive
one.
Thus
Y = beta0 + beta1X1 + beta2X2 + … + betap Xp +e
where the departure term e’s are independent and each having
mean 0 and variance sigma2 .

The beta value tells us to what degree each predictor affects the
outcome if the effects of all other predictors are held constants.
Multicollinearity

• It is a term reserved to describe the case when the intercorrelation of


predictor variables is high. It has been noted that the variance of the
estimated regression coefficients depends n the intercorrelation of
predictors, When there are moderate to high intercorrelations (e.g., r >
+0.90) among the predictors. the problem is referred to as multicollinearity.
Detection of Multicollinearity

Variance Inflation Factors (VIF)

• Less than 3 - No multicollinearity problem


• Greater than 3 –there is multicollinearity problem
• Greater than 5 – There is presence of high
multicollinearity
• Greater than 10 – Serious multicollinearity
The regression model is
MULTIVARIATE ANALYSIS
v Multivariate analyses are used principally for four
reasons, i.e. to see patterns of data, to make clear
comparisons, to discard unwanted information and to
study multiple factors at once.
v There are two main factor analysis methods: common
factor analysis, which extracts factors based on the
variance shared by the factors, and principal component
analysis, which extracts factors based on the total
variance of the factors.
FACTOR ANALYSIS
Factor analysis is a technique that is used to reduce a large number of variables into fewer numbers of
factors. This technique extracts maximum common variance from all variables and puts them into a
common score.
Types of factoring:
There are different types of methods used to extract the factor from the data set:
1. Principal component analysis: This is the most common method used by researchers. PCA starts
extracting the maximum variance and puts them into the first factor. After that, it removes that variance
explained by the first factors and then starts extracting maximum variance for the second factor. This
process goes to the last factor.
2. Common factor analysis: The second most preferred method by researchers, it extracts the common
variance and puts them into factors. This method does not include the unique variance of all
variables. This method is used in SEM.
3. Image factoring: This method is based on correlation matrix. OLS Regression method is used to predict
the factor in image factoring.
4. Maximum likelihood method: This method also works on correlation metric but it uses maximum
likelihood method to factor.
5. Other methods of factor analysis: Alfa factoring outweighs least squares. Weight square is another
regression based method which is used for factoring.
Assumptions:
1.No outlier: Assume that there are no outliers in data.
2.Adequate sample size: The case must be greater than the factor.
3.No perfect multicollinearity: Factor analysis is an interdependency technique. There should not be perfect multicollinearity
between the variables.
4.Homoscedasticity: Since factor analysis is a linear function of measured variables, it does not require homoscedasticity
between the variables.
5.Linearity: Factor analysis is also based on linearity assumption. Non-linear variables can also be used. After transfer, however,
it changes into linear variable.
6.Interval Data: Interval data are assumed.

Key concepts and terms:


Exploratory factor analysis: Assumes that any indicator or variable may be associated with any factor. This is the most common
factor analysis used by researchers and it is not based on any prior theory.
Confirmatory factor analysis (CFA): Used to determine the factor and factor loading of measured variables, and to confirm
what is expected on the basic or pre-established theory. CFA assumes that each factor is associated with a specified subset of
measured variables. It commonly uses two approaches:
The traditional method: Traditional factor method is based on principal factor analysis method rather than common factor
analysis. Traditional method allows the researcher to know more about insight factor loading.
The SEM approach: CFA is an alternative approach of factor analysis which can be done in SEM. In SEM, we will remove all
straight arrows from the latent variable, and add only that arrow which has to observe the variable representing the covariance
between every pair of latents. We will also leave the straight arrows error free and disturbance terms to their respective
variables. If standardized error term in SEM is less than the absolute value two, then it is assumed good for that factor, and if it
is more than two, it means that there is still some unexplained variance which can be explained by factor. Chi-square and a
number of other goodness-of-fit indexes are used to test how well the model fits.
CLUSTER ANALYSIS
vCluster analysis is an exploratory analysis that tries to identify structures
within the data.
v Cluster analysis is also called segmentation analysis or taxonomy analysis.
v More specifically, it tries to identify homogenous groups of cases if the
grouping is not previously known.
vBecause it is exploratory, it does not make any distinction between dependent
and independent variables.
vThe different cluster analysis methods that SPSS offers can handle binary,
nominal, ordinal, and scale (interval or ratio) data.
vCluster analysis is often used in conjunction with other analyses (such as
discriminant analysis).
vThe researcher must be able to interpret the cluster analysis based on their
understanding of the data to determine if the results produced by the analysis
are actually meaningful.
In SPSS Cluster Analyses can be found in Analyze/Classify….
SPSS offers three methods for the cluster analysis: K-Means
Cluster, Hierarchical Cluster, and Two-Step Cluster.
v K-means cluster is a method to quickly cluster large data sets. The
researcher define the number of clusters in advance. This is useful to
test different models with a different assumed number of clusters.
v Hierarchical cluster is the most common method. It generates a
series of models with cluster solutions from 1 (all cases in one cluster)
to n (each case is an individual cluster). Hierarchical cluster also
works with variables as opposed to cases; it can cluster variables
together in a manner somewhat similar to factor analysis. In
addition, hierarchical cluster analysis can handle nominal, ordinal,
and scale data; however it is not recommended to mix different levels
of measurement.
v Two-step cluster analysis identifies groupings by running pre-

You might also like