RMSA Digital Notes
RMSA Digital Notes
RESEARCH METHODOLOGY
& STATISTICAL ANALYSIS
Digital Notes
Compiled by
DR. G. NAVEEN KUMAR
Compiled by
DR. G. VENKAT REDDY
Dr. I. J. Raghavendra, Associate Professor, MBA-MRCET
Prof. G. Naveen Kumar, HOD, MBA-MRCET 2022-23
Page 1
RESEARCH METHODOLOGY AND STATISTICAL ANALYSIS-DIGITAL NOTES
Hold on to the ideal. March on! do not look back upon little mistakes and things. In this
battle field of ours the dust of mistakes must be raised. Those who are so thin-skinned that
they cannot bear the dust, let them get out of the ranks.
- Swami Vivekananda
Life is not just a series of calculations and a sum total of statistics, it's about experience, it's
about participation, it is something more complex and more interesting than what is obvious.
- Daniel Libeskind
SUBJECT EXPERTS
1. Use distributive practice rather than massed practice. That is, set aside one to two hours at the
same time each day for six days out of the week (Take the seventh day off) for studying statistics.
Do not cram your study for four or five hours into one or two sittings each week. This is a
cardinal principle.
2. Study in triads or quads of students at least once every week. Verbal interchange and
interpretation of concepts and skills with other students really cements a greater depth of
understanding.
3. Don't try to memorize formulas (A good instructor will never ask you to do this). Study
CONCEPTS CONCEPTS CONCEPTS. Remember, later in life when you need to use a
statistical technique you can always look the formula up in a textbook.
4. Work as many and varied problems and exercises as you possibly can. Hopefully your textbook
is accompanied by a workbook. You can not learn statistics by just reading about it. You must
push the pencil and practice your skills repeatedly.
5. Look for reoccurring themes in statistics. There are probably only a handful of important skills
that keep popping up over and over again. Ask your instructor to emphasize these if need be.
6. Must Carry Calculators and Statistical Tables.
1|Page
RESEARCH METHODOLOGY AND STATISTICAL ANALYSIS-DIGITAL NOTES
Learning Outcome/s:
Appreciate that the collection and statistical analysis of data improves business decisions
and reduces the risk of implementing solutions that waste resources and effort.
Select and deploy the correct statistical method for a given data analysis requirement.
Achieve a practical level of competence in building statistical models that suit business
applications.
Recognize, develop and distinguish between models for cross-sectional analysis at a
single point in time and models for time series analysis at multiple points in time.
2|Page
RESEARCH METHODOLOGY AND STATISTICAL ANALYSIS-DIGITAL NOTES
Trend analysis - Free Hand Curve - Moving Averages. Time Series Analysis and Report writing
Sample Test: t-Distribution - Properties and Applications - Testing for One and Two Means -
Paired t-test.
Chi-Square distribution: Test for a specified Population variance - Test for Independence of
Attributes.
REFERENCES:
Levin R.I., Rubin S. David, “Statistics for Management”, Pearson.
Beri, “Business Statistics”, TMH.
Gupta S.C, “Fundamentals of Statistics”, HPH.
Amir D. Aczel and Jayavel Sounder pandian, “Complete Business Statistics”, TMH,
Levine, Stephan , Krehbiel , Berenson - Statistics for Managers using Microsoft Excel, PHI.
J. K Sharma, “Business Statistics”, Pearson.
3|Page
RESEARCH METHODOLOGY AND STATISTICAL ANALYSIS-DIGITAL NOTES
INTRODUCTION TO RESEARCH
o Meaning and Scope
o Types of Research
o Research Process
Management
OBJECTIVE
RESEARCH DESIGN
o Research Problem
o Purpose of Research Design
o Characteristics of Good Research Design
o Sampling and its Applications
4|Page
RESEARCH METHODOLOGY AND STATISTICAL ANALYSIS-DIGITAL NOTES
MEANING OF RESEARCH
According to Redman and Mory (1923), research is a ―systematized effort togain new
knowledge‖. It is an academic activity (Mini projects, Major projects and report
writing) and therefore the term should be used in a technical sense (t test, ztest,
ANOVA, CHI Square test, correlation and regression analysis).
OBJECTIVES OF RESEARCH
In other words, although every research study has its own specific objectives, theresearch
objectives may be broadly grouped as follows:
To gain familiarity with new insights into a phenomenon (i.e.
formulativeResearch studies);
To accurately portray the characteristics of a particular individual, group,or a situation
(i.e., descriptive research studies);
To analyze the frequency with which something occurs (i.e., diagnostic research studies)
To examine the hypothesis of a causal relationship between two variables
(i.e.,hypothesis- testing research studies).
SCOPE OF RESEARCH
Scope of the study refers to the elements that will be covered in a research project. It defines
the boundaries of the research. The main purpose of the scope of the study is that it explains the
extent to which the research area will be explored and thus specifies the parameters that will be
observed within the study.
TYPES OF RESEARCH
Fundamental research.
Applied research.
Qualitative research.
Quantitative research
Mixed research.
Exploratory research.
Field research.
Cross- sectional Research
5|Page
RESEARCH METHODOLOGY AND STATISTICAL ANALYSIS-DIGITAL NOTES
Laboratory research.
Fixed research
Fundamental Research:
Fundamental, or basic, research is designed to help researchers better understand certain
phenomena in the world; This research attempts to broaden your understanding and
expand scientific theories and explanations. For example, fundamental research could
include a company's study of how different product sales. This study provides information and
is knowledge-based. Tata motors sales 2020-21 FA APRIL 1ST TO MARCH 31STRELIANCE
ALSO
Applied Research:
Qualitative Research:
Qualitative research involves non numerical data, such as opinions and literature.
Examples of qualitative data may include:
Focus groups(Team)
Surveys (consumers, customers, employess and employers)
Participant comments.
Observations
Interviews
Quantitative Research:
Quantitative research depends on numerical data, such as statistics and measurements. For
example, a car manufacturer may compare the number of sales of red CARS compared to
white CARS. The research uses objective data—the sales figures for red and white CARS—to
draw conclusions.
Mixed Research:
Mixed research includes both qualitative and quantitative data. Consider the car manufacturer
6|Page
RESEARCH METHODOLOGY AND STATISTICAL ANALYSIS-DIGITAL NOTES
comparing AUDI sales. The company could also ask car buyers to complete a survey after
buying a red or white sedan that asks how much the color impacted their decision and other
opinion-based questions.
Exploratory Research:
Exploratory research is designed to examine what is already known about a topic and what
additional information may be relevant. It rarely answers a specific question.
Longitudinal Research:
Longitudinal research focuses on how certain measurements change over time without
manipulating any variables. For instance, a researcher may examine if and how employee
satisfaction changes in the same employees after one year, three years and five years with the
same company.
Cross-sectional Research:
Cross-sectional research studies a group or subgroup at one point in time. Participants are
generally chosen based on certain shared characteristics, such as age, gender or income, and
researchers examine the similarities and differences within groups and between groups.
Field Research:
Field research takes place wherever the participants or subjects are, or "on location." This
type of research requires onsite observation and data collection.
RESEARCH PROCESS
Research process consists of a series of steps or actions required for effectively conducting
research. The following are the steps that provide useful procedural guidelines regarding the
conduct of research:
Formulating the research problem;
Extensive literature survey;
Developing hypothesis;
Preparing the research design;
Determining sample design;
Collecting data;
Execution of the project;
Analysis of data;
Hypothesis testing;
Generalization and interpretation, and Preparation of the report.
In other words, it involves the formal write-up of conclusions.
DATA:
Research data is any information that has been collected, observed, generated or created to
validate original research findings.
7|Page
RESEARCH METHODOLOGY AND STATISTICAL ANALYSIS-DIGITAL NOTES
Meaning:
Primary data refers to the first hand data gathered by the researcher himself.
Source Surveys, observations, experiments, questionnaire, personal interview, etc.
Secondary data means data collected by someone else earlier.
Government publications, websites, books, journal articles, internal records etc.
8|Page
RESEARCH METHODOLOGY AND STATISTICAL ANALYSIS-DIGITAL NOTES
Uses of DATA:
Schedule of Questions:
A schedule is a structure of a set of questions on a given topic which are asked by the
interviewer or investigator personally. The order of questions, the language of the questions
and the arrangement of parts of the schedule are not changed.
9|Page
RESEARCH METHODOLOGY AND STATISTICAL ANALYSIS-DIGITAL NOTES
Computer questionnaire
Respondents are asked to answer the questionnaire which is sentby mail.
Telephone questionnaire
Surveying is a way to collect information directly from project stakeholders,
participants or beneficiaries in a systematic, standardised way, and rely on the use of
questionnaires distributed to respondents.
In-house survey.
The survey will also include a written description of the property, the street
address, the location of buildings and adjacent properties, and any
improvements a homeowner can make to the land. A property survey also includes
things like right-of-ways and easements.
Mail Questionnaire
Mail questionnaire is a form of questionnaire which is mailed to targeted
individuals, which has a collection of questions on a particular topic asked to them as
a part of interview or survey which is used for conducting research on that topic.
10 | P a g e
RESEARCH METHODOLOGY AND STATISTICAL ANALYSIS-DIGITAL NOTES
Dichotomous Questions.
The dichotomous question is a question that can have two possible answers.
Dichotomous questions are usually used in a survey that asks for a Yes/No,
True/False, Fair/Unfair or Agree/Disagree answers. They are used for a clear
distinction of qualities, experiences, or respondent's opinions.
Scaling Questions
Scaling questions ask clients to consider their position on a scale (usually
from 1 to 10, with one being the least desirable situation and 10 being the most
desirable). Scaling questions can be a helpful way to track coaches’ progress toward
goals and monitor incremental change.
SAMPLING METHODS
11 | P a g e
RESEARCH METHODOLOGY AND STATISTICAL ANALYSIS-DIGITAL NOTES
METHODS OF SAMPLING:
Systematic Sampling
Systematic sampling is a type of probability sampling method in which sample
members from a larger population are selected according to a random starting point but
with a fixed, periodic interval. This interval, called the sampling interval, is calculated
by dividing the population size by the desired sample size.
12 | P a g e
RESEARCH METHODOLOGY AND STATISTICAL ANALYSIS-DIGITAL NOTES
Stratified Sampling
A stratified random sampling involves dividing the entire population into homogeneous
groups called strata (plural for stratum)......A random sample from each stratum is taken
in a number proportional to the stratum's size when compared to the population. These
subsets of the strata are then pooled to form a random sample.
Clustered Sampling
Cluster sampling is a probability sampling technique where researchers divide the
population into multiple groups (clusters) for research. Researchers then select random
groups with a simple random or systematic random sampling technique for data
collection and data analysis.
13 | P a g e
RESEARCH METHODOLOGY AND STATISTICAL ANALYSIS-DIGITAL NOTES
14 | P a g e
RESEARCH METHODOLOGY AND STATISTICAL ANALYSIS-DIGITAL NOTES
Quota sampling:
Quota sampling is defined as a non-probability sampling method in which
researchers create a sample involving individuals that represent a population. ...
They decide and create quotas so that the market research samples can be useful in
collecting data. These samples can be generalized to the entire population.
15 | P a g e
RESEARCH METHODOLOGY AND STATISTICAL ANALYSIS-DIGITAL NOTES
o Mean
o Median
Calculation Methods
o Mode
MEASURES OF DISPERSION
OBJECTIVE
o Range
o Quartile Deviation
o Mean deviation
o Standard Deviation
o Coefficient of Variation
SKEWNESS
o Karl Pearson
o Bowley’s
o Kelly’s Coefficients
16 | P a g e
RESEARCH METHODOLOGY AND STATISTICAL ANALYSIS-DIGITAL NOTES
One of the important objectives of statistics is to find out various numerical values which explain the
inherent characteristics of a frequency distribution. The first of such measures is averages. The
averages are the measures which condense a huge unwieldy set of numerical data into single
numerical values which represent the entire distribution. The inherent inability of the human mind to
remember a large body of numerical data compels us to few constants that will describe the data.
Averages provide us the gist and give a bird‟s eye view of the huge mass of unwieldy numerical
data. Averages are the typical values around which other items of the distribution congregate. This
value lies between the two extreme observations of the distribution and give us an idea about the
concentration of the values in the central part of the distribution. They are called the measures of
central tendency.
Averages are also called measures of location since they enable us to locate the position or place
of the distribution in question. Averages are statistical constants which enables us to comprehend in a
single value the significance of the whole group. According to Croxlon and Cowden, an average
value is a single value within the range of the data that is used to represent all the values in that
series. Since an average is somewhere within the range of data, it is sometimes called a measure of
central value. An average is the most typical representative item of the group to which it belongs and
which is capable of revealing all important characteristics of that group or distribution.
Measures of central tendency, Mean, Median, Mode, etc., indicate the central position of a
series. They indicate the general magnitude of the data but fail to reveal all the peculiarities and
characteristics of the series. In other words, they fail to reveal the degree of the spread out or the
extent of the variability in individual items of the distribution. This can be explained by certain other
measures, known as „Measures of Dispersion‟ or Variation.
The study of statistics does not show much interest in things which are constant. The total area of the
Earth may not be very important to a research-minded, person but the area covered by different
crops, forests, residential and commercial buildings are figures of great importance, because these
figures keep on changing from time to time and from place to place. Many experts are engaged in the
study of changing phenomena.
Experts working in different countries keep a watch on forces which are responsible for bringing
changes in the fields of human interest. Agricultural, industrial and mineral production and their
transportation from one area to other parts of the world are of great interest to economists,
statisticians, and other experts. Changes in human populations, changes in standards of living,
changes in literacy rates and changes in prices attract experts to perform detailed studies and then
correlate these changes to human life. Thus variability or variation is connected with human life and
its study is very important for mankind.
17 | P a g e
RESEARCH METHODOLOGY AND STATISTICAL ANALYSIS-DIGITAL NOTES
Different methods of measuring “Central Tendency” provide us with different kinds of averages. The
following are the main types of averages that are commonly used:
1. Mean
2. Median
3. Mode
Arithmetic Mean
Arithmetic mean is the most commonly used average or measure of the central tendency applicable
only in case of quantitative data; it is also simply called the “mean”. Arithmetic mean is defined as:
“Arithmetic mean is a quotient of sum of the given values and number of the given values”.
Arithmetic mean can be computed for both ungrouped data (raw data: data without any statistical
treatment) and grouped data (data arranged in tabular form containing different groups).
Median
The median is that value of the variable which divides the group in two equal parts. One part
comprising the values greater than and the other all values less than median. Median of a distribution
may be defined as that value of the variable which exceeds and is exceeded by the same number of
observation. It is the value such that the number of observations above it is equal to the number of
observations below it. Thus we know that the arithmetic mean is based on all items of the
18 | P a g e
RESEARCH METHODOLOGY AND STATISTICAL ANALYSIS-DIGITAL NOTES
distribution, the median is positional average, i.e., it depends upon the position occupied by a value
in the frequency distribution. When the items of a series are arranged in ascending or descending
order of magnitude the value of the middle item in the series is known as median in the case of
individual observation.
Symbolically, Median = size of nth item
It the number of items is even, and then there is no value exactly in the middle of the series. In such a
situation the median is arbitrarily taken to be halfway between the two middle items.
Advantages of Median:
(1) It is very simple to understand and easy to calculate. In some cases it is obtained simply by
inspection.
(2) Median lies at the middle part of the series and hence it is not affected by the extreme values.
(3) It is a special average used in qualitative phenomena like intelligence or beauty which are not
quantified but ranks are given. Thus we can locate the person whose intelligence or beauty is
the average.
(4) In grouped frequency distribution it can be graphically located by drawing gives.
(5) It is specially useful in open-ended distributions since the position rather than the value of
item that matters in median.
Disadvantages of Median:
(1) In simple series, the item values have to be arranged. If the series contains large number of
items, then the process becomes tedious.
(2) It is a less representative average because it does not depend on all the items in the series.
(3) It is not capable of further algebraic treatment. For example, we cannot find a combined
median of two or more groups if the median of different groups are given.
(4) It is affected more by sampling fluctuations than the mean as it is concerned with on1y one
item i.e. the middle item.
(5) It is not rigidly defined. In simple series having even number of items, median cannot be
exactly found. Moreover, the interpolation formula applied in the continuous series is based
on the unrealistic assumption that the frequency of the median class is evenly spread over the
magnitude of the class interval of the median group.
Mode
Mode is that value of the variable which occurs or repeats itself maximum number of item. The
mode is most “fashionable” size in the sense that it is the most common and typical and is defined by
Zizek as “the value occurring most frequently in series of items and around which the other items are
distributed most densely.” In the words of Croxton and Cowden, the mode of a distribution is the
value at the point where the items tend to be most heavily concentrated. According to A.M. Tuttle,
Mode is the value which has the greater frequency density in its immediate neighbourhood. In the
case of individual observations, the mode is that value which is repeated the maximum number of
times in the series. The value of mode can be denoted by the alphabet z also.
19 | P a g e
RESEARCH METHODOLOGY AND STATISTICAL ANALYSIS-DIGITAL NOTES
The top left corner of the highest rectangle is joined with the top left corner of the following
rectangle and the top right corner of the highest rectangle is joined with the top right corner of the
preceding rectangle respectively.
From the point of intersection of both the lines a perpendicular is drawn on the X-axis, and check
that point on the X-axis. This will be the required value of mode.
Advantages and Disadvantages of Mode:
Advantages:
It is easy to understand and simple to calculate.
It is not affected by extremely large or small values.
It can be located just by inspection in ungrouped data and discrete frequency distribution.
It can be useful for qualitative data.
It can be computed in an open-end frequency table.
It can be located graphically.
Disadvantages:
It is not well defined.
It is not based on all the values.
It is stable for large values so it will not be well defined if the data consists of a small number of
values.
It is not capable of further mathematical treatment.
Sometimes the data has one or more than one mode and sometimes the data has no mode at all.
20 | P a g e
RESEARCH METHODOLOGY AND STATISTICAL ANALYSIS-DIGITAL NOTES
MEASURES OF DISPERSION
Dispersion:
The word dispersion has a technical meaning in statistics. The average measures the center of the
data, and it is one aspect of observation. Another feature of the observation is how the observations
are spread about the center. The observations may be close to the center or they may be spread away
from the center. If the observations are close to the center (usually the arithmetic mean or median),
we say that dispersion, scatter or variation is small. If the observations are spread away from the
center, we say dispersion is large.
The study of dispersion is very important in statistical data. If in a certain factory there is
consistency in the wages of workers, the workers will be satisfied. But if some workers have high
wages and some have low wages, there will be unrest among the low paid workers and they might go
on strike and arrange demonstrations. If in a certain country some people are very poor and some are
very rich, we say there is economic disparity. This means that dispersion is large.
The idea of dispersion is important in the study of workers' wages, price of commodities, standards
of living of different people, distribution of wealth, distribution of land among framers, and many
other fields of life. Some brief definitions of dispersion are:
The degree to which numerical data tend to spread about an average value is called the dispersion
or variation of the data.
Dispersion or variation may be defined as a statistic signifying the extent of the scatteredness of
items around a measure of central tendency.
Dispersion or variation is the measurement of the size of the scatter of items in a series about the
average.
For the study of dispersion, we need some measures which show whether the dispersion is small or
large. There are two types of measure of dispersion, which are:
(a) Absolute Measures of Dispersion
(b) Relative Measures of Dispersion
21 | P a g e
RESEARCH METHODOLOGY AND STATISTICAL ANALYSIS-DIGITAL NOTES
Range:
Range is a simplest method of studying dispersion. It takes lesser time to compute the „absolute‟ and
„relative‟ range. Range does not take into account all the values of a series, i.e. it considers only the
extreme items and middle items are not given any importance. Therefore, Range cannot tell us
anything about the character of the distribution. Range cannot be computed in the case of “open
ends‟ distribution i.e., a distribution where the lower limit of the first group and upper limit of the
higher group is not given. The concept of range is useful in the field of quality control and to study
the variations in the prices of the shares etc.
Quartile Deviation:
The quartile deviation is a slightly better measure of absolute dispersion than the range, but it ignores
the observations on the tails. If we take difference samples from a population and calculate their
quartile deviations, their values are quite likely to be sufficiently different. This is called sampling
fluctuation, and it is not a popular measure of dispersion. The quartile deviation calculated from the
sample data does not help us to draw any conclusion (inference) about the quartile deviation in the
population.
Disadvantages:
It is completely dependent on the central items. If these values are irregular and abnormal the
result is bound to be affected.
All the items of the frequency distribution are not given equal importance in finding the values of
Q1 and Q3.
Because it does not take into account all the items of the series, considered to be inaccurate.
Average Deviation:
22 | P a g e
RESEARCH METHODOLOGY AND STATISTICAL ANALYSIS-DIGITAL NOTES
Average deviation is defined as a value which is obtained by taking the average of the deviations of
various items from a measure of central tendency Mean or Median or Mode, ignoring negative signs.
Generally, the measure of central tendency from which the deviations arc taken, is specified in the
problem. If nothing is mentioned regarding the measure of central tendency specified than deviations
are taken from median because the sum of the deviations (after ignoring negative signs) is minimum.
This method is more effective during the reports presented to the general public or to groups who are
not familiar with statistical methods.
Standard Deviation:
The standard deviation, which is shown by greek letter s (read as sigma) is extremely useful in
judging the representativeness of the mean. The concept of standard deviation, which was introduced
by Karl Pearson has a practical significance because it is free from all defects, which exists in a
range, quartile deviation or average deviation.
Standard deviation is calculated as the square root of average of squared deviations taken from actual
mean. It is also called root mean square deviation. The square of standard deviation i.e., S2 is called
„variance‟.
Disadvantages:
It is difficult to compute.
It assigns more weights to extreme items and less weight to items that are nearer to mean. It is
because of this fact that the squares of the deviations which are large in size would be
proportionately greater than the squares of those deviations which are comparatively small.
Coefficient of Variation
The most important of all the relative measures of dispersion is the coefficient of variation. This
word is variation not variance. There is no such thing as coefficient of variance.
Thus CV is the value of SD when mean is assumed equal to 100. It is a pure number and the unit of
observation is not mentioned with its value. It is written in percentage form like 20% or 25%. When
its value is 20%, it means that when the mean of the observations is assumed equal to 100, their
standard deviation will be 20. The C.V is used to compare the dispersion in different sets of data
particularly the data which differ in their means or differ in their units of measurement. The wages of
workers may be in dollars and the consumption of meat in families may be in kilograms. The
standard deviation of wages in dollars cannot be compared with the standard deviation of amount of
meat in kilograms. Both the standard deviations need to be converted into a coefficient of variation
for comparison. Suppose the value of C.V for wages is 10% and the values of C.V for kilograms of
meat are 25%. This means that the wages of workers are consistent. Their wages are close to the
overall average of their wages. But the families consume meat in quite different quantities. Some
families consume very small quantities of meat and some others consume large quantities of meat.
We say that there is greater variation in their consumption of meat. The observations about the
quantity of meat are more dispersed or more variant.
24 | P a g e
RESEARCH METHODOLOGY AND STATISTICAL ANALYSIS-DIGITAL NOTES
SKEWNESS
Measures of Skewness and Kurtosis, like measures of central tendency and dispersion, study
the characteristics of a frequency distribution. Averages tell us about the central value of the
distribution and measures of dispersion tell us about the concentration of the items around a central
value. These measures do not reveal whether the dispersal of value on either side of an average is
symmetrical or not. If observations are arranged in a symmetrical manner around a measure of
central tendency, we get a symmetrical distribution; otherwise, it may be arranged in an
asymmetrical order which gives asymmetrical distribution. Thus, skewness is a measure that studies
the degree and direction of departure from symmetry.
A symmetrical distribution, when presented on the graph paper, gives a „symmetrical curve‟,
where the value of mean, median and mode are exactly equal. On the other hand, in an asymmetrical
distribution, the values of mean, median and mode are not equal. When two or more symmetrical
distributions are compared, the difference in them is studied with „Kurtosis‟. On the other hand,
when two or more symmetrical distributions are compared, they will give different degrees of
Skewness. These measures are mutually exclusive i.e. the presence of skewness implies absence of
kurtosis and vice-versa.
Tests of Skewness:
There are certain tests to know whether skewness does or does not exist in a frequency distribution.
They are:
1. In a skewed distribution, values of mean, median and mode would not coincide. The values
of mean and mode are pulled away and the value of median will be at the centre. In this
distribution, mean-Mode = 2/3 (Median - Mode).
2. Quartiles will not be equidistant from median.
3. When the asymmetrical distribution is drawn on the graph paper, it will not give a bell shaped
curve.
4. Sum of the positive deviations from the median is not equal to sum of negative deviations.
5. Frequencies are not equal at points of equal deviations from the mode.
Nature of Skewness:
Skewness can be positive or negative or zero.
1. When the values of mean, median and mode are equal, there is no skewness.
2. When mean > median > mode, skewness will be positive.
3. When mean < median < mode, skewness will be negative.
26 | P a g e
RESEARCH METHODOLOGY AND STATISTICAL ANALYSIS-DIGITAL NOTES
27 | P a g e
RESEARCH METHODOLOGY AND STATISTICAL ANALYSIS-DIGITAL NOTES
28 | P a g e
RESEARCH METHODOLOGY AND STATISTICAL ANALYSIS-DIGITAL NOTES
29 | P a g e
RESEARCH METHODOLOGY AND STATISTICAL ANALYSIS-DIGITAL NOTES
The level of measurement refers to the relationship among the values that are assigned to the
attributes for a variable. Each scale of measurement has certain properties which in turn determine
the appropriateness for use of certain statistical analyses. It is important for the researcher to
understand the different levels of measurement, as these levels of measurement, together with how
the research question is phrased, dictate what statistical analysis is appropriate.
The first level of measurement is NOMINAL Level of Measurement. In this level of measurement,
the numbers in the variable are used only to classify the data. In this level of measurement, words,
letters, and alpha-numeric symbols can be used. Suppose there are data about people belonging to
three different gender categories. In this case, the person belonging to the female gender could be
classified as F, the person belonging to the male gender could be classified as M, and transgendered
classified as T. This type of assigning classification is nominal level of measurement.
The second level of measurement is the ORDINAL Level of Measurement. This level of
measurement depicts some ordered relationship among the variable's observations. Suppose a
student scores the highest grade of 100 in the class. In this case, he would be assigned the first
rank. Then, another classmate scores the second highest grade of an 92; she would be assigned the
second rank. A third student scores a 81 and he would be assigned the third rank, and so on. The
ordinal level of measurement indicates an ordering of the measurements.
The third level of measurement is the INTERVAL Level of Measurement. The interval level of
measurement not only classifies and orders the measurements, but it also specifies that the distances
between each interval on the scale are equivalent along the scale from low interval to high
interval. For example, an interval level of measurement could be the measurement of anxiety in a
student between the score of 10 and 11; this interval is the same as that of a student who scores
between 40 and 41. A popular example of this level of measurement is temperature in centigrade,
where, for example, the distance between 940C and 960C is the same as the distance between 1000C
and 1020C.
The fourth level of measurement is the RATIO Level of Measurement. In this level of
measurement, the observations, in addition to having equal intervals, can have a value of zero as
well. The zero in the scale makes this type of measurement unlike the other types of measurement,
although the properties are similar to that of the interval level of measurement. In the ratio level of
measurement, the divisions between the points on the scale have an equivalent distance between
them.
30 | P a g e
RESEARCH METHODOLOGY AND STATISTICAL ANALYSIS-DIGITAL NOTES
Stevens (1946, 1951) proposed that measurements can be classified into four different types of
scales
Scale Admissible Scale Mathematical
Permissible Statistics
Type Transformation Structure
Standard Set
One to One
Nominal Mode, Chi-Square Structure
(Equality(=))
(Unordered)
Monotonic Increasing Totally Ordered
Ordinal Median, Percentile
(Order(<)) Set
Mean, SD, Correlation, Regression, Positive Linear
Interval Affine Line
ANOVA (Affine)
All Statistics permitted for Interval Scales
Positive Similarities
Ratio plus the following: GM, HM, Coefficient Field
(Multiplication)
of Variation, Logarithms
31 | P a g e
RESEARCH METHODOLOGY AND STATISTICAL ANALYSIS-DIGITAL NOTES
32 | P a g e
RESEARCH METHODOLOGY AND STATISTICAL ANALYSIS-DIGITAL NOTES
CLASSIFICATION OF DATA
Classification:
The collected data, also known as raw data or ungrouped data are always in an un organised form
and need to be organised and presented in meaningful and readily comprehensible form in order to
facilitate further statistical analysis. It is, therefore, essential for an investigator to condense a mass
of data into more and more comprehensible and assimilable form. The process of grouping into
different classes or sub classes according to some characteristics is known as classification,
tabulation is concerned with the systematic arrangement and presentation of classified data. Thus
classification is the first step in tabulation. For Example, letters in the post office are classified
according to their destinations viz., Delhi, Madurai, Bangalore, Mumbai etc.,
Objects of Classification:
The following are main objectives of classifying the data.
1. It condenses the mass of data in an easily assimilable form.
2. It eliminates unnecessary details.
3. It facilitates comparison and highlights the significant aspect of data.
4. It enables one to get a mental picture of the information and helps in drawing inferences.
5. It helps in the statistical treatment of the information collected.
Types of classification:
Statistical data are classified in respect of their characteristics. Broadly there are four basic types of
classification namely
a) Chronological Classification;
b) Geographical Classification;
c) Qualitative Classification;
d) Quantitative Classification.
b) Geographical Classification: In this type of classification the data are classified according to
geographical region or place. For instance, the production of paddy in different states in India,
production of wheat in different countries etc.,
c) Qualitative Classification: In this type of classification data are classified on the basis of same
attributes or quality like sex, literacy, religion, employment etc., such attributes cannot be measured
along with a scale. For example, if the population to be classified in respect to one attribute, say sex,
then we can classify them into two namely that of males and females. Similarly, they can also be
classified into „employed‟ or „unemployed‟ on the basis of another attribute „employment‟. Thus
33 | P a g e
RESEARCH METHODOLOGY AND STATISTICAL ANALYSIS-DIGITAL NOTES
when the classification is done with respect to one attribute, which is dichotomous in nature, two
classes are formed, one possessing the attribute and the other not possessing the attribute. This type
of classification is called simple or dichotomous classification.
The classification, where two or more attributes are considered and several classes are
formed, is called a manifold classification. For example, if we classify population simultaneously
with respect to two attributes, e.g. sex and employment, then population are first classified with
respect to „sex‟ into „males‟ and „females‟. Each of these classes may then be further classified into
„Urban‟, „Semi-Urban‟ and „Rural‟ on the basis of attribute „employment‟ and as such Population are
classified into four classes namely.
(i) Male in Urban Area
(ii) Male Semi-Urban Area
(iii) Male in Rural Area
(iv) Female in Urban Area
(v) Female Semi-Urban Area
(vi) Female in Rural Area
Still the classification may be further extended by considering other attributes like marital status etc.
This can be explained by the following chart
Weight (in lbs) 90-100 100-110 110-120 120-130 130-140 140-150 Total
No. of Students 50 200 260 360 90 40 1000
34 | P a g e
RESEARCH METHODOLOGY AND STATISTICAL ANALYSIS-DIGITAL NOTES
TABULATION OF DATA
Tabulation is the process of summarizing classified or grouped data in the form of a table so that it is
easily understood and an investigator is quickly able to locate the desired information. A table is a
systematic arrangement of classified data in columns and rows. Thus, a statistical table makes it
possible for the investigator to present a huge mass of data in a detailed and orderly form. It
facilitates comparison and often reveals certain patterns in data which are otherwise not obvious.
Classification and „Tabulation‟, as a matter of fact, are not two distinct processes. Actually they go
together. Before tabulation data are classified and then displayed under different columns and rows
of a table.
Advantages of Tabulation:
Statistical data arranged in a tabular form serve following objectives:
1. It simplifies complex data and the data presented are easily understood.
2. It facilitates comparison of related facts.
3. It facilitates computation of various statistical measures like averages, dispersion, correlation etc.
4. It presents facts in minimum possible space and unnecessary repetitions and explanations are
avoided. Moreover, the needed information can be easily located.
5. Tabulated data are good for references and they make it easier to present the information in the
form of graphs and diagrams.
Preparing a Table:
The making of a compact table itself an art. This should contain all the information needed within the
smallest possible space. What the purpose of tabulation is and how the tabulated information is to be
used are the main points to be kept in mind while preparing for a statistical table. An ideal table
should consist of the following main parts:
1. Table number
2. Title of the table
3. Captions or column headings
4. Stubs or row designation
5. Body of the table
6. Footnotes
7. Sources of data
Table Number: A table should be numbered for easy reference and identification. This number, if
possible, should be written in the centre at the top of the table. Sometimes it is also written just
before the title of the table.
Title: A good table should have a clearly worded, brief but unambiguous title explaining the nature
of data contained in the table. It should also state arrangement of data and the period covered. The
title should be placed centrally on the top of a table just below the table number (or just after table
number in the same line).
Captions or column Headings: It stands for brief and self explanatory headings of vertical columns.
Captions may involve headings and sub-headings as well. The unit of data contained should also be
given for each column. Usually, a relatively less important and shorter classification should be
tabulated in the columns.
Stubs or Row Designations: Stubs stands for brief and self explanatory headings of horizontal rows.
Normally, a relatively more important classification is given in rows. Also a variable with a large
number of classes is usually represented in rows. For example, rows may stand for score of classes
35 | P a g e
RESEARCH METHODOLOGY AND STATISTICAL ANALYSIS-DIGITAL NOTES
and columns for data related to sex of students. In the process, there will be many rows for scores
classes but only two columns for male and female students.
Body: It contains the numerical information of frequency of observations in the different cells. This
arrangement of data is according to the description of captions and stubs.
Footnotes: They are given at the foot of the table for explanation of any fact or information included
in the table which needs some explanation. Thus, they are meant for explaining or providing further
details about the data, which have not been covered in title, captions and stubs.
Sources of data: Lastly one should also mention the source of information from which data are
taken. This may preferably include the name of the author, volume, page and the year of publication.
This should also state whether the data contained in the table is of „primary or secondary‟ nature.
Though, there is no hard and fast rule for forming a table yet a few general points should be kept in
mind:
1. A table should be formed in keeping with the objects of statistical enquiry.
2. A table should be carefully prepared so that it is easily understandable.
3. A table should be formed so as to suit the size of the paper. But such an adjustment should not be
at the cost of legibility.
4. If the figures in the table are large, they should be suitably rounded or approximated. The method
of approximation and units of measurements too should be specified.
5. Rows and columns in a table should be numbered and certain figures to be stressed may be put in
„box‟ or „circle‟ or in bold letters.
6. The arrangements of rows and columns should be in a logical and systematic order. This
arrangement may be alphabetical, chronological or according to size.
7. The rows and columns are separated by single, double or thick lines to represent various classes
and sub-classes used. The corresponding proportions or percentages should be given in adjoining
rows and columns to enable comparison. A vertical expansion of the table is generally more
convenient than the horizontal one.
8. The averages or totals of different rows should be given at the right of the table and that of
columns at the bottom of the table. Totals for every sub-class too should be mentioned.
9. In case it is not possible to accommodate all the information in a single table, it is better to have
two or more related tables.
Type of Tables:
Tables can be classified according to their purpose, stage of enquiry, nature of data or number of
characteristics used. On the basis of the number of characteristics, tables may be classified as
follows:
1. Simple or one-way table
2. Two way table
3. Manifold table
36 | P a g e
RESEARCH METHODOLOGY AND STATISTICAL ANALYSIS-DIGITAL NOTES
Total
Two-way Table:
A table, which contains data on two characteristics, is called a two-way table. In such case, therefore,
either stub or caption is divided into two co-ordinate parts. In the given table, as an example the
caption may be further divided in respect of „sex‟. This subdivision is shown in two-way table, which
now contains two characteristics namely, occupation and sex.
No. of Adults
Occupations Total
Male Female
Total
Manifold Table:
Thus, more and more complex tables can be formed by including other characteristics. For example,
we may further classify the caption sub-headings in the above table in respect of “marital status”,
“religion” and “socio-economic status” etc. A table, which has more than two characteristics of data,
is considered as a manifold table. For instance, table shown below shows three characteristics
namely, occupation, sex and marital status. Manifold tables, though complex are good in practice as
these enable full information to be incorporated and facilitate analysis of all related facts. Still, as a
normal practice, not more than four characteristics should be represented in one table to avoid
confusion. Other related tables may be formed to show the remaining characteristics
No. of Adults
Occupations Male Female Total
Married Unmarried Total Married Unmarried Total
Total
37 | P a g e
RESEARCH METHODOLOGY AND STATISTICAL ANALYSIS-DIGITAL NOTES
REGRESSION ANALYSIS
o Least Square Method
o Two lines of Regression
o Properties of regression Coefficients
38 | P a g e
RESEARCH METHODOLOGY AND STATISTICAL ANALYSIS-DIGITAL NOTES
CORRELATION
Introduction:
The term correlation is used by a common man without knowing that he is making use of the term
correlation. For example when parents advice their children to work hard so that they may get good
marks, they are correlating good marks with hard work.
The study related to the characteristics of only variable such as height, weight, ages, marks,
wages, etc., is known as Univariate analysis. The statistical Analysis related to the study of the
relationship between two variables is known as Bivariate Analysis. Sometimes the variables may be
inter-related. In health sciences we study the relationship between blood pressure and age,
consumption level of some nutrient and weight gain, total income and medical expenditure, etc., the
nature and strength of relationship may be examined by correlation and Regression analysis. Thus
Correlation refers to the relationship of two variables or more. (eg.) relation between height of father
and son, yield and rainfall, wage and price index, share and debentures etc.
Correlation is statistical Analysis which measures and analyses the degree or extent to which
the two variables fluctuate with reference to each other. The word relationship is important. It
indicates that there is some connection between the variables. It measures the closeness of the
relationship. Correlation does not indicate cause and effect relationship. Price and supply, income
and expenditure are correlated.
Definitions:
1. Croxton and Cowden, “When the relationship is of a quantitative nature, the
appropriate statistical tool for discovering and measuring the relationship and expressing it in a
brief formula is known as correlation”.
2. A.M. Tuttle, “Correlation is an analysis of the co-variation between two or more variables.”
3. W.A. Neiswanger, “Correlation analysis contributes to the understanding of economic
behavior, aids in locating the critically important variables on which others depend, may reveal to
the economist the connections by which disturbances spread and suggest to him the paths through
which stabilizing forces may become effective.”
4. L. R. Conner, “If two or more quantities vary in sympathy so that the movement in one tends to
be accompanied by corresponding movements in others than they are said to be correlated.”
Uses of correlation:
1. It is used in physical and social sciences.
2. It is useful for economists to study the relationship between variables like price, quantity etc.
Businessmen estimates costs, sales, price etc. using correlation.
3. It is helpful in measuring the degree of relationship between the variables like income and
expenditure, price and supply, supply and demand etc.
4. Sampling error can be calculated.
5. It is the basis for the concept of regression.
Types of Correlation
Correlation can be categorised as one of the following:
1. Positive and Negative.
2. Simple and Multiple.
3. Partial and Total.
4. Linear and Non-Linear (Curvilinear).
39 | P a g e
RESEARCH METHODOLOGY AND STATISTICAL ANALYSIS-DIGITAL NOTES
1. Positive and Negative Correlation: Positive or direct Correlation refers to the movement of
variables in the same direction. The correlation is said to be positive when the increase (decrease)
in the value of one variable is accompanied by an increase (decrease) in the value of other
variable also. Negative or inverse correlation refers to the movement of the variables in opposite
direction. Correlation is said to be negative, if an increase (decrease) in the value of one variable
is accompanied by a decrease (increase) in the value of other.
2. Simple and Multiple Correlation: Under simple correlation, we study the relationship between
two variables only i.e., between the yield of wheat and the amount of rainfall or between demand
and supply of a commodity. In case of multiple correlations, the relationship is studied among
three or more variables. For example, the relationship of yield of wheat may be studied with both
chemical fertilizers and the pesticides.
3. Partial and Total Correlation: There are two categories of multiple correlation analysis. Under
partial correlation, the relationship of two or more variables is studied in such a way that only
one dependent variable and one independent variable is considered and all others are kept
constant. For example, coefficient of correlation between yield of wheat and chemical fertilizers
excluding the effects of pesticides and manures is called partial correlation. Total correlation is
based upon all the variables.
4. Linear and Non-Linear Correlation: When the amount of change in one variable tends to keep
a constant ratio to the amount of change in the other variable, then the correlation is said to be
linear. But if the amount of change in one variable does not bear a constant ratio to the amount of
change in the other variable then the correlation is said to be non-linear. The distinction between
linear and non-linear is based upon the consistency of the ratio of change between the variables.
Scatter Diagram:
It is the simplest method of studying the relationship between two variables diagrammatically. One
variable is represented along the horizontal axis and the second variable along the vertical axis. For
each pair of observations of two variables, we put a dot in the plane. There are as many dots in the
plane as the number of paired observations of two variables. The direction of dots shows the scatter
or concentration of various points. This will show the type of correlation.
1. If all the plotted points form a straight line from lower left hand corner to the upper right hand
corner then there is Perfect positive correlation. We denote this as r = +.
2. If all the plotted dots lie on a straight line falling from upper left hand corner to lower right hand
corner, there is a perfect negative correlation between the two variables. In this case the
coefficient of correlation takes the value r = -1.
3. If the plotted points in the plane form a band and they show a rising trend from the lower left
hand corner to the upper right hand corner the two variables are highly positively correlated.
40 | P a g e
RESEARCH METHODOLOGY AND STATISTICAL ANALYSIS-DIGITAL NOTES
4. If the points fall in a narrow band from the upper left hand corner to the lower right hand corner,
there will be a high degree of negative correlation.
5. If the plotted points in the plane are spread all over the diagram there is no correlation between
the two variables.
Merits:
1. It is a simplest and attractive method of finding the nature of correlation between the two
variables.
2. It is a non-mathematical method of studying correlation. It is easy to understand.
3. It is not affected by extreme items.
4. It is the first step in finding out the relation between the two variables.
5. We can have a rough idea at a glance whether it is a positive correlation or negative correlation.
Demerits:
1. By this method we cannot get the exact degree or correlation between the two variables.
41 | P a g e
RESEARCH METHODOLOGY AND STATISTICAL ANALYSIS-DIGITAL NOTES
coefficient can be simplified by dividing the given data by a common factor. In such a case, the final
result is not multiplied by the common factor because coefficient of correlation is independent of
change of scale and origin.
Assumptions:
Karl Pearson based his formula on following basic assumptions:
1. Two variables are affected by many independent causes and form a normal distribution.
2. The cause and effect relationship exists between two variables.
3. The relationship between two variables is linear. It is often denoted by r.
Merits and Demerits of Pearson‟s method of studying correlation:
Merits
1. This method indicates the presence or absence of correlation between two variables and gives the
exact degree of their correlation.
2. In this method, we can also ascertain the direction of the correlation; positive, or negative.
3. This method has many algebraic properties for which the calculation of co-efficient of
correlation, and other related factors, are made easy.
Demerits:
1. It is more difficult to calculate than other methods of calculations.
2. It is much affected by the values of the extreme items.
3. It is based on a many assumptions, such as: linear relationship, cause and effect relationship etc.
which may not always hold good.
4. It is very much likely to be misinterpreted in case of homogeneous data.
Time Series is a sequence of well-defined data points measured at consistent time intervals over a
period of time. Data collected on an ad-hoc basis or irregularly does not form a time series. Time
series analysis is the use of statistical methods to analyze time series data and extract meaningful
statistics and characteristics about the data. Time series Analysis helps us understand what are the
underlying forces leading to a particular trend in the time series data points and helps us in
forecasting and monitoring the data points by fitting appropriate models to it.
Statistical data which is recorded with its time of occurrence is called a time series. The
yearly output of wheat recorded for the last twenty five years, the weekly average price of eggs
recorded for the last 52 weeks, the monthly average sales of a firm recorded for the last 48 months or
the quarterly average profits recorded for the last 40 quarters etc., are examples of time series data. It
may be observed that this data undergoes changes with the passage of time. A number of factors can
be isolated which contribute to changes occurring over time in such a series. In the fields of
economics and business, data such as income, imports, exports, production, consumption, and prices
depend on time. All of these data are dependent on seasonal changes as well as regular cyclical
changes over the time period. To evaluate the changes in business and economics, the analysis of
time series plays an important role in this regard. It is necessary to associate time with time series,
because time is one basic variable in time series analysis.
43 | P a g e
RESEARCH METHODOLOGY AND STATISTICAL ANALYSIS-DIGITAL NOTES
for example climate variables sometimes exhibit cyclic variation over a very long time period
such as 50 years. If one just had 20 years data, this long term oscillation would appear to be a
trend, but if several hundred years of data is available, then long term oscillations would be
visible. These movements are systematic in nature where the movements are broad, steady,
showing slow rise or fall in the same direction. The trend may be linear or non-linear
(curvilinear). Some examples of secular trend are: Increase in prices, Increase in pollution,
increase in the need of wheat, increase in literacy rate, and decrease in deaths due to advances in
science. Taking averages over a certain period is a simple way of detecting trend in seasonal data.
Change in averages with time is evidence of a trend in the given series, though there are more
formal tests for detecting trend in time series.
2. Seasonal Variation or Seasonal Fluctuations:
Many of the time series data exhibits a seasonal variation which is annual period, such as sales
and temperature readings. This type of variation is easy to understand and can be easily
measured or removed from the data to give de-seasonalized data. Seasonal
Fluctuations describes any regular variation (fluctuation) with a period of less than one year for
example cost of variation types of fruits and vegetables, cloths, unemployment figures, average
daily rainfall, increase in sale of tea in winter, increase in sale of ice cream in summer etc., all
show seasonal variations. The changes which repeat themselves within a fixed period, are also
called seasonal variations, for example, traffic on roads in morning and evening hours, Sales at
festivals like EID etc., increase in the number of passengers at weekend etc. Seasonal
variations are caused by climate, social customs, religious activities etc.
3. Cyclical Variation or Cyclic Fluctuations:
Time series exhibits Cyclical Variations at a fixed period due to some other physical cause, such
as daily variation in temperature. Cyclical variation is a non-seasonal component which varies in
recognizable cycle. Sometime series exhibits oscillations which do not have a fixed period but
are predictable to some extent. For example, economic data affected by business cycles with a
period varying between about 5 and 7 years. In weekly or monthly data, the cyclical component
may describe any regular variation (fluctuations) in time series data. The cyclical variations are
periodic in nature and repeat themselves like business cycle, which has four phases
(i) Peak (ii) Recession (iii) Trough/Depression (iv) Expansion.
4. Irregular Fluctuations:
When trend and cyclical variations are removed from a set of time series data, the residual left,
which may or may not be random. Various techniques for analyzing series of this type examine
to see “if irregular variation may be explained in terms of probability models such as moving
average or autoregressive models, i.e. we can see if any cyclical variation is still left in
the residuals. These variation occur due to sudden causes are called residual variation (irregular
variation or accidental or erratic fluctuations) and are unpredictable, for example rise in prices
of steel due to strike in the factory, accident due to failure of break, flood, earth quick, war etc.
There is another model called Additive model in which a particular observation in a time series is the
sum of these four components.
O=T+S+C+I
To prevent confusion between the two models, it should be made clear that in Multiplicative model
S, C, and I are indices expressed as decimal percents whereas in Additive model S, C and I are
quantitative deviations about trend that can be expressed as seasonal, cyclical and irregular in nature.
If in a multiplicative model. T = 500, S = 1.4, C = 1.20 and I = 0.7 then
O=T×S×C×I
By substituting the values we get
O = 500 × 1.4 × 1.20 × 0.7 = 608
In additive model, T = 500, S = 100, C = 25, I = –50
O = 500 + 100 + 25 – 50 = 575
The assumption underlying the two schemes of analysis is that whereas there is no interaction among
the different constituents or components under the additive scheme, such interaction is very much
present in the multiplicative scheme. Time series analysis, generally, proceed on the assumption of
multiplicative formulation.
While fitting a trend line by the freehand method, an attempt should be made that the fitted curve
conforms to these conditions.
The curve should be smooth either a straight line or a combination of long gradual curves.
The trend line or curve should be drawn through the graph of the data in such a way that the
areas below and above the trend line are equal to each other.
The vertical deviations of the data above the trend line must equal to the deviations below the
line.
Sum of the squares of the vertical deviations of the observations from the trend should be
minimum.
45 | P a g e
RESEARCH METHODOLOGY AND STATISTICAL ANALYSIS-DIGITAL NOTES
Merits:
It is very simplest method for study trend values and easy to draw trend.
Sometimes the trend line drawn by the statistician experienced in computing trend may be
considered better than a trend line fitted by the use of a mathematical formula.
Although the free hand curves method is not recommended for beginners, it has considerable
merits in the hands of experienced statisticians and widely used in applied situations.
Demerits:
This method is highly subjective and curve varies from person to person who draws it.
The work must be handled by skilled and experienced people.
Since the method is subjective, the prediction may not be reliable.
While drawing a trend line through this method a careful job has to be done.
Merits
1. This is a very simple method.
2. The element of flexibility is always present in this method as all the calculations have not to be
altered if same data is added. It only provides additional trend values.
3. If there is a coincidence of the period of moving averages and the period of cyclical fluctuations,
the fluctuations automatically disappear.
4. The pattern of moving average is determined in the trend of data and remains unaffected by the
choice of method to be employed.
5. It can be put to utmost use in case of series having strikingly irregular trend.
46 | P a g e
RESEARCH METHODOLOGY AND STATISTICAL ANALYSIS-DIGITAL NOTES
Limitations:
1. It is not possible to have a trend value for each and every year. As the period of moving average
increases, there is always an increase in the number of years for which trend values cannot be
calculated and known. For example, in a five yearly moving average, trend value cannot be
obtained for the first two years and last two years, in a seven yearly moving average for the first
three years and last three years and so on. But usually values of the extreme years are of great
interest.
2. There is no hard and fast rule for the selection of a period of moving average.
3. Forecasting is one of the leading objectives of trend analysis. But this objective remains
unfulfilled because moving average is not represented by a mathematical function.
4. Theoretically it is claimed that cyclical fluctuations are ironed out if period of moving average
coincide with period of cycle, but in practice cycles are not perfectly periodic.
Merits:
This method is simple to understand as compare to moving average method and method of least
squares.
This is an objective method of measuring trend as everyone who applies this method is bound to
get the same result.
Demerits:
The method assumes straight line relationship between the plotted points regardless of the fact
whether that relationship exists or not.
The main drawback of this method is if we add some more data to the original data then whole
calculation is to be done again for the new data to get the trend values and the trend line also
changes.
As the A.M of each half is calculated, an extreme value in any half will greatly affect the points
and hence trend calculated through these points may not be precise enough for forecasting the
future.
47 | P a g e
RESEARCH METHODOLOGY AND STATISTICAL ANALYSIS-DIGITAL NOTES
is known as the method of least squares. It may be mentioned that a line fitted to satisfy the second
condition, will automatically satisfy the first condition.
The formula for a straight-line trend can most simply be expressed as
Yc = a + bX
Where X represents time variable, Yc is the dependent variable for which trend values are to be
calculated and a and b are the constants of the straight tine to be found by the method of least
squares.
Constant is the Y-intercept. This is the difference between the point of the origin (O) and the point of
the trend line and Y-axis intersect. It shows the value of Y when X = 0, constant b indicates the slope
which is the change in Y for each unit change in X. Let us assume that we are given observations of
Y for n number of years. If we wish to find the values of constants a and b in such a manner that the
two conditions laid down above are satisfied by the fitted equation.
Merits:
This is a mathematical method of measuring trend and as such there is no possibility of
subjectiveness i.e. everyone who uses this method will get same trend line.
The line obtained by this method is called the line of best fit.
Trend values can be obtained for all the given time periods in the series.
Demerits:
Great care should be exercised in selecting the type of trend curve to be fitted i.e. linear,
parabolic or some other type. Carelessness in this respect may lead to wrong results.
The method is more tedious and time consuming.
Predictions are based on only long term variations i.e. trend and the impact of cyclical, seasonal
and irregular variations is ignored.
48 | P a g e
RESEARCH METHODOLOGY AND STATISTICAL ANALYSIS-DIGITAL NOTES
REGRESSION ANALYSIS
The Regression Analysis is a statistical tool used to determine the probable change in one variable
for the given amount of change in another. This means, the value of the unknown variable can be
estimated from the known value of another variable. The degree to which the variables are correlated
to each other depends on the Regression Line. The regression line is a single line that best fits the
data, i.e. all the points plotted are connected via a line in the manner that the distance from the line to
the points is the smallest.
The regression also tells about the relationship between the two or more variables, then what
is the difference between regression and correlation? Well, there are two important points of
differences between Correlation and Regression. These are:
The Correlation Coefficient measures the “degree of relationship” between variables, say X and
Y whereas the Regression Analysis studies the “nature of relationship” between the variables.
Correlation coefficient does not clearly indicate the cause-and-effect relationship between the
variables, i.e. it cannot be said with certainty that one variable is the cause, and the other is the
effect. Whereas, the Regression Analysis clearly indicates the cause-and-effect
relationship between the variables.
The regression analysis is widely used in all the scientific disciplines. In economics, it plays a
significant role in measuring or estimating the relationship among the economic variables. For
example, the two variables – price (X) and demand (Y) are closely related to each other, so we can
find out the probable value of X from the given value of Y and similarly the probable value of Y can
be found out from the given value of X.
The relationship between two variables may be interested in estimating (predicting) the value
of one variable given the value of another. The variable predicted on the basis of other variables is
called the “dependent” or the „explained‟ variable and the other the „independent‟ or the „predicting‟
variable. The prediction is based on average relationship derived statistically by regression analysis.
The equation, linear or otherwise, is called the regression equation or the explaining equation.
For example, if we know that advertising and sales are correlated we may find out expected
amount of sales for a given advertising expenditure or the required amount of expenditure for
attaining a given amount of sales.
The relationship between two variables can be considered between, say, rainfall and
agricultural production, price of an input and the overall cost of product consumer expenditure and
disposable income. Thus, regression analysis reveals average relationship between two variables and
this makes possible estimation or prediction.
Types of Regression:
The regression analysis can be classified in to:
a) Simple and Multiple
b) Linear and Non –Linear
c) Total and Partial
49 | P a g e
RESEARCH METHODOLOGY AND STATISTICAL ANALYSIS-DIGITAL NOTES
variables. regression.
It has limited application, because it is confined It has wider application, as it studies linear and
only to linear relationship between the variables. nonlinear relationship between the variables.
It is not very useful for further mathematical It is widely used for further mathematical
treatment. treatment.
If the coefficient of correlation is positive, then The regression coefficient explains that the
the two variables are positively correlated and decrease in one variable is associated with the
vice-versa. increase in the other variable.
The Regression Equation is the algebraic expression of the regression lines. It is used to predict the
values of the dependent variable from the given values of independent variables. If we take two
regression lines, say Y on X and X on Y, then there will be two regression equations:
Regression Equation of Y on X: This is used to describe the variations in the value Y from the
given changes in the values of X. It can be expressed as follows:
Yc = a + bX
Where Ye is the dependent variable, X is the independent variable, and a & b are the two unknown
constants that determine the position of the line. The parameter “a” tells about the level of the fitted
line, i.e. the distance of a line above or below the origin and parameter “b” tells about the slope of
the line, i.e. the change in the value of Y for one unit of change in X.
The values of „a‟ and „b‟ can be obtained by a method of least squares. According to which
the line should be drawn connecting all the plotted points in such a manner that the sum of the
squares of the vertical deviations of actual Y from the estimated values of Y is the least, or a best-
fitted line is obtained when ∑ (Y-Ye)2 is the minimum.
The following algebraic equations can be solved simultaneously to obtain the values of
parameter „a‟ and „b‟.
Regression Equation of X on Y: This is used to describe the variations in Y from the given
changes in the value of X. It can be expressed as follows:
Xc = a + bY
51 | P a g e
RESEARCH METHODOLOGY AND STATISTICAL ANALYSIS-DIGITAL NOTES
Where Xe is the dependent variable and Y is the independent variable. The parameters „a‟ and
„b‟ are the two unknown constants. Again, „a‟ tells about the level of fitted line and „b‟ tells about the
slope, i.e. the change in the value of X for a unit change in the value of Y.
The following are the two normal equations that can be solved simultaneously to obtain the
values of both the parameters „a‟ and „b‟.
Note: The line can be completely determined only if the values of the constant parameters „a‟ and
„b‟ are obtained.
Regression Lines:
The Regression Line is the line that best fits the data, such that the overall distance from the line to
the points (variable values) plotted on a graph is the smallest. In other words, a line used to minimize
the squared deviations of predictions is called as the regression line.
There are as many numbers of regression lines as variables. Suppose we take two variables,
say X and Y, then there will be two regression lines:
Regression line of Y on X: This gives the most probable values of Y from the given values of X.
Regression line of X on Y: This gives the most probable values of X from the given values of Y.
The algebraic expression of these regression lines is called as Regression Equations. There will be
two regression equations for the two regression lines.
The correlation between the variables depend on the distance between these two regression
lines, such as the nearer the regression lines to each other the higher is the degree of correlation, and
the farther the regression lines to each other the lesser is the degree of correlation.
The correlation is said to be either perfect positive or perfect negative when the two
regression lines coincide, i.e. only one line exists. In case, the variables are independent; then the
correlation will be zero, and the lines of regression will be at right angles, i.e. parallel to the X axis
and Y axis.
Note: The regression lines cut each other at the point of average of X and Y. This means, from the
point where the lines intersect each other the perpendicular is drawn on the X axis we will get the
mean value of X. Similarly, if the horizontal line is drawn on the Y axis we will get the mean value
of Y.
For regression analysis of two variables there are two regression lines, namely Y on X and X
on Y. The two regression lines show the average relationship between the two variables. For perfect
correlation, positive or negative i.e., r = + 1, the two lines coincide i.e., we will find only one straight
line. If r = 0, i.e., both the variables are independent then the two lines will cut each other at right
angle. In this case the two lines will be parallel to X and Y-axes.
52 | P a g e
RESEARCH METHODOLOGY AND STATISTICAL ANALYSIS-DIGITAL NOTES
Lastly the two lines intersect at the point of means of X and Y. From this point of
intersection, if a straight line is drawn on X-axis, it will touch at the mean value of x. Similarly, a
perpendicular drawn from the point of intersection of two regression lines on Y-axis will touch the
mean value of Y.
53 | P a g e
RESEARCH METHODOLOGY AND STATISTICAL ANALYSIS-DIGITAL NOTES
ANALYSIS OF VARIANCE
o One way Analysis
o Two way Analysis
CHI-SQUARE DISTRIBUTION
o Test for specified Variance
o Test for Independence of Attributes
54 | P a g e
RESEARCH METHODOLOGY AND STATISTICAL ANALYSIS-DIGITAL NOTES
Normal Distribution:
A normal distribution is an arrangement of a data set in which most values cluster in the middle of
the range and the rest taper off symmetrically toward either extreme. Height is one simple example
of something that follows a normal distribution pattern: Most people are of average height the
numbers of people that are taller and shorter than average are fairly equal and a very small (and still
roughly equivalent) number of people are either extremely tall or extremely short. Here's an example
of a normal distribution curve:
A graphical representation of a normal distribution is sometimes called a bell curve because of its
flared shape. The precise shape can vary according to the distribution of the population but the peak
is always in the middle and the curve is always symmetrical. In a normal distribution the mean mode
and median are all the same.
One-Tailed Test:
A one-tailed test is a statistical test in which the critical area of a distribution is one-sided so that it is
either greater than or less than a certain value, but not both. If the sample that is being tested falls
into the one-sided critical area, the alternative hypothesis will be accepted instead of the null
hypothesis. One-tailed test is also known as a directional hypothesis or test.
If you are using a significance level of .05, a one-tailed test allots your entire alpha to testing
the statistical significance in the one direction of interest. This means that .05 is in one tail of the
distribution of your test statistic. When using a one-tailed test, you are testing for the possibility of
the relationship in one direction and completely disregarding the possibility of a relationship in the
other direction. Let‟s return to our example comparing the mean of a sample to a given value x using
a t-test. Our null hypothesis is that the mean is equal to x. A one-tailed test will test either if the
mean is significantly greater than x or if the mean is significantly less than x, but not both. Then,
depending on the chosen tail, the mean is significantly greater than or less than x if the test statistic is
in the top 5% of its probability distribution or bottom 5% of its probability distribution, resulting in a
p-value less than 0.05. The one-tailed test provides more power to detect an effect in one direction
by not testing the effect in the other direction.
55 | P a g e
RESEARCH METHODOLOGY AND STATISTICAL ANALYSIS-DIGITAL NOTES
Because the one-tailed test provides more power to detect an effect, you may be tempted to use a
one-tailed test whenever you have a hypothesis about the direction of an effect. Before doing so,
consider the consequences of missing an effect in the other direction. Imagine you have developed a
new drug that you believe is an improvement over an existing drug. You wish to maximize your
ability to detect the improvement, so you opt for a one-tailed test. In doing so, you fail to test for the
possibility that the new drug is less effective than the existing drug. The consequences in this
example are extreme, but they illustrate a danger of inappropriate use of a one-tailed test.
Choosing a one-tailed test for the sole purpose of attaining significance is not
appropriate. Choosing a one-tailed test after running a two-tailed test that failed to reject the null
hypothesis is not appropriate, no matter how "close" to significant the two-tailed test was. Using
statistical tests inappropriately can lead to invalid results that are not replicable and highly
questionable–a steep price to pay for a significance star in results table.
Two-Tailed Test: A two-tailed test is a statistical test in which the critical area of a distribution is
two-sided and tests whether a sample is greater than or less than a certain range of values. If the
sample being tested falls into either of the critical areas, the alternative hypothesis is accepted instead
of the null hypothesis. The two-tailed test gets its name from testing the area under both of the tails
of a normal distribution, although the test can be used in other non-normal distributions. If you are
using a significance level of 0.05, a two-tailed test allots half of your alpha to testing the statistical
significance in one direction and half of your alpha to testing statistical significance in the other
direction. This means that .025 is in each tail of the distribution of your test statistic. When using a
two-tailed test, regardless of the direction of the relationship you hypothesize, you are testing for the
possibility of the relationship in both directions. For example, we may wish to compare the mean of
a sample to a given value x using a t-test. Our null hypothesis is that the mean is equal to x. A two-
tailed test will test both if the mean is significantly greater than x and if the mean significantly less
than x. The mean is considered significantly different from x if the test statistic is in the top 2.5% or
bottom 2.5% of its probability distribution, resulting in a p-value less than 0.05.
56 | P a g e
RESEARCH METHODOLOGY AND STATISTICAL ANALYSIS-DIGITAL NOTES
57 | P a g e
RESEARCH METHODOLOGY AND STATISTICAL ANALYSIS-DIGITAL NOTES
blood pressure in the first sample will likely have the highest blood pressure in the second
sample.
Give one group of people an active drug and give a different group of people an inactive placebo,
then compare the blood pressures between the groups. These two samples would likely be
independent because the measurements are from different people. Knowing something about the
distribution of values in the first sample doesn't inform you about the distribution of values in the
second.
Properties of t-Distribution:
1. Like, standard normal distribution the shape of the student distribution is also bell-shaped and
symmetrical with mean zero.
2. The student distribution ranges from – ∞ to + ∞ (infinity).
3. The shape of the t-distribution changes with the change in the degrees of freedom.
4. The variance is always greater than one and can be defined only when the degrees of freedom ν ≥
3 and is given as: Var (t) = v
v−2
5. It is less peaked at the center and higher in tails, thus it assumes platykurtic shape.
6. The t-distribution has a greater dispersion than the standard normal distribution. And as the
sample size „n‟ increases, it assumes the normal distribution. Here the sample size is said to be
large when n ≥ 30.
Degrees of Freedom
It refers to the number of values involved in the calculations that have the freedom to vary. In other
words, the degrees of freedom, in general, can be defined as the total number of observations minus
the number of independent constraints imposed on the observations.
The degrees of freedom are calculated for the following statistical tests to check their validity:
1. t-Distribution
2. F- Distribution
3. Chi-Square Distribution
These tests are usually done to compare the observed data with the data that is expected to be
obtained with a specific hypothesis.
It is usually denoted by a Greek symbol ν (mu) and is commonly abbreviated as, df. The statistical
formula to compute the value of degrees of freedom is quite simple and is equal to the number of
values in the data set minus one. Symbolically:
df= n-1
Where n is the number of values in the data set or the sample size. The concept of df can be further
understood through an illustration given below:
Suppose there is a data set X that includes the values: 10, 20, 30, 40. First of all, we will calculate the
mean of these values, which is equal to:
(10+20+30+40) /4 = 25.
Once the mean is calculated, apply the formula of degrees of freedom. As the number of values in
the data set or sample size is 4, so,
df = 4-1=3.
58 | P a g e
RESEARCH METHODOLOGY AND STATISTICAL ANALYSIS-DIGITAL NOTES
Thus, this shows that there are three values in the data set that have the freedom to vary as long as
the mean is 25.
APPLICATION-1
TEST OF HYPOTHESIS OF THE POPULATION MEAN
When the population is normally distributed, and the standard deviation „σ‟ is unknown, then “t”
statistic is calculated as:
X = Sample Mean
X− μ )2
(X − X µ = Population Mean
t= √n S = √∑ n = Sample size
S (n − 1)
S = SD of the sample
The null hypothesis is tested to check whether there is a significant difference between the X and µ.
If the calculated value of „t‟ exceeds the table value of „t‟ at a specific significance level, then the
null hypothesis is rejected considering the difference between the X and µ as significant. On the other
hand, if the calculated value of „t‟ is less than the table value of „t‟, then the null hypothesis is
accepted. It is to be noted that this test is based on the degrees of freedom, i.e. n-1.
S S
@ 95% Limit = X ± t @ 99% Limit = X± t
0.05 0.01
√n √n
APPLICATION-2
TEST OF HYPOTHESIS OF THE DIFFERENCE BETWEEN TWO MEANS
In Testing hypothesis about the difference between two means drawn from the two systematic
population whose variance is unknown, then t-test can be calculated in two ways:
X
1 − X2 n1 n2 2 2
t= √ ∑(X1 − X
1 ) + ∑(X 2 − X 2 )
S= √
S n 1 + n2 (n1 + n2 − 2)
59 | P a g e
RESEARCH METHODOLOGY AND STATISTICAL ANALYSIS-DIGITAL NOTES
Where,
X
1 and X 2 are the sample means of sample 1 of size n1 and sample 2 of size n2.
S is the common standard deviation obtained by pooling the data from both the samples.
The null hypothesis is that there is no difference between two means and‟ is accepted when the
calculated value of „t‟ at a specified significance level is less than the table value of „t‟ and is rejected
when the calculated value exceeds the table value.
2 22
( 1 − 2) − ( ( 1 + 2)
1 − 2) 1 2
= . .=
2 2 2 2
2 2 ( 1) ( 2) 1
√( 1 + 2)
1 2 I 1
+ 2 I
I 1 −1 2 − 1I
L
Where,
µ1 and µ2 are the two population means.
This statistic may not strictly follow t-distributions, but however it can be approximated by t-
distribution with the modified value for the degrees of freedom given above:
APPLICATION 3
TEST OF HYPOTHESIS OF THE DIFFERENCE BETWEEN TWO MEANS WITH
DEPENDENT SAMPLES
In several situations, it is possible that the samples are drawn from the two populations that are
dependent on each other. Thus, the samples are said to be dependent, as each observation included in
sample one is associated with the particular observation in the second sample. Hence, due to this
property the t-test that will be used here is called the paired t-test.
This test is applied in the situations when before and after experiments are to be compared. Usually,
two methods are adopted that are related to each other. The following statistic is used when the
means of both the methods applied is equal i.e. µ1 = µ2
d√n
t=
S
This statistic follows t- distribution with (n-1) degrees of freedom, where d = mean of the differences
calculated as:
d
d= ∑
n
S is the standard deviation of differences and is calculated by applying the following formula:
60 | P a g e
RESEARCH METHODOLOGY AND STATISTICAL ANALYSIS-DIGITAL NOTES
2
∑ 2 (∑ )
= √ −
−1 ( − 1)
n = Number of paired observations.
APPLICATION 4
TEST OF HYPOTHESIS ABOUT THE COEFFICIENT OF CORRELATION
There are three cases of testing the hypothesis about the coefficient of correlation. These are:
Case-1:
When the population coefficient of correlation is zero, i.e. ρ = 0. The coefficient of correlation
measures the degree of relationship between the variables, and when ρ = 0, then there is no statistical
relationship between the variables. To test the null hypothesis which assumes that there is no
correlation between the populations, it is necessary that the sample coefficient of correlation „r‟ is
known. The test statistic to be used is:
= √ −2
√1 − 2
Case -2:
When the Population Coefficient of Correlation is equal to some other value, other than zero, i.e.
ρ≠0. In this case, the test based on t-distribution will not be correct and hence the hypothesis is tested
using the Fisher‟s z- transformation. Here the „r‟ is transformed into „z‟ by:
Here, loge is a natural logarithm. The common logarithm can be shifted to a natural algorithm by
multiplying it by the factor 2.3026. Thus, loge X = 2.3026 log 10 X, where X is the positive integer
since, ½ X (2.3026) = 1.1513, then the following transformation formula is used:
This follows the normal distribution and the test is said to be more appropriate as long as the sample
size is large.
Case-3:
When the hypothesis is tested for the difference between two Independent Correlation
Coefficients: To test the hypothesis of two correlations derived from the two separate samples, then
the difference of the two corresponding values of z is to be compared with the standard error of the
difference. The following statistic is used:
61 | P a g e
RESEARCH METHODOLOGY AND STATISTICAL ANALYSIS-DIGITAL NOTES
An ANOVA test is a way to find out if survey or experiment results are significant. In other words,
they help you to figure out if you need to reject the null hypothesis or accept the alternate hypothesis.
Basically, you‟re testing groups to see if there‟s a difference between them.
Assumptions of ANOVA:
1. Normality of Errors
We assume in the ANOVA model that the error terms are normally-distributed with zero mean. If
the data are not normally-distributed, but instead come from some other distribution (exponential
or binomial, for example), then we may not be able to trust our p-values, which were built by
assuming normality. “If I was to repeat my sample repeatedly and calculate the means, those
means would be normally distributed.”
2. Equal Error Variance Across Treatments
The next assumption is that all error terms have the same variance σ2. It is common to see that
different treatments seem to result in different means AND ALSO different variances.
3. Independence of Errors
We assume that each trial is independent of all other trials, except for the effect τi of the
treatment on the mean. Statistical independence of two trials means that knowing the result of
one trial doesn't change the distribution of the other trial. The most common causes of
dependence in experimental data are confounding factors - something measured or unmeasured
that affects the experiment. Randomization is a critical technique in experimental design because
it can minimize the effect of any confounder. “Your samples have to come from a randomized or
randomly sampled design.”
Types of Tests
There are two main types: one-way and two-way. Two-way tests can be with or without replication.
1. One-way ANOVA between groups: used when you want to test two groups to see if there‟s a
difference between them.
2. Two-way ANOVA without replication: used when you have one group and you‟re double-
testing that same group. For example, you‟re testing one set of individuals before and after they
take a medication to see if it works or not.
3. Two-way ANOVA with replication: Two groups, and the members of those groups are doing
more than one thing. For example, two groups of patients from different hospitals trying two
different therapies.
62 | P a g e
RESEARCH METHODOLOGY AND STATISTICAL ANALYSIS-DIGITAL NOTES
ONE-WAY ANOVA
The one-way analysis of variance (ANOVA) is used to determine whether there are any statistically
significant differences between the means of two or more independent (unrelated) groups (although
you tend to only see it used when there are a minimum of three, rather than two groups). For
example, you could use a one-way ANOVA to understand whether exam performance differed based
on test anxiety levels amongst students, dividing students into three independent groups (e.g., low,
medium and high-stressed students). Also, it is important to realize that the one-way ANOVA is
an omnibus test statistic and cannot tell you which specific groups were statistically significantly
different from each other; it only tells you that at least two groups were different. Since you may
have three, four, five or more groups in your study design, determining which of these groups differ
from each other is important.
Common Uses
The One-Way ANOVA is often used to analyze data from the following types of studies:
Field studies
Experiments
Quasi-experiments
Note: Both the One-Way ANOVA and the Independent Samples t Test can compare the means for
two groups. However, only the One-Way ANOVA can compare the means across three or
more groups.
Note: If the grouping variable has only two groups, then the results of a one-way ANOVA and the
independent samples t test will be equivalent. In fact, if you run both an independent samples
t test and a one-way ANOVA in this situation, you should be able to confirm that t2=F.
63 | P a g e
RESEARCH METHODOLOGY AND STATISTICAL ANALYSIS-DIGITAL NOTES
64 | P a g e
RESEARCH METHODOLOGY AND STATISTICAL ANALYSIS-DIGITAL NOTES
For example, you might want to find out if there is an interaction between income and gender for
anxiety level at job interviews. The anxiety level is the outcome, or the variable that can be
measured. Gender and Income are the two categorical variables. These categorical variables are also
the independent variables, which are called factors in a Two Way ANOVA.
The factors can be split into levels. In the above example, income level could be split into
three levels: low, middle and high income. Gender could be split into three levels: male, female, and
transgender. Treatment groups and all possible combinations of the factors. In this example there
would be 3 x 3 = 9 treatment groups.
65 | P a g e
RESEARCH METHODOLOGY AND STATISTICAL ANALYSIS-DIGITAL NOTES
ignored for this part. Just the rows or just the columns are used, not mixed. This is
the part which is similar to the one-way analysis of variance. Each of the variances
calculated to analyze the main effects are like the between variances
Interaction The interaction effect is the effect that one factor has on the other factor. The degrees
Effect of freedom here is the product of the two degrees of freedom for each factor.
Within The Within variation is the sum of squares within each treatment group. You have
Variation one less than the sample size (remember all treatment groups must have the same
sample size for a two-way ANOVA) for each treatment group. The total number of
treatment groups is the product of the number of levels for each factor. The within
variance is the within variation divided by its degrees of freedom. The within group
is also called the error.
F-Tests There is an F-test for each of the hypotheses, and the F-test is the mean square for
each main effect and the interaction effect divided by the within variance. The
numerator degrees of freedom come from each effect, and the denominator degrees
of freedom is the degrees of freedom for the within variance in each case.
66 | P a g e
RESEARCH METHODOLOGY AND STATISTICAL ANALYSIS-DIGITAL NOTES
CHI-SQUARE DISTRIBUTION
A Chi-square distribution is the distribution of the sum of squares of k independent standard normal
random variable with k degree of freedom. It is a statistical hypothesis where the null hypothesis that
the distribution of the test statistic is a chi-square distribution, is true. While it was first introduced
by German Statistician Robert Helmert, and was used by Karl pearson in 1900. The most popular
chi-square test is Pearson‟s chi-square test and is also called „chi-squared‟ test and denoted by χ2. A
classical example of chi-square test is the test for fairness of a die where we test the hypothesis that
all six possible outcomes are equally likely.
Definitions
Chi-square A distribution obtained from the multiplying the ratio of sample variance to
distribution population variance by the degrees of freedom when random samples are selected
from a normally distributed population.
Contingency Data arranged in table form for the chi-square independence test
Table
Expected The frequencies obtained by calculation.
Frequency
Goodness-of-fit A test to see if a sample comes from a population with the given distribution.
Test
Independence A test to see if the row and column variables are independent.
Test
Observed The frequencies obtained by observation. These are the sample frequencies.
Frequency
Uses
The chi-squared distribution has many uses in statistics, including:
Confidence interval estimation for a population standard deviation of a normal distribution from
a sample standard deviation.
Independence of two criteria of classification of qualitative variables.
Relationships between categorical variables (contingency tables).
Sample variance study when the underlying distribution is normal.
Tests of deviations of differences between expected and observed frequencies (one-way tables).
The chi-square test (a goodness of fit test).
67 | P a g e
RESEARCH METHODOLOGY AND STATISTICAL ANALYSIS-DIGITAL NOTES
The test statistic has a chi-square distribution when the following assumptions are met
The data are obtained from a random sample
The expected frequency of each category must be at least 5.
68 | P a g e
RESEARCH METHODOLOGY AND STATISTICAL ANALYSIS-DIGITAL NOTES
the expected frequency to weight frequencies. A difference of 10 may be very significant if 12 was
the expected frequency, but a difference of 10 isn't very significant at all if the expected frequency
was 1200. If the sum of these weighted squared deviations is small, the observed frequencies are
close to the expected frequencies and there would be no reason to reject the claim that it came from
that distribution. Only when the sum is large is the a reason to question the distribution. Therefore,
the chi-square goodness-of-fit test is always a right tail test.
The test statistic has a chi-square distribution when the following assumptions are met
The data are obtained from a random sample.
The expected frequency of each category must be at least 5. This goes back to the requirement
that the data be normally distributed. You're simulating a multinomial experiment (using a
discrete distribution) with the goodness-of-fit test (and a continuous distribution), and if each
expected frequency is at least five then you can use the normal distribution to approximate (much
like the binomial). If the expected
69 | P a g e