100% found this document useful (1 vote)
717 views

Statistical Analysis in Resesarchmodule

This document provides an overview of survey and experimental study concepts for statistical analysis in research. It discusses the key differences between survey and designed experimental studies, including their objectives and how the researcher's role differs. The document outlines important considerations for designing survey and experimental studies, such as developing questionnaires, determining sample size, selecting sampling techniques, and identifying key components of an experimental design like treatments, experimental units, randomization, and replication.

Uploaded by

Eric Cabrera
Copyright
© Attribution Non-Commercial (BY-NC)
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
100% found this document useful (1 vote)
717 views

Statistical Analysis in Resesarchmodule

This document provides an overview of survey and experimental study concepts for statistical analysis in research. It discusses the key differences between survey and designed experimental studies, including their objectives and how the researcher's role differs. The document outlines important considerations for designing survey and experimental studies, such as developing questionnaires, determining sample size, selecting sampling techniques, and identifying key components of an experimental design like treatments, experimental units, randomization, and replication.

Uploaded by

Eric Cabrera
Copyright
© Attribution Non-Commercial (BY-NC)
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 115

EDEL812 STATISTICAL ANALYSIS IN RESEARCH

MODULE

UNIVERSITY OF KWAZULU-NATAL PIETERMARITZBURG CAMPUS

2002

Compiled by:
Peter M Njuho, Ph.D. Senior Lecturer School of Statistics and Actuarial Science University of KwaZulu-Natal Private Bag X01 P O Box X01 Scottsville 3209 South Africa

Statistical Analysis in Research Module E-mail: [email protected]

PMNjuho

EDEL 812

STATISTICAL ANALYSIS IN RESEARCH

1. SURVEY AND DESIGN OF EXPERIMENT STUDY CONCEPTS 1.1 Survey versus design of experimental study The difference between a survey study and design of experiment study is mainly in the study objectives. The researcher should understand the difference before he or she undertakes the study. Failure to make the distinction between the two forms of studies leads to complicated data analyses whose results may fail to tie with the study objectives. The primary objective in survey study is to observe the characteristics of the population of interest. For instances, is the disease common across the different communities? Is the level of education distributed the same across the race? Is the distribution of land even across different communities? What is the opinion across the residence regarding new rules in rubbish disposal? In such situations we would be concerned about the level of distribution rather than the actual difference. Where is the variability high? Would be a question of great interest. In designed experimental study the primary interest is to investigate on the relative performance of certain factors. The key questions to be answered are generally expressed as a statement of hypothesis that has to be verified or disproved through experimentation. The interest would be in answering questions such as: Are the three methods for treating the disease different? And if so, by how much? Is the new teaching method significantly different from the old method? In a survey study the researcher has no control over the responses. He/she acts as an observer. The outcomes are mainly considered as random. Survey study can be classified into two types namely exploratory or informal survey and formal survey. The exploratory survey is mainly used in obtaining information about population of interest, for example farmer circumstances. The approach places interviewer in direct contact with the subject and allows the interviewers to observe the characteristics of the population. An exploratory survey allows for quick gathering of information through informal interviews with many people. The information from exploratory survey is used to design a wellfocused formal survey by: identifying important topics bearing on research planning that should be the focus of the formal survey; ensuring that written questions in the formal survey are asked in a way that can be understood; designing and testing a sampling scheme;

Other important features of an exploratory survey Towards the end of the exploratory survey, it should also be possible to give approximate frequencies of use for a given practice among the target population (e.g. 0-10%, 10-25%, 25-50%, 50-75%, 75-100% of farmers). The exploratory survey narrows down the data to be collected in the formal survey to that which are essential for understanding present practices and prescreening technologies. An important part of the exploratory survey is to formulate hypotheses.
Statistical Analysis in Research Module E-mail: [email protected]

PMNjuho

Examples of such hypotheses: A larger area can be planted as labour is a limiting factor at the planting period; There is a dry period three months after the start of the rains and late plantings may survive this period better than plantings that flower at that time; Early plantings give an early supply of new food and are particularly important when the previous harvest has been poor.

1.2 Formal survey study concept The purpose of formal survey study is to verify and quantify information and test hypotheses formulated in the exploratory survey. A formal survey involves use of a welldesigned questionnaire. Define the population of interest as a first step. It should be noted that we interview a sample of the respondent and use the information obtained from this sample of respondent to make statements or inferences about the population of interest. The following are general rules to be followed when developing a questionnaire Organizing the questionnaire: -The questionnaire should be divided into sections based on the study themes. Section one should always be designed to collect the bio-data (examples gender, age, education, marital status, etc.) Language of the questionnaire: - The questions should be constructed using clear and friendly language. The responded must be given an opportunity to express himself or herself in a language of choice. Leading questions should be avoided. The question should be put in such away the respondent will provide more information. For example ask, did you apply fertilizer to wheat this year? rather than, Do you use fertilizer on wheat? Length of the questionnaire: - Lengthy questionnaires should be avoided, because they may introduce fatigue. The construction of the questions should be compact and comprehensive. The role of the questionnaire is to obtain estimates of how widespread are those problems and opinions and whether there are differences between groups of respondents. Finalising a questionnaire for use in a formal survey study requires an undertaking of pre-test of the same before producing a final version. Subjective data Consider a survey study were interest is in obtaining farmers opinions regarding a certain technology. It should be noted that information on what farmers do is objective and quantifiable where as farmers opinion and perceptions about problems and technologies are subjective. Sampling procedures for a formal survey study

Statistical Analysis in Research Module E-mail: [email protected]

PMNjuho

To select at a reasonable cost a group of respondent which is roughly representative of all subjects in the population of interest. A representative sample must be selected at random. That is, each unit in the population or subgroup of the population has an equal chance of selection. Such an assumption requires all the sampling units to be homogeneous and to be non overlapping. The nature of these non overlapping units dictate the sampling technique to use. The following are some of the sampling procedures: Simple random samples Stratified random sampling Mult-stage random sampling Systematic random sampling Cluster sampling Stratification of the population is the process of dividing the population into relatively homogeneous subgroups called strata, and then taking separate samples from each group or strata. Sample size: - Depends on the variability within the population and not on the size of the population. It should conform to the time and cost constraints of the survey. Major costs:- Cost of developing questionnaire, training enumerators, and establishing a suitable sampling method. Form of analysis: Either to estimate population means, variance components, population size or to establish casual relationship, predictable models, frequencies, etc. Commonly used in analysis:- Chi-squares, mean estimation, non-parametric, regression (i.e. logit, probit, logistic, etc.)

1.3 Designed experimental study concepts The researcher has the control over the factors to be tested and the form of data to be collected. He/she sets the experiment and observes the outcome. An experiment is a planned inquiry set to obtain new facts, confirm or disapprove results from a previous experiment or verify certain biological phenomenal. Objectives:- The objectives must clearly stated as questions to be answered; hypotheses to be tested, and effects to be estimated. It is necessary to classify the objectives as major or minor, since certain experimental designs give greater precision for some treatment comparisons than others. Precision: - Precision, sensitivity, or amount of information is measured as a reciprocal of the variance of a mean. That is Information =
1 var( y )

Statistical Analysis in Research Module E-mail: [email protected]

PMNjuho

Where var( y ) denotes the variance of the sample mean y . As the variance of y denoted by 2 increases, the amount of information decreases. Similarly, as n increases, the amount of information increases. Components of experimental design:- The following are components that any researcher must clearly state when conducting a designed experimental study. Treatment structure Design structure Experimental unit Randomization Replications Assumptions Treatment structure:- A treatment is a procedure whose effect is to be measured and compared with other treatments. E.g. a standard ration, a spraying schedule, a temperature-humidity combination, etc. A set of treatments, e.g. sources of fertilizer such as DAP, CAN, TSP, Manure, etc. One-way treatment structure, e.g. nitrogen levels, Dairy meal levels, etc., Two-way treatment structure, e.g. plant population and different hybrids. Higher order treatment structure, etc. The interest is to estimate effects, compare effects, predict, etc., Experimental unit:- This the smallest unit of material to which the treatment is applied. e.g. an animal, 5 pigs in a pen, a half-leaf. Sampling unit: -This is often referred to as observational unit. Treatment effect is measured on a sampling unit, which is basically a unit of experimental unit. Sometimes a sampling unit is a complete experimental unit. Experimental error:- This is a measure of the variation which exists among observations on experimental units treated alike. Aim at reducing experimental error in order to improve the power of the test. Replication:- When a treatment appears more than once in an experiment, it is said to replicated. Replication is necessary to provide for an estimate of experimental error, which is required for tests of significance. Without replication there is no basis for comparison. Valid replication requires that for similarly different units there are at least some sets of units treated identically. There are many situations when there can be different levels of replication, providing different degrees of variation. It is necessary to identify the different levels of replication, the correct of replication and situations when false replication is used. There are also many situations when multiple levels of replication are necessary and relevant to the analysis. Replication provides means of computing experimental error.

Statistical Analysis in Research Module E-mail: [email protected]

PMNjuho

The amount of replication is determined by the extent, to which the standard error must be reduced, which is in turn determined by the size of treatment difference, which the experiment should detect. Given the necessary amount of replication we have a total number of units in the experiment. The division of the total degrees of freedom (Sample size minus 1) can model the variation between these units in an analysis of variance into control (design structure), questions (treatment structure) and error (random structure). We should design experiments so that the error degrees of freedom are between 10 and 20. Experiments not satisfying this requirement are, to some degree, inefficient and should be avoided. Randomization:- Done to ensure that we have a varied or unbiased estimate of experimental error and of treatment means and differences among them. In other words, the procedure provides insurance against the possibility that the model for analysis is valid. It also provides a basis for randomisation test arguments to support coincidence arguments in terms of significance. Randomization provides a valid measure of experimental error. Design Structure:- Involves techniques for controlling known variation among the experimental units. Thus, experimental units are grouped into homogeneous groups referred as blocks such that variation within the groups is a minimum and between them is a maximum. The following are examples of design structure: Complete randomized design (CRD) Randomized complete block design (RCBD) Latin square design Cross-over design Incomplete block design Experiments with more than one experimental unit such as:Split plot design. Strip plot design. Repeated measure design. Assumptions:- The design structure and treatment structure do not interact. The observed values are independently and identically distributed normal with a constant variance.

1.4 Conceptual models in scientific research Conceptual models serve to organise research approaches and direct data presentation. Many inexperienced scientists do not make full use of conceptual models. Conceptual models assume many different forms that are not mutually exclusive. Different conceptual models may be dynamic and interactive. Working hypotheses are an essential component to all scientific approaches and must be elucidated in advance of more detailed research activities. Most working hypotheses may then be captured as either mathematical or statistical models. Simple diagrams should be based on one or more working hypothesis and constructed in advance of detailed research efforts to serve as a framework and may often evolve into more complex forms during the course of many experiments and much thought.

Statistical Analysis in Research Module E-mail: [email protected]

PMNjuho

Working hypotheses: Working hypotheses are reductionist word models based on logic and an essential component of all research. All of scientific progress may be viewed as a long chain of working hypotheses that were framed, tested and either accepted or rejected with conclusions that led to a more advanced hypothesis. A well stated working hypothesis is specific and directs treatment selection and measurements. Global hypotheses are more general, often a restatement of overall objectives and generate one to many working hypotheses. A null hypothesis is stated in the negative context and no longer considered essential provided that a working hypothesis is stated in a manner that may either be accepted or rejected. It should be noted that hypothesis testing is a formal procedure by which we investigate research questions using inferential statistics to reach decisions about the validity of the null and alternative hypotheses. It is most reasonable for one scientist to ask another How do your findings reflect on your working hypothesis? Working hypotheses should be stated as simply as possible but must be complete statements such as X regulates Y under Z conditions. Example 1.1 Phosphorus availability is limiting maize production in nutrient-depleted, smallholder farms in the highly weathered, sandy soils of KZN. Maize streak virus infects a greater proportion of the maize stand and reduces crop yields to a greater extent under continuous maize cultivation than in maize legume rotation. Both of these statements in example 1.1 may be summarised as If A and B then C (and D, etc.) Always remember that working hypotheses are intended to be either accepted or rejected as a result of successful research and as such must be able to withstand various tests of logic. Do not be defensive when a particular hypothesis is challenged but rather complimented that another scientist considers it worth of discussion. Beware of incomplete statements such as Use of fertiliser is better for farmers, or Maize streak virus is a serious problem. Also, tautologies statements such as Sustainable agriculture results in long-term food security are unsatisfactory working hypotheses, rather these sort of statements should be included within introductions, justification sections or overall objectives. Mathematical models Mathematical equations may also serve as conceptual models. Equations attempt to quantify cause and effect relationships. Cause(s) is referred to as the independent variable, that direct an effect in the dependent variable. The mathematical relationship may also be

Statistical Analysis in Research Module E-mail: [email protected]

PMNjuho

stated in more general terms as a working hypothesis. The relationship may either be linear or non linear. A general expression a simple linear relationship is of the form Y = + X + Where is the intercept is the slope is the random error Y is the dependent variable X is the dependent variable Many different equations define non linear relationships. Examples of such equations are: Power functions Exponential growth curves can - (y = axb , where a and b are constants) - (y = abx , where a and b are constants. b also be exponential e) Hyperbolic functions where x is the reciprocal of y ( y = 1/x) Asymptotic decay curves where y approaches 0 as x increases without limit (y =ae-kx) Polynomial curves -- ( y = a0 + a1x + a2 x2 + . . . + apxp )

In general, researcher should identify a conceptual basis for selecting a given curvilinear relationship based upon the properties of a phenomenon under study. Statistical models Statistical models particularly those that examine two or more factors simultaneously, are also useful as conceptual models. Differential effects of one factor on another result in interaction. For instance, in a study set to investigate the effect of two factors, a statistical model would be the following form assuming a completely randomised design. Suppose y is the response variable. yijk = + Ai + Bj + ABij + ijk Where denotes the overall mean Ai ith effect for factor A Bj jth effect for factor B ABij ijth interaction effect ijk random error.

Statistical model could also be considered as a process of partitioning of the response value, into components due to inherent and random variation. Setting up of the model before the analysis enables the researcher to be focused on issues of interest. Interest would be in estimating the main effects of factors A and B, and their interaction effect.
Statistical Analysis in Research Module E-mail: [email protected] PMNjuho

The random error would be used to establish statistical tests to ascertain significance or non significance of these effects from zero or any specified value. Exercise 1.1 1.1 Suppose you were approached to design a questionnaire to be used in a feasibility study regarding the settlement of a group of urban dwellers on a new land within KZN. a) b) c) d) What components would you include in your questionnaire? What could be the sampling unit? What could be the population of interest? What could be the sample size?

1.2 Consider the newly constructed Casino in Pietermaritzburg. Suppose the manager wishes to collect the views of the residence of Pietermaritzburg regarding the business. Indicate how such information could be collected. 1.3 The Checkers in Scottsville, Pietermaritzburg underwent some renovation recently. Suppose the manager wishes to collect the customers views regarding this change. Indicate how such information would be collected. 1.4 An experiment was conducted to determine the best way to manage citrus insects pests and diseases under small holder farms. Six farms were selected. In each farm, 10 trees infested with white flies were selected. The investigator was interested in finding treatment had the best effect in controlling the disease among the four namely 1) pruning, 2) fertiliser application, 3) pesticide application, and 4) farm activities intercropping practise. a) b) c) d) e) Indicate how the treatments were applied. What could have been the experimental unit? How independent are these treatments? What name could you give the experimental design used? What possible questions could be answered in this study?

Statistical Analysis in Research Module E-mail: [email protected]

PMNjuho

2. INFERENCES ABOUT ONE AND TWO POPULATION MEANS 2.1 Introduction to hypothesis testing Hypothesis testing is an area of statistical testing in which we evaluate a conjecture, which we call hypothesis, about some characteristic of the parent population. The hypotheses, usually concerns the unknown parameters of the population. The null hypotheses: This is the statement being tested and is denoted as H0. It is usually stated as equality implying no difference. The alternative hypothesis This is what is believed to be true if Ha is rejected. Usually, the investigator wishes to establish that there is a difference between the parameter and the value being tested. Thus, the alternative is also called the research hypothesis. Consider, for example, the hypothesis that the mean per capita income in a certain town is R800 per year. Suppose we denote the population mean, by . Suppose the investigator believes the mean per capita income of the town is greater than R800. The two hypotheses are stated as H0 : = 800 against Ha : > 800 If the investigator believes the mean per capita income is less then the alternative hypothesis is stated as Ha : < 800 The alternative is stated in support of what the investigator wish to believe. The significance level This is the probability with which we are willing to reject the null hypothesis when it is correct. Type I error is committed if we reject null hypothesis when it is in fact true. Type II error is committed if we fail to reject null hypothesis when it is false.

2.2 Inferences about a population mean In reality, we encounter situations where interest is in confirming a known hypothesis. This relates to questions such as, has the average increased, decreased or remained static over time? Sometimes, an investigator would like to compare characteristics of two populations. Handling of such investigation involves one or two sample situations. Consider a single population that is normally distributed with mean and variance 2 . Suppose we want to test hypothesis about . Let 0 denotes a known mean. Hypothesis: Ho : = 0 against Ha : 0
Statistical Analysis in Research Module E-mail: [email protected]

10

PMNjuho

Example 2.1 The scores on a college placement exam in mathematics are assumed to be normally distributed with a mean of 70 and a standard deviation of 18, The exam is given to a random sample of 50 high school students who have been admitted to college. Their average score on the exam was 67. If this is a true random sample, is the evidence sufficient to suggest that the population mean score is lower than 70? Solution: Let denotes the true population mean of the placement exam. We wish to see whether there is evidence that < 70. This is the research hypothesis. Thus, 0 = 70, 2 = 324, and n = 50. Hypothesis: Ho : = 70 against Ha : < 70

Significance level: = 0.05 Critical region: Reject H0 if the p-value is less than 0.05, where p-value = the probability that X 67. Test statistics: The sample mean, X = 67
X 0 67 70

z = / n = 18 / 50 = - 1.18 Thus, P( X 67) = P(Z -1.18) = 0.119 Conclusion: We fail to reject H0 since p-value = 0.119 is not less than 0.05. Based on the results of a random sample of 50 high school students, there is insufficient evidence to say that the mean score on the college placement exam should be lower than 70.

Exercise 2.1 2.1 For the future planning and control of automatic sorting machines, a member of the General Post Office is instructed to take a random sample of the letters posted with a 10c stamp during a specific period of the year. The weights of these letters were recorded as follows (in grams): 25.7 23.2 25.8 25.8 29.1 23.1 17.2 26.4 31.9 18.3 19.2 20.7 23.6 21.6 21.9 21.8 a) Test a claim that the average weight of such letters is 19.6 gms. b) Test a claim that the average weight is greater than 21.6 gms.
Statistical Analysis in Research Module E-mail: [email protected]

11

PMNjuho

c) Find 95 % and 99 % confidence intervals for the average weight of the letters. d) Use your results in (c ) to test whether the mean is 27.9gms. 2.2 A certain department store conducts monthly checks amongst its branches to test whether the mean balance outstanding on 30-day charge accounts complies with the company policy of R100. For a particular branch store a sample of 100 accounts gave the following results:
x = R104.19

s = R22.13

a) Test the claim at 5 % significance level, that the branch was complying with company policy. b) The department store financial controller claims that the mean balance is greater than R100. Test this claim. (Use = 0.05). 2.3 State the null and alternative hypotheses for the following research questions. a) Are children who have strict parents more disciplined than children who do not have strict parents? b) Do babies with birth weights of 2.8kg and more have a greater mortality rate than those with birth weights lower than 2.8kg? 2.4 The records of the National Road Traffic department reveal that the scores on the learner driver test are normally distributed with a mean of 62% and a standard deviation of 16. The traffic department is aware that people in Kwa-Zulu Natal tend to be better at obeying traffic rules than people from other provinces around the country. They administered the learner driver test on a random sample of 200 adults from KZN, and noted that their mean score was 69.9%. a) Do people from KZN perform better than national mean? ( = 0.05). b) Conduct analysis on the same data to test the research question that people from KZN perform differently from expectation. ( = 0.05). c) Although the above tests have yielded similar conclusions, in what way do they differ? d) How would the chance of making Type I and Type II errors change if we changed the significance level of the tests to ( = 0.01). 2.5 The weight of humans is normally distributed with a mean of 73kg and a variance of 144. To investigate whether the weight of rural South Africans is different from this international mean, we draw a random sample of 100 rural South Africans and calculate their mean weight to be 69kg. a) Determine whether the weight of rural South Africans is different from the international mean ( = 0.01).

2.3 Paired t- test problem Consider a situation where a researcher is interested in the effect of a treatment given to randomly selected subjects. Measurements are made before and after application of the
Statistical Analysis in Research Module E-mail: [email protected]

12

PMNjuho

treatment. The data are paired and interest would be to find out the effect as to whether it is negative or none or positive. Differencing eliminates the effect of the subject, and leaving the effect due to the treatment and random. Suppose we have two treatments A and B applied to n samples randomly selected from a normally distributed population with mean and variance 2 . Response Treatment B Y1 Y2 Y3 . . . Yn

Subject 1 2 3 . . . n

Treatment A X1 X2 X3 . . . Xn

Difference d = X-Y d1 = X1-Y1 d2 = X2-Y2 d2 = X3-Y3

d2 = Xn-Yn

We treat the new information on differences as single sample problem, and compute estimates of the mean and the variance using the usual estimation formulae. Assumption The di s (i =1, 2, . . . n) are random samples from a normal population with mean d and variance 2 . Hypothesis: Ho : d = 0 against H1 : d 0 or H1 : d < 0 or H1 : d > 0, depending on the available information. That is, the effect may be suspected to differ, or decrease or increase.
2 Under, Ho : d = 0 is true, we estimate the variance, 2 , by s d computed as follows:

s =

2 d

( d ) 2 / n n 1

The calculated t- value, denoted by tcalc is tcalc =


Where d 0 sd / n

d = Average of the paired differences sd = Standard deviation of the paired differences n = Number of pairs.
n Reject Ho if |tcalc| > t 1 and conclude that there is enough evidence that the treatment had /2 a significant effect at level where , measures the strength of evidence against Ho. The values of t distribution are given in Table B.

Statistical Analysis in Research Module E-mail: [email protected]

13

PMNjuho

The following example illustrates the computation procedures and the type of inference one can draw.

Example 2.2
A market research study in which a family was asked to record its total monthly purchases at Pick n Pay and its total monthly purchases at Checkers was conducted. The study wishes to estimate the difference in average monthly expenditures by families at the two shopping centres. The data from 10 families selected at random is presented below. The data is in rands. Family 1 2 3 4 5 6 7 8 9 10 Pick n Pay 140 120 230 50 70 240 190 120 250 100 Checkers 100 150 220 80 110 180 190 140 190 100 Difference, d 40 -30 10 -30 -40 60 0 -20 60 0 50 (d - d )2 (40 5)2 = 1 225 (-30 -5)2 = 1 225 (10 5)2 = 25 (-30 5)2 = 1 225 (- 40 5)2 = 2 025 (60 5)2 = 3 025 (0 5)2 = 25 (-20 5)2 = 625 (60 5)2 = 3 025 (0 5)2 = 25 Sum = 12 450

d =

50 =5 10

sd =

(d d )
n 1

12450 10 1

= 37.193

Critical region: The t-table value with 9 degrees of freedom at 5 % significance level is t = 2.262. (See n Table B) We would reject H0 if |tcalc| > t 1 =2.262. /2 Test statistic:
tcalc =

d 0 sd / n

50 37.193 / 10

= 0.425

Conclusion:

Statistical Analysis in Research Module E-mail: [email protected]

14

PMNjuho

We fail to reject H0 since the |tcalc| is not greater than 2.262. We conclude that there is no significant difference in average spending by families at the two shopping centres, based on the available data.

Exercises 2.2

2.6 Given two independent samples with the following information Item 1 2 3 4 5 6 7 8 9 10 Sample 1 19.6 22.1 19.5 20.0 21.5 20.2 17.9 23.0 12.5 19.0 Sample 2 21.3 17.4 19.0 21.2 20.1 23.5 18.9 22.4 14.3 17.8

a) State the null hypothesis b) What assumption would you make? c) Based on these paired samples, test at the = 0.10 level whether the true average paired difference is 0. d) State your conclusions.

2.7 A random sample of 15 cars passed through an urban speed trap. The following speeds in km per hour were recorded. Car 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 Speed 71 49 68 65 64 57 80 63 62 69 45 61 66 66 55

a) Estimate , the true mean speed of cars passing that point.

Statistical Analysis in Research Module E-mail: [email protected]

15

PMNjuho

b) Given that the speed limit is 60 km/h, test H0: = 60 to check if it is reasonable. c) Set a 95 % confidence limit about the true mean. d) What assumption did you have to make?

2.8 A consumer organisation has sampled 20 owners TV sets and recorded the time in years to the sets first repair. The data are:
1.97 2.81

2.87 0.57

3.01 3.17

2.75 3.89

2.09 3.10

1.34 2.05

1.62 1.01

1.10 4.16

2.24 2.59

1.79 1.67

a) Estimate the mean time to first repair for the population sampled. b) Set a 99 % confidence limits for the true mean, . c) Use the results obtained in part (b) to test H0: = 0.

2.9 The following data arise from a survey of aged people in Durban. The variable recorded per person is the monthly expenditure on medicines, recorded in rands.
34.42 9.66 40.40 31.00 6.30 52.82 2.20 20.00 6.50 48.24 57.13 24.64 37.80 36.00 58.16 a) The claim recently made in a local newspaper was that the mean monthly expenditure on medicines for elderly people exceeds R 30.00 per month. Test this claim at 5 % significance level. b) Use the sample to estimate the mean annual expenditure on medicines for this population and set a 95 % confidence limits to this quantity.

2.10 The repainting of lines on freeways represents a large proportion of the expenditure of a roads department. It is decided that a new, cheaper paint should be tested. Twenty-five randomly chosen 1-km stretches are painted with the new paint. After a month an assessment is done at each site. An instrument using a scale on which the current paint registers 39.2 measures the durability of the paint. For the sample of 25 sites, the following calculations have been done.
x = 39.65

s = 3.02

The department wishes to test (using =0.01) whether the new paint is better than the current paint.
a) State the appropriate null and alternative hypotheses. b) Test your null hypothesis at the required level. c) State your conclusions.

2.11 A random sample of nine local school children yielded the following sample statistics for the random variable X =IQ.
x = 107

s = 3.88

Statistical Analysis in Research Module E-mail: [email protected]

16

PMNjuho

a) Find a 99 % confidence interval for , the mean IQ, and use this interval to test H0 : = 100 at the 1% significance level.

2.12 A random sample of 16 pharmacies was selected in the Witwatersrand area. The price in rands charged for 100 tablets of a particular drug by each pharmacy in the sample was:
3.75 5.85

4.10 7.65

10.40 7.50 8.10 6.50

2.95 7.50

5.75 5.50

7.50 8.00

8.90 4.50

a) Estimate the mean price of 100 tablets of this drug for pharmacies in the area. b) Set 95 % confidence limits to your estimate. c) Carry out a test of a significance to assist you in deciding whether the mean price of this drug in the Witwatersrand area is lower than R 7,95 (which is known to be the mean selling price in Cape Town).

2.13 Suppose that, after sampling 20 records at random, a sociologist finds the following durations (to the nearest tenth of a year) of marriage ending in divorce.
10.1 21.2 4.31 4.9

13.8 5.4

11.1 8.7

10.9 4.81

9.2 9.42

6.6 6.3

12.3 24.5

7.8 21.6

15.1

2.61

a) Set up an appropriate null hypothesis and alternative hypothesis. b) Determine whether these data provide proof, at the 5 % significance level, that the mean duration of marriage ending in divorce in the population has decreased from an earlier value of 14.9. c) What distribution assumption is made in applying the hypothesis test?

2.14 A designer claims that by smoothing out parts of a particular automobile body to reduce air resistance, the average fuel consumption can be reduced below 8.0 litres per 100km. In an attempt to support the claim, the designer has obtained a sample of fuel consumption for 15 modified automobiles. The sample mean was 7.4 l/km and standard error of the mean was 0.8 l/km.
a) Do these results provide sufficient evidence to support the claim?

2.15 To test the durability of a new paint for white centre lines, a highway department painted test strips across heavily travelled roads in eight different locations, and automatic counters showed that they deteriorated after having been crossed by the following number of cars (in thousands).
142.6 167.8 136.5 108.3 126.4

133.7 162.0 149.0

a) Find 95 % confidence limits for , the average number of crossings that this paint can withstand before it deteriorates. b) Find 99 % confidence limits for , the average number of crossings that this paint can withstand before it deteriorates. c) Test the paint manufacturers claim that =160.0

Statistical Analysis in Research Module E-mail: [email protected]

17

PMNjuho

Oftentimes, investigators are interested in assessing the performance of one population compared to another. For instance, comparing the performance of a new technology to an old one or comparing new variety against an old variety, etc. The two populations are assumed to be independent. Suppose we have two random samples one of size n1, X1, X2, X3, . . . Xn1 drawn from a normally distributed population with mean 1 and variance 2 and the other of size n2, Y1, Y2, Y3, . . . Yn2 drawn from a normally distributed population with mean 2 and variance 2 . Consider the sample means x and y as unbiased estimators of population means 1 and 2, respectively. Also, let sample
2 variances s12 and s 2 both be unbiased estimators of population variance, 2 .

2.4 Inferences about two population means Assumption The two populations of Xs and Ys are independent and normally distributed with possibly different means and a common variance.
The setting up of hypotheses depends on the study objectives. The following are possible hypothesis:

Hypothesis: Ho : 1 = 2 against Ha : 1 2 or Ha : 1 < 2 or Ha : 1 > 2 Test Statistic: Thus, X - Y is normally distributed with mean 1 - 2 and variance 2 . We estimate 2 by a pooled variance, s 2 , where p
s 2 = (Total Sum of Squares)/(Total Degrees of Freedom) p =
2 (n1 1) s12 + (n 2 1) s 2 n1 + n 2 2

Comparing the equality of the two means against an alternative hypothesis of not equal, demand that the standard error of the means difference computed first. For a combined sample size n1 + n2 < 30, we use t-distribution, otherwise, normal distribution would apply. The appropriate test statistic assuming common variance estimated by a pooled variance is computed as tcal =
( x y ) ( 1 2 ) sp ( 1 1 + ) n1 n 2

Conclusion: Reject Ho : 1 = 2 in support of Ha : 1 2 if |tcal| > tTable obtained with n1 + n2 2 degrees of freedom at - level of significance.

Statistical Analysis in Research Module E-mail: [email protected]

18

PMNjuho

In case the assumption of common variance of population variances cannot be assumed, 2 say, 12 and 2 , then an approximate t-distribution with degrees of freedom, df, computed as
2 2 2 2 s1 s2 2 s1 s2 + ] /[( )/(n1 -1 ) - ( )/(n2 - 1)] df = [ n1 n2 n1 n2

The computed df are rounded down to the nearest integer and then t-test used, noting that 2 s1 the less degrees of freedom the lower the power. The pooled variance is estimated as n1 +
2 s2 . n2

Example 2.3

Consider data collected to study the heating producing capacity. The heat producing capacity (in millions of calories per ton) was measured on random samples of five specimens each of coal from two mines. The following is the data and the test statistics. Mine 1 Mine 2 8380 7540 8210 7720 8360 7750 7840 8100 7910 7690

Suppose we assume sample from Mine 1 to be normally distributed with mean 1 and variance 2 , and from Mine 2 to be also normally distributed with mean 2 and variance 2. Hypothesis: Test H0 : 1 = 2 against Ha : 1 2

Significance level: = 0.05 Critical region: Reject H0 is |tcal| > t* where t* is the t- table value corresponding to 2(n 1) = 8 degrees of freedom at 5 % significance level. Test statistics: x1 = 8 140 x 2 = 7760 n1 = n2 = n = 5 t* = 2.306 with 8 degrees of freedom and at 5 % significance level. Thus, the pooled estimate of the variance 2 is

and and

SS(x1) =

( x x )

= 253 800

SS(x2) = 170 600

Statistical Analysis in Research Module E-mail: [email protected]

19

PMNjuho

s2 pooled =

SS ( x1 ) + SS ( x 2 ) 253800 + 170600 = 2( n 1 ) 2( 5 1 )
= 53 050

The estimate standard error of the mean difference is S.E.( x1 - x 2 ) =


1 1 s2 ( + ) = p n n = 145.67 The value of t- calculate is tcalc =
x1 x 2 0 8140 7760 = 145.67 S . E .( x1 x 2 ) =

53050(

2 ) 5

380 = 2.61 145.67

Conclusion: We reject H0 since |tcalc| = 2.61 is greater than t* = 2.306 and conclude that the heat producing capacity of the coal from the two mines is not the same. The coal from Mine 1 being superior by 380 145.67 millions of calories per ton.

Example 2.4
A researcher wants to determine whether a given drug has any effect on the scores of human subjects performing a task of psychomotor co-ordination. Nineteen subjects were randomly selected from a subject pool and then randomly assigned to two groups. The nine subjects in group 1 received an oral administration of the drug prior to being tested. The ten subjects in group 2 received a placebo at the same time. The scored results were as follows: Group 1 12 14 10 8 16 5 3 9 11 Group 2 21 18 14 20 11 19 8 12 13 15 n2 = 10

n1 = 9

Statistical Analysis in Research Module E-mail: [email protected]

20

PMNjuho

Total score for Group 1 : Group 2 :

X X

1
2

= 88 = 151

On the assumption that the scores are distributed normally, we wish to test whether the two groups are significantly different at 5 % significance level. Hypothesis H0 : 1 = 2 against Ha : 1 2

Significance level: = 0.05 Critical region: Reject H0 is |tcal| > t* where t* is the t- table value corresponding to n1 + n2 -2 = 9 +10 -2 = 17 degrees of freedom at 5 % significance level. Thus t* = 2.110 Test statistics: The means and sum of squares, SS (x) for Group 1 : x1 = 9.778 Group 2 : x 2 = 15.100 and SS(x1) = 135.56 and SS(x2) = 164.90

s2 pooled =

SS ( x1 ) + SS ( x 2 ) 135.56 + 164.90 = 9 + 10 2 n1 + n2 2
= 17.6742

Hence S.E.( x1 - x 2 ) = s2 ( p

1 1 + ) = n1 n2

17 .6742(

1 1 + ) 9 10

= 1.93 Thus, tcalc =

x1 x 2 0 9 .778 15.10 = 1.93 S . E .( x1 x 2 )


= - 2.758

Conclusion: We reject H0 since |tcalc| = 2.758 is greater than t* = 2.110 and conclude that the scores of the experimental group are significantly lower the control group, say by 5.32 1.93 units.

2.5 The process of setting hypotheses and testing

Statistical Analysis in Research Module E-mail: [email protected]

21

PMNjuho

The investigator set up the study objectives, which are translated into questions that need to be answered by the data collected. These questions are formulated in form of hypotheses. The null hypothesis always has the equality sign whereas the alternative hypothesis is stated as either unequal, decrease or increase based on the available information on direction of the reaction. The basic form of the null and alternative hypotheses for two samples test.
Null hypothesis

H0 : 1 = 2

or 1 - 2 = 0

Possible alternative hypotheses

i) Ha : 1 2 ii) Ha : 1 < 2 iii) Ha : 1 > 2

or 1 - 2 0 or 1 - 2 < 0 or 1 - 2 > 0

< two tailed test > < one tailed test > < one tailed test >

A conventional rule is to state the null hypothesis with an equality sign and the alternative hypothesis with a strict inequality. The following are the necessary steps to follow when performing hypothesis test.
Step 1: State the assumptions associated with the random variable(s) related to the population(s) under investigation. Often, the random effect is assumed to be independently and identically distributed normal with a fixed mean and a constant variance. Step 2: State both the null and alternative hypotheses. Normally, the alternative hypothesis is the statement we wish to prove. Step 3: State the significance level. This is the type I error which, the probability of rejecting the null hypothesis when it is true. It is commonly referred to as an experimental error rate. The conventional levels are 10 %, 5 % and 1 %. Step 4: Set the critical rule or decision rule. It is becoming traditional to use p- value, which is the observed probability of rejecting the null hypothesis when is true. The smaller the p-value, the stronger is the evidence against the null hypothesis. Reject the null hypothesis when p -value is less than the significance level. The rejection region, consist of those values of the test statistic that will lead to the rejection of the null hypothesis. Step 5: Compute the test statistics. These are sample mean(s), variance of the mean(s), standard error of the mean or mean difference, and the degrees of freedom. In general, the test statistic is calculated from the sample data that is used to test the null hypothesis. Step 6: Draw the conclusions based on these statistics when compared to the critical value(s). If the null hypothesis is rejected then declare that there is sufficient evidence. Otherwise, there is no enough evidence.

Two probability distributions namely, normal and t - distributions are used. The t distribution is used when variance(s) is/are unknown and the sample size is less than 30.

Statistical Analysis in Research Module E-mail: [email protected]

22

PMNjuho

The normal distribution is used when the sample size is greater or equal to 30. The variance is still estimated if it is unknown.

2.6 Inferences about a population proportion.

Suppose P denotes the proportion in the population with the attribute. This is referred to a probability of success. Suppose that, a random sample of size n is drawn from a binomial distribution. Let y be the number in the population with the attribute. We estimate P using a statistic p as

y = Total number with the attribute divided by sample size. n The sampling distribution of p is approximately normal with mean P and variance
p=

P( 1 P ) . n
Suppose the hypothesis is stated as H0 : P = P 0 against Ha : P P 0

Under the assumption that the H0 is true, the variance of p becomes Thus, z calc =
p P0 P0 ( 1 P0 ) n

P0 ( 1 P0 ) n

Example 2.5

A recent report claimed that 20% of all college graduates find a job in their chosen area of study. A survey of a random sample of 500 graduates found that 110 obtained work in their area. Is there statistical evidence to refute the claim?
Solution: If P denotes the percent of college graduates who find a job in their area of study, then

H0:P = 0.20

against

Ha : P 0.20

We denote the test statistic by p, the proportion of successes in the sample. Thus, p=
110 = 0.22 500

Statistical Analysis in Research Module E-mail: [email protected]

23

PMNjuho

The studentised test statistic is zcalc =

0.22 0.20 p 0 .20 = 0.018 ( 0 .20 )( 0 .80 ) 500

= 1.11 In case of one-sided test, p-value = P (z 1.11) = 0.1335. For a two-sided test, p-value = 2(0.1335) = 0.267
Conclusion: There is no enough evidence to reject the null hypothesis at 5 % significance level. Thus, cannot refute the claim.

Testing with confidence intervals

The null hypothesis H0:P = P0 against Ha : P P0 is rejected at an level of significance if and only if the hypothesised value P0 falls outside a (1- )100% confidence interval for P.
Example 2.6

A news report in a major city stated that 80% of all violent crimes in that city involves firearms. A survey of all violent crimes in the city for the past 2 years revealed that of 283 violent crimes, 240 involved firearms. Determine with a confidence interval whether the news report is correct.
Solution:

H0:P = 0.80 Given, P0 = 0.80 n = 283 y = 240

against Ha : P 0.80

A 95% confidence interval for P is p 1.96


p( 1 p ) n

But,

p=

240 = 0.848 283


24
PMNjuho

Statistical Analysis in Research Module E-mail: [email protected]

Thus, 0.848 1.96

0 .848( 0 .152 ) = 0.848 0.042 283

The 95% confidence interval for P is (0.806, 0.890). We reject H0 at 5% significance level, because the hypothesised value, P0 = 0.80 does not fall in the interval. Test H0 : 1 = 2 against Ha : 1 2

Exercises 2.3
2.16 A claim was made that 60% of the adult population thinks that there is too much violence on television. A random sample of 200 adults found that 110 thought that there is too much violence on television. Is this enough evidence to reject the claim? 2.17 The government believes that no more than 25% of all college students would favour reducing the penalties for the use of marijuana. A sample of 2 400 college students revealed that 750 favour reducing the penalties.

a) Set up null and alternative hypotheses to evaluate the governments claim. b) Give the form of the standardised test statistic and calculate its value. c) Compute the p-value and determine whether there is sufficient statistical evidence to reject the governments claim. d) State your conclusion.
2.18 A psychologist has developed a new aptitude test and believes that 80% of the public should score above 50 on the test. From a sample of 200 people, 164 scored above 50.

a) Is there statistical evidence that the claim made by the psychologist is not valid? b) For the results to be significant at the 5% level of significance, how many out of 200 will have to score above 50 on the aptitude test?

Statistical Analysis in Research Module E-mail: [email protected]

25

PMNjuho

3. ANALYSIS OF VARIANCE 3.1 Introduction to completely randomised design

Testing of two population means is achieved through t- distribution procedures. The experimenter is at liberty to select type I error (the probability of rejecting null hypothesis when it is true), when setting the critical region or rejection rule. We draw inference on the population means based on the sample data. The problem in using t-test when the population means are more than two becomes complicated. For instance, with four 4 treatments, we require which reads 4 choose 2, pair-wise comparisons, namely, 2 {(1,2), (1,3), (1,4), (2,3), (2,4), and (3,4)}. We have, say, , type I error for each comparison. This probability increases exponentially with the number of pair-wise comparison. The analysis of variance is used as an alternative procedure for testing simultaneously, the equality of population means, while using the same type I error, say, . The design of an experiment is the process of planning a study. Conclusions are draw from such experiments. The analysis of variance is concerned with the comparison of t Populations (treatments) means 1 , 2 ,..., t . We would like to use sample results to draw inference on the means.

The model

A statistical model for an observation made on subject j receiving treatment i, denoted yij is expressed as yij = i + ij where i = + i i=1,2,..., t, j=1,2,...,ni

= Overall mean. i = Mean of the ith population or treatment.

i = ith treatment effect. ij = Random effect due to jth replication receiving ith treatment.
The statistical model can be categorised into two parts namely the means effect model and the fixed effects model. That is
Means effects model: Fixed effects model:
yij = i + ij yij = + i + ij

The null hypothesis related to the fixed effect model is

Statistical Analysis in Research Module E-mail: [email protected]

26

PMNjuho

H0 : 1 = 2 ... t =0 where i = i - , i=1,2, . . ., t. The null hypothesis for the means model is stated as H0 : 1 = 2 =...= t = , and alternative hypothesis as Ha : Not all treatment means are equal (i.e. i i' , for some i i).
Assumptions

The statistical model for a completely randomised design is based on the following assumptions.

Each population is normally distributed. That is yij ~ i. i.dN(i, 2 ), i =1, 2, . . . t. The variance, denoted 2 , is the same for each population. The observations must be independent.
Usually, the above assumptions are summarised by the following mathematical expression: ij ~ i. i.dN(0, 2 ), i=1,2, . . ., t, j=1,2, ..., ni. Where the first i denotes, identical distribution, second i denotes, independently distributed, d denotes distribution, N denotes normal distribution with mean zero and constant variance denoted by 2 .

The design layout

Suppose an investigator intends to carry out an experiment to investigate the performance of four varieties. Suppose the available experimental material allows for 12 homogeneous experimental units. Thus, each variety occupies three units, say plots. Denote the 4 varieties by V1, V2, V3, and V4. A simple randomisation approach is to write down the variety numbers on 12 pieces of papers wrap each of then and shovel them. Pick each at random and allocate the variety to the unit sequentially. For this example, the layout as a completely randomised design would be

V4 V2 V4

V1 V3 V2

V3 V4 V1

V2 V1 V3

The estimation

Under H0 : 1 = 2 =...= t = each sample observation would have been drawn from the same normal probability distribution with mean and variance 2 . Recall that the sampling distribution of the sample mean, y for a simple random sample of size n from a

Statistical Analysis in Research Module E-mail: [email protected]

27

PMNjuho

. n The best estimate of the mean of the sampling distribution of y is the mean of the individual sample means. That is
y=

normal population is normally distributed with mean and standard deviation y =

y1 + y 2 + ...+ y t t

The between samples variation provides a good estimate of the 2 only if the null hypothesis is true. If the null hypothesis is false, the between sample variation overestimates 2 . The within sample variation provides a good estimate of 2 in either case. If the null hypothesis is true, both estimates will be similar and their ratio will be close to 1. If the null hypothesis is false, the between samples will be larger than within samples, and their ratio will be large. The analysis of variance is a statistical technique for testing the hypothesis that the means of three or more populations (treatments) are equal. Also, it can be used to test the hypothesis that the means of two populations are equal. Pooling is the process of combining the results of two or more independent simple random samples to provide an estimate of 2 . When a simple random sample is selected from each population, each of the sample variances provides an unbiased estimate of the population variance 2 . The estimate of 2 obtained by combining the individual estimates into an overall estimate is called the within samples estimate.

The sample mean for the ith treatment

yi . =

1 ni

y
j =1

ni

ij

i =1, 2, . . ., t

The sample variance for the ith treatment

si2 =

1 ni 1

( y
j =1

ni

ij

yi . )2

nT = n1 + n 2 + . . . + nt Recall that variance is a measure of the dispersion in a set of responses and is calculated by determining the average distance of a set of responses from its mean.

3.2 Between samples estimate of population variance

Consider sample means each estimating the population means for each treatment under investigation. These sample means are statistics with a sampling distribution that is

Statistical Analysis in Research Module E-mail: [email protected]

28

PMNjuho

), i =1, 2, . . . t. , ni The investigator wishes to assess how each of these sample means differ from the estimate y.. which estimates the over all population mean, . The component that measures the deviation of the sample means from the overall sample mean is called the mean square between denoted MSB. This is defined as SSB 1 ni MSB = = n ( y y... )2 t 1 t 1 j =1 i i . Where SSB = Sum of squares between treatment means. The squaring of the deviation is done to remove the negative sign and the divisor t-1, is the corresponding degrees of freedom. Each of the deviation is weighted by the corresponding replications ni. The MSB is sometimes referred to as systematic variance and can be explained in terms of the independent variables or independent groups or treatments. For instance, suppose we wish to test the performance of a pressure cooker at three different temperature settings. We run the pressure cooker at 20, 40 and 60 kilopascals and record the temperature at which the water boils. We take five temperature readings at each pressure. The average deviations from the overall mean, of the means of the five readings, at each pressure, provides the measure of systematic variance. The component MSB is unbiased estimator of 2 under H0. The MSB is not an unbiased estimator of 2 and does overestimate, if the means of the t populations are not equal.

normal. The sampling distribution of sample mean i is yi . ~ i.i.d N(i,

3.3 Within samples estimate of population variance

The component that measures the deviation of each observation from the overall mean is called the mean square within denoted MSW. It is also an estimate of 2 and is defined as MSW =

SSW 1 = n T t nT t

(n
i =1

1 )si2
1 ( yij yi. ) 2 , the ni 1 j =1

Where SSW = Sum of squares within the treatments, and s i2 = sample variance for treatment i.

The estimate MSW is not influenced by whether or not the null hypothesis is true, unlike the MSB. It always provides an unbiased estimate of 2 . The MSW is referred as error variance or random error. This refers to the random variation between sample means, which we find when we select random samples from a population.
3.4 Comparing the variance estimates
Statistical Analysis in Research Module E-mail: [email protected]

29

PMNjuho

MSB , of the two independent estimates of 2 MSW follows an F distribution, under H0. Thus, under H0, and when assumptions are valid, the MSB sampling distribution of is an F distribution with the numerator degrees of MSW freedom equal to t-1 and denominator degrees of freedom equal to nT - t. In general, an F distribution is a ratio of two random variables that are distributed chi-square. Thus, the range of F is from zero to positive infinity.
The sampling distribution of the ratio

MSB is inflated because MSB overestimates 2 when the means of the t MSW MSB populations are not equal. Hence, we will reject H0 if the resulting value of MSW appears to be too large to have been selected at random from an F distribution with degrees of freedom t-1 in the numerator and nT - t in the denominator. The value of MSB that will cause us to reject H0 depends on , the level of significance. Table D MSW MSB and the rejection region associated with a provides the sampling distribution of MSW level of significance equal to where F denotes the critical value.
The value of To read the value of F from the table, you need to have numerator, denominator degrees of freedom and the level of significance, . Proceed to locate the F value that corresponds to the within degrees of freedom in the first column and the between degrees of freedom in the first row for a given - level. Often, the F table is provided for = 0.05 or 0.01. You will note later that most statistical software produce all statistics. An important statistic among these is the p-value which is the probability computed using calculate F- value. The decision to reject or not to reject the null hypothesis is based on the comparison made between the p- value and the - level. We reject the null hypothesis if p value < and otherwise fail to reject.

3.5 Computation formulae

The formulae previously discussed are difficult to apply. Equivalent formulae that are easy to use are presented below.

Sum of squares total (SST) =

y
i =1 j =1

ni

2 ij

- C.F.

where C.F. is the correction factor calculated as

1 t ni ( y )2 . nT i =1 j =1 ij

1 Sum of squares treatment (SSTrt) = ni

y
i =1

ni

2 i.

- C.F.

where y i. =

1 ni

y
j =1

ni

ij

Statistical Analysis in Research Module E-mail: [email protected]

30

PMNjuho

Sum of squares error (SSE) = SST - SSTrt.

The mean squares are computed as the ratio of sum of squares -to-the degrees of freedom. The analysis of variance table denoted ANOVA is a convenient display of calculations of between, within and total sum of squares, the associated degrees of freedom and mean squares. It is composed of none negative values. A general ANOVA table follows:

Source of Variation Between Within Total

Degrees of Freedom t-1 nT - t nT - 1

Sum of Squares SSB SSW SST

Mean Square MSB MSW

Fcalc

Ftable

MSB MSW

The analysis of variance can be viewed as the process of partitioning the total sum of squares and degrees of freedom into two sources, between and within. Dividing the sum of squares by the appropriate degrees of freedom provides the variance estimates and the F value used to test the hypothesis of equal population means. The degrees of freedom and the sum of squares are the only additive columns. Thus, need to compute two and the third can be obtained by subtraction.
Example 3.1

To test if the mean time needed to mix a batch of material is the same for machines produced by three manufacturers, the following data on the time (in minutes) needed to mix the material were obtained. Manufacturer 1 2 20 28 26 26 24 31 22 28 23 28 4.67

Sample mean y i. :

3 20 19 23 21 21 3.33

Sample variance si2 : 6.67

Test if the population mean times needed to mix a batch of material differ for the three manufacturers at 5% significance level.

Solution

Treatments, t = 3, and sample size per treatment, n1 = n2 =n3 = n =4


y .. = ( y1. + y 2. + y 3. )/3 = (23 + 28 + 21)/3 = 24.

Statistical Analysis in Research Module E-mail: [email protected]

31

PMNjuho

Or y .. =
ni

1 nT

y
i =1 j =1 i.

ij

288 = 24 12
= 4(23-24)2 + 4(28-24)2 + 4(21-24)2 = 104.

SSB =

n ( y
j =1 i

y ... )2

MSB =

SSB 104 = = 52. t 1 2

SSW =

(n
i =1

1 )si2 = 3(6.67) + 3(4.67) + 3(3.33)


= 44.01

MSW =

44.01 SSW = = 4.89 12 3 nT t

Fcalc =

MSB 52 = = 10.63 MSW 4.89

Ftable, 0.05(2, 9) = 4.26 The ANOVA Table Source of Variation Between Within Total Degrees of Freedom 2 9 11 Sum of Squares 104.00 44.01 148.01 Mean Square 52.00 4.89 Fcalc 10.63 Ftable, 0.05 4.26

Conclusion: Since Fcalc= 10.63 > Ftable, 0.05(2, 9) = 4.26, we reject the null hypothesis that the mean time needed to mix a batch of material is the same for each manufacturer at 5% significance level. This means that there is at least one significant difference between the means.

The rejection of the null hypothesis using F does not pinpoint where the specific differences are. Further analysis is therefore required to investigate which treatment means that are different. Multiple comparison tests (some are more conservative than the other) are used to achieve this. If the structure of the treatment means is known priori to the experiment, contrast or regression techniques could be used. For instance, if the treatments have qualitative structure, then reasonable contrasts can be constructed. If the structure is quantitative, then regression techniques can be applied. If the treatment structure is not known at all, which is unusual, multiple comparison test techniques can be used.

Statistical Analysis in Research Module E-mail: [email protected]

32

PMNjuho

It should be noted that if we got a nonsignificant F test in the analysis of variance, it would indicate the failure of the experiment to detect any difference among treatments. Nonsignificant F test does not, in any way prove that all treatments are the same, because the failure to detect treatment difference, could be the result of either a very small or nil treatment difference or a very large experimental error, or both. Thus, one need to examine the size of the experimental error and the numerical difference among treatment means, whenever F test is nonsignificant.

Steps in testing hypothesis

Below are useful steps to follow when conducting a test of hypothesis.

State the statistical model and the associated assumptions based on the design of experiment used and treatment structure. State the null and alternative hypothesis, based on the interest of the investigator. Choose the level of significance , which depends on the desired confidence to be attached to the results. Develop the critical region (rejection region) which depends on the alternative hypothesis. Compute the test statistic, say, sum of squares, mean squares, F-calculated and pvalues. Draw conclusions based on the analysis of variance results. Further statistical analyses are directed by the outcome of the ANOVA results.

3.6 Advantages and disadvantages

A completely randomised design has the following advantages over other designs:

Easy to set up and analyse; Provides maximum number of degrees of freedom for estimation of error variation; Missing values cause no difficulty.

Disadvantages

The approach is insensitive when the experimental units are heterogeneous. This is because it assumes the units to be homogeneous; It is difficult to maintain homogeneity among units when the treatment numbers is large. Thus, the approach is suitable only for small numbers of treatments. Exercise 3.1
3.1 Decide by F Table whether the following F calculated values would be greater at 0.01 significant level:

i) F at df1 = 14 and df2 = 100 ii) F at df1 = 2 and df2 = 40


Statistical Analysis in Research Module E-mail: [email protected]

33

PMNjuho

iii) F at df1 = 9 and df2 = 30


3.2 Consider the following set of data on scores. Group A 62 60 50 48 47 i) Group B 60 60 58 53 49 Group C 59 49 49 47 42

Find the total sum of squares, the within groups sum of squares, and the between group sum of squares. Present your results in a analysis of variance table.

ii)

iii) Is there enough evidence at 5 % significance level to suggest that the three treatments are significant? 3.3 A researcher investigates emotional stability in three groups of children, a control group who come from a stable background, children who have been physically abused, and children who have been sexually abused. Higher scores indicate greater stability. The researcher wants to test the hypothesis that any abused child shows less emotional stability. The following is the data.

Control 8 9 7 8 9

Physically abused 3 4 3 2 4

Sexually abused 4 2 2 3 3

a) State both null and alternative hypotheses. b) Set up a ANOVA table and test whether there is significant difference between the groups at a 5 % significance level. 3.4 A researcher wants to know what type of humour appeals most to students. She looks at three different types, slapsticks, puns and stand-up comedy. Three different groups laughed as follows. Slapstick 5 3 5 4 6 Puns Stand-up comedy 3 8 6 6 4 4 9 3 3 3

Statistical Analysis in Research Module E-mail: [email protected]

34

PMNjuho

Conduct a one-way analysis of variance to test if the three types differ significantly at 5 % significance level. 3.5 A study investigated the perception of corporate ethical values among individuals specialising in marketing. The following data on scores were recorded where higher scores indicate higher ethical values. Marketing Managers 6 5 4 5 6 4 Sample mean 5 Sample variance 0.8 Marketing Research 5 5 4 4 5 4 4.5 0.3

Advertising 6 7 6 5 6 6 6 0.4

Using 5 % significance level, test if there are significant differences in perception for the groups of specialists. 3.6 As a result of the recent revisions to the tax law, investment in equity instruments has become increasingly attractive. The accompanying table lists the annual internal rates of return for several different investment portfolios managed by three separate investment firms. Firm A 16.9 15.0 16.2 15.8 17.1 Firm B 15.1 12.5 13.0 11.8 Firm C 10.0 13.1 12.3 10.2 8.9

Carry out the analysis of the above data to test the equality of the three investment firms with respect to the mean annual internal rate of return earned on portfolios. Use a 5 % significance level. 3.7 Samples of peanut butter produced by three different manufacturers are tested for a flatoxin content with the following results: Brand B 2.5 1.8 3.6 4.1 1.2 0.7

A 2.5 6.3 3.1 2.7 5.5 4.3

C 2.3 1.5 0.4 3.8 2.2 1.0

Statistical Analysis in Research Module E-mail: [email protected]

35

PMNjuho

a) Determine whether there is a significance difference between the brand means at 5 % significance level. b) Outline the assumptions for a valid analysis of variance. 3.8 The following are the litres per 100 kilometres which a test driver obtained with measured quantities of five brands of petrol containing various additives:

Brand S T C 8.71 8.11 8.71 11.20 8.71 8.71 10.69 7.35 9.80

M 9.80 11.20 10.23

E 9.40 11.76 11.20

Test the hypothesis that the five brands of petrol give the same results. Use the 1 % significance level. 3.9 A postgraduate student in the Department of Dietetics studied the effect of diet on blood sugar. Originally 32 subjects were selected for their uniformity and assigned randomly to four diet groups; eight individuals per diet group. A mishap resulted in the loss of the records for six subjects. The following are the results for the remaining cases:

I 24 18 25 23 22

Diet II 26 21 23 25 20 24 20

III 30 32 29 25 31 33 29 28

IV 30 28 27 23 31 25

Determine whether the four diets have different effect on blood sugar levels. Use 5 % significance level. 3.10 In an assessment of five different reading programmes, a number of children judged to be equivalent in abilities on the basis of pre-testing were assigned at random to the five programmes. Assessments on the reading capacities of the children completing the programme produced the following scores:

Programme I
63 67 59 60

II
81 71 74 70

III
72 77 79 83

IV
59 65 70 71

V
62 71 73 67

Statistical Analysis in Research Module E-mail: [email protected]

36

PMNjuho

72 58 65 64

73 83 79 80

70 82 71 77

67 60 62 66

68 61 68 66

Determine whether there are any differences in the five programmes. Use a 5 % significance level. 3.11 A random ample of 16 observations was selected from each of four populations. A portion of the ANOVA table is given below: Source of Variation Between Within Total Degrees of Freedom Sum of Squares Mean Square 400 F

1 500

a) Complete the missing entries in the ANOVA table. b) Test whether the treatment means of the four populations are equal, using a 5 % significance level. 3.12 Random samples of 25 observations were selected from each of three populations. For these data, sum of squares between (SSB) = 120 and sum of square within (SSW) = 216. a) Set up the ANOVA table for this problem. b) What is the critical of F? Use a 5 % significance level. c) Are the three population means equal, at 5 % significance level?

Statistical Analysis in Research Module E-mail: [email protected]

37

PMNjuho

4. BEYOND ANALYSIS OF VARIANCE 4.1 Introduction

A common first step is to subject the data to an analysis of variance to determine whether or not significant differences exist among the treatment means. The overall F test provides statistical evidence of existence of some significant difference between the treatments under investigation. For instance, a rejection of the null hypothesis indicates that the treatment means are not all equal. That is, either 1 2 or 1 3 or 2 3, or 1 2 3. It cannot tell us where the differences between the means lie. While t test applies only for two treatments, F test applies to two or more treatments. Interest would be in finding source of that difference that contributed to an overall significant F test. Various procedures are in use under such circumstance. Recent approach suggests use of regression techniques if the treatments are of quantitative nature. If they are of qualitative nature and priori information on the treatment structure was available, appropriate contrasts questions could be formulated and tested through the ANOVA. If no structure is known and the treatments are of qualitative nature, multiple comparison procedures can then be applied.

4.2 Multiple comparison procedures

After the analysis of variance, the data are further analysed in an attempt to explain the nature of the response in more detail. A number of statistical procedures may be used for this purpose. Among these are:

Fitting response functions using regression techniques. Planned sets of contrasts among means, or groups of means. Pairwise multiple comparison procedures.

Some of these procedures are appropriate with some kinds of treatments and entirely inappropriate with others. These statistical test procedures are used under different circumstances. Most commonly used are the post hoc tests which are modified t tests known to control for familywise error rates.
Fishers least significant different (LSD)

This is the most widely used method for making pairwise comparisons of treatment means. Suppose the overall F test led to a rejection of H0 : 1 = 2 = 3 . The following could be the possible causes: i) H0 : 1 = 2 against Ha : 1 2 ii) H0 : 1 = 3 against Ha : 1 3
iv) H0 : 2 = 3 against Ha : 2 3

To test any of the above possibilities, t -test procedures can be applied. The test statistic for Fishers LSD at 5 % significance level is computed as follows:
Statistical Analysis in Research Module E-mail: [email protected]

38

PMNjuho

LSD0.05 = t0.025 x s.e.( yi yi' ) with MSW degrees of freedom where s.e.( yi yi' ) =
MSW ( 1 1 + ) ni ni'

Reject H0 : i = i' if | yi yi' | > LSD0.05 in support of the alternative at 5 % significance level. Fishers LSD test is commonly referred to as protected or restricted LSD. It is only applied when the overall F test is significant.
Example 4.1

Consider the information obtained in Example 3.1. Sample size: ni : Sample mean y i. : MSW = 4.89 s.e.( yi yi' ) =
MSW ( 1 1 + )= ni ni'
1 1 4.89( + ) 4 4

4 23

4 28

4 21

= 1.584 The table value with 9 degrees of freedom at 5 % significance level is t0.025 = 2.262. Thus, LSD0.05 = 2.262 x 1.584 = 3.583 Reject H0 : i = i ' if | yi yi' | > LSD0.05. Treatment difference y1 y 2 = 23 - 28 y1 y 3 = 23 21 y 2 y 3 = 28 - 21

Difference -5 2 7

Status Significant Not significant Significant

The overall F-test assured us that at least two of the treatment means are significantly different, at 5 % significant level. Further analysis using Fishers LSD indicate that the difference in Trt mean 1 versus 2 and Trt mean 2 versus 3. Treatment mean 1 is not significantly different from treatment mean 3, at 5 % significance level. A confidence interval estimate of the form ( yi yi' ) LSD0.05 can also be used for the same test. If the interval includes the value 0, we fail to reject the hypothesis that the

Statistical Analysis in Research Module E-mail: [email protected]

39

PMNjuho

treatment means are equal. However, if the confidence interval does not include the value 0, we conclude that there is a difference between the treatment means. Similarly, ( yi yi' ) LSD0.05 implies a 95 % confidence interval for treatment difference is Trt 1 vs Trt 2: Trt 1 vs Trt 3: Trt 2 vs Trt 3: (-5 3.583) = (-8.583, -1.417) (2 3.583) = (-1.583, 5.583) (7 3.583) = (3.417, 10.583)

Comparison-wise Type I error rate: This is the error rate that indicates the level of significance associated with a single statistical test. Thus, the comparison-wise Type I error remains , say = 0.05. Experiment-wise Type I error rate: Suppose we conduct a pair wise test and for each single t-test, we set = 0.05. The probability that we will not make a Type I error is 1 0.05 = 0.95 for each test. The probability that we will not make a Type I error for two consecutive t- tests is (0.95)(0.95) = 0.9025. Thus, the probability of making at least one Type I error is 1 - 0.9025 = 0.0975.When we use sequentially test two sets of hypotheses, the Type I error rate associated with this is not 0.05, but actually 0.0975. This Type I error rate is called experiment-wise Type I error rate.

In general, suppose we consider k treatments. The number of possible pairwise comparisons, C is

k k( k 1) k! = = 2 2 ( k 2 )! 2 !
The probability of making at least one Type I error is Experiment-wise Type I error rate, EW = 1 - (1- )C The Fishers LSD procedures leads to a experiment-wise Type I error rate that depends on the comparison-wise Type I error rate, and the number of comparisons, C.
Bonferroni adjustment: EW = 1 - (1- )C < C .

Thus, the maximum probability of making a Type I error for the overall experiment EW can be maintained if we use a comparison-wise Type I error rate of size EW/k.
Example 4.2

Refer to the information in example 4.1. Using = 0.05. Number of treatments, k =3, thus possible pairwise comparisons, C=

k ( k 1 ) 3(3 1) = = 3. 2 2

Statistical Analysis in Research Module E-mail: [email protected]

40

PMNjuho

EW = 1 - (1- )C =1 (1-0.05)3
=1 (0.95)3 = 0.143 C = 3(0.05) = 0.15 We use a comparison-wise Type I error rate of size 0.143 < C = 0.15.

EW
k

0.143 = 0.048, since EW = 3

Tukeys procedures: Allows one to perform tests of all possible pairwise comparisons and still maintain an overall experiment-wise Type I error rate, such as EW = 0.05. It uses a studentised range probability distribution. Considers all treatment means to have the same sample size, n and equal variances. However, a generalised Tukeys test can be used for unequal sample size case. Then a sampling distribution of

q=

y max y min MSW n

where

y max = largest sample mean and y min = Smallest sample mean MSW = Mean square within treatments.
Follows a studentised range distribution. Tukeys significant difference, denoted, TSD = q

MSW n

Tukeys procedure is an unprotected testing approach. Thus, Tukeys procedure provides an alternative to analysis of variance for testing, if the treatment means of k populations are equal. However, to use Tukeys procedures we need to estimate the population variance using MSW.
Example 4.3

Consider the information given in Example 4.1. i Sample size: ni : Sample mean y i. : 1 4 23 2 4 28 3 4 21

MSW = 4.89 Error degrees of freedom = 9

Statistical Analysis in Research Module E-mail: [email protected]

41

PMNjuho

The critical value of the studentised range, q( , k, v) for the 3 pairwise comparisons, k and v error degrees of freedom at 5 % significance level is obtained from Table E. Thus, q = q( , k, v) = q(0.05, 3, 9) = 3.95 Hence, TSD = q

MSW 4.89 = 3.95 = 4.367 n 4

y max = 28

y min = 21

We reject H0: 2 = 3 since | y 2 y 3 | = 7 > TSD = 4.367 and conclude that the two treatment means are significantly different at 5 % significance level. Similarly, we reject H0: 2 = 1 since | y 2 y1 | = 5 > TSD = 4.367 and conclude that the two treatment means are significantly different at 5 % significance level. But we fail to reject H0: 1 = 3 since | y1 y 3 | = 2 < TSD = 4.367 and conclude that the two treatment means are not significantly different at 5 % significance level. These conclusions can be summarised as follows: Ordered treatment means:

Any two treatment means sharing the same line are not significantly different at 5 % significance level.

Remark: The most often used and most often misused are the multiple comparison tests. Their purpose is to detect possible groups among a set of unstructured treatments. They are not meant for quantitative treatments, for which response methodology is more appropriate. Nor are they intended to substitute for meaningful orthogonal comparisons, which can be formulated in advance based on the treatments used. The following points should be noted:

Care should be taken to select a statistical procedure which is appropriate for the data being analysed. For experiments involving factorial sets of treatments or graded levels of quantitative factors there is almost always a statistical procedure, which can be specified in advance and which is more appropriate than a multiple comparison test. For experiments involving qualitative treatments it is often possible to form planned sets of comparisons to answer the objectives of experiment. Multiple comparison tests may be useful for grouping means from experiments involving unstructured qualitative treatments. 42
PMNjuho

Statistical Analysis in Research Module E-mail: [email protected]

Indiscriminant use of multiple comparison tests can result in loss of information and reduced efficiency when more appropriate procedures are available.

Exercise 4.1

Refer to Exercise 3.1 to answer the following questions. 4.1 Refer to the results from question 3.2 to compute LSD at 5 % significance level and determine which treatment means that are different. 4.2 Refer to the results from question 3.3 to compute Tukeys (TSD) critical value at 5 % significance level and determine which treatment means that are different. 4.3 Refer to the results from question 3.6 to construct 95 % confidence interval for each of the pairwise treatment means difference. Use these intervals to test the equality of these means. 4.4 Refer to the results from question 3.8 to compute Tukeys (TSD) critical value at 5 % significance level and determine which treatment means that are different. 4.5 Refer to the results from question 3.7 to compute LSD at 5 % significance level and determine which treatment means that are different.

Statistical Analysis in Research Module E-mail: [email protected]

43

PMNjuho

5. RANDOMISED COMPLETE BLOCK DESIGN 5.1 Introduction

Extraneous factors, not considered in the experiment, can inflate the mean square within (MSE) component. This causes the F value to be small, thus, signalling no significance difference among treatment means when in fact such a difference exists. We wish to compare the treatment means when all known variation is control or rather eliminated from the experimental error. One way of eliminating the known variation from the experimental error is by grouping the experimental units into homogeneous groups, commonly known as block. For instance, if an experiment is to be carried out in KZN and the race or gender is truly known to have an effect, the setting of the experiment should then take this known variation into consideration. The race could be used as blocks. The treatment understudy should be applied to each block where each application is based on independent randomisation. Or if a study involves assessment of different types of fuels in a given City, the cars should be considered as blocks because they are known to differ in fuel consumption. Or if different management practices are to be compared within the farming community in KZN, the size of farm, either small, or medium or large should be considered as a blocking factor. They are known to differ and this information should be incorporated into the experiment. Or if an experiment involves comparing of different animal feeds, where the breed is known to have an effect, then the breeding should be used as a blocking factor. Or if an agronomist wants to conduct an experiment on a field known to have different levels of soil fertility, then this information should be used as a blocking factor. And so on. The randomised complete block design (RCBD) draws its name from the fact that the treatments are allocated at random in each block. Independent randomisation is applied in each block. Complete implies that each block contains a complete set of treatments. This is an extension of a completely randomised design (CRD) in a situation where experimental units are no longer homogeneous. The principle behind this design is to divide all experimental units into homogeneous groups before applying the treatments. Each group is referred to as a block or replication in case of a balanced design. Balanced, because each treatment occurs equally often in each block. Differences between blocks cancel out for any comparison of treatments. The criteria applied in grouping should ensure that there is minimum variation within the blocks and maximum variation between them. Differences between blocks are then removed from the random or unexplained variation. The following should be noted with a RCBD a) Blocks should be laid perpendicular to the gradient in case of a directional variation. b) Blocks need not be continuous.
Statistical Analysis in Research Module E-mail: [email protected]

44

PMNjuho

c) Possible to replicate within a block. That is to say, a treatment may appear more than once in a block d) A block should signify a known variation that need to be controlled by the experiment. e) All the treatments should be randomised within each block, ensuring independent randomisation in each block. Even when no obvious natural blocks that exist, it is still sensible to define blocks representing major patterns of variation. Consider an experiment involving different varieties. Harvesting may be carried on each block on each day if it is impossible to harvest all on a single day. Such blocking controls the variability that may be introduced in a day (due to rain). Missing data can also occur in RCBD. The good thing with the design is that, the analysis can still be performed in the event of losing a complete block. A major restriction in the use of this design, is the requirement that all treatments must appear in each block.

5.2 Aspect of blocking

The analysis of completely randomised design assumes that the experimental units are homogeneous. Any treatment effect between the groups or treatments is expected to be due to the treatments only, under such assumption. Hence, the within treatments variation is assumed to be purely random. The experimental error is overestimated if the assumption is not true. The blocking technique is meant to utilise priori information concerning the nature of experimental units. Blocking is therefore defined as the process of grouping the experimental units into homogeneous groups such that the variation within the blocks is maximised and that between block is maximised. The approach aims at obtaining estimate of experimental errors that is unbiased.
The field layout

Consider an experiment set to investigate the effect of 5 nitrogen levels on the growth of a new variety. Three types of soils are used as the blocking factor. Thus, 5 x 3 experimental units were used. We denote the nitrogen levels by N0, N1, N2, N3, and N4. Suppose the soil types were clay, loam and sand. The five nitrogen levels are randomly assigned to each block. At each stage, a new randomisation scheme is used. The layout is presented below.

Block 1 N1 Block 2
Statistical Analysis in Research Module E-mail: [email protected]

N3

N0

N2

N4

45

PMNjuho

N2 Block 3 N4

N4

N1

N0

N3

N0

N1

N3

N2

5.3 The model

Suppose the experimental material is grouped into b homogeneous groups (referred to as blocks) and t treatments under investigation are randomly assigned, ensuring independent randomisation at each stage. Suppose yij is the response variable corresponding to treatment i measured on block j, where i =1, 2, . . ., t and j = 1, 2, . . .,b. We assume one measurement on each treatment on each block. Also, there is no treatment by block interaction. The response variable yij is partitioned into components, say, due to overall mean, block, treatment and random error effects. The mathematical expression is yij = + i + j + ij where

= the overall mean i = the ith treatment effect j = the jth block effect ij = the random effect

The random effect ij is assumed to be identically and independently distributed normal with zero mean and constant variance. (i.e. ij i.i.d(0, 2) ). The model is also assumed to be additive. The data corrected from b blocks involving t treatments is usually summarised in a twoway table of treatment totals as follows: Block 2 3 y12 y13 y22 y23 . . . . . . yt2 yt3 y.2 y.3 Treatment Totals y1. y2.
. . .

Treatment 1 2 . . . t Block Totals

1 y11 y21 . . . yt1 y.1

... ... ... ... ... ... ...

b y1b y2b . . . ytb y.b

yt. y..

Statistical Analysis in Research Module E-mail: [email protected]

46

PMNjuho

Notations

Where, the marginal treatment and block totals are denoted yi. = respectively. The overall total is denoted y.. =

y
j =1

ij

, and y.j =

y
i =1

ij

y
i =1 j =1

ij

Similarly, the marginal means for the treatments and blocks are y i. =

1 b

y
j =1

ij

, and y. j =

1 t 1 yij , respectively. The grand mean is y.. = bt t i =1

y
i =1 j =1

ij

Definition formulae

The Sum of squares is a squared deviation summed over the levels. Thus, the sum of squares total is a measure of overall deviation of each observation from the overall mean. These deviations are summed over the levels of treatment and blocks. Sum of squares total, SSTotal =

( y
i =1 j =1

ij

y.. ) 2

The total sum of squares is partitioned into the three components that due to blocks, treatments and random effects. The sum of squares block is a measure of deviation of block means from the overall mean. Sum of squares block, SSBlk = t ( y. j y.. ) 2
j =1 b

Similarly, the sum of squares treatment is a measure of deviation of treatment means from the overall mean. Sum of squares treatment, SSTrt = b ( y i. y.. ) 2
i =1 t

The sum of squares error is a measure of within experimental unit variation. That is, the random variation due to treatments treated alike. It is also referred to as a measure of uncontrollable variation within the experimental units. Sum of squares error, SSE =

( y
i =1 j =1

ij

y i . y. j + y.. ) 2

Computation formulae

The analysis using the definition formulae is tedious. Statistical formulae that are equivalent to definition formulae are often used. We referred these as computation of sum of squares.

Statistical Analysis in Research Module E-mail: [email protected]

47

PMNjuho

Usually the first item to compute is the correction factor, which is the sum of squares mean. This requires adding all the bt observations squaring the result and dividing it by total observations, bt. Thus,
( y ij ) 2
t b

Correction factor,

C.F. =

i =1 j =1

bt

The sum of squares total requires each of the bt observations to be squared, summed and then subtracted the correction factor. Thus, SSTotal =

y
i =1 j =1

2 ij

- CF

An easier way to compute the sum of squares block and sum of squares treatment is to construct a two way table totals both body and marginal. To compute the sum of squares block, square each block mean, average the sum over the treatment levels and then subtract the correction factor. Thus, 1 b SSBlk = y.2j - CF t j =1 Similarly, the sum of squares for treatment is obtained by squaring each treatment mean, averaging the sum over the block levels and then subtracting the correction factor. Thus, 1 t 2 SSTrt = yi. - CF b i =1 The property of additivity of the model allows the sum of squares error to be computed by subtracting both SSB and SSTrt from SSTotal. Thus, SSE = SSTotal SSBlk SSTrt The above sum of squares are called corrected or adjusted sum of squares. The unadjusted or uncorrected sums of squares are obtained when correction factor is not subtracted during the computation. The total degrees of freedom (df) computed by subtracting one from the total number of observation are bt 1. These are partitioned into degrees of freedom due to treatments, blocks and error. Thus (t-1) df due to treatment, (b-1) df due to block and (b-1)(t-1) due to error.
Computation of mean squares

The mean squares are computed as averages of sum of squares over the degrees of freedom. These are known to have a distribution called chi-square. Mean square blocks, MSBlk =

1 (SSBlk) b 1

The quantity MSBlk is distributed chi-square with b-1 degrees of freedom.

Statistical Analysis in Research Module E-mail: [email protected]

48

PMNjuho

Mean square treatment, MSTrt =

1 (SSTrt) t 1

Similarly, MSTrt is distributed chi-square with t-1 degrees of freedom. Mean square error, MSE =
1 (SSE) (b 1)(t 1)

Also, MSE is distributed chi- square with (b-1)(t-1) degrees of freedom.

Computation of the F-value

The ratio of MSTrt to MSE has an F-distribution with (t-1) numerator degrees of freedom and (b-1)(t-1) denominator degrees of freedom. Both quantities MSTrt and MSE are assumed to be unbiased estimators of the common variance, 2 when null hypothesis of equality of treatment means is true. That is, H0: 1 = 2 = . . . = t = 0. In case the treatment effects are not equal, the MSTrt tends to be larger than the MSE. The larger the quantity the more likely we to rejecting the null hypothesis in favour of the alternative. Therefore, the F-calculated value for testing the null hypothesis at a specified significance level is computed as Fcalc =

MSTrt MSE

The calculated F-value is compared against an FTable Value obtained with (t-1) numerator df and (b-1)(t-1) denominator df., at significance level. The null hypothesis is rejected if the Fcalc is greater than FTable. Similarly, the ratio MSBlk to MSE is distributed F with (b-1) numerator degrees of freedom and (b-1)(t-1) denominator degrees of freedom. Often, the test is not performed simply because the information about the blocks is priori known. The hypothesis tested by this quantity depends on the nature of the blocks whether considered random or fixed effects. When blocks are considered fixed effect then the quantity Fcalc =

MSBlk MSE

test H0 : 1 = 2 = = b = 0, against Ha : At least two blocks are different. When the blocks are considered to be random effect the interest would be assessing the block variability. This provides an indication on how effective the blocking was. The hypothesis tested by the F- calculate under this condition is H0 : 2 = 0 against Ha : 2 > 0

Statistical Analysis in Research Module E-mail: [email protected]

49

PMNjuho

Reject the null hypothesis in both cases (fixed or random effects), if the Fcalc > FTable obtained using (b-1) numerator df and (b-1)(t-1) denominator df, at significance level. The above computations are summarised in a table called analysis of variance table, (ANOVA). The format of ANVOA is as follows: Source of Variation Block Treatment Error Total Degrees of freedom b -1 t 1 (b-1)(t-1) bt 1 Sum of squares SSBlk SSTrt SSE SSTotal Mean squares MSBlk MSTrt MSE FCalculated

F=

MSBlk MSE MSTrt F= MSE

Example 5.1

An automobile dealer conducted a test to determine if the time needed to complete a minor engine tune-up depends on whether a computerised engine analyser or an electronic analyser is used. Because tune-up time varies among compact, intermediate, and full-size cars, the three types of cars were used as blocks in the experiment. The data obtained are presented below.
Car Analyser Computerised 50 55 63 168

Compact Intermediate Full-size Treatment Total

Electronic 42 44 46 132

Block Totals 92 99 109 300

We consider cars to our blocking factor and analysers as the treatments under investigation. Thus we have three blocks and two treatments. We wish to test the equality of two analyser methods at 5 % significance level. Note: this is a very insensitive experiment because of very few degrees of freedom for error.

Hypothesis: H0: 1 = 2 = 0 against Ha : 1 2 Critical region: Reject H0: 1 = 2 = 0 in favour of Ha : 1 2 if Fcalc > FTable (0.05, 1, 2). Computation of the sums of squares:
( y ij ) 2
t b

C.F. =

i =1 j =1

bt

(300) 2 = 15 000 (2)(3)

Statistical Analysis in Research Module E-mail: [email protected]

50

PMNjuho

SSTotal =

y
i =1 j =1

2 ij

- CF = 502 + 422 + . . . + 462 C.F.

= 15 310 15 000 = 310 SSBlk =

1 1 b 2 y. j - CF = 2 (922 + 992 + 1092) C.F. t j =1 1 = (30 146) 15 000 = 73 2


1 t 2 1 yi. - CF = 3 (1682 + 1322) C.F. b i =1 1 = (45 648) 15 000 = 216 3

SSTrt =

SSE = SSTotal SSBlk SSTrt = 310 73 216 = 21

Computation of the mean squares:


MSBlk =

1 (SSBlk) b 1 1 (73) = 36.5 = 3 1 1 (SSTrt) t 1 1 (216) = 216 = 2 1

MSTrt =

MSE =

1 (SSE) (b 1)(t 1) 1 (21) = 10.5 = (3 1)(2 1)

Computation of Fcalc Fcalc =

MSTrt 216 = = 20.571 MSE 10.5 MSBlk 36.5 = = 3.476 MSE 10.5

Fcalc =

FTable Values
FT(0.05, 1, 2) = 18.5; FT(0.05, 2, 2) = 19.0 51
PMNjuho

Statistical Analysis in Research Module E-mail: [email protected]

The ANOVA Table Source of Variation Cars Analyser Error Total Degrees of freedom 2 1 2 5 Sum of squares 73 216 21 310 Mean squares 36.5 216 10.5 FCalculated 3.476 20.571 FTable, 0.05 19.0 18.5

Conclusions: Reject H0: 1 = 2 = 0 in favour of Ha : 1 2 since Fcalc = 20.571 > FT (0.05, 1, 2) = 18.5. Thus, we have enough evidence that the two analyser methods are significantly different at 5 % significance level

If we assume the type of cars to be random effect, then we would fail to reject H0 : 2 = 0 in favour of Ha : 2 > 0, since Fcalc = 3.476 < FT (0.05, 2, 2) = 19.0. Thus, the variability among the car types was not significantly different from zero.
Remark: It should be noted that the multiple comparisons tests discussed in Section 3 also apply to randomised complete block design.

Exercise 5.1

5.1 A nation-wide real estate chain is in the process of comparing townhouse prices in four cities across the country. It is however known that the area size of a townhouse is also a determining factor in price fixing and should be controlled by using blocks. Therefore in each city, the selling prices of a 90-square-meter, a 120-square-meter, a 150-square-meter, a 180-square-meter and a 210-square-meter townhouse are randomly selected. The results are recorded to the nearest thousand Rand and are shown below. Townhouse size (m2) 90 120 150 180 210

Bloemfontein 165 198 251 312 405

Durban 185 193 215 268 381

Port Elizabeth 173 181 197 229 294

Joburg 200 196 278 332 446

Test if the townhouses in the four cities are significantly different at 5 % significance level. (Hint: The cities are the treatments and the townhouse sizes are the blocks).

Statistical Analysis in Research Module E-mail: [email protected]

52

PMNjuho

5.2 Five different auditing procedures were compared with respect to total audit time. To control for possible variation due to the person conducting the audit, four accountants were selected randomly and treated as blocks in the experiment. The following values were obtained using the ANOVA procedures: SSTotal = 100; SSTrt = 45; SSBlk = 36. a) Set up an ANOVA Table, filling in the missing information. b) Test to see if there is any significant difference in total audit stemming from the auditing procedure used. Use = 0.05. c) Determine which treatments could be significantly different, using Tukeys procedures. 5.3 An important factor in selecting software for word-processing and data base management systems is the time required to learn how to use a particular system. To evaluate three file management systems, a firm designed a test involving five different word-processing operators. Since operator variability was believed to be a significant factor, each of the five operators was trained on each of the three file management systems. The data obtained are presented below:

Operator 1 2 3 4 5

A 16 19 14 13 18

System B 16 17 13 12 17

C 24 22 19 18 22

a) Carry out analysis of variance and present your results in ANOVA Table. b) Using = 0.05, test to see if there is any significant difference in mean training times for the three systems. c) Compute LSD at = 0.05 and indicate which treatments could be significantly different. d) Compute TSD at = 0.05 and indicate which treatments could be significantly different. e) Comment on the results obtained in parts (c ) and (d). 5.4 Three groups of students are to be tested for percentage of high-level questions asked by each group. As questions can be on various types of material, six lessons are taught to each group and a record is made of the percentage of high-level questions asked by each group on all six lessons. a) Show a data layout for this situation. b) Provide an ANOVA Table outline giving only the source of variation and degrees of freedom. 5.5 Suppose data from question 4.4 is as follows: Group
Statistical Analysis in Research Module E-mail: [email protected]

53

PMNjuho

Lesson 1 2 3 4 5 6

A 13 16 28 26 27 23

B 18 25 24 13 16 19

C 7 17 14 15 12 9

Carry out analysis of variance on this data treating each lesson as a block and state your conclusions. 5.6 The effects of four types of graphite coaters on light box readings are to be studied. As these readings might differ from day to day, observations are to be taken on each of the four types every day for three days. The order of testing of the four types on any given day can be randomised. The results are Graphite Coater Type M A K L 4.0 4.8 5.0 4.6 4.8 5.0 5.2 4.6 4.0 4.8 5.6 5.0

Day 1 2 3

a) State the null and alternative hypotheses to test equality of the four graphite coater types. b) Analyse the data as a randomised complete block design and present your results in an ANOVA Table. c) Determine whether the four types are significantly different at 1 % significance level. d) Determine which types are different at 1 % significance level using Tukeys test procedures. e) State your overall conclusions. 5.7 A study on a physical strength measurement in kilogrammes on seven subjects before and after a specified training period gave the results shown below. Subject 1 2 3 4 5 6 7 Pretest 45.36 49.90 40.82 49.90 56.70 58.97 47.63 Posttest 52.16 56.70 47.63 58.97 63.50 63.75 56.70

a) Carry out the analysis as a pair t-test, stating the hypothesis. Use = 0.05. b) Carry out the analysis as a randomised complete block design, using subjects as blocks Use = 0.05. c) Using the results from parts (a) and (c), verify t2 = F.

Statistical Analysis in Research Module E-mail: [email protected]

54

PMNjuho

6. SPLIT-PLOT DESIGN 6.1 Introduction

A factor is a kind of treatment, and any factor can supply several treatments. For example, if diet is a factor under consideration, then several diets can be used. If baking temperature is a factor, then baking can be done at several temperatures. Such a factor provides oneway treatment structure. A researcher may be interested in determining the combined effect of two or more factors. For instance, the interest may be in investigating the effect of humidity on seed germination in the presence of temperature. Such joined effect is referred to as interaction. The process of formulating all possible combinations of the levels of these factors produces treatment combinations when are then randomly applied to the experimental units. This process is called factorial arrangement.
6.2 The field layout

Consider a case of an agronomist who wishes to investigate the effect of spacing on maize yield in the presence of nitrogen. Suppose 3 spacing (s1, s2, and s3), and 4 nitrogen levels (n0, n1, n2, and n3) are considered. This is a two-way treatment structure with the two factors being spacing (at 3 levels) and nitrogen (at 4 levels). We formulate all possible combinations as s1n0, s1n1, s1n2, s1n3, s2n0, s2n1, s2n2, s2n3, s3n0, s3n1, s3n2, s3n3 The 12 treatment combinations are randomly assigned to the experimental units according to the experimental design used, say CRD or RCBD. These treatments should be replicated in order to have an estimate of experimental error needed for drawing inference. Sometimes, it is not practical to randomly assign these treatments completely according to these designs. Suppose the study involves mechanisation (say m1, m2, m3, etc) as one factor and variety (v1, v2, v3, etc) as another. Note that the mechanisation may refer to method of land preparation. It is impractical to formulate these combinations and then randomise them according to CRD or RCBD, especially when mechanisation involves use of farm machinery. An alternative approach would be to randomise the machination factor first and then the variety over each level of the first factor. We illustrate this point using 3 levels of one factor and 4 levels of the other factor.
Block I

M1 V2 M2 V1 M3 V4 V2 V3 V4 V3 V2 V4 V1 V4 V3

The process is repeated for the other replications ensuring independent randomisation at each stage. The process involves two stages of randomisation. In case for RCBD, we first randomise the three levels of mechanisation in each block and then the levels of variety over each level of mechanisation. The design discussed above is called split-plot design. The word treatment and factor
Statistical Analysis in Research Module E-mail: [email protected]

55

PMNjuho

are used interchangeably in this case since they mean the same thing. The split-plot design involves two- or higher-order treatment structure with an incomplete block design structure and at least two different sizes of experimental units. The bigger size is associated with whole-plot treatment and the smaller size to the sub-plot treatment. The decision on which treatment to applied to a whole-plot or to a sub-plot is based on practicability and precision required for each treatment. The treatment of much interest is placed on the smaller experimental unit and that of less interest on the larger unit. The interaction is also measured with a higher precision. Since in split-plot experiments variation among sub-plots is expected to be less than among whole- plots, the factors which require smaller amounts of experimental material, or which are of major importance, or which are expected to exhibit smaller differences, or for which greater precision is desired for any reason, are assigned to the sub-plots. The selection of such a design depends on practicability of the treatments. Say applying fertilizer to a whole plot and varieties to a sub plot, etc. The fact that there are two experimental units imply that there are two experimental errors, hereby, referred to as error (a) and error (b). The plot layout requires the whole -plot treatments to be randomly applied the whole -plot and then the sub plot treatments are applied to each whole -plot randomly. Each application demand for an independent randomisation. Split-plot designs are frequently used for factorial experiments. Such designs may incorporate one or more of the completely random, randomised complete block, or Latin square designs.

6.3 The model

Suppose we wish to investigate on the joined effect of two factors namely A and B, on yield of maize. Let r equal the number of blocks, a the number of levels of A or wholeplot per block, and b the number of levels of B or sub-plots per whole-plot. Thus, we have ab treatment combinations replicated r times. We have abr total number of experimental units. Let yijk be an observation associate with ith block, jth factor A effect, and kth factor B effect. The observation yijk is expressed in a mathematical form as yijk = + i + j + ij + k + ()jk + ijk i =1, 2, . . ., r; j =1, 2, . . ., a; k = 1, 2, . . ., b Where = overall mean i = ith block effect j = jth factor A effect ij = ijth random effect associated with whole-plot factor k = kth factor B effect ()jk = jkth interaction effect ijk = random effect associated with sub-plot factor

Statistical Analysis in Research Module E-mail: [email protected]

56

PMNjuho

The effects ij, and ijk are assumed to be normally and independently distributed about zero means with 2 as the common variance of the s, the whole-plot random components, and with 2 as the common variance of the s, the sub-plot random components.

The form of the analysis of variance for a two-factor split-plot experiment for a randomised complete block design is presented below. Source of Variation Block Factor A Error (a) Factor B A*B Error (b) Total Degrees of Freedom r-1 a-1 (r-1)(a-1) b-1 (a-1)(b-1) a(r-1)(b-1) abr - 1 Sum of Squares Mean Squares F-Calculate

Error (a) is composed of the interaction between the whole-plot factor and the blocks. As was mentioned earlier, factor A by block interaction is assumed to be no existence. Thus, error (a) test the equality of level means of factor A (i.e. Error (a) = A*Block) Error (b) is composed of factor A by block and factor A by factor B by block interactions (Error (b) = B*Block +A*B*Block). The effects of factor B and those of the interaction between factor A and B are tested using error (b).
6.4 The analysis

The analysis of variance is illustrated through an example as follows: Consider 4 strains of perennial ryegrass were grown as swards at each of the two fertiliser levels. The 4 strains were S23, New Zealand, Kent and X. The fertiliser levels were denoted by H, heavy, and A, average. The experiment was laid out as four blocks of four whole plots for the varieties each split in two for application of fertiliser. The midsummer dry matter yields, in units of 10 lb/acre, were as follows: Block 2 318 202 247 175 439 170 353 216

Strains S23 New Zealand X Kent

Manure H A H A H A H A

1 299 247 315 257 403 222 382 233

3 284 171 289 188 355 192 383 200

4 279 183 307 174 324 176 310 143

Statistical Analysis in Research Module E-mail: [email protected]

57

PMNjuho

The whole-plot factor A is Strain, the Sub-plot factor B is Manure or fertiliser. With respective to our example, r = 4, a = 4, and b = 2.
Computation of whole-plot analysis

This requires setting up of a two way table of blocks and factor A treatment totals. Thus Blocks 1 2 3 4 546 520 455 462 572 422 477 481 625 609 547 500 615 569 583 453 2358 2120 2062 1896

Strains S23 New Zealand X Kent Block Totals

Strain Totals 1983 1952 2281 2220 8436

Correction factor (C.F.)


( y ijk ) 2 (8436) 2 32

C.F.

i , j ,k

rab

= 2223940.5

Sum of squares for the whole-plots

y
SS(Whole-plot) =
i, j

2 ij .

- C.F. =

1 (5462 + 5202 + . . . + 5832 + 4532) C.F. 2

1 (4510942) 2223940.5 = 31530.5 2

Sum of square due to blocks

y
SSBlk =
i

2 i ..

ab

- C.F. =

1 (23582 + 21202 + 20622 + 18962) C.F. 8

1 (17901224) 2223940.5 = 13712.5 8

Sum of square due to strains

y
SS(Strains) = SS(A) =
j

2 . j.

rb

- C.F.

Statistical Analysis in Research Module E-mail: [email protected]

58

PMNjuho

1 ( 19832 + 19522 + 22812 + 22202) C.F. 8 1 (17873954) - 2223940.5 = 10303.7 8

Sum of squares for whole-plot error SSE(a) = SS(Whole-plot) SSBlk - SS(A) = 31530.5 - 13712.5 - 10303.7 = 7514.3

Computation of sub-plot analysis

This section requires a two way table of factor A and factor B totals.

Strains (Factor A) S23 New Zealand X Kent Manure Totals

Manure (Factor B) H A 1180 803 1158 794 760 1521 1428 792 5287 3149

Strain Totals 1983 1952 2281 2220 8436

Sum of squares due to factor B

y
SS(B) =
k

2 ..k

ra

- C.F. =

1 (52872 + 31492) C.F. 16

1 (3786857) - 2223940.5 = 142845.1 16

Sum of squares due to factor A and B interaction

y
SS(AB) =
j ,k

2 jk

- C.F. SS(A) SS(B)

1 (11802 + 8032 + . . . + 7922) - C.F. SS(A) SS(B) 4

Statistical Analysis in Research Module E-mail: [email protected]

59

PMNjuho

1 (9566098) - 2223940.5 - 10303.75 - 142845.13 4

= 14435.1 Sum of squares total SSTotal =

i , j ,k

2 ijk

- C.F. = 2992 + 3182 + . . . + 1432 - 2223940.5

= 2420734.0 - 2223940.5 = 196793.5 Sum of squares for sub-plot error SSE(b) = SSTotal - SS(Whole-plot) - SS(B) - SS(AB) = 196793.5 - 31530.5 - 142845.1 - 14435.1 = 7982.8 The above calculations are summarised in ANOVA Table as follows: The ANOVA Table Source of variation D.F. Block 3 Strains 3 Error (a) 9 Manure 1 Strain*Manure 3 Error (b) 12 Total 31 SS 13712.5 10303.7 7514.3 142845.1 14435.1 7982.8 196793.5 MS 4570.8 3434.6 834.9 142845.1 4811.7 665.2 F-Calculated 5.47 4.11 214.73 7.23 F-Table, 0.05 FT(3, 9) = 3.86 FT(1,12) = 4.75 FT(3,12) = 3.49

Critical region: Testing the four strains: Reject H0 : 1 = 2 = 3 = 4 = 0 if F-Calculated > FT(3, 9) = 3.86 and conclude that the strains are significantly different at 5 % significance level. Testing the effect of the two types of manure: Reject H0 : k = k = 0 if F-Calculated > FT(1,12) = 4.75 and conclude that the two types of manure are significantly different at 5 % significance level.

Testing for the strain by manure interaction: Reject H0 : ()11 = ()12 = . . . = ()42 = 0 if F-Calculated > FT(3,12) = 3.49 and conclude that the interaction between the strains and manure types are significantly different at 5 % significance level.
Statistical Analysis in Research Module E-mail: [email protected]

60

PMNjuho

Conclusions: We reject H0 : 1 = 2 = 3 = 4 = 0 since F-Calculated = 4.11 > FT(3, 9) = 3.86 and conclude that the strains are significantly different at 5 % significance level. Strain X shows a higher performance followed by Kent based on the means.

We reject H0 : k = k = 0 since F-Calculated = 214.73 > FT(1,12) = 4.75 and conclude that the two types of manure are significantly different at 5 % significance level. Actually, H type has a higher effect than A based on the means. We reject H0 : ()11 = ()12 = . . . = ()42 = 0 since F-Calculated = 7.23 > FT(3,12) = 3.49 and conclude that the interaction between the strains and manure types are significantly different at 5 % significance level. The following is the graphical presentation of the interaction. Manure A consistently performed better than manure B. Manure B appears to have a constant effect across the strains. It is hard to note the source of the interaction from the graph.

Strain by manure interaction


400 350 300 250 200 150 100 50 0 S23 NZ Strains ManureA ManureB X KENT

Exercises 6.1
6.1 A researcher is interested in the effects of moisture and nitrogen on the growth of wheat plants. In the experiment, a particular variety of wheat is planted in 10 tube of soil in the greenhouse. Each tub is divided into 3 parts, and different levels of nitrogen (0, 10, 20) are applied randomly, one to each part. Five of the tubs are selected randomly and given high moisture while the other 5 are given normal moisture. a) Identify both the whole plot and subplot experimental units. Explain. b) Make a sketch of the field layout and explain the randomisation process. c) Give an outline of ANOVA Table (Source of variation and degrees of freedom only).

Statistical Analysis in Research Module E-mail: [email protected]

Mean yield

61

PMNjuho

6.2 An experiment was conducted using a split-plot design. The experiment consisted of 3 pairs of identical steers each pair used as a block, 2 rations (A, B) as whole plot treatments, and 2 cooking methods (1, 2) as sub-plot treatments. Within each pair of steers, one is assigned at random to feed A and one to feed B. After slaughter, two identical roasts are obtained and two roasts are randomly assigned to the two cooking methods. Recorded data are weight losses due cooking. (Assume methods and rations to be fixed effects). Block Method 1 2 1 2 a) b) c) d) e) f) Ration A A B B Pair1 Pair3 11.0 2.5 5.0 3.5 Pair2 17.0 9.0 8.0 4.0 11.0 6.5 8.0 4.5

Write down a mathematical model stating the necessary assumptions. State the null hypotheses for testing methods, rations and their interaction. Analyse the data and present your results in an ANOVA Table. State the critical regions for testing the hypotheses stated in part (b). Present a two-way table of treatment means. Compute the standard errors for testing the means differences for methods, rations, and method by ration interactions.

Statistical Analysis in Research Module E-mail: [email protected]

62

PMNjuho

7. NESTED DESIGNS 7.1 Introduction

Consider an experiment involving two fertiliser levels and three varieties. Thus, we have 2 x 3 = 6 treatment combinations. We consider such as case, the factors are said to be crossed. This means that every level of every factor could be used in combination with every level of every other factor. The intersections of these factor levels are the subclasses or cells of the situation, wherein data arise. Absence of data from a cell does not imply non-existence of that cell, only that it has no data. The total number of cells in a crossed classification is the product of the number of levels of the various factors, noting that not all of them may have observations in them. Nesting in design structure occurs when we have sub-units within larger experimental units. Examples: pigs within pens; plants within pots; pies within an oven; farms within a region; technicians within a method; sires within progeny; insecticides within source, etc. In general, levels of B are nested within levels of A. Thus, we do not have A*B interaction effect, but have A effect and B within A (denoted B(A)) effect. More often, in the treatment structure, levels of A are crossed with levels of factor B. The following example illustrates the concept of nested classification:
Example 7.1

Suppose that at a university a student survey is carried out to ascertain the reaction to instructors usage of a new computing facility. Suppose that all first years have to take English or Geology or Chemistry in their first semester. All three courses in the first semester are large and are divided into sections, each section with a different instructor and not all sections have the same number of students. Each student provided his or her opinion measured on a scale of 1-10, of his instructors use of the computer. The investigators interest is whether the instructors differ in their use of the computers. A Schematic representation of this nested classification follows: The (nij) denotes the number of students in section j of course i ( i=1, 2, 3; j = 1, , 4). Course Geology Sec.1 (31) Sec.2 (29)

English Sec.1 (28) Sec.2 (27) Sec.3 (30)

Chemistry Sec.1 (27) Sec.2 (32) Sec.3 (29) Sec.4 (30)

The measure of effect due to section j, say for j =1, it would mean the effect of the English course, of the Geology course and of the Chemistry course would be meaningless. This is because the three sections, composed of different groups of students, have nothing in common other than that they are all numbered 1 in respect of their respective courses. The number is only for identification purpose. Section 1 of English is no way related to section 1 of Geology. The only thing in common is the number 1, which is purely an identifier. These are not like the variety by fertiliser treatment combination discussed earlier. Fertiliser 1 on variety 1 was the same as fertiliser 1 on variety 2 and on variety 3. The sections are not related in this way, and are identities within their own courses. They
Statistical Analysis in Research Module E-mail: [email protected]

63

PMNjuho

are considered as sections within courses. Thus, sections nested within course. Similarly the students are nested within sections. An ANOVA outline would be: Source of variation Courses Sections within Course Students within Sections within Course Total Degrees of freedom 2 2+1+3 = 6 By subtraction (254) 262

The main use of the design would be mainly in assessing the degree of variation due to each component. Is the variation most between plants within pots or pots within treatments? Would be the interesting question. Nested designs have a characteristic that interaction does not occur, but nesting does. For instance, when we say A is nested in B, we cannot then say A interacts with B. Often nesting is denoted by say, A(B), meaning A is nested in B or A:B or A/B and the degrees of freedom are expressed as b(a-1), where a is levels of A and b is levels of B. We say levels of one factor are nested within or are subsamples of, levels of another factor. Such experiments are also sometimes called hierarchical experiments. For instance, in an onfarm experiment you may have farm types, farms nested within types and replications nested within farms.

Farm Types:

Farms within types:

2 3

2 3

Replications within farms:

2 3 1 2 3 1 2 3

...

...

In general there is no limit to the degree of nesting that can be handled. The extent of its use depends entirely on the data and the environment from which they came.
Example 7.2

Consider an experiment involving product of three manufacturing plants in each of two areas, A and B, and of two plants in area C. The observations on the quality of a product made in eight manufacturing plants in three areas is presented below.

Area
Statistical Analysis in Research Module E-mail: [email protected]

A 64

C_____
PMNjuho

Plants Observations

I 6

II 6, 8

III I 6,7,8 5, 7

II 6, 7

III 6

I 7

II 7, 9

Two way table of totals Area B 12 13 6 31 Plants Totals 25 43 27 95

Plants I II III Area Totals

A 6 14 21 41

C 7 16 0 23

y
i , j ,k

ijk

= 95

Total observations = 14

2 ijk

= 62 + 62 + . . . + 92 = 659
(95) 2 = 644.64 14

Correction factor, C.F. = Total sum of squares SSTotal =

2 ijk

- C.F. = 659 - 644.64 with 13 degrees of freedom.

= 14.36 Area sum of squares SSArea =

412 312 23 2 + + - C.F = 648.7 644.64 6 5 3 = 4.06 with (3 areas -1 = 2) degrees of freedom.

Plants sum of squares ignoring areas SS plants = 62 + = 5.86

(6 + 8) 2 (7 + 9) 2 +... + - C. F. 2 2
with (8 plants 1 =7) degrees of freedom.

Plants within area sum of squares SSPlants(Area) = SS plants (ignoring areas) SSArea = 5.86 - 4.06 = 1.80 with (7 2 = 5) degrees of freedom. Error sum of squares SSE = SSTotal SS Plants(ignoring areas) = 14.36 - 5.86 = 8.50 with 13 7 = 6 degrees of freedom.

Statistical Analysis in Research Module E-mail: [email protected]

65

PMNjuho

The ANOVA table Source of


0.05

Degrees of Freedom 2 5 6 13

Sum of Squares 4.06 1.80 8.50 14.36

Mean Squares 2.03 0.36 1.42

F-cacl

F- table,

Variation Area Plants within areas Observation within plants Total

5.639

5.79

Often, nested designs are meant to provide information about variability, and therefore, makes no sense to compute F value. Perhaps, the areas are fixed and hence can test the equality of the means using F- test. Estimation of an experimental error is only possible if the replications are independent. In this case, plants within areas are independent but observations within plants are not. Therefore, we estimate experimental error using plants within areas. The F-value for testing the equality of the areas is obtained as F=
MSArea 2.03 = = 5.639 MSP ( A) 0.36

which is compared against F-T = 5.79 obtained using 2 df numerator and 5 df denominator at 5 % significance level. We fail to reject H0 :1 = 2 = 3 since F-calc = 5.639 is not greater than F-T = 5.79 at 5 % significance level. Suppose we assume observations within plants to be randomly distributed normal with zero mean a constant variance, 2, and also plants within areas to be normally distributed with zero mean and 2 . Some techniques, which are beyond this manual, are available p for estimating these variance components. The following estimates of these variance components are obtained through such techniques. The observation within plants variance component is estimate as
2 = 1.42 The size of the estimate suggests that the total variance is purely due to observations within plants.

Similarly, an estimate of plants within area variance components would be approximated as

0.36 1.42 MSP( A) MSE = = -0.64 1.64 1.64 Since the variance will never be negative, we consider the estimate not to be significantly p different from zero. Thus 2 0, indicating no variation between plants within areas.

2 = p

Statistical Analysis in Research Module E-mail: [email protected]

66

PMNjuho

Exercise 7.1
7.1 An educator proposes a new teaching method and wishes to compare the achievement of students using his method with that of students using a traditional method. Twenty students are randomly placed into two groups with ten students per group. Tests are given to all 20 students at the beginning of a semester, at the end of the semester, and ten weeks after the end of the semester. The educator wishes to see whether there is a difference in the average achievement between the two methods at each of the three time periods. a) Write a mathematical model for this situation. b) Set up an ANOVA table and show the F tests that can be made. 7.2 In a study made of the characteristics associated with guidance competence versus counselling competence, 144 students were divided into 9 groups of 16 each. These nine groups represented all combinations of three levels of guidance ranking (high, medium, low) and three levels of counselling ranking (high, medium, low). All subjects were then given nine subtests. Assume the rankings as two fixed factors, the subtests as fixed, and the subjects within the nine groups as random. a) Present a schematic diagram for this information. b) Give an outline of ANOVA table with source of variation and degrees of freedom only. 7.3 Three days of sampling where each sample was subjected to two types of size graders gave the following results, coded by subtracting 4 percent moisture and multiplying by 10. Day Grader Sample 1 2 3 4 5 6 7 8 9 10 11 1 A 4 6 6 13 7 7 14 12 9 6 8 B 11 7 10 11 10 11 16 10 12 9 13 A 5 17 8 3 14 11 6 11 16 -1 3 2 B 11 13 15 14 20 19 11 17 4 9 14 A 0 -1 2 8 8 4 5 10 16 8 7 3 B 6 -2 5 6 10 10 18 13 17 15 11

Assume graders fixed, days random, and samples within days random. a) State the necessary hypotheses. b) Give an outline of the ANOVA table with source of variation and degrees of freedom only. c) Complete the ANOVA table by working out the calculations.

Statistical Analysis in Research Module E-mail: [email protected]

67

PMNjuho

8. NONPARAMETRIC STATISTICS 8.1 Introduction


Nonparametric methods are often applicable in situations where the parametric methods are not. They require less restrictive assumptions concerning the data and the form of the probability distributions generating the data. The scale of measurement for the data somehow determines whether to use parametric or nonparametric methods. Most parametric methods use interval or ratio-scaled data. Thus, means, medians, variances, standard deviations interquartile ranges, etc., can be computed and interpreted. Parametric methods cannot be applied on nominal or ordinal-scaled data. Nonparametric methods are the only way nominal or ordinal-scaled data can be statistically analysed and sound conclusions made.

The form or type of assumptions made to generate data also determines whether to use parametric or nonparametric method. Many parametric methods require assumptions. For instance, for a small sample case, normal distribution with a constant variance is required in order to apply t-distribution. The nonparametric methods do not require assumptions about the population probability distribution, and can be used when one is not prepared to make distribution assumptions. This property has led to nonparametric methods to be referred to as distribution-free methods. The sign test, the Wilcoxon signed-rank test, the Mann-Whitney-Wilcoxon test, the Kruskal-Wallis test, and Spearman rank correlation are the nonparametric methods discussed.
8.2 Sign test

This section is better introduced through an example. Consider a study of consumer preference for two brands of orange juice, where 12 people were given unmarked samples of the two brands. The brand each individual tasted was selected randomly. Each individual stated a preference for one of the two brands. The question of interest is to determine whether the preferences for the two products are equal.
Hypothesis

Ho : P=0.5 H1: P 0.5

<No difference in preference for one brand over the other exists>. <A difference in preference for one brand over the other exists>

Where, P= Population proportion of consumers favouring one brand. Suppose we denote, preference for brand A by + and that of brand B by -. The data is recorded in form of + and - sign, hence, Sign-test. Under Ho, the number of + are equal to - signs. If we consider + sign to denote success, then with n = 12, and P = 0.5, we have a binomial probability distribution case. We can compute probabilities for all the 12 people, giving a symmetric binomial distribution. This sampling distribution is used to determine a rejection rule.

Statistical Analysis in Research Module E-mail: [email protected]

68

PMNjuho

Binomial Probabilities (P=0.5, n=12)


0.25 0.2 Probability 0.15 0.1 0.05 0 10 11 Number of + Signs 12 0 1 2 3 4 5 6 7 8 9

The rejection rule is established as follows. Suppose our = 0.05. For a two tailed test, we have 0.025 on one tail and 0.025 on the other. Thus, starting at the lower end of the distribution, 0.0002 + 0.0029 + 0.0161 = 0.0192 probability of obtaining 0, 1 or 2 + signs. Adding the probability of 3 would give 0.0729, which exceeds the set probability, 0.025 for the lower tail. So we stop at 2 + sign. At the upper tail, we get 0.0192 probability corresponding to 10, 11 or 12 + signs. The closest we get to 0.05 is 0.0192 + 0.0192 = 0.0384. Thus, the rejection rule is Reject Ho if the number of + signs is less than 3 or greater than 9. The binomial probability distribution can be used for n=20 (small sample case). Largesample normal approximation of binomial probabilities can be used for sample size n, greater than 20 to determine the rejection rule for the sign test. Normal approximation of the sampling distribution of the number of + signs when no preference exists requires determination of Mean:

= 0.5n
0 .25n

Standard deviation: =

Thus, Z=

Example 8.1

The following data show the preferences indicated by 10 individuals in taste tests involving two brands of a product.
Individual Brand A versus Brand B

Statistical Analysis in Research Module E-mail: [email protected]

69

PMNjuho

1 2 3 4 5 6 7 8 9 10

+ + + + + + +

We test for a significant difference in the preferences for the two brands at 5% significance level. A + indicates a preference for brand A over brand B.
Hypothesis

Ho : P = 0.5 H1 : P 0.5 Where P= Population proportion of consumers favouring one brand A. The binomial probabilities for P = 0.5 and n = 10

Number of + Signs 0 1 2 3 4 5 6 7 8 9 10

Binomial Probability 0.0010 0.0098 0.0439 0.1172 0.2051 0.2461 0.2051 0.1172 0.0439 0.0098 0.0010

Starting at the lower end of the distribution: 0.0010 + 0.0098 = 0.0108 for 0, and 1. If we include 2, we get 0.0547, which exceed 0,05. Thus, we stop at 1. Similarly, from the upper end of the distribution we get 0.0010 + 0.0098 = 0.0108 for 9 and 10. Therefore, we reject Ho if the number of + signs is less than 2 and greater than 8. We fail to reject Ho in favour of H1 because we have 7 + signs. There is no evidence from this data that individuals preference differ significantly for the two brands at 5 % significance level.

8.3 Wilcoxon Signed-Rank Test

Statistical Analysis in Research Module E-mail: [email protected]

70

PMNjuho

The Wilcoxon Signed-Rank Test is the nonparametric alternative to the parametric paired sample test. In the parametric case, the population of differences between pairs of observations is assumed normally distributed. The nonparametric Wilcoxon Signed-Rank Test can be used when the appropriateness of the assumption of normality is in question. The procedure is illustrated by the following example.
Example 8.2

A manufacturing firm is attempting to determine whether a difference between taskcompletion times exists for two population methods. A sample of 11 workers was selected and each worker completed a production task using both production methods. The production method that each worker used first was selected randomly. A positive difference in task-completion times indicates that method 1 required more time and a negative difference indicates that method 2 required more time. Production task-completion times (Minutes)
Absolute Signed Difference Rank Rank 0.7 8 +8 0.2 2 -2 0.4 3.5 +3.5 0.5 5.5 +5.5 0.4 3.5 -3.5 0.9 10 +10 0.1 1 +1 0.0 0.6 7 +7 0.5 5.5 +5.5 0.8 9 +9 Sum of signed ranks +44

Worker 1 2 3 4 5 6 7 8 9 10 11

Method 1 10.2 9.6 9.2 10.6 9.9 10.2 10.6 10.0 11.2 10.7 10.6

Method 2 9.5 9.8 8.8 10.1 10.3 9.3 10.5 10.0 10.6 10.2 9.8

Difference 0.7 -0.2 0.4 0.5 -0.4 0.9 0.1 0.0 0.6 0.5 0.8

Hypothesis

Ho: The populations are identical H1: The populations are not identical The first step is to rank the absolute differences between the two methods, from lowest to the highest, where any differences of zeros are discarded. Tied differences are assigned average rank values. The ranks are given the sign of the original difference in the data. The sum of signed rank is finally obtained. For our example, we have +44. If the populations representing task-completion times for each of the two methods are identical, we would expect the positive ranks and the negative ranks to cancel out. Thus, we wish to test if the sum of signed rank is significantly different from zero. Let T denote the sum of the signed-rank values in a Wilcoxon signed-rank test. The distribution of T is approximated when the number of pairs of data is 10 or more and the

Statistical Analysis in Research Module E-mail: [email protected]

71

PMNjuho

populations are identical. Thus, the sampling distribution of T for identical population is mean T =0, and standard deviation, T =
n( n + 1 )( 2n + 1 ) . 6
10( 11 )( 21 ) = 19.62. 6

Referring to the above example, T =


T T

Z=

44 0 19 .62 = 2.24

Conclusion: Reject Ho since Zcal = 2.24 is greater than Ztable= 1.96 at 5 % significance level and conclude that the two populations are not identical in terms of task-completion times. It is worth to note that the Wilcoxon signed-rank test does not enable us to conclude in what ways the populations differ.

Exercise 8.1
8.1 A test was conducted of two overnight mail-delivery services. Two samples of identical deliveries were set up such that both delivery services were notified of the need for a delivery at the same time. The number of hours required to make the delivery is showed below for each service time.

Delivery 1 2 3 4 5 6 7 8 9 10 11

Service 1 2 24.5 28.0 26.0 25.5 28.0 32.0 21.0 20.0 18.0 19.5 36.0 28.0 25.0 29.0 21.0 22.0 24.0 23.5 26.0 29.5 31.0 30.0

Test at 5% significance level is the data suggest a difference in the delivery times for the two services.

8.4 Mann-Whitney-Wilcoxon Test

Statistical Analysis in Research Module E-mail: [email protected]

72

PMNjuho

This is a nonparametric test used to determine whether there is a difference between two populations. Unlike Wilcoxon signed-rank test, it is not based on paired samples. It concerns two independent random samples one from each population. In the case on parametric test, normality distribution and equality of variances were assumed. The Mann-Whitney-Wilcoxon (MWW) test does not require either of the assumptions. However, it does require that the measurement scale for the data generated by the two independent random samples be at least ordinal.
Small-Sample Case: Appropriate when sample sizes are less or equal to 10. following steps are taken in carrying out the test.

The

Combine the data from both samples and then rank them from smallest value ranked 1 and the largest value ranked the highest. Sum the ranks for each sample separately. The sum of ranks denoted by T takes two values, either smallest or largest from the two samples. Under Ho, the value of T is expected to be near the average of the sum of the smallest plus the largest values of T. That is, T =(TL+TU)/2. The critical value of the MWW T statistic exists when both sample sizes are less than or equal to 10. The n1 corresponds to the sample whose rank sum is being used in the test. TU = n1(n1+n2+1) -TL Reject Ho if T is strictly less than TL or strictly greater than TU.
Large-Sample Case: Appropriate when sample size is greater or equal to 10. In this case, the MWW T statistic can be approximated normal with a sampling distribution that has

Mean T =

1 n ( n + n2 + 1 ) and 2 1 1 1 n n ( n + n2 + 1 ) Standard deviation T = 12 1 2 1

General steps for MWW T test. 1. Rank the combined sample observations from lowest to the largest, with tied values being assigned the average of the tied rankings. 2. Compute the T, the sum of the ranks for the first sample. When we reject the hypothesis that the populations are identical using MWW test, we cannot state how they differ. The populations could have different means, different variances, and/or different forms. The MWW test has the advantage that it does not require any probability distribution assumptions and can be used on ordinal data.
Example 8.3

Statistical Analysis in Research Module E-mail: [email protected]

73

PMNjuho

Two fuel additives are being tested to determine their effect on fuel consumption. Seven cars were tested using additive 1 and another independent sample of nine cars was tested using additive 2. The data below show the kilometre per litre obtained using the additives. Test using MWW test to see if there is a significant difference in fuel consumption at 5 % significance level Additive 1 17.3 18.4 19.1 16.7 18.2 18.6 17.5 Sum Rank 2 6 10 1 5 7 3 34 Additive 2 18.7 17.8 21.3 21.0 22.1 18.7 19.8 20.7 20.2 Sum Rank 8.5 4 15 14 16 8.5 11 13 12 102

The combined samples are ranked and the rank sum for each sample obtained. This is a small sample test since, n1=7 and n2=9. T=34. With = 0.05, n1=7 and n2=9, TL = 41 and TU = 7(7+9+1) -41 = 78

Conclusion: Since T=34 < 41, we reject Ho and conclude that there is a significant difference in fuel consumption.

8.5 Kruskal-Wallis Test

Kruskal-Wallis test is an extension of Mann-Whitney-Wilcoxon test for three or more populations. The hypothesis is stated as follows: Ho : All k populations are identical H1 : Not all populations are identical Recall that the parametric test such as completely randomised design requires interval or ratio data. The Kruskal-Wallis test, which does not require the assumptions of normality and equal variance, is used with ordinal data as well as with interval or ratio data. The Kruskal-Wallis test statistics, which is based on the sum of ranks for each of the samples, can be computed as follows:
k Ri2 12 ] - 3(nT +1) W= [ nT ( nT + 1 ) i =1 ni

where k = the number of populations ni = the number of items in sample i


Statistical Analysis in Research Module E-mail: [email protected]

74

PMNjuho

nT = total number of items in all samples Ri = sum of the ranks for sample i. Under Ho, the populations are identical with the sampling distribution of W being approximated by a 2 with k-1 degrees of freedom. The approximation works well if each of the sample size is greater or equal to 5. See Table C. The following example illustrates the computation procedure.

Example 8.4

Three products received the following performance ratings by a panel of 15 consumers. We wish to use Kruskal-Wallis test to determine if there is a significant difference in the performance ratings for the product, at 5% significance level.
A Rank 50 4 62 8 75 10 48 3 65 9 Sum= 34 B Rank 80 11 95 14 98 15 87 12 90 13 Sum=65 C Rank 60 7 45 2 30 1 58 6 57 2 Sum=21

The first step is to rank all the 15 data values, with the lowest ranked 1 and the largest ranked 15. The average rank is assigned to tied data. Sum of ranks: Sample sizes: RA =34, RB =65, RC =21 nA = 5, nB =5, nC =5 nT =15

Total number of items in all samples; k=3, thus degrees of freedom =2 W=

12 34 65 2 212 [ + + ] -3(16) =10.22 15(16) 5 5 5

2 (2, 0.05) =5.99


Conclusion: Reject Ho and conclude the ratings for the products differ at 5% significance level. Note that the procedure would also have been applied directly to the original data if the data had been the ordinal rankings of the 15 consumers. The step of constructing the rank orderings from the performance evaluation ratings would have been omitted.

8.6 Spearman Rank Correlation


Statistical Analysis in Research Module E-mail: [email protected] PMNjuho

75

Spearmans rank correlation is used to find a measure of association between two random variables when only ordinal data are available. The Spearman rank-correlation coefficient is computed using the following formula:

rs = 1 where

6 d i2 n( n 2 1 )

n = the number of items or individuals being ranked. xi =the rank of item i with respect to one variable yi = the rank of item i with respect to a second variable di = xi - y i 6 is a constant. While r is a measure of linear correlation between X and Y, rs is a measure of increasing or decreasing relationship. The rs ranges from -1 to 1. Positive values near 1 indicate a strong positive association between the rankings. That is, as one rank increases the other rank also increase. Similarly, negative values near -1 indicate a strong negative association in the ranks. The sampling distribution of rs is Mean: rs =0 and Standard deviation: rs =
rs rs

1 for n 10. n1

Z=

has standard normal with mean zero and unit variance.

Consider the following example to illustrate the computation procedures.

Example 8.5

At a wine tasting function, two judges were asked to independently rank the 10 wines on exhibit from most desirable (rank=1) to least desirable (rank=10). The preferences were as follows:

JudgeA Rank 6 2 8 10

Judge B Rank 5 2 7 9

Difference di 1 0 1 1 76

di 2 1 0 1 1
PMNjuho

Statistical Analysis in Research Module E-mail: [email protected]

7 3 1 4 9 5

6 3 1 8 10 4
2 i

1 0 0 -4 -1 1

1 0 0 16 1 1

d
So,

= 22, n= 10

6 d i2 rs = 1 n( n 2 1 ) 6 ( 22 ) = 0.867 =110( 10 2 1 )
Conclusion: The high value of rs = 0.867 suggests the two judges preferences coincides very closely.

Statistical Analysis in Research Module E-mail: [email protected]

77

PMNjuho

9. REGRESSION ANALYSIS 9.1 Introduction


Regression analysis is a statistical procedure used to develop a mathematical equation showing how variables are related. The variable that is predicted using this mathematical equation is called a dependent variable while the variable used to predict is called independent variable. Regression analysis involving only one independent and one dependent variable is called a simple linear regression. Multiple regression analysis incase of two or more independent variable.

Consider the following examples of pairs of random variables where X is an independent variable and Y a dependent variable.

X Advertising Training Speed Hours worked Daily temperature Hours studied Product Xs price Bond Interest rate Cost of living

Y Company turnover Labour productivity Fuel consumption Machine output Electricity demand Statistics results Product Xs Sales level Number of bond defaulters Poverty

Several objectives exist for carrying out regression analysis, among them are to:

See if Xi affects Y. The objective would be to investigate whether there is a change in Y when the level of X is changed. Thus establishing a functional relationship between the two variables. In this case, X is assumed to be a continuous variable. A scatter plot would show if a relationship exist between the two variables. See how Xi affects Y. Would be interested in knowing by how much the value of Y changes per unit change in X. Predict Y given Xi The objective in this case is to provide a mathematical function that would be used in predicting values of Y per given X.

Consider for example, an experiment to estimate the mean weight gain per month for steers fed on a particular variety of feeds. The dependent variable, weight gain could be affected by many factors such as initial weight of the steer, amount of feed offered per day, protein content of the feed, water content of the feed, and so on.

9.2 Simple Linear Regression


Statistical Analysis in Research Module E-mail: [email protected]

78

PMNjuho

Involves an independent variable denoted by X and a dependent variable denoted by Y. The Xs are selected levels of the treatment under investigation. The response corresponding to the effect is measured. In simple linear regression we want to explain the behaviour of dependent variable Y in terms of X. Simple linear regression is concerned with establishment of a linear function of independent variable X. The procedure involves fitting simple linear regression to the data where parameters are estimated. The suitability of the model is then assessed. The first step should be to plot the raw data in order to have an indication of the relation between Y and X. If such relationship is not noticeable, then other reasons should be give for proceeding to fit the regression line. The simplest type of model relating a response variable y to a single independent variable x is given by the following equation of a straight line:
y = 0 + 1 x +

where,

0 is the intercept (value of y when x=0)


is the slope of the straight line (change in y for a unit change in x) is a random variable. Note that the random error term takes into account all unpredictable and unknown factors that are not included in the model. The interest is mainly in estimating the two unknown parameters 0 and 1 where their estimates are denoted by a and b, respectively. The statistics a and b are computed from the data using a technique called least squares estimation procedure. The least squares method is a procedure used to find a straight line that provides the best approximation for the relationship between the independent and dependent variables. This line is called estimated regression line or the estimated regression equation. The following equations have been shown using calculus, to provide the minimum sum of squared deviations between the observed values of dependent variable yi and the ) estimated values of the dependent variable yi :
1

b=

( x x )( y y ) ( x x )
i i

S xy S xx

< By definition>

b=

x y ( x y ) / n x ( x ) / n
i i i i

2 i

<Used in computation>

a = y -b x

Example 9.1a

Statistical Analysis in Research Module E-mail: [email protected]

79

PMNjuho

A property analyst is examining the relationship between the City Councils valuation on residential property and the market value (selling price) of the properties. A random sample of eight recent property transactions was examined. The data are as follows:

City council valuation (R1 000) x 12 45 32 50 28 56 18 40 281

Market value (R1 000) Y 65 220 142 310 196 364 116 260 1673

x2 144 2025 1024 2500 784 3136 324 1600 11537

xy 780 9900 4544 15500 5488 20384 2088 10400 69084

y2 4225 48400 20164 96100 38416 132496 13456 67600 420857

The scatter diagram of the above data is presented below.


City council values against market values
400 350 Market values (R1 000) 300 250 200 150 100 50 0 0 10 20 30 40 50 60 City council values (R1 000) y = 6.1912x - 8.3392

For the above example:

x =281, y
b=
i

=1 673,
i

xy =69 084, x 2 = 11 537, y 2 = 420 857.


i i 2

x y ( x y ) / n x ( x ) / n
2 i i

69084 ( 281 )( 1673 ) / 8 = 6.1912. 11537 ( 281 )2 / 8 a = y -b x = 209.125 - (6.1912)(35.125) = -8.3392.


=

Statistical Analysis in Research Module E-mail: [email protected]

80

PMNjuho

$ Thus, the estimated regression line is y = -8.3392+6.1912x


The estimates of intercept and slope, namely, a and b are unbiased estimators of population parameters 0 and 1 , respectively.
Caution: Extrapolation outside the range of x may lead to meaningless results. For instance, at x = 0, we get y = -8.3392. That is, at a zero city council valuation, we get R 8.3392 market value.

The above regression line is meaningful only when x values fall within 12 x 40 interval.
Note: A regression line obtained using the standardised values of X and Y passes through the origin, thus with zero intercept. The correlation coefficient between standardised X and Y, r equals the slope, b, obtained using the same standardised values. Example 9.1b

A substance used in biological and medical research is shipped by airfreight to users in cartons of 1,000 ampules. The data below, involving 10 shipments, were collected on the number of times the carton was transferred from one aircraft to another over the shipment route (X) and the number of ampules found to be broken upon arrival (Y). Assume a linear regression model. i: X i: Y i: 1 1 16 2 0 9 3 2 17 4 0 12 5 3 22 6 1 13 7 0 8 8 1 15 9 2 19 10 0 11

Scatter plot
25 20 15 Y 10 5 0 0 0.5 1 1.5 X 2 2.5 3 3.5

XI

Yi Xi- X i Yi- Yi (Xi- X i )( Yi- Yi ) (Xi- X i )2

(Yi- Yi )2

16

1.8

0 81

3.24

Yi 14.2

2 ei=(Yi- Yi ) ei 1.8 3.24

Statistical Analysis in Research Module E-mail: [email protected]

PMNjuho

2 3 4 5 6 7 8 9 10

0 9 2 17 0 12 3 22 1 13 0 8 1 15 2 19 0 11 10 142

-1 1 -1 2 0 -1 0 1 -1

-5.2 2.8 -2.2 7.8 -1.2 -6.2 0.8 4.8 -3.2

5.2 2.8 2.2 15.6 0 6.2 0 4.8 3.2 40

1 1 1 4 0 1 0 1 1 10

27.04 7.84 4.84 60.84 1.44 38.44 0.64 23.04 10.24 177.6

10.2 18.2 10.2 22.2 14.2 10.2 14.2 18.2 10.2

-1.2 -1.2 1.8 -0.2 -1.2 -2.2 0.8 0.8 0.8 0

1.44 1.44 3.24 0.04 1.44 4.84 0.64 0.64 0.64 17.6

Information required for computation: n =10, SX =

= 10,
2

=142, SXY = ( X X )(Y Y ) = 40,

(X X )

= 10, SY = (Y Y ) 2 =177.6

Computation

The estimate of the slope, b=


S XY 40 = =4 SX 10

The estimate of the intercept, a = Y b X = 14.2 4(1) =10.2


Estimate linear regression line is, Yi = a + bX = 10.2 + 4X

MSE =

17.6 SSE = = 2.2 n2 8

Regression analysis Estimator a b Coef 10.2 4 Std Error t -value 0.663 15.38 0.469 8.53 P-value <0.000 <0.000

Statistical Analysis in Research Module E-mail: [email protected]

82

PMNjuho

Fitted regression line

25 20 15
Y

10 5 0 0 0.5 1 1.5
X

2.5

3.5

9.3 Model and assumptions

It is important to distinguish between a deterministic model and a probabilistic model when testing for significance in regression analysis. In a deterministic model, the relationship between X and Y is such that if the value of the independent variable is specified, the value of the dependent variable is determined exactly. A probabilistic model if we are unable to guarantee a single value of Y for each value of X. Thus, mathematically, Deterministic model: y = 0 + 1 x < A model with no error> Probabilistic model: y = 0 + 1 x + < A model that allows for uncontrollable components to be denoted> The difference between the two models is in , which measures how far the actual y value is above or below the regression line. The following are the assumptions about , the error term in the regression model.

The error term is a random variable with a mean zero. The variance of , denoted 2 , is the same for all values of x. The values of are independent. The error term is a normally distributed random variable.

We would be more concerned with assessing how the fitted model explains more of the real life situation. That is, how close are the fitted values to the observed value? would be the question of interest. Thus, we would aim at minimising the error term. The above stated model assumes a straight line situation which often is not the case. A non-linear model may turn out to explain the data more clearly than the straight line case. The reliability of the final model depends on the validity of the underlying assumptions and the adequacy of the fitted model in explaining more of the variation in the data.

Statistical Analysis in Research Module E-mail: [email protected]

83

PMNjuho

The coefficient of determination, denoted by r2 which is expressed as a ratio of sum of squares regression to sum of squares total is often used as a measure of the goodness of fit of the estimated regression line. A higher r2 value is associated with a better fit, however, it does not allow us to concluded whether a regression relationship is statistically significant. The computation of r2 fails to take into consideration the sample size.

9.4 Partitioning the total sum of squares

The total sum of squares can be partitioned into regression sums of squares and residual sums of squares. That is:
Sum of squares about the mean = Sum of squares due to regression + Sun of squares for residual.

Sum of squares about the sample mean:

( y y )

Sum of squares due to regression (the portion of the overall distance that can be $ attributed to the independent variable x): ( y y )2 Sum of squares due to residual (that portion of the distance between y and y that
cannot be accounted for by the independent variable x):
$ ( y y )
2

In summary,

( y y )
<Total variability in y-values >

$ $ = ( y y )2 + ( y y )2 <Variability <Unexplained variability> explained by model>

The following computations obtained using the information given in the above example illustrate the point.

Statistical Analysis in Research Module E-mail: [email protected]

84

PMNjuho

X 12

y 65

$ y
65.96

$ (y- y )
-0.96 -50.26 -47.78 8.78 30.99 25.63 12.90 20.69

(y- y ) -144.12 10.88 -67.12 100.88 -13.12 154.88 -93.12 50.88

$ ( y -y)
143.16 -61.14 19.34 -92.10 44.11 -129.25 106.02 -30.19

$ (y- y )2
0.912 2526.550 2282.852 77.074 960.107 656.999 166.348 428.126 7098.970

$ ( y - y )2
20496.16 3738.687 374.0665 8482.557 1945.304 16705.05 11239.73 911.3636 63892.92

(y- y )2 20770.57 118.3744 4505.094 10176.77 172.1344 23987.81 8671.334 2588.774 70990.88

45 220 270.26 32 142 189.78 50 310 301.22 28 196 165.01 56 364 338.37 18 116 103.10 40 260 239.31

Where,

( y y ) $ ( y y ) $ ( y y )

= 70990.88 = 7098.97 = 63892.92

2 2

The results agree, except for the rounding errors.

9.5 An estimate of the variance of residual term

The variance of , denoted by 2 is estimated using the sum of squares due to residual, SSE. SSE =
$ ( y y )
2

= Syy - bSxy.

The degrees of freedom indicate how many independent pieces of information involving the n independent values used to compute the sum of squares. SSE is associated with n-2 degrees of freedom because two parameters ( 0 and 1 ) have to be estimated. The mean square (MSE) is a number computed by dividing a sum of squares by its degrees of freedom. It has been shown that, MSE or s2 provides estimate of 2 . Thus,

MSE =

SSE n2
85
PMNjuho

Statistical Analysis in Research Module E-mail: [email protected]

From the above example,

MSE =

SSE n2

7098 .97 =1183.16 82

9.6 Inference about the 0 and 1 parameters

The main interest would be to test if the slope is significantly different from zero, indicating change in y per unit change in x. An appropriate hypothesis is H o: 1 = 0 Ha: 1 0 The above hypothesis can be tested using t- test or F- test or a confidence interval. We need to obtain b, the estimate of 1 and the associated variance in order to conduct the appropriate test. The sampling distribution of estimate b is normal with mean 1 and 2 variance b , where,

b2 =
Sxx =

2
S xx

2 i

( xi ) / n

2 Since b is hardly known, it is estimated by sb where s2 replaces 2 in the above equation. Thus,

2 sb =

s2 Sx

The test statistic is tcalc =

b 1 sb

which follows t distribution with n-2 degrees of freedom. The decision rule is to reject Ho if the absolute tcalc denote by |tcalc| is greater than t / 2 . For the above example, b= 6.1912, standard error of b denoted s.e.(b) =
2 sb = sb.

Statistical Analysis in Research Module E-mail: [email protected]

86

PMNjuho

s=

1183.16 = 34.4
Sx =

2 i

( xi ) / n

= 11537 - (281)2/8 = 1666.875

s.e.(b) =

1183.16 1666 .875

= 0.8425

Thus,

tcalc =

6 .1912 0 = 7.349 0 .8425

From the t table, the value of t corresponding to 6 degrees of freedom and = 0.05 for a two tailed test is t0.025 = 2.447.
Conclusion: We reject Ho: 1 = 0 since |tcalc| is greater than t0.025 = 2.447 and conclude that the slope is significantly different from zero at 5 % significance level.

An F- test exists for testing the above hypothesis. The t- test and F- test give the same results for a regression model with only one independent variable. This is due to relation between the two distribution for one independent variable (F=t2 relationship). The following computations are necessary in order to test the above hypothesis concerning the slope parameter. Sum of squares due to regression, denoted by SSR = degree of freedom (number of parameters - 1).
$ ( y y )
2

associated with 1

$ Sum of square due to residual, denoted by SSE = ( y y )2 associated with n-p degrees of freedom (n is the sample size and p is the number of regression parameters).

Sum of squares due to total, denoted by SST = of freedom.

( y y )

associated with n-1 degrees

The following are the corresponding mean squares:

MSR =

SSR SSE and MSE = 1 n p

Under Ho: 1 = 0 both MSR and MSE are two independent estimates of 2 . The ratio MSR to MSE is known to have a sampling distribution that is F with 1 and n-p degrees of freedom. (In this case p=2).

Statistical Analysis in Research Module E-mail: [email protected]

87

PMNjuho

For the above example, we get MSR = 63892.92 and MSE = 1183.16 which implies that

Fcalc =

MSR MSE
63892.92 = 54.0 1183.16

Note: F = t2 (i.e. 54.0 = 7.3492)

From the F- distribution table we get F (1,6; 0.05) = 5.99. We reject Ho: 1 = 0 at 5 % significance level since Fcalc > F (1,6; 0.05) = 5.99 and conclude that there is statistically significant relationship between the x and y.
Caution: Rejection of the null hypothesis does not imply that the relationship between the x and y is linear. A proper way to phrase the statement is that, a linear relationship explains a significant amount of the variability in y over the range of x values observed in the sample.

Confidence Interval for 1 Confidence interval provides an alternative to testing the hypothesis Ho: 1 = 0 against Ha: 1 0. The following is a 95 % confidence interval for 1 b t0.025s.e.(b) In reference to the above example, a 95 % confidence interval for 1 is 6.1912 2.447(0.8425) Thus, (4.1296, 8.2528) is a 95% confidence interval for 1 . We reject Ho because the interval does not contain zero. Similarly, the variance of the intercept estimate is given by the following formula
2 sa =

MSE x 2 nS xx

MSE ( nx 2 ) S xx

9.7 Confidence interval estimate of the mean value of y


Statistical Analysis in Research Module E-mail: [email protected]

88

PMNjuho

There are two types of interval estimates, namely, confidence interval estimate and prediction interval estimate. The former is an estimate of the mean value of y for a particular value of x while the latter concerns the prediction of an individual value of y corresponding to a given value of x. The computed values using the equation $ y = a + bx are both the same. The difference is only in computation of the standard error.

$ Suppose we denote the estimate of the mean value by y m and individual value estimate $ by yind . The corresponding values and their associated variances are computed using the following formula:
Mean value:

$ y m = a + bx m
2 sm = s 2 [

$ Estimated variance of y m :

( x m x )2 1 ] + n ( x 2 ( x )2 / n )

$ Individual value: yind = a + bxind $ Estimated variance of yind :


2 sind = s 2 [ 1 +

( xind x )2 1 + ] n ( x 2 ( x )2 / n )

For our example, suppose we wish to estimate the mean value for a given value of xm=30.

$ y m = a + bx m
= -8.3392+6.1912(30) = 177.3968 and
( x m x )2 1 ] s = s [ + n ( x 2 ( x )2 / n )
2 m 2

= 1183.16 [ = 166.5385

1 ( 30 35.125 )2 + ] 8 ( 1666 .875 )

Suppose we wish to estimate the individual value for a given value of xind=30.

$ yind = a + bxind = -8.3392+6.1912(30) = 177.3968


and

Statistical Analysis in Research Module E-mail: [email protected]

89

PMNjuho

2 ind

( xind x )2 1 = s [1+ + ] n ( x 2 ( x )2 / n )
2

1 ( 30 35.125 )2 = 1183.16 [ 1 + + ] 8 ( 1666 .875 )

= 1349.6980

Note: The variance associated to the individual value prediction is greater than that associated to the mean value. Consequently, the confidence interval for the individual vale is wider than that of mean value.

Exercise 9.1

9.1 A restaurant operating on a reservations only basis would like to use the number of advance reservations x to predict the number of dinners y to be prepared. Data on reservations and number of dinners served for one day chosen at random from each week in a 100-week period gave the following results:
x = 150 y = 120
2

( x x )

= 90 000

( y y )

= 70 000

( x x )( y y ) = 60 000
$ a) Find the least squares estimates a and b for the linear regression line y = a + bx.
b) Predict the number of meals to be prepared if the number of reservations is 135. c) Construct a 90 % confidence interval for the slope. Does information on x (number of advance reservations) help in predicting y (number of dinners prepared)?
9.2 Interest rates charged for home mortgages have, in general, declined over recent months. With the apparent favourable influence for new home building, the data shown below are the prevailing mortgage interest rates and the number of housing starts in a city over a period of 18 months. Month 1 2 3 4 5 6 Interest rate x 10.5 10.3 10.6 11.4 11.8 11.3 Number of housing starts y 360 340 370 360 330 300

Statistical Analysis in Research Module E-mail: [email protected]

90

PMNjuho

7 8 9 10 11 12 13 14 15 16 17 18 a) Plot the data.

11.0 10.5 10.2 10.0 9.8 9.8 9.9 10.0 10.0 9.9 9.8 9.7

290 340 360 370 380 390 375 350 345 360 380 395

b) Use these data to obtain a linear regression equation. c) Is the slope significantly different from zero? d) Predict the number of housing starts for interest rates of 10.2% and 9.5%. e) Do you predict that the prevailing interest rate will increase or decrease next month (month 19)?

9.8 Testing model assumptions

A residual is the difference between the actual value of the dependent variable yi and the $ value predicted by the regression equation yi . The analysis of residuals plays an important role in validating the assumptions made in regression analysis. The hypothesis test discussed above is valid only when assumptions made on regression equation are satisfied. Residual plots are graphical presentations of the residuals that help reveal patterns and thus help determine whether the assumptions concerning the error component and the form of regression model are satisfied. The following are the common residual plots

A plot of residuals against the independent variable x. A plot of residuals against the predicted value of the dependent variable. A standardised residual plot in which each residual is standardised by dividing the residual by its standard deviation.

9.9 Diagnostic procedures Residual plot against x

Statistical Analysis in Research Module E-mail: [email protected]

91

PMNjuho

A residual plot against the independent variable x is constructed by placing x on the horizontal axis and the residuals on the vertical axis. The residual plot should give an overall impression of a horizontal band of points if the assumptions are valid and a linear relationship between x and y is appropriate.
Residual plot against x
40 20 Residual 0 0 -20 -40 -60 X 10 20 30 40 50 60

Using the Residual Plot a) An overall impression of a horizontal band of points from a residual plot implies that the model is valid and a linear relationship between x and the expected value of y exist. b) A cone shape pattern of the residual plot suggests that the variance is not constant. That is to say, the variability of about the regression line is greater for larger values of x. c) A quadratic pattern of the residual plots suggests that the linear model is not adequate and quadratic model should be fitted.

Note that for simple linear regression, both the residual plot against x and the residual plot $ against the predicted value y provide the same information. With multiple regression $ models, the residual plot against y . Standardised residual plots are provided by most computer software. A random variable is standardised by subtracting its mean and dividing the result by its standard deviation. The standard deviation of the ith residual is sy - y = s 2 ( 1 hi ) $
where hi =
( xi x )2 1 and + n ( xi x )2

s2 = MSE

If the normality assumption is satisfied, 95 % of the computed standardised residual should lie between -2 and 2.
Outliers

Statistical Analysis in Research Module E-mail: [email protected]

92

PMNjuho

Outliers represent observations that are suspect and warrant careful examination. Sometimes they may occur due to erroneous data recording. They may also indicate some signs of violation of model assumptions or unusual values may occur due to change.

Example 9.2

Consider the following data set to illustrate effect of an outlier. x 1 1 2 3 3 3 4 4 5 6 y 45 55 50 75 40 45 30 35 25 15

The effect of an outlier


80 60 40 20 0 0 1 2 3 X 4 5 6 Y

A negative linear relationship exists between X and Y except for the value at x=3 and y=75 which is out of the pattern. Most statistical software classify an observation with standardised residual that is either less than -2 or more than 2 to be an outlier.
Influential observations

An influential observation which may be an outlier is a value that is far away from the mean Consider the following data to illustrate the aspect of influential observation. x 10 10 15 20
Statistical Analysis in Research Module E-mail: [email protected]

y 125 130 120 115


PMNjuho

93

20 25 70

120 110 100

Example 9.3
A high leverage observation
130 120 110 100 90 10 15 20 25 30 35 40 45 50 55 60 65 70 75 80 85 X Y

The observation at x=70 and y=100 is an observation with an extreme value of x. Thus, correspond to a high leverage. The leverage is computed using the following formula hi = ( xi x )2 1 + n ( xi x )2

An observation is declared to be influential if hi > 6/n. The appropriate approach to handling data with influential observations if to run the regression analysis with and without the observation. Although time consuming, the approach will reveal the influence of the observation on the results.

Exercise 9.2

9.3 Consider the following data for two variables X and Y. X Y 135 110 145 100 130 145 175 160 120 120 120 130 130 110

a) Compute the standardised residuals for these data. Do there appear to be any outliers in the data? Explain.

$ b) Plot the standardised residuals against y . Does this plot reveal any outliers?
c) Develop a scatter plot for these data. Does the scatter diagram indicate any outliers in the data? In general, what implications does this have for the simple linear regression?
Statistical Analysis in Research Module E-mail: [email protected]

94

PMNjuho

9.10 Polynomial models

The response in dependent variable y will not always be linear whenever the independent variable x is of quantitative nature. Sometimes the response may either quadratic or cubic or higher than 3rd degree. For instance, a linear equation may not adequately represent the relationship between yield and the amount of fertiliser applied to the plot. The following data on yield of tomatoes receiving plots receiving different amount of fertiliser. Plot 1 2 3 4 5 6 7 8 9 10 11 12 13 14 Amount of fertiliser x 12 5 15 17 20 14 6 23 11 13 8 18 22 25 Yield y 24 18 31 33 26 30 20 25 25 27 21 29 29 26

Scatterplot of yield versus fertiliser


40 35 30 Yield Y 25 20 15 10 5 0 5 10 15 20 25 Amount of fertiliser X

A model describing the quadratic form showed in the above figure is y = 0 + 1 x + 2 x 2 + A general polynomial regression model relating a dependent variable y to a single quantitative independent variable x is given by
y = 0 + 1 x + 2 x 2 + ...+ p x p +
Statistical Analysis in Research Module E-mail: [email protected]

95

PMNjuho

The choice of p and hence the choice of an appropriate regression model will depend on the experimental situation.

9.11 Multiple regression

The probabilistic model for multiple regression analysis is a direct extension of the linear regression analysis. For p independent variables, we have

y = 0 + 1 x1 + 2 x 2 + ...+ p x p +
The estimated regression equation is
$ y = b0 + b1 x1 + b2 x 2 + ...+b p x p

Referred to multiple regression model because it involves more than one independent variable. For example, consider an experiment set to study the yield of tomato crop. Several independent variables say amount of fertiliser (X1), amount of water (X2), and hours of sunlight on clear days (X3) could all have an effect on the yield. The multiple regression model that relates a dependent variable y to a set of quantitative independent variables is a direct extension of a polynomial regression model in one independent variable. Any independent variables may be powers of other independent 2 variables, example x2 might be x1 or x3 a cross-product term x1x2. A point to note is that no x is a perfect linear function of other xs.
y = 0 + 1 x1 + 2 x 2 + ...+ p x p +

In general, j (j 0 ) represents the expected change in y for a unit increase in xj

while holding all other xs constant. A simplest model that allow for interaction between x1 and x2 is
y = 0 + 1 x1 + 2 x 2 + 3 x1 x 2 +

Say for a give x=2, expected value of y, denoted E(y) is expressed as E(y) = 0 + 1 x1 + 2 ( 2 ) + 3 x1 ( 2 ) = ( 0 + 22 ) + ( 1 + 2 3 )x1 Here the intercept and slope are ( 0 + 2 2 ) and ( 1 + 2 3 ) , respectively.

Statistical Analysis in Research Module E-mail: [email protected]

96

PMNjuho

10. INTRODUCTION TO MULTIVARIATE ANALYSIS 10.1 An overview

Multivariate data occur in all branches of science. Almost all data collected by todays researchers can be classified as multivariate data. For example, a marketing researcher might be interested in identifying characteristics of individuals that would be enable the researcher to determine whether a certain individual is likely to purchase a specific product. A wheat breeder might be interested in more than just the yields of some new varieties of wheat. The wheat breeder may also be interested in these varieties resistance to insect damage and drought. A social scientist might be interested in studying relationships between teenage girls dating behaviours and their fathers attitudes. The objectives of scientific investigations for which multivariate techniques most naturally lend themselves, include the following: Data reduction or structural simplification. Sorting and grouping. Investigation of the dependence among variables. Prediction. Hypothesis construction and testing. Multivariate techniques are applicable when more than one variable is measured on an experimental unit. Such variables could be correlated and univariate analysis would not be helpful in extracting relevant information. Multivariate techniques are classified into two categories, namely variable-directed and individual or experimental unit directed. Some of these techniques are:
Variable directed

Principal component analysis (PCA) Factor analysis (FA) Canonical correlation analysis (CCA) Multiple regression analysis (MRA)
Individual directed

Discriminant analysis (DA) Cluster analysis (CA) Multivariate analysis of variance (MANOVA) The above techniques will be discussed with examples later in Section 9.3.

10.2 Possible areas of applications


Statistical Analysis in Research Module E-mail: [email protected]

97

PMNjuho

Medicine and health Example 10.1a

A study conducted to investigate the reactions of cancer patients to radiotherapy. Measurements were made on 6 reaction variables for 98 patients. Interest data reduction.
Example 10.1b

Research on the genetic basis for alcoholism. One group has found that the activity of the two enzymes (monoamine oxidase and adenylate cyclase) produced by platelets was significantly reduced in alcoholics. The results of this study hold promise for the development of a simple screening test for the early detection of alcoholism. Interest to identify and measure physiological variables that could be used effectively to discriminate alcoholics from nonalcoholics.
Sociology Example 10.2a

Competing current theories suggest that one strong socioeconomic dimension and a few minor unexplored dimensions determine the structure of American occupations. Measurements on 25 variables for 583 occupations were analysed using multivariate methods in order to provide support for one or two of the positions. Interest hypothesis verification.
Example 10.2b

In a study of mobility, counts of the number of foreign-born and second-generation U S residents in 1970 were tabulated by country of origin and state of residence. Interest to find natural homogeneous groupings.
Business and economics Example 10.3a

Measurements of 6 accounting and financial variables were used in developing a multivariate model to help insurance regulators identify potentially insolvent propertyliability insurers. Using the model, an insurance company could be classified as solvent or distressed and remedial steps could then be taken to prevent bankruptcy of the distressed firm. Interest to obtain a classification rule for distinguishing solvent firms from distressed firms.
Example 10.3b

Knowledge of the relationships among policy instruments and goals for underdeveloped countries can aid the process of national development and modernisation. Data from 74 non-communist underdeveloped countries allowed an investigator to find the subsets of
Statistical Analysis in Research Module E-mail: [email protected]

98

PMNjuho

goals and instruments most closely associated with each other and to estimate the nature of the simultaneous relationships between the two subsets. Interest to determine the dependence between two sets of variables corresponding to goals and instruments.

Education Example 10.4

Scholastic Aptitude Test (SAT) scores and high school academic performance are often used as indicators of academic success in college. Measurements on 5 precollege predictor variables and 4 college performance criterion variables were used to determine the association between the predictor and criterion scores. The study was concerned with substituting the usefulness of test scores and high school achievement as predictors of college performance. Interest prediction of college performance variables based on the set of predictor variables.

Biology Example 10.5a

Two species of chickweed have proved difficult to identify. Measurements on 4 variables for chickweed plants, known to belong to the two species, were used to construct a function whose values allowed one to separate the two groups. Consequently, the function could be used to classify a new candidate plant as belonging to one species or the other. Interest sorting or classification.
Example 10.5b

In plant breeding it is necessary, after the end of one generation, to select those plants that will be the parents of the next generation. The selection is to be done in such a way that the succeeding generation will be improved in a number of characteristics over that of the previous generation. Many characteristics are often measured and evaluated. The plant breeders goal is to maximise the genetic gain in the minimum amount of time. Multivariate techniques were used in a bean-breeding programme to convert measurements on several variables relating to yield and protein content into a selection index. Scores on this index were then used to determine parents of the subsequent families of beans. Interest construction of an index to replace measurements on many variables and the development of a sorting rule.
Environmental studies Example 10.6

The atmospheric concentrations of air pollutants in the Los Angeles area have been extensively studied. In one of study, daily measurements on seven pollution related variables were recorded over an extended period of time. Of the immediate interest was whether the levels of air pollutants were roughly constant throughout the week or whether there was a noticeable difference between weekdays and weekends. Interest hypothesis testing and data reduction.
Statistical Analysis in Research Module E-mail: [email protected]

99

PMNjuho

Other areas where multivariate techniques apply are in meteorology, geology, psychology and sports.

10.2 Principal component analysis

Principal component analysis approach is useful in discovering dimensionality of the data, data screening, checking clusters and finding abnormalities. It applies technique of grouping variables that are highly correlated together. The variables within a group are highly correlated and between groups are uncorrelated. New variables are expressed as linear combination of the p original variables. Principal component scores are used as inputs in other analysis. Multiple regression analysis is characterised by multicollinearity problem, which come about as a result of predictor variables being correlated. In such a situation, the selected PC scores are used as regressors. Plots of the first PC scores helps to identify outliers and clusters that may be associated with the data.
10.3 Factor analysis

Factor analysis follows the same principal of PCA. The main difference being that the former has distributional properties whereas the later does not. A few factors do explain the original variables without loss of information. When the new factors cannot be explained, rotation techniques, some which are orthogonal, are applied. The PCs selected using PCA can be used as the new factors.
10.4 Discriminant analysis

Dicriminant analysis is a multivariate procedure used to develop a rule that separate two or more groups of individuals, given measurements for these individuals on several variables. Discriminant analysis is similar to regression analysis except that the dependent variable is categorical rather than continuous. In regression analysis the interest is in predicting the value of a variable based on a set of predictor variables. In discriminant analysis, the interest is in predicting class membership of an individual observation based on a set of predictor variables. Several rules exist. A likelihhod rule; the linear discriminant function rule; a mahalanobis distance rule; a posterior probability rule, etc. The groups are known before hand.

10.5 Cluster Analysis

Suppose a study on farming system in a given area has been conducted. Variables measured on each farm in the data set might include period farm had been farmed, number of animals, fertiliser used, type of trees, average income, soil types, crops grown, size of
Statistical Analysis in Research Module E-mail: [email protected]

1 00

PMNjuho

the family, labour, etc. The researcher want to use this information to partition farmers into subgroups, so that farmers that fall into distinct subgroups have similar characteristics with respect to the measured variables. The partition would allow for efficient use of the resources by the farmers. In more general terms, suppose a researcher has data collected on a large number of experimental units. Basic questions posed for cluster analysis would be whether it is possible to devise a classification or grouping scheme, that would allow for partitioning of the experimental units into classes or groups, called clusters, so that the units within a class or group are similar to one another while those in distinct classes or groups are not similar to those in the other groups. Cluster analysis involves techniques that produce classifications from data that are initially unclassified, and must not be confused with discriminant analysis where one initially knows how many distinct groups exist and where one has data that is known to come from each of these distinct groups.

11. CATEGORICAL DATA ANALYTIC METHODS 11.1Introduction

In many studies measurements are made on binary rather than numerical scales. For example, studies of altitudes or opinions with the two categories for the response variable being agree or disagree. Others form of responses being exposed or not exposed, yes or no, present or absent, improved or unimproved. The type of data collected relates to responses to question like, how many have the attribute? How many said yes? etc. We end up with frequency counts. Analysis of such data uses a chi-square distribution denoted 2 . This distribution is defined as the sum of squares of independent, normally distributed variables with zero mean and unit variance. The table values are the intersection of the - value and the respective degrees of freedom. For example, = 0.05 and 6 degrees of freedom, the table value from Table cc is 12.6. There are three areas of inferential statistics in which the chi-square test for significance is commonly applied. They are

Tests for independence of associations; Tests for equality of proportions in more than two populations; and Test for goodness of fit tests.

The chi-square statistic tests the null hypothesis by comparing a set of observed frequencies, which are, based on sample findings, to a set of expected frequencies, which describe the null hypothesis. It measures the extent to which the observed and expected frequencies differ. Large differences will result in the null hypothesis being rejected. The chi-square statistic is computes

Statistical Analysis in Research Module E-mail: [email protected]

1 01

PMNjuho

2 =

(O
i

Ei ) 2 , i =1, 2, . . ., k

Ei

where, Oi is the ith observed count. Ei is the ith expected count. k is the number of categories. The calculated 2 is compared against a table value obtained using k-1 degrees of freedom and a specified - level. In case of a contingency table the total number of cells constitute the number of categories, k.

11.2 Test for independence of association.

This test is applied when an investigator wishes to determine the independence of two random variables. Independence implies that outcomes of one random variable in no way influence the outcomes of a second random variable. The null hypothesis and alternative hypothesis are stated as follows: H0: The two categories are independent. Against Ha: The two categories are dependent. The procedure is illustrated through the following example.
Example 11.1

A certain brewery company manufactures and distributes three types of beers which are categorised as 1) a low-calorie light beer, 2) a regular beer and 3) a dark beer. In analysis of the market segments for the three beers, the firms market research group has raised the question of whether preferences for the three beers differ between male and female beer drinkers. If beer preference is independent of the sex of the beer drinker, one advertising campaign will be initiated for all their beers. However, if beer preference depends on the sex of the beer drinker, the company will tailor its promotions towards different target markets. The hypotheses of this test is stated as H0: Beer preference is independent of the sex of the beer drinker. Against Ha: Beer preference is not independent of the sex of the beer drinker (i.e., males and females differ in their preference). A sample is selected and each individual is asked to state his or her preference for the three companys beers. Every individual in the sample will be placed in one of the six cells (3 x 2 = 6). The table generate by the 3 x 2 cells is called a contingency table. The test of independence makes use of the contingency table format and for this reason is sometimes referred to as a contingency table test.

Statistical Analysis in Research Module E-mail: [email protected]

1 02

PMNjuho

Suppose that a simple random sample of 150 beer drinkers has been selected. After tastetesting the three beers, the individuals in the sample are asked to state their preference, or first choice. The responses are presented in a contingency table below: Observed frequencies (Oijs)
Beer Preference Light Regular Dark 20 40 20 30 30 10 50 70 30

Sex Male Female Totals

Totals 80 70 150

Expected frequencies for the cells of the contingency table are based on the following rationale. 50 1 = of the beer drinkers Assume the null hypothesis. Under this assumption we have 150 3 70 7 30 1 = prefer regular beer, and = prefer dark beer. If the prefer light beer, 150 15 150 5 independence assumption is valid, these same fractions must be applicable to both male and female beer drinkers. Thus, under the assumption of independence, we would expect 1 7 the 80 male drinkers to show that (80) = 26.67 prefer light beer, (80) = 37.33 prefer 3 15 1 regular beer, and (80) = 16 prefer dark beer. Similar argument follows for female beer 5 drinkers. Expected frequencies if beer preference is independent of the sex of the beer drinker (Eij)

Sex Male Female Totals

Beer Preference Light Regular Dark 26.67 37.33 16.00 23.33 32.67 14.00 50 70 30

Totals 80 70 150

The general formula for computing expected frequencies for a contingency table in the test for independence is

Eij = (Row i Total)(Column j Total)/ sample size In general, the contingency table test statistic is computed as

2 =

(O
i, j

ij

Eij ) 2

Eij

Statistical Analysis in Research Module E-mail: [email protected]

1 03

PMNjuho

where, Oij is the observed frequency for contingency table category in row i and column j. Eij is the expected frequency for contingency table in row i and column j based on the assumption of independence. With r rows and c columns in the contingency table, the test statistic has a chi-square distribution with (r-1)(c-1) degrees of freedom provided the expected frequencies are 5 or more for all categories. Referring back to our example, we note that all expected frequencies are at least 5. Thus, the sample size is adequate and can proceed to calculate chi-square statistic.

2 =

(O
i, j

ij

Eij ) 2 =

Eij

(20 26.67) 2 (40 37.33) 2 (10 14.00) 2 + +...+ 26.67 37.33 14.00

= 1.67 + 0.19 + . . . + 1.14 = 6.13 Degrees of freedom = (r-1)(c-1) = (2-1)(3-1) = 2. Using = 0.05, 2 -Table = 5.99 We reject H0 since 2 -calc = 6.13 is greater than 2 -Table = 5.99 and conclude that the preference for the beers is not independent of the sex of the beer drinkers.

Exercise 11.1

11.1 The following data is on the distribution of employment status in five areas denoted by polygon codes, from KZN. The study involved a random sample of 2942 persons.

Employment Status EMPLOYED UNEMPLOYED NOT_WORKIN UNSPECIFIE TOTAL

Polygons
5010012 5010013 5010014 5010015 5010016 344 435 291 30 13 25 14 1 211 276 189 72 178 218 125 14 746 954 619 117 257 15 130 104 506

Using = 0.05, test whether there is an association between employment status and five areas. 11.2 The Abacus Media Company publishes 4 magazines for the teenager (between 13 and 17 years of age) market. The executive editor of Abacus would like to know whether a readership preference for the four magazines is independent of gender. A survey of 200 teenagers in stationery stores was carried out. Randomly selected teenagers who bought at least one of the four magazines were asked to indicate which of the four magazines they preferred. Their responses are presented below.
Statistical Analysis in Research Module E-mail: [email protected]

1 04

PMNjuho

Gender Girls Boys

Beat 18 38

Magazine Preference Youth Grow Live 12 20 28 26 34 24

Using = 0.05, test whether there is an association between gender and magazine preference. 11.3 A motor vehicle distributor wishes to find out if the size of car bought is in any way related to the age of a buyer. From sales invoices over the past two years, a sample of 300 buyers were classified by size of the car bought and buyers age. The following contingency table was constructed. Car size bought Small Medium Large 10 22 34 24 42 48 52 32 36

Buyers Age Under 30 30 45 Over 45

Using = 0.05, test whether car size bought and buyers age are independent. Interpret your results. 11.4 A sample of parts provided the following contingency table data concerning part quality and production shift. Number Defective 32 15 24

Shift First Second Third

Good 368 285 176

Use = 0.05 and test whether part quality is independent of the production shit. What is your conclusion.

11.3 Tests for equality of proportions in more than two populations

Earlier sections discussed the case of comparing two population proportions using either normal or t-distributions. The situation is different when more than two population proportions are to be compared. The Chi-square distribution is used in such a situation. The test for equality of proportions in more than two populations is equivalent to the test for independence of association. The null hypothesis is stated as no differences exist between the proportions of a given category of one random variable examined across all categories of a second random variable. The following example illustrates the procedures used to test for the equality of proportions in more than two populations.

Statistical Analysis in Research Module E-mail: [email protected]

1 05

PMNjuho

Example 11.2

A local air carrier would like to know if there is any difference between the proportion of travellers classified as business or non-business making reservations for each of their four classes. A survey of 300 reservations over the past week shows the following use of each class of travel by passengers. The observed frequencies Class of Travel Emerald Amethyst Diamond 32 22 42 48 26 68 80 48 110 Row Totals 128 172 300

Type of traveller Business Non-Business Column Totals Let

Ruby 32 30 62

P1 = Proportion of Emerald class business traveller. P2 = Proportion of Amethyst class business traveller. P3 = Proportion of Diamond class business traveller. P4 = Proportion of Ruby class business traveller.

Hypothesis H0 : P1 = P 2 = P 3 = P 4 H1 : At least one population proportion is different. Note the null hypothesis could also be stated that type of traveller is independent of the class of travel used. The expected frequencies Type of traveller Business Non-Business Column Totals Test statistics Class of Travel Emerald Amethyst Diamond 34.1 20.5 46.9 45.9 27.5 63.1 80 48 110 Ruby 26.5 35.5 62 Row Totals 128 172 300

2 =
=

(O
i, j

ij

Eij ) 2

Eij

(32 34.1) 2 (22 20.5) 2 (30 35.5) 2 + ...+ 34.1 20.5 35.5

= 0.1293 + 0.1096 + . . . + 0.8521 = 3.3028

Statistical Analysis in Research Module E-mail: [email protected]

1 06

PMNjuho

Using = 0.05, and (2-1)(4-1) = 3 degrees of freedom, 2 -Table = 7.815, we fail to reject H0 since calculate 2 = 3.328 is not greater than table value 2 -Table = 7.815. Conclude that the proportion of business people using each class of travel is the same. This finding is equivalent to concluding that type of traveller and class of travel, are independent in an independence of association hypothesis test.
Exercise 11.2

11.5 An insurance organisation sampled its field sales force in the four provinces concerning their attitudes towards compensation. Respondents were given the choice between the present method (fixed salary plus year-end bonus) and a proposed new method (straight commission). Response preference Present method New Method Province Cape Transvaal 68 135 32 50

OFS 47 23

Natal 79 31

a) Test, at the 5 % level of significance, whether there is any difference in the proportion of sales staff between the four provinces who prefer the present method? b) Interpret your findings.

11.4 Test for goodness of fit tests

The following are the general steps used to conduct a goodness of fit test for any hypothesised probability distribution: Formulate a null hypothesis indicating a hypothesised distribution for k classes or categories of a population. Select a simple random sample of size n items, and record the observed frequencies for each of the k classes or categories. Based on the assumption that the null hypothesis is true, determine the expected frequencies for each category. Use the observed and expected frequencies to compute a value of 2 for the test. Reject H0 if the calculated 2 value is greater than table 2 value obtained with k-1 degrees of freedom at level of significance.

We illustrate the computation through the following example. Example 11.3 Patients that arrive for treatment at the emergency room of a large metropolitan hospital are assigned to one of the following three categories based on the seriousness of their condition. Category 1: Patient condition is stable; immediate treatment by a physician is not required.
Statistical Analysis in Research Module E-mail: [email protected] PMNjuho

1 07

Category 2: Patient condition is serious; immediate treatment is not required, but patient should be monitored for vital signs until a physician is available. Category 3: Patient condition is critical; the patients life will be endangered without immediate treatment. The population of interest is a multinomial population since the condition of each patient is classified into one and only one of the three categories stable, serious, and critical. The available information over the last year indicate that 50 % of the patients who arrived for treatment were classified as stable, 30 % were classified as serious, and 20 % were classified as critical. There has been an increased volume for the emergency room due to recent improvement. The director of the hospital is concerned that the percentage of patients classified as having stable, serious, or critical conditions may have also charged. Validation of this claim is required. Let P1 = fraction of patients classified as stable. P2 = fraction of patients classified as serious. P3 = fraction of patients classified as critical. Hypothesis H0 : P1 = 0.5, P2 = 0.30, P3 = 0.20 H1 : The population proportions are not P1 = 0.5, P2 = 0.30, P3 = 0.20 Suppose the hospital selected a sample of 200 patients who have been tested since the volume increased in the emergency room. The following are observed frequencies. Stable 98 Serious 48 Critical 54

The expected frequencies for each category under H0 are Stable 200(0.50) = 100 Serious 200(0.30) = 60 Critical 200(0.20) = 40

The goodness of fit test focuses on the differences between the observed frequencies and the expected frequencies. With the expected frequencies greater than 5 for all three categories, the sample size requirement is satisfied and we proceed to compute the test statistic.

Test statistic

2 =

(O
i

Ei ) 2

Ei

(98 100) 2 (48 60) 2 (54 40) 2 = + + 100 60 40

= 0.04 + 2.40 + 4.90 = 7.34


Statistical Analysis in Research Module E-mail: [email protected]

1 08

PMNjuho

Using = 0.05, and k = 3 -1 = 2 degrees of freedom, 2 -Table = 5.99. We reject H0 since 2 = 7.34 is larger than the critical value 5.99. In rejecting H0 we conclude that the increase in volume for the emergency room has altered the percentages of patients whose conditions are stable, serious, or critical. The goodness of fit test uses the chi-square distribution to determine whether a hypothesised probability distribution for a population provides a good fit. Acceptance or rejection of the hypothesised probability distribution depends on the differences between the observed frequencies in a sample and the expected frequencies based on the assumed probability distribution.
Exercise 11.3

11.6 Conduct a test of the following hypothesis using the chi-square goodness of fit test. H0 : PA = 0.4, PB = 0.40, PC = 0.20 H1 : The population proportions are not PA = 0.4, PB = 0.40, PC = 0.20 11.7 A sample of size 200 yielded 60 in category A, 120 in category B, and 20 in category C. Using = 0.01, test to see if the proportions are as stated in H0. 11.8 A manufacturer has adopted a new container design. Colour preferences indicated in a sample of 150 individuals are as follows. Red 40 Blue 64 Green 46

Test using =0.1 to see if the colour preferences are different. (Hint: Formulate the null hypothesis as H0 : P1 = P2 = P3 = P4 = 1/3 ) 11.9 Grade distribution guidelines for a statistics course at a major university are as follows: 10% A, 30 % B, 40 % C, 15 % D, and 5 % F. A sample of 120 statistics grades at the end of a semester showed 18 As, 30 Bs, 40 Cs, 22 Ds, and 10 Fs. Test using =0.05 to see if the actual grades deviate significantly from the grade distribution guidelines. 11.10 An accounted for a department store knows from past experience that 23 % of the stores customers pay cash for their purchases, 35 % write cheques, and the remaining 42 % use credit cards. The accountant examines a random sample of 200 sales receipts for the week before Christmas and makes the following sales summary.

Statistical Analysis in Research Module E-mail: [email protected]

1 09

PMNjuho

Number of Customers

Cash Cheque Credit cards 37 47 116

Use the chi-square goodness of fit test to see if the preceding percentages fit these observations. Use = 0.05.
11.11 Consider the following data on age distribution in the two polygons. Age Group 0 -10 11_20 21_30 31_40 41_50 51_60 61_70 71_80 81_90 91_100 Over 101 UN TOTAL 5010001 Frequency 177 231 240 141 169 124 38 8 1 0 0 2 1131 5090061 Frequency 75 54 34 14 18 10 9 1 0 0 0 0 215

Use the chi-square goodness of fit test to see if the age group distribution for polygon 5090061 follows Poisson distribution. Use = 0.05.

Statistical Analysis in Research Module E-mail: [email protected]

1 10

PMNjuho

APPENDIXES TABLE A The Normal Distribution

Pr(Z z) =

1 2

z2 2 2

(-z) = 1 - (z)
Z 0.00 0.05 0.10 0.15 0.20 0.25 0.30 0.35 0.40 0.45 0.50 0.55 0.60 0.65 0.70 0.75 0.80 0.85 0.90 0.95 1.00 1.05 (z) 0.500 0.520 0.540 0.560 0.579 0.599 0.618 0.637 0.655 0.674 0.691 0.709 0.726 0.742 0.758 0.773 0.788 0.802 0.816 0.829 0.841 0.853 z 1.10 1.15 1.20 1.25 1.282 1.30 1.35 1.40 1.45 1.50 1.55 1.60 1.645 1.65 1.70 1.75 1.80 1.85 1.90 1.95 1.960 2.00 (z) 0.864 0.875 0.885 0.894 0.900 0.903 0.911 0.919 0.926 0.933 0.939 0.945 0.950 0.951 0.955 0.960 0.964 0.968 0.971 0.974 0.975 0.977 z 2.05 2.10 2.15 2.20 2.25 2.30 2.326 2.35 2.40 2.45 2.50 2.55 2.576 2.60 2.65 2.70 2.75 2.80 2.85 2.90 2.95 2.00 (z) 0.980 0.982 0.984 0.986 0.988 0.989 0.990 0.991 0.992 0.993 0.994 0.995 0.995 0.995 0.996 0.997 0.997 0.997 0.998 0.998 0.998 0.999

Statistical Analysis in Research Module E-mail: [email protected]

1 11

PMNjuho

TABLE B The t-Distribution

Pr(T t ) =

[(r + 1) / 2] dw r (r / 2)(1 + w2 / r )( r +1) / 2

[Pr(T t ) = 1 Pr(T t )]

Pr(T t )

r
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30

0.90 3.078 1.886 1.638 1.533 1.476 1.440 1.415 1.397 1.383 1.372 1.363 1.356 1.350 1.345 1.341 1.337 1.333 1.330 1.328 1.325 1.323 1.321 1.319 1.318 1.316 1.315 1.314 1.313 1.311 1.310

0.95 6.314 2.920 2.353 2.132 2.015 1.943 1.895 1.860 1.833 1.812 1.796 1.782 1.771 1.761 1.753 1.746 1.740 1.734 1.729 1.725 1.721 1.717 1.714 1.711 1.708 1.706 1.703 1.701 1.699 1.697

0.975 12.706 4.303 3.182 2.776 2.571 2.447 2.365 2.306 2.262 2.228 2.201 2.179 2.160 2.145 2.131 2.120 2.110 2.101 2.093 2.086 2.080 2.074 2.069 2.064 2.060 2.056 2.052 2.048 2.045 2.042

0.99 31.821 6.965 4.541 3.747 3.365 3.143 2.998 2.896 2.821 2.764 2.718 2.681 2.650 2.624 2.602 2.583 2.567 2.552 2.539 2.528 2.518 2.508 2.500 2.492 2.485 2.479 2.473 2.467 2.462 2.457

0.995 63.657 9.925 5.841 4.604 4.032 3.707 3.499 3.355 3.250 3.169 3.106 3.055 3.012 2.977 2.947 2.921 2.898 2.878 2.861 2.845 2.831 2.819 2.807 2.797 2.787 2.779 2.771 2.763 2.756 2.750

Statistical Analysis in Research Module E-mail: [email protected]

1 12

PMNjuho

TABLE C The Chi-square Distribution Upper Probability Points


P = P ( 2 v2, P )

Entries in the table are the values 2,P of the 2 -distribution for various degrees of freedom and one-tailed probabilities P.

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30

P
0.99 0.000 0.020 0.115 0.297 0.554 0.872 1.239 1.646 2.088 2.558 3.053 3.571 4.107 4.660 5.229 5.812 6.408 7.015 7.633 8.260 8.897 9.542 10.196 10.856 11.524 12.198 12.879 13.565 14.256 14.953 0.975 0.001 0.051 0.216 0.484 0.831 1.237 1.690 2.180 2.700 3.247 3.816 4.404 5.009 5.629 6.262 6.908 7.564 8.231 8.907 9.591 10.283 10.928 11.689 12.401 13.120 13.844 14.573 15.308 16.047 16.791 0.95 0.004 0.103 0.352 0.711 1.145 1.635 2.167 2.733 3.325 3.940 4.575 5.226 5.892 6.571 7.261 7.962 8.672 9.390 10.117 10.851 11.591 12.338 13.091 13.848 14.611 15.379 16.151 16.928 17.708 18.493 0.90 0.016 0.211 0.584 1.064 1.610 2.204 2.833 3.490 4.168 4.865 5.578 6.304 7.042 7.790 8.547 9.312 10.085 10.865 11.651 12.443 13.240 14.041 14.848 15.659 16.473 17.292 18.114 18.939 19.768 20.599 0.50 0.455 1.386 2.366 3.357 4.351 5.348 6.346 7.344 8.343 9.342 10.341 11.340 12.340 13.339 14.339 15.338 16.338 17.338 18.338 19.337 20.337 21.337 22.337 23.337 24.337 25.336 26.336 27.336 28.336 29.336 0.10 2.706 4.605 6.251 7.779 9.236 10.645 12.017 13.362 14.684 15.987 17.275 18.549 19.812 21.064 22.307 23.542 24.769 25.989 27.204 28.412 29.615 30.813 32.007 33.196 34.382 35.563 36.741 37.916 39.087 40.256 0.05 3.841 5.991 7.815 9.488 11.070 12.592 14.067 15.507 16.919 18.307 19.675 21.026 22.362 23.685 24.996 26.296 27.587 28.869 30.144 31.410 32.671 33.924 35.172 36.415 37.652 38.885 40.113 41.337 42.557 43.773 0.025 5.024 7.378 9.348 11.143 12.833 14.449 16.013 17.535 19.023 20.483 21.920 23.337 24.736 26.119 27.488 28.845 30.191 31.526 32.852 34.170 35.479 36.781 38.076 39.364 40.646 41.923 43.195 44.461 45.722 46.979 0.01 6.635 9.210 11.345 13.277 15.086 16.812 18.475 20.090 21.666 23.209 24.725 26.217 27.688 29.141 30.578 32.000 33.409 34.805 36.191 27.566 38.932 40.289 41.638 42.980 44.314 45.642 46.963 48.278 49.588 50.892 0.005 7.879 10.597 12.838 14.860 16.750 18.548 20.278 21.955 23.589 25.188 26.757 28.300 29.819 31.319 32. 801 34. 267 35. 718 37. 156 38. 582 39. 997 41. 401 42. 796 44. 181 45. 559 46. 928 48. 290 49. 645 50. 993 52. 336 53. 672

For v > 30

2 2 2v 1 is approximately distributed as normal (0, 1).

Statistical Analysis in Research Module E-mail: [email protected]

1 13

PMNjuho

TABLE D The F-Distribution

Pr( F b) =

[(r1 + r2 ) /](r1 / r2 ) r1/ 2 wr / 21 dw (r1 / 2)(r2 / 2)(1 + r1w / r2 )( r1 + r2 ) / 2


r1

Pr(F b)
0.95 0.975 0.99 0.95 0.975 0.99 0.95 0.975 0.99 0.95 0.975 0.99 0.95 0.975 0.99 0.95 0.975 0.99 0.95 0.975 0.99 0.95 0.975 0.99 0.95 0.975 0.99 0.95 0.975 0.99 0.95 0.975 0.99 0.95 0.975 0.99

r2
1

1 161 648 4052 18.5 38.5 98.5 10.1 17.4 34.1 7.71 12.2 21.2 6.61 10.0 16.3 5.99 8.81 13.7 5.59 8.07 12.2 5.32 7.57 11.3 5.12 7.21 10.6 4.96 6.94 10.0 4.75 6.55 9.33 4.54 6.20 8.68

2 200 800 4999 19.2 39.0 99.0 9.55 16.0 30.8 6.94 10.6 18.0 5.79 8.43 13.3 5.14 7.26 10.9 4.74 6.54 9.55 4.46 6.06 8.65 4.26 5.71 8.02 4.10 5.46 7.56 3.89 5.10 6.93 3.68 4.77 6.36

3 216 864 5403 19.2 39.2 99.2 9.28 15.4 29.5 6.59 9.98 16.7 5.41 7.76 12.1 4.76 6.60 9.78 4.35 5.89 8.45 4.07 5.42 7.59 3.86 5.08 6.99 3.71 4.83 6.55 3.49 4.47 5.95 3.29 4.15 5.42

4 225 900 5625 19.2 39.2 99.2 9.12 15.1 28.7 6.39 9.60 16.0 5.19 7.39 11.4 4.53 6.23 9.15 4.12 5.52 7.85 3.84 5.05 7.01 3.63 4.72 6.42 3.48 4.47 5.99 3.26 4.12 5.41 3.06 3.80 4.89

5 230 922 5764 19.3 39.3 99.3 9.01 14.9 28.2 6.26 9.36 15.5 5.05 7.15 11.0 4.39 5.99 8.75 3.97 5.29 7.46 3.69 4.82 6.63 3.48 4.48 6.06 3.33 4.24 5.64 3.11 3.89 5.06 2.90 3.58 4.56

6 234 937 5859 19.3 39.3 99.3 8.94 14.7 27.9 6.16 9.20 15.2 4.95 6.98 10.7 4.39 5.99 8.75 3.87 5.12 7.19 3.58 4.65 6.37 3.37 4.32 5.80 3.22 4.07 5.39 3.00 3.73 4.82 2.79 3.41 4.32

7 237 948 5928 19.4 39.4 99.4 8.89 14.6 27.7 6.09 9.07 15.0 4.88 6.85 10.5 4.21 5.70 8.26 3.79 4.99 6.99 3.50 4.53 6.18 3.29 4.20 5.61 3.14 3.95 5.20 2.91 3.61 4.64 2.71 3.29 4.14

8 239 957 5982 19.4 39.4 99.4 8.85 14.5 27.5 6.04 8.98 14.8 4.82 6.76 10.3 4.15 5.60 8.10 3.73 4.90 6.84 3.44 4.43 6.03 3.23 4.10 5.47 3.07 3.85 5.06 2.85 3.51 4.50 2.64 3.20 4.00

9 241 963 6023 19.4 39.4 99.4 8.81 14.5 27.3 6.00 8.90 14.7 4.77 6.68 10.2 4.10 5.52 7.98 3.68 4.82 6.72 3.39 4.36 5.91 3.18 4.03 5.35 3.02 3.78 4.94 2.80 3.44 4.39 2.59 3.12 3.89

10 242 969 6056 19.4 39.4 99.4 8.79 14.4 27.2 5.96 8.84 14.5 4.74 6.62 10.1 4.06 5.46 7.87 3.64 4.76 6.62 3.35 4.30 5.81 3.14 3.96 5.26 2.98 3.72 4.85 2.75 3.37 4.30 2.54 3.06 3.80

12 244 977 6106 19.4 39.4 99.4 8.74 14.3 27.1 5.91 8.75 14.4 4.68 6.52 9.89 4.00 5.37 7.72 3.57 4.67 6.47 3.28 4.20 5.67 3.07 3.87 5.11 2.91 3.62 4.71 2.69 3.28 4.16 2.48 2.96 3.67

15 246 985 6157 19.4 39.4 99.4 8.70 14.3 26.9 5.86 8.66 14.2 4.62 6.43 9.72 3.94 5.27 7.56 3.51 4.57 6.31 3.22 4.10 5.52 3.01 3.77 4.96 2.85 3.52 4.56 2.62 3.18 4.01 2.40 2.86 3.52

10

12

15

Statistical Analysis in Research Module E-mail: [email protected]

1 14

PMNjuho

REFERENCES

Clarke, G P Y., Haines, L M., Dicks, H M., Stielau, K., and Brittain, S. (1999). Basic statistical methods teaching manual. School of Mathematics, Statistics and Information Technology. University of Natal Pietermaritzburg. Durrheim, K., Lachenicht, L., Richter L., and Gray, D. (2001). Statistics tutorial workbook. Research methods. School of Psychology. University of Natal Pietermaritzburg. Freund, J E., and Simon G A. (1995). Statistics: A first course. Sixth Edition. PrenticeHall, Inc. A Simon & Schuster Company. New Jersey. USA. Hildebrand, D K. and Ott, Lyman. (1991). Statistical thinking for managers. Third Edition. PWS- KENT Publishing Company. USA. Johnson, D. E. (1998). Applied multivariate methods for data analysts. Brooks/Cole Publishing Company. CA. USA. Kitchens, L. J. (1996). Exploring statistics: A modern introduction to data analysis and inference. 2Ed. Brooks/Cole Publishing Company. CA. USA. Lewis-Beck, M. S. (1994). Factor analysis & related techniques. Vol. 5. SAGE Publications, Inc. Lindgren, B W., and Berry, D A. (1981). Elementary statistics. MacMillan Publishing Co. Inc. New York. Manly, B. F. J. (1994). Multivariate statistical methods. A primer 2nd ed. Chapman & Hall. London. UK. Mendenhall, W., Wackerly, R L., and Scheaffer, R L. (1990). Mathematical statistics with applications. Fourth Edition. PWS- KENT Publishing Company. USA. Montgomery, D. C. (1976). Design and analysis of experiments. John Wiley & Sons, Inc. Neter, J., Kutner, M H, Nachtsheim, C J and Wasserman, W (1996). Applied linear statistical models. Fourth Edition. McGraw-Hill Companies. Boston, Massachusetts. USA. Rinaman, C W. (1993). Foundations of probability & statistics. Saunders College Publishing. Forth Worth Philadelphia. USA. Viljoen, C S., and Van der Merwe, L. (2000). Applied elementary statistics for business and economics. Volume 2. Creda Communications, Elliot Avenue, Epping II, Cape Town. Wegner, T. (2000). Applied business statistics. The Rustica Press, Ndabeni, Western Cape.

Statistical Analysis in Research Module E-mail: [email protected]

1 15

PMNjuho

You might also like