RM File
RM File
INDEX
S.NO. Particulars Page No.
1
1. Worksheet -1 7-11
Frequency Distribution
2. Worksheet – 2 12-15
Measures of Central Tendency
3. Worksheet – 3 16-20
Outlier Testing
4. Worksheet – 4 21-25
One Sample t-test
5. Worksheet – 5 26-29
Paired Sample t-test
6. Worksheet – 6 30-33
Independent Sample t-test
7. Worksheet – 7 34-41
One-way ANOVA
8. Worksheet – 8 42-48
Chi-Square test
LIST OF TABLES
2
1. 1.1 Table representing data of workers in small- and 7-8
medium- scale enterprises
3
14. 6.2 Table representing SPSS Output of independent 33
sample T-Test
4
24. 8.3 Table representing SPSS Output of Chi-Square 46
test
LIST OF FIGURES
S.NO. Figure No. Particulars Page
No.
1. 1.1 Figure representing screenshot of frequency 9
distribution (step 1)
5
2. 1.2 Figure representing screenshot of frequency 9
distribution (step 2)
6
15. 4.4 Figure representing screenshot of one sample 24
T-Test (step 4)
16. 5.1 Figure representing screenshot of distribution 27
of before and after training.
7
28. 8.3 Figure representing screenshot of chi-square 45
test (step 3)
8
WORKSHEET - 1 Frequency Distribution
Frequency distribution is a method of displaying the frequency (number of times
a particular value of variable repeats in the data) of different values of a variable
in a dataset. It represents the counts of all outcomes of variable in a sample. The
frequency distribution of a variable can be represented in tabular as well as
graphical forms.
S.NO. Gender Age Religion Education S.NO. Gender Age Religion Education
Group Group
1 1 1 3 2 17 1 1 1 5
2 1 4 2 1 18 1 5 1 5
3 1 3 3 4 19 1 5 2 2
4 1 3 1 3 20 1 2 2 5
5 2 4 1 1 21 2 5 2 2
6 1 4 1 1 22 1 2 2 1
7 2 2 1 1 23 1 2 3 1
8 1 2 3 1 24 1 2 1 5
9 1 2 2 1 25 2 5 2 5
10 2 2 2 2 26 1 3 3 2
11 1 3 1 2 27 1 1 1 2
12 1 3 1 3 28 1 2 2 2
9
13 1 4 1 4 29 1 2 2 4
14 2 1 2 3 30 1 2 2 2
15 1 5 2 2 31 1 3 3 5
16 2 2 2 2 32 1 2 2 1
S.NO. Gender Age Religion Education S.NO. Gender Age Religion Education
Group Group
33 2 2 2 2 42 1 2 3 1
34 1 5 2 1 43 1 3 2 1
35 2 5 1 2 44 1 5 2 5
36 2 5 2 3 45 1 2 1 2
37 2 2 3 4 46 2 5 2 3
38 2 5 2 32 47 2 2 1 2
39 1 3 3 3 48 1 3 3 4
40 1 5 2 2 49 1 4 2 4
41 1 2 1 1 50 2 1 1 1
The coding details of different variables in the dataset are shown below table 1.2.
Gender 1= Male
2= Female
2= 26-35 yrs.
3= 36-45 yrs.
4= 46-55 yrs.
5= 56 and above
Religion 1= Hindu
2= Muslim
10
3= Other Religion
2= High School
3= Intermediate
4= Technical Diploma
5= Degree Level
SPSS Commands
Step 1: Click Analyze Descriptive statistics Frequency
11
Figure 1.2: Screenshot of frequency distribution (step 2)
Step 3: Select the type of chart.
Example: Bar chart as shown in the Figure 1.3.
12
The final SPSS output in tabular form is shown below in Table 1.3.
Education Of Workers
Diploma
Degree Level 7 14.0 14.0 100.0
Total 50 100.0 100.0
13
Figure 1.4: Bar Chart of education of workers
There are three main measures of central tendency. These are as follows:
• Arithmetic mean
• Median
• Mode
Let us discuss these three in detail.
14
Arithmetic Mean
The mean of variable represents its average value. It can be calculated by
using the following formula:
Where, represents the mean and fi represents the frequency of an i th
observation of the variable.
One of the problems with arithmetic mean is that it is highly sensitive to
the presence of outliers in the data of the related variable. To avoid this
problem, the trimmed mean of the variable can be estimated. Trimmed
mean is the value of the mean of a variable after removing some extreme
observation (example 2.5 percent from both the tails of the distribution)
from the frequency distribution.
Median
Median is known as the ‘positional average’ of a variable. If we arrange
the observations of a variable in an ascending or descending order, the
value of the observation that lies in the middle of the series is known as
median. The value of the median divides the observations of a variable
into two equal halves. Half of the observations of the variable are higher
than the median value and the other half observations are lower than
median value. The extension of median are quartiles, deciles, and
percentiles.
Mode
The mode of a variable is the observation with the highest frequency or
highest concentration of frequencies.
Objective: To calculate mean, median, mode ad quartile of monthly sales
of company.
15
Table 2.1: Monthly sales Figures of 50 consecutive months of an
enterprise
Month Sales Month Sale Month Sale Month Sale Month Sale Month Sale Month Sale
1 60 9 70 17 15 25 120 33 68 41 34 49 89
2 70 10 65 18 40 25 130 34 65 42 56 50 100
3 45 11 54 19 54 27 23 35 70 43 97
4 90 12 72 20 56 28 32 36 60 44 34
5 110 13 45 21 25 29 54 37 30 45 54
6 40 14 24 22 43 30 34 38 40 46 70
7 90 15 12 23 56 31 45 39 110 47 98
8 50 16 8 24 120 32 49 40 150 48 45
SPSS Commands
Step 1: Click Analyze Descriptive statistics Frequency
16
Step 2: Transfer the variable to variable window and click ‘statistics’ as shown
in the Figure 2.2.
17
Table 2.2: SPSS output of measures of central tendency
Statistics
sales of an enterprise
Valid 50
N
Missing 0
Mean 61.42
Median 55.00
Mode 45a
25 40.00
Percentiles 50 55.00
75 76.25
a. Multiple modes exist. The smallest value is shown
Conclusion
18
WORKSHEET – 3 Outlier Testing
Outlier are:
On the basis of the cases mentioned above, outliers can be divided into three
different types:
1. Extreme values
2. Box plot
S.NO Gender Age Sports Hours S.NO Gender Age Sports Hours S.NO. Gender Age Sports Hours
19
7 1 2 2 2.5 18 1 3 5 2.5 29 1 3 3 5.0
8 1 3 3 5 19 2 1 2 3.5 30 1 1 1 1.0
9 2 2 5 2.0 20 1 1 4 3.0
10 2 2 5 1.0 21 2 2 3 3.0
11 1 2 5 1.0 22 1 2 5 13.0
The coding details of different variables in the dataset are shown below table 3.2.
Gender 1= Male
2= Female
2= 20-24 yrs.
3= 25-29 yrs.
Sports 1 = Badminton
2= Hockey
3= Cricket
4= Rugby
5= Football
SPSS Commands
Step 1: Click analyze Descriptive Statistics Explore
20
Figure 3.1: Screenshot of outlier testing (step 1)
Step 2: Send the hours spend variable in the dependent list and then click
statistics.
21
Step 3: Select ‘outliers’ and click ‘continue’ as shown in Figure 3.3.
The required output is shown in Table 3.3 and box plot diagram is shown in
Figure 3.4.
Extreme Values
Case Number Value
1 22 13.00
2 8 5.00
Highest 3 29 5.00
4 4 4.50
Number of hours played by 5 26 4.50
the player 1 30 1.00
2 11 1.00
Lowest 3 10 1.00
4 15 1.50
5 13 1.50a
22
a. Only a partial list of cases with the value 1.50 are shown in the table of lower
extremes.
Conclusion
Table 3.3 represents SPSS output of outliers. It represents extreme high and
extreme low values of sportsman dataset. Case number 22, 8, 29, 4 and 26 have
extreme high values and case number 30, 11, 10, 15 and 13 have extreme low
values. Figure 3.4 represents that case number 22 is an outlier.
WORKSHEET – 4
23
of a car is, for say, 199.9 kmpl or a business school may claim that the average
package offered to its students is Rs. 12 lakh per annum. A researcher may be
interested in analyzing the truthfulness of these claims. For this analysis, the
researcher needs to randomly pick a small sample from the population and
compare its mean with the claimed population mean. The sample mean and the
population mean maybe different form each other. In order to test whether this
difference is statistically significant, we should apply one-sample test.
Objective: To find out the difference between population mean and sample
mean.
Dataset of weight lost (in figure) by 50 customers a month after joining the
weight loss program is shown in table 4.1.
1 2 18 4 35 4
2 3 19 3 36 4
3 2 20 4 37 3
4 4 21 5 38 4
24
5 5 22 6 39 3
6 3 23 4 40 4
7 3 24 5 41 5
8 2 25 6 42 4
9 3 26 5 43 3
10 4 27 4 44 4
11 2 28 4 45 5
12 3 29 5 46 5
13 3 30 5 47 4
14 4 31 6 48 5
15 3 32 2 49 6
16 4 33 5 50 5
17 5 34 5
Table 4.1: Data of weight lost by 50 customers a month after joining the weight
loss program
25
Step 1: Click analyze Compare Means One sample T-Test
The final SPSS output (statistical Package of Social Science) in tabular form is
shown below in Table 4.2 and Table 4.3 respectively.
27
Table 4.2
One-Sample Statistics
Table 4.3
One-Sample Test
Test Value = 0
Conclusion: Sample mean is 4.02 kgs which is less than the claimed population
mean of 5 kgs. The t statistics is found to be 25.481 with p value of .000. Since
the p value of t statistics is less than 5% level of significance, hence with 95%
confidence level the null hypothesis of no difference between sample mean and
population mean cannot be accepted and it can be concluded that sample mean
28
is significantly different from population mean. Therefore, the company is
making a wrong statement about the weight loss of its customers.
A paired sample t-test is also known as repeated sample t-test because data
(response) is collected from same respondents but at different time periods. A
paired sample t-test should be used when we want to test the impact of a event
or experiment on the variable under study. In the case, the data is collected from
the same respondents before and after the event. After this, means are
compared. The null hypothesis of paired sample t-test is that the means of pre-
sample and post- sample are equal. Some of the instances where paired sample
t-test can be applied are as follows:
29
Table 5.1
56 82 38 67 65 68
45 76 44 56 53 56
56 78 76 91 49 53
34 64 34 48 42 56
56 62 38 68 53 76
42 60 42 67 58 82
43 68 83 90 34 45
56 69 72 87 43 76
70 78 47 64 45 67
56 87 48 53 65 72
30
Fig 5.1 Screenshot of distribution of before and after training.
Step-2: Click on the variable pre training score. Then click on the post training
variable. Now, move the paired variable into the paired variables box by
clicking on the right arrow button. Finally, click on “OK” as shown in fig 5.2.
31
Paired Differences t df Sig. (2-
tailed)
Mean Std. Std. 95% Confidence
Deviati Error Interval of the
on
Mean Difference
Lower Upper
SPSS Output
32
Paired Samples Correlations
N Correlation Sig.
Since the significance value is .002 which is less than 5% significance level, we
can state with 95% confidence level that null hypothesis is rejected. Hence,
there is significance difference between before and after training.
Worksheet – 6
When we want to test the difference between two independent sample means,
we use independent-sample-T-Test. The independent samples may belong to the
same population or different population. Some of the instances in which the
independent samples t-test can be used are as follows:
34
45 Male 60 34 Female 54 65 Male 26
58 Female 32 56 Male 34 63 Female 30
32 Female 25 76 Female 45 68 Male 34
65 Male 23 87 Male 40 87 Female 45
34 Male 26 54 Female 60
Step 2: Send the test variable ‘performance score’ to the test variable(s) window.
Then send ‘age’ variable in the grouping variable and click ‘Define Groups’ as
shown in Figure 6.2.
35
Figure 6.2: Screenshot of independent sample T-Test (step 2)
Step 3: Now define the cut point as 40. Next click ‘continue’ as shown in Figure
6.3.
The final SPSS output (statistical package of social science) in tabular form is
shown below in Table 6.2 and Table 6.3 respectively.
36
Table 6.2: SPSS Output of independent sample T-Test
Group Statistics
Equal variances 1.408 .241 2.717 48 .009 13.79221 5.07660 3.58502 23.99940
assumed
performance.score 2.675 42.170 .011 13.79221 5.15663 3.38696 24.19746
Equal variances not
assumed
Conclusion: Average performance score for less than 40 years of age 68.86
with standard deviation 19.075 and average performance for more than 40 years
of age is 55.07 with standard deviation 16.77. Table 6.1 shows that result of
levene’s test which assumes the null hypothesis that all sample variance are
same, the significance value of 0.241 indicates that 95% level of confidence, the
null hypothesis of equal variance are accepted. Table 6.3 also shows that t
statistics is 2.717 is less than 5% level of significance. Hence, with 95% of
confidence level the null hypothesis of no significant difference in the average
performance of the employees below and above 40 years of age is not accepted.
37
Worksheet – 7
Independent-samples t-test can be applied to situations where there are only two
independent samples. In other words, we can use independent-samples t-tests
for comparing the means of two populations (such as males and females). When
we have more than two independent samples, t-test is inappropriate. The
Analysis of Variance (ANOVA) has an advantage over t-test when the
researcher wants to compare the means of a large number of populations (i.e.,
three or more). ANOVA is a parametric test that is used to study the difference
among more than two groups in the datasets. It helps in explaining the amount
of variation in the dataset. In a dataset, two main types of variations can occur.
One type of variation occurs due to specific reasons. These variations are
studied separately in ANOVA to identify the actual cause of variation and help
the researcher in taking effective decisions.
In case of more than two independent samples, the ANOVA test explains three
types of variance. These are as follows:
• Total variance
• Between group variance
• Within group variance
The ANOVA test is based on the logic that if the between group variance is
significantly greater than the within group variance, it indicates that the means
of different samples are significantly different.
There are two main types of ANOVA, namely, one-way ANOVA and two-way
ANOVA. One-way ANOVA determines whether all the independent samples
(groups) have the same group means or not. On the other hand, two-way
ANOVA
38
is used when you need to study the impact of two categorical variables on a
scale variable.
PhD’s. H1: There is difference between salaries of graduates, post graduates and
PhD’s.
Salary Qualification
65000 Postgraduate
60000 Postgraduate
45000 Graduate
40000 PhD
35000 Graduate
56000 Postgraduate
36000 PhD
45000 PhD
40000 Postgraduate
35000 Graduate
56000 PhD
36000 PhD
25000 Graduate
23000 Graduate
40000 Graduate
35000 Postgraduate
56000 PhD
36000 Postgraduate
45000 PhD
39
40000 Graduate
35000 Postgraduate
56000 PhD
37000 Postgraduate
25000 Graduate
85000 PhD
32000 Postgraduate
29000 Graduate
25000 Graduate
Table 7.2: Dataset of coding of different variables
Qualification 1 = Graduate
2 = Post Graduate
3 = PhD
SPSS Commands
40
Figure 7.1: Screenshot of one-way ANOVA (step 1)
Step 2: Transfer the variable ‘salary’ to dependent list window and variable
‘qualification’ to factor window.
Step 3: Select ‘Post hoc’ and then click ‘Tukey’ as shown below in Figure 7.3.
41
Figure 7.3: Screenshot of one-way ANOVA (step 3)
Step 4: Click ‘options’ and select ‘Homogeneity of variance test’ and ‘Means
plot’ as shown in Figure 7.4.
42
The final SPSS output (Statistical Package of Social Science) in tabular form is
shown in Table7.3, Table 7.4, Table 7.5, Table 7.6 and Figure 7.5 respectively.
1.900 2 25 .171
ANOVA
Salary of different qualifications
5354678571.4 27
Total 29
43
e (I-J) Lower Upper
Bound Bound
- 5510. .066 - 726.40
post graduate 13000.00 771 26726.4
graduate 0 0
- 5845. .015 - -3115.96
3 17675.00 056 32234.0
0* 4
13000.00 5510. .066 -726.40 26726.4
graduate
0 771 0
post graduate - 5845. .707 - 9884.04
3 4675.000 056 19234.0
4
graduate 17675.00 5845. .015 3115.96 32234.0
3 0* 056 4
.707 -9884.04 19234.0
post graduate 4675.000 5845.
056 4
*. The mean difference is significant at the 0.05 level.
44
Figure 7.5: Screenshot of SPSS Output-Graphical form.
Conclusion: Table 7.2 represents the Levene Test which assumes the null
hypothesis that all sample variances are same. The significance value of 0.171
indicates that 95% level of confidence the null hypothesis can be accepted. The
homogeneity of variance is one of the desired condition of one-way ANOVA
test. Table 7.3 represents the results of F test in one-way ANOVA. As shown in
Table 7.3 the p value of F statistics (5.591) is less than 5% level of significance.
Hence with 95% confidence level, the null hypothesis of equal group means
cannot be accepted. Thus it can be concluded that average salary of graduates,
post- graduates and PhD’s are not same.
Worksheet – 8 Chi-Square Test
Chi-square test is one of the most popular non-parametric tests. It is used in two
cases which are as follows:
45
• To test the association between nominal variables in research.
• To test the difference between the expected and observed frequencies of
an event.
The process of chi-square test compares the actual observed frequencies with
the calculated expected frequencies of different combinations of nominal
variables. The difference between observed and expected frequencies gives
logic of possible association between categorical variables. The chi-square
statistics compares the observed the observed count in each table cell to the
count that would be expected between the row and column classifications under
the assumptions of no associations. A negligible difference between observed
and expected frequencies may indicate no association, whereas a big difference
may indicate the possibility of association.
Table 8.2 has the data collected from 100 internet users. The data consists of
two nominal variables ‘Level of familiarity with the internet’ and ‘Education
Background.’ The details of the codes provided to different sub-categories
of these nominal variables are shown in Table 8.1.
Table 8.1 Codes provided to sub-categories
46
3 = High 3 = Technology
4 = IT
SPSS Commands
48
Figure 8.1: Screenshot of chi-square test (step 1)
Step 3: Select the ‘chi-square’ and ‘Phi and Cramer’s V’ and click ‘continue’ as
shown in Figure 8.3.
49
Figure 8.3: Screenshot of chi-square test (step 3)
Step 4: Click on ‘Cells’ and select ‘Observed’ and ‘Expected’ and click
‘Continue’ as shown in Figure 8.4.
Cases
51
Chi-Square Tests
a. 2 cells (16.7%) have expected count less than 5. The minimum expected
count is 4.80.
Symmetric Measures
52