0% found this document useful (0 votes)
30 views

Statistics Mpc006

Uploaded by

shaila colaco
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
30 views

Statistics Mpc006

Uploaded by

shaila colaco
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
You are on page 1/ 30

STATISTICS

Course Code: MPC-006 Assignment Code: MPC-006/AST/TMA/2022-23 Marks: 100

SECTION A
Q1 Explain the meaning of descriptive statistics and describe organisation of data.
Ans: -
Statistics is nothing but the science of collecting, classifying, analysing, presenting
and interpreting data.
In behavioural science, statistics is used to study data related to human behaviour,
cognition, and social interactions. Statistics play a fundamental role in the study of
behaviour by providing quantitative methods for organizing, summarizing, and
drawing meaningful conclusions from data.
The science of statistics may be broadly studied under two headings:
i) Descriptive Statistics, and (ii) Inferential Statistics
Descriptive Statistics- The observations of human behaviour in this universe
comes with a lot of facets of variation. Every person has a different personality type
with a different attitude and a different level of intelligence. It is therefore necessary
to have precise characteristics to segregate and differentiate groups into various
categories, For this purpose observations need to be expressed as a single estimate
which summarises the observations.
Descriptive statistics provide numerical measures, tabulation, diagrammatic and
graphical representations, and summary tables that help to understand the
distribution, central tendency, variability, and other properties of the data.
These measures enable the researchers to know about the tendency of data or the
scores, which further enhance the ease in description of the phenomena. Such
single estimate of the series of data which summarises the distribution are known as
parameters of the distribution. These parameters define the distribution completely.
ORGANISATION OF DATA
There are four major statistical techniques for organising the data. These are:
i) Classification
ii) Tabulation
iii) Graphical Presentation
iv) Diagrammatical Presentation
CLASSIFICATION - The segregation of data into groups is known as classification.
Conclusions are drawn post classification into categories. It would make more sense
when its classified into frequency distribution. Frequency distribution shows the
number of cases falling within a given class interval or range of scores.
Frequency distribution can be broadly divided into grouped data and ungrouped
data.
An ungrouped frequency distribution means representation of data where individual
values are listed along with their corresponding frequencies. In this type of
distribution, each unique value in a dataset is recorded separately, without
combining or grouping similar values into intervals or ranges.
A grouped frequency distribution means a representation of data where values are
grouped into intervals or ranges, and their corresponding frequencies are recorded.
It is often used when dealing with a large dataset or when the data spans a wide
range of values.
Construction of frequency distribution
To prepare a frequency distribution it is essential to determine the following:
1) The range of the given data is the difference between the highest and lowest
scores.
2) For the number of class intervals, there is no rule, for the number of classes into
which data should be grouped. If there are very few scores it is useless to have a
large number of class-intervals
3) Limits of each class interval
Another factor used in determining the number of classes is the size/ width or range
of the class which is known as ‘class interval’ and is denoted by ‘i’.
Class interval should be of uniform width resulting in the same-size classes of
frequency distribution.
There are three methods for describing the class limits for distribution:
(i) Exclusive method, (ii) Inclusive method and (iii) True or actual class method
Exclusive method
In this method, the class limits are defined in such a way that the upper class limit of
one class is the lower class limit of the preceding class. meaning 1-5,5-10,10-15.15-
20 it is presumed that score equal to the upper limit of the class is exclusive, which
means a score of 15 will be included in the class of 15-20 not 10-15.
Inclusive method : In this method, the class limits are defined in such a way that
both the lower and upper class limits are included in the class. This classification
includes scores, which are equal to the upper limit of the class.1-4,5-9,10-14,15-20
True or Actual class method Mathematically, a score is an internal when it extends
from 0.5 units below to 0.5 units above the face value of the score on a continuum.
These class limits are known as true or actual class limits. (0.5 to 4.5, 4.5-9.5,9.5-
14.5,14.5-19.5) etc
Types of Frequency Distribution- Out of the multiple ways to arrange frequencies
of data a few of them are
Relative frequency distribution It is a concept that provides information about the
proportion or percentage of observations in a dataset that fall within each category or
range of values. It is used to analyse and summarize categorical or numerical data
Cumulative frequency distribution, is a concept that shows the accumulation of
frequencies up to a given value or category in a dataset. It provides information
about the total number of observations that fall below or equal to a particular value or
category.
Cumulative relative frequency distribution: A cumulative relative frequency
distribution is one in which the entry of any score of class interval expresses that
score’s cumulative frequency as a proportion of the total number of cases.

Tabulation
Tabulation refers to the process of organizing and presenting data in a systematic
and structured manner. It involves creating tables that summarize and display the
distribution of variables or data values, making it easier to analyze and interpret the
information.
The main components of a table are:
i) Table number: When there is more than one table in a particular analysis a table
should be marked with a number for their reference and identification. The
number should be written in the centre at the top of the table.
ii) Title of the table: Every table should have an appropriate title, which describes
the content of the table. The title should be clear, brief, and self-explanatory.
Title of the table should be placed either centrally on the top of the table or just
below or after the table number.
iii) Caption: Captions are brief and self-explanatory headings for columns. Captions
may involve headings and sub-headings. The captions should be placed in the
middle of the columns.
iv) Stub: Stubs stand for brief and self-explanatory headings for rows. A stub of a
table is the left part of the table describing the rows. It is a placeholder for the row
labels that are used to identify the data in each row of the table.
v) Body of the table: This is the real table and contains numerical information or
data in different cells. This arrangement of data remains according to the
description of captions and stubs.
vi) Head note: This is written at the extreme right hand below the title and explains
the unit of the measurements used in the body of the tables.
vii) Footnote: This is a qualifying statement which is to be written below the table
explaining certain points related to the data which have not been covered in title,
caption, and stubs.
viii) Source of data: The source from which data have been taken is to be
mentioned at the end of the table.

Graphical Presentation of Data


Graphical presentation of data refers to the visual representation of data using
various types of graphs, charts, and diagrams. It involves using visual elements such
as lines, bars, pie slices, points, and shapes to present data in a clear and concise
manner. Graphical presentation helps to communicate complex information,
patterns, and relationships in a more accessible and intuitive way. It enables quick
and effective data analysis, interpretation, and communication. Some common types
of graphical presentations of data include:
Histogram A histogram is a graphical representation of a grouped frequency
distribution with continuous classes. It is an area diagram and can be defined as a
set of rectangles with bases along with the intervals between class boundaries and
with areas proportional to frequencies in the corresponding classes. In this type of
distribution the upper limit of a class is the lower limit of the following class.
Frequency polygon
A frequency polygon is a graphical representation of data that displays the
distribution of a dataset by connecting points using line segments in different classes
or intervals. It is created by plotting the frequencies or counts of different data values
on a graph, typically with the data values on the x-axis and the corresponding
frequencies on the y-axis.It is made up of straight lines that connect the midpoints of
the upper edges of the rectangles in a histogram. It helps in interpreting the trend
and shape of the data set
Frequency curve: A frequency curve is a smooth free hand curve drawn through
frequency polygon. The objective of smoothing of the frequency polygon is to
eliminate as far as possible the random or erratic fluctuations that are present in the
data

ogive
An ogive is a graphical representation of a cumulative frequency distribution. It is
commonly used in statistics to depict cumulative frequencies and analyse data sets.
The ogive graph displays cumulative frequencies as a line graph, where the
horizontal axis represents the data values, and the vertical axis represents the
cumulative frequencies.
Diagrammatic Presentation of Data
Diagrammatic and graphic presentation of data means visual representation of the
data. It shows a comparison between two or more sets of data and helps in the
presentation of highly complex data in its simplest form. There are different forms of
diagram e.g., Bar diagram, Sub-divided bar diagram, Multiple bar diagram, Pie
diagram and Pictogram
a) Bar diagram:
A bar diagram, also known as a bar chart or bar graph, is a visual representation of
data using rectangular bars of different lengths or heights. It is commonly used to
display and compare categorical data or discrete values.Bar diagrams are
particularly useful for presenting data that can be divided into distinct categories or
groups.
b) A sub-divided bar diagram, also known as a stacked bar chart or stacked bar
graph, is a variation of a bar diagram where the bars are divided into segments to
represent different sub-categories or components within each category.The
subdivided bar diagram allows for a visual comparison of both the overall
composition of the main categories and the relative contributions of the
subcategories within each category. It is particularly useful when analysing data that
can be divided into multiple dimensions or attributes.
c) Multiple Bar diagram: .Multiple bar diagrams, also known as grouped bar charts
or clustered bar graphs, are used to compare data across multiple categories while
showing subcategories within each category. They are particularly useful when
comparing values between different groups and subgroups simultaneously. A set of
bars for person, place or related phenomena are drawn side by side without any gap.
To distinguish between the different bars in a set , different colours , shades are
used
d) Pie diagram:A pie diagram, also known as a pie chart, is a circular graphical
representation of data that is divided into sectors or slices. It is commonly used to
display the composition or distribution of a whole in terms of its parts or categories.
In a pie diagram, each sector or slice represents a specific category or subgroup,
and its size or angle is proportional to the relative frequency, percentage, or
magnitude of that category compared to the whole. The entire pie represents 100%
or the total value of the data being represented.

Above discussed are some of the ways in which we can organise the data for
studying human behaviour and their patterns.
Q2 Explain the concept of normal curve with help of a diagram. Explain the
characteristics of normal probability curve.
ANS:
The Normal Distribution curve, also called the Gaussian Distribution, is the most
significant continuous probability distribution. It is also known as a bell curve, It is
commonly used in statistics and probability theory to represent the distribution of a
continuous random variable. The mean median and mode is approximately of similar
values in the bell curve. Many variables from performances in corporate setup to
socio economic status to intelligence and physical attributes all can be measured by
the bell curve.

Normal distribution is always centered at the average value. on the X axis we have
scores on the Y axis we have frequencies. Frequency is basically how often a score
occurs

diagram

The above example of a bell curve would denote human height. People can be
short , average height or tall or anywhere in between. The Y axis represents the
relative probability of observing somone who is relatively short or really tall or who
has an average height.
Meaning, Its rare that we someone who is relatively short. so the bell shape curve
will be relatively low in this column, however it is very likely to see people with
average height, so the bell shape curve is at its peak in this region and its very rare
to see extremly tall humans hence the bell shaped curve is relatively low there.

Characteristics or Properties of Normal Probability


Curve (NPC)

1) 1) The Normal Curve is Symmetrical:


The normal probability curve is symmetric, meaning that it is centered around its
mean (average) value. The curve is perfectly balanced on both sides of the mean,
and the left and right halves of the curve are mirror images of each other. The
symmetry of the Normal Curve is a result of its mathematical properties. The curve is
defined by a mathematical function called the probability density function (PDF),
which is a symmetric bell-shaped curve. The PDF gives the probability of observing
a particular value within a range of the distribution.

2) Unimodal: The normal curve is unimodal, which means it has a single peak. The
peak occurs at the mean value of the distribution. If a distribution is unimodal, it
means that there is only one peak in the curve. This implies that the data is
concentrated around a single value, which is the mean.
The peak represents the most probable value or the center of the distribution. It is
the value around which the data is most likely to cluster. As you move away from the
peak in either direction, the probability of observing a particular value decreases.
The unimodal characteristic of the Normal Curve is closely related to its symmetry.
Since the curve is symmetric, the peak occurs at the mean, and the probabilities on
either side of the peak decrease symmetrically. This creates a smooth, bell-shaped
curve with a single mode.

3) The Maximum Ordinate occurs at the Center:


The mean of a normal distribution represents the central tendency or the average
value around which the data is distributed. Since the curve is symmetric, the highest
point on the curve, known as the maximum ordinate, naturally occurs at the mean.
This is because the mean is the point of balance or the peak of the distribution.

4) The Normal Curve is Asymptotic to the X Axis:


When we say that the normal curve is asymptotic to the x-axis, it means that as the
values on the x-axis move away from the center of the curve (mean), the curve
approaches but never touches the x-axis. In other words, the tails of the normal
curve extend indefinitely in both directions along the x-axis.
The shape of the normal curve is such that it is highest at the mean and gradually
decreases as you move away from the mean in either direction.

The asymptotic nature of the normal curve to the x-axis means that the probability of
observing extreme values (far from the mean) becomes increasingly smaller but
never reaches zero. This implies that, in theory, there is always a possibility of
observing values that are extremely far from the mean, although the likelihood
becomes incredibly low.

5) The Height of the Curve declines Symmetrically


When we say that the height of the normal curve declines symmetrically, we mean
that the curve is perfectly symmetric around its mean. This symmetry indicates that
the probabilities of observing values on one side of the mean are equal to the
probabilities of observing the corresponding values on the other side.
In a normal distribution, the peak of the curve corresponds to the mean of the
distribution. As you move away from the mean in either direction, the height of the
curve gradually decreases. However, the rate at which it decreases is symmetrical
on both sides.
This symmetry is a fundamental characteristic of the normal distribution and is
mathematically defined by its probability density function. It implies that the
probabilities of observing values equidistant from the mean are the same. This
property has important implications in statistical analysis and allows us to make
inferences and estimate probabilities based on the known characteristics of the
normal distribution.

6) The Points of Influx occur at point ±1 Standard Deviation (± 1 σ)


The normal curve changes its direction from convex to concave at a point recognised
as point of influx.
By stating that the Points of Influx occur at ±1 standard deviation, it suggests that the
values falling within this range are more likely to occur compared to those farther
away from the mean. Specifically, it implies that around 68% of the data points in the
distribution are expected to fall within this range.
EG Suppose we have a dataset of exam scores, and the mean score is 70 with a
standard deviation of 5. In this case, the Points of Influx would occur at values
between 65 and 75. This means that approximately 68% of the exam scores are
expected to fall within this range.

7 ) The Total Percentage of Area of the Normal Curve within Two Points of
Influxation is fixed

Since the points of influxation are defined as ±1 standard deviation from the mean,
they encompass approximately 68% of the total area under the normal curve. This
means that the total percentage of area within these two points is fixed at
approximately 68%.
This fixed percentage holds true regardless of the specific values of the mean and
standard deviation, as long as the data follows a normal distribution.

8) The Total Area under Normal Curve may be also considered 100 Percent
Probability:
The area under a probability distribution curve represents the probability of an event
or range of values occurring. In the case of the normal distribution, the total area
under the curve sums up to 100 percent.This means that the sum of probabilities for
all possible outcomes within the distribution is equal to 1 or 100 percent.

9) The Normal Curve is Bilateral


The bilateral nature of the normal curve means that the left and right sides of the
curve are mirror images of each other. In other words, if you were to fold the curve in
half at its mean, the left and right halves would perfectly overlap.

10 The Normal Curve is a mathematical model in behavioural Sciences


Specially in Mental Measurement:
The Normal Curve is characterized by its symmetric bell-shaped curve, which
represents the distribution of scores or measurements in a population. The curve is
defined by its mean (µ) and standard deviation (σ). The mean represents the
average or central tendency of the distribution, while the standard deviation indicates
the variability or spread of the scores around the mean

Q3 3. The scores obtained by four groups of employees on occupational


stress are given below. Compute ANOVA for the same.

Group 34 22 21 22 34 32 44 55 12 12
1
Group 12 15 12 23 45 43 33 44 54 37
2
Group 45 56 65 67 67 76 54 23 21 34
3
Group 34 55 66 76 54 34 23 22 11 23
4

ANS:
STEP 1 ORGINAL TABLE COMPUTATION
Group 1 34 22 21 22 34 32 44 55 12 12 ƩX1 288
X1
Group 2 12 15 12 23 45 43 33 44 54 37 ƩX2 318
X2
Group 3 45 56 65 67 67 76 54 23 21 34 ƩX3 508
X3
Group 4 34 55 66 76 54 34 23 22 11 23 ƩX4 398
X4
125 148 164 188 200 185 154 144 98 106 ƩX 1512

N – number of Observation
Above n1=n2=n3=n4=10
N is10+10+10+10 = 40

GROUP MEAN
Ʃ X1/n1 = 288/10= 28.8
Ʃ X2/n2 = 318/10=31.8
Ʃ X3/n3 = 508/10=50.8
Ʃ X4/n4 = 398/10=39.8

CORRECTION TERM C = (ƩX)2 /N


= 1512 X 1512
------------------------
40
= 2286144
----------------
40
= 57153.6
STEP 2
SQUARED VALUES OF THE ORIGINAL DATA

Sum of Squares of Total SST = ∑ x2− Cx

STEP 1 ORGINAL TABLE COMPUTATION


Group (34)2 (22) (21)2 (22)2 (34)2 (32) (44) (55) (12)2 (12)2 ƩX1
= 2 2 2 2
1 X1 441 484 1156 144 144 9994
1156 484 102 193 302
4 6 5
-Group (12)2 (15) (12)2 (23)2 (45)2 (43) (33) (44) (54)2 (37)2 ƩX2
2 2 2 2
X2 144 144 529 2025 291 136 12226
225 184 108 193 6 9
9 9 6
Group (45)2 (56) (65)2 (67)2 (67)2 (76) (54) (23) (21)2 (34)2 ƩX3
2 2 2 2
3 X3 2025 4225 4489 4469 441 115 29182
313 577 291 529 6
6 6 6
Group (34)2 (55) (66)2 (76)2 (54)2 (34) (23) (22) (11)2 (23)2 ƩX4
2 2 2 2
4 X4 1156 4356 5776 2916 121 529 20048
302 115 529 484
5 6
4481 687 9166 1127 1058 980 647 597 362 319 ƩX2
0 8 6 5 0 4 2 8 71450

STEP 3
TOTAL SUM OF SQUARES (st2 ) = Ʃ X2 - C
= 71450 – 57153.6
= 14296.4
STEP 4
BETWEEN GROUP SUM OF SQUARE (sb2)
(sb2) = (ƩX1)2 (ƩX2)2 (ƩX3)2 (ƩX4)2 ____ C
------- +-------- +--------+---------
n1 n2 n3 n4

= 288 X288 318 X 318 508 X 508 398 X 398


------------- + -------------- + ------------- + ---------------- ----- 57153.6
10 10 10 10

= 82944 + 101124 + 258064 + 158404 --- 57153.6


_________________________________
10
= 600536
_______ ----- 57153.6
10
= 60053.6 - 57153,6 = 2900
STEP 5
WITHIN GROUP SUM OF SQUARE ( SW2 )
( SW2 ) = st2 – sb2
= 14296.4 –2900
= 11396.4
STEP 6
NUMBER OF DEGREE OF FREEDOM

For total sum of Square


(st2) = N-1 = 40-1 = 39 N : Number of observation
For between groups sum of square
(sb2) = k-1 = 4-1=3 (k =group) (horizontal Column)
For within group sum of squares
(sw2)= N - K =40 – 4 = 36

Step 7
Calculation of F ratio
Computation of value of mean square variance
Source of Sum of Square Df Mean square variance
Variation
Between Groups sb2 = 2900 3 299 / 3=966
Winthin _Group sw2 = 11396.4 36 11396.4/36 = 316

F = mean square variance between groups


----------------------------------------------------
mean square variance within group
=966/316 = 3.05

STEP 8
INTERPRETATION OF F – RATIO
The F- ratio table is referred to for 36 degree of freedom for smaller mean square
variance on left side, and for 3 degree of freedom for greater mean square variance
across the top
Critical Value [ table value ] = F=2.86 at 0.05 level of significance
4.37 at 0.01 level of significance

Calculated value (Computed Value)


Of F is 3.05 is less than the Table value [critical value] of F at 0.01 level of
significance
Calculated Value < Table Value
3.05 < 4.37
Hence it is not significant and null hypothesis cannot be rejected. There is no
need of further testing with the help of T test.

SECTION B
Q4 . Discuss the assumptions of parametric and nonparametric statistics
Ans: Parametric statistics rely on a set of assumptions to ensure the validity
and accuracy of the statistical tests and estimations. The specific assumptions
can vary depending on the type of parametric analysis being conducted, These
assumptions are necessary for the validity of parametric tests and estimations
but here are the general assumptions commonly associated with parametric
statistics:
1) Independence: The observations or data points should be independent of
each other. This assumption implies that the value of one observation does not
affect or influence the value of another observation. Independence is crucial
for the accuracy of parametric tests. Independence is important to ensure that
the statistical tests are not biased or distorted.
2) Random sampling: The data should be collected through a random sampling
method, where each observation has an equal chance of being selected.
Random sampling helps ensure that the sample is representative of the
population and allows for generalization of the findings.
3) Normality: Many parametric tests assume that the data follows a normal
distribution or at least approximate normality. This assumption is particularly
important for tests such as t-tests, analysis of variance (ANOVA), hypothesis
testing, confidence intervals. And regression analysis. Departures from
normality can affect the accuracy of these tests, especially when sample sizes
are small.
4) Homogeneity of variance: Some parametric tests, like t-tests and ANOVA,
assume that the variance of the dependent variable is equal across different
groups or conditions being compared. This assumption is referred to as
homogeneity of variance or homoscedasticity. Violations of this assumption
can lead to biased results in the analysis.
5) Linearity: Parametric regression models, such as linear regression, assume a
linear relationship between the independent and dependent variables. This
assumption implies that the change in the dependent variable is proportional
to the change in the independent variable(s). Non-linear relationships may
require alternative modelling techniques.
6) Scale of measurement: Parametric statistics assume that the data is
measured on an interval or ratio scale. Interval scale means that the
differences between values are meaningful, while ratio scale includes a true
zero point. This assumption allows for meaningful calculations and
interpretations of statistical measures.
7) Equality of variances (homoscedasticity): In regression analysis, the
assumption of homoscedasticity means that the variance of the errors or
residuals is constant across all levels of the independent variables. Violations of
this assumption can affect the accuracy and interpretability of regression
results.

It's important to note that not all parametric tests have the same assumptions,
and there may be additional assumptions specific to certain analyses.
Additionally, there are alternative non-parametric methods available for
situations where the assumptions of parametric statistics are violated.
Violations of these assumptions can lead to biased or misleading results.
Nonparametric statistics, also known as distribution-free statistics, are
statistical methods that do not rely on specific assumptions about the
underlying probability of distribution of the data. Instead, they make fewer
assumptions or no assumptions about the shape, scale, or other characteristics
of the population distribution. non-parametric methods provide a flexible
framework for data analysis that can be applied in various situations. However,
non-parametric statistics still involve certain assumptions to ensure valid
results. Here are some key assumptions or characteristics of nonparametric
statistics:

1) Data Distribution: Nonparametric methods do not assume a specific


probability distribution for the data. They are suitable for analysing data with
unknown or non-normal distributions. This flexibility makes them robust in
situations where the data deviate from normality.
2) Measurement Scale: Nonparametric statistics can be used for data
measured on nominal, ordinal, interval, or ratio scales. They are not restricted
to continuous variables and can handle categorical or rank-ordered data.
3) Independence: Nonparametric tests typically assume that the observations
are independent of each other. This assumption is crucial for accurate
statistical inference. This means that the value of one observation should not
be influenced by or related to the value of another observation. Violation of
this assumption may lead to biased results.
4) Sample Size: Nonparametric methods can be used with both small and large
sample sizes. They do not rely on assumptions about the sample size or
asymptotic properties.
5) Hypothesis Testing: Nonparametric tests are generally used for hypothesis
testing when the assumptions of parametric tests are not met. They provide a
distribution-free approach to testing hypotheses, such as comparing medians,
ranks, or distributions between groups.
6) Homogeneity of variance: While non-parametric tests do not assume
homogeneity of variance as parametric tests do, they assume that the
variances of the different groups or populations being compared are not
drastically different. Extreme heterogeneity of variances can impact the results
of non-parametric tests.
7) Symmetry: Non-parametric tests often assume symmetry of the distribution
or shape of the data..
It's important to note that while nonparametric methods are robust against
certain assumptions, they also have their own assumptions and limitations. For
example, some nonparametric tests assume the data are exchangeable,
homogeneous, or that the samples are randomly drawn. Careful consideration
of these assumptions is necessary for appropriate and valid statistical analysis.

5. Using Spearman’s rank order correlation for the following data:


DATA 1 21 12 32 34 23 34 21 22 12 29
DATA 2 6 8 21 23 33 22 43 34 21 11

Subject X Y Rx Ry D=RX-RY D2
A 21 6 7.5 10 -2.5 6.25
B 12 8 9.5 9 0.5 0.25
C 32 21 3 6.5 -3.5 12.25
D 34 23 1.5 4 -2.5 6.25
E 23 33 5 3 2 4
F 34 22 1.5 5 -3.5 12.25
G 21 43 7.5 1 6.6 42.25
H 22 34 6 2 4 16
I 12 21 9.5 6.5 3 9
J 29 11 4 8 -4 16
N=10 Ʃd2 = 124.5
N = NO OF OBSERVATION
X = DATA 1
Y= DATA 2
RX = RANK OF X
RY = RANK OF Y
D = DIFFERENCE IN RANK
D2 = DIFFERENCE SQUARED

ƥ = 1 – 6 Ʃd2
---------------

N(N2 - 1)

= 1 - 6 X 124.5
------------------
10[100-1]
= 1- 747
-----
10 X 99
= 1- 747
------
990
= 1 – 0.745= 0.246
RANK CORELATION COEFFIECIENT = 0.246

.Q6 Describe various levels of measurement with suitable examples

ANS: Level of Measurement

Since there are a large number of test already available, it is important however to
know which test can be used where. and the implication of these tests on the
inferences of the data set.
There are two types of data that we take for sampling
Qualitative ( CATEGORICAL DATA )and Quantitative (NUMERICAL DATA)
Qualitative data can be Nominal or Ordinal
Quantitative data is split into Interval and Ratio

NOMINAL DATA
Nominal data are asassing numbers to objects where different numbers indicate
different objects . Given that no ordering or meaningful numerical distances between
numbers exist in nominal measurement, we cannot obtain the coveted ‘normal
distribution’ of the dependent variable. It essentially a way of assigning number
values to inherently qualitative data. IT is a basic level of measurement reflecting
categories with no rank or order involved.
gender : Men and Women
Favourite car : Mercedes Buggati Pagani BMW Mesarati
What season is it now : Winter spring summer and autumn (4 seasons)
They are numbers and cant be put in any order. For ex Christino Ronaldo is number
7 the number 7 doesnt have any specific meaning it is just used to differentiate
between the other players.

ORDINAL DATA
The second level of measurement, which is also frequently associated with non-
parametric statistics, is the ordinal scale (also known as rank-order).
Ordinal Data is also numbered and consist of groups and categories but follows strict
and meaningful order. it is rating from smaller to higher, negative to positive,
like rate the lunch from disguisting unappetizing Neutral Tasty Delicious This is rated
from negetive to positve
Or place finished in a race 1, 2,3 and so on. We know for sure that the Person
coming first did better then every one else. The numbers that is assigned 123 shows
how they finished in the race.
Number indicates placement or order. We are asked to rate the experience of a
restaurent or a shopping app from 1 to 10 where 1 being bad and 10 being the best
Also we are asked to rate in stars where 5 stars is the best and 1 star is unbearable
or bad.

INTERVALS
At the interval level, data are numeric and have equal intervals between values, but
there is no true zero point. Numbers have order like Ordinal but there are also equal
intervals between adjecent categories. This means that ratios and proportions
cannot be calculated.
CAT Score, Credit Score. Temprature there is a range of the scores You cant have a
zero Credit Score a lesser credit score cant go below 300.
EX Temp in degree Farenhite The difference between 75 and 76 is 1 degree and is
the same as 45 and 46. Anywhere along that scale up and down of Fahrenheit scale
that one degree difference means the same thing all up and down that scale Interval
however doesnt have the start point as Zero. For example
Temprature is an interval variable it is mostly in celsius or Faranhite The absolute 0
degre temperature is actually -273.15 degree celcius and -459.67 degree F. however
we can safely say that 80 degree F is lower then 100 degree F. The comparison is
meaningful but 0 is meaningless
However If the degree are stated in Kelvin it will be a ratio as absolute zero in kelvin
is 0 degree Kelvin

RATIO
Differences are meaningful unlike intervals, ratios are meaningful and has a true
Zero point which Intervals Don’t. Its basically absence of property. When the ratio
value is zero is absolutely is zero. This is what makes ratio the most precise and
sophisticated measurements across data. You may add subtract multiply and divide
and the values will be real and meaningful. 0 sec litrally means 0 duration, 0 Kelvin
litrally means No heat. It is not some arbitrary number .
For example.
Distance from point A to point be is 20km. The start point would always ben Zero
Zero pounds means no weight or absence of weight and 10kgs is twice as much as
5kgs which indicate ratios are meaningful

COMPARING SCALES
INTERVAL VS ORDINAL
Temperature : a 1 degree difference is the same at all points of the scale (interval)
Place in race (1,2,3) The difference between finishing between 1st and 2nd is not
necessarily and probably not the same as the difference between 2 and the 3 place
(ordinal.) The 1st position may have completed the race is 3 min:30 sec while the
second position may have completed the race in 3min :45 sec and the third one
might have completed the race in 5min: 02 sec
hence these are not equal intervals in the adjecent categories.

INTERVAL VS RATIO
Temperature: Zero degree doesnt mean the absence coldness or absence of heat. it
just says that its really cold. 0 degree F is 32 degree below the freezing point.so its
indeed very cold. It terms of Ratio if I had a 5 kg object and i add another 5 kg to it, it
becomes 10 kg. However the addition of temp of day 1 is 30 degree and day 2 is 30
degree it wont be 60 degree. the temp of both the days would still be 30 degrees.
Therefore the ratio isnt meaningful there for the temperature hence temperature is
interval and weight is ratio.
there is no true zero point here. and 80 degree is not twice as hot as 40 degree
(interval)

Understanding the level of measurement is important as it determines the type of


statistical analyses that can be applied to the data and the appropriate measures of
central tendency and variability to use

7. Explain Kruskall- Wallis ANOVA test and compare it with ANOVA


The Kruskal-Wallis ANOVA test is a non-parametric statistical test used to determine
whether there are any significant differences between the medians of three or more
independent groups. It is an extension of the Mann-Whitney U test, which is used to
compare the medians of two independent groups.
The Kruskal-Wallis test is commonly used when the assumptions of parametric
analysis of variance (ANOVA) tests, such as the normality of data or equal
variances, are violated. Instead, it makes fewer assumptions and is appropriate for
data that are ordinal or skewed.
The Kruskal-Wallis test compares the medians of several (more than two)
populations to see whether they are all the same or not. . It can be viewed as
ANOVA based on rank transformed data
When your data is not normally distributed and when assumptions for analysis of
Variance is not met Kruskal wallis test can be used. This test is the non parametric
counterpart to the single factor analysis of variance.
In ANOVA the groups has a mean and we check if the means are equal of all the
groups however in Kruskal wallis test we don’t check if mean are equal instead we
give ranking to every entity in the dataset and check if there is a difference in the
rank total and that if the rank sum of the groups are equal. The population is ranked
basis the smallest to the highest The special thing about this tests is the data does
not have to be normally distributed. In this test the data does not have to fulfil any
distribution in any form.

When do we use the test


How must be the variable scaled?
We use this test when we have a nominal or ordinal variable with more than two
expressions in addition to a metric or ordinal variable. For example
Preferred Newspaper How often do you eat out
1) Times of India 1) Daily
2) Indian Express 2) Weekends
3) Hindustan Times 3) Rarely
4) Deccan Chronicle 4) Never
Above are independent samples
Metric ordinal variable which are also dependent variable will mostly be like
Salary Weight Height
ASSUMPTIONS:
Only several independent random samples with at least ordinally scaled
characteristics must be available. The variables do not have to satisfy a distribution
curve
NULL hypothesis
The independent samples all have the same central tendency and therefore come
from the same population. clearly there is no difference in the rank sums.
ALTERNATIVE HYPOTHESIS
At least one of the independent samples does not have the same central tendency
as the other samples and therefore come from a different population.
It's worth noting that the Kruskal-Wallis test does not provide information about the
direction or magnitude of the differences between groups. It simply tells us whether
there are significant differences among the medians of the groups under
investigation.
The Kruskal-Wallis ANOVA test and the traditional ANOVA (Analysis of Variance)
are both statistical tests used to determine if there are significant differences
between groups. However, they have some key differences in terms of their
assumptions and application.

ANOVA is a parametric test that assumes the data are normally distributed and have
equal variances across groups. It is used to compare means between two or more
groups. ANOVA assesses whether there is a statistically significant difference in the
means of the groups by examining the variance between groups and within groups.

On the other hand, the Kruskal-Wallis test is a non-parametric test that does not
assume normality or equal variances. It is used when the assumptions for ANOVA
are violated, such as when the data are skewed or have unequal variances. The
Kruskal-Wallis test compares the medians of the groups instead of the means.

Here are some key points of comparison between the Kruskal-Wallis test and
ANOVA:
Assumptions: ANOVA assumes normality and equal variances, while the Kruskal-
Wallis test is a non-parametric test and makes no assumptions about the underlying
distribution of the data or the variances.
Data Types: ANOVA can be applied to both numerical and categorical data,
provided the assumptions are met. The Kruskal-Wallis test is primarily used for
numerical data, but it can also be applied to ordinal or ranked categorical data.
Test Statistic: ANOVA calculates an F-statistic to determine the significance of
differences between groups based on variances. The Kruskal-Wallis test calculates a
chi-square statistic based on the ranks of the data to determine if there are
differences in medians between groups.
Post hoc Tests: In ANOVA, post hoc tests (e.g., Tukey's HSD, Bonferroni, etc.) can
be performed to identify which specific groups differ significantly. In the Kruskal-
Wallis test, if the overall test is significant, follow-up pairwise comparisons are
typically conducted using appropriate non-parametric tests (e.g., Dunn's test, Mann-
Whitney U test).
Interpretation: ANOVA provides information about the differences in means
between groups. The Kruskal-Wallis test provides information about differences in
medians between groups.
In summary, ANOVA is a parametric test that makes assumptions about the data,
while the Kruskal-Wallis test is a non-parametric alternative that does not make such
assumptions. The choice between these tests depends on the nature of the data and
whether the assumptions of ANOVA are met. If the assumptions are violated, the
Kruskal-Wallis test can be used as a robust alternative.

Q 8 Compute Chi-square for the following data:

STUDENTS EMOTIONAL EMOTIONAL


INTELLIGENCE INTELLIGENCE SCORE
SCORE HIGH LOW
SCHOOL A 23 22
SCHOOL B 12 18

OBSERVED VALUE
STUDENTS EMOTIONAL EMOTIONAL TOTAL
INTELLIGENCE INTELLIGENCE SCORE
SCORE HIGH LOW
SCHOOL A 23 22 45
SCHOOL B 12 18 30
35 40 75

EXPECTED VALUE
STUDENTS EMOTIONAL EMOTIONAL
INTELLIGENCE INTELLIGENCE SCORE
SCORE HIGH LOW
SCHOOL A 21 24
SCHOOL B 14 16

45X35 / 75 = 21
45 X 40/75 = 24
30 X 35/75 = 14
30 X 40 /75 = 16
CALCULATION OF X2 CHI SQUARE
Observed Expected O-E O-E2 O-E2/E
Value [O] Value [E]
23 21 2 4 0.19
22 24 -2 4 0.16
12 14 -2 4 0.28
18 16 2 4 0.25
75 75 X2 = 0.88

DEGREE OF FREEDOM
Df = (r-1)(c-1)
Df=(Row-1)(Column – 1)
(2-1)(2-1) = 1x1=1
Table value(Critical value) of X2= 3.841 AT 0.05 SIGNIFICANCE LEVEL
Table value of X2= 6.635 AT 0.01 SIGNIFICANCE LEVEL
TABLE VALUE (CRITICAL VALUE) > CALCULATED VALUE (COMPUTED
VALUE)
CALCULATED VALUE IS 0.88 IS MUCH LESS THEN THE TABULAR VALUE
THEREFORE NULL HYPOTHESIS CANNOT BE REJECTED.

SECTION C
Answer the following in about 50 words each 10x3=30 Marks

9) Type I and type II errors.


Ans: Type 1 and Type 2 errors are concepts used in statistical
hypothesis testing to describe the potential errors that can occur when
making decisions based on sample data.
Type 1 Error: A Type 1 error, also known as a false positive, occurs
when the null hypothesis is wrongly rejected. In hypothesis testing, the
null hypothesis represents the assumption of no effect or no difference
between groups or variables. A Type 1 error means that the researcher
concludes there is a significant effect or difference when, in reality, there
isn't one. This error is denoted by the symbol α (alpha) and represents
the level of significance chosen for the test. In simpler terms, it is the
probability of rejecting the null hypothesis when it is true.
Type 2 Error: A Type 2 error, also known as a false negative, occurs
when the null hypothesis is wrongly accepted instead of rejecting. It
happens when the researcher fails to reject the null hypothesis even
though there is a true effect or difference in the population. In other
words, a Type 2 error occurs when the test fails to detect a significant
effect that actually exists.
This error is denoted by the symbol β (beta) and represents the
probability of failing to reject the null hypothesis when it is false. The
complement of β is called the statistical power (1 - β), which represents
the probability of correctly rejecting the null hypothesis.

10) Skewness and kurtosis.

Skewness and kurtosis are two statistical measures used to describe the shape and
distribution of a dataset
Skewness:
Skewness measures the asymmetry of a distribution. It indicates whether the data is
skewed to the left (negative skewness) or to the right (positive skewness), or if it is
approximately symmetric (zero skewness). Skewness is calculated based on the
third standardized moment of the data.
Skewness is useful in understanding the shape and behavior of the data distribution,
particularly when it deviates from a normal distribution. It can impact statistical
analysis and inference, as skewed data may require special treatment or
transformation.
Kurtosis:
Kurtosis measures the peakedness or flatness of a distribution. It indicates the
presence of outliers or extreme values and compares the distribution to a normal
distribution. Kurtosis is calculated based on the fourth standardized moment of the
data.
Leptokurtic (positive excess kurtosis): The distribution has heavy tails and a sharp
peak, indicating a higher probability of extreme values or outliers.
Mesokurtic (zero excess kurtosis): The distribution has a similar shape to a normal
distribution, with moderate tails and a moderate peak.
Platykurtic (negative excess kurtosis): The distribution has lighter tails and a flatter
peak, indicating fewer extreme values and a more spread-out distribution.
Kurtosis provides information about the presence of extreme values or outliers in the
data, which can impact the interpretation of statistical tests and the selection of
appropriate models for analysis.
It's important to note that skewness and kurtosis are descriptive statistics and do not
imply any specific distribution or make conclusions about the underlying data. They
are tools for understanding and summarizing the shape and characteristics of the
dataset.
Q 11. Point and interval estimations.

Point estimation and interval estimation are two approaches used in statistical
inference to estimate population parameters based on sample data.
Point Estimation:
Point estimation involves estimating a population parameter using a single value,
which is often derived from a sample statistic. The goal is to find the "best guess" or
the most likely value for the parameter. For example, the sample mean is commonly
used as a point estimate for the population mean, or the sample proportion is used
as a point estimate for the population proportion.
Point estimates provide a single value that serves as an estimate of the parameter of
interest. However, they do not provide information about the variability or uncertainty
associated with the estimate. To address this limitation, interval estimation is used.
Interval Estimation:
Interval estimation involves constructing a range or an interval of values within which
the population parameter is believed to lie. This range is called a confidence interval.
The confidence interval provides an estimate of the parameter along with an
associated level of confidence or probability.
For example, a 95% confidence interval for the population mean would provide a
range of values within which we are 95% confident that the true population mean
lies. The interval estimation takes into account both the point estimate and the
variability of the estimate, providing a measure of precision and uncertainty.
The width of the confidence interval depends on factors such as the sample size,
variability of the data, and the chosen confidence level. A wider interval indicates
greater uncertainty, while a narrower interval indicates greater precision in the
estimate.
Interval estimation provides a more informative measure than point estimation alone,
as it incorporates the uncertainty associated with the estimate. It allows researchers
and decision-makers to assess the precision and reliability of the estimate and make
informed inferences about the population parameter.

12. Null hypothesis


ANS: The null hypothesis, H0, represents a theory that has been put
forward, either because it is believed to be true or because it is to be
used as a basis for argument, but has not been proved.
The null hypothesis, often denoted as H₀, is a statement or assumption
that represents a default or initial belief about a population parameter or
the absence of an effect or relationship between variables. It is a
fundamental concept in statistical hypothesis testing.
In hypothesis testing, the null hypothesis is typically the hypothesis that
researchers want to challenge, investigate, or test against an alternative
hypothesis (H₁). The null hypothesis assumes that there is no significant
difference, effect, or relationship in the population or that any observed
difference is due to random chance.
For example, consider a study comparing the means of two groups. The
null hypothesis would state that there is no difference between the
means of the two groups. If the null hypothesis is rejected, it suggests
that there is evidence to support an alternative hypothesis, which asserts
the presence of a significant difference or effect.
The null hypothesis is crucial because it sets up the framework for
hypothesis testing. By assuming the null hypothesis is true, researchers
can evaluate the evidence against it based on sample data and
statistical tests. The goal is to either reject the null hypothesis in favor of
the alternative hypothesis or fail to reject the null hypothesis due to
insufficient evidence.
It's important to note that failing to reject the null hypothesis does not
prove it to be true. It simply means that there is not enough evidence to
support the alternative hypothesis and suggests that the observed
results are likely due to random variation.

13. Scatter diagram


Scatter diagram (also called as scatterplot, scattergram, or scatter) is
one way to study the relationship between two variables
It involves plotting individual data points on a Cartesian coordinate
system, with one variable represented on the x-axis and the other
variable on the y-axis.
In a scatter diagram, each data point is represented by a dot or a symbol
at the intersection of its corresponding x and y values. The pattern
formed by the dots on the graph can provide insights into the relationship
between the variables. It helps to determine if there is a correlation, and
if so, the nature and strength of the correlation.
The scatter diagram is particularly useful for visualizing data and
identifying any patterns, trends, or outliers present. By examining the
scatter plot, one can assess whether there is a positive correlation
(variables increase together), a negative correlation (variables move in
opposite directions), or no correlation (no apparent relationship).
Scatter diagrams are often accompanied by a line of best fit or
regression line, which is a straight line that best represents the overall
trend of the data points. This line helps to quantify and estimate the
relationship between the variables, making it easier to predict or model
future values based on the given data.
Overall, scatter diagrams provide a visual representation of the
relationship between two variables and are a valuable tool in exploring
and analysing data.

14) Outliers
Outliers are extreme score on one of the variables or both the variables. The
presence
of outliers has deterring impact on the correlation value. The strength and degree of
the correlation are affected by the presence of outlier. They are basically
observations that lie an abnormal distance away from other values. In other words,
outliers are data points that are either extremely high or extremely low compared to
the majority of the data

Outliers can arise due to various reasons, such as measurement errors, data entry
errors, natural variations in the data, or rare events. They can have a significant
impact on the statistical analysis and modelling of a dataset, as they can skew the
results and distort the interpretation of the data.

Identifying and handling outliers is important in data analysis to ensure accurate and
reliable results. Outliers can be detected using various statistical techniques, such as
graphical methods like box plots or scatter plots, or through statistical tests based on
measures like standard deviation or the interquartile range.

15. Biserial correlation

The biserial correlation coefficient (rb), is a measure of correlation. It is like the


pointbiserial correlation. But point-biserial correlation is computed while one of the
variables is dichotomous and do not have any underlying continuity. If a variable has
underlying continuity but measured dichotomously, then the biserial correlation can
be calculated.
Biserial correlation is a statistical measure that quantifies the relationship between a
continuous variable and a binary variable. It is an extension of the point-biserial
correlation coefficient, which measures the correlation between a continuous
variable and a dichotomous variable.
The biserial correlation coefficient, denoted as r_biserial, ranges between -1 and 1,
where -1 indicates a perfect negative relationship, 1 indicates a perfect positive
relationship, and 0 indicates no relationship. The sign of the coefficient indicates the
direction of the relationship.
The biserial correlation coefficient is based on the assumption that the continuous
variable follows a normal distribution within each group defined by the binary
variable. It calculates the correlation by estimating the probability of observing the
continuous variable's value based on the binary variable's grouping.
Biserial correlation is commonly used in fields of psychology and educational
research to examine the relationship between continuous variables and binary
variables. For example, it can be used to determine the association between a
person's height (continuous variable) and their gender (binary variable) or between
exam scores (continuous variable) and pass/fail status (binary variable).

6. Variance
In the terminology of statistics the distance of scores from a central point
i.e. Mean is called deviation and the index of variability is known as the
mean deviations or standard deviation (σ )
Variance is a statistical measure that quantifies the dispersion or spread
of a set of data points around their mean (average) value. It provides a
numerical representation of how much the data points deviate from the
mean.
Mathematically, variance is calculated by taking the average of the
squared differences between each data point and the mean. The
formula for variance, denoted as σ² or Var(X), for a dataset X with n data
points is:

σ² = Σ(xᵢ - μ)² / n
where xᵢ represents each data point, μ represents the mean of the
dataset, and Σ denotes summation across all data points.

In the study of sampling theory, some of the results may be some what
more simply interpreted if the variance of a sample is defined as the sum
of the squares of the deviation divided by its degree of freedom (N-1)
rather than as the mean of the squares deviations.
Variance has several important properties:
Variance is always a non-negative value. Since it involves squaring the
differences, it eliminates the effects of positive and negative deviations,
resulting in a positive value.

A smaller variance indicates that the data points are closer to the mean,
suggesting less dispersion or spread. Conversely, a larger variance
implies greater dispersion or spread of the data points.
Variance is influenced by outliers, as they can significantly increase the
squared differences and hence inflate the variance.
Variance is not expressed in the original units of the data points but in
squared units. To obtain a measure in the original units, the square root
of the variance, known as the standard deviation, is commonly used.
Variance is a fundamental concept in statistics and plays a crucial role in
various statistical analyses, such as hypothesis testing, regression
analysis, and the calculation of confidence intervals. It provides valuable
information about the variability within a dataset and helps to assess the
reliability and predictability of the data.

17. Interactional effect


In the two way analysis of variance, the consideration and interpretation of the
interaction of variables or factors become important. Without considering the
interaction between the different variables in a study, there is no use of two way or
three way analysis of variance.
The interactions may be between two or more than two independent variables and
its effect is measured on the dependent variable or the criterion variable. The need
to know interaction effect on criterion variable or dependent variable is to know the
combined effect of two or more than two independent variables on the criterion
variable. An interaction effect occurs when the relationship between two variables
changes depending on the level or presence of another variable. It suggests that the
effect of one variable on the outcome is not constant across different levels or
conditions of the other variable.
For example
Education and Parental Involvement: In the field of education, the interaction
between educational practices and parental involvement can have a substantial
impact on a child's academic achievement. While effective teaching methods and
parental involvement individually contribute to educational success, when combined,
their interaction can lead to even better outcomes OR Might burden the child too
much with expectations and over involvement of the parents with continues micro
monitoring, putting a lot of performance pressure on the child, leading to a disastrous
outcome in the exams results of the child.

18. Wilcoxon matched pair signed rank test.


The Wilcoxon matched-pairs signed-rank test, often referred to as the
Wilcoxon signed-rank test, is a non-parametric statistical test used to
determine if the median difference between two related or paired
samples is statistically significant. It can be used for 2 repeated (or
correlated) measures when measurement is at least ordinal. But unlike
the sign test, it does take into account the magnitude of the difference
. It is designed to compare two related samples or repeated measures
on the same subjects when the data do not meet the assumptions of
parametric tests, such as the paired t-test.

The Wilcoxon signed-rank test follows these steps:

The differences between the paired observations are calculated.


The absolute differences are ranked, ignoring the signs (+ or -) and
assigning ranks to the absolute values.
The signed ranks are determined by assigning the original sign (+ or -)
back to the ranked differences.
The sum of positive ranks and negative ranks is calculated.
Difference scores of 0 are eliminated since a rank cannot be assigned.
If the null hypothesis of no difference between the groups of scores is
true, the sum of
positive ranks should not differ from the sum of negative ranks beyond
that expected
by chance
The Wilcoxon signed-rank test is robust against non-normality and
outliers, making it suitable for analyzing data that violate the
assumptions of parametric tests. It is commonly used in various fields,
including medicine, psychology, and social sciences, when dealing with
paired data or dependent samples.

It's worth noting that the Wilcoxon signed-rank test assumes that the
differences between paired observations are independent and identically
distributed, and the distribution of the differences is symmetric around
the median.

THE END

You might also like