Statistics Mpc006
Statistics Mpc006
SECTION A
Q1 Explain the meaning of descriptive statistics and describe organisation of data.
Ans: -
Statistics is nothing but the science of collecting, classifying, analysing, presenting
and interpreting data.
In behavioural science, statistics is used to study data related to human behaviour,
cognition, and social interactions. Statistics play a fundamental role in the study of
behaviour by providing quantitative methods for organizing, summarizing, and
drawing meaningful conclusions from data.
The science of statistics may be broadly studied under two headings:
i) Descriptive Statistics, and (ii) Inferential Statistics
Descriptive Statistics- The observations of human behaviour in this universe
comes with a lot of facets of variation. Every person has a different personality type
with a different attitude and a different level of intelligence. It is therefore necessary
to have precise characteristics to segregate and differentiate groups into various
categories, For this purpose observations need to be expressed as a single estimate
which summarises the observations.
Descriptive statistics provide numerical measures, tabulation, diagrammatic and
graphical representations, and summary tables that help to understand the
distribution, central tendency, variability, and other properties of the data.
These measures enable the researchers to know about the tendency of data or the
scores, which further enhance the ease in description of the phenomena. Such
single estimate of the series of data which summarises the distribution are known as
parameters of the distribution. These parameters define the distribution completely.
ORGANISATION OF DATA
There are four major statistical techniques for organising the data. These are:
i) Classification
ii) Tabulation
iii) Graphical Presentation
iv) Diagrammatical Presentation
CLASSIFICATION - The segregation of data into groups is known as classification.
Conclusions are drawn post classification into categories. It would make more sense
when its classified into frequency distribution. Frequency distribution shows the
number of cases falling within a given class interval or range of scores.
Frequency distribution can be broadly divided into grouped data and ungrouped
data.
An ungrouped frequency distribution means representation of data where individual
values are listed along with their corresponding frequencies. In this type of
distribution, each unique value in a dataset is recorded separately, without
combining or grouping similar values into intervals or ranges.
A grouped frequency distribution means a representation of data where values are
grouped into intervals or ranges, and their corresponding frequencies are recorded.
It is often used when dealing with a large dataset or when the data spans a wide
range of values.
Construction of frequency distribution
To prepare a frequency distribution it is essential to determine the following:
1) The range of the given data is the difference between the highest and lowest
scores.
2) For the number of class intervals, there is no rule, for the number of classes into
which data should be grouped. If there are very few scores it is useless to have a
large number of class-intervals
3) Limits of each class interval
Another factor used in determining the number of classes is the size/ width or range
of the class which is known as ‘class interval’ and is denoted by ‘i’.
Class interval should be of uniform width resulting in the same-size classes of
frequency distribution.
There are three methods for describing the class limits for distribution:
(i) Exclusive method, (ii) Inclusive method and (iii) True or actual class method
Exclusive method
In this method, the class limits are defined in such a way that the upper class limit of
one class is the lower class limit of the preceding class. meaning 1-5,5-10,10-15.15-
20 it is presumed that score equal to the upper limit of the class is exclusive, which
means a score of 15 will be included in the class of 15-20 not 10-15.
Inclusive method : In this method, the class limits are defined in such a way that
both the lower and upper class limits are included in the class. This classification
includes scores, which are equal to the upper limit of the class.1-4,5-9,10-14,15-20
True or Actual class method Mathematically, a score is an internal when it extends
from 0.5 units below to 0.5 units above the face value of the score on a continuum.
These class limits are known as true or actual class limits. (0.5 to 4.5, 4.5-9.5,9.5-
14.5,14.5-19.5) etc
Types of Frequency Distribution- Out of the multiple ways to arrange frequencies
of data a few of them are
Relative frequency distribution It is a concept that provides information about the
proportion or percentage of observations in a dataset that fall within each category or
range of values. It is used to analyse and summarize categorical or numerical data
Cumulative frequency distribution, is a concept that shows the accumulation of
frequencies up to a given value or category in a dataset. It provides information
about the total number of observations that fall below or equal to a particular value or
category.
Cumulative relative frequency distribution: A cumulative relative frequency
distribution is one in which the entry of any score of class interval expresses that
score’s cumulative frequency as a proportion of the total number of cases.
Tabulation
Tabulation refers to the process of organizing and presenting data in a systematic
and structured manner. It involves creating tables that summarize and display the
distribution of variables or data values, making it easier to analyze and interpret the
information.
The main components of a table are:
i) Table number: When there is more than one table in a particular analysis a table
should be marked with a number for their reference and identification. The
number should be written in the centre at the top of the table.
ii) Title of the table: Every table should have an appropriate title, which describes
the content of the table. The title should be clear, brief, and self-explanatory.
Title of the table should be placed either centrally on the top of the table or just
below or after the table number.
iii) Caption: Captions are brief and self-explanatory headings for columns. Captions
may involve headings and sub-headings. The captions should be placed in the
middle of the columns.
iv) Stub: Stubs stand for brief and self-explanatory headings for rows. A stub of a
table is the left part of the table describing the rows. It is a placeholder for the row
labels that are used to identify the data in each row of the table.
v) Body of the table: This is the real table and contains numerical information or
data in different cells. This arrangement of data remains according to the
description of captions and stubs.
vi) Head note: This is written at the extreme right hand below the title and explains
the unit of the measurements used in the body of the tables.
vii) Footnote: This is a qualifying statement which is to be written below the table
explaining certain points related to the data which have not been covered in title,
caption, and stubs.
viii) Source of data: The source from which data have been taken is to be
mentioned at the end of the table.
ogive
An ogive is a graphical representation of a cumulative frequency distribution. It is
commonly used in statistics to depict cumulative frequencies and analyse data sets.
The ogive graph displays cumulative frequencies as a line graph, where the
horizontal axis represents the data values, and the vertical axis represents the
cumulative frequencies.
Diagrammatic Presentation of Data
Diagrammatic and graphic presentation of data means visual representation of the
data. It shows a comparison between two or more sets of data and helps in the
presentation of highly complex data in its simplest form. There are different forms of
diagram e.g., Bar diagram, Sub-divided bar diagram, Multiple bar diagram, Pie
diagram and Pictogram
a) Bar diagram:
A bar diagram, also known as a bar chart or bar graph, is a visual representation of
data using rectangular bars of different lengths or heights. It is commonly used to
display and compare categorical data or discrete values.Bar diagrams are
particularly useful for presenting data that can be divided into distinct categories or
groups.
b) A sub-divided bar diagram, also known as a stacked bar chart or stacked bar
graph, is a variation of a bar diagram where the bars are divided into segments to
represent different sub-categories or components within each category.The
subdivided bar diagram allows for a visual comparison of both the overall
composition of the main categories and the relative contributions of the
subcategories within each category. It is particularly useful when analysing data that
can be divided into multiple dimensions or attributes.
c) Multiple Bar diagram: .Multiple bar diagrams, also known as grouped bar charts
or clustered bar graphs, are used to compare data across multiple categories while
showing subcategories within each category. They are particularly useful when
comparing values between different groups and subgroups simultaneously. A set of
bars for person, place or related phenomena are drawn side by side without any gap.
To distinguish between the different bars in a set , different colours , shades are
used
d) Pie diagram:A pie diagram, also known as a pie chart, is a circular graphical
representation of data that is divided into sectors or slices. It is commonly used to
display the composition or distribution of a whole in terms of its parts or categories.
In a pie diagram, each sector or slice represents a specific category or subgroup,
and its size or angle is proportional to the relative frequency, percentage, or
magnitude of that category compared to the whole. The entire pie represents 100%
or the total value of the data being represented.
Above discussed are some of the ways in which we can organise the data for
studying human behaviour and their patterns.
Q2 Explain the concept of normal curve with help of a diagram. Explain the
characteristics of normal probability curve.
ANS:
The Normal Distribution curve, also called the Gaussian Distribution, is the most
significant continuous probability distribution. It is also known as a bell curve, It is
commonly used in statistics and probability theory to represent the distribution of a
continuous random variable. The mean median and mode is approximately of similar
values in the bell curve. Many variables from performances in corporate setup to
socio economic status to intelligence and physical attributes all can be measured by
the bell curve.
Normal distribution is always centered at the average value. on the X axis we have
scores on the Y axis we have frequencies. Frequency is basically how often a score
occurs
diagram
The above example of a bell curve would denote human height. People can be
short , average height or tall or anywhere in between. The Y axis represents the
relative probability of observing somone who is relatively short or really tall or who
has an average height.
Meaning, Its rare that we someone who is relatively short. so the bell shape curve
will be relatively low in this column, however it is very likely to see people with
average height, so the bell shape curve is at its peak in this region and its very rare
to see extremly tall humans hence the bell shaped curve is relatively low there.
2) Unimodal: The normal curve is unimodal, which means it has a single peak. The
peak occurs at the mean value of the distribution. If a distribution is unimodal, it
means that there is only one peak in the curve. This implies that the data is
concentrated around a single value, which is the mean.
The peak represents the most probable value or the center of the distribution. It is
the value around which the data is most likely to cluster. As you move away from the
peak in either direction, the probability of observing a particular value decreases.
The unimodal characteristic of the Normal Curve is closely related to its symmetry.
Since the curve is symmetric, the peak occurs at the mean, and the probabilities on
either side of the peak decrease symmetrically. This creates a smooth, bell-shaped
curve with a single mode.
The asymptotic nature of the normal curve to the x-axis means that the probability of
observing extreme values (far from the mean) becomes increasingly smaller but
never reaches zero. This implies that, in theory, there is always a possibility of
observing values that are extremely far from the mean, although the likelihood
becomes incredibly low.
7 ) The Total Percentage of Area of the Normal Curve within Two Points of
Influxation is fixed
Since the points of influxation are defined as ±1 standard deviation from the mean,
they encompass approximately 68% of the total area under the normal curve. This
means that the total percentage of area within these two points is fixed at
approximately 68%.
This fixed percentage holds true regardless of the specific values of the mean and
standard deviation, as long as the data follows a normal distribution.
8) The Total Area under Normal Curve may be also considered 100 Percent
Probability:
The area under a probability distribution curve represents the probability of an event
or range of values occurring. In the case of the normal distribution, the total area
under the curve sums up to 100 percent.This means that the sum of probabilities for
all possible outcomes within the distribution is equal to 1 or 100 percent.
Group 34 22 21 22 34 32 44 55 12 12
1
Group 12 15 12 23 45 43 33 44 54 37
2
Group 45 56 65 67 67 76 54 23 21 34
3
Group 34 55 66 76 54 34 23 22 11 23
4
ANS:
STEP 1 ORGINAL TABLE COMPUTATION
Group 1 34 22 21 22 34 32 44 55 12 12 ƩX1 288
X1
Group 2 12 15 12 23 45 43 33 44 54 37 ƩX2 318
X2
Group 3 45 56 65 67 67 76 54 23 21 34 ƩX3 508
X3
Group 4 34 55 66 76 54 34 23 22 11 23 ƩX4 398
X4
125 148 164 188 200 185 154 144 98 106 ƩX 1512
N – number of Observation
Above n1=n2=n3=n4=10
N is10+10+10+10 = 40
GROUP MEAN
Ʃ X1/n1 = 288/10= 28.8
Ʃ X2/n2 = 318/10=31.8
Ʃ X3/n3 = 508/10=50.8
Ʃ X4/n4 = 398/10=39.8
STEP 3
TOTAL SUM OF SQUARES (st2 ) = Ʃ X2 - C
= 71450 – 57153.6
= 14296.4
STEP 4
BETWEEN GROUP SUM OF SQUARE (sb2)
(sb2) = (ƩX1)2 (ƩX2)2 (ƩX3)2 (ƩX4)2 ____ C
------- +-------- +--------+---------
n1 n2 n3 n4
Step 7
Calculation of F ratio
Computation of value of mean square variance
Source of Sum of Square Df Mean square variance
Variation
Between Groups sb2 = 2900 3 299 / 3=966
Winthin _Group sw2 = 11396.4 36 11396.4/36 = 316
STEP 8
INTERPRETATION OF F – RATIO
The F- ratio table is referred to for 36 degree of freedom for smaller mean square
variance on left side, and for 3 degree of freedom for greater mean square variance
across the top
Critical Value [ table value ] = F=2.86 at 0.05 level of significance
4.37 at 0.01 level of significance
SECTION B
Q4 . Discuss the assumptions of parametric and nonparametric statistics
Ans: Parametric statistics rely on a set of assumptions to ensure the validity
and accuracy of the statistical tests and estimations. The specific assumptions
can vary depending on the type of parametric analysis being conducted, These
assumptions are necessary for the validity of parametric tests and estimations
but here are the general assumptions commonly associated with parametric
statistics:
1) Independence: The observations or data points should be independent of
each other. This assumption implies that the value of one observation does not
affect or influence the value of another observation. Independence is crucial
for the accuracy of parametric tests. Independence is important to ensure that
the statistical tests are not biased or distorted.
2) Random sampling: The data should be collected through a random sampling
method, where each observation has an equal chance of being selected.
Random sampling helps ensure that the sample is representative of the
population and allows for generalization of the findings.
3) Normality: Many parametric tests assume that the data follows a normal
distribution or at least approximate normality. This assumption is particularly
important for tests such as t-tests, analysis of variance (ANOVA), hypothesis
testing, confidence intervals. And regression analysis. Departures from
normality can affect the accuracy of these tests, especially when sample sizes
are small.
4) Homogeneity of variance: Some parametric tests, like t-tests and ANOVA,
assume that the variance of the dependent variable is equal across different
groups or conditions being compared. This assumption is referred to as
homogeneity of variance or homoscedasticity. Violations of this assumption
can lead to biased results in the analysis.
5) Linearity: Parametric regression models, such as linear regression, assume a
linear relationship between the independent and dependent variables. This
assumption implies that the change in the dependent variable is proportional
to the change in the independent variable(s). Non-linear relationships may
require alternative modelling techniques.
6) Scale of measurement: Parametric statistics assume that the data is
measured on an interval or ratio scale. Interval scale means that the
differences between values are meaningful, while ratio scale includes a true
zero point. This assumption allows for meaningful calculations and
interpretations of statistical measures.
7) Equality of variances (homoscedasticity): In regression analysis, the
assumption of homoscedasticity means that the variance of the errors or
residuals is constant across all levels of the independent variables. Violations of
this assumption can affect the accuracy and interpretability of regression
results.
It's important to note that not all parametric tests have the same assumptions,
and there may be additional assumptions specific to certain analyses.
Additionally, there are alternative non-parametric methods available for
situations where the assumptions of parametric statistics are violated.
Violations of these assumptions can lead to biased or misleading results.
Nonparametric statistics, also known as distribution-free statistics, are
statistical methods that do not rely on specific assumptions about the
underlying probability of distribution of the data. Instead, they make fewer
assumptions or no assumptions about the shape, scale, or other characteristics
of the population distribution. non-parametric methods provide a flexible
framework for data analysis that can be applied in various situations. However,
non-parametric statistics still involve certain assumptions to ensure valid
results. Here are some key assumptions or characteristics of nonparametric
statistics:
Subject X Y Rx Ry D=RX-RY D2
A 21 6 7.5 10 -2.5 6.25
B 12 8 9.5 9 0.5 0.25
C 32 21 3 6.5 -3.5 12.25
D 34 23 1.5 4 -2.5 6.25
E 23 33 5 3 2 4
F 34 22 1.5 5 -3.5 12.25
G 21 43 7.5 1 6.6 42.25
H 22 34 6 2 4 16
I 12 21 9.5 6.5 3 9
J 29 11 4 8 -4 16
N=10 Ʃd2 = 124.5
N = NO OF OBSERVATION
X = DATA 1
Y= DATA 2
RX = RANK OF X
RY = RANK OF Y
D = DIFFERENCE IN RANK
D2 = DIFFERENCE SQUARED
ƥ = 1 – 6 Ʃd2
---------------
N(N2 - 1)
= 1 - 6 X 124.5
------------------
10[100-1]
= 1- 747
-----
10 X 99
= 1- 747
------
990
= 1 – 0.745= 0.246
RANK CORELATION COEFFIECIENT = 0.246
Since there are a large number of test already available, it is important however to
know which test can be used where. and the implication of these tests on the
inferences of the data set.
There are two types of data that we take for sampling
Qualitative ( CATEGORICAL DATA )and Quantitative (NUMERICAL DATA)
Qualitative data can be Nominal or Ordinal
Quantitative data is split into Interval and Ratio
NOMINAL DATA
Nominal data are asassing numbers to objects where different numbers indicate
different objects . Given that no ordering or meaningful numerical distances between
numbers exist in nominal measurement, we cannot obtain the coveted ‘normal
distribution’ of the dependent variable. It essentially a way of assigning number
values to inherently qualitative data. IT is a basic level of measurement reflecting
categories with no rank or order involved.
gender : Men and Women
Favourite car : Mercedes Buggati Pagani BMW Mesarati
What season is it now : Winter spring summer and autumn (4 seasons)
They are numbers and cant be put in any order. For ex Christino Ronaldo is number
7 the number 7 doesnt have any specific meaning it is just used to differentiate
between the other players.
ORDINAL DATA
The second level of measurement, which is also frequently associated with non-
parametric statistics, is the ordinal scale (also known as rank-order).
Ordinal Data is also numbered and consist of groups and categories but follows strict
and meaningful order. it is rating from smaller to higher, negative to positive,
like rate the lunch from disguisting unappetizing Neutral Tasty Delicious This is rated
from negetive to positve
Or place finished in a race 1, 2,3 and so on. We know for sure that the Person
coming first did better then every one else. The numbers that is assigned 123 shows
how they finished in the race.
Number indicates placement or order. We are asked to rate the experience of a
restaurent or a shopping app from 1 to 10 where 1 being bad and 10 being the best
Also we are asked to rate in stars where 5 stars is the best and 1 star is unbearable
or bad.
INTERVALS
At the interval level, data are numeric and have equal intervals between values, but
there is no true zero point. Numbers have order like Ordinal but there are also equal
intervals between adjecent categories. This means that ratios and proportions
cannot be calculated.
CAT Score, Credit Score. Temprature there is a range of the scores You cant have a
zero Credit Score a lesser credit score cant go below 300.
EX Temp in degree Farenhite The difference between 75 and 76 is 1 degree and is
the same as 45 and 46. Anywhere along that scale up and down of Fahrenheit scale
that one degree difference means the same thing all up and down that scale Interval
however doesnt have the start point as Zero. For example
Temprature is an interval variable it is mostly in celsius or Faranhite The absolute 0
degre temperature is actually -273.15 degree celcius and -459.67 degree F. however
we can safely say that 80 degree F is lower then 100 degree F. The comparison is
meaningful but 0 is meaningless
However If the degree are stated in Kelvin it will be a ratio as absolute zero in kelvin
is 0 degree Kelvin
RATIO
Differences are meaningful unlike intervals, ratios are meaningful and has a true
Zero point which Intervals Don’t. Its basically absence of property. When the ratio
value is zero is absolutely is zero. This is what makes ratio the most precise and
sophisticated measurements across data. You may add subtract multiply and divide
and the values will be real and meaningful. 0 sec litrally means 0 duration, 0 Kelvin
litrally means No heat. It is not some arbitrary number .
For example.
Distance from point A to point be is 20km. The start point would always ben Zero
Zero pounds means no weight or absence of weight and 10kgs is twice as much as
5kgs which indicate ratios are meaningful
COMPARING SCALES
INTERVAL VS ORDINAL
Temperature : a 1 degree difference is the same at all points of the scale (interval)
Place in race (1,2,3) The difference between finishing between 1st and 2nd is not
necessarily and probably not the same as the difference between 2 and the 3 place
(ordinal.) The 1st position may have completed the race is 3 min:30 sec while the
second position may have completed the race in 3min :45 sec and the third one
might have completed the race in 5min: 02 sec
hence these are not equal intervals in the adjecent categories.
INTERVAL VS RATIO
Temperature: Zero degree doesnt mean the absence coldness or absence of heat. it
just says that its really cold. 0 degree F is 32 degree below the freezing point.so its
indeed very cold. It terms of Ratio if I had a 5 kg object and i add another 5 kg to it, it
becomes 10 kg. However the addition of temp of day 1 is 30 degree and day 2 is 30
degree it wont be 60 degree. the temp of both the days would still be 30 degrees.
Therefore the ratio isnt meaningful there for the temperature hence temperature is
interval and weight is ratio.
there is no true zero point here. and 80 degree is not twice as hot as 40 degree
(interval)
ANOVA is a parametric test that assumes the data are normally distributed and have
equal variances across groups. It is used to compare means between two or more
groups. ANOVA assesses whether there is a statistically significant difference in the
means of the groups by examining the variance between groups and within groups.
On the other hand, the Kruskal-Wallis test is a non-parametric test that does not
assume normality or equal variances. It is used when the assumptions for ANOVA
are violated, such as when the data are skewed or have unequal variances. The
Kruskal-Wallis test compares the medians of the groups instead of the means.
Here are some key points of comparison between the Kruskal-Wallis test and
ANOVA:
Assumptions: ANOVA assumes normality and equal variances, while the Kruskal-
Wallis test is a non-parametric test and makes no assumptions about the underlying
distribution of the data or the variances.
Data Types: ANOVA can be applied to both numerical and categorical data,
provided the assumptions are met. The Kruskal-Wallis test is primarily used for
numerical data, but it can also be applied to ordinal or ranked categorical data.
Test Statistic: ANOVA calculates an F-statistic to determine the significance of
differences between groups based on variances. The Kruskal-Wallis test calculates a
chi-square statistic based on the ranks of the data to determine if there are
differences in medians between groups.
Post hoc Tests: In ANOVA, post hoc tests (e.g., Tukey's HSD, Bonferroni, etc.) can
be performed to identify which specific groups differ significantly. In the Kruskal-
Wallis test, if the overall test is significant, follow-up pairwise comparisons are
typically conducted using appropriate non-parametric tests (e.g., Dunn's test, Mann-
Whitney U test).
Interpretation: ANOVA provides information about the differences in means
between groups. The Kruskal-Wallis test provides information about differences in
medians between groups.
In summary, ANOVA is a parametric test that makes assumptions about the data,
while the Kruskal-Wallis test is a non-parametric alternative that does not make such
assumptions. The choice between these tests depends on the nature of the data and
whether the assumptions of ANOVA are met. If the assumptions are violated, the
Kruskal-Wallis test can be used as a robust alternative.
OBSERVED VALUE
STUDENTS EMOTIONAL EMOTIONAL TOTAL
INTELLIGENCE INTELLIGENCE SCORE
SCORE HIGH LOW
SCHOOL A 23 22 45
SCHOOL B 12 18 30
35 40 75
EXPECTED VALUE
STUDENTS EMOTIONAL EMOTIONAL
INTELLIGENCE INTELLIGENCE SCORE
SCORE HIGH LOW
SCHOOL A 21 24
SCHOOL B 14 16
45X35 / 75 = 21
45 X 40/75 = 24
30 X 35/75 = 14
30 X 40 /75 = 16
CALCULATION OF X2 CHI SQUARE
Observed Expected O-E O-E2 O-E2/E
Value [O] Value [E]
23 21 2 4 0.19
22 24 -2 4 0.16
12 14 -2 4 0.28
18 16 2 4 0.25
75 75 X2 = 0.88
DEGREE OF FREEDOM
Df = (r-1)(c-1)
Df=(Row-1)(Column – 1)
(2-1)(2-1) = 1x1=1
Table value(Critical value) of X2= 3.841 AT 0.05 SIGNIFICANCE LEVEL
Table value of X2= 6.635 AT 0.01 SIGNIFICANCE LEVEL
TABLE VALUE (CRITICAL VALUE) > CALCULATED VALUE (COMPUTED
VALUE)
CALCULATED VALUE IS 0.88 IS MUCH LESS THEN THE TABULAR VALUE
THEREFORE NULL HYPOTHESIS CANNOT BE REJECTED.
SECTION C
Answer the following in about 50 words each 10x3=30 Marks
Skewness and kurtosis are two statistical measures used to describe the shape and
distribution of a dataset
Skewness:
Skewness measures the asymmetry of a distribution. It indicates whether the data is
skewed to the left (negative skewness) or to the right (positive skewness), or if it is
approximately symmetric (zero skewness). Skewness is calculated based on the
third standardized moment of the data.
Skewness is useful in understanding the shape and behavior of the data distribution,
particularly when it deviates from a normal distribution. It can impact statistical
analysis and inference, as skewed data may require special treatment or
transformation.
Kurtosis:
Kurtosis measures the peakedness or flatness of a distribution. It indicates the
presence of outliers or extreme values and compares the distribution to a normal
distribution. Kurtosis is calculated based on the fourth standardized moment of the
data.
Leptokurtic (positive excess kurtosis): The distribution has heavy tails and a sharp
peak, indicating a higher probability of extreme values or outliers.
Mesokurtic (zero excess kurtosis): The distribution has a similar shape to a normal
distribution, with moderate tails and a moderate peak.
Platykurtic (negative excess kurtosis): The distribution has lighter tails and a flatter
peak, indicating fewer extreme values and a more spread-out distribution.
Kurtosis provides information about the presence of extreme values or outliers in the
data, which can impact the interpretation of statistical tests and the selection of
appropriate models for analysis.
It's important to note that skewness and kurtosis are descriptive statistics and do not
imply any specific distribution or make conclusions about the underlying data. They
are tools for understanding and summarizing the shape and characteristics of the
dataset.
Q 11. Point and interval estimations.
Point estimation and interval estimation are two approaches used in statistical
inference to estimate population parameters based on sample data.
Point Estimation:
Point estimation involves estimating a population parameter using a single value,
which is often derived from a sample statistic. The goal is to find the "best guess" or
the most likely value for the parameter. For example, the sample mean is commonly
used as a point estimate for the population mean, or the sample proportion is used
as a point estimate for the population proportion.
Point estimates provide a single value that serves as an estimate of the parameter of
interest. However, they do not provide information about the variability or uncertainty
associated with the estimate. To address this limitation, interval estimation is used.
Interval Estimation:
Interval estimation involves constructing a range or an interval of values within which
the population parameter is believed to lie. This range is called a confidence interval.
The confidence interval provides an estimate of the parameter along with an
associated level of confidence or probability.
For example, a 95% confidence interval for the population mean would provide a
range of values within which we are 95% confident that the true population mean
lies. The interval estimation takes into account both the point estimate and the
variability of the estimate, providing a measure of precision and uncertainty.
The width of the confidence interval depends on factors such as the sample size,
variability of the data, and the chosen confidence level. A wider interval indicates
greater uncertainty, while a narrower interval indicates greater precision in the
estimate.
Interval estimation provides a more informative measure than point estimation alone,
as it incorporates the uncertainty associated with the estimate. It allows researchers
and decision-makers to assess the precision and reliability of the estimate and make
informed inferences about the population parameter.
14) Outliers
Outliers are extreme score on one of the variables or both the variables. The
presence
of outliers has deterring impact on the correlation value. The strength and degree of
the correlation are affected by the presence of outlier. They are basically
observations that lie an abnormal distance away from other values. In other words,
outliers are data points that are either extremely high or extremely low compared to
the majority of the data
Outliers can arise due to various reasons, such as measurement errors, data entry
errors, natural variations in the data, or rare events. They can have a significant
impact on the statistical analysis and modelling of a dataset, as they can skew the
results and distort the interpretation of the data.
Identifying and handling outliers is important in data analysis to ensure accurate and
reliable results. Outliers can be detected using various statistical techniques, such as
graphical methods like box plots or scatter plots, or through statistical tests based on
measures like standard deviation or the interquartile range.
6. Variance
In the terminology of statistics the distance of scores from a central point
i.e. Mean is called deviation and the index of variability is known as the
mean deviations or standard deviation (σ )
Variance is a statistical measure that quantifies the dispersion or spread
of a set of data points around their mean (average) value. It provides a
numerical representation of how much the data points deviate from the
mean.
Mathematically, variance is calculated by taking the average of the
squared differences between each data point and the mean. The
formula for variance, denoted as σ² or Var(X), for a dataset X with n data
points is:
σ² = Σ(xᵢ - μ)² / n
where xᵢ represents each data point, μ represents the mean of the
dataset, and Σ denotes summation across all data points.
In the study of sampling theory, some of the results may be some what
more simply interpreted if the variance of a sample is defined as the sum
of the squares of the deviation divided by its degree of freedom (N-1)
rather than as the mean of the squares deviations.
Variance has several important properties:
Variance is always a non-negative value. Since it involves squaring the
differences, it eliminates the effects of positive and negative deviations,
resulting in a positive value.
A smaller variance indicates that the data points are closer to the mean,
suggesting less dispersion or spread. Conversely, a larger variance
implies greater dispersion or spread of the data points.
Variance is influenced by outliers, as they can significantly increase the
squared differences and hence inflate the variance.
Variance is not expressed in the original units of the data points but in
squared units. To obtain a measure in the original units, the square root
of the variance, known as the standard deviation, is commonly used.
Variance is a fundamental concept in statistics and plays a crucial role in
various statistical analyses, such as hypothesis testing, regression
analysis, and the calculation of confidence intervals. It provides valuable
information about the variability within a dataset and helps to assess the
reliability and predictability of the data.
It's worth noting that the Wilcoxon signed-rank test assumes that the
differences between paired observations are independent and identically
distributed, and the distribution of the differences is symmetric around
the median.
THE END