Notes In-Statistics
Notes In-Statistics
Hadjiali, RSW
TYPES OF DATA
Quantitative data are those data that can be
expressed in numbers. These are the things that
can be measured, like person’s age height and STATISTICAL TOOLS FOR THE
weight, or family’s annual income and NUMERICAL DESCRIPTION OF DATA
merchant’s profit. These data can also be Measures of central location
counted, like the number of pupils enrolled in The most common way of describing data is by
elementary public schools, and the number of identifying its measure of central location it is
female senators. numerical value that summarizes a set of
Qualitative data are those data for which no observations into a single value, and that value
numerical measures exist and are usually may be used to represent the entire population.
expressed in categories or kind. Examples of It is a single value about which the set of
qualitative data are the color of the eyes which observations tend to cluster.
can be brown, black, gray, or blue; a person’s Median
gender which is male or female; a persons The median is the middle value of a set of
educational level which can be elementary, observations arranged in an increasing or
secondary, college, masters and doctorate. decreasing order of magnitude. Hence, it is the
TYPES OF VARIABLES value such that half of the observations fall
Variables are the characteristics or properties above it and half below it.
measured from objects, persons or things. These It is a positional value and unlike the arithmetic
variables can be discrete or continuous. The mean, it is not affected by the presence of
former can be counted, thus assume a value extreme values, when abnormal values or
which is a whole number, while the latter can be outliers are present, it is preferable to use the
measured using some units of measurements, median rather than the mean as a measure of
which may take some decimal numbers. For central location.
example, the number of passers and failures in a It is an appropriate measure for data which are at
Nursing Board Examination are discrete lest in the ordinal scale.
variables while the weights, heights and ages of Mode
the students are continuous variables. The mode of a set of observation is that value or
MEASUREMENT AND THE PROPERTIES values which occur the most number of times, or
OF NUMBERS the value or values with the greatest frequency.
Measurement is the process of assigning a The mode is determined by the frequency and
number or a numerical value to a characteristic not by the values of the observations.
of the object that is being measured. The The mode may be defined for qualitative or
properties of numbers are: quantitative variables.
Identity is the property of a number that enables It is an appropriate measure even for a nominal
a person to distinguish one number from the type of data.
other and is used for classification purposes MEASURES OF VARIABILITY
only. A measure of variability, a numerical value
Order refers to the way the numbers are computed from the given observations measures
arranged in sequence. how the data spreads from the central location.
Additivity is the property that allows us to add This is often used in comparing two sets of data,
numbers. the lesser the measure is, the closer the values of
SCALES OF MEASUREMENT the observations from the central value.
Nominal measurements posses only the Properties
property of identity and do not possess the The range of a set of observations is the
properties of order and equality of scales. difference between the largest and the smallest
Ordinal measurements possess the properties values in the set.
of both identity and order but not the equality of The variance can never be a negative number.
scale property. A large variance corresponds to a highly
Interval measurements possess the properties dispersed set of values.
of identity, order and equality of scale but do not The variance makes use of all observations.
have the property of absolute zero. In performing statistical inference, the variance
Ratio measurements possess all the properties is manipulate for further mathematical
of identity, order, equality of scales and absolute computations.
zero. The standard deviation has the same properties
Nonmetric data are further classified as nominal as the variance except the last one.
and ordinal. STATISTICAL DECISIONS IN
Nonmetric data are a categorical measurement HYPOTHESIS TESTING
and are expressed by means of a natural Statistical Decisions in Hypothesis Testing
language description. In hypothesis testing experiments, since
population parameter is tested for some of its
characteristic on the basis of the sample obtained
from the population of interest, some errors are The ramdomness is mostly related to the
bound to happen. These errors are known as assumption that the data has been obtained from
statistical errors. a random sample.
in hypothesis testing experiments, research The homogeneity of variances ensures that the
hypothesis is tested by negating the null samples are drawn from the populations having
hypothesis. The focus of the researcher is to test equal variance with respect to some criterion.
whether the null hypothesis can be rejected on Various parametric tests requires the
the basis of the given sampled data. independence assumption. However, for certain
types of data, the observations may not be
independent.
STATISTICAL TOOLS FOR COMPARING
MEANS
One Sample t-Test
the One Sample t-Test is typically used to
compare a sample mean to a hypothetical
population mean. This test statistic follows t-
ASSUMPTIONS IN PARAMETRIC TESTS distribution.
Parametric and Nonparametric Tests We use t-statistic only when the sample is small
The choice between parametric and (n‹30) and population variance is unknown. As
nonparametric tests does not solely depend on per the central limit theorem, the distribution of t
the assumption of normality. The choice also becomes normal if the sample is large (n≥30).
depends on the level of measurement of the Assumption;
variables under consideration. According to The dependent variable must be
Sheskin (2003), when the data is measured over continuous (interval/ratio).
interval or ratio scale, the parametric tests The observations are independent of one
should be tried first. However, if there is any another.
violation for one or more of the parametric The dependent variable should be
assumptions, it is recommended to transform approximately normally distributed.
the data into a format that makes it compatible to The dependent variable should not
the appropriate nonparametric test. contain any outliers.
Normality
When normality is severely violated
Plotting a histogram or QQ plot of the variable
nonparametric test should be applied. For one-
of interest will give an indication of the shape of
sample test. Sign test is the appropriate test in
the distribution. Histograms should peak in the
nonparametric.
middle and be approximately symmetrical about
the mean. If data is normally distributed, the Paired t-Test
points in QQ plots will be close to the line. Paired t-test is used to compare two related
means. Which mostly comes from a repeated
measures design. In other words. Data is collected
by two measures from each observation. eg.
Before and after.
Assumption;
The dependent variable must be
continuous (interval/ratio).
The observations are independent of one
another.
The dependent variable should be
approximately normally distributed.
The normality assumption can be tested using The t-test is robust to any violations of the
Shapiro wilk test or kolmogorov-smirnov assumptions of normality, particularly for
tests. Shapiro wilk test is more suitable for small moderate (n30) and larger sample sizes.
samples (n≤50) but it can be used for the sample When normality is severely violated,
sizes up to 2000 observations, whereas nonparametric test should be applied. For paired
kolmogorov-smirnov tests is used for large samples, the appropriate nonparametric test is
samples. One of the weaknesses of these tests is rank test.
giving significant results for the large sample Independent Two-sample t-Test
even for slight deviation from normality.
One of the most widely and commonly used
parametric test is the independent two-sample
t-test. it is utilized for comparing differences
between two separate groups/populations.
Assumption:
Data values must be independent.
Measurements for one observation do
not affect measurements for any other
Other Assumptions:
observation.
Data in each group must be obtained via a correlation coefficient are always between -1 and
random sample from the population. 1. A value of r= indicates that two variables are
Data in each group are normally perfectly related in a positive linear sense; r= -1
distributed. means that the two variables are perfectly related
in a negative linear sense, and a correlation
the appropriate test for comparing the means of coefficient of 0 indicates that there is no
two groups under the situations when the relationship between the two variables.
assumptions of normality and homogeneity are Assumption:
violeted is Mann-Whitney test. The variables x and y are linearly related.
There is a cause-and-effect relationship
STATISTICAL TOOLS FOR COMPARING between factors affecting the values of
VARIABILITY the variables x and y.
The random variables x and y are
normally distributed.
F-test for Comparing Variability1 If the data violates any of the above assumptions,
The F-Test is different from t-tests, where then use an alternative nonparametric test such
researchers/practitioners utilize it for testing as a Spearman correlation coefficient.
whether there are any differences in the variances
within the samples.
Testing for Independence
F-Test is used for testing equality of variances
The chi-square test for independence (also known
before using the t-test. In cases where there are
as a test of association) is used for testing the
only two means to compare, the t-test and the f-
relationship between two categorical variables in
test are equivalent and generate the results.
a cross-classification contingency table. The cross-
Analysis of variance (ANOVA) classification is used to test whether the observed
Analysis of variance (ANOVA) is the appropriate patterns of the variables are dependent on each
statistical technique for testing differences among other.
three or more means (treatments or groups).
There are several types of ANOVA, depending on
the number of the categorical variables), each
value is classified in exactly one way. If we have
two categorical variables, then it is named as two-
way ANOVA, and so on.
In the ANOVA, the response variable is the
categorical variable being used to define the
groups.
Assumption:
Independence
Normality
Homogeneity of variance (eg.
Homoscedasticity)
In general, if some of the assumptions for
parametric tests are not met or the data are
ordinal or nominal, the alternative nonparametric
test for one-way ANOVA is the Kruskal-Wallis test.
CORRELATION ANALYSIS
Correlation is one of the statistical measures that
identify the two or more variables that change
together. Correlation measures the direction and
magnitude or strength of the relationship
between each pair of the variables. In other
words, correlation is a measure of correlation or
association that tests whether a relationship exists
between two variables.
A positive correlation shows that these variables
are moving in the same direction, increasing or
decreasing together, while a negative correlation
is means that these variables are moving in an
opposite direction, one is increasing and another
is decreasing.
Karl Pearson’s Coefficient of Correlation
Pearson’s correlation coefficient is the statistical
measure for the association among the
quantitative data. The value of the Pearson’s