Stat
Stat
In statistics, the measures of central tendency provide a central location of the distribution
of data observations. For instance, in understanding changes in data sets over a period of time,
the arithmetical mean, which is the sum of all the values divided by their number, is very
significant even though it can be heavily influenced by extreme observations. The median is the
value in the center after data has been arranged in ascending or descending order, and it
provides a more accurate figure in the presence of skewness since it cannot be influenced by
the tail values. The mode is the value that appears the most and can be applied to quantitative
and qualitative data, however a dataset may not have, one or numerous modes. Every measure
offers advantages and disadvantages based on the characteristics of the data.
Q2. The geometric mean and the harmonic mean serve as other measures of central tendency
while effective in particular situations which the arithmetic mean fails to reach.
The geometric mean is the nth root of the product of n numbers or the n-dimensional volume of
a shape. This mean is suited for data that involves grow rates or enjoys compounding such as
investment returns. For instance, if an individual invests money and over three years the values
increase by 10%, 20% and 30%. The geometric mean is capable of providing the applicable
average of the growth since it will consider the compound. This is often in finance and
economics for instance investment returns over a period of time weighting available years at
different rates. Its use is however not applicable where zeros or negative values are existent in
the datasets.
The harmonic mean takes the number of values and divides each value by one and then
averages these reciprocals. It is preferable for percentages and ratios, and less focused on
higher values of the parameters. It would be inappropriate to use the simple average in such
cases, where some sections of the journey require moving at different average speeds, for
example, half the distance at 30 km/h and the other half at 60 km/h. Its use is however restricted
in cases of handling zero values such as in physics and transport industries.
In other words, the geometric mean applies better to growth rates while the harmonic mean
applies better to rates and ratios which therefore portray measures that are more true to certain
situations as compared to the arithmetic mean.
Q3. Quartiles, deciles, and percentiles are all considered the relative measures of the
distribution equidistant sections of the given input set.
- Quartiles are three fixed points of values that segregate the dataset into four equal parts.
- Q1- marks the first 25% of all the data.
- Q2- is the value that divides the ordered dataset into halves thus being the median.
- Q3 is the value under which 75% of all the data falls.
- A decile is a statistical measurement that subdivides a dataset into ten equal segments. Thus,
the first decile D1 indicates the level beneath where 10% of the data exists, the second decile
D2 indicates the level beneath where 20% of the data exists, and so forth.
- In this case, 100 corresponding parts of the data are divided in order to produce the
percentiles. For instance, the nth percentile is the value below which n percent of the data falls
(for example, the 90th percentil is the value beneath 90 percent of the data).
Such measures are useful to give the measures of dispersion or spread of data and also to
present the salient features like anomalies and fluctuations within the data. For instance,
quartiles assist in determining the IQR while percentiles come in handy in the scoring of
candidates in standardized exams.
Q4. The statistical tools that show the extent or range of a dataset are referred to as measures
of dispersion. The three most frequently used measures of dispersion include mean deviation,
quartile deviation, and standard deviation, each of which has different definitions, calculations,
uses, advantages and disadvantages.
1. Mean Deviation
Definition: The mean deviation is an average of all the absolute deviations from the central
measure ( average or median).
Demerits: Direction of deviations is not accounted for; less robust to outliers; rarely employed in
higher level statistics.
2. Quartile Deviation
● Definition: The quartile deviation measures the spread of the middle 50% of the dataset,
calculated as half the interquartile range (IQR).
● Calculation:
3. Standard Deviation
● Application: Widely used in various fields, including finance and research, to assess
risk and variability
● Merits: Sensitive to all data points, including outliers; provides a comprehensive
measure of variability
● Demerits: Can be influenced significantly by extreme values; may be difficult to interpret
without context
In summary, while all three measures provide insights into data variability, their
applications and interpretations differ significantly depending on the characteristics of the
dataset.
Q5. The CV is a measure of variability that is defined as the ratio of the standard deviation to
the mean, most commonly expressed as a percentage. It measures how much a data set varies
with respect to the mean as a percentage or standard metric. This measure of dispersion is
relative and can therefore be useful for comparing the dispersions of different data sets even
when the means of the data sets vary a great deal.
Interpretation
● Relative Measure: The Coefficient of Variation (CV) is useful in assessing how much a
set of two or more datasets differs, even if the datasets are of different scales or vastly
different means. For instance, consider two datasets, one having a mean of 100 and a
standard deviation of 20 with a P (CV=20%) and the other mean of 10 with standard
deviation of 2 (CV=20%). Though the sample means vary to a great extent, the degree
of relative variability remains constant.
● Insight into Risk and Consistency: When it comes to financial activities, the variation
coefficient is useful in evaluating the level of risk associated with a specific investment.
This means that a higher CV signifies higher risk or variability in returns when compared
to the mean, and a lower one means the opposite. For example, if two different
investments produce the same mean return, the one with the lower CV is considered to
be the less risky investment.
Advantages
● Standardization: CV offers a universal gauge of spread without bearing any units which
aids in comparing the disperse outbreak of data along various scopes.
● Versatility: It is relevant in a number of sectors, such as finance, biology, and quality
control, to measure dependence and hazard.
Limitations
Q6. A moment in statistics is a measure which permits the description of some properties of a
distribution in quantitative terms. Moments provide information about the shape, dispersion, and
general characteristics of the data. The moments of a distribution are defined as the nth power
of the distances from the mean. There are several moments, but the first four are of great
importance in defining the main features of the distribution such as the central tendency,
dispersion, skewness, and kurtosis.
● Definition: The first moment is the arithmetic mean of the data, representing the
central location or average value.
● Calculation:
● Significance: It indicates the central value of the dataset and is used as the reference
point for calculating higher moments.
● Definition: The second moment is the variance, which measures the spread or
dispersion of the data around the mean. It is the average of the squared deviations from
the mean.
● Calculation:
● Significance: Variance tells us how spread out the data points are from the mean. A
higher variance indicates greater dispersion.
● Definition: The third moment is related to skewness, which measures the asymmetry of
the distribution. Skewness tells whether the data is skewed to the left or right of the
mean.
● Calculation:
3
Σ(𝑥𝑖 − µ1)
µ3 = 𝑛
3
µ
Skewness = 3
σ
Where σ is the standard deviation.
● Significance:
- A positive skew indicates a longer right tail (data concentrated on the left)
- A negative skew indicates a longer left tail (data concentrated on the right)
- A skewness value closer to zero suggests a symmetrical distribution
● Definition: The fourth moment is related to kurtosis, which measures the "tailedness"
or peakedness of a distribution. It describes how much of the data is in the tails
compared to the center.
● Calculation:
4
Σ(𝑥𝑖 − µ1)
µ4 = 𝑛
4
µ
Kurtosis = 4
σ
○ Significance:
■ Leptokurtic distributions have high kurtosis, meaning more data is
concentrated in the tails (sharp peak).
■ Platykurtic distributions have low kurtosis, indicating less concentration
in the tails (flatter peak).
■ Mesokurtic distributions (e.g., normal distribution) have a kurtosis value
around 3, representing moderate tail behavior.
Understanding these moments provides comprehensive insights into the distribution’s shape,
especially when analyzing how data behaves compared to a normal distribution.
Q7. Kurtosis is a statistical characteristic that defines the shape of a distribution and
most importantly, how fat or thin the tails of the distribution are, i.e. how many points of
the data are in the tails of the distribution and how many are close to the peak of the
distribution. In other words, we can look at kurtosis to analyze how much of the data is
contained in the extremities of the distribution in comparison to its core and compare it to
a bell shaped curve.
Kurtosis is of significance due to the fact that it gives us an impression on how likely (or
unlikely) extremes values (outliers) are present in a given distribution as opposed to a
normal one. The kurtosis of a perfect normal distribution is said to be 3, however, in most
cases the excess kurtosis which is Kurtosis-3 is applied meaning:
- Excess Curtis of 0: The distribution exhibits the same kurtic levels as that of a normal
distribution.
- Positive lv axis excess kurtosis: the fatty distribution has fatter tails than usual.
- A negative excess kurtosis: The distribution has thinner tails when compared to the
shoulders.
Types of Kurtosis:
- Definition: A leptokurtic distribution has positive kurtosis (excess kurtosis > 0). It
exhibits a sharper peak and fatter tails than the normal distribution.
- Characteristics:
- High concentration of values near the mean and more data in the tails.
- Example: Stock market returns often display leptokurtic behavior, where extreme
positive or negative returns are more common.
- Definition: A platykurtic distribution has negative kurtosis (excess kurtosis < 0). It
has a flatter peak and thinner tails than the normal distribution.
- Characteristics:
- Data is more evenly spread across the range, with fewer extreme values.
- Example: Uniform distributions are typically platykurtic because they spread values
evenly across the range, with very few extreme observations.
- Characteristics:
- The shape of the distribution is moderate, with a typical peak and tails.
Kurtosis plays an important role in gauging the risk posed by extreme values and rare
events within a population. For instance, in the field of finance, distributions that exhibit
leptokurtic features are believed to help determine the chances of extreme swings in the
financial markets as such distributions indicate more presence of outlying values. On the
other hand, in case of quality control, a distribution of the platykurtic type may be more
preferred as it suggests lesser variations in extremes.
In summary:
Calculation of Skewness:
Skewness is calculated by taking the third moment of the distribution, which involves raising
deviations from the mean to the power of three and normalizing the result by the cube of the
standard deviation. The formula is:
3
𝑛 Σ(𝑥𝑖−𝑥)
Skewness =
(𝑛−1)(𝑛−2)
× 3
σ
Where:
Interpretation of Skewness:
Understanding skewness is important in data analysis because it reveals the direction of outliers
and how they affect the central tendency measures (like the mean). It helps to choose the
appropriate summary statistics and models, especially when the distribution is not symmetrical.
Q9. Quartiles are a fundamental statistical concept used in exploratory data analysis (EDA)
to divide a dataset into four equal parts, helping to understand the distribution and spread of
data. Quartiles help identify the central tendency, the spread, and potential outliers within the
dataset, providing a clearer picture of the data's characteristics.
Understanding Quartiles:
● First Quartile (Q1): Also called the lower quartile, it represents the 25th percentile. It is
the value below which 25% of the data points lie.
● Second Quartile (Q2): Also called the median, it represents the 50th percentile. It is the
middle value of the dataset.
● Third Quartile (Q3): Also called the upper quartile, it represents the 75th percentile. It
is the value below which 75% of the data points lie.
● Interquartile Range (IQR): The difference between the third and first quartiles
(IQR=Q3−Q1IQR = Q3 - Q1IQR=Q3−Q1) provides a measure of the spread of the
middle 50% of the data.
Conclusion:
Quartiles are a versatile tool in exploratory data analysis, offering insight into data distribution,
variability, and outliers. They provide a robust summary of the data's shape and spread, making
them valuable for decision-making across various fields, such as education, finance, and
business analytics.
Q10. Within the framework of probability theory, an event refers to any possible outcome or a
collection of possible outcomes obtained from performing a random experiment. It is an element
of a sample space which constitutes all the possible outcomes of that experiment. For example,
attempting to roll a die and successfully obtaining a "3" qualifies as an event, while all the
numbers that can possibly appear when one rolls a die, such as “1, 2, 3, 4, 5, 6”, is the sample
space of rolling a die. A sample point is any particular value found in the sample space such as
the value "4" when a die is rolled.
A random experiment may be defined as a process or an act, the outcome of which is not
certain, like throwing a coin or casting a die, where the result is not known prior to the action.
Sometimes there are mutually exclusive events, which are events that cannot happen at the
same time. For example, when one tosses a coin, the events “heads” and “tails” are mutually
exclusive because it is impossible to have both outcomes at the same time.
A description of events’ likeliness is considered equal, when every event can occur with the
same probability, such as in a fair coin toss where the probability of getting heads is equal to the
probability of getting tails. In tossing a fair coin, both heads and tails have equal chances, hence
the probability of heads=0.5 and the probability of tails=0.5. These concepts are fundamental in
the making of probability judgements, especially in random externalities with numerous results
one must evaluate.
Q11.
● Fundamental Probability: Fundamental probability takes into account the fact that
every outcome of an experiment is equally likely to happen. Whenever there is a
limitation of specific outcomes that are equally likely, it is quite often applied. The event
probability is then determined as the number of probability outcomes divided by all the
possible outcomes in question.
● Empirical Probability: Which is also referred to as experimental or statistical probability,
focuses on the experiment which has already been performed and repeated many times.
This is computed as the frequency of occurrence of an event divided by the total number
of trials conducted.
● Axiomatic Probablity: The foundation of axiomatic probability is attributed to the group
of principles, or axioms, suggested by the Russian mathematician Andrey Kolmogorov.
This makes it possible to lay the basic assumptions of probability theory in a completely
rigorous way. The three axioms are:
- For any event, the probability assigned to it is not less than zero
- The probability assigned to the sample space is one
- The events in consideration do not overlap; therefore, the addition of the individual
probabilities is equal to the probability of the union of the events.
Q11. The Addition Theorem of Probability is used to calculate the probability of the union of two
events. The theorem states that if A and B are two events in a sample space S, then the
probability that either event A or event B (or both) will occur is given by:
Where:
Case 1:
1. Understanding the Union of Events: The union of two events A and B (denoted
as𝐴 ∪ 𝐵) represents the event that either A occurs, B occurs, or both occur.
However, when summing P(A) and P(B), the probability of the intersection of A
and B (i.e., 𝑃(𝐴 ∩ 𝐵)) gets counted twice, once in P(A) and once in P(B).
2. Step-by-Step Explanation:
● P(A) represents the probability that event A occurs, including the cases where A
and B occur together.
● P(B) represents the probability that event B