0% found this document useful (0 votes)
3 views

Stat

Uploaded by

Jal Patel
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
3 views

Stat

Uploaded by

Jal Patel
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 16

Q1.

In statistics, the measures of central tendency provide a central location of the distribution
of data observations. For instance, in understanding changes in data sets over a period of time,
the arithmetical mean, which is the sum of all the values divided by their number, is very
significant even though it can be heavily influenced by extreme observations. The median is the
value in the center after data has been arranged in ascending or descending order, and it
provides a more accurate figure in the presence of skewness since it cannot be influenced by
the tail values. The mode is the value that appears the most and can be applied to quantitative
and qualitative data, however a dataset may not have, one or numerous modes. Every measure
offers advantages and disadvantages based on the characteristics of the data.

Q2. The geometric mean and the harmonic mean serve as other measures of central tendency
while effective in particular situations which the arithmetic mean fails to reach.

The geometric mean is the nth root of the product of n numbers or the n-dimensional volume of
a shape. This mean is suited for data that involves grow rates or enjoys compounding such as
investment returns. For instance, if an individual invests money and over three years the values
increase by 10%, 20% and 30%. The geometric mean is capable of providing the applicable
average of the growth since it will consider the compound. This is often in finance and
economics for instance investment returns over a period of time weighting available years at
different rates. Its use is however not applicable where zeros or negative values are existent in
the datasets.

The harmonic mean takes the number of values and divides each value by one and then
averages these reciprocals. It is preferable for percentages and ratios, and less focused on
higher values of the parameters. It would be inappropriate to use the simple average in such
cases, where some sections of the journey require moving at different average speeds, for
example, half the distance at 30 km/h and the other half at 60 km/h. Its use is however restricted
in cases of handling zero values such as in physics and transport industries.

In other words, the geometric mean applies better to growth rates while the harmonic mean
applies better to rates and ratios which therefore portray measures that are more true to certain
situations as compared to the arithmetic mean.

Q3. Quartiles, deciles, and percentiles are all considered the relative measures of the
distribution equidistant sections of the given input set.

- Quartiles are three fixed points of values that segregate the dataset into four equal parts.
- Q1- marks the first 25% of all the data.
- Q2- is the value that divides the ordered dataset into halves thus being the median.
- Q3 is the value under which 75% of all the data falls.
- A decile is a statistical measurement that subdivides a dataset into ten equal segments. Thus,
the first decile D1 indicates the level beneath where 10% of the data exists, the second decile
D2 indicates the level beneath where 20% of the data exists, and so forth.

- In this case, 100 corresponding parts of the data are divided in order to produce the
percentiles. For instance, the nth percentile is the value below which n percent of the data falls
(for example, the 90th percentil is the value beneath 90 percent of the data).

Such measures are useful to give the measures of dispersion or spread of data and also to
present the salient features like anomalies and fluctuations within the data. For instance,
quartiles assist in determining the IQR while percentiles come in handy in the scoring of
candidates in standardized exams.

Q4. The statistical tools that show the extent or range of a dataset are referred to as measures
of dispersion. The three most frequently used measures of dispersion include mean deviation,
quartile deviation, and standard deviation, each of which has different definitions, calculations,
uses, advantages and disadvantages.

1. Mean Deviation

Definition: The mean deviation is an average of all the absolute deviations from the central
measure ( average or median).

Where AAA is mean or median.

Application: Helpful in giving an overall picture of variability.

Merits: Easy to calculate and comprehend; mean or median can be applied.

Demerits: Direction of deviations is not accounted for; less robust to outliers; rarely employed in
higher level statistics.
2. Quartile Deviation

● Definition: The quartile deviation measures the spread of the middle 50% of the dataset,
calculated as half the interquartile range (IQR).
● Calculation:

Where Q1 is the first quartile and Q3 is the third quartile.

● Application: Effective for skewed distributions and datasets with outliers.


● Merits: Not affected by extreme values; focuses on central data.
● Demerits: Only considers the middle 50% of data; may overlook important variability in
the entire dataset.

3. Standard Deviation

● Definition: The standard deviation quantifies the amount of variation or dispersion in a


dataset relative to the mean.
● Calculation:
Where 𝑋 is the mean of the dataset.

● Application: Widely used in various fields, including finance and research, to assess
risk and variability
● Merits: Sensitive to all data points, including outliers; provides a comprehensive
measure of variability
● Demerits: Can be influenced significantly by extreme values; may be difficult to interpret
without context

In summary, while all three measures provide insights into data variability, their
applications and interpretations differ significantly depending on the characteristics of the
dataset.

Q5. The CV is a measure of variability that is defined as the ratio of the standard deviation to
the mean, most commonly expressed as a percentage. It measures how much a data set varies
with respect to the mean as a percentage or standard metric. This measure of dispersion is
relative and can therefore be useful for comparing the dispersions of different data sets even
when the means of the data sets vary a great deal.

Definition and Calculation

The coefficient of variation is defined mathematically as:

Interpretation

● Relative Measure: The Coefficient of Variation (CV) is useful in assessing how much a
set of two or more datasets differs, even if the datasets are of different scales or vastly
different means. For instance, consider two datasets, one having a mean of 100 and a
standard deviation of 20 with a P (CV=20%) and the other mean of 10 with standard
deviation of 2 (CV=20%). Though the sample means vary to a great extent, the degree
of relative variability remains constant.
● Insight into Risk and Consistency: When it comes to financial activities, the variation
coefficient is useful in evaluating the level of risk associated with a specific investment.
This means that a higher CV signifies higher risk or variability in returns when compared
to the mean, and a lower one means the opposite. For example, if two different
investments produce the same mean return, the one with the lower CV is considered to
be the less risky investment.

Advantages

● Standardization: CV offers a universal gauge of spread without bearing any units which
aids in comparing the disperse outbreak of data along various scopes.
● Versatility: It is relevant in a number of sectors, such as finance, biology, and quality
control, to measure dependence and hazard.

Limitations

● Mean Sensitivity: The CV demonstrates a considerable amount of determination


towards the mean; for instance, when the mean value is zero or nearly zero, the
coefficient of variance may become prone to distortions or vanish altogether.
● Not Always Suitable:This methodology is most effective in the case of data sets that
are positively skewed or have a nonzero mean. However, when the distribution around
the mean of the data contains a severe skew and outliers, it can become inappropriate.

In conclusion, the coefficient of variation helps compare the proportionate variability in


the various data sets, which is important in all aspects of managing, analyzing, and
assessing the risk as well as the performance of analysts and researchers working in
different areas.

Q6. A moment in statistics is a measure which permits the description of some properties of a
distribution in quantitative terms. Moments provide information about the shape, dispersion, and
general characteristics of the data. The moments of a distribution are defined as the nth power
of the distances from the mean. There are several moments, but the first four are of great
importance in defining the main features of the distribution such as the central tendency,
dispersion, skewness, and kurtosis.

First Four Moments:

1. First Moment (Mean):

● Definition: The first moment is the arithmetic mean of the data, representing the
central location or average value.
● Calculation:

● Significance: It indicates the central value of the dataset and is used as the reference
point for calculating higher moments.

2. Second Moment (Variance):

● Definition: The second moment is the variance, which measures the spread or
dispersion of the data around the mean. It is the average of the squared deviations from
the mean.
● Calculation:

● Significance: Variance tells us how spread out the data points are from the mean. A
higher variance indicates greater dispersion.

3. Third Moment (Skewness):

● Definition: The third moment is related to skewness, which measures the asymmetry of
the distribution. Skewness tells whether the data is skewed to the left or right of the
mean.
● Calculation:

3
Σ(𝑥𝑖 − µ1)
µ3 = 𝑛

Skewness is often standardized as

3
µ
Skewness = 3
σ
Where σ is the standard deviation.

● Significance:
- A positive skew indicates a longer right tail (data concentrated on the left)
- A negative skew indicates a longer left tail (data concentrated on the right)
- A skewness value closer to zero suggests a symmetrical distribution

4. Fourth Moment (Kurtosis):

● Definition: The fourth moment is related to kurtosis, which measures the "tailedness"
or peakedness of a distribution. It describes how much of the data is in the tails
compared to the center.
● Calculation:

4
Σ(𝑥𝑖 − µ1)
µ4 = 𝑛

Kurtosis is often standardized as:

4
µ
Kurtosis = 4
σ

○ Significance:
■ Leptokurtic distributions have high kurtosis, meaning more data is
concentrated in the tails (sharp peak).
■ Platykurtic distributions have low kurtosis, indicating less concentration
in the tails (flatter peak).
■ Mesokurtic distributions (e.g., normal distribution) have a kurtosis value
around 3, representing moderate tail behavior.

Summary of Moments' Significance:

● The first moment (mean) locates the center of the distribution.


● The second moment (variance) tells us about the spread.
● The third moment (skewness) shows the symmetry or asymmetry.
● The fourth moment (kurtosis) indicates the shape of the tails and the peak of the
distribution.

Understanding these moments provides comprehensive insights into the distribution’s shape,
especially when analyzing how data behaves compared to a normal distribution.
Q7. Kurtosis is a statistical characteristic that defines the shape of a distribution and
most importantly, how fat or thin the tails of the distribution are, i.e. how many points of
the data are in the tails of the distribution and how many are close to the peak of the
distribution. In other words, we can look at kurtosis to analyze how much of the data is
contained in the extremities of the distribution in comparison to its core and compare it to
a bell shaped curve.

Kurtosis is of significance due to the fact that it gives us an impression on how likely (or
unlikely) extremes values (outliers) are present in a given distribution as opposed to a
normal one. The kurtosis of a perfect normal distribution is said to be 3, however, in most
cases the excess kurtosis which is Kurtosis-3 is applied meaning:

- Excess Curtis of 0: The distribution exhibits the same kurtic levels as that of a normal
distribution.

- Positive lv axis excess kurtosis: the fatty distribution has fatter tails than usual.

- A negative excess kurtosis: The distribution has thinner tails when compared to the
shoulders.

Types of Kurtosis:

1. Leptokurtic (High Kurtosis):

- Definition: A leptokurtic distribution has positive kurtosis (excess kurtosis > 0). It
exhibits a sharper peak and fatter tails than the normal distribution.

- Characteristics:

- High concentration of values near the mean and more data in the tails.

- More likely to produce outliers compared to a normal distribution.

- Example: Stock market returns often display leptokurtic behavior, where extreme
positive or negative returns are more common.

2. Platykurtic (Low Kurtosis):

- Definition: A platykurtic distribution has negative kurtosis (excess kurtosis < 0). It
has a flatter peak and thinner tails than the normal distribution.
- Characteristics:

- Data is more evenly spread across the range, with fewer extreme values.

- Less likely to produce outliers.

- Example: Uniform distributions are typically platykurtic because they spread values
evenly across the range, with very few extreme observations.

3. Mesokurtic (Normal Kurtosis):

- Definition: A mesokurtic distribution has kurtosis equal to 3, or excess kurtosis of 0,


meaning it resembles the normal distribution in terms of peak and tails.

- Characteristics:

- The shape of the distribution is moderate, with a typical peak and tails.

- Outliers occur at the same rate as in a normal distribution.

- Example: The standard normal distribution is mesokurtic.

How Kurtosis Helps in Understanding Data Distribution:

Kurtosis plays an important role in gauging the risk posed by extreme values and rare
events within a population. For instance, in the field of finance, distributions that exhibit
leptokurtic features are believed to help determine the chances of extreme swings in the
financial markets as such distributions indicate more presence of outlying values. On the
other hand, in case of quality control, a distribution of the platykurtic type may be more
preferred as it suggests lesser variations in extremes.

In summary:

- Leptokurtic: High probability of extreme values (fat tails).

- Platykurtic: Low probability of extreme values (thin tails).

- Mesokurtic: Similar to normal distribution, moderate tail behavior.


Thus, kurtosis helps assess the shape of the data's distribution, particularly its peak and
tails, aiding in decisions where the likelihood of outliers or extreme deviations is
important.

Q8. Skewness is a statistical measure that describes the asymmetry or lack of


symmetry in a distribution. It helps us understand how data is distributed around the
mean—whether it is skewed to the left or right. A perfectly symmetrical distribution has a
skewness of zero. When a distribution is skewed, it means that one tail is longer or
heavier than the other, and skewness can be either positive or negative.

Calculation of Skewness:

Skewness is calculated by taking the third moment of the distribution, which involves raising
deviations from the mean to the power of three and normalizing the result by the cube of the
standard deviation. The formula is:

3
𝑛 Σ(𝑥𝑖−𝑥)
Skewness =
(𝑛−1)(𝑛−2)
× 3
σ

Where:

- 𝑛 = number of data points


- 𝑥𝑖 = each data value
- 𝑥 = mean of the data set
- σ = standard deviation of the dataset
In simpler terms, skewness compares how evenly data is distributed on both sides of the mean.
The cube of deviations amplifies the influence of larger deviations from the mean, making it
sensitive to the direction of skew.

Interpretation of Skewness:

● Zero Skewness: A skewness value of 0 indicates a symmetrical distribution, where


the left and right tails are mirror images of each other. A normal distribution is an
example of zero skewness.
● Positive Skewness:
○ A positive skewness value indicates that the right tail of the distribution is longer
or fatter than the left tail.
○ The bulk of the data lies to the left of the mean, with a small number of high
values extending the right tail.
○ Interpretation: The distribution is right-skewed (also called positively skewed),
meaning that there are more lower values and fewer extreme higher values.
○ Example: Income distribution in many countries is typically right-skewed, with a
large number of people earning below the mean and a small number earning
very high incomes.
● Negative Skewness:
○ A negative skewness value indicates that the left tail of the distribution is longer
or fatter than the right tail.
○ The bulk of the data is concentrated on the right of the mean, with a few lower
values extending the left tail.
○ Interpretation: The distribution is left-skewed (or negatively skewed), meaning
that there are more higher values and fewer extreme lower values.
○ Example: The distribution of exam scores may be left-skewed if most students
perform well, with a small number of students scoring very low.
Summary of Interpretation:

● Positive Skewness (right-skewed):


○ Mean > Median
○ Long tail on the right (higher values).
○ Indicates more lower values with some extreme higher values.
● Negative Skewness (left-skewed):
○ Mean < Median
○ Long tail on the left (lower values).
○ Indicates more higher values with some extreme lower values.

Understanding skewness is important in data analysis because it reveals the direction of outliers
and how they affect the central tendency measures (like the mean). It helps to choose the
appropriate summary statistics and models, especially when the distribution is not symmetrical.

Q9. Quartiles are a fundamental statistical concept used in exploratory data analysis (EDA)
to divide a dataset into four equal parts, helping to understand the distribution and spread of
data. Quartiles help identify the central tendency, the spread, and potential outliers within the
dataset, providing a clearer picture of the data's characteristics.

Understanding Quartiles:

● First Quartile (Q1): Also called the lower quartile, it represents the 25th percentile. It is
the value below which 25% of the data points lie.
● Second Quartile (Q2): Also called the median, it represents the 50th percentile. It is the
middle value of the dataset.
● Third Quartile (Q3): Also called the upper quartile, it represents the 75th percentile. It
is the value below which 75% of the data points lie.
● Interquartile Range (IQR): The difference between the third and first quartiles
(IQR=Q3−Q1IQR = Q3 - Q1IQR=Q3−Q1) provides a measure of the spread of the
middle 50% of the data.

Role of Quartiles in Exploratory Data Analysis:

1. Summarizing Data Distribution: Quartiles provide a simple summary of the distribution


of data by dividing it into four parts, each containing 25% of the observations. This helps
in quickly understanding how data is distributed, especially when the mean alone may
not offer a complete picture.
○ Example: Suppose a company analyzes the salaries of employees. Quartiles
could show that the bottom 25% of employees earn below $40,000, the median
salary is $60,000, and the top 25% earn over $80,000. This can provide insights
into salary distribution and inequality within the company.
2. Detecting Skewness: Comparing the differences between quartiles helps detect
skewness in the data. For example, if the difference between Q3 and Q2 (the upper half
of the data) is larger than the difference between Q2 and Q1, the data may be
right-skewed.
○ Example: In house prices, if Q3 is significantly higher than Q2 and Q1, it
indicates that a few extremely high-priced houses skew the distribution.
3. Identifying Outliers: Quartiles are used to detect outliers through the interquartile
range (IQR). Outliers are typically defined as points that fall below Q1−1.5×IQRQ1 - 1.5
\times IQRQ1−1.5×IQR or above Q3+1.5×IQRQ3 + 1.5 \times IQRQ3+1.5×IQR.
○ Example: If a dataset of test scores shows several points that lie beyond the
lower and upper bounds, those scores can be flagged as outliers, indicating
students who performed exceptionally poorly or exceptionally well compared to
their peers.
4. Assessing Spread and Variability: The IQR (Q3 - Q1) is a measure of the spread of
the middle 50% of the data, providing an understanding of how tightly or loosely data is
clustered around the median. The smaller the IQR, the less variability there is in the
middle portion of the dataset.
○ Example: In comparing the test scores of two different classes, one might find
that Class A has a narrow IQR, suggesting that most students' scores are similar,
whereas Class B has a wide IQR, suggesting more variability in student
performance.
5. Boxplots for Visual Representation: Quartiles are used in boxplots, a graphical tool in
EDA, to visually display the data distribution. A boxplot shows the minimum, Q1, median
(Q2), Q3, and maximum values, giving a clear picture of the central tendency, spread,
and potential outliers in the data.
○ Example: A boxplot of daily temperatures in two cities could reveal that one city
has more variability (wider box), while the other has more consistent
temperatures (narrower box).

Decisions Based on Quartiles:

1. Performance Evaluation: Quartiles can be used to categorize performance or identify


groups. For example, in educational testing, students can be grouped into quartiles
based on their scores, helping educators identify the top performers (Q3 to Q4), average
performers (Q2), and those who need extra help (Q1).
○ Example: A school may use quartiles to determine which students qualify for
honors (Q4) or which students need remedial support (Q1).
2. Income Distribution Analysis: In economics, quartiles help assess income inequality.
Comparing the first and third quartiles of income distribution can indicate whether
income disparity is large or small.
○ Example: If the gap between Q1 and Q3 is large, it may suggest significant
income inequality, prompting discussions on policies to address the disparity.
3. Financial Risk Assessment: Quartiles can be used to evaluate the risk and return of
investment portfolios. A portfolio with returns in the highest quartile (Q3-Q4) might offer
higher rewards but may also carry higher risk, whereas portfolios in the middle quartiles
may offer more consistent returns.
○ Example: Investors might choose investments based on quartiles, opting for
those in the upper quartile for higher gains or in the lower quartile for lower risk.

Conclusion:

Quartiles are a versatile tool in exploratory data analysis, offering insight into data distribution,
variability, and outliers. They provide a robust summary of the data's shape and spread, making
them valuable for decision-making across various fields, such as education, finance, and
business analytics.

Q10. Within the framework of probability theory, an event refers to any possible outcome or a
collection of possible outcomes obtained from performing a random experiment. It is an element
of a sample space which constitutes all the possible outcomes of that experiment. For example,
attempting to roll a die and successfully obtaining a "3" qualifies as an event, while all the
numbers that can possibly appear when one rolls a die, such as “1, 2, 3, 4, 5, 6”, is the sample
space of rolling a die. A sample point is any particular value found in the sample space such as
the value "4" when a die is rolled.

A random experiment may be defined as a process or an act, the outcome of which is not
certain, like throwing a coin or casting a die, where the result is not known prior to the action.
Sometimes there are mutually exclusive events, which are events that cannot happen at the
same time. For example, when one tosses a coin, the events “heads” and “tails” are mutually
exclusive because it is impossible to have both outcomes at the same time.

A description of events’ likeliness is considered equal, when every event can occur with the
same probability, such as in a fair coin toss where the probability of getting heads is equal to the
probability of getting tails. In tossing a fair coin, both heads and tails have equal chances, hence
the probability of heads=0.5 and the probability of tails=0.5. These concepts are fundamental in
the making of probability judgements, especially in random externalities with numerous results
one must evaluate.

Q11.
● Fundamental Probability: Fundamental probability takes into account the fact that
every outcome of an experiment is equally likely to happen. Whenever there is a
limitation of specific outcomes that are equally likely, it is quite often applied. The event
probability is then determined as the number of probability outcomes divided by all the
possible outcomes in question.
● Empirical Probability: Which is also referred to as experimental or statistical probability,
focuses on the experiment which has already been performed and repeated many times.
This is computed as the frequency of occurrence of an event divided by the total number
of trials conducted.
● Axiomatic Probablity: The foundation of axiomatic probability is attributed to the group
of principles, or axioms, suggested by the Russian mathematician Andrey Kolmogorov.
This makes it possible to lay the basic assumptions of probability theory in a completely
rigorous way. The three axioms are:
- For any event, the probability assigned to it is not less than zero
- The probability assigned to the sample space is one
- The events in consideration do not overlap; therefore, the addition of the individual
probabilities is equal to the probability of the union of the events.

Q11. The Addition Theorem of Probability is used to calculate the probability of the union of two
events. The theorem states that if A and B are two events in a sample space S, then the
probability that either event A or event B (or both) will occur is given by:

𝑃(𝐴 ∪ 𝐵) = 𝑃(𝐴) + 𝑃(𝐵) − 𝑃(𝐴 ∩ 𝐵)

Where:

- 𝑃(𝐴) is the probability of event A


- 𝑃(𝐵) is the probability of event B
- 𝑃(𝐴 ∩ 𝐵) is the probability that both events A and B occur simultaneously (intersection
of A and B)

Proof of the addition theorem of probability

Case 1:

1. Understanding the Union of Events: The union of two events A and B (denoted
as𝐴 ∪ 𝐵) represents the event that either A occurs, B occurs, or both occur.
However, when summing P(A) and P(B), the probability of the intersection of A
and B (i.e., 𝑃(𝐴 ∩ 𝐵)) gets counted twice, once in P(A) and once in P(B).

2. Step-by-Step Explanation:
● P(A) represents the probability that event A occurs, including the cases where A
and B occur together.
● P(B) represents the probability that event B

You might also like