Statiscal Method Using R
Statiscal Method Using R
Statistics is the study of the collection, analysis, interpretation, presentation, and organization
of data.
In other words, it is a mathematical discipline to collect, summarize data. Also, we can say that
statistics is a branch of applied mathematics. However, there are two important and basic ideas
involved in statistics; they are uncertainty and variation. The uncertainty and variation in
different fields can be determined only through statistical analysis. These uncertainties are
basically determined by the probability that plays an important role in statistics.
Gottfried Achenwall (20 October 1719 – 1 May 1772) was a German philosopher, historian,
economist, jurist and statistician. He is counted among the inventors of statistics
Statistics Examples
Some of the real-life examples of statistics are:
● To find the mean of the marks obtained by each student in the class whose strength is
50. The average value here is the statistics of the marks obtained.
● Suppose you need to find how many members are employed in a city. Since the city is
populated with 15 lakh people, hence we will take a survey here for 1000 people
(sample). Based on that, we will create the data, which is the statistic.
Types of Statistics
Basically, there are two types of statistics.
● Descriptive Statistics
● Inferential Statistics
Inferential Statistics – Based on the data sample taken from the population, inferential
statistics makes the predictions and inferences.
Both types of statistics are equally employed in the field of statistical analysis.
Characteristics of Statistics
The important characteristics of Statistics are as follows:
Importance of Statistics
The important functions of statistics are:
Scope of Statistics
Statistics is used in many sectors such as psychology, geology, sociology, weather
forecasting, probability and much more. The goal of statistics is to gain understanding
from the data, it focuses on applications, and hence, it is distinctively considered as a
mathematical science.
Methods in Statistics
The methods involve collecting, summarizing, analyzing, and interpreting variable
numerical data. Here some of the methods are provided below.
● Data collection
● Data summarization
● Statistical analysis
Types of Data
Representation of Data
There are different ways to represent data such as through graphs, charts or tables.
The general representation of statistical data are:
● Bar Graph
● Pie Chart
● Line Graph
● Pictograph
● Histogram
● Frequency Distribution
Bar Graph
Line graph
A pictorial symbol for a word or phrase, i.e. showing data with the
help of pictures. Such as Apple, Banana & Cherry can have
different numbers, and it is just a representation of data.
Histogram
Basics of Statistics
The basics of statistics include the measure of central tendency and the measure of
dispersion. The central tendencies are mean, median and mode and dispersions
comprise variance and standard deviation.
Mean is the average of the observations. Median is the central value when observations
are arranged in order. The mode determines the most frequent observations in a data
set.
Variation is the measure of spread out of the collection of data. Standard deviation is the
measure of the dispersion of data from the mean. The square of standard deviation is
equal to the variance.
Mathematical Statistics
Mathematical statistics is the application of Mathematics to Statistics, which was initially
conceived as the science of the state — the collection and analysis of facts about a
country: its economy, and, military, population, and so forth.
Descriptive Statistics
Inferential Statistics
Summary Statistics
In Statistics, summary statistics are a part of descriptive statistics (Which is one of the
types of statistics), which gives the list of information about sample data. We know that
statistics deals with the presentation of data visually and quantitatively. Thus, summary
statistics deals with summarizing the statistical information. Summary statistics
generally deal with condensing the data in a simpler form, so that the observer can
understand the information at a glance. Generally, statisticians try to describe the
observations by finding:
● The measure of central tendency or mean of the locations, such as arithmetic mean.
● The measure of distribution shapes like skewness or kurtosis.
● The measure of dispersion such as the standard mean absolute deviation.
● The measure of statistical dependence such as correlation coefficient.
For example, the blood group of 20 students in the class are O, A, B, AB, B, B, AB, O,
A, B, B, AB, AB, O, O, B, A, AB, B, A.
O 4
A 4
B 7
AB 5
Total 20
Thus, the summary statistics table shows that 4 students in the class have O blood
group, 4 students have A blood group, 7 students in the class have B blood group and 5
students in the class have AB blood group. The summary statistics table is generally
used to represent the big data related to population, unemployment, and the economy
to be summarized systematically to interpret the accurate result.
numerical values that are used to represent mid-value or central value a large
● Mean
● Median
● Mode
All three measures of central tendency are used to find the central value of the set of
data.
The mean represents the average value of the dataset. It can be calculated as the sum
of all the values in the dataset divided by the number of values. In general, it is
considered as the arithmetic mean. Some other measures of mean used to find the
central tendency are as follows:
● Geometric Mean
● Harmonic Mean
● Weighted Mean
It is observed that if all the values in the dataset are the same, then all geometric,
arithmetic and harmonic mean values are the same. If there is variability in the data,
then the mean value differs. Calculating the mean value is completely easy. The formula
to calculate the mean value is given by:
As the mean includes every value in the distribution the mean is influenced by outliers
Arithmetic mean (X) is defined as the sum of the individual observations (xi)
divided by the total number of observations N. In other words, the mean is given
Example: If there are 5 observations, which are 27, 11, 17, 19, and 21 then the
X = (27 + 11 + 17 + 19 + 21) ÷ 5
⇒X = 95 ÷ 5
⇒X = 19
Disadvantage of Mean as Measure of Central Tendency
Although Mean is the most general way to calculate the central tendency of a
dataset however it can not give the correct idea always, especially when there is
The basic difference between grouped data and ungrouped data is that in the case of
latter, the data is unorganized and is in random form. This type of data is also known as
raw data, whereas in the case of grouped data, it is organized in the form of groups or
which has been categorized in terms of the frequency distribution. These groups are
known as class intervals.
Find the mean deviation about the mean for the given data.
Question 1. Calculate the arithmetic mean for the following data set using
direct method:
0 – 10 5
10 – 20 12
20 – 30 14
30 – 40 10
40 – 50 9
Solution:
Question 2. Calculate the arithmetic mean for the following data set using
0–2 2
2–4 4
4–6 6
6–8 8
8 – 10 10
Solution:
Question 3. Calculate the arithmetic mean for the following data set using
10 – 20 5
20 – 30 3
30 – 40 4
40 – 50 7
50 – 60 2
60 – 70 6
70 – 80 13
Solution:
Question 4. Calculate the arithmetic mean for the following data set using
100 – 120 4
120 – 140 6
140 – 160 10
160 – 180 8
180 – 200 5
Solution:
Question 2: Find the mean value by the assumed mean method.
Class
0 – 10 10 – 20 20 – 30 30 – 40 40 – 50
interval
Frequency 12 28 32 25 13
Solution
Example: Calculate the arithmetic mean for the following data set using the
Number of
Marks
Students
0 – 10 5
10 – 20 12
20 – 30 14
30 – 40 10
40 – 50 5
Solution:
Question 1. Calculate the mean using the step deviation method:
Number of
Marks
students
10 – 20 5
20 – 30 3
30 – 40 4
40 – 50 7
50 – 60 2
60 – 70 6
70 – 80 13
Solution:
Measures of Dispersion
In statistics, the dispersion measures help interpret data variability, i.e. to understand
how homogenous or heterogeneous the data is. In simple words, it indicates how
squeezed or scattered the variable is. However, there are two types of dispersion
measures, absolute and relative. They are tabulated as below:
Skewness in Statistics
Skewness, in statistics, is a measure of the asymmetry in a probability distribution. It
measures the deviation of the curve of the normal distribution for a given set of data.
The value of skewed distribution could be positive or negative or zero. Usually, the bell
curve of normal distribution has zero skewness.
ANOVA Statistics
ANOVA Stands for Analysis of Variance. It is a collection of statistical models, used to
measure the mean difference for the given set of data.
Degrees of freedom
In statistical analysis, the degree of freedom is used for the values that are free to
change. The independent data or information that can be moved while estimating a
parameter is the degree of freedom of information.
Applications of Statistics
Statistics have huge applications across various fields in Mathematics as well as in real
life. Some of the applications of statistics are given below:
Descriptive statistics refers to a branch of statistics that involves summarizing, organizing, and
presenting data meaningfully and concisely. It focuses on describing and analyzing a dataset's
main features and characteristics without making any generalizations or inferences to a larger
population.
The primary goal of descriptive statistics is to provide a clear and concise summary of the data,
enabling researchers or analysts to gain insights and understand patterns, trends, and
distributions within the dataset. This summary typically includes measures such as central
tendency (e.g., mean, median, mode), dispersion (e.g., range, variance, standard deviation), and
shape of the distribution (e.g., skewness, kurtosis).
Descriptive statistics also involves a graphical representation of data through charts, graphs,
and tables, which can further aid in visualizing and interpreting the information. Common
graphical techniques include histograms, bar charts, pie charts, scatter plots, and box plots.
Example 1:
Exam Scores Suppose you have the following scores of 20 students on an exam:
85, 90, 75, 92, 88, 79, 83, 95, 87, 91, 78, 86, 89, 94, 82, 80, 84, 93, 88, 81
To calculate descriptive statistics:
● Mean: Add up all the scores and divide by the number of scores. Mean = (85 + 90 +
75 + 92 + 88 + 79 + 83 + 95 + 87 + 91 + 78 + 86 + 89 + 94 + 82 + 80 + 84 + 93 + 88 + 81)
/ 20 = 1770 / 20 = 88.5
● Median: Arrange the scores in ascending order and find the middle value. Median =
86 (middle value)
● Mode: Identify the score(s) that appear(s) most frequently. Mode = 88
● Range: Calculate the difference between the highest and lowest scores. Range = 95 -
75 = 20
● Variance: Calculate the average of the squared differences from the mean. Variance =
[(85-88.5)^2 + (90-88.5)^2 + ... + (81-88.5)^2] / 20 = 33.25
● Standard Deviation: Take the square root of the variance. Standard Deviation =
√33.25 = 5.77
Example 2:
$2,500, $3,000, $3,200, $4,000, $2,800, $3,500, $4,500, $3,200, $3,800, $3,500, $2,800, $4,200,
$3,900, $3,600, $3,000, $2,700, $2,900, $3,700, $3,500, $3,200, $3,600, $4,300, $4,100, $3,800,
$3,600, $2,500, $4,200, $4,200, $3,400, $3,300, $3,800, $3,900, $3,500, $2,800, $4,100, $3,200,
$3,600, $4,000, $3,700, $3,000, $3,100, $2,900, $3,400, $3,800, $4,000, $3,300, $3,100, $3,200,
$4,200, $3,400.
● Mean: Add up all the incomes and divide by the number of incomes. Mean = ($2,500
+ $3,000 + ... + $3,400) / 50 = $166,200 / 50 = $3,324
● Median: Arrange the incomes in ascending order and find the middle value. Median =
$3,400 (middle value)
● Range: Calculate the difference between the highest and lowest incomes. Range =
$4,500 - $2,500 = $2,000
● Variance: Calculate the average of the squared differences from the mean. Variance =
[($2,500-$3,324)^2 + ($3,000-$3,324)^2 + ... + ($3,400-$3,324)^2] / 50 =
$221,684,000 / 50 = $4,433,680
● Standard Deviation: Take the square root of the variance. Standard Deviation =
√$4,433,680 = $2,105.18
These calculations provide descriptive statistics that summarize the central tendency,
dispersion, and shape of the data in these examples.
Measures of central tendency describe the typical value in the dataset and include
mean, median, and mode.
Measures of variability represent the spread or dispersion of the data and include
range, variance, and standard deviation.
Measures of relative position describe the location of a specific value within the
dataset, such as percentiles.
Graphical methods use charts, histograms, and other visual representations to display
data.
Descriptive statistics break down into several types, characteristics, or measures. Some
authors say that there are two types. Others say three or even four.
Distribution (Also Called Frequency Distribution)
Measures of central tendency estimate a dataset's average or center, finding the result
using three methods: mean, mode, and median.
Mean: The mean is also known as “M” and is the most common method for finding
averages. You get the mean by adding all the response values together, and dividing the
sum by the number of responses, or “N.” For instance, say someone is trying to figure
out how many hours a day they sleep in a week. So, the data set would be the hour
entries (e.g., 6,8,7,10,8,4,9), and the sum of those values is 52. There are seven
responses, so N=7. You divide the value sum of 52 by N, or 7, to find M, which in this
instance is 7.3.
Mode: The mode is just the most frequent response value. Datasets may have any
number of modes, including “zero.” You can find the mode by arranging your dataset's
order from the lowest to highest value and then looking for the most common response.
So, in using our sleep study from the last part: 4,6,7,8,8,9,10. As you can see, the mode
is eight.
Median: Finally, we have the median, defined as the value in the precise center of the
dataset. Arrange the values in ascending order (like we did for the mode) and look for
the number in the set’s middle. In this case, the median is eight.
The measure of variability gives the statistician an idea of how spread out the
responses are. The spread has three aspects — range, standard deviation, and variance.
Range: Use range to determine how far apart the most extreme values are. Start by
subtracting the dataset’s lowest value from its highest value. Once again, we turn to our
sleep study: 4,6,7,8,8,9,10. We subtract four (the lowest) from ten (the highest) and get
six. There’s your range.
Standard Deviation: This aspect takes a little more work. The standard deviation (s) is
your dataset’s average amount of variability, showing you how far each score lies from
the mean. The larger your standard deviation, the greater your dataset’s variable. Follow
these six steps:
9 9-7.3=1.7 2.89
Variance: Variance reflects the dataset’s degree spread. The greater the degree of data
spread, the larger the variance relative to the mean. You can get the variance by just
squaring the standard deviation. Using the above example, we square 1.992 and arrive
at 3.971.
Univariate descriptive statistics examine only one variable at a time and do not compare
variables. Rather, it allows the researcher to describe individual variables. As a result,
this sort of statistic is also known as descriptive statistics. The patterns identified in
this sort of data may be explained using the following:
When using bivariate descriptive statistics, two variables are concurrently analyzed
(compared) to see whether they are correlated. Generally, by convention, the
independent variable is represented by the columns, and the rows represent the
dependent variable.'
There are numerous real-world applications for bivariate data. For example, estimating
when a natural occurrence will occur is quite valuable. Bivariate data analysis is a tool in
the statistician's toolbox. Sometimes, something as simple as projecting one parameter
against the other on a Two-dimensional plane can better understand what the
information is trying to convince you. For example, the scatterplot below demonstrates
the link between the period between eruptions at Old Faithful and the eruption's
duration.
Univariate Bivariate
Descriptive statistics can be useful for two things: 1) providing basic information about
variables in a dataset and 2) highlighting potential relationships between variables.
Graphical/Pictorial Methods are measures of the three most common descriptive
statistics that can be displayed graphically or pictorially. It is used to summarise data.
Descriptive statistics only make statements about the data set used to calculate them;
they never go beyond your data.
Scatter Plots
A scatter plot employs dots to indicate values for two separate numeric variables. Each
dot's location on the horizontal and vertical axes represents a data point's values.
Scatter plots are being used to monitor relationships between variables.
The main purposes of scatter plots are to examine and display relationships between
two numerical variables. The points in a scatter plot document the values of individual
points and trends when the data is obtained as a whole. Identification of correlational
links is prevalent with scatter plots. In these situations, we want to know what a good
vertical value prediction would be given a specific horizontal value.
This can lead to overplotting when there are many data points to plot. When data points
are overlaid to the point where it is difficult to see the connections between them and
the variables, this is known as overplotting. It might be difficult to discern how
densely-packed data points are when lots of them are in a tiny space.
There are a couple simple methods to relieve this issue. One approach is to choose only
a subset of data points: a random sample of points should still offer the basic sense of
the patterns in the whole data. Additionally, we can alter the shape of the dots by
increasing transparency to make overlaps visible or decreasing point size to minimise
overlaps.
What’s the Difference Between Descriptive Statistics and
Inferential Statistics?
So, what’s the difference between the two statistical forms? We’ve already touched upon
this when we mentioned that descriptive statistics doesn’t infer any conclusions or
predictions, which implies that inferential statistics do so.
Inferential statistics takes a random sample of data from a portion of the population
and describes and makes inferences about the entire population. For instance, in asking
50 people if they liked the movie they had just seen, inferential statistics would build on
that and assume that those results would hold for the rest of the moviegoing population
in general.
Therefore, if you stood outside that movie theater and surveyed 50 people who had just
seen Rocky 20: Enough Already! and 38 of them disliked it (about 76 percent), you could
extrapolate that 76% of the rest of the movie-watching world will dislike it too, even
though you haven’t the means, time, and opportunity to ask all those people.
Simply put: Descriptive statistics give you a clear picture of what your current data
shows. Inferential statistics makes projections based on that data.
Unit-2 Discrete and continuous distribution, Regression
Correlation analysis deals with the association between two or more variables.It
is used to find the degree of correlation.
The term "regression" literally means "stepping back towards the average. It was
first used by a British biometrician Sir Francis Galton (1822-1911), in connection
with the inheritance of stature
Lines of Regression:
•If the variables in a bivariate distribution are related, we will find that the points in the
scatter diagram will cluster round some curve called the "curve of regression".
•If the curve is a straight line, it is called the line of regression and there is said to be linear
regression' between the variables, otherwise regression is said to be curvilinear.
•The line of regression is the line which gives the best estimate to the value of one variable
for any specific value of the other variable. Thus the line of regression is the line of "best
fit" and is obtained by the principles of least squares.
•Let us suppose that in the bivariate distribution (Xi, Yi); i = 1, 2, ..., n; Y is a dependent
variable and X is an independent variable. Let the line of regression of Y on X be Y = a +
bX.
If X= a+B and Y= X+c then clearly if value of X changes then Y ‘s value will change
accordingly
•There are always two lines of regression one of Y on X and the other of X on Y. The line of
regression of Y on X is used to estimate or predict the value of Y for any given value of X
i.e., when Y is a dependent variable and X is an independent variable. Hence to estimate
or predict X for any given value of Y, we use the regression equation of X on Y. Here X is a
dependent variable and Y is an independent variable.
•The two regression equations are not reversible or interchangeable because of the simple
reason that the basis and assumptions for deriving these equations are quite different. The
regression equation of Y on X is obtained
•If one of the regression coefficients is greater than unity, the other must be less than unity.
•Arithmetic mean of the regression coefficients is greater than the correlation coefficient r,
provided r > 0.
•Regression coefficients are independent of the change of origin but not of scale.
Regression Formula
The regression formula assesses the relationship between the
dependent and independent variables and finds out how it affects the
dependent variable on the change of the independent variable. It is
represented by equation Y is equal to aX plus b where Y is the
dependent variable, a is the slope of the regression equation, x is the
independent variable, and b is constant
X on Y => X=a+bY
Y on X=> Y=a+bX
Regression Equation using Normal Equation
X 1 2 3 4 5
y 2 5 3 8 7
Σ X= Na+ b Σ Y ………………..1
2
Σ XY= aΣX+ b ΣY ………………..2
ΣX ΣY Σ XY 2
y
1 2 3 9
2 5 10 25
3 3 9 9
4 8 32 64
5 7 35 49
15=5a+25b
88=25a+151b
—--------------------------------------
88=25a+151b
Σ Y= Na+ b ΣX ………………..1
2
Σ XY= aΣX+ b ΣX ………………..2
ΣX ΣY Σ XY 2
X
1 9 9 1
2 9 18 4
3 10 30 9
4 12 48 16
5 11 55 25
15=5a+15b
160=15a+55b
—--------------------------------------
160=15a+55b
Then we get a=8.1 and b= 0.7 and X-10 given which means
X=8.1+0.7*10=15.1
X 1 2 3 4 5
y 2 5 3 8 7
1 2 3 9 1
2 5 10 25 4
3 3 9 9 9
4 8 32 64 16
5 7 35 49 25
2 2 2
Σ X=15 Σ Y=25 Σ XY=88 Σy =151 (ΣY) =(25) = 625
bxy= 5(88)-15*25
5(151)-625
bxy= 65/50=1.3
(Y-Y̅) =bxy (X- X̅)
X̅ =ΣX =>42/6 =7
N
Y̅ =ΣY =>30/6 =5
N
X Y x= x-x̅ y=y-y̅ xy 2 2
X Y
2 4 2-7= -5 4-5= -1 5 25 1
4 2 -3 -3 9 9 9
6 5 -1 0 0 1 0
8 10 1 5 5 1 25
10 3 3 -2 -6 9 4
12 6 5 1 5 25 2
Σ Σ Y=30 Σxy=18 2 2
ΣX =70 ΣY =40
X=42
X on Y calculated as
bxy= Σxy = 18/40=9/20 =0.45
2
Σy
Y on X calculated as
X on Y and Y on X
bxy= r* σ x bxy= r* σ y
Σy σx
Obtain Two regression equations
i) Estimate the Y when X=9
ii) Estimate the X when Y=12
X on Y
(X- X̅) =bxy (Y-Y̅) => (X- 5) =bxy (Y-12) => (X- 5) =0.50 (Y-12)
==> (X- 5) =0.50Y- 0.50*12 ⇒ 0.50Y- 6 ⇒ 0.50Y- 6+5 ⇒ 0.50Y- 1
Y on X
Y=0.96X+7.2
Probability Distribution
Probability distribution is a function that is used to give the probability of all the possible
values that a random variable can take
distribution function.
A die is tossed once. If the random variable x is the number of the even
number, find the probability distribution of X.
P(E)=3/6=>⅓
P(O)=3/6=>⅓
X E O
P 1/3 1/3
X= Number of heads
d) 0 head= ¼
e) 1 head= ½
f) 2 head= ¼
Distribution
The Probability Density Function(PDF) defines the probability function representing the
density of a continuous random variable lying between a specific range of values.
Say we have a continuous random variable whose probability density function is given
by f(x) = x + 2, when 0 < x ≤ 2. We want to find P(0.5 < X < 1). Then we integrate x + 2
within the limits 0.5 and 1. This gives us 1.375. Thus, the probability that the
continuous random variable lies between 0.5 and 1 is 1.375.
The Cumulative Distribution Function (CDF), of a real-valued random variable X,
evaluated at x, is the probability function that X will take a value less than or equal to x.
In other words, CDF finds the cumulative probability for the given value. To determine
the probability of a random variable, it is used and also to compare the probability
between values under certain conditions. For discrete distribution functions, CDF gives
the probability values till what we specify and for continuous distribution functions, it
gives the area under the probability density function up to the given value specified.
Consider a simple example for CDF which is given by rolling a fair six-sided die, where
X is the random variable
We know that the probability of getting an outcome by rolling a six-sided die is given as:
From this, it is noted that the probability value always lies between 0 and 1 and it is
non-decreasing and right continuous in nature.
The binomial distribution represents the probability for 'x' successes of an experiment in
'n' trials, given a success probability 'p' for each trial at the experiment.
Success= P
Failure = q
p=1-q
q=1-p
n x n-x
P(x:n,p) = Cx p (q)
where,
● x = 0, 1, 2, 3, 4, …
The binomial distribution formula is also written in the form of n-Bernoulli trials, where nCx =
Example 1: If a coin is tossed 5 times, using binomial distribution find the probability of:
(a) Exactly 2 heads
Solution:
(a) The repeated tossing of the coin is an example of a Bernoulli trial. According to the problem:
x=2
P(x=2) = 5/16
Hence,
● In discrete distributions, graph consists of bars lined up one after the other
1. Bernoulli Distribution
2. Binomial Distribution
3. Uniform Distribution
4. Poisson Distribution
Bernoulli Distribution can be used to describe events that can only have two outcomes, that is, success or failure. In other words, the random
variable can be 1 with a probability p or it can be 0 with a probability (1 - p). Such an experiment is called a Bernoulli trial. A pass or fail
In Bernoulli distribution there is only one trial and only two possible outcomes i.e. success or failure. It is denoted by y ~Bern(p).
In binomial probability distribution, the number of ‘Success’ in a sequence of n experiments, where each time a question is asked for yes-no, then
the boolean-valued outcome is represented either with success/yes/true/one (probability p) or failure/no/false/zero (probability q = 1 − p). A single
success/failure test is also called a Bernoulli trial or Bernoulli experiment, and a series of outcomes is called a Bernoulli process. For n = 1, i.e. a
single experiment, the binomial distribution is a Bernoulli distribution. The binomial distribution is the base for the famous binomial test of
statistical importance
A discrete probability distribution wherein the random variable can only have 2 possible outcomes is known as a Bernoulli Distribution. If in a
Bernoulli trial the random variable takes on the value of 1, it means that this is a success. The probability of success is given by p. Similarly, if the
value of the random variable is 0, it indicates failure. The probability of failure is q or 1 - p. Bernoulli distribution can be used to derive a binomial
● E(Y) = p
● Var(Y) = p × (1 – p)
● It is mostly used when trying to find out what we expect to obtain a single trial of an experiment.
1.2 Binomial Distribution
A sequence of identical Bernoulli events is called Binomial and follows a Binomial distribution. It is denoted by Y ~B(n, p).
● Over the n trials, it measures the frequency of occurrence of one of the possible result.
● E(Y) = n × p
● Var(Y) = n × p × (1 – p)
● Simply determine, how many times we obtain a head if we flip a coin 10 times.
● It is mostly used when we try to predict how likelihood an event occurs over a series of trials.
In uniform distribution all the outcomes are equally likely. It is denoted by Y ~U(a, b). If the values are categorical, we simply indicate the
Poisson distribution is used to determine how likelihood a certain event occur over a given interval of time or distance. It is denoted by
Y ~ Po( λ ).
● It is used to determine how likelihood a certain event occur over a given interval of time or distance.
● Mostly used in marketing analysis to find out whether more than average visits are out of the ordinary or otherwise.
2. CONTINUOUS DISTRIBUTIONS:
● We cannot add up individual values to find out the probability of an interval because there are many of them
● P(Y<y) = P(Y ≤ y)
1. Normal Distribution
2. Chi-Squared Distribution
3. Exponential Distribution
4. Logistic Distribution
5. Students’ T Distribution
2.1 Normal Distribution
It shows a distribution that most natural events follow. It is denoted by Y ~ (µ, σ2). The main characteristics of normal distribution are:
● Graph obtained from normal distribution is bell-shaped curve, symmetric and has shrill tails.
● 68% of all its all values should fall in the interval, i.e. (µ – σ , µ+ σ )
● E(Y) = µ
● Var(Y) = σ2
● Normal distributions are mostly observed in the size of animals in the desert.
● We can convert any normal distribution into a standard normal distribution. Normal distribution could be standardized to use
the Z-table.
Chi-Squared distribution is frequently being used. It is mostly used to test wow of fit. It is denoted by Y ~ X2 (k).
● The graph obtained from Chi-Squared distribution is asymmetric and skewed to the right.
● E(Y) = k
● Var(Y) = 2k
● It comprises a table of known values for its CDF called the x2 – table.
It is usually observed in events which considerably change early on. It is denoted by Y ~ Exp(λ).
● We do not have a table to known the values like the Normal or Chi-Squared Distributions, therefore, we mostly used natural
● It is mostly used with dynamically changing variables, such as online websites traffic.
It is used to observe how continuous variable inputs can affect the probability of a binary result. It is denoted by Y ~ Logistic(µ, s).
● The Cumulative Distributed Function picks up when we reach values near the mean.
● The lesser the scale parameter, the faster it reaches values close to 1.
Binomial Distribution
A sequence of identical Bernoulli events is called Binomial and follows a Binomial distribution. It is denoted by Y
~B(n, p).
p=1-q q=1-p
r = 0, 1, 2, 3, 4, …
Step 4:->Mean, μ = np
Variance, σ2 = npq
Mean, μ = np = 9 …………………………………….. 1
The probability of a man hitting a target is ¼. He fires 7 times . What is the probability of hitting
Solution:-> p= ¼ q= 1-¼ =¾ n= 7
P(X>=2) =>P(x=2)+P(x=3)+P(x=4)+P(x=5)+P(x=6)+P(x=7)
q= 1-[P(x=0)+P(x=1)]
P(r:n,p) = n r r n-r
C p (q)
P(x=0) = [7c0 p0 q7 + 7c1 p1 q6]
1-7290/16384 =>4547/8192
x=2
P(x=2) = 5/16
On average, every one out of 10 telephones is found busy. Six telephone numbers are
selected at random. Find the probability that four of them will be busy.
Solution:
= 15 × (1/10)4 × (9/10)2
= 15 × 81/106
= 0.001215.
There are four fused bulbs in a lot of 10 good bulbs. If three bulbs are drawn at random with replacement, find the probability of
Solution:
This is a problem of binomial distribution as the event of drawing a fused bulb is independent.
= nCr pr q(n – r)
= 1 × 1 × (125/343) = 125/343
= nCr pr q(n – r)
= nCr pr q(n – r)
= 1 × 8/343 × 1 = 8/343
X 0 1 2 3
8/343
The probability of a boy guessing a correct answer is ¼. How many questions must he answer so that the probability of guessing the
P(X ≥ 1) = P(guessing at least one correct answer out of n questions) = 1 – P(no success) = 1 – qn
⇒ (¾)n < ⅓
When n = 1
¾≮⅓
When n = 2
(¾)2 ≮ ⅓
When n = 3
(¾)3 ≮ ⅓
When n = 4
(¾)4 < ⅓.
1.1 Bernoulli Distribution
In Bernoulli distribution there is only one trial and only two possible outcomes i.e. success or failure. It is denoted by y ~Bern(p).
In binomial probability distribution, the number of ‘Success’ in a sequence of n experiments, where each time a question is asked for yes-no, then
the boolean-valued outcome is represented either with success/yes/true/one (probability p) or failure/no/false/zero (probability q = 1 − p). A single
success/failure test is also called a Bernoulli trial or Bernoulli experiment, and a series of outcomes is called a Bernoulli process. For n = 1, i.e. a
single experiment, the binomial distribution is a Bernoulli distribution. The binomial distribution is the base for the famous binomial test of
statistical importance
● E(Y) = p
● Var(Y) = p × (1 – p)
● It is mostly used when trying to find out what we expect to obtain a single trial of an experiment.
Poisson distribution probability
1. n= when n has maximum values and p= whenless probability of any event
2. λ=np
Where
● x = 0, 1, 2, 3...
For Poisson distribution, which has λ as the average rate, for a fixed interval of
time, then the mean of the Poisson distribution and the value of variance will be
the same. So for X following Poisson distribution, we can say that λ is the mean
as well as the variance of the distribution.
where
● E(X) is the expected mean
● V(X) is the variance
● λ>0
The Poisson distribution is applicable in events that have a large number of rare
and independent possible events. The following are the properties of the
Poisson Distribution. In the Poisson distribution,
● The events are independent.
● The average number of successes in the given period of time alone can occur. No two events can occur at the same time.
● The Poisson distribution is limited when the number of trials n is indefinitely large.
● mean = variance = λ
● np = λ is finite, where λ is constant.
● The standard deviation is always equal to the square root of the mean μ.
● The exact probability that the random variable X with mean μ =a is given by P(X= a) = μa / a! e -μ
● If the mean is large, then the Poisson distribution is approximately a normal distribution.
Five coins tossed 3200 times.what is the probability to getting 5 heads two
times
Solution
4. n= 3200 and p= ½ five times=> 5 times (½) => 1/32
In probability theory and statistics, the Normal Distribution, also called the
f(x) ≥ 0 ∀ x ϵ (−∞,+∞)
by;
Where,
● x is the variable
● μ is the mean
● σ is the standard deviation
Normal Distribution Curve
The random variables following the normal distribution are those whose
values can find any unknown value in a given range. For example, finding
the height of the students in the school. Here, the distribution can
consider any value, but it will be bounded in the range say, 0 to 6ft. This
Whereas, the normal distribution doesn’t even bother about the range.
The range can also extend to –∞ to + ∞ and still we can find a smooth
curve. These random variables are called Continuous Variables, and the
particular range for a given experiment. Also, use the normal distribution
graph, whereas the standard deviation helps to know how far the data
are spread out. If the standard deviation is smaller, the data are
somewhat close to each other and the graph becomes narrower. If the
standard deviation is larger, the data are dispersed more, and the graph
becomes wider. The standard deviations are used to subdivide the area
under the normal curve. Each subdivided section defines the percentage
the mean. (i.e., Between Mean- one Standard Deviation and Mean
below:
● There should be exactly half of the values are to the right of the
centre and exactly half of the values are to the left of the centre.
standard deviation.
● The normal distribution curve must have only one peak. (i.e.,
Unimodal)
The normal distributions are closely associated with many things such as:
Answer:
Mean = =>9
i)X>=15
Z= X- / ⇒ 15-9/3 => 2
Z=2
P(X>=15) = P(Z=2)
Take the valye from the table for Z=2 which is =0.4772
Rectangular or Uniform Distribution
A uniform distribution is a continuous probability distribution and relates to the
events which are likely to occur equally. A uniform distribution is defined by two
parameters, a and b, where a is the minimum value and b is the maximum value.
It is generally denoted as u(a, b).
denoted by U(a,b), where a and b are constants such that a<x<b. It is written as:
where,
Exponential distribution refers to the process in which the event happens at a constant
average rate independently and continuously. The exponential distribution is most often
known as the memoryless distribution because it means that past information has no effect
on future probabilities.
The exponential distribution is commonly used to model time: the time between arrivals, the
time until a component fails, the time until a patient dies. We have already encountered
several examples of exponential random variables—the time of the first arrival in a Poisson