0% found this document useful (0 votes)
38 views

Statiscal Method Using R

The document discusses key concepts in statistics including descriptive statistics, inferential statistics, data types, representation of data, measures of central tendency and dispersion, probability, distributions, sampling, hypothesis testing, and estimation. Statistics is the science of collecting, organizing, summarizing, analyzing, and interpreting data to describe situations, make predictions, and support decision making. It has important applications across many fields.
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
38 views

Statiscal Method Using R

The document discusses key concepts in statistics including descriptive statistics, inferential statistics, data types, representation of data, measures of central tendency and dispersion, probability, distributions, sampling, hypothesis testing, and estimation. Statistics is the science of collecting, organizing, summarizing, analyzing, and interpreting data to describe situations, make predictions, and support decision making. It has important applications across many fields.
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 150

Statistical Method Using R-20 SMT-460

Unit-1 Descriptive Statistics

Central tendency:-> Measure of Central Tendencies: Arithmetic Mean, Median, Mode,


Measure of Dispersion: Range, quartile deviation, standard deviation, variance, coefficient of variation

Probability:->Probability – mathematical and statistical definition, addition and multiplication theorem,


Conditional probabilities, bayes theorem

Introduction to R :->Installing R and RStudio, Basic R syntax and data types,


R data structures: vectors, matrices, data frames.

Unit-2 Discrete and continuous distribution, Regression

Regression:-> Regression analysis- its properties, various method to perform regression


analysis, numerical based on it

Discrete Distribution:->Bernoulli distribution, Binomial distribution, Poisson distribution,


Uniform Distribution

Continuous distributions:->uniform distribution, exponential distribution, normal


distribution, properties of normal distribution, area under the normal curve.

Unit-3 Large sample theory and hypothesis testing, theory of estimation

Sampling :->Introduction to sampling, difference between parameter and statistics, test of


significance, goodness of fit test

Testing of Hypothesis:->Procedure for testing of hypothesis, test of significance of large


sample, T-test, Z- test, Chi Square

Theory of estimation:->Introduction, Characteristics of Estimators: - unbiasedness,


consistency, efficiency, sufficiency and their propertie
************************************************************************************
Unit-1 Descriptive Statistics

Central tendency:-> Measure of Central Tendencies: Arithmetic Mean, Median, Mode,


Measure of Dispersion: Range, quartile deviation, standard deviation, variance, coefficient of
variation

Statistics is the science, or a branch of mathematics, that involves collecting, classifying,


analyzing, interpreting, and presenting numerical facts and data.

Statistics is the study of the collection, analysis, interpretation, presentation, and organization
of data.
In other words, it is a mathematical discipline to collect, summarize data. Also, we can say that
statistics is a branch of applied mathematics. However, there are two important and basic ideas
involved in statistics; they are uncertainty and variation. The uncertainty and variation in
different fields can be determined only through statistical analysis. These uncertainties are
basically determined by the probability that plays an important role in statistics.

Gottfried Achenwall (20 October 1719 – 1 May 1772) was a German philosopher, historian,
economist, jurist and statistician. He is counted among the inventors of statistics

Statistics Examples
Some of the real-life examples of statistics are:

● To find the mean of the marks obtained by each student in the class whose strength is
50. The average value here is the statistics of the marks obtained.
● Suppose you need to find how many members are employed in a city. Since the city is
populated with 15 lakh people, hence we will take a survey here for 1000 people
(sample). Based on that, we will create the data, which is the statistic.
Types of Statistics
Basically, there are two types of statistics.

● Descriptive Statistics
● Inferential Statistics

Descriptive Statistics – Through graphs or tables, or numerical calculations,


descriptive statistics uses the data to provide descriptions of the population.

Inferential Statistics – Based on the data sample taken from the population, inferential
statistics makes the predictions and inferences.

Both types of statistics are equally employed in the field of statistical analysis.

Characteristics of Statistics
The important characteristics of Statistics are as follows:

● Statistics are numerically expressed.


● It has an aggregate of facts
● Data are collected in systematic order
● It should be comparable to each other
● Data are collected for a planned purpose

Importance of Statistics
The important functions of statistics are:

● Statistics helps in gathering information about the appropriate quantitative data


● It depicts the complex data in graphical form, tabular form and in diagrammatic
representation to understand it easily
● It provides the exact description and a better understanding
● It helps in designing the effective and proper planning of the statistical inquiry in any field
● It gives valid inferences with the reliability measures about the population parameters from
the sample data
● It helps to understand the variability pattern through the quantitative observations

Scope of Statistics
Statistics is used in many sectors such as psychology, geology, sociology, weather
forecasting, probability and much more. The goal of statistics is to gain understanding
from the data, it focuses on applications, and hence, it is distinctively considered as a
mathematical science.

Methods in Statistics
The methods involve collecting, summarizing, analyzing, and interpreting variable
numerical data. Here some of the methods are provided below.

● Data collection
● Data summarization
● Statistical analysis

What is Data in Statistics?


Data is a collection of facts, such as numbers, words, measurements, observations etc.

Types of Data

1. Qualitative data- it is descriptive data.


● Example- She can run fast, He is thin.
2. Quantitative data- it is numerical information.
● Example- An Octopus is an Eight legged creature.

Types of quantitative data

1. Discrete data- has a particular fixed value. It can be counted


2. Continuous data- is not fixed but has a range of data. It can be measured.

Representation of Data
There are different ways to represent data such as through graphs, charts or tables.
The general representation of statistical data are:

● Bar Graph
● Pie Chart
● Line Graph
● Pictograph
● Histogram
● Frequency Distribution

Bar Graph

A Bar Graph represents grouped data with rectangular bars with


lengths proportional to the values that they represent. The bars
can be plotted vertically or horizontally.
Pie Chart

A type of graph in which a circle is divided into Sectors. Each of


these sectors represents a proportion of the whole.

Line graph

The line chart is represented by a series of data points connected


with a straight line.

The series of data points are called ‘markers.’


Pictograph

A pictorial symbol for a word or phrase, i.e. showing data with the
help of pictures. Such as Apple, Banana & Cherry can have
different numbers, and it is just a representation of data.

Histogram

A diagram is consisting of rectangles. Whose area is proportional


to the frequency of a variable and whose width is equal to the
class interval.
Frequency Distribution

The frequency of a data value is often represented by “f.” A


frequency table is constructed by arranging collected data values
in ascending order of magnitude with their corresponding
frequencies.

Basics of Statistics
The basics of statistics include the measure of central tendency and the measure of
dispersion. The central tendencies are mean, median and mode and dispersions
comprise variance and standard deviation.

Mean is the average of the observations. Median is the central value when observations
are arranged in order. The mode determines the most frequent observations in a data
set.

Variation is the measure of spread out of the collection of data. Standard deviation is the
measure of the dispersion of data from the mean. The square of standard deviation is
equal to the variance.

Mathematical Statistics
Mathematical statistics is the application of Mathematics to Statistics, which was initially
conceived as the science of the state — the collection and analysis of facts about a
country: its economy, and, military, population, and so forth.

Mathematical techniques used for different analytics include mathematical analysis,


linear algebra, stochastic analysis, differential equation and measure-theoretic
probability theory.

Descriptive Statistics

The data is summarised and explained in descriptive statistics. The summarization is


done from a population sample utilising several factors such as mean and standard
deviation. Descriptive statistics is a way of organising, representing, and explaining a
set of data using charts, graphs, and summary measures. Histograms, pie charts, bars,
and scatter plots are common ways to summarise data and present it in tables or
graphs. Descriptive statistics are just that: descriptive. They don’t need to be normalised
beyond the data they collect.

Inferential Statistics

We attempt to interpret the meaning of descriptive statistics using inferential statistics.


We utilise inferential statistics to convey the meaning of the collected data after it has
been collected, evaluated, and summarised. The probability principle is used in
inferential statistics to determine if patterns found in a study sample may be
extrapolated to the wider population from which the sample was drawn. Inferential
statistics are used to test hypotheses and study correlations between variables, and
they can also be used to predict population sizes. Inferential statistics are used to derive
conclusions and inferences from samples, i.e. to create accurate generalisations.

Summary Statistics
In Statistics, summary statistics are a part of descriptive statistics (Which is one of the
types of statistics), which gives the list of information about sample data. We know that
statistics deals with the presentation of data visually and quantitatively. Thus, summary
statistics deals with summarizing the statistical information. Summary statistics
generally deal with condensing the data in a simpler form, so that the observer can
understand the information at a glance. Generally, statisticians try to describe the
observations by finding:

● The measure of central tendency or mean of the locations, such as arithmetic mean.
● The measure of distribution shapes like skewness or kurtosis.
● The measure of dispersion such as the standard mean absolute deviation.
● The measure of statistical dependence such as correlation coefficient.

Summary Statistics Table

The summary statistics table is the visual representation of summarized statistical


information about the data in tabular form.

For example, the blood group of 20 students in the class are O, A, B, AB, B, B, AB, O,
A, B, B, AB, AB, O, O, B, A, AB, B, A.

Blood Group No. of Students

O 4

A 4

B 7

AB 5
Total 20

Thus, the summary statistics table shows that 4 students in the class have O blood
group, 4 students have A blood group, 7 students in the class have B blood group and 5
students in the class have AB blood group. The summary statistics table is generally
used to represent the big data related to population, unemployment, and the economy
to be summarized systematically to interpret the accurate result.

Measures of Central Tendency


In Mathematics, statistics are used to describe the central tendencies of the grouped
and ungrouped data. A measure of central tendency is a single value that attempts to
describe a set of data by identifying the central position within that set of data. As such,
measures of central tendency are sometimes called measures of central location. They
are also classed as summary statistics. The mean (often called the average) is most
likely the measure of central tendency that you are most familiar with, but there are
others, such as the median and the mode.Central Tendencies in Statistics are the

numerical values that are used to represent mid-value or central value a large

collection of numerical data.


The three measures of central tendency are:

● Mean
● Median
● Mode

All three measures of central tendency are used to find the central value of the set of
data.

The mean represents the average value of the dataset. It can be calculated as the sum
of all the values in the dataset divided by the number of values. In general, it is
considered as the arithmetic mean. Some other measures of mean used to find the
central tendency are as follows:

● Geometric Mean
● Harmonic Mean
● Weighted Mean

It is observed that if all the values in the dataset are the same, then all geometric,
arithmetic and harmonic mean values are the same. If there is variability in the data,
then the mean value differs. Calculating the mean value is completely easy. The formula
to calculate the mean value is given by:

Advantage of the mean


The mean can be used for both continuous and discrete numeric data.

Limitations of the mean


The mean cannot be calculated for categorical data, as the values cannot be summed.

As the mean includes every value in the distribution the mean is influenced by outliers

and skewed distributions.


Mean for Ungrouped Data

Arithmetic mean (X) is defined as the sum of the individual observations (xi)

divided by the total number of observations N. In other words, the mean is given

by the sum of all observations divided by the total number of observations.

Mean = Sum of all Observations ÷ Total number of Observations

Example: If there are 5 observations, which are 27, 11, 17, 19, and 21 then the

mean (X) is given by

X = (27 + 11 + 17 + 19 + 21) ÷ 5

⇒X = 95 ÷ 5

⇒X = 19
Disadvantage of Mean as Measure of Central Tendency

Although Mean is the most general way to calculate the central tendency of a

dataset however it can not give the correct idea always, especially when there is

a large gap between the datasets.

The basic difference between grouped data and ungrouped data is that in the case of
latter, the data is unorganized and is in random form. This type of data is also known as
raw data, whereas in the case of grouped data, it is organized in the form of groups or
which has been categorized in terms of the frequency distribution. These groups are
known as class intervals.

For example, Marks of 10 students (out of 100) are given as:


45, 60, 65, 78, 91, 38, 67, 81, 12, 55

This form of data is ungrouped in nature.

This can be represented in grouped form as:


Example 1: In a mango eating competition the number of mangoes eaten by six
contestants in an hour is as follows:

12, 18, 21, 26, 17, 20

Find the mean deviation about the mean for the given data.
Question 1. Calculate the arithmetic mean for the following data set using

direct method:

Marks Number of Students

0 – 10 5

10 – 20 12

20 – 30 14

30 – 40 10

40 – 50 9

Solution:
Question 2. Calculate the arithmetic mean for the following data set using

the direct method:

Class Intervals Frequency

0–2 2
2–4 4

4–6 6

6–8 8

8 – 10 10

Solution:
Question 3. Calculate the arithmetic mean for the following data set using

the direct method:

Class Intervals Frequency

10 – 20 5
20 – 30 3

30 – 40 4

40 – 50 7

50 – 60 2

60 – 70 6

70 – 80 13

Solution:
Question 4. Calculate the arithmetic mean for the following data set using

the direct method:


Class Intervals Frequency

100 – 120 4

120 – 140 6

140 – 160 10

160 – 180 8

180 – 200 5

Solution:
Question 2: Find the mean value by the assumed mean method.

Class
0 – 10 10 – 20 20 – 30 30 – 40 40 – 50
interval

Frequency 12 28 32 25 13

Solution
Example: Calculate the arithmetic mean for the following data set using the

step deviation method:

Number of
Marks
Students
0 – 10 5

10 – 20 12

20 – 30 14

30 – 40 10

40 – 50 5

Solution:
Question 1. Calculate the mean using the step deviation method:
Number of
Marks
students

10 – 20 5

20 – 30 3

30 – 40 4

40 – 50 7

50 – 60 2
60 – 70 6

70 – 80 13

Solution:
Measures of Dispersion
In statistics, the dispersion measures help interpret data variability, i.e. to understand
how homogenous or heterogeneous the data is. In simple words, it indicates how
squeezed or scattered the variable is. However, there are two types of dispersion
measures, absolute and relative. They are tabulated as below:

Absolute measures of dispersion Relative measures of dispersion

1. Range 1. Co-efficient of Range


2. Variance 2. Co-efficient of Variation
3. Standard deviation 3. Co-efficient of Standard Deviation
4. Quartiles and Quartile deviation 4. Co-efficient of Quartile Deviation
5. Mean and Mean deviation 5. Co-efficient of Mean Deviation

Skewness in Statistics
Skewness, in statistics, is a measure of the asymmetry in a probability distribution. It
measures the deviation of the curve of the normal distribution for a given set of data.

The value of skewed distribution could be positive or negative or zero. Usually, the bell
curve of normal distribution has zero skewness.

ANOVA Statistics
ANOVA Stands for Analysis of Variance. It is a collection of statistical models, used to
measure the mean difference for the given set of data.

Degrees of freedom
In statistical analysis, the degree of freedom is used for the values that are free to
change. The independent data or information that can be moved while estimating a
parameter is the degree of freedom of information.
Applications of Statistics
Statistics have huge applications across various fields in Mathematics as well as in real
life. Some of the applications of statistics are given below:

● Applied statistics, theoretical statistics and mathematical statistics


● Machine learning and data mining
● Statistics in society
● Statistical computing
● Statistics applied to the mathematics of the arts

Descriptive statistics refers to a branch of statistics that involves summarizing, organizing, and
presenting data meaningfully and concisely. It focuses on describing and analyzing a dataset's
main features and characteristics without making any generalizations or inferences to a larger
population.

The primary goal of descriptive statistics is to provide a clear and concise summary of the data,
enabling researchers or analysts to gain insights and understand patterns, trends, and
distributions within the dataset. This summary typically includes measures such as central
tendency (e.g., mean, median, mode), dispersion (e.g., range, variance, standard deviation), and
shape of the distribution (e.g., skewness, kurtosis).

Descriptive statistics also involves a graphical representation of data through charts, graphs,
and tables, which can further aid in visualizing and interpreting the information. Common
graphical techniques include histograms, bar charts, pie charts, scatter plots, and box plots.

Descriptive Statistics Examples

Example 1:

Exam Scores Suppose you have the following scores of 20 students on an exam:

85, 90, 75, 92, 88, 79, 83, 95, 87, 91, 78, 86, 89, 94, 82, 80, 84, 93, 88, 81
To calculate descriptive statistics:

● Mean: Add up all the scores and divide by the number of scores. Mean = (85 + 90 +
75 + 92 + 88 + 79 + 83 + 95 + 87 + 91 + 78 + 86 + 89 + 94 + 82 + 80 + 84 + 93 + 88 + 81)
/ 20 = 1770 / 20 = 88.5
● Median: Arrange the scores in ascending order and find the middle value. Median =
86 (middle value)
● Mode: Identify the score(s) that appear(s) most frequently. Mode = 88
● Range: Calculate the difference between the highest and lowest scores. Range = 95 -
75 = 20
● Variance: Calculate the average of the squared differences from the mean. Variance =
[(85-88.5)^2 + (90-88.5)^2 + ... + (81-88.5)^2] / 20 = 33.25
● Standard Deviation: Take the square root of the variance. Standard Deviation =
√33.25 = 5.77

Example 2:

Monthly Income Consider a sample of 50 individuals and their monthly incomes:

$2,500, $3,000, $3,200, $4,000, $2,800, $3,500, $4,500, $3,200, $3,800, $3,500, $2,800, $4,200,
$3,900, $3,600, $3,000, $2,700, $2,900, $3,700, $3,500, $3,200, $3,600, $4,300, $4,100, $3,800,
$3,600, $2,500, $4,200, $4,200, $3,400, $3,300, $3,800, $3,900, $3,500, $2,800, $4,100, $3,200,
$3,600, $4,000, $3,700, $3,000, $3,100, $2,900, $3,400, $3,800, $4,000, $3,300, $3,100, $3,200,
$4,200, $3,400.

To calculate descriptive statistics:

● Mean: Add up all the incomes and divide by the number of incomes. Mean = ($2,500
+ $3,000 + ... + $3,400) / 50 = $166,200 / 50 = $3,324
● Median: Arrange the incomes in ascending order and find the middle value. Median =
$3,400 (middle value)
● Range: Calculate the difference between the highest and lowest incomes. Range =
$4,500 - $2,500 = $2,000
● Variance: Calculate the average of the squared differences from the mean. Variance =
[($2,500-$3,324)^2 + ($3,000-$3,324)^2 + ... + ($3,400-$3,324)^2] / 50 =
$221,684,000 / 50 = $4,433,680
● Standard Deviation: Take the square root of the variance. Standard Deviation =
√$4,433,680 = $2,105.18

These calculations provide descriptive statistics that summarize the central tendency,
dispersion, and shape of the data in these examples.

The four types of descriptive statistics are:

● Measures of central tendency


● Measures of variability
● Standards of relative position
● Graphical methods

Measures of central tendency describe the typical value in the dataset and include
mean, median, and mode.

Measures of variability represent the spread or dispersion of the data and include
range, variance, and standard deviation.

Measures of relative position describe the location of a specific value within the
dataset, such as percentiles.

Graphical methods use charts, histograms, and other visual representations to display
data.
Descriptive statistics break down into several types, characteristics, or measures. Some
authors say that there are two types. Others say three or even four.
Distribution (Also Called Frequency Distribution)

Datasets consist of a distribution of scores or values. Statisticians use graphs and


tables to summarize the frequency of every possible value of a variable, rendered in
percentages or numbers. For instance, if you held a poll to determine people’s favorite
Beatle, you’d set up one column with all possible variables (John, Paul, George, and
Ringo), and another with the number of votes.

Statisticians depict frequency distributions as either a graph or as a table.

Measures of Central Tendency

Measures of central tendency estimate a dataset's average or center, finding the result
using three methods: mean, mode, and median.

Mean: The mean is also known as “M” and is the most common method for finding
averages. You get the mean by adding all the response values together, and dividing the
sum by the number of responses, or “N.” For instance, say someone is trying to figure
out how many hours a day they sleep in a week. So, the data set would be the hour
entries (e.g., 6,8,7,10,8,4,9), and the sum of those values is 52. There are seven
responses, so N=7. You divide the value sum of 52 by N, or 7, to find M, which in this
instance is 7.3.

Mode: The mode is just the most frequent response value. Datasets may have any
number of modes, including “zero.” You can find the mode by arranging your dataset's
order from the lowest to highest value and then looking for the most common response.
So, in using our sleep study from the last part: 4,6,7,8,8,9,10. As you can see, the mode
is eight.
Median: Finally, we have the median, defined as the value in the precise center of the
dataset. Arrange the values in ascending order (like we did for the mode) and look for
the number in the set’s middle. In this case, the median is eight.

Variability (Also Called Dispersion)

The measure of variability gives the statistician an idea of how spread out the
responses are. The spread has three aspects — range, standard deviation, and variance.

Range: Use range to determine how far apart the most extreme values are. Start by
subtracting the dataset’s lowest value from its highest value. Once again, we turn to our
sleep study: 4,6,7,8,8,9,10. We subtract four (the lowest) from ten (the highest) and get
six. There’s your range.

Standard Deviation: This aspect takes a little more work. The standard deviation (s) is
your dataset’s average amount of variability, showing you how far each score lies from
the mean. The larger your standard deviation, the greater your dataset’s variable. Follow
these six steps:

1. List the scores and their means.


2. Find the deviation by subtracting the mean from each score.
3. Square each deviation.
4. Total up all the squared deviations.
5. Divide the sum of the squared deviations by N-1.
6. Find the result’s square root.
Raw Number/Data Deviation from Mean Deviation Squared

4 4-7.3= -3.3 10.89

6 6-7.3= -1.3 1.69

7 7-7.3= -0.3 0.09

8 8-7.3= 0.7 0.49

8 8-7.3= 0.7 0.49

9 9-7.3=1.7 2.89

10 10-7.3= 2.7 7.29

M=7.3 Sum = 0.9 Square sums= 23.83


When you divide the sum of the squared deviations by 6 (N-1): 23.83/6, you get 3.971,
and the square root of that result is 1.992. As a result, we now know that each score
deviates from the mean by an average of 1.992 points.

Variance: Variance reflects the dataset’s degree spread. The greater the degree of data
spread, the larger the variance relative to the mean. You can get the variance by just
squaring the standard deviation. Using the above example, we square 1.992 and arrive
at 3.971.

Univariate Descriptive Statistics

Univariate descriptive statistics examine only one variable at a time and do not compare
variables. Rather, it allows the researcher to describe individual variables. As a result,
this sort of statistic is also known as descriptive statistics. The patterns identified in
this sort of data may be explained using the following:

● Measures of central tendency (mean, mode, and median)


● Data dispersion (standard deviation, variance, range, minimum, maximum, and
quartiles) (standard deviation, variance, range, minimum, maximum, and
quartiles)
● Tables of frequency distribution
● Pie graphs
● Frequency polygon histograms
● Bar graphs

Bivariate Descriptive Statistics

When using bivariate descriptive statistics, two variables are concurrently analyzed
(compared) to see whether they are correlated. Generally, by convention, the
independent variable is represented by the columns, and the rows represent the
dependent variable.'

There are numerous real-world applications for bivariate data. For example, estimating
when a natural occurrence will occur is quite valuable. Bivariate data analysis is a tool in
the statistician's toolbox. Sometimes, something as simple as projecting one parameter
against the other on a Two-dimensional plane can better understand what the
information is trying to convince you. For example, the scatterplot below demonstrates
the link between the period between eruptions at Old Faithful and the eruption's
duration.

Univariate vs. Bivariate Statistics

Univariate Bivariate

Involves only one variable Involves two variables

Doesn't deal with relationships or causes Deals with causes or


relationships
The prime purpose of univariate is describing: The prime purpose of
bivariate is explaining:
● Dispersion: variance, range, standard
deviation, quartiles, maximum, minimum ● Correlations:
● Central tendency: mean median, and mode Comparisons,
● Bar graph, pie chart, histogram, explanations,
box-and-whisker plot, line graph causes,
relationships
● Dependent and
independent
variables
● Tables where just
one variable is
dependent on
other variables'
values
● Simultaneous
analysis of two
variables

What is the Main Purpose of Descriptive Statistics?

Descriptive statistics can be useful for two things: 1) providing basic information about
variables in a dataset and 2) highlighting potential relationships between variables.
Graphical/Pictorial Methods are measures of the three most common descriptive
statistics that can be displayed graphically or pictorially. It is used to summarise data.
Descriptive statistics only make statements about the data set used to calculate them;
they never go beyond your data.

Scatter Plots

A scatter plot employs dots to indicate values for two separate numeric variables. Each
dot's location on the horizontal and vertical axes represents a data point's values.
Scatter plots are being used to monitor relationships between variables.

The main purposes of scatter plots are to examine and display relationships between
two numerical variables. The points in a scatter plot document the values of individual
points and trends when the data is obtained as a whole. Identification of correlational
links is prevalent with scatter plots. In these situations, we want to know what a good
vertical value prediction would be given a specific horizontal value.

This can lead to overplotting when there are many data points to plot. When data points
are overlaid to the point where it is difficult to see the connections between them and
the variables, this is known as overplotting. It might be difficult to discern how
densely-packed data points are when lots of them are in a tiny space.

There are a couple simple methods to relieve this issue. One approach is to choose only
a subset of data points: a random sample of points should still offer the basic sense of
the patterns in the whole data. Additionally, we can alter the shape of the dots by
increasing transparency to make overlaps visible or decreasing point size to minimise
overlaps.
What’s the Difference Between Descriptive Statistics and
Inferential Statistics?

So, what’s the difference between the two statistical forms? We’ve already touched upon
this when we mentioned that descriptive statistics doesn’t infer any conclusions or
predictions, which implies that inferential statistics do so.

Inferential statistics takes a random sample of data from a portion of the population
and describes and makes inferences about the entire population. For instance, in asking
50 people if they liked the movie they had just seen, inferential statistics would build on
that and assume that those results would hold for the rest of the moviegoing population
in general.

Therefore, if you stood outside that movie theater and surveyed 50 people who had just
seen Rocky 20: Enough Already! and 38 of them disliked it (about 76 percent), you could
extrapolate that 76% of the rest of the movie-watching world will dislike it too, even
though you haven’t the means, time, and opportunity to ask all those people.

Simply put: Descriptive statistics give you a clear picture of what your current data
shows. Inferential statistics makes projections based on that data.
Unit-2 Discrete and continuous distribution, Regression

Regression:-> Regression analysis- its properties, various method to perform regression


analysis, numerical based on it

Discrete Distribution:->Bernoulli distribution, Binomial distribution, Poisson distribution,


Uniform Distribution

Continuous distributions:->uniform distribution, exponential distribution, normal


distribution, properties of normal distribution, area under the normal curve.

Correlation/ connection refers to a process for establishing the relationships


between two variables.

As sunlight increases, temperature goes up.


If price is more then demand will be less.

Correlation analysis deals with the association between two or more variables.It
is used to find the degree of correlation.

1. positive correlation:-> If increase in one variable will increase the other


variable value.For example, there is a positive correlation between
smoking and alcohol use. As alcohol use increases, so does smoking.
2. Negative correlation:-> This means that as one variable increases, the
other decreases, and vice versa.If price is more then demand will be less
3. Linear correlation:-> If the ratio of change between two variable remains
same.Marks w.r.t topper of class
4. Curvilinear correlation:-> If the ratio of change between two variable
changes. Student strength in class
5. Simple correlation:-> Relation between two variables only sunlight & Temp
6. Partial correlation:-> Relation between three variables only temp, rainfall &
yield
7. Multiple correlation:-> Relation between three/four variables

Regression:->Regression analysis measures the nature and extent of two or more


variables which enables us to make predictions.basically, it is mathematical measure of the
average relationship between two or more variables in terms of the original units of the data.It
refers to assessing the relationship between the outcome variable and one or more
variables. The outcome variable is known as the dependent or response variable and the risk
elements, and co-founders are known as predictors or independent variables.

How much relation between variables is correlation.


How the relationship between variables is the regression.

The term "regression" literally means "stepping back towards the average. It was
first used by a British biometrician Sir Francis Galton (1822-1911), in connection
with the inheritance of stature

Describes how an independent variable is associated with the dependent variable.

Both variables are different.


To fit the best line and estimate one variable based on another variable.

To estimate values of a random variable based on the values of a fixed variable.

•In regression analysis there are two types of variables.

● Dependent variable/regressed or explained variable:-> The variable whose


value is influenced or is to be predicted.The dependent variable is shown by “y”
● Independent variable/ regressor or predictor or explanatory variable:->
variable which influences the values or is used for prediction. independent variables
are shown by “x” in regression analysis.

Lines of Regression:
•If the variables in a bivariate distribution are related, we will find that the points in the
scatter diagram will cluster round some curve called the "curve of regression".

•If the curve is a straight line, it is called the line of regression and there is said to be linear
regression' between the variables, otherwise regression is said to be curvilinear.

•The line of regression is the line which gives the best estimate to the value of one variable
for any specific value of the other variable. Thus the line of regression is the line of "best
fit" and is obtained by the principles of least squares.
•Let us suppose that in the bivariate distribution (Xi, Yi); i = 1, 2, ..., n; Y is a dependent
variable and X is an independent variable. Let the line of regression of Y on X be Y = a +
bX.

If X= a+B and Y= X+c then clearly if value of X changes then Y ‘s value will change
accordingly

•There are always two lines of regression one of Y on X and the other of X on Y. The line of
regression of Y on X is used to estimate or predict the value of Y for any given value of X
i.e., when Y is a dependent variable and X is an independent variable. Hence to estimate
or predict X for any given value of Y, we use the regression equation of X on Y. Here X is a
dependent variable and Y is an independent variable.

•The two regression equations are not reversible or interchangeable because of the simple
reason that the basis and assumptions for deriving these equations are quite different. The
regression equation of Y on X is obtained

Properties of Regression Coefficients

•Correlation coefficient is the geometric mean between the regression coefficients.

•If one of the regression coefficients is greater than unity, the other must be less than unity.

•Arithmetic mean of the regression coefficients is greater than the correlation coefficient r,

provided r > 0.

•Regression coefficients are independent of the change of origin but not of scale.
Regression Formula
The regression formula assesses the relationship between the
dependent and independent variables and finds out how it affects the
dependent variable on the change of the independent variable. It is
represented by equation Y is equal to aX plus b where Y is the
dependent variable, a is the slope of the regression equation, x is the
independent variable, and b is constant

X on Y => X=a+bY

Y on X=> Y=a+bX
Regression Equation using Normal Equation

Calculate the regression equation of X on Y of following data by least


Square method

X 1 2 3 4 5
y 2 5 3 8 7

Solution: X on Y => X=a+bY

Σ X= Na+ b Σ Y ………………..1

2
Σ XY= aΣX+ b ΣY ………………..2
ΣX ΣY Σ XY 2
y

1 2 3 9

2 5 10 25

3 3 9 9

4 8 32 64

5 7 35 49

Σ X=15 Σ Y=25 Σ XY=88 2=151


Σy

Put all the values in equation 1 and 2

15=5a+25b

88=25a+151b

—--------------------------------------

15=5a+25b * 5 => 75=25a+125b

88=25a+151b

Then we get a=0.5 and b= 0.5 which means


X=0.5+0.5Y

Question 2:Calculate the regression equation of Y on x of following


data by least Square method where X=10
X 1 2 3 4 5
y 9 9 10 12 11

Solution: Y on X=> Y=a+bX

Σ Y= Na+ b ΣX ………………..1

2
Σ XY= aΣX+ b ΣX ………………..2

ΣX ΣY Σ XY 2
X

1 9 9 1

2 9 18 4

3 10 30 9

4 12 48 16

5 11 55 25

Σ X=15 Σ Y=51 Σ XY=160 2=55


ΣX
Put all the values in equation 1 and 2

15=5a+15b

160=15a+55b

—--------------------------------------

15=5a+15b * 3 => 45=15a+45b

160=15a+55b

Then we get a=8.1 and b= 0.7 and X-10 given which means

X=8.1+0.7*10=15.1

Regression equation of X on Y is X=6+Y

Regression equation of Y on X is Y=-4.18+.87X

Regression Equation using Normal coefficients


Regression coefficients using actual values of X and Y Series.

Calculate the regression equation of X on Y and Y on X of following


data

X 1 2 3 4 5
y 2 5 3 8 7

Solution: X on Y calculated as (X- X̅) =bxy (Y-Y̅)


X Y XY 2 2
y X

1 2 3 9 1

2 5 10 25 4

3 3 9 9 9

4 8 32 64 16

5 7 35 49 25

Σ X=15 Σ Y=25 Σ XY=88 2 2


ΣY =151 ΣX =55

Step 1: Calculate the means of X and Y:


Mean of X (X̄) = (1+2+3+4+5) / 5 = 15 / 5 = 3
Mean of Y (Ȳ) = (2+5+3+8+7) / 5 = 25 / 5 = 5

Step 2: bxy= N ΣXY - ΣX.ΣY


2 2
NΣy - (ΣY)

2 2 2
Σ X=15 Σ Y=25 Σ XY=88 Σy =151 (ΣY) =(25) = 625

bxy= 5(88)-15*25
5(151)-625

bxy= 440-1375 = 0.5


755-625

Step 3:(X- X̅) =bxy (Y-Y̅) =>


(X- 3) =0.5 (Y-5) =>0.5Y-2.5
=> X=0.5Y-2.5+3
X=0.5Y+0.5

Y on X calculated as (Y-Y̅) =bxy (X- X̅)

byx= N ΣXY - ΣX.ΣY


2 2
NΣX - (ΣX)
2 2 2
Σ X=15 Σ Y=25 Σ XY=88 ΣX =55 (ΣX) =(15) = 225
N= 5 X̅=3 Y̅=5

bxy= 5(88) - 15*25


2
5(55) - (15)

bxy= 65/50=1.3
(Y-Y̅) =bxy (X- X̅)

(Y-5) =1.3 (X- 3)


(Y-5) =1.3X- 3.9=> Y=1.3X-3.9+5
Y= 1.3X+1.1

Method 2: Using Deviations from the Actual Means

X on Y calculated as (X- X̅) =bxy (Y-Y̅)

bxy= Σxy x= x-x̅ y=y-y̅


2
Σy

Calculate the regression equation of X on Y and Y on X of following


data
X 2 4 6 8 10 12
y 4 2 5 10 3 6

X̅ =ΣX =>42/6 =7
N

Y̅ =ΣY =>30/6 =5
N

X Y x= x-x̅ y=y-y̅ xy 2 2
X Y

2 4 2-7= -5 4-5= -1 5 25 1

4 2 -3 -3 9 9 9

6 5 -1 0 0 1 0

8 10 1 5 5 1 25

10 3 3 -2 -6 9 4

12 6 5 1 5 25 2

Σ Σ Y=30 Σxy=18 2 2
ΣX =70 ΣY =40
X=42

X on Y calculated as
bxy= Σxy = 18/40=9/20 =0.45
2
Σy

(X- X̅) =bxy (Y-Y̅)

(X- 7) =0.45 (Y-5)=>0.45Y-5*0.45=> 0.45Y-2.25


X=0.45Y-2.25+7
X=0.45Y-4.75

Y on X calculated as

byx= Σxy = 18/40=9/20 =0.45


2
Σx

(Y-Y̅) =byx (X- X̅)


=> (Y-5) =0.257 (X- 7)

̅=> (Y-5) =0.257X- 0.257*7

=> Y =0.257X- 1.799+5

=> Y =0.257X- 3.201

Regression equations from coefficient of correlation

X on Y and Y on X

(X- X̅) =bxy (Y-Y̅) (Y-Y̅) =byx (X- X̅)

bxy= r* σ x bxy= r* σ y
Σy σx
Obtain Two regression equations
i) Estimate the Y when X=9
ii) Estimate the X when Y=12

As Arithmetic mean x=5 and y= 12


Standard Deviation x=2.6 and y= 3.6 & correlation coefficient r=0.7

Solution: x̅=5 and y̅= 12 σx= 2.6 σ y=3,6 r=0.7

X on Y

(X- X̅) =bxy (Y-Y̅) => (X- 5) =bxy (Y-12) => (X- 5) =0.50 (Y-12)
==> (X- 5) =0.50Y- 0.50*12 ⇒ 0.50Y- 6 ⇒ 0.50Y- 6+5 ⇒ 0.50Y- 1

bxy= r* σ x => 0.7 * 2.6/3.6 == 0.50


Σy

Y on X

(Y-Y̅) =byx (X- X̅)

bxy= r* σ y ⇒ 0.7* 3.6/2.6=0.969


Σx

(Y-Y̅) =byx (X- X̅)

(Y-12) =0.96 (X- 5) ⇒ 0.96X- 0.96* 5⇒0.96X- 4.8


Y=0.96X- 4.8+12 ⇒ 0.96X+7.2

Y=0.96X+7.2
Probability Distribution

Probability distribution is a function that is used to give the probability of all the possible
values that a random variable can take

Probability distribution is a function that gives the relative

likelihood of occurrence of all possible outcomes of an

experiment. There are two important functions that are used to

describe a probability distribution. These are the probability

density function or probability mass function and the cumulative

distribution function.

In statistics, there can be two types of data, namely, discrete and

continuous. Based on this, a probability distribution can be

classified into a discrete probability distribution and a

continuous probability distribution. In this article, we will learn

more about probability distribution and the various aspects that

are associated with it.

Suppose two coins Tossed

Sample Space= {HH,TT,HT,TH}

Random Variables : for Number of heads


a) 0 head= ¼
b) 1 head= ½
c) 2 head= ¼

The table or methods of describing random variables is called Probability


distribution.

Random variables are used to quantify outcomes of a random


occurrence, and therefore, can take on many values. Random variables
are required to be measurable and are typically real numbers

Types of Random Variable


As discussed in the introduction, there are two random variables, such as:

● Discrete Random Variable


● Continuous Random Variable

Discrete Random Variable


A discrete random variable can take only a finite number of distinct values such
as 0, 1, 2, 3, 4, … and so on. The probability distribution of a random variable has
a list of probabilities compared with each of its possible values known as
probability mass function.

In an analysis, let a person be chosen at random, and the person’s height is


demonstrated by a random variable. Logically the random variable is described
as a function which relates the person to the person’s height. Now in relation with
the random variable, it is a probability distribution that enables the calculation of
the probability that the height is in any subset of likely values, such as the
likelihood that the height is between 175 and 185 cm, or the possibility that the
height is either less than 145 or more than 180 cm. Now another random variable
could be the person’s age which could be either between 45 years to 50 years or
less than 40 or more than 50.
Continuous Random Variable
A numerically valued variable is said to be continuous if, in any unit of
measurement, whenever it can take on the values a and b. If the random variable
X can assume an infinite and uncountable set of values, it is said to be a
continuous random variable. When X takes any value in a given interval (a, b), it is
said to be a continuous random variable in that interval.

Formally, a continuous random variable is such whose cumulative distribution


function is constant throughout. There are no “gaps” in between which would
compare to numbers which have a limited probability of occurring. Alternately,
these variables almost never take an accurately prescribed value c but there is a
positive probability that its value will rest in particular intervals which can be very
small.

A die is tossed once. If the random variable x is the number of the even
number, find the probability distribution of X.

X= Number of even numbers

Sample space = {1,2,3,4,5,6}

P(E)=3/6=>⅓

P(O)=3/6=>⅓

X E O
P 1/3 1/3

Find the probability distribution of random variable number of heads


when two coins are tossed

X= Number of heads

Sample space = {HH,TH,HT,TT}


Random Variables : for Number of heads

d) 0 head= ¼
e) 1 head= ½
f) 2 head= ¼

Discrete Probability Distribution

A discrete probability distribution counts occurrences that have countable or


finite outcomes.These values often represent outcomes of events, and the
probability of each outcome is assigned a specific value. Discrete distributions
contrast with continuous distributions, where outcomes can fall anywhere on a
continuum. Common examples of discrete distribution include the binomial,
Poisson, and Bernoulli distributions.

probability function includes all the methods or formulas to find the


probability of any outcome.

üProbability density function or probability mass function is

used for Discrete Probability Distribution

üProbability density function is used for continuous Probability

Distribution
The Probability Density Function(PDF) defines the probability function representing the
density of a continuous random variable lying between a specific range of values.

Say we have a continuous random variable whose probability density function is given
by f(x) = x + 2, when 0 < x ≤ 2. We want to find P(0.5 < X < 1). Then we integrate x + 2
within the limits 0.5 and 1. This gives us 1.375. Thus, the probability that the
continuous random variable lies between 0.5 and 1 is 1.375.
The Cumulative Distribution Function (CDF), of a real-valued random variable X,
evaluated at x, is the probability function that X will take a value less than or equal to x.
In other words, CDF finds the cumulative probability for the given value. To determine
the probability of a random variable, it is used and also to compare the probability
between values under certain conditions. For discrete distribution functions, CDF gives
the probability values till what we specify and for continuous distribution functions, it
gives the area under the probability density function up to the given value specified.

Consider a simple example for CDF which is given by rolling a fair six-sided die, where
X is the random variable

We know that the probability of getting an outcome by rolling a six-sided die is given as:

Probability of getting 1 = P(X≤ 1 ) = 1 / 6

Probability of getting 2 = P(X≤ 2 ) = 2 / 6

Probability of getting 3 = P(X≤ 3 ) = 3 / 6

Probability of getting 4 = P(X≤ 4 ) = 4 / 6

Probability of getting 5 = P(X≤ 5 ) = 5 / 6

Probability of getting 6 = P(X≤ 6 ) = 6 / 6 = 1

From this, it is noted that the probability value always lies between 0 and 1 and it is
non-decreasing and right continuous in nature.

Binomial Distribution Formula


The binomial distribution is a commonly used discrete distribution in statistics. The

normal distribution as opposed to a binomial distribution is a continuous distribution.

The binomial distribution represents the probability for 'x' successes of an experiment in

'n' trials, given a success probability 'p' for each trial at the experiment.

Success= P

Failure = q

p=1-q

q=1-p

n x n-x
P(x:n,p) = Cx p (q)

where,

● n = the number of experiments

● x = 0, 1, 2, 3, 4, …

● p = Probability of success in a single experiment

● q = Probability of failure in a single experiment (= 1 – p)

The binomial distribution formula is also written in the form of n-Bernoulli trials, where nCx =

n!/x!(n-x)!. Hence, P(x:n,p) = n!/[x!(n-x)!].px.(q)n-x

Example 1: If a coin is tossed 5 times, using binomial distribution find the probability of:
(a) Exactly 2 heads

(b) At least 4 heads.

Solution:

(a) The repeated tossing of the coin is an example of a Bernoulli trial. According to the problem:

Number of trials: n=5

Probability of head: p= 1/2 and hence the probability of tail, q =1/2

For exactly two heads:

x=2

P(x=2) = 5C2 p2 q5-2 = 5! / 2! 3! × (½)2× (½)3

P(x=2) = 5/16

(b) For at least four heads,

x ≥ 4, P(x ≥ 4) = P(x = 4) + P(x=5)

Hence,

P(x = 4) = 5C4 p4 q5-4 = 5!/4! 1! × (½)4× (½)1 = 5/32


P(x = 5) = 5C5 p5 q5-5 = (½)5 = 1/32

Answer: Therefore, P(x ≥ 4) = 5/32 + 1/32 = 6/32 = 3/16


1. DISCRETE DISTRIBUTIONS:

Discrete distributions have a finite number of different possible outcomes.

Characteristics of Discrete Distribution

● We can add up individual values to find out the probability of an interval

● Discrete distributions can be expressed with a graph, piece-wise function or table

● In discrete distributions, graph consists of bars lined up one after the other

● Expected values might not be achievable

● P(Y≤y) = P(Y < y + 1)

In graph, the discrete distributions looks like as,


Examples of Discrete Distributions:

1. Bernoulli Distribution

2. Binomial Distribution

3. Uniform Distribution

4. Poisson Distribution

1.1 Bernoulli Distribution

Bernoulli Distribution can be used to describe events that can only have two outcomes, that is, success or failure. In other words, the random

variable can be 1 with a probability p or it can be 0 with a probability (1 - p). Such an experiment is called a Bernoulli trial. A pass or fail

exam can be modeled by a Bernoulli Distribution.

If we have a Binomial Distribution where n = 1 then it becomes a Bernoulli Distribution

In Bernoulli distribution there is only one trial and only two possible outcomes i.e. success or failure. It is denoted by y ~Bern(p).

In binomial probability distribution, the number of ‘Success’ in a sequence of n experiments, where each time a question is asked for yes-no, then

the boolean-valued outcome is represented either with success/yes/true/one (probability p) or failure/no/false/zero (probability q = 1 − p). A single

success/failure test is also called a Bernoulli trial or Bernoulli experiment, and a series of outcomes is called a Bernoulli process. For n = 1, i.e. a

single experiment, the binomial distribution is a Bernoulli distribution. The binomial distribution is the base for the famous binomial test of

statistical importance
A discrete probability distribution wherein the random variable can only have 2 possible outcomes is known as a Bernoulli Distribution. If in a

Bernoulli trial the random variable takes on the value of 1, it means that this is a success. The probability of success is given by p. Similarly, if the

value of the random variable is 0, it indicates failure. The probability of failure is q or 1 - p. Bernoulli distribution can be used to derive a binomial

distribution, geometric distribution, and negative binomial distribution.

Characteristics of Bernoulli distributions

● It consists of a single trial

● Two possible outcomes

● E(Y) = p

● Var(Y) = p × (1 – p)

Examples and Uses:

● Guessing a single True/False question.

● It is mostly used when trying to find out what we expect to obtain a single trial of an experiment.
1.2 Binomial Distribution

A sequence of identical Bernoulli events is called Binomial and follows a Binomial distribution. It is denoted by Y ~B(n, p).

Characteristics of Binomial distribution

● Over the n trials, it measures the frequency of occurrence of one of the possible result.

● E(Y) = n × p

● P(Y = y) = C(y, n) × py× (1 – p)n-y

● Var(Y) = n × p × (1 – p)

Examples and Uses:

● Simply determine, how many times we obtain a head if we flip a coin 10 times.

● It is mostly used when we try to predict how likelihood an event occurs over a series of trials.

1.3 Uniform Distribution

In uniform distribution all the outcomes are equally likely. It is denoted by Y ~U(a, b). If the values are categorical, we simply indicate the

number of categories, like Y ~U(a).

Characteristics of Uniform Distribution

● In uniform distribution all the outcomes are equally likely.

● In graph, all the bars are equally tall

● The expected value and variance have no predictive power

Examples and Uses:

● Result obtained after rolling a die


● Due to its equality, it is mostly used in shuffling algorithms

1.4 Poisson Distribution

Poisson distribution is used to determine how likelihood a certain event occur over a given interval of time or distance. It is denoted by

Y ~ Po( λ ).

Characteristics of poisson distribution

● It measures the frequency over an interval of time or distance.

Examples and Uses

● It is used to determine how likelihood a certain event occur over a given interval of time or distance.

● Mostly used in marketing analysis to find out whether more than average visits are out of the ordinary or otherwise.
2. CONTINUOUS DISTRIBUTIONS:

Continuous distributions have infinite many consecutive possible values.

Characteristics of Continuous Distributions

● We cannot add up individual values to find out the probability of an interval because there are many of them

● Continuous distributions can be expressed with a continuous function or graph

● In continuous distributions, graph consists of a smooth curve

● To calculate the chance of an interval, we required integrals

● P(Y = y) = 0 for any distinct value y.

● P(Y<y) = P(Y ≤ y)

Examples of Continuous Distributions

1. Normal Distribution

2. Chi-Squared Distribution

3. Exponential Distribution

4. Logistic Distribution

5. Students’ T Distribution
2.1 Normal Distribution

It shows a distribution that most natural events follow. It is denoted by Y ~ (µ, σ2). The main characteristics of normal distribution are:

Characteristics of normal distribution

● Graph obtained from normal distribution is bell-shaped curve, symmetric and has shrill tails.

● 68% of all its all values should fall in the interval, i.e. (µ – σ , µ+ σ )

● E(Y) = µ

● Var(Y) = σ2

Examples and Uses

● Normal distributions are mostly observed in the size of animals in the desert.

● We can convert any normal distribution into a standard normal distribution. Normal distribution could be standardized to use

the Z-table.

Where, σ ensures standard deviation is 1 and µ ensures mean is 0.


2.2 Chi-Squared Distribution

Chi-Squared distribution is frequently being used. It is mostly used to test wow of fit. It is denoted by Y ~ X2 (k).

Characteristics of Chi-Squared distribution

● The graph obtained from Chi-Squared distribution is asymmetric and skewed to the right.

● It is square of the t-distribution.

● E(Y) = k

● Var(Y) = 2k

Examples and Uses:

● It is mostly used to test wow of fit.

● It comprises a table of known values for its CDF called the x2 – table.

2.3 Exponential Distribution

It is usually observed in events which considerably change early on. It is denoted by Y ~ Exp(λ).

Characteristics of exponential distribution


● Probability and Cumulative Distributed Functions (PDF & CDF) plateau after a certain point.

● We do not have a table to known the values like the Normal or Chi-Squared Distributions, therefore, we mostly used natural

logarithm to change the values of exponential distributions.

Examples and Uses

● It is mostly used with dynamically changing variables, such as online websites traffic.

2.4 Logistic Distribution

It is used to observe how continuous variable inputs can affect the probability of a binary result. It is denoted by Y ~ Logistic(µ, s).

Characteristics of logistic distribution

● The Cumulative Distributed Function picks up when we reach values near the mean.

● The lesser the scale parameter, the faster it reaches values close to 1.

Examples and Uses


It is mostly used in sports to predict how a player’s or team’s feat can conclude the result of the match.

Binomial Distribution

A sequence of identical Bernoulli events is called Binomial and follows a Binomial distribution. It is denoted by Y

~B(n, p).

Step 1:-> Success= P and Failure = q

p=1-q q=1-p

Step 2:-> P(r:n,p) = nCr pr (q)n-r

Where, n = the number of experiments

r = 0, 1, 2, 3, 4, …

p = Probability of success in a single experiment

q = Probability of failure in a single experiment (= 1 – p)

Step 3:-> Binomial Distribution =(q+p) n

Step 4:->Mean, μ = np

Variance, σ2 = npq

Standard Deviation σ= √(npq)

Determine Binomial Distribution whose mean is 9 and standard deviation is 3/2


Solution:

Mean, μ = np = 9 …………………………………….. 1

Standard Deviation σ= √(npq) = 3/2 …………………………………….. 2

Divide equation by 2 by 1 => ( npq = 9/4) / (np=9) => q= ¼

p=1-q => 1-¼ => ¾

np=9 => n*¾ =9 => 9*4/3 => 12

Binomial Distribution = n => 12


(q+p) (¼+¾)

The probability of a man hitting a target is ¼. He fires 7 times . What is the probability of hitting

atleast target twice?

Solution:-> p= ¼ q= 1-¼ =¾ n= 7

P(X>=2) =>P(x=2)+P(x=3)+P(x=4)+P(x=5)+P(x=6)+P(x=7)

q= 1-[P(x=0)+P(x=1)]

P(r:n,p) = nCr pr (q)n-r


P(x=0) = 7c0 p

P(r:n,p) = n r r n-r
C p (q)
P(x=0) = [7c0 p0 q7 + 7c1 p1 q6]

7c0=0 or 6c0=0 or 5c0=0

7c1=1 or 6c1=1 or 5c1=1

P= 1-[q7+7pq6] => 1-q6[q+7p] =>1-(3/4)6(3/4+7/4)

1-7290/16384 =>4547/8192

Example 1: If a coin is tossed 5 times, using binomial distribution

find the probability of:

(a) Exactly 2 heads

(b) At least 4 heads


Solution:

(a) The repeated tossing of the coin is an example of a Bernoulli trial.

According to the problem:

Number of trials: n=5

Probability of head: p= 1/2 and hence the probability of tail, q =1/2

For exactly two heads:

x=2

P(x=2) = 5C2 p2 q5-2 = 5! / 2! 3! × (½)2× (½)3

P(x=2) = 5/16

(b) For at least four heads,

x ≥ 4, P(x ≥ 4) = P(x = 4) + P(x=5)


Hence,

P(x = 4) = 5C4 p4 q5-4 = 5!/4! 1! × (½)4× (½)1 = 5/32

P(x = 5) = 5C5 p5 q5-5 = (½)5 = 1/32

Answer: Therefore, P(x ≥ 4) = 5/32 + 1/32 = 6/32 = 3/16

On average, every one out of 10 telephones is found busy. Six telephone numbers are

selected at random. Find the probability that four of them will be busy.

Solution:

Let X: event of getting a busy phone number

p = P(probability of getting a phone number busy) = 1/10

q = P(probability of not getting a phone number busy) = 9/10

The required probability = P(X = 4) = 6C4 p4 q(6 – 4)

= 15 × (1/10)4 × (9/10)2
= 15 × 81/106

= 0.001215.

There are four fused bulbs in a lot of 10 good bulbs. If three bulbs are drawn at random with replacement, find the probability of

distribution of the number of fused bulbs drawn.

Solution:

This is a problem of binomial distribution as the event of drawing a fused bulb is independent.

p = P(drawing a fused bulb) = 4/(10 + 4) = 2/7

q = P(drawing a bulb which is not fused) = 1 – 2/7 = 5/7

X = event of drawing a fused bulb

X can take up the values 0, 1, 2, 3

P(X = 0) = P(getting zero fused bulbs in all draws)

= nCr pr q(n – r)

= 3C0 (2/7)0 (5/7)(3 – 0)

= 1 × 1 × (125/343) = 125/343

P(X = 1) = P (getting one time fused bulb)


= nCr pr q(n – r)

= 3C1 (2/7)1 (5/7)(3 – 1)

= 3 × (2/7) × (25/49) = 150/343

P(X = 2) = P(getting two times fused bulbs)

= nCr pr q(n – r)

= 3C2 (2/7)2 (5/7)(3 – 2)

= 3 × 4/49 × (5/7) = 60/343

P(X = 3) = (P(getting three times fused bulb)

= nCr pr q(n – r)

= 3C3 (2/7)3 (5/7)(3 – 3)

= 1 × 8/343 × 1 = 8/343

The required probability distribution:

X 0 1 2 3

8/343

P(X) 125/343 150/343 60/343

The probability of a boy guessing a correct answer is ¼. How many questions must he answer so that the probability of guessing the

correct answer at least once is greater than ⅔?


Solution:

p = P(guessing a correct answer) = ¼

q = P(not guessing a correct answer) = ¾

Let him answers n number of questions, then

P(X ≥ 1) = P(guessing at least one correct answer out of n questions) = 1 – P(no success) = 1 – qn

Given, 1 – qn > ⅔ ⇒ 1 – (¾)n > ⅔

⇒ (¾)n < ⅓

Now, let us check the above inequality for different values of n = 1, 2, 3, 4, …

When n = 1

¾≮⅓

When n = 2

(¾)2 ≮ ⅓

When n = 3

(¾)3 ≮ ⅓

When n = 4

(¾)4 < ⅓.
1.1 Bernoulli Distribution

In Bernoulli distribution there is only one trial and only two possible outcomes i.e. success or failure. It is denoted by y ~Bern(p).

In binomial probability distribution, the number of ‘Success’ in a sequence of n experiments, where each time a question is asked for yes-no, then

the boolean-valued outcome is represented either with success/yes/true/one (probability p) or failure/no/false/zero (probability q = 1 − p). A single

success/failure test is also called a Bernoulli trial or Bernoulli experiment, and a series of outcomes is called a Bernoulli process. For n = 1, i.e. a

single experiment, the binomial distribution is a Bernoulli distribution. The binomial distribution is the base for the famous binomial test of

statistical importance

Characteristics of Bernoulli distributions

● It consists of a single trial

● Two possible outcomes

● E(Y) = p

● Var(Y) = p × (1 – p)

Examples and Uses:

● Guessing a single True/False question.

● It is mostly used when trying to find out what we expect to obtain a single trial of an experiment.
Poisson distribution probability
1. n= when n has maximum values and p= whenless probability of any event

2. λ=np

3. f(x) = P(X=x) = (e-λ λx )/x!

Where

● x = 0, 1, 2, 3...

● e is the Euler's number(e = 2.718)

● λ is an average rate of the expected value and λ = variance, also λ>0

Poisson Distribution Mean and Variance

For Poisson distribution, which has λ as the average rate, for a fixed interval of
time, then the mean of the Poisson distribution and the value of variance will be
the same. So for X following Poisson distribution, we can say that λ is the mean
as well as the variance of the distribution.

Hence: E(X) = V(X) = λ

where
● E(X) is the expected mean
● V(X) is the variance
● λ>0

Properties of Poisson Distribution

The Poisson distribution is applicable in events that have a large number of rare
and independent possible events. The following are the properties of the
Poisson Distribution. In the Poisson distribution,
● The events are independent.
● The average number of successes in the given period of time alone can occur. No two events can occur at the same time.
● The Poisson distribution is limited when the number of trials n is indefinitely large.
● mean = variance = λ
● np = λ is finite, where λ is constant.
● The standard deviation is always equal to the square root of the mean μ.

● The exact probability that the random variable X with mean μ =a is given by P(X= a) = μa / a! e -μ

● If the mean is large, then the Poisson distribution is approximately a normal distribution.

Poisson Distribution Table

Similar to the binomial distribution, we can have a Poisson distribution table


which will help us to quickly find the probability mass function of an event that
follows the Poisson distribution. The Poisson distribution table shows different
values of Poisson distribution for various values of λ, where λ>0. Here in the
table given below, we can see that, for P(X =0) and λ = 0.5, the value of the
probability mass function is 0.6065 or 60.65%.
Suppose X is possession distribution.If P(x=2)=2/3P(X=1) then find P(x=0)

Five coins tossed 3200 times.what is the probability to getting 5 heads two
times

Solution
4. n= 3200 and p= ½ five times=> 5 times (½) => 1/32

5. λ=np =>3200*1/32 ⇒ 100

6. f(x) = P(X=x) = (e-λ λx )/x!


P(X=2) = (e-100 2 )/2! ⇒ 500*e-100
100
Normal Distribution

In probability theory and statistics, the Normal Distribution, also called the

Gaussian Distribution, is the most significant continuous probability

distribution. Sometimes it is also called a bell curve. A large number of random

variables are either nearly or exactly represented by the normal distribution, in

every physical science and economics.

The Normal Distribution is defined by the probability density function for a

continuous random variable in a system. Let us say, f(x) is the probability

density function and X is the random variable. Hence, it defines a

function which is integrated between the range or interval (x to x + dx),


giving the probability of random variable X, by considering the values

between x and x+dx.

f(x) ≥ 0 ∀ x ϵ (−∞,+∞)

And -∞∫+∞ f(x) = 1

Normal Distribution Formula

The probability density function of normal or gaussian distribution is given

by;

Where,

● x is the variable
● μ is the mean
● σ is the standard deviation
Normal Distribution Curve

The random variables following the normal distribution are those whose

values can find any unknown value in a given range. For example, finding

the height of the students in the school. Here, the distribution can

consider any value, but it will be bounded in the range say, 0 to 6ft. This

limitation is forced physically in our query.

Whereas, the normal distribution doesn’t even bother about the range.

The range can also extend to –∞ to + ∞ and still we can find a smooth

curve. These random variables are called Continuous Variables, and the

Normal Distribution then provides here probability of the value lying in a

particular range for a given experiment. Also, use the normal distribution

calculator to find the probability density function by just providing the

mean and standard deviation value.

Normal Distribution Standard Deviation


Generally, the normal distribution has any positive standard deviation.

We know that the mean helps to determine the line of symmetry of a

graph, whereas the standard deviation helps to know how far the data

are spread out. If the standard deviation is smaller, the data are

somewhat close to each other and the graph becomes narrower. If the

standard deviation is larger, the data are dispersed more, and the graph

becomes wider. The standard deviations are used to subdivide the area

under the normal curve. Each subdivided section defines the percentage

of data, which falls into the specific region of a graph.


Using 1 standard deviation, the Empirical Rule states that,

● Approximately 68% of the data falls within one standard deviation of

the mean. (i.e., Between Mean- one Standard Deviation and Mean

+ one standard deviation)


● Approximately 95% of the data falls within two standard deviations

of the mean. (i.e., Between Mean- two Standard Deviation and

Mean + two standard deviations)

● Approximately 99.7% of the data fall within three standard

deviations of the mean. (i.e., Between Mean- three Standard

Deviation and Mean + three standard deviations)

Normal Distribution Properties

Some of the important properties of the normal distribution are listed

below:

● In a normal distribution, the mean, median and mode are

equal.(i.e., Mean = Median= Mode).

● The total area under the curve should be equal to 1.


● The normally distributed curve should be symmetric at the centre.

● There should be exactly half of the values are to the right of the

centre and exactly half of the values are to the left of the centre.

● The normal distribution should be defined by the mean and

standard deviation.

● The normal distribution curve must have only one peak. (i.e.,

Unimodal)

● The curve approaches the x-axis, but it never touches, and it

extends farther away from the mean.


Applications

The normal distributions are closely associated with many things such as:

● Marks scored on the test

● Heights of different persons

● Size of objects produced by the machine

● Blood pressure and so on.


The random variable X is normally distributed with 9 and at
Standard Deviation is 3 find the probability X>=15, X<15, 0<X<9

Answer:

Mean = =>9

Standard Deviation = => 3

i)X>=15

Z= X- / ⇒ 15-9/3 => 2

Z=2
P(X>=15) = P(Z=2)

Take the valye from the table for Z=2 which is =0.4772
Rectangular or Uniform Distribution
A uniform distribution is a continuous probability distribution and relates to the
events which are likely to occur equally. A uniform distribution is defined by two
parameters, a and b, where a is the minimum value and b is the maximum value.
It is generally denoted as u(a, b).

When the probability density function or probability distribution of a uniform

distribution with a continuous random variable X is f(x)=1/b-a, then It can be

denoted by U(a,b), where a and b are constants such that a<x<b. It is written as:

f(x) = 1/ (b-a) for a≤ x ≤b.

where,

● a is the minimum value

● b is the maximum value


Exponential Distribution
Exponential Distribution
The exponential distribution formula is used to find the exponential distribution of a function.

Exponential distribution refers to the process in which the event happens at a constant

average rate independently and continuously. The exponential distribution is most often

known as the memoryless distribution because it means that past information has no effect

on future probabilities.

The exponential distribution is commonly used to model time: the time between arrivals, the

time until a component fails, the time until a patient dies. We have already encountered

several examples of exponential random variables—the time of the first arrival in a Poisson

process follows an exponential distribution.

You might also like