0% found this document useful (0 votes)
278 views

Engr. Data Analysis

This document defines key statistical terms and discusses different methods for collecting data, including direct interviews, questionnaires, registration methods, retrospective studies, observational studies, and designed experiments. It provides examples of each method, such as using historical process data from a distillation column to conduct a retrospective study, or making deliberate changes in a designed experiment to better understand variable impacts. The document emphasizes that different collection methods have advantages and disadvantages, such as retrospective studies potentially missing relevant data or having recording errors.
Copyright
© © All Rights Reserved
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
278 views

Engr. Data Analysis

This document defines key statistical terms and discusses different methods for collecting data, including direct interviews, questionnaires, registration methods, retrospective studies, observational studies, and designed experiments. It provides examples of each method, such as using historical process data from a distillation column to conduct a retrospective study, or making deliberate changes in a designed experiment to better understand variable impacts. The document emphasizes that different collection methods have advantages and disadvantages, such as retrospective studies potentially missing relevant data or having recording errors.
Copyright
© © All Rights Reserved
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
You are on page 1/ 126

Statistics may be defined as the science that deals with

the collection, organization, presentation, analysis, and interpretation of data in order to


be able to draw judgments or conclusions that help in the decision-making process. The
two parts of this definition correspond to the two main divisions of Statistics. These are
Descriptive Statistics and Inferential Statistics. Descriptive Statistics, which is referred to
in the first part of the definition, deals with the procedures that organize, summarize and
describe quantitative data. It seeks merely to describe data. Inferential Statistics,
implied in the second part of the definition, deals with making a judgment or a
conclusion about a population based on the findings from a sample that is taken from
the population.

 
Learning Objectives

 At the end of this chapter, it is expected that the students will be able to:
1. Demonstrate an understanding of the different methods of obtaining data.

1. Explain the procedures in planning and conducting surveys and experiments.


 

Statistical Terms

Before proceeding to the discussion of the different methods of obtaining data, let
us have first definition of some statistical terms:

Population or Universe refers to the totality of objects, persons, places, things used in


a particular study. All members of a particular group of objects (items) or people
(individual), etc. which are subjects or respondents of a study.
Sample is any subset of population or few members of a population.
Data are facts, figures and information collected on some characteristics of a population
or sample. These can be classified as qualitative or quantitative data.
Ungrouped (or raw) data are data which are not organized in any specific way. They
are simply the collection of data as they are gathered.
Grouped Data are raw data organized into groups or categories with corresponding
frequencies. Organized in this manner, the data is referred to as frequency distribution.
Parameter is the descriptive measure of a characteristic of a population
Statistic is a measure of a characteristic of sample
Constant is a characteristic or property of a population or sample which is common to
all members of the group.
Variable is a measure or characteristic or property of a population or sample that may
have a number of different values. It differentiates a particular member from the rest of
the group.  It is the characteristic or property that is measured, controlled, or
manipulated in research. They differ in many aspects, most notably in the role they are
given in the research and in the type of measures that can be applied to them.

Methods of Data Collection


          Information you gather can come from a range of sources. Likewise, there are a
variety of techniques to use when gathering primary data. Listed below are some of the
most common data collection techniques.
 Direct or Interview Method
o One of the most effective method of data collection
o Can give complete information needed
o Can yield inaccurate information since the interviewer can influence respondent's
answer
o Time consuming and can be expensive for large number of respondents
 Indirect or Questionnaire Method
o One of the easiest method of data collection though it takes time to prepare the
questionnaire
o Can give confidential responses
o Obtained answers are free from any influence from the interviewer
o Cannot be accomplished by illiterates
o Has high proportions that questionnaires cannot be retrieved
o Tends to give wrong and incomplete information
 Registration Method
o Information provided are in compliance with certain laws, policies, rules, regulations or
standard practices.
 Examples - marriage contracts, birth certificates, licenses etc.
 
In engineering environment,
 Retrospective study
o Uses the population or sample of historical data which is archived for a period of time
o May involved large amount of data but with little information about the problem
o Some relevant data may be missing, may have recording errors, or the important data
may not have been gathered
 Observational study
o The engineer observes the process or population, disturbing it as little as possible, and
records the quantities of interest
o Usually conducted for a relatively short time period
o Sometimes, variables that are not routinely measured can be included
 Designed Experiment
o The engineer makes deliberate or purposeful changes in the controllable variables of
the system or process, observes the resulting system output data, and then makes an
inference or decision about which variables are responsible for the observed changes in
output performance
o Plays an important role in engineering design and development in the improvement of
manufacturing processes
o Plays a crucial role in reducing the lead time for engineering design and development
activities

Examples
Retrospective Study
            Montgomery, Peck, and Vining (2001) describe an acetone-butyl alcohol
distillation column for which concentration of acetone in the distillate or output product
stream is an important variable. Factors that may affect the distillate are the reboil
temperature, the condensate temperature, and the reflux rate. Production personnel
obtain and archive the following records:
 The concentration of acetone in an hourly test sample of output product
 The reboil temperature log, which is a plot of the reboil temperature over time
 The condenser temperature controller log
 The nominal reflux rate each hour
The reflux rate should be held constant for this process. Consequently, production
personnel change this very infrequently. A retrospective study would use either all or a
sample of the historical process data archived over some period of time. The study
objective might be to discover the relationships among the two temperatures and the
reflux rate on the acetone concentration in the output product stream. However, this
type of study presents some problems:        
1. We may not be able to see the relationship between the reflux rate and acetone
concentration, because the reflux rate didn’t change much over the historical period.
2. The archived data on the two temperatures (which are recorded almost
continuously) do not correspond perfectly to the acetone concentration measurements
(which are made hourly). It may not be obvious how to construct an approximate
correspondence.
3. Production maintains the two temperatures as closely as possible to desired
targets or set points. Because the temperatures change so little, it may be difficult to
assess their real impact on acetone concentration.
4. Within the narrow ranges that they do vary, the condensate temperature tends to
increase with the reboil temperature. Consequently, the effects of these two process
variables on acetone concentration may be difficult to separate.
            As you can see, a retrospective study may involve a lot of data, but that data
may contain relatively little useful informationabout the problem. Furthermore, some of
the relevant data may be missing, there may be transcription or recording errors
resulting in outliers(or unusual values), or data on other important factors may not have
been collected and archived. In the distillation column, for example, the specific
concentrations of butyl alcohol and acetone in the input feed stream are a very
important factor, but they are not archived because the concentrations are too hard to
obtain on a routine basis. As a result of these types of issues, statistical analysis of
historical data sometimes identify interesting phenomena, but solid and reliable
explanations of these phenomena are often difficult to obtain.
Observational Study
            In the distillation column, the engineer would design a form to record the two
temperatures and the reflux rate when acetone concentration measurements are made.
It may even be possible to measure the input feed stream concentrations so that the
impact of this factor could be studied. Generally, an observational study tends to solve
problems 1 and 2 above and goes a long way toward obtaining accurate and reliable
data. However, observational studies may not help resolve problems 3 and 4.
Designed Experiments
            In a designed experiment the engineer makes deliberate or purposeful changes
in the controllable variables of the system or process, observes the resulting system
output data, and then makes an inference or decision about which variables are
responsible for the observed changes in output performance. Designed experiments
play a very important role in engineering design and development and in the
improvement of manufacturing processes. Generally, when products and processes are
designed and developed with designed experiments, they enjoy better performance,
higher reliability, and lower overall costs. Designed experiments also play a crucial role
in reducing the lead time for engineering design and development activities.
Planning and Conducting Surveys
            A survey is a way to ask a lot of people a few well-constructed questions. The
survey is a series of unbiased questions that the subject must answer. Some
advantages of surveys are that they are efficient ways of collecting information from a
large number of people, they are relatively easy to administer, a wide variety of
information can be collected and they can be focused (researchers can stick to just the
questions that interest them.) Some disadvantages of surveys arise from the fact that
they depend on the subjects’ motivation, honesty, memory and ability to respond.
Moreover, answer choices to survey questions could lead to vague data. For example,
the choice “moderately agree” may mean different things to different people or to
whoever ends up interpreting the data
Conducting a Survey
            There are various methods of administering a survey. It can be done as a face-to
face interview or a phone interview where the researcher is questioning the subject. A
different option is to have a self-administered surveywhere the subject can complete a
survey on paper and mail it back, or complete the survey online. There are advantages
and disadvantages to each of these methods.
            Face to face interview
                        The advantages of face-to-face interviews include fewer misunderstood
questions, fewer incomplete responses, higher response rates, and greater control      
over the environment in which the survey is administered; also, the researcher can
collect additional information if any of the respondents’ answers need clarifying.       
The disadvantages of face-to-face interviews are that they can be expensive and     
time-consuming and may require a large staff of trained interviewers. In addition, the
response can be biased by the appearance or attitude of the interviewer.
            Self-administered survey
                        The advantages of self-administered surveys are that they are
less expensive than interviews, do not require a large staff of experienced interviewers
and can be administered in large numbers. In addition, anonymity and
privacy encourage more candid and honest responses, and there is less pressure
on respondents. The disadvantages of self-administered surveys are that
responders are more likely to stop participating mid-way through the survey and
respondents cannot ask them to clarify their answers. In addition, there are lower
response rates than in personal interviews, and often the respondents who bother to
return surveys represent extremes of the population – those people who care about
the issue strongly, whichever way their opinion leans.

Designing a Survey
Surveys can take different forms. They can be used to ask only one question or they can ask a
series of questions. We can use surveys to test out people’s opinions or to test a hypothesis.
When designing a survey, the following steps are useful:

1. Determine the goal of your survey: What question do you want to answer?
2. Identify the sample population: Whom will you interview?
3. Choose an interviewing method: face-to-face interview, phone interview, self-
administered paper survey, or internet survey.
4. Decide what questions you will ask in what order, and how to phrase them. (This is
important if there is more than one piece of information you are looking for.)
5. Conduct the interview and collect the information.
6. Analyze the results by making graphs and drawing conclusions.

Planning and Conducting Experiments: Introduction to Design of


Experiments
The products and processes in the engineering and scientific disciplines are mostly
derived from experimentation. An experiment is a series of tests conducted in a
systematic manner to increase the understanding of an existing process or to explore a
new product or process.
Design of Experiments, or DOE
 is a tool to develop an experimentation strategy that maximizes learning using minimum
resources
 is widely and extensively used by engineers and scientists in improving existing process
through
 is a technique needed to identify the "vital few" factors in the most efficient manner and
then directs the process to its best setting to meet the ever-increasing demand for
improved quality and increased productivity
Stages of Design of Experiment from www.weibull.com

1. Planning
 At this stage, keep in mind these considerations:

o thorough and precise objective identifying the need to conduct the investigation
o assessment of time and resources available to achieve the objective
o integration of prior knowledge to the experimentation procedure
 identifies possible factors to investigate and the most appropriate response(s) to
measure

o

 Factors. We usually talk about "treatment" factors, which are the


factors of primary interest to you. In addition to treatment factors, there are nuisance
factors which are not your primary focus, but you have to deal with them. Sometimes
these are called blocking factors, mainly because we will try to block on these factors to
prevent them from influencing the results.
1.

o

 Experimental Factors- these are factors that you can


specify (and set the levels) and then assign at random as the treatment to the
experimental units. Examples would be temperature, level of an additive fertilizer
amount per acre, etc.
 Classification Factors- can't be changed or assigned,
these come as labels on the experimental units. The age and sex of the participants are
classification factors which can't be changed or randomly assigned. But you can select
individuals from these groups randomly.
 Quantitative Factors- you can assign any specified level of
a quantitative factor. Examples includes percent or pH level of a chemical.
 Qualitative Factors- have categories which are different
types. Examples might be species of a plant or animal, a brand in the marketing field,
gender, - these are not ordered or continuous but are arranged perhaps in sets.
 Carefully planned experiments lead to increased understanding of the
product or process and are easy to execute and analyse using available statistical
softwares.
2. Screening
 used to identify the important factors that affect the process under
investigation out of the large pool of potential factors
 are carried out in conjunction with prior knowledge of the process to
eliminate unimportant factors and focus attention on the key factors that require further
detailed analyses
 are usually efficient designs requiring few executions, where the focus is
not on interactions but on identifying the vital few factors
3. Optimization
 determine the best setting of the important factors affecting the process to
achieve the desired objective
 this objective may be to either increase yield or decrease variability or to
find settings that achieve both at the same time
4. Robustness Testing
 make the product or process insensitive to variations
 these variations result from changes in factors that affect the process but
are beyond the control of the analyst
o noise or uncontrollable factors (e.g. humidity, ambient temperature,
variation in material, etc.)
 identify such sources of variation and take measures to
ensure that the product or process is made insensitive (or robust) to these factors
5. Verification
 validation of the best settings by conducting a few follow-up experimental
runs to confirm that the process functions as desired and all objectives are met

CHAPTER II
Probability is simply how likely an event is to happen. “The chance of rain today is 50%”
is a statement that enumerates our thoughts on the possibility of rain. The likelihood of
an outcome is measured by assigning a number from the interval [0, 1] or as
percentage from 0 to 100%. The higher the number means the event is more likely to
happen than the lower number. A zero (0) probability indicates that the outcome is
impossible to happen while a probability of one (1) indicates that the outcome will occur
inevitably.
 
Learning Objectives
At the end of this module, it is expected that the students will be able to:
1. Understand and describe sample spaces and events for random experiments
2. Explaintheconceptofprobabilityanditsapplicationtodifferentsituations
3. Defineandillustratethedifferentprobabilityrules
4. Solve for the probability of different statistical data.

Probability
Probability is the likelihood or chance of an event occurring.

            For example, the probability of flipping a coin and it being heads is ½, because
there is 1 way of getting a head and the total number of possible outcomes is 2 (a head
or tail). We write P(heads) = ½ .
 
Properties of Probability

a.   the probability  of an event must be a number


between \displaystyle{0}0 and \displaystyle{1}1 (inclusive)

b.
c.
d. The probability of something not happening is 1 minus the probability that it will happen.
 
Experiment – is used to describe any process that generates a set of data
Event – consists of a set of possible outcomes of a probability experiment. Can be one
outcome or more than one outcome.
Simple event – an event with one outcome.
Compound event – an event with more than one outcome.

A. Sample Space and Relationships Among Events


Sample space is the Set of all possible outcomes or results of a random experiment.
Sample space is represented by letter S. Each outcome in the sample space is called
an element of that set. An event is the subset of this sample space and it is represented
by letter E.

This can be illustrated in a Venn Diagram. In Figure 2.1, the sample space is
represented by the rectangle and the events by the circles inside the rectangle. The
events A and B (in a to c) and A, B and C (in d and e) are all subsets of the sample
space S.
Figure 2.1 Venn diagrams of sample space with events (adapted from Montgomery et
al., 2003)
 
For example if a dice is rolled we have {1,2,3,4,5,6} as sample space. The event can be
{1,3,5}  which means set of odd numbers. Similarly, when a coin is tossed twice the
sample space is {HH, HT, TH, TT}.
            Sample space and events play important roles in probability. Once we have
sample space and event, we can easily find the probability of that event. We have
following formula to find the probability of an event.
 
              The probability of an event E is defined as the number of outcomes
favourable to E divided by the total number of equally likely outcomes in the
sample space S of the experiment.
            That is,

Where,
                      
fsdfgsgsdfgfdg
 {\left({E}\right)}n(E) is the number of outcomes favourable to E and
 \displaystyle{n}{\left({S}\right)}n(S) is the total number of equally likely outcomes in
the sample space S of the experiment.
            Let us try to understand this with the help of an example. If a die is tossed, the
sample space is {1,2,3,4,5,6}. In this set, we have a number of elements equal to 6.
Now, if the event is the set of odd numbers in a dice, then we have { 1, 3, 5} as an
event. In this set, we have 3 elements. So, the probability of getting odd numbers in a
single throw of dice is given by

Difference Between Sample Space and Events


 
            As discussed in the beginning sample space is set of all possible outcomes of
an experiment and event is the subset of sample space. Let us try to understand this
with few examples. What happens when we toss a coin thrice? If a coin is tossed three
times we get following combinations,
HHH, HHT, HTH,THH, TTH, THT, HTT and TTT
            All these are the outcomes of the experiment of tossing a coin three times.
Hence, we can say the sample space is the set given by,
S = {HHH, HHT, HTH,THH, TTH, THT, HTT, TTT}
            Now, suppose the event be the set of outcomes in which there are only two
heads. The outcomes in which we have only two heads are HHT, HTH and THH hence
the event is given by,
E = {HHT, HTH, THH}
            We can clearly see that each element of set E is in set S, so E is a subset of S.
There can be more than one event. In this case, we can have an event as getting only
one tail or event of getting only one head. If we have more than one event we can
represent these events by E1, E2, E3 etc.We can have more than one event for a Sample
space but there will be one and only one Sample space for an Event. If we have Events
E1, E2, E3, ……En  as all the possible subset of sample space then we have,
S = E1∪ E2∪ E3∪ …….∪En
We can understand this with the help of a simple example. Consider an experiment of
rolling a dice. We have sample space,
S = {1, 2, 3, 4, 5, 6}
            Now if we have Event E 1 as getting odd number as outcome and E 2 as getting
even number as outcome for this experiment then we can represent E 1 and E2 as the
following set,
E1 = {1,3,5}
E2={2,4,6}
So we have
{1, 3, 5} ∪ {2, 4, 6} = {1,2,3,4,5,6}
Or S = E1 ∪ E2
Hence, we can say union of Events E1 and E2 is S.

Null space – is a subset of the sample space that contains no elements and is denoted
by the symbol Ø. It is also called empty space. 

Operations with Events


Intersection of events
            The intersection of two events A and B is denoted by the symbol A ∩B. It is the
event containing all elements that are common to A and B. This is illustrated as the
shaded region in Figure 2.1 (c).

For example,
            Let A = {3,6,9,12,15} and B = {1,3,5,8,12,15,17}; then A ∩ B = {3,12,15}
            Let X = {q, w, e, r, t,} and Y = {a, s, d, f}; then X ∩ Y = Ø, since X and Y have no
elements in common.
 

Mutually Exclusive Events


            We can say that an event is mutually exclusive if they have no elements in
common. This is illustrated in Figure 2.1 (b) where we can see that A ∩ B =Ø.

Union of Events
            The union of events A and B is the event containing all the elements that belong
to A or to B or to both and is denoted by the symbol A∪ B. The elements A ∪ B
maybe listed or defined by the rule A ∪B = { x | x ∈A or x ∈B}.
            For example,
            Let A = {a, e, i, o, u} and B = {b, c, d, e, f}; then A ∪B = {a, b, c, d, e, f, i, o, u}
            Let X = {1,2,3,4} and Y = {3,4,5,6}; then A ∪B = {1,2,3,4,5,6}

Compliment of an Event
            The complement of an event A with respect to S is the set of all elements of S
that are not in A and is denoted by A’. The shaded region in Figure 2.1 (e) shows
(A ∩ C)’.
            For example,
            Consider the sample space S = {dog, cow, bird, snake, pig}
            Let A = {dog, bird, pig}; then A’ = {cow, snake}

B. Counting Rules Useful in Probability


Multiplicative Rule
            Suppose you have j sets of elements, n1 in the first set, n2 in the second set, ...
and nj in the jth set. Suppose you wish to form a sample of j elements by taking one
element from each of the j sets. The number of possible sets is then defined by:
n1 x n2 x n3 x ... x nj 
 
Permutation Rule
            The arrangement of elements in a distinct order is called permutation. Given a
single set of n distinctively different elements, you wish to select k elements from the n
and arrange them within k positions. The number of different permutations of the n
elements taken k at a time is denoted Pkn and is equal to:
Partitions rule
            Suppose a single set of n distinctively different elements exists. You wish to
partition them into k sets, with the first set containing n 1 elements, the second containing
n2 elements, ..., and the kth set containing nk elements. The number of different partitions
is:

Where
                                    n1 + n2 + … + nk = n
The numerator gives the permutations of the n elements. The terms in the denominator
remove the duplicates due to the same assignments in the k sets (multinomial
coefficients).
 
Combinations Rule
            A sample of k elements is to be chosen from a set of n elements. The number of
different samples of k samples that can be selected from n is equal to:

C. Rules of Probability
Before discussing the rules of probability, we state the following definitions:
 Two events are mutually exclusive or disjoint if they cannot occur at the same time.
 The probability that Event A occurs, given that Event B has occurred, is called
a conditional probability. The conditional probability of Event A, given Event B, is
denoted by the symbol P(A|B).
 The complement of an event is the event not occurring. The probability that Event A will
not occur is denoted by P(A').
 The probability that Events A and B both occur is the probability of the intersection of A
and B. The probability of the intersection of Events A and B is denoted by P(A ∩ B). If
Events A and B are mutually exclusive, P(A ∩ B) = 0.
 The probability that Events A or B occur is the probability of the union of A and B. The
probability of the union of Events A and B is denoted by P(A ∪ B) .
 If the occurrence of Event A changes the probability of Event B, then Events A and B
are dependent. On the other hand, if the occurrence of Event A does not change the
probability of Event B, then Events A and B are independent.
Rule of Addition
            Rule 1: If two events A and B are mutually exclusive, then:

            Rule 2: For any two outcomes A and B

Example
            A student goes to the library. The probability that she checks out (a) a work of
fiction is 0.40, (b) a work of non-fiction is 0.30, and (c) both fiction and non-fiction is
0.20. What is the probability that the student checks out a work of fiction, non-fiction, or
both?
                                    Solution:
                        Let F = the event that the student checks out fiction;
                        let N = the event that the student checks out non-fiction.
Then, based on the rule of addition:

Rule of Multiplication
            Rule 1: When two events A and B are independent, then:

            Dependent - Two outcomes are said to be dependent if knowing that one of the   
outcomes has occurred affects the probability that the other occurs
            Conditional Probability - an event B in relationship to an event A is the
probability that event B occurs after event A has already occurred. The probability is

denoted by  .
            Rule 2: When two events are dependent, the probability of both occurring is:

                                    where

Example
            An urn contains 6 red marbles and 4 black marbles. Two marbles are drawn
without replacement from the urn. What is the probability that both of the marbles are
black?
                                    Solution:
                        Let A = the event that the first marble is black;
                        and let B = the event that the second marble is black.
We know the following:
 In the beginning, there are 10 marbles in the urn, 4 of which are black. Therefore, P(A)
= 4/10.
 After the first selection, there are 9 marbles in the urn, 3 of which are black. Therefore,
P(B|A) = 3/9.

Rule of Subtraction
            The probability that event A will occur is equal to 1 minus the probability that
event A will not occur.

Example
            The probability of Bill not graduating in college is 0.8. What is the probability that
Bill will not graduate from college?
                                    Solution:
 
REFERENCES:
Applied Statistics and Probability for Engineers, 3rd Edition (Douglas C. Montgomery)
https://math.tutorvista.com/statistics/sample-space-and-events.html
https://stattrek.com/probability/probability-rules.aspx
https://www.ck12.org/book/CK-12-Probability-and-Statistics-Advanced-Second-Edition/
section/3.6/

Probability examples

What is the probability of...

1. Getting an ace if I choose a card at random from a standard pack of 52 playing cards?
2. Getting a 5 if I roll a die?
3. Getting an even number if I roll a die?
4. Having one Tuesday in this week?
 
CHAPTER III

DISCRETE PROBABILITY DISTRIBUTION


Discrete Probability Distribution
A discrete distribution describes the probability of occurrence of each value of a discrete
random variable. A discrete random variable is a random variable that has countable values,
such as a list of non-negative integers.

With a discrete probability distribution, each possible value of the discrete random variable can
be associated with a non-zero probability. Thus, a discrete probability distribution is often
presented in tabular form.

3.1 Random Variables and Their Probability Distributions

Random Variables

In probability and statistics, a random variable is a variable whose value is subject to variations
due to chance (i.e. randomness, in a mathematical sense). As opposed to other mathematical
variables, a random variable conceptually does not have a single, fixed value (even if unknown);
rather, it can take on a set of possible different values, each with an associated probability.
A random variable’s possible values might represent the possible outcomes of a yet-to-be-
performed experiment, or the possible outcomes of a past experiment whose already-existing
value is uncertain (for example, as a result of incomplete information or imprecise
measurements). They may also conceptually represent either the results of an “objectively”
random process (such as rolling a die), or the “subjective” randomness that results from
incomplete knowledge of a quantity.
Random variables can be classified as either discrete (that is, taking any of a specified list of
exact values) or as continuous (taking any numerical value in an interval or collection of
intervals). The mathematical function describing the possible values of a random variable and
their associated probabilities is known as a probability distribution.

Discrete Random Variables

Discrete random variables can take on either a finite or at most a countably infinite set of
discrete values (for example, the integers). Their probability distribution is given by a probability
mass function which directly maps each value of the random variable to a probability. For
example, the value of x1 takes on the probability p1, the value of x2 takes on the probability p2,
and so on. The probabilities pi must satisfy two requirements: every probability pi is a number
between 0 and 1, and the sum of all the probabilities is 1. (p1+p2+⋯+pk=1)
 

Discrete Probability Distribution: This shows the probability mass function of a discrete
probability distribution. The probabilities of the singletons {1}, {3}, and {7} are respectively 0.2,
0.5, 0.3. A set not containing any of these points has probability zero.
Examples of discrete random variables include the values obtained from rolling a die and the
grades received on a test out of 100.

Probability Distributions for Discrete Random Variables


Probability distributions for discrete random variables can be displayed as a formula, in a table,
or in a graph.
A discrete random variable x has a countable number of possible values. The probability
distribution of a discrete random variable x lists the values and their probabilities, where
value x1 has probability p1, value x2has probability x2, and so on. Every probability pi is a number
between 0 and 1, and the sum of all the probabilities is equal to 1.
Examples of discrete random variables include:
 The number of eggs that a hen lays in a given day (it can’t be 2.3)
 The number of people going to a given soccer match
 The number of students that come to class on a given day
 The number of people in line at McDonald’s on a given day and time
A discrete probability distribution can be described by a table, by a formula, or by a graph. For
example, suppose that xx is a random variable that represents the number of people waiting at
the line at a fast-food restaurant and it happens to only take the values 2, 3, or 5 with
probabilities 2/10,3/10, and 5/10 respectively. This can be expressed through the
function f(x)=x/10, x=2,3,5 or through the table below. Of the conditional probabilities of the
event BB given that A1 is the case or that A2 is the case, respectively. Notice that these two
representations are equivalent, and that this can be represented graphically as in the probability
histogram below.

Probability Histogram: This histogram displays the probabilities of each of the three discrete
random variables

The formula, table, and probability histogram satisfy the following necessary conditions of
discrete probability distributions:

1. 0≤f(x)≤1, i.e., the values of f(x)are probabilities, hence between 0 and 1.


2. ∑f(x)=1, i.e., adding the probabilities of all disjoint cases, we obtain the probability of the
sample space, 1.
3. Sometimes, the discrete probability distribution is referred to as the probability mass
function (pmf). The probability mass function has the same purpose as the probability
histogram, and displays specific probabilities for each discrete random variable. The only
difference is how it looks graphically.
4.  
5.
6. Probability Mass Function: This shows the graph of a probability mass function. All the
values of this function must be non-negative and sum up to 1.

x f(x)
2 0.2
3 0.3
5 0.5
7.  
8. Discrete Probability Distribution: This table shows the values of the discrete random
variable can take on and their corresponding probabilities.

3.2 Cumulative Distribution Functions


You might recall that the cumulative distribution function is defined for discrete random variables
as:

Again, F(x) accumulates all of the probability less than or equal to x. The cumulative distribution
function for continuous random variables is just a straightforward extension of that of the
discrete case. All we need to do is replace the summation with an integral.
The cumulative distribution function ("c.d.f.") of a continuous random variable X is defined
as:

For -∞<x<∞
3.3 Expected Values of Random Variables
The expected value of a random variable is the weighted average of all possible values that this
random variable can take on.

Discrete Random Variable

A discrete random variable X has a countable number of possible values. The probability


distribution of a discrete random variable X lists the values and their probabilities, such
that xi has a probability of pi. The probabilities pi must satisfy two requirements:
1. Every probability piis a number between 0 and 1.
2. The sum of the probabilities is 1: p1+p2+⋯+pi = 1.

Expected Value Definition

In probability theory, the expected value (or expectation, mathematical expectation, EV, mean,
or first moment) of a random variable is the weighted average of all possible values that this
random variable can take on. The weights used in computing this average are probabilities in
the case of a discrete random variable.
The expected value may be intuitively understood by the law of large numbers: the expected
value, when it exists, is almost surely the limit of the sample mean as sample size grows to
infinity. More informally, it can be interpreted as the long-run average of the results of many
independent repetitions of an experiment (e.g. a dice roll). The value may not be expected in the
ordinary sense—the “expected value” itself may be unlikely or even impossible (such as having
2.5 children), as is also the case with the sample mean.

How To Calculate Expected Value

Suppose random variable X can take value x1 with probability p1, value x2 with probability p2, and


so on, up to value xi with probability pi. Then the expectation value of a random variable XX is
defined as E[X]=x1p1+ x2p2+⋯+xipi, which can also be written as: E[X]=∑xip1.
If all outcomes xi are equally likely (that is, p 1= p2 =⋯=pi), then the weighted average turns into
the simple average. This is intuitive: the expected value of a random variable is the average of
all values it can take; thus, the expected value is what one expects to happen on average. If the
outcomes xi are not equally probable, then the simple average must be replaced with the
weighted average, which takes into account the fact that some outcomes are more likely than
the others. The intuition, however, remains the same: the expected value of X is what one
expects to happen on average.
For example, let X represent the outcome of a roll of a six-sided die. The possible values
for X are 1, 2, 3, 4, 5, and 6, all equally likely (each having the probability of 1/6). The
expectation of X is: 
E[X]=(1x1/6) + (2x2/6) + (3x3/6) + (4x4/6) + (5x5/6) + (6x6/6) = 3.5.
In this case, since all outcomes are equally likely, we could have simply averaged the numbers
together:
(1 + 2 + 3 + 4 + 5 + 6)/6 = 3.5.

Average Dice Value Against Number of Rolls: An illustration of the convergence of sequence
averages of rolls of a die to the expected value of 3.5 as the number of rolls (trials) grows.

3.4 The Binomial Distribution

Binomial Experiment
A binomial experiment is a statistical experiment that has the following properties:
 The experiment consists of n repeated trials.
 Each trial can result in just two possible outcomes. We call one of these outcomes a success
and the other, a failure.
 The probability of success, denoted by P, is the same on every trial.
 The trials are independent; that is, the outcome on one trial does not affect the outcome on
other trials.
Consider the following statistical experiment. You flip a coin 2 times and count the number of
times the coin lands on heads. This is a binomial experiment because:
 The experiment consists of repeated trials. We flip a coin 2 times.
 Each trial can result in just two possible outcomes - heads or tails.
 The probability of success is constant - 0.5 on every trial.
 The trials are independent; that is, getting heads on one trial does not affect whether we get
heads on other trials.
The following notation is helpful, when we talk about binomial probability.
 x: The number of successes that result from the binomial experiment.
 n: The number of trials in the binomial experiment.
 P: The probability of success on an individual trial.
 Q: The probability of failure on an individual trial. (This is equal to 1 - P.)
 n!: The factorialof n (also known as n factorial).
 b (x; n, P): Binomial probability - the probability that an n-trial binomial experiment results
in exactly xsuccesses, when the probability of success on an individual trial is P.
 nCr: The number of combinations of n things, taken r at a time.

Binomial Distribution
A binomial random variable is the number of successes x in n repeated trials of a binomial
experiment. The probability distribution of a binomial random variable is called a binomial
distribution.
Suppose we flip a coin two times and count the number of heads (successes). The binomial
random variable is the number of heads, which can take on values of 0, 1, or 2. The binomial
distribution is presented below.

Number of
Probability
Heads
0 0.25
1 0.50
2 0.25
 

The binomial distribution has the following properties:


 The mean of the distribution (μx) is equal to n* P.
 The variance(σ2x) is n * P * ( 1 - P ).
 The standard deviation(σx) is sqrt[ n * P * ( 1 - P ) ].

Binomial Formula and Binomial Probability


The binomial probability refers to the probability that a binomial experiment results
in exactly x successes. For example, in the above table, we see that the binomial probability of
getting exactly one head in two-coin flips is 0.50.
Given x, n, and P, we can compute the binomial probability based on the binomial formula.
Binomial Formula. Suppose a binomial experiment consists of n trials and results
in x successes. If the probability of success on an individual trial is P, then the binomial
probability is:
b (x; n, P) = nCx * Px * (1 - P)n - x 
or 
b (x; n, P) = {n! / [ x! (n - x)!]} * Px * (1 - P)n - x
Example 1

Suppose a die is tossed 5 times. What is the probability of getting exactly 2 fours?
Solution: This is a binomial experiment in which the number of trials is equal to 5, the number of
successes is equal to 2, and the probability of success on a single trial is 1/6 or about 0.167.
Therefore, the binomial probability is:
b (2; 5, 0.167) = 5C2 * (0.167)2 * (0.833)3 
b (2; 5, 0.167) = 0.161

Cumulative Binomial Probability


A cumulative binomial probability refers to the probability that the binomial random variable
falls within a specified range (e.g., is greater than or equal to a stated lower limit and less than
or equal to a stated upper limit).
For example, we might be interested in the cumulative binomial probability of obtaining 45 or
fewer heads in 100 tosses of a coin. This would be the sum of all these individual binomial
probabilities.
b (x < 45; 100, 0.5) = 
b (x = 0; 100, 0.5) + b (x = 1; 100, 0.5) + ... + b (x = 44; 100, 0.5) + b (x = 45; 100, 0.5)
Example 2

What is the probability of obtaining 45 or fewer heads in 100 tosses of a coin?


Solution: To solve this problem, we compute 46 individual probabilities, using the binomial
formula. The sum of all these probabilities is the answer we seek. Thus,
b (x < 45; 100, 0.5) = b (x = 0; 100, 0.5) + b (x = 1; 100, 0.5) + . . . + b (x = 45; 100, 0.5) 
b (x < 45; 100, 0.5) = 0.184

Example 3

The probability that a student is accepted to a prestigious college is 0.3. If 5 students from the
same school apply, what is the probability that at most 2 are accepted?
Solution: To solve this problem, we compute 3 individual probabilities, using the binomial
formula. The sum of all these probabilities is the answer we seek. Thus,
b(x < 2; 5, 0.3) = b(x = 0; 5, 0.3) + b(x = 1; 5, 0.3) + b(x = 2; 5, 0.3)
b(x < 2; 5, 0.3) = 0.1681 + 0.3601 + 0.3087 
b(x < 2; 5, 0.3) = 0.8369

3.5 The Poisson Distribution


A Poisson distribution is the probability distribution that results from a Poisson experiment.

Attributes of a Poisson Experiment


A Poisson experiment is a statistical experiment that has the following properties:
 The experiment results in outcomes that can be classified as successes or failures.
 The average number of successes (μ) that occurs in a specified region is known.
 The probability that a success will occur is proportional to the size of the region.
 The probability that a success will occur in an extremely small region is virtually zero.
Note that the specified region could take many forms. For instance, it could be a length, an
area, a volume, a period of time, etc.

Notation
The following notation is helpful, when we talk about the Poisson distribution.
 e: A constant equal to approximately 2.71828. (Actually, eis the base of the natural logarithm
system.)
 μ: The mean number of successes that occur in a specified region.
 x: The actual number of successes that occur in a specified region.
 P (x; μ): The Poisson probability that exactly x successes occur in a Poisson experiment, when
the mean number of successes is μ.

Poisson Distribution
A Poisson random variable is the number of successes that result from a Poisson experiment.
The probability distribution of a Poisson random variable is called a Poisson distribution.
Given the mean number of successes (μ) that occur in a specified region, we can compute the
Poisson probability based on the following Poisson formula.
Poisson Formula. Suppose we conduct a Poisson experiment, in which the average number of
successes within a given region is μ. Then, the Poisson probability is:
P (x; μ) = (e-μ) (μx) / x!
where x is the actual number of successes that result from the experiment, and e is
approximately equal to 2.71828.
The Poisson distribution has the following properties:
 The mean of the distribution is equal to μ.
 The variance is also equal to μ .
 
Poisson Distribution Example

The average number of homes sold by the Acme Realty company is 2 homes per day. What is
the probability that exactly 3 homes will be sold tomorrow?
Solution: This is a Poisson experiment in which we know the following:
 μ = 2; since 2 homes are sold per day, on average.
 x = 3; since we want to find the likelihood that 3 homes will be sold tomorrow.
 e = 2.71828; since e is a constant equal to approximately 2.71828.
We plug these values into the Poisson formula as follows:
P (x; μ) = (e-μ) (μx) / x!
P (3; 2) = (2.71828-2) (23) / 3!
P (3; 2) = (0.13534) (8) / 6
P (3; 2) = 0.180
Thus, the probability of selling 3 homes tomorrow is 0.180.

Cumulative Poisson Probability


A cumulative Poisson probability refers to the probability that the Poisson random variable is
greater than some specified lower limit and less than some specified upper limit.
Cumulative Poisson Example

Suppose the average number of lions seen on a 1-day safari is 5. What is the probability that
tourists will see fewer than four lions on the next 1-day safari?
Solution: This is a Poisson experiment in which we know the following:
 μ = 5; since 5 lions are seen per safari, on average.
 x = 0, 1, 2, or 3; since we want to find the likelihood that tourists will see fewer than 4 lions; that
is, we want the probability that they will see 0, 1, 2, or 3 lions.
 e = 2.71828; since e is a constant equal to approximately 2.71828.
To solve this problem, we need to find the probability that tourists will see 0, 1, 2, or 3 lions.
Thus, we need to calculate the sum of four probabilities: P (0; 5) + P (1; 5) + P (2; 5) + P (3; 5).
To compute this sum, we use the Poisson formula:
P (x < 3, 5) = P (0; 5) + P (1; 5) + P (2; 5) + P (3; 5)
P (x < 3, 5) = [ (e-5) (50) / 0!] + [ (e-5) (51) / 1!] + [ (e-5) (52) / 2!] + [ (e-5) (53) / 3!]
P (x < 3, 5) = [ (0.006738) (1) / 1] + [ (0.006738) (5) / 1] + [ (0.006738) (25) / 2] + [ (0.006738)
(125) / 6]
P (x < 3, 5) = [ 0.0067] + [ 0.03369] + [ 0.084224] + [ 0.140375]
P (x < 3, 5) = 0.2650
Thus, the probability of seeing at no more than 3 lions is 0.2650.
CHAPTER IV

4.1 Continuous Random Variables and their Probability Distribution


A continuous random variable has a probability of zero of assuming exactly any
of its values. Consequently, its probability distribution cannot be given in tabular form. At
first this may seem startling, but it, becomes more plausible when we consider a
particular example. Let us discuss a random variable whose values are the heights of all
people over 21 years of age. Between any two values, say 163.5 and 164.5
centimeters, or even 163.99 and 164.01 centimeters, there are an infinite number of
heights, one of which is 164 centimeters. The probability of selecting a person at
random who is exactly 164 centimeters tall and not one of the infinitely large set of
heights so close to 164 centimeters that you cannot humanly measure the difference is
remote, and thus we assign a probability of zero to the event. This is not the case,
however, if we talk about the probability of selecting a person who is at least 163
centimeters but not more than 165 centimeters tall. Now we are dealing with an interval
rather than a point value of our random variable.
 
We shall concern ourselves with computing probabilities for various intervals of
continuous random variables such as P(a < X < b), P(W > c), and so forth. Note that
when X is continuous,
 
P(a < X < b) = P(a < X < b) + P(X = b) = P(a < X < b).
 
That is, it does not matter whether we include an endpoint of the interval or not.
This is not true, though, when X is discrete. Although the probability distribution of a
continuous random variable cannot be presented in tabular form, it can be stated as a
formula. Such a formula would necessarily be a function of the numerical values of the
continuous random variable X and as such will be represented by the functional notation
f(x). In dealing with continuous variables, f(x) is usually called the probability density
function, or

Figure
4.1 Typical Density Functions
 
simply the density function of A'. Since X is defined over a continuous sample
space, it is possible for f(x) to have a finite number of discontinuities. However, most
density functions that have practical applications in the analysis of statistical data are
continuous and their graphs may take any of several forms, some of which are shown in
Figure 4.1. Because areas will be used to represent probabilities and probabilities arc
positive numerical values, the density function must lie entirely above the x axis. A
probability density function is constructed so that the area under its curve bounded by
the x axis is equal to 1 when computed over the range of X for which f(x) is defined.
Should this range of X be a finite interval, it is always possible to extend the interval to
include the entire sot of real numbers by defining f(x) to be zero at all points in the
extended portions of the interval. In Figure 4.2, the probability that X assumes a value
between a and /; is equal to the shaded area under the density function between the
ordinates at. x = a and x = b, and from integral calculus is given by
 

Figure 4.2 P(a < X < b)


 
Example 4.1
 
            For the density function   , find f(x), and use it to
evaluate P(0 < X ≤ 1).
 
SOLUTION: for –1 < x < 2,
 

                        

 
 
Therefore,
 

 
The cumulative distribution function F(x) is expressed graphically in Figure 4.3.
Now,
4.2 Expected Values of Continuous Random Variables
Let X be a continuous random variable with range [a, b] and probability density function
f(x). The expected value of X is defined by

 
Let’s see how this compares with the formula for a discrete random variable:

The discrete formula says to take a weighted sum of the values xi of X, where the
weights are the probabilities p(xi). Recall that f(x) is a probability density. Its units are
prob/(unit of X).
 
So f(x) dx represents the probability that X is in an infinitesimal range of width dx around
x. Thus we can interpret the formula for E(X) as a weighted integral of the values x of X,
where the weights are the probabilities f(x) dx.
 
As before, the expected value is also called the mean or average.
 
Example 4.2
 
Let X ∼ uniform(0, 1). Find E(X).
 
SOLUTION:
 
Since X has a range of [0, 1] and a density of f(x) = 1:
 

 
Not surprisingly, the mean is at the midpoint of the range.
 
Example 4.3
 
Let X have range [0, 2] and density . Find E(X).

Does it make sense that this X has mean is in the right half of its range?
 
Yes. Since the probability density increases as x increases over the range, the average
value of x should be in the right half of the range.
 
µ is “pulled” to the right of the midpoint 1 because there is more mass to the right.
 
 
Properties of E(X)
 
The properties of E(X) for continuous random variables are the same as for discrete
ones:
 

1. If X and Y are random variables on a sample space Ω then


E(X = Y) = E(X) + E(Y)
 

2. If a and b are constants then


 E(aX + b) = aE(X) + b
 
 
Expectation of Functions of X
 
This works exactly the same as the discrete case. if h(x) is a function then Y = h(X) is a
random variable and
 

Example 4.4
 
Let X ∼ exp(λ). Find E(X2).
 

4.3 Normal Distribution


The Normal Distribution is the most important and most widely used continuous
probability distribution. It is the cornerstone of the application of statistical inference in
analysis of data because the distributions of several important sample statistics tend
towards a Normal distribution as the sample size increases.
 
Empirical studies have indicated that the Normal distribution provides an adequate
approximation to the distributions of many physical variables. Specific examples include
meteorological data, such as temperature and rainfall, measurements on living
organisms, scores on aptitude tests, physical measurements of manufactured parts,
weights of contents of food packages, volumes of liquids in bottles/cans,
instrumentation errors and other deviations from established norms, and so on.
 
The graphical appearance of the Normal distribution is a symmetrical bell-shaped curve
that extends without bound in both positive and negative directions.
 
The probability density function is given by

 
where μ and σ are parameters. These turn out to be the mean and standard deviation,
respectively, of the distribution. As a shorthand notation, we write X ~ N(μ,σ2).
 
The curve never actually reaches the horizontal axis buts gets close to it beyond about
3 standard deviations each side of the mean.
 
For any Normally distributed variable:
 
            68.3% of all values will lie between μ −σ and μ + σ (i.e. μ ± σ )
            95.45% of all values will lie within μ ± 2 σ
            99.73% of all values will lie within μ ± 3 σ
 
The graphs below illustrate the effect of changing the values of μ and σ on the shape of
the probability density function. Low variability (σ = 0.71) with respect to the mean gives
a pointed bell-shaped curve with little spread. Variability of σ = 1.41 produces a flatter
bellshaped curve with a greater spread.
 
Example 4.5
 
The volume of water in commercially supplied fresh drinking water containers is
approximately Normally distributed with mean 70 litres and standard deviation 0.75
litres. Estimate the proportion of containers likely to contain
 
(i) in excess of 70.9 litres, (ii) at most 68.2 litres, (iii) less than 70.5 litres.
 
SOLUTION
 
Let X denote the volume of water in a container, in litres. Then X ~ N(70, 0.75 2 ), i.e. μ =
70, σ = 0.75 and Z = (X − 70)/0.75
 
 X = 70.9 ; Z = (70.9 − 70)/0.75 = 1.20
P(X > 70.9) = P(Z > 1.20) = 0.1151 or 11.51%
 
 X = 68.2 ; Z = −2.40
P(X < 68.2) = P(Z < −2.40) = 0.0082 or 0.82%
 
 X = 70.5 ; Z = 0.67
P(X > 70.5) = 0.2514 ; P(X < 70.5) = 0.7486 or 74.86%

4.4 Normal Approximation to Binomial and Poisson Distribution


Binomial Approximation
The normal distribution can be used as an approximation to the binomial distribution,
under certain circumstances, namely:
 
If X ~ B(n, p) and if n is large and/or p is close to ½, then X is approximately N(np, npq)
 
(where q = 1 - p)
 
In some cases, working out a problem using the Normal distribution may be easier than
using a Binomial.
 
Poisson Approximation
 
The normal distribution can also be used to approximate the Poisson distribution for
large values of l (the mean of the Poisson distribution).
 
If X ~ Po(l) then for large values of l, X ~ N(l, l) approximately.
 
Continuity Correction
 
The binomial and Poisson distributions are discrete random variables, whereas the
normal distribution is continuous. We need to take this into account when we are using
the normal distribution to approximate a binomial or Poisson using
a continuity correction.
 
In the discrete distribution, each probability is represented by a rectangle (right hand
diagram):

When working out probabilities, we want to include whole rectangles, which is what
continuity correction is all about.
 
Example 4.6
 
Suppose we toss a fair coin 20 times. What is the probability of getting between 9 and
11 heads?
 
SOLUTION
 
Let X be the random variable representing the number of heads thrown.
X ~ Bin(20, ½)
 
Since p is close to ½ (it equals ½!), we can use the normal approximation to the
binomial. X ~ N(20 × ½, 20 × ½ × ½) so X ~ N(10, 5) .
 
In this diagram, the rectangles represent the binomial distribution and the curve is the
normal distribution:
 

We want P(9 ≤ X ≤ 11), which is the red shaded area. Notice that the first rectangle
starts at 8.5 and the last rectangle ends at 11.5 . Using a continuity correction,
therefore, our probability becomes P(8.5 < X < 11.5) in the normal distribution.

4.5 Exponential Distribution


The exponential distribution obtains its name from the exponential function in the
probability density function. Plots of the exponential distribution for selected values of
are shown in Fig. 4.4. For any value of, the exponential distribution is quite skewed.
Figure 4.4 Probability density function of exponential random variables for selected
values of λ
 
If the random variable X has an exponential distribution with parameter λ,
 

 and 
 
It is important to use consistent units in the calculation of probabilities, means, and
variances involving exponential random variables. The following example illustrates unit
conversions.
 
Example 4.7
In a large corporate computer network, user log-ons to the system can be modeled as a
Poisson process with a mean of 25 log-ons per hour. What is the probability that there
are no logons in an interval of 6 minutes?
SOLUTION
Let X denote the time in hours from the start of the interval until the first log-on. Then, X
has an exponential distribution with log-ons per hour. We are interested in the
probability that X exceeds 6 minutes. Because is given in log-ons per hour, we express
all time units in hours. That is, 6 minutes 0.1 hour. The probability requested is shown
as the shaded area under the probability density function in Fig. 4.5. Therefore,
 

Figure 4.5 Probability for the exponential distribution


 
In the previous example, the probability that there are no log-ons in a 6-minute interval
is 0.082 regardless of the starting time of the interval. A Poisson process assumes that
events occur uniformly throughout the interval of observation; that is, there is no
clustering of events. If the log-ons are well modeled by a Poisson process, the
probability that the first log-on after noon occurs after 12:06 P.M. is the same as the
probability that the first log-on after 3:00 P.M. occurs after 3:06 P.M. And if someone
logs on at 2:22 P.M., the probability the next log-on occurs after 2:28 P.M. is still 0.082.
Our starting point for observing the system does not matter. However, if there are high-
use periods during the day, such as right after 8:00 A.M., followed by a period of low
use, a Poisson process is not an appropriate model for log-ons and the distribution is
not appropriate for computing probabilities. It might be reasonable to model each of the
highand low-use periods by a separate Poisson process, employing a larger value for
during the high-use periods and a smaller value otherwise. Then, an exponential
distribution with the corresponding value of can be used to calculate log-on probabilities
for the high- and low-use periods.

 
REFERENCES
Walpole, R., Myers, R., Myers, S., & Ye, K., (2007) Probability & Statistics for Engineers
& Scientists 8th Ed. [Available online] Retrieve from
https://www.csie.ntu.edu.tw/~sdlin/download/Probability%20&%20Statistics.pdf
Orloff, J. & Bloom, J., (2014) Introduction to Probability and Statistics [Available online]
Retrieved from https://ocw.mit.edu/courses/mathematics/18-05-introduction-to-
probability-and-statistics-spring-2014/readings/MIT18_05S14_Reading6a.pdf
Montgomery, D. & Runger, G. (2003) Applied Statistics and Probability for Engineers 3 rd
Ed. [Available online] Retrieved from http://www.um.edu.ar/math/montgomery.pdf

5.1A Two Random Variables


Probabilities may be either marginal, joint or conditional. Understanding their
differences and how to manipulate among them is key to success in understanding the
foundation of statistics.
 
A. TWO RANDOM VARIABLES
In real life, we are often interested in several random variables that are related to
each other. For example, suppose that we choose a random family, and we would like
to study the number of people in the family, the household income, the ages of the
family members, etc. Each of these is a random variable, and we suspect that they are
dependent. In this chapter, we develop tools to study joint distributions of random
variables. The concepts are similar to what we have seen so far. The only difference is
that instead of one random variable, we consider two or more. In this chapter, we will
focus on two random variables, but once you understand the theory for two random
variables, the extension to n random variables is straightforward. We will first discuss
joint distributions of discrete random variables and then extend the results to continuous
random variables.
Joint Probability Distribution
Given random variables, that are defined on a probability space, the joint
probability distribution for , is a probability distribution that gives the probability that each
of , falls in any particular range or discrete set of values specified for that variable. In the
case of only two random variables, this is called a bivariate distribution, but the concept
generalizes to any number of random variables, giving a multivariate distribution.
The joint probability distribution can be expressed either in terms of a joint
cumulative distribution function or in terms of a joint probability density function (in the
case of continuous variables) or joint probability mass function (in the case of discrete
variables). These in turn can be used to find two other types of distributions: the
marginal distribution giving the probabilities for any one of the variables with no
reference to any specific ranges of values for the other variables, and the conditional
probability distribution giving the probabilities for any subset of the variables conditional
on particular values of the remaining variables.
Example 5.1
Consider the flip of two fair coins; let A and B be discrete random variables associated with the
outcomes of the first- and second-coin flips respectively. Each coin flip is a Bernoulli trial and
has a Bernoulli distribution. If a coin displays "heads" then the associated random variable takes
the value 1, and it takes the value 0 otherwise. The probability of each of these outcomes is 1/2,
so the marginal (unconditional) density functions are

The joint probability density function of A and B defines probabilities for each pair of
outcomes. All possible outcomes are

Since each outcome is equally likely the joint probability density function becomes

Since the coin flips are independent, the joint probability density function is the product
of the marginals:

 
Example 5.2
            The National Highway Traffic Safety Administration is interested in the effect of
seat belt use on saving lives. One study reported statistics on children under the age of
5 who were involved in motor vehicle accidents in which at least one fatality occurred.
For 7,060 such accidents between 1985 and 1989, the results are shown in the Table
5.1.
 
Table 5.1
Children Involved in Motor Vehicle Accidents

Survivors Fatalities Total

No belt 1,129 509 1,638

Adult belt 432 73 505

Child seat 733 139 872

Total 2,294 721 3,015

 
Whether or not he or she survived and what the seat belt situation was. For each child,
define two random variables as follows:
 

X1 will keep track of the number of child fatalities and X 2 will keep track of the type of
restraining device used for the child.
The frequencies from the Table 5.1 are turned into the relative frequencies of Table 5.2
to produce the joint probability distribution of X1 and X2. In general, we write

And call p(x1, x2) the joint probability function of (X1, X2). For example (see Table 5.2)

Represents the approximate probability that a child will both survive and be in a child
seat when involved in a fatal accident.
Table 5.2
Joint Probability Distribution
`

0 1 Total

0 0.37 0.17 0.55

X2 1 0.14 0.02 0.16

2 0.24 005 0.29

Total 0.76 0.24 1.00

 
The probability that a child will be in a child seat is

= 0.24 + 0.05
= 0.29

5.1B Joint Probability Mass Function


Remember that for a discrete random variable , we define the PMF as . Now, if
we have two random variables X and Y, and we would like to study them jointly, we
define the joint probability mass function as follows:

Note that as usual, the comma means "and," so we can write

We can define the joint range for X and Y as

In particular, if Rx= {x1, x2, ...}, and Ry= {y1, y2, ...},  then we can always write
In fact, sometimes we define RXY = RX x Ry to simplify the analysis. In this case, for some
pairs (xi, yi) in RX x Ry, PXY(xi, yi) might be zero. For two discrete random variables X and Y, we
have

We can use the joint PMF to find P((X,Y)∈A) for any set A⊂R2. Specifically, we have

Marginal Probability Distribution


            Marginal distribution of a subset of a collection of random variables is the
probability distribution of the variables contained in the subset. It gives the probabilities
of various values of the variables in the subset without reference to the values of the
other variables. This contrasts with a conditional distribution, which gives the
probabilities contingent upon the values of the other variables.
Marginal variables are those variables in the subset of variables being retained.
These concepts are "marginal" because they can be found by summing values in a
table along rows or columns and writing the sum in the margins of the table. The
distribution of the marginal variables (the marginal distribution) is obtained by
marginalizing – that is, focusing on the sums in the margin – over the distribution of the
variables being discarded, and the discarded variables are said to have been
marginalized out.
The context here is that the theoretical studies being undertaken, or the data
analysis being done, involves a wider set of random variables but that attention is being
limited to a reduced number of those variables. In many applications, an analysis may
start with a given collection of random variables, then first extend the set by defining
new ones (such as the sum of the original random variables) and finally reduce the
number by placing interest in the marginal distribution of a subset (such as the sum).
Several different analyses may be done, each treating a different subset of variables as
the marginal variables.
 
Conditional Probability Distribution
Conditional probability is the probability of one thing being true given that another
thing is true and is the key concept in Bayes' theorem. This is distinct from joint
probability, which is the probability that both things are true without knowing that one of
them must be true.
For example, one joint probability is "the probability that your left and right socks
are both black," whereas a conditional probability is "the probability that your left sock is
black if you know that your right sock is black," since adding information alters
probability. This can be high or low depending on how frequently your socks are paired
correctly. A Euler diagram, in which area is proportional to probability, can demonstrate
this difference.
 

Let  be the probability that your left sock is black and let  be the probability that
your right sock is black. On the left side of the diagram, the yellow area represents the
probability that both of your socks are black. This is the joint probability. If  is definitely
true (e.g., given that your right sock is definitely black), then the space of everything
not  is dropped and everything in  is rescaled to the size of the original space. The
rescaled yellow area is now the conditional probability of   given  , expressed as  . In
other words, this is the probability that your left sock is black if you know that your right
sock is black. Note that the conditional probability of  given   is not in general equal to
the conditional probability of  given . That would be the fraction of  that is yellow, which
in this picture is slightly smaller than the fraction of that is yellow.
Philosophically, all probabilities are conditional probabilities. In the Euler
diagram,  and are conditional on the box that they are in, in the same way that  is
conditional on the box   that it is in. Treating probabilities in this way makes chaining
together different types of reasoning using Bayes' theorem easier, allowing for the
combination of uncertainties about outcomes ("given that the coin is fair, how likely am I
to get a head") with uncertainties about hypotheses ("given that Frank gave me this
coin, how likely is it to be fair?"). Historically, conditional probability has often been
misinterpreted, giving rise to the famous Monty Hall problem and Bayesian mistakes in
science. There is only one main formula regarding conditional formula which is,
Any other formula regarding conditional probability can be derived from the above
formula. Specifically, if you have two random variables X and Y, you can write

Where C,D ∈ R
More Than Two Random Variables
            For two or more random variables, joint probability distribution function is
defined in a similar way to what we have already seen for the case of two random
variables. Let x1, x2, ... be n discrete random variables. The joint PMF of x1, x2, ..., xn is
defined as

For n jointly continuous random variables x1, x2, ..., xn, the joint PDF is defined to be the
function fx1, x2, ..., xn(x1, x2, ..., xn)  such that the probability of any set A⊂Rn, we can write

The marginal PDF of x1 can be obtained by integrating all other xj’s. For example

Example 5.3
Consider two random variables X and Y with joint PMF given in Table 5.3.
Table 5.3
Joint PMF of X and Y

Figure 5.1 shows PXY(x,y).

Figure 5.1: Joint PMF of X and Y

a. Find  .
b. Find the marginal PMFs of X and Y.

c. Find  .
d. Are X and Y independent?
 
Solution:

a. To find   , we can write

b. Note that from the table,


To find Px(0), we can write

We obtain

c. Find  : Using the formula for conditional


probability, we have
d. Are X and Y independent? X and Y are not independent, because as we just found
out

 
Example 5.4
            A soft-drink machine has a random amount Y 2  in supply at the beginning of a
given day and dispenses a random amount Y 1 during the day (with measurements in
gallons). It is not resupplied during the day, hence Y 1 ≤ Y2. It has been observed that
Y1 and Y2 have joint density

That is, the points (Y1, Y2) are uniformly distributed over the triangle with the given boundaries.
Find the conditional probability density of Y1 given Y2=y2. Evaluate the probability that less than
½ gallon is sold, given that the machine contains 1 gallon at the start of the day.
 
Solution:
The marginal density of Y2 is given by

By definition,
The probability of interest is

Note that if the machine had contained 2 gallons at the start of the day, then

Thus, the amount sold is highly dependent upon the amount in supply.
 

Previous

Chapter V
Continue

5.2 Linear Functions of Random Variables


It often happens that a random variable is the driver behind some cost function.
 The random occurrence of defects results in cost of returned items.
 The random variation of stock prices determines the performance of a portfolio.
 The random arrival of patients affects the length of the waiting line in a doctor’s office.
Sometimes the relationship between the random variable and the quantity of interest is
linear, and when it is, the computation of mean and standard deviation is greatly
simplified by the formulas in this note. We will use the following conventions, both in this
note and in class:
 
Types of Numbers Symbols Used Examples
Fixed Numbers Lower-case letters a, b, c
Random Variables Upper-case letters X, Y, Z
Population Parameters of Random Lower-case Greek
μ, , ρ
Variables letters
A linear relationship exists between X and Y when a one-unit increase in X causes Y to
change by a fixed amount, regardless of how large or small X is. For example, suppose
we change X from 10 to 11, and find that Y decreases by $5. If the relationship is
linear, Y will also drop by $5 when we change X from 15 to 16, or 99 to 100, or any
other one-unit increase.
 
Rules for linear functions of random variables:

These equations say the following:


(1a&b) If you multiply a random variable X by any number a, multiply the expected value and
the standard deviation by the same amount.
(1a&b) If you add a constant b, add the same amount to the expected value, but do not change
the standard deviation.
(2) Linear functions do not change the correlation.
Illustrations:
(1a)      Expected value is like an average. Suppose X varies between 1 and 3. If you double X,
2X varies from 2 to 6. Moreover, every value is twice as large, so when you compute the
average, it is also twice as large. If you then add 7, every value increases by 7 so the average
does likewise.
(1b) The standard deviation is a measure of how much something varies. However, (1b) is
easier to illustrate by considering the range of a variable. Suppose X varies between 1 and 3, a
range of 2. If you double X, 2X varies from 2 to 6, so its range is twice as large. However, if you
then add 7, 2X + 7 varies from 9 to 13, a range of 4, the same as the range of 2X. Adding 7 did
not increase the range, and for the same reason, adding a constant does not affect the standard
deviation.
(2) If X and Y have correlation 0.9, and if both have linear cost functions, then the correlation
between their costs is also 0.9.
Rules for adding random variables:

* If X and Y are independent, then  is zero.


If you use the value 1.0 for a and b, then these equations say the following:
(3a) If you add two random variables, you add their expected values.
(3b) If you add two random variables, to get the standard deviation you add their variances, add
twice their covariance, then take the square root.
(3b) Special Case: If the variables are independent, the covariance is zero so you can just add
the variances and take the square root.
Since these equations involve the covariance, whereas we are mostly familiar with correlation,
the relationship between covariance and correlation is given here for convenience, in two
versions.
 If you have the Covariance and need the Correlation, divide by both standard deviations.
 If you have the Correlation and need the Covariance, multiply by both standard deviations.

 
Example 5.5 Mean and Standard Deviation of Sales Commission
You pay your sales personnel a commission of 75% of the amount they sell over
$2000. X = Sales has mean $5000 and standard deviation $1000. What are the mean
and standard deviation of pay?
Solution:
represents the basis for the commission, and "Pay" is 75% of that, so

Pay =
(1a)     

(1b)

 
Example 5.6 The Portfolio Effect.
You are considering purchase of stock in two different companies, X and Y. Return after one
year for stock X is a random variable with X = $112, X = 10. Return for stock Y (a different
company) has the same  and . Assuming that X and Y are independent, which portfolio has
less variability, 2 shares of X or one each of X and Y?
 
Solution:
The returns from 2 shares of X will be exactly twice the returns from one share, or 2X.
The returns from one each of X and Y is the sum of the two returns, X+Y.
 

Previous

Chapter V
Continue

5.3 General Function of Random Variables


Let X be a random variable with known distribution. Let another random variable Y be a function
of X:
 
Y = g(X)
 
where g:R→R . How do we derive the distribution of Y from the distribution of X?
There is no general answer to this question. However, there are several special cases in which
it is easy to derive the distribution of Y. We discuss these cases below.
 
Strictly Increasing Functions
When the function  is strictly increasing on the support of  (i.e.

), then  admits an inverse defined on the


support of , i.e. a function  such that
Furthermore   is itself strictly increasing.
The distribution function of a strictly increasing function of a random variable can be computed
as follows.
 
Proposition (distribution of an increasing function) Let X be a random variable with
support Rx and distribution function Fx(). Let  be strictly increasing on the support of X. Then, the
support of Y = g(X) is

and the distribution function of  is


 
 
Proof
Therefore, in the case of an increasing function, knowledge of g-1 and of the upper and
lower bounds of the support of Y is all we need to derive the distribution function of Y from the
distribution function of X.
 
Example Let X be a random variable with support Rx= [1,2] and distribution function
 

Let
Y = X2
 
The function g(x) = x  is strictly increasing and it admits an inverse on the support of X:
2

g-1(y)=√y
The support of Y is Ry - [1,4]. The distribution function of Y is

In the cases in which  is either discrete or continuous there are specialized formulae for the
probability mass and probability density functions, which are reported below.
Strictly Increasing Functions of a Discrete Random Variable
When X is a discrete random variable, the probability mass function Y = g(X)can be computed
as follows.
 
Proposition (probability mass of an increasing function) Let X be a discrete random
variable with support Rx and probability mass function px(x). Let g: R→R  be strictly increasing
on the support of X. Then, the support of Y=g(X) is

and its probability mass function is 


Example
 Let X be a discrete random variable with support Rx=[1.2.3]
and probability mass function

Let

The support of Y is


Ry=[4.7.12]
The function g is strictly increasing and its inverse is

The probability mass function of Y is

Strictly Increasing Functions of a Continuous Random Variable


When X is a continuous random variable and g is differentiable, then also Y is continuous and
its probability density function can be easily computed as follows.
 
Proposition (density of an increasing function) Let X be a continuous random variable with
support Rx and probability density function fx(x) . Let g: R→R be strictly increasing and
differentiable on the support of X.

Then, the support of Y=g(X)  is 


and its probability density function is

Example Let X be a continuous random variable with support Rx = (0,1]


 
and probability density function

Let

The support of Y is


RY = (-∞, 0]
 
The function g is strictly increasing and its inverse is

with derivative

The probability density function of Y is

References
 
En.wikipedia.org. (2019). Joint probability distribution. [online] Available at:
https://en.wikipedia.org/wiki/Joint_probability_distribution [Accessed 3 Jun. 2019].
 
Probabilitycourse.com. (2019). Joint Probability Mass Function | Marginal PMF | PMF. [online]
Available at: https://www.probabilitycourse.com/chapter5/5_1_1_joint_pmf.php [Accessed 3
Jun. 2019].
 
En.wikipedia.org. (2019). Marginal distribution. [online] Available at:
https://en.wikipedia.org/wiki/Marginal_distribution [Accessed 3 Jun. 2019].
 
Brilliant.org. (2019). Conditional Probability Distribution | Brilliant Math & Science Wiki. [online]
Available at: https://brilliant.org/wiki/conditional-probability-distribution/ [Accessed 3 Jun. 2019].
 
Scheaffer, R., MuleKar, M. and McClave, J. (2011). Probability and Statistics for Engineering Students.
Philippines: C&E Publishing, Inc., pp.355-366.
Previous

Chapter VI
Continue

6.1. Point Estimation


Point Estimation
In statistics, point estimation involves the use of sample data to calculate a single
value (known as a point estimate since it identifies a point in some parameter
space) which is to serve as a "best guess" or "best estimate" of an unknown
population parameter (for example, the population mean). More formally, it is the
application of a point estimator to the data to obtain a point estimate.
Point estimation can be contrasted with interval estimation: such interval
estimates are typically either confidence intervals, in the case of frequentist
inference, or credible intervals, in the case of Bayesian inference.
 A point estimate is a reasonable value of a population parameter.
 Data collected, X1, X2,…, Xn are random variables.
 Functions of these random variables, X̄ and s2, are also random variables called statistics.
 Statistics have their unique distributions that are called sampling distributions.

Previous

Chapter VI
Continue

6.2.1 Sampling Distribution and the Central Limit of Point Theorem


6.2.1. Sampling Distribution
Suppose that we draw all possible samples of size n from a given population.
Suppose further that we compute a statistic (e.g., a mean, proportion, standard
deviation) for each sample. The probability distribution of this statistic is called
a sampling distribution. And the standard deviation of this statistic is called
the standard error.
Variability of a Sampling Distribution
The variability of a sampling distribution is measured by its variance or its standard
deviation. The variability of a sampling distribution depends on three factors:

o N: The number of observations in the population.
o n: The number of observations in the sample.
o The way that the random sample is chosen.
If the population size is much larger than the sample size, then the sampling
distribution has roughly the same standard error, whether we
sample with or without replacement. On the other hand, if the sample represents
a significant fraction (say, 1/20) of the population size, the standard error will be
meaningfully smaller, when we sample without replacement.
 
Sampling Distribution of the Mean
Suppose we draw all possible samples of size n from a population of size N.
Suppose further that we compute a mean score for each sample. In this way, we
create a sampling distribution of the mean.
We know the following about the sampling distribution of the mean. The mean of
the sampling distribution ( ) is equal to the mean of the population (μ). And
the standard error of the sampling distribution ( ) is determined by the
standard deviation of the population (σ), the population size (N), and the sample
size (n). These relationships are shown in the equations below:

In the standard error formula, the factor sqrt[ (N - n ) / (N - 1) ] is called the finite
population correction or fpc. When the population size is very large relative to the
sample size, the fpc is approximately equal to one; and the standard error
formula can be approximated by:

You often see this "approximate" formula in introductory statistics texts. As a


general rule, it is safe to use the approximate formula when the sample size is no
bigger than 1/20 of the population size.
 
Sampling Distribution of the Proportion
In a population of size N, suppose that the probability of the occurrence of an
event (dubbed a "success") is P; and the probability of the event's non-
occurrence (dubbed a "failure") is Q. From this population, suppose that we draw
all possible samples of size n. And finally, within each sample, suppose that we
determine the proportion of successes p and failures q. In this way, we create a
sampling distribution of the proportion.
We find that the mean of the sampling distribution of the proportion (μ p) is equal
to the probability of success in the population (P). And the standard error of the
sampling distribution (σp) is determined by the standard deviation of the
population (σ), the population size, and the sample size. These relationships are
shown in the equations below:

where 
Like the formula for the standard error of the mean, the formula for the standard
error of the proportion uses the finite population correction, sqrt[ (N - n ) / (N -
1) ]. When the population size is very large relative to the sample size, the fpc is
approximately equal to one; and the standard error formula can be approximated
by:

You often see this "approximate" formula in introductory statistics texts. As a


general rule, it is safe to use the approximate formula when the sample size is no
bigger than 1/20 of the population size.

Previous

Chapter VI
Continue

6.2.2 Central Limit Theorem


Definition
The central limit theorem states that the sampling distribution of the mean of
any independent,random variable will be normal or nearly normal, if the sample
size is large enough.
How large is "large enough"? The answer depends on two factors.

o Requirements for accuracy. The more closely the sampling distribution needs to resemble a
normal distribution, the more sample points will be required.
o The shape of the underlying population. The more closely the original population resembles a
normal distribution, the fewer sample points will be required.
In practice, some statisticians say that a sample size of 30 is large enough when
the population distribution is roughly bell-shaped. Others recommend a sample
size of at least 40. But if the original population is distinctly not normal (e.g., is
badly skewed, has multiple peaks, and/or has outliers), researchers like the
sample size to be even larger.
 
How to Choose Between T-Distribution and Normal Distribution
The t distribution and the normal distribution can both be used with statistics that
have a bell-shaped distribution. This suggests that we might use either the t-
distribution or the normal distribution to analyze sampling distributions. Which
should we choose?
Guidelines exist to help you make that choice. Some focus on the population
standard deviation.

o If the population standard deviation is known, use the normal distribution
o If the population standard deviation is unknown, use the t-distribution.
Other guidelines focus on sample size.

o If the sample size is large, use the normal distribution. (See the discussion above in the section
on the Central Limit Theorem to understand what is meant by a "large" sample.)
o If the sample size is small, use the t-distribution.
Example No 1: In practice, researchers employ a mix of the above guidelines.
On this site, we use the normal distribution when the population standard
deviation is known and the sample size is large. We might use either distribution
when standard deviation is unknown and the sample size is very large. We use
the t-distribution when the sample size is small, unless the underlying distribution
is not normal. The t distribution should not be used with small samples from
populations that are not approximately normal.
Assume that a school district has 10,000 6th graders. In this district, the average
weight of a 6th grader is 80 pounds, with a standard deviation of 20 pounds.
Suppose you draw a random sample of 50 students. What is the probability that
the average weight of a sampled student will be less than 75 pounds?
Solution: To solve this problem, we need to define the sampling distribution of the
mean. Because our sample size is greater than 30, the Central Limit Theorem
tells us that the sampling distribution will approximate a normal distribution.
To define our normal distribution, we need to know both the mean of the
sampling distribution and the standard deviation. Finding the mean of the
sampling distribution is easy, since it is equal to the mean of the population.
Thus, the mean of the sampling distribution is equal to 80.
The standard deviation of the sampling distribution can be computed using the
following formula:

Let's review what we know and what we want to know. We know that the
sampling distribution of the mean is normally distributed with a mean of 80 and a
standard deviation of 2.81. We want to know the probability that a sample mean
is less than or equal to 75 pounds.
Because we know the population standard deviation and the sample size is
large, we'll use the normal distribution to find probability. To solve the problem,
we plug these inputs into the Normal Probability Calculator: mean = 80, standard
deviation = 2.81, and normal random variable = 75. The Calculator tells us that
the probability that the average weight of a sampled student is less than 75
pounds is equal to 0.038.
Note: Since the population size is more than 20 times greater than the sample
size, we could have used the "approximate" formula σ x = [ σ / sqrt(n) ] to compute
the standard error. Had we done that, we would have found a standard error
equal to [ 20 /] or 2.83.

Example No 2: Find the probability that of the next 120 births, no more than 40%
will be boys. Assume equal probabilities for the births of boys and girls. Assume
also that the number of births in the population (N) is very large, essentially
infinite.
Solution: The Central Limit Theorem tells us that the proportion of boys in 120
births will be approximately normally distributed.
The mean of the sampling distribution will be equal to the mean of the population
distribution. In the population, half of the births result in boys; and half, in girls.
Therefore, the probability of boy births in the population is 0.50. Thus, the mean
proportion in the sampling distribution should also be 0.50.
The standard deviation of the sampling distribution (i.e., the standard error) can
be computed using the following formula:

Here, the finite population correction is equal to 1.0, since the population size (N)
was assumed to be infinite. Therefore, standard error formula reduces to:

Let's review what we know and what we want to know. We know that the
sampling distribution of the proportion is normally distributed with a mean of 0.50
and a standard deviation of 0.04564. We want to know the probability that no
more than 40% of the sampled births are boys.
Because we know the population standard deviation and the sample size is
large, we'll use the normal distribution to find probability. To solve the problem,
we plug these inputs into the Normal Probability Calculator: mean = .5, standard
deviation = 0.04564, and the normal random variable = .4. The Calculator tells us
that the probability that no more than 40% of the sampled births are boys is equal
to 0.014.
Note: This problem can also be treated as a binomial experiment. Elsewhere, we
showed how to analyze a binomial experiment. The binomial experiment is
actually the more exact analysis. It produces a probability of 0.018 (versus a
probability of 0.14 that we found using the normal distribution). Without a
computer, the binomial approach is computationally demanding. Therefore, many
statistics texts emphasize the approach presented above, which uses the normal
distribution to approximate the binomial.
 

Previous

Chapter VI
Continue

6.3. General Concept of Point Estimation


We'll start the lesson with some formal definitions. In doing so, recall that we
denote the n random variables arising from a random sample as subscripted
uppercase letters:
 X1, X2, ..., Xn
The corresponding observed values of a specific random sample are then
denoted as subscripted lowercase letters:

 x1, x2, ..., xn
           Definition. The range of possible values of the parameter θ is called the parameter
space Ω (the greek letter "omega").
For example, if μ denotes the mean grade point average of all college students,
t   hen the parameter space (assuming a 4-point grading scale) is:
Ω = {μ: 0 ≤ μ ≤ 4}
And, if p denotes the proportion of students who smoke cigarettes, then the
parameter space is:
Ω = {p: 0 ≤ p ≤ 1}
           Definition. The function of X1, X2, ..., Xn, that is, the statistic u(X1, X2, ..., Xn), used to
estimate θ is called a point estimator of θ.
For example, the function:
 

is a point estimator of the population mean μ. The function:


(where Xi = 0 or 1) is a point estimator of the population proportion p. And, the
function:

           
is a point estimator of the population variance σ2.
 
Definition. The function u(x1, x2, ..., xn) computed from a set of data is an observed point
estimate of θ. 
 
For example, if xi are the observed grade point averages of a sample of 88
students, then:
 

is a point estimate of μ, the mean grade point average of all the students in the
population.
And, if xi = 0 if a student has no tattoo, and xi = 1 if a student has a tattoo, then:
p = 0.11
is a point estimate of p, the proportion of all students in the population who have
a tattoo.
 
 
Previous

Chapter VI
Continue

6.3.1. Unbiased Estimator and Variance of a Point Estimator


On the previous page, we showed that if Xi are Bernoulli random variables with parameter p,
then:
Definition. If the following holds:
 
E [u(X1, X2,…,Xn)] = θE [u (X1, X2,…,Xn)] = θ
then the statistic u (X1,X2,…,Xn) u (X1,X2,…,Xn) is an unbiased estimator of the
parameter θ. Otherwise,  u (X1,X2,…,Xn) u (X1,X2,…,Xn) is a biased estimator of θ.
 
is the maximum likelihood estimator of p. And, if Xi are normally distributed random
variables with mean μ and variance σ2, then:

are the maximum likelihood estimators of μ and σ2, respectively. A natural question then
is whether or not these estimators are "good" in any sense. One measure of "good"
is "unbiasedness." 
Example No. 1: If Xi is a Bernoulli random variable with parameter p, then:

is the maximum likelihood estimator (MLE) of p. Is the MLE of p an unbiased estimator
of p?
Solution. Recall that if Xi is a Bernoulli random variable with parameter p, then E(Xi)
= p. Therefore:

The first equality holds because we've merely replaced p-hat with its definition. The
second equality holds by the rules of expectation for a linear combination. The third
equality holds because E(Xi) = p. The fourth equality holds because when you add the
value p up n times, you get np. And, of course, the last equality is simple algebra. In
summary, we have shown that:

Therefore, the maximum likelihood estimator is an unbiased estimator of p.


Example No. 2 :If Xi are normally distributed random variables with mean μ and
variance σ2, what is an unbiased estimator of σ2? Is S2 unbiased?
Solution. Recall that if Xi is a normally distributed random variable with mean μ and
variance σ2, then:

Also, recall that the expected value of a chi-square random variable is its degrees of
freedom. That is, if:

Then E(X) = r. Therefore:

The first equality holds because we effectively multiplied the sample variance by 1. The
second equality holds by the law of expectation that tells us we can pull a constant
through the expectation. The third equality holds because of the two facts we recalled
above. That is:

And, the last equality is again simple algebra.

In summary, we have shown that, if Xi is a normally distributed random variable with
mean μ and variance σ2, then S2is an unbiased estimator of σ2. It turns out, however,
that S2 is always an unbiased estimator of σ2, that is, for anymodel, not just the normal
model. (You'll be asked to show this in the homework.) And, although S2 is always an
unbiased estimator of σ2, S is not an unbiased estimator of σ.
 
Example No. 3 :Let T be the time that is needed for a specific task in a factory to be
completed. In order to estimate the mean and variance of T, we observe a random
sample T1,T2,⋯⋯,T6. Thus, Ti’s are i.i.d. and have the same distribution as T. We obtain
the following values (in minutes):
18,21,17,16,24,20.
Find the values of the sample mean, the sample variance, and the sample standard
deviation for the observed sample.
                The sample mean is:
The sample variance is given by

Finally, the sample standard deviation is given by:

 
 
 
 
 
Previous

Chapter VI
Continue

6.3.2. Standard Error


The standard error of an estimator is its standard deviation

Let’s calculate the standard error of the sample mean estimator:


where σ is the standard deviation std(X) being estimated. We don’t know the
standard deviation σ of X, but we can approximate the standard error based
upon some estimated value s for σ. Irrespective of the value of σ, the standard
error decreases with the square root of the sample size m. Quadrupling the
sample size halves the standard error.
 

Previous

Chapter VI
Continue

6.3.3. Mean Squared Error


We seek estimators that are unbiased and have minimal standard error.
Sometimes these goals are incompatible. Consider Exhibit 4.2, which indicates
PDFs for two estimators of a parameter θ. One is unbiased. The other is biased
but has a lower standard error. Which estimator should we use?
Exhibit 4.2: PDFs are indicated for two estimators of a parameter θ. One is unbiased.
The other is biased but has lower standard error.
Mean squared error (MSE) combines the notions of bias and standard error. It is
defined as:

Since we have already determined the bias and standard error of estimator, calculating
its mean squared error is easy:

Faced with alternative estimators for a given parameter, it is generally reasonable to


use the one with the smallest MSE.

References:
Holton,G., (n.d.) Value-at-Risk Second Edition [Available online] Retrieve
from: https://www.value-at-risk.net/bias/
Pishro-Nik, H., (n.d.) Introduction to Probability, Statistics, and Random Processes
[Available online] Retrieve
from: https://www.probabilitycourse.com/chapter8/8_2_2_point_estimators_for_mean_a
nd_var.php
The Pennsylvania State University (2018) Probability Theory and Mathematical
Statistics [Available online] Retrieve
from: https://newonlinecourses.science.psu.edu/stat414/node/192/
Stat Trek (n.d.) Sampling Distribution [Available online] Retrieve
from: https://stattrek.com/sampling/sampling-distribution.aspx

Previous

Chapter VII
Continue

7.1 Confidence Intervals


Confidence Intervals
In statistical inference, one wishes to estimate population parameters using observed
sample data.
A confidence interval gives an estimated range of values which is likely to include an
unknown population parameter, the estimated range being calculated from a given set
of sample data. The confidence interval can take any number of probabilities, with the
most common being 95% or 99%.
The common notation for the parameter in question is . Often, this parameter is the
population mean , which is estimated through the sample mean .
The level C of a confidence interval gives the probability that the interval produced by
the method employed includes the true value of the parameter .
Example No. 1
Suppose a student measuring the boiling temperature of a certain liquid observes the
readings (in degrees Celsius) 102.5, 101.7, 103.1, 100.9, 100.5, and 102.2 on 6
different samples of the liquid. He calculates the sample mean to be 101.82. If he knows
that the standard deviation for this procedure is 1.2 degrees, what is the confidence
interval for the population mean at a 95% confidence level?
In other words, the student wishes to estimate the true mean boiling temperature of the
liquid using the results of his measurements. If the measurements follow a normal
distribution, then the sample mean will have the distribution . Since the sample size is 6,
the standard deviation of the sample mean is equal to 1.2/sqrt(6) = 0.49.
The selection of a confidence level for an interval determines the probability that the
confidence interval produced will contain the true parameter value. Common choices for
the confidence level C are 0.90, 0.95, and 0.99. These levels correspond to
percentages of the area of the normal density curve. For example, a 95% confidence
interval covers 95% of the normal curve -- the probability of observing a value outside of
this area is less than 0.05. Because the normal curve is symmetric, half of the area is in
the left tail of the curve, and the other half of the area is in the right tail of the curve. As
shown in the diagram to the right, for a confidence interval with level C, the area in each
tail of the curve is equal to (1-C)/2. For a 95% confidence interval, the area in each tail
is equal to 0.05/2 = 0.025.

The value z* representing the point on the standard normal density curve such that the
probability of observing a value greater than z* is equal to p is known as the
upper p critical value of the standard normal distribution. For example, if p = 0.025, the
value z* such that P(Z > z*) = 0.025, or P(Z < z*) = 0.975, is equal to 1.96. For a
confidence interval with level C, the value p is equal to (1-C)/2. A 95% confidence
interval for the standard normal distribution, then, is the interval (-1.96, 1.96), since 95%
of the area under the curve falls within this interval.
 

Previous

Chapter VII
Continue

7.1.1 Confidence Intervals for Unknown Mean and Known Standard


Deviation
For a population with unknown mean   and known standard deviation  , a confidence
interval for the population mean, based on a simple random sample (SRS) of size n,
is   + z* , where z* is the upper (1-C)/2 critical value for the standard normal distribution.
An increase in sample size will decrease the length of the confidence interval without
reducing the level of confidence. This is because the standard deviation decreases
as n increases. The margin of error, m of a confidence interval is defined to be the value
added or subtracted from the sample mean which determines the length of the interval:
 

Previous

Chapter VII
Continue

7.1.2 Confidence Intervals for Unknown Mean and Unknown Standard


Deviation
Confidence Intervals for Unknown Mean and Unknown Standard Deviation
In most practical research, the standard deviation for the population of interest is not
known.  In this case, the standard deviation  is replaced by the estimated standard
deviation, s also known as the standard error. Since the standard error is an estimate
for the true value of the standard deviation, the distributions of the sample mean is no
longer normal with mean , and standard deviation . Instead, the sample mean follows
the t distribution with mean , and standard deviation . The t distribution is also described
by its degrees of freedom. For a sample of size n, the t distribution will have n-
1 degrees of freedom. The notation for a t distribution with k degrees of freedom is t(k).
As the sample size nincreases, the t distribution becomes closer to the normal
distribution, since the standard error approaches the true standard deviation  or large n.
For a population with unknown mean and unknown standard deviation, a confidence
interval for the population mean, based on a simple random sample (SRS) of size n,
is  t* , where t* is the upper (1-C)/2 critical value for the t distribution with n-1 degrees of
freedom, t(n-1).

Previous

Chapter VII
Continue

7.2 Prediction Intervals


Prediction Intervals
Predicting the next future observation with a 100(1-α)% prediction interval
Suppose that is a random sample from a normal population. We wish to predict the
value, a single future observation. A point prediction of is the sample mean. The
prediction error is . The expected value of the prediction error is:
and the variance of the prediction error is,

because the future observation , is independent of the mean of the current sample𝜒 .
The prediction error is normally distributed. Therefore,

has a standard normal distribution. Replacing  with S results in:

which has a t distribution with degrees of freedom. Manipulating Tas we have done


previously in the development of a CI leads to a prediction interval on the future
observation xn+1.
Definition:

A 100( 1 -   ) % prediction interval on a single future observation from a normal


distribution is given by:

Example No. 1:
Reconsider the tensile adhesion tests on specimens of U-700 alloy described in
Example 8-4. The load at failure for specimens was observed, and we found that and .
The 95% confidence interval on was . We plan to test a twenty-third specimen. A 95%
prediction interval on the load at failure for this specimen is,

Notice that the prediction interval is considerably longer than the CI.
 

Previous
Chapter VII
Continue

7.3 Tolerance Interval


Consider a population of semiconductor processors. Suppose that the speed of these
processors has a normal distribution with mean  megahertz and standard
deviation megahertz. Then the interval from 600 - 1.96(30) = 541.2 to 600 + 1.96(30) =
658.8 megahertz captures the speed of 95% of the processors in this population
because the interval from 1.96 to 1.96 captures 95% of the area under the standard
normal curve. The interval from is called a tolerance interval, 

is called a tolerance interval


If μ and σ are unknown, capturing a specific percentage of values of a population will
contain less than this percentage (probably) because of sampling variability in x-bar and
s.
A tolerance interval for capturing at least % of the values in a normal distribution with
confidence level 100(1+ ) % is,

where k is a tolerance interval factor found in Appendix Table XI. Values are given
for  90%, 95%, and 95% and for 95% and 99% confidence
One-sided tolerance bounds can also be computed. The tolerance factors for these
bounds are also given in Appendix Table XI.
Example No. 1:
Let’s reconsider the tensile adhesion tests. The load at failure for specimens was
observed, and we found that and . We want to find a tolerance interval for the load at
failure that includes 90% of the values in the population with 95% confidence. From
Appendix Table XI the tolerance factor k for , , and 95% confidence is The desired
tolerance interval is

which reduces to (23.67, 39.75). We can be 95% confident that at least 90% of the
values of load at failure for this particular alloy lie between 23.67 and 39.75
megapascals.
Reference:
Valerie J. Easton and John H. McColl's Statistics Glossary v1.1
Douglas C. Montgomery and George C. Runger:  Applied Statistics and Probability for
Engineers (Third Edition)

Previous

Chapter VIII
Continue

8.1. Hypothesis Testing


What is Hypothesis Testing?
Hypothesis testing was introduced by Ronald Fisher, Jerzy Neyman, Karl
Pearson and Pearson’s son, Egon Pearson. Hypothesis testing is a statistical method
that is used in making statistical decisions using experimental data.  Hypothesis Testing
is basically an assumption that we make about the population parameter.
A statistical hypothesis is an assumption about a population parameter. This
assumption may or may not be true.
Hypothesis testing refers to the formal procedures used by statisticians to accept
or reject statistical hypotheses.

Statistical Hypotheses
The best way to determine whether a statistical hypothesis is true would be to
examine the entire population. Since that is often impractical, researchers typically
examine a random sample from the population. If sample data are not consistent with
the statistical hypothesis, the hypothesis is rejected.
There are two types of statistical hypotheses.

o Null hypothesis. The null hypothesis, denoted by Ho, is usually the hypothesis that sample
observations result purely from chance.
o Alternative hypothesis. The alternative hypothesis, denoted by H1 or Ha, is the hypothesis that
sample observations are influenced by some non-random cause.
Hypothesis testing is an act in statistics whereby an analyst tests an assumption
regarding a population parameter. The methodology employed by the analyst depends
on the nature of the data used and the reason for the analysis. Hypothesis testing is
used to infer the result of a hypothesis performed on sample data from a larger
population.

Hypothesis testing in statistics is a way for you to test the results of a survey or
experiment to see if you have meaningful results. You’re basically testing whether your
results are valid by figuring out the odds that your results have happened by chance. If
your results may have happened by chance, the experiment won’t be repeatable and so
has little use.
Hypothesis testing can be one of the most confusing aspects for students, mostly
because before you can even perform a test, you have to know what your null
hypothesis is. Often, those tricky word problems that you are faced with can be difficult
to decipher. But it’s easier than you think; all you need to do is:

1. Figure out your null hypothesis,


2. State your null hypothesis,
3. Choose what kind of test you need to perform, and
4. Either support or reject the null hypothesis.
What is a Hypothesis Statement?
           If you are going to propose a hypothesis, it’s customary to write a statement.
Your statement will look like this:
             “If I…(do this to an independent variable)….then (this will happen to the
dependent variable).”
A good hypothesis statement should:

o Include an “if” and “then” statement (according to the University of California).
o Include both the independent and dependent variables.
o Be testable by experiment, survey or other scientifically sound technique.
o Be based on information in prior research (either yours or someone else’s).
o Have design criteria (for engineering or programming projects).

What is the Null Hypothesis?


If you trace back the history of science, the null hypothesis is always the accepted
fact. Simple examples of null hypotheses that are generally accepted as being true are:

1. DNA is shaped like a double helix.


2. There are 8 planets in the solar system (excluding Pluto).
3. Taking Vioxx can increase your risk of heart problems (a drug now taken off the market)
     How do I State the Null Hypothesis?
You won’t be required to actually perform a real experiment or survey in
elementary statistics (or even disprove a fact like “Pluto is a planet”!), so you’ll be given
word problems from real-life situations. You’ll need to figure out what your hypothesis is
from the problem. This can be a little trickier than just figuring out what the accepted fact
is. With word problems, you are looking to find a fact that is nullifiable (i.e. something
you can reject).

Previous
Chapter VIII
Continue

8.1.1 One-tailed and Two-tailed Hypothesis


Critical Regions in a Hypothesis Test
In hypothesis tests, critical regions are ranges of the distributions where the
values represent statistically significant results. Analysts define the size and location of
the critical regions by specifying both the significance level (alpha) and whether the test
is one-tailed or two-tailed.
Consider the following two facts:

1. The significance level is the probability of rejecting a null hypothesis that is correct.
2. The sampling distribution for a test statistic assumes that the null hypothesis is correct.
Consequently, to represent the critical regions on the distribution for a test
statistic, you merely shade the appropriate percentage of the distribution. For the
common significance level of 0.05, you shade 5% of the distribution.

Two-Tailed Hypothesis Tests


Two-tailed hypothesis tests are also known as nondirectional and two-sided tests
because you can test for effects in both directions. When you perform a two-tailed test,
you split the significance level percentage between both tails of the distribution. In the
example below, I use an alpha of 5% and the distribution has two shaded regions of
2.5% (2 * 2.5% = 5%).

When a test statistic falls in either critical region, your sample data are sufficiently
incompatible with the null hypothesis that you can reject it for the population.
In a two-tailed test, the generic null and alternative hypotheses are the following:
Null: The effect equals zero.
Alternative:  The effect does not equal zero.
The specifics of the hypotheses depend on the type of test you perform because
you might be assessing means, proportions, or rates.
Advantages of two-tailed hypothesis tests

You can detect both positive and negative effects. Two-tailed tests are standard
in scientific research where discovering any type of effect is usually of interest to
researchers.

One-Tailed Hypothesis Tests


One-tailed hypothesis tests are also known as directional and one-sided tests
because you can test for effects in only one direction. When you perform a one-tailed
test, the entire significance level percentage goes into the extreme end of one tail of the
distribution.
In the examples below, I use an alpha of 5%. Each distribution has one shaded
region of 5%. When you perform a one-tailed test, you must determine whether the
critical region is in the left tail or the right tail. The test can detect an effect only in the
direction that has the critical region. It has absolutely no capacity to detect an effect in
the other direction.
In a one-tailed test, you have two options for the null and alternative hypotheses,
which corresponds to where you place the critical region.
You can choose either of the following sets of generic hypotheses:
Null: The effect is less than or equal to zero.
Alternative: The effect is greater than zero.
Or:
Null: The effect is greater than or equal to zero.
Alternative: The effect is less than zero.

Again, the specifics of the hypotheses depend on the type of test you perform.
Notice how for both possible null hypotheses the tests can’t distinguish between
zero and an effect in a particular direction. For example, in the example directly above,
the null combines “the effect is greater than or equal to zero” into a single category.
That test can’t differentiate between zero and greater than zero.

Advantages and disadvantages of one-tailed hypothesis tests


One-tailed tests have more statistical power to detect an effect in one direction
than a two-tailed test with the same design and significance level. One-tailed tests occur
most frequently for studies where one of the following is true:

1. Effects can exist in only one direction.


2. Effects can exist in both directions but the researchers only care about an effect in one
direction. There is no drawback to failing to detect an effect in the other direction. (Not
recommended.)
The disadvantage of one-tailed tests is that is has no statistical power to detect
an effect in the other direction.
 

Previous

Chapter VIII
Continue
8.1.2 P-value in Hypothesis Test
The P value, or calculated probability, is the probability of finding the observed,
or more extreme, results when the null hypothesis (H0) of a study question is true – the
definition of ‘extreme’ depends on how the hypothesis is being tested. P is also
described in terms of rejecting H0 when it is actually true, however, it is not a direct
probability of this state.
The term significance level (alpha) is used to refer to a pre-chosen probability and
the term "P value" is used to indicate a probability that you calculate after a given study.
If your P value is less than the chosen significance level then you reject the null
hypothesis i.e. accept that your sample gives reasonable evidence to support the
alternative hypothesis. It does NOT imply a "meaningful" or "important" difference; that
is for you to decide when considering the real-world relevance of your result.
Type I error is the false rejection of the null hypothesis and type II error is the
false acceptance of the null hypothesis. As an aid memoir: think that our cynical society
rejects before it accepts.
The significance level (alpha) is the probability of type I error. The power of a test
is one minus the probability of type II error (beta). Power should be maximized when
selecting statistical methods.
The following table shows the relationship between power and error in
hypothesis testing:

Notes about Type I error:


 is the incorrect rejection of the null hypothesis
 maximum probability is set in advance as alpha
 is not affected by sample size as it is set in advance
 increases with the number of tests or end points (i.e. do 20 rejections of H0and 1 is likely to be
wrongly significant for alpha = 0.05)
Notes about Type II error:
 is the incorrect acceptance of the null hypothesis
 probability is beta
 beta depends upon sample size and alpha
 can't be estimated except as a function of the true population effect
 beta gets smaller as the sample size gets larger
 beta gets smaller as the number of tests or end points increases

Previous

Chapter VIII
Continue

8.1.3 General Procedure for Test of Hypothesis


General Procedure for Test of Hypothesis

1. From the problem context, identify the parameter of interest.


2. State the null hypothesis, H0.
3. Specify an appropriate alternative hypothesis, H1.
4. Choose a significance level, α.
5. Determine an appropriate test statistic.
6. State the rejection region for the statistic.
7. Compute any necessary sample quantities, substitute these into the equation for the test
statistic, and compute that value.
8. Decide whether or not H0 should be rejected and report that in the problem context.

Previous

Chapter VIII
Continue

8.2 Test on the Mean of a Normal Distribution, Variance Known


Example No. 1:
Aircrew escape systems are powered by a solid propellant. The burning rate of
this propellant is an important product characteristic. Specifications require that the
mean burning rate must 50 centimeters per second. We know that the standard
deviation of burning rate is  centimeters per second. The experimenter decides to
specify a type I error probability or significance level  and selects a random sample of
and obtains a sample average burning rate of  centimeters per second. What
conclusions should be drawn?
Solving the problem by following the eight-step procedure results:

1. The parameter of interest is µ, the mean burning rate.


2. Ho: µ = 50 centimeters per second
3. H1: µ ≠ 50 centimeters per second
4. α = 0.05
5. The test statistics is: 
6. Reject H0 if z0> 1.96 or if z0< -1.96. Note that this results from step 4, where we
specified α = 0.05, and so the boundaries of the critical region are at z 0.025 = 1.96 and -
z0.025 = -1.96.

7. Computations: Since x = 51.3 and = 2,


8. Conclusion: Since z0 = 3.25 > 1.96, we reject H0: µ = 50 at the 0.05 level of significance. Stated
more completely, we conclude that the mean burning rate differs from 50 centimeters per
second, based on a sample of 25 measurements. In fact, there is strong evidence that the mean
burning rate exceeds 50 centimeters per second.

Previous

Chapter VIII
Continue

8.3 Test on the Mean of a Normal Distribution, Variance Known


Example No. 2:
The increased availability of light materials with high strength has revolutionized
the design and manufacture of golf clubs, particularly drives. Clubs with hollow heads
and very thin faces can result in much longer tee shots, especially for players of modest
skills. This is due partly to the “spring-like effect” that the thin face imparts to the ball.
Firing a golf ball at the head of the club and measuring the ratio of the outgoing velocity
of the ball to the incoming velocity can quantify this spring-like effect. The ratio of
velocities is called the coefficient of restitution of the club. An experiment was performed
in which 15 drivers produced by a particular club maker were selected at random and
their coefficients of restitution measured. In the experiment the golf balls were fired from
air cannon so that the incoming velocity and spin rate of the ball could be precisely
controlled. It is of interest to determine if there is evidence (with α = 0.05) to support a
claim that the mean coefficient of restitution exceeds 0.82. The observations follow:
0.8411             0.8191             0.8182             0.8125             0.8750
0.8580             0.8532             0.8483             0.8276             0.7983
0.8042             0.8730             0.8282             0.8359             0.8660
The sample mean and sample standard deviation are x = 0.83725 and s =
0.02456. The normal probability plot of the date supports the assumption that the
coefficient of restitution is normally distributed. Since the objective of the experiment is
to demonstrate that the mean coefficient of restitution exceeds 0.82, a one-sided
alternative hypothesis testing is appropriate.
Using the eight-step procedure for hypothesis:

1. The parameter of interest is the mean coefficient of restitution, µ.


2. H0: µ = 0.82
3. H1: µ > 0.82. We want to reject H0 if the mean coefficient of restitution exceeds 0.82.
4. α = 0.05

5. The test statistic is: 


6. Reject H0 if t0> t0.05,14 = 1.761. Normal probability plot of the coefficient of restitution data.

      7. Computation: Since x = 0.83727, s = 0.02456, µ 0 = 0.82, and n = 15, we have,

8. Conclusions: Since t0 =


2.72 > 1.761, we reject H0 and conclude at the 0.05 level of significance that the mean
coefficient of restitution exceeds 0.82.

Previous

Chapter VIII
Continue
8.4 Test on the Variance and Statistical Deviation of a Normal
Distribution
Example No. 3:
An automatic filling machine is used to fill bottles with liquid detergent. A random
sample of 20 bottles results in a sample variance of fill volume of s 2 = 0.0153 (fluid
ounces)2. If the variance of fill volume exceeds 0.01 (fluid ounces) 2, an unacceptable
proportion of bottles will be underfilled or overfilled. Is there evidence in the sample data
to suggest that the manufacturer has a problem with underfilled or overfilled bottles?
Use α = 0.05, and assume that fill volume has a normal distribution.
     Using the eight-step procedure:

    1. The parameter of interest is the population variance  .

    2. H0;   = 0.01

   3. H1:  > 0.01


   4. α = 0.05
   5. The test statistic is:

   6. Reject H0 if:

   7. Computations: 

   8. Conclusions: Since , we conclude that there is no strong evidence that the variance
of fill volume exceeds 0.01 (fluid ounces)2.

1.
2.
Previous

3. Chapter VIII
4. Continue

5. 8.5 Test on a Population Proportion


6. Example No. 4:
7. A semiconductor manufacturer produces controllers used in automobile engine
applications. The customer requires that the process fallout or fraction defective
at a critical manufacturing step not exceed 0.05 and that the manufacturer
demonstrate process capability at this level of quality using α = 0.05. The
semiconductor manufacturer takes a random sample of 200 devices and finds
that four of them are defective. Can the manufacturer demonstrate process
capability for the customer?
8.      Using the eight-step procedure:
9.      1. The parameter of interest is the process fraction defective p.
10.      2. H0: p = 0.05
11.      3. H1: p < 0.05
12.      This formulation of the problem will allow the manufacturer to make a strong
claim about process capability if the null hypothesis H 0: p = 0.05 is rejected.
13.      4. α = 0.05
14.      5. The test statistic is:

15.
16. where x = 4, n = 200, and p0 = 0.05.
17.     6. Reject H0:
18.
19.     7. Computations: The test statistic is: 

20.
21.     8. Conclusions: Since z0 = -1.95 < -z0.05 = -1.645, we reject H0 and conclude
that the process fraction defective p is less than 0.05. The P-value for this value
of the test statistic z 0 is P = 0.0256, which is less than α = 0.05. We conclude that
the process is capable.
22. References:
23. Majaski, C.,(2019) Hypothesis Testing [Available online]
24. Retrieved from https://www.investopedia.com/terms/h/hypothesistesting.asp
25. Statistics Solutions (2013) Hypothesis Testing [Available online]
26. Retrieved
from http://www.statisticssolutions.com/academic-solutions/resources/directory-
of-statistical-analyses/hypothesis-testing/
27. Glen, S., (2019) Hypothesis Testing [Available online]
28. Retrieved from https://www.statisticshowto.datasciencecentral.com/probability-
and-statistics/hypothesis-testing/
29. Frost J., (2019) One-Tailed and Two-Tailed Hypothesis Tests Explained
[Available online]
30. Retrieved from https://statisticsbyjim.com/hypothesis-testing/one-tailed-two-
tailed-hypothesis-tests/
Previous

Chapter IX
Continue

9.1 Inference for a Difference in Means of Two Normal Distributions,


Variances Known

Two Independent Population

Inference for a Difference in Means of Two Normal Distributions, Variances Known


Assumptions

1. X11, X12, … ,X1n1 is a random sample from population 1.


2. X21, X22, … , X2n2is a random sample from population 2.
3. The two populations represented by X1 and X2 are independent.
4. Both populations are normal

      The quantity,

     has a N(0, 1) distribution.


 
Hypothesis Tests for a Difference in Means, Variances Known

Null Hypothesis: 
Test Statistic:

Sample Problem:
A product developer is interested in reducing the drying time of a primer paint. Two
formulations of the paint are tested; formulation 1 in the standard chemistry, and
formulation 2 has a new drying ingredient that should reduce drying time. From
experience, it is known that the standard deviation of drying time is 8 minutes, and this
inherent variability should be unaffected by the addition of the new ingredient. Ten
specimens are painted with formulation 1, and another 10 specimens are painted with
formulation 2; the 20 specimens are painted in random order. The two sample average
drying times are x1 = 121 minutes and x2 = 112 minutes, respectively. What conclusions
can the product developer draw about the effectiveness of the new ingredient,
using 0.05?
We apply the eight-step procedure to this problem as follows:
   1. The quantity of interest is the difference in the mean drying

times,  .

   2. 

   3.   We want to reject Ho if the new ingredient reduces mean
drying time.
   4. α = 0.05
   5. The test statistic is, 
where 

   6. Reject  .
   7. Computations: x1 = 121 minutes and x2 = 112 minutes the test statistic is:

   8. Conclusion: Since z0 = 2.52 > 1.65, we reject Ho :   at the α = 0.05


level and conclude that adding the new ingredient to the paint significantly reduces the
drying time. Alternatively, we can find the P-value for this test as:

P-value – 1 – (2.52) = 0.0059


Therefore, H0: μ1 = μ2 would be rejected at any level of significance α ≥ 0.0059.
Type II Error and Choice of Sample Size
Sample Size Formulas
One-sided alternative:

Previous

Chapter IX
Continue

9.2 Inference for a Difference in Means of Two Normal Distributions,


Variances Unknown
Inference for a Difference in Means of Two Normal Distributions, Variances Unknown
Hypotheses Tests for a Difference in Means, Variance Unknown
Case 1: 
We test:

Combine   and  to form an estimator of 


 

The pooled estimator of  :

                                                            
 
Given the assumptions of the section, the quantity

                                                              
has a t distribution of n1 + n2 - 2 degrees of freedom.
Definition: The Two-Sample of Pooled t-Test

Null Hypothesis: 
Test statistic:

                                                          
 
Alternative Hypothesis                                                       Rejection Criterion
                                                   
 

Case 2: 

                                                                                 
is distributed approximately as t distribution with degrees of freedom v is given by

                                                            
 
Type II Error and Choice of Sample Size

Case 1: 

                                                            

Case 2: 

Previous

Chapter IX
Continue

9.3 Inferences on the Variances of Two Normal Distributions


Inferences on the Variances of Two Normal Distributions
The F Distribution
We wish to test the hypotheses:

                                        
 The development of a test procedure for these hypotheses requires a new probability
distribution, the F distribution.
 
Let W and Y be independent chi-square random variables with u and v freedom,
respectively. Then, the ratio

                                      
Has the probability density function

 
And is said to follow the F distribution with u degrees of freedom in the numerator and v

degrees of freedom in the denominator. It is usually abbreviated as 


 
 
Previous

Chapter IX
Continue

9.4 Inference on Two Population Proportion


Inference on Two Population Proportion
Large-Sample Test on the Difference in Population Proportions
We wish to test the hypotheses:

 
The following test statistic is distributed approximately as standard normal and is the
basis of the test:

                                    
 
Large-Sample Test on the Difference in Population Proportions

Null Hypothesis: 
Test Statistic: 
 
Alternative Hypotheses:                                          Rejection Criterion:

                                                     
 
Type II Error and Choice of Sample Size

If the alternative hypothesis is 

                                                            

If alternative hypothesis is 

                                                              

For two sided alternative, the common sample size 

                            

where   and                    


 
References:
https://www.oreilly.com/library/view/applied-statistics-and/
9781118539712/14_chapter10.html
https://www.google.com/urlsa=t&source=web&rct=j&url=http://www.buders.com/
UNIVERSITE/Universite_Dersleri/istatistik/statistical_inferenc

10.1 Simple Linear Regression


Simple Linear Regression
When you plot two variables against each other in a scatter plot, the values usually
don’t fall exactly in a perfectly straight line. When you perform a linear regression
analysis, you attempt to find the line that best estimates the relationship between two
variables (the y, or dependent, variable, and the x, or independent, variable). The line
you find is called the fitted regression line, and the equation that specifies the line is
called the regression equation.
 
The Regression Equation
If the data in a scatter plot fall approximately in a straight line, you can use linear
regression to find an equation for the regression line drawn over the data. Usually, you
will not be able to fit the data perfectly, so some points will lie above and some below
the fitted regression line.
The regression line will have an equation of the form y = a + bx. Here y is the dependent
variable, the one you are trying to predict, and x is the independent, or predictor,
variable, the one that is doing the predicting. Finally, a and b are called coefficients.
Figure 10-1 shows a line with a = 10 and b = 2. The short vertical line segments
represent the errors, also called residuals, which are the gaps between the line and the
points. The residuals are the differences between the observed dependent values and
the predicted values. Because a is where the line intercepts the vertical axis, a is
sometimes called the intercept or constant term in the model. Because b tells how steep
the line is, b is called the slope. It gives the ratio between the vertical change and the
horizontal change along the line. Here y increases from 10 to 30 when x increases from
0 to 10, so the slope is

 
Suppose that x is years on the job and y is salary. Then y intercept is (x=0) is the salary
for a person with zero years’ experience, the starting salary. The slope is the change in
salary per year of service. A person with a salary above the line would have a positive
residual, and a person with a salary below the line would have a negative residual.
If the line trends downward so that y decreases when x increases, then the slope is
negative. For example, if x is age and y is price for used cars, then the slope gives the
drop in price per year of age. In this example, the intercept is the price when new, and
the residuals represent the difference between the actual price and the predicted price.
All other things being equal, if the straight line is the correct model, a positive residual
means a car costs more than it should, and a negative residual means a car costs less
than it should (that is, it’s a bargain).
 
Example 10-1: Determining If There is a Relationship
Is there a relationship between the alcohol content and the number of calories in 12-ounce beer?
To determine if there is one a random sample was taken of beer’s alcohol content and calories
("Calories in beer," 2011), and the data is in table 10-1.
Solution: To aid in figuring out if there is a relationship, it helps to draw a scatter plot of
the data. It is helpful to state the random variables, and since in an algebra class the
variables are represented as x and y, those labels will be used here. It helps to state
which variable is x and which is y. State random variables x = alcohol content in the
beer y = calories in 12 ounce beer

This scatter plot looks fairly linear. However, notice that there is one beer in the list that
is actually considered a non-alcoholic beer. That value is probably an outlier since it is a
non-alcoholic beer. The rest of the analysis will not include O’Doul’s. You cannot just
remove data points, but in this case it makes more sense to, since all the other beers
have a fairly large alcohol content.
 A scatter plot is a graphical representation of the relation between two or more variables. In the
scatter plot of two variables x and y, each point on the plot is an x-y pair.
To find the equation for the linear relationship, the process of regression is used to find
the line that best fits the data (sometimes called the best fitting line). The process is to
draw the line through the data and then find the distances from a point to the line, which
are called the residuals. The regression line is the line that makes the square of the
residuals as small as possible, so the regression line is also sometimes called the least
squares line. The regression line and the residuals are displayed in figure 10-3.
                            

The independent variable, also called the explanatory variable or predictor variable, is


the x-value in the equation. The independent variable is the one that you use to predict what the
other variable is. The dependent variable depends on what independent value you pick. It also
responds to the explanatory variable and is sometimes called the response variable. In the
alcohol content and calorie example, it makes slightly more sense to say that you would use the
alcohol content on a beer to predict the number of calories in the beer.
 

Previous

Chapter X
Continue

10.2 Empirical Models


A.1 Empirical Models
 
FITTING THE REGRESSION LINE
Regression is the analysis of the relation between one variable and some other variable(s),
assuming a linear relation. Also referred to as least squares regression and ordinary least squares
(OLS). Other meaning from different books, Linear Regression refers to a group of techniques
for fitting and studying the straight-line relationship between two variables.
When fitting a line to data, you assume that the data follow the linear model (also known as best
fitting line or least squares line):

                                            
where α is the “true” intercept, β is the “true” slope, and ε is an error term. When you fit the line,
you’ll try to estimate a and b, but you can never know them exactly. The estimates of α and β,
we’ll label a and b. The predicted values of y using these estimates, we’ll label ŷ, so that

                                            
The residuals are the difference between the actual values and the estimated values.

                                          

Previous

Chapter X
Continue

10.3 Regression: Modelling Linear Relationships - The Least Squares


Approach
B. Regression : Modelling Linear Relationships – The Least Squares Approach
To get estimates for α and β, we use values of a and b that result in a minimum value for the
sum of squared residuals. In other words, if yi is an observed value of y, we want values
of a and b such that
                                      
         SS stands for sum of squares. So you are summing up squares. With the subscript xy,
you aren’t really summing squares, but you can think of it that way in a weird sense.
                                  

                            
                             
is as small as possible. This procedure is called the least-squares method. The
values a and b that result in the smallest possible sum for the squared residuals can be
calculated from the following formulas:

                                                  
Note: the easiest way to find the regression equation is to use the technology.
These are called the least-squares estimates.
 
LEAST SQUARES PRINCIPLE
The least squares principle is that the regression line is determined by minimizing the sum of
the squares of the vertical distances between the actual Y values and the predicted values of Y.

                                        
A line is fit through the XY points such that the sum of the squared residuals (that is, the sum of
the squared the vertical distance between the observations and the line) is minimized.
 
For example, say our data set contains the values listed in Table 10-2:

                                        
The sample averages for x and y are 1.8 and 3.4, and the estimates for a and b are
        

 
Thus the least-squares estimate of the regression equation is y = 2.5 + 0.5x.
 
Possible Uses of Linear Regression Analysis
 Montgomery (1982) outlines the following four purposes for running a regression analysis.
 Description
          The analyst is seeking to find an equation that describes or summarizes the relationship
between two variables. This purpose makes the fewest assumptions.
 Coefficient Estimation
          This is a popular reason for doing regression analysis. The analyst may have a theoretical
relationship in mind, and the regression analysis will confirm this theory. Most likely, there is
specific interest in the magnitudes and signs of the coefficients. Frequently, this purpose for
regression overlaps with others.
 Prediction
          The prime concern here is to predict the response variable, such as sales, delivery time,
efficiency, occupancy rate in a hospital, reaction yield in some chemical process, or strength of
some metal. These predictions may be very crucial in planning, monitoring, or evaluating some
process or system. There are many assumptions and qualifications that must be made in this
case. For instance, you must not extrapolate beyond the range of the data. Also, interval
estimates require that normality assumptions to hold.
 Control
          Regression models may be used for monitoring and controlling a system. For example,
you might want to calibrate a measurement system or keep a response variable within certain
guidelines. When a regression model is used for control purposes, the independent variable
must be related to the dependent variable in a causal way. Furthermore, this functional
relationship must continue over time. If it does not, continual modification of the model must
occur.
 
Assumptions 
The following assumptions must be considered when using linear regression analysis.
 Linearity
          Linear regression models the straight-line relationship between Y and X. Any curvilinear
relationship is ignored. This assumption is most easily evaluated by using a scatter plot. This
should be done early on in your analysis. Nonlinear patterns can also show up in residual plot. A
lack of fit test is also provided.
 Constant Variance
          The variance of the residuals is assumed to be constant for all values of X. This
assumption can be detected by plotting the residuals versus the independent variable. If these
residual plots show a rectangular shape, we can assume constant variance. On the other hand,
if a residual plot shows an increasing or decreasing wedge or bowtie shape, nonconstant
variance (heteroscedasticity) exists and must be corrected.
          The corrective action for nonconstant variance is to use weighted linear regression or to
transform either Y or X in such a way that variance is more nearly constant. The most
popular variance stabilizing transformation is to take the logarithm of Y.
 Special Causes
          It is assumed that all special causes, outliers due to one-time situations, have been
removed from the data. If not, they may cause nonconstant variance, nonnormality, or other
problems with the regression model. The existence of outliers is detected by considering scatter
plots of Y and X as well as the residuals versus X. Outliers show up as points that do not follow
the general pattern.
 Normality
          When hypothesis tests and confidence limits are to be used, the residuals are assumed to
follow the normal distribution.
 Independence
          The residuals are assumed to be uncorrelated with one another, which implies that
the Y’s are also uncorrelated. This assumption can be violated in two ways: model
misspecification or time-sequenced data.

1. Model misspecification.If an important independent variable is omitted or if an incorrect


functional form is used, the residuals may not be independent. The solution to this
dilemma is to find the proper functional form or to include the proper independent
variables and use multiple regression.
2. Time-sequenced data.Whenever regression analysis is performed on data taken over
time, the residuals may be correlated. This correlation among residuals is called serial
correlation. Positive serial correlation means that the residual in time period jtends to
have the same sign as the residual in time period (j - k), where k is the lag in time
periods. On the other hand, negative serial correlation means that the residual in time
period j tends to have the opposite sign as the residual in time period (j - k).
The presence of serial correlation among the residuals has several negative impacts.

1. The regression coefficients remain unbiased, but they are no longer efficient, i.e.,
minimum variance estimates.
2. With positive serial correlation, the mean square error may be seriously underestimated.
The impact of this is that the standard errors are underestimated, the t-tests are inflated
(show significance when there is none), and the confidence intervals are shorter than
they should be.
3. Any hypothesis tests or confidence limits that require the use of the tor Fdistribution are
invalid. You could try to identify these serial correlation patterns informally, with the
residual plots versus time. A better analytical way would be to use the Durbin-Watson
test to assess the amount of serial correlation.
 
Example 10-2: Calculating the Regression Equation with the Formula
Is there a relationship between the alcohol content and the number of calories in 12-ounce
beer? To determine if there is one a random sample was taken of beer’s alcohol content and
calories ("Calories in beer,," 2011). Find the regression equation from the formula.
                            

Solution:  State random variables


                          x = alcohol content in the beer          y = calories in 12 ounce beer
          

Previous

Chapter X
Continue

10.4 Correlation: Estimating the Strength of Linear Relation


C. Correlation: Estimating the Strength of Linear Relation
CORRELATION
 The value of the slope in our regression equation is a product of the scale in which we measure
our data. If, for example, we had chosen to express the temperature values in degrees
Centigrade, we would naturally have a different value for the slope (though, of course, the
statistical significance of the regression would not change). Sometimes, it’s an advantage to
express the strength of the relationship between one variable and another in a dimensionless
number, one that does not depend on scale. One such value is the correlation coefficient, or
simply the correlation. The correlation expresses the strength of the relationship on a scale
ranging from -1 to 1. A positive correlation indicates a strong positive relationship, in which an
increase in the value of one variable implies an increase in the value of the second variable.
This might occur in the relationship between height and weight. A negative correlation indicates
that an increase in the first variable signals a decrease in the second variable. An increase in
price for an object could be negatively correlated with sales. See Figure 10-4. A correlation of
zero does not imply there is no relationship between the two variables. One can construct a
nonlinear relationship that produces a correlation of zero.
        

A correlation exists between two variables when the values of one variable are somehow
associated with the values of the other variable. When you see a pattern in the data you say
there is a correlation in the data. Though this book is only dealing with linear patterns, patterns
can be exponential, logarithmic, or periodic. To see this pattern, you can draw a scatter plot of
the data. Remember to read graphs from left to right, the same as you read words. If the graph
goes up the correlation is positive and if the graph goes down the correlation is
negative. The words “weak”, “moderate”, and “strong” are used to describe the strength of the
relationship between the two variables.
              
 
The correlation is a parameter of the bivariate normal distribution. This distribution is used to
describe the association between two variables. This association does not include a cause and
effect statement. That is, the variables are not labeled as dependent and independent. One
does not depend on the other. Rather, they are considered as two random variables that seem
to vary together. The important point is that in linear regression, Y is assumed to be a
random variable and X is assumed to be a fixed variable. In correlation analysis, both Y
and X are assumed to be random variables.
The linear correlation coefficient is a number that describes the strength of the linear
relationship between the two variables. It is also called the Pearson correlation
coefficient after Karl Pearson who developed it. It is the most often used measure of
correlation. The symbol for the sample linear correlation coefficient is r. The symbol for the
population correlation coefficient is ρ (Greek letter rho).
 
THE CORRELATION COEFFICIENT, r 
The correlation coefficient can be interpreted in several ways. Here are some of the
interpretations.

1. If both Y and X are standardized by subtracting their means and dividing by their standard
deviations, the correlation is the slope of the regression of the standardized Y on the
standardized X.
2. The correlation is the standardized covariance between Y and X.
3. The correlation is the geometric average of the slopes of the regressions of Y on X and of X on
Y.
4. The correlation is the square root of R-squared, using the sign from the slope of the regression
of Y on X.
 
The corresponding formulas for the calculation of the correlation coefficient, r are
                                          

 
where sXY is the covariance between X and Y, b XY is the slope from the regression of X on Y,
and bYX is the slope from the regression of Y on X. sXY is calculated using the formula
                                        

The population correlation coefficient, ρ , is defined for two random variables, U and W, as


follows
                                          

Example 10-3: the correlation of the data in Table 10-2 is


Table 10-2 Data for Least Squares Estimates

                                      
            
 
Example 10-4: Calculating the correlation coefficient

Calculations:
Assumptions of linear correlation are the same as the assumptions for the regression
line:
a. The set (x, y) of ordered pairs is a random sample from the population of all such possible (x,
y) pairs. b. For each fixed value of x, the y-values have a normal distribution. All of the y
distributions have the same variance, and for a given x-value, the distribution of yvalues has a
mean that lies on the least squares line. You also assume that for a fixed y, each xhas its own
normal distribution. This is difficult to figure out, so you can use the following to determine if you
have a normal distribution.

1. Look to see if the scatter plot has a linear pattern.


2. Examine the residuals to see if there is randomness in the residuals. If there is a pattern
to the residuals, then there is an issue in the data.
Facts about the Correlation Coefficient
The correlation coefficient has the following characteristics.

1. The range of ris between -1 and 1, inclusive.


2. If r= 1, the observations fall on a straight line with positive slope.
3. If r= -1, the observations fall on a straight line with negative slope.
4. If r= 0, there is no linear relationship between the two variables.
5. ris a measure of the linear (straight-line) association between two variables.
6. The value of ris unchanged if either Xor Y is multiplied by a constant or if a constant is added.
7. The physical meaning of ris mathematically abstract and may not be very help. However, we
provide it for completeness. The correlation is the cosine of the angle formed by the intersection
of two vectors in N-dimensional space. The components of the first vector are the values
of Xwhile the components of the second vector are the corresponding values of Y. These
components are arranged so that the first dimension corresponds to the first observation, the
second dimension corresponds to the second observation and so on.
 

Previous

Chapter X
Continue

10.5 Hypothesis Tests in Simple Linear Regression


D. Hypothesis Tests in Simple Linear Regression
D.1 Use of t-tests
T-Value
These are the t-test values for testing the hypotheses that the intercept and the slope are zero
versus the alternative that they are nonzero. These t-values have N - 2 degrees of freedom.
To test that the slope is equal to a hypothesized value other than zero, inspect the confidence
limits. If the hypothesized value is outside the confidence limits, the hypothesis is rejected.
Otherwise, it is not rejected.
In order to calculate hypothesis tests, it is assumed that the errors are independent and
normally distributed with mean zero and variance σ2.
The hypotheses of interest regarding the population correlation, ρ, are:
Null hypothesis   H0: ρ = 0
In other words, there is no correlation between the two variables
Alternative hypothesis Ha: ρ ≠ 0
In other words, there is a correlation between the two variables
The test statistic is t-distributed with N-2 degrees of freedom-1:

                                    
To make a decision, compare the calculated t-statistic with the critical t-statistic for the
appropriate degrees of freedom and level of significance.
 
Example 10-5: In the previous example,
r = 0.475
N = 10

Example 10-6: Suppose the correlation coefficient is 0.2 and the number of observations is 32.
What is the calculated test statistic? Is this significant correlation using a 5% level of
significance?
 
Solution
Hypotheses:
H0: ρ = 0
            Ha: ρ ≠ 0  
Calculated t-statistic:

Degrees of freedom = 32-2 = 30


The critical t-value for a 5% level of significance and 30 degrees of freedom is 2.042. Therefore,
we conclude that there is no correlation (1.11803 falls between the two critical values of –2.042
and +2.042).
 
D.2 Analysis of Variance Approach to Test Significance of Regression
Estimated Variances
An estimate of the variance of the residuals is computed using

                                                              
An estimate of the variance of the regression coefficients is calculated using

                                          
An estimate of the variance of the predicted mean of Y at a specific value of X, say X0 , is given
by
                                        

An estimate of the variance of the predicted value of Y for an individual for a specific value of X,
say X0 , is given by

                                            
 
Example 10-4:  Inference for Regression and Correlation
How do you really say you have a correlation? Can you test to see if there really is a
correlation? Of course, the answer is yes. The hypothesis test for correlation is as follows:
Hypothesis Test for Correlation:
1. State the random variables in words.
           x = independent variable
           y = dependent variable
2. State the null and alternative hypotheses and the level of significance
           H0 : ρ = 0 (There is no correlation)
           HA  : ρ ≠ 0 (There is a correlation)
           Or
           HA  : ρ < 0 (There is a negative correlation)
          Or
          HA  : ρ > 0 (There is a positive correlation)
          Also, state your α level here.
3. State and check the assumptions for the hypothesis test
          The assumptions for the hypothesis test are the same assumptions for regression and
correlation.
4. Find the test statistic and p-value
5.

with degrees of freedom = df = n – 2


p-value:
Using the TI-83/84: tcdf(lower limit, upper limit, df )
(Note: if HA : ρ < 0 , then lower limit is −1E99 and upper limit is your test statistic.
If HA : ρ > 0 , then lower limit is your test statistic and the upper limit is 1E99 .
If HA : ρ ≠ 0 , then find the p-value for HA : ρ < 0 , and multiply by 2.)
Using R: pt(t,df )
(Note: if HA : ρ < 0 , then use pt(t,df ), If HA : ρ > 0 , then use 1− pt(t, df ).
If HA : ρ ≠ 0 , then find the p-value for HA : ρ < 0 , and multiply by 2.)
6. Conclusion
           This is where you write reject Ho or fail to reject Ho. The rule is: if the p-value < α , then
reject Ho. If the p-value ≥α , then fail to reject Ho
7. Interpretation
           This is where you interpret in real world terms the conclusion to the test. The conclusion
for a hypothesis test is that you either have enough evidence to show HA is true, or you do not
have enough evidence to show HA is true.
           Note: the TI-83/84 calculator results give you the test statistic and the p-value. In R, the
command for getting the test statistic and p-value is cor.test(independent variable, dependent
variable, alternative = "less" or "greater"). Use less for HA:ρ < 0 , use greater for HA :ρ > 0 , and
leave off this command for HA :ρ ≠ 0 .
 

Previous

Chapter X
Continue

10.6 Prediction of New Observations


E. Prediction of New Observations
A common objective in regression analysis is to estimate the mean for one or more
distributions of Y.
Let Xh denote the level of X for which we wish to estimate the mean response.
Then mean response when X = Xh is denoted by E(Yh) or Ŷh and the point estimate E(Yh)
is:
Ŷh = bo + b1 Xh
The prediction of a new observation Y corresponding to a given level of X of the
predicted variable is viewed as the result of a new trial, independent of the trials on which the
regression analysis is based.
In the estimation of the mean response, we estimate the mean of the distribution Y. In
the prediction of new observation, we predict an individual outcome drawn from the distribution
of Y.
Then prediction of a new observation when X = X h is denoted by E(Yh(new)) or Ŷh(new)
and the point estimate of E(Yh(new)) is:
Ŷh(new) = bo + b1 Xh
 

The sampling distribution of Ŷh is


                                                    

Two sources of variations in the standard deviation of prediction:

1. Variation in possible location of the distribution of Y


2. Variation within the probability distribution of Y
Note that the first source is the only source is the only source of variations for estimating the
mean response.

Previous

Chapter X
Continue

10.7 Adequacy of the Regression Model


F. Adequacy of the Regression Model
F.1 Residual Analysis
Residual
The residual is the difference between the actual Yvalue and the Yvalue predicted by the
estimated regression model. It is also called the error, the deviate, or the discrepancy.
e j= y j– Ŷj
Although the true errors, εj , are assumed to be independent, the computed residuals,
ej , are not. Although the lack of independence among the residuals is a concern in developing
theoretical tests, it is not a concern on the plots and graphs.
The variance of the εj is σ2 . However, the variance of the ejis not σ2. In vector notation,
the covariance matrix of e is given by
 
V(e) = σ2{I-W1/2X(X’WX)-1X’W1/2}
= σ2 (I-H)
 
The matrix H is called the hat matrixsince it puts the ‘hat’ on yas is shown in the
unweighted case.
Ŷ = XB
= X(X’X)-1X’Y
= HY
 
Hence, the variance of ejis given by
 
ῡ(e) = σ2 (1 – hjj)
 
where hjj is the jth diagonal element of H. This variance is estimated using
 
ῡ(ej) = s2 (1 – hjj)
Standardized Residual
As shown above, the variance of the observed residuals is not constant. This makes
comparisons among the residuals difficult. One solution is to standardize the residuals by
dividing by their standard deviations. This will give a set of residuals with constant variance.
                      
The formula for this residual is

                                              
Modified Residuals
Davison and Hinkley (1999) page 279 recommend the use of a special rescaling of the
residuals when bootstrapping to keep results unbiased. These modified residuals are calculated
using

                                    
 
F.2 Coefficient of Determination
There are times when the degree of linear association is of interest in its own right. Discussed
two descriptive measures to describe the degree of linear association between Y and Y.
 
Partitioning of Total Sum of Squares:
 Total Sum of Squares (SST) = summation of (Yi- Ẏ)2
          - Error Sum of Squares (SSE) = summation 0f (Yi– Ŷ)2
          - Regression Sum of Squares (SSR) = summation of (Ŷi- Ẏi)2
The coefficient of determination r2is defined as
 
r2 = SSR/ SST
= 1- (SSE/SST)
 
Since 0 ≤ SSE SST, it follows that
 
0 ≤ r2 ≤ 1
 
Interpret r2 as the proportionate reduction of total variation associated with the use of the
predictor variable X.
 
The limiting values of r2 occurs as follows:
 

1. When all observations fall on the fitted regression line, then SSE = 0 and r2= 1.
2. When the fitted regression line is horizontal so that bo and Ŷ ≡ Ẏ, then SSE = SST and r2= 0.
 
Sample Problem
Find the coefficient of variation in calories that is explained by the linear relationship between
alcohol content and calories and interpret the value.
Solution:
From the calculator results,
r2 = 0.8344
Using R, you can do (cor(independent variable, dependent variable))^2. So that would
be (cor(alcohol, calories))^2, and the output would be
[1] 0.8343751
Or you can just use a calculator and square the correlation value. Thus, 83.44% of the
variation in calories is explained to the linear relationship between alcohol content and calories.
The other 16.56% of the variation is due to other factors. A really good coefficient of
determination has a very small, unexplained part.
 
Previous

Chapter X
10.8 Correlation
G. Correlation
A correlation exists between two variables when the values of one variable are somehow
associated with the values of the other variable. When you see a pattern in the data you say
there is a correlation in the data. Though this book is only dealing with linear patterns, patterns
can be exponential, logarithmic, or periodic. To see this pattern, you can draw a scatter plot of
the data.
 Remember to read graphs from left to right, the same as you read words. If the graph goes up
the correlation is positive and if the graph goes down the correlation is negative. The words “
weak”, “moderate”, and “strong” are used to describe the strength of the relationship between
the two variables.
Correlation does not imply causation. We may say that two variables X and Y are correlated, but
that does not mean that X causes Y or that Y causes X – they simply are related or associated
with one another.
                                  

The linear correlation coefficient is a number that describes the strength of the linear
relationship between the two variables. It is also called the Pearson correlation coefficient after
Karl Pearson who developed it. The symbol for the sample linear correlation coefficient is r. The
symbol for the population correlation coefficient is ρ (Greek letter rho).
Interpretation of the correlation coefficient r is always between -1 and 1.  r = -1 means
there is a perfect negative linear correlation and r=1 means there is a perfect positive
correlation. The closer r is to 1 or -1 the stronger the correlation.
Careful: r = 0 does not mean there is no correlation. It just means there is no linear
correlation. There might be a very strong curved pattern.
 
Sample Problem. Calculating the Linear Correlation Coefficient, r
How strong is the positive relationship between the alcohol content and the number of calories
in 12-ounce beer? To determine if there is a positive linear correlation, a random sample was
taken of beer’s alcohol content and calories for several different beers ("Calories in beer,,"
2011), and the data are in table #10.2.1. Find the correlation coefficient and interpret that value.
                    

Solution:
 
State random variables
x= alcohol content in the beer
y= calories in 12 ounce beer
 
Assumptions check:
From example problem, the assumptions have been met.
 
To compute the correlation coefficient using the TI-83/84 calculator, use the LinRegTTest in the
STAT menu. The setup is in figure 10.2.2. The reason that >0 was chosen is because the
question was asked if there was a positive correlation. If you are asked if there is a negative
correlation, then pick <0. If you are just asked if there is a correlation, then pick ≠ 0 . Right now
the choice will not make a different, but it will be important later.
 
The correlation coefficient is r= 0.913. This is close to 1, so it looks like there is
a strong, positive correlation.
Sample Problem. Using the Formula to Calculate rand r2
 
How strong is the relationship between the alcohol content and the number of
calories in 12-ounce beer? To determine if there is a positive linear correlation, a random
sample was taken of beer’s alcohol content and calories for several different beers ("Calories in
beer,," 2011), and the data are in table #10.7.1. Find the correlation coefficient and the
coefficient of determination using the formula.
 
Solution:
From,
SSx= 12.45, SSy= 10335.5556, SSxy= 327.6667
 
Correlation coefficient:
r= SSxy/SSxSSy
= 327.6667
12.45 *10335.5556
Å 0.913
Coefficient of determination:
r2 = (r)2 = (0.913)2 Å 0.834
 
References:
 https://ncss-wpengine.netdna-ssl.com/wp-content/themes/ncss/pdf/Procedures/NCSS/
Linear_Regression_and_Correlation.pdf
 https://www.coconino.edu/resources/files/pdfs/academics/sabbatical-reports/kate-kozak/
chapter_10.pdf
 http://educ.jmu.edu/~drakepp/FIN360/readings/Regression_notes.pdf
 http://www.engineeringbookspdf.com/data-analysis-with-microsoft-excel/

The lower tail percentage points 

                                    
Hypothesis Tests on the Ratio of Two Variances

Null Hypothesis: 
Test statistic: 
Alternative Hypotheses:                   Rejection Criterion:

                         
 
 
 

You might also like