Engineering Data Analysis: Instructional Materials in STAT 20023
Engineering Data Analysis: Instructional Materials in STAT 20023
Materials in
STAT 20023
ENGINEERING DATA
ANALYSIS
For the sole noncommercial use of the
Faculty of the Department of Mathematics and Statistics
Polytechnic University of the Philippines
2020
Contributors:
Elizon, Katrina
Usona, Laurence
Aranas, Peter John
Bautista, Lincoln
Baccay, Edcon
Republic of the Philippines
POLYTECHNIC UNIVERSITY OF THE PHILIPPINES
COLLEGE OF SCIENCE
Department of Mathematics and Statistics
The final grade will be based on the weighted average of the student’s scores on each test
assigned at the end of each lesson. The final SIS grade equivalent will be based on the
following table according to the approved University Student Handbook.
Midterm and/or Final Exam (MFE) = (((Weighted Average of the Midterm and/or Final Tests) x
50)+50)
Final Grade
SIS Grade Description
Equivalent
INC Incomplete
W Withdrawn
Prepared by:
Katrina D. Elizon
Faculty Member, Department of Mathematics and Statistics
College of Science
Contents
Exercises:
A research objective is presented. For each research objective, Understand the Process of
identify the population and sample in the study.
1. The Philippine Mental Health Associations contacts 1,028
Statistics
teenagers who are 13 to 17 years of age and live in Antipolo City 3. Organize and summarize the information
and asked whether or not they had been prescribed medications
for any mental disorders, such as depression or anxiety. This step in the process is referred to as descriptive statistics.
Descriptive statistics describe the information collected through
Answer:
Population: Teenagers 13 to 17 years of age who live in Antipolo numerical measurements, charts, graphs, and tables. The main
City. purpose of descriptive statistics is to provide an overview of the
Sample: 1,028 teenagers 13 to 17 years of age who live in information collected.
Antipolo City. 4. Draw conclusion from the information.
2. A farmer wanted to learn about the weight of his soybean crop. In this step the information collected from the sample is
He randomly sampled 100 plants and weighted the soybeans on generalized to the population. This process is referred to as
each plant. Inferential statistics. Inferential statistics uses methods that takes
Answer: results obtained from a sample, extends them to the population,
Population: Entire soybean crop. and measures the reliability of the result.
Sample: 100 soybean crop selected.
Polytechnic University of the Philippines Polytechnic University of the Philippines
College of Science College of Science
Department of Mathematics and Statistics Department of Mathematics and Statistics
Exercises:
For the following statements, decide whether it belongs to the field of
descriptive statistics or inferential statistics.
1. A badminton player wants to know his average score for the past
10 games. Answer: Descriptive Statistics
2. A car manufacturer wishes to estimate the average lifetime of
batteries by testing a sample of 50 batteries. Answer: Inferential Statistics
Exercises:
Distinction between Discrete and Determine whether the following quantitative variables are
Continuous Variables discrete or continuous.
1. The number of heads obtained after flipping a coin five
Quantitative variables may be further classified into: times. Answer: Discrete
Discrete variable is a quantitative variable that either a 2. The number of cars that arrive at a McDonald’s drive-
finite number of possible values or a countable number through between 12:00 P.M and 1:00 P.M. Answer: Discrete
of possible values. The terms countable means that the
values result from counting, such as 0, 1, 2, 3, and so 3. The distance of a 2005 Toyota Prius can travel in city
on. conditions with a full tank of gas. Answer: Continuous
A continuous variable is a quantitative variable that has 4. Number of words correctly spelled. Answer: Discrete
an infinite number of possible values that are not
5. Time of a runner to finish one lap. Answer: Continuous
countable.
Exercises:
Categorize each of the following as nominal, ordinal, Data Collection
interval or ratio measurement.
Data collection is the process of gathering and measuring
1. Ranking of college athletic teams Ans. Ordinal information on variables of interest, in an established systematic
fashion that enables one to answer stated research questions, test
2. Employee number Ans. Nominal hypotheses, and evaluate outcomes.
Importance of Data Collection
3. Number of vehicles registered Ans. Ratio ✦ Data empowers you to make informed decisions.
Methods of Collecting
Secondary Data TAKE NOTE!
The secondary data can be collected by the following ✦ Always investigate the validity and reliability
five methods: of the data by examining the collection
1. Published report on newspaper and periodicals method employed by your source.
2. Financial Data reported in annual reports
✦ Do not use inappropriate data for your
3. Records maintained by the institution
research.
4. Internal reports of the government departments
5. Information from official publications
III. Identify the qualitative and quantitative variables and indicate the highest level
of measurement required in each. If quantitative, classify whether discrete or
IV. Determine if the source would be a primary or a secondary
continuous. source.
_________________1. Occupation _________1. Government Records
_________________2. Number of government officials
_________2. Dictionary
_________________3. Favorite color
_________________4. Temperature in Celsius degrees _________3. Artifact
_________________5. Type of school _________4. A TV show explaining what happened in Philippines.
_________________6. Volume of mineral water sold daily
_________________7. Employee number _________5. Autobiography about Rodrigo Duterte.
_________________8. Civil status _________6. Enrile diary describing what he thought about the
_________________9. Zip code numbers world war II.
_________________10. Brands of soft drinks _________7. Audio and video recordings.
_________________11. Socioeconomic status
_________________12. Status Employment
_________8. Speeches
_________________13. Number of vehicles registered _________9. Newspaper
_________________14. Jersey Number _________10. Review Articles
_________________15. Number of employees collecting retirement benefits from
GSIS
Polytechnic University of the Philippines Polytechnic University of the Philippines
College of Science College of Science
Department of Mathematics and Statistics Department of Mathematics and Statistics
OBJECTIVES: Introduction to R
References After successful completion of this module, you should be able to:
Identify the Console, Script, Environment, and Final pane.
Statistics. Informed Decision using Data by Install and load add-in ‘packages’ and import data
Michael Sullivan, III,. Fifth Edition from .xlsx (Excel) file format for data processing and
statistical analysis.
Sampling: Design and Analysis by Sharon L. Understand the different data types in R.
Lhr. Second Edition
Understand the different data structures in R.
Polytechnic University of the Philippines Polytechnic University of the Philippines By Thomas H. Davenport and D.J. Patil
College of Science College of Science
Department of Mathematics and Statistics Department of Mathematics and Statistics
Introduction to R The Difference between R and
• R is a language and environment for statistical computing and R Studio
graphics.
• R is not only entrusted by academic, but many large companies
RStudio is actually an add-on to R: it takes the R software and adds
also use R programming language. Data analysis with R is done in to it a very user-friendly graphical interface. Thus, when one uses
a series of steps; programming, transforming, discovering, modeling RStudio, they are still using the full version of R while also getting
and communicate the results the benefit of greater functionality and usability due to an
Why use R for data analysis? improved user interface. As a result, when using R, one should
1. R is free to download and use. always use RStudio; working with R itself is very cumbersome.
2. R is open-source.
3. Data processing in R is very easy. Since RStudio is an add-on to R, you must first download and
4. Data visualization tools in R are very extensive. install R as well as RStudio. On your computer, you will see R and
5. Advanced functionality often used in practice by scienitists is RStudio as separate installed programs. When using R for data
available in R. analysis, you will always open and work in RStudio; you must leave
6. Will improve one’s understanding of statistics. R installed on the computer for RStudio to work, even though you
7. It is very easy to share your output from R. will likely never open R itself.
8. R provides reproducibility for your analyses
Polytechnic University of the Philippines Polytechnic University of the Philippines
College of Science College of Science
Department of Mathematics and Statistics Department of Mathematics and Statistics
Install R:
Go to https://cran.r-project.org, select the version of R
software applicable to your computer and install the
software.
Install RStudio:
Once you are done, download the RStudio installer. Go
to https://www.rstudio.com/products/rstudio/download/,
select the applicable version of RStudio and install the
software. Make sure to read and follow the instructions appears in installing
the program.
Polytechnic University of the Philippines Polytechnic University of the Philippines
College of Science College of Science
Department of Mathematics and Statistics Department of Mathematics and Statistics
Four Pane of
RStudio
Four Pane of RStudio
Console Pane - output and error messages are displayed.
The environment and history pane is where you will see the
different objects you create or the different datasets you
import.
If we want to create a vector of consecutive More complex sequences can be created using the seq( )
function, like defining number of points in an interval, or
numbers, the : operator is very helpful. the step size.
EXAMPLE:
Out of 60 respondents considered in
the survey, thirty of them indicates
that they are single, twenty of them
are married and ten of them are
separated. Create a factor, based on
the information given.
EXAMPLE:
I asked 70 students if they like watching
television. Fifty of them answered “Most of
the Time”, fifteen of them answered
“Sometimes” and only five student
answered the question as “Hardly Ever”.
Create a factor, based on the information
given.
EXAMPLE:
Create a 6 x 5 matrix that contains a
number from 1 to 30. Fill matrix by
rows. Make sure that the names of the
rows are A, B, C, D, E, and F, and the
names of the columns are Blue, Red,
White, Green, and Yellow.
To extract more than one rows or columns at a time. How to create a list in R?
Multiple rows:
matrixname[c( ),] The List is been created using list ( ) function in R.
Multiple columns:
matrixname[,c( )]
Polytechnic University of the Philippines Polytechnic University of the Philippines
College of Science College of Science
Department of Mathematics and Statistics Department of Mathematics and Statistics
EXAMPLE:
Create a list that contains a numeric
vector (1 to 30), a sequence of number
from 1 to 10 with step size of 0.5, and
a character vector that contains your
top 3 favourite subject.
EXAMPLE:
Create a data frame for the profile
of the respondents. Include only
the male respondents that are
single, at most 40 years old, and
high school graduate.
CONDITIONS
If/else statements
In R, we can write a conditional if/else
statement as follows:
ifelse(condition on data, true value
returned, false returned)
PIPES OPERATOR Filter Use filter to filter data with required condition.
Take the first 8 observations from SAMPLE_DATA. Take the last 8 observations from SAMPLE_DATA.
EXAMPLE:
From SAMPLE_DATA:
• Take the first 30% of the observation and name it as
``NEWDATA”.
• Remove the variables Language, GenInfo, Science,
Numerical, and NonVerbal.
• Rename the variable ``schooltype” as ``SchoolType” and
``average” as ``Average”.
• Include only female students and the school type is suc
and private.
• Create additional column and name it as ``If_FChoice” to
determine if her course now is her first choice.
• Arrange “NEWDATA” into descending order based on the
Score.
• Summarize your data to find out who has the highest
average grades based on first choice and not first choice.
Polytechnic University of the Philippines Polytechnic University of the Philippines
College of Science College of Science
Department of Mathematics and Statistics Department of Mathematics and Statistics
Visualization, Graphics in R
R graphic systems: base and ggplot2
ACTIVITIES/ASSESSMENTS:
1. Use "faithful" data in Rstudio. Extract waiting variable and
compute only the average of less than or equal to 50 mns. of
waiting time to next eruption.
2. Create a 8 x 5 matrix that contains a number from 1 to 40. Fill
matrix by columns. Make sure that the names of the rows are
Apple, Banana, Orange, Grapes, Mango, Limes, Watermelons and
Apricots, and the names of the columns are Blue, Red, White,
Green, and Yellow.
A. Extract row of Apple, Grapes and Apricots.
3. Create a factor, based on the information given.
I asked 150 residents of Barangay Dela Paz to rate their level of
agreement about mass testing. Most of the respondents answered
“Strongly Agree” with frequency of 60. Fifty of them answered
“Agree”, followed by “Disagree” and Strongly Disagree” with
frequency of 30 and 10, respectively.
Polytechnic University of the Philippines Polytechnic University of the Philippines
College of Science College of Science
Department of Mathematics and Statistics Department of Mathematics and Statistics
ACTIVITIES/ASSESSMENTS: ACTIVITIES/ASSESSMENTS:
4. Create a scatter plot using basic and enhance graphic of A. Create a vector containing the final grades for each class using
RStudio. Used “faithful” data in RStudio. the variable name “final scores".
Take Note: B. Create a vector of character data called “class names" containing
xlab=”Eruption time (min)", your classes.
ylab="Waiting time to next eruption (min)”) C. Assign the class name to each grade in your final scores vector.
main=“Eruptions of Old Faithful” D. Extract elements from final scores vector to create two new
5. Let’s say you're a student taking seven classes. Here's a table vectors:
containing your final grades for each class: “liberal arts": Containing your writing and history final grades.
“fine arts": Containing your art and music final grades.
Class Exams
Mathematics 88.00 E. Calculate the average of each new vector.
Chemistry 87.67 F. Calculate your grade point average from the “final scores" vector
Writing 86.00 that we created earlier. Store the result of your calculation in the
Art 91.33 variable “GPA".
History 84.00 G. Compare “final scores" to “GPA" to see whether the grade in
Music 91.00 each class is higher than the “GPA". Store the logical output in a
Physical Education 89.33 vector named “above average".
Polytechnic University of the Philippines Polytechnic University of the Philippines
College of Science College of Science
Department of Mathematics and Statistics Department of Mathematics and Statistics
References
https://uoftcoders.github.io/rcourse/lec02-basic-r.html
https://monashbioinformaticsplatform.github.io/r-intro/start.html
https://bookdown.org/kdonovan125/ibis_data_analysis_r4/
introducing-r-and-rstudio.html
https://www.datacamp.com/?
utm_source=adwords_ppc&utm_campaignid=1242944157&utm_
adgroupid=58673827368&utm_device=c&utm_keyword=data%
20camp&utm_matchtype=e&utm_network=g&utm_adpostion=&
utm_creative=340731356767&utm_targetid=aud-364780883969
:kwd-298095775602&utm_loc_interest_ms=&utm_loc_physical_
ms=9067208&gclid=CjwKCAjw9vn4BRBaEiwAh0muDMi3k63BZ
rzlDy43gz5XO7jsIcSEbGBs331m0iPQ5D8z-
ycK4pCnSRoCjr4QAvD_BwE
Polytechnic University of the Philippines
College of Science
Department of Mathematics and Statistics
Objectives
Engineering Data Analysis
Module 3: Parametric Statistics with R Software After successful completion of this module, you should be able
to:
1 Differentiate the null and alternative hypotheses.
K.Elizon
2 Formulates the appropriate null and alternative
P.Aranas
L.Usona hypotheses.
L.Bautista 3 Explain the logic of hypothesis testing.
distributed data.
7 Solves real-life problems involving parametric test.
K.Elizon P.Aranas L.Usona L.Bautista E.Baccay Engineering Data Analysis K.Elizon P.Aranas L.Usona L.Bautista E.Baccay Engineering Data Analysis
K.Elizon P.Aranas L.Usona L.Bautista E.Baccay Engineering Data Analysis K.Elizon P.Aranas L.Usona L.Bautista E.Baccay Engineering Data Analysis
Hypothesis Testing Set the level of significance or alpha level (α)
K.Elizon P.Aranas L.Usona L.Bautista E.Baccay Engineering Data Analysis K.Elizon P.Aranas L.Usona L.Bautista E.Baccay Engineering Data Analysis
Parametric Test
1. One-Sample t-Test
2. Dependent Sample t-Test
Performing statistical analysis using statistical
3. Independent Sample t-Test software. Using R software calculate the p-value.
4. One-Way Analysis of Variance
5. Pearson Product Moment Correlation
Make sure to verify that the assumptions of every
statistical test are satisfied.
K.Elizon P.Aranas L.Usona L.Bautista E.Baccay Engineering Data Analysis K.Elizon P.Aranas L.Usona L.Bautista E.Baccay Engineering Data Analysis
Decision Rule:
In stating your decision you can use: If p-value is less than or equal to 0.05 level of
1 Fail to reject the null hypothesis/ Do not reject significance, reject H0 , otherwise failed to reject H0 .
the null hypothesis/ Retain the null hypothesis
Example:
2 Reject the null hypothesis. If the level of significance is α = 0.05:
Take Note! 1. and if the computed p-value is 0.01, then the decision is
reject H0 .
It is important to recognize that we never accept
2. and if the computed p-value is 0.05, then the decision is
the null hypothesis. We are merely saying that the
reject H0 .
sample evidence is not strong enough to warrant
3. and if the computed p-value is 0.10, then the decision is
rejection of the null hypothesis. failed to reject H0 .
K.Elizon P.Aranas L.Usona L.Bautista E.Baccay Engineering Data Analysis K.Elizon P.Aranas L.Usona L.Bautista E.Baccay Engineering Data Analysis
Draw Conclusion Assumptions of Parametric Statistics
Common Assumptions
Record conclusions and recommendations in a 1 Approximately Normally Distributed
report, and associate interpretations to justify your
conclusion or recommendations.
2 Homogeneity of Variances
3 Samples must be independent of each other
K.Elizon P.Aranas L.Usona L.Bautista E.Baccay Engineering Data Analysis K.Elizon P.Aranas L.Usona L.Bautista E.Baccay Engineering Data Analysis
K.Elizon P.Aranas L.Usona L.Bautista E.Baccay Engineering Data Analysis K.Elizon P.Aranas L.Usona L.Bautista E.Baccay Engineering Data Analysis
K.Elizon P.Aranas L.Usona L.Bautista E.Baccay Engineering Data Analysis K.Elizon P.Aranas L.Usona L.Bautista E.Baccay Engineering Data Analysis
Testing the Assumptions Testing the Assumptions
K.Elizon P.Aranas L.Usona L.Bautista E.Baccay Engineering Data Analysis K.Elizon P.Aranas L.Usona L.Bautista E.Baccay Engineering Data Analysis
K.Elizon P.Aranas L.Usona L.Bautista E.Baccay Engineering Data Analysis K.Elizon P.Aranas L.Usona L.Bautista E.Baccay Engineering Data Analysis
Testing the Homogeneity of Variances Testing the Homogeneity of Variances
K.Elizon P.Aranas L.Usona L.Bautista E.Baccay Engineering Data Analysis K.Elizon P.Aranas L.Usona L.Bautista E.Baccay Engineering Data Analysis
K.Elizon P.Aranas L.Usona L.Bautista E.Baccay Engineering Data Analysis K.Elizon P.Aranas L.Usona L.Bautista E.Baccay Engineering Data Analysis
H 0 : µ = µ0 Assumptions
Ha : µ 6= µ0 two-tailed: two.sided
Samples must be independent of each other.
H0 : µ ≤ µ0
Ha : µ > µ0 one-tailed: greater Approximately Normally Distributed.
H0 : µ ≥ µ0
Ha : µ < µ0 one-tailed: less
K.Elizon P.Aranas L.Usona L.Bautista E.Baccay Engineering Data Analysis K.Elizon P.Aranas L.Usona L.Bautista E.Baccay Engineering Data Analysis
One Sample t-Test Testing the Normality of the Data
Example 1:
A psychologist makes use of a test instrument for measuring
depression. The instrument is known to have a mean score of
70 for normal individuals. Twenty individuals who have been
described as severely depressed by a therapist takes the tests.
A psychologist is interested to know if the average test score
made by severely depressed individuals is greater than 70. The
data are reflected on the table below.
Scores Made by Severely Depressed Individuals
No. Score No. Score
1 75 11 75
2 77 12 74
3 73 13 63
4 80 14 76
5 65 15 67
6 74 16 70
7 75 17 69
8 69 18 76 Since p-value is greater than 0.05, we fail to reject H0 . Therefore,
9 71 19 65 the sample data is normally distributed.
10 72 20 68
K.Elizon P.Aranas L.Usona L.Bautista E.Baccay Engineering Data Analysis K.Elizon P.Aranas L.Usona L.Bautista E.Baccay Engineering Data Analysis
K.Elizon P.Aranas L.Usona L.Bautista E.Baccay Engineering Data Analysis K.Elizon P.Aranas L.Usona L.Bautista E.Baccay Engineering Data Analysis
K.Elizon P.Aranas L.Usona L.Bautista E.Baccay Engineering Data Analysis K.Elizon P.Aranas L.Usona L.Bautista E.Baccay Engineering Data Analysis
K.Elizon P.Aranas L.Usona L.Bautista E.Baccay Engineering Data Analysis K.Elizon P.Aranas L.Usona L.Bautista E.Baccay Engineering Data Analysis
Dependent Sample t-Test Dependent Sample t-Test
K.Elizon P.Aranas L.Usona L.Bautista E.Baccay Engineering Data Analysis K.Elizon P.Aranas L.Usona L.Bautista E.Baccay Engineering Data Analysis
Assumptions Example 1:
Your dependent variable should be measured at In the show “The Pickup Artist”, “Mystery” (the host) wants
the interval or ratio level (i.e., they are the artists in training to change from being plain old Average
continuous). Frustrated Chumps, into master pickup artists. He insists that
one way to help increase your confidence and your ability to
Your independent variable should consist of two demonstrate higher value is simply to smell the part. He has
categorical, ”related groups” or ”matched pairs. each contestant go out to a club and try to get as many digits
(telephone numbers) as they can without any cologne of any
There should be no significant outliers in the
kind (pre-test). Then he has them go out on the next night to
differences between the two related groups. the same club after dousing themselves in “Mysterys Freaky
The distribution of the differences in the Funk” cologne, to see if the number of digits they receive
increases (post-test). The results are shown below. Although
dependent variable between the two related Mystery maybe a fashion challenged, is he correct in his
groups should be approximately normally assertion that cologne helps when picking up women?
distributed.
K.Elizon P.Aranas L.Usona L.Bautista E.Baccay Engineering Data Analysis K.Elizon P.Aranas L.Usona L.Bautista E.Baccay Engineering Data Analysis
K.Elizon P.Aranas L.Usona L.Bautista E.Baccay Engineering Data Analysis K.Elizon P.Aranas L.Usona L.Bautista E.Baccay Engineering Data Analysis
K.Elizon P.Aranas L.Usona L.Bautista E.Baccay Engineering Data Analysis K.Elizon P.Aranas L.Usona L.Bautista E.Baccay Engineering Data Analysis
K.Elizon P.Aranas L.Usona L.Bautista E.Baccay Engineering Data Analysis K.Elizon P.Aranas L.Usona L.Bautista E.Baccay Engineering Data Analysis
K.Elizon P.Aranas L.Usona L.Bautista E.Baccay Engineering Data Analysis K.Elizon P.Aranas L.Usona L.Bautista E.Baccay Engineering Data Analysis
Assumptions Assumptions
Your dependent variable should be measured on There should be no significant outliers.
a continuous scale (i.e., it is measured at the
Your dependent variable should be
interval or ratio level).
approximately normally distributed for each
Your independent variable should consist of two group of the independent variable.
categorical, independent groups.
There needs to be homogeneity of variances. If
You should have independence of observations, the variance of two independent groups are not
which means that there is no relationship equal, r software will calculate welch two sample
between the observations in each group or t-test instead of independent sample t-test.
between the groups themselves.
K.Elizon P.Aranas L.Usona L.Bautista E.Baccay Engineering Data Analysis K.Elizon P.Aranas L.Usona L.Bautista E.Baccay Engineering Data Analysis
Independent Sample t-Test Independent Sample t-Test
Shallow Processing Deep Processing
13 12
Example 1: 12 15
Twenty participants were given a list of 20 words to process. 11 14
The 20 participants were randomly assigned to one of two 9 14
treatment conditions. Half were instructed to count the 11 13
number of vowels in each word (shallow processing). Half were 13 12
instructed to judge whether the object described by each word 14 15
would be useful if one were stranded on a desert island (deep 14 14
processing). After a brief distractor task, all subjects were 14 16
given a surprise free recall task. The number of words correctly 15 17
recalled was recorded for each subject. Here are the data: Did the instructions given to the participants significantly
affect their level of recall at 10% level of significance? Discuss
your answer.
K.Elizon P.Aranas L.Usona L.Bautista E.Baccay Engineering Data Analysis K.Elizon P.Aranas L.Usona L.Bautista E.Baccay Engineering Data Analysis
Testing the Normality of the Data Testing the Normality of the Data
K.Elizon P.Aranas L.Usona L.Bautista E.Baccay Engineering Data Analysis K.Elizon P.Aranas L.Usona L.Bautista E.Baccay Engineering Data Analysis
Testing the Normality of the Data Testing the Normality of the Data
K.Elizon P.Aranas L.Usona L.Bautista E.Baccay Engineering Data Analysis K.Elizon P.Aranas L.Usona L.Bautista E.Baccay Engineering Data Analysis
Testing the Normality of the Data Testing the Homogeneity of Variances
K.Elizon P.Aranas L.Usona L.Bautista E.Baccay Engineering Data Analysis K.Elizon P.Aranas L.Usona L.Bautista E.Baccay Engineering Data Analysis
Step 1:
The p-value for levene’s test is 0.691 which is
greater than 0.05, therefore, fail to reject H0 . It H0 : µshallow = µdeep and Ha : µshallow 6= µdeep
means that we will assume equal variances for this
example. Step 2: α = 0.10
Step 3: Since we are comparing the means of two
We will use var.equal=TRUE in the command of
independent sample t-test. independent groups, we will use the independent
sample t-test.
K.Elizon P.Aranas L.Usona L.Bautista E.Baccay Engineering Data Analysis K.Elizon P.Aranas L.Usona L.Bautista E.Baccay Engineering Data Analysis
K.Elizon P.Aranas L.Usona L.Bautista E.Baccay Engineering Data Analysis K.Elizon P.Aranas L.Usona L.Bautista E.Baccay Engineering Data Analysis
Independent Sample t-Test Testing the Normality of the Data
Example 2:
Researchers wanted to know whether there was a difference in
comprehension among students learning a computer program based on
the style of the text. They randomly divided 18 students into two groups
of 9 each. The researchers verified that the 18 students were similar in
terms of educational level, age, and so on. Group 1 individuals learned
the software using visual manual (multimodal instruction), while Group 2
individual learned the software using textual manual (Unimodal
instruction). The following data represent scores the students received on
an exam given to them they studied from the manuals.
Visual Textual
51.08 64.55
57.03 57.60
44.85 68.59
75.21 50.75
56.87 49.63
75.28 43.58
57.07 57.40
80.30 49.48
52.20 49.57
K.Elizon P.Aranas L.Usona L.Bautista E.Baccay Engineering Data Analysis K.Elizon P.Aranas L.Usona L.Bautista E.Baccay Engineering Data Analysis
Testing the Normality of the Data Testing the Normality of the Data
K.Elizon P.Aranas L.Usona L.Bautista E.Baccay Engineering Data Analysis K.Elizon P.Aranas L.Usona L.Bautista E.Baccay Engineering Data Analysis
Testing the Normality of the Data Testing the Normality of the Data
K.Elizon P.Aranas L.Usona L.Bautista E.Baccay Engineering Data Analysis K.Elizon P.Aranas L.Usona L.Bautista E.Baccay Engineering Data Analysis
Testing the Homogeneity of Variances Testing the Homogeneity of Variances
K.Elizon P.Aranas L.Usona L.Bautista E.Baccay Engineering Data Analysis K.Elizon P.Aranas L.Usona L.Bautista E.Baccay Engineering Data Analysis
Step 1:
H0 : µvisual = µtextual and Ha : µvisual 6= µtextual
Step 2: α = 0.05
Step 3: Since we are comparing the means of two
independent groups, we will use the independent
sample t-test.
K.Elizon P.Aranas L.Usona L.Bautista E.Baccay Engineering Data Analysis K.Elizon P.Aranas L.Usona L.Bautista E.Baccay Engineering Data Analysis
Step 5: Since p-value (0.209) is greater than to One-way analysis of variance (ANOVA) is a method
0.05 level of significance, we failed to reject H0 . of testing the equality of two or more population
means by analyzing sample variances.
Step 6: There is no significant difference in
comprehension among students learning a computer It is actually a more general form of the t-test that
program based on the style of the text. is appropriate to use with three or more data groups.
K.Elizon P.Aranas L.Usona L.Bautista E.Baccay Engineering Data Analysis K.Elizon P.Aranas L.Usona L.Bautista E.Baccay Engineering Data Analysis
One-way Analysis of Variance One-way Analysis of Variance
The Default is equal variances not assumed, to change this, set summary(aov(x∼group, data=data frame))
”var.equal=” option to TRUE.
”x” a numeric vector of data values.
“x” a numeric vector of data values.
“group” factor of the data.
”group” factor of the data.
K.Elizon P.Aranas L.Usona L.Bautista E.Baccay Engineering Data Analysis K.Elizon P.Aranas L.Usona L.Bautista E.Baccay Engineering Data Analysis
Assumptions
Your dependent variable should be measured at
Null and Alternative Hypothesis the interval or ratio level (i.e., they are
continuous).
H0 : µ1 = µ2 = ... = µk Your independent variable should consist of two
Ha : At least one of the population means is or more categorical, independent groups.
different from the others. You should have independence of observations,
which means that there is no relationship
between the observations in each group or
between the groups themselves.
K.Elizon P.Aranas L.Usona L.Bautista E.Baccay Engineering Data Analysis K.Elizon P.Aranas L.Usona L.Bautista E.Baccay Engineering Data Analysis
There needs to be homogeneity of variances. If Compact Cars Midsize Cars Full-size Cars
643 469 484
the variance of more than two independent 655 427 456
groups are not equal, r software will calculate 702 525 402
Welchs anova for unequal variances. 451 532 623
678 562 711
509 571 488
K.Elizon P.Aranas L.Usona L.Bautista E.Baccay Engineering Data Analysis K.Elizon P.Aranas L.Usona L.Bautista E.Baccay Engineering Data Analysis
Testing the Normality of the Data Testing the Normality of the Data
K.Elizon P.Aranas L.Usona L.Bautista E.Baccay Engineering Data Analysis K.Elizon P.Aranas L.Usona L.Bautista E.Baccay Engineering Data Analysis
Testing the Normality of the Data Testing the Normality of the Data
K.Elizon P.Aranas L.Usona L.Bautista E.Baccay Engineering Data Analysis K.Elizon P.Aranas L.Usona L.Bautista E.Baccay Engineering Data Analysis
K.Elizon P.Aranas L.Usona L.Bautista E.Baccay Engineering Data Analysis K.Elizon P.Aranas L.Usona L.Bautista E.Baccay Engineering Data Analysis
Testing the Homogeneity of Variances Procedures for Testing Hypothesis
K.Elizon P.Aranas L.Usona L.Bautista E.Baccay Engineering Data Analysis K.Elizon P.Aranas L.Usona L.Bautista E.Baccay Engineering Data Analysis
K.Elizon P.Aranas L.Usona L.Bautista E.Baccay Engineering Data Analysis K.Elizon P.Aranas L.Usona L.Bautista E.Baccay Engineering Data Analysis
Example 2:
A firm wishes to compare four programs for training workers to perform a
certain manual task. Twenty new employees are randomly assigned to
the training programs, with 5 in each program. At the end of the training
period, a test is conducted to see how quickly trainees can perform the
task. The number of times the task is performed per minute is recorded
for each trainee, with the following results:
K.Elizon P.Aranas L.Usona L.Bautista E.Baccay Engineering Data Analysis K.Elizon P.Aranas L.Usona L.Bautista E.Baccay Engineering Data Analysis
Testing the Normality of the Data Testing the Normality of the Data
K.Elizon P.Aranas L.Usona L.Bautista E.Baccay Engineering Data Analysis K.Elizon P.Aranas L.Usona L.Bautista E.Baccay Engineering Data Analysis
Testing the Normality of the Data Testing the Normality of the Data
K.Elizon P.Aranas L.Usona L.Bautista E.Baccay Engineering Data Analysis K.Elizon P.Aranas L.Usona L.Bautista E.Baccay Engineering Data Analysis
K.Elizon P.Aranas L.Usona L.Bautista E.Baccay Engineering Data Analysis K.Elizon P.Aranas L.Usona L.Bautista E.Baccay Engineering Data Analysis
Testing the Homogeneity of Variances Procedures for Testing Hypothesis
K.Elizon P.Aranas L.Usona L.Bautista E.Baccay Engineering Data Analysis K.Elizon P.Aranas L.Usona L.Bautista E.Baccay Engineering Data Analysis
K.Elizon P.Aranas L.Usona L.Bautista E.Baccay Engineering Data Analysis K.Elizon P.Aranas L.Usona L.Bautista E.Baccay Engineering Data Analysis
K.Elizon P.Aranas L.Usona L.Bautista E.Baccay Engineering Data Analysis K.Elizon P.Aranas L.Usona L.Bautista E.Baccay Engineering Data Analysis
Pearson Product Moment Correlation Pearson Product Moment Correlation
Example 1:
The Rip-off Vending Machine Company operates coffee vending machines
Assumptions in office buildings. The company wants to study the relationship; if any,
that to study number of cups sold per day and the number of persons
Your two variables should be measured at the working in each building. Sample data for the study were collected by the
company and presented below and test the significance at 0.05 level.
interval or ratio level (i.e., they are continuous).
No. of Working at Location No. of Cups
There is a linear relationship between your two 5 10
variables. 6 20
14 30
There should be no significant outliers. 19 40
Your variables should be approximately normally 15 30
11 20
distributed. 18 40
22 40
26 50
K.Elizon P.Aranas L.Usona L.Bautista E.Baccay Engineering Data Analysis K.Elizon P.Aranas L.Usona L.Bautista E.Baccay Engineering Data Analysis
Testing the Normality of the Data Testing the Normality of the Data
K.Elizon P.Aranas L.Usona L.Bautista E.Baccay Engineering Data Analysis K.Elizon P.Aranas L.Usona L.Bautista E.Baccay Engineering Data Analysis
Step 1:
H0 : There is no significant relationship between the
number of cups sold per day and the number of
persons working in each building.
Ha : There is significant relationship between the
number of cups sold per day and the number of
persons working in each building.
Step 2: α = 0.05
Step 3: Since we are testing the significant
relationship of two variables, we will use Pearson r.
K.Elizon P.Aranas L.Usona L.Bautista E.Baccay Engineering Data Analysis K.Elizon P.Aranas L.Usona L.Bautista E.Baccay Engineering Data Analysis
Procedures for Testing Hypothesis Pearson Product Moment Correlation
Example 2:
Step 5:Since p-value (0.000) is less than to 0.05 A golf pro wants to investigate the relation between the
level of significance, we reject H0 . club-head speed of a golf club (measured in miles per hour)
and the distance (in yards) that the ball will travel. He realizes
Step 6: Therefore there is significant relationship other variables besides club-head speed determine the distance
between the number of cups sold per day and the a ball will travel (such as club type, ball type, golfer, and
weather conditions). To eliminate the variability due to these
number of persons working in each building and its variables, the pro uses a single model of club and ball, one
relationship is very strong based on correlation golfer, and a clear, 70-degree day with no wind. The pro
coefficient (0.968) records the club-head speed, measures the distance the ball
travels, and collects the data in Table.
K.Elizon P.Aranas L.Usona L.Bautista E.Baccay Engineering Data Analysis K.Elizon P.Aranas L.Usona L.Bautista E.Baccay Engineering Data Analysis
K.Elizon P.Aranas L.Usona L.Bautista E.Baccay Engineering Data Analysis K.Elizon P.Aranas L.Usona L.Bautista E.Baccay Engineering Data Analysis
Step 1:
H0 : There is no significant relationship between the
It is required that both variables are individually club-head speed of a golf club and the distance that
normally distributed. Since, the calculated p-values the ball will travel.
(0.356 and 0.341) are greater than 0.05, we fail to Ha : There is significant relationship between the
reject H0 . Therefore, both variables are normally club-head speed of a golf club and the distance that
distributed. the ball will travel.
Step 2: α = 0.05
Step 3: Since we are testing the significant
relationship of two variables, we will use Pearson r.
K.Elizon P.Aranas L.Usona L.Bautista E.Baccay Engineering Data Analysis K.Elizon P.Aranas L.Usona L.Bautista E.Baccay Engineering Data Analysis
Procedures for Testing Hypothesis Procedures for Testing Hypothesis
Step 4:
K.Elizon P.Aranas L.Usona L.Bautista E.Baccay Engineering Data Analysis K.Elizon P.Aranas L.Usona L.Bautista E.Baccay Engineering Data Analysis
Activities/Assessments Activities/Assessments
Directions: Determine whether the sampling is dependent or (3.) An educator wants to determine whether a new curriculum
significantly improves standardized test scores for third grade
independent.
students. She randomly divides 80 third-graders into two groups.
(1.) A researcher wishes to compare academic aptitudes of Group 1 is taught using the new curriculum, while group 2 is taught
married mathematicians and their spouses. She obtains a using the traditional curriculum. At the end of the school year, both
random sample of 287 such couples who take an academic groups are given the standardized test and the mean scores are
aptitude test and determines each spouses academic compared.
aptitude. (4.) A stock analyst wants to know if there is difference between the
mean rate of return from energy stocks and that from financial
(2.) A political scientist wants to know how a random sample
stocks. He randomly select 13 energy stocks and computes the rate
of 18- to 25-year-olds feel about Democrats and of return for the past year. He randomly selects 13 financial stocks
Republicans in Congress. She obtains a random sample of and compute the rate of return for the past year.
1030 registered voters 18 to 25 years of age and asks, Do
(5.) An urban economist believes that commute times to work in the
you have favorable/unfavorable opinion of the South are less than commute times to work in the Midwest. He
Democratic/Republican party? Each individual was asked randomly selects 40 employed individuals in the south and 45
to disclose his or her opinion about each party. employed individuals in the Midwest and determines their commute
times.
K.Elizon P.Aranas L.Usona L.Bautista E.Baccay Engineering Data Analysis K.Elizon P.Aranas L.Usona L.Bautista E.Baccay Engineering Data Analysis
Activities/Assessments Activities/Assessments
K.Elizon P.Aranas L.Usona L.Bautista E.Baccay Engineering Data Analysis K.Elizon P.Aranas L.Usona L.Bautista E.Baccay Engineering Data Analysis
Activities/Assessments Activities/Assessments
4. Researchers wanted to determine if carpeted rooms contained more
3. Professor Katrina measured the time (in second) required bacteria than uncarpeted rooms. To determine the amount of bacteria in
to catch a falling meter sticks for 10 randomly selected a room, researchers pumped the air from the room over a Petri dish at
students’ dominant hand and non-dominant hand. Professor the rate of 1 cubic foot per minute for eight carpeted rooms and eight
Katrina claims that the reaction time in an individual’s uncarpeted rooms. Colonies of bacteria were allowed to form in the 16
dominant hand is less than the reaction time in their Petri dishes. The results are presented in the table. A normal probability
plot and boxplot indicate that the data are approximately normally
non-dominant hand. Test the claim at the α = 0.10 level of distributed with no outliers. Do carpeted rooms have more bacteria than
significance. The data obtained are presented: uncarpeted rooms at the α = 0.05 level of significance?
Carpeted (cubic foot) Uncarpeted (cubic foot)
Student 1 2 3 4 5 11.8 12.1
Dominant Hand 0.177 0.210 0.186 0.189 0.198 8.2 8.3
Non-Dominant Hand 0.193 0.194 0.160 0.209 0.164 10.8 1.0
10.1 11.1
Student 6 7 8 9 10
7.1 3.8
Dominant Hand 0.194 0.160 0.163 0.166 0.152 13.0 7.2
Non-Dominant Hand 0.179 0.202 0.208 0.184 0.215 14.6 10.1
14.0 13.7
K.Elizon P.Aranas L.Usona L.Bautista E.Baccay Engineering Data Analysis K.Elizon P.Aranas L.Usona L.Bautista E.Baccay Engineering Data Analysis
Activities/Assessments Activities/Assessments
6. A pediatrician wants to determine the relation that may exist between
5. A researcher is interested whether a training course increases the a childs height and head circumference. She randomly selects eleven
teaching performance of the teachers who attended the training courses. 3-yearold children from her practice, measures their heights and head
Test at 10% level of significance. The data are shown below: circumference, and obtains the data shown in the table below.
Case Before After Case Before After Height (inches) Head Circumference (inches)
1 85 95 11 89 97 27.75 17.5
2 84 98 12 87 98 24.50 17.1
3 86 97 13 82 95 25.50 17.1
4 87 92 14 81 95 26.00 17.3
5 89 96 15 86 92 25.00 16.9
6 82 93 16 89 91 27.75 17.6
7 80 94 17 89 94 26.50 17.3
8 84 95 18 84 95 27.00 17.5
9 86 90 19 85 96 26.75 17.3
10 82 82 20 88 97 26.75 17.5
27.50 17.5
K.Elizon P.Aranas L.Usona L.Bautista E.Baccay Engineering Data Analysis K.Elizon P.Aranas L.Usona L.Bautista E.Baccay Engineering Data Analysis
Activities/Assessments Activities/Assessments
7. A stock analyst wondered whether the mean rate of return of
financial, energy, and utility stocks differed over the past five years. He
obtained a simple random sample of eight companies from each of the 8. At a community college,the mathematics department has
three sectors and obtained the five-year rates of return shown in the been experimenting with four different delivery mechanisms for
following table (in percent): content in theirIntermediate Algebra courses. One method is
Financial Energy Utilities the traditional lecture (method I), the second is a hybrid
10.76 12.72 11.88 format in which half the class time is online and the other half
15.05 13.91 5.86 is face-to-face (method II), the third is online (method III),
17.01 6.43 13.46 and the fourth is an emporium model from which students
5.07 11.19 9.90 obtain their lectures and do their work in a lab with an
19.50 18.79 3.95
instructor available for assistance (method IV). To assess the
8.16 20.73 3.44
10.38 9.60 7.11 effectiveness of the four methods, students in each approach
6.75 17.40 15.70 are given a final exam with the results shown next. Do the
data suggest that any method has a different mean score from
Source: Morningstar.com the others?
Are the mean rates of return different at the α = 0.05 level of
significance?
K.Elizon P.Aranas L.Usona L.Bautista E.Baccay Engineering Data Analysis K.Elizon P.Aranas L.Usona L.Bautista E.Baccay Engineering Data Analysis
Activities/Assessments References
K.Elizon P.Aranas L.Usona L.Bautista E.Baccay Engineering Data Analysis K.Elizon P.Aranas L.Usona L.Bautista E.Baccay Engineering Data Analysis
Objectives
Engineering Data Analysis After successful completion of this module, you should be able
Module 4: Nonparametric Statistics with R Software to:
1 Distinguish between parametric and nonparametric
statistical procedures.
K.Elizon 2 Conduct one-sample t-test.
P.Aranas
3 Test a hypothesis about the difference between the
L.Usona
L.Bautista medians of two dependent samples
4 Test a hypothesis about the difference between the
E.Baccay
medians of two independent samples
Polytechnic University of the Philippines 5 Conduct spearman rank correlation.
Sta. Mesa, Manila
6 Conduct test for two categorical variables.
K.Elizon P.Aranas L.Usona L.Bautista E.Baccay Engineering Data Analysis K.Elizon P.Aranas L.Usona L.Bautista E.Baccay Engineering Data Analysis
K.Elizon P.Aranas L.Usona L.Bautista E.Baccay Engineering Data Analysis K.Elizon P.Aranas L.Usona L.Bautista E.Baccay Engineering Data Analysis
Nonparametric Test
1 Nonparametric procedures are less efficient than 1. One-Sample Sign Test
parametric procedures.
2 Nonparametric procedures often discard useful 2. Wilcoxon Signed Rank Test
information. For example, the sign test uses only the sign 3. Mann Whitney U-Test
of the data and rank tests merely preserve order, so the
magnitudes of the actual data values are lost. As a result, 4. Kruskal Wallis H-Test
nonparametric procedures are typically less powerful.
3 Because fewer requirements must be satisfied to conduct 5. Spearman Rank Correlation Test
these tests, researchers sometimes incorrectly use these 6. Chi-square Test
procedures when parametric procedures can be used.
Make sure to verify that the assumptions of every
statistical test are satisfied.
K.Elizon P.Aranas L.Usona L.Bautista E.Baccay Engineering Data Analysis K.Elizon P.Aranas L.Usona L.Bautista E.Baccay Engineering Data Analysis
One Sample Sign Test One Sample Sign Test
K.Elizon P.Aranas L.Usona L.Bautista E.Baccay Engineering Data Analysis K.Elizon P.Aranas L.Usona L.Bautista E.Baccay Engineering Data Analysis
K.Elizon P.Aranas L.Usona L.Bautista E.Baccay Engineering Data Analysis K.Elizon P.Aranas L.Usona L.Bautista E.Baccay Engineering Data Analysis
K.Elizon P.Aranas L.Usona L.Bautista E.Baccay Engineering Data Analysis K.Elizon P.Aranas L.Usona L.Bautista E.Baccay Engineering Data Analysis
Procedures for Testing Hypothesis One Sample Sign t-Test
Example 2:
Recent studies of the private practices of physicians
Step 5: Since p-value (0.180) is greater than to who saw no Medicaid patients suggested that the
0.05 level of significance, we failed to reject H0 . median length of each patient visit was 22 minutes.
Step 6: There is no sufficient evidence to support It is believed that the median visit length in
the claim of the website administrator. practices with a large Medicaid load is shorter than
22 minutes. A random sample of 20 visits in
practices with a large Medicaid load yielded, in
order, the following visit lengths:
K.Elizon P.Aranas L.Usona L.Bautista E.Baccay Engineering Data Analysis K.Elizon P.Aranas L.Usona L.Bautista E.Baccay Engineering Data Analysis
K.Elizon P.Aranas L.Usona L.Bautista E.Baccay Engineering Data Analysis K.Elizon P.Aranas L.Usona L.Bautista E.Baccay Engineering Data Analysis
Wilcoxon Signed Rank Test Wilcoxon Signed Rank Test
Step 1:
H0 : Mbefore ≤ Mafter and Ha : Mbefore > Mafter
Step 2: α = 0.01
Step 3: Since we are comparing the median of two
related groups, we will use the wilcoxon signed rank
test.
K.Elizon P.Aranas L.Usona L.Bautista E.Baccay Engineering Data Analysis K.Elizon P.Aranas L.Usona L.Bautista E.Baccay Engineering Data Analysis
Procedures for Testing Hypothesis Wilcoxon Signed Rank Test
Example 2:
An analyst might want to determine whether there is a difference
in the cost per mile of airfares in the United States between 1979
and 2009 for various cities. The data in Table represent the costs
Step 5: Since p-value (0.945) is greater than 0.01 per mile of airline tickets for a sample of 17 cities for both 1979
and 2009.
level of significance, we failed to reject H0 .
City 1979 2009 City 1979 2009
Step 6: There is no sufficient evidence to conclude 1 20.3 22.8 10 20.3 20.9
2 19.5 12.7 11 19.2 22.6
that the program help to decreased the median
3 18.6 14.1 12 19.5 16.9
waistline. 4 20.9 16.1 13 18.7 20.6
5 19.9 25.2 14 17.7 18.5
6 18.6 20.2 15 21.6 23.4
7 19.6 14.9 16 22.4 21.3
8 23.2 21.3 17 20.8 17.4
9 21.8 18.7
K.Elizon P.Aranas L.Usona L.Bautista E.Baccay Engineering Data Analysis K.Elizon P.Aranas L.Usona L.Bautista E.Baccay Engineering Data Analysis
Step 1:
H0 : M1979 = M2009 and Ha : M1979 6= M2009
Step 2: α = 0.05
Step 3: Since we are comparing the median of two
related groups, we will use the wilcoxon signed rank
test.
K.Elizon P.Aranas L.Usona L.Bautista E.Baccay Engineering Data Analysis K.Elizon P.Aranas L.Usona L.Bautista E.Baccay Engineering Data Analysis
K.Elizon P.Aranas L.Usona L.Bautista E.Baccay Engineering Data Analysis K.Elizon P.Aranas L.Usona L.Bautista E.Baccay Engineering Data Analysis
Mann Whitney U-Test Mann Whitney U-Test
H0 : M1 = M2 Assumptions
Ha : M1 6= M2 two-tailed: two.sided Your dependent variable should be measured at
the ordinal or continuous level.
H 0 : M1 ≤ M2
Your independent variable should consist of two
Ha : M1 > M2 one-tailed: greater
categorical, ”independent groups”.
H 0 : M1 ≥ M2
Ha : M1 < M2 one-tailed: less
K.Elizon P.Aranas L.Usona L.Bautista E.Baccay Engineering Data Analysis K.Elizon P.Aranas L.Usona L.Bautista E.Baccay Engineering Data Analysis
Example 1:
ill Healthy
When exposed to an infection, a person typically
640 10
develops antibodies. The extent to which the 80 320
antibodies respond can be measured by looking at a 1280 320
persons titer, which is a measure of the number of 160 320
640 80
antibodies present. The higher the titer is, the more
640 160
antibodies that are present. The data in Table 1280 10
represent the titers of 11 ill people and 11 healthy 640 640
people exposed to the tularemia virus in Vermont. 160 160
320 320
Is the level of titer in the ill group greater than the
160 320
level of titer in the healthy group? Use the
α = 0.10 level of significance.
K.Elizon P.Aranas L.Usona L.Bautista E.Baccay Engineering Data Analysis K.Elizon P.Aranas L.Usona L.Bautista E.Baccay Engineering Data Analysis
Step 1:
H0 : Mill ≤ Mhealthy and Ha : Mill > Mhealthy
Step 2: α = 0.10
Step 3: Since we are comparing the median of two
independent groups, we will use the Mann Whitney
U-Test.
K.Elizon P.Aranas L.Usona L.Bautista E.Baccay Engineering Data Analysis K.Elizon P.Aranas L.Usona L.Bautista E.Baccay Engineering Data Analysis
Procedures for Testing Hypothesis Mann Whitney U-Test
Example 2:
An engineer is comparing the time to failure (in
Step 5: since p-value (0.960) is greater than to flight hours) of two different air conditioners for
0.10 level of significance, we failed reject H0 . airplanes and wants to determine if the median time
to failure for model Y is longer than the median
Step 6: There is no sufficient evidence to conclude
time to failure for model X. She obtains a random
that the level of titer in the ill group greater than
sample of 26 failure times for model X and an
the level of titer in the healthy group.
independent random sample of 17 failure times for
model Y. Do the data in Table suggest that the
time to failure for model Y is longer? Use the
α = 0.05 level of significance.
K.Elizon P.Aranas L.Usona L.Bautista E.Baccay Engineering Data Analysis K.Elizon P.Aranas L.Usona L.Bautista E.Baccay Engineering Data Analysis
K.Elizon P.Aranas L.Usona L.Bautista E.Baccay Engineering Data Analysis K.Elizon P.Aranas L.Usona L.Bautista E.Baccay Engineering Data Analysis
K.Elizon P.Aranas L.Usona L.Bautista E.Baccay Engineering Data Analysis K.Elizon P.Aranas L.Usona L.Bautista E.Baccay Engineering Data Analysis
Kruskal Wallis H-Test Kruskal Wallis H-Test
K.Elizon P.Aranas L.Usona L.Bautista E.Baccay Engineering Data Analysis K.Elizon P.Aranas L.Usona L.Bautista E.Baccay Engineering Data Analysis
K.Elizon P.Aranas L.Usona L.Bautista E.Baccay Engineering Data Analysis K.Elizon P.Aranas L.Usona L.Bautista E.Baccay Engineering Data Analysis
Step 1:
H0 : The distribution of exam scores is the same for
each city.
Ha : The distribution of exam scores is different for
each city.
Step 2: α = 0.01
Step 3: Since we are comparing the median of
more than two independent groups, we will use the
Kruskal Wallis H-Test.
K.Elizon P.Aranas L.Usona L.Bautista E.Baccay Engineering Data Analysis K.Elizon P.Aranas L.Usona L.Bautista E.Baccay Engineering Data Analysis
Procedures for Testing Hypothesis Kruskal Wallis H-Test
Example 2:
A family doctor claims that the distributions of HDL
Step 5: Since p-value (0.001) is less than to 0.01 cholesterol in males for the age groups 20 to 29
level of significance, we reject H0 . years old, 40 to 49 years old, and 60 to 69 years old
Step 6: This means that the distribution of exam are different. He obtains a simple random sample of
scores is different for each city. 12 individuals from each age group and determines
their HDL cholesterol.The results are presented in
Table.Test the doctors claim at the α = 0.05 level
of significance.
K.Elizon P.Aranas L.Usona L.Bautista E.Baccay Engineering Data Analysis K.Elizon P.Aranas L.Usona L.Bautista E.Baccay Engineering Data Analysis
Step 1:
No. 20-29yrs.old 40-49yrs.old 60-69yrs.old H0 : The distribution of HDL cholesterol in males
1 54 61 44
2 43 41 65
for the age groups 20 to 29 years old, 40 to 49 years
3 38 44 62 old, and 60 to 69 years old are the same.
4 30 47 53 Ha : The distribution of HDL cholesterol in males
5 61 33 51 for the age groups 20 to 29 years old, 40 to 49 years
6 53 29 49
7 35 59 49 old, and 60 to 69 years old are different.
8 34 35 42 Step 2: α = 0.05
9 39 34 35
10 46 74 44 Step 3: Since we are comparing the median of
11 50 50 37 more than two independent groups, we will use the
12 35 65 38
Kruskal Wallis H-Test.
K.Elizon P.Aranas L.Usona L.Bautista E.Baccay Engineering Data Analysis K.Elizon P.Aranas L.Usona L.Bautista E.Baccay Engineering Data Analysis
K.Elizon P.Aranas L.Usona L.Bautista E.Baccay Engineering Data Analysis K.Elizon P.Aranas L.Usona L.Bautista E.Baccay Engineering Data Analysis
Spearman Rank Correlation Spearman Rank Correlation
K.Elizon P.Aranas L.Usona L.Bautista E.Baccay Engineering Data Analysis K.Elizon P.Aranas L.Usona L.Bautista E.Baccay Engineering Data Analysis
Step 1:
H0 : There is no significant relationship between the
individual ranks obtained in swimming and cycling.
Ha : There is significant relationship between the
individual ranks obtained in swimming and cycling.
Step 2: α = 0.01
Step 3: Since we are testing the significant
relationship of two ordinal variables, we will use
Spearman Rho.
K.Elizon P.Aranas L.Usona L.Bautista E.Baccay Engineering Data Analysis K.Elizon P.Aranas L.Usona L.Bautista E.Baccay Engineering Data Analysis
Procedures for Testing Hypothesis Spearman Rank Correlation
Example 2:
The following are the ranks in statistics and the ranks in
mathematics of 10 students in an examination. Determine if there
is a relationship between the ranks of students in the two subjects.
Step 5: Since p-value (0.001) is less than to 0.01 Use 0.05 level of significance.
level of significance, we reject H0 . Subject Statistics Mathematics
1 56 66
Step 6: Therefore is significant relationship 2 75 70
between the individual ranks obtained in swimming 3 45 40
and cycling and its relationship is very strong based 4 71 60
5 62 65
on correlation coefficient (0.903) 6 64 56
7 58 59
8 80 77
9 76 67
10 61 67
K.Elizon P.Aranas L.Usona L.Bautista E.Baccay Engineering Data Analysis K.Elizon P.Aranas L.Usona L.Bautista E.Baccay Engineering Data Analysis
Step 5: Since p-value (0.039) is less than to 0.05 Chi-Square: Test for independence is use to discover
level of significance, we reject H0 . if there is association between two categorical
variables.
Step 6: Therefore is significant relationship
Command for Chi-Square Test
between the ranks of students in statistics and
chisq.test(x,y)
mathematics subjects and its relationship is
moderately strong based on correlation coefficient “x” a numeric vector or matrix.
(0.673) “y” a numeric vector; ignored if x is a matrix.
K.Elizon P.Aranas L.Usona L.Bautista E.Baccay Engineering Data Analysis K.Elizon P.Aranas L.Usona L.Bautista E.Baccay Engineering Data Analysis
Chi-Square Test Chi-Square Test
Assumptions
There are 2 variables, and both are measured as
Null and Alternative Hypothesis categories, usually at the nominal level.
H0 : The two categorical variables are independent. However, categories may be ordinal. Interval or
ratio data that have been collapsed into ordinal
Ha : The two categorical variables are dependent. categories may also be used.
The two variables should consist of two or more
categorical,independent groups.
K.Elizon P.Aranas L.Usona L.Bautista E.Baccay Engineering Data Analysis K.Elizon P.Aranas L.Usona L.Bautista E.Baccay Engineering Data Analysis
Assumptions Example 1:
The data in the cells should be frequencies, or Educators are always looking for novel ways in which to teach
counts of cases rather than percentages or some statistics to undergraduates as part of a non-statistics degree
course (e.g., psychology). With current technology, it is
other transformation of the data. possible to present how-to guides for statistical programs
For a 2 by 2 table, all expected frequencies > 5. online instead of in a book. However, different people learn in
different ways. An educator would like to know whether
For a larger table, all expected frequencies > 1 gender (male/female) is associated with the preferred type of
and no more than 20% of all cells may have learning medium (online vs. books). Import excel file
expected frequencies < 5. “chi-square data” sheet name “example 1”.
K.Elizon P.Aranas L.Usona L.Bautista E.Baccay Engineering Data Analysis K.Elizon P.Aranas L.Usona L.Bautista E.Baccay Engineering Data Analysis
Contingency Table
To Construct Contingency Table
table(a,b)
“a” is a numeric vector that will represent the row
of contingency table.
“b” is a numeric vector that will represent the This is a 2 by 2 contingency table. All expected
columns of contingency table. frequencies is greater than 5, this means that the
assumption is satisfied.
K.Elizon P.Aranas L.Usona L.Bautista E.Baccay Engineering Data Analysis K.Elizon P.Aranas L.Usona L.Bautista E.Baccay Engineering Data Analysis
Procedures for Testing Hypothesis Procedures for Testing Hypothesis
Step 4:
Step 1:
H0 : Gender is not associated with the preferred
type of learning medium.
Ha : Gender is associated with the preferred type of
learning medium.
Step 2: α = 0.05
Step 3: Since we are testing the significant
relationship of two categorical variables, we will use
Chi-square test.
K.Elizon P.Aranas L.Usona L.Bautista E.Baccay Engineering Data Analysis K.Elizon P.Aranas L.Usona L.Bautista E.Baccay Engineering Data Analysis
Example 2:
The Gallup Organization conducted a survey in 2014 asking
Step 5: Since p-value (0.026) is less than to 0.05 individuals questions pertaining to social well-being such as
level of significance, we reject H0 . strength of relationship with spouse, partner, or closest friend,
making time for trips or vacations, and having someone who
Step 6: Therefore, there is sufficient evidence encourages them to be healthy. Social well-being scores were
based on sample data that the gender of students is determined based on answers to these questions and used to
categorize individuals as thriving, struggling, or suffering in
associated with the preferred type of learning
their social wellbeing. In addition, body mass index (BMI) was
medium. determined based on height and weight of the individual. This
allowed for classification as obese, overweight, normal weight,
or underweight.
K.Elizon P.Aranas L.Usona L.Bautista E.Baccay Engineering Data Analysis K.Elizon P.Aranas L.Usona L.Bautista E.Baccay Engineering Data Analysis
The data in the following contingency table are based on the Step 1:
results of this survey.
H0 : There is no association between weight
classification and social wee-being.
Thriving Struggling Suffering
Obese 202 250 102
Ha : There is association between weight
Overweight 294 302 110 classification and social wee-being.
Normal Weight 300 295 103 Step 2: α = 0.05
Underweight 17 17 8
Step 3: Since we are testing the significant
Researchers wanted to determine whether the sample data
suggest there is an association between weight classification relationship of two categorical variables, we will use
and social well-being. Chi-square test.
K.Elizon P.Aranas L.Usona L.Bautista E.Baccay Engineering Data Analysis K.Elizon P.Aranas L.Usona L.Bautista E.Baccay Engineering Data Analysis
Procedures for Testing Hypothesis Procedures for Testing Hypothesis
The data given is presented in a contingency table. The raw data Step 4:
is not given. To solve this problem, we need to construct a matrix.
K.Elizon P.Aranas L.Usona L.Bautista E.Baccay Engineering Data Analysis K.Elizon P.Aranas L.Usona L.Bautista E.Baccay Engineering Data Analysis
Example 3:
A survey was conducted at a community college of 102
randomly selected students who dropped a course in the
Step 5:Since p-value (0.306) is greater than to current semester to learn why students drop courses. Personal
0.05 level of significance, we failed to reject H0 . drop reasons include financial, transportation, family issues,
health issues, and lack of child care. Course drop reasons
Step 6: Therefore, there is no sufficient evidence to include reducing ones load, being unprepared for the course,
conclude that there is an association between the course was not what was expected, dissatisfaction with
teaching, and not getting the desired grade. Work drop
weight classification and social well-being. reasons include an increase in hours, a change in shift, and
obtaining fulltime employment. Test whether gender is
independent of drop reason at the α = 0.1 level of
significance. Import excel file “chi-square data” sheet name
“example 3”.
K.Elizon P.Aranas L.Usona L.Bautista E.Baccay Engineering Data Analysis K.Elizon P.Aranas L.Usona L.Bautista E.Baccay Engineering Data Analysis
Step 1:
H0 : The gender of the students is independent to
their drop reason.
Ha : The gender of students is dependent to their
drop reason.
Step 2: α = 0.01
This is a 2 by 5 contingency table. All expected
frequencies is greater than 1, and no more than Step 3: Since we are testing the significant
20% of cells may have expected frequencies less relationship of two categorical variables, we will use
than 5, this means that the assumption is satisfied. Chi-square test.
K.Elizon P.Aranas L.Usona L.Bautista E.Baccay Engineering Data Analysis K.Elizon P.Aranas L.Usona L.Bautista E.Baccay Engineering Data Analysis
Procedures for Testing Hypothesis Procedures for Testing Hypothesis
Step 4:
K.Elizon P.Aranas L.Usona L.Bautista E.Baccay Engineering Data Analysis K.Elizon P.Aranas L.Usona L.Bautista E.Baccay Engineering Data Analysis
Activities/Assessments Activities/Assessments
1. Patients are instructed to do the exercise program 3 times per week
for 6 weeks. After 6 weeks, systolic blood pressures are again measured. 2. An economist believes that the median income of lawyers who recently
The data are shown. graduated from law school is more than $64,000. He queries a random
sample of 12 lawyers and obtains the accompanying data. Do the data
Systolic Blood Pressure of Patient support the economists belief at the α = 0.05 level of significance?
City Before After No. Income
1 125 118 1 85,000
2 132 134 2 63,000
3 138 130 3 62,000
4 120 124 4 70,000
5 125 105 5 91,000
6 127 130 6 67,000
7 136 130 7 68,500
8 139 132 8 86,000
9 131 123 9 70,500
10 132 128 10 71,000
11 69,000
Is there is a difference in systolic blood pressures after participating in the 12 60,500
exercise program as compared to before? Use α = 0.01 level of
significance.
K.Elizon P.Aranas L.Usona L.Bautista E.Baccay Engineering Data Analysis K.Elizon P.Aranas L.Usona L.Bautista E.Baccay Engineering Data Analysis
Activities/Assessments Activities/Assessments
K.Elizon P.Aranas L.Usona L.Bautista E.Baccay Engineering Data Analysis K.Elizon P.Aranas L.Usona L.Bautista E.Baccay Engineering Data Analysis
Activities/Assessments Activities/Assessments
5. Agribusiness researchers are interested in determining the conditions 6. A random sample of 395 people were surveyed and each person was
under which Christmas trees grow fastest. A random sample of
equivalent-size seedlings is divided into four groups. The trees are all asked to report the highest education level they obtained. The data that
grown in the same field. One group is left to grow naturally (Group 1), resulted from the survey is summarized in the following table:
one group is given extra water (Group 2), one group is given fertilizer
spikes (Group 3), and one group is given fertilizer spikes and extra water Highschool Bachelor Masters Ph.D.
(Group 4). At the end of one year, the seedlings are measured for growth Female 60 54 46 41
(in height). These measurements are shown for each group. Male 40 44 53 57
Group 1 Group 2 Group 3 Group 4 7. The following are the ranks in population and the ranks in crime rate
5 12 14 20 of 5 cities. Determine if there is a relationship between the ranks of
7 11 10 16
11 9 16 15
countries in the two measures. Use 0.10 level of significance.
9 13 17 14
6 12 12 22 City 1 2 3 4 5
Crime Rate 13 34 5 12 17
Determine whether there is a significant difference in the growth of trees Population 9 41 10 2 20
in these groups. Use α = 0.01.
K.Elizon P.Aranas L.Usona L.Bautista E.Baccay Engineering Data Analysis K.Elizon P.Aranas L.Usona L.Bautista E.Baccay Engineering Data Analysis
References
https://wolfweb.unr.edu/homepage/ania/
stat352f12lectures/352lecture21f12.pdf
Statistics. Informed Decision using Data by Michael
Sullivan, III,. Fifth Edition
Probability and Statistics for Engineers and Scientist
by Walpole. Nine Edition
1. A quality-control manager randomly selects 70 bottles of ketchup that were filled on July 17 to assess the
calibration of the filling machine.
(a)
(b)
2. Researchers want to determine whether or not higher folate intake is associated with a lower risk of hypertension
(high blood pressure) in women (27 to 44 years of age). To make this determination, they look at 7373 cases
of hypertension in these women and find that those who consume at least 1000 micrograms per day (µg /d) of
total folate had a decreased risk of hypertension compared with those who consume less than 200 µg /d.
(a)
(b)
Directions: Read each item carefully. Create command/codes based on the information requested on each item.
1. Out of 27 respondents considered in the survey, the respondents specify that their favorite subjects are math-
ematics, statistics and science with 12, 5, and 10 of the total respondents, respectively. Create a factor, based
on the information given.
2. Create a list that contains a numeric vector from (1 to 30, 34, 45, 47, and 50 to 70), a character vector that
repeat the elements A, B, and C 30 times, and a 2 x 3 matrix that contains a number from 101 to 106 filled by
rows.
Page 2
Republic of the Philippines
Polytechnic University of the Philippines
Directions: Read each item carefully. Write the letter corresponding to the best answer on a yellow paper on each
item. Write NONE if no correct choice is given.
4. It is an assumption of test that the distribution of the differences in the dependent variable
between the two related groups should be approximately normally distributed.
5. It is an assumption of test that the two variables should be measured at the interval or ratio level
(i.e., they are continuous) given that the data is not normal.
6. It is a general form of the independent sample t-test that is appropriate to use with three or more data groups.
7. It is a method of testing the equality of two or more population means by analyzing sample variances. (Assuming
normal and not equal variances)
(a) Dependent Sample t - Test (c) Kruskal Wallis H - Test
(b) Welch Analysis of Variance (d) One - Way Analysis of Variance
8. It is an assumption of test that the independent variable should consist of two or more categorical,
independent groups.
9. It is an assumption of test that the dependent variable should be measured at the interval or
ratio level (i.e., they are continuous).
11. It refers to a statistical method in which the data is not required to fit a normal distribution. Due to such
reason, they are sometimes referred to as distribution-free tests.
12. It refers to a statistical method which is apply to data in ratio scale, and some apply to data in interval scale.
14. It is a statement about the population parameter that is contradictory to the null hypothesis, and is accepted
as true only if there is convincing evidence in favor of it.
15. When the value of x variable increases and the value of y variable also increases. It is known as .
16. If the computed correlation coefficient of two continuous variables is 0.967, then describe the relationship.
(a) Weak Negative and Inverse Relationship
(b) Strong Negative and Inverse Relationship
(c) Strong Positive and Direct Relationship
(d) Weak Positive and Direct Relationship
Page 2
17. A company believes that it controls more than 30% of the total market share for one of its products. To prove
this belief, random samples of 144 purchases, of this product are contacted. It is found that 50 of the 144
purchased this company’s brand of the product. If a researcher wants to conduct a statistical test for this
problem, the alternative hypothesis would be .
(a) the population proportion is less than 0.30
(b) the population proportion is greater than 0.30
(c) the population proportion is not equal to 0.30
(d) the population mean is less than 40
18. If the computed value for Pearson r is negative, this implies that there is a/an relationship between
variables x and y.
19. It is a test for single mean when the population mean and standard deviation are known, and follow a normal
distribution.
Does the gender of the students is associated to pet preference? Use a 1% level of significance.
(a) Step 1:
(b) Step 2:
(c) Step 3:
Check the assumptions.
(d) Step 4:
Page 3
(e) Step 5:
(f) Step 6:
2. Some studies have shown that in the Philippines, men spend more than women buying gifts and cards on
Valentines Day. Suppose a researcher wants to test this hypothesis by randomly sampling nine men and 10
women with comparable demographic characteristics from various large cities across the Philippines to be in a
study. Each study participant is asked to keep a log beginning one month before Valentines Day and record all
purchases made for Valentines Day during that one-month period. The resulting data are shown below. Use
these data and a 1% level of significance to test to determine if, on average, men actually do spend significantly
more than women on Valentines Day.
(a) Step 1:
(b) Step 2:
(c) Step 3:
Check the assumptions.
(d) Step 4:
(e) Step 5:
(f) Step 6:
Page 4