Cereal Test
Cereal Test
Nguyen Tra My
Truong Hong Hanh 20223482
Nguyen Hien Chi 20223467
Trinh Thao Nguyen 20223452
Ta Tien Dung 20223435
Cereal's popularity as a convenient and accessible breakfast option is challenged by the growing
variety and confusion surrounding its nutritional value. This study aims to unravel this
confusion by:
• Analyzing the calorie, protein, fiber, sugar, and vitamin content of popular cereals.
• Evaluating the accuracy of manufacturers' health claims.
Motivation:
Goals:
This study ultimately seeks to equip consumers with the knowledge to navigate the cereal
landscape and make healthy breakfast choices.
1. VARIABLES EXPLAINATION
Name: Name of cereal
mfr: Manufacturer of cereal
A = American Home Food Products;
G = General Mills
K = Kelloggs
N = Nabisco
P = Post
Q = Quaker Oats
R = Ralston Purina
type:
cold
hot
calories: calories per serving
protein: grams of protein
fat: grams of fat
sodium: milligrams of sodium
fiber: grams of dietary fiber
carbo: grams of complex carbohydrates
sugars: grams of sugars
potass: milligrams of potassium
vitamins: vitamins and minerals - 0, 25, or 100, indicating the typical percentage of FDA
recommended
shelf: display shelf (1, 2, or 3, counting from the floor)
weight: weight in ounces of one serving
cups: number of cups in one serving
rating: a rating of the cereals (Possibly from Consumer Reports?)
3.DESCRIPTION OF THE DATA
library(readxl)
head(Cereal)
To consider the dimensions of the data set german_credit1, we use dim() function as below.
dim(Cereal)
> dim(Cereal)
[1] 77 16
We use function names() to illustrate the names of variables of the data set Cereal
4.CHARACTERISTICS OF VARIABLES
Table(mfr)
mfr
A G K N P Q R
1 22 23 6 9 8 8
mfr1<-table(mfr)
barplot(mfr1,col="blue")
1.Histogram of mfr
Type
table(type) type1<-table(type)
barplot(type1,col="blue")
table(type)
type
C H
74 3
type1<-table(type)
barplot(type1,col="blue")
2.Histogram of Type
Calories
summary(calories)
summary(calories)
Min. 1st Qu. Median Mean 3rd Qu. Max.
50.0 100.0 110.0 106.9 110.0 160.0
hist(calories,col='blue')
Protein
table(protein)
table(protein)
protein
1 2 3 4 5 6
13 25 28 8 1 2
hist(protein,col = "blue")
Fat
Table(fat)
fat
0 1 2 3 5
27 30 14 5 1
Sodium
Summary(sodium)
> summary(sodium)
Min. 1st Qu. Median Mean 3rd Qu. Max.
0.0 130.0 180.0 159.7 210.0 320.0
Hist(sodium,col=’blue’)
Fiber
Table(fiber)
table(fiber)
fiber
0 1 1.5 2 2.5 2.7 3 4 5 6 9 10 14
19 16 3 10 1 1 15 4 4 1 1 1 1
Hist(fiber,col=’blue’)
Carbo
Summary(carbo)
> summary(carbo)
Min. 1st Qu. Median Mean 3rd Qu. Max.
-1.0 12.0 14.0 14.6 17.0 23.0
Hist(carbo,col=’blue’)
Sugar
summary(sugars)
summary(sugars)
Min. 1st Qu. Median Mean 3rd Qu. Max.
-1.000 3.000 7.000 6.922 11.000 15.000
Hist(sugars,col=’blue’)
In this part of the research, we will consider the relationship between the variables pairwise and
investigate some simple regressions when it makes sense.
We will use pairs() and cor() to have a general view about the relationship between variables.
Cereal_numeric <- Cereal[, sapply(Cereal, is.numeric)]
pairs(Cereal_numeric, col = "blue")
str(Cereal)
numeric_cols <- sapply(Cereal, is.numeric)
cor(Cereal[, numeric_cols])
Cereal <- Cereal[, sapply(Cereal, is.numeric)]
cor(Cereal)
Through the charts and correlations that are given by pairs() and cor() we can see that there are
some promissing simple linear relationship such as the one between rating and calorie with -0.689
rating = β0 + β1 * calories
Cor(rating,calories)
> cor(rating,calories)
[1] -0.689376
From the chart and correlation that are given, the positive relationship between variables are
considered. The author use summary() and lm() to call out the simple linear regression model
between these two variables.
summary(lm(sugar~sugars))
The intercept of the model is estimated to be 59.2844, with a standard error of 1.9485. The t-
value for the intercept is 30.43, and the p-value is less than 2e-16, indicating that the intercept is
significantly different from zero.
- The coefficient for the "sugars" variable is estimated to be -2.4008, with a standard error of
0.2373. The t-value for the "sugars" variable is -10.12, and the p-value is 1.15e-15, suggesting
that the "sugars" variable has a significant impact on the "rating" variable.
- The multiple R-squared value is 0.5771, indicating that approximately 57.71% of the
variability in the "rating" variable can be explained by the linear relationship with the "sugars"
variable. - The adjusted R-squared value is 0.5715, which takes into account the number of
predictors in the model. It is slightly lower than the multiple R-squared value due to the
inclusion of the "sugars" variable.
- The F-statistic is 102.3 with 1 and 75 degrees of freedom, resulting in a very low p-value of
1.153e-15. This suggests that the overall model is statistically significant.
Overall, the model suggests that as the "sugars" variable increases by one unit, the "rating"
variable is expected to decrease by approximately 2.4008 units. However, it's important to note
that these conclusions are based solely on the provided output, and further analysis and
interpretation may be required.
According to our regression plot analysis, there appears to be a positive correlation between lower sugar
content and higher cereal ratings.
6.MULTIPLE REGRESSION
ingredients <- protein + fat + (sodium/100) + fiber + carbo + sugars + (potass/100)
summary(lm(rating~ingredients))
plot(ingredients,rating)
1. Coefficients: The estimated coefficients for the model are as follows: - Intercept: 68.6192 -
ingredients: -0.8713 These coefficients represent the intercept and the slope of the linear
relationship between the "rating" and "ingredients" variables.
2. Significance: The significance of the coefficients is indicated by the p-values. In this case, the
intercept coefficient is highly significant (p-value < 0.001), and the ingredients coefficient is
also significant (p-value = 0.0037). The significance suggests that both variables have a
statistically significant impact on the "rating" variable.
3. Residuals: The residuals represent the differences between the observed "rating" values and
the predicted values from the model. The summary provides information about the minimum,
1st quartile, median, 3rd quartile, and maximum values of the residuals.
4. Model Fit: The multiple R-squared value of 0.1069 indicates that approximately 10.69% of
the variability in the "rating" variable can be explained by the linear relationship with the
"ingredients" variable. The adjusted R-squared value of 0.09501 takes into account the number
of predictors in the model.
5. F-statistic: The F-statistic tests the overall significance of the model. In this case, the F-
statistic value is 8.979, with a p-value of 0.003701, suggesting that the model as a whole is
statistically significant.
Overall, the model suggests that there is a statistically significant linear relationship between the
"rating" and "ingredients" variables, with each unit increase in "ingredients" leading to a
decrease of approximately 0.8713 units in the "rating". However, the model explains only a
small portion of the variability in the "rating" variable
Here we can see as ingredients go up, calories also go up.
7.CONCLUSION
After analysis of the given data, the authors come to conclusions about relationship between
Rating and the other variable.
About Simple Regression
Overall, the model suggests that as the "sugars" variable increases by one unit, the "rating"
variable is expected to decrease by approximately 2.4008 units. However, it's important to note
that these conclusions are based solely on the provided output, and further analysis and
interpretation may be required.
About Multiple Regression
Overall, the model suggests that there is a statistically significant linear relationship between the
"rating" and "ingredients" variables, with each unit increase in "ingredients" leading to a
decrease of approximately 0.8713 units in the "rating". However, the model explains only a
small portion of the variability in the "rating" variable.