cs446_tool-summarizing-and-visualizing-numerical-variables-in-bbivariate-and-multivariate-analyses
cs446_tool-summarizing-and-visualizing-numerical-variables-in-bbivariate-and-multivariate-analyses
Bivariate Questions 3
Multivariate Questions 9
When you do an analysis that includes a numerical variable, it is best to visualize your data first, then
summarize it. This is because numerical variables can take many distinct values, which means that
the data can be distributed in many different ways. Visualizing a numerical variable shows you how
the data are distributed. Once you’ve examined the distribution of the data, you can decide which
summary statistics to calculate for your data.
Several of the examples in this tool use data sets that are already loaded into R. Other examples use
the titanic data set.
# In this tool you’ll analyze age, but age data is not available for every
# passenger. Remove all rows that contain NA in the age column. NA stands for
# not available:
Bivariate Questions
This section will help you use summarization and visualization tools to answer questions that involve
one categorical and one numerical variable or two numerical variables.
In the image below, you can see the boxplots of passenger Age by PClass. Boxplots provide
information on the median, 25th percentile, and 75th percentile, as well as outliers in the data set.
Creating box plots side by side allows for easy comparison among categories.
Outliers
60
boxplot(Age ~ PClass,
40
Age
75th Percentile
data = titanic,
20
These boxplots give you more information than comparing the means or medians alone. For
example, you can see that all three summary measures (median, 25th percentile, and 75th percentile)
are higher for first class. It also shows that despite this overall disparity, there were some older
passengers in the second and third classes (marked by the outliers) whose ages were comparable to
the older passengers in first class.
Some characteristics of the distributions you might want to compare include means, standard
deviations, median, percentiles, and outliers. To easily make these comparisons in R, you can use
|the aggregate() function to break your data into the groups in which you’re interested. In the
following example, we group the passengers by class, then calculate the mean for each group.
Calculating the means by group with the aggregate() function shows that the average age
of passengers that traveled first class (39.67 years) was much higher than the average age of
passengers that traveled third class (25.21 years), suggesting an association between the variables
Age and PClass.
When we look for associations between two numerical variables, we often ask the following
question: If one variable increases, does the other tend to increase, decrease, or show no consistent
change? If the answer is “increase,” it suggests a positive association between the two variables.
Similarly, if the answer is “decrease,” it suggests a negative association. Finally, when the answer is
“Shows no consistent change,” it suggests no association between the two variables.
Here, we’ll look at examples from the mtcars data set in R, which contains information on gas
mileage, gross horsepower, weight, acceleration, and many other features of 32 different cars.
Type help(mtcars) from the R console to learn more about this data set.
In this data set, horsepower (hp) and weight (wt) are positively associated because as weight
increases, horsepower also increases:
Finally, it doesn’t appear as though acceleration (qsec) and weight are associated because as
weight increases, acceleration does not change consistently in any direction:
It is important, however, to keep in mind that correlation is not the same as causation. So a low
correlation coefficient does not necessarily imply a lack of causal effect of one variable on the other.
In the same way, large correlation does not necessarily imply the presence of causal effect.
The following code calculates r for each of the plots you made in the previous section:
## 0.6587479
## -0.8676594
## -0.1747159
In this course, you have measured association between two categorical variables, one categorical
and one numerical variable, and two numerical variables. For each of these types of association,
we’ll focus on the case where a third categorical variable may be a confounding factor. The methods
you should use to evaluate this question depend on the combination of types of variables you are
studying.
But what if this relationship is influenced by another categorical variable, such as Sex? Next, you will
check if this association is equally strong among male and female passengers.
Function in R:
You can use a side-by-side boxplot to examine all six
boxplot()
groups of your data at once.
"red"), 30
20
las = 2, xlab = NA)
10
0
1st.female
2nd.female
3rd.female
1st.male
2nd.male
3rd.male
cor(faithful$eruptions, faithful$waiting)
## 0.9008112
A correlation coefficient of 0.9 suggests a very strong association. Yet Old Faithful may have different
kinds of eruptions, which could be a confounding factor that influences the interaction between
eruptions and waiting.
15
# Add a vertical red line between the two clusters: 1.5 2.0 2.5 3.0 3.5 4.0 4.5 5.0
To do this, create a color-coded scatterplot to mark the short and long eruptions differently. First,
create a new variable to code eruption types (short or long).
plot(faithful$eruptions, faithful$waiting,
col = colorcode, # use the colorcode vector to
# determine point color
pch = pchcode, # use the pchcode vector to
# determine point type
xlab = "Eruption Duration (mins)",
ylab = "Waiting Times (mins)")
# Create a legend
legend("topleft", c("short", "long"),
pch = c(19, 17), col = c("grey", "black"))
short
90
Waiting Times (mins)
long
80
70
60
50
Based on the scatterplot, you can clearly see two different clusters, indicating that the two types of
eruptions interact with waiting times differently.
# Calculate the correlation between eruption and waiting time for short
# eruptions:
cor(faithful$eruptions[faithful$Type == "short"],
faithful$waiting[faithful$Type == "short"])
## 0.3511609
# Calculate the correlation between eruption and waiting time for long
# eruptions:
cor(faithful$eruptions[faithful$Type == "long"],
faithful$waiting[faithful$Type == "long"])
## 0.3537682
Here, taking a closer look at a bivariate association with a multivariate analysis found a more
nuanced relationship between variables. The bivariate analysis between waiting time and eruption
duration resulted in a correlation coefficient of 0.90, which suggests that eruptions with slightly
higher durations are very likely to have a longer waiting time. Between two eruptions of the same
type (long or short), however, an eruption with a slightly longer duration is less likely to have longer
waiting time, as reflected by a moderate correlation of r = 0.35.