0% found this document useful (0 votes)
3 views

cs446_tool-summarizing-and-visualizing-numerical-variables-in-bbivariate-and-multivariate-analyses

This document serves as a guide for summarizing and visualizing numerical variables in bivariate and multivariate analyses using R. It covers methods for measuring associations between categorical and numerical variables, as well as between two numerical variables, and includes practical examples using the Titanic and mtcars datasets. The document also discusses the importance of visualizing data before summarizing it and provides code snippets for various R functions to facilitate these analyses.

Uploaded by

hasiba
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
3 views

cs446_tool-summarizing-and-visualizing-numerical-variables-in-bbivariate-and-multivariate-analyses

This document serves as a guide for summarizing and visualizing numerical variables in bivariate and multivariate analyses using R. It covers methods for measuring associations between categorical and numerical variables, as well as between two numerical variables, and includes practical examples using the Titanic and mtcars datasets. The document also discusses the importance of visualizing data before summarizing it and provides code snippets for various R functions to facilitate these analyses.

Uploaded by

hasiba
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 14

TOOL

Summarizing and Visualizing


Numerical Variables in Bivariate
and Multivariate Analyses
Table of Contents

Using R With This Tool 2

Data Set Information 2

Bivariate Questions 3

Measuring Association Between One Categorical and One Numerical Variable 3

Visualizing Bivariate Data With Boxplots 4

Summarizing Bivariate Data With the aggregate() Function 5

Measuring Association Between Two Numerical Variables 6

Visualizing Bivariate Data With Scatterplots 6

Summarizing Bivariate Data by Measuring Associations 7

Multivariate Questions 9

Check for a Confounding Effect on a Numerical-Categorical Association 9

Visualizing Multivariate Data With Side-by-Side Boxplots 9

Summarizing Multivariate Data With the aggregate() Function 9

Check for a Confounding Effect on a Numerical-Categorical Association 10

Visualizing Multivariate Data With Histograms and Color-Coded Scatterplots 11

Summarizing Multivariate Data by Measuring Associations 14

Summarizing and Visualizing Data


© 2021 Cornell University 1
Cornell Bowers College of Computing and Information Science
When you want to understand how one numerical variable relates to another numerical variable,
a categorical variable, or both types of variables, you can use this tool to help you summarize and
visualize your data. In particular, this tool will help you aggregate data into summaries, create
boxplots, and test for associations among variables.

When you do an analysis that includes a numerical variable, it is best to visualize your data first, then
summarize it. This is because numerical variables can take many distinct values, which means that
the data can be distributed in many different ways. Visualizing a numerical variable shows you how
the data are distributed. Once you’ve examined the distribution of the data, you can decide which
summary statistics to calculate for your data.

Several of the examples in this tool use data sets that are already loaded into R. Other examples use
the titanic data set.

Using R With This Tool


The portions of this tool with a gray background are code text that you can use to do the examples
included in this tool or modify to work with your own data. To use these examples, type the lines
of code that don’t begin with a pound sign (#) into R to carry out the command. Commented text
begins with one pound sign (#) and explains the lines of code. The code output begins with two
pound signs (##).

Data Set Information


The titanic data set contains demographic information about passengers on the RMS Titanic,
which sank in the Atlantic Ocean in 1912. The titanic data set has data on each passenger in
the rows and on passenger characteristics in the columns. To use the titanic data set with this
tool, download the data set, set your working directory to the location of the data set, and run the
following code:

titanic <- read.table("titanic.txt", header = TRUE) # Read in the data

# Change the Survived variable to a factor with the factor() function by


# telling R to replace 0 with No and 1 with Yes, then replacing
# titanic$Survived with SurvivedFactor:

SurvivedFactor <- factor(titanic$Survived, levels = c("0", "1"),


labels = c("No", "Yes"))

Summarizing and Visualizing Data


© 2021 Cornell University 2
Cornell Bowers College of Computing and Information Science
titanic$Survived <- SurvivedFactor

# In this tool you’ll analyze age, but age data is not available for every
# passenger. Remove all rows that contain NA in the age column. NA stands for
# not available:

titanic = titanic[!is.na(titanic$Age),] # This line of code tells R to remove


# any rows where Age is NA.In this expression, the ! means
# ‘don’t include’, and the is.na() function finds all instances of
# titanic$Age that are NA.

head(titanic) # display the first 6 rows of data

Bivariate Questions
This section will help you use summarization and visualization tools to answer questions that involve
one categorical and one numerical variable or two numerical variables.

Measuring Association Between One Categorical and One Numerical Variable


When you’re answering a bivariate question that involves one categorical and one numerical
variable, you usually want to understand how the distribution of the numerical variable changes
depending on the categorical variable.

Summarizing and Visualizing Data


© 2021 Cornell University 3
Cornell Bowers College of Computing and Information Science
Visualizing Bivariate Data With Boxplots

One way to visualize the distribution of a numerical


variable is by making a boxplot. Boxplots of Functions in R:
different variables are easier to compare than boxplot()
histograms because you can plot them together
and compare their key features.

In the image below, you can see the boxplots of passenger Age by PClass. Boxplots provide
information on the median, 25th percentile, and 75th percentile, as well as outliers in the data set.
Creating box plots side by side allows for easy comparison among categories.

Outliers

60
boxplot(Age ~ PClass,
40
Age

75th Percentile
data = titanic,
20

col = c("black", "gray", "red")) Median


0

1st 2nd 3rd


25th Percentile
PClass

These boxplots give you more information than comparing the means or medians alone. For
example, you can see that all three summary measures (median, 25th percentile, and 75th percentile)
are higher for first class. It also shows that despite this overall disparity, there were some older
passengers in the second and third classes (marked by the outliers) whose ages were comparable to
the older passengers in first class.

Summarizing and Visualizing Data


© 2021 Cornell University 4
Cornell Bowers College of Computing and Information Science
Summarizing Bivariate Data With the aggregate() Function

To see whether the distribution of a numerical variable changes


Functions in R:
based on different levels of a related categorical variable, you can
aggregate()
compare the distributions of the two variables across subsets of
the data that have different values of the categorical variable.

Some characteristics of the distributions you might want to compare include means, standard
deviations, median, percentiles, and outliers. To easily make these comparisons in R, you can use
|the aggregate() function to break your data into the groups in which you’re interested. In the
following example, we group the passengers by class, then calculate the mean for each group.

aggregate(Age ~ PClass, FUN = mean, data = titanic)


## PClass Age
## 1 1st 39.66779
## 2 2nd 28.30014
## 3 3rd 25.20858

Calculating the means by group with the aggregate() function shows that the average age
of passengers that traveled first class (39.67 years) was much higher than the average age of
passengers that traveled third class (25.21 years), suggesting an association between the variables
Age and PClass.

Summarizing and Visualizing Data


© 2021 Cornell University 5
Cornell Bowers College of Computing and Information Science
Measuring Association Between Two Numerical Variables
When you ask a bivariate question about two numerical variables, you usually want to know if, and
how, a change in one variable influences a change in the other variable.

When we look for associations between two numerical variables, we often ask the following
question: If one variable increases, does the other tend to increase, decrease, or show no consistent
change? If the answer is “increase,” it suggests a positive association between the two variables.
Similarly, if the answer is “decrease,” it suggests a negative association. Finally, when the answer is
“Shows no consistent change,” it suggests no association between the two variables.

Visualizing Bivariate Data With Scatterplots

You can use a scatterplot to visualize the association among


Function in R:
numerical variables.
plot()

Here, we’ll look at examples from the mtcars data set in R, which contains information on gas
mileage, gross horsepower, weight, acceleration, and many other features of 32 different cars.
Type help(mtcars) from the R console to learn more about this data set.

In this data set, horsepower (hp) and weight (wt) are positively associated because as weight
increases, horsepower also increases:

plot(hp ~ wt, data = mtcars, pch = 19)


# Plot the data

Summarizing and Visualizing Data


© 2021 Cornell University 6
Cornell Bowers College of Computing and Information Science
In contrast, gas mileage (mpg) and weight are negatively associated because as weight increases,
gas mileage decreases:

plot(mpg ~ wt, data = mtcars, pch = 19)

Finally, it doesn’t appear as though acceleration (qsec) and weight are associated because as
weight increases, acceleration does not change consistently in any direction:

plot(qsec ~ wt, data = mtcars, pch = 19)

Summarizing Bivariate Data by Measuring Associations

You can summarize the strength and nature of a linear association


Function in R:
numerically with the Pearson’s correlation coefficient (denoted using
cor()
the symbol r).

Summarizing and Visualizing Data


© 2021 Cornell University 7
Cornell Bowers College of Computing and Information Science
The value of r can be any value between –1 and 1, and the sign of r reflects whether the association
is positive or negative in nature. Its magnitude indicates the strength.

An r close to ±1 indicates very strong association, whereas r ≈ 0 indicates no or very weak


association. Yet whether an association between two variables should be called strong, moderate,
or weak varies across different fields of study. For instance, in some social sciences, the following is
taken as a rule of thumb:

• Strong association: r ≤ − 0.5 or r ≥ 0.5


• Moderate association: −0.5 < r ≤ −0.3 or 0.3 ≤ r < 0 .5
• Weak association: −0.3 < r < 0 .3
Because r describes the linear association between two variables, a small r should be interpreted
with care because there could be a potential nonlinear association between the two variables even
when r ≈ 0 .

It is important, however, to keep in mind that correlation is not the same as causation. So a low
correlation coefficient does not necessarily imply a lack of causal effect of one variable on the other.
In the same way, large correlation does not necessarily imply the presence of causal effect.

The following code calculates r for each of the plots you made in the previous section:

cor(mtcars$wt, mtcars$hp) # Calculate the correlation coefficient (r) between


# weight and horsepower

## 0.6587479

cor(mtcars$wt, mtcars$mpg) # Calculate the correlation coefficient between


# weight and gas mileage

## -0.8676594

cor(mtcars$wt, mtcars$qsec) # Calculate the correlation coefficient between


# weight and acceleration

## -0.1747159

Summarizing and Visualizing Data


© 2021 Cornell University 8
Cornell Bowers College of Computing and Information Science
Multivariate Questions
When you look for a relationship between two variables, you might want to check whether that
relationship is influenced by a third variable. Asking the right multivariate question can help you gain
a deeper understanding of your data by uncovering patterns you cannot find in a bivariate analysis.

In this course, you have measured association between two categorical variables, one categorical
and one numerical variable, and two numerical variables. For each of these types of association,
we’ll focus on the case where a third categorical variable may be a confounding factor. The methods
you should use to evaluate this question depend on the combination of types of variables you are
studying.

Check for a Confounding Effect on a Numerical-Categorical Association


When you examined the interaction between the ages of Titanic passengers (Age) and the three
passenger classes (PClass), you saw that there was an association between the numerical variable
Age and the categorical variable PClass. In particular, the passengers in first class were older, on
average, than passengers in the other two classes.

But what if this relationship is influenced by another categorical variable, such as Sex? Next, you will
check if this association is equally strong among male and female passengers.

Visualizing Multivariate Data With Side-by-Side Boxplots

Function in R:
You can use a side-by-side boxplot to examine all six
boxplot()
groups of your data at once.

boxplot(Age ~ PClass + Sex, 70


60
data = titanic, 50
col = c("black", "gray", 40
Age

"red"), 30
20
las = 2, xlab = NA)
10
0
1st.female

2nd.female

3rd.female

1st.male

2nd.male

3rd.male

Summarizing and Visualizing Data


© 2021 Cornell University 9
Cornell Bowers College of Computing and Information Science
Based on this plot, it looks like the same pattern persists for both male and female passengers,
suggesting that the relationship between Age and PClass is not confounded by Sex.

Summarizing Multivariate Data With the aggregate() Function

You can also summarize this information by calculating the average


Function in R:
age of male and female passengers in different passenger classes.
aggregate()

# Create a table that shows the average age of each group.

Mean.Age.by.PClass.Sex = aggregate(Age ~ PClass + Sex,


FUN=mean, data=titanic)

Mean.Age.by.PClass.Sex # Display the table


## PClass Sex Age
## 1st female 37.77228
## 2nd female 27.38824
## 3rd female 22.77618
## 1st male 41.19936
## 2nd male 28.91047
## 3rd male 26.35722

Summarizing and Visualizing Data


© 2021 Cornell University 10
Cornell Bowers College of Computing and Information Science
Check for a Confounding Effect on a Numerical-Numerical Association
Consider the faithful data set in R, which contains information on the eruptions of Old Faithful,
a geyser in Yellowstone National Park. In particular, the data set has two variables:

• eruptions, which shows the duration of a single eruption of the geyser


• waiting, which shows the waiting times between two consecutive eruptions
If you want to understand how waiting time influences eruptions, you might first find the correlation
coefficient between eruptions and waiting, which has an r of 0.9:

cor(faithful$eruptions, faithful$waiting)
## 0.9008112

A correlation coefficient of 0.9 suggests a very strong association. Yet Old Faithful may have different
kinds of eruptions, which could be a confounding factor that influences the interaction between
eruptions and waiting.

Visualizing Multivariate Data With Histograms and Color-Coded Scatterplots


First, you should make a histogram to determine
Functions in R:
whether there are any clusters in the data. When
hist()
you do this for the faithful data, you can make
a histogram to see that there are two distinct abline()
clusters of eruption durations. ifelse()
plot()
legend()

# Plot a histogram of eruption durations: Histogram of Eruption Durations

hist(faithful$eruptions, breaks = 40,


col = "grey",
20
Frequency

15

main = "Histogram of Eruption Durations",


10

xlab = "Eruption Duration (in mins)")


5
0

# Add a vertical red line between the two clusters: 1.5 2.0 2.5 3.0 3.5 4.0 4.5 5.0

abline(v = 3.2, col = "red", lwd = 2) Eruption Duration (in mins)

Summarizing and Visualizing Data


© 2021 Cornell University 11
Cornell Bowers College of Computing and Information Science
As you can see from the histogram, there are two types of eruptions: “short” and “long”. To
determine whether type of eruption is a confounding factor on the relationship between eruption
and waiting, you can examine the correlations between the two eruption categories separately.

To do this, create a color-coded scatterplot to mark the short and long eruptions differently. First,
create a new variable to code eruption types (short or long).

faithful$Type <- NA # Create the variable


faithful$Type[faithful$eruptions < 3.2] = "short" # Designate short eruptions
faithful$Type[faithful$eruptions > 3.2] = "long" # Designate long eruptions
head(faithful) # Show faithful data set with new Type variable
## eruptions waiting Type
## 3.600 79 long
## 1.800 54 short
## 3.333 74 long
## 2.283 62 short
## 4.533 85 long
## 2.883 55 short

Summarizing and Visualizing Data


© 2021 Cornell University 12
Cornell Bowers College of Computing and Information Science
Then, make a color-coded scatterplot:

# Use the ifelse() function to color code and create distinct


# point types for long vs. short observations:

colorcode <- ifelse(faithful$Type == "short","grey", "black")


pchcode <- ifelse(faithful$Type == "short", 19, 17)

# Create a plot with the color- and point-type-coded points

plot(faithful$eruptions, faithful$waiting,
col = colorcode, # use the colorcode vector to
# determine point color
pch = pchcode, # use the pchcode vector to
# determine point type
xlab = "Eruption Duration (mins)",
ylab = "Waiting Times (mins)")

# Create a legend
legend("topleft", c("short", "long"),
pch = c(19, 17), col = c("grey", "black"))

short
90
Waiting Times (mins)

long
80
70
60
50

1.5 2.0 2.5 3.0 3.5 4.0 4.5 5.0

Eruption Duration (mins)

Based on the scatterplot, you can clearly see two different clusters, indicating that the two types of
eruptions interact with waiting times differently.

Summarizing and Visualizing Data


© 2021 Cornell University 13
Cornell Bowers College of Computing and Information Science
Summarizing Multivariate Data by Measuring Associations

Next, calculate the correlation within each group you discovered by


Function in R:
visualizing the data in terms of the confounding variable Z; in this
cor()
case, the type of eruption.

# Calculate the correlation between eruption and waiting time for short
# eruptions:
cor(faithful$eruptions[faithful$Type == "short"],
faithful$waiting[faithful$Type == "short"])
## 0.3511609
# Calculate the correlation between eruption and waiting time for long
# eruptions:
cor(faithful$eruptions[faithful$Type == "long"],
faithful$waiting[faithful$Type == "long"])
## 0.3537682

Here, taking a closer look at a bivariate association with a multivariate analysis found a more
nuanced relationship between variables. The bivariate analysis between waiting time and eruption
duration resulted in a correlation coefficient of 0.90, which suggests that eruptions with slightly
higher durations are very likely to have a longer waiting time. Between two eruptions of the same
type (long or short), however, an eruption with a slightly longer duration is less likely to have longer
waiting time, as reflected by a moderate correlation of r = 0.35.

Summarizing and Visualizing Data


© 2021 Cornell University 14
Cornell Bowers College of Computing and Information Science

You might also like