Data Exploration and Visualization Unit 2
Data Exploration and Visualization Unit 2
Introduction
Understanding the relationship between multiple variables is crucial in data exploration and
visualization. This chapter focuses on handling two and three variables effectively through various
techniques, including scatterplots, transformations, and contingency tables.
Definition
The relationship between two variables is a fundamental concept in statistics and data analysis.
Understanding how one variable influences or is associated with another helps in predicting
outcomes, drawing insights, and making informed decisions.
Types of Relationships
a. Positive Relationship
In a positive relationship, as one variable increases, the other variable also tends to increase.
Visualization
A scatterplot of height (X-axis) versus weight (Y-axis) would show an upward trend.
b. Negative Relationship
In a negative relationship, as one variable increases, the other variable tends to decrease.
Example: The amount of gas in a tank and the distance you can drive. As the gas decreases, the
distance you can drive also decreases.
Visualization
A scatterplot would show a downward trend.
c. No Relationship
When two variables do not show any consistent pattern or association, they are said to have no
relationship.
Example: The amount of ice cream sold and the height of a person. These two variables are likely
unrelated.
Visualization
2. Correlation Coefficient
The Pearson correlation coefficient (r) quantifies the strength and direction of a linear
relationship between two variables.
𝑛(∑𝑥𝑦) − (∑𝑥)(∑𝑦)
r=
√[𝑛∑𝑥 2 − (∑𝑥)2 ][𝑛∑𝑦 2 − (∑𝑦)2 ]
Where:
Example
Let's calculate the correlation coefficient for the data on hours studied and exam scores:
For x:
For y:
Finally:
Interpretation
r=0.993 indicates a very strong positive correlation between hours studied and exam scores,
suggesting that as study hours increase, exam scores also tend to increase.
3. Scatterplots
Creating a Scatterplot
Scatterplots visually represent the relationship between two quantitative variables. Here’s how to
create one based on our example:
1. Data Points:
o (1, 50)
o (2, 55)
o (3, 70)
o (4, 80)
o (5, 90)
2. Axes:
o X-axis: Hours Studied
o Y-axis: Exam Scores
3. Plot the Points:
o Mark each pair of values on the graph.
Visual Representation:
As mentioned earlier, a simple scatterplot would show points rising from the lower left to the upper
right, confirming the strong positive correlation.
Diagram: Scatterplot of Hours Studied vs. Exam Scores
4. Percentage Tables
Definition
Percentage tables summarize categorical data, showing the proportion of each category relative to
the total.
Construction
Example
Definition
Contingency tables, also known as cross-tabulations or crosstabs, are powerful tools for analyzing
the relationship between two or more categorical variables. They summarize the frequency
distribution of the variables and allow for the examination of potential associations and
interactions.
a. Definition
A contingency table displays the frequency counts of the occurrences of combinations of the values
of two categorical variables. Each cell in the table represents the count of observations for a
specific combination of categories.
b. Example
Consider a study examining the relationship between smoking status (smoker vs. non-smoker) and
lung disease (presence vs. absence):
a. Marginal Totals
The totals at the end of each row and column (the margins) give insight into the distribution of
each variable independently.
b. Joint Frequency
The count in each cell represents the joint frequency of the combination of categories. For example,
there are 30 smokers with lung disease.
c. Conditional Frequency
Conditional frequencies show the proportion of one variable given the other. For instance, the
proportion of smokers with lung disease can be calculated as:
d. Relative Frequencies
Relative frequencies express counts as proportions of the total, providing insight into the
distribution of categories. For example, the relative frequency of smokers with lung disease is:
a. Chi-Squared Test
The Chi-squared test is commonly used to determine if there is a significant association between
two categorical variables.
Null Hypothesis
The null hypothesis (𝐻0 ) states that there is no association between the variables (they are
independent).
1. Calculate Expected Frequencies: The expected frequency for each cell can be calculated
using:
Where 𝑂𝑖𝑗 is the observed frequency and 𝐸𝑖𝑗 is the expected frequency.
3. Determine Degrees of Freedom: Degrees of freedom (df) for a contingency table is
given by:
4. Compare to Critical Value: Compare the Chi-squared statistic to the critical value from
the Chi-squared distribution table at a given significance level (α=0.05). If χ2 is greater
than the critical value, reject𝐻0 .
a. Mosaic Plots
Mosaic plots visually represent the proportions of categories in a contingency table. Each rectangle
represents a category, with the area proportional to the frequency.
Stacked bar charts can show the distribution of one categorical variable across the levels of another
variable, facilitating easy comparison.
c. Heatmaps
a. Market Research
Analyzing consumer preferences based on demographics (e.g., age and purchasing habits).
b. Healthcare
Studying the association between lifestyle factors (e.g., diet and exercise) and health outcomes.
c. Social Sciences
d. Environmental Studies
In data analysis, it's common to have multiple groups or batches of data. Each batch may represent
different conditions or categories.
Example:
Suppose we are analyzing the effects of different fertilizers on plant growth. We have three
batches:
Techniques:
Descriptive Statistics: Calculate mean, median, and standard deviation for each batch.
Visualization: Use box plots to compare distributions across batches.
Scatterplot
Example:
Consider the relationship between hours studied (X) and exam scores (Y).
Creating a Scatterplot:
Resistant Lines
A resistant line (like the median line) minimizes the influence of outliers. This can be more
informative than traditional least-squares regression.
Robust Regression Formula: The resistant line can be determined using median values instead
of mean.
Example Calculation:
1. Data: (2, 50), (3, 60), (4, 80), (5, 90), (10, 40)
2. Medians: Median X = 4, Median Y = 60.
3. Draw a line through the point (4, 60).
8. Transformations
Introduction
Data transformations are mathematical operations applied to datasets to modify their distributions,
stabilize variance, and improve the relationship between variables. Transformations can help meet
the assumptions of statistical tests, enhance interpretability, and facilitate better modeling of
relationships.
Log Transformation: Log transformations are particularly useful when data are positively
skewed. This transformation can help reduce the impact of large values (outliers) and make the
data more normally distributed.
Formula: Y′=log(Y)
When to Use:
Example
Consider a dataset of incomes where values range widely. An example income dataset might look
like this:
The income distribution is highly skewed due to the presence of a few high-income outliers.
Transformation Steps:
Formula: Y′=√𝑌
When to Use:
Count data where the variance increases with the mean (e.g., number of events in a time
period).
Example
Imagine a dataset representing the number of calls received at a call center per hour:
Using the square root transformation helps reduce the variance by diminishing the effect of larger
counts.
Transformation Steps:
Example:
Original data is highly skewed; applying a log transformation can linearize the relationship.
9. Introducing a Third Variable
When examining relationships between two variables, introducing a third variable can uncover
additional insights. This third variable may interact with or confound the relationship, providing a
more comprehensive understanding.
Let’s consider the relationship between hours studied (X) and exam scores (Y). However, we also
want to analyze how age (Z) influences this relationship.
Steps to Analyze:
Visualization Techniques:
3D Scatterplot:
o The x-axis represents hours studied, the y-axis represents exam scores, and the z-
axis represents age.
Color Coding:
o Use different colors to represent age groups (e.g., 18-25, 26-35).
Example of a 3D Scatterplot:
Data points are plotted based on hours studied and exam scores.
Points for different age groups are colored distinctly.
Understanding Interaction: Age might change the effect of hours studied on exam scores.
For instance, younger students may benefit more from studying than older students.
Controlling for Confounding: By examining the third variable, researchers can control
for confounding effects that might mislead the analysis.
Enhanced Interpretability: Adding dimensions helps capture the complexity of real-
world relationships, allowing for more nuanced conclusions.
10. Three Variable Contingency Tables and Beyond
Introduction
Three-variable contingency tables extend the concept of two-variable contingency tables to include
an additional categorical variable, allowing for more complex analyses of relationships among
variables. This enables researchers to examine interactions and associations in multi-dimensional
categorical data, providing deeper insights into the relationships among variables.
a. Definition
A three-variable contingency table organizes data to display the frequency counts of combinations
of three categorical variables. Each cell in the table represents a count for a specific combination
of categories from all three variables.
b. Example
Log-linear models are often used for three or more categorical variables. These models can
examine the relationships and interactions among variables while accounting for the structure of
the contingency table.
Model Structure
Where 𝐸𝑖𝑗𝑘 is the expected frequency, μ is the overall mean, and λ terms represent the effects
and interactions.