Rcourse_partViz
Rcourse_partViz
1. Motivation
Visualizations enable us to extract meaningful insights, identify patterns, and communicate complex
information in a more understandable and compelling manner. Among the many tools available, the
R programming language stands out as a powerful platform for data visualization, offering a wide
array of libraries and packages tailored for creating various types of plots and charts.
Today we learn how to create:
• Bar Plots, Box-Whisker Plots and Pie Charts
• PCA-Plots and Heatmaps
• Venn diagrams and Upset Plots
2.1 Manuals
There is a useful free manual on the net by Venables and Smith. You'll find it via
http://www.r-project.org/
=> Manuals => An Introduction to R (von Venables und Smith)1
2.2 Links
➢ The R Graph gallery:
https://r-graph-gallery.com/
➢ Tutorial explaining ggplot2 into more detail:
http://r-statistics.co/Complete-Ggplot2-Tutorial-Part1-With-R-Code.html
2
3 Introduction into the representation of RNA-Seq Data
The dataset we are using for visualization consists of RNA-sequencing data from Caco-2 cells
infected with SARS-CoV-2 at four time points: 3, 6, 12, and 24 hours post-infection. We already
talked about how to handle RNA-Seq data to identify differential expressed genes and perform a
gene set enrichment analysis. Feel free to look again at the exercises from day 2 afternoon session
to repeat how this works in R. Now, you will learn how to present the results to other co-workers,
an audience, …
Please load the required packages. If any of them are not installed yet, install them with
install.packages() or BiocManager::install().
library(DESeq2)
library(pheatmap)
library(ggplot2)
library(dplyr)
library(RColorBrewer)
library(VennDiagram)
library(ComplexUpset)
Read in the data from “sarsCov2_rawCounts_smybols.txt” using the function read.table(). Keep in
mind that the file contains column and row names. So, set the respective parameter. As you’ll see,
the column names contain a “X” as prefix. To get rid of the prefix we split the string and keep only
the part after the X. This can be done using the following command:
As already described in day 2 afternoon session, we’ll use those tables to create a DESeqDataSet
object.
Perform a PCA and store the results in “data”. Use “data” to and calculate the percentage of
variance explained by each principal component.
You can create the respective PCA plot by using the following command:
Advanced: find out how to customize the colors used to highlight the different timepoints.
The Heatmap
Use the DESeq2 package again to normalize the data and identify differentielly expressed genes.
We now want to select 50 genes with the highest variance over all samples and represent them in a
heatmap. In a heatmap each value of a table is represented by a color. Thus, it helps to visualize the
overall behavior, but is not very insightful.
with the above code we have not only selected the top 50 of the highest variant genes, but also
created two additional tables that will help us to annotate the heatmap clearly.
A heatmap can be plotted as follows:
pheatmap(assay(rld)[select,], cluster_rows=T,
show_rownames=FALSE, cluster_cols=T, annotation_col=df,
annotation_colors = annColors)
3
Save the plot as PNG file by setting the “filename” parameter within the pheatmap function. Have a
look at the heatmap again, do you see any outliers?
This is probably the most common representation for RNA-seq data, but there are much more
useful plots out there. Let's have a look!
Bar plots
The provided file “sarsCov2_DEG_smybols.txt” contains the differential expressed genes of the
current experiment. Read in the file and filter all gene with a LFC below -1 or higher 1.
res$direction <- NA
res$direction[res$log2FoldChange < 0] <- "down"
res$direction[res$log2FoldChange > 0] <- "up"
Remark: the dollar sign can be used to address a specific column of a dataframe independent of its
position
When it comes to visualization, it's also often about making sure the order is logical. The legend of a
picture is easier to understand if the values shown are sorted in ascending or descending order. In R
you can specify an order with factor(). In our example, the time points should be ordered with the
help of the factor function.
As a barplot is used to display the relationship between a numeric and a categorical variable, let’s
calculate the total number of differentially expressed genes per timepoint.
res$TotalNumDEGs <- NA
for (i in 1:length(unique(res$condition))) {
res$TotalNumDEGs[res$condition == unique(res$condition)[i]] <- length(res$Gene[res$condition
== unique(res$condition)[i]])
}
Additionally, the information about how many gene of a certain timepoint are up and down
regulated, respectively, can also be very useful.
4
res$NumDEGs <- NA
for (i in 1:length(unique(res$condition))) {
res$NumDEGs[res$condition == unique(res$condition)[i] & res$direction == "up"] <-
length(res$Gene[res$condition == unique(res$condition)[i] & res$direction == "up"])
res$NumDEGs[res$condition == unique(res$condition)[i] & res$direction == "down"] <-
length(res$Gene[res$condition == unique(res$condition)[i] & res$direction == "down"])
}
Now all the necessary preparatory work is done and we can visualize the number of DEGs as a bar
plot.
Here, the total number of DEGs per timepoint is represented. Using the “scale_fill_brewer”
command, we can highlight the bars in different colors of the R color brewer color palettes. “Set2”
is only one of the available palettes. Check what other palettes can be used under: https://r-graph-
gallery.com/38-rcolorbrewers-palettes.html
It’s also possible to draw a bar plot with error bars. Visit the R graph gallery and find out which
function to use for this.
Now, we used “scale_fill_manual” to select each color manually. You can select this option if you
need to color only a few objects, if the colors are not grouped together on one palette, or if you
wish to ensure consistent coloring. The latter, in particular, aids in recognizing an overarching
context across multiple images.
Change the coloring of the groups to two colors of your choice.
Pie charts
We have seen that the samples collected 24 hours post infection have the highest number of DEGs.
Therefore, we’ll select all genes which are differential expressed after 24 hours.
Think of how you must change the command above, when you want to select DEGs after 12h
instead of 24h.
5
Using this table, it’s possible to draw the proportion of up and down regulated genes in a pie chart.
A pie chart is a commonly used graphical representation, and you've likely encountered it multiple
times before today. However, it's important to be aware of some potential issues associated with
pie charts. Please find out why pie charts can pose problems.
Box-(Whisker)-Plot
In the bar plot, we observed that there are more upregulated genes than downregulated ones. To
assess the change in log2FoldChange over time, we can employ a Box-Whisker Plot, often
abbreviated as a Box Plot.
To begin, we must first identify all the upregulated genes. Select those genes using the filter
function for the table called “res”.
Using this data frame, it’s possible to create a box plot using the following command:
The name "boxplot" is derived from the way it represents the distribution of values or observations
using boxes. Find out what the line above, below and inside the box means.
In this specific boxplot, individual observations are depicted as clusters of points. Can you identify
which function is responsible for displaying these point clusters on the plot?
To be honest, this is a quite useless visualization in this case. But at least you learned how to draw a
box plot.
Venn Diagram
More important to for the analysis of RNA-seq data is to see how the time points overlap in their
differential expressed genes. One way to solve this task is to create a Venn diagram. The (so far)
best package to draw a Venn diagram in R is the VennDiagram package.
6
A previous section mentioned the importance of maintaining consistent group coloring for
enhanced comprehension. In our earlier plots, we utilized the "Set2" palette from the RColorBrewer
package. For the Venn diagram, it's beneficial to ensure that the groups are assigned identical
colors. Consequently, we need to determine the hexadecimal codes for these colors.
brewer.pal(n=4,"Set2")
This command print out the hexcodes oft the first four colors on the console. We can use this
output to copy/paste it in the respective section of the venn.diagramm function.
venn.diagram(
x = list(
unname(unlist(select(filter(res, condition == "timepoint_3h_vs_0h"), Gene))),
unname(unlist(select(filter(res, condition == "timepoint_6h_vs_0h"), Gene))),
unname(unlist(select(filter(res, condition == "timepoint_12h_vs_0h"), Gene))),
unname(unlist(select(filter(res, condition == "timepoint_24h_vs_0h"), Gene)))),
category.names = c("3h" , "6h" , "12h", "24h"),
filename = 'timepoint_venn.png', # output file name
output = TRUE ,
imagetype="png" , # creates output as PNG file
height = 480 , # height of the picture
width = 480 , # width of the picture
resolution = 300, # resolution
compression = "lzw",
lwd = 1,
col=brewer.pal(n=4,"Set2"),
fill = c(alpha("#66C2A5",0.3), alpha('#FC8D62',0.3), alpha('#8DA0CB',0.3),alpha('#E78AC3',0.3)),
cex = 0.5, # label size
fontfamily = "sans",
cat.cex = 0.6,
cat.default.pos = "outer",
cat.fontfamily = "sans",
cat.col = brewer.pal(n=4,"Set2"),
)
This is the most complex function we used so far. Experiment with the parameters to gain a better
understanding of how they function.
Upset Chart
As you can see, to answer the second question is much more complicated than the first.
Furthermore, it is possible to draw a Venn diagram for more than 4 groups, but the interpretation
will get even more complicated. An upset plot (or upset chart) provides solution for this problem.
First, we need to manipulate the data in a way that it can be presented as using the ComplexUpset
package in R.
7
upsetList <- list(timepoint_3h = unname(unlist(select(filter(res, condition =="timepoint_3h_vs_0h"),
Gene))),
timepoint_6h = unname(unlist(select(filter(res, condition == "timepoint_6h_vs_0h"),
Gene))),
timepoint_12h = unname(unlist(select(filter(res, condition == "timepoint_12h_vs_0h"),
Gene))),
timepoint_24h = unname(unlist(select(filter(res, condition == "timepoint_24h_vs_0h"),
Gene))))
For each timepoint, check the gene being differential expressed and set the according entry to
TRUE.
This create a List where each entry is a vector of TRUEs and FALSE, depending on wether the gene
was differential expressed on the specific time point. To draw an upset plot this list must be
transposed and converted into a data frame.
Check how the data frame looks like, to understand what has happened using the command above.
8
upset(data = upsetData, intersect = colnames(upsetData), name = "", min_size=0,
base_annotations = list('Intersection size'=intersection_size(counts = T)),
matrix = intersection_matrix(geom = geom_point(shape='circle filled', size=3),
segment = geom_segment(size=.8)) +
scale_color_manual(values = c('3h'='#66C2A5', '6h'='#FC8D62', '12h'='#8DA0CB',
'24h'='#E78AC3'),
guide=guide_legend(override.aes = list(shape='circle'))),
queries = list(upset_query(set="timepoint_3h",fill="#66C2A5"),
upset_query(set="timepoint_6h",fill="#FC8D62"),
upset_query(set="timepoint_12h",fill="#8DA0CB"),
upset_query(set="timepoint_24h",fill="#E78AC3")))
Last but not least, the ggplot offers it’s own function to save graphics as PNG file. We want to use it
to save the upset plot we just created.
There is a wide array of graphics that can be generated in R, and each field has its own established
standards for data visualization. In today's excercises, we focused on several commonly used plot
types. If you have a look into publications regarding transcriptome data analysis, you'll frequently
encounter the Volcano plot, among others. Of course, we couldn't cover all available plot types in
this excercise. It is highly recommended that you refer to publications that have worked with
similar or identical data to gain insights into effective ways of visualizing your results.
Well done!