Aplicaciones y Discusiones en Desarrollo Animal - Taller 2
A good first thing to do when running any R script, is to set your working environment and load any libraries you may need. A handy tool for loading packages is pacman. This package conveniently loads libraries, installs libraries that are not installed, including bioconductor packages!
# set working directory
setwd("~/Documents/git_projects/ADDA_taller2/")
# install pacman if not already installed
if (!require("pacman")) install.packages("pacman")
# use pacman to load libraries
pacman::p_load(tidyverse, DESeq2, pheatmap)
Now that we are all ready to go, let's import some data. As a prequel to this exercise, I have already explained the biological question(s) we are trying to answer (in brief, how is pigmentation plasticity controlled in Spadefoot toad tadpoles?). Here, we will jump in around about the middle of a typical RNAseq workflow. We have already done the following:
Our starting point will therefore be the count data for our RNAseq experiment prepared using tximport. In this exercise we will do the following:
Lets load the count data:
txi<-readRDS("./data/salmon_gene_counts.rds")
For those of you unfamiliar, .rds
files is a single R
data object, created with saveRDS()
. It is particularly
useful for saving R objects that are not tabular, or more than
2-dimensional (e.g. lists, functions etc.).
Question: What is the class and structure of this data object?
class(txi)
str(txi)
lapply(X=txi, FUN=head)
abundamce
, counts
, length
) and 1
character vector (countsFromAbundance
).Question: What is the content of this object and how is it arranged?
apply(X=txi$abundance, FUN=sum, MARGIN = 2)
In this case, they don't quite add up to 1 million because I have already filtered this matrix, to remove non-coding and mitochondrial DNA.
tximport
. Default is
no (counts are from salmon).It is always a good idea to look at the distribution of data:
# Log10 transform and plot with base R:
hist(log10(txi$abundance[,1]), breaks=50)
# a more detailed plot with the tidyverse:
txi$abundance %>%
as_tibble(rownames = "transcript") %>%
pivot_longer(-transcript, names_to="sample", values_to="TPM") %>%
ggplot(aes(x=TPM)) +
geom_histogram() +
ylab("number of transcripts") +
scale_x_log10() +
facet_wrap(~sample)
We can see that the log abundance is fairly normally distributed. This is a good sign. Many 0 counts would indicate that you have targets that are heavily depleted. For example, if I would have included non-coding target sequences, we would probably see a peak of low counts, because these tend to be very depleted in RNAseq data.
Next, we may be curious as to whether there are some basic patterns in our data. For instance, we would expect biological replicates to cluster together.
Let's load our design matrix. This is a .csv file where we have included some information on what these samples actually are:
samples<-read.csv("./data/design_matrix.csv")
Take a closer look at this table.
samples
Question: How many tissues and how many treatments are there?
samples %>%
select(treatment, tissue) %>%
ftable()
Before we go any further, it makes sense to check if our biological
replicates are grouping nicely. We could do this by performing a PCA on
the normalized counts (the TPM/abundance) and distinguish them using the
information we have from our samples
file that lays out the
experimental design. However, a PCA works best on homoskedastic data;
this means that the variance of an observed quantity (here, the
expression strength of a gene) does not depend on the mean. This is
definitely an assumption we are violating with RNAseq data, so if we
don't transform our data, then the PCA is going to be strongly
influenced by a few highly expressed genes. All of this requires a few
steps:
# perform VS transformation (the vst() function only likes integers!)
vst_counts<- txi$counts %>%
as.data.frame() %>%
mutate_all(as.integer) %>%
as.matrix() %>%
vst()
# remove 0-variance genes (remember to also convert counts to integers)
vst_counts<-vst_counts[apply(vst_counts, 1, var) != 0,]
# perform PCA on TRANSPOSED scaled, centered data
pca<- prcomp(t(vst_counts),scale.=T, center=T)
## add metadata and plot
pca$x %>%
as_tibble(rownames = "sample_id") %>%
left_join(samples) %>% # add the experimental design information
ggplot(aes(x=PC1, y=PC2, color=treatment, shape=tissue)) +
labs(x=paste0("PC1 (", summary(pca)$importance["Proportion of Variance",1]*100, "%)"),
y=paste("PC2 (", summary(pca)$importance["Proportion of Variance",2]*100, "%)")) +
geom_point(size=3)
Question: Can we see any patterns in the PCA biplot?
Question: What about patterns within tissues? Try running this same code, but only for the liver samples.
Notice that sample with "Li" are liver samples. perhaps drop the remaining columns first?
# perform VS transformation (the vst() function only likes integers!)
vst_counts_liver<- txi$counts %>%
as.data.frame() %>%
select(contains("Li")) %>%
mutate_all(as.integer) %>%
as.matrix() %>%
vst()
# remove 0-variance genes (remember to also convert counts to integers)
vst_counts_liver<-vst_counts_liver[apply(vst_counts_liver, 1, var) != 0,]
# perfomr PCA on TRANSPOSED scaled, centred data
pca_liver<- prcomp(t(vst_counts_liver),scale.=T, center=T)
## add metadata and plot
pca_liver$x %>%
as_tibble(rownames = "sample_id") %>%
left_join(samples) %>%
ggplot(aes(x=PC1, y=PC2, color=treatment)) +
geom_point(size=3)
Computational Genomics with R
book, especially the
chapter on RNA-Seq
by Bora Uyar. Galaxy also has plenty of interesting documentation such
as this
one.Another common exploratory visualization technique are heatmaps. Surely you have seen them in many publications. The PCA is useful for visualizing the clustering of samples. Heatmaps can do the same, but they can add one extra layer of information: the clustering of genes. Let's look at two examples:
Like the PCA, we might want to compare the distance between samples. To do this, we have to manually calculate this distances first, and then use those distances to inform the heatmap.
# get sample-to-sample distance
sample_dist <- dist(t(vst_counts))
# convert to matrix
sample_dist_matrix <- as.matrix(sample_dist)
# plot
pheatmap(sample_dist_matrix,
annotation_col=data.frame(samples[,c("sample_id","tissue","treatment","side")],
row.names = "sample_id"),
annotation_row=data.frame(samples[,c("sample_id","tissue","treatment","side")],
row.names = "sample_id"))
This shows a similar picture. There is strong clustering with tissues, and then a little bit of clustering for treatments within tissues, but the tissue-effect is consuming much of the signal.
Above we are just showing the distance between samples, but you may be interested whether there are specific genes/transcripts who's expression is clustering. because there are thousands of genes, we may want to only focus on genes that are particularly variable in their expression across the samples.
# Sort the rows in vst matrix by the row-wise variance, and keep only the 500 most variable genes
var_genes <- apply(vst_counts,MAR=1, FUN=var) %>%
enframe() %>%
arrange(desc(value)) %>%
slice_head(n=500) %>%
pull(name)
head(var_genes) # list of genes we want to keep
vst_subset<-vst_counts[var_genes, ]
# check to see we really only have 500 genes
dim(vst_subset)
# plot heatmap (this time scaled by row)
pheatmap(vst_subset,
scale = "row",
cluster_rows=T, show_rownames=FALSE,
cluster_cols=T, show_colnames = T,
annotation_col=data.frame(samples[,c("sample_id","tissue","treatment","side")], row.names = "sample_id"))
Question: Do we learn anything new from this heatmap?