Aplicaciones y Discusiones en Desarrollo Animal - Taller 2
Set your working directory and load the libraries we will need.
# set working directory
setwd("~/Documents/git_projects/ADDA_taller2/")
# install pacman if not already installed
if (!require("pacman")) install.packages("pacman")
# use pacman to load libraries
pacman::p_load(tidyverse,DESeq2,viridis,scales)
In the 1st exercise, we explored the RNA-Seq count data from salmon. We saw that this data:
The aim of this exercise is to go through a workflow for identifying genes that are significantly deferentially expressed when comparing two experimental treatments. We saw that the liver and skin samples were very different in their expression profiles. For this exercise, we will therefore only focus on the skin samples. This workflow entails the following general steps:
DESeq2
.We will start by loading the quantification data from
salmon
and the design matrix of the experiment, just like
we did for the previous exercise
txi<-readRDS("./data/salmon_gene_counts.rds")
samples<-read_csv("./data/design_matrix.csv")
Because we only want to be working with one tissue (in this case
skin), we want to filter out all of the remaining data from the design
matrix and the expression data. To make things a little more convenient,
we will also create a new variable (condition
) which
combines the treatment
and side
variable.
samples <- samples %>%
filter(tissue=="skin") %>%
mutate(condition=as.factor(paste(treatment, side, sep="_")))
## filter txi matrices
txi$abundance<-txi$abundance[,samples$sample_id]
txi$counts<-txi$counts[,samples$sample_id]
txi$length<-txi$length[,samples$sample_id]
We can now combine the design matrix and the expression data into a
format that DESeq2 likes. tximport
plays very well with
DESeq2, so this is easy!
dds <- DESeqDataSetFromTximport(txi,
colData = samples,
design = ~ condition)
The dataset should have successfully been stored as a
DESeqDataSet
object. You should see that it has imported
the counts
and the length
data from the
txi
object and the sample information from the design
matrix. lets take a look at this object:
dds
You may have seen this type of object before, but it most likely is not as common as other types.
Question: What is the structure of this object? Is there anything you are unfamiliar with?
It is an S4 object system. R has three object oriented systems: S3,
S4 and R5. You can read up about them here. For now, you
just need to know that most often you will work with S3 objects, which
can be conveniently accessed with $
for example. S4 objects
are a little different. They usually have data stored in "slots" which
are indicated with @
.
More specifically, it is a subclass of
RangedSummarizedExperiment
.
This is a format widely used by packages in the Bioconductor ecosystem. The core of the object are assays (a single or multiple matrices) with samples as columns and freatures as rows. Features in this case are our genes or transcripts, but they may also be chromosomal coordinates etc.
Working with this type of object sometimes requires specific
"accessor" functions (or methods) to get at data stored in 'slots'. For
example, the dds
object we created, has two assays:
counts
and avgTxLength
. We can access them
with assays()
function (note: different from
assay()
), or specific functions for specific assays (if
these are available)
# see what assays are stored in the object:
assayNames(dds)
# call a specific assay by name
assay(dds, "counts") %>% head()
assay(dds, "avgTxLength") %>% head()
# call a specific assay with a specific function
counts(dds) %>% head()
We can also access information about the columns:
colData(dds)
In this case, this is the experimental information we provided with the design matrix!
And we can also access the row data.
rowData(dds)
In this case it is empty, but you could imagine that you may have gene annotations here, or loci position information etc.
Feel free to read up more on this kind of
SummarizedExperiment
data object here.
Question: What is different about count data stored in the dds object compared to the raw count data we imported?
The decimal counts have been rounded off to make them integers.
head(counts(dds))
head(txi$counts)
You may have seen that some workflows filter out genes that have very low counts, orwhose counts don't vary very much across samples. In theory, this is not necessary with DESeq2 as it will handle this on the fly (read more here), but reducing the size of the matrix can greatly speeding up downstream calculations and reduce the memory requirements.
As we saw in the previous exercise, there are many liver-specific genes that most likely are not deferentially expressed in skin. There are therefore many rows of genes with 0 or very low counts. We can at least remove those:
# keep only rows with with counts summing up to 10 or more
dds <- dds[rowSums(counts(dds)) >= 10,]
The ultimate goal of running such an analysis is to see if a given gene is expressed to a different degree in one condition vs. another condition. This sounds almost like a T-test, but things are not that simple. In this exercise, I don't want to focus too much on theory, but it is worth talking about what some of the "problematic" features are of the data:
The differential expression analysis in DESeq2
takes
several steps to try to deal with these problems. At its core it fits
generalized linear models with a negative binomial distribution to each
gene. It takes into account the estimated counts per gene, normalized by
library size, but also takes into account the size of the target
genes/transcripts and calculates a dispersion parameter, which tells you
about how far the observed count is from the mean counts estimated by
the model.
It then performs a pairwise comparison of the level of expression per gene in condition A vs. condition B. It uses the models per gene to estimate coefficients and standard error for each sample group (i.e. for condition A and condition B). These coefficients are the estimates of log2 fold change.
To test whether this log2 fold change for a given gene is significant, it then applies a statistical test (Wald test), to test the hypothesis that the parameter (log 2 fold change) is significantly different from 0. A multiple test correction is then applied, by default a version of the Benjamini-Hochberg False Discovery Rate (FDR). This is done by ranking the genes by p-value, then:
\(adjusted\ p\ value = p\ value * \frac{total\ number\ of\ tests}{rank}\)
In short, DESeq2
does the following:
These are all done with a single convenient function:
dds <- DESeq(dds)
dds
Note: it looks like we are "overwriting" the
dds
object, but really, we are just adding more inforamtion to it
This incredibly convenient function has done all of the heavy lifting
for us, and it has even run some pair-wise tests already, based on the
~condtion
variable we set.
We can access the names of the pairwise tests like so:
resultsNames(dds)
Question: Is there anything unexpected about the list of pair-wise tests?
We had 4 treatment levels, which should result in a total of 6 pairwise comparisons. But here, there are only 3.
This is because DESeq2 compares everything to a baseline condition. If this is not specified, it will use the first alphabetic factor level, which in this case is "black_dorsal". As a result only xxx_vs_black_dorsal comparisons are made.
This is an important detail to remember: the second factor is always considered to be the "base level" or "control" variable.
We can extract the results for the pair-wise tests from the
dds
object like so:
res<-results(dds)
res
If we don't specify anything else, it will extract just the first comparison. In this case "white ventral vs black dorsal". This table has the following information:
Question: Is there anything unexpected about the adjusted p-values?
Some of the adjusted p-values are NA
. This is what the
DESeq2
handbook tells us:
"If a row is filtered by automatic independent filtering, for having a low mean normalized count, then only the adjusted p value will be set to NA. Description and customization of independent filtering is described below".
Let's check some of those counts with NA
padj:
counts(dds) %>%
as.data.frame() %>%
filter(is.na(res$padj)) %>%
head()
Indeed! all have very low counts!! they are therefore unreliable, and should rightfully not be included in the results.
We could also extract results of specific comparisons. there are two ways to do this:
# 1. using the result names:
results(dds, name="condition_white_dorsal_vs_black_dorsal")
# 2. using contrasts:
results(dds, contrast=c("condition","white_dorsal","black_dorsal"))
These two approaches should be largely the same (though baselines are treated slightly differently), but the second has some advantages. Namely, you can change the factor order (in our case, there is not really a clear control/treatment, right?), and, you can calculate results using different baseline than was included in the original, automatize run. lets make four results objects, comparing the dorsal and ventral skins within each treatment, and the dorsal skins across treatments:
# white dorsal vs white ventral
res_wD_wV<-results(dds, contrast=c("condition","white_dorsal","white_ventral"))
# black dorsal vs black ventral
res_bD_bV<-results(dds, contrast=c("condition","black_dorsal","black_ventral"))
# black dorsal vs white dorsal
res_bD_wD<-results(dds, contrast=c("condition","black_dorsal","white_dorsal"))
# black ventral vs white ventral
res_bV_wV<-results(dds, contrast=c("condition","black_ventral","white_ventral"))
We can now summarize these results tables:
summary(res_wD_wV, alpha=0.05)
summary(res_bD_bV, alpha=0.05)
summary(res_bD_wD, alpha=0.05)
summary(res_bV_wV, alpha=0.05)
This returns the number of up-regulated genes (LFC>0) and down-regulated genes (LFC<0) with respect to the control variable. Other information is also given, such as the number of outliers and low counts.
Question: What do you notice about these results?
It seems like the dorsal vs. ventral comparisons are producing many more up and down regulated genes than the black vs. white comparison. This indicates that not all skin is the same! At the level of gene expression, the change in pigmentation on the dorsum is minimal compared to the difference in gene expression between dorsal and ventral skin of much similar colour!
The typical plot to show the distribution of differentially expressed
genes, is a "MA-plot". Here, we plot the mean of normalized counts
against the log fold change, with a canned plotting function from the
DESeq2
package.
par(mfrow=c(2,2))
# single plot
DESeq2::plotMA(res_wD_wV, main="White Dorsals vs. Ventral")
DESeq2::plotMA(res_bD_bV, main="Black Dorsals vs. Ventral")
DESeq2::plotMA(res_bD_wD, main="Black Dorsals vs. White Dorsal")
DESeq2::plotMA(res_bV_wV, main="Black Ventral vs. White Ventral")
par(mfrow=c(1,1))
Question: Have you seen this kind of plot before? What can we say from these plots?
We can squeeze even more information out of these kinds of plots by getting creative with ggplot.
res_bD_bV %>%
as.data.frame() %>%
ggplot(aes(baseMean, log2FoldChange, colour=padj)) +
geom_point(size=1) +
scale_y_continuous(limits=c(-5, 5), oob=squish) + # oob from the scales package is needed to "squish" points falling outside the axis limits
scale_x_log10() +
geom_hline(yintercept = 0, colour="red", size=1, linetype="longdash") +
labs(x="mean of normalized counts", y="log fold change") +
scale_colour_viridis(direction=-1, trans='sqrt') +
geom_density_2d(colour="blue", size=0.5) +
theme_bw()
Here, we can see more clearly that genes with low counts do not have p-adjusted values (points in grey). The density contours are also centered around a y-intercept of 0 (most genes are not differentially expressed)
We have now finished the differential gene expression analysis. As the final step, we should export our results. We can export the individual results tables as .csv files, and we can export the various objects, in case we want to process them down the road
# make a results folder if it does not yet exist
dir.create("results", showWarnings = FALSE)
# export individual results tables like so:
write.csv(as.data.frame(res_wD_wV), "./results/deseq2_res_wD_wV.csv")
# but for convenience, we can export them all as a list object and save it as an .rds file
saveRDS(list(
bD_bV = res_bD_bV,
wD_wV = res_wD_wV,
bV_wV = res_bV_wV,
bD_wD = res_bD_wD),
"./results/deseq2_results.rds")
# export the DESeq2 object as an .rds files
saveRDS(dds, "./results/deseq2_dds.rds")
At this point, we can do some quality checks, to make sure that all
of the steps that DESeq2
has done, worked as expected. For
example:
The over-dispersion correction should result that there is an even distribution of p-values, with an "enrichment" of genes with low p-values (i.e. a rectangular plot with a peak near 0).
# for all comparisons:
par(mfrow=c(2,2))
par(mar=c(4,4,1,1))
hist(res_wD_wV$pvalue, breaks=50, col="grey", main="wD_wV")
hist(res_bD_bV$pvalue, breaks=50, col="grey", main="_bD_bV")
hist(res_bD_wD$pvalue, breaks=50, col="grey", main="bD_wD")
hist(res_bV_wV$pvalue, breaks=50, col="grey", main="bV_wV")
par(mfrow=c(1,1))
All-in-all, not bad. In all cases, we see the expected "enrichment" of genes with low p-values and then a relatively uniform/rectangular distribution for all other p-values. It is not perfect, and there is a little bit of a hill-shape with higher p-values. IF this were more extreme (both hill-shaped or U-shaped distributions) you may want to try to remove outliers or fit models with more than one dispersion parameter (see here for example, or look up local dispersion, in the DESeq2 handbook).
To accurately model sequencing counts, we need to generate accurate estimates of within-group variation (variation between biological replicates of the same treatment/condition group) for each gene. With only a few (usually 3) replicates per group, the estimates of variation for each gene are often unreliable (due to the large differences in dispersion for genes with similar means).
To address this problem, DESeq2 shares information across genes to generate more accurate estimates of variation based on the mean expression level of the gene using a method called ‘shrinkage’. DESeq2 assumes that genes with similar expression levels have similar dispersion.
Estimating the dispersion for each gene separately:
To model the dispersion based on expression level (mean counts of replicates), the dispersion for each gene is estimated using maximum likelihood estimation. In other words, given the count values of the replicates, the most likely estimate of dispersion is calculated.
We can easily plot these dispersion estimates.
plotDispEsts(dds)
Question: What are some interesting things to note about this plot?
All-in-all it looks good!