Aplicaciones y Discusiones en Desarrollo Animal - Taller 2
Set your working directory and load the libraries we will need.
# set working directory
setwd("~/Documents/git_projects/ADDA_taller2/")
# install pacman if not already installed
if (!require("pacman")) install.packages("pacman")
# use pacman to load libraries
pacman::p_load(tidyverse,DESeq2,viridis,scales)
In the 1st exercise, we explored the RNA-Seq count data from salmon. We saw that this data:
The aim of this exercise is to go through a workflow for identifying genes that are significantly deferentially expressed when comparing two experimental treatments. We saw that the liver and skin samples were very different in their expression profiles. For this exercise, we will therefore only focus on the skin samples. This workflow entails the following general steps:
DESeq2
.We will start by loading the quantification data from
salmon
and the design matrix of the experiment, just like
we did for the previous exercise
txi<-readRDS("./data/salmon_gene_counts.rds")
samples<-read_csv("./data/design_matrix.csv")
Because we only want to be working with one tissue (in this case
skin), we want to filter out all of the remaining data from the design
matrix and the expression data. To make things a little more convenient,
we will also create a new variable (condition
) which
combines the treatment
and side
variable.
samples <- samples %>%
filter(tissue=="skin") %>%
mutate(condition=as.factor(paste(treatment, side, sep="_")))
## filter txi matrices
txi$abundance<-txi$abundance[,samples$sample_id]
txi$counts<-txi$counts[,samples$sample_id]
txi$length<-txi$length[,samples$sample_id]
We can now combine the design matrix and the expression data into a
format that DESeq2 likes. tximport
plays very well with
DESeq2, so this is easy!
dds <- DESeqDataSetFromTximport(txi,
colData = samples,
design = ~ condition)
The dataset should have successfully been stored as a
DESeqDataSet
object. You should see that it has imported
the counts
and the length
data from the
txi
object and the sample information from the design
matrix. lets take a look at this object:
dds
## class: DESeqDataSet
## dim: 32531 16
## metadata(1): version
## assays(2): counts avgTxLength
## rownames(32531): PECUL23A000004 PECUL23A000005 ... PECUL23A062646
## PECUL23A062647
## rowData names(0):
## colnames(16): 10D 11D ... 8V 9V
## colData names(5): sample_id treatment tissue side condition
You may have seen this type of object before, but it most likely is not as common as other types.
Question: What is the structure of this object? Is there anything you are unfamiliar with?
It is an S4 object system. R has three object oriented systems: S3,
S4 and R5. You can read up about them here. For now, you
just need to know that most often you will work with S3 objects, which
can be conveniently accessed with $
for example. S4 objects
are a little different. They usually have data stored in "slots" which
are indicated with @
.
More specifically, it is a subclass of
RangedSummarizedExperiment
.
This is a format widely used by packages in the Bioconductor ecosystem. The core of the object are assays (a single or multiple matrices) with samples as columns and freatures as rows. Features in this case are our genes or transcripts, but they may also be chromosomal coordinates etc.
Working with this type of object sometimes requires specific
"accessor" functions (or methods) to get at data stored in 'slots'. For
example, the dds
object we created, has two assays:
counts
and avgTxLength
. We can access them
with assays()
function (note: different from
assay()
), or specific functions for specific assays (if
these are available)
# see what assays are stored in the object:
assayNames(dds)
## [1] "counts" "avgTxLength"
# call a specific assay by name
assay(dds, "counts") %>% head()
## 10D 11D 12D 22D 10V 11V 12V 22V 21D 7D 8D 9D 21V 7V 8V
## PECUL23A000004 799 209 588 675 951 314 427 989 842 299 172 682 503 222 303
## PECUL23A000005 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
## PECUL23A000006 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
## PECUL23A000008 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0
## PECUL23A000011 1239 898 883 1125 1268 933 845 1044 1005 879 725 970 858 729 714
## PECUL23A000012 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
## 9V
## PECUL23A000004 945
## PECUL23A000005 0
## PECUL23A000006 0
## PECUL23A000008 0
## PECUL23A000011 978
## PECUL23A000012 0
assay(dds, "avgTxLength") %>% head()
## 10D 11D 12D 22D 10V
## PECUL23A000004 6201.16983 6438.33500 6168.57244 6360.54714 6199.02702
## PECUL23A000005 71.39554 71.39554 71.39554 71.39554 71.39554
## PECUL23A000006 547.63317 547.63317 547.63317 547.63317 547.63317
## PECUL23A000008 1019.39872 1019.39872 1019.39872 1019.39872 1019.39872
## PECUL23A000011 4246.70282 4244.47366 4248.68448 4138.22686 4252.16543
## PECUL23A000012 447.71650 447.71650 447.71650 447.71650 447.71650
## 11V 12V 22V 21D 7D
## PECUL23A000004 6315.82807 6251.34080 6205.29626 6149.42485 6209.34168
## PECUL23A000005 71.39554 71.39554 71.39554 71.39554 71.39554
## PECUL23A000006 547.63317 547.63317 547.63317 547.63317 547.63317
## PECUL23A000008 1019.39872 1019.39872 1019.39872 1019.39872 1019.39872
## PECUL23A000011 4247.39914 4242.69120 4240.83801 4241.13665 4242.39630
## PECUL23A000012 447.71650 447.71650 447.71650 447.71650 447.71650
## 8D 9D 21V 7V 8V
## PECUL23A000004 6379.89021 6555.93984 6100.23764 6537.26970 6485.19773
## PECUL23A000005 71.39554 71.39554 71.39554 71.39554 71.39554
## PECUL23A000006 547.63317 547.63317 547.63317 547.63317 547.63317
## PECUL23A000008 1019.39872 1019.39872 1021.78100 1019.39872 1019.39872
## PECUL23A000011 4250.63982 4170.79175 4247.13092 4250.94094 4163.96652
## PECUL23A000012 447.71650 447.71650 447.71650 447.71650 447.71650
## 9V
## PECUL23A000004 6148.17008
## PECUL23A000005 71.39554
## PECUL23A000006 547.63317
## PECUL23A000008 1019.39872
## PECUL23A000011 4243.53463
## PECUL23A000012 447.71650
# call a specific assay with a specific function
counts(dds) %>% head()
## 10D 11D 12D 22D 10V 11V 12V 22V 21D 7D 8D 9D 21V 7V 8V
## PECUL23A000004 799 209 588 675 951 314 427 989 842 299 172 682 503 222 303
## PECUL23A000005 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
## PECUL23A000006 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
## PECUL23A000008 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0
## PECUL23A000011 1239 898 883 1125 1268 933 845 1044 1005 879 725 970 858 729 714
## PECUL23A000012 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
## 9V
## PECUL23A000004 945
## PECUL23A000005 0
## PECUL23A000006 0
## PECUL23A000008 0
## PECUL23A000011 978
## PECUL23A000012 0
We can also access information about the columns:
colData(dds)
## DataFrame with 16 rows and 5 columns
## sample_id treatment tissue side condition
## <character> <character> <character> <character> <factor>
## 10D 10D black skin dorsal black_dorsal
## 11D 11D black skin dorsal black_dorsal
## 12D 12D black skin dorsal black_dorsal
## 22D 22D black skin dorsal black_dorsal
## 10V 10V black skin ventral black_ventral
## ... ... ... ... ... ...
## 9D 9D white skin dorsal white_dorsal
## 21V 21V white skin ventral white_ventral
## 7V 7V white skin ventral white_ventral
## 8V 8V white skin ventral white_ventral
## 9V 9V white skin ventral white_ventral
In this case, this is the experimental information we provided with the design matrix!
And we can also access the row data.
rowData(dds)
## DataFrame with 32531 rows and 0 columns
In this case it is empty, but you could imagine that you may have gene annotations here, or loci position information etc.
Feel free to read up more on this kind of
SummarizedExperiment
data object here.
Question: What is different about count data stored in the dds object compared to the raw count data we imported?
The decimal counts have been rounded off to make them integers.
head(counts(dds))
## 10D 11D 12D 22D 10V 11V 12V 22V 21D 7D 8D 9D 21V 7V 8V
## PECUL23A000004 799 209 588 675 951 314 427 989 842 299 172 682 503 222 303
## PECUL23A000005 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
## PECUL23A000006 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
## PECUL23A000008 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0
## PECUL23A000011 1239 898 883 1125 1268 933 845 1044 1005 879 725 970 858 729 714
## PECUL23A000012 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
## 9V
## PECUL23A000004 945
## PECUL23A000005 0
## PECUL23A000006 0
## PECUL23A000008 0
## PECUL23A000011 978
## PECUL23A000012 0
head(txi$counts)
## 10D 11D 12D 22D 10V 11V 12V
## PECUL23A000004 798.556 209.388 588.464 675.112 950.925 313.572 427.419
## PECUL23A000005 0.000 0.000 0.000 0.000 0.000 0.000 0.000
## PECUL23A000006 0.000 0.000 0.000 0.000 0.000 0.000 0.000
## PECUL23A000008 0.000 0.000 0.000 0.000 0.000 0.000 0.000
## PECUL23A000011 1238.869 898.400 883.169 1125.317 1267.971 933.099 844.625
## PECUL23A000012 0.000 0.000 0.000 0.000 0.000 0.000 0.000
## 22V 21D 7D 8D 9D 21V 7V
## PECUL23A000004 988.542 842.375 299.200 172.249 681.824 503.027 222.206
## PECUL23A000005 0.000 0.000 0.000 0.000 0.000 0.000 0.000
## PECUL23A000006 0.000 0.000 0.000 0.000 0.000 0.000 0.000
## PECUL23A000008 0.000 0.000 0.000 0.000 0.000 1.000 0.000
## PECUL23A000011 1044.139 1005.042 879.316 724.642 970.492 858.488 728.819
## PECUL23A000012 0.000 0.000 0.000 0.000 0.000 0.000 0.000
## 8V 9V
## PECUL23A000004 303.135 945.262
## PECUL23A000005 0.000 0.000
## PECUL23A000006 0.000 0.000
## PECUL23A000008 0.000 0.000
## PECUL23A000011 713.794 978.380
## PECUL23A000012 0.000 0.000
You may have seen that some workflows filter out genes that have very low counts, orwhose counts don't vary very much across samples. In theory, this is not necessary with DESeq2 as it will handle this on the fly (read more here), but reducing the size of the matrix can greatly speeding up downstream calculations and reduce the memory requirements.
As we saw in the previous exercise, there are many liver-specific genes that most likely are not deferentially expressed in skin. There are therefore many rows of genes with 0 or very low counts. We can at least remove those:
# keep only rows with with counts summing up to 10 or more
dds <- dds[rowSums(counts(dds)) >= 10,]
The ultimate goal of running such an analysis is to see if a given gene is expressed to a different degree in one condition vs. another condition. This sounds almost like a T-test, but things are not that simple. In this exercise, I don't want to focus too much on theory, but it is worth talking about what some of the "problematic" features are of the data:
The differential expression analysis in DESeq2
takes
several steps to try to deal with these problems. At its core it fits
generalized linear models with a negative binomial distribution to each
gene. It takes into account the estimated counts per gene, normalized by
library size, but also takes into account the size of the target
genes/transcripts and calculates a dispersion parameter, which tells you
about how far the observed count is from the mean counts estimated by
the model.
It then performs a pairwise comparison of the level of expression per gene in condition A vs. condition B. It uses the models per gene to estimate coefficients and standard error for each sample group (i.e. for condition A and condition B). These coefficients are the estimates of log2 fold change.
To test whether this log2 fold change for a given gene is significant, it then applies a statistical test (Wald test), to test the hypothesis that the parameter (log 2 fold change) is significantly different from 0. A multiple test correction is then applied, by default a version of the Benjamini-Hochberg False Discovery Rate (FDR). This is done by ranking the genes by p-value, then:
\(adjusted\ p\ value = p\ value * \frac{total\ number\ of\ tests}{rank}\)
In short, DESeq2
does the following:
These are all done with a single convenient function:
dds <- DESeq(dds)
dds
## class: DESeqDataSet
## dim: 18110 16
## metadata(1): version
## assays(6): counts avgTxLength ... H cooks
## rownames(18110): PECUL23A000004 PECUL23A000011 ... PECUL23A062646
## PECUL23A062647
## rowData names(30): baseMean baseVar ... deviance maxCooks
## colnames(16): 10D 11D ... 8V 9V
## colData names(5): sample_id treatment tissue side condition
Note: it looks like we are "overwriting" the
dds
object, but really, we are just adding more inforamtion to it
This incredibly convenient function has done all of the heavy lifting
for us, and it has even run some pair-wise tests already, based on the
~condtion
variable we set.
We can access the names of the pairwise tests like so:
resultsNames(dds)
## [1] "Intercept"
## [2] "condition_black_ventral_vs_black_dorsal"
## [3] "condition_white_dorsal_vs_black_dorsal"
## [4] "condition_white_ventral_vs_black_dorsal"
Question: Is there anything unexpected about the list of pair-wise tests?
We had 4 treatment levels, which should result in a total of 6 pairwise comparisons. But here, there are only 3.
This is because DESeq2 compares everything to a baseline condition. If this is not specified, it will use the first alphabetic factor level, which in this case is "black_dorsal". As a result only xxx_vs_black_dorsal comparisons are made.
This is an important detail to remember: the second factor is always considered to be the "base level" or "control" variable.
We can extract the results for the pair-wise tests from the
dds
object like so:
res<-results(dds)
res
## log2 fold change (MLE): condition white ventral vs black dorsal
## Wald test p-value: condition white ventral vs black dorsal
## DataFrame with 18110 rows and 6 columns
## baseMean log2FoldChange lfcSE stat pvalue
## <numeric> <numeric> <numeric> <numeric> <numeric>
## PECUL23A000004 557.4102 -0.100434 0.552488 -0.181786 0.855751
## PECUL23A000011 943.9019 -0.274858 0.186806 -1.471356 0.141195
## PECUL23A000017 308.9902 -0.089849 0.488532 -0.183916 0.854079
## PECUL23A000021 389.3509 0.161886 0.128565 1.259179 0.207966
## PECUL23A000024 61.2082 -0.697616 0.797256 -0.875021 0.381562
## ... ... ... ... ... ...
## PECUL23A062638 7640.25635 -0.331002 0.285494 -1.159400 0.2462931
## PECUL23A062639 14.84163 -0.269193 0.576309 -0.467098 0.6404298
## PECUL23A062644 1.19938 -0.600717 1.455699 -0.412665 0.6798517
## PECUL23A062646 7.21864 -1.430253 1.053540 -1.357569 0.1746006
## PECUL23A062647 109.72323 0.558908 0.272927 2.047831 0.0405766
## padj
## <numeric>
## PECUL23A000004 0.974429
## PECUL23A000011 0.611279
## PECUL23A000017 0.973855
## PECUL23A000021 0.692827
## PECUL23A000024 0.815316
## ... ...
## PECUL23A062638 0.728480
## PECUL23A062639 0.924804
## PECUL23A062644 NA
## PECUL23A062646 0.656003
## PECUL23A062647 0.361351
If we don't specify anything else, it will extract just the first comparison. In this case "white ventral vs black dorsal". This table has the following information:
Question: Is there anything unexpected about the adjusted p-values?
Some of the adjusted p-values are NA
. This is what the
DESeq2
handbook tells us:
"If a row is filtered by automatic independent filtering, for having a low mean normalized count, then only the adjusted p value will be set to NA. Description and customization of independent filtering is described below".
Let's check some of those counts with NA
padj:
counts(dds) %>%
as.data.frame() %>%
filter(is.na(res$padj)) %>%
head()
## 10D 11D 12D 22D 10V 11V 12V 22V 21D 7D 8D 9D 21V 7V 8V 9V
## PECUL23A000050 3 1 2 0 4 1 1 3 5 2 0 0 1 3 0 1
## PECUL23A000099 0 0 0 1 0 3 2 2 1 1 0 2 0 0 1 1
## PECUL23A000197 1 0 1 2 2 2 1 0 6 1 0 4 2 2 0 1
## PECUL23A000271 2 1 0 1 1 1 0 1 1 0 1 0 2 1 8 0
## PECUL23A000302 0 0 2 0 1 0 1 1 1 0 0 2 1 0 0 1
## PECUL23A000417 0 1 1 0 2 1 4 1 2 1 0 0 2 4 0 2
Indeed! all have very low counts!! they are therefore unreliable, and should rightfully not be included in the results.
We could also extract results of specific comparisons. there are two ways to do this:
# 1. using the result names:
results(dds, name="condition_white_dorsal_vs_black_dorsal")
## log2 fold change (MLE): condition white dorsal vs black dorsal
## Wald test p-value: condition white dorsal vs black dorsal
## DataFrame with 18110 rows and 6 columns
## baseMean log2FoldChange lfcSE stat pvalue
## <numeric> <numeric> <numeric> <numeric> <numeric>
## PECUL23A000004 557.4102 -0.1551640 0.552481 -0.280850 0.778826
## PECUL23A000011 943.9019 -0.2105533 0.186661 -1.128000 0.259320
## PECUL23A000017 308.9902 0.0524451 0.488234 0.107418 0.914457
## PECUL23A000021 389.3509 0.0862399 0.128620 0.670503 0.502537
## PECUL23A000024 61.2082 -0.1648459 0.794355 -0.207522 0.835602
## ... ... ... ... ... ...
## PECUL23A062638 7640.25635 -0.3589307 0.285491 -1.257239 0.2086671
## PECUL23A062639 14.84163 0.1675091 0.565750 0.296083 0.7671664
## PECUL23A062644 1.19938 -0.0569031 1.393045 -0.040848 0.9674171
## PECUL23A062646 7.21864 -0.6777258 1.026960 -0.659934 0.5092961
## PECUL23A062647 109.72323 0.5468155 0.272676 2.005367 0.0449238
## padj
## <numeric>
## PECUL23A000004 0.994905
## PECUL23A000011 0.911978
## PECUL23A000017 0.998538
## PECUL23A000021 0.978891
## PECUL23A000024 0.998538
## ... ...
## PECUL23A062638 0.887490
## PECUL23A062639 0.994905
## PECUL23A062644 NA
## PECUL23A062646 0.981014
## PECUL23A062647 0.682353
# 2. using contrasts:
results(dds, contrast=c("condition","white_dorsal","black_dorsal"))
## log2 fold change (MLE): condition white_dorsal vs black_dorsal
## Wald test p-value: condition white dorsal vs black dorsal
## DataFrame with 18110 rows and 6 columns
## baseMean log2FoldChange lfcSE stat pvalue
## <numeric> <numeric> <numeric> <numeric> <numeric>
## PECUL23A000004 557.4102 -0.1551640 0.552481 -0.280850 0.778826
## PECUL23A000011 943.9019 -0.2105533 0.186661 -1.128000 0.259320
## PECUL23A000017 308.9902 0.0524451 0.488234 0.107418 0.914457
## PECUL23A000021 389.3509 0.0862399 0.128620 0.670503 0.502537
## PECUL23A000024 61.2082 -0.1648459 0.794355 -0.207522 0.835602
## ... ... ... ... ... ...
## PECUL23A062638 7640.25635 -0.3589307 0.285491 -1.257239 0.2086671
## PECUL23A062639 14.84163 0.1675091 0.565750 0.296083 0.7671664
## PECUL23A062644 1.19938 -0.0569031 1.393045 -0.040848 0.9674171
## PECUL23A062646 7.21864 -0.6777258 1.026960 -0.659934 0.5092961
## PECUL23A062647 109.72323 0.5468155 0.272676 2.005367 0.0449238
## padj
## <numeric>
## PECUL23A000004 0.994905
## PECUL23A000011 0.911978
## PECUL23A000017 0.998538
## PECUL23A000021 0.978891
## PECUL23A000024 0.998538
## ... ...
## PECUL23A062638 0.887490
## PECUL23A062639 0.994905
## PECUL23A062644 NA
## PECUL23A062646 0.981014
## PECUL23A062647 0.682353
These two approaches should be largely the same (though baselines are treated slightly differently), but the second has some advantages. Namely, you can change the factor order (in our case, there is not really a clear control/treatment, right?), and, you can calculate results using different baseline than was included in the original, automatize run. lets make four results objects, comparing the dorsal and ventral skins within each treatment, and the dorsal skins across treatments:
# white dorsal vs white ventral
res_wD_wV<-results(dds, contrast=c("condition","white_dorsal","white_ventral"))
# black dorsal vs black ventral
res_bD_bV<-results(dds, contrast=c("condition","black_dorsal","black_ventral"))
# black dorsal vs white dorsal
res_bD_wD<-results(dds, contrast=c("condition","black_dorsal","white_dorsal"))
# black ventral vs white ventral
res_bV_wV<-results(dds, contrast=c("condition","black_ventral","white_ventral"))
We can now summarize these results tables:
summary(res_wD_wV, alpha=0.05)
##
## out of 18110 with nonzero total read count
## adjusted p-value < 0.05
## LFC > 0 (up) : 198, 1.1%
## LFC < 0 (down) : 172, 0.95%
## outliers [1] : 103, 0.57%
## low counts [2] : 1054, 5.8%
## (mean count < 1)
## [1] see 'cooksCutoff' argument of ?results
## [2] see 'independentFiltering' argument of ?results
summary(res_bD_bV, alpha=0.05)
##
## out of 18110 with nonzero total read count
## adjusted p-value < 0.05
## LFC > 0 (up) : 214, 1.2%
## LFC < 0 (down) : 181, 1%
## outliers [1] : 103, 0.57%
## low counts [2] : 1756, 9.7%
## (mean count < 3)
## [1] see 'cooksCutoff' argument of ?results
## [2] see 'independentFiltering' argument of ?results
summary(res_bD_wD, alpha=0.05)
##
## out of 18110 with nonzero total read count
## adjusted p-value < 0.05
## LFC > 0 (up) : 24, 0.13%
## LFC < 0 (down) : 49, 0.27%
## outliers [1] : 103, 0.57%
## low counts [2] : 2104, 12%
## (mean count < 3)
## [1] see 'cooksCutoff' argument of ?results
## [2] see 'independentFiltering' argument of ?results
summary(res_bV_wV, alpha=0.05)
##
## out of 18110 with nonzero total read count
## adjusted p-value < 0.05
## LFC > 0 (up) : 29, 0.16%
## LFC < 0 (down) : 69, 0.38%
## outliers [1] : 103, 0.57%
## low counts [2] : 2104, 12%
## (mean count < 3)
## [1] see 'cooksCutoff' argument of ?results
## [2] see 'independentFiltering' argument of ?results
This returns the number of up-regulated genes (LFC>0) and down-regulated genes (LFC<0) with respect to the control variable. Other information is also given, such as the number of outliers and low counts.
Question: What do you notice about these results?
It seems like the dorsal vs. ventral comparisons are producing many more up and down regulated genes than the black vs. white comparison. This indicates that not all skin is the same! At the level of gene expression, the change in pigmentation on the dorsum is minimal compared to the difference in gene expression between dorsal and ventral skin of much similar colour!
The typical plot to show the distribution of differentially expressed
genes, is a "MA-plot". Here, we plot the mean of normalized counts
against the log fold change, with a canned plotting function from the
DESeq2
package.
par(mfrow=c(2,2))
# single plot
DESeq2::plotMA(res_wD_wV, main="White Dorsals vs. Ventral")
DESeq2::plotMA(res_bD_bV, main="Black Dorsals vs. Ventral")
DESeq2::plotMA(res_bD_wD, main="Black Dorsals vs. White Dorsal")
DESeq2::plotMA(res_bV_wV, main="Black Ventral vs. White Ventral")
par(mfrow=c(1,1))
Question: Have you seen this kind of plot before? What can we say from these plots?
We can squeeze even more information out of these kinds of plots by getting creative with ggplot.
res_bD_bV %>%
as.data.frame() %>%
ggplot(aes(baseMean, log2FoldChange, colour=padj)) +
geom_point(size=1) +
scale_y_continuous(limits=c(-5, 5), oob=squish) + # oob from the scales package is needed to "squish" points falling outside the axis limits
scale_x_log10() +
geom_hline(yintercept = 0, colour="red", size=1, linetype="longdash") +
labs(x="mean of normalized counts", y="log fold change") +
scale_colour_viridis(direction=-1, trans='sqrt') +
geom_density_2d(colour="blue", size=0.5) +
theme_bw()
## Warning: Using `size` aesthetic for lines was deprecated in ggplot2 3.4.0.
## ℹ Please use `linewidth` instead.
## This warning is displayed once every 8 hours.
## Call `lifecycle::last_lifecycle_warnings()` to see where this warning was
## generated.
Here, we can see more clearly that genes with low counts do not have p-adjusted values (points in grey). The density contours are also centered around a y-intercept of 0 (most genes are not differentially expressed)
We have now finished the differential gene expression analysis. As the final step, we should export our results. We can export the individual results tables as .csv files, and we can export the various objects, in case we want to process them down the road
# make a results folder if it does not yet exist
dir.create("results", showWarnings = FALSE)
# export individual results tables like so:
write.csv(as.data.frame(res_wD_wV), "./results/deseq2_res_wD_wV.csv")
# but for convenience, we can export them all as a list object and save it as an .rds file
saveRDS(list(
bD_bV = res_bD_bV,
wD_wV = res_wD_wV,
bV_wV = res_bV_wV,
bD_wD = res_bD_wD),
"./results/deseq2_results.rds")
# export the DESeq2 object as an .rds files
saveRDS(dds, "./results/deseq2_dds.rds")
At this point, we can do some quality checks, to make sure that all
of the steps that DESeq2
has done, worked as expected. For
example:
The over-dispersion correction should result that there is an even distribution of p-values, with an "enrichment" of genes with low p-values (i.e. a rectangular plot with a peak near 0).
# for all comparisons:
par(mfrow=c(2,2))
par(mar=c(4,4,1,1))
hist(res_wD_wV$pvalue, breaks=50, col="grey", main="wD_wV")
hist(res_bD_bV$pvalue, breaks=50, col="grey", main="_bD_bV")
hist(res_bD_wD$pvalue, breaks=50, col="grey", main="bD_wD")
hist(res_bV_wV$pvalue, breaks=50, col="grey", main="bV_wV")
par(mfrow=c(1,1))
All-in-all, not bad. In all cases, we see the expected "enrichment" of genes with low p-values and then a relatively uniform/rectangular distribution for all other p-values. It is not perfect, and there is a little bit of a hill-shape with higher p-values. IF this were more extreme (both hill-shaped or U-shaped distributions) you may want to try to remove outliers or fit models with more than one dispersion parameter (see here for example, or look up local dispersion, in the DESeq2 handbook).
To accurately model sequencing counts, we need to generate accurate estimates of within-group variation (variation between biological replicates of the same treatment/condition group) for each gene. With only a few (usually 3) replicates per group, the estimates of variation for each gene are often unreliable (due to the large differences in dispersion for genes with similar means).
To address this problem, DESeq2 shares information across genes to generate more accurate estimates of variation based on the mean expression level of the gene using a method called ‘shrinkage’. DESeq2 assumes that genes with similar expression levels have similar dispersion.
Estimating the dispersion for each gene separately:
To model the dispersion based on expression level (mean counts of replicates), the dispersion for each gene is estimated using maximum likelihood estimation. In other words, given the count values of the replicates, the most likely estimate of dispersion is calculated.
We can easily plot these dispersion estimates.
plotDispEsts(dds)
Question: What are some interesting things to note about this plot?
All-in-all it looks good!