Differential Gene Expression

Prepare environment

Set your working directory and load the libraries we will need.

# set working directory
#setwd("~/Documents/git_projects/ADDA_taller2/")

# install pacman if not already installed
if (!require("pacman")) install.packages("pacman")
  
# use pacman to load libraries
pacman::p_load(tidyverse,DESeq2,viridis,scales)

Background

In the 1st exercise, we explored the RNA-Seq count data from salmon. We saw that this data:

Has counts, abundances, lengths.
Comes from different experimental treatments (dark vs. light backgrounds)
Comes from different tissues (liver, dorsal skin, ventral skin)

The aim of this exercise is to go through a workflow for identifying genes that are significantly deferentially expressed when comparing two experimental treatments. We saw that the liver and skin samples were very different in their expression profiles. For this exercise, we will therefore only focus on the skin samples. This workflow entails the following general steps:

Import quantification data.
Perform a differential gene expression analysis using DESeq2.
Evaluate the quality of the results.
Summarize the results.

Load data

We will start by loading the quantification data from salmon and the design matrix of the experiment, just like we did for the previous exercise

txi<-readRDS("../data/salmon_gene_counts.rds")
samples<-read_csv("../data/design_matrix.csv")

Subset data

Because we only want to be working with one tissue (in this case skin), we want to filter out all of the remaining data from the design matrix and the expression data. To make things a little more convenient, we will also create a new variable (condition) which combines the treatment and side variable.

samples <- samples %>%
  filter(tissue=="skin") %>%
  mutate(condition=as.factor(paste(treatment, side, sep="_")))

## filter txi matrices
txi$abundance<-txi$abundance[,samples$sample_id]
txi$counts<-txi$counts[,samples$sample_id]
txi$length<-txi$length[,samples$sample_id]

Make a DESeqDataSet object

We can now combine the design matrix and the expression data into a format that DESeq2 likes. tximport plays very well with DESeq2, so this is easy!

dds <- DESeqDataSetFromTximport(txi,
                         colData = samples,
                         design = ~ condition)

The dataset should have successfully been stored as a DESeqDataSet object. You should see that it has imported the counts and the length data from the txi object and the sample information from the design matrix. lets take a look at this object:

dds

## class: DESeqDataSet 
## dim: 32531 16 
## metadata(1): version
## assays(2): counts avgTxLength
## rownames(32531): PECUL23A000004 PECUL23A000005 ... PECUL23A062646
##   PECUL23A062647
## rowData names(0):
## colnames(16): 10D 11D ... 8V 9V
## colData names(5): sample_id treatment tissue side condition

You may have seen this type of object before, but it most likely is not as common as other types.

Question: What is the structure of this object? Is there anything you are unfamiliar with?

It is an S4 object system. R has three object oriented systems: S3, S4 and R5. You can read up about them here. For now, you just need to know that most often you will work with S3 objects, which can be conveniently accessed with $ for example. S4 objects are a little different. They usually have data stored in "slots" which are indicated with @.

More specifically, it is a subclass of RangedSummarizedExperiment.

This is a format widely used by packages in the Bioconductor ecosystem. The core of the object are assays (a single or multiple matrices) with samples as columns and features as rows. Features in this case are our genes or transcripts, but they may also be chromosomal coordinates etc.

Working with this type of object sometimes requires specific "accessor" functions (or methods) to get at data stored in 'slots'. For example, the dds object we created, has two assays: counts and avgTxLength. We can access them with assays() function (note: different from assay()), or specific functions for specific assays (if these are available)

# see what assays are stored in the object:
assayNames(dds)

## [1] "counts"      "avgTxLength"

# call a specific assay by name
assay(dds, "counts") %>% head()

##                 10D 11D 12D  22D  10V 11V 12V  22V  21D  7D  8D  9D 21V  7V  8V
## PECUL23A000004  799 209 588  675  951 314 427  989  842 299 172 682 503 222 303
## PECUL23A000005    0   0   0    0    0   0   0    0    0   0   0   0   0   0   0
## PECUL23A000006    0   0   0    0    0   0   0    0    0   0   0   0   0   0   0
## PECUL23A000008    0   0   0    0    0   0   0    0    0   0   0   0   1   0   0
## PECUL23A000011 1239 898 883 1125 1268 933 845 1044 1005 879 725 970 858 729 714
## PECUL23A000012    0   0   0    0    0   0   0    0    0   0   0   0   0   0   0
##                 9V
## PECUL23A000004 945
## PECUL23A000005   0
## PECUL23A000006   0
## PECUL23A000008   0
## PECUL23A000011 978
## PECUL23A000012   0

assay(dds, "avgTxLength") %>% head()

##                       10D        11D        12D        22D        10V
## PECUL23A000004 6201.16983 6438.33500 6168.57244 6360.54714 6199.02702
## PECUL23A000005   71.39554   71.39554   71.39554   71.39554   71.39554
## PECUL23A000006  547.63317  547.63317  547.63317  547.63317  547.63317
## PECUL23A000008 1019.39872 1019.39872 1019.39872 1019.39872 1019.39872
## PECUL23A000011 4246.70282 4244.47366 4248.68448 4138.22686 4252.16543
## PECUL23A000012  447.71650  447.71650  447.71650  447.71650  447.71650
##                       11V        12V        22V        21D         7D
## PECUL23A000004 6315.82807 6251.34080 6205.29626 6149.42485 6209.34168
## PECUL23A000005   71.39554   71.39554   71.39554   71.39554   71.39554
## PECUL23A000006  547.63317  547.63317  547.63317  547.63317  547.63317
## PECUL23A000008 1019.39872 1019.39872 1019.39872 1019.39872 1019.39872
## PECUL23A000011 4247.39914 4242.69120 4240.83801 4241.13665 4242.39630
## PECUL23A000012  447.71650  447.71650  447.71650  447.71650  447.71650
##                        8D         9D        21V         7V         8V
## PECUL23A000004 6379.89021 6555.93984 6100.23764 6537.26970 6485.19773
## PECUL23A000005   71.39554   71.39554   71.39554   71.39554   71.39554
## PECUL23A000006  547.63317  547.63317  547.63317  547.63317  547.63317
## PECUL23A000008 1019.39872 1019.39872 1021.78100 1019.39872 1019.39872
## PECUL23A000011 4250.63982 4170.79175 4247.13092 4250.94094 4163.96652
## PECUL23A000012  447.71650  447.71650  447.71650  447.71650  447.71650
##                        9V
## PECUL23A000004 6148.17008
## PECUL23A000005   71.39554
## PECUL23A000006  547.63317
## PECUL23A000008 1019.39872
## PECUL23A000011 4243.53463
## PECUL23A000012  447.71650

# call a specific assay with a specific function
counts(dds) %>% head()

##                 10D 11D 12D  22D  10V 11V 12V  22V  21D  7D  8D  9D 21V  7V  8V
## PECUL23A000004  799 209 588  675  951 314 427  989  842 299 172 682 503 222 303
## PECUL23A000005    0   0   0    0    0   0   0    0    0   0   0   0   0   0   0
## PECUL23A000006    0   0   0    0    0   0   0    0    0   0   0   0   0   0   0
## PECUL23A000008    0   0   0    0    0   0   0    0    0   0   0   0   1   0   0
## PECUL23A000011 1239 898 883 1125 1268 933 845 1044 1005 879 725 970 858 729 714
## PECUL23A000012    0   0   0    0    0   0   0    0    0   0   0   0   0   0   0
##                 9V
## PECUL23A000004 945
## PECUL23A000005   0
## PECUL23A000006   0
## PECUL23A000008   0
## PECUL23A000011 978
## PECUL23A000012   0

We can also access information about the columns:

colData(dds)

## DataFrame with 16 rows and 5 columns
##       sample_id   treatment      tissue        side     condition
##     <character> <character> <character> <character>      <factor>
## 10D         10D       black        skin      dorsal black_dorsal 
## 11D         11D       black        skin      dorsal black_dorsal 
## 12D         12D       black        skin      dorsal black_dorsal 
## 22D         22D       black        skin      dorsal black_dorsal 
## 10V         10V       black        skin     ventral black_ventral
## ...         ...         ...         ...         ...           ...
## 9D           9D       white        skin      dorsal white_dorsal 
## 21V         21V       white        skin     ventral white_ventral
## 7V           7V       white        skin     ventral white_ventral
## 8V           8V       white        skin     ventral white_ventral
## 9V           9V       white        skin     ventral white_ventral

In this case, this is the experimental information we provided with the design matrix!

And we can also access the row data.

rowData(dds)

## DataFrame with 32531 rows and 0 columns

In this case it is empty, but you could imagine that you may have gene annotations here, or loci position information etc.

Feel free to read up more on this kind of SummarizedExperiment data object here.

Question: What is different about count data stored in the dds object compared to the raw count data we imported?

The decimal counts have been rounded off to make them integers.

head(counts(dds))

##                 10D 11D 12D  22D  10V 11V 12V  22V  21D  7D  8D  9D 21V  7V  8V
## PECUL23A000004  799 209 588  675  951 314 427  989  842 299 172 682 503 222 303
## PECUL23A000005    0   0   0    0    0   0   0    0    0   0   0   0   0   0   0
## PECUL23A000006    0   0   0    0    0   0   0    0    0   0   0   0   0   0   0
## PECUL23A000008    0   0   0    0    0   0   0    0    0   0   0   0   1   0   0
## PECUL23A000011 1239 898 883 1125 1268 933 845 1044 1005 879 725 970 858 729 714
## PECUL23A000012    0   0   0    0    0   0   0    0    0   0   0   0   0   0   0
##                 9V
## PECUL23A000004 945
## PECUL23A000005   0
## PECUL23A000006   0
## PECUL23A000008   0
## PECUL23A000011 978
## PECUL23A000012   0

head(txi$counts)

##                     10D     11D     12D      22D      10V     11V     12V
## PECUL23A000004  798.556 209.388 588.464  675.112  950.925 313.572 427.419
## PECUL23A000005    0.000   0.000   0.000    0.000    0.000   0.000   0.000
## PECUL23A000006    0.000   0.000   0.000    0.000    0.000   0.000   0.000
## PECUL23A000008    0.000   0.000   0.000    0.000    0.000   0.000   0.000
## PECUL23A000011 1238.869 898.400 883.169 1125.317 1267.971 933.099 844.625
## PECUL23A000012    0.000   0.000   0.000    0.000    0.000   0.000   0.000
##                     22V      21D      7D      8D      9D     21V      7V
## PECUL23A000004  988.542  842.375 299.200 172.249 681.824 503.027 222.206
## PECUL23A000005    0.000    0.000   0.000   0.000   0.000   0.000   0.000
## PECUL23A000006    0.000    0.000   0.000   0.000   0.000   0.000   0.000
## PECUL23A000008    0.000    0.000   0.000   0.000   0.000   1.000   0.000
## PECUL23A000011 1044.139 1005.042 879.316 724.642 970.492 858.488 728.819
## PECUL23A000012    0.000    0.000   0.000   0.000   0.000   0.000   0.000
##                     8V      9V
## PECUL23A000004 303.135 945.262
## PECUL23A000005   0.000   0.000
## PECUL23A000006   0.000   0.000
## PECUL23A000008   0.000   0.000
## PECUL23A000011 713.794 978.380
## PECUL23A000012   0.000   0.000

Pre-filtering

You may have seen that some workflows filter out genes that have very low counts, or whose counts don't vary very much across samples. In theory, this is not necessary with DESeq2 as it will handle this on the fly (read more here), but reducing the size of the matrix can greatly speeding up downstream calculations and reduce the memory requirements.

As we saw in the previous exercise, there are many liver-specific genes that most likely are not deferentially expressed in skin. There are therefore many rows of genes with 0 or very low counts. We can at least remove those:

# keep only rows with with counts summing up to 10 or more
dds <- dds[rowSums(counts(dds)) >= 10,]

Differential Expression Analysis

The ultimate goal of running such an analysis is to see if a given gene is expressed to a different degree in one condition vs. another condition. This sounds almost like a T-test, but things are not that simple. In this exercise, I don't want to focus too much on theory, but it is worth talking about what some of the "problematic" features are of the data:

Many counts are low, or even zero.
RNAseq data tends to exhibit over dispersion: the variance is much greater than the mean.
The target contigs (e.g. genes) are of different lengths. You would expect longer contigs to also have more mapped reads.
Each library (sample) consists of a different total number of reads.
RNAseq experiments tend to have few (usually three) replicates per condition.
There are typically thousands of genes, which means you are repeating the same pairwise test thousands of times.
Highly expressed genes tend to 'eat up' more reads during sequencing, meaning that genes that are less expressed tend to be underrepresented. This means changes in expression in highly expressed genes are easier to detect than changes in genes that are expressed at lower levels.

The differential expression analysis in DESeq2 takes several steps to try to deal with these problems. At its core it fits generalized linear models with a negative binomial distribution to each gene. It takes into account the estimated counts per gene, normalized by library size, but also takes into account the size of the target genes/transcripts and calculates a dispersion parameter, which tells you about how far the observed count is from the mean counts estimated by the model.

It then performs a pairwise comparison of the level of expression per gene in condition A vs. condition B. It uses the models per gene to estimate coefficients and standard error for each sample group (i.e. for condition A and condition B). These coefficients are the estimates of log2 fold change.

To test whether this log2 fold change for a given gene is significant, it then applies a statistical test (Wald test), to test the hypothesis that the parameter (log 2 fold change) is significantly different from 0. A multiple test correction is then applied, by default a version of the Benjamini-Hochberg False Discovery Rate (FDR). This is done by ranking the genes by p-value, then:

$adjusted\ p\ value = p\ value * \frac{total\ number\ of\ tests}{rank}$

In short, DESeq2 does the following:

Estimation of size factors (control for differences in the library size of the sequencing experiments)
Estimation of dispersion (e.g. for problem of large within-group variability and low sample sizes)
Negative binomial GLM fitting (good for modeling discrete counts with over-dispersion)
Wald statistics calculation and multiple testing correction (statistical significance testing)

These are all done with a single convenient function:

dds <- DESeq(dds)
dds

## class: DESeqDataSet 
## dim: 18110 16 
## metadata(1): version
## assays(6): counts avgTxLength ... H cooks
## rownames(18110): PECUL23A000004 PECUL23A000011 ... PECUL23A062646
##   PECUL23A062647
## rowData names(30): baseMean baseVar ... deviance maxCooks
## colnames(16): 10D 11D ... 8V 9V
## colData names(5): sample_id treatment tissue side condition

Note: it looks like we are "overwriting" the dds object, but really, we are just adding more inforamtion to it

This incredibly convenient function has done all of the heavy lifting for us, and it has even run some pair-wise tests already, based on the ~condtion variable we set.

We can access the names of the pairwise tests like so:

resultsNames(dds)

## [1] "Intercept"                              
## [2] "condition_black_ventral_vs_black_dorsal"
## [3] "condition_white_dorsal_vs_black_dorsal" 
## [4] "condition_white_ventral_vs_black_dorsal"

Question: Is there anything unexpected about the list of pair-wise tests?

We had 4 treatment levels, which should result in a total of 6 pairwise comparisons. But here, there are only 3.

This is because DESeq2 compares everything to a baseline condition. If this is not specified, it will use the first alphabetic factor level, which in this case is "black_dorsal". As a result only xxx_vs_black_dorsal comparisons are made.

This is an important detail to remember: the second factor is always considered to be the "base level" or "control" variable.

Examining differential expression analysis results

We can extract the results for the pair-wise tests from the dds object like so:

res<-results(dds)
res

## log2 fold change (MLE): condition white ventral vs black dorsal 
## Wald test p-value: condition white ventral vs black dorsal 
## DataFrame with 18110 rows and 6 columns
##                  baseMean log2FoldChange     lfcSE      stat    pvalue
##                 <numeric>      <numeric> <numeric> <numeric> <numeric>
## PECUL23A000004   557.4102      -0.100434  0.552488 -0.181786  0.855751
## PECUL23A000011   943.9019      -0.274858  0.186806 -1.471356  0.141195
## PECUL23A000017   308.9902      -0.089849  0.488532 -0.183916  0.854079
## PECUL23A000021   389.3509       0.161886  0.128565  1.259179  0.207966
## PECUL23A000024    61.2082      -0.697616  0.797256 -0.875021  0.381562
## ...                   ...            ...       ...       ...       ...
## PECUL23A062638 7640.25635      -0.331002  0.285494 -1.159400 0.2462931
## PECUL23A062639   14.84163      -0.269193  0.576309 -0.467098 0.6404298
## PECUL23A062644    1.19938      -0.600717  1.455699 -0.412665 0.6798517
## PECUL23A062646    7.21864      -1.430253  1.053540 -1.357569 0.1746006
## PECUL23A062647  109.72323       0.558908  0.272927  2.047831 0.0405766
##                     padj
##                <numeric>
## PECUL23A000004  0.974429
## PECUL23A000011  0.611279
## PECUL23A000017  0.973855
## PECUL23A000021  0.692827
## PECUL23A000024  0.815316
## ...                  ...
## PECUL23A062638  0.728480
## PECUL23A062639  0.924804
## PECUL23A062644        NA
## PECUL23A062646  0.656003
## PECUL23A062647  0.361351

If we don't specify anything else, it will extract just the first comparison. In this case "white ventral vs black dorsal". This table has the following information:

Rows are genes.
Per gene, there are 6 metrics:

baseMean: mean of normalized counts for all samples divided by size factor (note: it does not take into account gene length).
log2FoldChange: The effect size estimate. This value indicates how much the gene or transcript's expression seems to have changed between the comparison and control groups. This value is reported on a logarithmic scale to base 2.
lfcSE: The standard error estimate for the log2 fold change estimate
stat: The value of the test statistic for the gene or transcript.
pvalue: P-value of the test for the gene or transcript.
padj: Adjusted P-value for multiple testing for the gene or transcript.

Question: Is there anything unexpected about the adjusted p-values?

Some of the adjusted p-values are NA. This is what the DESeq2 handbook tells us:

"If a row is filtered by automatic independent filtering, for having a low mean normalized count, then only the adjusted p value will be set to NA. Description and customization of independent filtering is described below".

Let's check some of those counts with NA padj:

counts(dds) %>%
  as.data.frame() %>%
  filter(is.na(res$padj)) %>%
  head()

##                10D 11D 12D 22D 10V 11V 12V 22V 21D 7D 8D 9D 21V 7V 8V 9V
## PECUL23A000050   3   1   2   0   4   1   1   3   5  2  0  0   1  3  0  1
## PECUL23A000099   0   0   0   1   0   3   2   2   1  1  0  2   0  0  1  1
## PECUL23A000197   1   0   1   2   2   2   1   0   6  1  0  4   2  2  0  1
## PECUL23A000271   2   1   0   1   1   1   0   1   1  0  1  0   2  1  8  0
## PECUL23A000302   0   0   2   0   1   0   1   1   1  0  0  2   1  0  0  1
## PECUL23A000417   0   1   1   0   2   1   4   1   2  1  0  0   2  4  0  2

Indeed! all have very low counts!! they are therefore unreliable, and should rightfully not be included in the results.

We could also extract results of specific comparisons. there are two ways to do this:

# 1. using the result names:
results(dds, name="condition_white_dorsal_vs_black_dorsal")

## log2 fold change (MLE): condition white dorsal vs black dorsal 
## Wald test p-value: condition white dorsal vs black dorsal 
## DataFrame with 18110 rows and 6 columns
##                  baseMean log2FoldChange     lfcSE      stat    pvalue
##                 <numeric>      <numeric> <numeric> <numeric> <numeric>
## PECUL23A000004   557.4102     -0.1551640  0.552481 -0.280850  0.778826
## PECUL23A000011   943.9019     -0.2105533  0.186661 -1.128000  0.259320
## PECUL23A000017   308.9902      0.0524451  0.488234  0.107418  0.914457
## PECUL23A000021   389.3509      0.0862399  0.128620  0.670503  0.502537
## PECUL23A000024    61.2082     -0.1648459  0.794355 -0.207522  0.835602
## ...                   ...            ...       ...       ...       ...
## PECUL23A062638 7640.25635     -0.3589307  0.285491 -1.257239 0.2086671
## PECUL23A062639   14.84163      0.1675091  0.565750  0.296083 0.7671664
## PECUL23A062644    1.19938     -0.0569031  1.393045 -0.040848 0.9674171
## PECUL23A062646    7.21864     -0.6777258  1.026960 -0.659934 0.5092961
## PECUL23A062647  109.72323      0.5468155  0.272676  2.005367 0.0449238
##                     padj
##                <numeric>
## PECUL23A000004  0.994905
## PECUL23A000011  0.911978
## PECUL23A000017  0.998538
## PECUL23A000021  0.978891
## PECUL23A000024  0.998538
## ...                  ...
## PECUL23A062638  0.887490
## PECUL23A062639  0.994905
## PECUL23A062644        NA
## PECUL23A062646  0.981014
## PECUL23A062647  0.682353

# 2. using contrasts:
results(dds, contrast=c("condition","white_dorsal","black_dorsal"))

## log2 fold change (MLE): condition white_dorsal vs black_dorsal 
## Wald test p-value: condition white dorsal vs black dorsal 
## DataFrame with 18110 rows and 6 columns
##                  baseMean log2FoldChange     lfcSE      stat    pvalue
##                 <numeric>      <numeric> <numeric> <numeric> <numeric>
## PECUL23A000004   557.4102     -0.1551640  0.552481 -0.280850  0.778826
## PECUL23A000011   943.9019     -0.2105533  0.186661 -1.128000  0.259320
## PECUL23A000017   308.9902      0.0524451  0.488234  0.107418  0.914457
## PECUL23A000021   389.3509      0.0862399  0.128620  0.670503  0.502537
## PECUL23A000024    61.2082     -0.1648459  0.794355 -0.207522  0.835602
## ...                   ...            ...       ...       ...       ...
## PECUL23A062638 7640.25635     -0.3589307  0.285491 -1.257239 0.2086671
## PECUL23A062639   14.84163      0.1675091  0.565750  0.296083 0.7671664
## PECUL23A062644    1.19938     -0.0569031  1.393045 -0.040848 0.9674171
## PECUL23A062646    7.21864     -0.6777258  1.026960 -0.659934 0.5092961
## PECUL23A062647  109.72323      0.5468155  0.272676  2.005367 0.0449238
##                     padj
##                <numeric>
## PECUL23A000004  0.994905
## PECUL23A000011  0.911978
## PECUL23A000017  0.998538
## PECUL23A000021  0.978891
## PECUL23A000024  0.998538
## ...                  ...
## PECUL23A062638  0.887490
## PECUL23A062639  0.994905
## PECUL23A062644        NA
## PECUL23A062646  0.981014
## PECUL23A062647  0.682353

These two approaches should be largely the same (though baselines are treated slightly differently), but the second has some advantages. Namely, you can change the factor order (in our case, there is not really a clear control/treatment, right?), and, you can calculate results using different baseline than was included in the original, automatize run. lets make four results objects, comparing the dorsal and ventral skins within each treatment, and the dorsal skins across treatments:

# white dorsal vs white ventral
res_wD_wV<-results(dds, contrast=c("condition","white_dorsal","white_ventral"))
# black dorsal vs black ventral
res_bD_bV<-results(dds, contrast=c("condition","black_dorsal","black_ventral"))
# black dorsal vs white dorsal
res_bD_wD<-results(dds, contrast=c("condition","black_dorsal","white_dorsal"))
# black ventral vs white ventral
res_bV_wV<-results(dds, contrast=c("condition","black_ventral","white_ventral"))

We can now summarize these results tables:

summary(res_wD_wV, alpha=0.05)

## 
## out of 18110 with nonzero total read count
## adjusted p-value < 0.05
## LFC > 0 (up)       : 198, 1.1%
## LFC < 0 (down)     : 172, 0.95%
## outliers [1]       : 103, 0.57%
## low counts [2]     : 1054, 5.8%
## (mean count < 1)
## [1] see 'cooksCutoff' argument of ?results
## [2] see 'independentFiltering' argument of ?results

summary(res_bD_bV, alpha=0.05)

## 
## out of 18110 with nonzero total read count
## adjusted p-value < 0.05
## LFC > 0 (up)       : 214, 1.2%
## LFC < 0 (down)     : 181, 1%
## outliers [1]       : 103, 0.57%
## low counts [2]     : 1756, 9.7%
## (mean count < 3)
## [1] see 'cooksCutoff' argument of ?results
## [2] see 'independentFiltering' argument of ?results

summary(res_bD_wD, alpha=0.05)

## 
## out of 18110 with nonzero total read count
## adjusted p-value < 0.05
## LFC > 0 (up)       : 24, 0.13%
## LFC < 0 (down)     : 49, 0.27%
## outliers [1]       : 103, 0.57%
## low counts [2]     : 2104, 12%
## (mean count < 3)
## [1] see 'cooksCutoff' argument of ?results
## [2] see 'independentFiltering' argument of ?results

summary(res_bV_wV, alpha=0.05)

## 
## out of 18110 with nonzero total read count
## adjusted p-value < 0.05
## LFC > 0 (up)       : 29, 0.16%
## LFC < 0 (down)     : 69, 0.38%
## outliers [1]       : 103, 0.57%
## low counts [2]     : 2104, 12%
## (mean count < 3)
## [1] see 'cooksCutoff' argument of ?results
## [2] see 'independentFiltering' argument of ?results

This returns the number of up-regulated genes (LFC>0) and down-regulated genes (LFC<0) with respect to the control variable. Other information is also given, such as the number of outliers and low counts.

Question: What do you notice about these results?

It seems like the dorsal vs. ventral comparisons are producing many more up and down regulated genes than the black vs. white comparison. This indicates that not all skin is the same! At the level of gene expression, the change in pigmentation on the dorsum is minimal compared to the difference in gene expression between dorsal and ventral skin of much similar colour!

MA plots

The typical plot to show the distribution of differentially expressed genes, is a "MA-plot". Here, we plot the mean of normalized counts against the log fold change, with a canned plotting function from the DESeq2 package.

par(mfrow=c(2,2))
# single plot
DESeq2::plotMA(res_wD_wV, main="White Dorsals vs. Ventral")
DESeq2::plotMA(res_bD_bV, main="Black Dorsals vs. Ventral")
DESeq2::plotMA(res_bD_wD, main="Black Dorsals vs. White Dorsal")
DESeq2::plotMA(res_bV_wV, main="Black Ventral vs. White Ventral")

par(mfrow=c(1,1))

Question: Have you seen this kind of plot before? What can we say from these plots?

* It demonstrates that only genes with a large average normalized count contain sufficient information to be significant * The same as from the summary() function: * Dorsal-ventral comparisons result in more differential expressed genes (DEGs), than black-white comparisons? * DEGs are more or less evenly up (positive fold changes) and down (negative fold changes) regulated for each comparison * There is a lot of noise for genes with genes with low normalized counts.

We can squeeze even more information out of these kinds of plots by getting creative with ggplot.

res_bD_bV %>%
  as.data.frame() %>%
  ggplot(aes(baseMean, log2FoldChange, colour=padj)) +
  geom_point(size=1) + 
  scale_y_continuous(limits=c(-5, 5), oob=squish) + # oob from the scales package is needed to "squish" points falling outside the axis limits
  scale_x_log10() +
  geom_hline(yintercept = 0, colour="red", size=1, linetype="longdash") +
  labs(x="mean of normalized counts", y="log fold change") +
  scale_colour_viridis(direction=-1, trans='sqrt') +
  geom_density_2d(colour="blue", size=0.5) +
  theme_bw()

## Warning: Using `size` aesthetic for lines was deprecated in ggplot2 3.4.0.
## ℹ Please use `linewidth` instead.
## This warning is displayed once every 8 hours.
## Call `lifecycle::last_lifecycle_warnings()` to see where this warning was
## generated.

Here, we can see more clearly that genes with low counts do not have p-adjusted values (points in grey). The density contours are also centered around a y-intercept of 0 (most genes are not differentially expressed)

Exporting data

We have now finished the differential gene expression analysis. As the final step, we should export our results. We can export the individual results tables as .csv files, and we can export the various objects, in case we want to process them down the road

# make a results folder if it does not yet exist
dir.create("results", showWarnings = FALSE)

# export individual results tables like so:
write.csv(as.data.frame(res_wD_wV), "./results/deseq2_res_wD_wV.csv")

# but for convenience, we can export them all as a list object and save it as an .rds file
saveRDS(list(
          bD_bV = res_bD_bV,
          wD_wV = res_wD_wV,
          bV_wV = res_bV_wV,
          bD_wD = res_bD_wD),
        "./results/deseq2_results.rds")


# export the DESeq2 object as an .rds files
saveRDS(dds, "./results/deseq2_dds.rds")

Extras - Diagnostics

At this point, we can do some quality checks, to make sure that all of the steps that DESeq2has done, worked as expected. For example:

P-value distributions

The over-dispersion correction should result that there is an even distribution of p-values, with an "enrichment" of genes with low p-values (i.e. a rectangular plot with a peak near 0).

# for all comparisons:
par(mfrow=c(2,2))
par(mar=c(4,4,1,1))
hist(res_wD_wV$pvalue, breaks=50, col="grey", main="wD_wV")
hist(res_bD_bV$pvalue, breaks=50, col="grey", main="_bD_bV")
hist(res_bD_wD$pvalue, breaks=50, col="grey", main="bD_wD")
hist(res_bV_wV$pvalue, breaks=50, col="grey", main="bV_wV")

par(mfrow=c(1,1))

This is good example of the different kinds of distributions that you may encounter. In all cases, we see the expected "enrichment" of genes with low p-values. This is good, but what we want to see ideally, is a relatively uniform/rectangular distribution for all other p-values. We can see this especially for the bV_wV plot. For some of the other comparisons we can see a little bit of a hill-shape sloping up as we get to higher p-values, and even a little but of a U-shape. If this were more extreme (both hill-shaped or U-shaped distributions) you may want to try to remove outliers or fit models with more than one dispersion parameter (see here for example, or look up local dispersion, in the DESeq2 handbook).

Dispersion plots

To accurately model sequencing counts, we need to generate accurate estimates of within-group variation (variation between biological replicates of the same treatment/condition group) for each gene. With only a few (usually 3) replicates per group, the estimates of variation for each gene are often unreliable (due to the large differences in dispersion for genes with similar means).

To address this problem, DESeq2 shares information across genes to generate more accurate estimates of variation based on the mean expression level of the gene using a method called ‘shrinkage’. DESeq2 assumes that genes with similar expression levels have similar dispersion.

To model the dispersion based on expression level (mean counts of replicates), the dispersion for each gene is estimated using maximum likelihood estimation. In other words, given the count values of the replicates, the most likely estimate of dispersion is calculated.

We can plot these dispersion estimates to get an idea of what this looks like.

plotDispEsts(dds)

The red dots are the expected dispersion of each gene, given the size of the count, that is estimated by the model.
The black dots are the normalized expression levels (counts) and estimated dispersion for each gene.
The blue points are the same genes, but 'shrunk' towards the values predicted by the model (the red line). Genes that were not shrunk are outlined in blue

Question: What are some interesting things to note about this plot?

Genes with low counts have higher dispersion, but this curve flattens out.
Genes that have much higher dispersion than expected are identified as outliers (black dot with blue circles). These are NOT "shrunk". This is due to the likelihood that the gene does not follow the modeling assumptions and has higher variability than others for biological or technical reasons. Shrinking these could result in the detection of false positives.

All-in-all it looks good!