Prepare environment

Set your working directory and load the libraries we will need.

# set working directory
setwd("~/Documents/git_projects/ADDA_taller2/")

# install pacman if not already installed
if (!require("pacman")) install.packages("pacman")

# use pacman to load libraries
pacman::p_load(tidyverse,DESeq2, ageglm,ggVennDiagram,UpSetR,plotly,ggrepel,scico)
## Warning: package 'ageglm' is not available for this version of R
## 
## A version of this package for your version of R might be available elsewhere,
## see the ideas at
## https://cran.r-project.org/doc/manuals/r-patched/R-admin.html#Installing-packages
## Warning in p_install(package, character.only = TRUE, ...):
## Warning in library(package, lib.loc = lib.loc, character.only = TRUE,
## logical.return = TRUE, : there is no package called 'ageglm'
## Warning in pacman::p_load(tidyverse, DESeq2, ageglm, ggVennDiagram, UpSetR, : Failed to install/load:
## ageglm

Background

In the last exercise, we performed a differential gene expression analysis with DESeq2. We contrasted dorsal and ventral skin tissues from tadpoles reared on light and dark backgrounds. We were finished up with DESeqResults objects for pair-wise comparisons. These we saved as a list.

The aim of this exercise is to explore these results in greater detail. To do that, it would be good to know the functions of the genes in the Pelobates cultripes genome. One important extra file that you will use for this, is therefore an annotation file generated through BLASTing the genes against the Ensembl Xenopus tropicalis proteome.

Load data

Let's load the results files from the previous exercise, plus a file with annotations.

# DEG object
dds<-readRDS("./results/deseq2_dds.rds")

# the list of DEG results
res<-readRDS("./results/deseq2_results.rds")

# Load BLAST results
xtrop<-read_csv("data/PCU23_annotations_xtr105_genes.csv")

A comment on the annotation file

If we take a closer look at the annotation file, we should notice a few things:

head(xtrop, 20)

The annotation table has merged a couple of sources of information. The gene_id, transcript_id and pep_id are Pelobates-specific identifiers. These are reference IDs from the Pelobates genome. The pep_description are annotations from that genome. The rest (starting with xtr_) are Xenopus tropicalis peptide ids, names and descriptions. These have been derived from either a BLASTx or BLASTp search (using diamond). We used the Xenopus tropicalis Ensembl v105 proteome as a target for this search. These are therefore homolog annotations.

Comparing sets

As a reminder, calling summary() on our DESeqResults objects, returns the number (and percentage) of significant up and down regulated genes for a specific pair-wise comparison.

summary(res$bD_bV, alpha=0.05)

However, we have multiple sets of pair-wise comparisons, and we might be interested to see how many genes which are differentially expressed are shared among sets. In other words, how many genes that are differentially expressed when comparing dorsal and ventral skin white tadpoles are also differentially expressed in the same comparison for dark tadpoles?

More often than not, you will come across comparisons of sets of raw numbers like this displayed as Venn diagrams.

The first think we have to do, is to make lists of genes per set. I.e. which are these 395 genes that are differentially expressed in in the dorsal vs. ventral comparison in black tadpoles?

# For a single comparison
res$bD_bV %>%
  as_tibble(rownames = "gene") %>%
  filter(padj<0.05) %>%
  pull(gene)

We are going to do this repeatedly, so we can just turn this code into a function:

# turn it into a function so we can apply it to a list!
extract_degs<-function(x) {
  return(
    x %>%
      as_tibble(rownames = "gene") %>%
      filter(padj<0.05) %>%
      pull(gene)
  )
}

We can now apply this to our list of DESeqResults.

#  extract all
sig_deg<-lapply(res, FUN=extract_degs)
str(sig_deg)

We can now plot this as a Venn diagram. Let's do this just with two sets, for the dorsal-ventral comparisons for the black and the white tadpoles

# comparing just the dorsal vs. ventral
sig_deg[c("bD_bV","wD_wV")] %>%
  ggVennDiagram(edge_size = 0) +
  scale_fill_scico(palette = "batlow")

Question: What does this tell us?

  • Almost half of all the DEGs (the largest set) are shared! I.e. most of the difference in expression between dorsal and ventral skin, is probably not related to pigmentation.

Question: What would this look like if we compared all four sets? i.e. go ahead and make a Venn diagram comparing all four sets.

ggVennDiagram(sig_deg,edge_size = 0) +
  scale_fill_scico(palette = "batlow")
  • The largest shared set is still the Dorsal-Ventral comparisons of black and white tadpoles.
  • There are 8 genes that are differentially expressed in all comparisons.

Personally, once you compare more than three sets, I find Venn diagrams to become difficult to interpret. An interesting alternative is an "upset" diagram. This shows exactly the same information.

upset(fromList(sig_deg),
      number.angles = 0, point.size = 3, line.size = 1,
      sets.x.label = "Number of DEGs",
      set_size.show = TRUE,
      set_size.scale_max = max(sapply(sig_deg, length))+50, # needed only to expand the axis a bit
      text.scale = c(1.2, 1.2, 1.2, 1.2, 1.5, 1.5),
      order.by=c("degree","freq"))

Question: What do we think? is this easier to interpret? Is there any additional information we can gain from this?

  • The information shown is pretty much the same. The only additional information shown is the set size (the horizontal barplot).

Plotting fold changes

Volcano plot

A plot you may also have seen is a "Volcano" plot, where you are showing the log fold change plotted against the adjusted p-values per gene. To make small p-values very large, the p-values are usually -log10() transformed. The most basic plot would look like this:

res$bD_bV %>%
  as_tibble(rownames = "gene_id") %>%
  ggplot(aes(x=log2FoldChange, y=-log10(padj))) +
  geom_point(alpha=0.75, shape=16)

However, we want to squeeze as much information out of it as possible, so we will add the following:

  • Remove all points that are far from being significant.
  • Colour-code points that fall within the bounds of our significance thresholds (padj<0.05 and absolute log fold change > 2)
  • add annotations to the points (Gene IDs and the gene descriptions)
  • plot all comparisons together
  • make the plot interactive!
gg_res <- res %>%
  lapply(as_tibble,rownames = "gene_id") %>%
  bind_rows(.id="comparison") %>%
  drop_na(padj) %>% # drop all genes with NAs
  filter(padj<0.5) %>% # reduce the number of points that need to be plotted
  mutate(sig= padj<0.05 & abs(log2FoldChange)>=2) %>% # make a variable to indicate if a gene is significant based on a specific thresholds
  left_join(xtrop, by=c("gene_id")) %>% # add annotations
  ggplot(aes(x=log2FoldChange, y=-log10(padj), color=sig,
            text=paste0("</br>Pcu23 gene: ", gene_id,
                        "</br>Pcu23 peptide: ", pep_description,
                       "</br>X.tr peptide: ", xtr_pep_name_x
                       )
         )) +
  geom_point(alpha=0.75, shape=16) +
  facet_wrap(~comparison, ncol = 2, scales = "free") +
  xlim(-15,15) +
  theme_minimal() +
  theme(legend.position = "none")

gg_res

# we can now turn this into an interactive plot:
ggplotly(gg_res, tooltip="text")

Question: What gene are positively differentially expressed when comparing the black vs. the white dorsal tissue?

  • Seleno W
  • OX-2 membrane glyco-like
  • oca2
  • PMEL

Some other comments:

  • Remember, the second contrast is always the reference, or baseline. This means, that wD (white dorsal) is the reference, and so a positive fold change means genes are over-expressed in the treatment, which is the black dorsal tissue.
  • Notice also that some of them are missing annotations... is this a problem?
  • If we wanted to know more about the function of these genes, we need to turn to online references, such as UNIPROT, e.g. PMEL

We could also make publication-ready, static plots with annotations of specific genes. For example, all the significant DEGs for the black dorsal vs. white dorsal.

res$bD_wD %>%
  as_tibble(rownames = "gene_id") %>%
  drop_na(padj) %>% # drop all genes with NAs
  #filter(padj<0.99) %>% # reduce the number of points that need to be plotted
  mutate(sig= padj<0.05 & abs(log2FoldChange)>=2) %>% # make a variable to indicate if a gene is significant based on a specific thresholds
  left_join(xtrop, by=c("gene_id")) %>% # add annotations
  filter(padj<0.5) %>%
  ggplot(aes(x=log2FoldChange, y=-log10(padj), color=sig)) +
  geom_point(alpha=0.75, shape=16) +
  geom_text_repel(data=. %>% filter(sig),
                  aes(label=xtr_pep_name_x),
                  max.overlaps = 50,
                  size=2) +
  xlim(-10,10) +
  ggtitle("DEGs in Black Dorsal Skin in Comparison to White Dorsal Skin") +
  theme_bw() +
  theme(legend.position = "none")

Question: What have we learned about the gene expression that controls pigemnation plasticity in dorsal skin of these tadpoles?

Just some of the many answers:

  • Pigmentation changes involve melanin biosynthesis (pmel, oca2, mlana).
  • It also involves carotenoids! (bco2l)
  • There are many differentially expressed genes which may be important, but they are not annotated.

Extras - Shrinking the MA plot

In the previous exercise, we looked at MA-plots. We used them mostly as a diagnostic plot, but very often, you will also see them in publications, to display a ratio of gene expression (the fold change) against the mean expression of that gene (base mean). Let's quickly remind us what this looks like:

DESeq2::plotMA(res$bD_bV)

Here, the "noise" of genes with low counts is over-powering the plot.

The DESeq2 package incorporates a prior on log2 fold changes, resulting in moderated estimates from genes with low counts and highly variable counts, as can be seen by the narrowing of spread of points on the left side of the MA plot. This is called "shrinkage" and is done to avoid that these values, which otherwise would frequently be unrealistically large, dominate the top-ranked log fold changes.

A useful exercise to reduce the noise of the low-count genes is to apply a more aggressive shrinkage method. The most widely used type is the apglm method.

# apeglm shrinkage can only be done on already calculated coefficients:
resultsNames(dds)

bD_bV_res <- results(dds,
                     contrast= c("condition","white_ventral", "black_dorsal"))

bD_bV_res
# with shrinkage
bD_bV_shr<-lfcShrink(dds,
                   coef="condition_white_ventral_vs_black_dorsal",
                   type="apeglm")
bD_bV_shr

This produces the same type of "results" object, but you should notice that the first object shows the maximum likelihood estimates (MLE) of the fold change, whereas the second shows the shrunk (MAP) fold changes.

The differences become very clear when repeating the MA-plot

par(mfrow=c(1,2))
par(mar=c(4,4,4,1))
DESeq2::plotMA(bD_bV_res, main="MLE")
DESeq2::plotMA(bD_bV_shr, main="MAP")

Very often you will see the shrinkage versions of MA-plots, simply because they allow for a more pleasing visualization of the significant genes.

Question: How do the fold changes and adjusted p-values compare?

par(mfrow=c(1,2))
par(mar=c(4,4,1,1))
# p values
plot(bD_bV_res$padj~bD_bV_shr$padj,
     xlab="adjusted p-value (shrunk)",
     ylab="adjusted p-value (MLE)")
# fold changes
plot(bD_bV_res$log2FoldChange~bD_bV_shr$log2FoldChange,
     xlab="Fold Change (shrunk)",
     ylab="Fold Change (MLE)")
par(mfrow=c(1,1))
  • The p-values do not change! The number of significantly DEGs is therefore not affected by shrinkage
  • intermediate absolute fold changes are similar, but genes whose fold changes are very large, are pulled towards 0 with the shrinkage.