Functional Enrichment

Prepare environment

Set your working directory and load the libraries we need.

# set working directory
setwd("~/Documents/git_projects/ADDA_taller2/")

# install pacman if not already installed
if (!require("pacman")) install.packages("pacman")

# use pacman to load libraries
pacman::p_load(tidyverse,gprofiler2, scico)

Background

From the previous exercises on differential gene expression, we looked at some comparisons that resulted in >300 genes that were over or under expressed. It is not trivial to make sense of the biological significance of that many genes. One way to try to distill this large list of genes into biological functions is to perform a functional enrichment analysis. Essentially, we want to use curated lists of genes that belong to a specific functional category and test whether some of these are over represented in our case. Perhaps the most widely used knowledge base of this kind, is the Gene Ontology, but there are others, such as the Kyoto Encyclopedia of Genes and Genomes and the Reactome. This is also called a Functional Enrichment or Over Representation Analysis

Let's say we are interested in a single biological process: melanin biosynthesis. Conveniently, a GO term for this exists: GO:0042438, as well as a Xenopus-specific Reactome pathway R-XTR-5662702. Essentially, what we want to do is see how many of our differentially expressed genes are either part of this term or not. This would result in a contingency table like this:

	Melanin biosynthesis	Not melanin biosynthesis
Differentially expressed	a	b
Not differentially expressed	c	d

We can now perform a Fisher's exact test to get a p-value that would tell us whether genes associated with melanin biosynthesis are over represented in our list of differentially expressed genes. You can imagine however that we don't just want to test this single functional group, but many. We are therefore going to repeat this with all of the terms in the Gene Ontology for example. That many tests therefore requires a multiple-test correction, to get an adjusted p-value.

One final thing that we have to keep in mind is the background or universe. This refers to the total list of genes that we will compare your set of genes to. Imagine we compare our set of genes that we obtained from skin samples to a full list of genes from the whole genome. In this case, it would be highly likely that we will get a lot of enriched terms or functions related to processes occurring in the skin. While this is not wrong, it is also not informative, because what we are interested in, is in differences in enriched skin-specific functions between two conditions. We therefore ideally want to restrict our background set to only those genes that we realistically expect to find in our samples.

To perform such as test, we therefore need:

A list of genes/transcripts that are of biological interest (e.g. list of experimentally derived DEGs). These need to be annotated! That is, we need to know if these loci are genes with a known function (i.e. annotated).
A 'background' list of genes, that includes any gene that could have been differentially expressed in the analysis.
A curated database of biologically relevant categories (GO terms, gene pathways etc.), with associated genes.

Lets get to it!

Load data

Let's load the results files we need.

# the list of DEG results from the previous exercises
res<-readRDS("./results/deseq2_results.rds")

# the annotations
xtrop<-read_csv("data/PCU23_annotations_xtr105_genes.csv")

Prepare gene sets and background

Ideally, the annotations of your genes should come from experimental evidence from your organism. If you work with mice, Drosophila or humans for example, many functional enrichment analysis tools will be very easy to implement because they will automatically connect your lists of genes to annotations, background lists etc. This is unlikely going to exist if you work with non-model systems. We are therefore dependent on making our own annotations and lists.

For our annotations, we are using BLAST results from querying our genes against the proteome of Xenopus tropicalis. This is a well studied frog species. It is still only distantly related to our focal species, but this is the best we can do. As we saw in the previous exercise, this results in many genes not having any annotations. These will be excluded unfortunately.

Make gene sets

First we will use the same bit of code from the previous exercise to make a list of DEGs per comparison.

extract_degs<-function(x) {
  return(
    x %>%
      as_tibble(rownames = "gene") %>%
      filter(padj<0.05) %>%
      pull(gene)
  )
}

# now extract all
sig_deg<-lapply(res, FUN=extract_degs)

## Warning: package 'S4Vectors' was built under R version 4.3.2

## Warning: package 'GenomeInfoDb' was built under R version 4.3.2

## Warning: package 'SummarizedExperiment' was built under R version 4.3.2

str(sig_deg)

## List of 4
##  $ bD_bV: chr [1:395] "PECUL23A000242" "PECUL23A000289" "PECUL23A000301" "PECUL23A000483" ...
##  $ wD_wV: chr [1:370] "PECUL23A000301" "PECUL23A000532" "PECUL23A000547" "PECUL23A000601" ...
##  $ bV_wV: chr [1:98] "PECUL23A000301" "PECUL23A000532" "PECUL23A002321" "PECUL23A004402" ...
##  $ bD_wD: chr [1:73] "PECUL23A000532" "PECUL23A000872" "PECUL23A002033" "PECUL23A002231" ...

We now have to convert those to Xenopus tropicalis IDs, by extracting the matching annotations. Because we will do this multiple times over list itmes, it is cleaner to write a function first, then apply it to the list.

# make a function that extracts matching X. tropicalis IDs
extract_xtr<-function(x) {
  return(
      xtrop %>%
        filter(gene_id %in% x) %>%
        pull(xtr_pep_id_x) %>%
        unique()
  )
}

# apply function to list of Pelobates IDs
xtr_deg<-lapply(sig_deg, FUN=extract_xtr)
str(xtr_deg)

## List of 4
##  $ bD_bV: chr [1:377] "ENSXETP00000055200" "ENSXETP00000087506" "ENSXETP00000025683" "ENSXETP00000088560" ...
##  $ wD_wV: chr [1:355] "ENSXETP00000025683" "ENSXETP00000021288" "ENSXETP00000001532" "ENSXETP00000016336" ...
##  $ bV_wV: chr [1:92] "ENSXETP00000025683" "ENSXETP00000021288" "ENSXETP00000005849" "ENSXETP00000056779" ...
##  $ bD_wD: chr [1:71] "ENSXETP00000021288" "ENSXETP00000094013" "ENSXETP00000070786" "ENSXETP00000009333" ...

Question: Are the gene sets in the list of Xenopus genes the same length as the original DEG list? How and why are they different?

sapply(sig_deg, length)

## bD_bV wD_wV bV_wV bD_wD 
##   395   370    98    73

sapply(xtr_deg, length)

## bD_bV wD_wV bV_wV bD_wD 
##   377   355    92    71

The gene sets are smaller. This is because not every gene for our organism could be annotated.

Make background

There is some discussion about what makes a good background. Ideally, it should be the complete list of genes that could be differentially expressed. But what is this?

Question: A good background is:

All genes in the Xenopus tropicalis proteome

All genes that could be annotated in the Pelobates genome

All genes in the tissue-specific transcriptome

It is not always clear, but in this case, I would argue that the tissue-specific transcriptome is the closest right answer.

We will use the full set of genes that were returned by DESeq2. This set should have filtered out genes that have low counts (i.e. unlikely to be expressed across any of our tissues/conditions).

We can use the same function from earlier to convert our list of Pelobates IDs to Xenopus peptide IDs.

xtr_bg<-extract_xtr(rownames(res$bD_bV))
str(xtr_bg)

##  chr [1:14775] "ENSXETP00000061180" "ENSXETP00000042154;ENSXETP00000095923" ...

Functional Enrichment Analysis

We are now ready to go! There are a number of software and R packages that let you perform functional enrichment analysis. Here, we will use [g:Profiler])(https://biit.cs.ut.ee/gprofiler/), because it plays particularly well with R and with Ensembl gene/peptide annotations.

The analysis can be performed with a single command, even if our query is a list of multiple gene sets!

An important thing to remember is that the associated R package gprofiler2 is just an API, and the actual analysis will be performed on the g:Profiler server. This version is continuously updated, to match the updates Because of this, it is important to tell gprofiler which version of Ensembl you would like to use. Our annotations came from version 105, which is a slightly older version, so we have to make sure we set this correctly. We can do that by looking up the archives on the g:Profiler homepage.

# set base url:
set_base_url("https://biit.cs.ut.ee/gprofiler_archive3/e105_eg52_p16/")


# run the analysis
res_ora<-gost(multi_query = FALSE, # returns separate results tables for multiquery
              custom_bg = xtr_bg, # our background
              query=xtr_deg, # our list of gene sets
              organism="xtropicalis", # the organism our annotations belong to
              exclude_iea = FALSE, # include GO terms that were electronically assigned
              correction_method = "gSCS", # the recommended multiple testing correction.
              sources=c("GO:BP","GO:CC","GO:MF", "KEGG","REAC"), # the functional sets we are interested in 
              evcodes=FALSE, ## evcodes TRUE needed for downstream analysis like enrichment maps in Cytoscape, but this takes longer.
              significant= FALSE) # return all terms, not just the significant ones


# the results are stored as a "results" dataframe 
head(res_ora$result)

##   query significant      p_value term_size query_size intersection_size
## 1 bD_bV        TRUE 1.133556e-05      1336        326                75
## 2 bD_bV        TRUE 2.479204e-03      1256        326                66
## 3 bD_bV        TRUE 1.377473e-02       932        326                52
## 4 bD_bV        TRUE 4.277467e-02      1021        326                54
## 5 bD_bV       FALSE 7.381090e-02      1150        326                58
## 6 bD_bV       FALSE 1.121277e-01       656        326                39
##   precision     recall    term_id source                          term_name
## 1 0.2300613 0.05613772 GO:0032501  GO:BP   multicellular organismal process
## 2 0.2024540 0.05254777 GO:0032502  GO:BP              developmental process
## 3 0.1595092 0.05579399 GO:0048731  GO:BP                 system development
## 4 0.1656442 0.05288932 GO:0007275  GO:BP multicellular organism development
## 5 0.1779141 0.05043478 GO:0048856  GO:BP   anatomical structure development
## 6 0.1196319 0.05945122 GO:0030154  GO:BP               cell differentiation
##   effective_domain_size source_order      parents
## 1                 12184         8813   GO:0008150
## 2                 12184         8814   GO:0008150
## 3                 12184        15045 GO:00072....
## 4                 12184         3168 GO:00325....
## 5                 12184        15160   GO:0032502
## 6                 12184         7722   GO:0048869

Question: What do these resuls look like and how many functionally enriched terms are there for each gene set?:

The results returned tell us a few things:

colnames(res_ora$result)

##  [1] "query"                 "significant"           "p_value"              
##  [4] "term_size"             "query_size"            "intersection_size"    
##  [7] "precision"             "recall"                "term_id"              
## [10] "source"                "term_name"             "effective_domain_size"
## [13] "source_order"          "parents"

most importantly:

the term_id and associated term_name which are the functional terms from the sources we requested (GO, KEGG, Reactome).
the p_value, which is actually the adjusted p-value (misleading!), that the given term is enriched (over represented) in our query (gene set).
the term_size, query_size and intersect_size which tell you how many genes make up the given term, how many genes were in your query and how many genes from both are overlapping.

Question: How many terms have been significantly enriched for each of the comparisons?

We can use a p_value cutoff of 0.05 to see how many terms have been functionally enriched in each term.

res_ora$result %>%
  filter(p_value<0.05) %>%
  group_by(query) %>%
  dplyr::count(query, sort=TRUE)

## # A tibble: 4 × 2
## # Groups:   query [4]
##   query     n
##   <chr> <int>
## 1 bD_bV    17
## 2 wD_wV    14
## 3 bV_wV    12
## 4 bD_wD     7

We see that the dorsal-ventral comparisons have the most enriched terms (26 and 24) and the dorsal-dorsal comparison has the fewest (8) This is not surprising given the small number of genes in that last set.

gprofiler also has a few visualization tools as well. For example an interactive Manhattan-style plot:

gostplot(res_ora)

You should see that the Dorsal-Ventral comparisons look fairly similar. Let's take some time to explore this plot.

Question: What REACTOME pathway is enriched for all of the comparisons?

Melanin biosynthesis

Another useful plot to show enrichment results are dot plots. These are slightly more informative and make comparing across multiple sets more intuitive.

We can easily make our custom dot plot using the gprofiler results tables and ggplot.

res_ora$result %>%
  select(query,term_name, p_value, intersection_size, query_size,source) %>%
  filter(p_value<0.05) %>%
  mutate(GeneRatio=intersection_size/query_size) %>%
  arrange(GeneRatio) %>%
  mutate(term_name = factor(term_name, levels=unique(term_name))) %>%
  ggplot(aes(x=GeneRatio, y=term_name)) +
  geom_point(aes(color=p_value, size=intersection_size)) +
  ylab("") +
  scale_color_scico(palette = "batlow", direction = 1) +
  facet_grid(source~query,scales = "free_y",space = "free") +
  theme_bw()

Question: After having done the differential gene expression analysis and performed a functional enrichment analysis, what have we learned about the pigmentation plasticity of tadpoles?

We have learned for example that:

Melanin differential biosynthesis is relevant for all comparisons (even white ventral and black ventral?).
Dorsal-ventral differentiation involves other, non-pigment related processes such as "system development" and "extracellular matrix".

Concluding remarks

A functional enrichment analysis is useful when you have many differentially expressed genes and you would like to make biological sense of this.
These are 100% dependent on a) accurate annotations and b) curated functional databases. Both of these can be very problematic when working with non-model organisms.
The analysis also tends to be very sensitive to a number of factors, such as what the background set is, whether electronically annotations are included and what significance thresholds to use.