Aplicaciones y Discusiones en Desarrollo Animal - Taller 2
Set your working directory and load the libraries we need.
# set working directory
setwd("~/Documents/git_projects/ADDA_taller2/")
# install pacman if not already installed
if (!require("pacman")) install.packages("pacman")
# use pacman to load libraries
pacman::p_load(tidyverse,gprofiler2, scico)
From the previous exercises on differential gene expression, we
looked at some comparisons that resulted in >300 genes that were over
or under expressed. It is not trivial to make sense of the biological
significance of that many genes. One way to try to distill this large
list of genes into biological functions is to perform a functional
enrichment analysis. Essentially, we want to use curated lists of genes
that belong to a specific functional category and test whether some of
these are over represented in our case. Perhaps the most widely used
knowledge base of this kind, is the Gene Ontology, but there are others,
such as the Kyoto Encyclopedia of
Genes and Genomes and the Reactome. This is also called a
Functional Enrichment
or
Over Representation Analysis
Let's say we are interested in a single biological process: melanin biosynthesis. Conveniently, a GO term for this exists: GO:0042438, as well as a Xenopus-specific Reactome pathway R-XTR-5662702. Essentially, what we want to do is see how many of our differentially expressed genes are either part of this term or not. This would result in a contingency table like this:
Melanin biosynthesis | Not melanin biosynthesis | |
---|---|---|
Differentially expressed | a | b |
Not differentially expressed | c | d |
We can now perform a Fisher's exact test to get a p-value that would tell us whether genes associated with melanin biosynthesis are over represented in our list of differentially expressed genes. You can imagine however that we don't just want to test this single functional group, but many. We are therefore going to repeat this with all of the terms in the Gene Ontology for example. That many tests therefore requires a multiple-test correction, to get an adjusted p-value.
One final thing that we have to keep in mind is the background or universe. This refers to the total list of genes that we will compare your set of genes to. Imagine we compare our set of genes that we obtained from skin samples to a full list of genes from the whole genome. In this case, it would be highly likely that we will get a lot of enriched terms or functions related to processes occurring in the skin. While this is not wrong, it is also not informative, because what we are interested in, is in differences in enriched skin-specific functions between two conditions. We therefore ideally want to restrict our background set to only those genes that we realistically expect to find in our samples.
To perform such as test, we therefore need:
Lets get to it!
Let's load the results files we need.
# the list of DEG results from the previous exercises
res<-readRDS("./results/deseq2_results.rds")
# the annotations
xtrop<-read_csv("data/PCU23_annotations_xtr105_genes.csv")
Ideally, the annotations of your genes should come from experimental evidence from your organism. If you work with mice, Drosophila or humans for example, many functional enrichment analysis tools will be very easy to implement because they will automatically connect your lists of genes to annotations, background lists etc. This is unlikely going to exist if you work with non-model systems. We are therefore dependent on making our own annotations and lists.
For our annotations, we are using BLAST results from querying our genes against the proteome of Xenopus tropicalis. This is a well studied frog species. It is still only distantly related to our focal species, but this is the best we can do. As we saw in the previous exercise, this results in many genes not having any annotations. These will be excluded unfortunately.
First we will use the same bit of code from the previous exercise to make a list of DEGs per comparison.
extract_degs<-function(x) {
return(
x %>%
as_tibble(rownames = "gene") %>%
filter(padj<0.05) %>%
pull(gene)
)
}
# now extract all
sig_deg<-lapply(res, FUN=extract_degs)
## Warning: package 'S4Vectors' was built under R version 4.3.2
## Warning: package 'GenomeInfoDb' was built under R version 4.3.2
## Warning: package 'SummarizedExperiment' was built under R version 4.3.2
str(sig_deg)
## List of 4
## $ bD_bV: chr [1:395] "PECUL23A000242" "PECUL23A000289" "PECUL23A000301" "PECUL23A000483" ...
## $ wD_wV: chr [1:370] "PECUL23A000301" "PECUL23A000532" "PECUL23A000547" "PECUL23A000601" ...
## $ bV_wV: chr [1:98] "PECUL23A000301" "PECUL23A000532" "PECUL23A002321" "PECUL23A004402" ...
## $ bD_wD: chr [1:73] "PECUL23A000532" "PECUL23A000872" "PECUL23A002033" "PECUL23A002231" ...
We now have to convert
those to Xenopus
tropicalis IDs, by extracting the matching annotations. Because we
will do this multiple times over list itmes, it is cleaner to write a
function first, then apply it to the list.
# make a function that extracts matching X. tropicalis IDs
extract_xtr<-function(x) {
return(
xtrop %>%
filter(gene_id %in% x) %>%
pull(xtr_pep_id_x) %>%
unique()
)
}
# apply function to list of Pelobates IDs
xtr_deg<-lapply(sig_deg, FUN=extract_xtr)
str(xtr_deg)
## List of 4
## $ bD_bV: chr [1:377] "ENSXETP00000055200" "ENSXETP00000087506" "ENSXETP00000025683" "ENSXETP00000088560" ...
## $ wD_wV: chr [1:355] "ENSXETP00000025683" "ENSXETP00000021288" "ENSXETP00000001532" "ENSXETP00000016336" ...
## $ bV_wV: chr [1:92] "ENSXETP00000025683" "ENSXETP00000021288" "ENSXETP00000005849" "ENSXETP00000056779" ...
## $ bD_wD: chr [1:71] "ENSXETP00000021288" "ENSXETP00000094013" "ENSXETP00000070786" "ENSXETP00000009333" ...
Question: Are the gene sets in the list of Xenopus genes the same length as the original DEG list? How and why are they different?
sapply(sig_deg, length)
## bD_bV wD_wV bV_wV bD_wD
## 395 370 98 73
sapply(xtr_deg, length)
## bD_bV wD_wV bV_wV bD_wD
## 377 355 92 71
The gene sets are smaller. This is because not every gene for our organism could be annotated.
There is some discussion about what makes a good background. Ideally, it should be the complete list of genes that could be differentially expressed. But what is this?
Question: A good background is:
- All genes in the Xenopus tropicalis proteome
- All genes that could be annotated in the Pelobates genome
- All genes in the tissue-specific transcriptome
It is not always clear, but in this case, I would argue that the tissue-specific transcriptome is the closest right answer.
We will use the full set of genes that were returned by
DESeq2
. This set should have filtered out genes that have
low counts (i.e. unlikely to be expressed across any of our
tissues/conditions).
We can use the same function from earlier to convert our list of Pelobates IDs to Xenopus peptide IDs.
xtr_bg<-extract_xtr(rownames(res$bD_bV))
str(xtr_bg)
## chr [1:14775] "ENSXETP00000061180" "ENSXETP00000042154;ENSXETP00000095923" ...
We are now ready to go! There are a number of software and R packages that let you perform functional enrichment analysis. Here, we will use [g:Profiler])(https://biit.cs.ut.ee/gprofiler/), because it plays particularly well with R and with Ensembl gene/peptide annotations.
The analysis can be performed with a single command, even if our query is a list of multiple gene sets!
An important thing to remember is that the associated R package
gprofiler2
is just an API, and the actual analysis will be
performed on the g:Profiler server. This version is continuously
updated, to match the updates Because of this, it is important to tell
gprofiler which version of Ensembl you would like to use. Our
annotations came from version 105, which is a slightly older version, so
we have to make sure we set this correctly. We can do that by looking up
the archives
on the g:Profiler homepage.
# set base url:
set_base_url("https://biit.cs.ut.ee/gprofiler_archive3/e105_eg52_p16/")
# run the analysis
res_ora<-gost(multi_query = FALSE, # returns separate results tables for multiquery
custom_bg = xtr_bg, # our background
query=xtr_deg, # our list of gene sets
organism="xtropicalis", # the organism our annotations belong to
exclude_iea = FALSE, # include GO terms that were electronically assigned
correction_method = "gSCS", # the recommended multiple testing correction.
sources=c("GO:BP","GO:CC","GO:MF", "KEGG","REAC"), # the functional sets we are interested in
evcodes=FALSE, ## evcodes TRUE needed for downstream analysis like enrichment maps in Cytoscape, but this takes longer.
significant= FALSE) # return all terms, not just the significant ones
# the results are stored as a "results" dataframe
head(res_ora$result)
## query significant p_value term_size query_size intersection_size
## 1 bD_bV TRUE 1.133556e-05 1336 326 75
## 2 bD_bV TRUE 2.479204e-03 1256 326 66
## 3 bD_bV TRUE 1.377473e-02 932 326 52
## 4 bD_bV TRUE 4.277467e-02 1021 326 54
## 5 bD_bV FALSE 7.381090e-02 1150 326 58
## 6 bD_bV FALSE 1.121277e-01 656 326 39
## precision recall term_id source term_name
## 1 0.2300613 0.05613772 GO:0032501 GO:BP multicellular organismal process
## 2 0.2024540 0.05254777 GO:0032502 GO:BP developmental process
## 3 0.1595092 0.05579399 GO:0048731 GO:BP system development
## 4 0.1656442 0.05288932 GO:0007275 GO:BP multicellular organism development
## 5 0.1779141 0.05043478 GO:0048856 GO:BP anatomical structure development
## 6 0.1196319 0.05945122 GO:0030154 GO:BP cell differentiation
## effective_domain_size source_order parents
## 1 12184 8813 GO:0008150
## 2 12184 8814 GO:0008150
## 3 12184 15045 GO:00072....
## 4 12184 3168 GO:00325....
## 5 12184 15160 GO:0032502
## 6 12184 7722 GO:0048869
Question: What do these resuls look like and how many functionally enriched terms are there for each gene set?:
The results returned tell us a few things:
colnames(res_ora$result)
## [1] "query" "significant" "p_value"
## [4] "term_size" "query_size" "intersection_size"
## [7] "precision" "recall" "term_id"
## [10] "source" "term_name" "effective_domain_size"
## [13] "source_order" "parents"
most importantly:
term_id
and associated term_name
which
are the functional terms from the sources we requested (GO, KEGG,
Reactome).p_value
, which is actually the adjusted p-value
(misleading!), that the given term is enriched (over represented) in our
query (gene set).term_size
, query_size
and
intersect_size
which tell you how many genes make up the
given term, how many genes were in your query and how many genes from
both are overlapping.Question: How many terms have been significantly enriched for each of the comparisons?
We can use a p_value cutoff of 0.05 to see how many terms have been functionally enriched in each term.
res_ora$result %>%
filter(p_value<0.05) %>%
group_by(query) %>%
dplyr::count(query, sort=TRUE)
## # A tibble: 4 × 2
## # Groups: query [4]
## query n
## <chr> <int>
## 1 bD_bV 17
## 2 wD_wV 14
## 3 bV_wV 12
## 4 bD_wD 7
We see that the dorsal-ventral comparisons have the most enriched
terms (26
and 24
) and the dorsal-dorsal
comparison has the fewest (8
) This is not surprising given
the small number of genes in that last set.
gprofiler also has a few visualization tools as well. For example an interactive Manhattan-style plot:
gostplot(res_ora)
You should see that the Dorsal-Ventral comparisons look fairly similar. Let's take some time to explore this plot.
Question: What REACTOME pathway is enriched for all of the comparisons?
Another useful plot to show enrichment results are dot plots. These are slightly more informative and make comparing across multiple sets more intuitive.
We can easily make our custom dot plot using the gprofiler results tables and ggplot.
res_ora$result %>%
select(query,term_name, p_value, intersection_size, query_size,source) %>%
filter(p_value<0.05) %>%
mutate(GeneRatio=intersection_size/query_size) %>%
arrange(GeneRatio) %>%
mutate(term_name = factor(term_name, levels=unique(term_name))) %>%
ggplot(aes(x=GeneRatio, y=term_name)) +
geom_point(aes(color=p_value, size=intersection_size)) +
ylab("") +
scale_color_scico(palette = "batlow", direction = 1) +
facet_grid(source~query,scales = "free_y",space = "free") +
theme_bw()
Question: After having done the differential gene expression analysis and performed a functional enrichment analysis, what have we learned about the pigmentation plasticity of tadpoles?
We have learned for example that: