Aplicaciones y Discusiones en Desarrollo Animal - Taller 2
A good first thing to do when running any R script, is to set your working environment and load any libraries you may need. A handy tool for loading packages is pacman. This package conveniently loads libraries, installs libraries that are not installed, including bioconductor packages!
# set working directory
setwd("~/Documents/git_projects/ADDA_taller2/")
# install pacman if not already installed
if (!require("pacman")) install.packages("pacman")
# use pacman to load libraries
pacman::p_load(tidyverse, DESeq2, pheatmap)
Now that we are all ready to go, let's import some data. As a prequel to this exercise, I have already explained the biological question(s) we are trying to answer (in brief, how is pigmentation plasticity controlled in Spadefoot toad tadpoles?). Here, we will jump in around about the middle of a typical RNAseq workflow. We have already done the following:
Our starting point will therefore be the count data for our RNAseq experiment prepared using tximport. In this exercise we will do the following:
Lets load the count data:
txi<-readRDS("./data/salmon_gene_counts.rds")
For those of you unfamiliar, .rds
files is a single R
data object, created with saveRDS()
. It is particularly
useful for saving R objects that are not tabular, or more than
2-dimensional (e.g. lists, functions etc.).
Question: What is the class and structure of this data object?
class(txi)
## [1] "list"
str(txi)
## List of 4
## $ abundance : num [1:32531, 1:24] 9.26 0 0 0 20.97 ...
## ..- attr(*, "dimnames")=List of 2
## .. ..$ : chr [1:32531] "PECUL23A000004" "PECUL23A000005" "PECUL23A000006" "PECUL23A000008" ...
## .. ..$ : chr [1:24] "10D" "10Li" "10V" "11D" ...
## $ counts : num [1:32531, 1:24] 799 0 0 0 1239 ...
## ..- attr(*, "dimnames")=List of 2
## .. ..$ : chr [1:32531] "PECUL23A000004" "PECUL23A000005" "PECUL23A000006" "PECUL23A000008" ...
## .. ..$ : chr [1:24] "10D" "10Li" "10V" "11D" ...
## $ length : num [1:32531, 1:24] 6201.2 71.4 547.6 1019.4 4246.7 ...
## ..- attr(*, "dimnames")=List of 2
## .. ..$ : chr [1:32531] "PECUL23A000004" "PECUL23A000005" "PECUL23A000006" "PECUL23A000008" ...
## .. ..$ : chr [1:24] "10D" "10Li" "10V" "11D" ...
## $ countsFromAbundance: chr "no"
lapply(X=txi, FUN=head)
## $abundance
## 10D 10Li 10V 11D 11Li 11V
## PECUL23A000004 9.258514 10.21374 10.84399 2.008093 8.833516 2.847134
## PECUL23A000005 0.000000 0.00000 0.00000 0.000000 0.000000 0.000000
## PECUL23A000006 0.000000 0.00000 0.00000 0.000000 0.000000 0.000000
## PECUL23A000008 0.000000 0.00000 0.00000 0.000000 0.000000 0.000000
## PECUL23A000011 20.974082 11.27196 21.07977 13.069257 10.902891 12.598137
## PECUL23A000012 0.000000 0.00000 0.00000 0.000000 0.000000 0.000000
## 12D 12Li 12V 21D 21Li 21V
## PECUL23A000004 6.285256 11.15112 4.600014 10.00756 5.647655 6.463992
## PECUL23A000005 0.000000 0.00000 0.000000 0.00000 0.000000 0.000000
## PECUL23A000006 0.000000 0.00000 0.000000 0.00000 0.000000 0.000000
## PECUL23A000008 0.000000 0.00000 0.000000 0.00000 1.272369 0.076718
## PECUL23A000011 13.695497 11.25386 13.393722 17.31247 11.046205 15.845097
## PECUL23A000012 0.000000 0.00000 0.000000 0.00000 0.000000 0.000000
## 22D 22Li 22V 7D 7Li 7V
## PECUL23A000004 7.079812 12.87613 11.03426 3.172516 1.457532 2.49415
## PECUL23A000005 0.000000 0.00000 0.00000 0.000000 0.000000 0.00000
## PECUL23A000006 0.000000 0.00000 0.00000 0.000000 0.000000 0.00000
## PECUL23A000008 0.000000 0.00000 0.00000 0.000000 0.000000 0.00000
## PECUL23A000011 18.138504 12.56544 17.05364 13.646505 10.907281 12.58050
## PECUL23A000012 0.000000 0.00000 0.00000 0.000000 0.000000 0.00000
## 8D 8Li 8V 9D 9Li 9V
## PECUL23A000004 1.846965 3.208272 3.275358 6.754852 11.33774 10.01473
## PECUL23A000005 0.000000 0.000000 0.000000 0.000000 0.00000 0.00000
## PECUL23A000006 0.000000 0.000000 0.000000 0.000000 0.00000 0.00000
## PECUL23A000008 0.000000 0.000000 0.000000 0.000000 0.00000 0.00000
## PECUL23A000011 11.662266 9.852510 12.011888 15.113053 10.92103 15.01804
## PECUL23A000012 0.000000 0.000000 0.000000 0.000000 0.00000 0.00000
##
## $counts
## 10D 10Li 10V 11D 11Li 11V 12D
## PECUL23A000004 798.556 890.812 950.925 209.388 840.261 313.572 588.464
## PECUL23A000005 0.000 0.000 0.000 0.000 0.000 0.000 0.000
## PECUL23A000006 0.000 0.000 0.000 0.000 0.000 0.000 0.000
## PECUL23A000008 0.000 0.000 0.000 0.000 0.000 0.000 0.000
## PECUL23A000011 1238.869 642.998 1267.971 898.400 672.733 933.099 883.169
## PECUL23A000012 0.000 0.000 0.000 0.000 0.000 0.000 0.000
## 12Li 12V 21D 21Li 21V 22D 22Li
## PECUL23A000004 981.773 427.419 842.375 549.566 503.027 675.112 1104.126
## PECUL23A000005 0.000 0.000 0.000 0.000 0.000 0.000 0.000
## PECUL23A000006 0.000 0.000 0.000 0.000 0.000 0.000 0.000
## PECUL23A000008 0.000 0.000 0.000 19.000 1.000 0.000 0.000
## PECUL23A000011 659.733 844.625 1005.042 688.409 858.488 1125.317 721.587
## PECUL23A000012 0.000 0.000 0.000 0.000 0.000 0.000 0.000
## 22V 7D 7Li 7V 8D 8Li 8V 9D
## PECUL23A000004 988.542 299.200 144.040 222.206 172.249 302.444 303.135 681.824
## PECUL23A000005 0.000 0.000 0.000 0.000 0.000 0.000 0.000 0.000
## PECUL23A000006 0.000 0.000 0.000 0.000 0.000 0.000 0.000 0.000
## PECUL23A000008 0.000 0.000 0.000 0.000 0.000 0.000 0.000 0.000
## PECUL23A000011 1044.139 879.316 694.686 728.819 724.642 605.159 713.794 970.492
## PECUL23A000012 0.000 0.000 0.000 0.000 0.000 0.000 0.000 0.000
## 9Li 9V
## PECUL23A000004 1006.008 945.262
## PECUL23A000005 0.000 0.000
## PECUL23A000006 0.000 0.000
## PECUL23A000008 0.000 0.000
## PECUL23A000011 640.087 978.380
## PECUL23A000012 0.000 0.000
##
## $length
## 10D 10Li 10V 11D 11Li
## PECUL23A000004 6201.16983 6364.53488 6199.02702 6438.33500 6471.52674
## PECUL23A000005 71.39554 71.39554 71.39554 71.39554 71.39554
## PECUL23A000006 547.63317 547.63317 547.63317 547.63317 547.63317
## PECUL23A000008 1019.39872 1019.39872 1019.39872 1019.39872 1019.39872
## PECUL23A000011 4246.70282 4162.70479 4252.16543 4244.47366 4197.84742
## PECUL23A000012 447.71650 447.71650 447.71650 447.71650 447.71650
## 11V 12D 12Li 12V 21D
## PECUL23A000004 6315.82807 6168.57244 6370.40945 6251.34080 6149.42485
## PECUL23A000005 71.39554 71.39554 71.39554 71.39554 71.39554
## PECUL23A000006 547.63317 547.63317 547.63317 547.63317 547.63317
## PECUL23A000008 1019.39872 1019.39872 1019.39872 1019.39872 1019.39872
## PECUL23A000011 4247.39914 4248.68448 4241.71429 4242.69120 4241.13665
## PECUL23A000012 447.71650 447.71650 447.71650 447.71650 447.71650
## 21Li 21V 22D 22Li 22V
## PECUL23A000004 6627.37203 6100.23764 6360.54714 6359.47494 6205.29626
## PECUL23A000005 71.39554 71.39554 71.39554 71.39554 71.39554
## PECUL23A000006 547.63317 547.63317 547.63317 547.63317 547.63317
## PECUL23A000008 1017.02200 1021.78100 1019.39872 1019.39872 1019.39872
## PECUL23A000011 4244.46540 4247.13092 4138.22686 4258.91322 4240.83801
## PECUL23A000012 447.71650 447.71650 447.71650 447.71650 447.71650
## 7D 7Li 7V 8D 8Li
## PECUL23A000004 6209.34168 6592.86848 6537.26970 6379.89021 6451.48933
## PECUL23A000005 71.39554 71.39554 71.39554 71.39554 71.39554
## PECUL23A000006 547.63317 547.63317 547.63317 547.63317 547.63317
## PECUL23A000008 1019.39872 1019.39872 1019.39872 1019.39872 1019.39872
## PECUL23A000011 4242.39630 4248.94036 4250.94094 4250.63982 4203.48454
## PECUL23A000012 447.71650 447.71650 447.71650 447.71650 447.71650
## 8V 9D 9Li 9V
## PECUL23A000004 6485.19773 6555.93984 6423.42596 6148.17008
## PECUL23A000005 71.39554 71.39554 71.39554 71.39554
## PECUL23A000006 547.63317 547.63317 547.63317 547.63317
## PECUL23A000008 1019.39872 1019.39872 1019.39872 1019.39872
## PECUL23A000011 4163.96652 4170.79175 4242.94916 4243.53463
## PECUL23A000012 447.71650 447.71650 447.71650 447.71650
##
## $countsFromAbundance
## [1] "no"
abundamce
, counts
, length
) and 1
character vector (countsFromAbundance
).Question: What is the content of this object and how is it arranged?
apply(X=txi$abundance, FUN=sum, MARGIN = 2)
## 10D 10Li 10V 11D 11Li 11V 12D 12Li
## 932355.7 932241.7 938520.5 945823.7 934723.7 940997.4 936842.9 930242.2
## 12V 21D 21Li 21V 22D 22Li 22V 7D
## 935530.5 930301.0 935730.8 934942.5 930828.3 916143.5 931353.4 933611.3
## 7Li 7V 8D 8Li 8V 9D 9Li 9V
## 937777.1 925431.3 931110.8 936225.3 939398.7 939854.7 918715.2 935852.6
In this case, they don't quite add up to 1 million because I have already filtered this matrix, to remove non-coding and mitochondrial DNA.
tximport
. Default is
no (counts are from salmon).It is always a good idea to look at the distribution of data:
# Log10 transform and plot with base R:
hist(log10(txi$abundance[,1]), breaks=50)
# a more detailed plot with the tidyverse:
txi$abundance %>%
as_tibble(rownames = "transcript") %>%
pivot_longer(-transcript, names_to="sample", values_to="TPM") %>%
ggplot(aes(x=TPM)) +
geom_histogram() +
ylab("number of transcripts") +
scale_x_log10() +
facet_wrap(~sample)
## Warning: Transformation introduced infinite values in continuous x-axis
## Warning: Removed 366214 rows containing non-finite values (`stat_bin()`).
We can see that the log abundance is fairly normally distributed. This is a good sign. Many 0 counts would indicate that you have targets that are heavily depleted. For example, if I would have included non-coding target sequences, we would probably see a peak of low counts, because these tend to be very depleted in RNAseq data.
Next, we may be curious as to whether there are some basic patterns in our data. For instance, we would expect biological replicates to cluster together.
Let's load our design matrix. This is a .csv file where we have included some information on what these samples actually are:
samples<-read.csv("./data/design_matrix.csv")
Take a closer look at this table.
samples
## sample_id treatment tissue side
## 1 10Li black liver
## 2 11Li black liver
## 3 12Li black liver
## 4 22Li black liver
## 5 21Li white liver
## 6 7Li white liver
## 7 8Li white liver
## 8 9Li white liver
## 9 10D black skin dorsal
## 10 11D black skin dorsal
## 11 12D black skin dorsal
## 12 22D black skin dorsal
## 13 10V black skin ventral
## 14 11V black skin ventral
## 15 12V black skin ventral
## 16 22V black skin ventral
## 17 21D white skin dorsal
## 18 7D white skin dorsal
## 19 8D white skin dorsal
## 20 9D white skin dorsal
## 21 21V white skin ventral
## 22 7V white skin ventral
## 23 8V white skin ventral
## 24 9V white skin ventral
Question: How many tissues and how many treatments are there?
samples %>%
select(treatment, tissue) %>%
ftable()
## tissue liver skin
## treatment
## black 4 8
## white 4 8
Before we go any further, it makes sense to check if our biological
replicates are grouping nicely. We could do this by performing a PCA on
the normalized counts (the TPM/abundance) and distinguish them using the
information we have from our samples
file that lays out the
experimental design. However, a PCA works best on homoskedastic data;
this means that the variance of an observed quantity (here, the
expression strength of a gene) does not depend on the mean. This is
definitely an assumption we are violating with RNAseq data, so if we
don't transform our data, then the PCA is going to be strongly
influenced by a few highly expressed genes. All of this requires a few
steps:
# perform VS transformation (the vst() function only likes integers!)
vst_counts<- txi$counts %>%
as.data.frame() %>%
mutate_all(as.integer) %>%
as.matrix() %>%
vst()
# remove 0-variance genes (remember to also convert counts to integers)
vst_counts<-vst_counts[apply(vst_counts, 1, var) != 0,]
# perform PCA on TRANSPOSED scaled, centered data
pca<- prcomp(t(vst_counts),scale.=T, center=T)
## add metadata and plot
pca$x %>%
as_tibble(rownames = "sample_id") %>%
left_join(samples) %>% # add the experimental design information
ggplot(aes(x=PC1, y=PC2, color=treatment, shape=tissue)) +
labs(x=paste0("PC1 (", summary(pca)$importance["Proportion of Variance",1]*100, "%)"),
y=paste("PC2 (", summary(pca)$importance["Proportion of Variance",2]*100, "%)")) +
geom_point(size=3)
Question: Can we see any patterns in the PCA biplot?
Question: What about patterns within tissues? Try running this same code, but only for the liver samples.
Notice that sample with "Li" are liver samples. perhaps drop the remaining columns first?
# perform VS transformation (the vst() function only likes integers!)
vst_counts_liver<- txi$counts %>%
as.data.frame() %>%
select(contains("Li")) %>%
mutate_all(as.integer) %>%
as.matrix() %>%
vst()
# remove 0-variance genes (remember to also convert counts to integers)
vst_counts_liver<-vst_counts_liver[apply(vst_counts_liver, 1, var) != 0,]
# perfomr PCA on TRANSPOSED scaled, centred data
pca_liver<- prcomp(t(vst_counts_liver),scale.=T, center=T)
## add metadata and plot
pca_liver$x %>%
as_tibble(rownames = "sample_id") %>%
left_join(samples) %>%
ggplot(aes(x=PC1, y=PC2, color=treatment)) +
geom_point(size=3)
Computational Genomics with R
book, especially the
chapter on RNA-Seq
by Bora Uyar. Galaxy also has plenty of interesting documentation such
as this
one.Another common exploratory visualization technique are heatmaps. Surely you have seen them in many publications. The PCA is useful for visualizing the clustering of samples. Heatmaps can do the same, but they can add one extra layer of information: the clustering of genes. Let's look at two examples:
Like the PCA, we might want to compare the distance between samples. To do this, we have to manually calculate this distances first, and then use those distances to inform the heatmap.
# get sample-to-sample distance
sample_dist <- dist(t(vst_counts))
# convert to matrix
sample_dist_matrix <- as.matrix(sample_dist)
# plot
pheatmap(sample_dist_matrix,
annotation_col=data.frame(samples[,c("sample_id","tissue","treatment","side")],
row.names = "sample_id"),
annotation_row=data.frame(samples[,c("sample_id","tissue","treatment","side")],
row.names = "sample_id"))
This shows a similar picture. There is strong clustering with tissues, and then a little bit of clustering for treatments within tissues, but the tissue-effect is consuming much of the signal.
Above we are just showing the distance between samples, but you may be interested whether there are specific genes/transcripts who's expression is clustering. because there are thousands of genes, we may want to only focus on genes that are particularly variable in their expression across the samples.
# Sort the rows in vst matrix by the row-wise variance, and keep only the 500 most variable genes
var_genes <- apply(vst_counts,MAR=1, FUN=var) %>%
enframe() %>%
arrange(desc(value)) %>%
slice_head(n=500) %>%
pull(name)
head(var_genes) # list of genes we want to keep
## [1] "PECUL23A041600" "PECUL23A024292" "PECUL23A039371" "PECUL23A023059"
## [5] "PECUL23A000586" "PECUL23A017537"
vst_subset<-vst_counts[var_genes, ]
# check to see we really only have 500 genes
dim(vst_subset)
## [1] 500 24
# plot heatmap (this time scaled by row)
pheatmap(vst_subset,
scale = "row",
cluster_rows=T, show_rownames=FALSE,
cluster_cols=T, show_colnames = T,
annotation_col=data.frame(samples[,c("sample_id","tissue","treatment","side")], row.names = "sample_id"))
Question: Do we learn anything new from this heatmap?