Prepare environment

A good first thing to do when running any R script, is to set your working environment and load any libraries you may need. A handy tool for loading packages is pacman. This package conveniently loads libraries, installs libraries that are not installed, including bioconductor packages!

# set working directory
setwd("~/Documents/git_projects/ADDA_taller2/")

# install pacman if not already installed
if (!require("pacman")) install.packages("pacman")

# use pacman to load libraries
pacman::p_load(tidyverse, DESeq2, pheatmap)

Background

Now that we are all ready to go, let's import some data. As a prequel to this exercise, I have already explained the biological question(s) we are trying to answer (in brief, how is pigmentation plasticity controlled in Spadefoot toad tadpoles?). Here, we will jump in around about the middle of a typical RNAseq workflow. We have already done the following:

  1. Performed the experiment
  2. Extracted the RNA
  3. Sequenced the RNA
  4. Cleaned/trimmed sequences
  5. Quantified sequences by mapping reads from each library onto our reference genome with salmon
  6. Combined counts from all libraries into a single table using tximport. During this step, we have also calculated 'per-gene' counts. I.e. counts of multiple transcripts per gene have been summarized so that we are left with single estimates per gene.
  7. Minor data filtering to remove mtDNA, non-coding DNA and other irrelevant target sequences.

Our starting point will therefore be the count data for our RNAseq experiment prepared using tximport. In this exercise we will do the following:

  1. Load the RNAseq count data quantified with salmon and explore it.
  2. Visualize count data of biological replicates
  3. Visualize count data of most variable genes

Load data

Lets load the count data:

txi<-readRDS("./data/salmon_gene_counts.rds")

For those of you unfamiliar, .rds files is a single R data object, created with saveRDS(). It is particularly useful for saving R objects that are not tabular, or more than 2-dimensional (e.g. lists, functions etc.).

Exploring the loaded data

Question: What is the class and structure of this data object?

class(txi)
str(txi)
lapply(X=txi, FUN=head)
  • It is a list with 4 entries, three arrays/matrices (abundamce, counts, length) and 1 character vector (countsFromAbundance).

Question: What is the content of this object and how is it arranged?

  • All three arrays have target transcripts as rows and biological samples per column.
  • Counts: estimate of the number of reads mapping to each transcript.
  • Abundance: Raw counts cannot be compared across samples because each library may vary slightly in terms of the total number of reads, differences in sequencing bias and difference in transcript lengths. Salmon (the program we used to quantify reads) also produces an array of "Abundances" which are normalized counts. According to the salmon documention, this is Transcripts Per Million (TPM). This bastically means that per sample, the total counts add up to 1 million. We could check this:
apply(X=txi$abundance, FUN=sum, MARGIN = 2)

In this case, they don't quite add up to 1 million because I have already filtered this matrix, to remove non-coding and mitochondrial DNA.

  • length: effective length of the target transcript.
  • countsFromAbundance: a character vector indicating whether counts were taken directly from the quantifier (salmon) or whether they have been calculated from the abundances by tximport. Default is no (counts are from salmon).

It is always a good idea to look at the distribution of data:

# Log10 transform and plot with base R:
hist(log10(txi$abundance[,1]), breaks=50)

# a more detailed plot with the tidyverse:
txi$abundance %>%
  as_tibble(rownames = "transcript") %>%
  pivot_longer(-transcript, names_to="sample", values_to="TPM") %>%
  ggplot(aes(x=TPM)) +
  geom_histogram() +
  ylab("number of transcripts") +
  scale_x_log10() +
  facet_wrap(~sample)

We can see that the log abundance is fairly normally distributed. This is a good sign. Many 0 counts would indicate that you have targets that are heavily depleted. For example, if I would have included non-coding target sequences, we would probably see a peak of low counts, because these tend to be very depleted in RNAseq data.

Visualize your count data: PCAs

Next, we may be curious as to whether there are some basic patterns in our data. For instance, we would expect biological replicates to cluster together.

Let's load our design matrix. This is a .csv file where we have included some information on what these samples actually are:

samples<-read.csv("./data/design_matrix.csv")

Take a closer look at this table.

samples

Question: How many tissues and how many treatments are there?

samples %>%
  select(treatment, tissue) %>%
  ftable()
  • There are two tissues (liver and skin) and two treatments (black and white). The treatments refer to the background colour that we raised the tadpoles in.
  • We also have a "side" variable for each of the skin samples to tell us if it is from the dorsum or ventrum. Technically we therefore have 3 samples per individual.

Before we go any further, it makes sense to check if our biological replicates are grouping nicely. We could do this by performing a PCA on the normalized counts (the TPM/abundance) and distinguish them using the information we have from our samples file that lays out the experimental design. However, a PCA works best on homoskedastic data; this means that the variance of an observed quantity (here, the expression strength of a gene) does not depend on the mean. This is definitely an assumption we are violating with RNAseq data, so if we don't transform our data, then the PCA is going to be strongly influenced by a few highly expressed genes. All of this requires a few steps:

  1. Filter out any genes that are non-variable. i.e. genes whose expression is the same across all biological samples.
  2. Transform the counts variable to imrpove heteroskedasticity. Here we will use a variance stabilizing transformation
  3. Transpose the matrix so that we have samples as the rows and genes as the columns (we want to treat each gene as if it is a variable in our multivariate analysis).
  4. Perform PCA on scaled and centered data.
  5. Plot the first two components, labeling points with relevant biological information.
# perform VS transformation (the vst() function only likes integers!)
vst_counts<- txi$counts %>%
  as.data.frame() %>%
  mutate_all(as.integer) %>%
  as.matrix() %>%
  vst()
  
# remove 0-variance genes (remember to also convert counts to integers)
vst_counts<-vst_counts[apply(vst_counts, 1, var) != 0,]

# perform PCA on TRANSPOSED scaled, centered data
pca<- prcomp(t(vst_counts),scale.=T, center=T)

## add metadata and plot
pca$x %>%
  as_tibble(rownames = "sample_id") %>%
  left_join(samples) %>% # add the experimental design information
  ggplot(aes(x=PC1, y=PC2, color=treatment, shape=tissue)) +
  labs(x=paste0("PC1 (", summary(pca)$importance["Proportion of Variance",1]*100, "%)"),
       y=paste("PC2 (", summary(pca)$importance["Proportion of Variance",2]*100, "%)")) + 
  geom_point(size=3)

Question: Can we see any patterns in the PCA biplot?

  • Most of the variance is explained by tissue type (separation along PC1)
  • Treatment effect is not so clear, but some separation along PC2.

Question: What about patterns within tissues? Try running this same code, but only for the liver samples.

Notice that sample with "Li" are liver samples. perhaps drop the remaining columns first?

# perform VS transformation (the vst() function only likes integers!)
vst_counts_liver<- txi$counts %>%
  as.data.frame() %>%
  select(contains("Li")) %>%
  mutate_all(as.integer) %>%
  as.matrix() %>%
  vst()
  
# remove 0-variance genes (remember to also convert counts to integers)
vst_counts_liver<-vst_counts_liver[apply(vst_counts_liver, 1, var) != 0,]

# perfomr PCA on TRANSPOSED scaled, centred data
pca_liver<- prcomp(t(vst_counts_liver),scale.=T, center=T)

## add metadata and plot
pca_liver$x %>%
  as_tibble(rownames = "sample_id") %>%
  left_join(samples) %>%
  ggplot(aes(x=PC1, y=PC2, color=treatment)) +
  geom_point(size=3)
  • Although this is not a textbook example (welcome to the world of real-life data!), we do see that there is a bit of separation in the treatments via a diagonal between PC1 and PC2.

Final comments:

Extras - Visualize your count data with heatmaps

Another common exploratory visualization technique are heatmaps. Surely you have seen them in many publications. The PCA is useful for visualizing the clustering of samples. Heatmaps can do the same, but they can add one extra layer of information: the clustering of genes. Let's look at two examples:

Heatmaps by samples

Like the PCA, we might want to compare the distance between samples. To do this, we have to manually calculate this distances first, and then use those distances to inform the heatmap.

# get sample-to-sample distance
sample_dist <- dist(t(vst_counts))

# convert to matrix
sample_dist_matrix <- as.matrix(sample_dist)

# plot
pheatmap(sample_dist_matrix,
         annotation_col=data.frame(samples[,c("sample_id","tissue","treatment","side")],
                                   row.names = "sample_id"),
         annotation_row=data.frame(samples[,c("sample_id","tissue","treatment","side")],
                                   row.names = "sample_id"))

This shows a similar picture. There is strong clustering with tissues, and then a little bit of clustering for treatments within tissues, but the tissue-effect is consuming much of the signal.

Heatmpas by genes

Above we are just showing the distance between samples, but you may be interested whether there are specific genes/transcripts who's expression is clustering. because there are thousands of genes, we may want to only focus on genes that are particularly variable in their expression across the samples.

# Sort the rows in vst matrix by the row-wise variance, and keep only the 500 most variable genes

var_genes <- apply(vst_counts,MAR=1, FUN=var) %>%
  enframe() %>%
  arrange(desc(value)) %>%
  slice_head(n=500) %>%
  pull(name)

head(var_genes) # list of genes we want to keep

vst_subset<-vst_counts[var_genes, ]

# check to see we really only have 500 genes
dim(vst_subset)

# plot heatmap (this time scaled by row)
pheatmap(vst_subset,
         scale = "row",
         cluster_rows=T, show_rownames=FALSE,
         cluster_cols=T, show_colnames = T,
         annotation_col=data.frame(samples[,c("sample_id","tissue","treatment","side")], row.names = "sample_id"))

Question: Do we learn anything new from this heatmap?

  • We again see that most of the difference in gene expression is between liver and skin tissue.
  • The 8Li sample seems to have a very unique set of genes that are only over-expressed in that sample. It might be worth checking that sample more closely.