Exploring RNA-Seq Count Data

Prepare environment

A good first thing to do when running any R script, is to set your working environment and load any libraries you may need. A handy tool for loading packages is pacman. This package conveniently loads libraries, installs libraries that are not installed, including bioconductor packages!

# set working directory
setwd("~/Documents/git_projects/ADDA_taller2/")

# install pacman if not already installed
if (!require("pacman")) install.packages("pacman")

# use pacman to load libraries
pacman::p_load(tidyverse, DESeq2, pheatmap)

Background

Now that we are all ready to go, let's import some data. As a prequel to this exercise, I have already explained the biological question(s) we are trying to answer (in brief, how is pigmentation plasticity controlled in Spadefoot toad tadpoles?). Here, we will jump in around about the middle of a typical RNAseq workflow. We have already done the following:

Performed the experiment
Extracted the RNA
Sequenced the RNA
Cleaned/trimmed sequences
Quantified sequences by mapping reads from each library onto our reference genome with salmon
Combined counts from all libraries into a single table using tximport. During this step, we have also calculated 'per-gene' counts. I.e. counts of multiple transcripts per gene have been summarized so that we are left with single estimates per gene.
Minor data filtering to remove mtDNA, non-coding DNA and other irrelevant target sequences.

Our starting point will therefore be the count data for our RNAseq experiment prepared using tximport. In this exercise we will do the following:

Load the RNAseq count data quantified with salmon and explore it.
Visualize count data of biological replicates
Visualize count data of most variable genes

Load data

Lets load the count data:

txi<-readRDS("./data/salmon_gene_counts.rds")

For those of you unfamiliar, .rds files is a single R data object, created with saveRDS(). It is particularly useful for saving R objects that are not tabular, or more than 2-dimensional (e.g. lists, functions etc.).

Exploring the loaded data

Question: What is the class and structure of this data object?

class(txi)
str(txi)
lapply(X=txi, FUN=head)

It is a list with 4 entries, three arrays/matrices (abundamce, counts, length) and 1 character vector (countsFromAbundance).

Question: What is the content of this object and how is it arranged?

All three arrays have target transcripts as rows and biological samples per column.
Counts: estimate of the number of reads mapping to each transcript.
Abundance: Raw counts cannot be compared across samples because each library may vary slightly in terms of the total number of reads, differences in sequencing bias and difference in transcript lengths. Salmon (the program we used to quantify reads) also produces an array of "Abundances" which are normalized counts. According to the salmon documention, this is Transcripts Per Million (TPM). This bastically means that per sample, the total counts add up to 1 million. We could check this:

apply(X=txi$abundance, FUN=sum, MARGIN = 2)

In this case, they don't quite add up to 1 million because I have already filtered this matrix, to remove non-coding and mitochondrial DNA.

length: effective length of the target transcript.
countsFromAbundance: a character vector indicating whether counts were taken directly from the quantifier (salmon) or whether they have been calculated from the abundances by tximport. Default is no (counts are from salmon).

It is always a good idea to look at the distribution of data:

# Log10 transform and plot with base R:
hist(log10(txi$abundance[,1]), breaks=50)

# a more detailed plot with the tidyverse:
txi$abundance %>%
  as_tibble(rownames = "transcript") %>%
  pivot_longer(-transcript, names_to="sample", values_to="TPM") %>%
  ggplot(aes(x=TPM)) +
  geom_histogram() +
  ylab("number of transcripts") +
  scale_x_log10() +
  facet_wrap(~sample)

We can see that the log abundance is fairly normally distributed. This is a good sign. Many 0 counts would indicate that you have targets that are heavily depleted. For example, if I would have included non-coding target sequences, we would probably see a peak of low counts, because these tend to be very depleted in RNAseq data.

Visualize your count data: PCAs

Next, we may be curious as to whether there are some basic patterns in our data. For instance, we would expect biological replicates to cluster together.

Let's load our design matrix. This is a .csv file where we have included some information on what these samples actually are:

samples<-read.csv("./data/design_matrix.csv")

Take a closer look at this table.

samples

Question: How many tissues and how many treatments are there?

samples %>%
  select(treatment, tissue) %>%
  ftable()

There are two tissues (liver and skin) and two treatments (black and white). The treatments refer to the background colour that we raised the tadpoles in.
We also have a "side" variable for each of the skin samples to tell us if it is from the dorsum or ventrum. Technically we therefore have 3 samples per individual.

Before we go any further, it makes sense to check if our biological replicates are grouping nicely. We could do this by performing a PCA on the normalized counts (the TPM/abundance) and distinguish them using the information we have from our samples file that lays out the experimental design. However, a PCA works best on homoskedastic data; this means that the variance of an observed quantity (here, the expression strength of a gene) does not depend on the mean. This is definitely an assumption we are violating with RNAseq data, so if we don't transform our data, then the PCA is going to be strongly influenced by a few highly expressed genes. All of this requires a few steps:

Filter out any genes that are non-variable. i.e. genes whose expression is the same across all biological samples.
Transform the counts variable to imrpove heteroskedasticity. Here we will use a variance stabilizing transformation
Transpose the matrix so that we have samples as the rows and genes as the columns (we want to treat each gene as if it is a variable in our multivariate analysis).
Perform PCA on scaled and centered data.
Plot the first two components, labeling points with relevant biological information.

# perform VS transformation (the vst() function only likes integers!)
vst_counts<- txi$counts %>%
  as.data.frame() %>%
  mutate_all(as.integer) %>%
  as.matrix() %>%
  vst()
  
# remove 0-variance genes (remember to also convert counts to integers)
vst_counts<-vst_counts[apply(vst_counts, 1, var) != 0,]

# perform PCA on TRANSPOSED scaled, centered data
pca<- prcomp(t(vst_counts),scale.=T, center=T)

## add metadata and plot
pca$x %>%
  as_tibble(rownames = "sample_id") %>%
  left_join(samples) %>% # add the experimental design information
  ggplot(aes(x=PC1, y=PC2, color=treatment, shape=tissue)) +
  labs(x=paste0("PC1 (", summary(pca)$importance["Proportion of Variance",1]*100, "%)"),
       y=paste("PC2 (", summary(pca)$importance["Proportion of Variance",2]*100, "%)")) + 
  geom_point(size=3)

Question: Can we see any patterns in the PCA biplot?

Most of the variance is explained by tissue type (separation along PC1)
Treatment effect is not so clear, but some separation along PC2.

Question: What about patterns within tissues? Try running this same code, but only for the liver samples.

Notice that sample with "Li" are liver samples. perhaps drop the remaining columns first?

# perform VS transformation (the vst() function only likes integers!)
vst_counts_liver<- txi$counts %>%
  as.data.frame() %>%
  select(contains("Li")) %>%
  mutate_all(as.integer) %>%
  as.matrix() %>%
  vst()
  
# remove 0-variance genes (remember to also convert counts to integers)
vst_counts_liver<-vst_counts_liver[apply(vst_counts_liver, 1, var) != 0,]

# perfomr PCA on TRANSPOSED scaled, centred data
pca_liver<- prcomp(t(vst_counts_liver),scale.=T, center=T)

## add metadata and plot
pca_liver$x %>%
  as_tibble(rownames = "sample_id") %>%
  left_join(samples) %>%
  ggplot(aes(x=PC1, y=PC2, color=treatment)) +
  geom_point(size=3)

Although this is not a textbook example (welcome to the world of real-life data!), we do see that there is a bit of separation in the treatments via a diagonal between PC1 and PC2.

Final comments:

It is important to spend some time looking at your data! Nowadays there are many pipelines that will process the whole RNAseq workflow from start to finish, but don't forget to actually look at your results in detail.
It is important to think about what comparisons you are trying to make. Here, we saw that the tissue effect is much stronger than the treatment effect. Unless it is a specific aim of yours to compare across tissues, you may want to run your analysis on each tissue independently.
Here, we explore patterns in gene expression, but we do not apply any specific tests. More importantly, we are using counts that have been normalized within samples (TPM). It is generally adviced that these are not reliable when comparing across samples! (We will come back to this in the next exercise).
There are many sources of information online. A practical guide that may be useful is the Computational Genomics with R book, especially the chapter on RNA-Seq by Bora Uyar. Galaxy also has plenty of interesting documentation such as this one.

Extras - Visualize your count data with heatmaps

Another common exploratory visualization technique are heatmaps. Surely you have seen them in many publications. The PCA is useful for visualizing the clustering of samples. Heatmaps can do the same, but they can add one extra layer of information: the clustering of genes. Let's look at two examples:

Heatmaps by samples

Like the PCA, we might want to compare the distance between samples. To do this, we have to manually calculate this distances first, and then use those distances to inform the heatmap.

# get sample-to-sample distance
sample_dist <- dist(t(vst_counts))

# convert to matrix
sample_dist_matrix <- as.matrix(sample_dist)

# plot
pheatmap(sample_dist_matrix,
         annotation_col=data.frame(samples[,c("sample_id","tissue","treatment","side")],
                                   row.names = "sample_id"),
         annotation_row=data.frame(samples[,c("sample_id","tissue","treatment","side")],
                                   row.names = "sample_id"))

This shows a similar picture. There is strong clustering with tissues, and then a little bit of clustering for treatments within tissues, but the tissue-effect is consuming much of the signal.

Heatmpas by genes

Above we are just showing the distance between samples, but you may be interested whether there are specific genes/transcripts who's expression is clustering. because there are thousands of genes, we may want to only focus on genes that are particularly variable in their expression across the samples.

# Sort the rows in vst matrix by the row-wise variance, and keep only the 500 most variable genes

var_genes <- apply(vst_counts,MAR=1, FUN=var) %>%
  enframe() %>%
  arrange(desc(value)) %>%
  slice_head(n=500) %>%
  pull(name)

head(var_genes) # list of genes we want to keep

vst_subset<-vst_counts[var_genes, ]

# check to see we really only have 500 genes
dim(vst_subset)

# plot heatmap (this time scaled by row)
pheatmap(vst_subset,
         scale = "row",
         cluster_rows=T, show_rownames=FALSE,
         cluster_cols=T, show_colnames = T,
         annotation_col=data.frame(samples[,c("sample_id","tissue","treatment","side")], row.names = "sample_id"))

Question: Do we learn anything new from this heatmap?

We again see that most of the difference in gene expression is between liver and skin tissue.
The 8Li sample seems to have a very unique set of genes that are only over-expressed in that sample. It might be worth checking that sample more closely.