Exploring RNA-Seq Count Data

Prepare environment

A good first thing to do when running any R script, is to set your working environment and load any libraries you may need. A handy tool for loading packages is pacman. This package conveniently loads libraries, installs libraries that are not installed, including bioconductor packages!

# set working directory
setwd("~/Documents/git_projects/ADDA_taller2/")

# install pacman if not already installed
if (!require("pacman")) install.packages("pacman")

# use pacman to load libraries
pacman::p_load(tidyverse, DESeq2, pheatmap)

Background

Now that we are all ready to go, let's import some data. As a prequel to this exercise, I have already explained the biological question(s) we are trying to answer (in brief, how is pigmentation plasticity controlled in Spadefoot toad tadpoles?). Here, we will jump in around about the middle of a typical RNAseq workflow. We have already done the following:

Performed the experiment
Extracted the RNA
Sequenced the RNA
Cleaned/trimmed sequences
Quantified sequences by mapping reads from each library onto our reference genome with salmon
Combined counts from all libraries into a single table using tximport. During this step, we have also calculated 'per-gene' counts. I.e. counts of multiple transcripts per gene have been summarized so that we are left with single estimates per gene.
Minor data filtering to remove mtDNA, non-coding DNA and other irrelevant target sequences.

Our starting point will therefore be the count data for our RNAseq experiment prepared using tximport. In this exercise we will do the following:

Load the RNAseq count data quantified with salmon and explore it.
Visualize count data of biological replicates
Visualize count data of most variable genes

Load data

Lets load the count data:

txi<-readRDS("./data/salmon_gene_counts.rds")

For those of you unfamiliar, .rds files is a single R data object, created with saveRDS(). It is particularly useful for saving R objects that are not tabular, or more than 2-dimensional (e.g. lists, functions etc.).

Exploring the loaded data

Question: What is the class and structure of this data object?

class(txi)

## [1] "list"

str(txi)

## List of 4
##  $ abundance          : num [1:32531, 1:24] 9.26 0 0 0 20.97 ...
##   ..- attr(*, "dimnames")=List of 2
##   .. ..$ : chr [1:32531] "PECUL23A000004" "PECUL23A000005" "PECUL23A000006" "PECUL23A000008" ...
##   .. ..$ : chr [1:24] "10D" "10Li" "10V" "11D" ...
##  $ counts             : num [1:32531, 1:24] 799 0 0 0 1239 ...
##   ..- attr(*, "dimnames")=List of 2
##   .. ..$ : chr [1:32531] "PECUL23A000004" "PECUL23A000005" "PECUL23A000006" "PECUL23A000008" ...
##   .. ..$ : chr [1:24] "10D" "10Li" "10V" "11D" ...
##  $ length             : num [1:32531, 1:24] 6201.2 71.4 547.6 1019.4 4246.7 ...
##   ..- attr(*, "dimnames")=List of 2
##   .. ..$ : chr [1:32531] "PECUL23A000004" "PECUL23A000005" "PECUL23A000006" "PECUL23A000008" ...
##   .. ..$ : chr [1:24] "10D" "10Li" "10V" "11D" ...
##  $ countsFromAbundance: chr "no"

lapply(X=txi, FUN=head)

## $abundance
##                      10D     10Li      10V       11D      11Li       11V
## PECUL23A000004  9.258514 10.21374 10.84399  2.008093  8.833516  2.847134
## PECUL23A000005  0.000000  0.00000  0.00000  0.000000  0.000000  0.000000
## PECUL23A000006  0.000000  0.00000  0.00000  0.000000  0.000000  0.000000
## PECUL23A000008  0.000000  0.00000  0.00000  0.000000  0.000000  0.000000
## PECUL23A000011 20.974082 11.27196 21.07977 13.069257 10.902891 12.598137
## PECUL23A000012  0.000000  0.00000  0.00000  0.000000  0.000000  0.000000
##                      12D     12Li       12V      21D      21Li       21V
## PECUL23A000004  6.285256 11.15112  4.600014 10.00756  5.647655  6.463992
## PECUL23A000005  0.000000  0.00000  0.000000  0.00000  0.000000  0.000000
## PECUL23A000006  0.000000  0.00000  0.000000  0.00000  0.000000  0.000000
## PECUL23A000008  0.000000  0.00000  0.000000  0.00000  1.272369  0.076718
## PECUL23A000011 13.695497 11.25386 13.393722 17.31247 11.046205 15.845097
## PECUL23A000012  0.000000  0.00000  0.000000  0.00000  0.000000  0.000000
##                      22D     22Li      22V        7D       7Li       7V
## PECUL23A000004  7.079812 12.87613 11.03426  3.172516  1.457532  2.49415
## PECUL23A000005  0.000000  0.00000  0.00000  0.000000  0.000000  0.00000
## PECUL23A000006  0.000000  0.00000  0.00000  0.000000  0.000000  0.00000
## PECUL23A000008  0.000000  0.00000  0.00000  0.000000  0.000000  0.00000
## PECUL23A000011 18.138504 12.56544 17.05364 13.646505 10.907281 12.58050
## PECUL23A000012  0.000000  0.00000  0.00000  0.000000  0.000000  0.00000
##                       8D      8Li        8V        9D      9Li       9V
## PECUL23A000004  1.846965 3.208272  3.275358  6.754852 11.33774 10.01473
## PECUL23A000005  0.000000 0.000000  0.000000  0.000000  0.00000  0.00000
## PECUL23A000006  0.000000 0.000000  0.000000  0.000000  0.00000  0.00000
## PECUL23A000008  0.000000 0.000000  0.000000  0.000000  0.00000  0.00000
## PECUL23A000011 11.662266 9.852510 12.011888 15.113053 10.92103 15.01804
## PECUL23A000012  0.000000 0.000000  0.000000  0.000000  0.00000  0.00000
## 
## $counts
##                     10D    10Li      10V     11D    11Li     11V     12D
## PECUL23A000004  798.556 890.812  950.925 209.388 840.261 313.572 588.464
## PECUL23A000005    0.000   0.000    0.000   0.000   0.000   0.000   0.000
## PECUL23A000006    0.000   0.000    0.000   0.000   0.000   0.000   0.000
## PECUL23A000008    0.000   0.000    0.000   0.000   0.000   0.000   0.000
## PECUL23A000011 1238.869 642.998 1267.971 898.400 672.733 933.099 883.169
## PECUL23A000012    0.000   0.000    0.000   0.000   0.000   0.000   0.000
##                   12Li     12V      21D    21Li     21V      22D     22Li
## PECUL23A000004 981.773 427.419  842.375 549.566 503.027  675.112 1104.126
## PECUL23A000005   0.000   0.000    0.000   0.000   0.000    0.000    0.000
## PECUL23A000006   0.000   0.000    0.000   0.000   0.000    0.000    0.000
## PECUL23A000008   0.000   0.000    0.000  19.000   1.000    0.000    0.000
## PECUL23A000011 659.733 844.625 1005.042 688.409 858.488 1125.317  721.587
## PECUL23A000012   0.000   0.000    0.000   0.000   0.000    0.000    0.000
##                     22V      7D     7Li      7V      8D     8Li      8V      9D
## PECUL23A000004  988.542 299.200 144.040 222.206 172.249 302.444 303.135 681.824
## PECUL23A000005    0.000   0.000   0.000   0.000   0.000   0.000   0.000   0.000
## PECUL23A000006    0.000   0.000   0.000   0.000   0.000   0.000   0.000   0.000
## PECUL23A000008    0.000   0.000   0.000   0.000   0.000   0.000   0.000   0.000
## PECUL23A000011 1044.139 879.316 694.686 728.819 724.642 605.159 713.794 970.492
## PECUL23A000012    0.000   0.000   0.000   0.000   0.000   0.000   0.000   0.000
##                     9Li      9V
## PECUL23A000004 1006.008 945.262
## PECUL23A000005    0.000   0.000
## PECUL23A000006    0.000   0.000
## PECUL23A000008    0.000   0.000
## PECUL23A000011  640.087 978.380
## PECUL23A000012    0.000   0.000
## 
## $length
##                       10D       10Li        10V        11D       11Li
## PECUL23A000004 6201.16983 6364.53488 6199.02702 6438.33500 6471.52674
## PECUL23A000005   71.39554   71.39554   71.39554   71.39554   71.39554
## PECUL23A000006  547.63317  547.63317  547.63317  547.63317  547.63317
## PECUL23A000008 1019.39872 1019.39872 1019.39872 1019.39872 1019.39872
## PECUL23A000011 4246.70282 4162.70479 4252.16543 4244.47366 4197.84742
## PECUL23A000012  447.71650  447.71650  447.71650  447.71650  447.71650
##                       11V        12D       12Li        12V        21D
## PECUL23A000004 6315.82807 6168.57244 6370.40945 6251.34080 6149.42485
## PECUL23A000005   71.39554   71.39554   71.39554   71.39554   71.39554
## PECUL23A000006  547.63317  547.63317  547.63317  547.63317  547.63317
## PECUL23A000008 1019.39872 1019.39872 1019.39872 1019.39872 1019.39872
## PECUL23A000011 4247.39914 4248.68448 4241.71429 4242.69120 4241.13665
## PECUL23A000012  447.71650  447.71650  447.71650  447.71650  447.71650
##                      21Li        21V        22D       22Li        22V
## PECUL23A000004 6627.37203 6100.23764 6360.54714 6359.47494 6205.29626
## PECUL23A000005   71.39554   71.39554   71.39554   71.39554   71.39554
## PECUL23A000006  547.63317  547.63317  547.63317  547.63317  547.63317
## PECUL23A000008 1017.02200 1021.78100 1019.39872 1019.39872 1019.39872
## PECUL23A000011 4244.46540 4247.13092 4138.22686 4258.91322 4240.83801
## PECUL23A000012  447.71650  447.71650  447.71650  447.71650  447.71650
##                        7D        7Li         7V         8D        8Li
## PECUL23A000004 6209.34168 6592.86848 6537.26970 6379.89021 6451.48933
## PECUL23A000005   71.39554   71.39554   71.39554   71.39554   71.39554
## PECUL23A000006  547.63317  547.63317  547.63317  547.63317  547.63317
## PECUL23A000008 1019.39872 1019.39872 1019.39872 1019.39872 1019.39872
## PECUL23A000011 4242.39630 4248.94036 4250.94094 4250.63982 4203.48454
## PECUL23A000012  447.71650  447.71650  447.71650  447.71650  447.71650
##                        8V         9D        9Li         9V
## PECUL23A000004 6485.19773 6555.93984 6423.42596 6148.17008
## PECUL23A000005   71.39554   71.39554   71.39554   71.39554
## PECUL23A000006  547.63317  547.63317  547.63317  547.63317
## PECUL23A000008 1019.39872 1019.39872 1019.39872 1019.39872
## PECUL23A000011 4163.96652 4170.79175 4242.94916 4243.53463
## PECUL23A000012  447.71650  447.71650  447.71650  447.71650
## 
## $countsFromAbundance
## [1] "no"

It is a list with 4 entries, three arrays/matrices (abundamce, counts, length) and 1 character vector (countsFromAbundance).

Question: What is the content of this object and how is it arranged?

All three arrays have target transcripts as rows and biological samples per column.
Counts: estimate of the number of reads mapping to each transcript.
Abundance: Raw counts cannot be compared across samples because each library may vary slightly in terms of the total number of reads, differences in sequencing bias and difference in transcript lengths. Salmon (the program we used to quantify reads) also produces an array of "Abundances" which are normalized counts. According to the salmon documention, this is Transcripts Per Million (TPM). This bastically means that per sample, the total counts add up to 1 million. We could check this:

apply(X=txi$abundance, FUN=sum, MARGIN = 2)

##      10D     10Li      10V      11D     11Li      11V      12D     12Li 
## 932355.7 932241.7 938520.5 945823.7 934723.7 940997.4 936842.9 930242.2 
##      12V      21D     21Li      21V      22D     22Li      22V       7D 
## 935530.5 930301.0 935730.8 934942.5 930828.3 916143.5 931353.4 933611.3 
##      7Li       7V       8D      8Li       8V       9D      9Li       9V 
## 937777.1 925431.3 931110.8 936225.3 939398.7 939854.7 918715.2 935852.6

In this case, they don't quite add up to 1 million because I have already filtered this matrix, to remove non-coding and mitochondrial DNA.

length: effective length of the target transcript.
countsFromAbundance: a character vector indicating whether counts were taken directly from the quantifier (salmon) or whether they have been calculated from the abundances by tximport. Default is no (counts are from salmon).

It is always a good idea to look at the distribution of data:

# Log10 transform and plot with base R:
hist(log10(txi$abundance[,1]), breaks=50)

# a more detailed plot with the tidyverse:
txi$abundance %>%
  as_tibble(rownames = "transcript") %>%
  pivot_longer(-transcript, names_to="sample", values_to="TPM") %>%
  ggplot(aes(x=TPM)) +
  geom_histogram() +
  ylab("number of transcripts") +
  scale_x_log10() +
  facet_wrap(~sample)

## Warning: Transformation introduced infinite values in continuous x-axis

## Warning: Removed 366214 rows containing non-finite values (`stat_bin()`).

We can see that the log abundance is fairly normally distributed. This is a good sign. Many 0 counts would indicate that you have targets that are heavily depleted. For example, if I would have included non-coding target sequences, we would probably see a peak of low counts, because these tend to be very depleted in RNAseq data.

Visualize your count data: PCAs

Next, we may be curious as to whether there are some basic patterns in our data. For instance, we would expect biological replicates to cluster together.

Let's load our design matrix. This is a .csv file where we have included some information on what these samples actually are:

samples<-read.csv("./data/design_matrix.csv")

Take a closer look at this table.

samples

##    sample_id treatment tissue    side
## 1       10Li     black  liver        
## 2       11Li     black  liver        
## 3       12Li     black  liver        
## 4       22Li     black  liver        
## 5       21Li     white  liver        
## 6        7Li     white  liver        
## 7        8Li     white  liver        
## 8        9Li     white  liver        
## 9        10D     black   skin  dorsal
## 10       11D     black   skin  dorsal
## 11       12D     black   skin  dorsal
## 12       22D     black   skin  dorsal
## 13       10V     black   skin ventral
## 14       11V     black   skin ventral
## 15       12V     black   skin ventral
## 16       22V     black   skin ventral
## 17       21D     white   skin  dorsal
## 18        7D     white   skin  dorsal
## 19        8D     white   skin  dorsal
## 20        9D     white   skin  dorsal
## 21       21V     white   skin ventral
## 22        7V     white   skin ventral
## 23        8V     white   skin ventral
## 24        9V     white   skin ventral

Question: How many tissues and how many treatments are there?

samples %>%
  select(treatment, tissue) %>%
  ftable()

##           tissue liver skin
## treatment                  
## black                4    8
## white                4    8

There are two tissues (liver and skin) and two treatments (black and white). The treatments refer to the background colour that we raised the tadpoles in.
We also have a "side" variable for each of the skin samples to tell us if it is from the dorsum or ventrum. Technically we therefore have 3 samples per individual.

Before we go any further, it makes sense to check if our biological replicates are grouping nicely. We could do this by performing a PCA on the normalized counts (the TPM/abundance) and distinguish them using the information we have from our samples file that lays out the experimental design. However, a PCA works best on homoskedastic data; this means that the variance of an observed quantity (here, the expression strength of a gene) does not depend on the mean. This is definitely an assumption we are violating with RNAseq data, so if we don't transform our data, then the PCA is going to be strongly influenced by a few highly expressed genes. All of this requires a few steps:

Filter out any genes that are non-variable. i.e. genes whose expression is the same across all biological samples.
Transform the counts variable to imrpove heteroskedasticity. Here we will use a variance stabilizing transformation
Transpose the matrix so that we have samples as the rows and genes as the columns (we want to treat each gene as if it is a variable in our multivariate analysis).
Perform PCA on scaled and centered data.
Plot the first two components, labeling points with relevant biological information.

# perform VS transformation (the vst() function only likes integers!)
vst_counts<- txi$counts %>%
  as.data.frame() %>%
  mutate_all(as.integer) %>%
  as.matrix() %>%
  vst()
  
# remove 0-variance genes (remember to also convert counts to integers)
vst_counts<-vst_counts[apply(vst_counts, 1, var) != 0,]

# perform PCA on TRANSPOSED scaled, centered data
pca<- prcomp(t(vst_counts),scale.=T, center=T)

## add metadata and plot
pca$x %>%
  as_tibble(rownames = "sample_id") %>%
  left_join(samples) %>% # add the experimental design information
  ggplot(aes(x=PC1, y=PC2, color=treatment, shape=tissue)) +
  labs(x=paste0("PC1 (", summary(pca)$importance["Proportion of Variance",1]*100, "%)"),
       y=paste("PC2 (", summary(pca)$importance["Proportion of Variance",2]*100, "%)")) + 
  geom_point(size=3)

Question: Can we see any patterns in the PCA biplot?

Most of the variance is explained by tissue type (separation along PC1)
Treatment effect is not so clear, but some separation along PC2.

Question: What about patterns within tissues? Try running this same code, but only for the liver samples.

Notice that sample with "Li" are liver samples. perhaps drop the remaining columns first?

# perform VS transformation (the vst() function only likes integers!)
vst_counts_liver<- txi$counts %>%
  as.data.frame() %>%
  select(contains("Li")) %>%
  mutate_all(as.integer) %>%
  as.matrix() %>%
  vst()
  
# remove 0-variance genes (remember to also convert counts to integers)
vst_counts_liver<-vst_counts_liver[apply(vst_counts_liver, 1, var) != 0,]

# perfomr PCA on TRANSPOSED scaled, centred data
pca_liver<- prcomp(t(vst_counts_liver),scale.=T, center=T)

## add metadata and plot
pca_liver$x %>%
  as_tibble(rownames = "sample_id") %>%
  left_join(samples) %>%
  ggplot(aes(x=PC1, y=PC2, color=treatment)) +
  geom_point(size=3)

Although this is not a textbook example (welcome to the world of real-life data!), we do see that there is a bit of separation in the treatments via a diagonal between PC1 and PC2.

Final comments:

It is important to spend some time looking at your data! Nowadays there are many pipelines that will process the whole RNAseq workflow from start to finish, but don't forget to actually look at your results in detail.
It is important to think about what comparisons you are trying to make. Here, we saw that the tissue effect is much stronger than the treatment effect. Unless it is a specific aim of yours to compare across tissues, you may want to run your analysis on each tissue independently.
Here, we explore patterns in gene expression, but we do not apply any specific tests. More importantly, we are using counts that have been normalized within samples (TPM). It is generally adviced that these are not reliable when comparing across samples! (We will come back to this in the next exercise).
There are many sources of information online. A practical guide that may be useful is the Computational Genomics with R book, especially the chapter on RNA-Seq by Bora Uyar. Galaxy also has plenty of interesting documentation such as this one.

Extras - Visualize your count data with heatmaps

Another common exploratory visualization technique are heatmaps. Surely you have seen them in many publications. The PCA is useful for visualizing the clustering of samples. Heatmaps can do the same, but they can add one extra layer of information: the clustering of genes. Let's look at two examples:

Heatmaps by samples

Like the PCA, we might want to compare the distance between samples. To do this, we have to manually calculate this distances first, and then use those distances to inform the heatmap.

# get sample-to-sample distance
sample_dist <- dist(t(vst_counts))

# convert to matrix
sample_dist_matrix <- as.matrix(sample_dist)

# plot
pheatmap(sample_dist_matrix,
         annotation_col=data.frame(samples[,c("sample_id","tissue","treatment","side")],
                                   row.names = "sample_id"),
         annotation_row=data.frame(samples[,c("sample_id","tissue","treatment","side")],
                                   row.names = "sample_id"))

This shows a similar picture. There is strong clustering with tissues, and then a little bit of clustering for treatments within tissues, but the tissue-effect is consuming much of the signal.

Heatmpas by genes

Above we are just showing the distance between samples, but you may be interested whether there are specific genes/transcripts who's expression is clustering. because there are thousands of genes, we may want to only focus on genes that are particularly variable in their expression across the samples.

# Sort the rows in vst matrix by the row-wise variance, and keep only the 500 most variable genes

var_genes <- apply(vst_counts,MAR=1, FUN=var) %>%
  enframe() %>%
  arrange(desc(value)) %>%
  slice_head(n=500) %>%
  pull(name)

head(var_genes) # list of genes we want to keep

## [1] "PECUL23A041600" "PECUL23A024292" "PECUL23A039371" "PECUL23A023059"
## [5] "PECUL23A000586" "PECUL23A017537"

vst_subset<-vst_counts[var_genes, ]

# check to see we really only have 500 genes
dim(vst_subset)

## [1] 500  24

# plot heatmap (this time scaled by row)
pheatmap(vst_subset,
         scale = "row",
         cluster_rows=T, show_rownames=FALSE,
         cluster_cols=T, show_colnames = T,
         annotation_col=data.frame(samples[,c("sample_id","tissue","treatment","side")], row.names = "sample_id"))

Question: Do we learn anything new from this heatmap?

We again see that most of the difference in gene expression is between liver and skin tissue.
The 8Li sample seems to have a very unique set of genes that are only over-expressed in that sample. It might be worth checking that sample more closely.