Tidyverse

The Tidyverse

Efficient data manipulation and visualization becomes increasingly important when working with large datasets. In genomics, we are often working with 10s of thousands or 100s of thousands of lines of data, and often more than one related tabel or dataframe, that all needs to be manipulated in R.

What is the Tidyverse?

A collection of R packages
The packages are focused on data science and they share an underlying grammar and design. In other words, they play well together!

Installing and using the tidyverse

install.packages("tidyverse",repos ="https://cran.rstudio.com/")

## 
## The downloaded binary packages are in
##  /var/folders/gs/tbjgd5dd67338ggy5zkz8lbw0000gn/T//Rtmp2fqo9D/downloaded_packages

library(tidyverse)

You should see that a call to load tidyverse library essentially just loads a number of "core" packages. It also tells you if there are any conflicting functions. For example, if you call filter(), it will use the filter function from the dplyr package, unless you specify to use the base stats package.

My favorite functions

Syntax	What it does
%>%	this "pipe" passes output from one function into another
select()	Filters specified columns
filter()	Filters specified rows
arrange()	Sorts rows
mutate()	creates a new variable (column)
group_by()	perform a specific operation on individual groups within the data
join	combine data tables based on shared columns
pivot	transform table structures

1. The pipe!

Essentially, pipes (%>% or |>) are special functions that allow you to take the output of one operation and use it as the input of another operation. Here is a very simple example:

# take a look at this internal dataset on car manufacturing
mpg

## # A tibble: 234 × 11
##    manufacturer model      displ  year   cyl trans drv     cty   hwy fl    class
##    <chr>        <chr>      <dbl> <int> <int> <chr> <chr> <int> <int> <chr> <chr>
##  1 audi         a4           1.8  1999     4 auto… f        18    29 p     comp…
##  2 audi         a4           1.8  1999     4 manu… f        21    29 p     comp…
##  3 audi         a4           2    2008     4 manu… f        20    31 p     comp…
##  4 audi         a4           2    2008     4 auto… f        21    30 p     comp…
##  5 audi         a4           2.8  1999     6 auto… f        16    26 p     comp…
##  6 audi         a4           2.8  1999     6 manu… f        18    26 p     comp…
##  7 audi         a4           3.1  2008     6 auto… f        18    27 p     comp…
##  8 audi         a4 quattro   1.8  1999     4 manu… 4        18    26 p     comp…
##  9 audi         a4 quattro   1.8  1999     4 auto… 4        16    25 p     comp…
## 10 audi         a4 quattro   2    2008     4 manu… 4        20    28 p     comp…
## # ℹ 224 more rows

# we could use `nrow()` to count how many rows are in this table like so:
nrow(mpg)

## [1] 234

# we can do the same operation using a pipe, like this:
mpg %>% nrow()

## [1] 234

With this basic example, it is hard to justify using the pipe, but lets crank it up a knotch.

# say we want to count how many rows have "audi" as the manufaturer are in this dataset. We could  do a series of individual operations, always saving the output like so: 
#1. subset all rows with audi
audi<-subset(mpg, manufacturer=="audi")

#2. count rows
nrow(audi)

## [1] 18

# we can of course put it all together as a series of nested functions like this:
nrow(subset(mpg, manufacturer=="audi"))

## [1] 18

# you can see however how this would get quite confusing with every level of operation you add. Pipes make this much cleaner:
mpg %>%
  subset(manufacturer=="audi") %>%
  nrow()

## [1] 18

Note that for the functions that receive an input, you no longer have to specify the data argument. Throughout this course we will be using pipes extensively. Hopefully I will convince you of their utility for keeping code tidy and removing redundancy.

2. Select, filter and arrange!

If you are mostly working with base R, then you will most likely rely heavily on subset(), Boolean operators TRUE/FALSE and row and column indices [row-number,column-number] to filter, select and sort your data. The tidyverse can do much of the same, but with a more intuitive set of functions.

## lets subset the same dataset to only include audi (`filter()` rows), and only the manufacturer, model and year columns (`select()` columns). Then we can sort (`arrange()`) it by the year.

mpg %>%
  filter(manufacturer=="audi") %>% # specify rows to keep
  select(manufacturer, model, year) %>% # specify coloumns to keep
  arrange(year) # specify row order based on column values

## # A tibble: 18 × 3
##    manufacturer model       year
##    <chr>        <chr>      <int>
##  1 audi         a4          1999
##  2 audi         a4          1999
##  3 audi         a4          1999
##  4 audi         a4          1999
##  5 audi         a4 quattro  1999
##  6 audi         a4 quattro  1999
##  7 audi         a4 quattro  1999
##  8 audi         a4 quattro  1999
##  9 audi         a6 quattro  1999
## 10 audi         a4          2008
## 11 audi         a4          2008
## 12 audi         a4          2008
## 13 audi         a4 quattro  2008
## 14 audi         a4 quattro  2008
## 15 audi         a4 quattro  2008
## 16 audi         a4 quattro  2008
## 17 audi         a6 quattro  2008
## 18 audi         a6 quattro  2008

3. Mutate!

Creating a new variable in the tidyverse uses the mutate() function.

# in base R, you would create a new variable/column like so:
mpg$avg_consumption<-(mpg$cty+mpg$hwy)/2

# there is a fair amount of redundancy here (specifying the data object for every variable), and w
# lets calculate an average fuel consumption for city and highway driving
mpg %>%
  mutate(avg_consumption=(cty+hwy)/2)

## # A tibble: 234 × 12
##    manufacturer model      displ  year   cyl trans drv     cty   hwy fl    class
##    <chr>        <chr>      <dbl> <int> <int> <chr> <chr> <int> <int> <chr> <chr>
##  1 audi         a4           1.8  1999     4 auto… f        18    29 p     comp…
##  2 audi         a4           1.8  1999     4 manu… f        21    29 p     comp…
##  3 audi         a4           2    2008     4 manu… f        20    31 p     comp…
##  4 audi         a4           2    2008     4 auto… f        21    30 p     comp…
##  5 audi         a4           2.8  1999     6 auto… f        16    26 p     comp…
##  6 audi         a4           2.8  1999     6 manu… f        18    26 p     comp…
##  7 audi         a4           3.1  2008     6 auto… f        18    27 p     comp…
##  8 audi         a4 quattro   1.8  1999     4 manu… 4        18    26 p     comp…
##  9 audi         a4 quattro   1.8  1999     4 auto… 4        16    25 p     comp…
## 10 audi         a4 quattro   2    2008     4 manu… 4        20    28 p     comp…
## # ℹ 224 more rows
## # ℹ 1 more variable: avg_consumption <dbl>

4. Grouping and summarizing!

Summarizing data in columns is achieved like so:

# mean manufacturing year:
mpg %>%
  summarise(mean=mean(year))

## # A tibble: 1 × 1
##    mean
##   <dbl>
## 1 2004.

# it is much more powerful however... we could for example:

# get the means of all numeric variables
mpg %>%
  summarise_if(is.numeric, mean)

## # A tibble: 1 × 6
##   displ  year   cyl   cty   hwy avg_consumption
##   <dbl> <dbl> <dbl> <dbl> <dbl>           <dbl>
## 1  3.47 2004.  5.89  16.9  23.4            20.1

# or get multiple summary statistics at once:

mpg %>%
  summarise(mean_cty=mean(cty),
            sd_cty=sd(cty))

## # A tibble: 1 × 2
##   mean_cty sd_cty
##      <dbl>  <dbl>
## 1     16.9   4.26

This function is really useful when you combine it with group_by(). This does exactly what it says on the box: it groups data by a specified variable:

# look what happens when we group by manufacturer:
mpg %>%
  group_by(manufacturer)

## # A tibble: 234 × 12
## # Groups:   manufacturer [15]
##    manufacturer model      displ  year   cyl trans drv     cty   hwy fl    class
##    <chr>        <chr>      <dbl> <int> <int> <chr> <chr> <int> <int> <chr> <chr>
##  1 audi         a4           1.8  1999     4 auto… f        18    29 p     comp…
##  2 audi         a4           1.8  1999     4 manu… f        21    29 p     comp…
##  3 audi         a4           2    2008     4 manu… f        20    31 p     comp…
##  4 audi         a4           2    2008     4 auto… f        21    30 p     comp…
##  5 audi         a4           2.8  1999     6 auto… f        16    26 p     comp…
##  6 audi         a4           2.8  1999     6 manu… f        18    26 p     comp…
##  7 audi         a4           3.1  2008     6 auto… f        18    27 p     comp…
##  8 audi         a4 quattro   1.8  1999     4 manu… 4        18    26 p     comp…
##  9 audi         a4 quattro   1.8  1999     4 auto… 4        16    25 p     comp…
## 10 audi         a4 quattro   2    2008     4 manu… 4        20    28 p     comp…
## # ℹ 224 more rows
## # ℹ 1 more variable: avg_consumption <dbl>

## you should see that the table header now contains the grouping information (manufacturer[15])

# and now, we can get summary statistics per group:
mpg %>%
  group_by(manufacturer) %>%
  summarise_if(is.numeric, .funs = c(mean=mean, sd=sd))

## # A tibble: 15 × 13
##    manufacturer displ_mean year_mean cyl_mean cty_mean hwy_mean
##    <chr>             <dbl>     <dbl>    <dbl>    <dbl>    <dbl>
##  1 audi               2.54     2004.     5.22     17.6     26.4
##  2 chevrolet          5.06     2005.     7.26     15       21.9
##  3 dodge              4.38     2004.     7.08     13.1     17.9
##  4 ford               4.54     2003.     7.2      14       19.4
##  5 honda              1.71     2003      4        24.4     32.6
##  6 hyundai            2.43     2004.     4.86     18.6     26.9
##  7 jeep               4.58     2006.     7.25     13.5     17.6
##  8 land rover         4.3      2004.     8        11.5     16.5
##  9 lincoln            5.4      2002      8        11.3     17  
## 10 mercury            4.4      2004.     7        13.2     18  
## 11 nissan             3.27     2004.     5.54     18.1     24.6
## 12 pontiac            3.96     2003.     6.4      17       26.4
## 13 subaru             2.46     2004.     4        19.3     25.6
## 14 toyota             2.95     2003.     5.12     18.5     24.9
## 15 volkswagen         2.26     2003.     4.59     20.9     29.2
## # ℹ 7 more variables: avg_consumption_mean <dbl>, displ_sd <dbl>,
## #   year_sd <dbl>, cyl_sd <dbl>, cty_sd <dbl>, hwy_sd <dbl>,
## #   avg_consumption_sd <dbl>

left join, right join, inner join, full join!

Any SQL people out there? Often, we have more than one dataset or table and we want to join them based on a reference variable. These "join" operations can go in different directions, depending on which table you want to complete:

You will most likely use left_join() most, where you wish to pull additional data from a second table into your first/primary table

# imagine the mpg data set was our most complete data set, but we were working with a list of only manual cars and their manufacturing details, and we wanted to find out what millage these cars have:

# our manual data subset
df1<-mpg %>%
  select(-c(cty, hwy)) %>%
  filter(str_detect(trans, "manual"))
df1

## # A tibble: 77 × 10
##    manufacturer model  displ  year   cyl trans drv   fl    class avg_consumption
##    <chr>        <chr>  <dbl> <int> <int> <chr> <chr> <chr> <chr>           <dbl>
##  1 audi         a4       1.8  1999     4 manu… f     p     comp…            25  
##  2 audi         a4       2    2008     4 manu… f     p     comp…            25.5
##  3 audi         a4       2.8  1999     6 manu… f     p     comp…            22  
##  4 audi         a4 qu…   1.8  1999     4 manu… 4     p     comp…            22  
##  5 audi         a4 qu…   2    2008     4 manu… 4     p     comp…            24  
##  6 audi         a4 qu…   2.8  1999     6 manu… 4     p     comp…            21  
##  7 audi         a4 qu…   3.1  2008     6 manu… 4     p     comp…            20  
##  8 chevrolet    corve…   5.7  1999     8 manu… r     p     2sea…            21  
##  9 chevrolet    corve…   6.2  2008     8 manu… r     p     2sea…            21  
## 10 chevrolet    corve…   7    2008     8 manu… r     p     2sea…            19.5
## # ℹ 67 more rows

# now we can join the two, to get the millage information for just these manual cars
df1 %>%
  left_join(mpg)

## Warning in left_join(., mpg): Detected an unexpected many-to-many relationship between `x` and `y`.
## ℹ Row 14 of `x` matches multiple rows in `y`.
## ℹ Row 65 of `y` matches multiple rows in `x`.
## ℹ If a many-to-many relationship is expected, set `relationship =
##   "many-to-many"` to silence this warning.

## # A tibble: 79 × 12
##    manufacturer model  displ  year   cyl trans drv   fl    class avg_consumption
##    <chr>        <chr>  <dbl> <int> <int> <chr> <chr> <chr> <chr>           <dbl>
##  1 audi         a4       1.8  1999     4 manu… f     p     comp…            25  
##  2 audi         a4       2    2008     4 manu… f     p     comp…            25.5
##  3 audi         a4       2.8  1999     6 manu… f     p     comp…            22  
##  4 audi         a4 qu…   1.8  1999     4 manu… 4     p     comp…            22  
##  5 audi         a4 qu…   2    2008     4 manu… 4     p     comp…            24  
##  6 audi         a4 qu…   2.8  1999     6 manu… 4     p     comp…            21  
##  7 audi         a4 qu…   3.1  2008     6 manu… 4     p     comp…            20  
##  8 chevrolet    corve…   5.7  1999     8 manu… r     p     2sea…            21  
##  9 chevrolet    corve…   6.2  2008     8 manu… r     p     2sea…            21  
## 10 chevrolet    corve…   7    2008     8 manu… r     p     2sea…            19.5
## # ℹ 69 more rows
## # ℹ 2 more variables: cty <int>, hwy <int>

Pivot longer and wider

When one variable is nested within another variable, this information can be stored as either a "wide" table, or a "long" table. (Think about multiple species in a genus, or multiple morphological measurements taken from a single animal).

One way of thinking about it is that wide tables have a single ID column and then many value columns, whereas a long table has many ID columns and only a single value column. Coming from Excel, and base R, we are probably more familiar with wide tables, but the tidyverse really likes long tables. To switch between them, we use pivot_longer() and pivot_wider().

Lets take a look at another very popular dataset, the iris dataset.

iris %>% as_tibble()

## # A tibble: 150 × 5
##    Sepal.Length Sepal.Width Petal.Length Petal.Width Species
##           <dbl>       <dbl>        <dbl>       <dbl> <fct>  
##  1          5.1         3.5          1.4         0.2 setosa 
##  2          4.9         3            1.4         0.2 setosa 
##  3          4.7         3.2          1.3         0.2 setosa 
##  4          4.6         3.1          1.5         0.2 setosa 
##  5          5           3.6          1.4         0.2 setosa 
##  6          5.4         3.9          1.7         0.4 setosa 
##  7          4.6         3.4          1.4         0.3 setosa 
##  8          5           3.4          1.5         0.2 setosa 
##  9          4.4         2.9          1.4         0.2 setosa 
## 10          4.9         3.1          1.5         0.1 setosa 
## # ℹ 140 more rows

Is the iris dataset long or wide?

The iris dataset is a typical wide dataset, where each trait is its own variable

## lets reshape it!
iris %>%
  pivot_longer(-Species, names_to = "trait", values_to = "length")

## # A tibble: 600 × 3
##    Species trait        length
##    <fct>   <chr>         <dbl>
##  1 setosa  Sepal.Length    5.1
##  2 setosa  Sepal.Width     3.5
##  3 setosa  Petal.Length    1.4
##  4 setosa  Petal.Width     0.2
##  5 setosa  Sepal.Length    4.9
##  6 setosa  Sepal.Width     3  
##  7 setosa  Petal.Length    1.4
##  8 setosa  Petal.Width     0.2
##  9 setosa  Sepal.Length    4.7
## 10 setosa  Sepal.Width     3.2
## # ℹ 590 more rows

Although this may seem trivial or even unnecessary at first glance, it is a hugely important data transformation technique, especially in combination with group_by() and facetting plots (more on that later).

Visualizing data with ggplot2

A package of the tidverse that many of you may know already is ggplot2. To build plots using ggplot takes three general steps.

1. Create a new ggplot object

# to build a plot we have to define two basic aspects
# 1. what is our dataset? - defined by  "data="
# 2. what variables do we want to plot? - defined by mapping the aesthetics, or "mapping=aes()"
ggplot(data=mpg,
       mapping=aes(x=displ, y=hwy))

2. Add plot layers

Once the plot as been created, you can add any plot layer you like, using geoms. For example, the x and y data as points:

ggplot(data=mpg,
       mapping=aes(x=displ, y=hwy)) +
  geom_point()

Different geoms allow for different data visualisation

# line graph
ggplot(data=mpg,
       mapping=aes(x=displ, y=hwy)) +
  geom_line()

# boxplot (categorical x axis)
ggplot(data=mpg,
       mapping=aes(x=manufacturer, y=hwy)) +
  geom_boxplot()

# histogram
ggplot(data=mpg,
       mapping=aes(x=hwy)) +
  geom_histogram()

3. Styling visualizations

Different styling can be added at different parts of the build.

Adding a fixed colour is done outside the aes()

ggplot(data=mpg,
       mapping=aes(x=manufacturer, y=hwy)) +
  geom_boxplot(fill="blue")

Adding a conditional colour is done insidee the aes()

ggplot(data=mpg,
       mapping=aes(x=manufacturer, y=hwy, fill=manufacturer)) +
  geom_boxplot()

General theme elements can be manipulated both with canned theme functions, or manually

ggplot(data=mpg,
       mapping=aes(x=manufacturer, y=hwy, fill=manufacturer)) +
  geom_boxplot() +
  ## apply a canned theme
  theme_classic() +
  ## edit the theme by e.g. removing the legend
  theme(legend.position = "none")

4. Faceting

ggplot is great for organizing multiple plots for groups of data that share one or both axes. This is done with facet_wrap() of facet_grid()

# two plots that share the same x axis:
ggplot(data=mpg,
       mapping=aes(x=manufacturer, y=hwy)) +
  geom_bar(stat="identity") +
  facet_wrap(~year, ncol=1)

# multiple plots that are grouped into two groups
ggplot(data=mpg,
       mapping=aes(x=manufacturer, y=hwy)) +
  geom_bar(stat="identity") +
  facet_grid(class~year)

Final comments:

Switching to the tidyverse can be a little daunting at first and may seem redundant. Many things can be done in base R. However, it is a powerful too for complex data organization and manipulation.
Learn by doing! As with any programming language, the best way to learn is to just get your hands dirty. Your regular google search should be "how do i ________ using the tidyverse?".