Aplicaciones y Discusiones en Desarrollo Animal - Taller 2
In this brief practical, we want to make sure we are all on the same page when it comes to the tidyverse.
Efficient data manipulation and visualization becomes increasingly important when working with large datasets. In genomics, we are often working with 10s of thousands or 100s of thousands of lines of data, and often more than one related tabel or dataframe, that all needs to be manipulated in R.
What is the Tidyverse?
install.packages("tidyverse",repos ="https://cran.rstudio.com/")
##
## The downloaded binary packages are in
## /var/folders/gs/tbjgd5dd67338ggy5zkz8lbw0000gn/T//Rtmp2fqo9D/downloaded_packages
library(tidyverse)
You should see that a call to load tidyverse
library
essentially just loads a number of "core" packages. It also tells you if
there are any conflicting functions. For example, if you call
filter()
, it will use the filter function from the
dplyr
package, unless you specify to use the base
stats
package.
Syntax | What it does |
---|---|
%>% | this "pipe" passes output from one function into another |
select() | Filters specified columns |
filter() | Filters specified rows |
arrange() | Sorts rows |
mutate() | creates a new variable (column) |
group_by() | perform a specific operation on individual groups within the data |
join | combine data tables based on shared columns |
pivot | transform table structures |
Essentially, pipes (%>%
or |>
) are
special functions that allow you to take the output of one operation and
use it as the input of another operation. Here is a very simple
example:
# take a look at this internal dataset on car manufacturing
mpg
## # A tibble: 234 × 11
## manufacturer model displ year cyl trans drv cty hwy fl class
## <chr> <chr> <dbl> <int> <int> <chr> <chr> <int> <int> <chr> <chr>
## 1 audi a4 1.8 1999 4 auto… f 18 29 p comp…
## 2 audi a4 1.8 1999 4 manu… f 21 29 p comp…
## 3 audi a4 2 2008 4 manu… f 20 31 p comp…
## 4 audi a4 2 2008 4 auto… f 21 30 p comp…
## 5 audi a4 2.8 1999 6 auto… f 16 26 p comp…
## 6 audi a4 2.8 1999 6 manu… f 18 26 p comp…
## 7 audi a4 3.1 2008 6 auto… f 18 27 p comp…
## 8 audi a4 quattro 1.8 1999 4 manu… 4 18 26 p comp…
## 9 audi a4 quattro 1.8 1999 4 auto… 4 16 25 p comp…
## 10 audi a4 quattro 2 2008 4 manu… 4 20 28 p comp…
## # ℹ 224 more rows
# we could use `nrow()` to count how many rows are in this table like so:
nrow(mpg)
## [1] 234
# we can do the same operation using a pipe, like this:
mpg %>% nrow()
## [1] 234
With this basic example, it is hard to justify using the pipe, but lets crank it up a knotch.
# say we want to count how many rows have "audi" as the manufaturer are in this dataset. We could do a series of individual operations, always saving the output like so:
#1. subset all rows with audi
audi<-subset(mpg, manufacturer=="audi")
#2. count rows
nrow(audi)
## [1] 18
# we can of course put it all together as a series of nested functions like this:
nrow(subset(mpg, manufacturer=="audi"))
## [1] 18
# you can see however how this would get quite confusing with every level of operation you add. Pipes make this much cleaner:
mpg %>%
subset(manufacturer=="audi") %>%
nrow()
## [1] 18
Note that for the functions that receive an input, you no longer have
to specify the data
argument. Throughout this course we
will be using pipes extensively. Hopefully I will convince you of their
utility for keeping code tidy and removing redundancy.
If you are mostly working with base R, then you will most likely rely
heavily on subset()
, Boolean operators
TRUE/FALSE
and row and column indices
[row-number,column-number]
to filter, select and sort your
data. The tidyverse can do much of the same, but with a more intuitive
set of functions.
## lets subset the same dataset to only include audi (`filter()` rows), and only the manufacturer, model and year columns (`select()` columns). Then we can sort (`arrange()`) it by the year.
mpg %>%
filter(manufacturer=="audi") %>% # specify rows to keep
select(manufacturer, model, year) %>% # specify coloumns to keep
arrange(year) # specify row order based on column values
## # A tibble: 18 × 3
## manufacturer model year
## <chr> <chr> <int>
## 1 audi a4 1999
## 2 audi a4 1999
## 3 audi a4 1999
## 4 audi a4 1999
## 5 audi a4 quattro 1999
## 6 audi a4 quattro 1999
## 7 audi a4 quattro 1999
## 8 audi a4 quattro 1999
## 9 audi a6 quattro 1999
## 10 audi a4 2008
## 11 audi a4 2008
## 12 audi a4 2008
## 13 audi a4 quattro 2008
## 14 audi a4 quattro 2008
## 15 audi a4 quattro 2008
## 16 audi a4 quattro 2008
## 17 audi a6 quattro 2008
## 18 audi a6 quattro 2008
Creating a new variable in the tidyverse uses the
mutate()
function.
# in base R, you would create a new variable/column like so:
mpg$avg_consumption<-(mpg$cty+mpg$hwy)/2
# there is a fair amount of redundancy here (specifying the data object for every variable), and w
# lets calculate an average fuel consumption for city and highway driving
mpg %>%
mutate(avg_consumption=(cty+hwy)/2)
## # A tibble: 234 × 12
## manufacturer model displ year cyl trans drv cty hwy fl class
## <chr> <chr> <dbl> <int> <int> <chr> <chr> <int> <int> <chr> <chr>
## 1 audi a4 1.8 1999 4 auto… f 18 29 p comp…
## 2 audi a4 1.8 1999 4 manu… f 21 29 p comp…
## 3 audi a4 2 2008 4 manu… f 20 31 p comp…
## 4 audi a4 2 2008 4 auto… f 21 30 p comp…
## 5 audi a4 2.8 1999 6 auto… f 16 26 p comp…
## 6 audi a4 2.8 1999 6 manu… f 18 26 p comp…
## 7 audi a4 3.1 2008 6 auto… f 18 27 p comp…
## 8 audi a4 quattro 1.8 1999 4 manu… 4 18 26 p comp…
## 9 audi a4 quattro 1.8 1999 4 auto… 4 16 25 p comp…
## 10 audi a4 quattro 2 2008 4 manu… 4 20 28 p comp…
## # ℹ 224 more rows
## # ℹ 1 more variable: avg_consumption <dbl>
Summarizing data in columns is achieved like so:
# mean manufacturing year:
mpg %>%
summarise(mean=mean(year))
## # A tibble: 1 × 1
## mean
## <dbl>
## 1 2004.
# it is much more powerful however... we could for example:
# get the means of all numeric variables
mpg %>%
summarise_if(is.numeric, mean)
## # A tibble: 1 × 6
## displ year cyl cty hwy avg_consumption
## <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 3.47 2004. 5.89 16.9 23.4 20.1
# or get multiple summary statistics at once:
mpg %>%
summarise(mean_cty=mean(cty),
sd_cty=sd(cty))
## # A tibble: 1 × 2
## mean_cty sd_cty
## <dbl> <dbl>
## 1 16.9 4.26
This function is really useful when you combine it with
group_by()
. This does exactly what it says on the box: it
groups data by a specified variable:
# look what happens when we group by manufacturer:
mpg %>%
group_by(manufacturer)
## # A tibble: 234 × 12
## # Groups: manufacturer [15]
## manufacturer model displ year cyl trans drv cty hwy fl class
## <chr> <chr> <dbl> <int> <int> <chr> <chr> <int> <int> <chr> <chr>
## 1 audi a4 1.8 1999 4 auto… f 18 29 p comp…
## 2 audi a4 1.8 1999 4 manu… f 21 29 p comp…
## 3 audi a4 2 2008 4 manu… f 20 31 p comp…
## 4 audi a4 2 2008 4 auto… f 21 30 p comp…
## 5 audi a4 2.8 1999 6 auto… f 16 26 p comp…
## 6 audi a4 2.8 1999 6 manu… f 18 26 p comp…
## 7 audi a4 3.1 2008 6 auto… f 18 27 p comp…
## 8 audi a4 quattro 1.8 1999 4 manu… 4 18 26 p comp…
## 9 audi a4 quattro 1.8 1999 4 auto… 4 16 25 p comp…
## 10 audi a4 quattro 2 2008 4 manu… 4 20 28 p comp…
## # ℹ 224 more rows
## # ℹ 1 more variable: avg_consumption <dbl>
## you should see that the table header now contains the grouping information (manufacturer[15])
# and now, we can get summary statistics per group:
mpg %>%
group_by(manufacturer) %>%
summarise_if(is.numeric, .funs = c(mean=mean, sd=sd))
## # A tibble: 15 × 13
## manufacturer displ_mean year_mean cyl_mean cty_mean hwy_mean
## <chr> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 audi 2.54 2004. 5.22 17.6 26.4
## 2 chevrolet 5.06 2005. 7.26 15 21.9
## 3 dodge 4.38 2004. 7.08 13.1 17.9
## 4 ford 4.54 2003. 7.2 14 19.4
## 5 honda 1.71 2003 4 24.4 32.6
## 6 hyundai 2.43 2004. 4.86 18.6 26.9
## 7 jeep 4.58 2006. 7.25 13.5 17.6
## 8 land rover 4.3 2004. 8 11.5 16.5
## 9 lincoln 5.4 2002 8 11.3 17
## 10 mercury 4.4 2004. 7 13.2 18
## 11 nissan 3.27 2004. 5.54 18.1 24.6
## 12 pontiac 3.96 2003. 6.4 17 26.4
## 13 subaru 2.46 2004. 4 19.3 25.6
## 14 toyota 2.95 2003. 5.12 18.5 24.9
## 15 volkswagen 2.26 2003. 4.59 20.9 29.2
## # ℹ 7 more variables: avg_consumption_mean <dbl>, displ_sd <dbl>,
## # year_sd <dbl>, cyl_sd <dbl>, cty_sd <dbl>, hwy_sd <dbl>,
## # avg_consumption_sd <dbl>
Any SQL people out there? Often, we have more than one dataset or
table and we want to join them based on a reference variable. These
"join" operations can go in different directions, depending on which
table you want to complete:
You will most likely use left_join()
most, where you
wish to pull additional data from a second table into your first/primary
table
# imagine the mpg data set was our most complete data set, but we were working with a list of only manual cars and their manufacturing details, and we wanted to find out what millage these cars have:
# our manual data subset
df1<-mpg %>%
select(-c(cty, hwy)) %>%
filter(str_detect(trans, "manual"))
df1
## # A tibble: 77 × 10
## manufacturer model displ year cyl trans drv fl class avg_consumption
## <chr> <chr> <dbl> <int> <int> <chr> <chr> <chr> <chr> <dbl>
## 1 audi a4 1.8 1999 4 manu… f p comp… 25
## 2 audi a4 2 2008 4 manu… f p comp… 25.5
## 3 audi a4 2.8 1999 6 manu… f p comp… 22
## 4 audi a4 qu… 1.8 1999 4 manu… 4 p comp… 22
## 5 audi a4 qu… 2 2008 4 manu… 4 p comp… 24
## 6 audi a4 qu… 2.8 1999 6 manu… 4 p comp… 21
## 7 audi a4 qu… 3.1 2008 6 manu… 4 p comp… 20
## 8 chevrolet corve… 5.7 1999 8 manu… r p 2sea… 21
## 9 chevrolet corve… 6.2 2008 8 manu… r p 2sea… 21
## 10 chevrolet corve… 7 2008 8 manu… r p 2sea… 19.5
## # ℹ 67 more rows
# now we can join the two, to get the millage information for just these manual cars
df1 %>%
left_join(mpg)
## Warning in left_join(., mpg): Detected an unexpected many-to-many relationship between `x` and `y`.
## ℹ Row 14 of `x` matches multiple rows in `y`.
## ℹ Row 65 of `y` matches multiple rows in `x`.
## ℹ If a many-to-many relationship is expected, set `relationship =
## "many-to-many"` to silence this warning.
## # A tibble: 79 × 12
## manufacturer model displ year cyl trans drv fl class avg_consumption
## <chr> <chr> <dbl> <int> <int> <chr> <chr> <chr> <chr> <dbl>
## 1 audi a4 1.8 1999 4 manu… f p comp… 25
## 2 audi a4 2 2008 4 manu… f p comp… 25.5
## 3 audi a4 2.8 1999 6 manu… f p comp… 22
## 4 audi a4 qu… 1.8 1999 4 manu… 4 p comp… 22
## 5 audi a4 qu… 2 2008 4 manu… 4 p comp… 24
## 6 audi a4 qu… 2.8 1999 6 manu… 4 p comp… 21
## 7 audi a4 qu… 3.1 2008 6 manu… 4 p comp… 20
## 8 chevrolet corve… 5.7 1999 8 manu… r p 2sea… 21
## 9 chevrolet corve… 6.2 2008 8 manu… r p 2sea… 21
## 10 chevrolet corve… 7 2008 8 manu… r p 2sea… 19.5
## # ℹ 69 more rows
## # ℹ 2 more variables: cty <int>, hwy <int>
When one variable is nested within another variable, this information can be stored as either a "wide" table, or a "long" table. (Think about multiple species in a genus, or multiple morphological measurements taken from a single animal).
One way of thinking about it is that wide tables have a single ID
column and then many value columns, whereas a long table has many ID
columns and only a single value column. Coming from Excel, and base R,
we are probably more familiar with wide tables, but the tidyverse really
likes long tables. To switch between them, we use
pivot_longer()
and pivot_wider()
.
Lets take a look at another very popular dataset, the
iris dataset
.
iris %>% as_tibble()
## # A tibble: 150 × 5
## Sepal.Length Sepal.Width Petal.Length Petal.Width Species
## <dbl> <dbl> <dbl> <dbl> <fct>
## 1 5.1 3.5 1.4 0.2 setosa
## 2 4.9 3 1.4 0.2 setosa
## 3 4.7 3.2 1.3 0.2 setosa
## 4 4.6 3.1 1.5 0.2 setosa
## 5 5 3.6 1.4 0.2 setosa
## 6 5.4 3.9 1.7 0.4 setosa
## 7 4.6 3.4 1.4 0.3 setosa
## 8 5 3.4 1.5 0.2 setosa
## 9 4.4 2.9 1.4 0.2 setosa
## 10 4.9 3.1 1.5 0.1 setosa
## # ℹ 140 more rows
Is the iris dataset long or wide?
## lets reshape it!
iris %>%
pivot_longer(-Species, names_to = "trait", values_to = "length")
## # A tibble: 600 × 3
## Species trait length
## <fct> <chr> <dbl>
## 1 setosa Sepal.Length 5.1
## 2 setosa Sepal.Width 3.5
## 3 setosa Petal.Length 1.4
## 4 setosa Petal.Width 0.2
## 5 setosa Sepal.Length 4.9
## 6 setosa Sepal.Width 3
## 7 setosa Petal.Length 1.4
## 8 setosa Petal.Width 0.2
## 9 setosa Sepal.Length 4.7
## 10 setosa Sepal.Width 3.2
## # ℹ 590 more rows
Although this may seem trivial or even unnecessary at first glance,
it is a hugely important data transformation technique, especially in
combination with group_by()
and facetting plots (more on
that later).
A package of the tidverse that many of you may know already is
ggplot2
. To build plots using ggplot takes three general
steps.
# to build a plot we have to define two basic aspects
# 1. what is our dataset? - defined by "data="
# 2. what variables do we want to plot? - defined by mapping the aesthetics, or "mapping=aes()"
ggplot(data=mpg,
mapping=aes(x=displ, y=hwy))
Once the plot as been created, you can add any plot layer you like,
using geoms
. For example, the x and y data as points:
ggplot(data=mpg,
mapping=aes(x=displ, y=hwy)) +
geom_point()
Different geoms allow for different data visualisation
# line graph
ggplot(data=mpg,
mapping=aes(x=displ, y=hwy)) +
geom_line()
# boxplot (categorical x axis)
ggplot(data=mpg,
mapping=aes(x=manufacturer, y=hwy)) +
geom_boxplot()
# histogram
ggplot(data=mpg,
mapping=aes(x=hwy)) +
geom_histogram()
Different styling can be added at different parts of the build.
Adding a fixed colour is done outside the aes()
ggplot(data=mpg,
mapping=aes(x=manufacturer, y=hwy)) +
geom_boxplot(fill="blue")
Adding a conditional colour is done insidee the aes()
ggplot(data=mpg,
mapping=aes(x=manufacturer, y=hwy, fill=manufacturer)) +
geom_boxplot()
General theme elements can be manipulated both with canned theme functions, or manually
ggplot(data=mpg,
mapping=aes(x=manufacturer, y=hwy, fill=manufacturer)) +
geom_boxplot() +
## apply a canned theme
theme_classic() +
## edit the theme by e.g. removing the legend
theme(legend.position = "none")
ggplot is great for organizing multiple plots for groups of data that
share one or both axes. This is done with facet_wrap()
of
facet_grid()
# two plots that share the same x axis:
ggplot(data=mpg,
mapping=aes(x=manufacturer, y=hwy)) +
geom_bar(stat="identity") +
facet_wrap(~year, ncol=1)
# multiple plots that are grouped into two groups
ggplot(data=mpg,
mapping=aes(x=manufacturer, y=hwy)) +
geom_bar(stat="identity") +
facet_grid(class~year)