Statistics using R

R is first and foremost a programming language that was developed for statistics. As such, the possibilities for different types of data analysis are massive and continue to be developed. It would be impossible to impart a comprehensive tutorial on statistics in R in the limited time we have. The purpose of this section is therefore to introduce you to some common analyses and the typical syntax of running statistical models in R.

1. Formulating Generalized Linear Models

Perhaps the most frequently used types of statistical models in biology are generalized linear models (GLMs). As a reminder, special cases of this would be linear regression models, ANOVAs or a Student's T-tests. The typical syntax for such a model is:

glm( y ~ x, data, family)

Lets try try this with the iris dataset

# load data
iris<-iris

# run a general linear model with a gaussian family

mod1<-glm(Sepal.Length~Sepal.Width, data=iris, family="gaussian")

Lets look at the structure of the output of this model. This can be done in the "environment" window, or like so:

str(mod1)
## List of 30
##  $ coefficients     : Named num [1:2] 6.526 -0.223
##   ..- attr(*, "names")= chr [1:2] "(Intercept)" "Sepal.Width"
##  $ residuals        : Named num [1:150] -0.644 -0.956 -1.111 -1.234 -0.722 ...
##   ..- attr(*, "names")= chr [1:150] "1" "2" "3" "4" ...
##  $ fitted.values    : Named num [1:150] 5.74 5.86 5.81 5.83 5.72 ...
##   ..- attr(*, "names")= chr [1:150] "1" "2" "3" "4" ...
##  $ effects          : Named num [1:150] -71.566 -1.188 -1.081 -1.187 -0.759 ...
##   ..- attr(*, "names")= chr [1:150] "(Intercept)" "Sepal.Width" "" "" ...
##  $ R                : num [1:2, 1:2] -12.25 0 -37.44 5.32
##   ..- attr(*, "dimnames")=List of 2
##   .. ..$ : chr [1:2] "(Intercept)" "Sepal.Width"
##   .. ..$ : chr [1:2] "(Intercept)" "Sepal.Width"
##  $ rank             : int 2
##  $ qr               :List of 5
##   ..$ qr   : num [1:150, 1:2] -12.2474 0.0816 0.0816 0.0816 0.0816 ...
##   .. ..- attr(*, "dimnames")=List of 2
##   .. .. ..$ : chr [1:150] "1" "2" "3" "4" ...
##   .. .. ..$ : chr [1:2] "(Intercept)" "Sepal.Width"
##   ..$ rank : int 2
##   ..$ qraux: num [1:2] 1.08 1.02
##   ..$ pivot: int [1:2] 1 2
##   ..$ tol  : num 1e-11
##   ..- attr(*, "class")= chr "qr"
##  $ family           :List of 12
##   ..$ family    : chr "gaussian"
##   ..$ link      : chr "identity"
##   ..$ linkfun   :function (mu)  
##   ..$ linkinv   :function (eta)  
##   ..$ variance  :function (mu)  
##   ..$ dev.resids:function (y, mu, wt)  
##   ..$ aic       :function (y, n, mu, wt, dev)  
##   ..$ mu.eta    :function (eta)  
##   ..$ initialize:  expression({  n <- rep.int(1, nobs)  if (is.null(etastart) && is.null(start) && is.null(mustart) && ((family$link| __truncated__
##   ..$ validmu   :function (mu)  
##   ..$ valideta  :function (eta)  
##   ..$ dispersion: num NA
##   ..- attr(*, "class")= chr "family"
##  $ linear.predictors: Named num [1:150] 5.74 5.86 5.81 5.83 5.72 ...
##   ..- attr(*, "names")= chr [1:150] "1" "2" "3" "4" ...
##  $ deviance         : num 101
##  $ aic              : num 372
##  $ null.deviance    : num 102
##  $ iter             : int 2
##  $ weights          : Named num [1:150] 1 1 1 1 1 1 1 1 1 1 ...
##   ..- attr(*, "names")= chr [1:150] "1" "2" "3" "4" ...
##  $ prior.weights    : Named num [1:150] 1 1 1 1 1 1 1 1 1 1 ...
##   ..- attr(*, "names")= chr [1:150] "1" "2" "3" "4" ...
##  $ df.residual      : int 148
##  $ df.null          : int 149
##  $ y                : Named num [1:150] 5.1 4.9 4.7 4.6 5 5.4 4.6 5 4.4 4.9 ...
##   ..- attr(*, "names")= chr [1:150] "1" "2" "3" "4" ...
##  $ converged        : logi TRUE
##  $ boundary         : logi FALSE
##  $ model            :'data.frame':   150 obs. of  2 variables:
##   ..$ Sepal.Length: num [1:150] 5.1 4.9 4.7 4.6 5 5.4 4.6 5 4.4 4.9 ...
##   ..$ Sepal.Width : num [1:150] 3.5 3 3.2 3.1 3.6 3.9 3.4 3.4 2.9 3.1 ...
##   ..- attr(*, "terms")=Classes 'terms', 'formula'  language Sepal.Length ~ Sepal.Width
##   .. .. ..- attr(*, "variables")= language list(Sepal.Length, Sepal.Width)
##   .. .. ..- attr(*, "factors")= int [1:2, 1] 0 1
##   .. .. .. ..- attr(*, "dimnames")=List of 2
##   .. .. .. .. ..$ : chr [1:2] "Sepal.Length" "Sepal.Width"
##   .. .. .. .. ..$ : chr "Sepal.Width"
##   .. .. ..- attr(*, "term.labels")= chr "Sepal.Width"
##   .. .. ..- attr(*, "order")= int 1
##   .. .. ..- attr(*, "intercept")= int 1
##   .. .. ..- attr(*, "response")= int 1
##   .. .. ..- attr(*, ".Environment")=<environment: R_GlobalEnv> 
##   .. .. ..- attr(*, "predvars")= language list(Sepal.Length, Sepal.Width)
##   .. .. ..- attr(*, "dataClasses")= Named chr [1:2] "numeric" "numeric"
##   .. .. .. ..- attr(*, "names")= chr [1:2] "Sepal.Length" "Sepal.Width"
##  $ call             : language glm(formula = Sepal.Length ~ Sepal.Width, family = "gaussian", data = iris)
##  $ formula          :Class 'formula'  language Sepal.Length ~ Sepal.Width
##   .. ..- attr(*, ".Environment")=<environment: R_GlobalEnv> 
##  $ terms            :Classes 'terms', 'formula'  language Sepal.Length ~ Sepal.Width
##   .. ..- attr(*, "variables")= language list(Sepal.Length, Sepal.Width)
##   .. ..- attr(*, "factors")= int [1:2, 1] 0 1
##   .. .. ..- attr(*, "dimnames")=List of 2
##   .. .. .. ..$ : chr [1:2] "Sepal.Length" "Sepal.Width"
##   .. .. .. ..$ : chr "Sepal.Width"
##   .. ..- attr(*, "term.labels")= chr "Sepal.Width"
##   .. ..- attr(*, "order")= int 1
##   .. ..- attr(*, "intercept")= int 1
##   .. ..- attr(*, "response")= int 1
##   .. ..- attr(*, ".Environment")=<environment: R_GlobalEnv> 
##   .. ..- attr(*, "predvars")= language list(Sepal.Length, Sepal.Width)
##   .. ..- attr(*, "dataClasses")= Named chr [1:2] "numeric" "numeric"
##   .. .. ..- attr(*, "names")= chr [1:2] "Sepal.Length" "Sepal.Width"
##  $ data             :'data.frame':   150 obs. of  5 variables:
##   ..$ Sepal.Length: num [1:150] 5.1 4.9 4.7 4.6 5 5.4 4.6 5 4.4 4.9 ...
##   ..$ Sepal.Width : num [1:150] 3.5 3 3.2 3.1 3.6 3.9 3.4 3.4 2.9 3.1 ...
##   ..$ Petal.Length: num [1:150] 1.4 1.4 1.3 1.5 1.4 1.7 1.4 1.5 1.4 1.5 ...
##   ..$ Petal.Width : num [1:150] 0.2 0.2 0.2 0.2 0.2 0.4 0.3 0.2 0.2 0.1 ...
##   ..$ Species     : Factor w/ 3 levels "setosa","versicolor",..: 1 1 1 1 1 1 1 1 1 1 ...
##  $ offset           : NULL
##  $ control          :List of 3
##   ..$ epsilon: num 1e-08
##   ..$ maxit  : num 25
##   ..$ trace  : logi FALSE
##  $ method           : chr "glm.fit"
##  $ contrasts        : NULL
##  $ xlevels          : Named list()
##  - attr(*, "class")= chr [1:2] "glm" "lm"

As this is generally a little overwhelming, R knows to spit out just the important information, or the information can be summarized:

# call just the model
mod1
## 
## Call:  glm(formula = Sepal.Length ~ Sepal.Width, family = "gaussian", 
##     data = iris)
## 
## Coefficients:
## (Intercept)  Sepal.Width  
##      6.5262      -0.2234  
## 
## Degrees of Freedom: 149 Total (i.e. Null);  148 Residual
## Null Deviance:       102.2 
## Residual Deviance: 100.8     AIC: 372
# summarize model:
summary(mod1)
## 
## Call:
## glm(formula = Sepal.Length ~ Sepal.Width, family = "gaussian", 
##     data = iris)
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)   6.5262     0.4789   13.63   <2e-16 ***
## Sepal.Width  -0.2234     0.1551   -1.44    0.152    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for gaussian family taken to be 0.6807844)
## 
##     Null deviance: 102.17  on 149  degrees of freedom
## Residual deviance: 100.76  on 148  degrees of freedom
## AIC: 371.99
## 
## Number of Fisher Scoring iterations: 2

This coefficients table should be more familiar to us. It is also worth pointing out that a GLM with a gaussian distribution is simply a linear regression model, and indeed, we can use a shortcut to run such a model:

mod2<-lm(Sepal.Length~Sepal.Width, data=iris)
summary(mod2)
## 
## Call:
## lm(formula = Sepal.Length ~ Sepal.Width, data = iris)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -1.5561 -0.6333 -0.1120  0.5579  2.2226 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)   6.5262     0.4789   13.63   <2e-16 ***
## Sepal.Width  -0.2234     0.1551   -1.44    0.152    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.8251 on 148 degrees of freedom
## Multiple R-squared:  0.01382,    Adjusted R-squared:  0.007159 
## F-statistic: 2.074 on 1 and 148 DF,  p-value: 0.1519

The results should be identical. What can we conclude from this model? Is it odd that the we get a negative regression coefficient (Be sure to also look at the R-squared!)

It is always a good idea to visualize this relationship as well. For this we will need to load the tidyverse again.

library(tidyverse)

iris %>%
  ggplot(aes(x=Sepal.Width, y=Sepal.Length)) +
  geom_point()

To fit the regression model, we could use the canned functionality of geom_smooth.

iris %>%
  ggplot(aes(x=Sepal.Width, y=Sepal.Length)) +
  geom_point() +
  geom_smooth(method="lm")

If we run more complicated models, it may be that we have to plot our own regression slopes. We can do this, because we know the y-intercept and the slope.

# the coefficients we need:
coef(mod1)
## (Intercept) Sepal.Width 
##   6.5262226  -0.2233611
# the abline geom:
iris %>%
  ggplot(aes(x=Sepal.Width, y=Sepal.Length)) +
  geom_point() +
  geom_abline(intercept=coef(mod1)[1], slope=coef(mod1)[2])

We already saw that the R-squared was very low. This means, the model is a poor fit. Why is this so miss-leading? This is missleading, because there is a strong species effect. What would the slopes look like if we ran models for individual species?

iris %>%
  ggplot(aes(x=Sepal.Width, y=Sepal.Length, color=Species)) +
  geom_point()

We should therefore probably include species as another variable in our model, to build a multiple regression model.

mod3<-lm(Sepal.Length~Sepal.Width+Species, data=iris)
summary(mod3)
## 
## Call:
## lm(formula = Sepal.Length ~ Sepal.Width + Species, data = iris)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -1.30711 -0.25713 -0.05325  0.19542  1.41253 
## 
## Coefficients:
##                   Estimate Std. Error t value Pr(>|t|)    
## (Intercept)         2.2514     0.3698   6.089 9.57e-09 ***
## Sepal.Width         0.8036     0.1063   7.557 4.19e-12 ***
## Speciesversicolor   1.4587     0.1121  13.012  < 2e-16 ***
## Speciesvirginica    1.9468     0.1000  19.465  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.438 on 146 degrees of freedom
## Multiple R-squared:  0.7259, Adjusted R-squared:  0.7203 
## F-statistic: 128.9 on 3 and 146 DF,  p-value: < 2.2e-16

We now see that the r-squared is greatly improved, that sepal width is positively correlated with sepal length and that the intercepts for each species is significantly different from each other.

iris %>%
  ggplot(aes(x=Sepal.Width, y=Sepal.Length, color=Species)) +
  geom_point() +
  geom_smooth(method="lm")

This means that the Sepal length is likely significantly different between the two species (the y-intercept), but judging from the plot, the relationship of sepal width and length is the same between the species. We could test this by including an interaction effect:

mod4<-lm(Sepal.Length~Sepal.Width*Species, data=iris)
summary(mod4)
## 
## Call:
## lm(formula = Sepal.Length ~ Sepal.Width * Species, data = iris)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -1.26067 -0.25861 -0.03305  0.18929  1.44917 
## 
## Coefficients:
##                               Estimate Std. Error t value Pr(>|t|)    
## (Intercept)                     2.6390     0.5715   4.618 8.53e-06 ***
## Sepal.Width                     0.6905     0.1657   4.166 5.31e-05 ***
## Speciesversicolor               0.9007     0.7988   1.128    0.261    
## Speciesvirginica                1.2678     0.8162   1.553    0.123    
## Sepal.Width:Speciesversicolor   0.1746     0.2599   0.672    0.503    
## Sepal.Width:Speciesvirginica    0.2110     0.2558   0.825    0.411    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.4397 on 144 degrees of freedom
## Multiple R-squared:  0.7274, Adjusted R-squared:  0.718 
## F-statistic: 76.87 on 5 and 144 DF,  p-value: < 2.2e-16

We see that there is no significant interaction term, which we can interpret as all slopes being the same.

3. Choose the right test

image source

but there's more:

  • machine learning (from LDA to random forest)

  • Bayesian statistics

4. Categorical predictors: T-test and ANOVA

A very common problem in biology is to test the difference between the response of two or more groups is statistically significant or not. Essentially we are comparing within group and between group variances.

image source

For two groups, this would usually be achieved with a T-test, and for more than two groups, this could be achieved with an ANOVA. These are special cases of a GLM, and could be performed similarly to how we did above. However, because they are quite specific hypothesis tests, functions exist in R to execute them directly.

T-Test

The basic idea behind a T-Test is to use statistics to evaluate two contrary hypotheses:

  • H0: NULL hypothesis: The average is the same as the sample used
  • H1: True hypothesis: The average is different from the sample used

The T-test is commonly used with small sample sizes and normality (something we will address later). The sample could be a fixed value (one sample t-test) or you could be comparing means of two groups (a two sample t-test). Lets try a two sample T-test.

Lets test to see if Sepal Length is different between I. versicolor and I. virginica.

# filter data to only two groups. We have to also remember to drop the factor levels!
iris2<-iris %>%
  filter(Species!="setosa") %>%
  mutate(Species=fct_drop(Species))

#
t.test(iris2$Sepal.Length~iris2$Species)
## 
##  Welch Two Sample t-test
## 
## data:  iris2$Sepal.Length by iris2$Species
## t = -5.6292, df = 94.025, p-value = 1.866e-07
## alternative hypothesis: true difference in means between group versicolor and group virginica is not equal to 0
## 95 percent confidence interval:
##  -0.8819731 -0.4220269
## sample estimates:
## mean in group versicolor  mean in group virginica 
##                    5.936                    6.588

To be specific, this is a two-tailed t-test. This is the default.

Looking at the function's documentation, could you figure our how to run a one-tailed test?

ANOVA

In an ANOVA, the hypotheses are essentially the same, but the independent variable has more than two factor levels.

aov1<-aov(iris$Sepal.Length~iris$Species)
aov1
## Call:
##    aov(formula = iris$Sepal.Length ~ iris$Species)
## 
## Terms:
##                 iris$Species Residuals
## Sum of Squares      63.21213  38.95620
## Deg. of Freedom            2       147
## 
## Residual standard error: 0.5147894
## Estimated effects may be unbalanced
# to get the test statistics, we have to summarize this model first:
summary(aov1)
##               Df Sum Sq Mean Sq F value Pr(>F)    
## iris$Species   2  63.21  31.606   119.3 <2e-16 ***
## Residuals    147  38.96   0.265                   
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

You may also be interested in testing which groups exactly are different. This is achieved with a posthoc test.

TukeyHSD(aov1)
##   Tukey multiple comparisons of means
##     95% family-wise confidence level
## 
## Fit: aov(formula = iris$Sepal.Length ~ iris$Species)
## 
## $`iris$Species`
##                       diff       lwr       upr p adj
## versicolor-setosa    0.930 0.6862273 1.1737727     0
## virginica-setosa     1.582 1.3382273 1.8257727     0
## virginica-versicolor 0.652 0.4082273 0.8957727     0

An appropriate way to visualize these kinds of problems are with a boxplot.

iris %>%
  ggplot(aes(x=Species, y=Sepal.Length)) +
  geom_boxplot()

5. Check assumptions

It is important to understand that as long as the syntax is correct and the data is formatted correctly, R will run any statistical test you ask it to. It is therefore up to you to understand the assumptions of the method you are applying. These will differ between tests, but for linear-regression type problems, like the ANOVA, there are generally three assumptions we should be aware of:

Normality

The data for each factor should be normally distributed. We could check this visually with a histogram or a density plot.

# histogram
iris %>%
  ggplot(aes(x=Sepal.Length)) +
  geom_histogram(bins = 20) # refin the number of bins

# how do we show this per factor level? or as densities?

Are we convinced that the data is normal? This is a fairly subjective approach and may not be very accurate if sample sizes are small. It may be more appropriate to look at the model residuals instead. If The data is normally distributed, then the residuals (the distance between the points and the regression line) should be correlated with the theoretical data of the model if it were 100% normal. This can be visually inspected with a Q-Q test.

library(car)
qqPlot(aov1$residuals,
  id = FALSE 
)

If the data is normally distributed then the points should fall close to the line. In this case, we can be fairly happy with the distribution. To confirm this, we can run a hypothesis test, like a Shapiro-Wilk test. Here, like previous null hypotheses we have seen, we are testing whether the sample is significantly different from the test statistic. i.e. if the data is normally distributed, we expect no statistical difference.

shapiro.test(residuals(aov1))
## 
##  Shapiro-Wilk normality test
## 
## data:  residuals(aov1)
## W = 0.9879, p-value = 0.2189

Homogeneity of variance

A second assumption of an ANOVA is that the variances are the same across groups. We can test this again by looking at the residuals.

plot(aov1, which=3)

We should see no relationship. In other words, the points should be scattered randomly along the y-axis. The red line should be horizontal. We can see that there is some deviation. As for normality, we can apply a formal test for this. Here, the Levene's Test.

leveneTest(Sepal.Length ~ Species, data = iris)
## Levene's Test for Homogeneity of Variance (center = median)
##        Df F value   Pr(>F)   
## group   2  6.3527 0.002259 **
##       147                    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

In this case, it is significant, meaning that our data does not conform to the homogeneity of variance assumption.

This means we should probably not rely on an ANOVA. We could try a different test that has more relaxed assumptions. e.g. a Welch one-way test

oneway.test(Sepal.Length ~ Species, data = iris)
## 
##  One-way analysis of means (not assuming equal variances)
## 
## data:  Sepal.Length and Species
## F = 138.91, num df = 2.000, denom df = 92.211, p-value < 2.2e-16

Independence of observations

Each data point is assumed to be completely independent from each other. This assumption may be violated if we take repeated measures on the same individuals, or we take climate measurements from locations that are geographically close to each other (spatial autocorrelation), or we are comparing species, which are inherently, evolutionary related. We will not address these issues here in any more detail, but this should be kept in mind. Specific solutions exist for non-independent data (e.g. phylogenetic linear models).

What to do when assumptions are not met?

  • Transform data (e.g. log transform)

  • Choose a different test (e.g. non-parametric, Kruskal-Wallis)

6. Data Exploration

Although we are getting to this last, a good first step to any data analysis, is to explore your data first. This is especially true for datasets with many variables. This will help you find patterns, but also screen for problematic data (e.g. outliers, non-normal distributions, inter-correlations).

Pair-plots

A useful tool for this are pair-plots. I.e. plotting each variable against each other. We can do this with the library GGally.

library(GGally)

iris %>%
  ggpairs()

These plots are really useful and can be customized quite extensively, for example, instead of getting correlation coefficients for the entire data set, we can get them per species.

iris %>%
  ggpairs(aes(color=Species)) + # note that ggplot layers can be added just like before. E.g. a theme!
  theme_bw()

Principal Component Analysis

Very often in biological data, we are working with many, inter correlated variables.

Why are these problematic?

A PCA is a good way to visualize "high-dimensional" data and can also be a useful pre-processing step for further statistical analyses. Lets try this with a morphological dataset on birds.

birds<-read_csv("../data/bird_sizes.csv")

birds
## # A tibble: 3,770 × 41
##    Order  Family Species English_name Subspecies M_mass M_mass_N F_mass F_mass_N
##    <chr>  <chr>  <chr>   <chr>        <chr>       <dbl>    <dbl>  <dbl>    <dbl>
##  1 Strut… Strut… Struth… Ostrich      <NA>       115000       NA 100000       NA
##  2 Casua… Droma… Dromai… Emu          <NA>        32700       NA  38300       NA
##  3 Apter… Apter… Aptery… Brown Kiwi   <NA>         2208       11   2535       26
##  4 Apter… Apter… Aptery… Little Spot… <NA>         1135       51   1351       41
##  5 Apter… Apter… Aptery… Great Spott… <NA>         2610       12   3270        7
##  6 Tinam… Tinam… Tinamu… Great Tinam… <NA>         1059       NA    991       NA
##  7 Tinam… Tinam… Cryptu… Little Tina… <NA>          198        6    235        7
##  8 Tinam… Tinam… Cryptu… Slaty-breas… <NA>          418       22    468       18
##  9 Tinam… Tinam… Nothop… Brushland T… <NA>          475       NA    577       NA
## 10 Galli… Craci… Ortali… Plain Chach… <NA>          584      106    542      102
## # ℹ 3,760 more rows
## # ℹ 32 more variables: unsexed_mass <dbl>, unsexed_mass_N <dbl>, M_wing <dbl>,
## #   M_wing_N <dbl>, F_wing <dbl>, F_wing_N <dbl>, Unsexed_wing <dbl>,
## #   Unsexed_wing_N <dbl>, M_tarsus <dbl>, M_tarsus_N <dbl>, F_tarsus <dbl>,
## #   F_tarsus_N <dbl>, Unsexed_tarsus <dbl>, Unsexed_tarsus_N <dbl>,
## #   M_bill <dbl>, M_bill_N <dbl>, F_bill <dbl>, F_bill_N <dbl>,
## #   Unsexed_bill <dbl>, Unsexed_bill_N <dbl>, M_tail <dbl>, M_tail_N <dbl>, …

As you can see, this dataset have mean measurements of birds, but also a lot of missing data. In general, PCAs do not accept NA data. So lets select only the morphological measurements of female birds, and filter out any row that has NA.

birds_clean<-birds %>%
  select(Order, Family, Species, F_mass, F_wing, F_tarsus,F_bill, F_tail) %>%
  drop_na()

birds_clean
## # A tibble: 1,401 × 8
##    Order       Family       Species         F_mass F_wing F_tarsus F_bill F_tail
##    <chr>       <chr>        <chr>            <dbl>  <dbl>    <dbl>  <dbl>  <dbl>
##  1 Galliformes Cracidae     Ortalis vetula    542    200      56.6   21.7  223. 
##  2 Galliformes Megapodiidae Alectura latha…  2210    307     111.    36.2  262. 
##  3 Galliformes Phasianidae  Ammoperdix gri…   184.   130      31.6   11.1   56  
##  4 Galliformes Phasianidae  Alectoris chuk…   501    154      44.1   14.3   78.9
##  5 Galliformes Phasianidae  Alectoris rufa    439    157      41.4   13.3   84.4
##  6 Galliformes Phasianidae  Francolinus fr…   425    167      52.4   26.6   88.3
##  7 Galliformes Phasianidae  Perdix perdix     386    154      41.4   12.4   72.3
##  8 Galliformes Phasianidae  Coturnix cotur…   103    113      26.6    9     39  
##  9 Galliformes Phasianidae  Coturnix pecto…   106    104.     24.6   12.2   34.3
## 10 Galliformes Phasianidae  Coturnix ypsil…   103    100.     22.6   13.2   47  
## # ℹ 1,391 more rows

We are now left with 1401 entries and 6 variables. Lets explore these:

birds_clean %>%
  select(-c(Order, Family, Species)) %>% # remove the species variable we don't want to plot (how would you use select_if() instead?)
  ggpairs()

You can see that these are all quite strongly intercorrelated (i.e. bigger birds also have bigger wings etc.), and they seem to follow a log distribution. These would both be problematic for linear models. So lets log-transform this data and run a PCA. Let's also reduce the dataset to fewer Orders, to make this exercise a bit easier. Lets keep only Owls, Shorebirds, Woodpeckers and Parrots.

birds_clean_log<- birds_clean %>%
  mutate_if(is.numeric, log10)  %>% # use a clever version of mutate to alter multiple columns of the same type, and log10 transform them
  filter(Order %in% c("Strigiformes","Columbiformes","Charadriiformes","Piciformes","Psittaciformes")) 

Let's quickly plot that again.

birds_clean_log %>%
  select_if(is.numeric) %>% # using select_if() here
  ggpairs()

Looks much better already!

Lets run a PCA

birds_pca<-prcomp(birds_clean_log %>%
                    select_if(is.numeric),
                  scale. = T, center = T)

Lets explore the loadings:

birds_pca
## Standard deviations (1, .., p=5):
## [1] 1.9262871 0.9132549 0.4963923 0.3784397 0.2564403
## 
## Rotation (n x k) = (5 x 5):
##                PC1         PC2         PC3         PC4           PC5
## F_mass   0.4937513  0.09180214 -0.27457831 -0.57459305  0.5850056385
## F_wing   0.4977187  0.17261233 -0.06495641 -0.31481212 -0.7868640705
## F_tarsus 0.4405448 -0.44106832 -0.53953117  0.56593741  0.0000198325
## F_bill   0.4212378 -0.48883957  0.75836699  0.01409182  0.0909700969
## F_tail   0.3700875  0.72682320  0.23275102  0.50025236  0.1741774341

And the components

summary(birds_pca)
## Importance of components:
##                           PC1    PC2     PC3     PC4     PC5
## Standard deviation     1.9263 0.9133 0.49639 0.37844 0.25644
## Proportion of Variance 0.7421 0.1668 0.04928 0.02864 0.01315
## Cumulative Proportion  0.7421 0.9089 0.95820 0.98685 1.00000

And lets visualize this. We could plot the first two components using our knowledge of ggplot

str(birds_pca) # see where the components are stored
## List of 5
##  $ sdev    : num [1:5] 1.926 0.913 0.496 0.378 0.256
##  $ rotation: num [1:5, 1:5] 0.494 0.498 0.441 0.421 0.37 ...
##   ..- attr(*, "dimnames")=List of 2
##   .. ..$ : chr [1:5] "F_mass" "F_wing" "F_tarsus" "F_bill" ...
##   .. ..$ : chr [1:5] "PC1" "PC2" "PC3" "PC4" ...
##  $ center  : Named num [1:5] 2.15 2.23 1.47 1.41 1.95
##   ..- attr(*, "names")= chr [1:5] "F_mass" "F_wing" "F_tarsus" "F_bill" ...
##  $ scale   : Named num [1:5] 0.513 0.219 0.224 0.269 0.245
##   ..- attr(*, "names")= chr [1:5] "F_mass" "F_wing" "F_tarsus" "F_bill" ...
##  $ x       : num [1:301, 1:5] -1.81 -2.65 -2.37 -2.16 -2.13 ...
##   ..- attr(*, "dimnames")=List of 2
##   .. ..$ : NULL
##   .. ..$ : chr [1:5] "PC1" "PC2" "PC3" "PC4" ...
##  - attr(*, "class")= chr "prcomp"
birds_pca$x %>%
  as_data_frame() %>% # required, because the tidyverse doesn't like matrices
  ggplot(aes(x=PC1, y=PC2)) +
  geom_point()
## Warning: `as_data_frame()` was deprecated in tibble 2.0.0.
## ℹ Please use `as_tibble()` (with slightly different semantics) to convert to a
##   tibble, or `as.data.frame()` to convert to a data frame.
## This warning is displayed once every 8 hours.
## Call `lifecycle::last_lifecycle_warnings()` to see where this warning was
## generated.

On its own, we can clean little information from this. PCAs become especially useful when we can assign groups to data. Lets leverage a new library to make this a little easier.

library(ggfortify)

autoplot(birds_pca,
         loadings = TRUE,loadings.label = TRUE, # includes loading arrows
         data = birds_clean_log, colour = 'Order', frame=T) # colours points by a grouping column in our original data and add convex hulls  

What can we glean from this? Is this clearer when we plot other components?

autoplot(birds_pca, x=2,y=3,
         loadings = TRUE,loadings.label = TRUE,
         data = birds_clean_log, colour = 'Order', frame=T)