R for Biology Data Science - Session 4 - Introduction to dplyr part 1

Workshop Series

Introduction

Welcome to the fourth session of our workshop series. The purpose of this page is to serve as a guide for the material we will cover during the session. We will learn to summarize and transform data using the dplyr package in our session today.

library(tidyverse)

For more information about dplyr, check out the help file

?dplyr #note you need to index the tidyverse package for this to work

Download dplyr Cheatsheet

Lets start by downloading the cheatsheet for dplyr.

RStudio has produced many ‘cheatsheets’ for various packages that are super helpful. Each package has many functions and it is impossible to remember them all.

dplyr cheatsheet - click to download.

Index our data

data <- read.csv('Silver Tree Study.csv')

factor variables

Just to make sure we are on the same page before moving on, lets set our check that R reads our data how we want it to.

str(data)

## 'data.frame':    1722 obs. of  14 variables:
##  $ Unique.Sample.Number  : Factor w/ 102 levels "1DroughtBoth Pathogens9",..: 10 10 10 10 10 10 10 10 10 10 ...
##  $ Days.after.inoculation: int  3 3 3 3 3 3 3 3 3 3 ...
##  $ Date                  : Factor w/ 10 levels "6/12/18","6/16/18",..: 3 3 3 3 3 3 3 3 3 3 ...
##  $ Licor                 : int  6400 6400 6400 6400 6400 6400 6400 6400 6400 6400 ...
##  $ Trial                 : int  1 1 1 1 1 1 1 1 1 1 ...
##  $ Treatment             : Factor w/ 2 levels "Drought","Wet": 1 1 1 1 1 1 1 1 1 1 ...
##  $ Species               : Factor w/ 4 levels "Both Pathogens",..: 2 2 2 2 2 2 2 2 2 2 ...
##  $ Plant.Number          : int  9 9 9 9 9 9 9 9 9 9 ...
##  $ Isolate.Number        : int  NA NA NA NA NA NA NA NA NA NA ...
##  $ Unique.Sample.Number.1: int  1 1 1 1 1 1 1 1 1 1 ...
##  $ Obs                   : int  1 2 3 4 5 6 7 8 9 10 ...
##  $ Photosynthesis        : num  -1.257 -0.456 -0.675 0.196 1.132 ...
##  $ Conductance           : num  -0.008756 -0.001057 -0.000624 0.004572 0.008387 ...
##  $ Ci                    : num  170 -286 -1315 328 182 ...

Ok, currently R has read columns ‘Days.after.inoculation’ and ‘Trial’ as intergers, but for this session, we want R to read them as factors.

Lets change them to factors for now.

data$Days.after.inoculation <- as.factor(data$Days.after.inoculation)
data$Trial <- as.factor(data$Trial)

str(data)

## 'data.frame':    1722 obs. of  14 variables:
##  $ Unique.Sample.Number  : Factor w/ 102 levels "1DroughtBoth Pathogens9",..: 10 10 10 10 10 10 10 10 10 10 ...
##  $ Days.after.inoculation: Factor w/ 10 levels "3","5","6","9",..: 1 1 1 1 1 1 1 1 1 1 ...
##  $ Date                  : Factor w/ 10 levels "6/12/18","6/16/18",..: 3 3 3 3 3 3 3 3 3 3 ...
##  $ Licor                 : int  6400 6400 6400 6400 6400 6400 6400 6400 6400 6400 ...
##  $ Trial                 : Factor w/ 2 levels "1","2": 1 1 1 1 1 1 1 1 1 1 ...
##  $ Treatment             : Factor w/ 2 levels "Drought","Wet": 1 1 1 1 1 1 1 1 1 1 ...
##  $ Species               : Factor w/ 4 levels "Both Pathogens",..: 2 2 2 2 2 2 2 2 2 2 ...
##  $ Plant.Number          : int  9 9 9 9 9 9 9 9 9 9 ...
##  $ Isolate.Number        : int  NA NA NA NA NA NA NA NA NA NA ...
##  $ Unique.Sample.Number.1: int  1 1 1 1 1 1 1 1 1 1 ...
##  $ Obs                   : int  1 2 3 4 5 6 7 8 9 10 ...
##  $ Photosynthesis        : num  -1.257 -0.456 -0.675 0.196 1.132 ...
##  $ Conductance           : num  -0.008756 -0.001057 -0.000624 0.004572 0.008387 ...
##  $ Ci                    : num  170 -286 -1315 328 182 ...

Ok, now R recognizes those columns as factors. Yay.

dplyr

Ok now lets get into the real content of this session. Dplyr is a package that will empower you to quickly and safely (low risk of mistakes) summarize your data.

Lets again look at our data.

head(data) #as_tibble is another way to view a table of our data

##   Unique.Sample.Number Days.after.inoculation   Date Licor Trial Treatment
## 1     1DroughtControl9                      3 6/2/18  6400     1   Drought
## 2     1DroughtControl9                      3 6/2/18  6400     1   Drought
## 3     1DroughtControl9                      3 6/2/18  6400     1   Drought
## 4     1DroughtControl9                      3 6/2/18  6400     1   Drought
## 5     1DroughtControl9                      3 6/2/18  6400     1   Drought
## 6     1DroughtControl9                      3 6/2/18  6400     1   Drought
##   Species Plant.Number Isolate.Number Unique.Sample.Number.1 Obs
## 1 Control            9             NA                      1   1
## 2 Control            9             NA                      1   2
## 3 Control            9             NA                      1   3
## 4 Control            9             NA                      1   4
## 5 Control            9             NA                      1   5
## 6 Control            9             NA                      1   6
##   Photosynthesis  Conductance         Ci
## 1     -1.2574071 -0.008756114   169.6851
## 2     -0.4559930 -0.001057429  -286.0282
## 3     -0.6746335 -0.000623729 -1315.1230
## 4      0.1958272  0.004572455   328.0515
## 5      1.1315593  0.008387323   181.6788
## 6      1.4892968  0.010670317   173.9482

Ok so we can see that the first 10 or so rows are from the same plant (see below) and measured on the same day (day 3). We took at least 10 measurements per plant. Therefore, we need to calculate the mean value per plant (to avoid psuedoreplication). The unique sample number is just a combination of 3 columns using the stringr package in the tidyverse (not going into that now).

The first sample plant is:
- Trial = 1
- Treatment = Drought
- Species = Control
- Plant number = 9.

Pipes

In general, the dyplr package uses ‘pipes’ (%>%), which eventually the entire tidyverse will use (including ggvis, the next version of ggplot2).

You can read the %>% as ‘THEN’.

For example the below code can be interpreted as

Data THEN Filter THEN Group_by THEN Summarise

summary <- data %>% filter(Photosynthesis<200) %>% group_by(Species) %>% summarize(mean=mean(Photosynthesis)) #more about this later

The pipes essentially let us do everything in one go, without having to create a named variable for every step..

We will build up to the command above.

Filter Data

We have already used the filter command in the first two sessions, but lets take another look.

Remember that we included a few plants from species=“Both Pathogens” in the first few weeks of the experiment.

levels(data$Species) #we can remind ourselves about the levels of a factor using the levels command

## [1] "Both Pathogens"       "Control"              "Exotic Pathogen "    
## [4] "Indigenous Pathogen "

Lets remove the “Both Pathogens” from our dataset for now, just to simplify our stats going forward

data.filtered <- data %>% filter (Photosynthesis<200 & Species!="Both Pathogens") #note the '!=' means 'does not equal' 
#Here we used & to filter the data by two variables at the same time

Lets check that that worked

levels(data.filtered$Species)

## [1] "Both Pathogens"       "Control"              "Exotic Pathogen "    
## [4] "Indigenous Pathogen "

Oops.. the level is still there

summary(data.filtered$Species)

##       Both Pathogens              Control     Exotic Pathogen  
##                    0                  563                  508 
## Indigenous Pathogen  
##                  569

Ok, notice that all of the obesrvations for “Both Pathogens” were filtered out, but the level still remains in our factor. (its a silly thing R does).

Even though there are no observations, the level of the factor will still appear in our legend if we made a plot now

Drop Levels

We can completely drop that level of the factor by including the ‘droplevels’ command.

data.filtered <- data %>% filter (Photosynthesis<200 & Species!="Both Pathogens") %>% droplevels()

levels(data.filtered$Species)

## [1] "Control"              "Exotic Pathogen "     "Indigenous Pathogen "

Blam.

Note that the indigenous pathogen and exotic pathogen levels actually have spaces behind them.. Whoops.

Recode

Suppose we want to change the names of some of our levels in our factors. Sometimes it is easier to do in excel from the start, but if you only want to make one change, it is easier to just use one line of code

data.filtered <- data.filtered %>% mutate(Species = recode(Species, "Indigenous Pathogen "="P. multivora", "Exotic Pathogen "="P. cinnamomi")) #notice that the original dataset had spaces after the column names..

levels(data.filtered$Species)

## [1] "Control"      "P. cinnamomi" "P. multivora"

Here we used the ‘mutate’ command. This command is used to create a new column. However, in the above code, we called the column the same name (Species), thereby just replacing the other column.

Check out ?mutate for more information.

Summary Stats

The dplyr package is most helpful for summarizing data. Below are a few common statistics we will use (mean, sd, se, ci).

Group_by()

Before we calculate the means we need to introduce the group_by() command.

We can tell R which groups we want to calculate the means for using the group_by() command. For example, if we want to group the plants by treatment we will use:

data.filtered.summary <- data.filtered %>% group_by(treatment) #note this code probably doesn't do anything

Summarize()

Now we can tell R we want to calculate means for the groups (summarize).

data.filtered.summary <- data.filtered %>% group_by(Treatment) %>% summarize(mean=mean(Photosynthesis))

Treatment	mean
Drought	1.232575
Wet	5.918684

Sample Sizes

Lets rather start by summarizing the number of measurements made for each plant on each day.

Recall that there are at least 10 measurements per plant, per day of measurement (day after inoc).

Plant.Measurements <- data.filtered %>% group_by(Unique.Sample.Number,Days.after.inoculation) %>% summarize(n=n()) #the n=n() is a super nice function in dplyr to calculate the number of rows (measurements in our case).

#We could also do n=length() I think

as.tibble(Plant.Measurements)

## # A tibble: 154 x 3
##    Unique.Sample.Number Days.after.inoculation     n
##    <fct>                <fct>                  <int>
##  1 1DroughtControl1     9                         10
##  2 1DroughtControl1     22                        13
##  3 1DroughtControl10    13                        10
##  4 1DroughtControl10    35                        10
##  5 1DroughtControl2     35                        11
##  6 1DroughtControl3     22                        11
##  7 1DroughtControl5     9                         10
##  8 1DroughtControl6     22                        11
##  9 1DroughtControl6     35                        10
## 10 1DroughtControl7     37                        11
## # ... with 144 more rows

Ok, here we can see there are about 10 (10-13) observations (measurements/rows) per plant per measurement day.

Suppose we want to see how many plants there were per treatment and species.

Plant.Count <- data.filtered %>% group_by(Treatment,Species) %>% summarize(n=n_distinct(Unique.Sample.Number))

ggplot(Plant.Count,aes(Species,n,fill=Treatment))+geom_col(position="dodge") +scale_fill_manual(values=c("Orange","Dark Blue")) #note you can pick specific colors you're interested in using the scale_fill_manual. However, anytime you're listing more than one object, you need to use c().

Mean per plant

Ok, next we can look at means. Lets calculate the mean photosynthesis value per plant (but lets keep all of our other variables of interest by including them in group_by()

Plant.Mean.Photo <- data.filtered %>% group_by(Unique.Sample.Number,Trial,Treatment,Species,Days.after.inoculation) %>% summarize(plant.average=mean(Photosynthesis,na.rm=TRUE))

Here we are cacluating the a ‘plant average’ (mean) for each plant.

as.tibble(Plant.Mean.Photo)

## # A tibble: 154 x 6
##    Unique.Sample.N… Trial Treatment Species Days.after.inoc… plant.average
##    <fct>            <fct> <fct>     <fct>   <fct>                    <dbl>
##  1 1DroughtControl1 1     Drought   Control 9                       1.66  
##  2 1DroughtControl1 1     Drought   Control 22                      2.13  
##  3 1DroughtControl… 1     Drought   Control 13                      0.492 
##  4 1DroughtControl… 1     Drought   Control 35                      0.528 
##  5 1DroughtControl2 1     Drought   Control 35                      0.0851
##  6 1DroughtControl3 1     Drought   Control 22                      3.90  
##  7 1DroughtControl5 1     Drought   Control 9                       4.76  
##  8 1DroughtControl6 1     Drought   Control 22                      1.04  
##  9 1DroughtControl6 1     Drought   Control 35                      1.09  
## 10 1DroughtControl7 1     Drought   Control 37                      0.0491
## # ... with 144 more rows

Note that we did not use the same plants every time we took measurements (1DroughtControl1 was only measured on days 9 and 22). We could only measure <30 plants per day, so we randomly selected them from the treatments each time we went to measure.

Ok lets plot the mean values for each plant that was measured.

ggplot(Plant.Mean.Photo,aes(Species,plant.average,shape=Treatment,color=Species))+geom_point(position="jitter")

Each plant is a sampling unit, so we want to conduct our analysis with a single value for each plant. If we include each measurement as a seperate observation, we are incorporating pseudoreplication because each measurement was not independant (10 measurements were made on a single plant–therefore, each measurement is dependant on the plant).

Mean per group

Ok now that we have the means per plant, we can look at the means per trial, treatment, etc.

Group.Mean.Photo <- Plant.Mean.Photo %>% group_by(Treatment,Species,Days.after.inoculation) %>% summarize(n=n(), mean=mean(plant.average),sd=sd(plant.average),se = sd / sqrt(n),lowse = (mean-se),highse = (mean+se)) #notice here we also calculate sd, se, and the low and high values)

This time we also calculated a few other variables (e.g. sample size, standard deviation, and standard error)

as.tibble(Group.Mean.Photo)

## # A tibble: 48 x 9
##    Treatment Species Days.after.inoc…     n  mean     sd     se  lowse
##    <fct>     <fct>   <fct>            <int> <dbl>  <dbl>  <dbl>  <dbl>
##  1 Drought   Control 3                    1 1.13  NA     NA     NA    
##  2 Drought   Control 6                    2 0.656  0.662  0.468  0.188
##  3 Drought   Control 9                    5 2.59   2.81   1.26   1.34 
##  4 Drought   Control 13                   1 0.492 NA     NA     NA    
##  5 Drought   Control 17                   2 1.47   1.45   1.03   0.438
##  6 Drought   Control 22                   3 2.36   1.44   0.831  1.53 
##  7 Drought   Control 35                   5 1.71   1.75   0.782  0.925
##  8 Drought   Control 37                   4 0.520  0.498  0.249  0.271
##  9 Drought   P. cin… 6                    1 0.194 NA     NA     NA    
## 10 Drought   P. cin… 9                    4 2.88   3.55   1.78   1.10 
## # ... with 38 more rows, and 1 more variable: highse <dbl>

Note that if there was only one plant in a group, the sd, se, etc. could not be calculated because there is no variation (this is concerning, some of the data might be missing. we would need to address this in real analysis).

Ok, so we just changed the plot on the top to the plot on the bottom.

left <- ggplot(Plant.Mean.Photo,aes(Species,plant.average,shape=Treatment,color=Species))+geom_point(position="jitter")

right <- ggplot(Group.Mean.Photo,aes(Species,mean,shape=Treatment,color=Species))+geom_point(position="jitter")

gridExtra::grid.arrange(left,right) #note this is calling a the command grid.arrange from the package gridExtra. You will need to install gridExtra if you want to use this package/command

Below we can plot the mean values of all the plants from each species and each treatment on each day measured.

ggplot(Group.Mean.Photo,aes(as.factor(Days.after.inoculation),mean,color=Species))+geom_point()+facet_wrap(~Treatment,ncol=1) #notice we had to factor days after inoculation

This plot reveals some concerns, only control plants were measured on day 3 and no control plants were measured on day 5. Only one group of plants measured on day 36? Anyway, lots of errors in data, demonstrating the importance of exploring your data graphically.

Error bars

We can use the geom_errorbar to add error bars around our means. Note that we previously calculated low standard error (lowse) and high standard error (highse) values in the summary calculations.

Now we can tell r to use the lowse and highse as ymin and ymax in the geom_errorbar(), respectively.

ggplot(Group.Mean.Photo,aes(as.factor(Days.after.inoculation),mean,color=Species))+geom_point()+facet_wrap(~Treatment,nrow=2) +geom_errorbar(aes(ymin=lowse,ymax=highse))

## Warning: Removed 10 rows containing missing values (geom_errorbar).

Here we see a warning that R removed 10 values because they didn’t have standard error values (because they only had one plant.. hmm somethings off)

For simplicity, lets just look at the last two days of measurements in this last plot.

Last.Week <- Group.Mean.Photo %>% filter(Days.after.inoculation=="35" | Days.after.inoculation=="37") #note that the | in this means 'OR', thus we have effectively asked R to to filter our data to those two days only.

Now we have filtered the data from the last week of collections into a new dataframe.

Below is a overly complicated plot where we use position_dodge(width=1) to spread out our species groups. However, we have to do the same thing for the error bars as well.

ggplot(Last.Week,aes(Days.after.inoculation,mean,color=Species))+geom_point(position = position_dodge(width = 1))+facet_wrap(~Treatment,nrow=2)+geom_errorbar(aes(ymin=lowse,ymax=highse),position = position_dodge(width = 1)) +labs(x= "Days after inoculation",y="Mean Photosynthesis") +theme_bw()+ theme(legend.text = element_text(face = "italic"))

Add text

Perhaps we want to add text to see how many plants were used to calculate the mean values. Again, we can do this with geom_text.

note that we have to do the same position_dodge command for the geomtext, otherwise it shows up in middle of group column.

ggplot(Last.Week,aes(Days.after.inoculation,mean,color=Species))+geom_point(position = position_dodge(width = 1))+facet_wrap(~Treatment,nrow=2)+geom_errorbar(aes(ymin=lowse,ymax=highse),position = position_dodge(width = 1)) +labs(x= "Days after inoculation",y="Mean Photosynthesis") +theme_bw()+ theme(legend.text = element_text(face = "italic"))+geom_text(aes(label=n),position=position_dodge(width=1))

Note that we would need to play with this a bit more to get the text to sit in the right place. Ultimately we probably would rather make this plot below

ggplot(Last.Week,aes(Species,mean,color=Treatment))+geom_point()+facet_wrap(~Days.after.inoculation,nrow=2)+geom_errorbar(aes(ymin=lowse,ymax=highse)) +labs(x= "Days after inoculation",y="Mean Photosynthesis") +theme_bw()+ theme(legend.text = element_text(face = "italic"))+geom_text(aes(label=n),nudge_x=0.1) #note we used the nudge_x command to nudge the text a small distance from the point

Neat, looks like some interesting things are going on. Not quite as clear as we hoped, but it will make for some fun analyses with ‘repeated anova’.