FABI Workshop Series >> Home FABI Workshop Series >> Mock Data
Welcome to the third session of our workshop series. The purpose of this page is to serve as a guide for the material we will cover during the session. We will learn to summarize and transform data using the dplyr package in our session today.
library(tidyverse)
For more information about dyplr, check out the help file
?dplyr #note you need to index the tidyverse package for this to work
Lets start by downloading the cheatsheet for dplyr.
RStudio has produced many ‘cheatsheets’ for various packages that are super helpful. Each package has many functions and it is impossible to remember them all.
dplyr cheatsheet - click to download.
data <- read.csv('Silver Tree Study.csv')
Just to make sure we are on the same page before moving on, lets set our check that R reads our data how we want it to.
str(data)
## 'data.frame': 1722 obs. of 14 variables:
## $ Unique.Sample.Number : Factor w/ 102 levels "1DroughtBoth Pathogens9",..: 10 10 10 10 10 10 10 10 10 10 ...
## $ Days.after.inoculation: int 3 3 3 3 3 3 3 3 3 3 ...
## $ Date : Factor w/ 10 levels "6/12/18","6/16/18",..: 3 3 3 3 3 3 3 3 3 3 ...
## $ Licor : int 6400 6400 6400 6400 6400 6400 6400 6400 6400 6400 ...
## $ Trial : int 1 1 1 1 1 1 1 1 1 1 ...
## $ Treatment : Factor w/ 2 levels "Drought","Wet": 1 1 1 1 1 1 1 1 1 1 ...
## $ Species : Factor w/ 4 levels "Both Pathogens",..: 2 2 2 2 2 2 2 2 2 2 ...
## $ Plant.Number : int 9 9 9 9 9 9 9 9 9 9 ...
## $ Isolate.Number : int NA NA NA NA NA NA NA NA NA NA ...
## $ Unique.Sample.Number.1: int 1 1 1 1 1 1 1 1 1 1 ...
## $ Obs : int 1 2 3 4 5 6 7 8 9 10 ...
## $ Photosynthesis : num -1.257 -0.456 -0.675 0.196 1.132 ...
## $ Conductance : num -0.008756 -0.001057 -0.000624 0.004572 0.008387 ...
## $ Ci : num 170 -286 -1315 328 182 ...
Ok, currently R has read columns ‘Days.after.inoculation’ and ‘Trial’ as intergers, but for this session, we want R to read them as factors.
Lets change them to factors for now.
data$Days.after.inoculation <- as.factor(data$Days.after.inoculation)
data$Trial <- as.factor(data$Trial)
str(data)
## 'data.frame': 1722 obs. of 14 variables:
## $ Unique.Sample.Number : Factor w/ 102 levels "1DroughtBoth Pathogens9",..: 10 10 10 10 10 10 10 10 10 10 ...
## $ Days.after.inoculation: Factor w/ 10 levels "3","5","6","9",..: 1 1 1 1 1 1 1 1 1 1 ...
## $ Date : Factor w/ 10 levels "6/12/18","6/16/18",..: 3 3 3 3 3 3 3 3 3 3 ...
## $ Licor : int 6400 6400 6400 6400 6400 6400 6400 6400 6400 6400 ...
## $ Trial : Factor w/ 2 levels "1","2": 1 1 1 1 1 1 1 1 1 1 ...
## $ Treatment : Factor w/ 2 levels "Drought","Wet": 1 1 1 1 1 1 1 1 1 1 ...
## $ Species : Factor w/ 4 levels "Both Pathogens",..: 2 2 2 2 2 2 2 2 2 2 ...
## $ Plant.Number : int 9 9 9 9 9 9 9 9 9 9 ...
## $ Isolate.Number : int NA NA NA NA NA NA NA NA NA NA ...
## $ Unique.Sample.Number.1: int 1 1 1 1 1 1 1 1 1 1 ...
## $ Obs : int 1 2 3 4 5 6 7 8 9 10 ...
## $ Photosynthesis : num -1.257 -0.456 -0.675 0.196 1.132 ...
## $ Conductance : num -0.008756 -0.001057 -0.000624 0.004572 0.008387 ...
## $ Ci : num 170 -286 -1315 328 182 ...
Ok, now R recognizes those columns as factors. Yay.
Ok now lets get into the real content of this session. Dplyr is a package that will empower you to quickly and safely (low risk of mistakes) summarize your data.
Lets again look at our data.
head(data) #as_tibble is another way to view a table of our data
## Unique.Sample.Number Days.after.inoculation Date Licor Trial Treatment
## 1 1DroughtControl9 3 6/2/18 6400 1 Drought
## 2 1DroughtControl9 3 6/2/18 6400 1 Drought
## 3 1DroughtControl9 3 6/2/18 6400 1 Drought
## 4 1DroughtControl9 3 6/2/18 6400 1 Drought
## 5 1DroughtControl9 3 6/2/18 6400 1 Drought
## 6 1DroughtControl9 3 6/2/18 6400 1 Drought
## Species Plant.Number Isolate.Number Unique.Sample.Number.1 Obs
## 1 Control 9 NA 1 1
## 2 Control 9 NA 1 2
## 3 Control 9 NA 1 3
## 4 Control 9 NA 1 4
## 5 Control 9 NA 1 5
## 6 Control 9 NA 1 6
## Photosynthesis Conductance Ci
## 1 -1.2574071 -0.008756114 169.6851
## 2 -0.4559930 -0.001057429 -286.0282
## 3 -0.6746335 -0.000623729 -1315.1230
## 4 0.1958272 0.004572455 328.0515
## 5 1.1315593 0.008387323 181.6788
## 6 1.4892968 0.010670317 173.9482
Ok so we can see that the first 10 or so rows are from the same plant (see below) and measured on the same day (day 3). We took at least 10 measurements per plant. Therefore, we need to calculate the mean value per plant (to avoid psuedoreplication). The unique sample number is just a combination of 3 columns using the stringr package in the tidyverse (not going into that now).
In general, the dyplr package uses ‘pipes’ (%>%), which eventually the entire tidyverse will use (including ggvis, the next version of ggplot2).
You can read the %>% as ‘THEN’.
For example the below code can be interpreted as
Data THEN Filter THEN Group_by THEN Summarise
summary <- data %>% filter(Photosynthesis<200) %>% group_by(Species) %>% summarize(mean=mean(Photosynthesis)) #more about this later
The pipes essentially let us do everything in one go, without having to create a named variable for every step..
We will build up to the command above.
We have already used the filter command in the first two sessions, but lets take another look.
Remember that we included a few plants from species=“Both Pathogens” in the first few weeks of the experiment.
levels(data$Species) #we can remind ourselves about the levels of a factor using the levels command
## [1] "Both Pathogens" "Control" "Exotic Pathogen "
## [4] "Indigenous Pathogen "
Lets remove the “Both Pathogens” from our dataset for now, just to simplify our stats going forward
data.filtered <- data %>% filter (Photosynthesis<200 & Species!="Both Pathogens") #note the '!=' means 'does not equal'
#Here we used & to filter the data by two variables at the same time
Lets check that that worked
levels(data.filtered$Species)
## [1] "Both Pathogens" "Control" "Exotic Pathogen "
## [4] "Indigenous Pathogen "
Oops.. the level is still there
summary(data.filtered$Species)
## Both Pathogens Control Exotic Pathogen
## 0 563 508
## Indigenous Pathogen
## 569
Ok, notice that all of the obesrvations for “Both Pathogens” were filtered out, but the level still remains in our factor. (its a silly thing R does).
Even though there are no observations, the level of the factor will still appear in our legend if we made a plot now
We can completely drop that level of the factor by including the ‘droplevels’ command.
data.filtered <- data %>% filter (Photosynthesis<200 & Species!="Both Pathogens") %>% droplevels()
levels(data.filtered$Species)
## [1] "Control" "Exotic Pathogen " "Indigenous Pathogen "
Blam.
Note that the indigenous pathogen and exotic pathogen levels actually have spaces behind them.. Whoops.
Suppose we want to change the names of some of our levels in our factors. Sometimes it is easier to do in excel from the start, but if you only want to make one change, it is easier to just use one line of code
data.filtered <- data.filtered %>% mutate(Species = recode(Species, "Indigenous Pathogen "="P. multivora", "Exotic Pathogen "="P. cinnamomi")) #notice that the original dataset had spaces after the column names..
levels(data.filtered$Species)
## [1] "Control" "P. cinnamomi" "P. multivora"
Here we used the ‘mutate’ command. This command is used to create a new column. However, in the above code, we called the column the same name (Species), thereby just replacing the other column.
Check out ?mutate for more information.
The dplyr package is most helpful for summarizing data. Below are a few common statistics we will use (mean, sd, se, ci).
Before we calculate the means we need to introduce the group_by() command.
We can tell R which groups we want to calculate the means for using the group_by() command. For example, if we want to group the plants by treatment we will use:
data.filtered.summary <- data.filtered %>% group_by(treatment) #note this code probably doesn't do anything
Now we can tell R we want to calculate means for the groups (summarize).
data.filtered.summary <- data.filtered %>% group_by(Treatment) %>% summarize(mean=mean(Photosynthesis))
Treatment | mean |
---|---|
Drought | 1.232575 |
Wet | 5.918684 |
Lets rather start by summarizing the number of measurements made for each plant on each day.
Recall that there are at least 10 measurements per plant, per day of measurement (day after inoc).
Plant.Measurements <- data.filtered %>% group_by(Unique.Sample.Number,Days.after.inoculation) %>% summarize(n=n()) #the n=n() is a super nice function in dplyr to calculate the number of rows (measurements in our case).
#We could also do n=length() I think
as.tibble(Plant.Measurements)
## # A tibble: 154 x 3
## Unique.Sample.Number Days.after.inoculation n
## <fct> <fct> <int>
## 1 1DroughtControl1 9 10
## 2 1DroughtControl1 22 13
## 3 1DroughtControl10 13 10
## 4 1DroughtControl10 35 10
## 5 1DroughtControl2 35 11
## 6 1DroughtControl3 22 11
## 7 1DroughtControl5 9 10
## 8 1DroughtControl6 22 11
## 9 1DroughtControl6 35 10
## 10 1DroughtControl7 37 11
## # ... with 144 more rows
Ok, here we can see there are about 10 (10-13) observations (measurements/rows) per plant per measurement day.
Suppose we want to see how many plants there were per treatment and species.
Plant.Count <- data.filtered %>% group_by(Treatment,Species) %>% summarize(n=n_distinct(Unique.Sample.Number))
ggplot(Plant.Count,aes(Species,n,fill=Treatment))+geom_col(position="dodge") +scale_fill_manual(values=c("Orange","Dark Blue")) #note you can pick specific colors you're interested in using the scale_fill_manual. However, anytime you're listing more than one object, you need to use c().
Ok, next we can look at means. Lets calculate the mean photosynthesis value per plant (but lets keep all of our other variables of interest by including them in group_by()
Plant.Mean.Photo <- data.filtered %>% group_by(Unique.Sample.Number,Trial,Treatment,Species,Days.after.inoculation) %>% summarize(plant.average=mean(Photosynthesis,na.rm=TRUE))
Here we are cacluating the a ‘plant average’ (mean) for each plant.
as.tibble(Plant.Mean.Photo)
## # A tibble: 154 x 6
## Unique.Sample.N… Trial Treatment Species Days.after.inoc… plant.average
## <fct> <fct> <fct> <fct> <fct> <dbl>
## 1 1DroughtControl1 1 Drought Control 9 1.66
## 2 1DroughtControl1 1 Drought Control 22 2.13
## 3 1DroughtControl… 1 Drought Control 13 0.492
## 4 1DroughtControl… 1 Drought Control 35 0.528
## 5 1DroughtControl2 1 Drought Control 35 0.0851
## 6 1DroughtControl3 1 Drought Control 22 3.90
## 7 1DroughtControl5 1 Drought Control 9 4.76
## 8 1DroughtControl6 1 Drought Control 22 1.04
## 9 1DroughtControl6 1 Drought Control 35 1.09
## 10 1DroughtControl7 1 Drought Control 37 0.0491
## # ... with 144 more rows
Note that we did not use the same plants every time we took measurements (1DroughtControl1 was only measured on days 9 and 22). We could only measure <30 plants per day, so we randomly selected them from the treatments each time we went to measure.
Ok lets plot the mean values for each plant that was measured.
ggplot(Plant.Mean.Photo,aes(Species,plant.average,shape=Treatment,color=Species))+geom_point(position="jitter")
Each plant is a sampling unit, so we want to conduct our analysis with a single value for each plant. If we include each measurement as a seperate observation, we are incorporating pseudoreplication because each measurement was not independant (10 measurements were made on a single plant–therefore, each measurement is dependant on the plant).
Ok now that we have the means per plant, we can look at the means per trial, treatment, etc.
Group.Mean.Photo <- Plant.Mean.Photo %>% group_by(Treatment,Species,Days.after.inoculation) %>% summarize(n=n(), mean=mean(plant.average),sd=sd(plant.average),se = sd / sqrt(n),lowse = (mean-se),highse = (mean+se)) #notice here we also calculate sd, se, and the low and high values)
This time we also calculated a few other variables (e.g. sample size, standard deviation, and standard error)
as.tibble(Group.Mean.Photo)
## # A tibble: 48 x 9
## Treatment Species Days.after.inoc… n mean sd se lowse
## <fct> <fct> <fct> <int> <dbl> <dbl> <dbl> <dbl>
## 1 Drought Control 3 1 1.13 NA NA NA
## 2 Drought Control 6 2 0.656 0.662 0.468 0.188
## 3 Drought Control 9 5 2.59 2.81 1.26 1.34
## 4 Drought Control 13 1 0.492 NA NA NA
## 5 Drought Control 17 2 1.47 1.45 1.03 0.438
## 6 Drought Control 22 3 2.36 1.44 0.831 1.53
## 7 Drought Control 35 5 1.71 1.75 0.782 0.925
## 8 Drought Control 37 4 0.520 0.498 0.249 0.271
## 9 Drought P. cin… 6 1 0.194 NA NA NA
## 10 Drought P. cin… 9 4 2.88 3.55 1.78 1.10
## # ... with 38 more rows, and 1 more variable: highse <dbl>
Note that if there was only one plant in a group, the sd, se, etc. could not be calculated because there is no variation (this is concerning, some of the data might be missing. we would need to address this in real analysis).
Ok, so we just changed the plot on the top to the plot on the bottom.
left <- ggplot(Plant.Mean.Photo,aes(Species,plant.average,shape=Treatment,color=Species))+geom_point(position="jitter")
right <- ggplot(Group.Mean.Photo,aes(Species,mean,shape=Treatment,color=Species))+geom_point(position="jitter")
gridExtra::grid.arrange(left,right) #note this is calling a the command grid.arrange from the package gridExtra. You will need to install gridExtra if you want to use this package/command
Below we can plot the mean values of all the plants from each species and each treatment on each day measured.
ggplot(Group.Mean.Photo,aes(as.factor(Days.after.inoculation),mean,color=Species))+geom_point()+facet_wrap(~Treatment,ncol=1) #notice we had to factor days after inoculation
This plot reveals some concerns, only control plants were measured on day 3 and no control plants were measured on day 5. Only one group of plants measured on day 36? Anyway, lots of errors in data, demonstrating the importance of exploring your data graphically.
We can use the geom_errorbar to add error bars around our means. Note that we previously calculated low standard error (lowse) and high standard error (highse) values in the summary calculations.
Now we can tell r to use the lowse and highse as ymin and ymax in the geom_errorbar(), respectively.
ggplot(Group.Mean.Photo,aes(as.factor(Days.after.inoculation),mean,color=Species))+geom_point()+facet_wrap(~Treatment,nrow=2) +geom_errorbar(aes(ymin=lowse,ymax=highse))
## Warning: Removed 10 rows containing missing values (geom_errorbar).
Here we see a warning that R removed 10 values because they didn’t have standard error values (because they only had one plant.. hmm somethings off)
For simplicity, lets just look at the last two days of measurements in this last plot.
Last.Week <- Group.Mean.Photo %>% filter(Days.after.inoculation=="35" | Days.after.inoculation=="37") #note that the | in this means 'OR', thus we have effectively asked R to to filter our data to those two days only.
Now we have filtered the data from the last week of collections into a new dataframe.
Below is a overly complicated plot where we use position_dodge(width=1) to spread out our species groups. However, we have to do the same thing for the error bars as well.
ggplot(Last.Week,aes(Days.after.inoculation,mean,color=Species))+geom_point(position = position_dodge(width = 1))+facet_wrap(~Treatment,nrow=2)+geom_errorbar(aes(ymin=lowse,ymax=highse),position = position_dodge(width = 1)) +labs(x= "Days after inoculation",y="Mean Photosynthesis") +theme_bw()+ theme(legend.text = element_text(face = "italic"))
Perhaps we want to add text to see how many plants were used to calculate the mean values. Again, we can do this with geom_text.
note that we have to do the same position_dodge command for the geomtext, otherwise it shows up in middle of group column.
ggplot(Last.Week,aes(Days.after.inoculation,mean,color=Species))+geom_point(position = position_dodge(width = 1))+facet_wrap(~Treatment,nrow=2)+geom_errorbar(aes(ymin=lowse,ymax=highse),position = position_dodge(width = 1)) +labs(x= "Days after inoculation",y="Mean Photosynthesis") +theme_bw()+ theme(legend.text = element_text(face = "italic"))+geom_text(aes(label=n),position=position_dodge(width=1))
Note that we would need to play with this a bit more to get the text to sit in the right place. Ultimately we probably would rather make this plot below
ggplot(Last.Week,aes(Species,mean,color=Treatment))+geom_point()+facet_wrap(~Days.after.inoculation,nrow=2)+geom_errorbar(aes(ymin=lowse,ymax=highse)) +labs(x= "Days after inoculation",y="Mean Photosynthesis") +theme_bw()+ theme(legend.text = element_text(face = "italic"))+geom_text(aes(label=n),nudge_x=0.1) #note we used the nudge_x command to nudge the text a small distance from the point
Neat, looks like some interesting things are going on. Not quite as clear as we hoped, but it will make for some fun analyses with ‘repeated anova’.