FABI Workshop Series >> Home

Introduction

Welcome to the second session of our workshop series. The purpose of this page is to provide a guide through the subjects we will cover. In this session we will introduce a powerful, highly customizable, and widely used package for visulizing data, ggplot2.

Pre-Session Instructions

You will need to have RStudio (and R) installed on your computer and have the ‘tidyverse’ package installed. More information about this can be found in the lesson for Session 1.

Install Packages

Remember you can install packages using the below code.

install.packages("tidyverse")

It can take a few minutes for RStudio to unpack/install the packages. The Tidyverse package is a collection (package) of packages. It includes several individual packages that work fairly well together.

Today we are going to learn to use ggplot, but we will learn to use dplyr next week in Session 3. Knowing how to use both of these packages provides a strong foundation for analyses. Today we will learn to plot raw data with ggplot, but next week we will be able to summarize (calculate means, standard errors, add variables, etc.), then plot the summarized data.

Once we have the Tidyverse installed. We will need to index it in our project.

Working Diretory

Remember to set your working directory if you create a new project so that R knows where to look for and save your data.

Download ggplot2 Cheatsheet

RStudio has produced many ‘cheatsheets’ for various packages that are super helpful. Each package has many functions and it is impossible to remember them all.

ggplot cheatsheet - click to download.

Download Data

Lets download some data that we can use for this workshop series:

FABI Workshop Series >> Mock Data

  • Now move the data to your working directory
    • Note that I prefer to keep the data in a seperate folder ‘data’.

Session 1 Re-Cap

Create R Script

I recommend creating a new R Script for this session. If you’re working in a project directory for the workshop, you can save each session as a seperate R Script.

File (menu) >> New File >> R Script

Load Packages

Next we need to load our packages into our session.

library(tidyverse)
## ── Attaching packages ─────────────────────────────── tidyverse 1.2.1 ──
## ✔ ggplot2 3.1.0     ✔ purrr   0.2.5
## ✔ tibble  1.4.2     ✔ dplyr   0.7.8
## ✔ tidyr   0.8.2     ✔ stringr 1.3.1
## ✔ readr   1.1.1     ✔ forcats 0.3.0
## ── Conflicts ────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag()    masks stats::lag()

Remember we need to load the packages everytime we start a new session (after closing and opening RStudio) or when we start working from a new R script.

Index Data

Next we can index our data to our RStudio environment using the below code.

#name <- read.csv('./data/'FILENAME.csv') 
data <- read.csv('./data/Silver Tree Study.csv')

You should now see a data.frame with the name ‘data’ in the environment pane.

Note that you may need to add ‘sep=“;”’ to the command if windows uses ‘;’ for your .csv files rather than ‘,’. Basically, if there is no data.frame called ‘data’ in your environment, you may need to include it as below.

data <-read.csv('Silver Tree Study.csv',sep=";")

Check and Adjust Data

Lets run a few commands to refresh ourselves about the structure and organization of our data

str(data)
## 'data.frame':    1722 obs. of  14 variables:
##  $ Unique.Sample.Number  : Factor w/ 102 levels "1DroughtBoth Pathogens9",..: 10 10 10 10 10 10 10 10 10 10 ...
##  $ Days.after.inoculation: int  3 3 3 3 3 3 3 3 3 3 ...
##  $ Date                  : Factor w/ 10 levels "6/12/18","6/16/18",..: 3 3 3 3 3 3 3 3 3 3 ...
##  $ Licor                 : int  6400 6400 6400 6400 6400 6400 6400 6400 6400 6400 ...
##  $ Trial                 : int  1 1 1 1 1 1 1 1 1 1 ...
##  $ Treatment             : Factor w/ 2 levels "Drought","Wet": 1 1 1 1 1 1 1 1 1 1 ...
##  $ Species               : Factor w/ 4 levels "Both Pathogens",..: 2 2 2 2 2 2 2 2 2 2 ...
##  $ Plant.Number          : int  9 9 9 9 9 9 9 9 9 9 ...
##  $ Isolate.Number        : int  NA NA NA NA NA NA NA NA NA NA ...
##  $ Unique.Sample.Number.1: int  1 1 1 1 1 1 1 1 1 1 ...
##  $ Obs                   : int  1 2 3 4 5 6 7 8 9 10 ...
##  $ Photosynthesis        : num  -1.257 -0.456 -0.675 0.196 1.132 ...
##  $ Conductance           : num  -0.008756 -0.001057 -0.000624 0.004572 0.008387 ...
##  $ Ci                    : num  170 -286 -1315 328 182 ...

Here we can see how R reads some of the columns. Note that R reads Trial as an integer.

Lets change it to a factor (grouping variable).

data$Trial <- as.factor(data$Trial)

Ok, now we can see if R recognizes the trial as a factor.

str(data$Trial) #note that this time we can specify the column using $ to keep things shorter. 
##  Factor w/ 2 levels "1","2": 1 1 1 1 1 1 1 1 1 1 ...

Good, ok we will probably have to do that again with another variable later once we start visualizing the data with ggplot2.

Another way to remind ourselves about the different columns in our data is names().

names(data)
##  [1] "Unique.Sample.Number"   "Days.after.inoculation"
##  [3] "Date"                   "Licor"                 
##  [5] "Trial"                  "Treatment"             
##  [7] "Species"                "Plant.Number"          
##  [9] "Isolate.Number"         "Unique.Sample.Number.1"
## [11] "Obs"                    "Photosynthesis"        
## [13] "Conductance"            "Ci"

Also, summary() is another neat function to summarize the data.

summary(data$Photosynthesis)
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##  -7.027   0.394   2.226   5.903   6.704 397.893

Although I really only find it useful for telling us the number of observations in a group.

summary(data$Treatment)
## Drought     Wet 
##     821     901

Ok, we had 821 observations of plants in the drought treatment, and 901 observations of plants in the wet treatment.

Session 2 Information

Part 1

Introduction to ggplot2

ggplot is a package that is included in the tidyverse. It was developed specifically for visualizing data so it is pretty powerful and highly customizable.

Remember that you can access the help file for ggplot or any of the ggplot commands/functions (geom_bar, aes, etc) by adding a questionmark

?ggplot #note that you need to have loaded the package for that to work, otherwise you need to include two questionmarks (??ggplot)

General setup

“ggplot() is used to construct the initial plot object, and is almost always followed by + to add component to the plot.”

The general format for the ggplot command is:

Long command version
ggplot(data=df, mapping=aes(x-variable, y-variable, other aesthetics such as color))
Short command version

We can also run the same command without specifying data= or mapping =

ggplot(df,aes(x-variable, y-variable, other aesthetics))

Then we can add + geom_col() or +geom_boxplot() depending on which type of plot we want.

Data Visualization

One Variable Plots

Continuous Data

Continuous data refers to our data that ranges from 0 to infinity. For example our measurments for Photosynthsis is continous

summary(data$Photosynthesis)
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##  -7.027   0.394   2.226   5.903   6.704 397.893

In contrast discrete data (more below) is categorized into discrete groups.

Histograms

I generally recommend visualizing the distribution of your response variables (results) as a start because then you can decide which statistical tests are appropriate based on the distributions.

First lets look at the distribution of the Photosyntehsis data. To do this we can use geom_histogram()

names(data) #We can check the names of the columns to remind us which variables we can plot, and how to correctly spell/capitlize them.
##  [1] "Unique.Sample.Number"   "Days.after.inoculation"
##  [3] "Date"                   "Licor"                 
##  [5] "Trial"                  "Treatment"             
##  [7] "Species"                "Plant.Number"          
##  [9] "Isolate.Number"         "Unique.Sample.Number.1"
## [11] "Obs"                    "Photosynthesis"        
## [13] "Conductance"            "Ci"
ggplot(data,aes(Photosynthesis))+geom_histogram() #note that this is only a one variable plot (only x, not x and y). 
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

Shoot I forgot about that super outlier.. Looks like we will need a quick glimpse into dplyr ( Session 3).

data <- data %>% filter(Photosynthesis<300) # lots more about this next week :)

Now we can run the same plot and see if the outlier is removed.

ggplot(data,aes(Photosynthesis))+geom_histogram() #note that this is only a one variable plot (only x, not x and y). 
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

This plot is showing us the number of observations (y-axis) for every value of photosynthesis (x-axis).

We can also do the same thing for the other response variables.

ggplot(data,aes(Conductance))+geom_histogram()

Pretty much the same distribution as Photosynthesis, which makes sense, becuase they are related right?

Lets stick to plotting Photosynthesis for now.

Density Plots

There are other plots to look at the same data. For example, below we can look at a density plot.

ggplot(data,aes(Photosynthesis))+geom_density()

Group Comparisons

Now we can add some complexity to compare the two groups, still only looking at a single respones variable.

Below we indicate that we want each Trial to have a different color using the ‘fill=’ command.

+ Remember that we designated Trial as a factor. The fill command only works with factors.
ggplot(data,aes(Photosynthesis,fill=Trial))+geom_histogram()

Here it looks like Trial 1 had higher level, but it is really because the Trial 1 data is stacked on top of Trial 2 data.

See what happens if you try a density plot instead..

ggplot(data,aes(Photosynthesis,fill=Trial))+geom_density()

To indicate that we want our groups to appear side by side, rather than on top of eachother we have to specify a position.

ggplot(data,aes(Photosynthesis,fill=Trial))+geom_histogram(position="dodge")
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

Discrete Data

Discrete data referes to our data that are categorized into discrete groups. For example, our variables that are factors, are discrete variables

summary(data$Trial)
##   1   2 
## 925 787
summary(data$Treatment)
## Drought     Wet 
##     821     891
summary(data$Species)
##       Both Pathogens              Control     Exotic Pathogen  
##                   72                  563                  508 
## Indigenous Pathogen  
##                  569

We can visualize discrete data using geom_bar()

ggplot(data,aes(Species))+geom_bar()

This plot is telling us how many observations were made for each (discrete) group.

Remember that at least 10 measurements were made for each plant, so these observations do not represent each plant. We will learn to summarize (average) the observations per plant in Session 3.

Note that we did not take many measurements from plants that were infected with both pathogens. We ultimately did not have enough time to measure the physiological response for that group.

As a side note, I often prefer to flip the coordinates so that the labels read better.

We can flip the axises using +coord_flip()

ggplot(data,aes(Species))+geom_bar() +coord_flip()

Looks like we measured less plants infected with the exotic pathogen than plants infected with indigenous pathogen or the controls.

Lets include one more discrete variable by adding a fill command and chaning the position.

ggplot(data,aes(Species,fill=Treatment))+geom_bar(position="dodge") +coord_flip()

Looks like we measured a few more plants from the Wet treatment than from the drought treatment.

Two Variable Plots

Now we can look at a few plots with two variables where we define x and y.

Continuous x, Continous y

We can see if there is a relationship between Phyotosyntehsis and Conductance. Then we can see if that relationship changes if plants are infected with pathogens.

Point Plots
ggplot(data,aes(Photosynthesis,Conductance))+geom_point()

Pretty messy, but it looks like there is a linear trend as we would expect.

This time if we want to compare different treatments or other categories, we have to use the ‘color=’ command rather than the ‘fill=’.

ggplot(data,aes(Photosynthesis,Conductance,color=Trial))+geom_point()

Line Plots

The below plot essentially draws lines between points

ggplot(data,aes(Photosynthesis,Conductance))+geom_line()

We can use geom_smooth() command to essentially average our data.

+ we may want to check the helpfile for geom_smooth to see exactly what it is doing. 
?geom_smooth()
ggplot(data,aes(Photosynthesis,Conductance))+geom_smooth()
## `geom_smooth()` using method = 'gam' and formula 'y ~ s(x, bs = "cs")'

Not suprising, there is a linear relationship between the response variables.

We may also want to look at it linearly by defining the method in the geom_smooth() command.

ggplot(data,aes(Photosynthesis,Conductance))+geom_smooth(method=lm)

Now lets see if they differ between treatments or species.

ggplot(data,aes(Photosynthesis,Conductance,linetype=Treatment))+geom_smooth(method=lm) #note that we define a linetype in this.

Interesting, the relationship has a steeper slope for plants in the wet trial.

ggplot(data,aes(Photosynthesis,Conductance,linetype=Treatment,color=Species))+geom_smooth(method=lm) #note that we add color and linetype commands

Ok, clearly the plot is getting quite complicated once we start adding multiple grouping variables. This is a good time to introduce ‘facet wrapping’.

Facet Wrapping

ggplot has a neat way to create plots side by side using a command called facet_wrap.

Here we can split the above plot into two seperate plots for each treatment.

ggplot(data,aes(Photosynthesis,Conductance,color=Species))+geom_smooth(method=lm) +facet_wrap(~Treatment)#note that add Treatment as a facet_wrap argument rather than as a linetype. 

Well the plot is still pretty messy, but the differences between treatments are clearer.

We can also arrange facets differently

ggplot(data,aes(Photosynthesis,Conductance,color=Treatment))+geom_smooth(method=lm) +facet_wrap(~Species) #here we switched treatment and species

And we can define how many columns or rows we want to present the facets in.

ggplot(data,aes(Photosynthesis,Conductance,color=Treatment))+geom_smooth(method=lm) +facet_wrap(~Species,nrow=1) #here we define nrow=1

Discrete x, Continous y

The final group of plots that we will cover in this session involve one (discrete) grouping variable and one continous variable.

Boxplots

Boxplots are commonly used to visualize differences in distributions between groups.

ggplot(data,aes(Species,Photosynthesis))+geom_boxplot()

Note that boxplots show the median, not the mean. We will learn to calculate and visualize means and standard errors in Session 3.

ggplot(data,aes(Species,Photosynthesis))+geom_boxplot() +coord_flip()

It looks like the plants infected with pathogens had lower median Photosynthesis values overall.

ggplot(data,aes(Species,Photosynthesis,fill=Treatment))+geom_boxplot(position="dodge") +coord_flip()

Here we can see that drought substantially lowered the levels of Photosynthesis.

So there is one major limitation to the plots above. There is something that we are not accounting for. As plant pathologists, plant scientists, and mycologists can you guess what is?

Lets look at our columns again

names(data)
##  [1] "Unique.Sample.Number"   "Days.after.inoculation"
##  [3] "Date"                   "Licor"                 
##  [5] "Trial"                  "Treatment"             
##  [7] "Species"                "Plant.Number"          
##  [9] "Isolate.Number"         "Unique.Sample.Number.1"
## [11] "Obs"                    "Photosynthesis"        
## [13] "Conductance"            "Ci"

Do you see any variables that might be important to include to see a difference between our Species groups?

str(data$Days.after.inoculation)
##  int [1:1712] 3 3 3 3 3 3 3 3 3 3 ...

Here R recognizes days after inoculation as a integer. Lets see what happens if we try to plot that on the x axis.

ggplot(data,aes(Days.after.inoculation,Photosynthesis,fill=Species))+geom_boxplot(position="dodge")+facet_wrap(~Treatment)
## Warning: position_dodge requires non-overlapping x intervals

## Warning: position_dodge requires non-overlapping x intervals

Not quite what we looking for. We want to see if the photosynthesis levels change over time since infection. Lets try if we set Days.after.inoculation as a factor.

data$Days.after.inoculation <- as.factor(data$Days.after.inoculation)
ggplot(data,aes(Days.after.inoculation,Photosynthesis,fill=Species))+geom_boxplot(position="dodge")+facet_wrap(~Treatment,ncol=1)

Ok, things are getting very complex.

Part 2

Part 1 Recap

Remember that we went through examples of plots following the structure of the ggplot cheatsheet, first plotting one continuous variable, then moving to two variables (one discrete and one continuous)

One Variable Plots
Continuous Data

Remember that we identified a mistake in the data. The plot below has an enormous outlier.

ggplot(data,aes(Photosynthesis))+geom_histogram() #note that this is only a one variable plot (only x, not x and y). 
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

Then remember we filtered out the outlier from our data using dplyr

data <- data %>% filter(Photosynthesis<300) # we will go over the structure of this after our Session 2 recap. 

Here we resaved ‘data’ as a new data.frame that was filtered. Alternatively we could have saved it as a new variable, but I try not to rename things as much as I can (we will see why later when we have a bunch of dataframes in our environment).

data.filtered <-data %>% filter(Photosynthesis<300) #same as above, just demonstrating that could rename it something else.
Two Variable Plots
Discrete x, Continous y

Boxplots

Boxplots are commonly used to show data distributions between groups.

ggplot(data,aes(Species,Photosynthesis,fill=Treatment))+geom_boxplot(position="dodge") +coord_flip()

Remember that we are interested in how photosynthesis changed over time (measured by days after inoculation) because the inoculated plant should be the exact same as controls at start of experiment.

data$Days.after.inoculation.factored <- as.factor(data$Days.after.inoculation) #this time I created a new column rather than save over the other column.
ggplot(data,aes(Days.after.inoculation.factored,Photosynthesis,fill=Species))+geom_boxplot(position="dodge")+facet_wrap(~Treatment,ncol=1)

Continuous x, Continuous y

Lets look at the data if we keep ‘Days.after.inoculation’ as an integer rather than a factor. Thus, we are plotting a continuous variable.

data$Days.after.inoculation <- as.numeric(data$Days.after.inoculation)
ggplot(data,aes(Days.after.inoculation,Photosynthesis,linetype=Treatment,color=Species))+geom_smooth(method=lm)

ggplot2 Extras

Labels

There were a few examples that we didn’t get to last week that we can go through now quickly.

First, lets customize the labels of our plots. Suppose we want our x and y labels to be cleaner and more accurate.

We can add custom labels to our plots using the ‘labs’ command

ggplot(data,aes(Days.after.inoculation,Photosynthesis,linetype=Treatment,color=Species))+geom_smooth(method=lm) +labs(x="Days after inoculation",y=expression(Phyotosnytheic~Response~(µmol~m^{-2}~s^{-1}))) #note the crazy notation for adding superscripts

Titles

We can also add a title using the ‘labs’ command

ggplot(data,aes(Days.after.inoculation,Photosynthesis,linetype=Treatment,color=Species))+geom_smooth(method=lm) +labs(x="Days after inoculation",y="Photosnthesis",title="Photosnythetic Response")

Note that ggplot left adjust titles (don’t ask me why). To fix this we add a theme command

ggplot(data,aes(Days.after.inoculation,Photosynthesis,linetype=Treatment,color=Species))+geom_smooth(method=lm) +labs(x="Days after inoculation",y="Photosynthesis",title="Photosnythetic Response")+theme(plot.title = element_text(hjust = 0.5))

Italics

Similarly, if we want to add italics to an element, we can add the ’element_text(face=“italic”) command.

ggplot(data,aes(Days.after.inoculation,Photosynthesis,linetype=Treatment,color=Species))+geom_smooth(method=lm) +labs(x="Days after inoculation",y="Photosnthesis",title="Photosnythetic Response") +theme(axis.text.y=element_text(face="italic"), plot.title = element_text(hjust = 0.5,face="italic")) #notice we added italics to the y axis text and the title in this example

Themes

Themes can also be used to change the basic layout (e.g. background color) of the plot.

Here we can remove the grey background by adding another theme component.

ggplot(data,aes(Days.after.inoculation,Photosynthesis,linetype=Treatment,color=Species))+geom_smooth(method=lm) +labs(x="Days after inoculation",y="Photosynthesis",title="Photosnythetic Response")+theme(plot.title = element_text(hjust = 0.5)) +theme_bw()

There are more theme options listed in the ggplot cheatsheet.

Colors

The commands for changing the colors vary depending on the type of plot you’re creating. For the best results, check out the ggplot2 cheatsheet. In general the command can be summarized as ‘scale_aes_colortype()’.

For our plot above we use the color and linetype aesthetics, therefore we can use scale_color_discrete()

ggplot(data,aes(Days.after.inoculation,Photosynthesis,linetype=Treatment,color=Species))+geom_smooth(method=lm) +labs(x="Days after inoculation",y="Photosynthesis",title="Photosnythetic Response")+theme(plot.title = element_text(hjust = 0.5)) +theme_bw() +scale_color_brewer(palette="Reds")

We can also indicate that we want to create a black and white/ grayscale plot with ‘scale_x_grey()’

ggplot(data,aes(Species,Photosynthesis,fill=Treatment))+geom_boxplot(position="dodge") +coord_flip() +theme_bw()+scale_fill_grey()

Add Text

Warning: below is a super messy example. For a better example, check the geom_text section of Session 3

Suppose we want to add text to a plot. Perhaps we want to see which plant a given point our plot is. For this we can use the geom_text command. ›

ggplot(data,aes(Species,Photosynthesis,color=as.factor(Trial)))+geom_point(position="jitter") +coord_flip()+facet_wrap(~Treatment) +theme_bw()+guides(color=FALSE)

In the below example, we tell R to indicate the plant number at each point.

ggplot(data,aes(Species,Photosynthesis,color=as.factor(Trial)))+geom_point(position="jitter") +coord_flip()+facet_wrap(~Treatment) +theme_bw()+guides(color=FALSE)+geom_text(aes(label=Plant.Number),position="jitter")

Ok that is super messy, but we will come back to a better example of this after we summarize some of the data in Session 3.

ggsave

Exporting plots from the plot pane of R studio is certainly the easiest, but the journal you’re submitting to may want higher resolution, or you may want to control the size of the text and the spacing.

Therefore, the ggsave command is helpful, especially when we want to save multiple plots with the same themes, font, font size, etc.).

First save the plot as an object.

plot <- ggplot(data,aes(Species,Photosynthesis,fill=Treatment))+geom_boxplot(position="dodge") +coord_flip()+scale_fill_grey()

Then we can save our themes criteria as objects.

Space <- theme(axis.title.x=element_text(margin=margin(15,0,0,15)),axis.title.y=element_text(margin=margin(0,15,0,0))) # adds space between x axis text and x axis title

Theme <-  theme(axis.title=element_text(size=12),axis.text = element_text(size=12),strip.text.x=element_text(size=12),axis.text.y = element_text(face="italic"),plot.title = element_text(hjust = 0.5)) +theme_bw()

Then we can add them together

Plot.for.print <- plot + Space + Theme

And then save them into our working directory

ggsave(filename='Pretty Plot.png', plot = Plot.for.print, width = 225, height = 110,unit="mm",dpi = 300, limitsize = TRUE) # note that you can specify any directory or folder you want to save it by typing the whole path (e.g. './figures/pretty plot.png').