R for Biology Data Science - Session 2 - Introduction to ggplot2 part 1

Workshop Series

Introduction

Welcome to the second session of our workshop series. The purpose of this page is to provide a guide through the subjects we will cover. In this session we will introduce a powerful, highly customizable, and widely used package for visulizing data, ggplot2.

Pre-Session Instructions

You will need to have RStudio (and R) installed on your computer and have the ‘tidyverse’ package installed. More information about this can be found in the lesson for Session 1.

Session 1 Recap

Create a Project

Projects provide a powerful means to keep your data organized. You can think of projects as a folder or directory.

I generally create a new project for each chapter or manuscript.

Download Data

Lets download some data that we can use for this workshop series: Mock Data

Import Data to Project

Now move the data to your working directory and add it to our environment using the below code.

Set Working Diretory

Files (tab) >> More >> Set as Working Directory

Defining a working directory is a critical step for making the most of projects and simplifying our workflow. A working directory essentially tells R where to look for and save your files.

Remember to set your working directory if you create a new project so that R knows where to look for and save your data.

Create R Script

I recommend you start this session with a new script.

File (menu) >> New File >> R Script

We can save this script so that we can come back to our analysis later. Perhaps call your new script Session2.

Install Packages

Remember you can install packages using the below code.

install.packages("tidyverse")

It can take a few minutes for RStudio to unpack/install the packages.

The Tidyverse package is a collection (package) of packages. It includes several individual packages that work fairly well together.

Load Packages

Once a package is installed, we can load it into our R session as below.

library(tidyverse)

## ── Attaching packages ─────────────────────────────── tidyverse 1.2.1 ──

## ✔ ggplot2 3.1.0     ✔ purrr   0.2.5
## ✔ tibble  1.4.2     ✔ dplyr   0.7.8
## ✔ tidyr   0.8.2     ✔ stringr 1.3.1
## ✔ readr   1.1.1     ✔ forcats 0.3.0

## ── Conflicts ────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag()    masks stats::lag()

Note that we have to load the package everytime we open R.

Index Data

Ok, now we need to add the mock data to our environment.

Note that you will need to reload the data anytime you make changes to the excel files.

Also note that the file has to be a ‘.csv’ file for the below code to work.

If you prefer to work with ‘.xls’ files, you will need can use the readxl package (which is also part of the tidyverse).

data <- read.csv("Silver Tree Study.csv")

Ok, now you should see a data.frame in the ‘Environment’ pane.

Note that everytime you reopen R you may have to rerun each line of code again for it to show up in the environment.

Note that you may need to add ‘sep=“;”’ to the command if windows uses ‘;’ for your .csv files rather than ‘,’.

Basically, if there is no data.frame called ‘data’ in your environment, you may need to include it as below.

data <-read.csv('Silver Tree Study.csv',sep=";")

Explore and Adjust

Lets run a few commands to refresh ourselves about the structure and organization of our data

str(data)

## 'data.frame':    1722 obs. of  14 variables:
##  $ Unique.Sample.Number  : Factor w/ 102 levels "1DroughtBoth Pathogens9",..: 10 10 10 10 10 10 10 10 10 10 ...
##  $ Days.after.inoculation: int  3 3 3 3 3 3 3 3 3 3 ...
##  $ Date                  : Factor w/ 10 levels "6/12/18","6/16/18",..: 3 3 3 3 3 3 3 3 3 3 ...
##  $ Licor                 : int  6400 6400 6400 6400 6400 6400 6400 6400 6400 6400 ...
##  $ Trial                 : int  1 1 1 1 1 1 1 1 1 1 ...
##  $ Treatment             : Factor w/ 2 levels "Drought","Wet": 1 1 1 1 1 1 1 1 1 1 ...
##  $ Species               : Factor w/ 4 levels "Both Pathogens",..: 2 2 2 2 2 2 2 2 2 2 ...
##  $ Plant.Number          : int  9 9 9 9 9 9 9 9 9 9 ...
##  $ Isolate.Number        : int  NA NA NA NA NA NA NA NA NA NA ...
##  $ Unique.Sample.Number.1: int  1 1 1 1 1 1 1 1 1 1 ...
##  $ Obs                   : int  1 2 3 4 5 6 7 8 9 10 ...
##  $ Photosynthesis        : num  -1.257 -0.456 -0.675 0.196 1.132 ...
##  $ Conductance           : num  -0.008756 -0.001057 -0.000624 0.004572 0.008387 ...
##  $ Ci                    : num  170 -286 -1315 328 182 ...

Here we can see how R reads some of the columns. Note that R reads Trial as an integer.

Lets change it to a factor (grouping variable).

data$Trial <- as.factor(data$Trial)

Ok, now we can see if R recognizes the trial as a factor.

str(data$Trial) #note that this time we can specify the column using $ to keep things shorter.

##  Factor w/ 2 levels "1","2": 1 1 1 1 1 1 1 1 1 1 ...

Good, ok we will probably have to do that again with another variable later once we start visualizing the data with ggplot2.

Another way to remind ourselves about the different columns in our data is names().

names(data)

##  [1] "Unique.Sample.Number"   "Days.after.inoculation"
##  [3] "Date"                   "Licor"                 
##  [5] "Trial"                  "Treatment"             
##  [7] "Species"                "Plant.Number"          
##  [9] "Isolate.Number"         "Unique.Sample.Number.1"
## [11] "Obs"                    "Photosynthesis"        
## [13] "Conductance"            "Ci"

Also, summary() is another neat function to summarize the data.

summary(data$Photosynthesis)

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##  -7.027   0.394   2.226   5.903   6.704 397.893

Although I really only find it useful for telling us the number of observations in a group.

summary(data$Treatment)

## Drought     Wet 
##     821     901

Ok, we had 821 observations of plants in the drought treatment, and 901 observations of plants in the wet treatment.

ggplot2

Today we are going to learn to use ggplot, but we will learn to use dplyr in Sessions 4 and 5.

Knowing how to use both of these packages provides a strong foundation for analyses.

Today we will learn to plot raw data with ggplot, but once we know dplyr, we will be able to summarize data (calculate means, standard errors, add variables, etc.), then plot the summarized data.

Introduction to ggplot2

ggplot is a package that is included in the tidyverse. It was developed specifically for visualizing data so it is pretty powerful and highly customizable.

Remember that you can access the help file for ggplot or any of the ggplot commands/functions (geom_bar, aes, etc) by adding a questionmark

?ggplot #note that you need to have loaded the package for that to work, otherwise you need to include two questionmarks (??ggplot)

Download ggplot2 Cheatsheet

RStudio has produced many ‘cheatsheets’ for various packages that are super helpful. Each package has many functions and it is impossible to remember them all.

ggplot cheatsheet - click to download.

General setup

“ggplot() is used to construct the initial plot object, and is almost always followed by + to add component to the plot.”

The general format for the ggplot command is:

Long command version

#ggplot(data=df, mapping=aes(x-variable, y-variable, other aesthetics such as color))

Short command version

We can also run the same command without specifying data= or mapping =

#ggplot(df,aes(x-variable, y-variable, other aesthetics))

Then we can add + geom_col() or +geom_boxplot() depending on which type of plot we want.

Data Visualization

One Variable Plots

Continuous Data

Continuous data refers to our data that ranges from 0 to infinity. For example our measurments for Photosynthsis is continous

summary(data$Photosynthesis)

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##  -7.027   0.394   2.226   5.903   6.704 397.893

In contrast discrete data (more below) is categorized into discrete groups.

Histograms

I generally recommend visualizing the distribution of your response variables (results) as a start because then you can decide which statistical tests are appropriate based on the distributions.

First lets look at the distribution of the Photosyntehsis data. To do this we can use geom_histogram()

names(data) #We can check the names of the columns to remind us which variables we can plot, and how to correctly spell/capitlize them.

##  [1] "Unique.Sample.Number"   "Days.after.inoculation"
##  [3] "Date"                   "Licor"                 
##  [5] "Trial"                  "Treatment"             
##  [7] "Species"                "Plant.Number"          
##  [9] "Isolate.Number"         "Unique.Sample.Number.1"
## [11] "Obs"                    "Photosynthesis"        
## [13] "Conductance"            "Ci"

ggplot(data,aes(Photosynthesis))+geom_histogram() #note that this is only a one variable plot (only x, not x and y).

Shoot I forgot about that super outlier.. Looks like we will need a quick glimpse into dplyr ( Session 4).

data <- data %>% filter(Photosynthesis<300) # lots more about this in 3 weeks :)

Now we can run the same plot and see if the outlier is removed.

ggplot(data,aes(Photosynthesis))+geom_histogram() #note that this is only a one variable plot (only x, not x and y).

This plot is showing us the number of observations (y-axis) for every value of photosynthesis (x-axis).

We can also do the same thing for the other response variables.

ggplot(data,aes(Conductance))+geom_histogram()

Pretty much the same distribution as Photosynthesis, which makes sense, becuase they are related right?

Lets stick to plotting Photosynthesis for now.

Density Plots

There are other plots to look at the same data. For example, below we can look at a density plot.

ggplot(data,aes(Photosynthesis))+geom_density()

Group Comparisons

Now we can add some complexity to compare the two groups, still only looking at a single respones variable.

Below we indicate that we want each Trial to have a different color using the ‘fill=’ command.

Remember that we designated Trial as a factor. The fill command only works with factors.

ggplot(data,aes(Photosynthesis,fill=Trial))+geom_histogram()

Here it looks like Trial 1 had higher level, but it is really because the Trial 1 data is stacked on top of Trial 2 data.

See what happens if you try a density plot instead..

ggplot(data,aes(Photosynthesis,fill=Trial))+geom_density()

To indicate that we want our groups to appear side by side, rather than on top of eachother we have to specify a position.

ggplot(data,aes(Photosynthesis,fill=Trial))+geom_histogram(position="dodge")

Discrete Data

Discrete data referes to our data that are categorized into discrete groups. For example, our variables that are factors, are discrete variables

summary(data$Trial)

##   1   2 
## 925 787

summary(data$Treatment)

## Drought     Wet 
##     821     891

summary(data$Species)

##       Both Pathogens              Control     Exotic Pathogen  
##                   72                  563                  508 
## Indigenous Pathogen  
##                  569

We can visualize discrete data using geom_bar()

ggplot(data,aes(Species))+geom_bar()

This plot is telling us how many observations were made for each (discrete) group.

Remember that at least 10 measurements were made for each plant, so these observations do not represent each plant. We will learn to summarize (average) the observations per plant in Session 4.

Note that we did not take many measurements from plants that were infected with both pathogens. We ultimately did not have enough time to measure the physiological response for that group.

As a side note, I often prefer to flip the coordinates so that the labels read better.

We can flip the axises using +coord_flip()

ggplot(data,aes(Species))+geom_bar() +coord_flip()

Looks like we measured less plants infected with the exotic pathogen than plants infected with indigenous pathogen or the controls.

Lets include one more discrete variable by adding a fill command and chaning the position.

ggplot(data,aes(Species,fill=Treatment))+geom_bar(position="dodge") +coord_flip()

Looks like we measured a few more plants from the Wet treatment than from the drought treatment.

Two Variable Plots

Now we can look at a few plots with two variables where we define x and y.

Continuous x, Continous y

We can see if there is a relationship between Phyotosyntehsis and Conductance. Then we can see if that relationship changes if plants are infected with pathogens.

Point Plots

ggplot(data,aes(Photosynthesis,Conductance))+geom_point()

Pretty messy, but it looks like there is a linear trend as we would expect.

This time if we want to compare different treatments or other categories, we have to use the ‘color=’ command rather than the ‘fill=’.

ggplot(data,aes(Photosynthesis,Conductance,color=Trial))+geom_point()

Line Plots

The below plot essentially draws lines between points

ggplot(data,aes(Photosynthesis,Conductance))+geom_line()

We can use geom_smooth() command to essentially average our data.

+ we may want to check the helpfile for geom_smooth to see exactly what it is doing.

?geom_smooth()

ggplot(data,aes(Photosynthesis,Conductance))+geom_smooth()

## `geom_smooth()` using method = 'gam' and formula 'y ~ s(x, bs = "cs")'

Not suprising, there is a linear relationship between the response variables.

We may also want to look at it linearly by defining the method in the geom_smooth() command.

ggplot(data,aes(Photosynthesis,Conductance))+geom_smooth(method=lm)

Now lets see if they differ between treatments or species.

ggplot(data,aes(Photosynthesis,Conductance,linetype=Treatment))+geom_smooth(method=lm) #note that we define a linetype in this.

Interesting, the relationship has a steeper slope for plants in the wet trial.

ggplot(data,aes(Photosynthesis,Conductance,linetype=Treatment,color=Species))+geom_smooth(method=lm) #note that we add color and linetype commands

Ok, clearly the plot is getting quite complicated once we start adding multiple grouping variables. This is a good time to introduce ‘facet wrapping’.

Facet Wrapping

ggplot has a neat way to create plots side by side using a command called facet_wrap.

Here we can split the above plot into two seperate plots for each treatment.

ggplot(data,aes(Photosynthesis,Conductance,color=Species))+geom_smooth(method=lm) +facet_wrap(~Treatment)#note that add Treatment as a facet_wrap argument rather than as a linetype.

Well the plot is still pretty messy, but the differences between treatments are clearer.

We can also arrange facets differently

ggplot(data,aes(Photosynthesis,Conductance,color=Treatment))+geom_smooth(method=lm) +facet_wrap(~Species) #here we switched treatment and species

And we can define how many columns or rows we want to present the facets in.

ggplot(data,aes(Photosynthesis,Conductance,color=Treatment))+geom_smooth(method=lm) +facet_wrap(~Species,nrow=1) #here we define nrow=1

Discrete x, Continous y

The final group of plots that we will cover in this session involve one (discrete) grouping variable and one continous variable.

Boxplots

Boxplots are commonly used to visualize differences in distributions between groups.

ggplot(data,aes(Species,Photosynthesis))+geom_boxplot()

Note that boxplots show the median, not the mean. We will learn to calculate and visualize means and standard errors in Session 3.

ggplot(data,aes(Species,Photosynthesis))+geom_boxplot() +coord_flip()

It looks like the plants infected with pathogens had lower median Photosynthesis values overall.

ggplot(data,aes(Species,Photosynthesis,fill=Treatment))+geom_boxplot(position="dodge") +coord_flip()

Here we can see that drought substantially lowered the levels of Photosynthesis.

So there is one major limitation to the plots above. There is something that we are not accounting for. As plant pathologists, plant scientists, and mycologists can you guess what is?

Lets look at our columns again

names(data)

##  [1] "Unique.Sample.Number"   "Days.after.inoculation"
##  [3] "Date"                   "Licor"                 
##  [5] "Trial"                  "Treatment"             
##  [7] "Species"                "Plant.Number"          
##  [9] "Isolate.Number"         "Unique.Sample.Number.1"
## [11] "Obs"                    "Photosynthesis"        
## [13] "Conductance"            "Ci"

Do you see any variables that might be important to include to see a difference between our Species groups?

str(data$Days.after.inoculation)

##  int [1:1712] 3 3 3 3 3 3 3 3 3 3 ...

Here R recognizes days after inoculation as a integer. Lets see what happens if we try to plot that on the x axis.

ggplot(data,aes(Days.after.inoculation,Photosynthesis,fill=Species))+geom_boxplot(position="dodge")+facet_wrap(~Treatment)

## Warning: position_dodge requires non-overlapping x intervals

## Warning: position_dodge requires non-overlapping x intervals

Not quite what we looking for. We want to see if the photosynthesis levels change over time since infection. Lets try if we set Days.after.inoculation as a factor.

data$Days.after.inoculation <- as.factor(data$Days.after.inoculation)

ggplot(data,aes(Days.after.inoculation,Photosynthesis,fill=Species))+geom_boxplot(position="dodge")+facet_wrap(~Treatment,ncol=1)

Ok, things are getting very complex.