Welcome to the webpage for the first session of our R for Biology Data Science Workshop Series. The purpose of this webpage is to provide the content we will cover during this session.
We need to format our data so that R will read it correctly.
Our data should be set up as a rectangle with observations (samples) in rows and variables (treatment groups, results, etc.) in columns. Below is an example:
Sample | Trial | Treatment | Result |
---|---|---|---|
1 | A | Control | 0.4307779 |
2 | A | Control | 0.6755954 |
3 | A | Control | 0.8203907 |
4 | A | Treated | 0.6630276 |
5 | A | Treated | 0.3523970 |
6 | A | Treated | 0.7788465 |
7 | B | Control | 0.4163977 |
8 | B | Control | 0.9285789 |
9 | B | Control | 0.0550451 |
10 | B | Treated | 0.4989079 |
11 | B | Treated | 0.8311272 |
12 | B | Treated | 0.8836404 |
It is best to import your raw data and not bother calculating averages or anything in excel previously. We will use dplyr in Session 3 to summarize data and calculate means based on groups or other explanatory variables, etc.
Now that our data is ready for R, we need to install R and RStudio.
It is best if we have the latest versions of R (3.5.2) and RStudio (1.1.463).
RStudio has four panes for organizing and analyzing data.
You can rearrange the panes to your preference in the pane layout section of preferences.
RStudio >> Preferences >> Pane Layout
You can also change the appearance to various themes or dark modes in the appearance section of the preferences.
RStudio >> Preferences >> Appearance
I also like to enable ‘soft wrap R source files’ so that the lines of code wrap within your window.
RStudio >> Preferences >> Code
Next we can get into one of the best features of RStudio, projects. Projects provide a powerful means to keep your data organized. You can think of projects as a folder or directory.
I generally create a new project for each chapter or manuscript. Then I like to create these sub folders.
Defining a working directory is a critical step for making the most of projects and simplifying our workflow. A working directory essentially tells R where to look for and save your files. That way you can reference data without needing to type out the entire file path (e.g. C://Desktop/Files/R/LongDirectoryPath).
It is recommended to set your project directory as your working directory. To do so, click on the file tab of the pane and click the ‘More’ button.
Files (tab) >> More >> Set as Working Directory
Alternatively we can use the following command.
setwd("<folder containing our dataset>")
Ok, now that we have a working directory, lets create our first script.
File (menu) >> New File >> R Script
We can save this script so that we can come back to our analysis later. Note that everytime you reopen R you will have to rerun each line of code again to restore the files in the environment.
R packages are collections of functions and documentation that can be reproduced and used with multiple datasets.
PoppR is an example of a popular package that was created by Zhian Kamvar and the Grünwald Lab.
In general, we need to install packages before we can use them. In this workshop series, we will use the Tidyverse package. The Tidyverse is a collection of packages that we will use to keep our data tidy. It includes ggplot2 and dplyr, the two packages we will use in the next two sessions.
We can install the package by running the code below:
install.packages("tidyverse")
Note that you only need to install this package in RStudio one time. Once it is installed, you can simply load the package (see below) anytime you want to use it.
Also, if you check the packages tab of your pane, you will see it installed ggplot2 and dplyr, and a few others.
We can run code from the source file by highlighting it and clicking run in the top right. Altneratively, you can press Ctrl/Cmd + Enter to run the line that the cursor sits on.
Once a package is installed, we can load it into our R session as below
library(tidyverse)
There are a couple options to consider for importing data. For example, there are packages (e.g. readxl) that allow you to import .xls files. However, I recommend importing your data as .csv (R default) and keeping your original .xls untouched.
We can save our excel file as a .csv by selecting ‘Save As’ in excel and saving it into our ‘data’ folder in our working directory.
Lets download some data that we can use for this workshop series: Mock Data
Now move the data to your working directory and add it to our environment using the below code.
#name <- read.csv('./data/'FILENAME.csv')
data <- read.csv('./data/Silver Tree Study.csv')
R ignores code that starts with ‘#’. Therefore, you can add notes to your script by adding a hash (#)
#this is a note
This.is.a.variable <- "code"
Another nice feature in RStudio is the ability to create an outline. Click ‘show document outline’ in top right corner of the Script Pane.
For example can use create a header by placing text between 4 hashes (####)
#### R will treat this as a header ####
####**Sometimes I indent headers using asterisks**####
You can assign variables by using <- or =. I prefer to use <- so that I do not confuse the = with == (more later)
#We can assign data to a dataframe as well
df <- data.frame(Sample = 1:4,Treatment = rep(c('Control','Treated'),2),Result = runif(4))
df
## Sample Treatment Result
## 1 1 Control 0.8467715
## 2 2 Treated 0.6817997
## 3 3 Control 0.9803151
## 4 4 Treated 0.9547095
We can refer to a single column or variable using the $ symbol.
df$Sample
## [1] 1 2 3 4
summary(data$Treatment)
## Length Class Mode
## 1722 character character
We can reference help files directly in RStudio by proceeding code with ? symbols
??tidyverse
?str
We can look at the structure of a dataset and check how R reads variables in a dataset using the str() function
str(data)
## 'data.frame': 1722 obs. of 14 variables:
## $ Unique.Sample.Number : chr "1DroughtControl9" "1DroughtControl9" "1DroughtControl9" "1DroughtControl9" ...
## $ Days.after.inoculation: int 3 3 3 3 3 3 3 3 3 3 ...
## $ Date : chr "6/2/18" "6/2/18" "6/2/18" "6/2/18" ...
## $ Licor : int 6400 6400 6400 6400 6400 6400 6400 6400 6400 6400 ...
## $ Trial : int 1 1 1 1 1 1 1 1 1 1 ...
## $ Treatment : chr "Drought" "Drought" "Drought" "Drought" ...
## $ Species : chr "Control" "Control" "Control" "Control" ...
## $ Plant.Number : int 9 9 9 9 9 9 9 9 9 9 ...
## $ Isolate.Number : int NA NA NA NA NA NA NA NA NA NA ...
## $ Unique.Sample.Number.1: int 1 1 1 1 1 1 1 1 1 1 ...
## $ Obs : int 1 2 3 4 5 6 7 8 9 10 ...
## $ Photosynthesis : num -1.257 -0.456 -0.675 0.196 1.132 ...
## $ Conductance : num -0.008756 -0.001057 -0.000624 0.004572 0.008387 ...
## $ Ci : num 170 -286 -1315 328 182 ...
Here we can see that R interprets our Plant.Number as an integer. But really we want R to read it as a factor. We can change how R reads individual columns by assigning variables and referencing columns.
data$Plant.Number <- as.factor(data$Plant.Number)
Now we can check if the column is read as a factor with the str function
str(data$Plant.Number)
## Factor w/ 10 levels "1","2","3","4",..: 9 9 9 9 9 9 9 9 9 9 ...
This will be important later when we are trying to summarize data based on plants. It is also important if your samples are represented by numbers.
There are a few helpful utilities to remind us what the structure of our dataset looks like.
For example, the head() function is nice for reminding us what the data looks like, it displays the top few rows.
head(data)
## Unique.Sample.Number Days.after.inoculation Date Licor Trial Treatment
## 1 1DroughtControl9 3 6/2/18 6400 1 Drought
## 2 1DroughtControl9 3 6/2/18 6400 1 Drought
## 3 1DroughtControl9 3 6/2/18 6400 1 Drought
## 4 1DroughtControl9 3 6/2/18 6400 1 Drought
## 5 1DroughtControl9 3 6/2/18 6400 1 Drought
## 6 1DroughtControl9 3 6/2/18 6400 1 Drought
## Species Plant.Number Isolate.Number Unique.Sample.Number.1 Obs Photosynthesis
## 1 Control 9 NA 1 1 -1.2574071
## 2 Control 9 NA 1 2 -0.4559930
## 3 Control 9 NA 1 3 -0.6746335
## 4 Control 9 NA 1 4 0.1958272
## 5 Control 9 NA 1 5 1.1315593
## 6 Control 9 NA 1 6 1.4892968
## Conductance Ci
## 1 -0.008756114 169.6851
## 2 -0.001057429 -286.0282
## 3 -0.000623729 -1315.1230
## 4 0.004572455 328.0515
## 5 0.008387323 181.6788
## 6 0.010670317 173.9482
Similarly, I always struggle to remember the names/titles of columns or I forget exactly how I spelled/capitalized it. For this reminder we can use them names() function.
names(data)
## [1] "Unique.Sample.Number" "Days.after.inoculation" "Date"
## [4] "Licor" "Trial" "Treatment"
## [7] "Species" "Plant.Number" "Isolate.Number"
## [10] "Unique.Sample.Number.1" "Obs" "Photosynthesis"
## [13] "Conductance" "Ci"
Another helpful function is the levels() function, which shows us the different levels (groups) in a column.
levels(data$Treatment)
## NULL
Finally, another powerful function is the summary() function, which will actually count the number of observations.
summary(data$Treatment)
## Length Class Mode
## 1722 character character
In most cases, it is easiest to change incorrectly spelt observations and clean our data in excel using the ‘Find and replace’ functions. However, it is safer to do it in R, because it is easy to make mistakes in Excel.
For example, if you have a lot of levels because it was incorrectly spelt a few times, you can use the stringr package to change them with the below code.
library(stringr)
data$Species <-str_replace_all(data$Species,"Indigenous Pathogen","P. multivora")
Note that if you make changes to the excel file, you will have to import it to your R environment. You can do so by simply going back to the top and running the code again.
We will learn more about cleaning our data in the Session 3 - Introduction to Dyplr. There we will learn to manipulate, transform, and summarize our data.
##Package Cheat Sheets
RStudio has produced many ‘cheat sheets’ for various packages that are super helpful. Each package has many functions and it is impossible to remember them all. Therefore, I recommend downloading the cheat sheets for the packages listed below at least.
More cheatsheets are available at:https://www.rstudio.com/resources/cheatsheets/