Introduction

Welcome to the webpage for the first session of our R for Biology Data Science Workshop Series. The purpose of this webpage is to provide the content we will cover during this session.

  • Session content outline
    • Data organization
    • Install R and RStudio
    • RStudio interface
    • Projects
    • Set working directories
    • Packages
    • Create R script
    • R code basics
    • Package Cheat Sheets

Data organization

We need to format our data so that R will read it correctly.

Our data should be set up as a rectangle with observations (samples) in rows and variables (treatment groups, results, etc.) in columns. Below is an example:

Sample Trial Treatment Result
1 A Control 0.4307779
2 A Control 0.6755954
3 A Control 0.8203907
4 A Treated 0.6630276
5 A Treated 0.3523970
6 A Treated 0.7788465
7 B Control 0.4163977
8 B Control 0.9285789
9 B Control 0.0550451
10 B Treated 0.4989079
11 B Treated 0.8311272
12 B Treated 0.8836404

It is best to import your raw data and not bother calculating averages or anything in excel previously. We will use dplyr in Session 3 to summarize data and calculate means based on groups or other explanatory variables, etc.

Now that our data is ready for R, we need to install R and RStudio.

Install R and RStudio

It is best if we have the latest versions of R (3.5.2) and RStudio (1.1.463).

RStudio interface

RStudio has four panes for organizing and analyzing data.

Layout

  • RStudio Panes
    • Source
      • Used for creating and saving R source files (codes).
    • Console
      • Runs commands and displays history of commands run in session.
    • Environment, Plots, Help, Viewer Tabs
      • The Environment tab represents the data that has been read into R.
      • The Plots tab displays plots generated from source or console.
        • Note that you may need to resize or expand the plots pane for some plots.
      • The Help tab will show you the help documentation.
        • For example, run the command: ?stats
    • History, Files, Connections, Packages
      • The Files tab displays files in the project directory (see below).
        • Note that it is still sometimes easier to manage files from explorer/finder
      • The Packages tab lists the packages that have been installed on your computer. A checkmark indicates the packages that are currently loaded.

Rearrange panes

You can rearrange the panes to your preference in the pane layout section of preferences.

RStudio >> Preferences >> Pane Layout

Appearance

You can also change the appearance to various themes or dark modes in the appearance section of the preferences.

RStudio >> Preferences >> Appearance

Softwrap

I also like to enable ‘soft wrap R source files’ so that the lines of code wrap within your window.

RStudio >> Preferences >> Code

Projects

Next we can get into one of the best features of RStudio, projects. Projects provide a powerful means to keep your data organized. You can think of projects as a folder or directory.

I generally create a new project for each chapter or manuscript. Then I like to create these sub folders.

  • Project Directory
    • Data
    • Code
    • Figures

Set working directories

Defining a working directory is a critical step for making the most of projects and simplifying our workflow. A working directory essentially tells R where to look for and save your files. That way you can reference data without needing to type out the entire file path (e.g. C://Desktop/Files/R/LongDirectoryPath).

It is recommended to set your project directory as your working directory. To do so, click on the file tab of the pane and click the ‘More’ button.

Files (tab) >> More >> Set as Working Directory

Alternatively we can use the following command.

setwd("<folder containing our dataset>")

Create R script

Ok, now that we have a working directory, lets create our first script.

File (menu) >> New File >> R Script

We can save this script so that we can come back to our analysis later. Note that everytime you reopen R you will have to rerun each line of code again to restore the files in the environment.

Packages

R packages are collections of functions and documentation that can be reproduced and used with multiple datasets.

PoppR is an example of a popular package that was created by Zhian Kamvar and the Grünwald Lab.

In general, we need to install packages before we can use them. In this workshop series, we will use the Tidyverse package. The Tidyverse is a collection of packages that we will use to keep our data tidy. It includes ggplot2 and dplyr, the two packages we will use in the next two sessions.

We can install the package by running the code below:

install.packages("tidyverse")

Note that you only need to install this package in RStudio one time. Once it is installed, you can simply load the package (see below) anytime you want to use it.

Also, if you check the packages tab of your pane, you will see it installed ggplot2 and dplyr, and a few others.

R code basics

Run Code

We can run code from the source file by highlighting it and clicking run in the top right. Altneratively, you can press Ctrl/Cmd + Enter to run the line that the cursor sits on.

Load Packages

Once a package is installed, we can load it into our R session as below

library(tidyverse)

Importing data

There are a couple options to consider for importing data. For example, there are packages (e.g. readxl) that allow you to import .xls files. However, I recommend importing your data as .csv (R default) and keeping your original .xls untouched.

We can save our excel file as a .csv by selecting ‘Save As’ in excel and saving it into our ‘data’ folder in our working directory.

Lets download some data that we can use for this workshop series: Mock Data

Now move the data to your working directory and add it to our environment using the below code.

#name <- read.csv('./data/'FILENAME.csv') 
data <- read.csv('./data/Silver Tree Study.csv')

Basics

Make notes

R ignores code that starts with ‘#’. Therefore, you can add notes to your script by adding a hash (#)

#this is a note
This.is.a.variable <- "code"
Outline

Another nice feature in RStudio is the ability to create an outline. Click ‘show document outline’ in top right corner of the Script Pane.

For example can use create a header by placing text between 4 hashes (####)

#### R will treat this as a header ####
####**Sometimes I indent headers using asterisks**####
Assign variables

You can assign variables by using <- or =. I prefer to use <- so that I do not confuse the = with == (more later)

#We can assign data to a dataframe as well
df <- data.frame(Sample = 1:4,Treatment = rep(c('Control','Treated'),2),Result = runif(4))
df
##   Sample Treatment    Result
## 1      1   Control 0.8467715
## 2      2   Treated 0.6817997
## 3      3   Control 0.9803151
## 4      4   Treated 0.9547095
Reference Data Columns

We can refer to a single column or variable using the $ symbol.

df$Sample
## [1] 1 2 3 4
summary(data$Treatment)
##    Length     Class      Mode 
##      1722 character character
Function and Package Help

We can reference help files directly in RStudio by proceeding code with ? symbols

??tidyverse
?str
Structure

We can look at the structure of a dataset and check how R reads variables in a dataset using the str() function

str(data)
## 'data.frame':    1722 obs. of  14 variables:
##  $ Unique.Sample.Number  : chr  "1DroughtControl9" "1DroughtControl9" "1DroughtControl9" "1DroughtControl9" ...
##  $ Days.after.inoculation: int  3 3 3 3 3 3 3 3 3 3 ...
##  $ Date                  : chr  "6/2/18" "6/2/18" "6/2/18" "6/2/18" ...
##  $ Licor                 : int  6400 6400 6400 6400 6400 6400 6400 6400 6400 6400 ...
##  $ Trial                 : int  1 1 1 1 1 1 1 1 1 1 ...
##  $ Treatment             : chr  "Drought" "Drought" "Drought" "Drought" ...
##  $ Species               : chr  "Control" "Control" "Control" "Control" ...
##  $ Plant.Number          : int  9 9 9 9 9 9 9 9 9 9 ...
##  $ Isolate.Number        : int  NA NA NA NA NA NA NA NA NA NA ...
##  $ Unique.Sample.Number.1: int  1 1 1 1 1 1 1 1 1 1 ...
##  $ Obs                   : int  1 2 3 4 5 6 7 8 9 10 ...
##  $ Photosynthesis        : num  -1.257 -0.456 -0.675 0.196 1.132 ...
##  $ Conductance           : num  -0.008756 -0.001057 -0.000624 0.004572 0.008387 ...
##  $ Ci                    : num  170 -286 -1315 328 182 ...

Here we can see that R interprets our Plant.Number as an integer. But really we want R to read it as a factor. We can change how R reads individual columns by assigning variables and referencing columns.

data$Plant.Number <- as.factor(data$Plant.Number)

Now we can check if the column is read as a factor with the str function

str(data$Plant.Number)
##  Factor w/ 10 levels "1","2","3","4",..: 9 9 9 9 9 9 9 9 9 9 ...

This will be important later when we are trying to summarize data based on plants. It is also important if your samples are represented by numbers.

Helpful utilities

There are a few helpful utilities to remind us what the structure of our dataset looks like.

For example, the head() function is nice for reminding us what the data looks like, it displays the top few rows.

head(data)
##   Unique.Sample.Number Days.after.inoculation   Date Licor Trial Treatment
## 1     1DroughtControl9                      3 6/2/18  6400     1   Drought
## 2     1DroughtControl9                      3 6/2/18  6400     1   Drought
## 3     1DroughtControl9                      3 6/2/18  6400     1   Drought
## 4     1DroughtControl9                      3 6/2/18  6400     1   Drought
## 5     1DroughtControl9                      3 6/2/18  6400     1   Drought
## 6     1DroughtControl9                      3 6/2/18  6400     1   Drought
##   Species Plant.Number Isolate.Number Unique.Sample.Number.1 Obs Photosynthesis
## 1 Control            9             NA                      1   1     -1.2574071
## 2 Control            9             NA                      1   2     -0.4559930
## 3 Control            9             NA                      1   3     -0.6746335
## 4 Control            9             NA                      1   4      0.1958272
## 5 Control            9             NA                      1   5      1.1315593
## 6 Control            9             NA                      1   6      1.4892968
##    Conductance         Ci
## 1 -0.008756114   169.6851
## 2 -0.001057429  -286.0282
## 3 -0.000623729 -1315.1230
## 4  0.004572455   328.0515
## 5  0.008387323   181.6788
## 6  0.010670317   173.9482

Similarly, I always struggle to remember the names/titles of columns or I forget exactly how I spelled/capitalized it. For this reminder we can use them names() function.

names(data)
##  [1] "Unique.Sample.Number"   "Days.after.inoculation" "Date"                  
##  [4] "Licor"                  "Trial"                  "Treatment"             
##  [7] "Species"                "Plant.Number"           "Isolate.Number"        
## [10] "Unique.Sample.Number.1" "Obs"                    "Photosynthesis"        
## [13] "Conductance"            "Ci"

Another helpful function is the levels() function, which shows us the different levels (groups) in a column.

levels(data$Treatment)
## NULL

Finally, another powerful function is the summary() function, which will actually count the number of observations.

summary(data$Treatment)
##    Length     Class      Mode 
##      1722 character character

Cleaning Data

In most cases, it is easiest to change incorrectly spelt observations and clean our data in excel using the ‘Find and replace’ functions. However, it is safer to do it in R, because it is easy to make mistakes in Excel.

For example, if you have a lot of levels because it was incorrectly spelt a few times, you can use the stringr package to change them with the below code.

library(stringr)
data$Species <-str_replace_all(data$Species,"Indigenous Pathogen","P. multivora")

Note that if you make changes to the excel file, you will have to import it to your R environment. You can do so by simply going back to the top and running the code again.

We will learn more about cleaning our data in the Session 3 - Introduction to Dyplr. There we will learn to manipulate, transform, and summarize our data.

##Package Cheat Sheets

RStudio has produced many ‘cheat sheets’ for various packages that are super helpful. Each package has many functions and it is impossible to remember them all. Therefore, I recommend downloading the cheat sheets for the packages listed below at least.

More cheatsheets are available at:https://www.rstudio.com/resources/cheatsheets/