Redcedar Data Analyses Instructions
library(tidyverse)
#devtools::install_github("lhmrosso/XPolaris")
library(XPolaris)
library(randomForest)
library(caret)
library(rpart)
library(knitr)
library(corrplot)

Approach

We used the XPolaris package to download POLARIS soils tiles and extract data for 13 soils variables for all of our data points.

Instructions for using this package are available at https://github.com/lhmrosso/XPolaris

Then we completed a random forest analysis with the soils data to see what variables were the most important for classifying trees as healthy or unhealthy.

Data Wrangling

Import iNat Data - Empirical Tree Points (Response variables)

The steps for wrangling the data are described here.

#data <- read.csv('https://github.com/jmhulbert/open/raw/main/redcedar/data/data-modified.csv')

data <- read.csv("~/ServerFiles/open/redcedar/data/data-modified.csv") #offline  

Format and export for collecting Polaris soil data with xPolaris Package

Data were subset to include only gps information to use in collecting ancillary data.

gps.soils <- data[c(2,24,25)] #subset data to only include id and gps coordinates
gps.soils <- rename(gps.soils,lat = latitude) %>% `colnames<-`(c("ID","lat","long")) 
#write.csv(gps.soils,file="/Users/redcedar/ServerFiles/open/redcedar/data/gps.soils.2232.csv") #named 2232 because data was downloaded after having 2232 observations

Drop data with missing coordinates were filtered out - breaks code below otherwise

gps.soils <- gps.soils %>% filter (lat!="")
xplot(gps.soils)

We may need to filter out canada in later analysis, not sure if tiffs from BC will have data or be helpful. Everything above 49th parallel was NA. Filtering the Canada data beforehand might save some time in the data wrangling.

Tiles for soils data were downloaded using the ximages command in the package. 53 tiles for all 6 soil layer depths and all 13 variables (4134 tiles) were downloaded. This process took us close to 12 hours to complete.

#mean_soils_data <- ximages(locations = gps.soils,
#                     statistics = c('mean'),
#                     variables = c('ph','om','clay','sand','silt','bd','hb','n','ksat','theta_r','theta_s','lambda','alpha'),
#                      layersdepths = c('0_5','5_15','15_30','30_60','60_100','100_200'))
                    #variables = c('ph','om','clay','sand','silt','bd','hb','n','ksat','theta_r','theta_s'),
                      #layersdepths = c('0_5','5_15','15_30','30_60','60_100','100_200'))


# These data were saved in our tempdir() at  "/var/folders/nc/nbgmcl_s2_q4d2r3z08pnjz00000gn/T//RtmpPU3kIf"

Next, we extracted the soils data from the tiles using the xsoil command.

We received the below error from one raster file in the silt data, but we manually downloaded the tif file and replaced it in the data folder, then it worked fine.

Error in dplyr::mutate(): ℹ In argument: extracts = purrr::map2(...). Caused by error in purrr::map2(): ℹ In index: 27. Caused by error: ! [extract] cannot read values

The below xsoil command worked fine after fixing that silt file mentioned above.

#mean_all_vars_xsoil<-xsoil(ximages_output=mean_soils_data)

12882 observations in final dataset with 18 variables (2221 obs x 6 soil horizons = 1.3326^{4} - 500ish observations some reason.

Data were saved as xsoil_data_all_13_vars.csv

#write.csv(mean_all_vars_xsoil,file="/Users/redcedar/ServerFiles/open/redcedar/data/xsoil_data_all_13_vars.csv")

Random Forest Analysis

tidy soils data

Remove specific soil variables not useful as explanatory variables (e.g. latitutde)

#soils <- read.csv('https://github.com/jmhulbert/open/raw/main/redcedar/data/xsoil_data_all_13_vars.csv')

soils <- read.csv('~/ServerFiles/open/redcedar/data/xsoil_data_all_13_vars.csv') #offline
soils <- soils %>% rename_at('ID',~'id')
soils <-soils %>% select(-c(
"X"                                                                                   
,"lat"                                                                            
,"long"                                                                           
,"statistics"))
tidy.soils <-soils %>% pivot_longer(.,cols=!(c(id,layersdepths)),names_to="soilvar",values_to="value") %>% pivot_wider(names_from=c(soilvar,layersdepths),values_from=value)

tidy iNat data

Remove iNaturalist columns and explanatory variables not needed for random forest models

join data

full <- left_join(tidy.soils,random.forest.data,by="id",keep=FALSE)

remove NA’s (Canada data)

full <- full %>% filter(bd_0_5!="NA") %>% droplevels()

random forest models

five.cats.full <- full %>% select(-c("binary.tree.canopy.symptoms","id"))
five.cats.full$reclassified.tree.canopy.symptoms <- as.factor(five.cats.full$reclassified.tree.canopy.symptoms)
binary.full <- full %>% select(-c("reclassified.tree.canopy.symptoms","id"))
binary.full$binary.tree.canopy.symptoms <- as.factor(binary.full$binary.tree.canopy.symptoms)

Compare model errors

Five categorical response

set.seed(71)
rf.five.cats <- randomForest(reclassified.tree.canopy.symptoms ~ ., data=five.cats.full, ntree=2001, importance=TRUE, na.action=na.omit, proximity=TRUE)
rf.five.cats
## 
## Call:
##  randomForest(formula = reclassified.tree.canopy.symptoms ~ .,      data = five.cats.full, ntree = 2001, importance = TRUE, proximity = TRUE,      na.action = na.omit) 
##                Type of random forest: classification
##                      Number of trees: 2001
## No. of variables tried at each split: 8
## 
##         OOB estimate of  error rate: 45.76%
## Confusion matrix:
##                 Dead Top Healthy Other Thinning Canopy Tree is Dead class.error
## Dead Top              46     131    10              18            6   0.7819905
## Healthy               53     880    32              58           11   0.1489362
## Other                 18     148    29              17            2   0.8644860
## Thinning Canopy       40     193    16              26            2   0.9061372
## Tree is Dead          17      49     3               7            4   0.9500000

Binary Normal Model

set.seed(71)
rf.binary <- randomForest(binary.tree.canopy.symptoms ~ ., data=binary.full, ntree=2001, importance=TRUE, na.action=na.omit, proximity=TRUE)
rf.binary
## 
## Call:
##  randomForest(formula = binary.tree.canopy.symptoms ~ ., data = binary.full,      ntree = 2001, importance = TRUE, proximity = TRUE, na.action = na.omit) 
##                Type of random forest: classification
##                      Number of trees: 2001
## No. of variables tried at each split: 8
## 
##         OOB estimate of  error rate: 37.33%
## Confusion matrix:
##           Healthy Unhealthy class.error
## Healthy       747       287   0.2775629
## Unhealthy     391       391   0.5000000

Identify important variables

Binary Response

plot(rf.binary)

#Below code copied from: https://github.com/StatQuest/random_forest_demo/blob/master/random_forest_demo.R as described here: https://www.youtube.com/watch?v=6EXPYzbfLCE

## Start by converting the proximity matrix into a distance matrix. 

distance.matrix <- as.dist(1-rf.binary$proximity)

mds.stuff <- cmdscale(distance.matrix, eig=TRUE, x.ret=TRUE)

## calculate the percentage of variation that each MDS axis accounts for...
mds.var.per <- round(mds.stuff$eig/sum(mds.stuff$eig)*100, 1)

## now make a fancy looking plot that shows the MDS axes and the variation:
mds.values <- mds.stuff$points
mds.data <- data.frame(Sample=rownames(mds.values),
  X=mds.values[,1],
  Y=mds.values[,2],
  Status=binary.full$binary.tree.canopy.symptoms)

ggplot(data=mds.data, aes(x=X, y=Y)) + 
  geom_point(aes(color=Status),alpha=0.45) +  
  stat_ellipse(geom="polygon",aes(fill=Status,color=Status),alpha=0.45) +
  theme_bw() + 
  scale_color_discrete() +
  scale_fill_discrete() +
  xlab(paste("MDS1 - ", mds.var.per[1], "%", sep="")) +
  ylab(paste("MDS2 - ", mds.var.per[2], "%", sep="")) +
  ggtitle("MDS plot using (1 - Random Forest Proximities)")
## Warning in MASS::cov.trob(data[, vars]): Probable convergence failure
## Too few points to calculate an ellipse

#By default, the stat_ellipse function draws a 95% confidence level for a multivariate t-distribution. You can modify this level with level argument.

#more info for ellipse https://r-charts.com/correlation/scatter-plot-ellipses-ggplot2/
varImpPlot(rf.binary)

importance <- varImp(rf.binary,scale=TRUE)
plot(importance)

corr.binary <- cor(binary.full[-c(79:83)])
corrplot(corr.binary,order='AOE')

Data Visualization

ggplot(binary.full,aes(clay_0_5,fill=binary.tree.canopy.symptoms)) +geom_density(alpha=0.5) + scale_fill_manual(name="Tree Condition",values=c("#7fcdbb","#fe9929")) +theme_bw() +labs(x="Clay 0-5 cm")
## Warning: Groups with fewer than two data points have been dropped.
## Warning in max(ids, na.rm = TRUE): no non-missing arguments to max; returning
## -Inf

ggplot(binary.full,aes(clay_15_30,fill=binary.tree.canopy.symptoms)) +geom_density(alpha=0.5) + scale_fill_manual(name="Tree Condition",values=c("#7fcdbb","#fe9929")) +theme_bw() +labs(x="Clay 15-30 cm")
## Warning: Groups with fewer than two data points have been dropped.
## Warning in max(ids, na.rm = TRUE): no non-missing arguments to max; returning
## -Inf

ggplot(binary.full,aes(clay_30_60,fill=binary.tree.canopy.symptoms)) +geom_density(alpha=0.5) + scale_fill_manual(name="Tree Condition",values=c("#7fcdbb","#fe9929")) +theme_bw() +labs(x="Clay 30-60 cm")
## Warning: Groups with fewer than two data points have been dropped.
## Warning in max(ids, na.rm = TRUE): no non-missing arguments to max; returning
## -Inf

ggplot(binary.full,aes(clay_100_200,fill=binary.tree.canopy.symptoms)) +geom_density(alpha=0.5) + scale_fill_manual(name="Tree Condition",values=c("#7fcdbb","#fe9929")) +theme_bw() +labs(x="Clay 100-200 cm")
## Warning: Groups with fewer than two data points have been dropped.
## Warning in max(ids, na.rm = TRUE): no non-missing arguments to max; returning
## -Inf

ggplot(binary.full,aes(sand_100_200,fill=binary.tree.canopy.symptoms)) +geom_density(alpha=0.5) + scale_fill_manual(name="Tree Condition",values=c("#7fcdbb","#fe9929")) +theme_bw() +labs(x="sand 100-200 cm")
## Warning: Groups with fewer than two data points have been dropped.
## Warning in max(ids, na.rm = TRUE): no non-missing arguments to max; returning
## -Inf

ggplot(binary.full,aes(ksat_60_100,fill=binary.tree.canopy.symptoms)) +geom_density(alpha=0.5) + scale_fill_manual(name="Tree Condition",values=c("#7fcdbb","#fe9929")) +theme_bw() +labs(x="ksat 60-100 cm")
## Warning: Groups with fewer than two data points have been dropped.
## Warning in max(ids, na.rm = TRUE): no non-missing arguments to max; returning
## -Inf

ggplot(binary.full,aes(bd_60_100,fill=binary.tree.canopy.symptoms)) +geom_density(alpha=0.5) + scale_fill_manual(name="Tree Condition",values=c("#7fcdbb","#fe9929")) +theme_bw() +labs(x="bd 60-100 cm")
## Warning: Groups with fewer than two data points have been dropped.
## Warning in max(ids, na.rm = TRUE): no non-missing arguments to max; returning
## -Inf