Redcedar | Data | Analyses | Instructions |
library(tidyverse)
#devtools::install_github("lhmrosso/XPolaris")
library(XPolaris)
library(randomForest)
library(caret)
library(rpart)
library(knitr)
library(corrplot)
We used the XPolaris package to download POLARIS soils tiles and extract data for 13 soils variables for all of our data points.
Instructions for using this package are available at https://github.com/lhmrosso/XPolaris
Then we completed a random forest analysis with the soils data to see what variables were the most important for classifying trees as healthy or unhealthy.
The steps for wrangling the data are described here.
#data <- read.csv('https://github.com/jmhulbert/open/raw/main/redcedar/data/data-modified.csv')
data <- read.csv("~/ServerFiles/open/redcedar/data/data-modified.csv") #offline
Data were subset to include only gps information to use in collecting ancillary data.
gps.soils <- data[c(2,24,25)] #subset data to only include id and gps coordinates
gps.soils <- rename(gps.soils,lat = latitude) %>% `colnames<-`(c("ID","lat","long"))
#write.csv(gps.soils,file="/Users/redcedar/ServerFiles/open/redcedar/data/gps.soils.2232.csv") #named 2232 because data was downloaded after having 2232 observations
Drop data with missing coordinates were filtered out - breaks code below otherwise
gps.soils <- gps.soils %>% filter (lat!="")
xplot(gps.soils)
We may need to filter out canada in later analysis, not sure if tiffs from BC will have data or be helpful. Everything above 49th parallel was NA. Filtering the Canada data beforehand might save some time in the data wrangling.
Tiles for soils data were downloaded using the ximages command in the package. 53 tiles for all 6 soil layer depths and all 13 variables (4134 tiles) were downloaded. This process took us close to 12 hours to complete.
#mean_soils_data <- ximages(locations = gps.soils,
# statistics = c('mean'),
# variables = c('ph','om','clay','sand','silt','bd','hb','n','ksat','theta_r','theta_s','lambda','alpha'),
# layersdepths = c('0_5','5_15','15_30','30_60','60_100','100_200'))
#variables = c('ph','om','clay','sand','silt','bd','hb','n','ksat','theta_r','theta_s'),
#layersdepths = c('0_5','5_15','15_30','30_60','60_100','100_200'))
# These data were saved in our tempdir() at "/var/folders/nc/nbgmcl_s2_q4d2r3z08pnjz00000gn/T//RtmpPU3kIf"
Next, we extracted the soils data from the tiles using the xsoil command.
We received the below error from one raster file in the silt data, but we manually downloaded the tif file and replaced it in the data folder, then it worked fine.
Error in
dplyr::mutate()
: ℹ In argument:extracts = purrr::map2(...)
. Caused by error inpurrr::map2()
: ℹ In index: 27. Caused by error: ! [extract] cannot read values
The below xsoil command worked fine after fixing that silt file mentioned above.
#mean_all_vars_xsoil<-xsoil(ximages_output=mean_soils_data)
12882 observations in final dataset with 18 variables (2221 obs x 6 soil horizons = 1.3326^{4} - 500ish observations some reason.
Data were saved as xsoil_data_all_13_vars.csv
#write.csv(mean_all_vars_xsoil,file="/Users/redcedar/ServerFiles/open/redcedar/data/xsoil_data_all_13_vars.csv")
Remove specific soil variables not useful as explanatory variables (e.g. latitutde)
#soils <- read.csv('https://github.com/jmhulbert/open/raw/main/redcedar/data/xsoil_data_all_13_vars.csv')
soils <- read.csv('~/ServerFiles/open/redcedar/data/xsoil_data_all_13_vars.csv') #offline
soils <- soils %>% rename_at('ID',~'id')
soils <-soils %>% select(-c(
"X"
,"lat"
,"long"
,"statistics"))
tidy.soils <-soils %>% pivot_longer(.,cols=!(c(id,layersdepths)),names_to="soilvar",values_to="value") %>% pivot_wider(names_from=c(soilvar,layersdepths),values_from=value)
full <- left_join(tidy.soils,random.forest.data,by="id",keep=FALSE)
full <- full %>% filter(bd_0_5!="NA") %>% droplevels()
five.cats.full <- full %>% select(-c("binary.tree.canopy.symptoms","id"))
five.cats.full$reclassified.tree.canopy.symptoms <- as.factor(five.cats.full$reclassified.tree.canopy.symptoms)
binary.full <- full %>% select(-c("reclassified.tree.canopy.symptoms","id"))
binary.full$binary.tree.canopy.symptoms <- as.factor(binary.full$binary.tree.canopy.symptoms)
set.seed(71)
rf.five.cats <- randomForest(reclassified.tree.canopy.symptoms ~ ., data=five.cats.full, ntree=2001, importance=TRUE, na.action=na.omit, proximity=TRUE)
rf.five.cats
##
## Call:
## randomForest(formula = reclassified.tree.canopy.symptoms ~ ., data = five.cats.full, ntree = 2001, importance = TRUE, proximity = TRUE, na.action = na.omit)
## Type of random forest: classification
## Number of trees: 2001
## No. of variables tried at each split: 8
##
## OOB estimate of error rate: 45.76%
## Confusion matrix:
## Dead Top Healthy Other Thinning Canopy Tree is Dead class.error
## Dead Top 46 131 10 18 6 0.7819905
## Healthy 53 880 32 58 11 0.1489362
## Other 18 148 29 17 2 0.8644860
## Thinning Canopy 40 193 16 26 2 0.9061372
## Tree is Dead 17 49 3 7 4 0.9500000
set.seed(71)
rf.binary <- randomForest(binary.tree.canopy.symptoms ~ ., data=binary.full, ntree=2001, importance=TRUE, na.action=na.omit, proximity=TRUE)
rf.binary
##
## Call:
## randomForest(formula = binary.tree.canopy.symptoms ~ ., data = binary.full, ntree = 2001, importance = TRUE, proximity = TRUE, na.action = na.omit)
## Type of random forest: classification
## Number of trees: 2001
## No. of variables tried at each split: 8
##
## OOB estimate of error rate: 37.33%
## Confusion matrix:
## Healthy Unhealthy class.error
## Healthy 747 287 0.2775629
## Unhealthy 391 391 0.5000000
plot(rf.binary)
#Below code copied from: https://github.com/StatQuest/random_forest_demo/blob/master/random_forest_demo.R as described here: https://www.youtube.com/watch?v=6EXPYzbfLCE
## Start by converting the proximity matrix into a distance matrix.
distance.matrix <- as.dist(1-rf.binary$proximity)
mds.stuff <- cmdscale(distance.matrix, eig=TRUE, x.ret=TRUE)
## calculate the percentage of variation that each MDS axis accounts for...
mds.var.per <- round(mds.stuff$eig/sum(mds.stuff$eig)*100, 1)
## now make a fancy looking plot that shows the MDS axes and the variation:
mds.values <- mds.stuff$points
mds.data <- data.frame(Sample=rownames(mds.values),
X=mds.values[,1],
Y=mds.values[,2],
Status=binary.full$binary.tree.canopy.symptoms)
ggplot(data=mds.data, aes(x=X, y=Y)) +
geom_point(aes(color=Status),alpha=0.45) +
stat_ellipse(geom="polygon",aes(fill=Status,color=Status),alpha=0.45) +
theme_bw() +
scale_color_discrete() +
scale_fill_discrete() +
xlab(paste("MDS1 - ", mds.var.per[1], "%", sep="")) +
ylab(paste("MDS2 - ", mds.var.per[2], "%", sep="")) +
ggtitle("MDS plot using (1 - Random Forest Proximities)")
## Warning in MASS::cov.trob(data[, vars]): Probable convergence failure
## Too few points to calculate an ellipse
#By default, the stat_ellipse function draws a 95% confidence level for a multivariate t-distribution. You can modify this level with level argument.
#more info for ellipse https://r-charts.com/correlation/scatter-plot-ellipses-ggplot2/
varImpPlot(rf.binary)
importance <- varImp(rf.binary,scale=TRUE)
plot(importance)
corr.binary <- cor(binary.full[-c(79:83)])
corrplot(corr.binary,order='AOE')
ggplot(binary.full,aes(clay_0_5,fill=binary.tree.canopy.symptoms)) +geom_density(alpha=0.5) + scale_fill_manual(name="Tree Condition",values=c("#7fcdbb","#fe9929")) +theme_bw() +labs(x="Clay 0-5 cm")
## Warning: Groups with fewer than two data points have been dropped.
## Warning in max(ids, na.rm = TRUE): no non-missing arguments to max; returning
## -Inf
ggplot(binary.full,aes(clay_15_30,fill=binary.tree.canopy.symptoms)) +geom_density(alpha=0.5) + scale_fill_manual(name="Tree Condition",values=c("#7fcdbb","#fe9929")) +theme_bw() +labs(x="Clay 15-30 cm")
## Warning: Groups with fewer than two data points have been dropped.
## Warning in max(ids, na.rm = TRUE): no non-missing arguments to max; returning
## -Inf
ggplot(binary.full,aes(clay_30_60,fill=binary.tree.canopy.symptoms)) +geom_density(alpha=0.5) + scale_fill_manual(name="Tree Condition",values=c("#7fcdbb","#fe9929")) +theme_bw() +labs(x="Clay 30-60 cm")
## Warning: Groups with fewer than two data points have been dropped.
## Warning in max(ids, na.rm = TRUE): no non-missing arguments to max; returning
## -Inf
ggplot(binary.full,aes(clay_100_200,fill=binary.tree.canopy.symptoms)) +geom_density(alpha=0.5) + scale_fill_manual(name="Tree Condition",values=c("#7fcdbb","#fe9929")) +theme_bw() +labs(x="Clay 100-200 cm")
## Warning: Groups with fewer than two data points have been dropped.
## Warning in max(ids, na.rm = TRUE): no non-missing arguments to max; returning
## -Inf
ggplot(binary.full,aes(sand_100_200,fill=binary.tree.canopy.symptoms)) +geom_density(alpha=0.5) + scale_fill_manual(name="Tree Condition",values=c("#7fcdbb","#fe9929")) +theme_bw() +labs(x="sand 100-200 cm")
## Warning: Groups with fewer than two data points have been dropped.
## Warning in max(ids, na.rm = TRUE): no non-missing arguments to max; returning
## -Inf
ggplot(binary.full,aes(ksat_60_100,fill=binary.tree.canopy.symptoms)) +geom_density(alpha=0.5) + scale_fill_manual(name="Tree Condition",values=c("#7fcdbb","#fe9929")) +theme_bw() +labs(x="ksat 60-100 cm")
## Warning: Groups with fewer than two data points have been dropped.
## Warning in max(ids, na.rm = TRUE): no non-missing arguments to max; returning
## -Inf
ggplot(binary.full,aes(bd_60_100,fill=binary.tree.canopy.symptoms)) +geom_density(alpha=0.5) + scale_fill_manual(name="Tree Condition",values=c("#7fcdbb","#fe9929")) +theme_bw() +labs(x="bd 60-100 cm")
## Warning: Groups with fewer than two data points have been dropped.
## Warning in max(ids, na.rm = TRUE): no non-missing arguments to max; returning
## -Inf