Redcedar | Data | Analyses | Instructions |
Please note this analysis and R Markdown document are in still in development :)
View a previous, archived, version of this analysis here.
The overall approach is to model empirical data collected by community scientists with ancillary climate data to identify important predictors of western redcedar dieback.
The steps for wrangling the iNat data are described here.
Data were subset to include only gps information to use in collecting ancillary data.
Climate data then extracted with ClimateNA tool following the below process. Data were downloaded for the iNat GPS locations using the ClimateNA Tool.
ClimateNA version 7.42 -
Variables
Note the below analysis uses the iNat data with 1510 observations. Amazing!
Remove specific climate variables not useful as explanatory variables (e.g. norm_Latitutde)
For some reason there is one observation with a super neg cmi value (-10000 ish)
Normals data for 265 variables were downloaded for each point Monthly - 180 variables represented data averaged over months for the 30 year period Seasonal - 60 variables respresented data averaged over 3 month seasons (4 seasons) for 30 year period Annual - 20 variables represented data averaged for all years during 30 year period
Remove variables with variables that have near zero standard deviations (entire column is same value)
Full
There were length(normals)-length(normals.nearzerovar
monthly variables with zero standard deviation is Dropping columns with
near zero standard deviation removed
length(normals)-length(normals.nearzerovar
monthly climate
variables.
Monthly
There were
length(normals.monthly)-length(normals.monthly.nearzerovar
monthly variables with zero standard deviation is Dropping columns with
near zero standard deviation removed
length(normals.monthly)-length(normals.monthly.nearzerovar
monthly climate variables.
Seasonal
There were
length(normals.monthly)-length(normals.seasonal.nearzerovar
monthly variables with zero standard deviation is
Annual
There were
length(normals.monthly)-length(normals.annual.nearzerovar
monthly variables with zero standard deviation.
Note we chose to remove dead trees after joining with climate data in case we change our mind about that.
Generally, removing dead trees may make the most sense biologically, because we’re not sure about the cause of the dead tree. Later we could test if there is a good climate variable for classifying trees as alive or dead.
Remove other explanatory variable categories (binary only)
##
## Call:
## randomForest(formula = binary.tree.canopy.symptoms ~ ., data = binary.full, ntree = 1200, importance = TRUE, proximity = TRUE, na.action = na.omit)
## Type of random forest: classification
## Number of trees: 1200
## No. of variables tried at each split: 14
##
## OOB estimate of error rate: 31.3%
## Confusion matrix:
## Healthy Unhealthy class.error
## Healthy 1143 329 0.2235054
## Unhealthy 430 523 0.4512067
##
## Call:
## randomForest(formula = binary.tree.canopy.symptoms ~ ., data = binary.monthly, ntree = 1200, importance = TRUE, proximity = TRUE, na.action = na.omit)
## Type of random forest: classification
## Number of trees: 1200
## No. of variables tried at each split: 12
##
## OOB estimate of error rate: 31.92%
## Confusion matrix:
## Healthy Unhealthy class.error
## Healthy 1132 340 0.2309783
## Unhealthy 434 519 0.4554040
##
## Call:
## randomForest(formula = binary.tree.canopy.symptoms ~ ., data = binary.seasonal, ntree = 1200, importance = TRUE, proximity = TRUE, na.action = na.omit)
## Type of random forest: classification
## Number of trees: 1200
## No. of variables tried at each split: 7
##
## OOB estimate of error rate: 31.88%
## Confusion matrix:
## Healthy Unhealthy class.error
## Healthy 1134 338 0.2296196
## Unhealthy 435 518 0.4564533
##
## Call:
## randomForest(formula = binary.tree.canopy.symptoms ~ ., data = binary.annual, ntree = 1200, importance = TRUE, proximity = TRUE, na.action = na.omit)
## Type of random forest: classification
## Number of trees: 1200
## No. of variables tried at each split: 4
##
## OOB estimate of error rate: 31.88%
## Confusion matrix:
## Healthy Unhealthy class.error
## Healthy 1138 334 0.2269022
## Unhealthy 439 514 0.4606506
Summary of model performance
Response | Grouping | Num Variables | Vars tried split | OOB Error |
Binary | Full | 225 | 14 | 31.3 |
Binary | Monthly | 148 | 12 | 31.92 |
Binary | Seasonal | 54 | 7 | 31.88 |
Binary | Annual | 25 | 31.88 | |
The error rate above may stabilize enough by 600-800 trees. May not be necessary to run 1200 trees.
## num [1:2425, 1:2] -0.02559 0.00402 -0.01364 -0.01961 -0.01828 ...
## - attr(*, "dimnames")=List of 2
## ..$ : chr [1:2425] "1" "2" "3" "4" ...
## ..$ : NULL
Clearly all of the climate variables are highly correlated.
Lets pick the top performing metric in our random forests analyses, CMI and then any less correlated variables
Below we can check the correlation of CMI, MAP, and DD_18
Now we can check how the model performs with only these three climate variables
##
## Call:
## randomForest(formula = binary.tree.canopy.symptoms ~ ., data = binary.annual, ntree = 1200, importance = TRUE, proximity = TRUE, na.action = na.omit)
## Type of random forest: classification
## Number of trees: 1200
## No. of variables tried at each split: 4
##
## OOB estimate of error rate: 31.88%
## Confusion matrix:
## Healthy Unhealthy class.error
## Healthy 1138 334 0.2269022
## Unhealthy 439 514 0.4606506
It’s hard to give up the seasonality data, but they are all highly correlated (data not shown) and if we look at the above importance plot for the seasonality data, the winter variables (norm_CMI_wt,norm_DD_18_wt, and norm_PPT_wt) all had the highest MeanDecrease Accuracy and Gini. Therefore, even if we chose to build the model on seasonal data, we would likely want to choose to use the winter values for each variable.
Probability of a tree classified as unhealthy
Response variable: category: healthy, unhealthy
Note dead trees were removed
## Family: binomial ( logit )
## Formula: binary.tree.canopy.symptoms ~ norm_CMI
## Data: annual
##
## AIC BIC logLik deviance df.resid
## 3252.8 3264.4 -1624.4 3248.8 2423
##
##
## Conditional model:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -0.4890830 0.0675426 -7.241 4.45e-13 ***
## norm_CMI 0.0008836 0.0008637 1.023 0.306
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## Warning: Removed 12 rows containing non-finite values (`stat_density()`).
## Family: binomial ( logit )
## Formula: top.dieback ~ norm_CMI
## Data: annual
##
## AIC BIC logLik deviance df.resid
## 1808.4 1820.0 -902.2 1804.4 2423
##
##
## Conditional model:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -2.095608 0.098672 -21.238 <2e-16 ***
## norm_CMI 0.002063 0.001182 1.746 0.0808 .
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## Warning: Removed 12 rows containing non-finite values (`stat_density()`).
## Warning: Removed 12 rows containing non-finite values (`stat_density()`).
## Family: binomial ( logit )
## Formula: dead.tree ~ norm_CMI
## Data: full.with.dead
##
## AIC BIC logLik deviance df.resid
## 1019.3 1031.0 -507.7 1015.3 2551
##
##
## Conditional model:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -3.011139 0.146278 -20.585 <2e-16 ***
## norm_CMI 0.001112 0.001798 0.619 0.536
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Explore if monthly, seasonal, or annual data are best fit for binomial distributed glmm.
Identify which climate variable grouping is best fit then run random forests for determining which climate variable is best for predicting top dieback
then run random forests for predicting which climate variable is best for predicting thinning