Redcedar Data Analyses Instructions

Please note this analysis and R Markdown document are in still in development :)

View a previous, archived, version of this analysis here.

Approach

The overall approach is to model empirical data collected by community scientists with ancillary climate data to identify important predictors of western redcedar dieback.

Data Wrangling

Import iNat Data - Empirical Tree Points (Response variables)

The steps for wrangling the iNat data are described here.

Format and export for collecting climateNA data

Data were subset to include only gps information to use in collecting ancillary data.

Remove iNaturalist columns and explanatory variables not needed for random forest models

Import Normals Data

Climate data then extracted with ClimateNA tool following the below process. Data were downloaded for the iNat GPS locations using the ClimateNA Tool.

ClimateNA version 7.42 -

  • Climate data extraction process with ClimateNA
    • Convert data into format for climateNA use (see above)
    • In ClimateNA
      • Normal Data
        • Select input file (browse to gps2566 file)
        • Choose ‘More Normal Data’
          • Select ‘Normal_1991_2020.nrm’
        • Choose ‘All variables(265)’
        • Specify output file
  • Grouping explored
    • data averaged over 30 year normals (1991-2020)

Variables

Note the below analysis uses the iNat data with 1510 observations. Amazing!

  • Response variables included in this analysis
    • Tree canopy symptoms (binary)
  • Explanatory variables included
    • Climate data
      • 30yr normals 1991-2020 (265 variables - annual, seasonal, monthly)

Remove specific climate variables not useful as explanatory variables (e.g. norm_Latitutde)

Remove Outliers

For some reason there is one observation with a super neg cmi value (-10000 ish)

Seperate climate variable groupings

Normals data for 265 variables were downloaded for each point Monthly - 180 variables represented data averaged over months for the 30 year period Seasonal - 60 variables respresented data averaged over 3 month seasons (4 seasons) for 30 year period Annual - 20 variables represented data averaged for all years during 30 year period

Remove variables with variables that have near zero standard deviations (entire column is same value)

Full

There were length(normals)-length(normals.nearzerovar monthly variables with zero standard deviation is Dropping columns with near zero standard deviation removed length(normals)-length(normals.nearzerovar monthly climate variables.

Monthly

There were length(normals.monthly)-length(normals.monthly.nearzerovar monthly variables with zero standard deviation is Dropping columns with near zero standard deviation removed length(normals.monthly)-length(normals.monthly.nearzerovar monthly climate variables.

Seasonal

There were length(normals.monthly)-length(normals.seasonal.nearzerovar monthly variables with zero standard deviation is

Annual

There were length(normals.monthly)-length(normals.annual.nearzerovar monthly variables with zero standard deviation.

Join iNat and Climate Data

Remove Dead Trees

Note we chose to remove dead trees after joining with climate data in case we change our mind about that.

Generally, removing dead trees may make the most sense biologically, because we’re not sure about the cause of the dead tree. Later we could test if there is a good climate variable for classifying trees as alive or dead.

Prepare data for random forest models

Remove other explanatory variable categories (binary only)

Compare model errors

Binary Normal Model

Full Normal Model

## 
## Call:
##  randomForest(formula = binary.tree.canopy.symptoms ~ ., data = binary.full,      ntree = 1200, importance = TRUE, proximity = TRUE, na.action = na.omit) 
##                Type of random forest: classification
##                      Number of trees: 1200
## No. of variables tried at each split: 14
## 
##         OOB estimate of  error rate: 31.3%
## Confusion matrix:
##           Healthy Unhealthy class.error
## Healthy      1143       329   0.2235054
## Unhealthy     430       523   0.4512067

Monthly Normal Model

## 
## Call:
##  randomForest(formula = binary.tree.canopy.symptoms ~ ., data = binary.monthly,      ntree = 1200, importance = TRUE, proximity = TRUE, na.action = na.omit) 
##                Type of random forest: classification
##                      Number of trees: 1200
## No. of variables tried at each split: 12
## 
##         OOB estimate of  error rate: 31.92%
## Confusion matrix:
##           Healthy Unhealthy class.error
## Healthy      1132       340   0.2309783
## Unhealthy     434       519   0.4554040

Seasonal Normal Model

## 
## Call:
##  randomForest(formula = binary.tree.canopy.symptoms ~ ., data = binary.seasonal,      ntree = 1200, importance = TRUE, proximity = TRUE, na.action = na.omit) 
##                Type of random forest: classification
##                      Number of trees: 1200
## No. of variables tried at each split: 7
## 
##         OOB estimate of  error rate: 31.88%
## Confusion matrix:
##           Healthy Unhealthy class.error
## Healthy      1134       338   0.2296196
## Unhealthy     435       518   0.4564533

Annual Normal Model

## 
## Call:
##  randomForest(formula = binary.tree.canopy.symptoms ~ ., data = binary.annual,      ntree = 1200, importance = TRUE, proximity = TRUE, na.action = na.omit) 
##                Type of random forest: classification
##                      Number of trees: 1200
## No. of variables tried at each split: 4
## 
##         OOB estimate of  error rate: 31.88%
## Confusion matrix:
##           Healthy Unhealthy class.error
## Healthy      1138       334   0.2269022
## Unhealthy     439       514   0.4606506

Summary of model performance

Response Grouping Num Variables Vars tried split OOB Error
Binary Full 225 14 31.3
Binary Monthly 148 12 31.92
Binary Seasonal 54 7 31.88
Binary Annual 25 31.88

Identify important variables

Binary Response, Annual Explanatory Variable

The error rate above may stabilize enough by 600-800 trees. May not be necessary to run 1200 trees.

##  num [1:2425, 1:2] -0.02559 0.00402 -0.01364 -0.01961 -0.01828 ...
##  - attr(*, "dimnames")=List of 2
##   ..$ : chr [1:2425] "1" "2" "3" "4" ...
##   ..$ : NULL

Binary Response, Seasonal Explanatory Variable

Clearly all of the climate variables are highly correlated.

Lets pick the top performing metric in our random forests analyses, CMI and then any less correlated variables

Below we can check the correlation of CMI, MAP, and DD_18

Now we can check how the model performs with only these three climate variables

## 
## Call:
##  randomForest(formula = binary.tree.canopy.symptoms ~ ., data = binary.annual,      ntree = 1200, importance = TRUE, proximity = TRUE, na.action = na.omit) 
##                Type of random forest: classification
##                      Number of trees: 1200
## No. of variables tried at each split: 4
## 
##         OOB estimate of  error rate: 31.88%
## Confusion matrix:
##           Healthy Unhealthy class.error
## Healthy      1138       334   0.2269022
## Unhealthy     439       514   0.4606506

It’s hard to give up the seasonality data, but they are all highly correlated (data not shown) and if we look at the above importance plot for the seasonality data, the winter variables (norm_CMI_wt,norm_DD_18_wt, and norm_PPT_wt) all had the highest MeanDecrease Accuracy and Gini. Therefore, even if we chose to build the model on seasonal data, we would likely want to choose to use the winter values for each variable.

GLMMs

Probability of a tree classified as unhealthy

Response variable: category: healthy, unhealthy

Note dead trees were removed

Healthy / Unhealthy

##  Family: binomial  ( logit )
## Formula:          binary.tree.canopy.symptoms ~ norm_CMI
## Data: annual
## 
##      AIC      BIC   logLik deviance df.resid 
##   3252.8   3264.4  -1624.4   3248.8     2423 
## 
## 
## Conditional model:
##               Estimate Std. Error z value Pr(>|z|)    
## (Intercept) -0.4890830  0.0675426  -7.241 4.45e-13 ***
## norm_CMI     0.0008836  0.0008637   1.023    0.306    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## Warning: Removed 12 rows containing non-finite values (`stat_density()`).

Top Dieback / No Topdieback

##  Family: binomial  ( logit )
## Formula:          top.dieback ~ norm_CMI
## Data: annual
## 
##      AIC      BIC   logLik deviance df.resid 
##   1808.4   1820.0   -902.2   1804.4     2423 
## 
## 
## Conditional model:
##              Estimate Std. Error z value Pr(>|z|)    
## (Intercept) -2.095608   0.098672 -21.238   <2e-16 ***
## norm_CMI     0.002063   0.001182   1.746   0.0808 .  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## Warning: Removed 12 rows containing non-finite values (`stat_density()`).

Thinning / no-thinning

## Warning: Removed 12 rows containing non-finite values (`stat_density()`).

Dead / Alive

##  Family: binomial  ( logit )
## Formula:          dead.tree ~ norm_CMI
## Data: full.with.dead
## 
##      AIC      BIC   logLik deviance df.resid 
##   1019.3   1031.0   -507.7   1015.3     2551 
## 
## 
## Conditional model:
##              Estimate Std. Error z value Pr(>|z|)    
## (Intercept) -3.011139   0.146278 -20.585   <2e-16 ***
## norm_CMI     0.001112   0.001798   0.619    0.536    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Discussion

Explore if monthly, seasonal, or annual data are best fit for binomial distributed glmm.

Identify which climate variable grouping is best fit then run random forests for determining which climate variable is best for predicting top dieback

then run random forests for predicting which climate variable is best for predicting thinning