---
title: "R workflow for ABA prediction model calibration"
author: "Jean-Matthieu Monnet"
date: "`r Sys.Date()`"
output:
  html_document: default
  pdf_document: default
papersize: a4
bibliography: "../bib/bibliography.bib"
---

```{r setup, include=FALSE}
# erase all
cat("\014")
rm(list = ls())
# knit options
knitr::opts_chunk$set(echo = TRUE)
# Set so that long lines in R will be wrapped:
knitr::opts_chunk$set(tidy.opts = list(width.cutoff = 80), tidy = TRUE)
knitr::opts_chunk$set(fig.align = "center")
```

---
The code below presents a workflow to calibrate prediction models for the estimation of forest parameters from ALS-derived metrics, using the area-based approach (ABA). The workflow is based on functions from `R` packages [lidaRtRee](https://cran.r-project.org/package=lidaRtRee) (tested with version `r packageVersion("lidaRtRee")`) and [lidR](https://cran.r-project.org/package=lidR) (tested with version `r packageVersion("lidR")`).

Licence: GNU GPLv3 / [Source page](https://gitlab.irstea.fr/jean-matthieu.monnet/lidartree_tutorials/-/blob/master/R/area-based.2.model.calibration.Rmd)

Many thanks to Pascal Obstétar for checking code and improvement suggestions.

# Load data

The "Quatre Montagnes" dataset from France, prepared as described in the [data preparation tutorial](https://gitlab.irstea.fr/jean-matthieu.monnet/lidartree_tutorials/-/blob/master/R/area-based.1.data.preparation.Rmd) is loaded from the `R` archive files located in the folder "data/aba.model/output".

## Field data

The file "plots.rda" contains the field data, organized as a data.frame named `plots`. For subsequent use in the workflow, the data.frame should contain at least two fields: `plot_id` (unique plot identifier) and a forest stand parameter. Each line in the data.frame corresponds to a field plot. A factor variable is required to calibrate stratified models. Plot coordinates are required for subsequent inference computations.

The provided data set includes one categorical variable: `stratum`, which corresponds to forest ownership, XY coordinates and three forest stand parameters :

* basal area in m^2^/ha (`G_m2_ha`),
+ stem density in /ha (`N_ha`),
+ mean diameter at breast height in cm (`D_mean_cm`).

Scatterplots of stand parameters are presented below, colored by ownership (green for public forest, blue otherwise).

```{r loadFieldData, include=TRUE, message = FALSE, warning = FALSE}
# load plot-level data
load(file = "../data/aba.model/output/plots.rda")
summary(plots)
# display forest variables
plot(plots[, c("G_m2_ha", "N_ha", "D_mean_cm")],
  col = ifelse(plots$stratum == "public", "green", "blue")
)
```

## ALS data

Normalized ALS point clouds extracted over each plot, as well as terrain statistics previously computed from the ALS ground points can also be prepared according to the [data preparation tutorial](https://gitlab.irstea.fr/jean-matthieu.monnet/lidartree_tutorials/-/blob/master/R/area-based.1.data.preparation.Rmd). Point clouds corresponding to each field plot are organized in a list of LAS objects. Meta data of one LAS object are displayed below.

```{r loadALSdData, include=TRUE, message = FALSE, warning = FALSE}
# list of LAS objects: normalized point clouds inside plot extent
load("../data/aba.model/output/llas_height.rda")
# display one point cloud # lidR::plot(llasn[[1]])
llas_height[[1]]
```

The first lines of the terrain statistics are displayed hereafter.

```{r loadTerrainStats, echo = FALSE}
# terrain statistics previously computed with (non-normalized) ground points inside each plot extent
load("../data/aba.model/output/metrics_terrain.rda")
head(metrics_terrain[, 1:3], n = 3)
```

The following lines ensure that the plots are ordered in the same way in the three data objects.

```{r orderData, include=TRUE}
llas_height <- llas_height[plots$plot_id]
metrics_terrain <- metrics_terrain[plots$plot_id, ]
```

# ALS metrics computation

Two types of vegetation metrics can be computed.

* Point cloud metrics are directly computed from the point cloud or from the derived surface model on the whole plot extent. These are the metrics generally used in the area-based approach.
+ Tree metrics are computed from the characteristics of trees detected in the point cloud (or in the derived surface model). They are more CPU-intensive to compute and require ALS data with higher density, but in some cases they allow a slight improvement in models prediction accuracy.

## Point cloud metrics

Point cloud metrics are computed with the function `lidaRtRee::clouds_metrics`, which applies the function `lidR::cloud_metrics` to all point clouds in a list. Default computed metrics are those proposed by the function [`lidR::stdmetrics`](https://github.com/Jean-Romain/lidR/wiki/stdmetrics). Additional metrics are available with the function `lidaRtRee::aba_metrics`. The buffer points, which are located outside of the plot extent inventoried on the field, should be removed before computing those metrics.

```{r computeMetrics, include=TRUE}
# define function for later use
aba_point_metrics_fun <- ~ lidaRtRee::aba_metrics(Z, Intensity, ReturnNumber, Classification, 2)
# create list of point clouds without buffer
llas_height_plot_extent <-
  lapply(llas_height, function(x) {
    lidR::filter_poi(x, buffer == FALSE)
  })
# apply function on each point cloud in list
metrics_points <- lidaRtRee::clouds_metrics(llas_height_plot_extent, aba_point_metrics_fun)
round(head(metrics_points[, 1:8], n = 3), 2)
```

## Tree metrics

Tree metrics rely on a preliminary detection of trees, which is performed with the `lidaRtRee::tree_segmentation` function. For more details, please refer to the [tree detection tutorial](https://gitlab.irstea.fr/jean-matthieu.monnet/lidartree_tutorials/-/blob/master/R/tree.detection.Rmd). Tree segmentation requires point clouds or canopy height models with an additional buffer in order to avoid border effects when computing tree characteristics. Once trees are detected, metrics are derived with the function `lidaRtRee::std_tree_metrics`. A user-specific function can be specified to compute other metrics from the features of detected trees. Plot radius has to be specified as it is required to exclude trees detected outside of the plot, and to compute the plot surface. Tree segmentation is not relevant when the point cloud density is too low, typically below five points per m^2^. The function first computes a canopy height model which default resolution is 0.5 m, but this should be set to 1 m with low point densities.

```{r computeTreeMetrics, include=TRUE, warning = FALSE}
# resolution of canopy height model (m)
aba_res_chm <- 0.5
# specify plot radius to exclude trees located outside plots
plot_radius <- 15
# compute tree metrics
metrics_trees <- lidaRtRee::clouds_tree_metrics(llas_height, plots[, c("X", "Y")],
  plot_radius,
  res = aba_res_chm,
  func = function(x) {
    lidaRtRee::std_tree_metrics(x,
      area_ha = pi * plot_radius^2 / 10000
    )
  }
)
round(head(metrics_trees[, 1:5], n = 3), 2)
```

## Other metrics

In case terrain metrics have been computed from the cloud of ground points only, they can also be added as variables, and so do other environmental variables which might be relevant in modeling.

```{r bindMetrics, include=TRUE}
metrics <- cbind(
  metrics_points[plots$plot_id, ],
  metrics_trees[plots$plot_id, ],
  metrics_terrain[plots$plot_id, 1:3]
)
```

# Model calibration

## Calibration for a single variable

Once a dependent variable (forest parameter of interest) has been chosen, the function `lidaRtRee::aba_build_model` is used to select the linear regression model that yields the highest adjusted-R^2^ with a defined number of independent variables (metrics), while checking linear model assumptions. A Box-Cox transformation of the dependent variable can be applied to normalize its distribution, or a log transformation of all variables (parameter `transform`). Model details and cross-validation statistics are available from the returned object.

```{r modelCalibration, include=TRUE, message = FALSE, warning = FALSE}
variable <- "G_m2_ha"
# no subsample in this case
subsample <- 1:nrow(plots)
# model calibration
model_aba <- lidaRtRee::aba_build_model(plots[subsample, variable], metrics[subsample, ], transform = "boxcox", nmax = 4, xy = plots[subsample, c("X", "Y")])
# renames outputs with variable name
row.names(model_aba$stats) <- variable
# display selected linear regression model
model_aba$model
# display calibration and validation statistics
model_aba$stats
```

The function computes values predicted in leave-one-out cross-validation, by using the same combination of dependent variables and fitting the regression coefficients with all observations except one. Predicted values can be plotted against field values with the function `lidaRtRee::aba_plot`. It is also informative to check the correlation of prediction errors with other forest or environmental variables.

The model seems to fail to predict large values, and the prediction errors are positively correlated with basal area.

```{r modelCorrelation, include=TRUE, fig.height = 4.5, fig.width = 8}
# check correlation between errors and other variables
round(cor(cbind(model_aba$values$residual, plots[subsample, c("G_m2_ha", "N_ha", "D_mean_cm")], metrics_terrain[subsample, 1:3])), 2)[1, ]
# significance of correlation value
cor.test(model_aba$values$residual, plots[subsample, variable])
```

Coloring points by ownership shows that plots located in private forests have the largest basal area values.

```{r modelPlot, include=TRUE, fig.height = 4.5, fig.width = 8}
par(mfrow = c(1, 2))
# plot predicted  VS field values
lidaRtRee::aba_plot(model_aba,
  main = variable,
  col = ifelse(plots$stratum == "public", "green", "blue")
)
legend("topleft", c("public", "private"), col = c("green", "blue"), pch = 1)
plot(plots[subsample, c("G_m2_ha")],
  model_aba$values$residual,
  ylab = "Prediction errors", xlab = "Field values",
  col = ifelse(plots$stratum == "public", "green", "blue")
)
abline(h = 0, lty = 2)
```

## Calibration for several variables

The following code calibrates models for several forest parameters. In case different transformations have to be performed on the parameters, models have to be calibrated one by one.

```{r multipleModels, include=TRUE, warning = FALSE, message = FALSE}
models_aba <- list()
for (i in c("G_m2_ha", "D_mean_cm", "N_ha"))
{
  models_aba[[i]] <- lidaRtRee::aba_build_model(plots[, i], metrics, transform = "boxcox", nmax = 4, xy = plots[, c("X", "Y")])
}
# bind model stats in a data.frame
model_stats <- do.call(rbind, lapply(models_aba, function(x) {
  x[["stats"]]
}))
```
The obtained models are presented below. The table columns correspond to:

* `n` number of plots,
* `metrics` selected in the model,
* `adj-R2.% ` adjusted R-squared of fitted model (%),
* `CV-R2.%` coefficient of determination of values predicted in cross-validation (CV) VS field values (%),
* `CV-RMSE.%` coefficient of variation of the Root Mean Square Errors of prediction in CV (%),
* `CV-RMSE` Root Mean Square Error of prediction in CV.

The two largest (outlier) values of mean diameter are underestimated by the model, which greatly decreases the accuracy statistics. This might be explained by the fact that when trees reach maturity, diameter growth continues while height growth almost stops. As the ALS point cloud mostly contains height information, there is some signal saturation for high mean diameter values. It might also be the case for high biomass values.

```{r multipleModelsTable, echo = FALSE, fig.width = 12, fig.height = 4.5}
# prepare output for report
table_output <- cbind(
  model_stats[, c("n", "formula")],
  round(model_stats[, c("adjR2", "looR2", "cvrmse")] * 100, 1),
  data.frame(rmse = round(model_stats[, "rmse"], 1))
)
names(table_output) <- c("n", "metrics", "adj-R2.%", "CV-R2.%", "CV-RMSE.%", "CV-RMSE")
knitr::kable(table_output)
#
par(mfrow = c(1, 3))
for (i in names(models_aba))
{
  lidaRtRee::aba_plot(models_aba[[i]], main = i)
}
rm(models_aba, model_stats)
``` 

# Stratified models

## Motivation

When calibrating a statistical relationship between forest stand parameters, which are usually derived from diameter measurements, and ALS metrics, one relies on the hypothesis that the interaction of laser pulses with the leaves and branches structure is constant on the whole area. However, differences can be expected either due to variations in acquisition settings (flight parameters, scanner model), in forests (stand structure and composition) or in topography (slope). Better models might be obtained when calibrating stratum-specific relationships, provided each stratum is more homogeneous regarding the laser interaction with the vegetation. A trade-off has to be achieved between the within-strata homogeneity and the number of available plots for calibration in each stratum. A minimum number of plots is approximately 50, while 100 would be recommended. In this example we hypothesize that ownership reflects both structure and composition differences in forest stands.

## Calibration of stratum-specific models

Stratum-specific models are computed and stored in a list during a `for` loop. The function `lidaRtRee::aba_combine_strata` then combines the list of models corresponding to each stratum to compute aggregated statistics for all plots, making it easier to compare stratified with non-stratified models.

```{r stratifiedmodelCalibration, include=TRUE, warning = FALSE}
# stratification variable
strat <- "stratum"
# create list of models
model_aba_stratified <- list()
# calibrate each stratum model
for (i in levels(plots[, strat]))
{
  subsample <- which(plots[, strat] == i)
  if (length(subsample) > 0) {
    model_aba_stratified[[i]] <- lidaRtRee::aba_build_model(plots[subsample, variable], metrics[subsample, ], transform = "boxcox", nmax = 4, xy = plots[subsample, c("X", "Y")])
  }
}
# backup list of models for later use
model_aba_stratified_boxcox <- model_aba_stratified
# combine list of models into single object
model_aba_stratified <- lidaRtRee::aba_combine_strata(model_aba_stratified, plots$plot_id)
# model_aba_stratified$stats
```

```{r stratifiedmodelTable, echo=FALSE, fig.height = 4.5, fig.width = 8}
# bind model stats in a data.frame for comparison
model_stats <- rbind(model_aba$stats, model_aba_stratified$stats)
row.names(model_stats)[1] <- "NOT.STRATIFIED"
# prepare output for report
table_output <- cbind(
  model_stats[, c("n", "formula")],
  round(model_stats[, c("adjR2", "looR2", "cvrmse")] * 100, 1),
  data.frame(rmse = round(model_stats[, "rmse"], 1))
)
names(table_output) <- c("n", "metrics", "adj-R2.%", "CV-R2.%", "CV-RMSE.%", "CV-RMSE")
knitr::kable(table_output)
par(mfrow = c(1, 2))
lidaRtRee::aba_plot(model_aba, main = paste0(variable, ", not stratified"))
lidaRtRee::aba_plot(model_aba_stratified, main = paste0(variable, ", stratified"))
```

## Stratified models with stratum-specific variable tranformations

In case one wants to apply different variable transformations, or use different subsets of ALS metrics depending on the strata, the following example can be used. First models using only the point cloud metrics are calibrated without transformation of the data. The statistics for all plots are then calculated by combining the following stratum-specific models :

* public ownership, all metrics, Box-Cox transformation of basal area values (calibrated in the previous paragraph),
+ private ownership, only point cloud metrics, no data transformation.

```{r stratifiedmodelCalibrationTransformation, include=TRUE, warning = FALSE}
# create list of models for no transformation
model_aba_stratified.none <- list()
# calibrate each stratum model
for (i in levels(plots[, strat]))
{
  subsample <- which(plots[, strat] == i)
  if (length(subsample) > 0) {
    model_aba_stratified.none[[i]] <- lidaRtRee::aba_build_model(plots[subsample, variable], metrics_points[subsample, ], transform = "none", xy = plots[subsample, c("X", "Y")])
  }
}
# combine list of models into single object
model_aba_stratified_mixed <- lidaRtRee::aba_combine_strata(list(private = model_aba_stratified.none[["private"]], public = model_aba_stratified_boxcox[["public"]]), plots$plot_id)
# bind model stats in a data.frame for comparison
model_stats <- rbind(model_aba$stats, model_aba_stratified_mixed$stats)
row.names(model_stats)[1] <- "NOT.STRATIFIED"
```

```{r stratifiedmodelCalibrationTransformationTable, echo = FALSE, fig.height = 4.5, fig.width = 8}
# prepare output for report
table_output <- cbind(
  model_stats[, c("n", "formula", "transform")],
  round(model_stats[, c("adjR2", "looR2", "cvrmse")] * 100, 1),
  data.frame(rmse = round(model_stats[, "rmse"], 1))
)
names(table_output) <- c("n", "metrics", "transform", "adj-R2.%", "CV-R2.%", "CV-RMSE.%", "CV-RMSE")
knitr::kable(table_output)
# graphics
par(mfrow = c(1, 2))
lidaRtRee::aba_plot(model_aba, main = paste0(variable, ", not stratified"))
lidaRtRee::aba_plot(model_aba_stratified_mixed, main = paste0(variable, ", stratified"))
```

# Save data before next tutorial

The following lines save the data required for the [area-based mapping step](https://gitlab.irstea.fr/jean-matthieu.monnet/lidartree_tutorials/-/blob/master/R/area-based.3.mapping.and.inference.Rmd).

```{r saveModels, eval=FALSE}
save(model_aba_stratified_mixed, model_aba, aba_point_metrics_fun, aba_res_chm,
  file = "../data/aba.model/output/models.rda"
)
```

```{r saveForlidaRtRee, include=FALSE, eval=FALSE}
# save data for lidaRtRee package
# quatre_montagnes <- cbind(plots, metrics)
# save(quatre_montagnes, file = "quatre_montagnes.rda")
```