Imputing missing values

library(bdffpRain)
library(dplyr)
#> 
#> Attaching package: 'dplyr'
#> The following objects are masked from 'package:stats':
#> 
#>     filter, lag
#> The following objects are masked from 'package:base':
#> 
#>     intersect, setdiff, setequal, union
library(tidyr)
library(Amelia)
#> Loading required package: Rcpp
#> ## 
#> ## Amelia II: Multiple Imputation
#> ## (Version 1.8.2, built: 2024-04-10)
#> ## Copyright (C) 2005-2024 James Honaker, Gary King and Matthew Blackwell
#> ## Refer to http://gking.harvard.edu/amelia/ for more information
#> ##
library(lubridate)
#> 
#> Attaching package: 'lubridate'
#> The following objects are masked from 'package:base':
#> 
#>     date, intersect, setdiff, union
library(tsibble)
#> Registered S3 method overwritten by 'tsibble':
#>   method               from 
#>   as_tibble.grouped_df dplyr
#> 
#> Attaching package: 'tsibble'
#> The following object is masked from 'package:lubridate':
#> 
#>     interval
#> The following objects are masked from 'package:base':
#> 
#>     intersect, setdiff, union
library(ggplot2)

The precipitation records in bdffp_rain have a high degree of missingness at all sites (some more than others). None of the sites were visited daily, and some were only in operation for part of this time series. In this vignette, I’ll show one option for imputing missing values in this precipitation time series using the Amelia package.

Inspect data

bdffp_wide <- 
  bdffp_rain %>%
  select(date, doy, site, precip) %>% 
  pivot_wider(names_from = site, values_from = precip) %>% 
  arrange(date)
missmap(bdffp_wide)
#> Warning: Unknown or uninitialised column: `arguments`.
#> Unknown or uninitialised column: `arguments`.
#> Warning: Unknown or uninitialised column: `imputations`.

Some camps like Florestal and Cabo Frio were rarely visited, while others like Km41 and Colosso were more commonly visited. Km 37 were only active later in the time series. There are also many multi-day accumulations in the data, which do not represent daily data. These should be removed before imputing missing values.

bdffp_wide <- 
  bdffp_rain %>%
  filter(!flag %in% c("A", "U")) %>% 
  select(date, doy, site, precip) %>% 
  pivot_wider(names_from = site, values_from = precip) %>% 
  arrange(date)

Imputation

One approach to filling in gaps in the data while retaining variation among sites is imputation. Not all of the sites can have missing values filled by imputation because the degree of missingness is too high. Here we’ll just select Km41, Colosso, Dimona, and Porto Alegre as examples. Below we show an example using the amelia package, but you should consult the documentation for amelia.

NOTE: Additional data sources should be added to improve imputation accuracy. For example, INMET weather station data available through the Infoclima portal—the Manaus and Rio Preta da Eva stations are geographically closest to BDFFP. See also the brclimate package. More data sources are better—overfitting is not an issue for imputation—but have not been included in this vignette for simplicity.

bdffp_wide2 <-
  bdffp_wide %>% 
  select(date, doy, km41, `colosso`, dimona, `porto alegre`)

Add columns that represent time in different ways to use seasonality to inform imputation

bdffp_wide3 <- 
  bdffp_wide2 %>% 
  mutate(year = year(date), .after = date)
bdffp_wide3
#> # A tibble: 8,494 × 7
#>    date        year   doy  km41 colosso dimona `porto alegre`
#>    <date>     <dbl> <dbl> <dbl>   <dbl>  <dbl>          <dbl>
#>  1 1987-09-01  1987   244    NA      NA    0               NA
#>  2 1987-09-02  1987   245    NA      NA    5               NA
#>  3 1987-09-03  1987   246    NA      NA    6               NA
#>  4 1987-09-04  1987   247    NA      NA    0               NA
#>  5 1987-09-05  1987   248    NA      NA    0               NA
#>  6 1987-09-06  1987   249    NA      NA    0.6             NA
#>  7 1987-09-07  1987   250    NA      NA    0               NA
#>  8 1987-09-08  1987   251     0      NA    0               NA
#>  9 1987-09-09  1987   252     0      NA    0               NA
#> 10 1987-09-10  1987   253     0      NA    6.6             NA
#> # ℹ 8,484 more rows

Identify any columns that should be transformed to improve normality.

hist(bdffp_wide3$km41)

hist(log(bdffp_wide3$km41))


log_cols <- bdffp_wide3 %>% select(-date, -year, -doy) %>% colnames()

Impute (this step may take a while depending on your computer). Multiple imputations are preformed by default, but here only a single imputation is done by setting m = 1

bdffp_imp <-
  amelia(
    as.data.frame(bdffp_wide3),
    m = 1,
    p2s = 1,
    ts = "doy",
    cs = "year",
    intercs = TRUE,
    polytime = 3,
    leads = log_cols,
    lags = log_cols,
    logs = log_cols,
    idvars = c("date"),
    empri = .01 * nrow(bdffp_wide3) #ridge penalty because of high degree of missingness
  )
#> -- Imputation 1 --
#> 
#>   1  2  3  4  5  6  7  8  9 10 11 12 13 14 15 16 17 18 19 20
#>  21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40
#>  41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60
#>  61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80
#>  81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100
#>  101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120
#>  121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140
#>  141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160
#>  161 162 163 164 165 166 167 168 169 170 171 172 173 174 175 176 177 178 179 180
#>  181 182 183 184 185 186 187 188 189 190 191 192 193 194 195 196 197 198 199 200
#>  201 202 203 204 205 206 207 208 209 210 211 212 213 214

Imputation diagnostics

A variety of diagnostic tools are provided in Amelia to select among possibly settings and data sources for imputation. See the vignette for Amelia for more information about how to use and interpret these.

plot(bdffp_imp, which.vars = "colosso")

Imputed values (red) have a slightly higher mean than observed values.

Accessing imputed data

Imputed datasets are stored in the object output by amelia(). If multiple imputation is used, bdffp_imp$imp is a list of multiple datasets. Calculations can be done on these in parallel with, for example, the purr package, then averaged to take into account uncertainty in the imputation step.

out <- bdffp_imp$imp[[1]]
head(out)
#>         date year doy       km41    colosso dimona porto alegre
#> 1 1987-09-01 1987 244  1.1231503  1.2063870    0.0    0.8627172
#> 2 1987-09-02 1987 245  0.8655257  1.8613534    5.0    3.1030818
#> 3 1987-09-03 1987 246 15.8843364 10.9737257    6.0   20.9976189
#> 4 1987-09-04 1987 247  1.5569731  4.7381599    0.0    2.7177896
#> 5 1987-09-05 1987 248  4.4413317 10.3439301    0.0    8.6083778
#> 6 1987-09-06 1987 249  1.4002843  0.4262919    0.6    0.7990145

Monthly rainfall

Now that we have a complete daily rainfall dataset, we can use it to create monthly rainfall data. The yearmonth class from the tsibble package will be helpful.

monthly_rain <-
  out %>% 
  pivot_longer(c(-date, -year, -doy), names_to = "site", values_to = "precip") %>% 
  mutate(yearmonth = yearmonth(date), .after = year) %>% 
  group_by(yearmonth, site) %>% 
  summarize(precip_mon = sum(precip))
#> `summarise()` has grouped output by 'yearmonth'. You can override using the
#> `.groups` argument.

Timeseries plot

We could then use these data to plot a timeseries of monthly precipitation at each site. (Remember, these data will be slightly different with a different random seed, because imputed values will change).

ggplot(monthly_rain, aes(x = yearmonth, y = precip_mon)) +
  geom_line() +
  facet_wrap(~site) +
  scale_x_yearmonth()

Seasonality

We could also the imputed data to calculate mean monthly precipitation at each site for the entire timeseries.

mean_rain <- 
  monthly_rain %>% 
  mutate(month = month(yearmonth, label = TRUE)) %>% 
  group_by(month, site) %>%
  summarize(precip_mean = mean(precip_mon))
#> `summarise()` has grouped output by 'month'. You can override using the
#> `.groups` argument.
ggplot(mean_rain, aes(x = month, y = precip_mean)) +
  geom_col() +
  facet_wrap(~site)

These plots may differ slightly with each run of amelia() since imputations are made with some uncertainty that depends on the random seed. Additionally, if amelia() is run on a dataset with more data sources (recommended), then the output will obviously be different.