library(bdffpRain)
library(dplyr)
#>
#> Attaching package: 'dplyr'
#> The following objects are masked from 'package:stats':
#>
#> filter, lag
#> The following objects are masked from 'package:base':
#>
#> intersect, setdiff, setequal, union
library(tidyr)
library(Amelia)
#> Loading required package: Rcpp
#> ##
#> ## Amelia II: Multiple Imputation
#> ## (Version 1.8.2, built: 2024-04-10)
#> ## Copyright (C) 2005-2024 James Honaker, Gary King and Matthew Blackwell
#> ## Refer to http://gking.harvard.edu/amelia/ for more information
#> ##
library(lubridate)
#>
#> Attaching package: 'lubridate'
#> The following objects are masked from 'package:base':
#>
#> date, intersect, setdiff, union
library(tsibble)
#> Registered S3 method overwritten by 'tsibble':
#> method from
#> as_tibble.grouped_df dplyr
#>
#> Attaching package: 'tsibble'
#> The following object is masked from 'package:lubridate':
#>
#> interval
#> The following objects are masked from 'package:base':
#>
#> intersect, setdiff, union
library(ggplot2)
The precipitation records in bdffp_rain
have a high
degree of missingness at all sites (some more than others). None of the
sites were visited daily, and some were only in operation for part of
this time series. In this vignette, I’ll show one option for imputing
missing values in this precipitation time series using the
Amelia
package.
bdffp_wide <-
bdffp_rain %>%
select(date, doy, site, precip) %>%
pivot_wider(names_from = site, values_from = precip) %>%
arrange(date)
missmap(bdffp_wide)
#> Warning: Unknown or uninitialised column: `arguments`.
#> Unknown or uninitialised column: `arguments`.
#> Warning: Unknown or uninitialised column: `imputations`.
Some camps like Florestal and Cabo Frio were rarely visited, while others like Km41 and Colosso were more commonly visited. Km 37 were only active later in the time series. There are also many multi-day accumulations in the data, which do not represent daily data. These should be removed before imputing missing values.
One approach to filling in gaps in the data while retaining variation
among sites is imputation. Not all of the sites can have missing values
filled by imputation because the degree of missingness is too high. Here
we’ll just select Km41, Colosso, Dimona, and Porto Alegre as examples.
Below we show an example using the amelia
package, but you
should consult the documentation for amelia
.
NOTE: Additional data sources should be added to improve
imputation accuracy. For example, INMET weather station data
available through the Infoclima portal—the Manaus
and Rio Preta da Eva stations are geographically closest to BDFFP. See
also the brclimate
package. More data sources are better—overfitting is not an issue
for imputation—but have not been included in this vignette for
simplicity.
Add columns that represent time in different ways to use seasonality to inform imputation
bdffp_wide3 <-
bdffp_wide2 %>%
mutate(year = year(date), .after = date)
bdffp_wide3
#> # A tibble: 8,494 × 7
#> date year doy km41 colosso dimona `porto alegre`
#> <date> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
#> 1 1987-09-01 1987 244 NA NA 0 NA
#> 2 1987-09-02 1987 245 NA NA 5 NA
#> 3 1987-09-03 1987 246 NA NA 6 NA
#> 4 1987-09-04 1987 247 NA NA 0 NA
#> 5 1987-09-05 1987 248 NA NA 0 NA
#> 6 1987-09-06 1987 249 NA NA 0.6 NA
#> 7 1987-09-07 1987 250 NA NA 0 NA
#> 8 1987-09-08 1987 251 0 NA 0 NA
#> 9 1987-09-09 1987 252 0 NA 0 NA
#> 10 1987-09-10 1987 253 0 NA 6.6 NA
#> # ℹ 8,484 more rows
Identify any columns that should be transformed to improve normality.
Impute (this step may take a while depending on your computer).
Multiple imputations are preformed by default, but here only a single
imputation is done by setting m = 1
bdffp_imp <-
amelia(
as.data.frame(bdffp_wide3),
m = 1,
p2s = 1,
ts = "doy",
cs = "year",
intercs = TRUE,
polytime = 3,
leads = log_cols,
lags = log_cols,
logs = log_cols,
idvars = c("date"),
empri = .01 * nrow(bdffp_wide3) #ridge penalty because of high degree of missingness
)
#> -- Imputation 1 --
#>
#> 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20
#> 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40
#> 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60
#> 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80
#> 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100
#> 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120
#> 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140
#> 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160
#> 161 162 163 164 165 166 167 168 169 170 171 172 173 174 175 176 177 178 179 180
#> 181 182 183 184 185 186 187 188 189 190 191 192 193 194 195 196 197 198 199 200
#> 201 202 203 204 205 206 207 208 209 210 211 212 213 214 215 216 217 218 219 220
#> 221 222 223 224 225 226 227 228 229 230 231 232 233 234 235 236 237 238 239 240
#> 241 242 243 244 245 246 247 248 249 250 251 252 253 254 255 256 257
A variety of diagnostic tools are provided in Amelia
to
select among possibly settings and data sources for imputation. See the
vignette for Amelia
for more information about how to use
and interpret these.
Imputed values (red) have a slightly higher mean than observed values.
Imputed datasets are stored in the object output by
amelia()
. If multiple imputation is used,
bdffp_imp$imp
is a list of multiple datasets. Calculations
can be done on these in parallel with, for example, the
purr
package, then averaged to take into account
uncertainty in the imputation step.
out <- bdffp_imp$imp[[1]]
head(out)
#> date year doy km41 colosso dimona porto alegre
#> 1 1987-09-01 1987 244 9.6137664 26.7664013 0.0 24.4445423
#> 2 1987-09-02 1987 245 2.7333636 0.8497229 5.0 0.6561507
#> 3 1987-09-03 1987 246 4.0338526 4.2666153 6.0 7.4966190
#> 4 1987-09-04 1987 247 20.5607485 2.1134163 0.0 1.6844572
#> 5 1987-09-05 1987 248 0.3754247 0.9786680 0.0 0.3252066
#> 6 1987-09-06 1987 249 8.3369635 2.4073080 0.6 1.2036644
Now that we have a complete daily rainfall dataset, we can use it to
create monthly rainfall data. The yearmonth
class from the
tsibble
package will be helpful.
monthly_rain <-
out %>%
pivot_longer(c(-date, -year, -doy), names_to = "site", values_to = "precip") %>%
mutate(yearmonth = yearmonth(date), .after = year) %>%
group_by(yearmonth, site) %>%
summarize(precip_mon = sum(precip))
#> `summarise()` has grouped output by 'yearmonth'. You can override using the
#> `.groups` argument.
We could then use these data to plot a timeseries of monthly precipitation at each site. (Remember, these data will be slightly different with a different random seed, because imputed values will change).
We could also the imputed data to calculate mean monthly precipitation at each site for the entire timeseries.
mean_rain <-
monthly_rain %>%
mutate(month = month(yearmonth, label = TRUE)) %>%
group_by(month, site) %>%
summarize(precip_mean = mean(precip_mon))
#> `summarise()` has grouped output by 'month'. You can override using the
#> `.groups` argument.
These plots may differ slightly with each run of
amelia()
since imputations are made with some uncertainty
that depends on the random seed. Additionally, if amelia()
is run on a dataset with more data sources (recommended), then the
output will obviously be different.