Specifying cluster-level model in SUMMER for mortality estimation
Zehang Richard Li
2025-01-04
Source:vignettes/articles/web_only/cluster-model-vignette.Rmd
cluster-model-vignette.Rmd
In this vignette, we will describe different ways to specify the
cluster-level model for U5MR estimation. We first load the package and
an example dataset DemoData
, which contains model survey
data provided by DHS. Note that this data is simulated and does not
represent any real country’s data.
For this simulated dataset, the strata variable is coded as region crossed by urban/rural status. For our analysis with urban/rural stratified model, we first construct a new strata variable that contains only the urban/rural status, i.e., the additional stratification within each region.
for (i in 1:length(DemoData)) {
strata <- DemoData[[i]]$strata
DemoData[[i]]$strata[grep("urban", strata)] <- "urban"
DemoData[[i]]$strata[grep("rural", strata)] <- "rural"
}
counts.all <- NULL
for (i in 1:length(DemoData)) {
vars <- c("clustid", "strata", "region", "time", "age")
counts <- getCounts(DemoData[[i]][, c(vars, "died")], variables = "died", by = vars,
drop = TRUE)
counts <- counts %>%
mutate(cluster = clustid, years = time, Y = died)
counts$survey <- names(DemoData)[i]
counts.all <- rbind(counts.all, counts)
}
periods <- c("85-89", "90-94", "95-99", "00-04", "05-09", "10-14")
The DemoData
includes
surveys. We index surveys by
.
The sampling frame that was used for survey
,
will be denoted by
.
For example, if the first
surveys are based on the first sampling frame and the next
surveys are based on the second sampling frame, then
.
When only one survey is used, the
index can be dropped in all formulas below.
We assume a discrete hazards model described in Mercer et al. (2015). We use discrete time survival analysis to estimate age-specific monthly probabilities of dying in user-defined age groups. We assume constant hazards within the age bands. The default choice uses the monthly age bands for U5MR and they can be easily specified by the user. The U5MR for area and time can be calculated as, where and are the start and end of the -th age group, and is the probability of death in age group in area and time , with the estimate of this quantity. This calculation follows the synthetic cohort life table approach in which mortality probabilities for each age segments based on real cohort mortality experience are combined.
For cluster-level modeling, we consider a beta-binomial model for the probability (hazard) of death from month to in survey and at cluster in year . This model allows for overdispersion relative to the binomial model. Assuming constant hazards within age bands, we assume the number of deaths occurring within age band , in cluster , time , and survey follow the beta-binomial distribution, where is the monthly hazard at -th month of age, in cluster , time , and survey and is the overdispersion parameter.
In the likelihood model,
is a known constant. The default prior for
is
The mean and precision of the normal
prior can be modified by specifying the overdisp.mean
and
overdisp.prec
arguments.
Specifying the latent logistic model
For simplicity of notation, we will consider within-region
stratification by urban/rural status in this vignette. In SUMMER, more
than two levels of stratification is automatically allowed when
strata
variable in the input data frame contains more
unique values. Under the urban/rural stratification, the most general
form of the latent logistic model we use is,
We will break down the terms next.
Sharing information across age groups
In the temporal model, we usually would like to share some
information across different age groups. This requires specifying a
reduced group of age bands for some component. We use
here to differentiate from the age bands used in the hazard likelihood.
The default choice for U5MR in the package is
The
are specified in age.group
argument, and the size of each
age group is specified in age.n
argument. The reduced
mapping
is specified in age.time.group
argument.
Survey fixed effect
We include a survey fixed effect
with the constraint
for each sampling frame
,
so that the main temporal trends are identifiable for each sampling
frame. The
terms are not included in the prediction, i.e., they are set to zero. If
we ignore any survey-specific difference, we can remove the
term by setting survey.effect = FALSE
.
Unstructured temporal effect
The
are unstructured temporal effects that allow for perturbations over
time. It is a contextual choice whether they are used in predictions. By
default, when applying getSmoothed()
to a fitted
cluster-level model,
are not included in the predictions. They can be included by specifying
include.time.unstruct = TRUE
.
Structured temporal effect
The temporal main effects are specified for urban and rural stratum seperately with and . These effects are also specific to each sampling frame , as the definition of urban and rural usually change across different sampling frames.
We do not share any information across frames currently in SUMMER, thus all terms specific to different sampling frames are modeled as independent a priori.
The age-frame-stratum-specific temporal effects can be modelled by a first order random walk (RW1), second order frandom walk (RW2), or first order autoregressive process (AR1). In all cases, we apply a sum-to-zero constraint on the random walk or autoregressive process and include an explicit intercept for the full age groups (instead of . For example, where . For RW1 and AR1 model, we include an additional linear trend by default, so that We will discuss the modeling of the linear trend later. First, we start with the default RW2 model.
No urban/rural effect
In the simplest case, we can let
.
This can be achieved by setting strata
variable of the
input data frame to be all NA.
counts.all.no.strat <- counts.all
counts.all.no.strat$strata <- NA
fit1 <- smoothCluster(data = counts.all.no.strat, Amat = DemoMap$Amat, family = "betabinomial",
year.label = c(periods, "15-19"), time.model = "rw2", survey.effect = TRUE, strata.time.effect = FALSE)
rownames(fit1$fit$summary.fixed)
## [1] "age.intercept0" "age.intercept1-11" "age.intercept12-23"
## [4] "age.intercept24-35" "age.intercept36-47" "age.intercept48-59"
We can see the fixed effects include only the six elements .
Time-invariant urban/rural effect
We can also specify the random effects . This
This can be specified by setting
strata.time.effect = FALSE
.
fit2 <- smoothCluster(data = counts.all, Amat = DemoMap$Amat, family = "betabinomial",
year.label = c(periods, "15-19"), survey.effect = TRUE, strata.time.effect = FALSE)
rownames(fit2$fit$summary.fixed)
## [1] "age.intercept0:rural" "age.intercept1-11:rural"
## [3] "age.intercept12-23:rural" "age.intercept24-35:rural"
## [5] "age.intercept36-47:rural" "age.intercept48-59:rural"
## [7] "age.intercept0:urban" "age.intercept1-11:urban"
## [9] "age.intercept12-23:urban" "age.intercept24-35:urban"
## [11] "age.intercept36-47:urban" "age.intercept48-59:urban"
## [13] "strataurban"
Now the fixed effects include the six elements for the urban strata, six elements for the rural strata and a constant urban/rural effect for the random effect difference.
Time-varying urban/rural effect
A more flexible and usually reasonable model is to assume each stratum have indepent age-specific trends, i.e., and each follow its own trends.
fit3 <- smoothCluster(data = counts.all, Amat = DemoMap$Amat, family = "betabinomial",
year.label = c(periods, "15-19"), survey.effect = TRUE, strata.time.effect = TRUE)
rownames(fit3$fit$summary.fixed)
## [1] "age.intercept0:rural" "age.intercept1-11:rural"
## [3] "age.intercept12-23:rural" "age.intercept24-35:rural"
## [5] "age.intercept36-47:rural" "age.intercept48-59:rural"
## [7] "age.intercept0:urban" "age.intercept1-11:urban"
## [9] "age.intercept12-23:urban" "age.intercept24-35:urban"
## [11] "age.intercept36-47:urban" "age.intercept48-59:urban"
We can check now that there are in total structured temporal random effect terms, corresponding to urban time series and rural time series.
nrow(fit3$fit$summary.random$time.struct)
## [1] 42
Linear trend
As mentioned before, for RW1 and AR1 model, by default we include an additional linear trend by default, so that The linear trend is modelled as a fixed effect for a rescaled time index variable . That is, the coefficient should be interpreted as unit change on the logit scale comparing the last to the first time period.
fit4 <- smoothCluster(data = counts.all, Amat = DemoMap$Amat, family = "betabinomial",
year.label = c(periods, "15-19"), time.model = "ar1", survey.effect = TRUE, strata.time.effect = TRUE)
rownames(fit4$fit$summary.fixed)
## [1] "time.slope.group1" "time.slope.group2"
## [3] "time.slope.group3" "time.slope.group4"
## [5] "time.slope.group5" "time.slope.group6"
## [7] "age.intercept0:rural" "age.intercept1-11:rural"
## [9] "age.intercept12-23:rural" "age.intercept24-35:rural"
## [11] "age.intercept36-47:rural" "age.intercept48-59:rural"
## [13] "age.intercept0:urban" "age.intercept1-11:urban"
## [15] "age.intercept12-23:urban" "age.intercept24-35:urban"
## [17] "age.intercept36-47:urban" "age.intercept48-59:urban"
The
additional linear trends are included in the fixed effect. To see which
slope correspond to which age-stratum pairs, we can inspect the
slope.fixed.output
field in the returned object.
fit4$slope.fixed.output
## $time.slope.group1
## [1] "0:rural"
##
## $time.slope.group2
## [1] "1-11:rural"
##
## $time.slope.group3
## [1] "12-23:rural" "24-35:rural" "36-47:rural" "48-59:rural"
##
## $time.slope.group4
## [1] "0:urban"
##
## $time.slope.group5
## [1] "1-11:urban"
##
## $time.slope.group6
## [1] "12-23:urban" "24-35:urban" "36-47:urban" "48-59:urban"
The linear trend term can also be set to
by setting linear.trend = FALSE
fit5 <- smoothCluster(data = counts.all, Amat = DemoMap$Amat, family = "betabinomial",
year.label = c(periods, "15-19"), time.model = "ar1", survey.effect = TRUE, strata.time.effect = TRUE,
linear.trend = FALSE)
rownames(fit5$fit$summary.fixed)
## [1] "age.intercept0:rural" "age.intercept1-11:rural"
## [3] "age.intercept12-23:rural" "age.intercept24-35:rural"
## [5] "age.intercept36-47:rural" "age.intercept48-59:rural"
## [7] "age.intercept0:urban" "age.intercept1-11:urban"
## [9] "age.intercept12-23:urban" "age.intercept24-35:urban"
## [11] "age.intercept36-47:urban" "age.intercept48-59:urban"
Shared Linear trend
We can also enforce shared linear trends across all age groups, though this is usually too strong an assumption, i.e.,
fit6 <- smoothCluster(data = counts.all, Amat = DemoMap$Amat, family = "betabinomial",
year.label = c(periods, "15-19"), time.model = "ar1", survey.effect = TRUE, strata.time.effect = TRUE,
linear.trend = TRUE, common.trend = TRUE)
rownames(fit6$fit$summary.fixed)
## [1] "time.slope" "age.intercept0:rural"
## [3] "age.intercept1-11:rural" "age.intercept12-23:rural"
## [5] "age.intercept24-35:rural" "age.intercept36-47:rural"
## [7] "age.intercept48-59:rural" "age.intercept0:urban"
## [9] "age.intercept1-11:urban" "age.intercept12-23:urban"
## [11] "age.intercept24-35:urban" "age.intercept36-47:urban"
## [13] "age.intercept48-59:urban"
The
additional linear trends are included in the fixed effect. To see which
slope correspond to which age-stratum pairs, we can inspect the
slope.fixed.output
field in the returned object.
fit6$slope.fixed.output
## $time.slope.group1
## [1] "0:rural"
##
## $time.slope.group2
## [1] "1-11:rural"
##
## $time.slope.group3
## [1] "12-23:rural" "24-35:rural" "36-47:rural" "48-59:rural"
##
## $time.slope.group4
## [1] "0:urban"
##
## $time.slope.group5
## [1] "1-11:urban"
##
## $time.slope.group6
## [1] "12-23:urban" "24-35:urban" "36-47:urban" "48-59:urban"
Spatial effects
The spatial effects are assumed to be shared across all age groups and strata. The terms correspond to the sum of a structured spatial effect and an unstructured effect for each area. It is parameterized with a BYM2 model described in Riebler et al. (2016).
Space-time interaction effects
The space-time interaction term
are modelled with the type I to IV interactions of the chosen temporal
model and the ICAR or independent model in space. The four types of
models are described in Knorr-Held (2000)
. The specification of type I to IV interactions is through the argument
type.st
.
We can further include random slopes in the space-time interaction terms, i.e., where is the usual type I to IV interactions and the time covariate is rescaled to be -0.5 to 0.5, so that the random slope can be interpreted as the total deviation from the main trend from the first year to the last year to be projected, on the logit scale.
The random slopes are included when user specifies their hyperprior
pc.st.slope.alpha
andpc.st.slope.u
. For
example, the following model specifies a type IV interaction between AR1
in time and ICAR in space, with region-specific random slope in time,
with PC prior such that
The main temporal random effect is set
to be a second order random walk.
fit6 <- smoothCluster(data = counts.all, Amat = DemoMap$Amat, family = "betabinomial",
year.label = c(periods, "15-19"), survey.effect = TRUE, strata.time.effect = TRUE,
linear.trend = FALSE, time.model = "rw2", st.time.model = "ar1", pc.st.slope.u = 2,
pc.st.slope.alpha = 0.1)
The random slopes can be extracted from
fit6$fit$summary.random$st.slope.id[, c(1:3)]
## ID mean sd
## 1 1 0.051 0.30
## 2 2 0.113 0.30
## 3 3 0.120 0.31
## 4 4 -0.284 0.33
Hyperpriors
By default, all hyperpriors are assumed to be PC priors. The
hyperprior arguments start with pc.
. Their definition can
be easily found in the help file.