Experiments with MRP for small area estimation in GIScience
Author
Roger Beecham
Introduction
This document describes a workflow for generating small area outcome estimates using MRP. Originally this work formed a GISRUK abstract (paper | talk). Currently, we are turning this into a full paper, but particularly exciting are the wider ambitions to generate a set of place-based indicators as part of Leeds HASP and FINDS data service.
Setup
Listed below are some useful packages for working over Bayesian models in a tidy way. brms is a wrapper to Stan; marginaleffects is a very useful package for generating interpretable outputs from model objects (conditional effects in this case); and tidybayes for uncertainty representation extensions to ggplot2.
Code
library(here) library(tidyverse) # Bayesian modelling toolslibrary(rstanarm)library(cmdstanr)library(brms)library(marginaleffects) # For extracting effect sizes from bayesian objects.library(parameters)library(tidybayes) # For uncertainty representation.library(sf)library(patchwork)library(ggtext)library(ggpubr)# library(modelsummary)# library(gt)source(here("code", "theme_clean.R"))source(here("code", "plot_helpers.R"))
Data
Survey data (microdata): For this example, from Health Survey for England (HSE).
Census data: Population by usual residence in households.
Poststratification frame: Joint counts (and modelled joint counts) by MSOA.
MSOA lookup files, spatial geometry files etc.
Health Survey for England data
Code
# HEStoFMF.R > Converts to format to be used in FMF# HES data (microdata)hes_raw <-read_csv(here("data", "HES_for_R.csv"))hes_for_model <-read_csv(here("data", "HESforFMF.csv")) |>select(id=Seriala, health=genhelf1, SexbyAge, sha=SHA2, region=GOR2, is_urban=urban14b_2, imd=qimd19_2) |>separate_wider_delim(col=SexbyAge, names=c("sex", "age"), delim=".") |>mutate(# Create binary outcome variable.is_good=health %in%c("good","verygood"),is_vbad=health =="verybad",# Cast 5-point as factor, for ordinal regression.health=factor(health, levels=c("verybad", "bad", "fair", "good", "verygood"), ordered=TRUE),is_urban=is_urban=="U",imd=case_when( imd=="Least.deprived"~"5", imd=="X4"~"2", imd=="X3"~"3", imd=="X2"~"4", imd=="Most.deprived"~"1",TRUE~ imd ),across(c(sha, region), ~str_replace_all(.x, "\\.","_")) ) |>left_join(hes_raw |>mutate(educ=case_when( topqual3 =="NVQ4/NVQ5/Degree or equiv"~"level4", topqual3 %in%c("Higher ed below degree","NVQ3/GCE A Level equiv") ~"level3", topqual3 %in%c("NVQ1/CSE other grade equiv","NVQ2/GCE O Level equiv") ~"level1level2", topqual3 =="Foreign/other"~"other", topqual3 %in%c("No qualification","Refused") ~"level0", topqual3 =="Not applicable"~"na",TRUE~ topqual3 )) |>select(id=Seriala, educ)) |>mutate(age2=case_when( age %in%c("age0to4", "age5to15") ~"age0to15", age %in%c("age65to74", "age75p") ~"age65p",TRUE~ age ))
Census data and poststratification frame
Code
# Total households by MSOAzone_totals <-read_csv(here("data", "TS001.csv")) |>rename(zone_id=`Zone ID`)# Groundtruth: Census datahealth <-read_csv(here("data", "GH.csv")) |>rename(zone_id=`Zone ID`)geog <- readxl::read_excel(here("data", "MSOA21_GOR_SHA_UR_IMD.xlsx"), sheet="MSOA21") |>select(zone_id=MSOA21CD, region=GOR2, sha=SHA2, is_urban=urban14b_2, imd=IMD19_1eqMD) |>mutate(is_urban=is_urban=="U", across(c(sha, region), ~str_replace_all(.x, "\\.","_")))# Poststratification frame# ps1 : sex by ageps1 <-read_csv(here("data", "SexbyAge.csv")) |>rename(zone_id=`Zone ID`) |>pivot_longer(-zone_id) |>separate_wider_delim(col=name, names=c("sex", "age"), delim=".") |>mutate(count=value) |>group_by(zone_id) |>mutate(zone_tot=sum(count)) |>ungroup() |>group_by(zone_id, sex, age) |>summarise(cell_counts=sum(count), prop=cell_counts/zone_tot) |>ungroup() |>left_join(geog)# ps2 : sex by age by education levelps2 <- readxl::read_excel(here("data", "SbyAbyQual.xlsx"), sheet="custom-filtered-2024-11-24T15_1")names(ps2) <-c("zone_id", "zone_name", "age_code", "age", "sex_code", "sex", "educ_code", "educ", "count")# Checking for different MSOA definitions.missing_msoas <- ps2 |>separate_wider_delim(educ, delim=":", names=c("educ", "expl"), too_few ="align_start") |>select(zone_id, age, sex, educ, count) |>left_join(geog |>select(zone_id, region)) |>filter(region!="Wales") |>group_by(zone_id) |>summarise(zone_tot=sum(count)) |>left_join(geog) |>filter(region!="Wales") |>select(zone_id, region) |>unique() |>left_join(ps1 |>select(zone_id, r2=region) |>unique() ) |>filter(is.na(r2)) |>pull(zone_id)# Create formatted ps dataset, matching HSE data.ps2 <- ps2 |>separate_wider_delim(educ, delim=":", names=c("educ", "expl"), too_few ="align_start") |>select(zone_id, age, sex, educ, count) |>left_join(geog |>select(zone_id, region)) |>filter(region!="Wales", !zone_id %in% missing_msoas) |>select(-region) |>mutate(educ=case_when( educ =="Level 4 qualifications or above"~"level4", educ =="Level 3 qualifications"~"level3", educ %in%c("Level 2 qualifications","Level 1 and entry level qualifications") ~"level1level2", educ =="No qualifications"~"level0", educ =="Does not apply"~"na", educ =="Other"~"other",TRUE~as.character(educ) ),sex=tolower(sex),age =case_when( age =="Aged 15 years and under"~"age0to15", age =="Aged 16 to 24 years"~"age16to24", age =="Aged 25 to 34 years"~"age25to34", age =="Aged 35 to 49 years"~"age35to49", age =="Aged 50 to 64 years"~"age50to64", age =="Aged 65 years and over"~"age65p",TRUE~ age ), ) |>group_by(zone_id) |>mutate(zone_tot=sum(count)) |>ungroup() |>group_by(zone_id, sex, age, educ) |>summarise(cell_counts=sum(count), prop=cell_counts/zone_tot) |>ungroup() |>unique() |>left_join(geog)# Identify differences in MSOA codes between the HSE and MSOA lookup datasets.# missing_msoas <- t |> left_join(geog) |> filter(region!="Wales") |> select(zone_id, region) |> unique() |> left_join(ps1|> select(zone_id, r2=region) |> unique() ) |> filter(is.na(r2)) |> pull(zone_id)# LIST of MSOAs not in ps1but present in ps2# Joining with `by = join_by(zone_id)`# Joining with `by = join_by(zone_id)`# # A tibble: 14 × 3# zone_id region r2 # <chr> <chr> <chr># 1 E02002524 North_East NA # 2 E02002526 North_East NA # 3 E02002693 Yorkshire_and_the_Humber NA # 4 E02003897 South_West NA # 5 E02003972 North_West NA # 6 E02003974 North_West NA # 7 E02004243 South_West NA # 8 E02004631 South_West NA # 9 E02005184 North_West NA # 10 E02005327 North_West NA # 11 E02005713 North_East NA # 12 E02005721 North_East NA # 13 E02005727 North_East NA # 14 E02005915 East_Midlands NA # Population size of those missing MSOAs : 88536ps2 |>left_join(geog) |>filter(region!="Wales", !zone_id %in% missing_msoas ) |>summarise(tot=sum(cell_counts))# 55415266ps1|>summarise(tot=sum(cell_counts))# 55415499# Overall difference in zone totals, with hhd also of 233.# Check levels exactly match between the survey and postratification frame.all.equal( ps2 |>select(region) |>unique() |>arrange(region) |>pull(), hes_for_model |>select(region) |>unique() |>arrange(region) |>pull())
Boundary data
Code
# MSOAs : Bring down from ONS Open Geography Portal and simplify. # msoa_boundaries <- st_read("https://services1.arcgis.com/ESMARspQHYMw9BZ9/arcgis/rest/services/Middle_layer_Super_Output_Areas_December_2021_Boundaries_EW_BSC_V3/FeatureServer/0/query?outFields=*&where=1%3D1&f=geojson") |> # rmapshaper::ms_simplify(keep_shapes = TRUE)# st_write(msoa_boundaries, here("data", "msoa_boundaries.geojson"))msoa_boundaries <-st_read(here("data", "msoa_boundaries.geojson"))
Also, Andrew Heiss’s detailed posts on working with logistic regression multilevel model outputs – here and here – are an excellent resource, especially as a key ambition of the SMART-MRP is to predict out over different combinations of demographic and geographic subsets, and communicate uncertainty ranges to decision-makers.
General practice in MRP is to include every variable as a random intercept / level and so pool estimates of the outcome. That makes sense, but it means that it’s hard to interpret and justify model-based decisions on variables included in the model. If we are interested in interrogating into the processes, and making comparisons with say SPM-derived estimates, then it is useful to require parsimony in the variables used. Interpreting model outputs is therefore useful. What I do here is:
Generate posterior predictions for each level in the model and look at how narrow the Credible Intervals are for those values. This is using marginal effects.
Generate posterior conditional effects for each level, again using marginal effects.
Priors
We parameterise our models using:
An informed prior on the intercept. We have some experience with stated response surveys and health outcomes. Our expectation is that most will identify as having overall good health – c. 75%. So for the multilevel logistic regression, our intercept is a Gaussian set on log-odds scale, we use \(Normal(mean=0.5, sd=1)\). We set a reasonably conservative estimate for the variance around the prior (\(sd=1\)), as there is some systematic absolute errors between the HSE and the “true” population paramter, per the Census (see below).
The multilevel priors are set to \(exponential(1)\) as a default.
The adept_delta setting is to .95 – common when a hierarchical grouping variable has few levels.
Figure 2: A prior expectation for the likelihood HSE respondents stating a good health outcome.
TODO: Chart here showing total abs error between HSE and Census on health outcome.
Example model specification
set.seed(13)# Time difference of 5.53914 minsstart <-Sys.time()model7 <-brm(data = hes_for_model,family =bernoulli(), is_good ~ sex + (1| age2) + (1| educ) + (1| imd) + (1| region) + (1| is_urban), #+ (1 | educ:age2),#is_good ~ sex + (1 | age),prior =c(set_prior("normal(0.5, 1)", class ="Intercept"),set_prior("normal(0, 1)", class ="b"),set_prior("exponential(1)", class ="sd", group="age2"),set_prior("exponential(1)", class ="sd", group="educ"),set_prior("exponential(1)", class ="sd", group="imd"),set_prior("exponential(1)", class ="sd", group="region"),set_prior("exponential(1)", class ="sd", group="is_urban")#,# set_prior("exponential(1)", class = "sd", group="educ:age2") ),iter =2000, warmup =1000, cores =4, chains=4,seed =13,backend ="cmdstanr", control =list(adapt_delta = .95))print(Sys.time()-start)# saveRDS(model, file = here("models", "fit_hse_7.rds"))# saveRDS(model, file = here("models", "fit_hse_7_nointeract.rds"))# model <- readRDS(here("models", "fit_hse_7.rds"))model7 <-readRDS(here("models", "fit_hse_7_nointeract.rds"))
Example analysis of model outputs
The benefit of modern packages for computational model-building (and Bayesian) is that we can take advantage of resampling procedures to derive interpretable quantities for the variables included in our model. For example, we can ask about the predicted probability of a particular characteristic individual, or an individual living in a particular place, reporting a good health outcome. Or what the average difference in likelihood of reporting a good health outcome is for a hypothetical individual of a particular age versus a hypothetical individual of another age group (conditional on other characteristics being consistent with the sample average).
The marginaleffects package provides a set of functions to do this – to look at posterior predicted values and comparisons (contrasts) for different, user specified, individuals. There are a few key functions:
marginaleffects::predictions() extracts from a model the predicted probability in this case of a survey respondent reporting a good health outcome. To this function we supply a newdata argument with a grid of person characteristics that we want to predict over/for, and a type argument setting the scale with which to return predictions. By default marginaleffects::predictions() returns for each row (respondent) in the dataset, but we might want to return predicted values for:
A specific demographic combination (as in a postratification frame)
A hypothetical individual with each predictor held at the sample average (mean / mode).
marginaleffects::avg_predictions() takes the average of all predicted values of the dataset – so imagine predicting out for each observation (respondent) and averaging their predicted probabilities. But this becomes more useful when we pass a by argument to predict for actually observed values of our dataset – e.g. for particular ages or other characteristics, what is the average probability of the outcome, given our model?
marginaleffects::comparisons() compares two sets of predictions made with different predictor values. The quantity of interest is the difference in predictions when, say, age takes on different values. This is done via parameterising the variables argument. Risk differences are conditional quantities, which vary with the values of all predictors in the model. By default we obtain the comparison (risk difference) for each unit of observation.
Below I inspect the predicted probability of good health by key demographics and comparing model1 (using only sex and age) with model7, which uses many more covariates.
Figure 3: Predicted probabilities, given our model, of a good health outcome for the typical respondent in each demographic and areal-level grouping.
Here I use avg_comparisons() to analyse the average marginal component effect of moving between age ranges, education levels etc. holding other features constant. So with marginal effects, we ask what happens to an outcome when the explanatory variable moves, in this case, between levels.
Figure 4: The estimated effect, or risk difference, on the good health outcome for the average respondent on each level of demographic and area-level category as compared to the reference level category (on 0). So for the model that uses information on education and area-level demographics, age has an even more regular effect on the likelihood of reporting good health.
Example poststratification
At poststratification, we predict over the outcome for each of the ‘strata’ defined by the poststratification frame (ps). I’ve wrapped this into a function so that we poststratify per model. Here I also calculate absolute errors, against the Census groundtruth, for each postratificatified model prediction.
Results from the FMF (spatial microsimulation model) are stored in a separate .csv file. After loading into the session, I refactor this data frame so that it can be joined, with bind_rows, on the MRP results.
Another external table is that describing Information Entropy scores, per area, on the outcome estimates derived from the spatial microsimulation. This describes how much our models need to draw from a diminishing set of survey respondents, as we constrain on more variables.
For plots evaluating model performance, I generate several helper functions. These are quite specialised to the HSE case, so I need to return to these to generalise when doing comparisons on other model case studies.
Figure 5: Summary of model performance comapring MRP, with spatial microsimulation (default SAE technique in GIScience) for simulating good health at MSOA level.