3 Data processing and preparation

The data preparation pipeline (housed within scripts/01_data_prep_real/) transforms raw, row-level patient data into the deterministic, aggregated matrices required by the Lexis simulation model.

The pipeline processes demographic variables (age, sex, and Index of Multiple Deprivation [IMD]) alongside clinical diagnoses to construct a fully populated baseline cohort.

3.1 Hospital episode statistics (HES) definition

The primary data source for acute service utilisation is the Admitted Patient Care (APC) dataset from English Hospital Episode Statistics (HES).

The raw dataset (01_process_hes_data.R) is filtered to isolate relevant clinical spells. The inclusion criteria filter for adult patients (aged 16 and over) undergoing elective or non-elective inpatient admissions. Day cases and regular day/night attenders without an overnight stay are excluded.

To construct the clinical baseline, the pipeline scans the first five diagnostic positions (ICD-10 codes) of the admission episode. This scan utilises strict inner joins against the central disease_mapping_dictionary.csv.

3.2 Distinguishing admissions from unique individuals

A single individual with a chronic tobacco-related condition may experience multiple hospital admissions within a single year.

The pipeline explicitly calculates both the count of unique individuals and the total volume of admissions. It generates an admissions_per_patient multiplier for every demographic and disease stratum:

Code

# Extract from 01_process_hes_data.R

cohort_stratified <- spells_geo[, .(
total_admissions = .N,
unique_patients = uniqueN(token_person_id)
), by = c(REGION_NAME, "base_diagnosis", AGE_GROUP_NAME, SEX_NAME, IMD_NAME)]

cohort_stratified[, admissions_per_patient := total_admissions / unique_patients]

This separation allows the downstream epidemiological engine to correctly apply quitting probabilities to individuals while applying costs to total admission events.

3.3 Defining the comorbidity matrix and resolving sparsity

Patients admitted for a primary focal condition (e.g., ischaemic heart disease) frequently possess secondary tobacco-related comorbidities (e.g., COPD). The pipeline generates a probabilistic comorbidity matrix to capture these relationships.

Because HES data stratified by age, sex, IMD, focal condition, and comorbid condition can result in sparse or empty data frames for rare combinations, the pipeline implements a cascading imputation framework. Rather than allowing missing values (NA) to silently crash the downstream economic loops, the pipeline systematically relaxes demographic constraints to find the nearest valid biological average:

Code

# Extract from 01_process_hes_data.R

impute_cascading_average <- function(dt, val_col, group_cols) {

# 1. Average across IMD
dt[, mean_imd := mean(get(val_col), na.rm = TRUE), 
   by = c(REGION_NAME, group_cols, AGE_GROUP_NAME, SEX_NAME)]

# 2. Average across Age
dt[, mean_age := mean(get(val_col), na.rm = TRUE), 
   by = c(REGION_NAME, group_cols, SEX_NAME)]

# 3. Overall regional average
dt[, mean_overall := mean(get(val_col), na.rm = TRUE), 
   by = c(REGION_NAME, group_cols)]

# 4. National overall average (Fallback)
dt[, mean_national := mean(get(val_col), na.rm = TRUE), 
   by = c(group_cols)]

# Apply cascades safely replacing NAs
dt[is.na(get(val_col)), (val_col) := mean_imd]
dt[is.na(get(val_col)), (val_col) := mean_age]
dt[is.na(get(val_col)), (val_col) := mean_overall]
dt[is.na(get(val_col)), (val_col) := mean_national]

return(dt)
}

3.4 Deriving acute spell unit costs

To support the downstream Tier 1 economic evaluation and operational capacity reporting, the pipeline calculates both the financial cost and the physical duration of each historical HES spell.

The model merges the individual HES spell records against the official NHS Payment Scheme using Healthcare Resource Group (HRG) codes. Rather than applying a flat average, the engine calculates the precise cost of each spell by accounting for:

Admission method: Applying distinct base tariffs for elective versus non-elective admissions.
Length of stay (LOS) and trim points: Comparing the spell LOS against the HRG-specific long-stay trim point.
Excess bed days: Adding a specific per_day_cost for every day a patient remains in the hospital beyond the designated trim point.

Alongside the financial cost, the pipeline aggregates the absolute Length of Stay (LOS) for these spells. These granular metrics are then aggregated into median outcome_unit_costs and an Average Length of Stay (ALOS) for every specific demographic and disease stratum. This baseline ALOS is crucial for allowing the downstream engine to accurately quantify physical bed days freed.

3.5 Dynamic smoking prevalence calibration

Baseline smoking prevalences are initially estimated using a pooled sample of the Health Survey for England (HSE). Because general population surveys frequently contain missing demographic or clinical data, the data preparation pipeline (03_estimate_smoking_by_disease.R) applies Multiple Imputation by Chained Equations (MICE).

Missing clinical covariates (e.g., historical cancer or cardiovascular disease flags) and demographic variables are imputed using logistic and polynomial regression models. This ensures that a robust, fully populated “donor” dataset is available before regional sampling and calibration occurs.

Baseline smoking prevalences are estimated by combining HES data with the Health Survey for England (HSE).

Because the HSE is a general population survey, applying its raw prevalence rates directly to an admitted hospital cohort underestimates smoking rates (as smokers are disproportionately more likely to be hospitalised). To correct this, 03_estimate_smoking_by_disease.R applies a dynamic target calibration.

The script computes the expected number of smokers using the base demographic prevalences. It then aligns this projection against an observed, real-world acute hospital audit target (the target_hospital_prevalence). A dynamic multiplier is derived and applied proportionally across the disease matrix, ensuring the modelled volume of smokers mathematically reconciles with local hospital audits:

Code

# Extract from 03_estimate_smoking_by_disease.R

# Calculate expected base smokers from general population data
grid_weighted[, expected_base_smokers := total_admissions * base_current]

# Compute demographic-weighted baseline prevalence
cohort_summary[, base_demographic_prev := expected_base_smokers / total_adms]

# Derive calibration multipliers against the audit target
cohort_summary[, dynamic_multiplier := target_hospital_prevalence / base_demographic_prev]

# Apply calibrated modifiers to the master grid
grid[, prevalence_current := base_current * dynamic_multiplier]

This methodological step ensures that the starting cohort entering the clinical pathway model reflects the true clinical burden observed in secondary care, rather than the healthier profile of the general public.

3.6 Population denominators and mortality rate derivation

To provide the epidemiological model with accurate baseline survival constraints, the data preparation pipeline must calculate granular central death rates (deaths per person-year).

The 02_process_population.R script ingests Office for National Statistics (ONS) Mid-Year Estimates alongside Index of Multiple Deprivation (IMD) distributions. It dynamically calculates joint population denominators for every specific age, sex, IMD, and geographic stratum defined in the user configuration.

The 04_process_mortality.R script subsequently merges these calculated denominators against raw ONS death counts, guaranteeing that the mortality rates applied in the Lexis simulation accurately reflect the underlying population scale.

3.7 Standardising and inflating economic parameters

Legacy economic benchmarks require rigorous standardisation before they can be applied to current NHS cohorts.

For instance, the foundational lifetime cost savings derived from Godfrey et al. (Godfrey et al. 2011) are based on 2009 financial valuations. To ensure economic accuracy, the 05_construct_lifetime_costs.R script automatically compiles historical inflation multipliers using the PSSRU / NHSCII Hospital and Community Health Services (HCHS) index. It continuously compounds these multipliers to escalate the base 2009 valuations to current-year equivalents before extrapolating the parameters across the full demographic matrix.