I am conducting a non-randomized study to assess differences in complication rates between two treatment groups. Before matching, the observed complication rates differ significantly between the two groups (22% vs. 50%).
To account for potential confounders and ensure the consistency of these results, I decided to perform propensity score matching (PSM) using the Matchthem
package. First, I identified potential confounders based on clinical considerations and constructed a Directed Acyclic Graph (DAG). I then applied various matching models to determine which performed the best (see Table for comparisons).
Model | Distance | Balanced Variables (<0.1) | Unbalanced Variables (>0.1) | Greatest Imbalance | Treated ESS | Control ESS | Unmatched Treated | Unmatched Control |
---|---|---|---|---|---|---|---|---|
Nearest Logit | 0.0278 | 11 | 4 | min_distance_gallblad_mm (0.1833) | 29.89 | 70.4 | 4 | 6 |
Nearest Logit (Ratio=1, Replace=False) | 0.1326 | 8 | 6 | min_distance_gallblad_mm (0.2641) | 42.6 | 42.6 | 10.4 | 34.4 |
Nearest Mahalanobis | NA | 7 | 5 | Segments_treated_per_session (0.4141) | 32.74 | 77 | 7 | 7 |
Optimal | 0.46 | 9 | 5 | Metastasis_size_at_treatment_mm (0.3666) | 53 | 53 | 0 | 24 |
Full Matching | 0.0496 | 8 | 7 | min_distance_gallblad_mm (0.3046) | 19.5 | 77 | NA | NA |
Genetic | 1.3338 | 5 | 9 | Metastasis_size_at_treatment_mm (0.5037) | 53 | 53 | 0 | 24 |
Below is the code for the first model, which performed the best:
# Perform matching in each imputed dataset
m.out_model1 <- matchthem(Treatment_type_tumor_near_gallbladder ~ age + sex + bmi + ASA + comorbidities + Systemic_treatment_prior_local_treatment + Segments_treated_per_session + Metastasis_size_at_treatment_mm + min_distance_gallblad_mm + stage_binary,
data = new_amcore_imputed,
method = "nearest",
distance = "logit",
replace = TRUE,
ratio = 3,
caliper = 0.2)
Output:
Balance summary across all imputations
Type Max.Diff.Adj
distance Distance 0.0278
age Contin. 0.0558
sex_Male Binary 0.1080
bmi Contin. 0.1040
ASA_1 Binary 0.0094
ASA_2 Binary 0.0423
ASA_3 Binary 0.0329
comorbidities_0 Binary 0.0381
comorbidities_1 Binary 0.0452
comorbidities_2 Binary 0.0329
Systemic_treatment_prior_local_treatment_Yes Binary 0.0440
Segments_treated_per_session Contin. 0.1328
Metastasis_size_at_treatment_mm Contin. 0.0931
min_distance_gallblad_mm Contin. 0.1833
stage_binary_III-IV Binary 0.0493
M.Threshold
distance Balanced, <0.1
age Balanced, <0.1
sex_Male Not Balanced, >0.1
bmi Not Balanced, >0.1
ASA_1 Balanced, <0.1
ASA_2 Balanced, <0.1
ASA_3 Balanced, <0.1
comorbidities_0 Balanced, <0.1
comorbidities_1 Balanced, <0.1
comorbidities_2 Balanced, <0.1
Systemic_treatment_prior_local_treatment_Yes Balanced, <0.1
Segments_treated_per_session Not Balanced, >0.1
Metastasis_size_at_treatment_mm Balanced, <0.1
min_distance_gallblad_mm Not Balanced, >0.1
stage_binary_III-IV Balanced, <0.1
Balance tally for mean differences
count
Balanced, <0.1 11
Not Balanced, >0.1 4
Variable with the greatest mean difference
Variable Max.Diff.Adj M.Threshold
min_distance_gallblad_mm 0.1833 Not Balanced, >0.1
Average sample sizes across imputations
0 1
All 53. 77
Matched (ESS) 31.17 71
Matched (Unweighted) 49. 71
Unmatched 4. 6
Question 1: Can I rely on the results even if not all variables are balanced? If not, what alternative approach is appropriate?
Afterward, I performed a multivariable logistic regression using the matched dataset. Following guidance from Noah Greifer in another post (thanks!), I computed the Average Treatment Effect on the Treated (ATT) using the marginaleffects::avg_comparisons
function.
# Perform a logistic regression on the matched data
results_all <- with(m.out_model1, svyglm(complications ~ Treatment_type_tumor_near_gallbladder + age + sex + bmi +
ASA + comorbidities + Systemic_treatment_prior_local_treatment + Segments_treated_per_session +
Metastasis_size_at_treatment_mm + min_distance_gallblad_mm + stage_binary,
family = quasibinomial()), cluster = FALSE)
# Compute the ATT on RISK RATIO SCALE
ATT_all <- lapply(results_all$analyses, function(fit) {
marginaleffects::avg_comparisons(fit, variables = c("Treatment_type_tumor_near_gallbladder", "age", "sex",
"bmi", "ASA", "comorbidities", "Systemic_treatment_prior_local_treatment", "Segments_treated_per_session",
"Metastasis_size_at_treatment_mm", "min_distance_gallblad_mm", "stage_binary"),
newdata = subset(Treatment_type_tumor_near_gallbladder == "MWA"),
comparison = "lnratioavg") # This returns the ATT on the risk ratio scale. To get the ATT on the odds ratio scale, change comparison to "lnoravg". To get the ATT on the risk difference scale, remove comparison and exponentiate.
})
# Pool the results
ATT_all |> mice::pool() |> summary(exponentiate = TRUE, conf.int = TRUE) |> mutate(across(where(is.numeric),
round, digits = 3))
Question 2: In the code above, I included all the variables used in the PSM to perform the multivariable logistic regression. Is this a correct approach, or should I have included only the unbalanced variables?
Thank you in advance.