PhD Dissertation Proposal Defense: Purity Mugambi, Leveraging Data Science and Machine Learning to Discover and Intervene on Treatment Disparities Captured in EHR Datasets
Content
Speaker
Abstract
Machine learning (ML) researchers have increasingly used electronic health record (EHR) datasets, especially those that are anonymized and publicly accessible, to train models that could be deployed in the real world. Simultaneously, clinical researchers have shown that systemic injustices and biases creep into health systems leading to vast and pervasive disparities in treatment of patients based on factors such as sex, gender, race/ethnicity, and socioeconomic status. It is vital to understand whether and to what extent these disparities manifest in EHR datasets to inform creators of ML models for healthcare on considerations they should make when training models and interpreting their findings.
This thesis seeks to understand the extent of health inequity captured in EHR data and investigate how ML models can be redesigned to ensure they maintain high performance for patient groups that are negatively affected by those inequities. To that end, we build tools to; 1) quantify the disparities in treatment of patients across multiple datasets, 2) automate cohort extraction from large databases to reduce the time demand in multi-dataset analyses, and 3) optimize between personalization and generalization to train cohort-specific models that can improve performance for underrepresented patient groups. This thesis is structured in these three main parts.
First, to understand the prevalence of disparities in EHR datasets, we build a tool to run multiple hypothesis tests across multiple datasets to quantify differences in proportions of patients who received various treatments and in quantities of the treatment that they received. This tool also runs multiple regression analyses to compute the association between treatments and patient outcomes, which is vital in understanding the effect of treatment disparities.
Second, we investigate ways to automate cohort extraction from EHR databases, a task that is critical to many observational and ML for health studies yet is currently manual and extremely time consuming. We develop and evaluate a language-model-based method for automatically matching EHR database schemas. By matching columns across databases, the researcher(s) can quickly run cohort selection criteria queries on multiple databases, saving them crucial time that is currently invested (especially) to understand schemas of databases they have previously not worked with.
Third, considering the findings on existing treatment inequities, we develop methods for improving performance of predictive ML models on minoritized patient groups. We explore the tradeoff between personalization and generalization and develop hierarchical models that have higher predictive accuracy for smaller patient cohorts even in datasets with highly imbalanced labels.
This thesis makes contributions to the understanding of the pervasiveness of disparities in treatment (especially of acute myocardial infarction) and how that differs across multiple datasets and over time. Most importantly, through this work, we develop tools that allow clinical researchers to quickly search their private datasets for such disparities and compare against insights learned from other datasets. These tools are easy to use and can be used for different disease usecases. The findings obtained from using these tools empower clinical stakeholders to make informed decisions about their systems of care with the goal of closing existing equity gaps. They also inform ML researchers on existing inequities, requiring them to design models such that effects of the disparities present in data are reduced in downstream applications. Finally, this thesis shows one approach through which ML models can be redesigned to ensure they have high predictive performance for underrepresented patient subgroups who typically are negatively affected by existing health inequities.