Accelerating Open Data Integration of Real-World Health Data Silos

Phdprojects (4) Phdprojects (4)
Phdprojects (4)

Accelerating Open Data Integration of Real-World Health Data Silos

(Marcel Parciak - ongoing since October 2021):

The complexity of multiple sclerosis (MS) necessitates interventions from various healthcare professionals: general practitioners, neurologists, pharmacists and physiotherapists, just to name a few. These professionals may work on different sites with different IT systems supporting their routine work. Nevertheless, detailed, accurate and up-to-date information flows between these professionals are crucial for appropriate healthcare. These sites consequently generate high volumes of data in heterogeneous data formats that are continuously updated and modified. Health data engineers have to integrate such real-world datasets to unlock their full potential for healthcare professionals and health data scientists. Despite the similarities between health data integration and open data integration, which is a field that is studied intensively in the domain of computer science, I find that approaches and tools developed in the open integration domain are not applied to the health data domain. The sensitive nature of health data results in locked-down health data silos. Large corpora used to develop and test applications from the open data integration setting do not apply to the health data domain. Therefore, the tooling used for health data integration remains the same as for enterprise integration, even though health data’s volume, variety, velocity and veracity render these tools ineffective. This ineffectiveness leaves health data scientists with tedious, complex and time-consuming data integration tasks.

In this PhD project, delivering innovation in two main fields is combined: Computer science and Medical Informatics.


In the computer science domain, the focus is on the following 3 main objectives:

  1. Investigate possibilities to profile relational data. Functional dependencies describe a strong relation between two attributes, primarily used to enforce domain-specific constraints in relational databases. In real-world data, where such relations are undocumented and errors are common, approximate functional dependency (AFD) measures detect functional dependencies which hold most of the time. With AFDs, data engineers can enforce functional dependencies on datasets to reduce errors.
  2. Apply recent open data integration strategies to real-world health data. In particular, some strategies propose to employ transformer-based Large Language Models (LLMs) for tasks such as schema mapping and data harmonisation. Because they are self-trained on vast corpora of text data available on the web, LLMs are endowed with a form of "common knowledge" that is useful, for example, to automatically derive that a table column mentioning values like "CA" and "AK" is most likely about US states (namely: California and Alaska)- which can be exploited when integrating customer tables for example. Applications of LLM-based integration strategies in the health data domain are missing, probably because LLMs are not specific to the health data domain even though they have the potential to speed up health data integration.
  3. Explore approaches to automate the discovery of datasets in health data lakes. Generating patient cohorts for research purposes is a common task in medical research. Traditionally, this is done on well-known, harmonized data schemas. Semi-automating this task to heterogeneous data schemas will potentially speed up finding the right data for a research project.

In the medical informatics domain, this PhD project investigates the current state of the art in health data integration as well as the development of a proof-of-concept of leveraging solutions from the computer science domain to the medical informatics domain:

  1. The MSDataConnect (MSDC) Initiative by the University MS Center (UMSC), a strategic collaboration between UHasselt and the Noorderhart Hospital in Pelt. MSDC aims to unlock health data silos that contain data on people with multiple sclerosis (PwMS), allowing clinical researchers to analyse large patient cohorts to derive data-driven insights.
  2. The medEmotion project, where UHasselt, PXL and LRM collaborate with Noorderhart, the Jessa hospital in Hasselt and Ziekenhuis Oost-Limburg in Genk to develop and establish data processing environments. For both the MSDC and medEmotion we work together with professional companies that are tasked with implementing data integration pipelines, allowing a detailed insight into the current state-of-the-art of health data integration.
  3. The Belgian MS Registry (BELTRIMS), initiated by the Belgian Study Group for Multiple Sclerosis (BSGMS) in 2012, collaborates with healthdata.be to collect data from people with MS in Belgium for healthcare and research. The BELTRIMS use case provides an excellent example of integrating highly heterogeneous health data across multiple sites.

Marcel is involved in following subprojects: 

POC4 of the Flanders AI Research Project (2019-2023):

  • (Semi)-Automation of Health Data Integration

MSDA:

  • Educational program

BELTRIMS

  • EHDEN - transforming the MS DataConnect dataset to OMOP
  • MS DataConnect (and medEmotion)