Representation learning to advance multi-institutional studies with electronic health record data from US and France

D Zhou, H Tong, L Wang, S Liu, X Xiong, Z Gan, R Griffier, B Hejblum, Y-C Liu, C Hong, C-L Bonzel, T Cai, K Pan, Y-L Ho, L Costa, VA Panickan, JM Gaziano, K Mandl, V Jouhet, R Thiebaut, Z Xia, K Cho, K Liao, T Cai

Abstract

The widespread adoption of electronic health records has created new opportunities for translational clinical research, yet this promise remains constrained by fragmented data across privacy-siloed institutions and substantial heterogeneity in local coding practices. While privacy-preserving collaborative learning allows institutions to work together without sharing patient-level data, it does not address inconsistencies in how clinical concepts are represented across sites. We introduce a graph-based framework that addresses this gap by treating data harmonization as a scalable representation learning problem. Rather than relying on fixed standards or manual mappings, the framework integrates institution-specific summary statistics from health records, curated biomedical knowledge graphs, and semantic information derived from large language models to learn a shared semantic space. This joint learning approach aligns diverse, site-specific vocabularies while preserving patient privacy. Evaluated across seven institutions and two languages, the framework provides a robust, data-centric foundation for training and deploying clinical models across heterogeneous healthcare systems.

Type

Journal article

Publication

Nature Communications, in press

Date

2026

Links

Preprint PDF