Open Source MIT Proof-of-Concept

Predicting alloimmunity with protein language models

CAPA replaces coarse HLA match / mismatch scores with continuous ESM-2 embeddings, then predicts GvHD, relapse, and transplant-related mortality as competing risks via cross-attention and DeepHit.

Input · 5 lociDonor × Recipient
A*02:01B*07:02C*07:02 DRB1*15:01DQB1*06:02
ESM-2 · 650M1 280-DIM
Frozen embeddingper allele
CROSS-ATTENTIONDEEPHIT · CIF
GvHD
Relapse
TRM

§ 01

How it works

From allele strings to risk curves

Three stages transform raw HLA typing into calibrated, interpretable competing-risk predictions.

01
allele → sequence

HLA Input

Donor and recipient alleles at five loci (A, B, C, DRB1, DQB1) are looked up in the IPD-IMGT/HLA database to retrieve their full protein sequences.

02
sequence → 1 280-dim

ESM-2 Embedding

Each amino-acid sequence is encoded by frozen ESM-2 (650M parameters) into a 1 280-dim vector. Immunologically similar alleles cluster together.

03
interaction → CIF

Risk Prediction

A cross-attention network models donor–recipient allele interactions. DeepHit jointly outputs cumulative incidence curves for GvHD, relapse, and TRM.

Input
HLA-A*02:01HLA-B*07:02HLA-DRB1*15:01
ESM-2 · 650M
1 280-dimembeddingper allele
Cross-Attention
Donor × Recipientinteraction128-dim
DeepHit output
GvHD · CIFRelapse · CIFTRM · CIF

§ 02

Key results

Outperforming traditional baselines

0.84
C-index, relapse
Fine–Gray · 95% CI 0.69–1.00
0.75
C-index, relapse
Cox-PH · 95% CI 0.53–1.00
187
Patients (UCI BMT)
Paediatric HSCT cohort
3
Competing risks
GvHD · Relapse · TRM

Evaluated on the UCI Bone Marrow Transplant dataset (n = 187) using time-dependent C-index and Brier score.

Cumulative incidence — held-out test setFIG. 02
0.25 .50.751.0 036 mo72 mo
GvHD Relapse TRM

Illustrative competing-risk CIFs for a representative donor–recipient pair. Run the prediction tool for case-specific curves.

Time-dependent C-index — UCI BMT · n = 29 test patients. GvHD not evaluable (2 events). Fine–Gray is the best baseline.
ModelGvHDRelapseTRM
Cox-PH (cause-specific)0.750.65
Fine–Gray best0.840.66
DeepHit (tabular HLA)0.670.41

About the project

A new lens on HLA compatibility.

Haematopoietic stem cell transplantation outcome depends critically on HLA compatibility. The standard approach encodes this as a binary match / mismatch count — discarding most of the immunological information.

CAPA was built to change that. By encoding every allele with ESM-2, a protein language model trained on 250 M sequences, and learning donor–recipient interaction through cross-attention, we obtain embeddings that reflect structural and functional similarity rather than mere categorical identity.

This is an open-source proof-of-concept, validated on 187 paediatric HSCT patients. We acknowledge the small-cohort limitation and encourage replication on larger datasets.

ESM-2 Embeddings

1 280-dim per allele, frozen 650M model.

Cross-Attention

Interpretable donor × recipient interaction.

DeepHit Survival

Joint competing-risks CIF output.

Open Source

MIT licensed, fully reproducible.

Try it on your own data

Enter donor and recipient HLA strings and receive competing-risk curves, attention heatmaps, and SHAP feature attribution in seconds.