Open Source MIT Proof-of-Concept

Predicting alloimmunity with protein language models

CAPA replaces coarse HLA match / mismatch scores with continuous ESM-2 embeddings and learns directional donor–recipient alloreactivity from signed difference embeddings — capturing information that a symmetric mismatch distance provably cannot — then predicts GvHD, relapse, and TRM as competing risks via DeepHit.

Open prediction tool → Read the paper

sh4wn27/capa

Input · 5 lociDonor × Recipient

A*02:01B*07:02C*07:02 DRB1*15:01DQB1*06:02

ESM-2 · 650M1 280-DIM

Frozen embeddingper allele

SIGNED-DIFF ATTENTIONDEEPHIT · CIF

GvHD aspirational

Relapse

TRM

§ 01

How it works

From allele strings to risk curves

Three stages transform raw HLA typing into calibrated, interpretable competing-risk predictions.

allele → sequence

HLA Input

Donor and recipient alleles at five loci (A, B, C, DRB1, DQB1) are looked up in the IPD-IMGT/HLA database to retrieve their full protein sequences.

sequence → 1 280-dim

ESM-2 Embedding

Each amino-acid sequence is encoded by frozen ESM-2 (650M parameters) into a 1 280-dim vector. Immunologically similar alleles cluster together.

signed difference → CIF

Risk Prediction

Attention over the signed donor–recipient difference embeddings captures the direction of mismatch. DeepHit jointly outputs cumulative incidence curves for GvHD, relapse, and TRM.

Input

HLA-A*02:01HLA-B*07:02HLA-DRB1*15:01

ESM-2 · 650M

1 280-dimembeddingper allele

Signed-Diff Attention

eᴰ − eᴿdirectional128-dim

DeepHit output

GvHD · CIFRelapse · CIFTRM · CIF

§ 02

Key results

What the learned representation adds

0.87

C-index, GvHD — CAPA (signed diff)

Directional simulation · vs 0.58 scalar distance · 6 seeds

+0.29

ΔC over direction-blind Cox

93% of oracle gap · non-overlapping CIs every seed

0.84

C-index, relapse — Fine–Gray

Best UCI BMT baseline · single split · CI wide

187

Patients (UCI BMT)

Baselines only · no per-allele typing

CAPA's genuine advantage is directional: a symmetric scalar mismatch distance is blind to the direction of donor–recipient mismatch, while CAPA — learning from signed difference embeddings — recovers it. On UCI BMT we report tabular baselines as reference points; CAPA's ESM-2 step cannot run end-to-end there because the cohort records aggregate mismatch counts, not allele-level typing.

Cumulative incidence — illustrativeFIG. 02

GvHD Relapse TRM

Illustrative competing-risk CIFs for a representative donor–recipient pair. Run the prediction tool for case-specific curves.

Time-dependent C-index — UCI BMT · n = 29 test patients · single split. Tabular baselines only: CAPA's ESM-2 step requires registry-scale data with per-allele HLA typing, which UCI BMT lacks. All CIs span most of [0, 1]; treat as exploratory. GvHD: 2 events in test fold — not evaluable.
Model	GvHD	Relapse	TRM
Cox-PH (cause-specific)	—	0.75	0.65
Fine–Gray best baseline	—	0.84	0.66
Random Survival Forest	—	0.48	0.65
DeepHit (tabular HLA)	—	0.65	0.57

Directional GvHD simulation (N = 10,000; mean ± SD over 6 seeds). GvHD hazard is driven by the *signed* projection of the donor–recipient difference, which a symmetric scalar distance cannot represent; TRM is a scalar-distance control. The oracle is given the true directional features.
Model	GvHD	Relapse	TRM
Cox (binary mismatch)	0.53	0.77	0.59
Cox (scalar distances)	0.58	0.78	0.67
Cox (oracle direction)	0.89	0.78	0.67
CAPA (signed diff)	0.87	0.77	0.53

CAPA recovers 93% of the distance-to-oracle gap on GvHD, with non-overlapping confidence intervals on every seed. On the TRM control — where the signal is genuine scalar magnitude — the distance-based Cox model is near-optimal and CAPA is weaker: the advantage is specific to directional structure, not a generic capacity effect. The simulation establishes a capability (the architecture can exploit direction when present), not a clinical result.

§ 03

Why not just count mismatches?

How CAPA differs from existing tools

Clinical standard practice and current immunogenetics tools encode HLA compatibility as a count. CAPA treats it as a continuous, learned relationship in protein embedding space.

Methodological comparison — not a performance benchmark. HLAMatchmaker and PIRCHE-II are validated at scale; CAPA is validated on n = 187.
Approach	Mismatch representation	Trained on outcomes?	Competing-risk output	Unseen alleles
Allele match counting clinical standard · 8/8, 10/10	Binary per locus	No — rule-based	No	No
HLAMatchmaker / PIRCHE-II eplet & peptidome tools	Eplet or peptide counts	No — curated biology rules	No	Partial
CAPA this work · proof-of-concept · n = 187	Continuous 1 280-dim protein embedding	Yes — end-to-end from outcomes	Yes — GvHD · Relapse · TRM	Yes — any allele with a sequence

About the project

A new lens on HLA compatibility.

Haematopoietic stem cell transplantation outcome depends critically on HLA compatibility. The standard approach encodes this as a binary match / mismatch count — discarding most of the immunological information.

CAPA was built to change that. By encoding every allele with ESM-2, a protein language model trained on 250 M sequences, we obtain representations that reflect structural and functional similarity rather than mere categorical identity — and by learning from the signed donor–recipient difference, CAPA captures the direction of mismatch that a symmetric distance discards.

This is an open-source proof-of-concept. Tabular baselines are reported on 187 paediatric HSCT patients (UCI BMT); the directional advantage is established in controlled simulation. We acknowledge the small-cohort limitation, that real registry validation with per-allele typing is still pending, and encourage replication.

Full project story → Read the paper →

ESM-2 Embeddings

1 280-dim per allele, frozen 650M model.

Directional

Signed-difference attention captures mismatch direction.

DeepHit Survival

Joint competing-risks CIF output.

Open Source

MIT licensed, fully reproducible.