The problem with match counting
Haematopoietic stem cell transplantation (HSCT) can cure otherwise fatal blood cancers, but its success hinges on human leukocyte antigen (HLA) compatibility between donor and recipient. The immune consequences of a mismatch — graft-versus-host disease, relapse, transplant-related mortality — are what determine whether a patient survives.
Yet clinical practice still reduces this rich biology to a count. A donor is described as a 9/10 or 10/10 match: the number of matching alleles across a handful of loci. This treats every mismatch as equivalent and binary, when in reality two alleles can differ by a single, immunologically silent amino acid — or by a substitution that radically reshapes the peptide-binding groove.
The core idea
HLA molecules are proteins. Protein language models — trained on hundreds of millions of sequences — learn representations that capture structural and functional similarity. CAPA's premise is simple: encode each allele with a protein language model, and let the geometry of that embedding space stand in for compatibility.
Two alleles that fold and present peptides similarly land near each other in embedding space, even if their names differ. Two that diverge functionally are pushed apart, even if they share a serological group. This is a continuous, learned notion of mismatch — the opposite of a binary count.
Continuous embeddings let a model reason about degree and direction of mismatch, and to generalise to allele pairs it has never seen in training — something a lookup table of match scores can never do.
How it works
The pipeline has three stages, each deliberately kept interpretable.
1 · Sequence retrieval
Donor and recipient alleles at five loci — A, B, C, DRB1, DQB1 — are resolved to full protein sequences via the IPD-IMGT/HLA database.
2 · ESM-2 embedding
Each sequence is passed through frozen ESM-2 (650M parameters) to produce a 1 280-dimensional vector. Freezing the language model keeps the representation general and the trainable model small.
3 · Cross-attention & DeepHit
A cross-attention network models the interaction between donor and recipient embeddings, and a DeepHit head jointly predicts the cumulative incidence of three competing risks. Competing-risks modelling matters because a patient who relapses cannot then experience transplant-related mortality — the events compete.
- GvHD
- Graft-versus-host disease — donor immune cells attack recipient tissue.
- Relapse
- Return of the underlying malignancy after transplant.
- TRM
- Transplant-related mortality from non-relapse causes.
Data & evaluation
CAPA was developed and evaluated on the public UCI Bone Marrow Transplant (children) dataset — 187 paediatric HSCT patients. Performance was measured with the time-dependent concordance index and Brier score against Cox proportional-hazards and Fine–Gray baselines.
On the held-out test set, the structure-aware model matched or exceeded classical baselines for relapse and TRM concordance, while producing calibrated, case-specific incidence curves rather than a single risk class.
All code, trained weights, and evaluation notebooks are open source under the MIT license. The dataset is publicly available from the UCI Machine Learning Repository.
Limitations & honest caveats
This is a proof-of-concept, and the constraints are real:
- Small cohort. 187 patients is far too few to make clinical claims; confidence intervals are wide and some events are too rare to evaluate.
- Single dataset. Results have not yet been replicated on an independent, larger transplant registry.
- Not for clinical use. CAPA is a research artifact intended to demonstrate a representational idea, not a decision-support tool.
The right next step is external validation on a large multi-centre cohort. Contributions and replications are welcome via the repository.