About — CAPA

The problem with match counting

Haematopoietic stem cell transplantation (HSCT) can cure otherwise fatal blood cancers, but its success hinges on human leukocyte antigen (HLA) compatibility between donor and recipient. The immune consequences of a mismatch — graft-versus-host disease, relapse, transplant-related mortality — are what determine whether a patient survives.

Yet clinical practice still reduces this rich biology to a count. A donor is described as a 9/10 or 10/10 match: the number of matching alleles across a handful of loci. This treats every mismatch as equivalent and binary, when in reality two alleles can differ by a single, immunologically silent amino acid — or by a substitution that radically reshapes the peptide-binding groove.

A match score throws away almost everything the protein is telling us.

The core idea

HLA molecules are proteins. Protein language models — trained on hundreds of millions of sequences — learn representations that capture structural and functional similarity. CAPA's premise is simple: encode each allele with a protein language model, and let the geometry of that embedding space stand in for compatibility.

Two alleles that fold and present peptides similarly land near each other in embedding space, even if their names differ. Two that diverge functionally are pushed apart, even if they share a serological group. This is a continuous, learned notion of mismatch — the opposite of a binary count.

Why it matters

Continuous embeddings let a model reason about degree and direction of mismatch, and to generalise to allele pairs it has never seen in training — something a lookup table of match scores can never do.

What already exists — and where it stops

Two tools dominate current clinical practice. HLAMatchmaker counts mismatched eplets — short amino-acid motifs on the HLA surface that trigger antibody responses. PIRCHE-II enumerates T-cell-presented peptides derived from mismatched residues. Both improve on simple allele counting by capturing finer-grained surface differences. But they share three critical limits: the rules are curated by hand rather than learned from outcomes; they produce a mismatch count, not a survival prediction; and they cannot output competing-risk trajectories or generalise to rare alleles absent from their reference tables.

How CAPA differs

CAPA learns a continuous embedding directly from protein sequence, not from curated rule tables, and learns the direction of donor–recipient mismatch from the signed difference embedding — information a symmetric distance or a mismatch count provably cannot represent. Because any allele with a sequence in IPD-IMGT/HLA can be embedded, the approach generalises to rare alleles that eplet reference sets do not cover. The trade-off: the directional advantage is so far established in controlled simulation and against tabular baselines (n = 187), while HLAMatchmaker and PIRCHE-II are supported by thousands of patients. Real registry validation with per-allele typing is the necessary next step.

How it works

The pipeline has three stages, each deliberately kept interpretable.

1 · Sequence retrieval

Donor and recipient alleles at five loci — A, B, C, DRB1, DQB1 — are resolved to full protein sequences via the IPD-IMGT/HLA database.

2 · ESM-2 embedding

Each sequence is passed through frozen ESM-2 (650M parameters) to produce a 1 280-dimensional vector. Freezing the language model keeps the representation general and the trainable model small.

3 · Directional attention & DeepHit

CAPA attends over the signed donor–recipient difference embeddings, capturing the direction of mismatch — which antigens the recipient carries that the donor lacks — and a DeepHit head jointly predicts the cumulative incidence of three competing risks. Competing-risks modelling matters because a patient who relapses cannot then experience transplant-related mortality — the events compete. (A higher-capacity bidirectional cross-attention variant exists but is under-determined at these cohort sizes; the compact signed-difference model is the one evaluated.)

GvHD: Graft-versus-host disease — donor immune cells attack recipient tissue. Note: GvHD prediction is aspirational in the current version. The UCI BMT cohort has too few GvHD events to evaluate; a larger dataset is needed.
Relapse: Return of the underlying malignancy after transplant.
TRM: Transplant-related mortality from non-relapse causes.

Data & evaluation

We benchmark tabular competing-risks baselines on the public UCI Bone Marrow Transplant (children) dataset — 187 paediatric HSCT patients — using the time-dependent concordance index and Brier score. The cohort records only aggregate mismatch counts, not per-allele HLA typing, so CAPA's ESM-2 step cannot be driven by real alleles here; we substitute frequency-imputed allele assignments (matched to each patient's recorded mismatch count) purely to validate that the end-to-end pipeline runs correctly — these imputed alleles carry no outcome-specific biological signal by construction.

On a single held-out test split (n = 29), the best baseline (Fine–Gray) reaches a C-index of 0.84 for relapse and 0.66 for TRM (Cox 0.75 / 0.65; Random Survival Forest 0.48 / 0.65; DeepHit 0.65 / 0.57). These single-split numbers are optimistic; repeated 5×5 cross-validation corrects them to Cox relapse 0.60 ± 0.14 and TRM 0.56 ± 0.06. GvHD was not evaluable: only 2 events fell in the test fold.

CAPA's own advantage is demonstrated in a controlled directional simulation (N = 10,000, 6 seeds): where GvHD hazard is driven by the direction of mismatch, a Cox model on symmetric scalar distances collapses to near-chance (C = 0.58), while CAPA — learning from signed difference embeddings — reaches C = 0.87, recovering 93% of the gap to a direction-aware oracle, with non-overlapping confidence intervals on every seed. On a scalar-distance TRM control CAPA is deliberately weaker, confirming the advantage is specific to directional structure.

As a separate, real-outcome control, we ran repeated 5×10 cross-validation (50 folds) on the imputed-allele UCI BMT cohort comparing CAPA's signed-difference variant against a symmetric-difference variant and a fold-matched Cox model on scalar mismatch distance. As expected given the lack of real signal, the two CAPA variants were statistically indistinguishable (relapse 0.50 ± 0.18 signed vs. 0.52 ± 0.20 symmetric; TRM 0.52 ± 0.08 both), and neither beat the Cox baseline. This confirms the architectural choice alone does not manufacture spurious lift — it does not, and cannot, test the directional-advantage claim, which requires real allele typing.

Evaluation caveat

A single 29-patient test split is one draw from a noisy distribution. We therefore report repeated 5×5 stratified cross-validation as the primary estimate (Cox relapse 0.60 ± 0.14, TRM 0.56 ± 0.06), which corrects the optimism of the single split. Results should be treated as a methodology demonstration, not a benchmark claim.

Reproducibility

All code, trained weights, and evaluation notebooks are open source under the MIT license. The dataset is publicly available from the UCI Machine Learning Repository.

Limitations & honest caveats

This is a proof-of-concept, and the constraints are real:

Small cohort. 187 patients is far too few to make clinical claims; confidence intervals are wide and some events are too rare to evaluate.
Single test split. The test set contains 29 patients — one draw from a noisy distribution. Bootstrap CIs on every metric touch 1.0. We therefore report repeated 5×5 stratified cross-validation as the primary estimate.
GvHD not evaluable. Only 2 GvHD events in the test fold. The architecture outputs a GvHD curve, but no performance claim can be made until evaluated on a larger cohort with more events.
Single dataset. Results have not yet been replicated on an independent, larger transplant registry.
Directional result is simulation-based. The headline directional advantage is established in a controlled simulation that plants a directional signal — it demonstrates the architecture can exploit direction when present (a capability), not that real GvHD signal lives in that direction at a useful magnitude. A repeated-CV control on the real (but allele-imputed) UCI BMT cohort found no difference between signed and symmetric CAPA variants, as expected for data with no real allele-outcome relationship — it cannot substitute for a real test of directionality. Confirming that requires registry data with per-allele typing and event-specific endpoints.
Not for clinical use. CAPA is a research artifact intended to demonstrate a representational idea, not a decision-support tool.

The right next step is external validation on a large multi-centre cohort. Contributions and replications are welcome via the repository.

Author & license

CAPA is an independent open-source research project, released under the MIT license.

Read the paper → View on GitHub

Reading HLA the way the immune system does

The problem with match counting

The core idea

What already exists — and where it stops

How it works

1 · Sequence retrieval

2 · ESM-2 embedding

3 · Directional attention & DeepHit

Data & evaluation

Limitations & honest caveats

Author & license