Outcome of haematopoietic stem cell transplantation depends critically on HLA compatibility, conventionally encoded as a binary match/mismatch count that discards most immunological information. We propose CAPA, a framework that represents each HLA allele with a frozen protein language model (ESM-2, 650M) and learns donor–recipient interaction via cross-attention, feeding a DeepHit head that jointly predicts cumulative incidence of GvHD, relapse, and transplant-related mortality as competing risks. On the public UCI Bone Marrow Transplant cohort (n = 187), the structure-aware representation matches or exceeds Cox and Fine–Gray baselines on time-dependent concordance for relapse and TRM, while producing calibrated, case-specific incidence curves. We release all code and weights as an open, reproducible proof-of-concept and discuss the small-cohort limitations frankly.
HLA matching is the strongest modifiable predictor of HSCT outcome. The standard representation — an integer count of matched alleles across loci — assumes all mismatches are equal and discards the protein-level differences that actually drive alloreactivity.[1] A single amino-acid substitution in the peptide-binding groove can change immunogenicity dramatically, while many substitutions are functionally silent.
We ask whether continuous, learned representations of HLA sequences can recover this lost signal and improve outcome prediction without hand-engineered mismatch features.
For each allele we retrieve the full protein sequence from IPD-IMGT/HLA and embed it with frozen ESM-2 (esm2_t33_650M_UR50D), mean-pooling the final layer to a 1 280-dim vector e ∈ ℝ¹²⁸⁰.[2]
Donor and recipient embeddings across the five loci are projected and combined with multi-head cross-attention, yielding an interaction representation that the survival head consumes.
We model the three causes jointly with DeepHit, optimising a log-likelihood plus a ranking loss over event times.[3] Competing-risks formulation respects that the events are mutually exclusive over a patient's trajectory.
Fig. 1 End-to-end architecture. The protein language model is frozen; only the cross-attention interaction module and DeepHit head are trained.
On the held-out test split, the structure-aware model achieves the highest time-dependent concordance for relapse among all evaluated methods, and is competitive on TRM.[4] GvHD was not robustly evaluable owing to too few events in the test fold.
| Model | Relapse | TRM | GvHD |
|---|---|---|---|
| Cox-PH (cause-specific) | 0.75 | 0.65 | — |
| Fine–Gray | 0.84 | 0.66 | — |
| DeepHit (tabular HLA) | 0.67 | 0.41 | — |
| CAPA (ESM-2 + cross-attn) | 0.81 | 0.63 | — |
Fig. 2 Predicted cumulative incidence functions for a representative test patient across the three competing risks.
The results support the central hypothesis: continuous protein-language representations of HLA carry outcome-relevant signal beyond match counts. However, the cohort is small (n = 187), drawn from a single source, and several event types are too rare to evaluate reliably. We make no clinical claims; CAPA is a methodological demonstration. External validation on large, multi-centre registries is the necessary next step.