MixTCRpred
Benchmarking CDR input encoding strategies for T-cell receptor–epitope interaction prediction — Computational Cancer Biology Lab, UNIL, 2024.
Context
T cells are central to the immune response against infected and cancerous cells. Each T cell expresses a unique T Cell Receptor (TCR) that recognises specific peptides — called epitopes — displayed on the surface of target cells. With up to 1011 unique TCRs in a single individual, computationally predicting which TCRs bind which epitopes is a major biomedical challenge with direct applications in cancer immunotherapy: identifying TCRs that target tumour-specific neoantigens is a key step in designing personalised T-cell therapies.
This internship focused on MixTCRpred, a transformer-based deep learning model developed by Croce et al. (2024) in the Computational Cancer Biology Lab. MixTCRpred takes as input the amino acid sequences of the CDR1, CDR2, and CDR3 regions of the TCR alpha and beta chains, and returns a binding score for a given epitope. The CDR3 region is closest to the epitope and primarily drives specificity, while CDR1 and CDR2 primarily interact with the MHC molecule — yet are still included as input features.
The research question: how do modifications to the CDR1 and CDR2 input sequences affect MixTCRpred's prediction performance, and are the full amino acid sequences necessary or can simplified representations suffice?
Data
All experiments used the same dataset as the original MixTCRpred publication, consisting of 19 epitopes split into two groups based on training data availability. The 9 high-data epitopes each had 300 to 1,300 positive TCRs, yielding stable and statistically interpretable results. The 10 low-data epitopes had only 5 to 13 positive TCRs each, making results more variable. Negatives were computationally generated at a 10:1 ratio.
| Dataset component | Count | Details |
|---|---|---|
| Total epitopes | 19 | 9 high-data + 10 low-data |
| High-data epitopes | 9 | 300–1,300 positive TCRs each |
| Low-data epitopes | 10 | 5–13 positive TCRs each |
| Alpha CDR1/2 sequences | 113 | From IMGT gene database |
| Beta CDR1/2 sequences | 147 | From IMGT gene database |
| Negative:positive ratio | 10:1 | Computationally generated negatives |
| Cross-validation | 5-fold | Stratified train/test split |
Results
Experiments & Findings
Three encoding strategies were tested against the baseline (full CDR1/2 amino acid sequences), each designed to probe a specific aspect of the input representation.
-
Baseline (full CDR1/2 sequences) — mean AUC 0.88 on high-data epitopes. Results were consistent with those reported by Croce et al. (2024), confirming the experimental setup was correctly reproduced. Low-data epitopes showed high variability (mean AUC 0.58), making statistical interpretation difficult.
-
CDR1/2 deletion — significant performance drop (Wilcoxon p = 0.006 on high-data epitopes). Replacing CDR1/2 sequences with a single "X" token caused a statistically significant decrease in AUC for the 9 high-data epitopes, confirming that CDR1/2 sequences carry information the model relies on despite being physically distant from the epitope binding interface. On low-data epitopes the result was not significant (p = 0.329), reflecting high inter-epitope variability.
-
CDR1/2 categorisation — marginal decrease, not significant on high-data epitopes (p = 0.052). Replacing each unique CDR1/2 sequence with a 2-amino-acid category token (221 unique categories for 260 sequences) produced only a slight drop on high-data epitopes, suggesting the model can partially recover sequence information from categorical identity alone. Significant decrease observed on low-data epitopes (p = 0.034).
-
CDR1/2 sequence extension — no significant improvement across all conditions. Extending CDR1/2 sequences by up to 4 amino acids left and right of the standard loop boundary produced no meaningful AUC change (mean range: 0.876–0.881, all p-values 0.7–0.9). The standard CDR boundary definition in MixTCRpred is already optimal; extending it does not capture additional signal.
Taken together, these results confirm that the full CDR1/2 sequences as currently defined in MixTCRpred are necessary and already optimal. They carry real predictive signal despite their physical distance from the TCR-epitope interface, likely through their interaction with the MHC molecule which constrains TCR specificity indirectly.
Pipeline
What I Learned
This internship was my first exposure to immunoinformatics and to working within an established research codebase rather than building from scratch. Navigating MixTCRpred's transformer architecture — modifying padding, embedding dimensions, and input features without touching the core model — required careful reading of someone else's code, which is a different skill from writing your own.
From a statistical standpoint, the stark contrast between high-data and low-data epitopes reinforced a practical lesson: sample size governs what you can claim. The same experiment that gave p = 0.006 on 9 epitopes gave p = 0.329 on 10 others — not because the biology differed, but because the data volume made the signal undetectable.
Working in the Computational Cancer Biology Lab also grounded my computational work in a clinical context — understanding how TCR prediction tools fit into neoantigen discovery pipelines and why improving prediction accuracy, even marginally, has downstream consequences for immunotherapy design.
Note: The full internship report is available on request. Results reported here are as published in the report submitted to the University of Lausanne, January–June 2024.