Benchmarking CDR input encoding strategies for T-cell receptor–epitope interaction prediction — Computational Cancer Biology Lab, UNIL, 2024.
T cells are central to the immune response against infected and cancerous cells. Each T cell expresses a unique T Cell Receptor (TCR) that recognises specific peptides — called epitopes — displayed on the surface of target cells. With up to 1011 unique TCRs in a single individual, computationally predicting which TCRs bind which epitopes is a major biomedical challenge with direct applications in cancer immunotherapy: identifying TCRs that target tumour-specific neoantigens is a key step in designing personalised T-cell therapies.
This internship focused on MixTCRpred, a transformer-based deep learning model developed by Croce et al. (2024) in the Computational Cancer Biology Lab. MixTCRpred takes as input the amino acid sequences of the CDR1, CDR2, and CDR3 regions of the TCR alpha and beta chains, and returns a binding score for a given epitope. The CDR3 region is closest to the epitope and primarily drives specificity, while CDR1 and CDR2 primarily interact with the MHC molecule — yet are still included as input features.
The research question: how do modifications to the CDR1 and CDR2 input sequences affect MixTCRpred's prediction performance, and are the full amino acid sequences necessary or can simplified representations suffice?
All experiments used the same dataset as the original MixTCRpred publication, consisting of 19 epitopes split into two groups based on training data availability. The 9 high-data epitopes each had 300 to 1,300 positive TCRs, yielding stable and statistically interpretable results. The 10 low-data epitopes had only 5 to 13 positive TCRs each, making results more variable. Negatives were computationally generated at a 10:1 ratio.
| Dataset component | Count | Details |
|---|---|---|
| Total epitopes | 19 | 9 high-data + 10 low-data |
| High-data epitopes | 9 | 300–1,300 positive TCRs each |
| Low-data epitopes | 10 | 5–13 positive TCRs each |
| Alpha CDR1/2 sequences | 113 | From IMGT gene database |
| Beta CDR1/2 sequences | 147 | From IMGT gene database |
| Negative:positive ratio | 10:1 | Computationally generated negatives |
| Cross-validation | 5-fold | Stratified train/test split |
Three encoding strategies were tested against the baseline (full CDR1/2 amino acid sequences), each designed to probe a specific aspect of the input representation.
Taken together, these results confirm that the full CDR1/2 sequences as currently defined in MixTCRpred are necessary and already optimal. They carry real predictive signal despite their physical distance from the TCR-epitope interface, likely through their interaction with the MHC molecule which constrains TCR specificity indirectly.
This internship was my first exposure to immunoinformatics and to working within an established research codebase rather than building from scratch. Navigating MixTCRpred's transformer architecture — modifying padding, embedding dimensions, and input features without touching the core model — required careful reading of someone else's code, which is a different skill from writing your own.
From a statistical standpoint, the stark contrast between high-data and low-data epitopes reinforced a practical lesson: sample size governs what you can claim. The same experiment that gave p = 0.006 on 9 epitopes gave p = 0.329 on 10 others — not because the biology differed, but because the data volume made the signal undetectable.
Working in the Computational Cancer Biology Lab also grounded my computational work in a clinical context — understanding how TCR prediction tools fit into neoantigen discovery pipelines and why improving prediction accuracy, even marginally, has downstream consequences for immunotherapy design.
Note: The full internship report is available on request. Results reported here are as published in the report submitted to the University of Lausanne, January–June 2024.