Back to portfolio // research internship · immunoinformatics · cancer biology · deep learning

MixTCRpred

Benchmarking CDR input encoding strategies for T-cell receptor–epitope interaction prediction — Computational Cancer Biology Lab, UNIL, 2024.

Period01/2024 — 06/2024
LabGfeller Group · UNIL
SupervisorDr. Giancarlo Croce
PIProf. David Gfeller
StackPyTorch · PyTorchLightning · scikit-learn

Context

T cells are central to the immune response against infected and cancerous cells. Each T cell expresses a unique T Cell Receptor (TCR) that recognises specific peptides — called epitopes — displayed on the surface of target cells. With up to 1011 unique TCRs in a single individual, computationally predicting which TCRs bind which epitopes is a major biomedical challenge with direct applications in cancer immunotherapy: identifying TCRs that target tumour-specific neoantigens is a key step in designing personalised T-cell therapies.

This internship focused on MixTCRpred, a transformer-based deep learning model developed by Croce et al. (2024) in the Computational Cancer Biology Lab. MixTCRpred takes as input the amino acid sequences of the CDR1, CDR2, and CDR3 regions of the TCR alpha and beta chains, and returns a binding score for a given epitope. The CDR3 region is closest to the epitope and primarily drives specificity, while CDR1 and CDR2 primarily interact with the MHC molecule — yet are still included as input features.

The research question: how do modifications to the CDR1 and CDR2 input sequences affect MixTCRpred's prediction performance, and are the full amino acid sequences necessary or can simplified representations suffice?

Data

All experiments used the same dataset as the original MixTCRpred publication, consisting of 19 epitopes split into two groups based on training data availability. The 9 high-data epitopes each had 300 to 1,300 positive TCRs, yielding stable and statistically interpretable results. The 10 low-data epitopes had only 5 to 13 positive TCRs each, making results more variable. Negatives were computationally generated at a 10:1 ratio.

Dataset component Count Details
Total epitopes199 high-data + 10 low-data
High-data epitopes9300–1,300 positive TCRs each
Low-data epitopes105–13 positive TCRs each
Alpha CDR1/2 sequences113From IMGT gene database
Beta CDR1/2 sequences147From IMGT gene database
Negative:positive ratio10:1Computationally generated negatives
Cross-validation5-foldStratified train/test split

Results

0.88 Mean AUC — 9 high-data epitopes (baseline)
0.006 Wilcoxon p-value — CDR deletion test
19 Epitopes benchmarked across all tests
0.58 Mean AUC — 10 low-data epitopes (baseline)

Experiments & Findings

Three encoding strategies were tested against the baseline (full CDR1/2 amino acid sequences), each designed to probe a specific aspect of the input representation.

Taken together, these results confirm that the full CDR1/2 sequences as currently defined in MixTCRpred are necessary and already optimal. They carry real predictive signal despite their physical distance from the TCR-epitope interface, likely through their interaction with the MHC molecule which constrains TCR specificity indirectly.

Pipeline

01
IMGT Database
113α + 147β CDR1/2
02
Sequence Modification
deletion / category / extension
03
MixTCRpred Training
PyTorch · transformer
04
5-fold CV
stratified · AUC mean
05
Wilcoxon Test
vs baseline · per epitope group

What I Learned

This internship was my first exposure to immunoinformatics and to working within an established research codebase rather than building from scratch. Navigating MixTCRpred's transformer architecture — modifying padding, embedding dimensions, and input features without touching the core model — required careful reading of someone else's code, which is a different skill from writing your own.

From a statistical standpoint, the stark contrast between high-data and low-data epitopes reinforced a practical lesson: sample size governs what you can claim. The same experiment that gave p = 0.006 on 9 epitopes gave p = 0.329 on 10 others — not because the biology differed, but because the data volume made the signal undetectable.

Working in the Computational Cancer Biology Lab also grounded my computational work in a clinical context — understanding how TCR prediction tools fit into neoantigen discovery pipelines and why improving prediction accuracy, even marginally, has downstream consequences for immunotherapy design.

Note: The full internship report is available on request. Results reported here are as published in the report submitted to the University of Lausanne, January–June 2024.

Tech Stack

PyTorch PyTorchLightning scikit-learn Python Immunoinformatics IMGT Database 5-fold CV Wilcoxon Test Transformer Architecture
Back to portfolio Research Internship · 2024 Master's Thesis