MixTCRpred Internship — Jikaël Ntoko

Context

T cells are central to the immune response against infected and cancerous cells. Each T cell expresses a unique T Cell Receptor (TCR) that recognises specific peptides — called epitopes — displayed on the surface of target cells. With up to 10¹¹ unique TCRs in a single individual, computationally predicting which TCRs bind which epitopes is a major biomedical challenge with direct applications in cancer immunotherapy: identifying TCRs that target tumour-specific neoantigens is a key step in designing personalised T-cell therapies.

This internship focused on MixTCRpred, a transformer-based deep learning model developed by Croce et al. (2024) in the Computational Cancer Biology Lab. MixTCRpred takes as input the amino acid sequences of the CDR1, CDR2, and CDR3 regions of the TCR alpha and beta chains, and returns a binding score for a given epitope. The CDR3 region is closest to the epitope and primarily drives specificity, while CDR1 and CDR2 primarily interact with the MHC molecule — yet are still included as input features.

The research question: how do modifications to the CDR1 and CDR2 input sequences affect MixTCRpred's prediction performance, and are the full amino acid sequences necessary or can simplified representations suffice?

Data

All experiments used the same dataset as the original MixTCRpred publication, consisting of 19 epitopes split into two groups based on training data availability. The 9 high-data epitopes each had 300 to 1,300 positive TCRs, yielding stable and statistically interpretable results. The 10 low-data epitopes had only 5 to 13 positive TCRs each, making results more variable. Negatives were computationally generated at a 10:1 ratio.

Dataset component	Count	Details
Total epitopes	19	9 high-data + 10 low-data
High-data epitopes	9	300–1,300 positive TCRs each
Low-data epitopes	10	5–13 positive TCRs each
Alpha CDR1/2 sequences	113	From IMGT gene database
Beta CDR1/2 sequences	147	From IMGT gene database
Negative:positive ratio	10:1	Computationally generated negatives
Cross-validation	5-fold	Stratified train/test split

Results

0.88 Mean AUC — 9 high-data epitopes (baseline)

0.006 Wilcoxon p-value — CDR deletion test

19 Epitopes benchmarked across all tests

0.58 Mean AUC — 10 low-data epitopes (baseline)

Experiments & Findings

Three encoding strategies were tested against the baseline (full CDR1/2 amino acid sequences), each designed to probe a specific aspect of the input representation.

Baseline (full CDR1/2 sequences) — mean AUC 0.88 on high-data epitopes. Results were consistent with those reported by Croce et al. (2024), confirming the experimental setup was correctly reproduced. Low-data epitopes showed high variability (mean AUC 0.58), making statistical interpretation difficult.
CDR1/2 deletion — significant performance drop (Wilcoxon p = 0.006 on high-data epitopes). Replacing CDR1/2 sequences with a single "X" token caused a statistically significant decrease in AUC for the 9 high-data epitopes, confirming that CDR1/2 sequences carry information the model relies on despite being physically distant from the epitope binding interface. On low-data epitopes the result was not significant (p = 0.329), reflecting high inter-epitope variability.
CDR1/2 categorisation — marginal decrease, not significant on high-data epitopes (p = 0.052). Replacing each unique CDR1/2 sequence with a 2-amino-acid category token (221 unique categories for 260 sequences) produced only a slight drop on high-data epitopes, suggesting the model can partially recover sequence information from categorical identity alone. Significant decrease observed on low-data epitopes (p = 0.034).
CDR1/2 sequence extension — no significant improvement across all conditions. Extending CDR1/2 sequences by up to 4 amino acids left and right of the standard loop boundary produced no meaningful AUC change (mean range: 0.876–0.881, all p-values 0.7–0.9). The standard CDR boundary definition in MixTCRpred is already optimal; extending it does not capture additional signal.

Taken together, these results confirm that the full CDR1/2 sequences as currently defined in MixTCRpred are necessary and already optimal. They carry real predictive signal despite their physical distance from the TCR-epitope interface, likely through their interaction with the MHC molecule which constrains TCR specificity indirectly.

Pipeline

01

IMGT Database

113α + 147β CDR1/2

02

Sequence Modification

deletion / category / extension

03

MixTCRpred Training

PyTorch · transformer

04

5-fold CV

stratified · AUC mean

05

Wilcoxon Test

vs baseline · per epitope group

What I Learned

This internship was my first exposure to immunoinformatics and to working within an established research codebase rather than building from scratch. Navigating MixTCRpred's transformer architecture — modifying padding, embedding dimensions, and input features without touching the core model — required careful reading of someone else's code, which is a different skill from writing your own.

From a statistical standpoint, the stark contrast between high-data and low-data epitopes reinforced a practical lesson: sample size governs what you can claim. The same experiment that gave p = 0.006 on 9 epitopes gave p = 0.329 on 10 others — not because the biology differed, but because the data volume made the signal undetectable.

Working in the Computational Cancer Biology Lab also grounded my computational work in a clinical context — understanding how TCR prediction tools fit into neoantigen discovery pipelines and why improving prediction accuracy, even marginally, has downstream consequences for immunotherapy design.

Note: The full internship report is available on request. Results reported here are as published in the report submitted to the University of Lausanne, January–June 2024.

Tech Stack

PyTorch PyTorchLightning scikit-learn Python Immunoinformatics IMGT Database 5-fold CV Wilcoxon Test Transformer Architecture