Michael Bikman

Using transformers to predict the effect of protein mutation

Michael Bikman, Rachel Kolodny, Margarita Osadchy
Department of Computer Science, University of Haifa

Our goal is to predict computationally the effect of a single-point mutation in a protein. Such a mutation can either have no effect or can lead to changes in the protein structure that affect function. Indeed, such mutations are at the root of numerous human diseases, from cancer to incurable developmental diseases. Predicting the impact of a mutation remains a major unsolved challenge, despite its wide-ranging importance for human health. We address this goal by train a transformer-based deep network.

Our model is trained in a supervised-setting on VAMP-seq (variant abundance by massively parallel sequencing) and MAVE (multiplexed assays of variant effects) measurements. The dataset holds almost 155.000 variants in 29 proteins. For this dataset we also have previously calculated predictions of protein stability, using Rosetta (a state-of-the-art predictor) and multiple-sequence-alignment (MSA)-based conservation scores. We also compute sequence embedding for each wild-type and mutated inputs using ESM-1b (evolutionary scale modeling [1]). The input to our new attention-based neural network are these calculated values, and the output is a loss-of-function prediction for each mutation.

Similar to previous evaluations of such predictors, we hold-out one protein, and train the network on all other proteins in our data. Then, we predict the scores for all mutations in the held-out protein. We repeat this experiment for each one of the proteins in the dataset and compare the accuracy of the predictions of our transformer model to those of a previously published state-of-the-art methods. Notably, these include the Random-Forest based method by Hoie et al [2] which pioneered the use of the stability predictions and the MSA based conservation scores. Furthermore, we show that if after training on all the proteins that are not the tested protein, we further fine-tune, using only a few scored samples from the tested protein, we dramatically improve the prediction accuracy.


[1] “Biological structure and function emerge from scaling unsupervised learning to 250 million protein sequences” (Rives et al., 2019)

[2] “Predicting and interpreting large-scale mutagenesis data using analyses of protein stability and conservation” (Hoie et al. 2020)