Abstract
The melting point of a protein characterizes its thermal stability, a critical factor in determining its suitability for various applications. For example, Ideonella sakaiensis produces a plastic-degrading enzyme, polyethylene terephthalate hydrolase, capable of breaking down polyethylene terephthalate (PET). However, the enzyme’s low thermal stability limits its potential for industrial use. To address this, we fine-tuned a pre-trained protein language model to predict thermal stability based on protein sequence and identified mutations to improve stability. Two mutant proteins were designed and evaluated using molecular dynamics simulations. While the simulations revealed that the mutated proteins were less stable than the wild-type, this outcome highlights the complexity of accurately predicting mutation effects on protein stability using sequence-based models. Our findings underscore the need for further refinement of computational tools and the need for more comprehensive datasets to enhance protein engineering for industrial applications.
Keywords – Protein Language Model, Protein Engineering, Ideonella sakaiensis PETase, Thermal stability
Introduction
Polyethylene terephthalate (PET) is a polymer made from the polymerization reaction of ethylene glycol and dimethyl terephthalate, and it is one of the most common single-use plastics due to its low cost and strength1. Additionally, because PET is not biodegradable, taking years to decompose2, its buildup in landfills and as litter poses a threat to the environment3. This is due to the fact that as PET accumulates, it leads to the production of greenhouse gases like CO2 and other organic compounds that accelerate global warming, most commonly methane4. Although recycling of PET is possible through heat reprocessing, this alters the desirable mechanical properties of PET. Furthermore, chemical recycling is not viable because of its high cost5.
Ideonella. Sakaiensis, which was discovered in 2016 by Yoshida and collaborators6,is a bacterium that can use PET as a carbon source. The enzyme responsible for PET degradation, PETase (IsPETase), has been found to be more efficient and stable than other enzymes that can degrade PET6. However, the PETase enzyme does not have a very high thermal stability, with a melting temperature (Tm) of 45°C, which makes it lose most of its activity within 24h at 37°C. This poses a problem for it to be effectively used in PET degradation at an industrial scale7. Because of this, many scientists have researched rational protein engineering techniques to increase thermal stability8.
Current research focuses on laboratory methods for either improving the efficiency of the enzyme or its stability. Most forms of research for making thermostable IsPETase involve protein engineering techniques with replacing amino acid residues to increase the number of contacts within the protein, via hydrogen bonding, in order to increase the Tm9 or post-translational modifications like glycosylation to improve the thermal stability10.
In this research paper, we aim to predict point mutations that can improve the thermal stability (TS) for I. Sakaiensis PETase using a pre-trained machine learning (ML) model and molecular dynamic (MD) to assess the effects of these variations in silico. The advantage of doing it computationally is that we can use MD to control the environmental conditions without incurring the costs and extensive work required for setting up mutagenesis and thermal stability analysis in the laboratory. Although researchers have used computational methods to predict the TS of the protein, due to the high structural complexity of enzymes like IsPETase, these methods struggle to predict melting points accurately. For example, Yang et al proposed ProTstab2 11, a ML model based on substantial calculations to collect thousands of protein features that will be used to train the algorithm, this increases the complexity of the model and makes it prone to overfitting. Additionally, other methods have been tried to predict protein TS, including analyzing the protein length and hydrogen bonding12, but they perform worse when trying to predict the effect of single amino acid substitutions. This occurs due to a lack of experimental data, since many factors that may influence protein stability, no matter how subtle, will be difficult to detect unless many samples have been analyzed. Other approaches include the use of the state-of-the-art structure-prediction-model AlphaFold to predict protein stability, with little success, mainly because AlphaFold is primarily designed for predicting protein structures rather than accurately modelling the subtle effects of mutations on stability and since it was trained on a database of experimentally resolved structures suffers from a lack of diversity on its training dataset13. To overcome these challenges, we propose the finetuning of a transformer protein language model developed by Facebook research14 that has been pre-trained on hundreds of millions of unlabelled protein sequences in an unsupervised manner, so it can learn the fundamental properties of proteins by itself. Afterwards, by running MD simulations on wild-type and mutant proteins, we can test our predictions on data from the ML model by seeing if the mutant protein shows an increase in the compared stability.
Methodology
MD Simulations
Molecular Dynamics (MD) simulations were used to assess the TSof our enzyme. The system was parameterized using the Amber99SB-ILDN15 force field for proteins, solvated with the TIP3P water model16 and neutralised in 0.15 M of NaCl. In order to compare the stability of the protein two simulations, systems were run at the temperatures of 300K and 400K for 100ns. All simulations were carried out with GROMACS 202117. These simulations were used to calculate the Q-value, which gives information on how much of the protein structure is retained after the simulation time. This was calculated using the soft cut function described by Best et al18 and implemented in MDAnalysis19:
(1)
where and are the contact distances at time and time , respectively. is the softness of the switching function, and is the reference distance tolerance, set to \AA and \AA, respectively.
The -values were calculated for all inter- and intra-combinations of three groups: hydrophilic (D, E, Q, N, R, K, H), hydrophobic (F, Y, W, L, V, I, M, C, P), and small (G, A, S, T).
Evidence shows that the average Q-value over MD simulations correlates well with protein Tm when run at a higher temperature, specifically at 400K. Furthermore, this correlation is best when the Q-value is calculated for the hydrophilic versus all residues20.
Protein language model fine-tuning
We used a pre-trained transformer protein language model (PLM) ESM2 of 35 million parameters. The model consists of a modified BERT-like encoder transformer architecture with 12 layers with 20 attention heads each21). We fine-tuned the model using the Meltome atlas database of protein melting temperatures22. The finetuning was run using the Kaggle notebook platform with GPU acceleration (https://www.kaggle.com). Due to memory limitations, the sequences longer than 512 amino acids were omitted from the database leaving a dataset of 22971 proteins. This dataset was split into train, validation, and test sets (80%, 10%, and 10% of the total database respectively). The model was trained for 10 epochs, and the best model was selected at the end of training using the validation loss as a metric. Additionally, mean absolute error (MAE), mean square error (MSE), and R2 metrics were calculated to assess the accuracy of the regression and for comparison with similar methods. MSE was calculated by adding the squared errors and dividing by the number of data points then taking the square root (Eq. 2). Likewise, the process for MAE was done without the squaring and square rooting and just taking the absolute value of the average of the errors (Eq. 3). The R2 error was calculated using the standard formula for finding the coefficient of determination (Eq. 4). The 95% confidence interval (CI) was calculated by resampling the test set using bootstrapping with 5000 iterations.
(2)
(3)
(4)
Results & Discussion
Table 1: Comparison of metrics for the fine-tuned PLM and ProTstab2 | ||
Metric | Fine-tuned PLM (95% CI) | ProTstab2 |
MSE | 47.59 (43.99- 53.94) | 82.752 |
MAE | 5.08 (4.82- 5.22) | 6.934 |
R2 | 0.63 (0.59-0.675) | 0.580 |
We fine-tuned the best model for 10 epochs and after the third epoch the model’s accuracy did not improve any further, at the end of training the best model was chosen based on the MAE.
We performed a deep mutational scan predicting the Tm of every possible single-point mutant, excluding the residues in the binding site (Y87, W159, S160, M161, W185, C203, D206, I208, H237, S238, C239, N241, R280)10. and found that even the best mutants had a very slight increase compared to the wild types. For the best mutant, our model predicted a Tm no more than 2°C higher than the Tm predicted for the wild-type protein (54.49°C vs 52.8°C respectively). We calculated the Q-values of two multiple mutants with ten modifications using molecular-dynamic simulations. One of the mutants was generated by taking the 10 best substitutions predicted by the model (N30W, E44R, P49R, T51R, T72R, N73R, G75R, G76R, G163Y, E274Y) and the other by iteratively modifying the protein 10 times (T73R, P86W, W97D, G99I, G255I, E275Y, S279C), the ML model was used to predict the most favorable single modification at each step. After applying each modification, the mutated protein was fed into the model to predict the next optimal modification. As shown in Table 2, the wild-type proteins showed a higher Q-value than the mutant protein between each combination of amino acid groups. The most contact preservation in both mutant proteins at 400 K, was between the small amino acids, which had a Q value of around 0.78 and 0.89 for the 10-best and iterative-10 mutants respectively. However, the Q value in the original wild-type protein was 0.89 on average, implying that no improvements were seen in the preservations in native contacts for the mutant protein over the wild-type protein.
Wild-type | Mutant 10-best | Mutant iterative-10 | |
hydrophobic-hydrophobic | 0.86 ± 0.08 | 0.75 ± 0.13 | 0.85 ± 0.07 |
hydrophobic-small | 0.87 ± 0.06 | 0.77 ± 0.10 | 0.86 ± 0.05 |
hydrophilic-hydrophobic | 0.79 ± 0.07 | 0.68 ± 0.11 | 0.75 ± 0.06 |
hydrophilic-hydrophilic | 0.82 ± 0.06 | 0.70 ± 0.11 | 0.79 ± 0.06 |
hydrophilic-small | 0.80 ± 0.06 | 0.69 ± 0.11 | 0.76 ± 0.06 |
small-small | 0.89 ± 0.06 | 0.78 ± 0.11 | 0.89 ± 0.06 |
all-hydrophobic | 0.85 ± 0.06 | 0.76 ± 0.10 | 0.84 ± 0.05 |
all-hydrophilic | 0.82 ± 0.05 | 0.73 ± 0.09 | 0.80 ± 0.05 |
all-small | 0.85 ± 0.06 | 0.76 ± 0.09 | 0.84 ± 0.05 |
all-all | 0.86 ± 0.07 | 0.74 ± 0.11 | 0.84 ± 0.07 |
Pre-trained protein language models have been successfully applied to several tasks, such as structure prediction, effect of variants on protein function, and prediction of binding residues, either with or without fine-tuning23,24,25,26,27,28, and they seem able to generalise better for downstream, complex tasks if some extra data is provided for fine-tuning.
Compared to a previously published thermal stability machine learning model, which was trained on the same dataset (ProTstab2)11, our model had a lower MSE and MAE, as well as a higher R2 value, indicating greater accuracy and better correlation than the existing model on the test dataset.
The predicted mutants obtained from the PLM did not show any improvement in the TS over the wild-type protein. There are several potential reasons why the model failed to predict an improved protein. Namely, models trained using wild-type protein libraries, such as the one we used in this work, are designed to predict certain properties of a natural sequence, which does not align with the task of predicting the properties of an engineered protein. One example of this is a recent work using AlphaFold, a model trained to predict the structure of proteins, to predict the effect of point mutations on protein stability which resulted to be quite inaccurate13.
Recently, Chu et al29 used a similar approach by finetuning the ESM2 protein language model but fine-tuned it to predict general stability (folding ΔΔG) for small protein domains, which generalized well for small proteins not in the dataset, being not as accurate with the larger ones like the enzyme used in this study.
Additionally, even if a model can predict the effect of a single-point mutation, this does not mean that the predicted effect of several mutations will be additive. For example, if mutation X and Y each improve the Tm of a protein, it does not mean that mutations X and Y used in combination will improve the Tm of a protein even more. This is because these amino acids can interact in non-trivial ways that are not captured by the algorithm30.
Additionally, when fine-tuning the model for 10 epochs we observed that after the third epoch, we did not see any improvement in the validation loss, but the training loss kept improving, which tells us that after that the model’s goal changed from generalising the problem of predicting the sequence Tm to aligning the results as much as it can with the given data. This behaviour of the model overfitting after a few epochs shows that either the complexity of the model is too high, the dataset is too small, or more likely a combination of the two since more complex machine learning models need more data to learn properly31.
Conclusion
Currently, the machine learning model used in our case had better accuracy than older models, however, we were unable to predict an improved mutant regarding the Q values. Furthermore, the melting temperature of the best-predicted mutants increased by only a few degrees C. This shows that, even though state-of-the-art protein language models can be used for several tasks on native proteins, they still lack the ability to successfully predict the effect of mutations, due to the limitations of their algorithm and the data they were trained on. Thus highlighting the importance of diverse datasets that accurately represent the properties of both wild-type and mutant proteins to achieve better predictions in protein engineering tasks.
References
- K. Gopalakrishna, N. Reddy. Regulations on Recycling PET Bottles. Recycling of Polyethylene Terephthalate Bottles. 23-25 (2019). [↩]
- R. Muller, I. Kleeberg, W. D. Deckwer. Biodegradation of polyesters containing aromatic constituents. Journal of Biotechnology. 86, 87-95 (2001). [↩]
- S. Sharma, S. Chatterjee. Microplastic pollution, a threat to marine ecosystem and human health: a short review. Environmental Science and Pollution Research. 24, 21530–21547 (2017). [↩]
- J. Diao, Y. Hu, Y. Tian, R. Carr, T. Moon. Upcycling of poly(ethylene terephthalate) to produce high-value bio-products. Cell Reports. 42,(2023). [↩]
- K. Hiraga, I. Taniguchi, S. Yoshida , Y. Kimura , K. Oda. Biodegradation of waste PET. EMBO Rep. 20, (2019). [↩]
- S. Yoshida, K. Hiraga, T. Takehana, I. Taniguchi, H. Yamaji, Y. Maeda, et al. A bacterium that degrades and assimilates poly(ethylene terephthalate). Science. 351, (2016). [↩] [↩]
- S. Brott, L. Pfaff, J. Schuricht, JN. Schwarz, D. Böttcher, CPS, Badenhorst, et al. Engineering and evaluation of thermostable Is PETase variants for PET degradation. Engineering in Life Sciences. 22, 192-203 (2022). [↩]
- HTH. Nguyen, P. Qi, M. Rostagno, A. Feteha, SA. Miller. The quest for high glass transition temperature bioplastics. Journal of materials chemistry. A. 6, 9298-9331 (2018). [↩]
- HF. Son, IJ. Cho, S. Joo, H. Seo, HY. Sagong, SY. Choi, et al. Rational Protein Engineering of Thermo-Stable PETase from Ideonella sakaiensis for Highly Efficient PET Degradation. ACS Catalysis. (2019). [↩]
- X. Han, W. Liu, JW. Huang, J. Ma, Y. Zheng, TP. Ko, et al. Structural insight into catalytic mechanism of PET hydrolase. Frontiers in Bioengineering and Biotechnology. 11, (2023). [↩] [↩]
- Y. Yang, et al. ProTstab2 for Prediction of Protein Thermal Stabilities. International journal of molecular sciences. 23, (2022). [↩] [↩]
- K. Ghosh, K. Dil. Computing protein stabilities from their chain lengths. Biophysics and Computational Biology, 106, (2009), M. Prevost, et al. Contribution of the hydrophobic effect to protein stability: analysis based on simulations of the Ile-96 → Ala mutation in barnase. Proceedings of the National Academy of Sciences of the United States of America. 88, (1991). [↩]
- M. Pak, et al. Using AlphaFold to predict the impact of single mutations on protein stability and function. PloS one. 18, (2023). [↩] [↩]
- A. Rives, et al. Biological structure and function emerge from scaling unsupervised learning to 250 million protein sequences. Proceedings of the National Academy of Sciences of the United States of America. 118, (2021).,Lin, Z. et al. Evolutionary-scale prediction of atomic-level protein structure with a language model. Science. 379. (2023). [↩]
- K. Lindorff-Larsen, et al. Improved side-chain torsion potentials for the Amber ff99SB protein force field. Proteins. 78, (2010). [↩]
- W. Jorgensen, J. Chandrasekhar, J. Madura, R. Impey. Comparison of simple potential functions for simulating liquid water. The Journal of Chemical Physics. 79, (1983). [↩]
- M.J. Abraham, et al. GROMACS: High performance molecular simulations through multi-level parallelism from laptops to supercomputers. SoftwareX. 1-2, 19-25 (2015). [↩]
- R. Best, et al. Native contacts determine protein folding mechanisms in atomistic simulations. Proceedings of the National Academy of Sciences of the United States of America. 110, (2013). [↩]
- R. J. Gowers, et al. MDAnalysis: A Python Package for the Rapid Analysis of Molecular Dynamics Simulations. http://conference.scipy.org.s3-website-us-east-1.amazonaws.com/proceedings/scipy2016/oliver_beckstein.html (2019). [↩]
- D. Kuroda, et al. Computer-aided antibody design. Protein engineering, design & selection. 25, (2012). [↩]
- Lin, Z. et al. Evolutionary-scale prediction of atomic-level protein structure with a language model. Science. 379. (2023 [↩]
- A. Jarzab, et al. Meltome atlas-thermal proteome stability across the tree of life. Nature methods. 17, (2020). [↩]
- C. Hsu, R. Verkuil, J. Liu, Z. Lin, B. Hie, T. Sercu, et al. Learning inverse folding from millions of predicted structures. bioRxiv. (2022). [↩]
- K. Weissenow, et al. Protein language-model embeddings for fast, accurate, and alignment-free protein structure prediction. Structure London, England : 1993), 30. (2022). [↩]
- M. Heinzinger, et al. Contrastive learning on protein embeddings enlightens midnight zone. NAR genomics and bioinformatics, 4. (2022). [↩]
- C. Marquet, et al. Embeddings from protein language models predict conservation and variant effects. Human genetics, 141. (2022). [↩]
- M. Littmann, et al. Protein embeddings and deep learning predict binding residues for various ligand classes. Scientific reports, 11. (2021). [↩]
- H. Stärk, et al. Light attention predicts protein location from the language of life. Bioinformatics advances, 1. (2021). [↩]
- S. Chu, et al. Protein stability prediction by fine-tuning a protein language model on a mega-scale dataset. Plos Computational Biology. (2024). [↩]
- F. Van Der Flier, D. Estell, S. Pricelius, L. Dankmeyer, S. Van, S. Thans, et al. What makes the effect of protein mutations difficult to predict? bioRxiv. (2023). [↩]
- N. Liu, S. Li, Y. Du, JB. Tenenbaum, A. Torralba. Learning to Compose Visual Relations. arXiv. (2021). [↩]