Abstract
The grand challenge of biology for the last 50 years has been to find a method to reliably predict the way proteins fold. Mapping the folded (or tertiary) structure of proteins based on their primary structure allows us to understand their function, a critical task. While artificial intelligence tools to reliably predict the tertiary structure exist, predicting the function of the protein from its primary structure is not as well explored. In this paper, we explore multiple artificial intelligence approaches to predicting protein function, including a deep learning approach and using a Long Short Term Memory (LSTM) approach. We used a CountVectorizer approach to process the sequence data for the deep learning model. We assign each amino acid a number when processing the data for the LSTM. Our deep learning model provided a maximum accuracy of 0.6091 and the LSTM model provided a maximum accuracy of 0.7415. Our results show us that there is potential for stronger results when predicting protein function, especially if we further explore other model types.
Key Words: Proteins, Protein folding, Machine learning, Artificial Intelligence, Deep learning, Recurrent neural networks, Long short term memory, Computational Biology, Bioinformatics
Introduction
For the last 50 years, accurate prediction of how amino acid chains fold into proteins has been a grand challenge of biology1. This is because modeling the folded, or tertiary, structure of proteins offers valuable insights into the function of proteins2. In particular, the tertiary structure of proteins is essential knowledge for the drug development industry, as the tertiary structure informs researchers of the function of proteins that are present in antibiotics3. Numerous traditional methods have therefore been developed for experimentally determining the tertiary structure, including X-Ray Crystallography, Nuclear Magnetic Resonance (NMR), Cryogenic Electron Microscopy, and Dual Polarisation Interferometry3. However, each of these methods has its drawbacks, primarily that they are expensive and time-intensive processes4.
Contrarily, the primary structure, a chain of amino acids that make up the protein, is much simpler to find through experimental methods such as mass spectrometry. While the primary structure determines the tertiary structure5, accurately predicting how this chain of amino acids folds into a 3D structure remains a major problem in modern biology. As such, artificial intelligence has been consistently applied to attempt to solve this problem. In 2020, Google’s research subsidiary DeepMind introduced AlphaFold. AlphaFold is a novel deep learning model that takes the primary structure of a protein as an input, and outputs the tertiary structure with high levels of accuracy6. AlphaFold utilizes attention-based architecture first introduced by Google in 2017, through the paper “Attention Is All You Need”7, which enables its high accuracy by being able to analyze long sequences of data with less loss of essential information.
AlphaFold2 (the second generation of the model) performed incredibly in the 14th Critical Assessment of protein Structure Prediction (CASP14), receiving a median GDT (Global Distance Test) score of 92.4 (Global Distance Test) score of 92.48. CASP is considered the gold standard competition for computational biology methods that predict the tertiary structure of proteins. John Mault, a co-founder of CASP, considered AlphaFold2’s GDT score to be comparable to experimental methods without the tedious and expensive lab analysis required9. In 2022, DeepMind announced that AlphaFold had predicted the structures to nearly all proteins known to science: over 214 million structures10’11. DeepMind claims that its partners are already using AlphaFold to solve real-world problems including breaking down single-use plastics and finding new drugs to treat liver cancer12.
Following the reconstruction of the tertiary structure of a protein, artificial intelligence has also been applied in regards to predicting protein function from structure. DeepFRI, a Graph Convolutional Network, predicts protein function using the tertiary structure as an input. As a pre-trained, task-agnostic, language model, DeepFRI is able to reliably predict function, even when the input is a structure that is computationally inferred. At the time of its release in 2021, DeepFRI was outperforming many state-of-the-art models13.
However, the exciting advancements of combining methods such as AlphaFold and DeepFRI raise questions on how else AI can be applied to this grand challenge of biology. The ultimate goal of predicting the tertiary structure is to understand the function of the protein. This fact encourages us to challenge whether or not we can skip the intermediate step of the tertiary structure to directly and reliably predict the function of a protein using the primary structure as an input without first determining the tertiary structure. While it is clear that function prediction directly from the tertiary structure will likely be more accurate, it is computationally more costly and less accessible. Thus, it is critical to explore the capabilities of predicting protein function from primary structure.
Predicting protein function directly from primary structure, skipping the intermediate step of finding the tertiary structure, is not as well explored as predicting tertiary structure from primary structure. There has been some work using transformer/attention-based architecture to predict protein function from primary structure. For example, ReLSO, a transformer-based model, can predict “fitness” for a protein sequence, which represents its general functionality/foldability14. This model doesn’t, however, classify proteins into specific types of functions. In this paper, we therefore examine different model architectures in an attempt to predict the function of the protein resulting from an amino acid chain without predicting the tertiary structure. Specifically, we study two different approaches that consider the primary structure of a protein, the sequence of amino acids. The first approach studied, a standard Neural Network considers the protein input as a collection of amino acids without taking the order of them into account, as the Bag of Words representation of the protein sequences is based on amino acid frequency, and not order. The second method, based on a Long-Short Term Model, does take the order into account. Our paper thus aims to answer the question; to what extent is the primary structure of a protein sufficient to determine its function? As our result will show, taking the sequence of the amino acids into account is important for accurate protein function prediction.
Our paper is organized as follows. First, we will introduce the data and different machine learning models used. Following this, we will present the accuracy each of our methods was able to achieve on the used datasets. Lastly, we will discuss the implications of these results and consider possible limitations of our work.
Methodology
In this section we will provide an overview of the two different machine learning methods evaluated for predicting protein function using only the primary structures as well as the data used to evaluate the models.
Data
The data we use for this paper was taken from Kaggle. The dataset, named Structural Protein Sequences, contains the primary structure of numerous proteins and a classification of their function as a protein15. The primary structures were taken from the Protein Data Bank, where accurate experimental methods such as X-ray Crystallography, NMR techniques, and cryo-electron microscopy were used to find the sequences16. There was no metadata in the dataset, besides a protein ID that is irrelevant for this paper. After filtering out null values, the dataset has 346,321 amino acid chain and classification pairs. There are 4468 different classifications, but many of these contain only one or just a few corresponding amino acid chains. We filter out any classifications that don’t have at least 100 entries, as they are nonoptimal for model training. This is to prevent having a significant class imbalance, as functions with just a few corresponding proteins would be ignored by the model. Given the size of the dataset, 100 is a reasonable amount of proteins to require for each classification to be represented by the model. Additionally, we specifically remove any classifications that aren’t directly describing a type of protein function. We do this by only using classifications ending in the word ‘PROTEIN’, which tells us that they are labeling a specific protein that carries out a unique function. For this dataset, classifications ending in protein specifically describe a protein with a function. No relevant classifications were ignored. This leaves us with 35 classifications. These are: [‘CALCIUM-BINDING PROTEIN’, ‘VIRAL PROTEIN’, ‘GLYCOPROTEIN’, ‘RIBOSOMAL PROTEIN’, ‘LIPID BINDING PROTEIN’, ‘DNA BINDING PROTEIN’, ‘MEMBRANE PROTEIN’, ‘CONTRACTILE PROTEIN’, ‘STRUCTURAL PROTEIN’, ‘RNA BINDING PROTEIN’, ‘FLAVOPROTEIN’, ‘ANTIVIRAL PROTEIN’, ‘SIGNALING PROTEIN,’ ‘BIOTIN-BINDING PROTEIN’, ‘PEPTIDE BINDING PROTEIN’, ‘SUGAR BINDING PROTEIN’, ‘METAL BINDING PROTEIN’, ‘TRANSPORT PROTEIN’, ‘PLANT PROTEIN’, ‘LUMINESCENT PROTEIN’, ‘MOTOR PROTEIN’, ‘DE NOVO PROTEIN’, ‘ANTIMICROBIAL PROTEIN’, ‘ANTITUMOR PROTEIN’, ‘VIRUS/VIRAL PROTEIN’, ‘LIGAND BINDING PROTEIN’, ‘FLUORESCENT PROTEIN’, ‘BIOSYNTHETIC PROTEIN’, ‘NUCLEAR PROTEIN’, ‘CIRCADIAN CLOCK PROTEIN’, ‘BIOTIN BINDING PROTEIN’, ‘IMMUNE SYSTEM/VIRAL PROTEIN’, ‘MEMBRANE PROTEIN, TRANSPORT PROTEIN’, ‘HYDROLASE/TRANSPORT PROTEIN’, ‘ACETYLCHOLINE-BINDING PROTEIN’]
In total, we have 56396 records within these 35 classifications. The data is stored in multiple ways depending on the type of model. We primarily use PyTorch tensors to store and process the data.
Some limitations of the dataset are that we filter out everything but 35 distinct classifications, a potential point of bias. This means practically, we could only use our model on a primary structure if we know that it is within one of these classifications. A more useful approach would be to have data that covers many different types of proteins in order to have a more flexible model. This is something that can be expanded on with access to more broad data in the future.
Methods
We utilize two primary architectures to create our models. First, a simple deep learning model utilizing linear layers to process the chain as a whole, using count-based vectors to process the inputs. Second, a Long Short Term Memory (LSTM) model that applies Recurrent Neural Networks (RNNs) to recursively process each acid in the chain. This way, we can examine the differences in accuracy from processing each acid individually and processing the chain as a whole. Both models were implemented using the PyTorch framework17 and were run using a T4 GPU on Google Colab. All code used for the experiments is available online.
Deep Neural Network
Our first model is a standard Deep Neural Network (DNN) with fully connected layers18’19’20. We process our input, which is initially a string with a character representing each acid in the chain, using sklearn’s CountVectorizer. This Bag of Words representation provides us with a count-based representation of the string in the form of a vector. We input this vector into a PyTorch model, with varying numbers of linear layers of varying sizes which we discuss further below. We use an optimizer with a varying learning rate. Additionally, we use a DataLoader to process the large amounts of input data in chunks of 64 entries. The Adam optimizer is used for this model.
One of the main hyperparameters for deep neural networks is the number of layers used and the number of nodes for each layer. In the final iteration of our deep neural network , we have eight layers total. Our input layer is of size 24, reflecting the dimensions of our input matrix. Our next 7 layers are of sizes 100, 200, 250, 150, 125, 75, and 50 in that order. Our final output layer is of size 35, to match the 35 different classifications our model can output. Each linear layer is followed by a ReLU layer, except the output layer. After many variations of the model, the iteration with these arguments showed the best results.
We generally run the model for 250 epochs, and decrease the learning rate after 50 epochs of a higher learning rate. We save the loss and accuracy for testing and training data after each epoch. We switched to the lower learning rate to take advantage of less overfitting over more epochs while still getting the initial growth of a high learning rate over the first 50 epochs. This strategy proved to show higher accuracy and lower loss.
The model runs for 250 epochs. For the first 50 epochs, the model trains at a higher learning rate of 0.0005. After 50 epochs, there is a scheduled decrease in learning rate to 0.0001 for the next 200 epochs. This strategy allows the model to make larger optimizations in the first 50 epochs, while reducing the impact of overfitting over the next 200. By taking smaller steps when overfitting generally occurs, scheduling a learning rate decrease helps mitigate its impact (Li, Katherine. “How to Choose a Learning Rate Scheduler for Neural Networks.” neptune.ai, 9 Aug. 2023, neptune.ai/blog/how-to-choose-a-learning-rate-scheduler. Accessed 7 Sep. 2024). This strategy proved to show higher accuracy and lower loss.
Long-Short Term Memory Model
Our second model is a more complex Long-Short Term Memory (LSTM) approach 21’22’23. In order to process our input, each chain must be stored as a list, and each character must occupy one index in the list. This is because we need to convert each acid into an integer from 1-26 in order to process the data, and using a float or similar datatype to store the number would lead to the value being processed as “infinity” due to the restrictions of the float datatype and the large length of amino acid chains. Due to computing constraints, we only process the chains with 32 acids or less. While there is a concern of this leading to potential data loss, this wasn’t observed in practice. If the chain is less than 32 in length, we add 0s to the end of its list until it is length 32 for smoother processing. Because the padding is a different token than the rest of the amino acid chain, the model learned to ignore these parts of the data. We store all these lists in a tensor, and when we feed this data into our model, we use a data loader with varying batch size to process the data.
In the final iteration of our second model, we have an embedding dimension of 32, a hidden dimension of 128, and 7 hidden layers. Additionally, our data loader uses batch sizes of 64 and the optimizer has a learning rate of 0.0001 (after switching from 0.0005 after 50 epochs). We run the model for 2000 epochs.
Results
In the following we will discuss the results of our experiments with the two different deep learning architectures. For each model we will discuss the final accuracy and present the training curves.
Deep Neural Network
Using a deep neural network for predicting the function of a protein purely on the set of amino acids the protein is made from, we find moderate results. As visualized in Fig. 2, after 250 epochs we find high final accuracy on the training data and lower accuracies for testing data. This phenomenon is called extreme overfitting and is a common issue for this type of architecture. In our Discussion we will discuss different approaches for overcoming this. Interestingly, our testing accuracy actually decreases shortly after switching to the lower learning rate of 0.0001 from 0.0005. Regardless, our model performs decently well and provides us with a maximum accuracy of 0.6091.
Long-Short Term Memory
When including the sequenced data in addition to the set of amino acids, the second approach, a Long-Short Term Memory model, shows stronger results. After 2000 epochs, our model has a better maximum test accuracy. Additionally, while we once again see higher training than testing accuracies, indicating overfitting the overfitting doesn’t appear as significant as in the previous model. By using larger datasets and dropout techniques, we can mitigate the overfitting. These techniques are explained in more detail in the Discussion. The maximum test accuracy is 0.7415.
Conclusion
In this work we have studied the difference in performance between two different machine learning models for predicting protein function without taking into account the tertiary structure. Specifically, we have compared standard fully connected Deep Neural Networks with Long-Short Term Memory models. These two approaches primarily differ by how they take the amino acid sequence into account.
Notably we found that Long-Short Term Memory models significantly outperform Deep Neural Networks in terms of accuracy, with a test accuracy of 0.7415 (LSTM) compared to 0.6091 (Deep Neural Network). This result is in line with what we would expect as Long-Short Term Memory models retain more information than Deep Neural Networks, and have information about the order of the amino acid sequence. Since protein sequences are sequential forms of data, LSTMs are geared towards solving these problems as they can process sequential data recursively and utilize useful information for their predictions from previous parts of the sequence.As such, for future research into predicting protein function without taking into account the tertiary structure we recommend the use of recurrent neural architectures such as the LSTM model.
Limitations and Future Work
Due to computational restrictions a number of limitations were placed on the study that we believe would be important to lift in future work. Most notably, we looked at a smaller dataset and only utilized 35 classifications for functions. For more practical models, we would need to use a much larger dataset with many more classifications to truly be able to understand the intricacies of different chains and how that impacts their function. Additionally, only using 35 classifications means numerous rarer types of proteins are disregarded in favor of more common protein types. Moreover, many more epochs and smaller learning rates for our model might have provided stronger accuracies/results.
Additionally, certain amino acids have inherent similarities24. This could skew our results and lead to inconsistencies when our models handle these similar acids. Examining how our models approach these cases could allow us to improve the models’ performance.
Overfitting was observed in our model, with test and training accuracy and loss diverging eventually, as seen in Figures 2 and 3. We can mitigate the impact of overfitting with access to larger datasets. Another potential method for future approaches is to use a dropout technique, where random neurons in the neural network are deactivated in training to reduce dependence on specific neurons (Jain, Abhishek. “All About ANNs and Dropout – Abhishek Jain – Medium.” Medium, 10 Feb. 2024, medium.com/@abhishekjainindore24/all-about-anns-and-dropout-39099dbc9f7e. Accessed 7 Sep. 2024). Furthermore, we believe that our results can be further improved by considering different architectures of the models studied here (eg. more layers and nodes) as well as using different architectures all together. For example, the attention-based models as mentioned in the introduction could possibly provide an improvement over the LSTM architecture. A more thorough examination of other models could provide more insight into the feasibility of predicting protein function from primary structure.
References
- “AlphaFold: A Solution to a 50-year-old Grand Challenge in Biology.”Google DeepMind, 30 Nov. 2020, deepmind.google/discover/blog/alphafold-a-solution-to-a-50-year-old-grand-challenge-in-biology. Accessed 26 July. 2024 [↩]
- Davis, Anne. “Why Is Protein Folding Important in Biology?” Health Tech Zone, 3 Dec. 2019, www.healthtechzone.com/topics/healthcare/articles/2019/12/03/443892-why-protein-folding-important-biology.htm. Accessed 26 July. 2024 [↩]
- Tertiary Structure Analysis – Creative Biolabs. www.creative-biolabs.com/drug-discovery/therapeutics/tertiary-structure-analysis.htm. Accessed 26 July. 2024 [↩] [↩]
- SITNFlash. “A Near Perfect Solution to a Decades-Old Biology Problem – Science in the News.” Science in the News, 9 Feb. 2021, sitn.hms.harvard.edu/flash/2021/a-near-perfect-solution-to-a-decades-old-biology-problem. Accessed 26 July. 2024 [↩]
- Rehman, Ibraheem. “Biochemistry, Tertiary Protein Structure.” StatPearls – NCBI Bookshelf, 12 Sept. 2022, www.ncbi.nlm.nih.gov/books/NBK470269. Accessed 26 July. 2024 [↩]
- Jumper, John, et al. “Highly Accurate Protein Structure Prediction With AlphaFold.” Nature, vol. 596, no. 7873, July 2021, pp. 583–89. https://doi.org/10.1038/s41586-021-03819-2. Accessed 26 July. 2024 [↩]
- Vaswani, Ashish, et al. “Attention Is All You Need.” arXiv (Cornell University), vol. 30, pp. 5998–6008. arxiv.org/pdf/1706.03762v5. Accessed 26 July. 2024 [↩]
- Database, AlphaFold Protein Structure. AlphaFold Protein Structure Database. alphafold.ebi.ac.uk/faq. Accessed 26 July. 2024 [↩]
- Al-Janabi, Aisha. “Has DeepMind’s AlphaFold Solved the Protein Folding Problem?” BioTechniques, vol. 72, no. 3, Mar. 2022, pp. 73–76. https://doi.org/10.2144/btn-2022-0007. Accessed 26 July. 2024 [↩]
- Goldman, Sharon. “A Year Ago, DeepMind’S AlphaFold AI Changed the Shape of Science — but There Is More Work to Do.” VentureBeat, 1 Aug. 2023, venturebeat.com/ai/a-year-ago-deepminds-alphafold-ai-changed-the-shape-of-science-but-there-is-more-work-to-do. Accessed 26 July. 2024 [↩]
- Varadi, Mihaly, et al. “AlphaFold Protein Structure Database in 2024: Providing Structure Coverage for Over 214 Million Protein Sequences.” Nucleic Acids Research, vol. 52, no. D1, Nov. 2023, pp. D368–75. https://doi.org/10.1093/nar/gkad1011. Accessed 7 Sep. 2024 [↩]
- “AlphaFold.” Google DeepMind, 28 July 2022, deepmind.google/technologies/alphafold. Accessed 26 July. 2024 [↩]
- Gligorijevi?, Vladimir, et al. “Structure-based Protein Function Prediction Using Graph Convolutional Networks.” Nature Communications, vol. 12, no. 1, May 2021, https://doi.org/10.1038/s41467-021-23303-9. Accessed 26 July. 2024 [↩]
- Castro, Egbert, et al. “Transformer-based Protein Generation With Regularized Latent Space Optimization.” Nature Machine Intelligence, vol. 4, no. 10, Sept. 2022, pp. 840–51. https://doi.org/10.1038/s42256-022-00532-1. Accessed 7 Sep. 2024 [↩]
- Structural Protein Sequences. Kaggle, 2018, www.kaggle.com/datasets/shahir/protein-data-set. Accessed 26 July. 2024 [↩]
- Aoki-Kinoshita, Kiyoko F. “Glycan Bioinformatics: Informatics Methods for Understanding Glycan Function.” Elsevier eBooks, 2023, pp. 516–24. https://doi.org/10.1016/b978-0-12-821618-7.00002-x. Accessed 7 Sep. 2024 [↩]
- Paszke, Adam, et al. “PyTorch: An Imperative Style, High-Performance Deep Learning Library.” arXiv.org, 3 Dec. 2019, arxiv.org/abs/1912.01703. Accessed 26 July. 2024 [↩]
- https://colab.research.google.com/drive/1NVsegTF4lw2W7MgC0hJFP-UPoeN61dX4?usp=sharing [↩]
- Tam, Adrian. “Building Multilayer Perceptron Models in PyTorch.” MachineLearningMastery.com, 8 Apr. 2023, machinelearningmastery.com/building-multilayer-perceptron-models-in-pytorch. Accessed 26 July. 2024 [↩]
- Danofer. “Predicting Protein Classification.” Kaggle, 22 Apr. 2018, www.kaggle.com/code/danofer/predicting-protein-classification. Accessed 26 July. 2024 [↩]
- Danofer. “Predicting Protein Classification.” Kaggle, 22 Apr. 2018, www.kaggle.com/code/danofer/predicting-protein-classification. Accessed 26 July. 2024 [↩]
- https://colab.research.google.com/drive/1ZnTH67IhQHfeiWMEvD6Eo5CPY4r4wlct?usp=sharing [↩]
- Tam, Adrian. “LSTM for Time Series Prediction in PyTorch.” MachineLearningMastery.com, 8 Apr. 2023, machinelearningmastery.com/lstm-for-time-series-prediction-in-pytorch. Accessed 26 July. 2024 [↩]
- Stephenson, James, and Stephen J. Freeland. “Unearthing the Root of Amino Acid Similarity.” Journal of Molecular Evolution, vol. 77, no. 4, Oct. 2013, pp. 159–69. https://doi.org/10.1007/s00239-013-9565-0. Accessed 26 July. 2024 [↩]