We use cookies to improve your experience. By continuing to browse this site, you accept our cookie policy.×
Tech NewsOpen Accesscc iconby iconnc iconnd icon

The path to solving the protein folding problem

    Published Online:https://doi.org/10.2144/btn-2023-0031

    Abstract

    With advances in imaging technologies and the development of artificial intelligence-based predictive software, has the protein folding problem finally been solved?

    As humans, we love to predict things. Predictions dominate conversations from the weather to sports results as we debate what is still to come. Science is no exception, and recent advances in technology have only enabled our predictive power to increase: artificial intelligence can predict a new therapeutic target for a repurposed drug [1]; blood biopsies can predict the development of Alzheimer's before it reaches the clinical stage [2]; and MRI can even predict our emotions [3]. However, until recently, one prediction remained out of reach.

    In 1972, upon receiving a Nobel Prize for his work on “the connection between the amino acid sequence and the biologically active conformation” [4], Christian B Anfinsen predicted that, in principle, it should be possible to determine a protein's 3D shape based solely on the composition of its 1D structure [5].

    Anfinsen's prediction became one of science's great unsolved mysteries. Known as the ‘protein folding problem’, the issue flummoxed researchers for over half a century – and it's no wonder why. Per Levinthal's paradox, a thought experiment regarding the complexity of protein folding, a protein of 101 amino acids could exist in a possible 3100 = 5 × 1047 configuration. Even if it were possible to test each potential configuration at a rate of 1013 per second – or 3 × 1020 per year – it would take 1027 years to test them all [6].

    Yet, not for want of trying, the problem remained. In the years since Anfinsen's prediction, rapid technological developments enabled researchers to uncover the structure of several key proteins of the human proteome; however, many of the techniques were expensive, laborious and time consuming [7].

    From crystallization to cryogenics

    For decades, X-ray crystallography was the dominant technique for elucidating protein structures and was the method used to decode the structure of a number of key proteins, including the iconic double helix of DNA [8]. Despite being responsible for solving 90% of the known protein structures at the time, X-ray crystallography had major limitations. Researchers could spend years attempting to crystalize the protein to be suitable for analysis, with certain molecules defying crystallization [8].

    These limitations forced researchers to explore other avenues and, in doing so, led to the development of cryogenic electron microscopy (cryo-EM) in the mid-1970s. In cryo-EM, two-dimensional microscopic images are produced by shooting a beam of electrons at a flash-frozen protein solution. The emerging scattered electrons are then detected, and an image is created. These images are then inputted into computational image processing software to reconstruct the 3D structure of the protein [8,9].

    Advances were slow as electron microscope images had poor image contrast and low signal-to-noise ratio. However, the development and application of new direct electronic detectors led to advances in both resolution and image processing, which greatly enhanced the capabilities of cryo-EM. In 2016, the 3D structure of glutamate dehydrogenase was reported at a resolution of 1.8 Å, a far cry from the 25 Å resolution seen in the mid-1990s [10]. By the close of 2019, cryo-EM had been used to determine the structure of almost 4000 proteins listed in the Protein Data Bank database [10].

    What started as a niche field soon supplanted X-ray crystallography as the primary method of protein structure prediction, with researchers able to develop sharp 3D models of proteins in a fraction of the time [8]. What had previously taken years could now be done in a matter of months; the structure of ribosomes – the initial discovery of which had taken decades yet gained three X-ray crystallographers a Nobel Prize in 2009 – was an ideal target for cryo-EM and researchers quickly determined and published the cryo-EM structure of ribosomes from multiple organisms, including the first high-resolution model of human ribosomes [8].

    In addition to speed and improved resolution, cryo-EM enabled researchers to deduce the mechanism of certain structures. While X-ray crystallography required crystallized proteins to be locked into a single static position, with cryo-EM, proteins could be flash-frozen in various conformations. This allowed researchers to visualize proteins in action and thus determine how they worked in the organism [8].

    Automation leads the way

    Despite these advances in cryo-EM, at the start of 2021, almost 50 years after Anfinsen's Nobel acceptance speech prediction, the 3D structure was known for approximately 180,000 proteins, and only 17% of proteins in the human proteome [7,11]. This all changed in July 2021; the artificial intelligence (AI) research company DeepMind Technologies (London, UK) made their protein prediction software Alphafold, and its associated database, open source and freely available [12,13].

    The number of known protein structures in one fell swoop skyrocketed to 350,000, including 98.5% of the human proteome and key protein structures from 20 other scientifically relevant species [7,11]. At the time of writing (April 2023), that number has grown even further, with the structure of 214,683,829 proteins listed on the AlphaFold database and the complete proteome available for 48 species, including Homo sapiens (Figure 1) [12,13].

    Figure 1. The Alphafold structure prediction of DNA helicase.

    AlphaFold Data Copyright (2022) DeepMind Technologies Ltd [12,13,16].

    AlphaFold had first entered the protein prediction race 2.5 years earlier at the 2018 Critical Assessment of Structure Prediction (CASP), a biennial competition where teams of scientists test the abilities of their models to computationally predict the 3D structure of a protein from a given sequence of amino acids. In 2018, AlphaFold1 impressed and went on to win, yet it was the 2020 iteration of CASP where the AlphaFold team blew away the competition with the next generation of their software [11,14]. On average, AlphaFold2 successfully predicted the 3D structure of proteins within the width of about one atom, leading the CASP organizers to declare the protein folding problem solved [11].

    It was a historic event, as stated by CASP co-founder John Moult (University of Maryland, College Park, MD, USA): “This is the first time a serious scientific problem has been solved by AI” [11].

    AlphaFold2 is an attention-based neural network architecture combined with a deep learning framework that relies exclusively on pattern recognition [15,16]. The network directly predicts the 3D coordinates of all the heavy atoms of a given protein structure based on the inputted primary amino acid sequence and aligned sequences of homologs [16]. Similar to how you might approach a jigsaw, the attention-based algorithm builds small sections before compiling these sections together to complete the full 3D structure prediction [15].

    As with all AI software, Alphafold2 was trained using pre-existing data. Much of its training came from the known 3D structures found in the Protein Data Bank, but the amino acid structures for an additional 200 million proteins listed in the UniProt database and a further 350,000 sequences from UniClust, a database of annotated protein sequences and alignments, also aided in its development [11,15]. By incorporating the evolutionary, physical and geometric constraints of protein structure into its training, the AlphaFold team was able to greatly increase the accuracy of the software's structure prediction capabilities relative to previous computational prediction methods [16].

    While indeed game-changing, the AI technology of AlphaFold2 is not without its limitations. As it is trained on existing data, the software can struggle to predict unusual or de novo proteins not commonly found in nature [11]. In addition, the predicted structures gained from AlphaFold2 are not always as accurate as structures gleaned from more traditional techniques; proteins are dynamic, and while AlphaFold2 may be able to predict one conformation, proteins can change shape and contort into various proteoforms as they function and move throughout the body – an element that the current software cannot account for [11]. Despite initial excitement and ongoing hopes for the future, one cannot say that the protein folding problem has been well and truly solved – at least not yet.

    A multi-technique approach

    The impact of AI-based prediction software has been immense, with such software now benefiting structure-based drug design and structural bioinformatics analysis on a scale that would have otherwise been impossible [17]. However, automation has yet to completely replace traditional techniques, and integration of the old and the new appears to be the way forward.

    Deep learning, the machine learning method used by Alphafold2, has been applied to almost all areas of cryo-EM data analysis, from the initial sample preparation to end 3D structure prediction [18]. Following the protein prediction breakthrough of AlphaFold2, a number of labs are now obtaining more accurate structural models by integrating the deep learning methods used to predict protein structures with experimentally derived cryo-EM density maps [18].

    In one technique, named DeepTracer-ID, researchers used the DeepTracer computational image processing software to determine a 3D structure from cryo-EM data, which was then searched against a database of AlphaFold2-predicted structures in order to identify molecules with a similar structure and thereby enhance the accuracy of their reconstructed structure [18,19].

    Taking technique collaboration a step further, a team led by researchers at the Karolinska Institutet (Solna, Sweden) combined the AI-powered technology of AlphaFold with the traditional techniques of both cryo-EM and X-ray crystallography [20]. Working with the proteins uromodulin and glycoprotein 2, the team collected experimental data using the aforementioned imaging techniques, but technical issues complicated the data interpretation. Molecular replacement with models generated by AlphaFold2 enabled the team to solve the protein structure of each molecule, with the structural insights gained enabling them to determine how the two molecules exert their anti-bacterial properties in the body [20].

    With AI technology developing rapidly, its integration with simultaneously advancing experimental techniques heralds a transformative time for structural biology, as well as the many other fields of life science that rely on protein structure analysis [17]. Whether Anfinsen's 1972 prediction has finally come to light over 50 years later remains debatable, though it cannot be denied that the path to complete protein prediction is almost at an end.

    Author contributions

    Jenny Straiton carried out all work related to this manuscript, including researching, writing, and editing.

    Financial & competing interests disclosure

    Jenny Straiton is a Contributing Editor for BioTechniques. The author has no other relevant affiliations or financial involvement with any organization or entity with a financial interest in or financial conflict with the subject matter or materials discussed in the manuscript apart from those disclosed.

    No writing assistance was utilized in the production of this manuscript.

    Open access

    This work is licensed under the Attribution-NonCommercial-NoDerivatives 4.0 Unported License. To view a copy of this license, visit http://creativecommons.org/licenses/by-nc-nd/4.0/

    References

    • 1. Paul D, Sanap G, Shenoy S, Kalyane D, Kalia K, Tekade RK. Artificial intelligence in drug discovery and development. Drug Discov. Today 26(1), 80–93 (2021).
    • 2. Straiton J. Predicting Alzheimer's disease. BioTechniques 67(4), 146–148 (2019).
    • 3. Kragel PA, LaBar KS. Multivariate neural biomarkers of emotional states are categorically distinct. Soc. Cogn. Affect. Neurosci. 10(11), 1437–1448 (2015).
    • 4. The Nobel Prize. The Nobel Prize in Chemistry 1972. www.nobelprize.org/prizes/chemistry/1972/summary
    • 5. The Nobel Prize. Christian Anfinsen: Nobel lecture. Studies on the principles that govern the folding of protein chains. www.nobelprize.org/uploads/2018/06/anfinsen-lecture.pdf
    • 6. Zwanzig R, Szabo A, Bagchi B. Levinthal's paradox. Proc. Natl Acad. Sci. USA 89(1), 20–22 (1992).
    • 7. Tunyasuvunakool K, Adler J, Wu Z et al. Highly accurate protein structure prediction for the human proteome. Nature 596(7873), 590–596 (2021).
    • 8. Callaway E. The revolution will not be crystallized. Nature 525(7568), 172–174 (2015).
    • 9. Callaway E. Revolutionary cryo-EM is taking over structural biology. Nature 578(7794), 201 (2020).
    • 10. Benjin X, Ling L. Developments, applications, and prospects of cryo-electron microscopy. Protein Sci. 29(4), 872–882 (2020).
    • 11. Forbes. AlphaFold is the most important achievement in AI—ever. www.forbes.com/sites/robtoews/2021/10/03/alphafold-is-the-most-important-achievement-in-ai-ever/?sh=38a2ba6b6e0a
    • 12. Varadi M, Anyango S, Deshpande M et al. AlphaFold Protein Structure Database: massively expanding the structural coverage of protein-sequence space with high-accuracy models. Nucleic Acids Res. 50(D1), D439–D444 (2021).
    • 13. EMBL-EBI. AlphaFold Protein Structure Database. https://alphafold.ebi.ac.uk
    • 14. Marx V. Method of the Year: protein structure prediction. Nat. Methods 19(1), 5–10 (2022).
    • 15. Al-Janabi A. Has DeepMind's AlphaFold solved the protein folding problem? BioTechniques 72(3), 73–76 (2022).
    • 16. Jumper J, Evans R, Pritzel A et al. Highly accurate protein structure prediction with AlphaFold. Nature 596(7873), 583–589 (2021).
    • 17. Varadi M, Bordin N, Orengo C, Velankar S. The opportunities and challenges posed by the new generation of deep learning-based protein structure predictors. Curr. Opin. Struct. Biol. 79, 102543 (2023).
    • 18. Giria N, Roya RS, Cheng J. Deep learning for reconstructing protein structures from cryo-EM density maps: Recent advances and future directions. Curr. Opin. Struct. Biol. 79, 102536 (2023).
    • 19. Chang L, Wang F, Connolly K et al. DeepTracer-ID: De novo protein identification from cryo-EM maps. Biophys. J. 121(15), 2840–2848 (2022).
    • 20. Stsiapanava A, Xu C, Nishio S et al. Structure of the decoy module of human glycoprotein 2 and uromodulin and its interaction with bacterial adhesin FimH. Nat. Struct. Mol. Biol. 29(3), 190–193 (2022).