We use cookies to improve your experience. By continuing to browse this site, you accept our cookie policy.×
ReportsOpen Accesscc iconby iconnc iconnd icon

PS4: a next-generation dataset for protein single-sequence secondary structure prediction

    Omar Peracha

    *Author for correspondence:

    E-mail Address: omar.peracha@conted.ox.ac.uk

    Department for Continuing Education, University of Oxford, Rewley House, 1 Wellington Square, Oxford, OX1 2JA, United Kingdom

    Published Online:https://doi.org/10.2144/btn-2023-0024

    Abstract

    Protein secondary structure prediction is a subproblem of protein folding. A light-weight algorithm capable of accurately predicting secondary structure from only the protein residue sequence could provide useful input for tertiary structure prediction, alleviating the reliance on multiple sequence alignments typically seen in today's best-performing models. Unfortunately, existing datasets for secondary structure prediction are small, creating a bottleneck. We present PS4, a dataset of 18,731 nonredundant protein chains and their respective secondary structure labels. Each chain is identified, and the dataset is nonredundant against other secondary structure datasets commonly seen in the literature. We perform ablation studies by training secondary structure prediction algorithms on the PS4 training set and obtains state-of-the-art accuracy on the CB513 test set in zero shots.

    Tweetable abstract

    Better data are all you need for state-of-the-art protein secondary structure prediction. We present PS4, the largest open-source dataset for protein single-sequence prediction. We achieve SotA q3 and q8 accuracy on CB513 in zero shots by training on PS4 alone, with no multiple sequence alignment input.

    Recent years have seen great advances in automated protein structure prediction, with open-sourced algorithms, capable in many cases of matching the accuracy of traditional methods for determining the structure of a folded protein such as x-ray crystallography and cryogenic electron microscopy, made increasingly available. It remains common for the best-performing approaches to rely on multiple sequence alignment (MSA) data in order to provide strong results [1–3]. One drawback of this is that these algorithms perform poorly on orphan proteins. Another is that quite significant extra resources are required when running these algorithms, particularly disk space for storing the database of potential homologues and computation time to adequately perform a search through several hundred gigabytes of these data. For example, at the time of writing, the lighter-weight version of AlphaFold 2 currently made available by DeepMind as an official release (available at [4]) requires 600 Gb of disk space and comes with accuracy tradeoffs; the full version occupies terabytes.

    Improved performance on orphan proteins is particularly desirable, as it may open up wider avenues for exploration when it comes to the set of possible human-designed proteins that can benefit from reliable algorithmic structure prediction, in turn offering advantages such as faster drug development. Meanwhile, reducing the resource requirements for using protein structure prediction algorithms increases their accessibility, ultimately improving the rate of research advances and downstream industry adoption.

    More recently, accurate structure prediction models have been proposed that do not rely on MSA. Wang et al. propose trRosettaX-Single [5], a model designed to predict tertiary structure from single-sequence input. They instead leverage a large, transformer-based protein ‘language model’, so named because it is an algorithm trained to denoise or autoregressively predict residues in a sequence of protein amino acids in a self-supervised manner, much as language models are trained to denoise or autoregressively predict words in a sequence of text [6,7]. This technique has gained popularity because the embeddings generated by the trained model in response to a protein sequence input seem to encode information regarding genetic relationships, which a downstream neural network can take advantage of. Furthermore, these models can be trained on the residue data alone, without requiring further labels such as atomic coordinates.

    To form one component of trRosettaX-Single, Wang et al. also use knowledge distillation from a pretrained MSA-based network [8], training a smaller ‘student’ neural network to approximate the MSA-based ‘teacher’ network's output distribution when the former is fed only a single sequence input, as a way to further induce some understanding of the homology relationships in their model. While performance is close to AlphaFold 2 when evaluating structure prediction accuracy on a dataset of human-designed proteins, and exceeds it on a dataset of orphan proteins, the authors point out that accuracy on those orphan proteins is still far from satisfactory. However, the use of upstream neural networks, such as the protein language model, in place of searching large databases ultimately reduces the resource requirement compared with AlphaFold 2.

    It is a well-held belief that a protein's secondary structure has implications that affect the final fold – for example, through a correlation with fold rate in certain conditions [9,10]. The author therefore infers that accurate secondary structure prediction models can also serve as powerful upstream components for tertiary structure prediction algorithms. Secondary structure motifs, comprising just a handful of varieties, occur in most proteins with well-defined tertiary structures; indeed, the same classes of secondary structure can occur in proteins that are evolutionarily distant from each other. Furthermore, a significant proportion of all protein structure is in some form of secondary structure [11].

    Secondary structure largely concerns the formation of hydrogen bonds between amino acids in the protein. These can create helical motifs when they form at regular intervals between residues that are local to each other in the primary structure, or pleated shapes, known as beta sheets, when they form between residues that are initially more distant to each other in the primary structure (Figure 1). Secondary structure is influenced to a great degree by the local constitution of a residue chain, particularly in the case of helices and turns, rather than by the idiosyncrasies that begin to emerge over the length of an entire protein polypeptide chain and in turn contribute to the plethora of topologies observed among fully folded proteins. The implication is that it may be possible to infer the patterns in polypeptide sequences that correspond to the occurrence of the various classes of secondary structure without relying on homology data. However, attempts to prove this empirically have been hampered by a lack of large, high-quality datasets for single-sequence secondary structure prediction.

    Figure 1. Two folded proteins displaying different common secondary structure motifs, rendered using UCSF ChimeraX software [12] for molecular visualization.

    (Left) A synthetic triple-stranded protein by [13], featuring alpha helices. (Right) A two-chain synthetic structure by [14] predominantly featuring beta strands.

    Among the most cited in the literature is the CB513 dataset [15], consisting of 513 protein sequences split into training and validation sets. A training set of a few hundred sequences is not sufficient to achieve high test set accuracy on CB513; therefore, a typical approach is to use extra training data [16,17]. However, the specific proteins included in the dataset are not identified, which can make it difficult to ensure there are no duplicate occurrences of samples from the test set in the augmented training set. Furthermore, the CB513 test set and training set contain some instances of different subsequences extracted from the same protein chain. Although local information is likely to play a strong role in determining the materialization of secondary structure motifs, it cannot be said for certain that there is no information leakage between the training and test sets, suggesting evaluation on the CB513 test set is unideal in cases where its training set was also seen by the model. Unfortunately, the lack of large datasets for protein secondary structure prediction means that omission of the CB513 from a larger training superset would be a significant sacrifice.

    The majority of other datasets seen in the literature are of similar size to or smaller than CB513 [18,19]. Klausen et al. introduce a notably larger dataset, comprising almost 11,000 highly nonredundant protein sequences [20]. Elnaggar et al. were able to leverage this dataset, among other smaller ones, including the CB513 training set, to achieve a test set q3 accuracy (i.e, correct per residue predictions of contribution to any helical structures, any beta structures or neither) of 86% and q8 accuracy (i.e., further breaking down helical classes into alpha helices, pi helices and three to ten helices; beta structure classes into beta bridges and beta sheets; and all other classes into helix turns, bends and no secondary structure) of 74.5% on the CB513 [17]. Not only is this the highest previously reported, albeit by a narrow margin, but also the authors were able to avoid using extra inputs such as MSA by leveraging a protein language model. This is the first time accuracy in that range has been achieved without the use of MSA, to our knowledge.

    We also note a relevant concurrent work by Singh et al. [21], who similarly prepare their own dataset and evaluate on single-sequence secondary structure prediction, obtaining accuracy levels comparable with ours. Their provided results are limited to q3 accuracy. Furthermore, they evaluate only on a novel set of datasets that are of similar size to CB513, and on various CASP datasets, which are too small to derive statistically meaningful conclusions. This makes it difficult to compare their method against other results reported in canonical literature, and readers must extrapolate based on their provided results and description of methodology for dataset collation. We use larger evaluation sets, nearly double in size in the case of PS4, to derive results that are more readily comparable with previous literature. We also include data to evaluate algorithms on eight secondary structure classes, rather than just three, and provides state-of-the-art results for both q3 and q8.

    In order to continue to push the boundaries further, the author proposes PS4, a dataset for protein secondary structure prediction comprising 18,731 sequences. A high-level comparison of this work with commonly cited datasets is shown in Table 1. Each sequence is from a distinct protein chain and consists of the entire resolved chain at the time of compilation, including chains featuring multiple domains. Samples are filtered at 40% similarity via a Levenshtein distance comparison to ensure a highly diverse dataset. Crucially, samples are also filtered by similarity to the entire CB513 dataset in the same manner, allowing for improved evaluation reliability when using the CB513 for performance benchmarking. All samples are identified by their Protein Data Base (PDB) code and chain ID, and they are guaranteed to have a corresponding entry in the CATH database [22], to facilitate research on further hierarchical modeling tasks such as domain location prediction or domain classification.

    Table 1. Comparison of commonly cited datasets for secondary structure prediction in recent literature.
    DatasetSamples (n)Train/test splitHas Protein Data Bank codes
    TS115115NANo
    NEW364364NANo
    CB513511Via maskingNo
    NetSurfP-2.010,792NANo
    PS418,73117,799/932Yes

    Ours is by far the largest and the only one in which the proteins can be identified. Only ours and CB513 are fully self-contained for training and evaluation with specified training and test sets. The CB513 achieves this distinction in many cases by masking subsequences of training samples, only sometimes including an entire sequence as a whole in the test set.

    NA: Not applicable.

    We performed ablation studies by using the PS4 training set and evaluating on both the PS4 test set and the CB513 test set in a zero-shot manner, leaving out the CB513 training set from the process. We use the same protein language model as [17] to extract input embeddings and evaluate on multiple neural network architectures for the end classifier, with no further inputs such as MSA. We obtain state-of-the-art results for q3 and q8 secondary structure prediction accuracy on the CB513, 87.0% and 76.5%, respectively, by training solely on PS4.

    We make the full dataset freely available along with code for evaluating our pretrained models, for training them from scratch to reproduce our results and for running predictions on new protein sequences. Finally, in the interests of obtaining a dataset of sufficient scale to truly maximize the potential of representation learning for secondary structure prediction, we provide a toolkit for any researchers to add new sequences to the dataset, ensuring the same criteria for nonredundancy. New additions will be released in labeled versions to ensure the possibility for consistent benchmarking in a future-proof manner (available at [23]).

    Materials & methods

    Dataset preparation

    The PS4 dataset consists of 18,731 protein sequences, split into 17,799 training samples and 932 validation samples, where each sequence is from a distinct protein chain. We first obtained the entire precomputed DSSP database [11,24], initiating the database download on 16 December 2021 (available by following the instructions at [25]). The full database at that time contained secondary structure information for 169,579 proteins, many of which are multimers, in DSSP format, with each identified by its respective PDB code.

    We iterate through the precomputed DSSP files and create a separate record for each individual chain, noting its chain ID, its residue sequence as a string of one-letter amino acid codes and its respective secondary structure sequence, assigning one of nine possible secondary structure classes for each residue in the given chain; the ninth class, the polyproline helix, has not generally been taken into consideration by other secondary structure prediction algorithms, and we also ignore this class when performing our own algorithmic assessments; however, the information is retained in the raw dataset provided.

    We also store the residue number of the first residue denoted in the DSSP file for the chain, which is quite often not number 1; being able to infer the residue number of any given residue in the chain could better facilitate the ability to use external data by future researchers. Following from whichever is the first residue included for that chain, we omit chains that are missing any subsequent residues. We further omit any chains containing fewer than 16 residues. Finally, we perform filtration to greatly reduce redundancy, checking for similarity below 40% against the entire CB513 dataset and then for all remaining samples against each other.

    We chose the Levenshtein distance to compute similarity. Recall that Levenshtein distance is an edit distance that measures the minimum number of single-character edits, such as additions, deletions and substitutions, required to change one string into another, whose similarity to the original string is being measured. We chose it due to its balance of effectiveness as a distance metric for biological sequences [26], its relative speed and its portability, with optimized implementations existing in several programming languages. This last property is of particular importance when factoring in our aim for the PS4 dataset to be extensible by the community, enabling a federated approach to maximally scale and leverage the capabilities of deep learning. The possibility of running similarity checks locally with a nonspecialized computing setup means that even a bioinformatics hobbyist can add new sequences to future versions of the dataset and guarantee a consistent level of nonredundancy, without relying on precomputed similarity clusters. This removes hurdles toward future growth and utility of the PS4 dataset while allowing light-weight similarity measurement against proteins that are not easily identifiable by a PDB or UniProt code, such as those in the CB513.

    As a last step, we omit any chains that did not have entries in the CATH database as of 16 December 2021, ruling out roughly 1500 samples from the final dataset. We make the CATH data of all samples in the PS4 available alongside the dataset, in case future research is able to leverage those structural data for improved performance on secondary structure prediction or related tasks. We ultimately chose to focus the scope of this work purely on predictions from single sequence, and did not find great merit in early attempts to include domain classification within the secondary sequence prediction pipeline; as such, this restriction will not be enforced for community additions to PS4.

    The main secondary structure data are made available as a CSV file, which is 8.2 MB in size. The supplemental CATH data are a 1.3 MB file in pickle format, mapping chain ID to a list of domain boundary residue indices and the corresponding four-integer CATH classification. Finally, a file in compressed NumPy format [27] maps chain IDs to the training or validation set, according to the split used in our experiments.

    Experimental evaluation

    We conduct experiments to validate the PS4 dataset's suitability for use in secondary structure prediction tasks. We train two models, each based on a different neural network algorithm, to predict eight-class secondary structure given single-sequence protein input. The models are trained only on the PS4 training set and then evaluated on both the PS4 test set and the CB513 test set. We avoid using any training data from the CB513 training set, meaning evaluation on its test set is conducted in zero shots. We also do not provide surrounding ground truth data at evaluation time for those samples in the CB513 test set that are masked subsequences of a training sample but, rather, predict the secondary structure for the whole sequence at once.

    Both models make use of the pretrained, open-source, encoder-only version of the ProtT5-XL-UniRef50 model by [17] to generate input embeddings from the initial sequence. Our algorithms are composable such that any protein language model could be used in place of ProtT5-XL-UniRef50, opening up an avenue for potential future improvement, however, our choice was in part governed by a desire to maximize accessibility; we use the half-precision version of the model made publicly available by Elnaggar et al., which can fit in just 8 GB of GPU RAM. As such, our entire training and inferencing pipeline can fit on a single GPU.

    The protein language model generates an N × 1024 encoding matrix, where N is the number of residues in the given protein chain. Our two models differ in the neural network architecture used to form the classifier component of our overall algorithm, which generates secondary structure predictions from these encoding matrices (Figure 2). Our first model, which we call PS4-Mega, leverages 11 moving average equipped gated attention (Mega) encoder layers [28] to compute a final encoding, which is then passed to an output affine layer and Softmax to generate a probability distribution over the eight secondary structure classes.

    Figure 2. Overview of the neural network meta architecture used in our experiments.

    The protein language model here refers to an encoder-only, half-precision version of ProtT5-XL-UniRef50, while the SS classifier is either a mega-based or convolution-based network. We precompute the protein encodings generated by the pretrained language model, reducing the computations necessary during training to only the forwards and backwards passes of the classifier.

    We chose Mega encoders due to their improved inductive bias when compared with a basic transformer [29], which promises to offer a better balance of factoring both local and global dependencies when encoding the protein sequence. We use a hidden dimension of 1024 for our Mega encoder layers, a z_dim of 128 and an n_dim of 16. We use a dropout of probability 0.1 on the moving average gated attention layers and the normalized feedforward layers and the simple variant of the relative positional bias. Normalization is computed via LayerNorm.

    Our second algorithm, which we call PS4-Conv, is derived from the secondary structure prediction model used by [17] and is entirely based on two-dimensional convolutional layers. We found that the exact model they used did not have sufficient capacity to fully fit our training set, likely due to its comprising many more samples, and so our convolutional neural classifier is larger, using five layers of gradually reducing size rather than two. All layers use feature row-wise padding of 3 elements, a 7 × 1 kernel size and a stride of 1. All layers but the last are followed by a ReLU activation and a dropout layer, with probability 0.1. Both models are trained to minimize a multiclass cross-entropy objective.

    Implementation

    All neural network training, evaluation and inference logic is implemented using PyTorch [30]. We train both models for 30 epochs, using Adam optimizer [31] with 3 epochs of warmup and a batch size of 1, equating to 53,397 warmup steps. Both models increase the learning rate from 10-7 to a maximum value of 0.0001, chosen by conducting a learning rate range test [32], during the warmup phase before gradually reducing back to 10-7via cosine annealing.

    The input embeddings from ProtT5-XL-UniRef50 are precomputed, requiring roughly 1 h to generate these for the entire PS4 dataset on GPU. Hence, only the weights of the classifier component are updated by gradient descent, while the encoder protein language model maintains its original weights from pretraining. For convenience and extensibility, we make a script available in our repository to generate these embeddings from any FASTA file, allowing for predictions on novel proteins.

    Results & discussion

    Results

    PS4-Mega has 83.8 million parameters and takes roughly 40 min per epoch when training on a single GPU. PS4-Conv is much more light-weight, with just 4.8 million parameters and requiring only 13.5 min per epoch on GPU. Both models are trained in full-precision. PS4-Mega obtains a training and test set accuracy of 99.4% and 78.2% on the PS4 dataset for Q8 secondary structure prediction, respectively, and 76.3% on CB513. PS4-Conv performs almost as well, obtaining 93.1% training and 77.9% test set accuracy on PS4 and 75.6% on CB513. Furthermore, both algorithms show an improvement over state of the art for Q3 accuracy, as shown in Table 2.

    Table 2. Comparison of Q3 and Q8 performance on the CB513 test set by the leading algorithms for secondary structure prediction, all of which use the same protein language model and operate without multiple sequence alignment input.
    ModelQ3 accuracyQ8 accuracyTrained on CB513
    ProtT5-XL-UniRef5086.0%74.5%Yes
    PS4-Conv86.3%75.6%No
    PS4-Mega86.8%76.3%No

    The version of ProtT5-XL-UniRef50 shown here includes a convolution-based classifier network. Results for ProtT5-XL-UniRef50 are quoted directly from [17].

    Dataset extension

    Our initial dataset preprocessing, from which the secondary structure CSV was obtained, was implemented in the Rust programming language. Filtering through over 160,000 proteins in Python was prohibitively slow. In particular, performing that many string comparisons to verify nonredundancy runs in O(n2) time complexity, and given a large value of n as seen in our case, the speed advantages offered by Rust and its various features were necessary to be able to complete preprocessing on a simple computing setup, running only on a quad-core Intel i5 CPU. We were able to leverage multithreading when iterating through over 169k DSSP files and could therefore increase the efficiency of the sequence comparisons via a parallelized divide-and-conquer approach, reducing time complexity to O(n × log[n]).

    Since all training, evaluation and model inference code is made available in a Python library, it would be convenient for all data processing code to also be Python-based. Therefore, we make the original Rust code callable via Python, such that anyone wishing to add new samples to a future version of PS4 can still leverage the original implementation for similarity measurement, ensuring the quality of the dataset is sustainable. Even with the optimizations made, preprocessing the original dataset on common commercial hardware still required close to 2 days. Fortunately, smaller future additions that comprise far fewer than 169k sequences will run faster, from seconds to minutes, due to the rapidly growing nature of superlinear time complexity.

    Initial extensions will be made via pull request to the maintained code repository for the dataset. Sufficient added sequences will give rise to a new, versioned release of the dataset. Future improvements to the PS4 may seek to further simplify the process for community-led extension – for example, managed via a web graphic user interface, so as to maximize accessibility.

    Conclusion

    We have presented the largest dataset for secondary structure prediction and made available a pipeline for further growth of the dataset by the bioinformatics community. The promise of learning-based algorithms to be a catalyst of progress on tasks related to protein folding is significant, particularly given what has been seen in recent advances in tertiary structure structure prediction. However, for this promise to be realized requires datasets of sufficient scale and quality.

    The state of datasets for protein secondary structure prediction has been such that most recent advances in the literature have depended on an amalgamation of different sources of data for both the training and evaluation sets in order to maximize the number of samples. This instantly hampers progress by creating an obstacle to the acquisition of good-quality data by would-be researchers, as well as well-attested benchmarks to measure against. Because protein sequences in preexisting datasets have typically been difficult to identify, reliability of assessments may also be an issue, resulting from the possibility of leakage between training and evaluation data.

    The most common method to mitigate this issue so far has been using a cutoff date threshold – for example, only evaluating an algorithm on proteins released after a date known to be after all samples in the training set were themselves released. This has the downside of either limiting future research to the same datasets, which are still too small to fully maximize the potential offered by deep learning algorithms, or in the case that new datasets are introduced in the future, immediately invalidating the evaluation data used in setting a previous benchmark.

    We show that by training on the PS4 dataset, we can achieve new state-of-the-art performance on the CB513 test set, validated on multiple classifier architectures. Our method is composable such that alternative protein language models can be used to generate embeddings, should this prove useful to future researchers. We also impose strict sequence similarity restrictions and run these directly against the CB513 dataset as a whole to greatly reduce the probability of data leakage into the test set, with respect to both the CB513 and the PS4's own validation set.

    Future perspective

    Despite the advances we claim, we acknowledge that the PS4 is still too small to truly support the development of a general learning-based solution to protein single-sequence secondary structure prediction. Therefore, we chose to leverage the scaling opportunities offered by open-source technology and provided a protocol for the community to continue augmenting the dataset with new samples. Given a file in PDB format, with full atomic coordinates, it is trivial to assign secondary structure to each residue using DSSP. As the PDB continues to grow, and indeed with the arrival of new protein structure databases with atomic coordinates resolved via automated methods of increasingly high quality [33], the task of single-sequence secondary structure prediction should be able to benefit from increased data availability over time and positively feed back into the cycle by supporting the improvement of tertiary structure prediction algorithms in turn.

    We propose the PS4 dataset as a hub for protein secondary structure data for training learning algorithms, a common first point of call where researchers can reliably obtain a high-quality dataset and benchmark against other algorithms with confidence of that data's cleanliness. To achieve this, making it able to grow as new labeled data become available is a first step. Future developments could focus on user experience and quality-of-life improvements to better facilitate community contributions, thus maximizing overall effectiveness.

    Executive summary

    Background

    • Improving protein single-sequence secondary structure prediction methods can help predict the tertiary structure of orphan proteins.

    • Improved orphan protein structure prediction methods open new avenues for de novo protein design, such as for medical applications.

    • Reducing reliance on large multiple sequence alignment databases for running structure prediction algorithms vastly increases accessibility.

    • Existing secondary structure prediction algorithms are small, do not identify the contained proteins and have probable issues with evaluation reliability.

    Materials & methods

    • We introduce PS4, by far the largest dataset for secondary structure prediction, featuring 18,731 samples.

    • PS4 has all included proteins identified by Protein Data Bank code.

    • PS4 is nonredundant, also against all samples in the CB513 dataset.

    Implementation

    • We develop a machine learning algorithm and achieve state-of-the-art Q3 and Q8 loss on the CB513 evaluation set by training it on only the PS4 training set.

    Discussion

    • We provide a software toolkit for the wider research community to continue expanding PS4 with new proteins and maintain nonredundancy.

    Supplementary data

    To view the supplementary data that accompany this paper please visit the journal website at: www.future-science.com/doi/suppl/10.2144/btn-2023-0024

    Author contributions

    All conceptualization, experiments and reporting were conducted by the corresponding author.

    Financial disclosure

    The author has no financial involvement with any organization or entity with a financial interest in or financial conflict with the subject matter or materials discussed in the manuscript. This includes employment, consultancies, honoraria, stock ownership or options, expert testimony, grants or patents received or pending or royalties.

    Competing interests disclosure

    The author has no competing interests or relevant affiliations with any organization or entity with the subject matter or materials discussed in the manuscript. This includes employment, consultancies, honoraria, stock ownership or options, expert testimony, grants or patents received or pending or royalties.

    Writing disclosure

    No writing assistance was utilized in the production of this manuscript.

    Open access

    This work is licensed under the Attribution-NonCommercial-NoDerivatives 4.0 Unported License. To view a copy of this license, visit http://creativecommons.org/licenses/by-nc-nd/4.0/

    Papers of special note have been highlighted as: • of interest; •• of considerable interest

    References

    • 1. Jumper J, Evans R, Pritzel A et al. Highly accurate protein structure prediction with AlphaFold. Nature 596(7873), 583–589 (2021).
    • 2. Baek M, DiMaio F, Anishchenko I et al. Accurate prediction of protein structures and interactions using a three-track neural network. Science 373(6557), 871–876 (2021).
    • 3. Zheng W, Wuyun Q, Freddolino PL. D-i-tasser: integrating deep learning with multi-MSAs and threading alignments for protein structure prediction. Proc. CASP15 (2022).
    • 4. GitHub. Google-deepmind/alphafold. https://github.com/deepmind/alphafold
    • 5. Wang W, Peng Z, Yang J. Single-sequence protein structure prediction using supervised transformer protein language models. Nat. Comput. Sci. 2(12), 804–814 (2022).
    • 6. Alaparthi S, Mishra M. Bidirectional encoder representations from transformers (bert): a sentiment analysis odyssey (2020). https://arxiv.org/abs/2007.01127
    • 7. Radford A, Wu J, Child R et al. Language models are unsupervised multitask learners (2019). https://cdn.openai.com/better-language-models/language_models_are_unsupervised_multitask_learners.pdf
    • 8. Hinton G, Vinyals O, Dean J. Distilling the knowledge in a neural network (2015). https://arxiv.org/abs/1503.02531
    • 9. Ji YY, Li YQ. The role of secondary structure in protein structure selection. Eur. Phys. J. E. 32(1), 10–107 (2010).
    • 10. Huang JT, Wang T, Huang SR, Li X. Prediction of protein folding rates from simplified secondary structure alphabet. J. Theor. Biol. 383, 1–6 (2015).
    • 11. Kabsch W, Sander C. Dictionary of protein secondary structure: pattern recognition of hydrogen-bonded and geometrical features. Biopolymers 22(12), 2577–2637 (1983). •• DSSP is a great starting point for any researcher interested in information retrieval relating to protein secondary structure.
    • 12. Pettersen EF, Goddard TD, Huang CC et al. UCSF ChimeraX: structure visualization for researchers, educators, and developers. Protein Sci. 30(1), 70–82 (2021).
    • 13. Lovejoy B, Choe S, Cascio D et al. Crystal structure of a synthetic triple-stranded alpha-helical bundle. Science 259(5099), 1288–1293 (1993).
    • 14. Scherf T, Kasher R, Balass M et al. A beta-hairpin structure in a 13-mer peptide that binds alpha-bungarotoxin with high affinity and neutralizes its toxicity. Proc. Natl Acad. Sci. USA 98(12), 6629–6634 (2001).
    • 15. Cuff JA, Barton GJ. Evaluation and improvement of multiple sequence methods for protein secondary structure prediction. Proteins 34(4), 508–519 (1999). • The CB513 has provided one of the best sources of comparison for secondary structure prediction algorithms to date, furthering the formation of a canonical literature.
    • 16. Torrisi M, Kaleel M, Pollastri G. Deeper profiles and cascaded recurrent and convolutional neural networks for state-of-the-art protein secondary structure prediction. Sci. Rep. 9(1), 12374 (2019). • An example of previous state-of-the-art machine learning approaches for secondary structure prediction, where multiple sequence alignments were required to obtain strong results.
    • 17. Elnaggar A, Heinzinger M, Dallago C et al. Prottrans: toward understanding the language of life through self-supervised learning. IEEE Trans. Pattern Anal. Mach. Intell. 44(10), 7112–7127 (2022). •• Providing open-source protein language models to create embeddings was a key step in enabling this work and many related works.
    • 18. Drozdetskiy A, Cole C, Procter J, Barton GJ. JPred4: a protein secondary structure prediction server. Nucleic Acids Res. 43(W1), W389–W394 (2015).
    • 19. Yang Y, Gao J, Wang J et al. Sixty-five years of the long march in protein secondary structure prediction: the final stretch? Brief. Bioinform. 19(3), 482–494 (2018).
    • 20. Klausen MS, Jespersen MC, Nielsen H et al. Netsurfp-2.0: improved prediction of protein structural features by integrated deep learning. Proteins 87(6), 520–527 (2019).
    • 21. Singh J, Paliwal K, Litfin T et al. Reaching alignment-profile-based accuracy in predicting protein secondary and tertiary structural properties without alignment. Sci. Rep. 12, 7607 (2022). • A related concurrent work that offers a different approach to constructing datasets for single-sequence q3 secondary structure prediction.
    • 22. Knudsen M, Wiuf C. The CATH database. Hum. Genomics 4(3), 207–212 (2010).
    • 23. GitHub. Omarperacha/ps4-dataset. https://github.com/omarperacha/ps4-dataset
    • 24. Joosten RP, te Beek TAH, Krieger E et al. A series of PDB related databases for everyday needs. Nucleic Acids Res. 39(database issue), D411–D419 (2011).
    • 25. DSSP Program. https://swift.cmbi.umcn.nl/gv/dssp/DSSP_1.html
    • 26. Berger B, Waterman MS, Yu YW. Levenshtein distance, sequence comparison and biological database search. IEEE Trans. Inf. Theory 67(6), 3287–3294 (2021).
    • 27. Harris CR, Millman KJ, van der Walt SJ et al. Array programming with NumPy. Nature 585(7825), 357–362 (2020).
    • 28. Ma X, Zhou C, Kong X et al. Mega: moving average equipped gated attention. arXiv preprint arXiv:2209.10655 (2022). https://arxiv.org/abs/2209.10655
    • 29. Vaswani A, Shazeer N, Parmar N et al. Attention is all you need. Proc. 31st Int. Conf. Neural Inf. Process. Syst. 6000–6010 (2017).
    • 30. Paszke A, Gross S, Massa F et al. Pytorch: an imperative style, high-performance deep learning library. Adv. Neural Inf. Process. Syst. 32, 8024–8035 (2019).
    • 31. Kingma DP, Ba J. Adam: a method for stochastic optimization. CoRR abs/1412.6980 (2014).
    • 32. Smith LN, Topin N. Super-convergence: very fast training of neural networks using large learning rates (2017). https://arxiv.org/abs/1708.07120
    • 33. Varadi M, Anyango S, Deshpande M et al. AlphaFold protein structure database: massively expanding the structural coverage of protein-sequence space with high-accuracy models. Nucleic Acids Res. 50(D1), D439–D444 (2021).