We use cookies to improve your experience. By continuing to browse this site, you accept our cookie policy.×
Tech NewsOpen Accesscc iconby iconnc iconnd icon

Long-read sequencing for the metagenomic analysis of microbiomes

    Tristan Free

    *Author for correspondence:

    E-mail Address: t.free@future-science.com

    Future Science Group, Unitec House, 2 Albert Place, London, N3 1QB, UK

    Published Online:https://doi.org/10.2144/btn-2023-0028

    Abstract

    One technology, long-read sequencing, and one research field, microbiome studies, have risen to prominence over the last decade. But how can one be used in the other? What changes are being wrought? And what limitations remain?

    The rise of long-read sequencing over the last decade has been well documented, with the initial concerns surrounding accuracy being quickly addressed by the latest generation of technologies from Oxford Nanopore (Oxford, UK) and PacBio (CA, USA), which both produce sequences with a read accuracy greater than 99% (Figure 1) [1,2]. These developments have culminated in these techniques being named Nature's Method of the Year 2022 and have led to their spread into ever more complex applications and challenging research, perhaps best exemplified by their burgeoning use in the metagenomic analysis of microbiomes.

    Figure 1. Observed raw read accuracies measured through read-mapping.

    Nanopore R.4.1 in green, Nanopore R10.4 n yellow. Figure adapted from the original created by Sereika et al. [1] and published under a Creative Commons Attribution 4.0 International License: http://creativecommons.org/licenses/by/4.0/.

    Falling short in metagenomic sequencing

    Metagenomic analyses aim to sequence and assemble the total genomes from all organisms present in a sample; an approach that proves nearly essential to the increasingly popular study of host and environmental microbiomes [3]. Traditionally, metagenomic sequencing has relied on short-read sequencing to provide the sequences with which to construct Metagenome Assembled Genomes (MAGs). However, these techniques have several limitations that present challenges that are difficult to overcome while still using short-read sequencing.

    For instance, short-read sequencing often relies on PCR amplification of DNA in the sample, which often struggles to amplify GC-rich regions due to the increased stability of these DNA sequences, introducing GC bias into the data created. Furthermore, the length of short reads – typically between 75–400 bp – inhibits the sequencing of long repeat sequences and structural variants, limiting the ability to create MAGs with full sequence coverage [4].

    Identifying specific strains of a bacterial species is another challenge that arises when you need to conduct a high-resolution analysis of your sample. Talking to Titus Brown, a metagenomic sequence analysis expert leading the Data Intensive Biology lab at the University of California, Davis (CA, USA), it becomes clear why this is so important in the study of microbiomes.

    Bacteria have dramatically more variation in their genomes than eukaryotes, which typically have small variations such as small nucleotide polymorphisms. Bacteria, meanwhile, will have the same core genome that composes roughly 60% of the genome but the remaining 40% will be a composite of different genes transferred from, and required to survive in, their specific environment.

    It is highly challenging to obtain the linkage information required to confidently assemble a strain-specific genome using short-read sequencing. While attempts have been made to resolve this issue, it remains a serious roadblock to the use of the technique for this application [5]. Instead, short-read assembled MAGs often represent a composite genome of the strains of a species present in a sample. This loses a lot of the contextual information within the strain-specific genome that is needed to interpret the functions and interactions occurring in a microbiome.

    Enter long-read sequencing

    While short reads present a serious challenge to assembling genomes from the same strain or cell, long reads are large enough to provide the linkage information to identify which specific microbial strain each read is from, enabling the assembly of strain-specific MAGs. Talking to BioTechniques for a recent Talking Techniques podcast episode, Jeremy Wilkinson of PacBio stated that, “…one HiFi read has an average of eight intact genes per read, which allows for greater functional and taxonomic profiling, with 90% of the reads being annotatable or classifiable.” Therefore, if you are designing a study that intends to obtain specific information about the different strains of a species present in the genome, long-read sequencing techniques provide an excellent tool to deliver this [6].

    For example, in a recent study led by the USDA Dairy Forage Research Center (WI, USA), Bickhart et al. used PacBio HiFi reads to generate 428 MAGs with over 90% completeness from sheep feces [6]. Developing their own software, MAGPhase, the team was able to identify 220 lineage-resolved MAGs, differentiating between related strains of the same species found within the fecal matter. With this additional contextual information, the team was able to identify 424 novel host-strain associations within the sheep fecal microbiome.

    Similar metagenomic studies of canine fecal microbiomes have yielded interesting results, highlighting canid-specific bacteria and resolving key aspects of the mobilome, the mobile components of the genome that make up the composite part of bacterial genomes highlighted previously by Brown [8].

    What's more, due to the exceptional size that long reads are now reaching, structural variants and long repeat sequences can now be captured within a single read, while the lack of reliance on PCR means that GC-rich sequences present a much less daunting challenge

    Remaining challenges

    While long-read sequencing has resolved many issues in metagenomics, two closely linked challenges remain.

    Firstly, As exemplified by the papers noted above, the majority of success stories of long-read sequencing for metagenomics have come from studies of host-associated microbiomes as opposed to environmental microbiomes [9]. This is because long-read sequencing requires higher quality and larger volumes of DNA compared to short-read sequencing. This challenge follows clear logic: to produce long reads, you need long, intact strands of DNA to sequence. In a host-associated microbiome, there is less diversity, meaning that each species' genome is present in a comparatively higher concentration than environmental samples and what's more, the DNA in these samples is also often in a much better condition.

    This is compounded by the second challenge, the output of long-read technologies, which lags behind short-read sequencing, a technique that has had decades to optimize and increase the amount of sequence data that can be obtained from each sample. As a result, generating enough sequences to produce MAGs for each of the numerous species present in a complex environmental sample represents a current limitation of these techniques.

    The ideal solution to this problem would be to establish a technique or methodology that combines long and short read sequencing techniques in an attempt to reach a perfect union between short read's ability to provide wide coverage of many, low abundance species, and long read's ability to resolve genomes of those species down to individual strains.

    In the first steps towards this solution, studies utilizing long-read-assembled and hybrid-assembled MAGs to fill in the gaps left behind by short read-based metagenomic studies have borne fruit. For instance, a recent analysis of 109 gut microbiomes across three ethnicities in Singapore successfully compiled 1708 hybrid-assembled MAGs that had been missed by short-read-only analyses. This facilitated the discovery of 70 novel gut microbes and over 3400 strains, providing far greater insight into the gut microbiome of Southeast Asian populations [10].

    As for the quantity and quality of DNA? Quantity poses a major challenge that is difficult to address. In an environmental sample, increasing the sample size to raise the volume of DNA that can be yielded from sample preparation only increases the diversity, stretching the output of these sequencing technologies. Ultimately, to resolve this issue the DNA input requirements for these technologies need to be reduced; a solution that is simple in principle but not in practice. Currently, Oxford Nanopore's technology has lower input requirements than PacBio's, however, this is balanced with slightly less accurate read calling.

    Quality, on the other hand, has more readily available solutions. Sample preservation solutions exist that can stabilize the microbes in a sample and protect the nucleic acids present, such as the OMNIgene gut collection kit, and even immobilize pathogens in the sample to make it safe, like Zymo Research's DNA/RNA Shield medium.

    Is it all about the analysis?

    For sample input requirements and the sequencing output, time is the only solution; however, one that Brown is confident about. “I'm not worried about the sequencing processes and the extraction methods. There is plenty of motivation out there to improve these and they will with time.” What concerns him is what follows after collection: analysis.

    A growing trend in multiple fields of the life sciences is the acquisition of larger and more complex datasets, the assumption being that with more information, deeper and more meaningful insights will follow. Yet, in recent years, a counterargument has arisen: if the purpose of developments such as single-cell sequencing was to move away from bulk readings and averaged results, then why in other instances do we often see the generation of larger, deeper datasets as a tool to yield more perceptive insights? Whilst speaking at the 2022 iteration of the annual Society for Neuroscience conference (12–16 November, San Diego, CA, USA), Tim Harris (Howard Hughes Medical Institute, VA, USA) – the developer of Neuropixels, which enable the collection of signals from hundreds of individual neurons in different brain regions – posed the question of his own invention: “Have Neuropixels led to studies that bury findings in vast swathes of data or have they enabled researchers to collect enough information to discover the emphatic truth?” [11]

    The key to this problem is designing data analysis techniques that can provide better insights than simply averaging datasets and grouping findings. This is what mostly concerns Brown. “I think as a field we are mostly concerned with generating data, assuming that we will be able to do something useful with it at a later date. However, as it stands, I think there is a lot of room to improve data analysis.”

    Fortunately, Brown is not alone in this concern. Significant efforts are underway to improve microbiome analysis, and there are many software platforms, such as Smash community, with which to conduct metagenomic data analysis in distinct contexts [12].

    While data analysis does prove perhaps the biggest challenge facing the field, the current techniques available and ongoing research efforts are continuing to produce vitally important results that would not have been possible with previous technologies.

    A fine example of this work includes the recent development of target-enriched long-read sequencing (TELSeq) by a research collaboration between the University of Florida (FL, USA) and the University of Minnesota (MN, USA). This technique was used to investigate antimicrobial resistance genes in three different metagenomes: human fecal microbiome transplant material, bovine fecal material from animals that had not received antibiotics and soil from unused prairie surrounded by farmland. This study revealed numerous colocalizations between mobile genetic elements and antibiotic-resistance genes in the soil and human fecal metagenomes. By demonstrating its ability to deliver results in both host-associated and environmental microbiomes, this study provides a significant indication of the vast strides being made in long-read sequencing and the value that long-read sequencing techniques can add to metagenomic analyses of the microbiome [13].

    References

    • 1. Sereika M, Kirkegaard R, Karst S et al. Oxford Nanopore R10.4 long-read sequencing enables the generation of near-finished bacterial genomes from pure cultures and metagenomes without short-read or reference polishing. Nat. Methods 9(7), 823–826 (2023).
    • 2. Hon T, Mars K, Young G et al. Highly accurate long-read HiFi sequencing data for five complex genomes. Sci. Data 7(1), 399 (2020).
    • 3. National Human Genome Research Institute. Metagenomics. www.genome.gov/genetics-glossary/Metagenomics
    • 4. Adewale B. Will long-read sequencing technologies replace short-read sequencing technologies in the next 10 years? Afr. J. Lab. Med. 9(1), 1340 (2020).
    • 5. Beggel B, Neumann-Fraune M, Kaiser R, Verheyen J, Lengauer T. Inferring short-range linkage information from sequencing chromatograms. PLOS ONE 8(12), e81687 (2013).
    • 6. Reiter T, Brown T. MAGs achieve lineage resolution. Nat. Microbiol. 7(2), 193–194 (2022).
    • 7. Bickhart D, Kolmogorov M, Tseng K et al. Generating lineage-resolved, complete metagenome-assembled genomes from complex microbial communities. Nat. Biotechnol. 40(5), 711–719 (2022).
    • 8. Cuscó A, Perez D, Viñes J, Fàbregas N, Francino O. Long-read metagenomics retrieves complete single-contig bacterial genomes from canine feces. BMC Genom. 22(1), 330 (2021).
    • 9. Marx V. Method of the year: long-read sequencing. Nat. Methods 20(1), 6–11 (2023).
    • 10. Counot JS, Chia M, Bertrand D et al. Genome-centric analysis of short and long read metagenomes reveals uncharacterized microbiome diversity in southeast Asians. Nat. Commun. 13(1), 6044 (2022).
    • 11. BioTechniques. Talking Techniques. Neuropixels: big data heaven or burying the lead in averages? www.biotechniques.com/podcasts/talking-techniques-neuropixels-big-data-heaven-or-burying-the-lead-in-averages/
    • 12. Navgire G, Goel N, Sawhney G et al. Analysis and interpretation of metagenomics data: an approach. Biol. Proced. Online 24(1), 18 (2022).
    • 13. Slizovskiy I, Oliva M, Settle J et al. Target-enriched long-read sequencing (TELSeq) contextualizes antimicrobial resistance genes in metagenomes. Microbiome 10(1), 185 (2022).