We use cookies to improve your experience. By continuing to browse this site, you accept our cookie policy.×
Mini-ReviewOpen Accesscc iconby icon

Deep cap analysis gene expression (CAGE): genome-wide identification of promoters, quantification of their expression, and network inference

    Michiel de Hoon

    Genome Exploration Research Group, RIKEN Genomic Sciences Center, Yokohama Institute, Yokohama, Kanagawa, Japan

    &
    Yoshihide Hayashizaki

    *Address correspondence to Yoshihide Hayashizaki, Genome Exploration Research Group, RIKEN Genomic Sciences Center, Yokohama Institute, 1-7-22 Suehiro-cho, Tsurumi-ku, Yokohama, Kanagawa 230- 0045, Japan. e-mail:

    E-mail Address: yosihide@gsc.riken.jp

    Genome Exploration Research Group, RIKEN Genomic Sciences Center, Yokohama Institute, Yokohama, Kanagawa, Japan

    Published Online:https://doi.org/10.2144/000112802

    Abstract

    In cap analysis gene expression (CAGE), short (∼20 nucleotide) sequence tags originating from the 5′ end of full-length mRNAs are sequenced to identify transcription events on a genome-wide scale. The rapid increase in the throughput of present-day sequencers provides much deeper CAGE tag sequencing, where CAGE tags can be found multiple times for each mRNA in a given experiment. CAGE tag counts can then be used to reliably estimate the cellular concentration of the corresponding mRNA. In contrast to microarray and SAGE expression profiling, CAGE identifies the location of each transcription start site in addition to its expression level. This makes it possible for us to infer a genome-wide network of transcriptional regulation by searching the promoter region surrounding each CAGE-defined transcription start site for potential transcription factor binding sites. Hence, deep CAGE is a unique tool for the construction of a promoter-based network of transcriptional regulation. CAGE-based expression profiling also allows us to identify dynamic promoter usage in time-course experiments and the specific promoter regulated by a given transcription factor in disruption experiments. The sheer size of the short-tag datasets produced by modern sequencers spurs a need for new software development to handle the amount of data generated by next-generation sequencers. In addition, new visualization methods will be needed to represent a promoter-based transcriptional network.

    Introduction

    Cap analysis gene expression (CAGE) was introduced in 2003 as a method to determine transcription start sites on a genome-wide scale by isolating and sequencing short sequence tags originating from the 5′ end of RNA transcripts (1). Mapping these tags back to the reference genome identifies the transcription start sites from which the transcripts originated.

    CAGE relies on a cap-trapper system to capture full-length RNAs while avoiding rRNA and tRNA transcripts. First, an oligo-dT primer is used to reverse-transcribe poly-A terminated RNAs. Alternatively, a random primer can be used for RNAs without a poly-A tail, which may constitute almost half of the transcriptome (2). RNA/DNA double-stranded hybrids that contain a mature mRNA are selected by biotinylating their 5′ cap structure, allowing capture by streptavidin-coated magnetic beads. Ligation of a linker sequence containing an MmeI recognition site to the 5′ end of the full-length cDNA creates a restriction site about 20 nucleotides downstream, producing a short CAGE tag starting at the 5′ end of eukaryotic mRNAs (3). CAGE tags that map just upstream of known genes may be derived from the corresponding full-length mRNAs, whereas others may reflect the existence of currently unknown transcription start sites or genes. Due to their short size, sequencing CAGE tags is more efficient at detecting transcription start sites than sequencing full-length cDNAs.

    In early CAGE experiments, the throughput of sequencers limited the achievable sequencing depth such that many CAGE tags were found only once in a given experiment. More recently, a new generation of sequencers excelling at high-throughput sequencing of short tags has enabled deep CAGE tag sequencing, generating upward of a million tags from a single experimental condition. The tag counts found in such experiments are typically much larger than one, allowing an accurate estimate of the cellular concentration of the RNA molecule corresponding to each CAGE tag. Deep CAGE thus detects both the transcription start site as well as its expression level, making it a unique tool in the analysis of transcriptional regulatory networks.

    Characteristic Features of CAGE

    High-throughput gene expression experiments based on microarrays (4), massively parallel signature sequencing (MPSS) (5), serial analysis of gene expression (SAGE) (6), and CAGE give us a snapshot of the RNA concentrations in the cell at a particular time in a specific experimental condition. Quantitative realtime PCR (qRT-PCR) (7–10), while not a high-throughput method, can provide a valuable standard for validation because of its accuracy and wide dynamic range. Whereas these methods are complementary to each other, the characteristic features of CAGE expression profiling make it particularly suitable for investigating the transcriptional regulatory network that drives the expression of genes and noncoding RNAs.

    First, CAGE tag counts allow us to calculate the cellular amount of the corresponding RNA molecule in a digital form. As one mRNA is not preferentially detected over another, expression profiling based on tag sequencing is unbiased, allowing a direct comparison of the expression values of different genes measured in a single deep CAGE experiment. In contrast, microarray fluorescence levels are affected both by the mRNA concentration and by the probe-dependent mRNA affinity, precluding a direct comparison between genes. In addition, tag counts as a measure of mRNA concentrations have a dynamic range that is orders of magnitude larger than microarray expression levels. The accuracy and the dynamic range of CAGE- and SAGE-derived expression levels as well as the sensitivity of detecting lowly expressed transcripts can be improved further by deeper sequencing. Importantly, microarray and qRT-PCR expression profiling are restricted to those transcripts for which a probe or primer pair is available, whereas methodologies based on tag sequencing can also measure the expression of currently unknown transcripts.

    Deep CAGE expression profiling is unique among high-throughput expression profiling methods because the 5′ end of the CAGE tag identifies the corresponding transcription start site. Hence, deep CAGE allows us to determine the promoter driving the transcription of each transcript in addition to its expression level. In contrast, tags generated by SAGE or MPSS are located at the 3′ end of the transcript and do not identify the promoter, which may lie tens of kilobases or more upstream in the genome sequence. A variant of SAGE has been developed that uses oligo-capping to create tags at the 5′ end of the transcript (5′ SAGE) (11), but it is currently not in common use. Whereas expressed sequence tags and full-length mRNA sequencing do identify the promoter, the throughput of these techniques is insufficient to allow genome-wide expression profiling and may not be able to detect lowly expressed transcripts.

    Next-generation High-throughput Sequencers Revolutionize Transcriptomics

    In recent years, advances in sequencing chemistry have led to innovations such as pyrosequencing, Solexa sequencing by synthesis (Illumina, Inc., San Diego, CA, USA), SOLiD sequencing by oligonucleotide ligation and detection with 2-base encoding (Applied Biosystems, Foster City, CA, USA), and true single molecule sequencing by Helicos (Helicos BioSciences Corp., Cambridge, MA, USA). The number of bp that can be read in 1 day by high-throughput sequencers based on these techniques is orders of magnitude larger than conventional sequencers based on Sanger sequencing (see Table 1). These sequencers are particularly suitable for short sequence reads of the kind generated by CAGE, in contrast to genome sequencing, for which longer sequence reads are preferred. This makes transcriptome sequencing one of the primary targets of high-throughput sequencers.

    Table 1. Sequencing Capacities of Current and Near-future Next-generation Sequencers, and the Estimated Corresponding Deep CAGE Throughput and Sensitivity

    The increased throughput of sequencers enables CAGE sequencing at a much deeper scale than previously possible. On the one hand, this implies that it is increasingly possible to detect RNA molecules present at very low-copy numbers in the cell. This is critical in time-course experiments, for example, where transcripts coding for regulatory molecules may be present only transiently and at low concentrations. Approximately, the probability to detect an RNA molecule that is present in k copies per cell is 1 – exp(−nk/m), where n is the number of CAGE tags extracted and m is the total number of RNA molecules in the cell. A 454 sequencer (454 Life Sciences, Branford, CT, USA) generating 400,000 reads per run of a length of about 250 bases can produce ∼1,000,000 mappable CAGE tags in one run. Assuming a total mRNA concentration of 200,000 molecules per cell, the probability to detect an RNA molecule present in one copy per cell is 99.33%. The Solexa sequencer can generate about 5× more CAGE tags per run, making it possible for us to detect an RNA molecule present in one copy per five cells at the same probability. Deeper CAGE also increases the CAGE tag count for each particular mRNA, enabling more precise expression profiling also for lowly expressed transcripts, with the sampling error decreasing as the square root of the number of mappable CAGE tags.

    Transcriptome and Regulation Probed by Deep CAGE Experiments

    Due to the inherent variability in transcription start sites, CAGE tags are typically found scattered over short genomic regions. In previous shallow CAGE experiments, promoters were constructed by clustering the 5′ ends of the individual CAGE tags based on the distance between them on the genome (12). In deep CAGE experiments, transcription start site clustering may also take into account the similarity in expression profiles. Deep CAGE thus allows us to define the individual promoters from which genes are transcribed based on the promoter activities.

    Previous CAGE experiments indicate that transcription units in mammalian genomes are characterized by alternative transcription start sites and multiple splice forms that are active in different cellular contexts (12–16). In any given CAGE experiment, some of the CAGE tags correspond to previously identified promoters, whereas others are due to novel promoters and transcripts that may give rise to novel protein variants or contain alternative localization or degradation signals. CAGE has also led to the discovery of novel noncoding RNAs (12,17).

    CAGE expression profiling has produced a wide variety of new insights into the mammalian transcriptome and its regulation. An analysis of CAGE-defined transcription start sites in human and mouse illustrated that promoters characterized by a TATA-box tend to have a clear, single transcription start site, whereas promoters associated with CpG islands tend to have transcription start sites distributed over a broad area (14). CAGE expression profiling, with its ability to localize the promoter as well as determine its expression level, is ideally suited to study the regulation of bidirectional promoters (18,19), which may be coregulated by shared transcription factor binding sites. Antisense transcripts, which are transcribed in the opposite direction to the coding strand, may play a role in regulation via RNA interference as well as gene silencing at the chromatin level (20–24). A genome-wide analysis of CAGE transcriptome data revealed frequent concordant regulation of sense/antisense pairs (25).

    While genes can have multiple promoters in principle, their usage will likely vary depending on cellular conditions. Time-course CAGE experiments can be used to study the dynamic usage of promoters by comparing the distribution of expression levels over transcription start sites in the upstream region of a gene to discover which promoters are switched on and off between time points during the experiment. Similarly, we may find tissue-specific promoter usage, or promoters that are only used in specific cell lines.

    A major goal of the analysis of deep CAGE sequence data is to infer the regulatory network that orchestrates transcription in a cell (26). With the target of each regulatory interaction being a promoter instead of a gene, such a network is qualitatively different from current gene-based networks inferred from microarray or SAGE expression profiling. Each of the promoters may be associated with one or more coding or noncoding transcripts.

    A promoter-based network can be constructed by first obtaining matrix models to describe the binding affinities of transcription factors, and then using these matrices to search for potential transcription factor binding sites in promoter regions. Both of these steps are facilitated by the availability of the promoter locations and expression levels measured by deep CAGE.

    Matrix models of transcription factors can be derived either from the literature or by aligning the upstream regions of coregulated genes. Such coregulated genes can be found by clustering the genome-wide expression profiles measured in microarray experiments. However, the expression profile measured by microarrays is an aggregate of the expression profiles of the individual promoters from which transcripts originate, hampering a clean clustering result. In addition, these promoters are typically regulated by a different set of transcription factors. Deep CAGE expression profiling enables us to separate the expression of a gene into the contribution of the individual promoters, which can then be clustered based on their expression profiles to generate tighter clusters of co-expression. Each of these clusters of promoters is likely regulated by a smaller set of transcription factors than conventional clusters of genes. In addition, the CAGE-defined promoter identifies the genome regions to be aligned to find overrepresented sequence motifs.

    Similarly, we can restrict the genome region to be searched for potential transcription factor binding sites to the promoter region identified by CAGE, significantly reducing the possibility of false positives. While comparative genomics may also be used to identify the promoter region, it does not pinpoint the exact transcription start site. In addition, many biologically functional sequences do not seem to be evolutionarily constrained across all mammals (16) and therefore cannot be identified by comparative approaches. Finally, deep CAGE profiling identifies which promoters are active in a particular biological context and therefore suggests which transcription factor binding sites may be biologically relevant.

    Software Needs for High-throughput Transcriptome Sequencing

    The sheer size of the datasets generated by high-throughput transcriptome sequencing places new requirements on the software tools used to analyze these data. During extraction of CAGE tags from the raw reads, care must be taken to correctly distinguish the tag from linker sequences. Mapping CAGE tags to the genome is complicated by the possibility of sequence mismatches as well as tags mapping to multiple locations on the genome. Whereas BLAST (27) can be used to place CAGE tags on the genome, this tool is likely too slow given the size of the high-throughput datasets produced by next-generation sequencers. In addition, BLAST is based on the assumption that the sequences to be compared are evolutionarily related, which is clearly not appropriate for CAGE tag mapping. For this purpose, software specializing in transcriptome tag sequencing such as SSAHA (Wellcome Trust Sanger Institute, Hinxton, Cambridge, UK) (28) and Nexalign (RIKEN, Yokohama, Japan) (29) may be more appropriate. The latter is exceedingly fast for exact matches in particular, and guarantees to find any tag present in the genome.

    Special care must be taken for ambiguous tags that map to multiple genome locations with equal scores. In such cases, it may be possible to decide between the mapping locations by considering the number of singly-mapped CAGE tags in each genome neighborhood as a prior probability (30). Many of the ambiguous CAGE tags map to a large number of genome locations, suggesting that they originate from repeat regions of the genome. This is consistent with a considerable fraction of the human transcriptome originating from repetitive sequences, which may play a role in the transcriptional regulation of gene expression (31).

    The size of the datasets produced by next-generation sequencers also poses a challenge to data management systems. As terabytes of image data may be generated in a single run, saving all raw data produced in one experiment may no longer be an option. With billions of bases being generated in a single run, even saving only the sequence data will be a considerable enterprise.

    In addition to new software to analyze deep CAGE data, the paradigm shift of gene-based networks to promoter-based networks of transcriptional regulation requires new ways to visualize such networks. Visualization software packages such as Cytoscape (32) have previously been developed to represent biomolecular interaction networks and can be used to draw gene regulatory networks. In a promoter-based network, the targets of regulatory interactions are the individual promoters of a gene, which may be too numerous for graphical representation except in the most detailed drawings. The situation is further compounded by the multitude of transcripts that have been identified for each gene. A multiscale visualization approach in which users can choose the level of detail at which each gene is represented may be suitable for visualizing promoter-based regulatory networks.

    Acknowledgements

    This work was supported by a research grant from the RIKEN Genome Exploration Research Project from the Ministry of Education, Culture, Sports, Science and Technology of the Japanese Government to Y.H., a grant from the Genome Network Project, also from the Ministry of Education, Culture, Sports, Science and Technology, and a grant from the RIKEN Frontier Research System, Functional RNA Research Program.

    Competing Interests Statement

    The authors declare no competing interests.

    References

    • 1. Shiraki, T., S. Kondo, S. Katayama, K. Waki, T. Kasukawa, H. Kawaji, R. Kodzius, A. Watahiki, et al.. 2003. Cap analysis gene expression for high-throughput analysis of transcriptional starting point and identification of promoter usage. Proc. Natl. Acad. Sci. USA 100:15776–15781.
    • 2. Cheng, J., P. Kapranov, J. Drenkow, S. Dike, S. Brubaker, S. Patel, J. Long, D. Stern, et al.. 2005. Transcriptional maps of 10 human chromosomes at 5-nucleotide resolution. Science 308:1149–1154.
    • 3. Kodzius, R., M. Kojima, H. Nishiyori, M. Nakamura, S. Fukuda, M. Tagami, D. Sasaki, K. Imamura, et al.. 2006. CAGE: cap analysis of gene expression. Nat. Methods 3:211–222.
    • 4. Schena, M., D. Shalon, R.W. Davis, and P.O. Brown. 1995. Quantitative monitoring of gene expression patterns with a complementary DNA microarray. Science 270:467–470.
    • 5. Brenner, S., M. Johnson, J. Bridgham, G. Golda, D.H. Lloyd, D. Johnson, S. Luo, S. McCurdy, et al.. 2000. Gene expression by massively parallel signature sequencing (MPSS) on microbead arrays. Nat. Biotechnol. 18:630–634.
    • 6. Velculescu, V.E., L. Zhang, B. Vogelstein, and K.W. Kinzler. 1995. Serial analysis of gene expression. Science 270:484–487.
    • 7. Higuchi, R., G. Dollinger, P.S. Walsh, and R. Griffith. 1992. Simultaneous amplification and detection of specific DNA sequences. Biotechnology (N. Y.) 10:413–417.
    • 8. Higuchi, R., C. Fockler, G. Dollinger, and R. Watson. 1993. Kinetic PCR: Real time monitoring of DNA amplification reactions. Biotechnology (N. Y.) 11:1026–1030.
    • 9. Heid, C.A., J. Stevens, K.J. Livak, and P.M. Williams. 1996. Real time quantitative PCR. Genome Res. 6:986–994.
    • 10. Wittwer, C.T., M.G. Herrmann, A.A. Moss, and R.P. Rasmussen. 1997. Continuous fluorescence monitoring of rapid cycle DNA amplification. BioTechniques 22:130–138.
    • 11. Hashimoto, S., Y. Suzuki, Y. Kasai, K. Morohoshi, T. Yamada, J. Sese, S. Morishita, S. Sugano, and K. Matsushima. 2004. 5′-end SAGE for the analysis of transcriptional start sites. Nat. Biotechnol. 22:1146–1149.
    • 12. Carninci, P., T. Kasukawa, S. Katayama, J. Gough, M.C. Frith, N. Maeda, R. Oyama, T. Ravasi, et al.. 2005. The transcriptional landscape of the mammalian genome. Science 309:1559–1563.
    • 13. Zavolan, M., S. Kondo, C. Schönbach, J. Adachi, D.A. Hume, Y. Hayashizaki, T. Gaasterland, RIKEN GER Group, et al.. 2003. Impact of alternative initiation, splicing, and termination on the diversity of the mRNA transcripts encoded by the mouse transcriptome. Genome Res. 13:1290–1300.
    • 14. Carninci, P., A. Sandelin, B. Lenhard, S. Katayama, K. Shimokawa, J. Ponjavic, C.A.M. Semple, M.S. Taylor, et al.. 2006. Genome-wide analysis of mammalian promoter architecture and evolution. Nat. Genet. 38:626–635.
    • 15. ENCODE Project Consortium. 2004. The ENCODE (ENCyclopedia Of DNA Elements) Project. Science 306:636–640.
    • 16. ENCODE Project Consortium. 2007. Identification and analysis of functional elements in 1% of the human genome by the ENCODE pilot project. Nature 447:799–816.
    • 17. Ravasi, T., H. Suzuki, K.C. Pang, S. Katayama, M. Furuno, R. Okunishi, S. Fukuda, K. Ru, et al.. 2006. Experimental validation of the regulated expression of large numbers of non-coding RNAs from the mouse genome. Genome Res. 16:11–19.
    • 18. Trinklein, N.D., S.F. Aldred, S.J. Hartman, D.I. Schroeder, R.P. Otillar, and R.M. Myers. 2004. An abundance of bidirectional promoters in the human genome. Genome Res. 14:62–66.
    • 19. Engström, P.G., H. Suzuki, N. Ninomiya, A. Akalin, L. Sessa, G. Lavorgna, A. Brozzi, L. Luzi, et al.. 2006. Complex loci in human and mouse genomes. PLoS Genet. 2:e47.
    • 20. Ambros, V. 2004. The functions of animal microRNAs. Nature 431:350–355.
    • 21. Fukagawa, T., M. Nogami, M. Yoshikawa, M. Ikeno, T. Okazaki, Y. Takami, T. Nakayama, and M. Oshimura. 2004. Dicer is essential for formation of the heterochromatin structure in vertebrate cells. Nat. Cell Biol. 6:784–791.
    • 22. Imamura, T., S. Yamamoto, J. Ohgane, N. Hattori, S. Tanaka, and K. Shiota. 2004. Non-coding RNA directed DNA demethylation of Sphk1 CpG island. Biochem. Biophys. Res. Commun. 322:593–600.
    • 23. Murrell, A., S. Heeson, and W. Reik. 2004. Interaction between differentially methylated regions partitions the imprinted genes Igf2 and H19 into parent-specific chromatin loops. Nat. Genet. 36:889–893.
    • 24. Andersen, A.A. and B. Panning. 2003. Epigenetic gene regulation by noncoding RNAs. Curr. Opin. Cell Biol. 15:281–289.
    • 25. Katayama, S., Y. Tomaru, T. Kasukawa, K. Waki, M. Nakanishi, M. Nakamura, H. Nishida, C.C. Yap, et al.. 2005. Antisense transcription in the mammalian transcriptome. Science 309:1564–1566.
    • 26. Nilsson, R., V.B. Bajic, H. Suzuki, D. di Bernardo, J. Björkegren, S. Katayama, J.F. Reid, M.J. Sweet, et al.. 2006. Transcriptional network dynamics in macrophage activation. Genomics 88:133–142.
    • 27. Altschul, S.F., W. Gish, W. Miller, E.W. Myers, and D.J. Lipman. 1990. Basic local alignment search tool. J. Mol. Biol. 215:403–410.
    • 28. Ning, Z., A.J. Cox, and J.C. Mullikin. 2001. SSAHA: a fast search method for large DNA databases. Genome Res. 11:1725–1729.
    • 29. Lassmann, T., E. Arner, and C.O. Daub. 2008. Manuscript in preparation.
    • 30. Faulkner, G.J., A.R.R. Forrest, A.M. Chalk, K. Schroder, Y. Hayashizaki, P. Carninci, D.A. Hume, and S.M. Grimmond. 2008. A rescue strategy for multimapping short sequence tags refines surveys of transcriptional activity by CAGE. Genomics 91:281–288.
    • 31. Imanishi, T., T. Itoh, Y. Suzuki, C. O'Donovan, S. Fukuchi, K.O. Koyanagi, R.A. Barrero, T. Tamura, et al.. 2004. Integrative annotation of 21,037 human genes validated by full-length cDNA clones. PLoS Biol. 2:e162.
    • 32. Shannon, P., A. Markiel, O. Ozier, N.S. Baliga, J.T. Wang, D. Ramage, N. Amin, B. Schwikowski, and T. Ideker. 2003. Cytoscape: a software environment for integrated models of biomolecular interaction networks. Genome Res. 13:2498–2504.