New DNA sequencing technologies: applications, promises and challenges for pharmacogenomics
Posted: 30 July 2009 | Daniel G. MacArthur, Visiting Fellow, Wellcome Trust Sanger Institute supported by an Overseas Biomedical Fellowship from the Australian National Health and Medical Research Council | No comments yet
We are currently on the cusp of a technology-driven revolution in the field of genomics. The rapid evolution of DNA sequencing technology is already providing researchers with the ability to generate data about genetic variation and patterns of gene expression on an unprecedented scale; within just a few years it is likely that these technologies will allow accurate sequencing of complete human genomes to become a routine tool for researchers and clinicians. This review covers the emerging field of new DNA sequencing technologies, and outlines the potential benefits – and the challenges – of these technologies for pharmacogenomics.
We are currently on the cusp of a technology-driven revolution in the field of genomics. The rapid evolution of DNA sequencing technology is already providing researchers with the ability to generate data about genetic variation and patterns of gene expression on an unprecedented scale; within just a few years it is likely that these technologies will allow accurate sequencing of complete human genomes to become a routine tool for researchers and clinicians. This review covers the emerging field of new DNA sequencing technologies, and outlines the potential benefits – and the challenges – of these technologies for pharmacogenomics.
We are currently on the cusp of a technology-driven revolution in the field of genomics. The rapid evolution of DNA sequencing technology is already providing researchers with the ability to generate data about genetic variation and patterns of gene expression on an unprecedented scale; within just a few years it is likely that these technologies will allow accurate sequencing of complete human genomes to become a routine tool for researchers and clinicians. This review covers the emerging field of new DNA sequencing technologies, and outlines the potential benefits – and the challenges – of these technologies for pharmacogenomics.
The importance of DNA sequencing
DNA sequencing provides a useful read-out for a variety of sources of biomedically significant information. In its most straightforward application DNA sequencing can be used directly to uncover patterns of genetic variation for exploring associations with clinical traits. However, other types of biological information can also be converted into DNA for analysis with DNA sequencing technologies: for instance, RNA molecules can be reverse-transcribed into DNA, and patterns of cytosine methylation (an important epigenetic marker) can be converted into a form accessible to DNA sequencing through the process of bisulfite conversion. Improvements in DNA sequencing technology will thus lead to advances in our understanding of many different biological processes.
Second-generation sequencing platforms
Since the late 1970s DNA sequencing has been dominated by a single chemical approach, albeit in increasingly sophisticated formats: dideoxy terminator or “Sanger” sequencing1. In a modern and highly-automated format, dideoxy sequencing was the workhorse behind both competing efforts to sequence the human genome2,3 and remains the most widely-used sequencing technology today. However, several new so-called “second-generation” sequencing technologies offering tremendously higher throughput are now rapidly displacing the dideoxy approach.
Three second-generation technologies are currently commercially available, which differ in their underlying chemistry and many other parameters both from one another and from traditional dideoxy sequencing: the 454 platform offered by Roche (www.454.com), Illumina’s Genome Analyser technology (http://illumina.com/pages.ilmn?ID=204), and the SOLiD platform from Applied Biosystems (http://solid.appliedbiosystems.com/). A fourth second-generation short-read platform, developed by Complete Genomics (http://www.completegenomics.com/), is currently on the verge of commercial launch; notably, however, the company does not plan to make this platform available commercially, but rather intends to use the technology solely for the generation of complete human genome sequences from within its own custom-built sequencing facilities.
All four second-generation platforms share a conceptually similar approach to sequencing: (1) random fragmentation of input DNA and ligation of adaptors to fragment ends; (2) immobilisation of the DNA fragments on a solid matrix (or bead), followed by amplification to produce discrete clusters of identical molecules; and (3) sequencing of one or both ends of the resulting fragments through alternating cycles of substrate addition and imaging.
The most notable advantage of all three second-generation technologies over Sanger sequencing is throughput: all are capable of generating millions to billions of bases of sequence data in a single run, orders of magnitude more than could be generated by dideoxy sequencing. This massive increase in throughput permits DNA sequencing experiments to be performed on a scale far beyond those feasible with Sanger sequencing, at an ever-reducing cost. As an example, the sequencing of near-complete individual human genomes has now been performed with all three of the major second-generation technologies4,5,6.
The declining cost of sequencing is best illustrated by the fact that while the first draft human genome sequence (completed in 2001) cost several billion US dollars, Illumina recently launched a retail whole-genome sequencing service for under US$50,000 and Complete Genomics is promising a commercial price beginning at US$5,000. At the current rate of decrease many industry observers expect to see retail human genome sequences offered for under US$1000 within the next two years.
The massive throughput of new sequencing technologies does come at a cost, however: second-generation sequence data is generated as a series of fragmentary reads that are shorter and typically contain substantially more errors than those generated by Sanger sequencing. Both the read length and error rates of the new technologies are rapidly improving (for instance, read lengths for the Illumina platform have now increased from 35 bases to over 100 bases), but both still result in considerable informatics challenges during downstream analysis. While improvements to the existing second-generation platforms will certainly yield increasingly more accurate data, it is likely that the largest leaps in quality will come from the development of entirely novel technological approaches.
Third-generation sequencing platforms
Several sequencing platforms currently in development promise even greater advances in throughput and resolution. These are based on more diverse chemistries than second-generation platforms, but can be broadly characterised as offering two major advantages over currently commercially available platforms: substantially longer read lengths, and direct analysis of single DNA molecules.
A thorough comparison of third-generation sequencing technologies is difficult due to the rate of progress in the field and the limited information on operational performance currently available in the public domain. However, several emerging third-generation platforms are worth mentioning to highlight the diverse approaches being taken to the generation of DNA sequence: Pacific BioSciences (http://www.pacificbiosciences.com/), whose technology is based on real-time visualisation of the incorporation of fluorescently labelled bases into a single, immobilised DNA molecule; Oxford Nanopore Technologies (http://www.nanoporetech.com/), who rely on detecting the sequential passage of cleaved nucleotides from a DNA strand through a protein nanopore acting as an electrical sensor; and ZS Genetics (http://www.zsgenetics.com/), who plan to visualise DNA strands directly using electron microscopy.
It is currently unclear which of these approaches, if any, will ultimately become the default technology for large-scale DNA sequencing. However, it is clear that the development of any technology capable of generating very long independent reads from single molecules will substantially improve our ability to sequence human genomes: current short-read technologies are incapable of producing reliable sequence for the 10-15% of the human genome contained within highly repetitive regions, and also provide limited information about which of the two sister chromosomes in an individual carries a particular variant (so-called haplotypic phase). Complete reconstruction of individual genomes thus awaits the development of long-read single-molecule approaches.
Applications of new sequencing technology in pharmacogenomics
Massively parallel sequencing technologies are already altering almost every field of genetics, and pharmacogenomics will be no exception.
The most obvious application of new sequencing technologies in pharmacogenomics is the discovery of novel genetic variants that may influence drug response. Recent genome-wide association studies of drug responses (e.g. statin-induced myopathy7 and stable warfarin dose8) have revealed genetic variants explaining a surprisingly large proportion of the population variance in drug response. However, much residual variance remains to be explained in these and other pharmacologically relevant traits, and genome-wide association studies performed to date have only been well-powered to detect associations with common variants present at a population frequency of greater than 5%8.
It is likely that some non-trivial fraction of the population variance in drug response is due to genetic variants at a frequency below 5%, which may individually have large effects on drug efficacy and toxicity. Discovering all such variants in the population will ultimately require deep resequencing studies in which large numbers of individuals with varying drug responses are analysed using DNA sequencing technologies – initially characterising targeted regions of the genome with a high prior probability of playing a role in drug response (e.g. cytochrome P450 genes), and eventually expanding to analysis of complete genome sequences as the cost of sequencing drops.
Whole-genome sequencing also offers the potential of identifying a variety of other forms of genetic variation currently poorly captured by the chip-based platforms used for current genome-wide association studies: for instance, small insertions and deletions (“indels”), and larger rearrangements of DNA (so-called “structural variants”) involving the removal, duplication or inversion of thousands of bases of DNA. However, it should be noted that identifying both small indels and structural variants remains a non-trivial challenge with current short-read sequencing technologies, particularly in the highly repetitive regions of DNA where these variants are most common.
Moving beyond studies of genetic variation, advances in DNA sequencing technology will also permit fine-grained dissection of the dynamic processes involved in drug responses, such as gene expression and epigenetic modifications of DNA. Analysis of gene expression has already been transformed by the advent of whole-transcriptome sequencing, which allows relatively unbiased interrogation of the full range of RNA transcripts produced by a cell, as opposed to the subset of transcripts represented on microarray chips, and also provides direct information on alternative RNA splicing events that are difficult or impossible to capture using array-based methods9; application of such high-resolution surveys to cells exposed to pharmaceutical agents or disease-causing agents raises the possibility of identifying novel, specific drug targets.
Whole-genome analysis of epigenetic modifications (including cytosine methylation and the placement of specific DNA-binding proteins such as histones) can also be performed using high-throughput sequencing technologies10, raising the possibility of identifying all of the important epigenetic changes resulting from drug exposure or disease state. As our ability to specifically modify epigenetic states improves, such high-resolution maps will provide a framework for targeted interventions to reduce the effects of disease or counteract side-effects of existing medications.
Challenges of new sequencing technologies
The power of new sequencing technologies described above also brings with it considerable obstacles for new adopters. Currently, the establishment of sequencing facilities employing second-generation sequencing technologies requires heavy investment in purchasing sequencing equipment and associated infrastructure, recruitment of staff, and training. Once the equipment has been purchased the costs of maintaining and running a high-throughput sequencing facility are also extremely high; indeed, one of the problems with the new technologies is that the cost of each individual experiment can be very large, making trouble-shooting and methods development an expensive process.
However, perhaps the major challenges faced by any organisation seeking to employ these new sequencing technologies are informatic: the sheer scale of the data produced by current second-generation sequencing technologies is far greater than most research organisations are equipped to deal with, and developing the required infrastructure for data storage, processing and analysis represents a substantial fraction of the costs of these technologies. Even with substantial investment in informatics infrastructure, the volume of raw image data produced by the new technologies is often too large to archive and must instead be processed on-the-fly into digested formats. The routine discarding of raw data is one of the uncomfortable but unavoidable consequences of migrating into the new world of high-throughput sequencing.
Even with the appropriate hardware systems in place for coping with large-scale sequencing data, new users are faced with a bewildering array of both free and commercial packages for downstream analysis. While many of the packages that are currently most widely used for routine procedures such as mapping reads to the human genome11 and for analysis of genetic variation or gene expression12 are free, they can sometimes come with minimal documentation and assume substantial background knowledge. In addition, the rapidly evolving technology in the field means that analysis pipelines need to be constantly modified to deal with changing data formats and new algorithmic approaches, while ensuring that the ceaseless stream of new data rolling off sequencing platforms is not compromised. These challenges place a strain on even well-resourced informatics groups.
Finally, new users of high-throughput sequencing technologies need to be aware that these platforms bring with them brand new sources of bias and error, which must be carefully considered before drawing conclusions from the resulting data. Taking full advantage of quality control metrics and new tools for visualising data output will be crucial for any researcher seeking to use these technologies to gain insight into biology.
Conclusions
Advances in DNA sequencing technology promise nothing less than a transformation of many diverse areas of biology, allowing analyses of genetic variation, gene expression, DNA modification and other biological processes at unprecedented scale and resolution. This transformative potential also applies to the area of pharmacogenomics; however, researchers seeking to take full advantage of these rapidly evolving technologies will need to be mindful of the challenges ahead, particularly in terms of the infrastructure and expertise required for effective management of the massive volume of data generated by the new sequencing platforms.
References
- Sanger, F. et al. 1977. Nucleotide sequence of bacteriophage phi X174 DNA. Nature 265:687-695.
- Lander, E.S. et al. 2001. Initial sequencing and analysis of the human genome. Nature 409:860-921.
- Venter, J.C. et al. 2001. The sequence of the human genome. Science 291:1304-1351.
- Wheeler, D.A. et al. 2008. The complete genome of an individual by massively parallel DNA sequencing. Nature 452:872-876.
- Bentley, D.R. et al. 2008. Accurate whole human genome sequencing using reversible terminator chemistry. Nature 456:53-59.
- McKernan, K.J. et al. 2009. Sequence and structural variation in a human genome uncovered by short-read, massively parallel ligation sequencing using two base encoding. Genome Res advance online publication.
- SEARCH Collaborative Group et al. 2008. SLCO1B1 variants and statin-induced myopathy – a genomewide study. N Engl J Med. 359:789-799.
- Takeuchi, F. et al. 2009. A genome-wide association study confirms VKORC1, CYP2C9, and CYP4F2 as principal genetic determinants of warfarin dose. PLoS Genet 5:e1000433.
- Wilhelm, B.T. et al. 2008. Dynamic repertoire of a eukaryotic transcriptome surveyed at single-nucleotide resolution. Nature 453:1239-1243.
- Lister, R. et al. 2008. Highly integrated single-base resolution maps of the epigenome in Arabidopsis. Cell 133:523-536.
- Li, H., Ruan, J., and Durbin, R. 2008. Mapping short DNA sequencing reads and calling variants using mapping quality scores. Genome Res 18:1851-1858.
- Fejes, A.P. et al. 2008. FindPeaks 3.1: a tool for identifying areas of enrichment from massively parallel short-read sequencing technology. Bioinformatics 24:1729-1730.