SEARCH WITHIN CONTENT
Citation Information : North American Journal of Medicine and Science. VOLUME 10 , ISSUE 4 , ISSN (Online) 2156-2342, DOI: 10.7156/najms.2017.1004176, December 2017 © 2017.
License : (Transfer of Copyright)
Received Date : 25-August-2017 / Accepted: 10-October-2017 / Published Online: 18-December-2017
Large cohorts with rich genetic and phenotypic data are key for the success of the Precision Medicine Initiative. Since whole genome sequencing (WGS) is still expensive, many large studies have taken advantage of cost-effective arrays to assay single-nucleotide polymorphism (SNP). The UK biobank project1 in United Kingdom, the VA million Veteran program2 in United States, the China Kadoorie Study3 in China and United Kingdom, are a few of the largest ones, which already obtained genome-wide SNP data for over 100,000 samples. Genotyping refers to assays specifically designed to target a genomic point, a signal that is polymorphic in the genome. A SNP is a variation in one nucleotide that occurs at a specific position in the genome, where each variation is detectable within a population (e.g. > 1%). For example, at a specific base position in the human genome, the base C may appear in 90% of individuals, but in the other 10% the position is occupied by base A. Thus there is a SNP at this specific base position, and the two possible nucleotide variations (C vs. A) are called alleles for the position. Most SNPs are bi-allelic, but there also are tri-allelic SNPs. Most SNPs are interrogated with one or two probe sets: one derived from the forward strand sequence and/or one derived from the reverse strand sequence. Fan et al. published an excellent review of highly parallel genomic assays,4 and the Figure 1 from that paper clearly illustrates the key steps in genotyping, which still underlie today’s technologies.
Genotyping is commonly contrasted with sequencing, which reads all the data in a base pair sequence. As an analogy, genotyping is like reading certain key words in a book, while sequencing is simply reading an entire book. However, new technologies like “genotyping-by-sequencing” are reducing the differences between these two technologies. The cost of sequencing the first whole genome was around $3 billion and concluded in 2003 after 13 years (http://www.genome.gov/11006943). Since then, the cost of sequencing a genome has been decreasing at a speed exceeding Moore’s law.5 The actual cost of sequencing varies depending on whether all or only some aspects of variables such as logistics, sequencing instruments and other large equipment and indirect costs, quality assessment/control, and data interpretation are included. This led some to state that the real cost of genome sequencing is higher than we thought.6 However, the combination of technological advancements and competition will undoubtedly continue to drive down costs.
The two basic types of arrays used in genomic analysis are ordered arrays and random arrays. Most Illumina arrays including the Global Screening Assay (GSA) use random arrays, while Affymetrix arrays are manufactured using a photolithographic process, which produces ordered arrays. Ordered arrays means that the arrays manufactured today, next week, or ten years from now for the same array design are exactly identical. On the other hand, random arrays are manufactured by sampling a bead pool, which results in random selection of the probe sequences used. That in turn means that each lot of arrays manufactured is slightly different than the other lots, which can cause differences in the final design, which is not possible with bead array technology. With several of the advantages mentioned for the ordered array, a possible disadvantage of ordered array is that it requires specialized equipment, unlike that used to produce Illumina’s random arrays.
Illumina and Affymetrix have dominated high-throughput genotyping for the past 10+ years. In 2016, Illumina released GSA, which claims to combine a highly optimized, universal genome-wide backbone, hand-curated clinical research variants, and sample tracking content to produce a highly economical array for population-scale genomics and screening. It uses the 24-sample Infinium HTS format, enabling high content flexibility, throughput capacity, and genotyping accuracy. The latest array from Affymetrix is called the Axiom™ Precision Medicine Research Array (PMRA) and claims to provide the most up-to date content, broadest coverage, and highest accuracy for disease-association studies across populations.7 In general, both GSA and PMRA arrays include the following SNPs: (1). genome-wide imputation grid; (2). global population specific variants; (3). variants from GWAS Catalog and common cancer variants; (4). Rare functional variants from ClinVAR, ExAC consortium; (5). Variants with pharmacogenomic effects including those from PharmGKB databases; (6). HLA region and CNV variants; (7). Fingerprinting variants.
Although basic array mechanisms have not changed dramatically, technologies do evolve over time. Taking the Affymetrix Axiom array as an example, there are quite a few differences between Axiom and the earlier version of Affymetrix 6.0 array, as shown in Table 1.
A few challenges remain to be resolved by even the most up-to-date genotyping technologies. They include but are not limited to direct assays of haplotype, copy number variations (CNVs), and human leukocyte antigen (HLA) region.
Haplotype information is very important, which is usually not directly captures by genotyping technologies. Take the well-known APOE gene for example. The ε4 haplotype is defined by two variants: rs429358-C (build 37 position 45,411,941), rs7412-C (build 37 position 45,412,079). It has been implicated in a variety of diseases, including atherosclerosis,8 AD,9 impaired cognitive function,10 reduced hippocampal volume,11 HIV,12 faster disease progression in multiple sclerosis,13 unfavourable outcome after traumatic brain injury,14 sleep apnea,15 accelerated telomere shortening.16 The current Axiom arrays have probes that directly assay these two key ApoE variants (rs7412 and rs429358), and it is actually claimed to be the only arrays on the market that can reliably assay these two variants. This is primarily due to Axiom array’s capacity to use both hybridization and ligation instead of hybridization alone technology to tackle the genome surrounding these two SNPs with high GC content in the flanking regions. But still, the Axiom assays can only assay these two variants separately, not able to directly assay the haplotype built from these two variants. Instead, statistical phasing software is used to determine the haplotype of these two SNP.17
For haplotypes composed of SNPs that are very close to each other (e.g., within 15-20 bp), they can actually be directly detected using current genotyping technologies. If the two APOE SNPs were only 10 bp apart (rather than 138 bp, as is the case), we could design four probes, one matching each haplotype (or two probes, taking advantage of the two-color system). This would essentially be multiallelic genotyping of a four-allele variant. However, variants more than about 15 bp apart cannot successfully be combined this way, since with hybridization the 30-mer Axiom probes would be increasingly less specific with distance past that point. Recently, fluorescence in situ hybridization (FISH) is used as a powerful single-cell technique for directly assaying haplotypes. Beliveau et al. introduced a robust and reliable system that harnesses SNPs to visually distinguish between the maternal and paternal homologous chromosomes in both mammalian and insect systems.18 The method makes use of Oligopaints, which are highly efficient, renewable, strand-specific FISH probes derived from complex single-stranded DNA (ssDNA) libraries in which each oligo carries a short stretch of homology to the genome. An open-jaw molecular inversion probe19,20 could also be a promising approach for directly assaying haplotypes. Assuming the homology arms could be designed and a ∼120bp (i.e., 138bp minus 15-20bp) gap-fill would work, we could then design four probes, one for each haplotype.
Currently, CNV detection does not work very well with sequencing, but it does work with SNP arrays. This is because it requires impractically high depth of sequencing to obtain accurate CNV signal, which genotyping array captures CNV signal naturally. A double deletion is easy to distinguish from two copies, but the ability to call one or three copies requires a good dynamic range of response in signal or reads. The problem becomes even more complex for mosaic samples, e.g., a tumor sample with only a fraction of the cells having an aberration. A microarray has thousands (or tens of thousands) of probes in each feature, which inherently provides a practically continuous response and the possibility of high dynamic range and good signal-to-noise. Achieving the same accuracy with sequencing requires many more reads than are necessary for genotyping, therefore making genome-wide CNV detection very expensive.
The human leukocyte antigen (HLA) complex is the human version of the major histocompatibility complex (MHC) region in chromosome 6, which includes genes responsible for immune function. Variations in these genes affect immune response, including those responsible for transplant rejection as well as disease susceptibility. The naming of HLA variants is quite complex. All alleles start with “HLA”, and the next portion (HLA-A or HLA-B) identifies the gene of which the allele is a modification. The next two numbers (HLA-A*02) signify what antigen type that particular allele is, typically the serological antigen present. In other words, HLAs with the same antigen type (e.g., HLA-A*02:101 and HLA-A*02:102) will not react with each other in serological tests. The next set of digits (HLA-A*02:101) indicates what protein the allele codes for; these are numbered sequentially based on the order in which they were discovered. The third set of numbers (HLA-A*02:101:01) indicates an allele variant that has a different DNA sequence but produces the same protein as the normal gene. The final set of numbers (HLA-A*02:101:01:01) designates whether there is a single or multiple nucleotide polymorphism in a non-coding region of the gene. The final aspect of HLA naming is one of six letters (for example, HLA-A*02:101:01:01L). The letter L in this example means lower-than-normal cell surface expression.
The highly polymorphic nature of the HLA region and the prevalence of pseudogenes create challenges for traditional genotyping methods. Combining direct genotyping with advanced imputation methods over the extended MHC region allows accurate HLA typing from SNP genotype data. For HLA-specific markers, Affymetrix provides a tool that uses directly assayed genotypes from the Axiom array to impute and generate two- and sometimes four-digit HLA resolution. In contrast, Illumina claims that its TruSight HLA Sequencing Panel delivers unprecedented accuracy, efficiency, and certainty in HLA typing, all in one assay. It is also worth mentioning SNP2HLA, developed by the Broad Institute (http://software.broadinstitute.org/mpg/snp2hla/). It imputes not only the classical HLA alleles but also the amino acid sequences of those classical alleles, so that individual amino acid sites can be directly tested for association. This allows for facile amino acid-focused downstream analysis.21
Before we discuss the cost and effectiveness of genotyping vs. sequencing, it is important to keep in mind that only sequencing can detect novel variants. However, sequencing the whole genome for $100 is not yet a reality and probably will not be accessible to anyone outside of the largest sequencing labs for at least a few years. Also, accurate detection of rare content requires deep sequencing, which generates enormous amounts of data and requires weeks to months of analysis to generate usable results. We think that genotyping technologies will remain the platform of choice for many years, for at least the following reasons: 1) they are very affordable; 2) they take relatively little time to quality-control, filter, and generate genotypes (~1.5 hours for a plate of 96 samples); 3) they can be easily customized to meet virtually any need, and 4) they can generate data on hundreds of samples per week. The possible uses for genotyping in the era of precision medicine includes but are not limited to comprehensive assays of blood types, diseases in newborns and variants recommended by the American College of Medical Genetics (ACMG), and fast genotyping for detecting pathogens and in point-of-care settings.
Red blood cells (RBCs) carrying a particular antigen may elicit an immune response if introduced into the blood circulation of a patient who lacks this antigen. It is the antibody produced during the immune response that is problematic and leads to donor/patient transfusion incompatibility, maternal-fetal incompatibility, and autoimmune hemolytic anemia. This immune response can be immediate or delayed and may in some cases be lethal. Knowing one's blood type is important for both scientific and medical purpose. People with non-O blood types have an increased mortality particularly due to cardiovascular diseases. This is partially due to the effect of blood group alleles on blood biochemistry including von Willebrand factor and factor VIII levels.22 As genotyping becomes cost-effective and more easily automated and multiplexed than phenotyping, there is a desire to derive human blood type from genetic data. Also, blood typing through genetics does not really need blood. As of today, there are 346 serologically distinct red blood cell (RBC) blood group antigen phenotypes recognized by the International Society of Blood Transfusion (ISBT),23 defined by over 1,100 alleles across 45 genes (http://www.isbtweb.org/). There are 33 serologically distinct human PLT antigen (HPA) phenotypes (http://www.ebi.ac.uk/ipd/hpa/), defined by 33 alleles within six genes. Centralized efforts have been put to catalog these genetic variants, including the ISBT website, the BGMUT website,24 the RHD RhesusBase (http://www.rhesusbase.info/), and the Immuno Polymorphism Database-HPA website.25
However, in reality, even for the well studied ABO blood types, genetic data is rarely used to determine its type. One obstacle is that ABO polymorphic sites associated with antigen expression are documented according to nucleotide positions in cDNA, not genomic coordinates. The other obstacle is the complex link between the genetic variations and the resulting blood types. Take ABO blood type as an example. It is one of the RBC carbohydrate antigens (together with Lea/b, P1, Pk) synthesized by enzymes, and it requires gene sequencing to properly predict the enzymatic and sugar specificity across several genes. The ABO gene has seven coding exons, with the majority of the coding sequence lying in exon 6 and 7. Four common missense variants in exon 7 that differentiate between the A and B haplotype result in amino acid substitutions in the active/binding site of the ABO glycosyltransferase: rs7853989 (p. R176G), rs8176743 (p. G235S), rs8176746 (p. L266M), and rs8176747 (p. G286A).26,27 An exon-6 deletion (rs8176719) leads to the classic O genotype and phenotype, while another common deletion located at the end terminus of exon 7 (rs56392308) results in the A2 subtype.28 For decades, the method of reference for testing blood group antigens was the hemagglutination technique. This is a simple and well-established technique usable for all major blood groups, with specificity, sensitivity, and security appropriate for the clinical diagnostic environment. However, this gold-standard method has certain limitations when it comes to the determination of minor or rare blood group antigens critical to determine a perfect match between patient and donor, including immunologic reagent availability and specificity.29 More comprehensive evaluation of the performance of genetically predicated blood types would contribute to transfusion medicine and therefore precision medicine. While realizing the great potential of using genetic data to predict blood group, we don’t recommend it to replace the conventional serological methods yet, because the clinical significance of missing one inactivating mutation for the ABO blood type would pose an unacceptable risk for transfusion.
Newborn screening tests provide an early opportunity to detect certain disorders before symptoms appear. At about 48 hours after birth, or just before a baby is discharged from the hospital, a small blood sample is taken and tested for a variety of conditions/disorders. At the state government level, usually an advisory board made up of doctors, nurses, scientists, ethicists, and parents advises which disorders to include. For a disorder to be included in the list, the following must be true: 1) the disorder is treatable, 2) there is a good test, and 3) early medical intervention would benefit the infant. For example, 32 disorders are included in routine screening mandated by the Massachusetts Department of Public Health (http://nensp.umassmed.edu/screening-programs/massachusetts/routine-disorders).
In 2013, the American College of Medical Genetics and Genomics (ACMG) released a guideline30 that recommends clinical diagnostic laboratories performing exome or genome sequencing to report known pathogenic or expected pathogenic variants within 56 genes even when unrelated to the primary medical reason for testing. Subsequently, the ACMG revised the terminology from “incidental findings” to “secondary findings” because these genes are intentionally being analyzed, as opposed to genetic variants found incidentally or accidentally. The shift in terminology also maintained consistency with a recommendation by the Presidential Commission on Bioethical Issues.31 An additional modification to the original policy included offering an option for individuals undergoing clinical genomic sequencing to opt out of receiving secondary findings. The updated list includes 59 medically actionable genes recommended for returning the genetic results to patients who participated clinical genomic sequencing.32
In April 2017, CRISPR pioneering Dr. Feng Zhang and colleagues at the Broad Institute reported a CRISPR-based diagnostic tool that can detect pathogens, identify cancerous mutations, and genotype human DNA.33 The tool is called SHERLOCK, for Specific High Sensitivity Enzymatic Reporter UnLOCKing. also includes a reporter RNA strand that fluoresces when cleaved. When Cas13a detects the targeted RNA sequence, its unbiased RNAse activity will slice the reporter sequence, releasing a detectable fluorescent signal. Cas13a). This new tool incorporates isothermal RNA amplification that was previously used to create a paper-based Zika test, and it is now capable of detecting single RNA and DNA molecules at attomolar concentrations. The other benefits include quick turn-around time (less than 1 hour), portable, and low-cost (less than $1 a sample). All these features are key to build a genotyping tool that can reach far beyond research labs and make true difference for both public health and precision medicine.