Complex genomes, including the human genome, contain ‘dark’ regions that standard short-read sequencing technologies do not adequately resolve—overlooking many variants that may be relevant to disease. We systematically characterized these regions in genomes with high short-read coverage, identifying more than 6000 gene bodies that are at least partially dark, and more than 100 protein-coding genes are 100% ‘camouflaged’. Many known disease-relevant genes are also camouflaged, including CR1, a top Alzheimer’s disease gene. Other disease-relevant genes include NEB, SMN1 and SMN2, and ARX. Long-read technologies resolve major portions of these regions. We specifically compared 10x Genomics, PacBio’s Sequel, and Oxford Nanopore Technologies’ PromethION, demonstrating potential long-term benefits of using long-read technologies. We are also utilizing optical DNA mapping from the Bionano Genomics’ Saphyr System to construct full individual haplotypes across challenging genomic regions that are strongly implicated in disease, including the C9orf72 ‘GGGGCC’ repeat expansion, CR1, and major histocompatibility locus, which harbors the human leukocyte antigen (HLA) genes.
1. Identify types of genetic variants that are overlooked with standard short-read sequencing approaches, including ‘dark’ and ‘camouflaged’ genes
2. Identify specific genes that are ‘dark’ or ‘camouflaged’, including 76 disease-associated genes
3. Demonstrate how long-range technologies resolve these regions