Efficient and Accurate Clustering for Large-Scale Genetic Mapping

Report ID

2014-03

Report Authors

Veronika Strnadova, Aydin Buluc, Jarrod Chapman, John R. Gilbert, Joseph Gonzalez, Stefanie Jegelka, Daniel Rokhsar, Leonid Oliker

Report Date

2014-05-01

Abstract

High-throughput next generation genome sequencing technologies have out-paced Moores law, producing a flood of inexpensive genetic information that is invaluable to research, ranging from the development of new and improved crops to understanding the genetic variation that underlies cancer. However, this flood of new information presents a fundamental new challenge to genetic mapping, the process of assembling genetic data, which is a core operation in genomics research. The current generation of genetic mapping tools were designed for the small data setting, and are now limited by the the prohibitively slow clustering algorithms they employ in the genetic marker-clustering stage of automatic genetic map construction. In this work, we present a new approach to genetic mapping based on a fast new clustering algorithm. Our theoretical and empirical analysis shows that the algorithm can correctly recover linkage groups. Using real-world and synthetic data, we demonstrate that our approach is able to quickly process orders of magnitude more genetic markers than existing tools and that by exploiting domain knowledge, it is able to out-perform more generic approaches based on spectral clustering. Finally, we demonstrate that by scaling to the available sequence data we are able to improve the quality of genetic marker clusters, leading to a higher quality ultra-high-density genetic map that can be used to improve genome assemblies and map quantitative traits.

Document

2014-03_0.pdf1.47 MB