Structural variants (SVs) play a causal role in numerous diseases but are difficult to detect and accurately genotype (determine zygosity) in whole-genome short-read sequencing data. SV genotypers that assume that the aligned sequencing data uniformly reflect the underlying SV or use existing SV call sets as training data can only partially account for variant and sample-specific biases. We are developing novel genotypers that combined machine learning-methods and SRS simulation to improve SV genotyping accuracy.
The observed SRS data reflects the genome induced by a putative SV but also depends on interactions between the genomic region, sequencing process, and analysis software. Our key idea is to use SRS simulation to generate empirical distributions of the expected SV evidence instead of trying to build a model of those complex and interconnected effects. Our current tool, NPSV-deep, builds on NPSV’s (its predecessor) use of simulated training data, but recasts SV genotyping as an image similarity problem that can exploit deep learning approaches. We model SV genotyping as a facial recognition task, with many (potentially unbounded) outputs and few examples per output, instead of image classification, with its fixed number of outputs and many examples per output. NPSV-deep improves accuracy for SVs as currently described, and by leveraging the learned similarity model, enables automatic correction of imprecise/incorrect SV descriptions (that otherwise negatively impact genotyping accuracy).
These efforts are described in the following publications:
The growing generation of genomic data motivates the adoption of the scalable distributed computing systems already in wide use in other “big data” domains. ADAM is genome analysis platform built on Apache Avro, Apache Spark and Parquet. We have been experimenting with porting existing genomics tools to ADAM. Our Distributed Exome CNV Analyzer, or DECA, is a distributed re-implementation of the XHMM exome CNV caller using ADAM and Apache Spark.
These efforts are described in the following publications:
PeopleSeq is a longitudinal study of healthy adults who plan to, or have already received, their own genomic sequence information. The PeopleSeq study is the collaborative effort of a growing consortium of commercial and research organizations. The PeopleSeq study will collect valuable empirical data on the medical, behavioral and economic impact of performing elective personal genome sequencing.
PeopleSeq is described in the following publications:
There is an acute need to develop more effective genomics education to train the next-generation of genomics professionals. With a multi-disciplinary group of colleagues at the Icahn of School of Medicine at Mount Sinai, I co-developed and directed a laboratory-style medical genomics course (first taught in 2012), in which the students have the opportunity to sequence and analyze their own whole genome. The results of our companion research study evaluating student attitudes towards and the outcomes of incorporating personal genome sequencing into graduate genomics education are described in the following publications:
To support hands-on genomics education, I developed the MySeq web application. MySeq is a web-based, privacy protecting, personal genome analysis tool inspired by GENOtation (previously the Interpretome) and DNA.LAND Compass. MySeq can load and analyze Tabix-indexed VCF files stored locally on the user’s computer or available remotely. Queries and other analyses will only load the necessary blocks of the compressed VCF file, enabling efficient analysis of whole-genome-scale VCF files. MySeq has been used in the CSCI1007 “Practical Analysis of a Personal Genome” winter-term course at Middlebury College. MySeq is described in the following publication:
Accurately assessing genomics knowledge was an ongoing challenge in this and other projects. In response, we developed and validated the GKnowM genomics knowledge scale. GKnowM incorporates modern genomics technologies and is informative for individuals with diverse backgrounds, including those with clinical/life-sciences training. We performed a rigorous psychometric evaluation of GKnowM with a large, educationally, and ethnically diverse cohort drawn from the general public, students, and genomics professionals. GKnowM is described in the following publications:
HealthSeq is a longitudinal cohort study at Mount Sinai in which unselected ostensibly healthy participants received a variety of health and non-health-related genetic results from whole genome sequencing. The primary aim of HealthSeq is to improve our understanding of participants’ motivations, expectations, concerns and preferences, and the impacts of receiving personal genome sequencing in a pre-dispositional setting.
HealthSeq results are described in the following publications:
I developed the Genome Analysis Pipeline (GAP) used by many groups at the Icahn Institute for Genomics and Multiscale Biology at Mount Sinai. The GAP was validated for clinical use in New York State and has been successfully applied to identify causal mutations in multiple patients. The GAP is used in multiple small and large research projects, totalling many thousands genomes, exomes and targeted panels.
The GAP is described in the following publications:
Weighted Gene Co-expression Network Analysis (WGCNA) is a methodology for describing the correlation patterns among genes across microarray samples. Analysis of tens of thousands of probes, however, can take hours and requires hundreds of gigabytes of memory, putting this method out of reach for all but a few organizations and applications. Coexpp substantially reduces in the execution time and memory footprint. Those reductions will enable to researchers to apply WGCNA in many new contexts.
Recent advances in flow cytometry enable simultaneous single-cell measurement of 30+ surface and intracellular proteins. In a single experiment we can now measure enough markers to identify and compare functional immune activities across nearly all cell types in the human hematopoietic lineage. However, practical approaches to analyze and visualize data at this scale are only now becoming available. SPADE, described in Qiu et al., Nature Biotechnology 2011 and first used in Bendall et al. Science 2011, is a novel algorithm that organizes cells into hierarchies of related phenotypes, or “trees”, that facilitate the visualization of developmental lineages, identification of rare cell types, and comparison of functional markers across stimuli.
CytoSPADE is a robust, modular and performant implementation of Qiu et al.’s SPADE algorithm, including a rich GUI implemented as a plugin for the Cytoscape Network Visualization platform. CytoSPADE is 12-19 fold faster than the SPADE prototype, ensuring that users can run complex analyses on their laptops in just seconds or minutes. More information is available on the software page.
CytoSPADE is described in the following publications:
Aberrant intracellular signaling plays an important role in many diseases. The causal structure of signal transduction networks can be modeled as Bayesian Networks (BNs), and computationally learned from experimental data. However, learning the structure of BNs is an NP-hard problem that, even with fast heuristics, is too time consuming for large, clinically important networks (20-50 nodes). I developed a novel graphics processing unit (GPU)-accelerated implementation of a Monte Carlo Markov Chain-based algorithm for learning BNs that is up to 7.5-fold faster than already heavily optimized general-purpose processor (GPP)-based implementations.
The GPU-based implementation is just one of several variants within the larger application, each optimized for a different input or machine configuration. I concurrently enhanced the Merge framework to enable efficient integration, testing and intelligently selection among the different potential implementations targeting multicore GPPs, GPUs and distributed compute clusters.
The GPU-accelerated Bayesian Network learning implementation is described in the following publication:
The performance of the BN learning application is sensitive to the performance of of an accumulation of log-space probabilities in the inner most loop:
acc += log(1 + exp(x))
For x
far from the origin, this computation can be approximated as 0 or the
identity function. The choice of those boundaries creates a
performance-precision trade-off. I concurrently developed a tool, Gappa++, for
analyzing the numerical behavior of this and other computations. Using Gappa++
I as able to improve performance an additional 10-15%. Gappa++ is described in
the following publication and on the software page:
Computer systems are undergoing significant change: to improve performance and efficiency, architects are exposing more microarchitectural details directly to programmers. Software that exploits specialized accelerators, such as GPUs, and specialized processor features, such as software-controlled memory, exposes limitations in existing compiler and OS infrastructure.
Merge is a programming model for building applications that will tolerate changing hardware. Merge allows programmers to leverage different processor-specific or domain-specific toolchains to create software modules specialized for different hardware configurations, and it provides language mechanisms to enable the automatic mapping of the application to these processor-specific modules. I showed that this approach can be used to manage computing resources in complex heterogeneous processors and to enable aggressive compiler optimizations.
Merge was used extensively if the complex and computationally intensive Bayesian structure learning application described above. Using Merge we were able to deploy a single application binary that could deliver the best possible performance across a range of problem sizes and hardware configurations (including multicore processors, GPUs and clusters with MPI). For any given problem size and hardware configuration, the Merge-enabled application automatically and dynamically selects the appropriate implementation (based on predicates supplied by original implementors).
Merge is described in the following publications: