NPSV-deep, is a Python-based tool for stand-alone genotyping of previously detected/reported deletion (DEL) and insertion (INS) structural variants (SVs) in short-read genome sequencing (SRS) data. NPSV-deep is the successor to the NSPV SV genotyper. NPSV-deep recasts SV genotyping as an image similarity problem amenable to deep metric learning techniques. Instead of trying to predict the SV genotype from the SRS data in isolation, NPSV-deep predicts the genotype from the similarity of the actual SRS data and simulations of the possible genotypes. NPSV-deep is implemented with Tensorflow and distributed with pre-trained models. By default, NPSV-deep will automatically download and cache pre-trained models from a repository on Hugging Face. Middlebury undergraduate CS students Jacob Wallace, Alderik van der Heyde, Eliza Wieman, Daniel Brey, Yiran Shi and Peter Hansen contributed to NPSV-deep.
NPSV, the Non-Parametric Structural Variant genotyper, is a Python-based tool for stand-alone genotyping of previously detected/reported deletion and insertion structural variants (SVs) in short-read whole genome sequencing (WGS) data. NPSV implements a machine learning-based approach for SV genotyping that employs NGS simulation to model the combined effects of the genomic region, sequencer and alignment pipeline. By treating potential biases as a “black box” that can be simulated, NPSV provides a framework for accurately genotyping a broad range of SVs in both targeted and genome-scale applications. NPSV consistently achieves or exceeds state-of-the-art genotyping accuracy across SV call sets, samples, and variant types. NPSV can specifically identify putative de novo SVs in a trio context and is robust to offset SV breakpoints. Middlebury undergraduate CS students Crystal Paudyal, Musab Shakeel and William Kelley contributed to NPSV.
NPSV is described in the following publication:
MySeq is a web-based, privacy protecting, personal genome analysis tool inspired by GENOtation (previously the Interpretome) and DNA.LAND Compass. MySeq can load and analyze Tabix-indexed VCF files stored locally on the user’s computer or available remotely. Queries and other analyses will only load the necessary blocks of the compressed VCF file, enabling efficient analysis of whole-genome-scale VCF files. MySeq was designed for educational use and has been used in the CSCI1007 “Practical Analysis of a Personal Genome” winter-term course at Middlebury College. Middlebury undergraduate CS students Leo McElroy and Laura Chang have contributed to MySeq.
MySeq is described in the following publication:
DECA (Distributed Exome CNV Analyzer) is a distributed re-implementation of the XHMM exome CNV caller using ADAM and Apache Spark. DECA began as a Davin Chia’s Middlebury College Computer Science senior thesis project and has been further developed by Middlebury undergraduate CS student Forrest Wallace in collaboration with Frank Nothaft at the UC Berkeley AMPLab. DECA parallelizes XHMM on both multi-core shared memory computers and large shared-nothing Spark clusters. Using DECA we can perform CNV discovery on the n=2535 1000 Genomes exome cohort in less than 10 minutes (over 30× faster than the XHMM reference implmentation). DECA is described in the following publication:
Coexpp is an R package implementing Weighted Gene Co-expression Network Analysis (WGCNA), a method for describing the correlation patterns among genes across microarray/RNASeq samples. Analysis of tens of thousands of probes, however, can take hours and may require the data to be partitioned. Coexpp substantially improves the runtime and memory footprint of WGCNA to enable the unpartitioned analysis of much larger datasets.
Rclusterpp is an R package providing standard geometrical hierarchical clustering routines such as average link, but optimized for multicore processors and large datasets where it is impractical or inefficient to compute the distance matrix. One example is flow cytometry data, where it is not uncommon to cluster more than 100,000 observations. Rclusterpp is being used both in that domain, and to cluster large gene expression datasets. Rclusterpp is available via CRAN.
CytoSPADE is an implementation of the Spanning-tree Progression of Density-normalized Events (SPADE) algorithm for visualizing high-dimensional flow cytometry data. CytoSPADE has two components:
CytoSPADE enables researchers to interactively explore the trees that SPADE produces in the context of the underlying flow cytometry data. Users can effectively uses the tree nodes as gates. CytoSPADE is 12-19 fold faster than the SPADE prototype, ensuring that users can run complex analyses on their laptops in just seconds or minutes. Additional documentation is available via the CytoSPADE R package and Cytoscape plugin Github repositories.
CytoSPADE is described in the following publications:
Gappa++ is a tool to help determining and verifying numerical behavior, and particularly rounding error in, computations with floating or fixed point operations. Gappa++ is an extension of Gappa version 0.11.3, developed by Guillaume Melquiond. Gappa++ was used to obtain a 10-15% performance improvement in a computationally demanding Bayesian Network learning application by optimizing the numerical behavior of transcendental operations. Gappa++ is described in the following publication: