You can access the Lab's Github Repository by clicking the link below

Selected Lab Projects

1. MAGIC (Markov Affinity-based Graph Imputation of Cells): an algorithm for denoising and transcript recover of single cells applied to single-cell RNA sequencing data from the epithelial-to-mesenchymal transition in breast cancer.

2. PHATE (Potential of Heat-diffusion Affinity-based Transition Embedding): a visualization technique that offers an alternative to tSNE in that it emphasizes progressions and branching structures rather than cluster separations shown on several datasets including a newly generated embryoid body differentiation dataset.

3. SAUCIE (Sparse AutoEncoders for Clustering Imputation and Embedding): A novel auto encoder architecture that performs denoising, batch normalization, clustering and visualization simultaneously for massive single-cell data sets from multi-patient cohorts shown on mass cytometry data from Dengue patients.

4. DREMI (Conditional Density-Based Analysis of T-Cell Signaling in Single Cell Data): We developed computational methods, based on established statistical concepts, to characterize signaling network relationships by quantifying the strengths of network edges and deriving signaling response functions.

PHATE: Visualizing Transitions and Structure for High Dimensional Data Exploration

Kevin R. Moon, David van Dijk, Zheng Wang, Daniel Burkhardt, William Chen, Antonia van den Elzen, Matthew J Hirn, Ronald R Coifman, Natalia B Ivanova, Guy Wolf, Smita Krishnaswamy*

You can access PHATE's Github repository and bioRxiv page by clicking the links below


In the era of 'Big Data' there is a pressing need for tools that provide human interpretable visualizations of emergent patterns in high-throughput high-dimensional data. Further, to enable insightful data exploration, such visualizations should faithfully capture and emphasize emergent structures and patterns without enforcing prior assumptions on the shape or form of the data. In this paper, we present PHATE (Potential of Heat-diffusion for Affinity-based Transition Embedding) - an unsupervised low-dimensional embedding for visualization of data that is aimed at solving these issues. Unlike previous methods that are commonly used for visualization, such as PCA and tSNE, PHATE is able to capture and highlight both local and global structure in the data. 


MAGIC: A diffusion-based imputation method reveals gene-gene interactions in single-cell RNA-sequencing data

David van Dijk, Juozas Nainys, Roshan Sharma, Pooja Kathail, Ambrose J Carr, Kevin R Moon, Linas Mazutis, Guy Wolf, Smita Krishnaswamy, Dana Pe'er *

You can access MAGIC's Github repository and bioRxiv page by clicking the links below


Single-cell RNA-sequencing is fast becoming a major technology that is revolutionizing biological discovery in fields such as development, immunology and cancer. The ability to simultaneously measure thousands of genes at single cell resolution allows, among other prospects, for the possibility of learning gene regulatory networks at large scales. However, scRNA-seq technologies suffer from many sources of significant technical noise, the most prominent of which is dropout due to inefficient mRNA capture. This results in data that has a high degree of sparsity, with typically only 10% non-zero values. To address this, we developed MAGIC (Markov Affinity-based Graph Imputation of Cells), a method for imputing missing values, and restoring the structure of the data. After MAGIC, we find that two- and three-dimensional gene interactions are restored and that MAGIC is able to impute complex and non-linear shapes of interactions. MAGIC also retains cluster structure, enhances cluster-specific gene interactions and restores trajectories, as demonstrated in mouse retinal bipolar cells, hematopoiesis, and our newly generated epithelial-to-mesenchymal transition dataset.



SAUCIE: Exploring Single-Cell Data with Multitasking Deep Neural Networks

Matthew Amodio, Krishnan Srinivasan, David van Dijk, Hussein Mohsen, Kristina Yim, Rebecca Muhle, Kevin R Moon, Susan Kaech, Ryan Sowell, Ruth Montgomery, James Noonan, Guy Wolf, Smita Krishnaswamy

You can access SAUCIE's Github repository and bioRxiv page by clicking the links below


Handling the vast amounts of single-cell RNA-sequencing and CyTOF data, which are now being generated in patient cohorts, presents a computational challenge due to the noise, complexity, sparsity and batch effects present. Here, we propose a unified deep neural network-based approach to automatically process and extract structure from these massive datasets. Our unsupervised architecture, called SAUCIE (Sparse Autoencoder for Unsupervised Clustering, Imputation, and Embedding), simultaneously performs several key tasks for single-cell data analysis including 1) clustering, 2) batch correction, 3) visualization, and 4) denoising/imputation. SAUCIE is trained to recreate its own input after reducing its dimensionality in a 2-D embedding layer which can be used to visualize the data. Additionally, it uses two novel regularizations: (1) an information dimension regularization to penalize entropy as computed on normalized activation values of the layer, and thereby encourage binary-like encodings that are amenable to clustering and (2) a Maximal Mean Discrepancy penalty to correct batch effects. Thus SAUCIE has a single architecture that denoises, batch-corrects, visualizes and clusters data using a unified representation. We show results on artificial data where ground truth is known, as well as mass cytometry data from dengue patients, and single-cell RNA-sequencing data from embryonic mouse brain.



DREMI: Conditional Density-based Analysis of T cell Signaling in Single Cell Data

Smita Krishnaswamy, Matthew H. Spitzer, Michael Mingueneau, Sean C Bendall, Oren Litvin, Erica Stone, Dana Pe’er* and Garry P Nolan

You can access DREMI's Github repository and bioRxiv page by clicking the links below


Cellular circuits sense the environment, process signals, and compute decisions using networks of interacting proteins. To model such a system, the abundance of each activated protein species can be described as a stochastic function of the abundance of other proteins. High-dimensional single-cell technologies, like mass cytometry, offer an opportunity to characterize signaling circuit-wide. However, the challenge of developing and applying computational approaches to interpret such complex data remains. Here, we developed computational methods, based on established statistical concepts, to characterize signaling network relationships by quantifying the strengths of network edges and deriving signaling response functions. In comparing signaling between naïve and antigen-exposed CD4+ T-lymphocytes, we find that although these two cell subtypes had similarly-wired networks, naïve cells transmitted more information along a key signaling cascade than did antigen-exposed cells. We validated our characterization on mice lacking the extracellular-regulated MAP kinase (ERK2), which showed stronger influence of pERK on pS6 (phosphorylated-ribosomal protein S6), in naïve cells compared to antigen-exposed cells, as predicted. We demonstrate that by using cell-to-cell variation inherent in single cell data, we can algorithmically derive response functions underlying molecular circuits and drive the understanding of how cells process signals.



Geometry-Based Data Generation

Ofir Lindenbaum, Jay S. Stanley III, Guy Wolf, Smita Krishnaswamy

You can access the bioRxiv page by clicking the link below


Many generative models attempt to replicate the density of their input data. However, this approach is often undesirable, since data density is highly affected by sampling biases, noise, and artifacts. We propose a method called SUGAR (Synthesis Using Geometrically Aligned Random-walks) that uses a diffusion process to learn a manifold geometry from the data. Then, it generates new points evenly along the manifold by pulling randomly generated points into its intrinsic structure using a diffusion kernel. SUGAR equalizes the density along the manifold by selectively generating points in sparse areas of the manifold. We demonstrate how the approach corrects sampling biases and artifacts, while also revealing intrinsic patterns (e.g. progression) and relations in the data. The method is applicable for correcting missing data, finding hypothetical data points, and learning relationships between data features.

Geometry Data.jpg

Modeling Dynamics with Deep Transition-Learning Networks

David van Dijk, Scott Gigante, Alexander Strzalkowski, Guy Wolf, Smita Krishnaswamy

You can access the bioRxiv page by clicking the link below


Markov processes, both classical and higher order, are often used to model dynamic processes, such as stock prices, molecular dynamics, and Monte Carlo methods. Previous works have shown that an autoencoder can be formulated as a specific type of Markov chain. Here, we propose a generative neural network known as a transition encoder, or transcoder, which learns such continuous-state dynamic processes. We show that the transcoder is able to learn both deterministic and stochastic dynamic processes on several systems. We explore a number of applications of the transcoder including generating unseen trajectories and examining the propensity for chaos in a dynamic system. Further, we show that the transcoder can speed up Markov Chain Monte Carlo (MCMC) sampling to a convergent distribution by training it to make several steps at a time. Finally, we show that the hidden layers of a transcoder are useful for visualization and salient feature extraction of the transition process itself.

Generative Model.jpg