Posts by Collection

portfolio

Deep Learning for Characterizing Sequence Reads

Published:

In this research project, we show that using a deep learning based DNA language model we can achieve performance in sequence classification approaching the performance of standard mapping algorithms. The performance of our model was most comparable to the performance of mapping when the input sequences were mutated at a higher rate (0.1 snps per bp).

publications

[Preprint] A Recovery Algorithm and Pooling Designs for One-Stage Noisy Group Testing Under the Probabilistic Framework

Published in medRxiv, 2021

The main contributions of this paper include a practical one-stage group testing protocol guided by maximizing pool entropy and a maximum-likelihood recovery algorithm under the probabilistic framework.

Recommended citation: Liu, Y., Kadyan, S., Pe’er, I. (2021). A Recovery Algorithm and Pooling Designs for One-Stage Noisy Group Testing Under the Probabilistic Framework. medRxiv 2021.03.09.21253193; doi: https://doi.org/10.1101/2021.03.09.21253193 https://www.medrxiv.org/content/10.1101/2021.03.09.21253193v1

A Recovery Algorithm and Pooling Designs for One-Stage Noisy Group Testing Under the Probabilistic Framework

Published in International Conference on Algorithms for Computational Biology, 2021

The main contributions of this paper include a practical one-stage group testing protocol guided by maximizing pool entropy and a maximum-likelihood recovery algorithm under the probabilistic framework.

Recommended citation: Liu, Y., Kadyan, S., Pe’er, I. (2021). A Recovery Algorithm and Pooling Designs for One-Stage Noisy Group Testing Under the Probabilistic Framework. In: Martín-Vide, C., Vega-Rodríguez, M.A., Wheeler, T. (eds) Algorithms for Computational Biology. AlCoB 2021. Lecture Notes in Computer Science(), vol 12715. Springer, Cham. https://link.springer.com/chapter/10.1007/978-3-030-74432-8_4

[Preprint] OpenFold: Retraining AlphaFold2 yields new insights into its learning mechanisms and capacity for generalization

Published in bioRxiv, 2022

Here we report OpenFold, a fast, memory-efficient, and trainable implementation of AlphaFold2, and OpenProteinSet, the largest public database of protein multiple sequence alignments. We use OpenProteinSet to train OpenFold from scratch, fully matching the accuracy of AlphaFold2. Having established parity, we assess OpenFold's capacity to generalize across fold space by retraining it using carefully designed datasets.

Recommended citation: OpenFold: Retraining AlphaFold2 yields new insights into its learning mechanisms and capacity for generalization. Gustaf Ahdritz, Nazim Bouatta, Sachin Kadyan, Qinghui Xia, William Gerecke, Timothy J O’Donnell, Daniel Berenberg, Ian Fisk, Niccolò Zanichelli, Bo Zhang, Arkadiusz Nowaczynski, Bei Wang, Marta M Stepniewska-Dziubinska, Shang Zhang, Adegoke Ojewole, Murat Efe Guney, Stella Biderman, Andrew M Watkins, Stephen Ra, Pablo Ribalta Lorenzo, Lucas Nivon, Brian Weitzner, Yih-En Andrew Ban, Peter K Sorger, Emad Mostaque, Zhao Zhang, Richard Bonneau, Mohammed AlQuraishi; bioRxiv 2022.11.20.517210; doi: https://doi.org/10.1101/2022.11.20.517210 https://www.biorxiv.org/content/10.1101/2022.11.20.517210v2

talks

Machine Learning on AWS for Life Sciences

Published:

As part of the Machine Learning on AWS for Life Sciences talk at AWS re:Invent 2022, I talked about the growing importance of computational power and techniques in the advancement of biology.

teaching

Natural Language Processing

Graduate Course, Department of Computer Science, Columbia University, 2021

I was a Teaching Assistant and Instructor for the Natural Language Processing course taught by Prof. Yassine Benajiba during the Fall 2021 semester.