Deep Learning for Characterizing Sequence Reads

Published:

Motivation: Tremendous amounts of data are generated by NGS. A crucial step in characterizing this data is to map the reads to reference genome. Depending on the sample, vast portions of the data may go uncharacterized if the reference genomes are unavailable. This data may contain significant signals that are lost due to the lack of methods to characterize data without reference assemblies. Another limitation of the current methods is that vast amounts of memory is required to load the references to characterize the reads from sequencing data.

Results: With minimal fine-tuning of a deep learning based DNA-language model we were able to achieve performance in sequence classification approaching the performance of standard mapping algorithms. The performance of our model was most comparable to the performance of mapping when the input sequences were mutated at a higher rate (0.1 snps/bp).

Availability: The code is available freely at https://github.com/abhishake07/unmapped_reads.git
Authors: Sachin Kadyan, Abhishek Iyer, and William Kindschuh
Paper: Paper PDF
Supplementary Information: Supplementary PDF

The paper is also embedded below.