Section: New Software and Platforms
HTS data processing
GATB: Genome Assembly & Analysis Tool Box
The GATB software toolbox aims to lighten the design of NGS algorithms. It offers a panel of high-level optimized building blocks to speed-up the development of NGS tools related to genome assembly and/or genome analysis. The underlying data structure is the de Bruijn graph, and the general parallelism model is multithreading. The GATB library targets standard computing resources such as current multicore processors (laptop computer, small server) with a few GB of memory. From high-level API, NGS programming designers can rapidly elaborate their own software based on domain state-of-the-art algorithms and data structures. The GATB library is written in C++.
Contact: Dominique Lavenier
LEON: Genomic Data Compression
Leon is a lossless compression software that achieves compression of DNA sequences of high throughput sequencing data, without the need of a reference genome. Techniques are derived from assembly principles that better exploit NGS data redundancy. A reference is built de novo from the set of reads as a probabilistic de-Bruijn graph stored in a Bloom filter. Each read is encoded as a path in this graph, storing only an anchoring kmer and a list of bifurcations indicating which path to follow in the graph. This new method will allow to have compressed read files containing its underlying de-Bruijn Graph, thus directly re-usable by many tools relying on this structure. Leon achieved the encoding of a C. elegans reads set with 0.7 bits per base, outperforming state of the art reference-free methods.
Contact: Claire Lemaitre
URL: https://gatb.inria.fr/software/leon/
BLOOCOO: Genomic Data Correction
Bloocoo is a k-mer spectrum-based read error corrector, designed to correct large datasets with a very low memory footprint. It uses the disk streaming k-mer counting algorithm included in the GATB library, and inserts solid k-mers in a bloom-filter. The correction procedure is similar to the Musket multistage approach. Bloocoo yields similar results while requiring far less memory: as an example, it can correct whole human genome re-sequencing reads at 70 x coverage with less than 4GB of memory.
Contact: Claire Lemaitre
URL: https://gatb.inria.fr/bloocoo-read-corrector/
DiscoSnp++: DISCOvering Single Nucleotide Polymorphism
DiscoSnp++ is designed for discovering Single Nucleotide Polymorphism (SNP) and insertions/deletions (indels) from raw set(s) of reads obtained with Next Generation Sequencers (NGS). The number of input read sets is not constrained, it can be one, two, or more. No other data as reference genome or annotations are needed. The software is composed of three modules: (1) kissnp2, that detects SNPs and indels from read sets; (2) kissreads2, that enhances the kissnp2 results by providing for each variant a read coverage mean and a (phred) quality; (3) VCF_creator, that provides a file in the Variant Calling Format (VCF). A VCF file using or not a reference genome is also created.
Contact: Pierre Peterlongo
URL: http://colibread.inria.fr/software/discosnp/
MindTheGap: Detection of insertion
MindTheGap is a software that performs detection and assembly of DNA insertion variants in NGS read datasets with respect to a reference genome. It takes as input a set of reads and a reference genome. It outputs two sets of FASTA sequences: one is the set of breakpoints of detected insertion sites, the other is the set of assembled insertions for each breakpoint. For each breakpoint, MindTheGap either returns a single insertion sequence (when there is no assembly ambiguity), or a set of candidate insertion sequences (due to ambiguities) or nothing at all (when the insertion is too complex to be assembled). MindTheGap performs de novo assembly using the de Bruijn Graph implementation of GATB. Hence, the computational resources required to run MindTheGap are significantly lower than that of other assemblers.
Contact: Claire Lemaitre
URL: http://mindthegap.genouest.org/
TakeABreak: Detection of inversion breakpoints
TakeABreak is a tool that can detect inversion breakpoints directly from raw NGS reads, without the need of any reference genome and without de novo assembling the genomes. Its implementation is based on the Genome Assembly Tool Box (GATB) library, and has a very limited memory impact allowing its usage on common desktop computers and acceptable runtime (Illumina reads simulated at 80x coverage from human chromosome 22 can be treated in less than two hours, with less than 1GB of memory).
Contact: Claire Lemaitre