You are here

    • You are here:
    • Home > Events > PRBB-CRG Sessions Johannes Söding

PRBB-CRG Sessions Johannes Söding

PRBB-CRG Sessions Johannes Söding

PRBB-CRG Sessions Johannes Söding

21/09/2018
Add to Calendar
MARIE CURIE

21/09/201812:00MARIE CURIEPRBB-CRG SessionsJohannes SödingMax Planck Institute for Biophysical Chemistry, München, DE"New algorithms and tools for large-scale sequence analysis of metagenomics data"Host: Roderic Guigó (CRG)Abstract:Sequencing costs have dropped much faster than Moore's law in the past decade. The analysis of large metagenomic datasets and not their generation is now the main time and cost bottleneck. We present three methods that together allow us to move from an experiment-by experiment analysis to large-scale analyses of hundreds or thousands of metagenomic datasets.

MMseqs2 [1] is a protein sequence and profile search method slightly more sensitive than PSI-BLAST and 400 times faster. MMseqs2 can annotate 1.1 billion sequences in 8.3 hours on 28 cores. MMseqs2 offers great potential to increase the fraction of annotatable (meta)genomic sequences. Linclust [2] is a sequence clustering method whose run time scales linearly with the input set size, not nearly quadratically as in conventional algorithms. It can cluster 1.6 billion metagenomic sequence fragments in 10 hours on a single server to 50% sequence identity, >1000 times faster than has been possible previously. PLASS (unpublished) is a metagenomic protein sequence assembler whose runtime and memory scale linearly with dataset size. It can assemble ten times more protein sequences from soil metagenomes, and faster than Megahit and other popular nucleotide-level assemblers.

[1] Steinegger M and Soeding J. MMseqs2 enables sensitive protein sequence searching for the analysis of massive data sets. Nature Biotechnology, doi: 10.1038/nbt.3988 (2017)
[2] Steinegger M and Soeding J. Clustering huge protein sequence sets in linear time. biorxiv, doi: 10.1101/104034 (2018) (accepted at Nature Communications)