When analysing data, especially when analysing complicated genomics data, one quickly learns to appreciate the benefits of well-written workflows.

In the past, I have developed my own bash, Snakemake and Nextflow pipelines. But since then some people from the bioinformatics community have put in enormous effort to create general standardized pipelines that anyone can use. For Snakemake this effort is called workflow catalogue and for Nextflow it is called nf-core.

Last week I was browsing this catalogue and came across a workflow whose structure is very familiar to me: The MAG workflow.

So of course I have to see if I can get it to run and if the results are as expected.

Getting some data

To make a test possible I first need some data. Of course, I won’t assemble a full metagenomic dataset in the first go. So it’s probably better to either simulate data or get an example dataset. For these purposes, I decided to try analysing a single run of the CAMI II Toy Mouse Gut.

As I need fastq files for the pipeline but the testdata only contains a bam file, I first need to go from bam to fastq:

%%bash
mkdir -p ./data/202505_MAG_fastq/
cd ./data/202505_MAG_fastq/
wget -c -o /dev/null https://frl.publisso.de/data/frl:6421672/dataset/2017.12.29_11.37.26_sample_0_bam.tar
tar -xvf 2017.12.29_11.37.26_sample_0_bam.tar > /dev/null
cd 2017.12.29_11.37.26_sample_0/bam

# Convert each BAM file to FASTQ and store them in temporary files
for bam_file in *.bam; do
    # Extract the base name without extension
    base_name=$(basename "$bam_file" .bam)
    
    # Convert BAM to paired FASTQ files
    samtools fastq -1 "${base_name}_1.fastq" -2 "${base_name}_2.fastq" "$bam_file"  2> /dev/null
done
gzip *fastq

cat 1107303.0_2.fastq.gz 180944.0_2.fastq.gz 178480.0_2.fastq.gz > ../combined_reads_2.fastq.gz
cat 1107303.0_1.fastq.gz 180944.0_1.fastq.gz 178480.0_1.fastq.gz > ../combined_reads_1.fastq.gz

Now I got a fastq dataset in the location /data/202505_MAG_fastq/2017.12.29_11.37.26_sample_0/combined_reads_{1,2}.fastq.gz.

As I already have docker and Nextflow installed, I should now be able to run the pipeline by creating a sample sheet and then running it. The sample sheet looks like this:

sample,group,short_reads_1,short_reads_2,long_reads
sample1,0,2017.12.29_11.37.26_sample_0/combined_reads_1.fastq.gz,2017.12.29_11.37.26_sample_0/combined_reads_2.fastq.gz,

And I execute the command like so:

cd ./data/202505_MAG_fastq
nextflow run nf-core/mag \
   -r 3.4.0 \
   -profile docker \
   --input 'samplesheet.csv' \
   --outdir './output'

But that will fail because the pipeline expects to have a lot of memory available on the host. Usually one would use a HPC for this, which would allow the distribution of jobs depending on their needed resources. But as I am testing this pipeline locally I will need to ensure that all jobs can be executed on my machine.

Just for completeness sake, this is the error message that will be shown if I try to execute this locally with not enough memory:

Process requirement exceeds available memory -- req: 36 GB; avail: 27.3 GB

To sidestep this issue I can apparently generate a config file that specifies the memory limit of my machine, which apparently then applies this to all jobs:

process {
  resourceLimits = [
    cpus: 16,
    memory: 20.GB,
    time: 24.h
    ]
}

And then restart the workflow:

nextflow run nf-core/mag \                                                                                                  
    -r 3.4.0 \    
    -profile docker \
    --input 'samplesheet.csv' \
    --outdir './output' -c config.config

But I noticed quickly, that this is not yet the end of the needed configuration. Well, the workflow will run assemblies and binning steps without problems, but QC checks and taxonomic profiling need reference databases. As these can be pretty large, downloading them in the workflow might time out. Especially for GTDB this seems to be the case.

So I first get the GTDB database by running:

mkdir -p ~/databases/gtdbtk/
wget -P  ~/databases/gtdbtk/ https://data.ace.uq.edu.au/public/gtdb/data/releases/release220/220.0/auxillary_files/gtdbtk_package/full_package/gtdbtk_r220_data.tar.gz
cd ~/databases/gtdbtk/
md5sum gtdbtk_r220_data.tar.gz # 5aafa1b9c27ceda003d75adf238ed9e0
tar xzvf gtdbtk_r220_data.tar.gz

For using BUSCO it can be helpful to download the database before running the pipeline as the download is rather large.

mkdir -p ~/databases/busco/
cd ~/databases/busco/
wget -c --mirror https://busco-data.ezlab.org/v5/data/lineages/
cd ~/databases/busco/busco-data.ezlab.org/v5/data/lineages
ls *.tar.gz | parallel 'tar -xzf {}'
cd ~/databases/busco/busco-data.ezlab.org/v5/data/placement_files
ls *.tar.gz | parallel 'tar -xzf {}'

The same can be done for CheckM2 or CheckM, but is not as needed. All pipeline downloads can also be saved with flags such as --save_checkm2_data, which then I imagine will allow the user to re-use this download by passing it as the database location.

mkdir -p ~/databases/checkm/checkm2
cd ~/databases/checkm/checkm2
wget https://zenodo.org/records/5571251/files/checkm2_database.tar.gz\?download\=1
tar xzf checkm2_database.tar.gz\?download=1 

Running the Pipeline

When running the pipeline I recommend passing -resume as Nextflow in contrast to Snakemake does not resume by default.

The new command to execute is consequently:

GTDBPATH="$HOME/databases/gtdbtk/release220"
nextflow run nf-core/mag \
    -r 3.4.0 \
    -profile docker \
    --input 'samplesheet.csv' \
    --gtdb_db ${GTDBPATH} \
    --host_genome hg38 \
    --outdir './output' \
    --binqc_tool checkm2 \
    -c config.config \
    -resume

This nicely runs all kinds of tools. To summarize a bit: It assembles the read files with both MEGAHIT and metaSPAdes. It then does binning with three binners (CONCOCT, MetaBAT2, MaxBin2), which results in 6 sets of bins. Bin-refinement can also be enabled, which would result in even more bin combinations. As is this will lead to 6 GTDBTK and CheckM runs.

Pipeline Outputs

One of the key outputs created by the pipeline seems to be output/GenomeBinning/bin_summary.tsv which contains completeness and contamination values and much more for all evaluated combinations.

Let’s have a look at it.

The table looks very complete, clearly, it is a well-curated combination of the CheckM2 results but also the GTDBTK results and the QUAST results (N50 etc.). It even has a Depth column, which I am not sure how it was computed but it seems to be a result of the MetaBAT2 depth computation. One would have to trace the pipeline to figure it out, as the documentation is not very specific at that point.

import p9customtheme
import plotnine as p9
import polars as pl

table_path = "./data/202505_MAG_fastq/output/GenomeBinning/bin_summary.tsv"

df = (
    pl.read_csv(table_path, separator="\t")
    .with_columns(
        pl.col("Name")
        .str.splitn("-", 3)
        .struct.rename_fields(["Assembler", "Binner", "_remainder"])
        .alias("data")
    )
    .unnest("data")
    .drop("_remainder")
    .filter(pl.col("Binner").is_null().not_())
)

(
    p9.ggplot(df)
    + p9.aes("Completeness", "Contamination", fill="Total length")
    + p9.geom_point(size=3)
    + p9.facet_grid("Binner~Assembler")
    + p9.labs(
        title="Bin Completeness - Contamination",
        subtitle="Colored by bin size [bp]",
    )
)

svg

The plot shows the resulting bins we obtained from the pipeline, separated by assembler and binner. Clear to see that MaxBin2 created bins that were less pure than CONCOCT and MetaBAT2.

All in all this pipeline from nf-core seems very usable and easy to set up. I only scratched the very surface of the output but for now, that is enough for me. Once I need to analyse a shotgun metagenome I know that I can have this up and running within minutes. And that is nice.