3

I have written a pipeline that is composed of three processes. I seem to be misunderstanding how Nextflow passes the outputs from one process to the next or creates a queue of files. I am using a tumor and paired normal sample from SRA (SRX21185781) to prototype this pipeline.

My first process, trim_QC works fine and outputs the expected number of trimmed fastqs. However, the next process, bwa_align only creates a single bam from one of the two output from the preceding process. My terminal looks like this:

N E X T F L O W  ~  version 23.10.1
Launching `../main.nf` [sick_church] DSL2 - revision: beb03cc0c6

[6e/48d4f6] process > trim_QC (SRR25452858_normal) [100%] 2 of 2, cached: 2 ✔ [39/1b8cb2] process > bwa_align (SRR25452858_normal) [100%] 1 of 1, cached: 1 ✔ [5f/e598c1] process > mapping_stats (SRR25452858_normal_sorted.bam) [100%] 1 of 1, cached: 1 ✔ ...

and when running with -with-trace it is even more obvious only one bam file is being processed by the downstream processes:

task_id hash    native_id   name    status  exit    submit  duration    realtime    %cpu    peak_rss    peak_vmem   rchar   wchar
2   d3/0bd651   71127   trim_QC (SRR25452863_tumor) CACHED  0   2024-02-07 16:28:37.449 1h 21m 10s  1h 21m 10s  -   -   -   -   -
1   6e/48d4f6   71128   trim_QC (SRR25452858_normal)    CACHED  0   2024-02-07 16:28:37.459 1h 16m 32s  1h 16m 32s  -   -   -   -   -
3   39/1b8cb2   2754    bwa_align (SRR25452858_normal)  CACHED  0   2024-02-08 10:08:46.033 4h 18m 10s  4h 18m 10s  -   -   -   -   -
4   5f/e598c1   9282    mapping_stats (SRR25452858_normal_sorted.bam)   CACHED  0   2024-02-08 14:26:55.823 53.9s   53.8s   -   -   -   -   -

I am expecting for the bwa-align and mapping_stats processes to also be n of 2.

I have tried explicitly designating the trim_QC.out as a channel (Channel.of(trim_QC.out)) or using an operator to return a queue channel object (i.e., .collect, .collectFile(), etc.) but this either returns errors or fails.

If I try to visualize the outputs from the trim_QC process by adding this to my workflow: trim_QC.out.trimmed_reads.view{it}, it does show both file pairs in my terminal, but only the first one is processed by bwa_align and subsequent steps. Like so:

...
[6e/48d4f6] process > trim_QC (SRR25452858_normal)            [100%] 2 of 2, cached: 2 ✔
[39/1b8cb2] process > bwa_align (SRR25452858_normal)          [100%] 1 of 1, cached: 1 ✔
[5f/e598c1] process > mapping_stats (SRR25452858_normal_so... [100%] 1 of 1, cached: 1 ✔
hello

[SRR25452858_normal, /Users/me/Desktop/fastq2/work/6e/48d4f61cae83ff903abceaadf715e5/SRR25452858_normal_1_val_1.fq.gz, /Users/me/Desktop/fastq2/work/6e/48d4f61cae83ff903abceaadf715e5/SRR25452858_normal_2_val_2.fq.gz] Analysis complete for SRR25452858_normal_1_val_1.fq.gz Analysis complete for SRR25452858_normal_2_val_2.fq.gz

[SRR25452863_tumor, /Users/me/Desktop/fastq2/work/d3/0bd6515fcedfc060596b75dd90c28a/SRR25452863_tumor_1_val_1.fq.gz, /Users/me/Desktop/fastq2/work/d3/0bd6515fcedfc060596b75dd90c28a/SRR25452863_tumor_2_val_2.fq.gz] Analysis complete for SRR25452863_tumor_1_val_1.fq.gz Analysis complete for SRR25452863_tumor_2_val_2.fq.gz

I am unsure if I am misunderstanding what Nextflow is capable of or how channels work. What I am trying to accomplish is trim adapters and QC result, align trimmed reads followed by indexing, and finally calculate mappings statistics. Any advice is appreciated.

Here is my entire script:

#!/usr/bin/env nextflow
nextflow.enable.dsl = 2

params.fastq = "$projectDir/fastq2/_{1,2}" params.index_dir = "$projectDir/RefGenome/BWAIndex" params.ref = "hg38.fa" params.out_dir = "$projectDir/Results"

workflow {

fastq_ch = Channel.fromFilePairs(params.fastq)
index_ch = Channel.fromPath(params.index_dir)
ref_ch = Channel.of(params.ref)

trim_QC(fastq_ch)
bwa_align(index_ch, ref_ch, trim_QC.out.trimmed_reads)
mapping_stats(bwa_align.out.sorted_bam)

}

process trim_QC { debug true tag "$sample_id"

publishDir = [
                [
                    path: { "${params.out_dir}/trimmed/" },
                    mode: 'copy',
                    overwrite: false
                ],

                [
                    path: { "${params.out_dir}/stats/" },
                    mode: 'copy',
                    pattern: "*fastqc*",
                    overwrite: false
                ]
            ]

input:
tuple val(sample_id), path(fastq)

output:
tuple val("${sample_id}"), path("*_val_1.fq.gz"), path("*_val_2.fq.gz"), emit: trimmed_reads
path("*fastqc*")

script:
def (read1, read2) = fastq
"""
trim_galore --paired --fastqc <span class="math-container">${read1} $</span>{read2}
&quot;&quot;&quot;

}

process bwa_align { debug true tag "$sample_id"

publishDir &quot;${params.out_dir}/bwa_aligned/&quot;, mode: 'copy', overwrite: false

input:
path index_dir
val ref
tuple val(sample_id), path(read1), path(read2)

output:
path(&quot;<span class="math-container">${sample_id}_sorted.bam"), emit: sorted_bam

path("${sample_id}_sorted.bam.bai"), emit: bam_index

script:
&quot;&quot;&quot;
bwa mem -t 2 <span class="math-container">${index_dir}/$</span>{ref} <span class="math-container">${read1} $</span>{read2} | samtools sort -@2 -o &quot;<span class="math-container">${sample_id}_sorted.bam" -

samtools index ${sample_id}_sorted.bam > "${sample_id}_sorted.bam.bai" """ }

process mapping_stats { debug true tag "$sorted_bam"

publishDir &quot;${params.out_dir}/stats/&quot;, mode: 'copy', overwrite: false

input:
path(sorted_bam)

output:
path(&quot;${sorted_bam}_mapping_stats.txt&quot;)

script:
&quot;&quot;&quot;
samtools flagstat <span class="math-container">${sorted_bam} &gt; "$</span>{sorted_bam}_mapping_stats.txt&quot;
&quot;&quot;&quot;

}

bhumm
  • 33
  • 6

1 Answers1

2

This is a result of nextflow channel types, queue- and value-channels. Nextflow will make fifo combinations of your channels to run bwa_align until either or both channel are empty. index_ch contains only one item, so once it is used there are no more valid combinations to start new bwa_align processes.

Either create your index_ch as value channel, or keep the index out of a channel entirely and refer to it as a parameter.

Pallie
  • 697
  • 5
  • 11
  • Thanks for your response! I did as you suggested and used Channel.value(params...) for both my index and ref channels. While this did work I still seem to be missing something. While my bwa_align process is now aligning both sets of files, the mapping_stats still is 1 of 1. I was under the impression that any output from a process is a value channel, so this should work. Even more strange is this process only takes a single input. Do you have any advice for how to rectify this or what I am misunderstanding? – bhumm Feb 14 '24 at 04:13
  • Afaik a process will only emit a value channel if ALL its input channels were value channels, otherwise it will be a queue chanel. Make a new question if you have a new question, it is hard to debug your new issue without the exact code. The output you show seems to suggest you are using the -resume option, maybe this is messing with the correct execution after you changed the flow. – Pallie Feb 14 '24 at 12:54
  • 1
    Apologies, I was too impatient and responded prematurely. The executor updated to 2 of 2 for the mapping_stats following the completion of the second bwa_align process. Looks like everything is working as expected. Thanks again for your help! – bhumm Feb 14 '24 at 14:09