Nextflow: Properly chaining process outputs

Question

I have written a pipeline that is composed of three processes. I seem to be misunderstanding how Nextflow passes the outputs from one process to the next or creates a queue of files. I am using a tumor and paired normal sample from SRA (SRX21185781) to prototype this pipeline.

My first process, trim_QC works fine and outputs the expected number of trimmed fastqs. However, the next process, bwa_align only creates a single bam from one of the two output from the preceding process. My terminal looks like this:

N E X T F L O W  ~  version 23.10.1
Launching `../main.nf` [sick_church] DSL2 - revision: beb03cc0c6
[6e/48d4f6] process > trim_QC (SRR25452858_normal)                  [100%] 2 of 2, cached: 2 ✔
[39/1b8cb2] process > bwa_align (SRR25452858_normal)                [100%] 1 of 1, cached: 1 ✔
[5f/e598c1] process > mapping_stats (SRR25452858_normal_sorted.bam) [100%] 1 of 1, cached: 1 ✔
...

and when running with -with-trace it is even more obvious only one bam file is being processed by the downstream processes:

task_id hash    native_id   name    status  exit    submit  duration    realtime    %cpu    peak_rss    peak_vmem   rchar   wchar
2   d3/0bd651   71127   trim_QC (SRR25452863_tumor) CACHED  0   2024-02-07 16:28:37.449 1h 21m 10s  1h 21m 10s  -   -   -   -   -
1   6e/48d4f6   71128   trim_QC (SRR25452858_normal)    CACHED  0   2024-02-07 16:28:37.459 1h 16m 32s  1h 16m 32s  -   -   -   -   -
3   39/1b8cb2   2754    bwa_align (SRR25452858_normal)  CACHED  0   2024-02-08 10:08:46.033 4h 18m 10s  4h 18m 10s  -   -   -   -   -
4   5f/e598c1   9282    mapping_stats (SRR25452858_normal_sorted.bam)   CACHED  0   2024-02-08 14:26:55.823 53.9s   53.8s   -   -   -   -   -

I am expecting for the bwa-align and mapping_stats processes to also be n of 2.

I have tried explicitly designating the trim_QC.out as a channel (Channel.of(trim_QC.out)) or using an operator to return a queue channel object (i.e., .collect, .collectFile(), etc.) but this either returns errors or fails.

If I try to visualize the outputs from the trim_QC process by adding this to my workflow: trim_QC.out.trimmed_reads.view{it}, it does show both file pairs in my terminal, but only the first one is processed by bwa_align and subsequent steps. Like so:

...
[6e/48d4f6] process > trim_QC (SRR25452858_normal)            [100%] 2 of 2, cached: 2 ✔
[39/1b8cb2] process > bwa_align (SRR25452858_normal)          [100%] 1 of 1, cached: 1 ✔
[5f/e598c1] process > mapping_stats (SRR25452858_normal_so... [100%] 1 of 1, cached: 1 ✔
hello
[SRR25452858_normal, /Users/me/Desktop/fastq2/work/6e/48d4f61cae83ff903abceaadf715e5/SRR25452858_normal_1_val_1.fq.gz, /Users/me/Desktop/fastq2/work/6e/48d4f61cae83ff903abceaadf715e5/SRR25452858_normal_2_val_2.fq.gz]
Analysis complete for SRR25452858_normal_1_val_1.fq.gz
Analysis complete for SRR25452858_normal_2_val_2.fq.gz
[SRR25452863_tumor, /Users/me/Desktop/fastq2/work/d3/0bd6515fcedfc060596b75dd90c28a/SRR25452863_tumor_1_val_1.fq.gz, /Users/me/Desktop/fastq2/work/d3/0bd6515fcedfc060596b75dd90c28a/SRR25452863_tumor_2_val_2.fq.gz]
Analysis complete for SRR25452863_tumor_1_val_1.fq.gz
Analysis complete for SRR25452863_tumor_2_val_2.fq.gz

I am unsure if I am misunderstanding what Nextflow is capable of or how channels work. What I am trying to accomplish is trim adapters and QC result, align trimmed reads followed by indexing, and finally calculate mappings statistics. Any advice is appreciated.

Here is my entire script:

#!/usr/bin/env nextflow
nextflow.enable.dsl = 2
params.fastq = "$projectDir/fastq2/_{1,2}"
params.index_dir = "$projectDir/RefGenome/BWAIndex"
params.ref = "hg38.fa"
params.out_dir = "$projectDir/Results"
workflow {
fastq_ch = Channel.fromFilePairs(params.fastq)
index_ch = Channel.fromPath(params.index_dir)
ref_ch = Channel.of(params.ref)

trim_QC(fastq_ch)
bwa_align(index_ch, ref_ch, trim_QC.out.trimmed_reads)
mapping_stats(bwa_align.out.sorted_bam)

}
process trim_QC {
    debug true
    tag "$sample_id"
publishDir = [
                [
                    path: { &quot;${params.out_dir}/trimmed/&quot; },
                    mode: 'copy',
                    overwrite: false
                ],

                [
                    path: { &quot;${params.out_dir}/stats/&quot; },
                    mode: 'copy',
                    pattern: &quot;*fastqc*&quot;,
                    overwrite: false
                ]
            ]

input:
tuple val(sample_id), path(fastq)

output:
tuple val(&quot;${sample_id}&quot;), path(&quot;*_val_1.fq.gz&quot;), path(&quot;*_val_2.fq.gz&quot;), emit: trimmed_reads
path(&quot;*fastqc*&quot;)

script:
def (read1, read2) = fastq
&quot;&quot;&quot;
trim_galore --paired --fastqc <span class="math-container">${read1} $</span>{read2}
&quot;&quot;&quot;

}
process bwa_align {
    debug true
    tag "$sample_id"
publishDir &quot;${params.out_dir}/bwa_aligned/&quot;, mode: 'copy', overwrite: false

input:
path index_dir
val ref
tuple val(sample_id), path(read1), path(read2)

output:
path(&quot;<span class="math-container">${sample_id}_sorted.bam"), emit: sorted_bam

path("${sample_id}_sorted.bam.bai"), emit: bam_index
script:
&quot;&quot;&quot;
bwa mem -t 2 <span class="math-container">${index_dir}/$</span>{ref} <span class="math-container">${read1} $</span>{read2} | samtools sort -@2 -o &quot;<span class="math-container">${sample_id}_sorted.bam" -

samtools index ${sample_id}_sorted.bam > "${sample_id}_sorted.bam.bai"
    """
}
process mapping_stats {
    debug true
    tag "$sorted_bam"
publishDir &quot;${params.out_dir}/stats/&quot;, mode: 'copy', overwrite: false

input:
path(sorted_bam)

output:
path(&quot;${sorted_bam}_mapping_stats.txt&quot;)

script:
&quot;&quot;&quot;
samtools flagstat <span class="math-container">${sorted_bam} &gt; "$</span>{sorted_bam}_mapping_stats.txt&quot;
&quot;&quot;&quot;

}

score 2 · Accepted Answer · answered Feb 13 '24 at 08:56

2

This is a result of nextflow channel types, queue- and value-channels. Nextflow will make fifo combinations of your channels to run bwa_align until either or both channel are empty. index_ch contains only one item, so once it is used there are no more valid combinations to start new bwa_align processes.

Either create your index_ch as value channel, or keep the index out of a channel entirely and refer to it as a parameter.

answered Feb 13 '24 at 08:56

Pallie

697
5
11

Thanks for your response! I did as you suggested and used Channel.value(params...) for both my index and ref channels. While this did work I still seem to be missing something. While my bwa_align process is now aligning both sets of files, the mapping_stats still is 1 of 1. I was under the impression that any output from a process is a value channel, so this should work. Even more strange is this process only takes a single input. Do you have any advice for how to rectify this or what I am misunderstanding? – bhumm Feb 14 '24 at 04:13
Afaik a process will only emit a value channel if ALL its input channels were value channels, otherwise it will be a queue chanel. Make a new question if you have a new question, it is hard to debug your new issue without the exact code. The output you show seems to suggest you are using the -resume option, maybe this is messing with the correct execution after you changed the flow. – Pallie Feb 14 '24 at 12:54
1

Apologies, I was too impatient and responded prematurely. The executor updated to 2 of 2 for the mapping_stats following the completion of the second bwa_align process. Looks like everything is working as expected. Thanks again for your help! – bhumm Feb 14 '24 at 14:09

Nextflow: Properly chaining process outputs

1 Answers1