I have written a pipeline that is composed of three processes. I seem to be misunderstanding how Nextflow passes the outputs from one process to the next or creates a queue of files. I am using a tumor and paired normal sample from SRA (SRX21185781) to prototype this pipeline.
My first process, trim_QC works fine and outputs the expected number of trimmed fastqs. However, the next process, bwa_align only creates a single bam from one of the two output from the preceding process. My terminal looks like this:
N E X T F L O W ~ version 23.10.1
Launching `../main.nf` [sick_church] DSL2 - revision: beb03cc0c6
[6e/48d4f6] process > trim_QC (SRR25452858_normal) [100%] 2 of 2, cached: 2 ✔
[39/1b8cb2] process > bwa_align (SRR25452858_normal) [100%] 1 of 1, cached: 1 ✔
[5f/e598c1] process > mapping_stats (SRR25452858_normal_sorted.bam) [100%] 1 of 1, cached: 1 ✔
...
and when running with -with-trace it is even more obvious only one bam file is being processed by the downstream processes:
task_id hash native_id name status exit submit duration realtime %cpu peak_rss peak_vmem rchar wchar
2 d3/0bd651 71127 trim_QC (SRR25452863_tumor) CACHED 0 2024-02-07 16:28:37.449 1h 21m 10s 1h 21m 10s - - - - -
1 6e/48d4f6 71128 trim_QC (SRR25452858_normal) CACHED 0 2024-02-07 16:28:37.459 1h 16m 32s 1h 16m 32s - - - - -
3 39/1b8cb2 2754 bwa_align (SRR25452858_normal) CACHED 0 2024-02-08 10:08:46.033 4h 18m 10s 4h 18m 10s - - - - -
4 5f/e598c1 9282 mapping_stats (SRR25452858_normal_sorted.bam) CACHED 0 2024-02-08 14:26:55.823 53.9s 53.8s - - - - -
I am expecting for the bwa-align and mapping_stats processes to also be n of 2.
I have tried explicitly designating the trim_QC.out as a channel (Channel.of(trim_QC.out)) or using an operator to return a queue channel object (i.e., .collect, .collectFile(), etc.) but this either returns errors or fails.
If I try to visualize the outputs from the trim_QC process by adding this to my workflow: trim_QC.out.trimmed_reads.view{it}, it does show both file pairs in my terminal, but only the first one is processed by bwa_align and subsequent steps. Like so:
...
[6e/48d4f6] process > trim_QC (SRR25452858_normal) [100%] 2 of 2, cached: 2 ✔
[39/1b8cb2] process > bwa_align (SRR25452858_normal) [100%] 1 of 1, cached: 1 ✔
[5f/e598c1] process > mapping_stats (SRR25452858_normal_so... [100%] 1 of 1, cached: 1 ✔
hello
[SRR25452858_normal, /Users/me/Desktop/fastq2/work/6e/48d4f61cae83ff903abceaadf715e5/SRR25452858_normal_1_val_1.fq.gz, /Users/me/Desktop/fastq2/work/6e/48d4f61cae83ff903abceaadf715e5/SRR25452858_normal_2_val_2.fq.gz]
Analysis complete for SRR25452858_normal_1_val_1.fq.gz
Analysis complete for SRR25452858_normal_2_val_2.fq.gz
[SRR25452863_tumor, /Users/me/Desktop/fastq2/work/d3/0bd6515fcedfc060596b75dd90c28a/SRR25452863_tumor_1_val_1.fq.gz, /Users/me/Desktop/fastq2/work/d3/0bd6515fcedfc060596b75dd90c28a/SRR25452863_tumor_2_val_2.fq.gz]
Analysis complete for SRR25452863_tumor_1_val_1.fq.gz
Analysis complete for SRR25452863_tumor_2_val_2.fq.gz
I am unsure if I am misunderstanding what Nextflow is capable of or how channels work. What I am trying to accomplish is trim adapters and QC result, align trimmed reads followed by indexing, and finally calculate mappings statistics. Any advice is appreciated.
Here is my entire script:
#!/usr/bin/env nextflow
nextflow.enable.dsl = 2
params.fastq = "$projectDir/fastq2/_{1,2}"
params.index_dir = "$projectDir/RefGenome/BWAIndex"
params.ref = "hg38.fa"
params.out_dir = "$projectDir/Results"
workflow {
fastq_ch = Channel.fromFilePairs(params.fastq)
index_ch = Channel.fromPath(params.index_dir)
ref_ch = Channel.of(params.ref)
trim_QC(fastq_ch)
bwa_align(index_ch, ref_ch, trim_QC.out.trimmed_reads)
mapping_stats(bwa_align.out.sorted_bam)
}
process trim_QC {
debug true
tag "$sample_id"
publishDir = [
[
path: { "${params.out_dir}/trimmed/" },
mode: 'copy',
overwrite: false
],
[
path: { "${params.out_dir}/stats/" },
mode: 'copy',
pattern: "*fastqc*",
overwrite: false
]
]
input:
tuple val(sample_id), path(fastq)
output:
tuple val("${sample_id}"), path("*_val_1.fq.gz"), path("*_val_2.fq.gz"), emit: trimmed_reads
path("*fastqc*")
script:
def (read1, read2) = fastq
"""
trim_galore --paired --fastqc <span class="math-container">${read1} $</span>{read2}
"""
}
process bwa_align {
debug true
tag "$sample_id"
publishDir "${params.out_dir}/bwa_aligned/", mode: 'copy', overwrite: false
input:
path index_dir
val ref
tuple val(sample_id), path(read1), path(read2)
output:
path("<span class="math-container">${sample_id}_sorted.bam"), emit: sorted_bam
path("${sample_id}_sorted.bam.bai"), emit: bam_index
script:
"""
bwa mem -t 2 <span class="math-container">${index_dir}/$</span>{ref} <span class="math-container">${read1} $</span>{read2} | samtools sort -@2 -o "<span class="math-container">${sample_id}_sorted.bam" -
samtools index ${sample_id}_sorted.bam > "${sample_id}_sorted.bam.bai"
"""
}
process mapping_stats {
debug true
tag "$sorted_bam"
publishDir "${params.out_dir}/stats/", mode: 'copy', overwrite: false
input:
path(sorted_bam)
output:
path("${sorted_bam}_mapping_stats.txt")
script:
"""
samtools flagstat <span class="math-container">${sorted_bam} > "$</span>{sorted_bam}_mapping_stats.txt"
"""
}
Channel.value(params...)for both my index and ref channels. While this did work I still seem to be missing something. While mybwa_alignprocess is now aligning both sets of files, themapping_statsstill is 1 of 1. I was under the impression that any output from a process is a value channel, so this should work. Even more strange is this process only takes a single input. Do you have any advice for how to rectify this or what I am misunderstanding? – bhumm Feb 14 '24 at 04:13-resumeoption, maybe this is messing with the correct execution after you changed the flow. – Pallie Feb 14 '24 at 12:54mapping_statsfollowing the completion of the secondbwa_alignprocess. Looks like everything is working as expected. Thanks again for your help! – bhumm Feb 14 '24 at 14:09