I'm running a long-running pipeline from bash, in the background:
find / -size +500M -name '*.txt' -mtime +90 |
xargs -n1 gzip -v9 &
The 2nd stage of the pipeline takes a long time to complete (hours) since there are several big+old files.
In contrast, the 1st part of the pipeline completes immediately, and since the pipe isn't full, and it has completed, find exits successfully.
The parent bash process seems to wait properly for child processes.
I can tell this because there's no find (pid 20851) running according to either:
ps alx | grep 20851
pgrep -l find
There's no zombie process, nor there's any process with process-id 20851 to be found anywhere on the system.
The bash builtin jobs correctly shows the job as a single line, without any process ids:
[1]+ Running find / -size +500M -name '*.txt' -mtime +90 | xargs -n1 gzip -v9 &
OTOH: I stumbled by accident on a separate job control command (/bin/jobs) which shows:
[1]+ 20851 Running find / -size +500M -name '*.txt' -mtime +90
20852 Running | xargs -n1 gzip -v9 &
and which is (wrongly) showing the already exited 20851 find process as "Running".
This is on CentOS (edit: More accurately: Amazon Linux 2 AMI) Linux.
Turns out that /bin/jobs is a two line /bin/sh script:
#!/bin/sh
builtin jobs "$@"
This is surprising to me. How can a separate process, started from another program (sh), know the details of a process which is managed by another (bash) after that process has already completed and exited and is NOT a zombie?
Further:
how can it know details (including pid) about the already exited process, when other methods on the system (ps, pgrep) can't?
Edits:
(1) As Uncle Billy noted in the comments, on this system /bin/sh and/bin/bash are the same (/bin/sh is a symlink to /bin/bash) but /bin/jobs is a script with a shebang line so it runs in a separate process.
(2) Also, thanks to Uncle Billy: an easier way to reproduce. /bin/jobs was a red herring. I mistakenly assumed it is the one producing the output. The surprising output apparently came from the bash builtin jobs when called with -l:
$ sleep 1 | sleep 3600 &
[1] 13616
$ jobs -l
[1]+ 13615 Running sleep 1
13616 Running | sleep 3600 &
$ ls /proc/13615
ls: cannot access /proc/13615: No such file or directory
So process 13615 doesn't exist, but is shown as "Running" by bash builtin job control, which appears like a bug in jobs -l.
The presence on /bin/jobs which confused me to think it must be the culprit (it wasn't), seems confusing and questionable. I believe it should be removed from the system as it is useless (a sh script running in a separate process, which can't show jobs of the caller anyway).
jobs -l, even when a process has exited:sleep 1 | sleep 3600 &... after 2 secsjobs -lwill show both as running, though the firstsleephas terminated. What version of bash is that? What doesalias | grep jobssay? – Jan 30 '21 at 22:44$ bash --versionshowsGNU bash, version 4.2.46(2)-release (x86_64-koji-linux-gnu)– arielf Jan 30 '21 at 23:07type /bin/jobssay? Notice that on centos/rhel/bin/shis still bash under another name. – Jan 30 '21 at 23:11$ type /bin/jobsshows:/bin/jobs is /bin/jobsI have no alias for jobs, and indeed/bin/shis a symlink to/bin/bashon this system, so same program, different processes. Great questions! – arielf Jan 30 '21 at 23:13execve(2), as running/bin/jobsshould be -- unless there's some kind of trick. Another possibility would be that eitherjobsorbuiltinis an exported function ;-) – Jan 30 '21 at 23:28/bin/jobs, its content is what you wrote; but it behaves like we all expect, not like in the question. So if things work for you as you described then it's not because of the Amazon Linux 2 itself (phew!). – Kamil Maciorowski Jan 31 '21 at 05:01jobsis specified by POSIX and POSIX explicitly requires it as a standalone executable. In this matter Amazon Linux 2 is more POSIX-compliant than e.g. Debian. Note in AL2 there is/usr/bin/cdas well. – Kamil Maciorowski Feb 01 '21 at 19:15