I need to run grep on a couple of million files. Therefore I tried to speed it up, following the two approaches mentioned here: xargs -P -n and GNU parallel. I tried this on a subset of my files (9026 in number), and this was the result:
With
xargs -P 8 -n 1000, very fast:$ time find tex -maxdepth 1 -name "*.json" | \ xargs -P 8 -n 1000 grep -ohP "'pattern'" > /dev/null real 0m0.085s user 0m0.333s sys 0m0.058sWith
parallel, very slow:$ time find tex -maxdepth 1 -name "*.json" | \ parallel -j 8 grep -ohP "'pattern'" > /dev/null real 0m21.566s user 0m22.021s sys 0m18.505sEven sequential
xargsis faster thanparallel:$ time find tex -maxdepth 1 -name "*.json" | \ xargs grep -ohP 'pattern' > /dev/null real 0m0.242s user 0m0.209s sys 0m0.040s
xargs -P n does not work for me because the output from all the processes gets interleaved, which does not happen with parallel. So I would like to use parallel without incurring this huge slowdown.
Any ideas?
UPDATE
Following the answer by Ole Tange, I tried
parallel -X, the results are here, for completeness:$ time find tex -maxdepth 1 -name "*.json" | \ parallel -X -j 8 grep -ohP "'pattern'" > /dev/null real 0m0.563s user 0m0.583s sys 0m0.110sFastest solution: Following the comment by @cas, I tried to grep with
-Hoption (to force printing the filenames), and sorting. Results here:time find tex -maxdepth 1 -name '*.json' -print0 | \ xargs -0r -P 9 -n 500 grep --line-buffered -oHP 'pattern' | \ sort -t: -k1 | cut -d: -f2- > /dev/null real 0m0.144s user 0m0.417s sys 0m0.095s
parallellaunching new shell for each invocation. If yes, is there any way to avoid it? – nofrills Mar 30 '16 at 16:16xargs -P8with-H(--with-filename) instead of-h(--no-filename) option togrep, then pipe throughsort -t: -k1 | cut -d: -f2-to sort by filename and then strip the filename? – cas Mar 31 '16 at 00:00-sto enable stable sort (disable last resort comparison) else output from within a file could come out of order – iruvar Mar 31 '16 at 00:09find ... -print0andxargs -0so that your script works even with filenames containing annoying characters like spaces and newlines.time find tex -maxdepth 1 -name '*.json' -print0 | xargs -0r -P 8 -n 1000 grep -oHP 'pattern' | sort -t: -k1 | cut -d: -f2- > /dev/null– cas Mar 31 '16 at 00:09-soption (how does it differ fromsort -k1,1?). BTW, i seem to have found a bug in GNUgrepwhere-H -zdoesn't act like-l -zeven though the description for-zimplies it should. – cas Mar 31 '16 at 00:15--stabledoes, but the man page only lists--stablewith minimal description. I hate the way GNU tools leave important info out of man pages and expect you to rely on their crappy .info format. – cas Mar 31 '16 at 00:23grep. my mistake, should be using-Z, not-z. but using that reveals thatsortcan't use NUL as a field delimiter (as opposed to line separator) anyway, so there's no point in using-print0infindor-0in xargs. – cas Mar 31 '16 at 00:30grep -Hsuggestion. However, I still get mangled output, this time within the same line: like so:line1word1_line2word1_line1word2. – nofrills Mar 31 '16 at 13:13grep --line-buffered– nofrills Apr 03 '16 at 14:09xargs -Pyou still risk getting mangled output - even withgrep --line-buffered. See an example of mangling on https://www.gnu.org/software/parallel/parallel_alternatives.html#DIFFERENCES-BETWEEN-xargs-AND-GNU-Parallel – Ole Tange Mar 06 '18 at 07:28