awk: compute min/max values for multi-column data using big number of input files

Question

I am dealing with the post-processing of big number of csv data files located in the distinct directories. Each csv file has the following 3 column format:

ID, POP, dG
1, 24, -6.8100
2, 22, -6.7900
3, 11, -6.6800
4, 18, -6.1100
5, 5, -6.0700
6, 1, -6.0600
7, 11, -6.0300
8, 36, -6.0100

The following bash function incorporates awk code that computes min for dG (the 3rd column, which is always negative float number) as well as max value for POP (the 2nd column 2, which is positive) values for ALL processed CSVs at once and stores it in new bash variables highestPOP lowestDG used by second awk script (not considered here):

home="$PWD"
# folder with the outputs
rescore="${home}"/rescore 
# folder with the folders to analyse
storage="${home}"/results_bench
cd "${storage}"
# pattern of the csv file located inside each of sub-directory of "${storage}"
str='*str1.csv'
rescore_data4 () {
str_name=$(basename "${str}" .csv)
mkdir -p "${rescore}"/"${str_name}"
# 1- calculate max POP and dGmin for ALL rescored CSVs at once
read highestPOP lowestDG < <(
    awk -F ', ' '
        FNR == 1 {
            next
            }
        NR == 2 || $2 > popMAX {popMAX = $2}
        NR == 2 || $3 < dGmin  {dGmin  = $3}
        END {printf "%d %.2f\n", popMAX, dGmin}
    ' "${storage}"/*_*_*/${str}
)
#
# 2- run rescoring routine using the min/max values
awk -F', *' -v OFS=', ' -v highest_POP="${highestPOP}" -v lowest_dG="${lowestDG}" '
   ... some awk code
'
}

In the first awk script $str is the glob mask of the target csv file located in distinct directories (matching glob pattern "__*") While this generally works, there is a bug in the first AWK code (used to calculate min/ max values for ALL processed CSVs) : sometimes the value of the lowestDG can not be computed in the case of a big number of input CSVs/ contained many lines. The problem always related to calculating of dg variable (Which always negative), the script reports dg=0.000, which is not correct.

To resolve the issue I tried to modify AWK code, defining two new variables (with min and max values) at the begining and then compare each value in the column to them:

   read highestPOP lowestDG < <(
    awk -F ', ' '
        FNR == 1 {
            dGmin = ""                              # initialize the min value
            POPmax = ""   
            next
            }
        NR == 2 || POPmax == "" || $2 > POPmax {POPmax = $2 }
        NR == 2 || dGmin == "" || $3 < dGmin  {dGmin  = $3 }
        END {printf "%d %.2f\n", POPmax, dGmin}
    ' "${storage}"/*_*_*/${str}
)

Now, technically it works but it seems that the second sollution does not report min and max values correctly. How the awk script may be fixed correcly?

Chris Davies · Accepted Answer · 2021-05-25T11:41:53.297

1

If you want to use awk to calculate max/min across a range of files, simply provide those files on the command line as input to an awk script

awk -F, '
    ($2+0 > POP+0 || POP == "") && $1+0 > 0 { POP = $2 }
    ($3+0 < dG+0 || dG == "") && $1+0 > 0 { dG = $3 }
    END { print POP, dG }
' file1 file2 file3...

(This can also be written as a one-liner simply by concatenating all the lines, but it's way less readable.)

Let's break down one line. The pattern style is expression { action } and either part is optional. The expression here is looking for a larger value for POP on any line where the ID is a numeric non-zero

($2+0 > POP+0 || POP == "") && $1+0 > 0 { POP = $2 }
$2+0 > POP+0    # Is the numeric values of $2 more than the numeric value of POP
||              # OR
POP == ""       # Is POP the empty string (possibly unset)

If at least one of these is true then we also require the next condition

$1+0 > 0        # Is the numeric value of $1 greater than zero ("skip the header")

Then...

{ POP = $2 }    # Assign the numeric value of $2 to POP

The loop is then repeated for every line in every file. At the end of the last file the END construct is executed, which prints out the resulting two values.

Notice that only during comparisons are the values in the awk loop converted to numbers. At all other times they are just strings, so there is no loss of precision.

With bash you can assign variables to these outputs easily enough, allowing the unwanted whitespace to be discarded as a side-effect

read pop dg < <(awk ...)

For a very large number of files, such that a glob expansion fails, the standard find approach should suffice, feeding the contents of the files into awk as STDIN instead of listing them on the command line

find "${storage}" -type f -name 'target_file.csv' -exec cat {} + | awk '...'

edited May 25 '21 at 11:41

answered May 18 '21 at 12:01

Chris Davies

3,824
1
16
29

thank you very much for this as well as these explanations! Briefly it works like the first version of the awk script shown in my first topic but do the job correctly without any errors! I suppose the fix was achived via adding zero to each term, wasn't it? Cheers – Hot JAMS May 18 '21 at 13:28
1

@HotJAMS your version read each value as a number. Mine reads them as strings, so no loss of precision – Chris Davies May 18 '21 at 15:03
Right, thank you so much again! – Hot JAMS May 19 '21 at 07:50
I've just finished a second round of tests using 18000 input csv filles :-) While it worked well with 3000 csvs, with huge number of files there is still an error in the calculation of min/max values for the both columns... may you propose some fix ? :-) – Hot JAMS May 25 '21 at 09:21
line 275: /usr/local/bin/awk: Argument list too long – Hot JAMS May 25 '21 at 09:24
so it seems that it does not accept all CSV filles at once via awk -F, ' ' "${storage}"/__*/target_file.csv assuming that I am dealing with 18000 directories now( the same with 4000 directories worted OK!) – Hot JAMS May 25 '21 at 09:26
the line 275 is directly is related to your AWK code (integrated to my bashworkflow), possible the last part where I load all CSVs at once: ${storage}"/***/target_file.csv – Hot JAMS May 25 '21 at 09:28
Answer updated with a standard solution for you – Chris Davies May 25 '21 at 09:33
for 2000 it works but for 18000 filles it gives : line 275: /usr/bin/find: Argument list too long. :-) – Hot JAMS May 25 '21 at 09:48
for the test I reduced the number of the CSV to something aroung 4000 and your last solution is works.. can't understand why this happens so ... – Hot JAMS May 25 '21 at 10:07
1

The last solution would work with millions of files – Chris Davies May 25 '21 at 11:41
ah just fixed small bug in the find. does it means that find "${storage}" -type f -name keyword.csv -exec cat {} + | awk -F, ' ... ' will apply awk code on any file located in any sub-directories of "${storage}" (this was actually the issue since each csv is located within sub-directory!), whichj matching the glob pattern keyword.csv ? – Hot JAMS May 25 '21 at 12:35
There's no bug in my find. Quote your pattern so that the shell doesn't attempt to expand it to match files in your current directory. – Chris Davies May 25 '21 at 12:38
yes, sorry this was me who made the bug in the last solluion. BTW does it mean that with the properly quoted variables the command should be: find "${storage}" -type f -name "${str}" -exec cat {} + | awk -F, '...' where $str='*keyword.csv' – Hot JAMS May 25 '21 at 12:45
1

That would work. (You don't need the curly brackets for "$str") – Chris Davies May 25 '21 at 13:59

awk: compute min/max values for multi-column data using big number of input files

1 Answers1

Linked