Using awk to split text file every 10,000 lines

Question

I have a large gzip'd text file. I'd like to something like:

zcat BIGFILE.GZ | \
    awk (snag 10,000 lines and redirect to...) | \
    gzip -9 smallerPartFile.gz

the awk part up there, I basically want it to take 10,000 lines and send it to gzip and then repeat until all lines in the original input file are consumed. I found a script that claims to do this, but when I run it on my files and then diff the original to the ones that were split and then merged, lines are missing. So, something is wrong with the awk part and I'm not sure what part is broken.

The goal:

Read through the source file one time for the entire operation
Split the source into smaller parts, delimited by newline. Say, 10,000 lines per file
Compress the target files that are created as a result of the split action and do so without an extra step after this script processes.

Here's the code. Can someone tell me why this doesn't yield a file that can be split and merged and then diff'd to the original successfully?

# Generate files part0.dat.gz, part1.dat.gz, etc.
# restore with: zcat foo* | gzip -9 > restoredFoo.sql.gz (or something like that)
prefix="foo"
count=0
suffix=".sql"

lines=10000 # Split every 10000 line.

zcat /home/foo/foo.sql.gz |
while true; do
  partname=${prefix}${count}${suffix}

  # Use awk to read the required number of lines from the input stream.
  awk -v lines=${lines} 'NR <= lines {print} NR == lines {exit}' >${partname}

  if [[ -s ${partname} ]]; then
    # Compress this part file.
    gzip -9 ${partname}
    (( ++count ))
  else
    # Last file generated is empty, delete it.
    rm -f ${partname}
    break
  fi
done

score 5 · Accepted Answer · edited Jun 12 '20 at 13:48

5

I would suggest doing all the house-keeping inside awk, this works here with GNU awk:

BEGIN { file = "1" }
{ print | "gzip -9 > " file ".gz" }
NR % 10000 == 0 {
  close("gzip -9 > " file ".gz")
  file = file + 1
}

This will save 10000 lines to 1.gz, the next 10000 to 2.gz, etc. Use sprintf if you want more flexibility in filename generation.

Updated with a test

Test data used are primes up to 300k, found here.

wc -lc primes; md5sum primes

Output:

25997 196958 primes
547d527ec50c2799fa6ce96dba3c26c0  primes

Now, if the awk program above was saved into split.awk and run like this (with GNU awk):

awk -f split.awk primes

Three files (1.gz, 2.gz and 3.gz) are produced. Testing these files:

for f in {1..3}; do gzip -dc $f.gz >> foo; done

Test:

diff source.file foo

Output should be nothing if the files are the same.

And the same tests as above:

gzip -dc [1-3].gz | tee >(wc -lc) >(md5sum) > /dev/null

Output:

25997  196958
547d527ec50c2799fa6ce96dba3c26c0  -

This shows that the contents are the same and that the files are split as expected.

edited Jun 12 '20 at 13:48

Community

1

answered Oct 09 '12 at 21:04

Thor

6,573

this seems to always trim off the last 34 characters of the last line for each split file. – Sneaky Wombat Oct 15 '12 at 19:33
1

Sounds odd. I've added an example to the answer, see if you get the same result. – Thor Oct 15 '12 at 22:57
Thanks Thor! I had to change the for loop because i had a lot of split files, but diff is telling me the original and the split files, then merged file are the same. – Sneaky Wombat Oct 17 '12 at 20:14
@Thor: you are right in the unlikely case the limit is exceeded. But then you must close("gzip -9 > " file ".gz") not only close(file). Otherwise awk has no clue what to close. – sparkie Jan 17 '13 at 16:27
@sparkie: You're right close(file) was incorrect. The likelihood of this being a problem depends on file size, number of available file descriptors and how many lines go into each file; it's cleaner to close each file as we're done with it. – Thor Jan 17 '13 at 19:58

Scott - Слава Україні · Answer 2 · 2012-10-08T22:09:26.440

3

The short answer is that awk is reading its input (the pipe from zcat, in this case) a block at a time (where a block is 512 bytes, or a multiple thereof, depending on your OS). So, by the time it has the 10000th newline character (end-of-line marker) in memory, it also has the 10001st line, the 10002nd, and quite probably more (or possibly less) in memory, too. This is a problem because it means those characters have been read out of the pipe, and are no longer available for the next iteration of awk to read.

edited Oct 08 '12 at 22:09

answered Oct 08 '12 at 21:47

Scott - Слава Україні

21,717

That makes sense. Hmm. Is there a way to capture those and buffer them and achieve the desired result? btw, this is Ubuntu 12.04 LTS. – Sneaky Wombat Oct 08 '12 at 22:02

score 3 · Answer 3 · answered Oct 08 '12 at 22:01

3

The shorter (and more useful) answer: have you looked at the Unix split command?

answered Oct 08 '12 at 22:01

Scott - Слава Україні

21,717

yes I know about split and that isn't useful. A careful reader will notice that split generates uncompressed files. I need them compressed. – Sneaky Wombat Oct 08 '12 at 22:09
The problem with split is that until he finishes with every piece of the split he can't start compressing split output, thus if dealing with a huge gzip split could be just not possible to use. – Valor Oct 08 '12 at 22:09
@Valor - exactly. I have 400 of these files, each is about 400GB uncompressed. :( – Sneaky Wombat Oct 08 '12 at 22:16
@Valor: Well, @Sneaky could run a shell script in parallel with the split, that waits until fooab is created, and then zips fooaa, then waits until fooac is created, and then zips fooab, and so on. But that’s a kluge, and not guaranteed to work. – Scott - Слава Україні Oct 08 '12 at 22:30
@Sneaky: I would quibble that the number of files isn’t a factor. Yes, the size obviously is. But if you have about 401GB free space, I don’t see why you couldn’t use split. … But wait — does that mean that you expect to break each file into millions of pieces (400G ÷ 80 = 5000000)? That might rule out split — I don’t know whether it can handle more than 676 (26²) output files. – Scott - Слава Україні Oct 08 '12 at 22:31
@SneakyWombat I think I've it... give me a minute – Valor Oct 08 '12 at 22:41

score 3 · Answer 4 · answered Oct 08 '12 at 22:46

I thought about it and found a way, not efficient at all, which will useless decompress entirely each file to take each piece, meaning that if you want to split in 20 pieces, it will decompress the big files 20 times. But it won't store the whole file, only the compressed piece, so while it's storage efficient it's cpu inefficient.

Script should be run with first argument the big gzip file and second argument the number of lines to split.

#!/bin/bash
GZIP_FILE=$1
SPLIT_LINES=$2
TOTAL_LINES=`zcat $GZIP_FILE|wc -l`
START=0
NEXT_START=0
while [ $NEXT_START -lt $TOTAL_LINES ]; do
        NEXT_START=$(( $NEXT_START + $SPLIT_LINES ))
        echo .
        zcat $GZIP_FILE|sed -n ${START},${NEXT_START}p |gzip -9 > ${GZIP_FILE}.lines-${START}-${NEXT_START}.gz
        START=$NEXT_START
done

This will create in the same dir for each piece a file named as the gzip file and appending ".lines-$startline-$endline.gz"

Hope you are ok wasting CPU :)

you guys are hilarious. I think I'll try to script something through python or something. The idea was was to read through the source file one time, splitting it as it's read. The TOTAL_LINES var will read through the entire file to get the count and then it goes through the business end of the work. Haha. I'll give you an up vote for the effort. — Sneaky Wombat, Oct 09 '12 at 14:51

score 1 · Answer 5 · answered May 06 '13 at 13:37

You have an awk alternative. Here is how you could do it with GNU split or GNU parallel.

GNU split has a --filter option and something very close to what you are trying to do is described in the manual:

`--filter=COMMAND'
     With this option, rather than simply writing to each output file,
     write through a pipe to the specified shell COMMAND for each
     output file.  COMMAND should use the $FILE environment variable,
     which is set to a different output file name for each invocation
     of the command.  For example, imagine that you have a 1TiB
     compressed file that, if uncompressed, would be too large to
     reside on disk, yet you must split it into individually-compressed
     pieces of a more manageable size.  To do that, you might run this
     command:

          xz -dc BIG.xz | split -b200G --filter='xz > $FILE.xz' - big-

     Assuming a 10:1 compression ratio, that would create about fifty
     20GiB files with names `big-xaa.xz', `big-xab.xz', `big-xac.xz',
     etc.

So in your case you could do:

zcat bigfile.gz | split -l 10000 --filter='gzip -9 > $FILE.gz' - big-

A good alternative to split would be to use GNU parallel, this would allow you to parallelize the compression:

zcat bigfile.gz | parallel --pipe -N 10000 'gzip > {#}.gz'

Using awk to split text file every 10,000 lines

5 Answers5

Updated with a test

Linked