21

I need to write some data from a computation, that will be read later by Paraview (.vtu or vtk file).

When it comes to file size , should I go for the ASCII format or the Binary format ?

SAAD
  • 636
  • 2
  • 7
  • 20

3 Answers3

24

If your only worry is file size, then you want binary files. For an illustrative example, lets assume you are writing 1 double precision floating point number to a file. Let's assume that the file system can handle this perfectly and holding the file, headers, and padding are all 0.

For a binary file, that number would take the exact size of the number in RAM, or 8 bytes.

In ASCII format, it would hold:

  • 16 digits of the base
  • 1 period for the decimal
  • 1 char to delimit the exponent
  • 1 char for the sign of the exponent
  • 2-3 char for the exponent

Assuming it uses only 1 byte for a character, That is 22 bytes to hold the same number. This doesn't count the characters required to dilimit between numbers (usually atleast 1). Therefore file size for ASCII format will be about 3 times larger.

You can trade in file size for the precision in the stored files (only keep 5-6 digits in the base), but that depends on what you are using them for. The main advantage of ASCII is for debugging or producing human readable data.

Godric Seer
  • 4,637
  • 2
  • 23
  • 38
  • 3
    Also important in the scientific arena is long-term archiving and reliable sharing, which is why, despite it's inefficiencies, ASCII CSV is so prevalent and recommended (PDF). – horchler Sep 03 '13 at 21:46
  • That is true, and something I didn't consider, but since the question is asking about file size, I believe my answer still stands. – Godric Seer Sep 03 '13 at 21:48
  • 4
    Another useful point is that although ASCII CSV encoding isn't very efficient, using a file compression utility (like zip, gzip, etc.) on your ascii file will typically bring the file size down to something similar to the size of a binary file. – Brian Borchers Sep 04 '13 at 01:30
  • 3
    Be careful because some input/output libraries aren't careful enough to get bit for bit reproducibility as you output IEEE Double Precision numbers in ASCII and then read them back in. In my experience, using 17 or 18 decimal digits is sometimes necessary for safety. – Brian Borchers Sep 04 '13 at 02:01
  • 5
    Concerning horchler's comment: I'm sure well-used, standardised open binary formats such as HDF5 will be around for a long time. That's what I'd personally recommend. – AlexE Sep 05 '13 at 09:13
  • 1
  • I stick to binary whenever possible, for accuracy, compactness, peace of mind, and (especially) speed. Then if I need further compactness, I can zip it. If I need to be able to visually read the contents, I can write a little program for that. On the other hand, if it's more important to be visual, and easily passed around to random programs like Excel, R, etc. then CSV is the way to go.
  • – Mike Dunlavey Sep 09 '13 at 00:34
  • Why does anyone who recommends CSV not get howled down?! Use TSV. It's just as well supported, better defined, simpler, more compact, and equally readable. – podperson May 29 '20 at 17:28