4

There are several questions on this site about implementing fast CSV/TSV export for machine numbers without loss of precision. Recently I discovered that Export merely applies ToString[#, InputForm] & to every number in the table for conversion it into a string:

Trace[ExportString[{-7756.0224337393065}, "TSV"], ToString] // Flatten
{ToString[-7756.0224337393065, InputForm], "-7756.0224337393065"}

Unfortunately there seems to be no built-in alternative to InputForm in Mathematica for obtaining the correct digits of machine numbers, even RealDigits isn't precise:

x = 44802.395880518656;
ToExpression@ToString[x, InputForm] - x
FromDigits@RealDigits[x] - x    
0.

-7.27596*10^-12

InputForm works correctly but very slow. From the other hand, it shouldn't be too difficult to re-implement the algorithm behind this conversion as a compilable function in order to get huge speed-up.

Does anybody know what the algorithm is? Probably it isn't proprietary and can be found in the literature.

Alexey Popkov
  • 61,809
  • 7
  • 149
  • 368

1 Answers1

3

This section of my answer previously showed that Export[fname, data, "TSV"] is incredibly slow (~500 seconds for exporting 10^5 * 10 reals), but this is easily worked around by WriteString[fname, ExportString[data, "TSV"]].

All of this is less relevant for CSV export, as I demonstrated below and the foolishly missed while going on to ramble about TSV export. But even for CSV a small speedup is possible with this workaround.


I would assume that what you mean by "the algorithm behind InputForm" is defined by low-level code, something like

#include <stdio.h>
#include <float.h>

and

printf("%.*f", DBL_DIG, x);

Where the implementation of that is defined, I do not know, but I would expect that to follow the IEEE-754 standard (Wiki, Official paywall).

Any (finite) binary float can be represented by a terminating sequence of decimal digits, of approximately 52 decimal digits of precision, but (provably) 17 digits are always enough. However...

N @ MachinePrecision
(* 15.9546 *)

1 + $MachineEpsilon
(* 1. *)

InputForm @ %
(* 1.0000000000000002 *)

SetPrecision[%, 16]
(* 1.000000000000000 *)

The second paragraph of section 5.12 of 2008 IEEE-754 clearly states

Implementations shall provide conversions between each supported binary format and external decimal character sequences such that, under roundTiesToEven, conversion from the supported format to external decimal character sequence and back recovers the original floating-point representation, except that a signaling NaN might be converted to a quiet NaN. See 5.12.1 and 5.12.2 for details.

  • InputForm is compliant with this, though I would expect 44802.39588051865 to be represented as 44802.395880518648 and not to remain as ...865.

  • RealDigits is not a true conversion to an "external decimal representation", though one could expect it to behave as such. It discards the digit of least precision. Annoyingly, FromDigits converts back to a rational number.

  • Internal`DoubleToString is quite fast (but not orders of magnitude faster) and it suffers from the same problem as RealDigits -- it discards the ULP. I'd say, this is blatant non-compliance with the standard. I can no longer reproduce this on my machine. Perhaps DoubleToString is compliant after all.

  • ToString@SetPrecision[x, 17] should give always correct results and is not too bad in terms of speed. In fact, you don't even need ToString here.

But

Export["test.csv",
  SetPrecision[RandomReal[{0, 1000}, {10^5, 10}], 17], "CSV"] // AbsoluteTiming
(* {20.1076, "test.csv"} *)
Export["test.csv", RandomReal[{0, 1000}, {10^5, 10}], "CSV"] // AbsoluteTiming
(* {15.307, "test.csv"} *)
With[{arr = RandomReal[{0, 1000}, {10^5, 10}]}, 
  Export["test.csv", arr, "CSV"] // AbsoluteTiming]
(* {15.2228, "test.csv"} *)
With[{arr = SetPrecision[RandomReal[{0, 1000}, {10^5, 10}], 17]}, 
  Export["test.csv", arr, "CSV"] // AbsoluteTiming]
(* {19.3445, "test.csv"} *)
RandomReal[{0, 1000}, {10^5, 10}] // AbsoluteTiming // First
With[{arr = RandomReal[{0, 1000}, {10^5, 10}]}, 
  SetPrecision[arr, 17] // AbsoluteTiming // First]
(* 0.026734 *)
(* 0.840021 *)

Basically, generating 10^6 numbers is fast (its timing is negligible), converting them to 17 digits of precision is also very fast, and frankly, I don't think that the unpacked array causes the slowdown in the CSV export. What's more likely, is that there is simply more bytes to write.

Bottom line: Export is simply that slow.

Here's a not-very-thoroughly-though-out alternative which is already quite a bit faster:

With[{rand = RandomReal[{0, 1000}, {10^5, 10}]},
 Block[{arr = SetPrecision[rand, 17], str},
   str = Map[ToString, arr, {-1}];
   Export["test.dat", StringRiffle[str], "String"]] // AbsoluteTiming]
(* {7.75655, "test.dat"} *)

This can all be much faster if you're prepared to downgrade from 17 to 16 digits of precision - the SetPrecision and ToString can be replaced by a single Internal`DoubleToString.

You may likely be interested in studying Put, PutAppend (which, by the way, generate output in InputForm) as well as Write and related functions. They'll likely be somewhat complicated to get to work properly, but can offer large speedups.


LibraryLink solution

This is a work-in-progress which I'll fine-tune a bit over the next few days. In principle, Henrik Schumacher has already made a solution here, but OP apparently had some troubles getting it to work, so I'll show a bare-bones example to get started.

I'm assuming, that a C-compiler is set up and <<CCompilerDriver` has been executed. Start by creating a directory and creating a file, ExportTable.c in it with the following code:

 #include "WolframLibrary.h"
 #include <float.h>
 #include <stdio.h>

 DLLEXPORT int fastExportCSV(WolframLibraryData libData, mint Argc, MArgument *Args, MArgument Res) {

        double* data;
        char* fname = MArgument_getUTF8String(Args[0]);
        mint i = 0;
        mint j = 0;
//        mint len;  // correct handling of array dimensions still to be implemented

        MTensor in = MArgument_getMTensor(Args[1]);
//      len = libData->MTensor_getFlattenedLength(in);
        data = libData->MTensor_getRealData(in);

        FILE * ofile;
        ofile = fopen(fname, "w");

        for(i = 0; i<100000; i++) {
            for(j = 0; j<10; j++) {
                fprintf(ofile, "%.*f,", DBL_DIG, data[10*i + j]);
            }
            fprintf(ofile, "\n");
        }
        fclose(ofile);

        MArgument_setUTF8String(Res,fname);
        return LIBRARY_NO_ERROR;
}

Now go over to Mathematica to set everything up there:

lib = CreateLibrary[{"ExportTable.c"}, "ExportTable"]
fastExportCSV = LibraryFunctionLoad[
    lib, 
    "fastExportCSV",
    {"UTF8String", {Real, 2, "Constant"}}, 
    "UTF8String"
    ]

Because the dimensions of the array are hard-coded right now, you have to feed a 2-dimensional Real array with at least 10^6 elements and the program will write them as a 10^5*10 CSV table. Like so:

fastExportCSV["D:\\Larkin\\Dev\\LLTutorial\\test2.dat", 
  RandomReal[{-100, 100}, {10^5, 10}]] // AbsoluteTiming
(* {0.958084, "D:\\Larkin\\Dev\\LLTutorial\\test2.dat"} *)

Of course, this is duplication of the linked answer, but in simpler, purely C code.

LLlAMnYP
  • 11,486
  • 26
  • 65
  • In the linked question I've already found a faster alternative to Internal`DoubleToString: it is DecimalForm combined with SetPrecision. The problem is that it doesn't allow to recover the original decimal representation of the numbers exactly (as well as Internal`DoubleToString). – Alexey Popkov May 17 '18 at 08:02
  • @AlexeyPopkov That faster alternative is quite a fine example of what in Russia we call костыли. Anyway, your words "the original decimal representation" suggest some confusion about they way, numbers are handled, at least on an intuitive level. In the case of your linked question, the numbers cannot and will not be "exactly" represented (your complaint about the changing last digit). If you want an exact decimal representation, you should input numbers with arbitrary precision, e.g. -56705.907687414737`17. (cont'd) – LLlAMnYP May 17 '18 at 08:14
  • Note, then, that x = -56705.907687414737; y = -56705.907687414737`17; ExportString[{x,y},"TSV"] will return -56705.907687414736 -56705.907687414737. Besides that, you say that DecimalForm[-38741.90768741473, Infinity] changes the last digit to 2. I, unfortunately, don't have v11.3 to test with, but both InputForm and DoubleToString show me the exact input, regardless if the last digit is 2 or 3. This seems to be a bug of DecimalForm – LLlAMnYP May 17 '18 at 08:20
  • @AlexeyPopkov anyway, it would be helpful if you specify your problem more completely (including the background). Why exactly do you care specifically about the decimal representation of your numbers when they, likely have been obtained through calculations with binary double-precision floats? If that is not the case, then calculations ought to have been done in arbitrary precision, in which case the numbers should be stored as such. In this case, their InputForm and/or ToString, etc conserves their decimal representation to the arbitrary precision that was given. – LLlAMnYP May 17 '18 at 08:24
  • The numbers are generated by another application and then Imported by Mathematica. Export allows to generate exactly the same file, but very slow. My goal is to add two columns into that TSV file. – Alexey Popkov May 17 '18 at 08:27
  • I see. I'll get to work and have a look. Do you need to do manipulations with these numbers, or just append to a table? – LLlAMnYP May 17 '18 at 08:30
  • These two columns are generated from numbers in other columns, but original contents of the file shouldn't be changed. – Alexey Popkov May 17 '18 at 08:33
  • @AlexeyPopkov if Export suits your needs, I just found that str = ExportString[data, "TSV"]; then WriteString[filename, str] is many-many times faster than direct export. – LLlAMnYP May 17 '18 at 09:07
  • @AlexeyPopkov please see the new code at the top of my answer. – LLlAMnYP May 17 '18 at 09:39
  • Which Mathematica version do you use? On my machive with version 11.3 ExportString is 38% faster than Export for "Table" format and has no advantages for "CSV" format. – Alexey Popkov May 17 '18 at 10:08
  • @AlexeyPopkov v10.2 – LLlAMnYP May 17 '18 at 10:19
  • @AlexeyPopkov I have tried with v11.1.1 and it is showing the same inferior performance for Export and "TSV" as compared to ExportString. – LLlAMnYP May 17 '18 at 13:28
  • AFAIK the "CSV" export was improved in version 11.2. Probably it affects "TSV" too. – Alexey Popkov May 17 '18 at 13:31
  • @AlexeyPopkov so what are your timings like then? If ~10 seconds for a million Real numbers are too much for you, the next step would be to offload the work to some LibraryLink functionality; C code can read and write files very quickly. But it gets much harder when the records of the TSV file are not just rows of numbers (i.e. the output to the file isn't simply a packed array). – LLlAMnYP May 17 '18 at 13:44
  • A LibraryLink solution would be nice, but I think that even Compile is sufficient for obtaining good performance. My file contains strings only on the first row (see dat at the top of the linked question). – Alexey Popkov May 18 '18 at 08:45
  • @AlexeyPopkov I've just realized, that I've been rambling about TSV export (replaced that section with a tl;dr) while CSV export was from the start quite reasonable, ~20 seconds with Export, can be accelerated to ~8 with some hacks. But I don't think that compile will solve your problems. Reimplementing the IEEE standard even in a compile will almost certainly be not faster than an InputForm call. As it seems, MMA is quite slow at writing to files: 18MB string with WriteString still takes over 2 seconds. My idea is to pass the array of numbers to C code and have the export done (contd) – LLlAMnYP May 18 '18 at 09:27
  • entirely within the C function. – LLlAMnYP May 18 '18 at 09:27
  • @AlexeyPopkov And anyway, a Compile based solution is impossible, since Compile cannot return strings or even characters, so you'll be back to square one. – LLlAMnYP May 18 '18 at 09:31
  • @AlexeyPopkov the LibraryLink functionality has already been done by Henrik Schumacher in his answer, as it appears. Is there still some deficiency? – LLlAMnYP May 18 '18 at 09:56