35

I am in the early stages of trying to replace an existing document generation system based on Microsoft Office apps (Word and Excel).

The system runs once a year to generate over 300,000 individual documents of three different types (the new system may be an "on demand" for an individual document instead of batch processing all at once). Two of the types are mostly standard text with value place-holders, and some conditional logic for inclusion/exclusion of certain elements (standard paragraphs and/or values). One of the types includes a bar chart based on individual calculations. All of the values for replacement and conditional logic are sourced from an ASCII file, but some of the values are calculated from the data.

The current system is very slow and error prone (at run time) and requires a complex system of machines, threads and message queues to scale the processing resources to a level that will get the job done within two weeks or so. Basically, there are three Word Document templates that include the value place-holders and conditional logic and text. The templates are processed with Office interop libraries to create an instance document. In the case of one of the types, Excel is used to create a bar chart that is injected (OLE embedded) into the instance document. The Word instance documents are then converted to (saved as) PDF.

I only know a little about TeX (brushed up against it during my many years of Emacs use), but it seems that it might serve as a good basis to replace the behemoth described above. The problem is that I need some guidance as to whether or not TeX would be a good route to persue (performance being a key factor), and some pointers to resources that can accomplish the more obscure needed tasks (I know PDF generation is no issue).

The final system would execute on Windows machine(s), and programmatic processing would be done with .NET or Java most likely.

Werner
  • 603,163
  • 2
    Review this similar question http://tex.stackexchange.com/questions/3697/is-it-possible-to-connect-a-database-to-latex-to-produce-data-driven-documents – R. Schumacher Jul 31 '12 at 20:27
  • 1
    Is it a single very large ASCII file for all the data? How big? How many pages will your typical final text document (types 1 and 2 if I understand correctly) contain? What kind of computations are needed to produce the extra data? Lots of floating point computations? How complicated are the bar charts? Only one per type 3 document? Lots of data points to produce each chart? – Bruno Le Floch Jul 31 '12 at 21:08
  • There are 3 data files, one for each type (3 types).
  • – Edwin McConnel Jul 31 '12 at 21:16
  • each file ranges from 12 - 100 MB and 35k to >200k records respectively. - The computations are generally simple date threshold calculations (e.g., is a date before or after another fixed date) and similar for financial calculations (is amount greater or less than some fixed amount). - The bar chart is somewhat simple, with up to three composite bars (i.e., the bar totals up to three currency numbers, with the bar colored differently and sectioned for each portion. There is also a threshold line on the y-axis that is calculated per record.
  • – Edwin McConnel Jul 31 '12 at 21:28
  • 5
    What is it that the definition of done, i.e. when would you accept that a TeX solution is better than your current system? Does "performance being a key factor" mean that you want it faster? 300000 docs / 14 days / 24h / 60min = 15 docs / min if I am not mistaken. Depending on the size of your documents (or the number of data points for bar plots), this might be a challenge for TeX as well. At least assuming that the parallel potential for a TeX-based solution is the same as for your current one(?). – Christian Feuersänger Jul 31 '12 at 21:28
  • Each document is no more than 4 pages in length (100 or 450 KB in size, the larger ones include the bar chart of course).
  • – Edwin McConnel Jul 31 '12 at 21:34