5

I have a very large text file which file size is nearly 100GB and contents is like a CSV file as follows

0.00234567, 3.45e-2 0x6
0.00234789, 1.23e-2 0x6
0.00234967, 2.13e-3 0x7
    ...many lines...
0.00567323, 4.12e-1 0x6
    ...many lines...

I want to extract some lines which match simple conditions from the text file and export the lines to another text or csv file.

My problem is the file size of the input text file much exceeds that of my computer’s RAM(16GB); I cannot Import the file on a notebook.

Is there any circumvent of this problem?

For example, read the input text file partially from the beginning, execute the selection, export the selected lines and repeat this sequence until all the lines of the input text file read.

Any advise will be appreciated. Thank you in advance.

Taiki Bessho
  • 886
  • 5
  • 15

1 Answers1

3

As suggested in the comments, ReadLine or ReadList can be used to read the file in lines or chunks of lines. The lines can be processed one at a time and collected with Sow and Reap:

stream = OpenRead[filename];

list = Join@@Reap[
   While[
     (line = ReadLine[stream]) =!= {}),
       Sow /@ StringCases[line,
           (* the string pattern to match your data structure: *)
           a : NumberString ~~ ", " ~~ b : NumberString ~~ "e" ~~ c : NumberString ~~ " 0x" ~~ d : NumberString :>
           {a, b*10^c, d}] /. s_String :> ToExpression@s (* convert Strings to expressions (numbers) - skip if you do not need it *)
       ]
   ]
 ][[2]];

The Sowed expressions are available after Reaping in Join@@Reap[ ... ][[2]].

This code reads one line at a time. It might be advisable to use ReadList to read chunks of 1000 lines or so for better performance.

Theo Tiger
  • 1,273
  • 8
  • 10