19

I have a *.dat file in which the contents are basically one giant list:

{1,2,3,4} {5,6,7,8} {9,10,11,12}

and so on. I'd like to read in the first 500 members { } .... { } in to do some computations. The problem is that the file is about 1.5 GB. How does one deal with large files of this type in Mathematica?

István Zachar
  • 47,032
  • 20
  • 143
  • 291
Luap Nalehw
  • 846
  • 5
  • 12
  • I'm interested in this too. When I've been dealing with plain text files, I often use the split command in bash. i.e. split -l 500 data. Here's the man page. – canadian_scholar Nov 15 '12 at 14:59
  • 2
    Actually, this SO discussion "Way to deal with large data files in Wolfram Mathematica" looks pretty key for this. http://stackoverflow.com/questions/2370570/way-to-deal-with-large-data-files-in-wolfram-mathematica – canadian_scholar Nov 15 '12 at 14:59
  • split looks useful. But my first 500 sequences are not neccessarily the first 500 lines. I have no idea what lines they fall on so would need a way of splitting the first 500 { ... } 's which doesn't look like an option – Luap Nalehw Nov 15 '12 at 15:03
  • cat temp.dat | tr " " "\n" | split -l 500 would be one of many Unix commands. sed would also be a good bet. – cormullion Nov 15 '12 at 15:23
  • thanks cormullion. I was really looking for a mathematica solution. – Luap Nalehw Nov 15 '12 at 15:35
  • Unfortunately, at the moment Mathematica is good only in handle data on memory, what it a problem for big files. I know that they are working to change that, but maybe just in Mathematica 10! So, for now I use bash too. – Murta Nov 15 '12 at 15:45
  • 4
    this answer may be relevant. – Silvia Nov 15 '12 at 16:50
  • I too think this is a duplicate of the question that silvia linked to – rm -rf Nov 15 '12 at 16:55
  • @rm-rf I think Leonid's answer to that question doesn't quite explain how to go from one gigantic file to his system. Perhaps all is needed is that tutorial? – tkott Nov 15 '12 at 17:09
  • 2
    There was a talk about this in the 2011 conference: "BigData: Demystifying Large Datasets in Mathematica with Nick Lariviere" http://www.wolfram.com/events/technology-conference/2011/videos.html – Gustavo Delfino Nov 15 '12 at 17:16
  • @rm-rf , I think I agree with tkott. The conference presentation suggests they are working on this for future releases ... let's hope so. – Luap Nalehw Nov 15 '12 at 17:41
  • 1
    My framework has not been yet optimized to work with lists where every part is very small but the number of parts is just huge. OTOH, it is relatively straightforward to do. The framework itself need not be modified, since this scenario can be addressed by writing some code on top of it. I will try to work out some example tomorrow. – Leonid Shifrin Nov 15 '12 at 18:55
  • 1
    For large files I find it nice to monitor the progress of the Read, see my answer at: http://mathematica.stackexchange.com/questions/4640/how-to-monitor-the-progress-of-read – s0rce Nov 15 '12 at 20:00

2 Answers2

13

You can read in the first 500 elements like this:

data = ToExpression@ReadList["myfile.txt", Record, 500, RecordSeparators -> " "];

On Linux/OS X, items 500 to 1000 can be read in this way:

n = 500; m = 1000;

data =ReadList["!cat myfile.txt | tr ' ' '\n'  | head -" <> 
          ToString@m <>" | tail -" <> ToString[m - (n - 1) ]];

Assuming your record separator is a space, and that you might want to ignore records which are not of the form, {...}, you could use the following to find records n through m:

n = 1; m = 500;

data = ReadList["!cat myfile.txt | tr ' ' '\n' | grep '{.*}' | head -" <> 
  ToString@m <> " | tail -" <> ToString[m - (n - 1) ]];
image_doctor
  • 10,234
  • 23
  • 40
  • It would be great if you explain what is the meaning of the argument inside ReadList in the second and third examples. I have another separator and I am looking to extract the last record. – Basheer Algohi Sep 15 '17 at 19:56
6

Depending which kind of processing you need, you can try to digest the information as you read, line by line. I wrote this to answer another question about sorting long files. Its based on OpenRead and Read. In this case it reads one line, stores only one number from the file and the StreamPosition in a table for indexing.

strR = OpenRead["UnSorted.txt"]

nList = Table[
   With[{sp = StreamPosition[strR], 
     line = Read[strR, {"Word", "Number", "String"}]}, {sp, line[[2]]}]
   , {i, 10^6}
   ];

Close[strR]
rhermans
  • 36,518
  • 4
  • 57
  • 149