Dealing with large files

Question

I have a *.dat file in which the contents are basically one giant list:

{1,2,3,4} {5,6,7,8} {9,10,11,12}

and so on. I'd like to read in the first 500 members { } .... { } in to do some computations. The problem is that the file is about 1.5 GB. How does one deal with large files of this type in Mathematica?

I'm interested in this too. When I've been dealing with plain text files, I often use the split command in bash. i.e. split -l 500 data. Here's the man page. — canadian_scholar, Nov 15 '12 at 14:59
Actually, this SO discussion "Way to deal with large data files in Wolfram Mathematica" looks pretty key for this. http://stackoverflow.com/questions/2370570/way-to-deal-with-large-data-files-in-wolfram-mathematica — canadian_scholar, Nov 15 '12 at 14:59
split looks useful. But my first 500 sequences are not neccessarily the first 500 lines. I have no idea what lines they fall on so would need a way of splitting the first 500 { ... } 's which doesn't look like an option — Luap Nalehw, Nov 15 '12 at 15:03
cat temp.dat | tr " " "\n" | split -l 500 would be one of many Unix commands. sed would also be a good bet. — cormullion, Nov 15 '12 at 15:23
thanks cormullion. I was really looking for a mathematica solution. — Luap Nalehw, Nov 15 '12 at 15:35
Unfortunately, at the moment Mathematica is good only in handle data on memory, what it a problem for big files. I know that they are working to change that, but maybe just in Mathematica 10! So, for now I use bash too. — Murta, Nov 15 '12 at 15:45
I too think this is a duplicate of the question that silvia linked to — rm -rf, Nov 15 '12 at 16:55
@rm-rf I think Leonid's answer to that question doesn't quite explain how to go from one gigantic file to his system. Perhaps all is needed is that tutorial? — tkott, Nov 15 '12 at 17:09
There was a talk about this in the 2011 conference: "BigData: Demystifying Large Datasets in Mathematica with Nick Lariviere" http://www.wolfram.com/events/technology-conference/2011/videos.html — Gustavo Delfino, Nov 15 '12 at 17:16
@rm-rf , I think I agree with tkott. The conference presentation suggests they are working on this for future releases ... let's hope so. — Luap Nalehw, Nov 15 '12 at 17:41
My framework has not been yet optimized to work with lists where every part is very small but the number of parts is just huge. OTOH, it is relatively straightforward to do. The framework itself need not be modified, since this scenario can be addressed by writing some code on top of it. I will try to work out some example tomorrow. — Leonid Shifrin, Nov 15 '12 at 18:55
For large files I find it nice to monitor the progress of the Read, see my answer at: http://mathematica.stackexchange.com/questions/4640/how-to-monitor-the-progress-of-read — s0rce, Nov 15 '12 at 20:00

image_doctor · Answer 1 · 2012-11-15T21:49:45.047

13

You can read in the first 500 elements like this:

data = ToExpression@ReadList["myfile.txt", Record, 500, RecordSeparators -> " "];

On Linux/OS X, items 500 to 1000 can be read in this way:

n = 500; m = 1000;

data =ReadList["!cat myfile.txt | tr ' ' '\n'  | head -" <> 
          ToString@m <>" | tail -" <> ToString[m - (n - 1) ]];

Assuming your record separator is a space, and that you might want to ignore records which are not of the form, {...}, you could use the following to find records n through m:

n = 1; m = 500;

data = ReadList["!cat myfile.txt | tr ' ' '\n' | grep '{.*}' | head -" <> 
  ToString@m <> " | tail -" <> ToString[m - (n - 1) ]];

edited Nov 15 '12 at 21:49

answered Nov 15 '12 at 19:18

image_doctor

10,234
23
40

It would be great if you explain what is the meaning of the argument inside ReadList in the second and third examples. I have another separator and I am looking to extract the last record. – Basheer Algohi Sep 15 '17 at 19:56

score 6 · Answer 2 · edited Apr 13 '17 at 12:55

Depending which kind of processing you need, you can try to digest the information as you read, line by line. I wrote this to answer another question about sorting long files. Its based on OpenRead and Read. In this case it reads one line, stores only one number from the file and the StreamPosition in a table for indexing.

strR = OpenRead["UnSorted.txt"]

nList = Table[
   With[{sp = StreamPosition[strR], 
     line = Read[strR, {"Word", "Number", "String"}]}, {sp, line[[2]]}]
   , {i, 10^6}
   ];

Close[strR]

Dealing with large files

2 Answers2

Linked