I have a large CSV file (~1gb) with mixed data. Basically with strings and integers. It takes very long to import this whole dataset with Import[] so I would like to use ReadList. A line contains of {int, string, string, string, int, int, string, int, string, string, int, int, int (or NaN), int (or NaN), string}. These entries are separated by commas. The tricky part is that some of the strings may contain commas as well, but are then enclosed by double quotes. Is there a way I can accomplish reading this correctly with ReadList?
Asked
Active
Viewed 1,293 times
6
m_goldberg
- 107,779
- 16
- 103
- 257
Thijs
- 680
- 4
- 11
1 Answers
5
I would read the csv file as a stream. It's very quick.
First, I use the following function to look for specific lines.
ClearAll[readLine];
readLines[stream_, search_?StringQ] :=
With[{stro = FindList[stream, search]},
ImportString[StringJoin[Riffle[stro, "\n"]], "Table"] /;
stro =!= {}]
Then I open a stream:
file = "data.csv";
str = OpenRead[file];
read = readLine[str, "whatever"];
Close[str]; (*Close the stream *)
Then use the following to cut each line into list of strings
ds = StringSplit[#, ","] & /@
read; (*Split the strings for each items in each line*)
Then you can use the following to transform the strings to integers:
ds[[All, 1, 9]] = IntegerPart@ToExpression@ds[[All, 1, 9]];
Hope that helps
Xavier
- 155
- 9
-
1Does this also deal with the problem of CSV string entries that contain commas? E.g.:
1,2,"Item 3, with a comma",4– Thijs Nov 15 '14 at 21:37 -
1Probably not but you haven't shared any example so I gave a possible solution on which you can work on – Xavier Nov 16 '14 at 07:40
Import, but splits the data into chunks and might work for you. – Leonid Shifrin Nov 15 '14 at 12:39Import(viaImportString), so you can give the same spec there. IfImportworks for you in principle (but is just slow), then that code should also work for you (but be faster / more memory-efficient). IfImportcan't handle your format, then it is another story. You might need to write some custom parser. – Leonid Shifrin Nov 15 '14 at 14:36