-1

A testing file with the specified format can be downloaded from here

Let's read a few lines from this file

(recLines = ReadList[csvFileName, Record, 3]) // TableForm

SalesOrderID,ProductID,ProductName,OrderDate,ShipDate,Revenue
51217,779,Soft Cushion,5/7/2014,11/7/2014,"$2,400"
51217,788,Soft Cushion,5/7/2014,11/7/2014,"$2,400"

ls[[2]] // FullForm
"51217,779,Soft Cushion,5/7/2014,11/7/2014,\"$2,400\""

Had I used the Import I would have got

recData = Import[csvFileName];
recDate[[1;;3]]//TableForm

SalesOrderID    ProductID   ProductName OrderDate   ShipDate    Revenue
51217   779 Soft Cushion    5/7/2014    11/7/2014   2400
51217   788 Soft Cushion    5/7/2014    11/7/2014   2400

Head /@ csvData[[3]]
{Integer, Integer, String, String, String, Integer}

This is exactly the output I would like to get, but instead of using Import I want to use ReadList.

I have read various posts related to ReadList and this is the closest I have found to an answer that will use ReadList

rl = ReadList[csvFileName, Word, WordSeparators->{","}, RecordSeparators->{"\n"}, RecordLists->True]

rl[[1 ;; 3]] // TableForm

 SalesOrderID   ProductID   ProductName OrderDate   ShipDate    Revenue 
  51217 779     Soft Cushion    5/7/2014        11/7/2014       "$2 400"
  51217 788 Soft Cushion 5/7/2014    11/7/2014 "$2      400"

rl[[3]] // FullForm

List["51217","788","Soft Cushion","5/7/2014","11/7/2014","\"$2","400\""]

Head /@ rl[[3]]
{String, String, String, String, String, String, String}

Could you please continue this to a full answer or perhaps share a better solution with ReadList ? For example, is it possible to read and transform at the same time with ReadList, by specifying the header, i.e. types of the record items, and get exactly the same result as that of Import ?

I also have this challenge question for the experts of the language:

ReadList is a memory efficient and fast method of reading and parsing CSV files. But it cannot detect automatically the data types of the record items, something that Import does AUTOMAGICALLY. Wouldn't be possible to implement somehow a ReadList that can also detect the header format?

If not, suppose we do know what is the format of the records, and suppose it does not change. There are all these cryptic types and options of ReadList to assist you in getting the parsing of data items right. Can somebody explain me in the file I specified how I can use these types and options to achieve reading lists same way like Import function ?

Athanassios
  • 1,291
  • 10
  • 22
  • 1
    Why exactly do you want to use ReadList instead of Import ? – SquareOne Feb 02 '16 at 12:29
  • ReadList is much faster for big CSV files than Import. I am using this fileas an exercise to learn more on how to use ReadList. – Athanassios Feb 02 '16 at 12:39
  • In order to use ReadList with your CSV file in the most efficient way, the entries in your file should be formatted more friendly for Mathematica. Do you have a way to modify your csv file to add some formatting ? – SquareOne Feb 02 '16 at 12:58
  • Thank you, yes ReadList, I have just edited my comments – Athanassios Feb 02 '16 at 13:06
  • @SquareOne I can modify my CSV file, but I thought ReadLine could detect the format of each record and transform the fields, if you specify the correct types, something that Import does automagically ;-) – Athanassios Feb 02 '16 at 13:20
  • @Athanassios ReadList cannot detect the formats, you have to specify it. OK, I'll try an answer now. – SquareOne Feb 02 '16 at 13:57
  • @Andre this is not a duplicate. I am aware of this answer, they are not using ReadList ! The whole point of this question of mine is whether you can utilize better the ReadList function. This is clear in my question. – Athanassios Feb 02 '16 at 14:03
  • I understand that you want to use ReadList, and I have understood that that's probably meaningfull for big files. Nevertheless, the OP of this question looks for a solution with ReadList for the same reasons as you. I'm afraid that the problem in fact is that there no solution with ReadList (or may be doing a preprocessing of the file as SquareOne suggests) – andre314 Feb 02 '16 at 14:11
  • @Andre, partly this may be true, nevertheless it depends heavily on the specific format of the CSV file. Moreover on the post you are referring they are missing the ReadList answer with the Word and Word Separator that I have placed in my question above. – Athanassios Feb 02 '16 at 14:21
  • @Andre, come on, I am pretty sure that they know this, although I got this from another post and it is not mentioned as an answer there. But I am not playing it smart here. I am trying to understand how this ReadLine works when you know the format of the header. So is it possible to read line by line and parse at the same time knowing what are the data types, format of the record items ? Can a guru help us with this ;-) For that purpose I have modified my question, see the end of it. – Athanassios Feb 02 '16 at 15:01
  • @Athanassios, do I understand, you wouldn't really be happy with a method that used ReadList as you have it, then applied a series of transformations to turn elements like {51217,779,Soft Cushion,5/7/2014,11/7/2014,"$2,400"} into elements like {51217, 779, "Soft Cushion", "5/7/2014", "11/7/2014", 2400}? You want it to be more flexible and do it automatically? – Jason B. Feb 02 '16 at 15:43
  • @Jason B It would be already very interesting to have a solution with ReadList that gives directly {51217,779,Soft Cushion,5/7/2014,11/7/2014,"$2,400"}, that is to say that has solved the problem of the comma. Do you really have something ? – andre314 Feb 02 '16 at 15:52

4 Answers4

4

Here are some approaches with slightly modified versions of your input CSV file. If you are looking for an efficient way to read the data, the input data should be as friendly as possible to MMA.

However in the last example, I will give one possible way to process your original data.

1. ReadList with formats

With this approach, the most problematic in your data is the Revenue field where values contain a comma ($2,400). So let's say you don't have this problem (I directly changed this to a numeric value but I could have left it as a string with just the comma removed)

Let's say your file is :

csv = "51217,779,Soft Cushion,5/7/2014,11/7/2014,2400\n51217,788,Soft Cushion,5/7/2014,11/7/2014,3200";

Print@csv

Blockquote

str = StringToStream@csv;
rl = ReadList[str, 
{Number, Character, Number, Character, Word, Word, Word, Character, Number},
RecordSeparators -> {"\n"}, WordSeparators -> {","}, RecordLists -> False]
Close@str;

{{51217, ",", 779, ",", "Soft Cushion", "5/7/2014", "11/7/2014", ",", 2400},

{51217, ",", 788, ",", "Soft Cushion", "5/7/2014", "11/7/2014", ",", 3200}}

and finally you extract the parts you want (you get rid of the commas) :

rl[[All, {1, 3, 5, 6, 7, 9}]]

{{51217, 779, "Soft Cushion", "5/7/2014", "11/7/2014", 2400},

{51217, 788, "Soft Cushion", "5/7/2014", "11/7/2014", 3200}}

and you can check that :

Head /@ rl[[All, {1, 3, 5, 6, 7, 9}]][[2]]

{Integer, Integer, String, String, String, Integer}

2. ReadList combined with ToExpression

(I learned this powerful method from @Jakob Akkerboom in this post)

This approach will fit data like these (here it is better not to have commas inside the fields, so for the sake of simplicity i removed these):

csv = "51217,779,\"Soft Cushion\",\"5/7/2014\",\"11/7/2014\",2400\n\
51217,788,\"Soft Cushion\",\"5/7/2014\",\"11/7/2014\",3200";

Print@csv

Blockquote

then just

str = StringToStream@csv;
rl = ReadList[str, String, RecordSeparators -> {"\n"}] // 
Map[ToExpression["{" ~~ # ~~ "}"] &, #] &
Close@str;

{{51217, 779, "Soft Cushion", "5/7/2014", "11/7/2014", 2400},

{51217,788, "Soft Cushion", "5/7/2014", "11/7/2014", 3200}}

and you can verify:

Head /@ rl[[1]]

{Integer, Integer, String, String, String, Integer}

3.ReadList with your original dataset

The idea is to import each line of the CSV file as string, then to do the processing to suit your needs. If you deal with very big files, this approach could be adapted to read the file by chunks (using low level Read function instead of ReadList).

r1 = ReadList[csvFileName, String];
r2 = Map[StringReplace[#, 
       "\"$" ~~ pat : (DigitCharacter .. ~~ "," ~~ DigitCharacter ..) ~~"\"" :> 
StringDelete[pat, ","]] &, r1] // Map[StringSplit[#, ","] &, #] &;
r2[[2 ;;, {1, 2, 6}]] = r2[[2 ;;, {1, 2, 6}]] // ToExpression;

r2 // TableForm

Blockquote

and

Head /@ r2[[2]]

{Integer, Integer, String, String, String, Integer}

4. Lazy lists

But if you have to deal with very big files, maybe you should convert your data to @Leonid Shifrin's lazy lists.

SquareOne
  • 7,575
  • 1
  • 15
  • 34
  • great answer @SquareOne regarding to 3. I though it would be easier, faster perhaps to use ReadLine, is that right ? – Athanassios Feb 02 '16 at 18:49
  • @Athanassios You mean ReadList, not ReadLine – andre314 Feb 02 '16 at 18:54
  • @Athanassios Yes ReadList is fast, but if you need also to filter/parse the data after reading the file (like in 3.) then the whole computation is longer, but probably less than with Import. – SquareOne Feb 02 '16 at 20:45
  • This time @andre I really meant to use ReadLine, I.e. parse string lines, line by line. Wouldn't be faster and more memory efficient this way ? Please also see the end of my question about comparing it with the import function. This part has NOT been answered. – Athanassios Feb 02 '16 at 21:11
  • @Athanassios Reading line by line is possible with the low level function Read which shares the same syntax than ReadList. (I have not yet used ReadLine which seems aimed for interactivity with external processes.) With big files that cannot fit into memory, it would probably faster to read block of lines ("chunks" as I said in my answer) then parse, read again and so on ... There are examples of how to do that in this forum ... – SquareOne Feb 02 '16 at 22:04
  • 1
    @Athanassios For example you might find interesting these posts: 1, 2, 3, 4, ... – SquareOne Feb 02 '16 at 22:08
  • And there is also another good one here – Athanassios Feb 03 '16 at 12:26
2

As long as you are using ReadList to read in the entire file, and then acting on each line, you may as well just feed the result of ReadLine into ImportString.

test = ImportString[
    StringJoin[
     Riffle[ReadList["shoes_revenue.csv", String], "\n"]
     ],
    "CSV"]; // AbsoluteTiming
(*{2.7918, Null}*)

which is marginally faster than this answer

conv = {ToExpression@#[[1]], 
    ToExpression@#[[2]], #[[3]], #[[4]], #[[5]], 
    ToExpression@StringTake[StringJoin@#[[6 ;;]], {3, -2}]} &;
(wrdlst = 
   conv /@ (ReadList["shoes_revenue.csv", Word, 
       WordSeparators -> {","}, RecordSeparators -> {"\n"}, 
       RecordLists -> True])[[2 ;;]];
  PrependTo[wrdlst, 
   Flatten@StringSplit[ReadList["shoes_revenue.csv", Record, 1], ","]];
  ) // AbsoluteTiming
(* {3.00567, Null} *)
Jason B.
  • 68,381
  • 3
  • 139
  • 286
  • This is so far the fastest method with ReadList but I checked the timing on Import and I found that it reads the file faster than your answer. Memory allocation is also about the same with that of Import. – Athanassios Feb 03 '16 at 12:07
  • 1
    That is definitely true. I think that you can really save some overhead by using Read or ReadList when the data is formatted very simply. The fact that you have to do postprocessing on each line is going to remove any benefit from these stream operations over Import. If you could preprocess these files to go so far as to change the "$2,400" to 2400 then you could benefit from ReadList. – Jason B. Feb 03 '16 at 12:58
  • 1
    I have found that simply Import["shoes_revenue.csv"]alone is 2 times faster than your best solution, and the result is the same. There is something I don't understand at all – andre314 Feb 03 '16 at 17:51
  • I think Import and ImportString functions have been optimized to automatically detect the format of data rows in a file. Something that you have to do in a suboptimal way in ReadList. That is also why JasonB solution is also the fastest here. But generally speaking for in-memory efficient and fast processing you have to switch on a different kind of data structure, data model, I.e. single instance, associative. See also this question and my comment there – Athanassios Feb 03 '16 at 21:29
  • 1
    @andre, In this case, ImportString is essentially writing the string to a temporary file and then calling Import on it, so it's not very efficient. You can see the temporary streams used with the following: DeleteDuplicates[Flatten[Trace[ImportString["a,b,c\n1,2,3\n4,5,6\n", "CSV"], Except[List | SameQ][___, (InputStream | OutputStream)[_String, ___], ___]]]]. – rhennigan Feb 06 '16 at 18:52
1

I am not sure what exactly you are looking for, but you could apply ToExpression to those columns that should be Integers instead of Strings:

 Head@ToExpression["1234"]
 (* Integer *)

Similarly, you can combine the last two columns, drop the Dollar symbol and convert to integer:

ToExpression@StringDrop["$2" <> "843", 1]
(*2843 *)
Berg
  • 1,059
  • 5
  • 14
0

ReadList with Word

It seems to me that the best you can do with ReadList in this case is to parse each line of the file and get the field values back as strings. For the monetary comma value, you get a list with two string values and this is the major obstacle to tackle.

Note that this ReadList use has already been added to my question above, but nobody continued this for a full answer

We know the format of the fields, two integers, followed by a string, followed by two dates, followed by the monetary value.

conv = {
        ToExpression@#[[1]], ToExpression@#[[2]],
        #[[3]], 
        #[[4]], #[[5]],
        ToExpression@StringTake[StringJoin@#[[6 ;;]], {3, -2}]} &;

Map the Pure Function above to the output of ReadList

   wrdlst = conv /@ (ReadList[csvFileName, Word, WordSeparators -> {","}, RecordSeparators -> {"\n"}, RecordLists -> True])[[2 ;;]];

   PrependTo[wrdlst, Flatten@StringSplit[ReadList[csvFileName, Record, 1], ","]];

   Head /@ wrdlst[[3]]
   {Integer, Integer, String, String, String, Integer}

In my machine it is about 1.7sec faster than the answer provided by @SquareOne. It is also close to Jason's answer

Athanassios
  • 1,291
  • 10
  • 22