13

The new Interpreter functionality in version 10 looks like it has the potential to make parsing custom data formats very easy. I'm trying to create a CSV parser.


Requirements:

  1. Rows are delimited by newlines, columns are delimited by commas.
  2. Entries can be numbers, strings (everything that's quoted or can't be interpreted as a number is a string).
  3. Commas in quoted strings must be ignored.
  4. Empty elements are allowed, e.g. this row contains three empty elements: ,,. They're delimited by two commas. These can be parsed to either Null or "".

My actual problem doesn't have requirement 3. I put in more requirements in the hopes to make the question more generally useful, and I meant to accept answers that satisfy only a subset of these. (Perhaps this was misguided.) In the meantime Carlo's answer explains that requirement 3. can't be met.

Test data:

,"one","2"
"a",1,2
"b",3,"4c"
"c",5,x
"d",6,"seven, eight"

Or ready to paste Mathematica string:

csv = ",\"one\",\"2\"\n\"a\",1,2\n\"b\",3,\"4c\"\n\"c\",5,x\n\"d\",6,\"seven,eight\""

Parsed result should MatchQ this pattern:

{{Null | "" | Missing[___], "one", "2"}, 
 {"a", 1, 2}, 
 {"b", 3, "4c"}, 
 {"c", 5, "x"}, 
 {"d", 6, "seven, eight"}}

How close can we get to this result, using Interpreter?


Here's a first try:

int = Interpreter[
  DelimitedSequence[
   DelimitedSequence[
    Restricted["String", "\"" ~~ ___ ~~ "\""] | "Number" | "String",
    ","
    ],
   "\n"
   ]
  ]

int[csv]

What it gets wrong:

  • fails on point 4. (this is actually important for me)
  • fails on point 3.
  • doesn't unquote strings

It may not be possible to implement all the features I request using Interpreter, but how close can we get? How much time and effort can Interpreter save when attacking this problem? Preferably it should be possible to offload most of the processing to interpreter and reach the desired result by adding minimal pre and/or post-processing.

Szabolcs
  • 234,956
  • 30
  • 623
  • 1,263
  • How close can we get to this result using Interpreter but not Interpreter["CSV"]@str :) – C. E. Aug 07 '14 at 17:46
  • @Pickett Actually my motivation for this is that the built-in CSV import has too many problems and I'm required to create my own. I'll tweak the test case to break the built-in CSV import even more. – Szabolcs Aug 07 '14 at 17:49
  • Interesting, I didn't know about such problems. That makes the question even more valid, though I already gave it a +1. – C. E. Aug 07 '14 at 17:49
  • 1
    I don't think this is a good use case for Interpreter. I don't dispute that we can do better, but I suggest you look at SemanticImport instead. – Taliesin Beynon Aug 07 '14 at 17:52
  • 1
    @Pickett Problems with Import[..., "CSV"]: Some problems: 4c is interpreted as currency and converted to the number 4 instead of being read as "4c". I can fix this with "CurrencyTokens" -> None. Then "2" is again read as the number 2, not a string, which is again a problem. I thought it would be better to explicitly restrict what data types to interpret and how. – Szabolcs Aug 07 '14 at 17:53
  • @TaliesinBeynon I wasn't sure if Interpreter was meant for this actually. If that's the case then I better not push it with this application and remove this question. However, if there's a good solution using SemanticImport and if SemanticImport doesn't require an internet connection (does it?) then it would be better to give an answer using a SemanticImport solution, mention that Interpreter is not for this, and keep the question. – Szabolcs Aug 07 '14 at 17:57
  • You can unquote strings by doing something like this: Interpreter[DelimitedSequence[StringReplace[#, "\"" ~~ a___ ~~ "\"" :> a] & | "Number" | "String", ","]]["\"a\",1,2,c"] – Carlo Aug 07 '14 at 18:05
  • @Carlo Why does this fail? --> Interpreter[StringReplace[#, "\"" ~~ a___ ~~ "\"" :> a] & | "String"][""]. Is it a bug or am I misunderstanding how it should work? Interpreter["String"][""] doesn't fail, nor does Interpreter[StringReplace[#, "\"" ~~ a___ ~~ "\"" :> a] &][""]. – Szabolcs Aug 07 '14 at 18:42
  • Here's the idea: the first returns Missing["NoInput"] so we try the second that is Missing["NoInput"] as well. Since we considered the first equivalent to the Failure, we can't treat the second as good. – Carlo Aug 07 '14 at 18:53
  • Can you explain how CSV is broken? I'm only seeing 3 differences. 1: The "2" in the top-right corner is expected to be a string but is in fact a number. I believe that's correct -- the expectation is wrong -- since quotes in CSV delimit the content, and are not part of the content. 2: The "4c" in the middle-right is becoming a numeric 4. See ref/format/CSV and stop "c" from being interpreted as "cents" by setting "CurrencyTokens"->None. 3: "seven,eight" is not having a space inserted after the comma. Why is that wrong? – Jeremy Michelson Aug 07 '14 at 19:21
  • @JeremyMichelson Can you come to chat? There are too many comments already. – Szabolcs Aug 07 '14 at 19:28
  • @Carlo Is Interpreter meant to be used as a general parser mechanism? Unfortunately Interpreter["Number"] is just too slow to be able to use it to implement parsers for different file formats. Please see here. Is there a chance that performance will be improved for 10.0.1 or 10.0.2? – Szabolcs Aug 09 '14 at 15:35
  • @Szabolcs No not really. It's been mainly been designed for use with FormFunction and APIFunction. Performance will definitely be improved, but I don't think the main design goal for Interpreter is general parsing, there is SemanticImport and other stuff yet to come that will cover that. – Carlo Aug 19 '14 at 14:54

2 Answers2

12

I worked on Interpreter.

As far as the implentation is now, the DelimitedSequence parser does not support quoting, so what you want can't be done. We'll try to add it in a future version.

Carlo
  • 1,171
  • 9
  • 12
  • Hi Carlo, thanks for the response. This was really a theoretical question, and I didn't mean to point out any faults with Interpreter, I'm just trying to learn how to make the best use of it. You mean that requirement 3. can't be met with Interpreter, right? In my actual application I don't need point 3., but I do need 4. What I did was that I broke up the lines using StringSplit[#, ",", All]& (note All, which preserves empty entries), then applied Interpreter to each entry, and finally unquoted each quoted string. Could I have done better with the new functionality? – Szabolcs Aug 07 '14 at 18:09
  • (The reason why I put more requirements in the question than my actual application was to make the question broader and more generally useful. I meant to accept answers that satisfy only a subset of the requirements. I should spell this out.) – Szabolcs Aug 07 '14 at 18:14
  • I think a nullable string pattern might also be useful (and can probably be written). However I think that DelimitedSequence here is at fault. For example Interpreter["Boolean"][""] returns False, but Interpreter[DelimitedSequence["Boolean", ","]][",True"] returns only {True} and this is probably a bug. – Carlo Aug 07 '14 at 18:18
  • StringSplit also behaves like this without its third argument: StringSplit[",1", ","] ---> {"1"} and StringSplit[",1", ",", All] ---> {"", "1"}. Maybe a similar syntax could be used? The standard behaviour to ignore delimiters at the start/end makes sense if the delimiter is whitespace. – Szabolcs Aug 07 '14 at 18:25
9

This is an ideal use case for SemanticImport, but unfortunately it has issues getting the commas right in version 10.0.

Luckily, version 10.0.1 has already fixed this bug:

enter image description here

Taliesin Beynon
  • 10,639
  • 44
  • 51
  • 3
    Nice tease with v10.0.1 :) – RunnyKine Aug 07 '14 at 21:07
  • I was trying to use SemanticImport for this sort of data in 10.1.0, but it is literally unusably slow when specifying column types explicitly. This takes 27 seconds on my computer: SemanticImportString[ExportString[RandomReal[10, {5000, 5}], "Table"], {Real, Real, Real,Real, Real}]; Is there going to be a fix for this? – Szabolcs May 21 '15 at 11:41
  • 1
    Because of problems like this one (and this) I am actually getting quite a bit of real pressure to use R instead. It's very hard to explain to the decision-makers why I would need to use a very expensive piece of software like Mathematica when R is free and can import data quickly and accurately, especially when I had to correct my results the next day as I discovered that Mathematica imported some data incorrectly. I really hope WRI will give higher priority to practical user needs. – Szabolcs May 21 '15 at 12:02
  • @Szabolcs Real isn't a valid type for SemanticImport. It only happens to work because Interpreter[Real] means the same as Interpreter["Real"], so SI is delegating to Interpreter, which is much, much slower. If you pass "Real" to SemanticImport, it works fast. I get 1.4 seconds for your example. – Taliesin Beynon May 21 '15 at 15:10
  • @Taliesin Thank you!!! This is great! – Szabolcs May 21 '15 at 19:47