8

i'm trying to get familiar with the new Dataset-stuff in Mathematica 10. As it will (hopefully) provide some benefit for me, I have started to convert my app to move from list and list of rules to Associations and Datasets.

Starting point for me is an XML object, which is converted to a Dataset now. I was hoping to reduce the programming code by e.g. applying functions to multiple columns at once.

This is an example Dataset:

testset = Dataset[{<|"Element" -> "A", "a" -> "529251", "b" -> "520358"|>, <|
"Element" -> "B", "a" -> "51"|>, <|"Element" -> "C", 
"a" -> "177"|>, <|"Element" -> "L", "a" -> "125"|>, <|
"Element" -> "S", "a" -> "1343"|>}]

This gives:

dataset

As the XML sometimes contains integers and reals in scientific form, I'm using

Internal`StringToDouble

to convert the stuff:

fnStr2Real[x_,r_:0]:=If[Head[x]===Missing,r,Internal`StringToDouble[x]]
SetAttributes[fnStr2Real, {Listable}];
testset[fnStr2Real, {"a", "b"}]
(* Trace yields *)
(* Query[fnStr2Real, {"a", "b"}][testset] *)

The issue: there is some kind of rounding in the resulting recordset:

result

If I apply my function directly to my list or association, everything is fine:

Function[x, fnStr2Real[#[x] & /@ Normal[testset]]] /@ {"a", "b"}
(* {{529251., 51., 177., 125., 1343.}, {520358., 0, 0, 0, 0}} *)

I know that Internal`StringToDouble is an internal function, but I found several posts in this forum using it, and I really like it!

Is there an issue with M10, Dataset and Internal`StringToDouble?

Are there any other robust string coversion functions availabe which can handle integers, reals and reals in scientific form inside a dataset? Thanks!

32u-nd
  • 963
  • 4
  • 10
  • 1
    Sorry, you need to set the Listable attribute: SetAttributes[fnStr2Real, {Listable}]; – 32u-nd Jul 17 '14 at 09:51
  • Why don't you use ToExpression for the conversion? You could check before conversion whether the string contains numeric data using StringMatchQ. – Sjoerd C. de Vries Jul 17 '14 at 10:41
  • InternalStringToDouble is fine. Check InternalStringToDouble /@ {"1234", "1234.", "1234.567", "1.234e+3", "1.23456e+3", "0.5664", "5.664e-1", "5.664e-4"} – 32u-nd Jul 17 '14 at 11:14
  • 1
    @SjoerdC.deVries ToExpression is very slow compared to Internal`StringToDouble. See this – Murta Jul 17 '14 at 13:19

1 Answers1

5

I don't yet know where the rounding is coming from but it appears to be occurring during formatting.

Consider:

testset[fnStr2Real, {"a", "b"}] // Normal // Normal
{{"a" -> 529251., "b" -> 520358.}, {"a" -> 51., "b" -> 0},
 {"a" -> 177., "b" -> 0}, {"a" -> 125., "b" -> 0}, {"a" -> 1343., "b" -> 0}}

Actually the rounding seems to be a standard part of the formatting of Dataset:

Dataset[<|a -> 123456.|>]

enter image description here

Mr.Wizard
  • 271,378
  • 34
  • 587
  • 1,371
  • Yes, you're right; interesting. It also appears correctly in the FullForm. So I don't need to focus on this and can continue with my work. Everything is fine; it's just a matter of vizualisation – 32u-nd Jul 17 '14 at 10:11
  • @akm I think that's correct, but it's still surprising. I haven't yet read about Datasets so this may be plainly documented. Perhaps there is an Option that controls it. I'll try to remember to update this post later as I learn more. – Mr.Wizard Jul 17 '14 at 10:12
  • Maybe, the programmers have remembered this: (http://www.stat.washington.edu/pds/stat423/Documents/Ehrenberg.numeracy.pdf), which is actually good, but should be documented. – 32u-nd Jul 17 '14 at 10:17
  • 5
    @Mr.Wizard It's neither documented nor controllable. The formatting of Dataset is quite limited at the moment, there are several ways I wish to expand it and make it more interactive, so making it controllable now is premature and would probably lock the design to something sub-optimal. – Taliesin Beynon Jul 17 '14 at 13:46
  • @TaliesinBeynon Thanks for the information. – Mr.Wizard Jul 17 '14 at 13:50
  • 1
    @TaliesinBeynon, why not give options for formatting as there are for FailureAction in Query. Is it safe to Unprotect and roll implement custom formatters? eg w/ scrollbars for large data. – alancalvitti Sep 23 '14 at 21:53
  • @alancalvitti I would love to, I just can't devote 100% of my time to Dataset. There are many other projects vying for my attention, from better machine learning, to NLP, to a opt-in type system for the language, to incremental improvements to the language, to laziness, and on and on. Dataset formatting will have to wait. And yes, roll your own if you want to :). – Taliesin Beynon Sep 24 '14 at 05:20
  • @Mr.Wizard answered. But my busy schedule at this point involves going to sleep :-). P.S. is there a way of getting a regular email summary of questions asked with a given tag? Or getting an email whenever that tag is mentioned? I would keep tabs on things more closely if I didn't have to manually check SO when I happen to remember. – Taliesin Beynon Sep 24 '14 at 06:09
  • @Tali You can go to this page: http://stackexchange.com/filters/ then choose New Filter and fill in the details. There is a field for email and how often you wish to receive messages. – Mr.Wizard Sep 24 '14 at 06:16
  • @Tali other questions I would appreciate your eyes on, if or when you find time: (54599), (58430) – Mr.Wizard Sep 24 '14 at 13:47