5

I have a huge string database with the form such as {{position},{1,2,3,4,5,..}}, and I want to convert such database to Interger data in a very quick way.

In the following example, I created a string database (instead of my real simulation database).

StringData = {};
steps=5000; (* for testing, in real simulation it is large*)
Do[AppendTo[StringData, {"Position"<>ToString[ii],"0,1,1,1,22,1,2,14,5,2,2,1,5,"}], {ii, 1, steps}];

strtolist = ConstantArray[{}, Length[StringData]];
For[ii = 1, ii <= Length[StringData], ii++,
   strtolist[[ii]] = ToExpression[StringSplit[StringData[[ii]][[2]], ","]];
   ]; // AbsoluteTiming

strtolist = ConstantArray[{}, Length[StringData]];
For[ii = 1, ii <= Length[StringData], ii++,
   strtolist[[ii]] = IntegerPart/@Internal`StringToDouble/@StringSplit[StringData[[ii]][[2]], ","];
   ]; // AbsoluteTiming  

{0.248431, Null}

{0.100303, Null}

The second way is much fast. My real simulation is a large database and I wonder whether there are even quicker way to do such converting? for example without doing the outside for-loop? Thank you very much!

one additional problem using Internal`StringToDouble:

when the number is very large as the following example

test = {"0", "33837677493872221", "311462297063636041906"};
numstr1 = IntegerPart /@ Internal`StringToDouble /@ test
numstr2 = IntegerPart /@ Internal`StringToDouble /@ test[[3]]

the results:

{0, 33837677493872220, IntegerPart[$Failed["Bignum"]]}

311462297063636041906

Why numstr2 works good while numstr1 doesn't work? It seems Internal`StringToDouble works fine with single string not string lists?

What if the StringData contains number like "-1","-2" and so on? Only thinking about Integer number (including negative and positive). Is there any other way to do the same work instead of using ToExpression?

Alexey Popkov
  • 61,809
  • 7
  • 149
  • 368
Xuemei
  • 1,616
  • 6
  • 10
  • 1
    try Map[FromDigits, StringSplit[StringData[[All, 2]], ","], {-1}]? – kglr Aug 30 '19 at 20:42
  • Thank you very much! I will test your method. In addition I add one small question in the end. Seems Internal`StringToDouble does not work with with a lists containing large number but fine with one single string. Do you know why?@kglr – Xuemei Aug 30 '19 at 21:00
  • Map[FromDigits, StringSplit[StringData[[All, 2]], ","], {-1}] works good. Is it possible works with string list with negitave number such as "-1" and so on? @kglr – Xuemei Aug 30 '19 at 21:07
  • 2
    because Internal`StringToDouble does not have the Listable attribute. You can make listable version using istd = Internal`StringToDouble; SetAttributes[istd, Listable] – kglr Aug 30 '19 at 21:08
  • 1
    @kglr that's an interesting subtlety to Listable... I hadn't realized it would take precedence over evaluating to Internal`StringToDouble. I assumed the Head would evaluate, then any Attributes would apply but I suppose it's the other way around? – b3m2a1 Aug 30 '19 at 21:17
  • @b3m2a1, hadn't thought about it. Now that you mention it, I found this and this re order of definitions and setting attributes. – kglr Aug 30 '19 at 21:37
  • numstr2 is still a string because you mapped on a string not a list. – Chris Degnen Aug 30 '19 at 22:16
  • @ChrisDegnen, yes you are right. I didn't realize that.Thank you for pointing out. That's very interesting. – Xuemei Aug 30 '19 at 22:19
  • @kglr, when I make the Listable as you suggested SetAttributes[istd, Listable], the problems still exists. – Xuemei Aug 30 '19 at 22:21
  • @XuemeiGu, I misspoke; we need to define istd as:istd[x_]:= Internal`StringToDouble[x]; SetAttributes[istd, Listable]; and use it as IntegerPart @ istd @ StringSplit[StringData[[All, 2]], ","] ( but this is slower than mapping Internal`StringToDouble at level {-1} and using IntegerPart on the entire list. – kglr Aug 31 '19 at 00:57

2 Answers2

6

ToExpression is very fast for correct Mathematica syntax input. So the key idea is to create an input string for ToExpression that delivers the expected result for the huge string database in one go:

StringData // 
   Extract[{All, 2}] // 
   StringRiffle[#, {"{{", "Nothing},{", "Nothing}}"}] & // 
   ToExpression

The odd looking "Nothing}" in StringRiffle is required to make the parser ignore the terminating comma (i.e., "0,1,1,1,22,1,2,14,5,2,2,1,5,") in each input string.

ToExpression also handles negative and very large numbers correctly.

sakra
  • 5,120
  • 21
  • 33
4
strtolist2 = Map[FromDigits, StringSplit[StringData[[All, 2]], ","], {-1}]

strtolist3 = IntegerPart @ Map[Internal`StringToDouble, 
   StringSplit[StringData[[All, 2]], ","], {-1}];

strtolist3  == strtolist2 == strtolist

True

Both are about twice as fast as For loop with IntegerPart/@Internal`StringToDouble/@...

kglr
  • 394,356
  • 18
  • 477
  • 896
  • Yes, it is fast but one problem is it cannot work with negative number such as “-1” because of FromDigits. Is there any chance also working with negative number string?@kglr – Xuemei Aug 31 '19 at 00:27
  • @XuemeiGu, please see the update. The second alternative can handle signed numbers. – kglr Aug 31 '19 at 00:43
  • Thank you very much. Using istd[x_] := InternalStringToDouble[x];SetAttributes[istd, Listable];strtolist6 = IntegerPart@istd@StringSplit[StringData[[All, 2]], ","]; cannot solve large number strings such asstr={{"Position1", "0,1,201111111111111111,1,-1"}, {"Position2", "0,1,201111111111111111,1,-1"}}`. – Xuemei Aug 31 '19 at 18:31