9

I'm trying to split the strings of chemicals into their elements and numbers.

See this example

"Fe3O4" will be split into {"Fe","3","O","4"}

I've tried using StringSplit and various _LowerCaseQ type patters but it isn't working. I've also tried using StringSplit[#,""] to split everything and then finding the lower case characters and putting it back together but I haven't got it to work. Any solution would be greatly appreciated.

s0rce
  • 9,632
  • 4
  • 45
  • 78

5 Answers5

15

I propose:

StringCases[
  {"Fe3O4", "CO", "MgO", "Uut14AuO6"},
  DigitCharacter .. | (_?UpperCaseQ ~~ ___?LowerCaseQ)
]
{{"Fe", "3", "O", "4"}, {"C", "O"}, {"Mg", "O"}, {"Uut", "14", "Au", "O", "6"}}

Or as a RegularExpression:

StringCases[
  {"Fe3O4", "CO", "MgO", "Uut14AuO6"},
  RegularExpression["\\d+|[A-Z][a-z]*"]
]
Karsten7
  • 27,448
  • 5
  • 73
  • 134
Mr.Wizard
  • 271,378
  • 34
  • 587
  • 1,371
5

I prefer one of @Mr.Wizard's solutions based on StringCases, but here is a solution using StringSplit:

StringSplit["Fe3O4", RegularExpression["(?=[A-Z]|\\d)"]]
(* {"Fe", "3", "O", "4"} *)

It splits the string at any position that is followed by an upper case letter or a digit. If multiple digits are possible:

StringSplit["Fe23O42", RegularExpression["(?=[A-Z]|(?<!\\d)\\d)"]]
(* {"Fe", "23", "O", "42"} *)

This is the same except that now a digit may not be preceded by another digit.

WReach
  • 68,832
  • 4
  • 164
  • 269
  • A powerful use of regular expressions. Is it possible to write this as a StringExpression? I can't think of a way at the moment. – Mr.Wizard Aug 29 '14 at 02:15
  • 1
    No, regex is the only way to express look-ahead or look-behind conditions. A shame, really, because regex patterns are ugly as sin... but it sure is nice to have the PCRE engine to draw upon for those tougher jobs. – WReach Aug 29 '14 at 05:01
3

You can use

chemSplit[s_String] := 
 Module[{pos = StringPosition[s, {_?UpperCaseQ, NumberString}, Overlaps -> False][[All, 1]]},
  StringSplit@StringInsert[s, " ", pos]
 ]

chemSplit["Fe3O4"]
{"Fe", "3", "O", "4"}
Karsten7
  • 27,448
  • 5
  • 73
  • 134
3
elements = SortBy[ElementData[#, "Abbreviation"] & /@ ElementData[], Minus@*StringLength];
StringCases["Fe3O2", DigitCharacter .. | elements]

{"Fe", "3", "O", "2"}

(Thanks to Mr.Wizard for syntax improvements.)

C. E.
  • 70,533
  • 6
  • 140
  • 264
  • There is a precision about this that I like, and I can see the effort that you put into it. +1 One could replace the second part with: StringCases[string, DigitCharacter .. | elements] if desired. Also, these string functions are set up to work with lists of patterns so I believe you can drop the Apply[Alternatives, part. – Mr.Wizard Aug 28 '14 at 19:15
  • @Mr.Wizard Thanks, I like your version better so I incorporated it in the answer. – C. E. Aug 28 '14 at 19:29
  • +1 for @*, there is always something new to learn! – ybeltukov Aug 28 '14 at 20:29
2

Just to be different

f = Flatten[List @@@ WolframAlpha["formula " <> #, "Result"][[1, 1]]] &;
f /@ {"Fe2O3", "MgO"}

{{"Fe", 2, "O", 3}, {"Mg", "O"}}

This approach seems to be stupid, but it can be easily extended to another chemical data (e.g. molar mass).

ybeltukov
  • 43,673
  • 5
  • 108
  • 212
  • ybeltukov I am happy to see you posting again! Unfortunately you fell into the same trap I did; see the edit history of my answer. :^) – Mr.Wizard Aug 28 '14 at 18:41
  • @Mr.Wizard I also happy to join you again! When I try to correct my answer I come to your answer exactly. I propose another approach instead of deleting the post :) – ybeltukov Aug 28 '14 at 19:06