I'm trying to get a list of words from a string. Sounds like an easy task for Mathematica. I have the following code:
text = "Merçi d'avoir pris le temps.";
ToLowerCase[#] & /@ StringSplit[text, Except[WordCharacter] ..]
However, the output is
{"merçi", "d", "avoir", "pris", "le", "temps"}
and not
{"merçi", "d'avoir", "pris", "le", "temps"}
because the ' is not a word character. Hence, I'd like to ignore the ', just like the -. Any idea on how to do that?
é?WordCharacterdoes match it on my machine, asToUpperCase/ToLowerCasework fine on it. – Szabolcs Mar 14 '14 at 20:49'and-), as there are some pretty weird ones in these articles. I'd rather not have to specify them all manually. – Tim Vermeulen Mar 14 '14 at 21:06éis matched byWordCharacter, I must've had some other error in my code. The other problem persists: I don't want'and-taken out. Is there any way I can changeExcept[WordCharacter]..to something similar toExcept[{WordCharacter,Characters["'-"]}]..? – Tim Vermeulen Mar 14 '14 at 21:11Except[WordCharacter | "'" | "-"]? – Szabolcs Mar 14 '14 at 21:19In general, I have trouble figuring out when to use
– Tim Vermeulen Mar 14 '14 at 21:22|and when to use{...,...}, or when to use..or..., etcetera. Anything you'd recommend me to read? The official reference pages aren't very n00b-friendly.