Removing non-word characters from a string

Question

I'm trying to get a list of words from a string. Sounds like an easy task for Mathematica. I have the following code:

text = "Merçi d'avoir pris le temps.";
ToLowerCase[#] & /@ StringSplit[text, Except[WordCharacter] ..]

However, the output is

{"merçi", "d", "avoir", "pris", "le", "temps"}

and not

{"merçi", "d'avoir", "pris", "le", "temps"}

because the ' is not a word character. Hence, I'd like to ignore the ', just like the -. Any idea on how to do that?

Can you give a short but complete example of the problem/difficulty you are seeing (i.e. code I can copy and and run directly), and mention what OS and what version of Mathematica you are using? I don't quite understand the question: what is the problem with é? WordCharacter does match it on my machine, as ToUpperCase/ToLowerCase work fine on it. — Szabolcs, Mar 14 '14 at 20:49
@RunnyKine I'd rather have it omit all non-letter characters (except for ' and -), as there are some pretty weird ones in these articles. I'd rather not have to specify them all manually. — Tim Vermeulen, Mar 14 '14 at 21:06
@Szabolcs You're right, é is matched by WordCharacter, I must've had some other error in my code. The other problem persists: I don't want ' and - taken out. Is there any way I can change Except[WordCharacter].. to something similar to Except[{WordCharacter,Characters["'-"]}]..? — Tim Vermeulen, Mar 14 '14 at 21:11
@timvermeulen OK, sounds good. I thought those characters would not be matched on some other OSs. Could you simply use Except[WordCharacter | "'" | "-"]? — Szabolcs, Mar 14 '14 at 21:19
@Szabolcs Yes! I hadn't thought of that.
In general, I have trouble figuring out when to use | and when to use {...,...}, or when to use .. or ..., etcetera. Anything you'd recommend me to read? The official reference pages aren't very n00b-friendly. — Tim Vermeulen, Mar 14 '14 at 21:22

RunnyKine · Accepted Answer · 2014-03-14T21:23:17.190

8

text = "Merçi d'avoir pris le temps."; 

ToLowerCase[#] & /@ StringSplit[text, Except[(LetterCharacter|"'"|"-")]..]

Gives:

{"merçi", "d'avoir", "pris", "le", "temps"}

OR if some of the words contain digits then replace LetterCharacter with WordCharacter above.

edited Mar 14 '14 at 21:23

answered Mar 14 '14 at 21:16

RunnyKine

33,088
3
109
176

Mr.Wizard · Answer 2 · 2014-03-17T07:30:44.183

Since you are trying to extract words I would suggest using StringCases rather than StringSplit. Also, there is no need to write ToLowerCase[#] & -- ToLowerCase /@ list would do. In fact because ToLowerCase is Listable it can be directly applied to the output list. Therefore I would write:

ToLowerCase @ StringCases[text, (WordCharacter | "'" | "-") ..]

{"merçi", "d'avoir", "pris", "le", "temps"}

StringCases is somewhat faster than the alternative:

big = "" <> Table[text, {50000}];

StringSplit[big, Except[(LetterCharacter | "'" | "-")] ..] // Timing // First

StringCases[big, (WordCharacter | "'" | "-") ..]           // Timing // First

0.1342
0.1092

score 1 · Answer 3 · answered Apr 14 '23 at 03:37

It will be a discovery for some that:

DictionaryWordQ[#, Language -> "French"] & /@ {"merçi", "merci"}

{False, True}

Other than that the following does the job:

w1 = TextCases[text, "Word"]

{"Merçi", "d'avoir", "pris", "le", "temps"}

Another example:

TextCases["Répétez s'il vous plaît. Quand voulez-vous voyager?", \
"Word"]

{"Répétez", "s'il", "vous", "plaît", "Quand", "voulez-vous",
"voyager"}

score 0 · Answer 4 · answered Mar 17 '14 at 03:47

0

Unless I'm missing some subtlety, this works:

    text = "Merçi d'avoir pris le temps.";
    StringSplit[text]
(* {Merçi, d'avoir, pris, le, temps.} *)

answered Mar 17 '14 at 03:47

murray

11,888
2
26
50

2

One still has to get rid of the terminal period. Ditto if the original text had commas, semicolons, dashes, etc. And then suppose one of the "words" were an abbreviation, e.g., "Dr."? – murray Mar 17 '14 at 03:48

Removing non-word characters from a string

4 Answers4

Linked