6

I'm trying to get a list of words from a string. Sounds like an easy task for Mathematica. I have the following code:

text = "Merçi d'avoir pris le temps.";
ToLowerCase[#] & /@ StringSplit[text, Except[WordCharacter] ..]

However, the output is

{"merçi", "d", "avoir", "pris", "le", "temps"}

and not

{"merçi", "d'avoir", "pris", "le", "temps"}

because the ' is not a word character. Hence, I'd like to ignore the ', just like the -. Any idea on how to do that?

Tim Vermeulen
  • 407
  • 3
  • 11
  • Can you give a short but complete example of the problem/difficulty you are seeing (i.e. code I can copy and and run directly), and mention what OS and what version of Mathematica you are using? I don't quite understand the question: what is the problem with é? WordCharacter does match it on my machine, as ToUpperCase/ToLowerCase work fine on it. – Szabolcs Mar 14 '14 at 20:49
  • @RunnyKine I'd rather have it omit all non-letter characters (except for ' and -), as there are some pretty weird ones in these articles. I'd rather not have to specify them all manually. – Tim Vermeulen Mar 14 '14 at 21:06
  • @Szabolcs You're right, é is matched by WordCharacter, I must've had some other error in my code. The other problem persists: I don't want ' and - taken out. Is there any way I can change Except[WordCharacter].. to something similar to Except[{WordCharacter,Characters["'-"]}]..? – Tim Vermeulen Mar 14 '14 at 21:11
  • @Szabolcs I've edited my question, hope it makes sense now. – Tim Vermeulen Mar 14 '14 at 21:14
  • 1
    @timvermeulen OK, sounds good. I thought those characters would not be matched on some other OSs. Could you simply use Except[WordCharacter | "'" | "-"]? – Szabolcs Mar 14 '14 at 21:19
  • @Szabolcs Yes! I hadn't thought of that.

    In general, I have trouble figuring out when to use | and when to use {...,...}, or when to use .. or ..., etcetera. Anything you'd recommend me to read? The official reference pages aren't very n00b-friendly.

    – Tim Vermeulen Mar 14 '14 at 21:22
  • @RunnyKine Wow, looks good. Thanks! – Tim Vermeulen Mar 14 '14 at 21:31

4 Answers4

8
text = "Merçi d'avoir pris le temps."; 

ToLowerCase[#] & /@ StringSplit[text, Except[(LetterCharacter|"'"|"-")]..]

Gives:

{"merçi", "d'avoir", "pris", "le", "temps"}

OR if some of the words contain digits then replace LetterCharacter with WordCharacter above.

RunnyKine
  • 33,088
  • 3
  • 109
  • 176
2

Since you are trying to extract words I would suggest using StringCases rather than StringSplit. Also, there is no need to write ToLowerCase[#] & -- ToLowerCase /@ list would do. In fact because ToLowerCase is Listable it can be directly applied to the output list. Therefore I would write:

ToLowerCase @ StringCases[text, (WordCharacter | "'" | "-") ..]
{"merçi", "d'avoir", "pris", "le", "temps"}

StringCases is somewhat faster than the alternative:

big = "" <> Table[text, {50000}];

StringSplit[big, Except[(LetterCharacter | "'" | "-")] ..] // Timing // First

StringCases[big, (WordCharacter | "'" | "-") ..]           // Timing // First
0.1342

0.1092

Mr.Wizard
  • 271,378
  • 34
  • 587
  • 1,371
1

It will be a discovery for some that:

DictionaryWordQ[#, Language -> "French"] & /@ {"merçi", "merci"}

{False, True}

Other than that the following does the job:

w1 = TextCases[text, "Word"]

{"Merçi", "d'avoir", "pris", "le", "temps"}


Another example:

TextCases["Répétez s'il vous plaît. Quand voulez-vous voyager?", \
"Word"]

{"Répétez", "s'il", "vous", "plaît", "Quand", "voulez-vous",
"voyager"}

Syed
  • 52,495
  • 4
  • 30
  • 85
0

Unless I'm missing some subtlety, this works:

    text = "Merçi d'avoir pris le temps.";
    StringSplit[text]
(* {Merçi, d'avoir, pris, le, temps.} *)
murray
  • 11,888
  • 2
  • 26
  • 50
  • 2
    One still has to get rid of the terminal period. Ditto if the original text had commas, semicolons, dashes, etc. And then suppose one of the "words" were an abbreviation, e.g., "Dr."? – murray Mar 17 '14 at 03:48