6

I would like to ask how I can remove non-word characters from a string, but only in certain cases.

I have read this article, so I know how to get the words out of a string. My text is however a bit more complicated.

For example:

trialtext = ",,temp sp.a tiral - dump NV-A rambo.6833. 16,rgcht";

From this text, I would like to get as output:

{"temp","sp.a","tiral","dump","NV-A","rambo","6833","16","rgcht"}

In other words, I want so split according to spaces, commas, hyphens and dots, EXCEPT when they have letter character before and after either a hyphen or a dot (so not commas or other signs!)

This has been my most succesful trial so far:

StringSplit[trialtext, 
 Except[WordCharacter, WordCharacter .. ~~ "." ~~ WordCharacter]]

{"temp sp.a tiral dump NV-A rambo.6833 16,rgcht"}

although I do not understand why - if I as for "." - it decides to also take "," and "-".

Therefore also the related question: can someone please explain to me why this

StringSplit[trialtext, Except[WordCharacter, ","]]

gives this output:

 {"temp sp.a tiral dump NV-A rambo.6833 16", "rgcht"}

while this:

StringSplit[trialtext, Except[WordCharacter, "."]]

produces this output:

{"temp", "sp", "a", "tiral", "dump", "NV", "A", "rambo", "6833", "16", "rgcht"}

Thanks a bunch!

Lena
  • 121
  • 1
  • 3
  • It seems "." is interpreted in Except as regular expression. And "." is every character excluding newline. – Kuba Oct 31 '14 at 11:59

3 Answers3

4

Regular expressions are cryptic, but they offer look-ahead and look-behind capabilities that are unavailable to regular string patterns:

split[s_] :=
  StringSplit[s, RegularExpression["( |,|(?<![[:alpha:]])[-.]|[-.](?![[:alpha:]]))+"]]

split[",,temp sp.a tiral - dump NV-A rambo.6833. 16,rgcht"]
(* {"temp", "sp.a", "tiral", "dump", "NV-A", "rambo", "6833", "16", "rgcht"} *)

This formulation respects the special rule that dots and dashes act as delimiters except when they have letters on both sides:

split["1.2.3.a.b.c   ---4-5-6-x-y-z---"]
(* {"1", "2", "3", "a.b.c", "4", "5", "6", "x-y-z"} *)

The key ingredient in this solution is the use of (?<![[:alpha:]])[-.] which can be interpreted as "a dot or dash that is not preceded by an alphabetic character". Similarly, [-.](?![[:alpha:]]) means "a dot or dash that is not followed by an alphabetic character". Look-ahead and look-behind patterns are particularly useful for this problem because they allow us to examine characters for matching purposes without considering them to be part of a delimiter itself.

WReach
  • 68,832
  • 4
  • 164
  • 269
2
trialtext = ",,temp sp.a tiral - dump NV-A rambo.6833. 16,rgcht";

StringTrim@StringSplit[trialtext, {"," | "-" | ".",
   x : PatternSequence[Except[WhitespaceCharacter] .. ~~ "." | "-" ~~LetterCharacter ..] :> x}]
(* {"temp", "sp.a", "tiral", "dump", "NV-A", "rambo", "6833", "16", "rgcht"} *)
kglr
  • 394,356
  • 18
  • 477
  • 896
1

As of version 10.1 there is TextWords that will achieve this for you easily

TextWords[",,temp sp.a tiral - dump NV-A rambo.6833. 16,rgcht"]
(*{"temp", "sp.a", "tiral", "dump", "NV-A", "rambo.6833", "16,rgcht"}*)

Note that the implementation of the function is available to you with

??TextWords

It relies on a bunch of stuff from the NaturalLanguageProcessing package that rumour has it will be opened up more in Mathematica 11.

Charlotte Hadley
  • 2,364
  • 19
  • 17