9

What is the best approach to a problem like this in Mathematica?

I can only manage this:

 func[stringlist_] := 
   With[{i = 
   Last@Flatten@
   Table[DictionaryLookup[StringTake[#[[2]], n]], {n, 14}]}, {i,
   StringDrop[#[[2]], StringLength@i]}] &@stringlist;
func1[string_] := 
   Rest@NestWhileList[func, {"", # <> CharacterRange["a", "z"]}, 
   Length@Characters@#[[2]] > 26 &][[All, 1]] &@string;

func1["tableapplechairtablecupboard"]

(* {"table", "apple", "chair", "table", "cupboard"} *)

with its obvious shortcomings.

Added

Here is a pre-made file (made from WordFrequencyData /@ DictionaryLookup[] - very slow!!) incase anyone wanted to make a weighted function:

wfd = ToExpression@
Import["https://raw.githubusercontent.com/martinq321/dictionary/master/values", "List"];

eg something like this (based on @kirma's answer below) can be used to faithfully recreate about 50% of test cases:

suggested@StringJoin@removespaces@   
TextSentences[WikipediaData["PageID" -> "5094570", "ArticlePlaintext"]][[1]]

(* the great wall of china is a series of fortifications made of stone brick tamped earth wood and other materials generally built along an east to west line across the historical northern borders of china to protect the chinese states and empires against the raids and invasions of the various nomadic groups of the eurasian steppe *)

though I am sure there are far quicker and more efficient methods.

martin
  • 8,678
  • 4
  • 23
  • 70

1 Answers1

14

Simple greedy matching can be done as follows: construct a string pattern with longest words first and find words:

StringCases["tableapplechairtablecupboard", 
 Alternatives @@ SortBy[DictionaryLookup[], Minus@*StringLength]]

(* {"table", "apple", "chair", "table", "cupboard"} *)

Intuitive assumption is that one might use Longest to find longest match every time, but it doesn't really seem to work in this case. Note that this variation may end up not matching the whole string as it always chooses the longest match without considering if the rest can match anything in the dictionary.

For all matches one can, for instance, find all word matches using StringPosition and then find all non-overlapping complete covers of the string using per-string-position Boolean logic and SatisfiabilityInstances:

Module[{string, positions},
  string = "tableapplecharitablecupboarding";
  positions = 
   StringPosition[string, Alternatives @@ DictionaryLookup[], 
    Overlaps -> All];
  StringTake[string, Pick[positions, #]] & /@ 
   SatisfiabilityInstances[
    And @@ (BooleanCountingFunction[{1}, Length@positions] @@@ 
       Transpose@
        MapIndexed[Function[{list, index}, # && c @@ index & /@ list],
          Table[#1 <= i <= #2, {i, StringLength@string}] & @@@ 
          positions]), Array[c, Length@positions], All]] // TableForm

$$ \begin{array}{cccccccc} \text{table} & \text{apple} & \text{char} & \text{it} & \text{able} & \text{cup} & \text{boar} & \text{ding} \\ \text{table} & \text{apple} & \text{char} & \text{it} & \text{able} & \text{cup} & \text{boarding} & \text{} \\ \text{table} & \text{apple} & \text{charitable} & \text{cup} & \text{boar} & \text{ding} & \text{} & \text{} \\ \text{table} & \text{apple} & \text{charitable} & \text{cup} & \text{boarding} & \text{} & \text{} & \text{} \\ \end{array} $$

Note that although StringPosition does find "cupboard" as a valid word substring, none of solutions contain it as remaining part of the string can't be split into set of word matches. Thus, "boar", "ding" and "boarding" are only valid endings for a match.

String matching functions do take an additional IgnoreCase -> True option, which allows for matching "I" on the above example. Sadly this option also exposes various capitalized non-words, such as "Le", whatever that might be. This problem can occur also with other naturally capitalized words, such as country names.

More procedural approaches to solution search are definitely possible, but in the lines of letting Mma avoid me constructing error-prone explicit algorithms, I've chosen the above approach which converts search to a satisfiability problem.

BTW: can anyone tell why WordList[] lacks such words as "is" and "or"? This is the reason for using DictionaryLookup[] instead.

kirma
  • 19,056
  • 1
  • 51
  • 93
  • 1
    many thanks - great answer :) – martin Jul 04 '16 at 17:34
  • 1
    Bounty added for your handling the more complicated problem through methods I do not recall seeing. – Mr.Wizard Jul 05 '16 at 06:47
  • Well this is awkward. I have temporarily pulled my bounty because it seems I cannot, even as a moderator, close a question with an active bounty. I wished to close this as a duplicate of (3443) but I also wish to recognize your unusual and intriguing solution to the problem. Any suggestion on how I do this? Sorry for the weirdness. :-/ – Mr.Wizard Jul 05 '16 at 06:55
  • @Mr.Wizard I'm open to your suggestions (especially as I'm not entirely certain what you would actually want to happen). :) – kirma Jul 05 '16 at 08:14
  • 1
    IncludeInflections -> True will include "is" but "or" – yode Jul 05 '16 at 18:30
  • 1
    "is" and "or" is "Stopwords" in Mathematica you should specify the type be "Stopwords".It will include those. – yode Jul 05 '16 at 18:36
  • @yode Intriguing thing about this is that "and" isn't such a word... – kirma Jul 05 '16 at 19:53
  • 1
    words=Catenate[WordList[#,Language->"English",IncludeInflections->True]&/@{"KnownWords","Stopwords"}]; will include all words.And if you run MemberQ[words, "and"],it will give True. – yode Jul 05 '16 at 20:05
  • @Mr. Wizard, why not award the bounty first before closing as a dupe? – J. M.'s missing motivation Jul 05 '16 at 20:06
  • @J.M. I tried to do just that, but it told me I had to wait 23 hours. Then I tried to close and it told me I couldn't because of a bounty. By the way do you agree that this is a duplicate? My f2 function is directly applicable here but perhaps I am biased. – Mr.Wizard Jul 05 '16 at 23:02
  • "it told me I had to wait 23 hours" - well, that's exactly what you do, @Mr. Wizard; that rule applies to everybody. Anyway, it looks like it can be bountied now; put in the bounty, wait a bit, and you can do the awarding and closure one after another. – J. M.'s missing motivation Jul 06 '16 at 01:55