Split up word pairs in string based on semantic textual analysis

Question

I have several hundred sentence-strings with joined words. They were extracted from a text (and are in sequence).

Here is an example of one:

"thequick brown foxjumped over the lazydog"

I would like to be able to split up joined "word pairs":

"the quick brown fox jumped over the lazy dog"

How would one approach implementing such an operation in mathematica? Ideas welcome!

How would you want to split "dogsled"? "dogs led" or "dog sled"? Or "carseat"? Or... And do you or don't you split "everything"? Or... — David G. Stork, Aug 31 '18 at 20:33
@David G. Stork. Good point. This seemed tractable at first glance, haha. I guess a human would attempt to split it based on his/her understanding of the "context" (what the preceding sentences represent) which may require a model the world. — Conor, Aug 31 '18 at 20:45
Of course a HUMAN would split based on context and high-level understanding. But what about an ALGORITHM? Are you seriously suggesting that your question involves solving the hard-artificial-intelligence problem needed to solve this splitting task? Really? — David G. Stork, Aug 31 '18 at 21:16
Somewhat related: https://mathematica.stackexchange.com/q/3443/3056 and https://mathematica.stackexchange.com/q/119901/3056 . — kirma, Sep 02 '18 at 06:29
@ConorCosnett Even without contextual understanding (which might be built on basis of large corpuses using methods such as Markov chains, or in highly inflected languages, neural networks), splitting space-omitting text can get quite complicated quite quickly after the case of two words. Consider my earlier answer: https://mathematica.stackexchange.com/a/119905/3056 — kirma, Sep 02 '18 at 06:34
Continuing on my neural network idea: I guess it would be plausible to take a sufficiently large corpus (say, text content of thousands of Wikipedia articles) and feed it with some omitted spaces, and a target to learn adding those spaces back to a recurrent neural network. This would form a sort of dedicated "proofreader neural network" for the task, which would be applicable to new inputs. A popular blog post on RNNs: http://karpathy.github.io/2015/05/21/rnn-effectiveness/ — kirma, Sep 02 '18 at 06:47

kirma · Answer 1 · 2018-09-02T21:28:39.890

This is not really a solution, but a quick demonstration of trying to solve this problem with neural networks.

My approach is to take small snippets of text (ending up as fixed-size input) from a corpus, mangling them by removing varying amount of spaces from some (or no!) locations, and to train the neural network to recognise if a space is missing at a specific offset of the sample string. (In my case, 16-letter inputs and looking on the offset 8 of the string.) This approach doesn't use recurrent neural networks or anything fancy, just a list of letters as input and a single output; here it's a "Real" for visualization purposes, but it could also be a "Boolean" with help of NetDecoder very easily.

The bulk of code is in the snippet mangling part. It may be a bit messy, but it does basically what I described above. I use Darwin's (On) the Origin of Species as my corpus (a toy approach!), since it's easily available on Mathematica, in a clean format, and it isn't too archaic or way too short for the purpose. The following creates the training set and trains a small network:

ClearAll[trainednet, extractlength, position];

extractlength = 16;
position = 8;
trainednet = 
  With[{corpus = ExampleData[{"Text", "OriginOfSpecies"}], 
    size = 2000000},
   With[{trainingset = 
      With[{omits = 
             Sort@RandomSample[StringPosition[#, " "], 
               UpTo@RandomVariate[
                 BinomialDistribution[extractlength, 
                  2/extractlength]]]}, 
           StringTake[StringReplacePart[#, "", omits], 
             extractlength] -> 
            Boole@MemberQ[omits[[All, 1]] - Range@Length@omits + 1, 
              position]] &@
         StringTake[corpus, {#, # + 2 extractlength}] & /@ 
       RandomInteger[{1, StringLength@corpus - 2 extractlength}, 
        size], net = 
      NetChain[{UnitVectorLayer[], LinearLayer[512], LinearLayer[128],
         LinearLayer[32], SoftmaxLayer[], LinearLayer[]}, "Input" ->
        NetEncoder[{"Characters", "TargetLength" -> extractlength, 
          IgnoreCase -> True}], "Output" -> "Real"]},
    NetTrain[net, trainingset, BatchSize -> 4096, 
     MaxTrainingRounds -> 60, ValidationSet -> Scaled[0.1]]]];

This takes about an hour on my CPUs. Now I can define a function to visualize weights the network gives to adding a space ahead of a specific letter on the input:

ClearAll@missingspacevisualize; 

missingspacevisualize[str_String] := 
 BarChart@Table[
     Labeled[trainednet[StringTake[#, {i, i + extractlength}]], 
      StringTake[#, {i, i} + position - 1]], {i, 
      StringLength@# - extractlength}] &@
  StringJoin[StringRepeat[" ", extractlength - position - 1], str, 
   StringRepeat[" ", extractlength - position + 1]];

Now, my own example that demonstrates that the network is not all bull!

missingspacevisualize[
 "this sentenceis unfortunately not verywell written, I know inthis case"]

Well, that wasn't so bad for a dumb set of tensor algebra! Well, how does it perform with the string in the question? With a sufficiently long training run it does perform surprisingly well!

missingspacevisualize["thequick brown foxjumped over the lazydog"]

This is not a global matching algorithm like my earlier answer to a similar problem (https://mathematica.stackexchange.com/a/119905/3056). So, how does this implementation perform in comparison to that one?

missingspacevisualize["tableapplecharitablecupboarding"]

Well, surprisingly poorly. I guess the network more and more expects long words to just go on with longer and longer training times. Surely that's not a typical "some spaces omitted" input.

score 6 · Answer 2 · edited Sep 02 '18 at 21:09

I didn't check for speed, but one possibility is:

splitWord[s_] := If[DictionaryWordQ[s],
    s<>" ",
    StringReplaceList[
        s,
        StartOfString~~a__~~b__~~EndOfString /; AllTrue[{a,b}, DictionaryWordQ] :> a<>" "<>b<>" "
    ] /. {a_} :> a
]

Then:

splitWord["thequick"]
splitWord["carseat"]

"the quick "

{"cars eat ", "car seat "}

For your example:

words = splitWord /@ StringSplit @ "thequick brown foxjumped over the lazydog"

{"the quick ", "brown ", "fox jumped ", "over ", "the ", "lazy dog "}

Since there were no ambiguous splits, you can just StringJoin the output:

StringJoin[words]

"the quick brown fox jumped over the lazy dog "

If there were ambiguous splits, then perhaps you can use a function like TextStructure to choose the words that make the most grammatical sense.

How would you distinguish between TextStructure[ "Bill thinks everybody is tall."] and "...every body is tall"? — David G. Stork, Sep 01 '18 at 00:20

Split up word pairs in string based on semantic textual analysis

2 Answers2