5

I have several hundred sentence-strings with joined words. They were extracted from a text (and are in sequence).

Here is an example of one:

"thequick brown foxjumped over the lazydog"

I would like to be able to split up joined "word pairs":

"the quick brown fox jumped over the lazy dog"

How would one approach implementing such an operation in mathematica? Ideas welcome!

Conor
  • 7,449
  • 1
  • 22
  • 46
  • 4
    How would you want to split "dogsled"? "dogs led" or "dog sled"? Or "carseat"? Or... And do you or don't you split "everything"? Or... – David G. Stork Aug 31 '18 at 20:33
  • 1
    @David G. Stork. Good point. This seemed tractable at first glance, haha. I guess a human would attempt to split it based on his/her understanding of the "context" (what the preceding sentences represent) which may require a model the world. – Conor Aug 31 '18 at 20:45
  • 2
    Of course a HUMAN would split based on context and high-level understanding. But what about an ALGORITHM? Are you seriously suggesting that your question involves solving the hard-artificial-intelligence problem needed to solve this splitting task? Really? – David G. Stork Aug 31 '18 at 21:16
  • 1
    Somewhat related: https://mathematica.stackexchange.com/q/3443/3056 and https://mathematica.stackexchange.com/q/119901/3056 . – kirma Sep 02 '18 at 06:29
  • 1
    @ConorCosnett Even without contextual understanding (which might be built on basis of large corpuses using methods such as Markov chains, or in highly inflected languages, neural networks), splitting space-omitting text can get quite complicated quite quickly after the case of two words. Consider my earlier answer: https://mathematica.stackexchange.com/a/119905/3056 – kirma Sep 02 '18 at 06:34
  • 2
    Continuing on my neural network idea: I guess it would be plausible to take a sufficiently large corpus (say, text content of thousands of Wikipedia articles) and feed it with some omitted spaces, and a target to learn adding those spaces back to a recurrent neural network. This would form a sort of dedicated "proofreader neural network" for the task, which would be applicable to new inputs. A popular blog post on RNNs: http://karpathy.github.io/2015/05/21/rnn-effectiveness/ – kirma Sep 02 '18 at 06:47

2 Answers2

7

This is not really a solution, but a quick demonstration of trying to solve this problem with neural networks.

My approach is to take small snippets of text (ending up as fixed-size input) from a corpus, mangling them by removing varying amount of spaces from some (or no!) locations, and to train the neural network to recognise if a space is missing at a specific offset of the sample string. (In my case, 16-letter inputs and looking on the offset 8 of the string.) This approach doesn't use recurrent neural networks or anything fancy, just a list of letters as input and a single output; here it's a "Real" for visualization purposes, but it could also be a "Boolean" with help of NetDecoder very easily.

The bulk of code is in the snippet mangling part. It may be a bit messy, but it does basically what I described above. I use Darwin's (On) the Origin of Species as my corpus (a toy approach!), since it's easily available on Mathematica, in a clean format, and it isn't too archaic or way too short for the purpose. The following creates the training set and trains a small network:

ClearAll[trainednet, extractlength, position];

extractlength = 16;
position = 8;
trainednet = 
  With[{corpus = ExampleData[{"Text", "OriginOfSpecies"}], 
    size = 2000000},
   With[{trainingset = 
      With[{omits = 
             Sort@RandomSample[StringPosition[#, " "], 
               UpTo@RandomVariate[
                 BinomialDistribution[extractlength, 
                  2/extractlength]]]}, 
           StringTake[StringReplacePart[#, "", omits], 
             extractlength] -> 
            Boole@MemberQ[omits[[All, 1]] - Range@Length@omits + 1, 
              position]] &@
         StringTake[corpus, {#, # + 2 extractlength}] & /@ 
       RandomInteger[{1, StringLength@corpus - 2 extractlength}, 
        size], net = 
      NetChain[{UnitVectorLayer[], LinearLayer[512], LinearLayer[128],
         LinearLayer[32], SoftmaxLayer[], LinearLayer[]}, "Input" ->
        NetEncoder[{"Characters", "TargetLength" -> extractlength, 
          IgnoreCase -> True}], "Output" -> "Real"]},
    NetTrain[net, trainingset, BatchSize -> 4096, 
     MaxTrainingRounds -> 60, ValidationSet -> Scaled[0.1]]]];

This takes about an hour on my CPUs. Now I can define a function to visualize weights the network gives to adding a space ahead of a specific letter on the input:

ClearAll@missingspacevisualize; 

missingspacevisualize[str_String] := 
 BarChart@Table[
     Labeled[trainednet[StringTake[#, {i, i + extractlength}]], 
      StringTake[#, {i, i} + position - 1]], {i, 
      StringLength@# - extractlength}] &@
  StringJoin[StringRepeat[" ", extractlength - position - 1], str, 
   StringRepeat[" ", extractlength - position + 1]];

Now, my own example that demonstrates that the network is not all bull!

missingspacevisualize[
 "this sentenceis unfortunately not verywell written, I know inthis case"]

enter image description here

Well, that wasn't so bad for a dumb set of tensor algebra! Well, how does it perform with the string in the question? With a sufficiently long training run it does perform surprisingly well!

missingspacevisualize["thequick brown foxjumped over the lazydog"]

enter image description here

This is not a global matching algorithm like my earlier answer to a similar problem (https://mathematica.stackexchange.com/a/119905/3056). So, how does this implementation perform in comparison to that one?

missingspacevisualize["tableapplecharitablecupboarding"]

enter image description here

Well, surprisingly poorly. I guess the network more and more expects long words to just go on with longer and longer training times. Surely that's not a typical "some spaces omitted" input.

kirma
  • 19,056
  • 1
  • 51
  • 93
6

I didn't check for speed, but one possibility is:

splitWord[s_] := If[DictionaryWordQ[s],
    s<>" ",
    StringReplaceList[
        s,
        StartOfString~~a__~~b__~~EndOfString /; AllTrue[{a,b}, DictionaryWordQ] :> a<>" "<>b<>" "
    ] /. {a_} :> a
]

Then:

splitWord["thequick"]
splitWord["carseat"]

"the quick "

{"cars eat ", "car seat "}

For your example:

words = splitWord /@ StringSplit @ "thequick brown foxjumped over the lazydog"

{"the quick ", "brown ", "fox jumped ", "over ", "the ", "lazy dog "}

Since there were no ambiguous splits, you can just StringJoin the output:

StringJoin[words]

"the quick brown fox jumped over the lazy dog "

If there were ambiguous splits, then perhaps you can use a function like TextStructure to choose the words that make the most grammatical sense.

Henrik Schumacher
  • 106,770
  • 7
  • 179
  • 309
Carl Woll
  • 130,679
  • 6
  • 243
  • 355