13

Given a list of word characters, such as this one, I'd like to build a tree, similar to this makeTree function, but with the tree in a different format. So, for an input such as

test = {{"h", "e", "l", "l", "o"}, {"h", "o", "l", "o"}, {"h", 
    "e", "a"}, {"h", "e", "l", "l", "o", "s"}, {"b", "r", "o"}};

I'd like the output to be

output = StartOfString[
  "h"["e"["a"[EndOfString], 
    "l"["l"["o"[EndOfString, "s"[EndOfString]]]]], 
   "o"["l"["o"[EndOfString]]]], "b"["r"["o"[EndOfString]]]]

So that

TreeForm@output

gives

Mathematica graphics

So far I haven't got a perfect solution, that's why I'm not posting. I know I must be missing lots of good ways to do this. What I want is not so much one single good solution, or "a fix to what I tried", but to see several ways to tackle the problem, particularly but not at all limited to elegant rule-based solutions

J. M.'s missing motivation
  • 124,525
  • 11
  • 401
  • 574
Rojo
  • 42,601
  • 7
  • 96
  • 188
  • You know I don't post homework or without trying so if you want to close it I'll stay here defending it. It better be 5 against 1 – Rojo Jul 09 '12 at 23:08
  • 1
    brb, while I close this question with the force of a thousand suns! :P – rm -rf Jul 09 '12 at 23:17
  • I think you're missing an "l" in "hollow" within the output and TreeForm. – Mr.Wizard Jul 11 '12 at 12:43
  • @Mr.Wizard let's say I had an extra "l" in the input so I don't have to reupload the image :) – Rojo Jul 11 '12 at 13:22
  • @Rojo, take a look at the timing study w/ recursive version - http://mathematica.stackexchange.com/questions/69942/10-0-2-breaks-a-recursive-trie-query/69946#69946 – alancalvitti Jan 01 '15 at 20:01
  • @alancalvitti thanks, and happy new year – Rojo Jan 02 '15 at 07:32

5 Answers5

9

I favor tree transformations, so I would reuse the makeTree function you linked to (because it is reasonably efficient), as follows:

ClearAll[makeRojoTree];
makeRojoTree[words_List] :=
 StartOfString @@ 
  ReplaceRepeated[
    makeTree[words], {
       ({} -> {}) :> EndOfString, 
        Rule[x_, l_List] :> x @@ l
    }
  ]

The argument can be either a list of words, or a list of lists of words characters (as in your test), since makeTree is already polymorphic. Applying it to your test, we get:

makeRojoTree[test]

(*

StartOfString[
  "h"["e"["l"["l"["o"[EndOfString, "s"[EndOfString]]]], 
  "a"[EndOfString]], "o"["l"["l"["o"[EndOfString]]]]], 
  "b"["r"["o"[EndOfString]]]
]

*)

which is slightly different in terms of ordering of the branches from what you have as a desired answer, but this can be fixed if you impose some specific ordering.

Comparing the performance to makeTree itself, we see that it is only about 1.5 times slower:

allWords=DictionaryLookup["*"];

(allTree=makeTree[allWords]);//Timing

(* {5.297,Null} *)

(rTree = makeRojoTree[allWords]);//AbsoluteTiming

(* {8.4375000,Null} *)

EDIT

To make this self contained, this is a slightly tuned up version of the linked makeTree, with the slightly different behaviour that it keeps duplicates

ClearAll[makeTree];
makeTree[wrds : {__String}] := makeTree[Characters[wrds]];
makeTree[{b___, {}, a___}] := Prepend[makeTree[{b, a}], {} -> {}];
makeTree[wrds_] := 
 Reap[Scan[Sow[Rest[#], First@#] &, 
    wrds], _, #1 -> makeTree[#2] &][[2]]

and this is a tweaked version of that that returns what the OP wants without resorting to the original makeTree

ClearAll[makeTreeRojo];
Module[{makeTreeRojoAux},
 makeTreeRojo[wrds_] := DeleteCases[StartOfString @@ makeTreeRojoAux[wrds], List, Infinity, Heads->True];
 makeTreeRojoAux[{b___, {}, a___}] := 
  Prepend[makeTreeRojoAux[{b, a}], EndOfString];
 makeTreeRojoAux[wrds_] := 
  Reap[Scan[Sow[Rest[#], First@#] &, 
     wrds], _, #1 @ makeTreeRojoAux[#2] &][[2]];
 ]
Rojo
  • 42,601
  • 7
  • 96
  • 188
Leonid Shifrin
  • 114,335
  • 15
  • 329
  • 420
  • Nice one, +1... 1.5 times slower but less than half the storage, which isn't so much in either case – Rojo Jul 10 '12 at 00:00
  • @Rojo Thanks. One could as well modify the original makeTree to squeeze some speed out, but I did not bother. – Leonid Shifrin Jul 10 '12 at 00:11
  • I'll see if I understand it now, squeeze some speed out, and offer to edit your answer, adding the code to make it self contained – Rojo Jul 10 '12 at 00:18
  • @Rojo Be my guest. I am off to bed in 5 minutes, but feel free to edit the post. – Leonid Shifrin Jul 10 '12 at 00:23
  • @LeonidShifrin Is there a way to remove the EndOfString from makeTreeRojo? I've been trying with no success whatsoever. – Pragabhava Oct 23 '15 at 20:29
  • @Pragabhava Sorry, can't switch contexts to recall what was heppening here, right now. But why would you want to do that? It just serves to mark that there is an exact word with the contents matching the prefix tree path. – Leonid Shifrin Oct 23 '15 at 20:34
  • @LeonidShifrin I've been trying to use your code to see the structure of a set of urls in TreeFrom view. There are a lot of entries and EndOfString is not only out of context, but it clogs up the tree a lot. By changing it for Nothing, I removed the clogging problem, but I ended up with a parenthesis in the last leafs of the tree. I was hoping there was a simple way to do this in your code, because I didn't want to hijack the answer (which now I'm doing :/ ). – Pragabhava Oct 23 '15 at 21:14
  • 1
    @Pragabhava I suggest you ask this as a separate question with all the information and a minimal reproducible code example. Then you will have a much better chance to get an answer from somebody. You can link to this question and the code, in your question. – Leonid Shifrin Oct 23 '15 at 21:22
8

Here is a very concise way to convert the list of strings to your desired format:

StartOfString @@ (
    (Composition @@ #)[EndOfString] & /@ test //. h_[a___, x_[y__], b___, x_[z__], c___] :> h[x[y, z], a, b, c]

(* {"h"["e"["l"["l"["o"[EndOfString, "s"[EndOfString]]]], 
    "a"[EndOfString]], "o"["l"["l"["o"[EndOfString]]]]], "b"["r"["o"[EndOfString]]]} *)

This is a rather perverse use of Composition, but the fact that Composition[f, g][x] is f[g[x]] lends itself very nicely to the way in which you want your tree built.

rm -rf
  • 88,781
  • 21
  • 293
  • 472
  • Almost great! but check the TreeForm. The e in hello, hea, and hellos aren't groups – Rojo Jul 09 '12 at 23:34
  • Btw, loved the Composition to "unflatten" – Rojo Jul 09 '12 at 23:38
  • Thanks for helping fix my pattern! :) – rm -rf Jul 10 '12 at 00:18
  • 1
    Well deserved +1. Exactly the kind of answer I was (and still am) hoping to see appear – Rojo Jul 10 '12 at 00:19
  • This is very nice conceptually, but a performance disaster for large lists of words / word letters (which is a general feature of this sort of patterns, alas). Since I think that performance is generally important for this type of problems, I don't upvote this time. – Leonid Shifrin Jul 10 '12 at 00:27
  • @LeonidShifrin No worries. The performance disaster was rather guaranteed with the use of //. and multiple ___ :) Nevertheless, I love this for some reason, because I would've never thought I'd use Composition this way... – rm -rf Jul 10 '12 at 00:28
  • I would vote for this for its beauty, if I did not know all too well that performance is important in the types of problems where this sort of constructs (trees) are likely to be used. Most of the time, one should not consider performance as the main and only criteria, but in cases like this it is likely to be important. – Leonid Shifrin Jul 10 '12 at 00:33
  • @LeonidShifrin, can you take a look at the timing study here: http://mathematica.stackexchange.com/questions/69942/10-0-2-breaks-a-recursive-trie-query/69946#69946 – alancalvitti Jan 01 '15 at 20:05
2

For the lastest version on cloud:

Clear[ds];
test = {{"h", "e", "l", "l", "o"}, {"h", "o", "l", "o"}, {"h", "e", 
    "a"}, {"h", "e", "l", "l", "o", "s"}, {"b", "r", "o"}};
ds = CreateDataStructure["ByteTrie", "a", "z"];
str=StringJoin/@test;
ds["Insert", #] & /@ str;

ds["Visualization"]

(* same output as shown below *)


Original (ironically it works on v12.2)

Using ByteTrie data structure: Trie Wiki

Clear[ds]
test = {{"h", "e", "l", "l", "o"}, {"h", "o", "l", "o"}, {"h", "e", 
    "a"}, {"h", "e", "l", "l", "o", "s"}, {"b", "r", "o"}};

ds = CreateDataStructure["ByteTrie", "a", "z"] ds["Insert", #] & /@ test; (* It can take in string inputs as well *)

ds["Visualization"]


enter image description here


Methods such as "Strings" (stored) and "MemberQ" are available with the data structure.

{ds["MemberQ", "hello"], ds["FreeQ", "hello"]}

{True, False}

Syed
  • 52,495
  • 4
  • 30
  • 85
  • 1
    Syed - I get the following: "DataStructure::err: ByteTrie encountered an error of type ExpressionConversion processing Insert." ( Doc says that DataStructure is Experimental, in case it matters. ) – Rabbit Mar 27 '24 at 17:42
  • I encounter the same error, Version 13.3 for Windows 11. – Glenn Welch Mar 27 '24 at 18:37
  • It seems that ByteTrie accepts strings and lists of character codes, but not lists of characters. Try Scan[ds["Insert", #]&, StringJoin/@test] or Scan[ds["Insert", #] &, ToCharacterCode[{"help", "hire"}]]. – david Mar 27 '24 at 18:49
  • Regrettably, I have updated the answer to accommodate the regression. – Syed Mar 27 '24 at 23:08
2

Using a recursive Query:

byPrefixTree = Query[{
     Query[Select[# != {} &] /* GroupBy[First], All, Rest], 
     Query[Select[# == {} &]]}] /* Merge[Join] /* 
   Query[All, First, byPrefixTree[#] &];

Can be used directly to reconstruct a directory tree from FileNames[...,Infinity].

  • Can it be optimized? ~1000 files nested up to 15 folders deep took ~15sec.

  • So far haven't been successful merging the 2 Select calls with a single GroupBy[#=={}&] as then the keys may be any subset of {True,False}. Wanted to /* with MapAt or similar

  • Operator form is broken- throws a recursion limit exception.

On Rojo's data:

  testData = 
     test // AssociationMap[{SoS, Sequence @@ # , EoS} &] // 
       KeyMap[StringJoin] // Dataset 

enter image description here

testData [byPrefixTree] // Normal

(* <|SoS-><|h-><|e-><|l-><|l-><|o-><|s-><|EoS-><|hellos-><||>|>|>,EoS-><|hello-><||>|>|>|>|>,a-><|EoS-><|hea-><||>|>|>|>,o-><|l-><|o-><|EoS-><|holo-><||>|>|>|>|>|>,b-><|r-><|o-><|EoS-><|bro-><||>|>|>|>|>|>|> *) 

Desired form (though unsorted)

(testData[byPrefixTree][Map[Normal, #, All] &][First] // 
    Normal) //. {Rule[EoS, val_] :> EoS, 
   Rule[x_, l_] :> x @@ l} // TreeForm

enter image description here

alancalvitti
  • 15,143
  • 3
  • 27
  • 92
2

I may be off the mark by not making nested compositions. So, for what it's worth:

pref[list_] := (f[m_] := m[[1 ;; #]] & /@ Range[Length@m]; 
  g[t_] := Rule @@@ Partition[t, 2, 1]; 
  Module[{str = {StartOfString, ##, EndOfString} & @@@ (Characters /@ 
        list)}, TreePlot[Union[Flatten[g /@ (f /@ str)]], 
    Automatic, {StartOfString}, 
    VertexRenderingFunction -> ({LightYellow, EdgeForm[Black], 
        Rectangle[# - {0.4, 0.2}, # + {0.4, 0.2}], Black, 
        Text[Last@#2, #1]} &)]])

Testing:

pref[{"hello", "holo", "hea", "hellos", "bro"}]

enter image description here

pref[{"bro", "hea", "holo", "hello", "help"}]

enter image description here

ubpdqn
  • 60,617
  • 3
  • 59
  • 148