36

What is the most space and time efficient way to implement a Trie in Mathematica?

Will it be practically faster than what is natively available in appropriate cases?

Szabolcs
  • 234,956
  • 30
  • 623
  • 1,263
Mr.Wizard
  • 271,378
  • 34
  • 587
  • 1,371
  • Did this arise because of this question? If so, then once again, thank you for providing the exact formal name of the problem! :) – István Zachar Mar 26 '12 at 08:06
  • @István no, look at the date on this. It is not clear to me how this structure would be used to solve that problem, though it does appear related. (back to lurking now...) – Mr.Wizard Mar 26 '12 at 11:32
  • @István well, it seems that it's clear to you. I'll read your answer in detail tomorrow. :-) – Mr.Wizard Mar 26 '12 at 11:33
  • Strange I don't remember this question at all... Admit that moderators above 10K are allowed to use SE time machines. Also, I wonder whether a tree or tree-graph tag would be useful, as it would covere some questions like this, would fit nicely between list-manipulation and graphs-and-networks (though being a subset of the latter) and (as my major reason) would have helped me to find this post earlier. – István Zachar Mar 26 '12 at 12:39
  • @István add tags as you see fit. – Mr.Wizard Mar 27 '12 at 03:43
  • Now I finally have a definite answer to the second part of your question: yes, there are non-trivial cases where trie implementation will work much faster than, say, DictionaryLookup, due to the nature of the problem. I added a link to one such in my edit. – Leonid Shifrin May 11 '12 at 18:33
  • @Mr.Wizard, related http://mathematica.stackexchange.com/questions/8111/building-a-tree. What do you think of the approach via Dataset in terms of performance? – alancalvitti Nov 23 '14 at 18:38
  • @alancalvitti I am taking a bit of a break from Mathematica and I haven't explored Dataset as an alternative to Leonid's code. I suggest you ask Leonid his opinion of this. – Mr.Wizard Nov 26 '14 at 17:03
  • @Mr.Wizard, will do. Have a good break - hope you'll be back. – alancalvitti Nov 26 '14 at 19:03
  • @LeonidShifrin, when you have the chance, could you look at the link above - btw, for some applications the query can be reduced to a single If- currently testing. Association and Dataset provide very compact syntax - so it's worth optimizing. – alancalvitti Nov 26 '14 at 19:11
  • 1
    @alancalvitti I have updated my post below. As to Dataset, I think it will be an overkill and actually not an appropriate thing to do to use it for a trie. It will also be much slower. – Leonid Shifrin Nov 27 '14 at 16:12
  • @LeonidShifrin, in an increasingly agile world, dev time is the bottleneck vs computing time - yet worth optimizing. Based on real data ~250m LeafCount (time series, etc) it often takes longer just to Import and thread keys to construct Associations than to execute the recursive tries, which I use routinely b/c they are extremely useful in indexing & analysis. – alancalvitti Nov 28 '14 at 14:43
  • @alancalvitti My point has been that it is important to distinguish different aspects. Trie by itself is a specialized data structure, and as such, should be "mean and lean": efficient in the first place, and having some simple well-defined API. How to make it syntactically more pleasant is a different question. One may use Dataset, or some custom wrappers, or whatever. But when things get mixed, the result is that it is much harder to understand performance. In the context of Mathematica, this is particularly important. – Leonid Shifrin Nov 28 '14 at 16:42
  • 1
    @LeonidShifrin, I understand the distinction but the reality in statistical analysis is that queries will vary w/ the specifics of the task and (often noSQL) data structures provided. Dataset, Query and even operator form have already reduced my codebase & dev time >5x over V9. It's not a "wrapper". The ideal solution is Dataset syntax w/ fast algorithms. For example, how easily can you modify your code to handle frequencies like here: http://mathematicaforprediction.wordpress.com/2013/12/06/tries-with-frequencies-for-data-mining/ – alancalvitti Nov 28 '14 at 17:32
  • @alancalvitti The reality of Mathematica is that every generalization potentially leads to a serious loss of performance. The language is so high-level, that we don't gain the performance, we can only not lose or lose minimally. Therefore, the utmost concern of the implementor of a generic functionality in Mathematica is performance. I speak from personal experience here. Now, wrapping data structures into wrappers like Dataset, which provide convenient syntax, is a separate step. I don't mean to say that Dataset is just a wrapper, because it is not (it generates powerful queries, ... – Leonid Shifrin Nov 28 '14 at 18:02
  • @alancalvitti ... type-checks data, etc.). But, I insist on the separation of data structures and syntactic layer. Association itself is a new efficient data structure, and it was implemented in C for speed. Even when we implement new data structures in Mathematica (top-level), they must be made as fast as possible. Only then, one adds syntactic layer on top, as a separate step. Both are important, but the core data structures should not IMO be concerned with syntactic convenience - they just need to be fast and have a clear well-defined API. – Leonid Shifrin Nov 28 '14 at 18:05
  • 1
    @alancalvitti I understand your perspective as being a user who wants compact syntax, but for any realistic problem this is not the only dimension. What I have described is a developer's view on how to structure the functionality so that it works well for both performance and convenience dimensions. From this viewpoint, Dataset certainly is a wrapper (in the sense I explained above). Once again: I am not saying that convenience isn't important, I am saying that it is a different aspect, from pure data structure implementation. – Leonid Shifrin Nov 28 '14 at 18:08
  • @LeonidShifrin, re your last comment above - my point was Dataset is a uniform wrapper - unlike ad hoc code below. Let's see how long it takes you to refactor your method to manage Tallys for instance. – alancalvitti Nov 28 '14 at 22:15
  • @alancalvitti You seem to not be getting my point, despite my numerous attempts to make it clear. I do not dispute the usefulness of uniform wrapper. And I don't claim my code (or the approach to data structures implementation and design I advocate) to provide generic wrappers. I was just saying that a question about a design of a particular data structure like a trie (the one asked in this page) is largely unrelated to the one of designing uniform wrappers to incorporate many data structures under some uniform syntactic umbrella. These are different layers of abstraction, and ... – Leonid Shifrin Nov 28 '14 at 23:51
  • @alancalvitti ... one should not be mixing them. Dataset is a kind of a database, and defines its own query language. Which data structures it uses under the hood is its own business, and the end user could not care less, as long as the speed and convenience is satisfactory. Trie and others are particular data structures. They are constructed to be used in very specific algorithms. They are not general by definition. Asking your database engine to be at the same time a collection of data structures / algorithms doesn't strike me as a good idea. This has been my last comment here. – Leonid Shifrin Nov 28 '14 at 23:56
  • @LeonidShifrin, re "...defines its own query language. Which data structures it uses under the hood is its own business" - seems an opportunity to insert an interface b/w the syntax and the engine. Knuth & assembly language are long gone. – alancalvitti Nov 29 '14 at 01:03
  • @LeonidShifrin, I don't expect a reply, or restart the argument but FYI, it took ~340sec to Import and thread Keys for some time series data, and only ~18 sec to make (small Depth ~10) Trie recursively - so in this case your approach doesn't necessarily speed up the overall analytic workflow - though I recognize this is outside OPs question. – alancalvitti Dec 01 '14 at 21:39
  • @alancalvitti This only means the Import is grossly inefficient. Also, list of Associations is by itself not the most efficient format. It has nothing to do with data structures, they must be implemented as efficiently as possible. – Leonid Shifrin Dec 02 '14 at 13:01
  • 2
    So, five years later, I will ask...what have you, err, tried? (I crack me up.) – Daniel Lichtblau Feb 01 '17 at 02:13
  • @LeonidShifrin, fyi > http://mathematica.stackexchange.com/a/137422/801 – alancalvitti Feb 09 '17 at 22:23
  • @alancalvitti Thanks for the info / link. – Leonid Shifrin Feb 10 '17 at 17:11

3 Answers3

36

UPDATE

Since version 10, we have Associations. Here is the modified code for trie building and querying, based on Associations. It is almost the same as the old code (which is below):

ClearAll[makeTreeAssoc];
makeTreeAssoc[wrds : {__String}] := Association@makeTreeAssoc[Characters[wrds]];
makeTreeAssoc[wrds_ /; MemberQ[wrds, {}]] := 
    Prepend[makeTreeAssoc[DeleteCases[wrds, {}]], {} -> {}];
makeTreeAssoc[wrds_] := 
   Reap[
     If[# =!= {}, Sow[Rest[#], First@#]] & /@ wrds, 
     _, 
     #1 -> Association@makeTreeAssoc[#2] &
   ][[2]]

You can see that the only difference is that Association is added to a couple of places, otherwise it's the same code. The lookup functions also are very similar:

ClearAll[getSubTreeAssoc];
getSubTreeAssoc[word_String, tree_] := Fold[Compose, tree, Characters[word]]

ClearAll[inTreeQAssoc];
inTreeQAssoc[word_String, tree_] := KeyExistsQ[getSubTreeAssoc[word, tree], {}]

The tests similar to the ones below (for entire dictionary) show that the lookup based on this trie (Associations - based) is about 3 times faster than the one based on rules, for a trie built from a dictionary. The new implementation of getWords is left as an exercise to the reader (in fact, that function could be optimized a lot, by storing entire words as leaves in the tree, so that one doesn't have to use StringJoin and combine the words).


A combination of rules and recursion is able to produce rather powerful solutions. Here is my take on it:

ClearAll[makeTree];
makeTree[wrds : {__String}] := makeTree[Characters[wrds]];
makeTree[wrds_ /; MemberQ[wrds, {}]] := 
     Prepend[makeTree[DeleteCases[wrds, {}]], {} -> {}];
makeTree[wrds_] := 
    Reap[If[# =!= {}, Sow[Rest[#], First@#]] & /@ 
       wrds, _, #1 -> makeTree[#2] &][[2]]

ClearAll[getSubTree];
getSubTree[word_String, tree_] := Fold[#2 /. #1 &, tree, Characters[word]]

ClearAll[inTreeQ];
inTreeQ[word_String, tree_] :=  MemberQ[getSubTree[word, tree], {} -> {}]

ClearAll[getWords];
getWords[start_String, tree_] :=
  Module[{wordStack = {}, charStack = {}, words},
    words[{} -> {}] :=
      wordStack = {wordStack, StringJoin[charStack]};
    words[sl_ -> ll_List] :=
      Module[{},
        charStack = {charStack, sl};
        words /@ ll;
        charStack = First@charStack;
      ];
    words[First@Fold[{#2 -> #1} &, getSubTree[start, tree], 
         Reverse@Characters[start]]
    ];
    ClearAll[words];
    Flatten@wordStack];

The last function serves to collect the words from a tree, by performing a depth-first tree traversal and maintaining the stack of accumulated characters and words.

Here is a short example:

In[40]:= words = DictionaryLookup["absc*"]
Out[40]= {abscess,abscessed,abscesses,abscessing,abscissa,abscissae,abscissas,
   abscission,abscond,absconded,absconder,absconders,absconding,absconds}

In[41]:= tree = makeTree[words]
Out[41]= {a->{b->{s->{c->{e->{s->{s->{{}->{},e->{d->{{}->{}},s->{{}->{}}},
      i->{n->{g->{{}->{}}}}}}},i->{s->{s->{a->{{}->{},e->{{}->{}},s->{{}->{}}},
        i->{o->{n->{{}->{}}}}}}},o->{n->{d->{{}->{},e->{d->{{}->{}},r->{{}->{},s->{{}->{}}}},
       i->{n->{g->{{}->{}}}},s->{{}->{}}}}}}}}}}

In[47]:= inTreeQ[#,tree]&/@words
Out[47]= {True,True,True,True,True,True,True,True,True,True,True,True,True,True}

In[48]:= inTreeQ["absd",tree] 
Out[48]= False

In[124]:= getWords["absce", tree]
Out[124]= {"abscess", "abscessed", "abscesses", "abscessing"}

I only constructed here a bare-bones tree, so you can only test whether or not the word is there, but not keep any other info. Here is a larger example:

In[125]:= allWords =  DictionaryLookup["*"];

In[126]:= (allTree = makeTree[allWords]);//Timing
Out[126]= {5.375,Null}

In[127]:= And@@Map[inTreeQ[#,allTree]&,allWords]//Timing
Out[127]= {1.735,True}

In[128]:= getWords["pro",allTree]//Short//Timing
Out[128]= {0.015,{pro,proactive,proactively,probabilist,
    <<741>>,proximate,proximately,proximity,proxy}}

In[129]:= DictionaryLookup["pro*"]//Short//Timing
Out[129]= {0.032,{pro,proactive,proactively,probabilist,<<741>>,
    proximate,proximately,proximity,proxy}}

I don't know which approach has been used for the built-in functionality, but the above implementation seems to be generally in the same calss for performance. The slowest part is due to the top-level tree-traversing code in getWords. It is slow because the top-level code is slow. One could speed it up considerably by hashing words to integers - then it can be Compiled. This is how I'd do that, if I were really concerned with speed.

EDIT

For a really nice application of a Trie data structure, where it allows us to achieve major speed-up (w.r.t. using DictionaryLookup, for example), see this post, where it was used it to implement an efficient Boggle solver.

Leonid Shifrin
  • 114,335
  • 15
  • 329
  • 420
  • Thank you Leonid. I was waiting for other answers to compare but you are too intimidating. :-) – Mr.Wizard Jan 28 '12 at 23:58
  • @Spartacus (Mr.Wizard - sorry, old habits) - I did not mean to. Thanks for the accept. – Leonid Shifrin Jan 29 '12 at 10:25
  • Thanks for the update, even it it wasn't specifically for me. – Mr.Wizard Nov 27 '14 at 16:21
  • @Mr.Wizard No problem :). This didn't take any time to check, in any case. I was actually a bit surprised with a speedup, given a small number of rules on every level. – Leonid Shifrin Nov 27 '14 at 16:28
  • Now we are talking, it needed to be 3 times faster to deserve my +1, so there you go (kidding, I was surprised to see I hadn't voted before) – Rojo Nov 30 '14 at 05:11
  • @Rojo Thanks for the vote, but the credit goes to Association and those desinged and who implemented it. I only had to stick it in two places. Besides, I could as well use Dispatch in my original code, likely with a similar effect. – Leonid Shifrin Nov 30 '14 at 11:18
  • Used (just the tree building part) here – Dr. belisarius Aug 27 '15 at 20:09
  • @belisarius Thanks for the info. I upvoted. Looks to be indeed comparatively one of the cleanest and shortest solutions. The {} -> {} part of my code is rather arbitrary, and partly was dictated by the need of this terminal element to be an (idle) rule - which was only necessary in the original code that used lists of rules. The code using Associations can use some more easy-to-understand expression for a terminal element. – Leonid Shifrin Aug 27 '15 at 21:08
  • @LeonidShifrin thanks to you. Removed the unneeded code. I'll try the Associations thing when my wallet allows v10 :) – Dr. belisarius Aug 27 '15 at 21:53
11

This might not give you the answer you expect, neither is this better than Leonid's solution, but. Since your fairly general question leaves a lot of room for answers and since I felt that it might be relevant, I gave it a go.

Assuming, that we have a list of prefix representations of a string (e.g. from here), it can be plotted easily with TreeForm:

decompTree = {"ar", {"c", {"h", {{"b", {"i", {"s", {{"h", {"o", {"p"}}},
      {"ho", {"p"}}}}}}, {"bi", {"s", {{"h", {"o", {"p"}}}, {"ho", {"p"}}}}}}}}};
TreeForm[decompTree, VertexRenderingFunction -> (Style[Text[#2, #1], 14, 
     Background -> White] &), ImageSize -> 400]

Mathematica graphics

Now let's convert it to a graph. First, assign a unique integer to each leaf:

decompList = Cases[decompTree //. {x__String, y__List} :>
    (Join[{x}, #] & /@ {y}), {__String}, \[Infinity]];
vertexRep = Thread[Range@Length@# -> #] &@ Cases[decompTree, _String, \[Infinity]];
counter = 1;
vertexTree = Replace[decompTree, _String :> counter++, \[Infinity]]
{1, {2, {3, {{4, {5, {6, {{7, {8, {9}}}, {10, {11}}}}}}, {12, {13, 
    {{14, {15, {16}}}, {17, {18}}}}}}}}}

And then building the edge list of the graph by traversing all possible routes with ReplaceRepeated in the (now integer-valued) decomposition tree:

edgeTree = vertexTree //. {{x_Integer, {y_Integer, z___}} :> {x -> y, {y, z}},
    {x_Integer, y : {__List}} :> {x -> First@# & /@ y, y}};
edgeList = Cases[edgeTree, _Rule, \[Infinity]];
TreePlot[edgeList, Left, VertexRenderingFunction -> (Style[Text[#2 /. vertexRep, #1], 14, 
     Background -> White] &), ImageSize -> 400]

Mathematica graphics

István Zachar
  • 47,032
  • 20
  • 143
  • 291
7

The Markdown document

(and this blog post) describe the installation and use in Mathematica of Tries with frequencies implemented in Java through a corresponding Mathematica package. Note that the set-up requires some efforts.

Basic examples

Basic examples can be found in this album of images that shows the "JavaTrie.*" commands with their effects:

"JavaTrieExample"

More detailed explanations can be found in the Markdown document.

Performance evaluation (finding dictionary infixes)

The following performance tests are done on a 2015 MacBook Pro. The query timings are almost the same as those of Leonid's implementation; the creation timings are 5-10 times smaller.

Get all words from a dictionary (~93,000):

allWords =  DictionaryLookup["*"];
allWords // Length

(* 92518 *)

Trie creation and shrinking:

AbsoluteTiming[
 jDTrie = JavaTrieCreateBySplit[allWords];
 jDShTrie = JavaTrieShrink[jDTrie];
]

(* {0.30508, Null} *)

JSON form extraction:

AbsoluteTiming[
 jsonRes = JavaTrieToJSON[jDShTrie];
]

(* {3.85955, Null} *)

(The data transfer from Java to Mathematica takes time...)

Here are the node statistics of the original and shrunk tries:

JavaTrieNodeCounts[jDTrie]
(* <|"total" -> 224937, "internal" -> 160090, "leaves" -> 64847|> *)

JavaTrieNodeCounts[jDShTrie]
(* <|"total" -> 115504, "internal" -> 50657, "leaves" -> 64847|> *)

Find the infixes that have more than three characters and appear more than 10 times:

Multicolumn[#, 4] &@
 Select[SortBy[
   Tally[Cases[
     jsonRes, ("key" -> v_) :> v, Infinity]], -#[[-1]] &], StringLength[#[[1]]] > 3 && #[[2]] > 10 &]

"Long-infixes-in-shrunk-dictionary-trie"

Anton Antonov
  • 37,787
  • 3
  • 100
  • 178