Split list of states by run lengths and discontinuous changes in state

Question

I was asked to look into writing some simulation code for a user, where the simulation output (x) would oscillate between three possible states:

states[x_, γ_] := Which[
x < -γ,  1 (*top*),
x < γ , -1 (*bottom*),
True  ,  0   (*middle*)]

The simulation essentially boiled down to the following, and what they were interested in is computing the lengths runs in each state - which could be easily computed with SplitBy:

γ = 0.9;
sampleSimulation = RandomChoice[{-1, 0, 1}, 1000000];
runLengths = Switch[Sign[#[[1]]], 
     1, {"top", Length[#]},
    -1, {"bottom", Length[#]}, 
     0, {"middle", Length[#]}
 ] & /@ SplitBy[sampleSimulation, states[#, γ] &]

However, if the simulation output moved between the "top" and "bottom" states (in either direction) without occupying the "middle" state they needed this information in runLengths via the string "unoccupied middle".

The following code achieves that and is surprisingly performant and the experimenter was happy. However it feels unnecessary to SplitBy and then Reap and Sow with a Do loop to find where the simulation moves seemlessly between "top" and "bottom". What would be a more natural, and less wasteful method to solve such a problem?

runLengthsFunc[list_, γ_] :=
 Block[{splitting},
  splitting = 
   Switch[Sign[#[[1]]], 
      1, {"top", Length[#]}, -1, {"bottom", Length[#]}, 
      0, {"middle", Length[#]}] & /@ 
    SplitBy[list, states[#, γ] &];
  Append[Flatten[
    Reap[Do[If[(splitting[[k, 1]] == "top" && 
           splitting[[k + 1, 1]] == "bottom") || (splitting[[k, 1]] ==
             "bottom" && splitting[[k + 1, 1]] == "top"), 
        Sow[{splitting[[k]], "unoccupied middle"}],
        Sow[{splitting[[k]]}]],
       {k, Length[splitting] - 1}
       ]][[2]], 2], splitting[[-1]]]]

Timing:

In[10]:=AbsoluteTiming[runLengthsFunc[sampleSimulation, γ];]
Out[10]:={7.526282, Null}

Solution Choosing

Thanks for the three very different solutions, as requested the focus was on different methodologies rather than sheer blistering speed.

I'm going to accept Mr Wizard's answer as it provided the most different approach (and best learning experience for me) as well as exceptional speed - under a second for an initial set of 1,000,000 states. However, there was an apparently superfluous process; Unitize where a list already contained only 0,1.

Kguler's method falls apart for more than ~3000 elements, but is a great example of pattern matching. And it's great to see in both this and Mr Wizard's solutions the convergent evolution of SE toward Composition.

Leonid's answer introduced me to linked lists which will be a useful tool for similar problems I have. Unfortunately, I can't benchmark Leonid's as he elected to focus on methodology rather than speed so did not include an implementation of runLengths - but the linking of lists is much more performant than my Reap[Do[Sow] approach.

Position[Abs[Differences@sampleSimulation], 2] will give you list of positions where next position is a jump up/down past 0. You can then use that to insert the unoccupied tag. — ciao, May 01 '14 at 00:16
@rasher I'd like to still see where you were going, would you undelete and post as a comment? — Charlotte Hadley, May 01 '14 at 16:48
Thanks for the Accept. :-) Good catch on Unitize; at the moment I agree, it appears superfluous. — Mr.Wizard, Jun 05 '14 at 07:47
+1 on the question for adding the "Solution Choosing" section - wish more would explain their process in that way, or when changing accepts the why (without which I assume the answer(s) were non-responsive). — ciao, Jun 05 '14 at 08:20
@rasher I'm in Saudi Arabia on consultancy where there is very little entertainment other than going through my list of "Things I forgot to reply to". The best questions I've read become discussions with the answerers, proof of a good community. — Charlotte Hadley, Jun 05 '14 at 09:47

score 4 · Answer 1 · answered May 01 '14 at 00:38

I think your core idea about post-processing the splittings is a good one. It can be expressed somewhat more elegantly (although elegance is a subjective matter) with the help of linked lists rather than Reap/Sow:

ClearAll[ll];
SetAttributes[ll,HoldAllComplete];
toLinkedList[l_List]:=
    Fold[ll[#2,#1]&,ll[],Reverse[l]]

and

ClearAll[process];
process[runlengths_]:=
    Block[{$IterationLimit=\[Infinity]},
        process[ll[],toLinkedList[runlengths]]
    ];
process[accum_,ll[t:{"top",_},tail:ll[{"bottom",_},_]]]:=
    process[ll[ll[accum,t],"unoccupied middle"],tail];

process[accum_,ll[t:{"bottom",_},tail:ll[{"top",_},_]]]:=
    process[ll[ll[accum,t],"unoccupied middle"],tail];

process[accum_,ll[head_,tail_]]:=
    process[ll[accum,head],tail];

process[accum_,ll[]]:=
    List@@Flatten[accum,\[Infinity],ll];

On your test, this code seems to be slightly faster too, on my machine:

(runLengths = ... ); // AbsoluteTiming

(* {6.667189, Null} *)

AbsoluteTiming[runLengthsFunc[sampleSimulation, \[Gamma]];]

(* {9.505267, Null} *)

process[runLengths]; // AbsoluteTiming

(* {1.383734, Null} *)

which means that this method is roughly twice faster than the one based on Reap/Sow. But my main point was not speed but a more clear and declarative way to express the algorithm.

Leonid, I hope to see you include your own code to generate runLenghts. Then we can see what this code can really do. :-) — Mr.Wizard, May 01 '14 at 05:05
@Mr.Wizard May be later :) You already squeezed nearly everything out of this one, it terms of speed. Besides, I wasn't really concerned with speed here. — Leonid Shifrin, May 01 '14 at 11:29
While elegance is subjective, seeing Fold in an example I understand is wonderful - thanks. Need to explore the answers when not on mobile device before I can accept one. — Charlotte Hadley, May 01 '14 at 17:05
@MartinJohnHadley Glad you liked it. Fold here serves a role of a helper function to construct the linked list. This is basically your own algorithm, just rewritten using linked lists. — Leonid Shifrin, May 01 '14 at 20:56

score 4 · Accepted Answer · edited Jun 16 '20 at 09:23

It looks like I'm late to the party but here is my approach. Common functions and data:

trinarize[a_List, γ_?NumericQ] := UnitStep[a - γ] + UnitStep[γ + a]
labels = {0 -> "bottom", 1 -> "middle", 2 -> "top"};
SeedRandom[1]
sampleSimulation = RandomChoice[{-1, 0, 1}, 20]

{0, -1, 0, 0, -1, -1, -1, 0, -1, -1, -1, -1, 1, -1, 0, 1, -1, -1, 0, 0}

Method #1

Split @ trinarize[sampleSimulation, 0.9]  (* 0.9 is γ *)
Split[%, Abs[#[[1]] - #2[[1]]] < 2 &]
Join @@ Riffle[%, {{"unoccupied middle"}}]
% /. p : {x_ ..} :> {x /. labels, Length@p}

{{1}, {0}, {1, 1}, {0, 0, 0}, {1}, {0, 0, 0, 0}, {2}, {0}, {1}, {2}, {0, 0}, {1, 1}}
{{{1}, {0}, {1, 1}, {0, 0, 0}, {1}, {0, 0, 0, 0}}, {{2}}, {{0}, {1}, {2}}, {{0, 0}, {1, 
   1}}}
{{1}, {0}, {1, 1}, {0, 0, 0}, {1}, {0, 0, 0, 0}, "unoccupied middle", {2},
  "unoccupied middle", {0}, {1}, {2}, "unoccupied middle", {0, 0}, {1, 1}}
{{"middle", 1}, {"bottom", 1}, {"middle", 2}, {"bottom", 3}, {"middle", 1}, {"bottom", 
  4}, "unoccupied middle", {"top", 1}, "unoccupied middle", {"bottom", 1}, {"middle", 
  1}, {"top", 1}, "unoccupied middle", {"bottom", 2}, {"middle", 2}}

Method #2

s1 = Split @ trinarize[sampleSimulation, 0.9]  (* 0.9 is γ *)
s2 = {First @ # /. labels, Length @ #} & /@ s1
n = Length @ s1;
jumps = SparseArray[Abs @ Differences @ s1[[All, 1]] - 1]["AdjacencyLists"]
Clip[Ordering @ Join[Range@n, jumps], {1, n}, {1, n + 1}];
Append[s2, "unoccupied middle"][[%]]

{{1}, {0}, {1, 1}, {0, 0, 0}, {1}, {0, 0, 0, 0}, {2}, {0}, {1}, {2}, {0, 0}, {1, 1}}
{{"middle", 1}, {"bottom", 1}, {"middle", 2}, {"bottom", 3}, {"middle", 1}, {"bottom", 
  4}, {"top", 1}, {"bottom", 1}, {"middle", 1}, {"top", 1}, {"bottom", 2}, {"middle", 2}}
{6, 7, 10}
{{"middle", 1}, {"bottom", 1}, {"middle", 2}, {"bottom", 3}, {"middle", 1}, {"bottom", 
  4}, "unoccupied middle", {"top", 1}, "unoccupied middle", {"bottom", 1}, {"middle", 
  1}, {"top", 1}, "unoccupied middle", {"bottom", 2}, {"middle", 2}}

@rasher Thanks! I see you deleted your answer. I found another way to make the insertions that I like well enough (it's cleaner, but not as fast). Why don't you resurrect your answer and I'll leave that ("method #2" to you? — Mr.Wizard, May 01 '14 at 03:06
Nah, I doofused the question by not paying attention, penance is deletion... — ciao, May 01 '14 at 03:59
Need to understand the SparseArray SubValues better - it's part of your standard toolbox, where'd you gain the knowledge? — Charlotte Hadley, May 01 '14 at 17:00
@Martin I first learned about SparseArray Properties from a post by Oliver Ruebenkoenig on MathGroup. Since them I've made frequent use of them, e.g. as linked here. You can get a list of the Properties with SparseArray[{1}]["Properties"]. Experimentation is perhaps the best way to understand them. (continued) — Mr.Wizard, May 01 '14 at 20:04
@Martin "NonzeroPositions" is actually non-background positions (a SparseArray has a default background of zero but it can be set to other things), in multidimensional form; "AdjacencyLists" is the same but for each row and without the first dimension. If you have any questions please let me know. — Mr.Wizard, May 01 '14 at 20:05

score 2 · Answer 3 · answered May 01 '14 at 07:49

SeedRandom[1];
sampleSimulation = RandomChoice[{-1, 0, 1}, 20]

rplcR = {(0) -> "middle", {beg___, mid : Alternatives @@ 
   PatternSequence @@@ {{"top", "bottom"}, {"bottom", "top"}},  end___} :>
     {beg, Sequence @@ Riffle[{mid}, "jump"], end}};
jtsrcF = Composition[Join @@ # &, Tally /@ # &, Split, # //. rplcR &, 
                   Clip[#1, #2 {-1, 1}, {"bottom", "top"}] &];
jtsrcF[sampleSimulation, .5]
(* {0, -1, 0, 0, -1, -1, -1, 0, -1, -1, -1, -1, 1, -1, 0, 1, -1, -1, 0,  0} *)
(* {{"middle", 1}, {"bottom", 1}, {"middle", 2}, {"bottom", 3}, {"middle", 1},
    {"bottom", 4}, {"jump", 1}, {"top", 1}, {"jump", 1}, {"bottom", 1},
    {"middle", 1}, {"top", 1}, {"jump", 1}, {"bottom", 2}, {"middle", 2}} *)

+1 for PatternSequences and patterns in general. I can't benchmark this at the moment, if you don't update I'll put timings into the question. — Charlotte Hadley, May 01 '14 at 16:50
@MartinJohnHadley, thanks for the vote. I don't think this stands a good chance in benchmarks:) — kglr, May 01 '14 at 17:46

Split list of states by run lengths and discontinuous changes in state

3 Answers3

Method #1

Method #2