How to efficiently find positions of duplicates?

Question

Is there an efficient way to find the positions of the duplicates in a list?

I would like the positions grouped according to duplicated elements. For instance, given

list = RandomInteger[15, 20]

{3, 3, 6, 11, 13, 13, 11, 1, 2, 3, 12, 8, 9, 9, 4, 15, 5, 6, 9, 12}

the output should be

positionDuplicates[list]

{{{1}, {2}, {10}}, {{3}, {18}}, {{4}, {7}}, {{5}, {6}}, {{11}, {20}}, {{13}, {14}, {19}}}

Here's my first naive thought:

positionDuplicates1[expr_] :=
  Position[expr, #, 1] & /@ First /@ Select[Gather[expr], Length[#] > 1 &]

And my second:

positionDuplicates2[expr_] := Module[{seen, tags = {}},
  MapIndexed[
   If[seen[#1] === True, Sow[#2, #1], 
     If[Head[seen[#1]] === List, AppendTo[tags, #1]; 
      Sow[seen[#1], #1]; Sow[#2, #1]; seen[#1] = True, 
      seen[#1] = #2]] &, expr]
  ]

The first works as desired but is horrible on long lists. In the second, Reap does not return positions in order, so if necessary, one can apply Sort. I feel the work done by Gather is about what it should take for this task; DeleteDuplicates is (and should be) faster.

Here is a summary of timings on a big list.

list = RandomInteger[10000, 5 10^4];
positionDuplicates1[list]; // AbsoluteTiming
positionDuplicates2[list] // Sort; // AbsoluteTiming
Sort[Map[{#[[1, 1]], Flatten[#[[All, 2]]]} &, Reap[MapIndexed[Sow[{#1, #2}, #1] &, list]][[2, All, All]]]]; // AbsoluteTiming (* Daniel Lichtblau *)
Select[Last@Reap[MapIndexed[Sow[#2, #1] &, list]], Length[#] > 1 &]; // AbsoluteTiming
positionOfDuplicates[list] // Sort; // AbsoluteTiming (* Leonid Shifrin *)
Module[{a, o, t}, Composition[o[[##]] &, Span] @@@ Pick[Transpose[{Most[ Prepend[a = Accumulate[(t = Tally[#[[o = Ordering[#]]]])[[All, 2]]], 0] + 1], a}], Unitize[t[[All, 2]] - 1], 1]] &[list]; // AbsoluteTiming (* rasher *)
GatherBy[Range@Length[list], list[[#]] &]; // AbsoluteTiming (* Szabolcs *)
GatherByList[Range@Length@list, list]; // AbsoluteTiming (* Carl Woll *)
Gather[list]; // AbsoluteTiming
DeleteDuplicates[list]; // AbsoluteTiming

{27.7134, Null} (* my #1 *)
{0.586742, Null} (* my #2 *)
{0.14921, Null} (* Daniel Lichtblau *)
{0.074334, Null} (* Szabolcs's suggested improvement of my #2 *)
{0.028313, Null} (* Leonid Shifrin *)
{0.020012, Null} (* rasher *)
{0.004821, Null} (* Szabolcs's answer *)
{0.003127, Null} (* Carl Woll *)
{0.002999, Null} (* Gather - for comparison purposes *)
{0.000181, Null} (* DeleteDuplicates *)

Isn't this easier for the Sow/Reap solution? Why is seen necessary? Last@Reap[MapIndexed[Sow[#2, #1] &, list]] — Szabolcs, Mar 14 '13 at 19:58
I wanted only the duplicated elements -- I suppose I could delete the singletons afterwards. — Michael E2, Mar 14 '13 at 20:04
Yes, that's probably faster too. Select[result, Length[#] > 1&] or similar. — Szabolcs, Mar 14 '13 at 20:05
@anderstood Thanks. If I get a chance, I should include Carl's, but I'll have to redo everything. (And it's ok if you edit it.) — Michael E2, Jan 04 '18 at 20:47

Szabolcs · Accepted Answer · 2013-03-15T16:42:42.953

105

You can use GatherBy for this. You can map List onto Range[...] first if you wish to have exactly the same output you showed.

positionDuplicates[list_] := GatherBy[Range@Length[list], list[[#]] &]

list = {3, 3, 6, 11, 13, 13, 11, 1, 2, 3, 12, 8, 9, 9, 4, 15, 5, 6, 9, 12}

positionDuplicates[list]

(* ==> {{1, 2, 10}, {3, 18}, {4, 7}, {5, 6}, {8}, {9}, 
        {11, 20}, {12}, {13, 14, 19}, {15}, {16}, {17}} *)

If you prefer a Sow/Reap solution, I think this is simpler than your version (but slower than GatherBy):

positionDuplicates[list_] := Last@Reap[MapIndexed[Sow[#2, #1] &, list]]

If you need to remove the positions of non-duplicates, I'd suggest doing that as a post processing step, e.g. Select[result, Length[#] > 1&]

edited Mar 15 '13 at 16:42

answered Mar 14 '13 at 19:47

Szabolcs

234,956
30
623
1,263

3

Smart. I don't think this can be beaten. – Sjoerd C. de Vries Mar 15 '13 at 06:53
5

Your method is faster than the standard decorate method I've been using: GatherBy[{#, Range@Length@#}\[Transpose], First][[All, All, 2]] &. One to add to the toolbox. Thanks! – Mr.Wizard Mar 15 '13 at 11:59
5

One thing: why not get rid of Module? positionDuplicates[list_] := GatherBy[Range @ Length @ list, list[[#]] &] – Mr.Wizard Mar 15 '13 at 12:01
Thanks. I did look at GatherBy. Gathering the positions somehow seems natural, but I didn't think of it. – Michael E2 Mar 16 '13 at 00:02
Used, with credit, here: http://mathematica.stackexchange.com/a/21453/121 – Mr.Wizard Mar 16 '13 at 02:18
@Mr.Wizard no need to give credit for something so small, I feel it's a natural approach – Szabolcs Mar 16 '13 at 03:06
3

Actually, it's not. I've never seen it before, and it didn't occur to me to try it, because for some reason it seemed semantically more complex even if syntactically simpler (so I figured it would be slower). In many cases the best ideas are simple in appearance. The "injector pattern" is very simple yet also very powerful. The step function I worked long to figure out has, IMHO, extensive implications for how we may handle expressions and definitions and is perhaps my best contribution to this site so far, yet it is a couple of lines of code. I give credit where it's due. – Mr.Wizard Mar 16 '13 at 03:26
I realize that my comment above makes it sounds like I might have contemplated this specific form and dismissed it. That's not the case. What I mean is that I normally try to refactor code in the other direction, lumping data into one object that is then processed (as in decorate-and-sort) rather than dynamically accessing expressions on demand. Therefore I never even thought about this particular construct. I shall have to rethink my assumptions. – Mr.Wizard Mar 16 '13 at 04:15
@Mr.Wizard I second that. I knew about and used the "injector pattern" for a long time (albeit under a different name), but in this case I was so sure that the performance was going to be inferior, that I did not even think in this direction (which is strange in retrospect since I usually look for dualities like this, and even mentioned the element-position duality here (#8) - which is exactly the category where this particular technique belongs). So, what can I say - I wish I've thought of that :-) – Leonid Shifrin Mar 16 '13 at 14:32
1

BTW, DeleteCases[result, {_}] seems to be noticeably faster than Select[..]. And thanks again. :) – Michael E2 Aug 12 '17 at 14:43

score 24 · Answer 2 · edited Apr 13 '17 at 12:55

Here is a version based on sorting, and using Mr. Wizard's dynP function:

dynP[l_, p_] := 
   MapThread[l[[# ;; #2]] &, {{0}~Join~Most@# + 1, #} &@Accumulate@p]

positionOfDuplicates[list_] :=
   With[{ord = Ordering[list]},
      SortBy[dynP[ord, Length /@ Split[list[[ord]]]], First]
   ]

so that

positionOfDuplicates[list]

(* {{1,2,10},{3,18},{4,7},{5,6},{8},{9},{11,20},{12},{13,14,19},{15},{16},{17}} *)

It is also fast enough, although not as fast as the one based on GatherBy.

score 19 · Answer 3 · answered Mar 14 '13 at 22:19

If you wanted to retain each value as well as its positions, this works.

Sort[
 Map[{#[[1, 1]], Flatten[#[[All, 2]]]} &, 
  Reap[MapIndexed[Sow[{#1, #2}, #1] &, list]][[2, All, All]]]]

(* Out[178]= {{0, {14}}, {1, {17, 19}}, {4, {4, 
   20}}, {5, {12}}, {7, {10}}, {9, {13}}, {10, {2, 
   6}}, {11, {3}}, {12, {7, 15}}, {13, {8, 9, 11}}, {14, {1, 16, 
   18}}, {15, {5}}} *)

It's maybe 20x slower than the GatherBy though.

score 17 · Answer 4 · edited Apr 13 '17 at 12:55

17

In version 10 there is a new function PositionIndex that could be the go-to method for this operation:

a = {3, 3, 6, 11, 13, 13, 11, 1, 2, 3, 12, 8, 9, 9, 4, 15, 5, 6, 9, 12};

Values @ PositionIndex @ a

{{1, 2, 10}, {3, 18}, {4, 7}, {5, 6}, {8}, {9}, {11, 20},
 {12}, {13, 14, 19}, {15}, {16}, {17}}

Sadly, as currently implemented its performance is very poor, so it is NOT the go-to method:

positionDuplicates[list_] := GatherBy[Range @ Length @ list, list[[#]] &]

test = RandomInteger[999, 5*^5];

positionDuplicates[test]     // Timing // First

Values @ PositionIndex[test] // Timing // First

0.015600
2.215214

Perhaps in future release this function will live up to its potential.

Update: In 10.0.1 it is indeed far more useful but still not a match for positionDuplicates:

Values @ PositionIndex[test] // Timing // First

0.0524

edited Apr 13 '17 at 12:55

Community

1

answered Jul 15 '14 at 00:00

Mr.Wizard

271,378
34
587
1,371

1

PositionIndex seems to be slightly slower on test2 = Association @ Thread[Range[5*^5] -> test]. (I thought it might be optimized for associations.) +1 for testing out the new function. – Michael E2 Jul 15 '14 at 00:31
@Michael Don't miss: (54853) – Mr.Wizard Jul 15 '14 at 00:33
2

WTH? Wolfram seems to have screwed the pooch on a few things... I think I'll hold off updating until .01... – ciao Jul 15 '14 at 01:17

score 16 · Answer 5 · answered Apr 11 '14 at 04:32

Prompted by a comments conversation with Mr. Wizard, a method I use often.

list = RandomInteger[1000, 100];

Module[{a, o, t}, 
   Composition[o[[##]] &, Span] @@@ 
    Pick[Transpose[{Most[Prepend[a = Accumulate[(t = Tally[#[[o = Ordering[#]]]])
      [[All, 2]]], 0] + 1], a}], Unitize[t[[All, 2]] - 1], 1]] &[list]

list[[#]] & /@ %

(*
   {{47, 53}, {72, 89}, {18, 58}, {20, 56}}

   {{699, 699}, {738, 738}, {829, 829}, {962, 962}}
*)

Searches are at the top-level of the list, and only duplicate positions are returned so no need for further parsing.

Smallish lists with mostly duplicates / dense duplicates sees GatherBy with similar or somewhat faster performance, but as soon as the data tends toward distinctness and/or large lists (more typical than not for my work), it clobbers GatherBy by a factor of 5-10. In addition, it is much cheaper on memory than gatherhog, which at times is like watching Oprah at a buffet when in comes to eating RAM...

@MichaelE2: Cool. Too small a test size though, mine's optimized for the other side of hard. Curious what say RandomInteger[10000000, 100000] does in your environment... — ciao, Apr 11 '14 at 10:44

score 13 · Answer 6 · answered Jan 04 '18 at 17:43

13

You can use my GatherByList function to gain a modest improvement in speed when compared to @Szabolcs' solution:

GatherByList[list_, representatives_] := Module[{func},
    func /: Map[func, _] := representatives;
    GatherBy[list, func]
]

Comparison:

r1 = GatherByList[
    Range@Length@list,
    list
]; //RepeatedTiming

r2 = GatherBy[
    Range@Length@list,
    list[[#]]&
]; //RepeatedTiming

r1===r2

{0.0047, Null}

{0.0062, Null}

True

answered Jan 04 '18 at 17:43

Carl Woll

130,679
6
243
355

That's more than a modest improvement. It is now as fast as Gather. – anderstood Jan 04 '18 at 21:14
3

Very interesting code. I didn't know (or remember?) that interception of Map was possible at that point. This significantly changes the way I look at GatherBy. – Mr.Wizard Jan 04 '18 at 23:51

score 9 · Answer 7 · edited Apr 13 '17 at 12:55

While reflecting on the method I used for How to get list of duplicates when using DeleteDuplicates? (second answer) it occurred to me that I had the elements for a solution to this question that might be faster than Szabolcs's magnificently clean solution. Indeed I found that in some cases I can beat his function, though in general the common bottleneck of partitioning a list ultimately holds this back.

My code as well as Szabolcs's function again for comparative testing:

diffPos[a_List] := SparseArray[Differences@a, Automatic, 1]["AdjacencyLists"]

posDupsRaw[a_List] := {#, diffPos @ Ordering @ Reverse @ a[[#]]} & @ Ordering @ a

posDups[a_List] :=
  posDupsRaw[a] /. {o_, p_} :>
    MapThread[Take[o, {##}] &, {Prepend[p + 1, 1], Append[p, -1]}]

positionDuplicates[a_] := GatherBy[Range @ Length @ a, a[[#]] &]

An example of the outputs:

a = {0, 3, 2, 2, 2, 3, 2, 4};
positionDuplicates[a]
posDups[a]
posDupsRaw[a]

{{1}, {2, 6}, {3, 4, 5, 7}, {8}}
{{1}, {3, 4, 5, 7}, {2, 6}, {8}}
{{1, 3, 4, 5, 7, 2, 6, 8}, {1, 5, 7}}

We can see that the output of posDups is not in order of first appearance, but otherwise the result is the same; a SortBy[First] would align them. The output of posDupsRaw contains the same information; the first sublist is to be partitioned according to the second, which is exactly what posDups actually does.

Now some timings on a basic integer vector:

SeedRandom[1]
big1 = RandomInteger[1*^6, 1*^6];

positionDuplicates[big1] // Length // RepeatedTiming
posDups[big1]            // Length // RepeatedTiming
posDupsRaw[big1] // Last // Length // RepeatedTiming

{0.706, 632355}
{0.778, 632355}
{0.169, 632354}

In this particular case posDups is reasonably competitive and posDupsRaw is much faster, clearly demonstrating that partitioning is the bottleneck here.

With the right balance of duplication posDups actually beats positionDuplicates:

SeedRandom[1]
big1 = RandomInteger[3*^5, 1*^6];

positionDuplicates[big1] // Length // RepeatedTiming
posDups[big1]            // Length // RepeatedTiming
posDupsRaw[big1] // Last // Length // RepeatedTiming

{0.480, 289165}
{0.445, 289165}
{0.1609, 289164}

My function does even better when the list elements are themselves lists:

SeedRandom[1]
big1 = RandomInteger[500, {1*^6, 2}];

positionDuplicates[big1] // Length // RepeatedTiming
posDups[big1]            // Length // RepeatedTiming
posDupsRaw[big1] // Last // Length // RepeatedTiming

{1.10, 246302}
{0.513, 246302}
{0.2699, 246301}

Unfortunately in a simple list with heavy duplication my method falls well behind:

SeedRandom[1]
big1 = RandomInteger[999, 1*^6];

positionDuplicates[big1] // Length // RepeatedTiming
posDups[big1]            // Length // RepeatedTiming
posDupsRaw[big1] // Last // Length // RepeatedTiming

{0.0524, 1000}
{0.1265, 1000}
{0.1241, 999}

score 2 · Answer 8 · answered Mar 22 '22 at 05:08

This is a very interesting post and I still remember the first time I read it. My response serves only as an update on the functionality of PositionIndex as of V13. Please do take into consideration that it has been mentioned previously, but I felt like an update is necessary.

@Michael E2 and the other more experienced users, if you think that it would be better for site maintenance for this to be incorporated in the OP, please leave a comment and I will delete my answer. Many thanks in advance!

Firstly, I am sitting on

$Version

"13.0.0 for Mac OS X ARM (64-bit) (December 3, 2021)"

The first thing that I want to address for the reader's convenience is that if we consider a toy-model list as is given in the OP

list = {3, 3, 6, 11, 13, 13, 11, 1, 2, 3, 12, 8, 9, 9, 4, 15, 5, 6, 9,
    12};

the following

Values@PositionIndex[list]

returns

{{1, 2, 10}, {3, 18}, {4, 7}, {5, 6}, {8}, {9}, {11, 20}, {12}, {13, 14, 19}, {15}, {16}, {17}}

which is precisely what one gets by using positionDuplicates1[list]; for its definition please see the OP.

And now we perform a test

Quit[]

with the list as given in the comparative studies

list = RandomInteger[10000, 5 10^4];

we execute

Values@PositionIndex[list]; // AbsoluteTiming

to obtain

{0.003899, Null}

which does not seem that bad.

Finally, I would like to address something that is not really an answer to the OP, but I think it's quite interesting and has not been mentioned previously.

If we want to get an answer as to whether or not a list contains duplicate elements, the following is also pretty fast -note that I am using the big list here-

DuplicateFreeQ[list] // AbsoluteTiming

returns

{0.000032, False}

Mr.Wizard used Values@PositionIndex@a in his answer. Is that what you meant by "it has been mentioned previously"? Is your main point the performance change from V10? — Michael E2, Mar 22 '22 at 05:23
@MichaelE2 yes, I just did not know how to refer precisely to his answer. As I tried to explain this is just a humble status update :) I just thought it might be useful. I happened to check these again today for some projects — bmf, Mar 22 '22 at 05:25
@MichaelE2 if you -or any other user- feel I should delete this answer, I am happy to do so. Just stressing this again. — bmf, Mar 22 '22 at 05:31
No need to delete, and thanks for the work. As the site grows and Mathematica evolves, it's hard to maintain the site. Personally I would've left a comment on Mr.Wizard's answer, so that the answer could be kept up-to-date. It seems better to me than duplicating an answer every time a function is updated. But that's my vision for the site. (Actually I might even have edited his answer and left another update at the bottom, since he hasn't been engaged in the site for a while.) — Michael E2, Mar 22 '22 at 15:25
@MichaelE2 many thanks for sharing your thoughts on the matter. I was unsure how many people would've read a comment to an old question, to be honest and quite frankly I was very skeptical with editing Mr.Wizard's answer. I did not know how the other users would respond to something like that. Again, many thanks for your thoughts. I'll take these into advisement for future posts :) — bmf, Mar 22 '22 at 17:20

Ronald Monson · Answer 9 · 2019-05-08T23:12:52.493

Most of the other answers focus on duplicate positions amongst numbers but what about duplicate positions amongst other types and/or within deeper arrays? Neither of these were originally specified but it can be interesting to consider these alternatives and the order-of-magnitude efficiency improvements achievable from some minimal pre-processing (with some authentic applications). Further, variability in input, not only in terms of type and depth, but also in terms of duplicate-distribution can significantly impact efficiency. Taken together this suggests, perhaps, the need for dedicated DuplicatePositions and DuplicatePositionsBy functions, the case for which is later built. First to the efficiency improvement when finding duplicates amongst lists of reals.

Needs["GeneralUtilities`"]
positionDuplicates[ls_] := GatherBy[Range@Length@ls, ls[[#]] &];
duplicatePositions[ls_] := positionDuplicates[FromDigits /@ ls];
SeedRandom@0;
vectors[n_] := IntegerDigits /@ RandomInteger[n, 5*n];
BenchmarkPlot[{positionDuplicates, duplicatePositions}, vectors[10^#] &, Range@3, "IncludeFits" -> True]

At around 1K items the efficiency difference becomes meaningful (benchmark is defined in a related answer with a green tick indicating identical output for all timings).

fns = {duplicatePositions, positionDuplicates};
n = 10^3;
vectors5k = IntegerDigits /@ RandomInteger[n, 5*n];
benchmark[fns]@vectors5k

By performing the pre-processing that converts vectors into integers, all of GatherBy's optimizations can be bought to bear (namely, reducing the number and cost of pair-wise comparisons that implicitly occur in GatherBy's sorting). Hence efficiently finding duplicate positions ultimately depends on efficient sorting which in turn (often) depends on the underlying objects possessing a natural order. For arbitrary objects this order may need to be user-imposed (thereby motivating a DuplicatePositionsBy function). Note that in the previous example, IntegerDigits produces different sized vectors and for which, apparently, no order has been internally recognized. By fixing the vector size with padding if necessary and this order apparently now does become recognizable with a still discernible but reduced efficiency advantage.

vectors[n_] := IntegerDigits[#, 10, 5] & /@ RandomInteger[n, 5*n];

BenchmarkPlot[{positionDuplicates, duplicatePositions}, vectors[10^#] &, Range@4, "IncludeFits" -> True]

and individually

n = 10^4;
vectors50k = IntegerDigits[#, 10, 5] & /@ RandomInteger[n, 5*n];
benchmark[fns]@vectors50k

A similar efficiency edge, possibly from more efficient pairwise-comparisons arises from using PositionIndex such as for strings.

duplicatePositions[M_] := Values@PositionIndex@M;
n = 10^5;
strings100K = ToString & /@ RandomInteger[n, 5*n];
benchmark[fns]@strings100K

The distribution of duplicates can also directly impact on a method's efficiency. Note how in the OP's example, a random selection amongst n numbers was performed 5n times to ensure that a normalish distribution of duplicates is generated. The following shows how varying this 5 factor affects the duplicate distribution.

Manipulate[Histogram[Length /@ duplicatePositions@RandomInteger[n, n*r // Round], PlotRange -> {{0, 100}, Automatic}], {{n, 1000}, 1, 101}, {{r, 30}, .1, 100, 1}]

Increasing $r$ increases the chance of duplicates eventually leading to a uniform distribution. Decreasing $r$, on the other hand, decreases the chance of duplicates eventually leading to a distribution of singleton sets. For the latter, it turns out there is another implementation that seems to perform unreasonably well for sparse vectors.

 duplicatePositionsNew[ls_] := SplitBy[Ordering@ls, ls[[#]]&]//SortBy[First];
 SeedRandom@0;
 vectors[n_] := RandomInteger[1, {n, n}];
 fns = {duplicatePositions, duplicatePositionsNew, positionDuplicates};
 BenchmarkPlot[fns, vectors[10^#] &, Range@4, "IncludeFits" -> True]

If it is known in advance that such sparsity is almost guaranteed, then by sacrificing absolute certainty one gains the efficiency of simply returning singleton sets. If one suspects sparse input but still wants to maintain certainty for rare duplicates then this "ordering" implementation may be the desired function.

n = 10^4;
sparseVectors10k = RandomInteger[1, {n, n}];
benchmark[fns]@sparseVectors10k

All these case-based efficiencies suggest a "superfunction"--DuplicatePositions-- that maintains demonstrated efficiency advantages either automatically when efficiently detectable or, when it is not, via a method option and/or defining an order by preprocessing in "DuplicatePositionsBy". Note that such a superfunction defaults to Szabolcs' efficient positionDuplicates for numbers and for good measure includes Carl Woll's eeking out of extra speed via a "UseGatherByLocalMap" option setting. An initial implementation might look something like:

Options[DuplicatePositions] = {Method -> Automatic};

DuplicatePositions[ls_, OptionsPattern[]] := 
  With[{method = OptionValue[Method]},
   Switch[method,
    "UseGatherBy", GatherBy[Range@Length@ls, ls[[#]] &],
    "UsePositionIndex", Values@PositionIndex@ls,
    "UseOrdering", SplitBy[Ordering@ls, ls[[#]] &] // SortBy[First],
    "UseGatherByLocalMap", Module[{func}, func /: Map[func, _] := ls;
     GatherBy[Range@Length@ls, func]],
    Automatic, Which[
     ArrayQ[ls, 1, NumericQ], 
     DuplicatePositions[ls, "Method" -> "UseGatherBy" ],
     ArrayQ[ls, 2, NumericQ], DuplicatePositionsBy[ls, FromDigits],
     MatchQ[{{_?IntegerQ ..} ..}]@ls, 
     DuplicatePositionsBy[ls, FromDigits],
     True, DuplicatePositions[ls, Method -> "UsePositionIndex" ]
     ]]];

DuplicatePositionsBy[ls_, fn_, opts : OptionsPattern[]] := 
  DuplicatePositions[fn /@ ls, opts];

Putting it all together

n = 10^3;
normal := RandomInteger[n, 5*n];
vectors5k := IntegerDigits[#, 10, 6] & /@ normal;
vectorsRagged5k := IntegerDigits /@ normal;
strings5k := ToString /@ normal;

benchmark[{
   DuplicatePositions, positionDuplicates}] /@ 
 Unevaluated@{vectors5k, vectorsRagged5k, strings5k}

n = 10^4;
normal50k := RandomInteger[n, 5*n];

benchmark[{
   DuplicatePositions, 
   DuplicatePositions[#, Method -> "UseGatherBy"] &, 
   DuplicatePositions[#, Method -> "UseGatherByLocalMap"] &, 
   positionDuplicates}] /@ Unevaluated@{normal50k}

n = 10^3;
sparseVectors1k := RandomInteger[1, {n, n}];
benchmark[{
   DuplicatePositions, 
   DuplicatePositions[#, Method -> "UseOrdering"] &, 
   positionDuplicates}] /@ Unevaluated@{sparseVectors1k}

While it might seem that finding duplicate positions in arbitrary-depth arrays are sparse corner cases, this is only from the perspective of random generation and the challenge of searching within an "interesting" normal distribution. Duplicates are however, frequently injected into high-dimensional structures by satisfying specified symmetries. In fact, such symmetrization defines much of WL's rich tensor framework. DuplicatePositions therefore possesses a natural applicability to symbolic tensors produced from SymmetrizedArray, SparseArrays etc. (with itself potentially returning new, structured objects) with further relations to SymmetrizedDependentComponents, and DeleteDuplicates. For this application one might imagine a 4-argument function along the lines of:

DuplicatePositions[expr, test, comp, levspec]

(note that for arbitrary-depth arrays the very notion of a duplicate becomes level-dependent. In a list of binary vectors it was implicitly assumed that the duplicates of interest were at the top level [the positions of binary vectors] and not at the bottom level [the positions of 0's and 1's] which seem to have greater interest for deeper arrays).

Why use something fragile and only applicable to Integer like FromDigits and not "integer-ize" with Hash? — b3m2a1, May 08 '19 at 23:29
Yeah, from memory I did consider Hash but the cost of conversions seemed to outweigh built-in numerics - perhaps to obtain this generality this should be added to one of DuplicatePositions default branches. — Ronald Monson, May 08 '19 at 23:52
You can also always see if you're dealing with an efficiently convertable type with the PackedArray stuff. Like you could check for a packed Integer array or some integralizable type and in that case delegate to FromDigits. — b3m2a1, May 08 '19 at 23:54
Good point. In fact I'm sure there are other optimisations for other types which was one of the points of the post. I make no claims of completeness. Perhaps as these particular cases are identified someone could add these to the superfunction and put on the function repository ... -- I don't quite have the time at the moment. — Ronald Monson, May 09 '19 at 00:02

How to efficiently find positions of duplicates?

9 Answers9

Linked

Related