How can the Theil-Sen estimator be made to work on larger datasets?

Question

The Theil-Sen estimator finds the slope and intercept of a line passing through a set of points by calculating the median slope and median intercept of the set of lines passing through all possible distinct point pairs. It is spectacular at fitting a line through data containing outliers.

Here is a reasonably efficient way of calculating this for medium-sized sets of points:

slope[data:{{_,_}..}]:=Median[
 Join@@Table[
  (data[[;;-(n+1),2]] - data[[n+1;;,2]]) / (data[[;;-(n+1),1]] - data[[n+1;;,1]]),
  {n,Length[data]-1}
 ]
]
intercept[data:{{_,_}..}]:=Median[
 Join@@Table[
  (data[[;;-(n+1),1]] data[[n+1;;,2]] - data[[n+1;;,1]] data[[;;-(n+1),2]]) /
  (data[[;;-(n+1),1]] - data[[n+1;;,1]]),
  {n,Length[data]-1}
 ]
]

However, on a machine with 16 GB of RAM, this runs out of memory when processing somewhere between 20,000 points and 50,000 points. How can the code be made more memory-efficient to operate on bigger datasets?

Here is a way of generating an example value for data:

sampleData[nPoints_Integer?Positive]:=(
 SeedRandom[0];
 datax = 10 N[Normalize[Range[nPoints],Max]];
 Transpose[{
  datax,
  1 + 0.001 datax + RandomReal[NormalDistribution[0,0.01],nPoints] +
  datax RandomChoice[{0.01,1-0.01}->{0.02,0},nPoints]
 }]
)

TAOCP offers a collection of extremely efficient methods for such tasks. Truth is: Basic principles of computer sciences cannot be outsmarted by any kind of "standard software". — Jinxed, Jan 31 '15 at 23:04
Is there some reason not to use a sampling of point pairs, say n log(n) or so for n points? I think I know of an approach that uses maybe O(n) storage but the time complexity would be n^2 log(n) and that's quite steep for the size range in question. And fishing through the details for an actual implementation would be not so easy. — Daniel Lichtblau, Feb 04 '15 at 16:55
@DanielLichtblau, is a random sample guaranteed to give the same answer? One of the main benefits of the technique is its insensitivity to outliers, even if a substantial fraction of the data is composed of outliers. I'd hate to get unlucky and randomly sample only pairs that include the outlier data. — ArgentoSapiens, Feb 04 '15 at 18:26
No, unfortunately there is no such guarantee beyond what statistics might indicate (e.g. in terms of variance from "correct" median). — Daniel Lichtblau, Feb 04 '15 at 19:36
@DanielLichtblau, well as the bounty is nonrefundable and there has been no explosion of interest in this question, I suppose an answer that neatly packages a random sampling method could still win. — ArgentoSapiens, Feb 04 '15 at 19:39
With 20,000 to 50,000 points of real-world data assuming that the functional relationship can be adequately described by a straight line would seem to be more of an issue than not getting the "exact" median slope with a random sample. Something like quantile regression might be a better approach that does "ignore" extreme values. — JimB, Jul 04 '23 at 19:12

score 7 · Answer 1 · edited Apr 13 '17 at 12:56

As a starting point, here is a compact implementation of Theil-Sen:

theilSen[data_?MatrixQ] := Median[With[{df = First[Differences[#]]},
                             {df[[2]], -Det[#]}/df[[1]]] & /@ Subsets[data, {2}]]

It is a bit more efficient than the routine in the OP when I tested it on small sets of points.

In fact, this formulation shows why it may have trouble "when processing somewhere between 20,000 points and 50,000 points." For data of length $n$, you will be generating $\binom{n}{2}$ subsets. $\binom{50\,000}{2}\approx1.25\times 10^9$, and having that many point pairs will indeed give your computer a hard time. Thus, to process that many points, one might consider resorting to lazy subset generation. But, this presents another problem: updating the median when a slope and intercept are generated. There has been some work (e.g. this) on median updating, but I haven't gotten around to fully digesting the literature. I might edit this post later if I figure out a nice implementation of median updating.

(+1) It is amazing what you packed into a single line of code! — JimB, Mar 25 '18 at 01:50
I was amused at how compact it was, but seeing it break on moderately-sized datasets was a cold shower. ;) — J. M.'s missing motivation, Mar 25 '18 at 02:02
Works nice. "data" format is ${{x_1,y_1},{x_2,y_2},{x_3,y_3},\dots,{x_{n-1},y_{n-1}},{x_n,y_n}}$ as usual. — Carl, Jul 04 '23 at 04:21

Carl · Answer 2 · 2024-02-07T01:17:06.180

There is a ResourceFunction for TheilSenLine. See https://resources.wolframcloud.com/FunctionRepository/resources/TheilSenLine/. That may give you a bit of flexibility and is even shorter than the other answer here (which is brilliant BTW), but it bogs down with large $n$. In fact, my kernal quit trying $5\times10^5$ points, presumably the memory allocation was exceeded (I set mine low to avoid obliterating crashes). Trying $10^3$ points took 1.24 s and $10^4$ points took a long time, 93.8 s, see below.

dist = MultinormalDistribution[{0, 0}, {{1, 1/2}, {1/2, 1}}]; 
pointlist = RandomVariate[dist, 10^3];
AbsoluteTiming[line = ResourceFunction["TheilSenLine"][pointlist]]
Show[ListPlot[pointlist], Plot[line[[1]] x + line[[2]], {x, Min[pointlist[[All, 1]]], Max[pointlist[[All, 2]]]}]]
dist = MultinormalDistribution[{0, 0}, {{1, 1/2}, {1/2, 1}}];
pointlist = RandomVariate[dist, 10^4];
AbsoluteTiming[line = ResourceFunction["TheilSenLine"][pointlist]]
Show[ListPlot[pointlist], Plot[line[[1]] x + line[[2]], {x, Min[pointlist[[All, 1]]], Max[pointlist[[All, 2]]]}]][![enter image description here][1]][1]

EDIT: As to the question about efficient implementation there are several papers to that effect a few of which are listed on a Cross-validated post. Unfortunately, I have not been seen Mathematica code for those. Another question, not asked, is how to determine confidence intervals, and one way is to use bootstrap, e.g., see this, but there are faster ways as well.

Did you ever manage to read the "Author Notes" at the very bottom of the resource function you are linking to? — J. M.'s missing motivation, Sep 16 '23 at 02:51
@J.M.'slackofA.I. I upvoted your answer before i read about your code. Very nice. — Carl, Sep 16 '23 at 05:49
@J.M.'seventualburnout How about code for the whole enchilada? Would you do that or is it over the top? — Carl, Feb 07 '24 at 01:19

How can the Theil-Sen estimator be made to work on larger datasets?

2 Answers2

Linked