15

The Theil-Sen estimator finds the slope and intercept of a line passing through a set of points by calculating the median slope and median intercept of the set of lines passing through all possible distinct point pairs. It is spectacular at fitting a line through data containing outliers.

Here is a reasonably efficient way of calculating this for medium-sized sets of points:

slope[data:{{_,_}..}]:=Median[
 Join@@Table[
  (data[[;;-(n+1),2]] - data[[n+1;;,2]]) / (data[[;;-(n+1),1]] - data[[n+1;;,1]]),
  {n,Length[data]-1}
 ]
]
intercept[data:{{_,_}..}]:=Median[
 Join@@Table[
  (data[[;;-(n+1),1]] data[[n+1;;,2]] - data[[n+1;;,1]] data[[;;-(n+1),2]]) /
  (data[[;;-(n+1),1]] - data[[n+1;;,1]]),
  {n,Length[data]-1}
 ]
]

However, on a machine with 16 GB of RAM, this runs out of memory when processing somewhere between 20,000 points and 50,000 points. How can the code be made more memory-efficient to operate on bigger datasets?

Here is a way of generating an example value for data:

sampleData[nPoints_Integer?Positive]:=(
 SeedRandom[0];
 datax = 10 N[Normalize[Range[nPoints],Max]];
 Transpose[{
  datax,
  1 + 0.001 datax + RandomReal[NormalDistribution[0,0.01],nPoints] +
  datax RandomChoice[{0.01,1-0.01}->{0.02,0},nPoints]
 }]
)
J. M.'s missing motivation
  • 124,525
  • 11
  • 401
  • 574
ArgentoSapiens
  • 7,780
  • 1
  • 32
  • 49
  • Maybe some useful information here. – b.gates.you.know.what Jan 31 '15 at 10:44
  • TAOCP offers a collection of extremely efficient methods for such tasks. Truth is: Basic principles of computer sciences cannot be outsmarted by any kind of "standard software". – Jinxed Jan 31 '15 at 23:04
  • 2
    Is there some reason not to use a sampling of point pairs, say n log(n) or so for n points? I think I know of an approach that uses maybe O(n) storage but the time complexity would be n^2 log(n) and that's quite steep for the size range in question. And fishing through the details for an actual implementation would be not so easy. – Daniel Lichtblau Feb 04 '15 at 16:55
  • @DanielLichtblau, is a random sample guaranteed to give the same answer? One of the main benefits of the technique is its insensitivity to outliers, even if a substantial fraction of the data is composed of outliers. I'd hate to get unlucky and randomly sample only pairs that include the outlier data. – ArgentoSapiens Feb 04 '15 at 18:26
  • No, unfortunately there is no such guarantee beyond what statistics might indicate (e.g. in terms of variance from "correct" median). – Daniel Lichtblau Feb 04 '15 at 19:36
  • @DanielLichtblau, well as the bounty is nonrefundable and there has been no explosion of interest in this question, I suppose an answer that neatly packages a random sampling method could still win. – ArgentoSapiens Feb 04 '15 at 19:39
  • With 20,000 to 50,000 points of real-world data assuming that the functional relationship can be adequately described by a straight line would seem to be more of an issue than not getting the "exact" median slope with a random sample. Something like quantile regression might be a better approach that does "ignore" extreme values. – JimB Jul 04 '23 at 19:12

2 Answers2

7

As a starting point, here is a compact implementation of Theil-Sen:

theilSen[data_?MatrixQ] := Median[With[{df = First[Differences[#]]},
                             {df[[2]], -Det[#]}/df[[1]]] & /@ Subsets[data, {2}]]

It is a bit more efficient than the routine in the OP when I tested it on small sets of points.


In fact, this formulation shows why it may have trouble "when processing somewhere between 20,000 points and 50,000 points." For data of length $n$, you will be generating $\binom{n}{2}$ subsets. $\binom{50\,000}{2}\approx1.25\times 10^9$, and having that many point pairs will indeed give your computer a hard time. Thus, to process that many points, one might consider resorting to lazy subset generation. But, this presents another problem: updating the median when a slope and intercept are generated. There has been some work (e.g. this) on median updating, but I haven't gotten around to fully digesting the literature. I might edit this post later if I figure out a nice implementation of median updating.

J. M.'s missing motivation
  • 124,525
  • 11
  • 401
  • 574
1

There is a ResourceFunction for TheilSenLine. See https://resources.wolframcloud.com/FunctionRepository/resources/TheilSenLine/. That may give you a bit of flexibility and is even shorter than the other answer here (which is brilliant BTW), but it bogs down with large $n$. In fact, my kernal quit trying $5\times10^5$ points, presumably the memory allocation was exceeded (I set mine low to avoid obliterating crashes). Trying $10^3$ points took 1.24 s and $10^4$ points took a long time, 93.8 s, see below.

dist = MultinormalDistribution[{0, 0}, {{1, 1/2}, {1/2, 1}}]; 
pointlist = RandomVariate[dist, 10^3];
AbsoluteTiming[line = ResourceFunction["TheilSenLine"][pointlist]]
Show[ListPlot[pointlist], Plot[line[[1]] x + line[[2]], {x, Min[pointlist[[All, 1]]], Max[pointlist[[All, 2]]]}]]

dist = MultinormalDistribution[{0, 0}, {{1, 1/2}, {1/2, 1}}]; pointlist = RandomVariate[dist, 10^4]; AbsoluteTiming[line = ResourceFunction["TheilSenLine"][pointlist]] Show[ListPlot[pointlist], Plot[line[[1]] x + line[[2]], {x, Min[pointlist[[All, 1]]], Max[pointlist[[All, 2]]]}]][![enter image description here][1]][1]

enter image description here

EDIT: As to the question about efficient implementation there are several papers to that effect a few of which are listed on a Cross-validated post. Unfortunately, I have not been seen Mathematica code for those. Another question, not asked, is how to determine confidence intervals, and one way is to use bootstrap, e.g., see this, but there are faster ways as well.

Carl
  • 695
  • 4
  • 16