11

The usual simple algorithm for finding the median element in an array $A$ of $n$ numbers is:

  • Sample $n^{3/4}$ elements from $A$ with replacement into $B$
  • Sort $B$ and find the rank $|B|\pm \sqrt{n}$ elements $l$ and $r$ of $B$
  • Check that $l$ and $r$ are on opposite sides of the median of $A$ and that there are at most $C\sqrt{n}$ elements in $A$ between $l$ and $r$ for some appropriate constant $C > 0$. Fail if this doesn't happen.
  • Otherwise, find the median by sorting the elements of $A$ between $l$ and $r$

It's not hard to see that this runs in linear time and that it succeeds with high probability. (All the bad events are large deviations away from the expectation of a binomial.)

An alternate algorithm for the same problem, which is more natural to teach to students who have seen quick sort is the one described here: Randomized Selection

It is also easy to see that this one has linear expected running time: say that a "round" is a sequence of recursive calls that ends when one gives a 1/4-3/4 split, and then observe that the expected length of a round is at most 2. (In the first draw of a round, the probability of getting a good split is 1/2 and then after actually increases, as the algorithm was described so round length is dominated by a geometric random variable.)

So now the question:

Is it possible to show that randomized selection runs in linear time with high probability?

We have $O(\log n)$ rounds, and each round has length at least $k$ with probability at most $2^{-k+1}$, so a union bound gives that the running time is $O(n\log\log n)$ with probability $1-1/O(\log n)$.

This is kind of unsatisfying, but is it actually the truth?

Louis
  • 2,926
  • 16
  • 25
  • Please clarify which algorithm your questions refer to. – Raphael Apr 18 '12 at 20:32
  • Are you asking if you applied your union bound correctly, or if there's a better, more satisfying bound? – Joe Apr 18 '12 at 23:42
  • @Joe The latter. The point is that rounds are an artifact to get that the round length is dominated by a geometric. Then the anaylisys "forgets" whether the algorithm is ahead or behind the one that always gets a 1/4-3/4 split on the nose to make the geometrics independent. I'm asking whether this "cheating" as Yuval put it below is still tight. – Louis Apr 19 '12 at 06:45

1 Answers1

5

It's not true that the algorithm runs in linear time with high probability. Considering only the first round, the running time is at least $\Theta(n)$ times a $G(1/2)$ random variable. Let $p(n) \longrightarrow 0$ be the allowed failure probability. Since $\Pr[G(1/2) \geq \log_2 p(n)^{-1}] = p(n)$, the running time is at least $\Omega(n \log_2 p(n)^{-1}) = \omega(n)$.

(There is some cheating involved, since the length of the first round isn't really $G(1/2)$. More careful analysis might or might not validate this answer.)

Edit: Grübel and Rosler proved that the expected number of comparisons divided by $n$ tends (in some sense) to some limit distribution, which is unbounded. See for example Grübel's paper "Hoare's selection algorithm: a Markov chain approach", which references their original paper.

Yuval Filmus
  • 276,994
  • 27
  • 311
  • 503
  • Here is the thing that bothers me. Like I said in my comment above, rounds are just a way to analyze a "slowed down" version of the algorithm that waits until it gets a good enough pivot to proceed. What you're showing is that for any fixed $C>0$ the probability of the first round needing more than $C$ pivots is $>0$. But, in principle, a long first round could be offset by an empty 2nd round, in the sense that at the end, the "un-slowed" algorithm caught up to the one that always gets a 1/4-3/4 split. – Louis Apr 19 '12 at 07:01
  • 1
    That's not true, if the first round is long then the entire running time is long, since further rounds cannot decrease the running time. The point is that the for any $C$, the first round takes time at least $Cn$ with some constant probability $p_C > 0$. – Yuval Filmus Apr 19 '12 at 18:47
  • I am happier now, since the round length is not much smaller than the geometric being used for the upper bound. I guess this is what G&R are making rigerous. Nice answer. – Louis Apr 19 '12 at 19:54