2

This is an instance of the "coupon collector's problem" in which not all the "coupons" have the same probability (more exactly, the probability of coupon $i$ follows Zipf distribution $p_i \sim Zipf(i)$.

There are $N$ distinct coupons. We keep collecting coupons till we have $N'$ distinct coupons. How many distinct rare coupons do we expect to collect? A rare coupon is defined as a coupon with probabilities less than $p_{N'}$. In other words, the top common $N'$ distinct coupons are not rare and the rest are rare.

Can someone point me in the right direction?

1a1a11a
  • 73
  • I take it you're using "we have" in two different sense here? By "we have $N$ coupons", you mean "there are $N$ coupons in total", and by "we have $N'$ coupons" you mean "we have collected $N'$ coupons", right? – joriki Mar 21 '23 at 00:56
  • Also, I assume that as usual in this context, by $N$ coupons and $N'$ coupons you mean distinct coupons (i.e. types of coupons). For the number of rare coupons whose expectation you want, do you also mean distinct rare coupons? – joriki Mar 21 '23 at 00:59
  • Yes, you are right! "We have N coupons" means that "there are N coupons". And I mean distinct coupons. I will update the questions. Thank you! – 1a1a11a Mar 21 '23 at 01:41
  • Oh, you set a bounty for this! I didn't notice that until after I posted my answer :-) – joriki Mar 23 '23 at 15:55

1 Answers1

1

I’ll do this for separate numbers $m$ of distinct coupons collected and $n$ of distinct coupons regarded as common because I don’t see any simplification allowed by the special case $m=n=N'$.

By linearity of expectation, the expected number of distinct rare coupons collected is the sum over $i\gt n$ of the probabilities $c_i$ of coupon $i$ being collected.

This is a generalization of Last coupon collected in the coupon collectors problem, which amounts to the special case $m=N-1$. We can solve the more general problem with the same Poissonization approach.

So let coupons be drawn in a continuous Poisson process with rate $1$ that is the sum of $N$ independent Poisson processes with rates $p_i$.

The probability density for the process for coupon $i$ to yield its first coupon at time $t$ is $p_i\mathrm e^{-p_it}$. At that time, the probability that $m$ distinct coupons have not yet been collected is

$$ \sum_{S\subset P_i}(-1)^{|S|-\ell}\binom{|S|-1}{\ell-1}\prod_{j\in S}\mathrm e^{-p_jt}\;, $$

where $\ell=N-m$, $P_i$ is the set of sets of coupons other than $i$ with at least $\ell$ coupons, and $(-1)^{|S|-\ell}\binom{|S|-1}{\ell-1}$ is minus the Möbius function on $P_i\cup\{\emptyset\}$, ordered by inclusion, which you can check from

$$ \sum_{T\supset S\subset P_i}(-1)^{|S|-\ell}\binom{|S|-1}{\ell-1}=\sum_{\ell\le k\le|T|}(-1)^{k-\ell}\binom{k-1}{\ell-1}\binom{|T|}{k}=1\;. $$

Thus, the probability $c_i$ that when coupon $i$ is first collected $m$ distinct coupons have not yet been collected, and thus coupon $i$ gets collected, is

\begin{eqnarray} c_i &=& \int_0^\infty\mathrm dtp_i\mathrm e^{-p_it}\sum_{S\subset P_i}(-1)^{|S|-\ell}\binom{|S|-1}{\ell-1}\prod_{j\in S}\mathrm e^{-p_jt} \\ &=& \sum_{S\subset P_i}(-1)^{|S|-\ell}\binom{|S|-1}{\ell-1}\frac{p_i}{p_i+\sum_{j\in S}p_j}\;. \end{eqnarray}

For $m=N-1$ and thus $\ell=1$, we recover the probability for coupon $i$ not to be the last coupon collected,

$$ c_i=\sum_{S\subset P_i}(-1)^{|S|-1}\frac{p_i}{p_i+\sum_{j\in S}p_j}\;. $$

For $m=1$ and thus $\ell=N-1$, the sum has a single term, in which $S$ is the set of all $\ell$ coupons other than $i$, and we recover the trivial result $c_i=p_i$.

The expected number of rare coupons collected is

$$ \sum_{i\gt n}c_i=\sum_{i\gt n}\sum_{S\subset P_i}(-1)^{|S|-\ell}\binom{|S|-1}{\ell-1}\frac{p_i}{p_i+\sum_{j\in S}p_j}\;. $$

We can write this as a single sum by collecting the terms with the same denominator:

$$ \sum_{i\gt n}c_i=\sum_T(-1)^{|T|-\ell-1}\binom{|T|-2}{\ell-1}\frac{\sum_{n\lt j\in T}p_j}{\sum_{j\in T}p_j}\;, $$

where the sum now runs over all sets of at least $\ell+1$ coupons.

I don’t see how either of these representations could be simplified for the specific case of a Zipf distribution.

joriki
  • 238,052