- lower-bounds pr.probability st.statistics sample-complexity
- Updated Sat, 04 Jun 2022 23:30:04 GMT

Say we have a black-box access to a distribution $\mathcal{D}$ with finite support $\{1,2,...,n\}$ with probability mass function $i \mapsto p_i$. **How many samples of $\mathcal{D}$ are needed to learn the mode of $\mathcal{D}$, i.e. the $j$ such that $p_j=\max\{p_i\}$, (with error probability $\delta$)?** Is there any known lower bound? Any idea, comment, or reference will be very helpful.

- I know that $\Theta(\frac{1}{d_H^2} \log\frac{1}{\delta})$ samples are required regarding Hellinger distance of $\mathcal{D}=(p,1-p)$ and $\mathcal{D}'=(1-p,p)$ in the case of $n=2$.
- I proved an upper bound $O(\frac{1}{\varepsilon^2} \log \frac{n}{\delta})$ with standard Hoeffding bound, where $\varepsilon$ is the difference between the largest and the second largest among $p_i$'s.

Update: I proved a lower bound $\Omega(\frac{1}{\gamma}\log\frac{1}{\delta})$ by reducing the problem to some $n=2$ case, where $\gamma=\min_i\left(1-2\sqrt{\frac{pp_i}{(p+p_i)^2}}\right)$ with $p$ being the mode. But, I still want to reduce the gap between upper and lower bounds.

A followup question: Say we have a black-box access to a distribution $\mathcal{D}$ with finite support $A\cup B:=\{a_1,...,a_n\}\cup\{b_1,...,b_m\}$ with probability mass function $a_i \mapsto p_i$ and $b_i \mapsto q_i$. **How many samples of $\mathcal{D}$ are required to make sure the most frequent sample is in $A$ (with error probability $\delta$)?** For sanity, it is guaranteed that $A$ contains the mode of $\mathcal{D}$.

- Following Clement's answer, I could prove an upper bound of $O(\frac{1}{\varepsilon^2}\log\frac{1}{\delta})$, where $\varepsilon=\max\{p_i\}-\max\{q_i\}$. However, a nice lower bound eluded me, mainly because reducing to a Bernoulli case does not seem to work in this case.

From the DKW inequality, it follows one can learn an arbitrary distribution over $\mathbb{R}$ (and in particular over $\{1,2,\dots,n\}$ to Kolmogorov distance* $\varepsilon$ with probability at least $1-\delta$ using $$ O\!\left(\frac{1}{\varepsilon^2}\log\frac{1}{\delta}\right)$$ samples; which is optimal in view of the complexity of learning even a single Bernoulli (biased coin): embed the biased coin (with bias either $1/2+\varepsilon$ or $1/2-\varepsilon$) in the middle of the domain, where all other elements have mass 0: figuring out which element is the mode corresponds to figuring out the bias.

This implies the same upper (and lower) bounds for your question, where $\varepsilon$ is defined as the gap *(assuming you have a priori knowledge on that gap; this is not quite apparent from your post. If not, doing something adaptive and slightly more involved might be required).*

$*$ Kolmgorov distance is $\ell_\infty$ distance between CDFs. This guarantee is stronger (and implies) learning the pdf in $\ell_\infty$. For more, see. e.g., this short note of mine, specifically Section 4 (not new results, just exposition of "folklore" facts)

- +0 – isn't $\Omega\left(\frac{1}{\epsilon^2}\log\frac{1}{\delta}\right)$ a lower bound for any $\mathcal{D}$? — May 20, 2022 at 05:27
- +0 – @mathworker21 yes, you can start with any distribution and embed the corresponding biased coin problem (with a suitable $\varepsilon$) in there. Is that what you mean? — May 20, 2022 at 05:37
- +0 – By "what you mean", do you mean "what you had in mind for a proof"? I meant what I asked, namely if $\Omega(\epsilon^{-2}\log\delta^{-1})$ is always a lower bound; it seems you agree that is always a lower bound. — May 20, 2022 at 06:02
- +0 – @ClementC. Wow, thank you for your answer! And also writing up folklores. It really helps non-specialists like me. Actually, this question was already inspired by your note on distinguishing discrete distributions:) — May 20, 2022 at 12:47
- +2 – Erm, why the downvote? Is there something wrong with this answer? — May 22, 2022 at 01:11

External links referenced by this document: