Theoretical Computer Science
randomness sorting
Updated Wed, 02 Nov 2022 23:30:19 GMT

# Expected number of random comparisons needed to sort a list

Consider the task of sorting a list x of size n by repeatedly querying an oracle. The oracle draws, without replacement, a random pair of indices (i, j), with i != j, and returns (i, j) if x[i] < x[j] and (j, i) if x[i] > x[j].

How many times, on average, do we need to query the oracle to sort the list? How is that average affected if the draw is made with replacement? Looking for a formula but an asymptotic estimate or an order would also be great.

## Solution

This answer gives exact formulas for the expected number of steps, with and without replacement. To be clear, we interpret OP's problem as detailed in OP's Python gist: each step of the process makes one random comparison, and the process stops as soon as the comparisons made are sufficient to determine the total order.

Lemma 1. (i) With replacement the expected number of comparisons is $${n \choose 2} H_{n-1},$$ where $$H_k = \sum_{i=1}^k \frac 1 i \approx 0.5+\ln k$$ is the $$k$$th harmonic number.

(ii) Without replacement, the expected number of comparisons is $$n^2/2 - n + 3/2 - 1/n.$$

Proof. Assume WLOG that the set of numbers in $$x$$ is $$[n]$$. Let random variable $$T$$ be the total number of steps.

As observed in @user3257842's answer, a necessary and sufficient condition for the full order to be known is that, for all $$i\in [n-1]$$, the positions holding elements $$i$$ and $$i+1$$ have been compared. Call such positions "neighbors".

Proof of Part (i). For $$m \in [n-1]$$, let random variable $$T_m$$ be the number of steps such that, at the start of the step, the number of neighbors that have not yet been compared is $$m$$. Then the total number of steps is $$T=T_{n-1} + T_{n-2} + \cdots + T_1$$. By linearity of expectation, $$E[T] = \sum_{m=1}^{n-1} E[T_m]$$. To finish, we calculate $$E[T_m]$$ for any $$m\in [n-1]$$.

Consider any $$m\in [n-1]$$. Consider any iteration that starts with $$m$$ neighbors not yet compared. Given this, $$m$$ of the $$n\choose 2$$ possible comparisons would compare one not-yet compared pair of neighbors, while the remaining comparisons would not (so would leave the number of uncompared neighbors unchanged). So the number of not-yet-compared neighbors either stays the same, or, with probability $$m/{n\choose 2}$$, reduces by 1. It follows that $$E[T_m]$$, the expected number of steps that start with $$m$$ uncompared neighbors, is $${n\choose 2}/m$$.

So $$E[T] = \sum_{m=1}^{n-1} {n\choose 2}/m = {n\choose 2} H_{n-1}$$, proving Part (i).

Proof of Part (ii). View the process as follows: a random permutation of the $$n\choose 2$$ pairs is chosen, and then the algorithm checks each pair (in the chosen order) just until it has compared all neighboring pairs.

That is, we want the expected position of the last neighboring pair in the random ordering of pairs. There are $$n-1$$ neighboring pairs and $${n\choose 2} - (n-1) = {n-1\choose 2}$$ non-neighboring pairs. By a standard calculation, the expected position of the last neighboring pair is $${n \choose 2} - {n-1\choose 2}/n = n^2/2 - n + 3/2 - 1/n$$. $$~~~\Box$$

Here is some intuition about the last "standard calculation". Suppose you choose a $$k$$-subset $$S$$ uniformly at random from $$[m+k]$$. What's the expectation of $$\max S$$? The $$k$$ chosen elements divide the $$m$$ unchosen elements into $$k+1$$ intervals of total size $$m$$. By a symmetry argument the expected number of elements in each of these intervals is the same, so each interval has size $$m/(k+1)$$ in expectation. In particular, the expected size of the last interval is $$m/(k+1)$$, so the expectation of the largest element is $$m+k - m/(k+1)$$.

In our case $$m+k = {n \choose 2}$$ and $$k=n-1$$.