I have been searching for the most efficient (streaming??) algorithm that tells me the 'k' most frequently occurring elements in a data stream at any point in time. This post: "Divide and conquer" data stream algorithms got me interested in it.
For example, suppose there are numbers: (4,3,5,1,6,2,4,3,3,8,9,1) and I query for the 3 most frequently occurring numbers (say), then I should get (3,4,1) as the answer.
I tried searching online, but couldn't find any place that gives an approach and says that that is the best. A trivial solution would be to use a heap or a balanced binary tree, but I think there is a better way and I wanted to know if it is documented somewhere.
Edit: I'm looking for an algorithm that always gives the correct answer as opposed to a appromixation algorithm (many of which pop up in search results) which rely on the distribution of data in some way or other
There's an extensive literature on this problem. Firstly, even for $k=1$, the problem is hard: as Alon, Matias and Szegedy show, you can't get a better than constant factor approximation to the frequency of the most popular element in $o(n)$ space, even if you're willing to be randomized.
However, if you're guaranteed that one element occurs strictly more than 50% of the time, then there's a simple trick that uses constant space and will find it (this is a popular google puzzle question). More generally, you can find elements that occur more than $n/k$ times using a generalization of this trick.
A definitive survey of what's known about the problem is the paper by Cormode and Hadjileftheriou from VLDB 2008. It covers the above methods and many of the latest methods. the general idea is that you can generate an approximate list of the top $k$ items (where approximate here means that you might get items whose counts are close to the counts of the top $k$ items).
External links referenced by this document:
Local articles referenced by this article: