Theoretical Computer Science
ds.algorithms online-algorithms data-streams
Updated Sat, 21 May 2022 18:19:16 GMT

Algorithm for 'k'' most frequently occurring numbers


I have been searching for the most efficient (streaming??) algorithm that tells me the 'k' most frequently occurring elements in a data stream at any point in time. This post: "Divide and conquer" data stream algorithms got me interested in it.

For example, suppose there are numbers: (4,3,5,1,6,2,4,3,3,8,9,1) and I query for the 3 most frequently occurring numbers (say), then I should get (3,4,1) as the answer.

I tried searching online, but couldn't find any place that gives an approach and says that that is the best. A trivial solution would be to use a heap or a balanced binary tree, but I think there is a better way and I wanted to know if it is documented somewhere.

Edit: I'm looking for an algorithm that always gives the correct answer as opposed to a appromixation algorithm (many of which pop up in search results) which rely on the distribution of data in some way or other




Solution

There's an extensive literature on this problem. Firstly, even for $k=1$, the problem is hard: as Alon, Matias and Szegedy show, you can't get a better than constant factor approximation to the frequency of the most popular element in $o(n)$ space, even if you're willing to be randomized.

However, if you're guaranteed that one element occurs strictly more than 50% of the time, then there's a simple trick that uses constant space and will find it (this is a popular google puzzle question). More generally, you can find elements that occur more than $n/k$ times using a generalization of this trick.

A definitive survey of what's known about the problem is the paper by Cormode and Hadjileftheriou from VLDB 2008. It covers the above methods and many of the latest methods. the general idea is that you can generate an approximate list of the top $k$ items (where approximate here means that you might get items whose counts are close to the counts of the top $k$ items).





Comments (5)

  • +1 – +1. I think the > 50% of the time algorithm is a well known one (majority element algorithm) as you mentioned — Sep 15, 2010 at 06:27  
  • +2 – Thanks!! The paper by Cormode and Hadjileftheriou that you mentioned cites this paper: citeseerx.ist.psu.edu/viewdoc/summary?doi=10.1.1.106.7889 which has the same technique which I was thinking about. It maintains 2 linked lists; one by frequency and within it another list of all elements with the same frequency. — Sep 15, 2010 at 06:29  
  • +0 – can you elaborate on the more than 50 percent algorithm ? and the google puzzle ? I cannot follow this sloppy reasoning as you have just touched on it and not fully expended upon "the well -known trick". Thanks. — Sep 16, 2010 at 07:06  
  • +0 – Here's a link: userweb.cs.utexas.edu/users/misra/scannedPdf.dir/… — Sep 16, 2010 at 07:50  
  • +0 – This is a comment (not enough reputation) on Suresh Venkat's link userweb.cs.utexas.edu/users/misra/scannedPdf.dir/…: It looks like the algorithm presented there requires a second pass through the data, which is not allowed here. In fact, I don't see how a one-pass algorithm with O(1) space requirements can exist. — Sep 16, 2010 at 09:45  


Linked Articles

Local articles referenced by this article: