Elements appearing with prescribed frequency in a list

Suppose you have a list A of size n and a number k which is at most n, and you want to find all elements of A which appear more than \(n/k\) times in A in \(O(n \log k)\) time. You can do this using algorithm (3) of Finding repeated elements by J. Misra and D. Gries in Science of Computer Programming Volume 2, Issue 2, November 1982, pp.143-152, which I’m going to describe here.

Imagine that we go through A forming as many k-tuples of distinct objects as possible and throwing them away. When we finish doing this, at most k-1 distinct elements of A can remain. If \(x\) is an element of A which appears more than \(n/k\) times in A then it must be one of these remaining elements, since there are at most \(n/k\) distinct k-tuples and each contains at most one x. So all we need to do is efficiently throw away distinct k-tuples, obtaining at most k-1 candidates for elements which appear more than \(n/k\) times, and then go through A counting how many times these elements really do appear.

The algorithm keeps a multiset M of size at most k. If we represent this with a sensible data structure, all basic operations (adding an element, decreasing the count of an element, checking if something is an element) will cost \(O(\log k)\).

The first part of the algorithm goes like this:

for e in A:
    if M.contains(e) or M has < k-1 distinct elements:
        M.add(e)
    else:
        decrease the count of every element of M by 1

(When a count is zero, the element is deleted). At the end of this loop any element appearing more than \(n/k\) times belongs to M, so we can form a new multiset containing the elements of M once each, go through A one element at a time, increase the count of that element in M if it belongs to M, and at the end deduce the true count of each element of M.

The suite of the else clause costs \(O(k \log k)\) and can happen at most \(n/k\) times, so the total cost of the whole algorithm is \(O(n \log k)\).