Elements appearing with prescribed frequency in a list
Suppose you have a list A
of size n
and a number k
which is at
most n
, and you want to find all elements of A
which appear more
than \(n/k\) times in A
in \(O(n \log k)\) time. You can do this
using algorithm (3) of Finding repeated elements by J. Misra and D.
Gries in Science of Computer Programming Volume 2, Issue 2, November
1982, pp.143-152, which I’m going to describe here.
Imagine that we go through A
forming as many k
-tuples of distinct
objects as possible and throwing them away. When we finish doing this,
at most k-1
distinct elements of A
can remain. If \(x\) is an element
of A
which appears more than \(n/k\) times in A
then it must be one
of these remaining elements, since there are at most \(n/k\) distinct
k
-tuples and each contains at most one x
. So all we need to do is
efficiently throw away distinct k
-tuples, obtaining at most k-1
candidates for elements which appear more than \(n/k\) times, and then
go through A
counting how many times these elements really do appear.
The algorithm keeps a multiset M
of size at most k
. If we represent
this with a sensible data structure, all basic operations (adding an
element, decreasing the count of an element, checking if something is an
element) will cost
\(O(\log k)\).
The first part of the algorithm goes like this:
for e in A:
if M.contains(e) or M has < k-1 distinct elements:
M.add(e)
else:
decrease the count of every element of M by 1
(When a count is zero, the element is deleted). At the end of this loop
any element appearing more than \(n/k\) times belongs to M
, so we can
form a new multiset containing the elements of M
once each, go through
A
one element at a time, increase the count of that element in M
if
it belongs to M
, and at the end deduce the true count of each element
of M
.
The suite of the else clause costs \(O(k \log k)\) and can happen at most \(n/k\) times, so the total cost of the whole algorithm is \(O(n \log k)\).