Basic data structures

Some rough notes on basic data structures and the complexities of associated operations, with usage notes in Java and Python.

Arrays

An array is an implementation of a list in which we store objects in consecutive memory locations. This makes random access to list elements fast, but insertions and deletions slow - since inserting or deleting requires every subsequent object to be moved.

Python “lists” are actually arrays, while in Java you can make an array with java.util.ArrayList:

import java.util.ArrayList;
ArrayList<Integer> l = new ArrayList<Integer>();

Here are the amortized runtimes.

Operation	time complexity	Java ArrayList	Python list
insert	O(n)	`l.add(i, e)`	`l.insert(i, e)`
delete	O(n)	`l.delete(i)`	`l.del` or `l.remove`
append	O(1)	`l.add(e)`	`l.append(e)`
min/max	O(n)		`max`, `min`
get	O(1)	`l.get(i)`	`l[i]`
set	O(1)	`l.set(i, e)`	`l[i] = e`

The convention here is that n is the list length, e is a list element of the appropriate type and i is an int index.

Linked lists

A linked list is another list implementation. This time each object in the list carries a pointer to the next (maybe also previous) element in the list. This means insertion and deletion are cheap, as we only have to alter the pointers, but getting or setting the ith element of the list is slow since we have to start at the beginning and walk i steps to find it.

In Python collections.deque is a double-ended linked list. In Java, LinkedList is double-ended. You can make one with

import java.util.LinkedList;
LinkedList<String> list = new LinkedList<>();

but frankly you should only type something as monstrous as “new linkedlist open pointy brackets closed pointy brackers open round brackets closed round brackets semicolon” if someone is paying you good money to do so.

Here are the amortized running times:

Operation	time complexity	Java LinkedList	Python deque
insert	O(1)	`l.add(i, e)`	`d.insert(i, e)`
delete	O(1)	`l.remove(i)`	`d.remove(e)`
append	O(1)	`l.addlast(e)`	`d.append(e)`
prepend	O(1)	`l.addfirst(e)`	`d.appendleft(e)`
get	O(n)	`l.get(i)`	`d[i]`
set	O(n)	`l.set(i, e)`	`d[i] = e`
min/max	O(n)		`min`, `max`

For min/max in Java, best to use Collections I think.

Maps/dictionaries

Mathematically, a map/dict is a function from a set of keys to a set of values. You can think of them as a generalisation of an array: instead of having a value for each index i between 0 and N, we have a value for each key belonging to some set of objects of an arbitrary (hashable) type.

In Python you declare a new empty dict with d = {}. In Java, one implementation is HashMap

import java.util.HashMap; 
HashMap<String, String> hm = new HashMap<String, String>();

Many other implementations exist, see the map docs

Average cases and operations:

Operation	time complexity	Java HashMap	Python dict
insert	O(1)	`h.put(k, v)`	`d[k] = v`
delete	O(1)	`h.remove(k)`	`d.pop(k)`, `del d[k]`
membership	O(1)	`h.containsKey(k)`	`k in d`
get	O(1)	`h.get(k)`	`d[k]`

Worst cases are different - if you’re unlucky with hashing, membership can be O(n) for example.

Sets

This isn’t really a concrete type, but I’ll talk about the standard implementations. Java set docs are here One implementation is HashSet

import java.util.HashSet;
Set<String> hashset = new HashSet<>();

Guava Set is an alternative with different methods.

You can covert from other data types by giving the constructor arguments.

Operation	time complexity	Java HashSet	Python set
in	O(1)	`h.contains(e)`	`e in s`
union	O(n+m)	`h.addAll(k)`	`s.union(t)`
meet	O(min(m,n))	`h.retainAll(k)`	`s.intersection(t)`
size	O(1)	`h.size()`	`len(s)`
add	O(1)	`h.add(e)`	`s.add(e)`

Here m, n are the sizes of the two sets involved. There are subtleties in set difference complexities, again see the link. e.g. in Python s-t is O(len(s)) but s.difference_update(t) is O(len(t)).

The c code for the Python set object (written by Raymond D. Hettinger) is here. It’s surprisingly readable even if (like me) you don’t know much c. Here’s how it implements size:

static Py_ssize_t
set_len(PyObject *so)
{
    return ((PySetObject *)so)->used;
}

(PySetObject *)so is a pointer to the structure which represents the set, and ->used gets the member of this structure called used. It’s clear from the code that set objects maintain their size as a structure member, so size really is O(1). You can see elsewhere that when something is successfully added to the set, used gets incremented.

Here’s the code used when adding a known-to-be-new object to a set (the comments are mine):

static void
set_insert_clean(setentry *table, size_t mask, PyObject *key, Py_hash_t hash)
{
    setentry *entry;
    size_t perturb = hash;
    size_t i = (size_t)hash & mask;
    size_t j;

    while (1) {
        entry = &table[i]; // & is the address operator, entry is now a
                           // pointer to the setentry table[i]
        if (entry->key == NULL) // as above, if p is a pointer to a
                                // structure then p->member-of-structure
                                // refers to that member
            goto found_null;
        if (i + LINEAR_PROBES <= mask) {
            for (j = 0; j < LINEAR_PROBES; j++) {
                entry++; // incrementing a pointer like this adds the
                // size of the pointer data type, so it now points to the
                // next place a setentry could be stored.
                if (entry->key == NULL)
                    goto found_null;
            }
        }
        perturb >>= PERTURB_SHIFT;
        i = (i * 5 + 1 + perturb) & mask;
    }
  found_null:
    entry->key = key;
    entry->hash = hash;
}

It jumps around in the table by moving the pointer entry, consisting of some unused space and some setentrys, until it finds an empty spot. Then it updates that empty spot so that it’s a setentry struct which holds the key being inserted and the hash of that key.

Heaps

The point of a heap is to implement a priority queue - they must have fast pop-min or pop-max and add-element operations. You’d need one of these if you wanted to implement Dijkstra, for example, where you want to queue up the unexplored vertices and pop the vertex with smallest distance.

I’ll deal with min heaps only.

Operation	time complexity	Java PriorityQueue	Python PriorityQueue
min	O(1)	`pq.peek()`	n/a
pop-min	O(log n)	`pq.poll()`	`pq.get()`
add	O(log n)	`pq.add(e)`	`pq.put(e)`
decrease key	O(log n)	n/a	n/a
build-heap	O(n)	`PriorityQueue(c)`	multiple `put`s

Here’s a simple implementation following lecture 4 of the OCW 6.006 course. Think of a min heap as a complete (or as complete as possible) binary tree in which each each object in the heap is a node, and the key of each node is greater than or equal to the key of its parent. This means the node with minimum key is at the root.

We represent this tree as a 1-indexed array, so that the root is at position 1, the parent of the node at i is at i//2, and the left and right children are at 2*i and 2*i+1. Since the node with maximum key is in position 1, finding it is O(1).

Suppose we want to pop the min and maintain the heap property. We can copy the last element of the array into position 1 and delete the last element in O(1), but then we need to resore the min heap property. This operation is called min-heapify. Since both subtrees beneath the root are still min heaps, repeatedly swapping the node currently at the root with whichever of its children has strictly smaller key (if any) will restore the min heap property in O(log n) steps since the tree has depth log n. Thus pop-min is O(log n).

Adding a new element to the heap can be done in O(log n) similarly: append the new element to the array, repeatedly swap it with its parent if that parent has a strictly larger key.

Finally consider building a min heap from an unsorted array. This can be done by min-heapifying starting at positions n//2, ..., 2, 1. Calling min-heapify at position i whose descendent subtrees are already min heaps results in the subtree starting at i being a min heap, so this sequence of min-heapifies will result in the whole array being a min heap. (There’s no need to min-heapify the other positions because they are leaves of the tree so have no descendent subtrees).

On the face of it this is O(n) calls to the O(log n) min-heapify method so O(n log n), i.e. no better than sorting the whole array. A more careful analysis shows it is actually O(n): the point is that the cost of min-heapify at i is log of the size of the subtree at i, not log of the whole heap size. (Be careful of a typo in the 6.006 video lecture at this point). Since the number of nodes of height \(h\) looks like \(n/2^h\) the total amount of work is (up to constants) \(\sum_h nh/2^h\), which is O(n) because \(\sum_h h/2^h\) is bounded by a constant independent of \(n\).

Fancier implementations

A Fibonacci heap makes increase/decrease key O(1) amortized.

Python and Java

The Python queue module has PriorityQueue which is implemented as a heap. Entries in the queue don’t have a separate priority, rather, when you pop the queue you get whichever entry comes first in the order produced by a Python sort. So if you want to assign numerical priorities to strings, for example, you could put tuples (priority, "string") into the queue (and you must then remember to extract the second element the tuple popped off the queue when you want to work with the lowest priority string). Warning! pq.put(key, priority) will (because put has optional arguments) sliently not do what you want. That’s the Python type system for you.

In Java, java.util.PriorityQueue is a heap-based implementation. As in Python, you don’t specify an order for each queue member, rather the sorting is by natural order (or you can supply a comparator).

Binary search trees

A binary search tree is a binary tree with the property that for each node \(x\), everything in the left subtree below \(x\) has a key \(\leq\) that of \(x\) and everything in the right-subtree below \(x\) has a key \(\geq\) that of \(x\).

As the name suggests, the point of a BST is that you can search it quickly. In particular you can find a key, or the next larger or smaller key, or the min or max, in O(h) where h is the height of the tree, and you can insert and delete again in O(h).

Operation	time complexity
max, min	O(h)
ins, del	O(h)
find	O(h)

The problem is then how to control \(h\). We want the tree to be as “balanced” as possible so that \(h \approx \log n\), but this requires care when doing insert and delete. The AVL tree achieves this.

Bags/multisets

These can be represented as a map/dictionary with objects as keys and multiplicities as values. collections.Counter is a dict subclass for these. collections.defaultdict is another possibility. For Java, Guava has a multiset interface.