NEURAL DATA CLUSTERING
-----------------------
This data illustrates a clustering problem common in experimental neuroscience, known as “spike sorting”. Each line represents a set of 96 features characterizing the spatiotemporal electric field generated by 20,000 neuronal action potentials (or “spikes”). The action potentials are generated by multiple neurons, each of which corresponds to a separate cluster in this high dimensional space.
                                                                                                           
Into this data set we have artificially added a set of additional spikes at known times, providing a known “ground truth”, whereby the identity of points belonging to one (but only one) of the clusters is known. A competition (see below) is based on this ground truth, which will be available to participants at the meeting and later on the meeting's website.

Any unsupervised clustering algorithm should reveal several clusters most of which pertain to putative neurons or to electrical artefacts in the recording. We can only evaluate an idea of the quality of a clustering method based on how well the artificially added cluster is isolated from the rest. Clusters pertaining to noise and artefacts can be considered as belonging to their own clusters. Not all clusters will correspond to neurons. 

Competition rules: 
You can analyse the data set without taking part in the competition, but if you want to take part in the competition, please send an ASCII text file with 20,000 natural numbers separated by blanks or returns (and nothing else) to 
Christian Hennig (c.hennig@ucl.ac.uk) 
by Tuesday 5 November, 18:00 in the evening. 
The natural numbers will be interpreted as cluster membership indicators of the action potentials (observations) in the original order. The number of clusters is not prescribed but it must be a partition (i.e., every observation should belong to only one cluster). The competition will be evaluated by taking the Jaccard similarity between the (single) known true cluster and the cluster of your partition that matches the true one best (i.e., maximises the Jaccard similarity to the true one out of all your clusters). The best submissions will be awarded a book prize donated by Chapman&Hall/CRC.

Further information:
 The features in the file were produced by an automatic spike detection algorithm taking the first 3 principal components from some filtered waveforms coming from the recording. Waveforms were recorded using a probe inserted into the brain consisting of 32 different channels (labelled 0-31) which were arranged spatially according to the following adjacency graph: Each pair corresponds to channels which are nearest neighbours, e.g. channel 12 is next to channels 10, 11, 13, and 14. The probe therefore looks like a zig-zag - see pdf. 

probes = {
    # Probe 1
    1:[
        (0, 1), (0, 2),
        (1, 2), (1, 3),
        (2, 3), (2, 4),
        (3, 4), (3, 5),
        (4, 5), (4, 6),
        (5, 6), (5, 7),
        (6, 7), (6, 8),
        (7, 8), (7, 9),
        (8, 9), (8, 10),
        (9, 10), (9, 11),
        (10, 11), (10, 12),
        (11, 12), (11, 13),
        (12, 13), (12, 14),
        (13, 14), (13, 15),
        (14, 15), (14, 16),
        (15, 16), (15, 17),
        (16, 17), (16, 18),
        (17, 18), (17, 19),
        (18, 19), (18, 20),
        (19, 20), (19, 21),
        (20, 21), (20, 22),
        (21, 22), (21, 23),
        (22, 23), (22, 24),
        (23, 24), (23, 25),
        (24, 25), (24, 26),
        (25, 26), (25, 27),
        (26, 27), (26, 28),
        (27, 28), (27, 29),
        (28, 29), (28, 30),
        (29, 30), (29, 31),
        (30, 31),
        ]
    }


For each neuronal cluster, it is expected that only a subset of the variables will be informative. 
The number of relevant dimensions depends on the neuron and on how many channels of the probe
managed to pick up signals from this neuron. e.g. if a neuron's signal was only picked up on channels 
10, 11,  12, 13, and 14, then the relevant informative features are going to be features, 3k+1, 3k+2, 3k+3 with k = 10, 11, 12, 13 and 14. A neuron could potentially have as many 36 relevant channels or as few as 3. The rest of the features will correspond to background noise. At the same time, many neurons may share the same set of relevant channels.