Big Data

In 2008-2009 I was in involved in the Sequential Monte Carlo workshop at SAMSI, where I was part of the Big Data and Distributed Computing working group.

One of the challenges of Markov chain Monte Carlo in large datasets is the need to scan through the whole data at each iteration of the sampler, which can be computationally prohibitive. Several approaches have been developed to address this, typically drawing computationally manageable subsamples of the data. Here we consider the specific case where most of the data provides little or no information about the parameters of interest, and we aim to select subsamples such that the information extracted is most relevant. The motivating application arises in flow cytometry, where several measurements from a vast number of cells are available. Interest lies in identifying specific rare cell subtypes and characterizing them according to their corresponding markers. We present a Markov chain Monte Carlo approach where an initial subsample of the full data is used to draw a further set of data points which contains more information about rare events, and describe how inferences can be made efficiently by reducing the dimensionality of the problem. Finally, we extend our method to a Sequential Monte Carlo framework whereby the targeted subsample is augmented sequentially as estimates improve, and introduce a stopping rule for determining the number of data points belonging to the low-probability region of interest.

Papers and Presentations