# Big Data

In 2008-2009 I was in involved in the Sequential Monte Carlo workshop
at SAMSI, where
I was part of the Big Data and Distributed Computing working group.

One of the challenges of Markov chain Monte Carlo in large datasets is the need to scan through the whole data at each iteration of the sampler, which can be computationally prohibitive. Several approaches have been developed to address this, typically drawing computationally manageable subsamples of the data. Here we consider the specific case where most of the data provides little or no information about the parameters of interest, and we aim to select subsamples such that the information extracted is most relevant. The motivating application arises in flow cytometry, where several measurements from a vast number of cells are available. Interest lies in identifying specific rare cell subtypes and characterizing them according to their corresponding markers. We present a Markov chain Monte Carlo approach where an initial subsample of the full data is used to draw a further set of data points which contains more information about rare events, and describe how inferences can be made efficiently by reducing the dimensionality of the problem. Finally, we extend our method to a Sequential Monte Carlo framework whereby the targeted subsample is augmented sequentially as estimates improve, and introduce a stopping rule for determining the number of data points belonging to the low-probability region of interest.

One of the challenges of Markov chain Monte Carlo in large datasets is the need to scan through the whole data at each iteration of the sampler, which can be computationally prohibitive. Several approaches have been developed to address this, typically drawing computationally manageable subsamples of the data. Here we consider the specific case where most of the data provides little or no information about the parameters of interest, and we aim to select subsamples such that the information extracted is most relevant. The motivating application arises in flow cytometry, where several measurements from a vast number of cells are available. Interest lies in identifying specific rare cell subtypes and characterizing them according to their corresponding markers. We present a Markov chain Monte Carlo approach where an initial subsample of the full data is used to draw a further set of data points which contains more information about rare events, and describe how inferences can be made efficiently by reducing the dimensionality of the problem. Finally, we extend our method to a Sequential Monte Carlo framework whereby the targeted subsample is augmented sequentially as estimates improve, and introduce a stopping rule for determining the number of data points belonging to the low-probability region of interest.

### Papers and Presentations

- I. Manolopoulou, C. Chan and M. West, 'Selection Sampling from Large Datasets for Targeted Inference in Mixture Modeling'. Bayesian Analysis (2010), with discussion.
- SAMSI Research Highlight (May 09), 'Needle in a Haystack: Rare Cell Subtypes in Flow Cytometry'.
- SAMSI Transition workshop (Nov 09), Targeted Sequential Resampling from Very Large Datasets in Mixture Modelling.
- Greek Stochastics alpha: Monte Carlo methods (Aug 09), 'Targeted Sequential Resampling from Very Large Datasets in Mixture Modelling'
- JSM 2009 (Aug 09), Targeted Sequential Resampling from Very Large Datasets in Mixture Modelling.
- Adaptive Design, SMC and Computer Modeling workshop (Apr 09), Adaptive Bayesian Computation for Targeted Learning in Mixture Models.
- Internal SAMSI SMC workshop (Feb 09), Targeted re-sampling from very large datasets in mixture modeling.