XCOP: Mixture of multivariate copulas
-------------------------------------

* Ricardo Silva and Robert B. Gramacy
  University College London, University of Chicago
  Version 0.1

First version: June 30th 2010

ABOUT THIS SOFTWARE
-------------------

The current code provides an implementation for the infinite mixture
of tree-structured copulas introduced by

* R. Silva and R. Gramacy (2009). "MCMC methods for Bayesian mixtures of
  copulas". Proceedings of Twelfth International Conference on Artificial
  Intelligence and Statistics. Clearwater, FL.

It requires the last R enviroment for statistical computing.

* http://www.r-project.org/

Current limitations:

* The original paper describes three main proposals for tree structures.
  All three of them are provided for Gaussian copulas, but for the 
  remaining copulas supported by the current software, only the basic
  proposal (called "Simple" proposal in the paper) has been implemented.
* Although there is tentative support for the Clayton copula, it has not
  been finalized yet. The other four copula functions
* All data is assumed to be continuous.
* Some of the proposals for copula parameters are tentative and might
  change.

INSTALLATION
------------

This software is eventually going to evolve into a fully documented R
package. For now, installation should be done manually and documentation
boils down to this readme file.

If you are using Windows, there is no installation procedure: just copy
the "src" folder to any directory of your choice.

If you are using other operating system such as Linux, you will need to
recompile the C++ files in the directory "cpp". Change to that directory
and run the following in the Linux command line (assuming R is in your
path):

> R CMD SHLIB -o cop.so *.cc *.c

Also, you will need to change the reference to "cop.so.dll" in the file
"tree.mcmc.setup.R" to"cop.so". File "tree.mcmc.setup.R" is located in
the subdirectory src/R.

USAGE
-----

After starting R, initialize the XCOP library by calling

> source("tree.mcmc.setup.R")

assuming that "tree.mcmc.setup.R" is in the path.

The usage of XCOP is better illustrated through an example. Script
"runme.example1.R" provides a simple one. It loads a "training set"
(stored as data frame "train.x") and a "test set" (stored as
"test.x"). The usage of test sets is describe in section TEST SETS.
The data in this case consists of a random selection of log returns
from 10 different NYSE and AMEX stocks during a period from 1968 to
1998. Here, we assume trends were removed and the data approximately
i.i.d.

In this simple demo, we use Gaussian copulas and Student t marginals.
Details on available copulas and marginals, and how to set them up,
are provided in sections MARGINAL DOCUMENTATION and COPULA
DOCUMENTATION below.

The important lesson from this demo is to use it as a template for
general tasks. The general structure is to set all required
information for function "copula.tree.mcmc". Several log files are
generated, which can then be used for different analytical
purposes. See section FILE OUTPUT STRUCTURE for details.

In particular, the list "opt" contains different options that can be
used to set the sampler. Currently, the following are supported:

- burn.in: an integer, set initial samples to ignore when doing evaluation;

- mcmc.step: thin samples. That is, dump sample to log file only at every 
  "mcmc.step". Set this to > 1 to save disk space while discarding
  correlated samples.

- missing.pattern: specification of which entries should be considered 
  "missing". This should be a matrix with the same dimensions as "train.x".
  An entry at missing.pattern[i, j] == 1 indicates that train.x[i, j] is
  NOT missing, while missing.pattern[i, j] == 0 indicates that 
  train.x[i, j] IS missing. If "missing.pattern" is not set, it will be
  automatically filled according to the NA values in "train.x".

  NOTICE THAT NO MISSING VALUES IN "test.x" ARE ALLOWED. If needed, you
  can always impute missing values prior to sampling. We provide a very
  straightforward method, function "complete.data".  This function
  imputes missing values by using the average of the observed values for
  each variable.

- initial.K: good initialization can make a difference in MCMC sampling.
  We initialize our sampler by first partitioning the data into K
  clusters and fitting a different tree to each set of the partitioning.
  Option "initial.K" will allow the user to set the initial number of
  trees. The default value is 2.

- init.cluster.t: in order to generatel an initial partition the data, 
  we need a clustering strategy. Two possibilities: "euclidean" will do
  k-means by using the Euclidean distance between points; "mle" will
  use as similarity criterion between a cluster and a point the 
  log-likelihood of this point, under a tree fit using the points in the
  cluster. The "mle" method is quite expensive.

- tree.sampler: proposal type for sampling trees (0, 1, 2 or 3). Value
  0 corresponds to the "Treevariate" proposal in Silva and Gramacy (2009).
  Value of 1 corresponds to the "Simple" proposal, while a value of
  2 corresponds to the "Hybrid" proposal. Value of 3 corresponds to the
  "Exact" proposal, but this is not recommended. See Silva and Gramacy 
  (2009) for details. PLEASE NOTICE THAT SUPPORT FOR ALL FOUR PROPOSALS
  IS CURRENTLY GIVEN TO THE GAUSSIAN COPULA ONLY. For the other copulas,
  the sampler will automatically default to the "Simple" proposal.

- slice.sampler: use slice sampler? (TRUE or FALSE) For sampling parameters
  for the Gaussian copula only, use this to override the 
  Metropolis-Hastings sampler. Default is FALSE, and it will be ignored
  for the other copulas;

- fix.marginals: fix marginals after initializing them? (TRUE or FALSE)
  Sampling marginal parameters is perhaps the most costly step. A common
  trick in copula modeling is to fit marginals separately before
  estimating. Set this to TRUE is so desired. Notice that using
  "Empirical" marginal will set this to TRUE by default. Otherwise,
   the default is FALSE;

- save.missing: save any imputed missing variables? (TRUE or FALSE).
  This allows for a comparison of missing data imputation methods. 
  By having "train.x" will all observed values, but passing a 
  "missing.pattern" matrix indicates some values as missing, the sampler
  will treat such values as missing and sample them. A comparison can
  then be made if missing values are saved and compared against the
  true values. See file "runme.example3.R" for an example;

- save.corr: save any implied correlation matrix? (TRUE or FALSE)
  In the case of a Gaussian copula, it is easy to compute an implied
  multivariate parameter that accounts for the joint distribution within
  a tree: the implied correlation matrix. Setting "save.corr" to true
  tells the sampler to store the correlation matrices. Default is 
  FALSE. This option is ignored if non-Gaussian copulas are used. 
  Notice that each row will contain the copula correlation matrix
  averaged over all training points (since each training point is
  potentially associated with a different latent correlation matrix) 
  in order to save space while still allowing for an evaluation of 
  effective sample of the procedure;

Finally, the only other parameter required by "copula.tree.mcmc"
is the hyperprior for the Dirichlet process mixture parameter,
alpha. The hyperprior is gamma with parameters "mix.priors$a" and 
"mix.priors$b". The default is a gamma (0.1, 1) hyperprior.

After sampling is done, three graphs are generated. One of them
depicts the evolution of the tree acceptance rate over the MCMC
iterations.  Another shows the evolution of the predictive
log-likelihood on the test set (see section TEST SETS). A necessary
(but obviously not sufficient) condition for MCMC convergence is that
each curve shown here eventually plateaus.

The final plot shows the evolution of the marginal parameters for the
first variable. More complete plots can be obtained if you have the
package "coda" installed. See comments in the "runme.example1.R" file.

FILE OUTPUT STRUCTURE
---------------------

Up to nine different (text) log files can be generated while running
"copula.tree.mcmc":

- accept_tree.mcmc: one line for each MCMC iteration, containing
  the proportion of acceptances while sampling trees;

- copula.sel.dat: a "selection" of copula parameters. Again, one
  line per MCMC iteration, containing the copula samples for the
  most frequent tree given the current mixture component assigments.
  Notice that the majority of the copula parameters comes from the
  prior, since the corresponding edge might not be in the tree.
  Morever, the most frequent mixture component might change between  
  iterations, so abrupt changes from line to line might happen;

- k.mcmc: one line per MCMC iteration containing the number of
  mixture components with non-zero points assigned to them. The
  second number is the parameter alpha for the Dirichlet process
  mixture;

- marginal.mcmc: one line per MCMC iteration, it contains the
  marginal parameter samples (if marginals are not fixed).
  They are ordered first by variable (according to the column order
  of "train.x"), then by parameter. For instance, for the Student
  t distribution, the first three columns are the centrality, scale
  and degrees of freedom parameters of the t-distribution (see
  also MARGINAL DOCUMENTATION);

- param.impl.dat: if a Gaussian copula is used and the option
  "save.corr" is set to TRUE, this file will have the implied copula 
  correlation matrix averages over the training points;

- param.mcmc: a typically very large file containing all parameters
  for all marginals and copulas at each iteration. To be used with
  "z.mcmc" if desired. One line per MCMC iteration. Under the convention
  that mixture components are labeled by integer labels (0, 1, 2, etc.),
  and each mixture component had D parameters (ordered just as in the
  header of "copula.sel.dat"), at each line we place every single set of
  parameters for the copula of mixture component 0, followed by 1,
  followed by 2, etc. Since the number of mixture components changes,
  each line might have a different number of columns. Morever, from time
  to time we remove mixture components with zero points associated with
  them, so the number of columns can severely decrease periodically.
  The first columns in this file correspond to the marginal parameters,
  as in "marginal.mcmc". To know which edges exist in the tree (and
  so which parametes are actually used in the likelihood function for
  each mixture component), see "trees.mcmc"

- predictive.mcmc: one line for each MCMC iteration, containing
  the average predictive log-likelihood on the test set;

- trees.mcmc: an encoding of all trees corresponding to each mixture
  component, one line per MCMC iteration. Each line starts with the
  number of mixture components at that particular iteration (notice that
  periodically all components with zero points are discarded to avoid an
  explosion of mixture components, so this number decreases
  periodically). The following number is the number M of edges in the
  tree associated with mixture component #0 (so far, we've implemented
  models with connected trees only, so this number will always be the
  number of variables minus 1). The next 2 * M numbers are the column
  ids of the variables corresponding to edges (starting from zero). 
  For instance, for a dataset with 10 variables we might see the 
  following

  "2 9 0 5 0 7 1 2 1 3 1 4 1 5 3 9 4 6 6 8 9..."

  The first number indicates we have 2 trees, the first tree (the one
  corresponding to mixture component #0) has 9 edges, the first edge
  is between variable 0 and 5, the second edge is (0, 7), then (1, 2),
  (1, 3) (1, 4), (1, 5), (3, 9), (4, 6), (6, 8). The next 9 indicates
  that the numer of edges of the tree corresponding to mixture
  component #1, and so on;

- z.mcmc: individual component assignment, to be used with "param.mcmc"
  if desired. Each row contains as many columns as points in "train.x".
  Each number indicates which cluster the corresponding training point
  belongs to (starting from zero). Since periodically we delete mixture
  components with zero points, this numbering changes accordingly
  (while still matching the placement of parameters in "param.mcmc").

TEST SETS
---------

As typical in the machine learning literature, a out-of-sample
evaluation can be arranged by providing a separate hold-out set. If
"test.x" is defined (set it to NULL if desired), the predictive
log-likelihood of each point given the training data is computed. At
the end of the sampling procedure, an average over all test points is
provided. Notice that the predictive log-likelihood is computed
independently for each test point, not jointly.

MARGINAL DOCUMENTATION
----------------------

When creating a marginal (see "runme.example1.R") for an example, you
should provide its type, prior hyperparametes, and Metropolis-Hastings
(MH) parameters. Some marginals don't have prior hyperparameters (e.g., if
they use some improper prior).

Types are provided as strings ("norm", "t", etc.). Prior
hyperparameters and MH parameters are passed as lists with specific
names for the fields.

There are two ways of setting marginals. You can set all marginals to
be of same type with same priors and MH parameters by providing a
single marginal specification (as in "runme.example1.R"). If you want
marginals to differ, you have to provide vectors and lists of
lists. For instance, suppose you have a problem with four variables,
each one with a different type of marginal: the first and third
Gaussians, the second a exponential and the fourth a Student t. The
following can be done to set, respectively, the types ("marg.type"),
prior hyperparameters ("marg.priors"), and MH parameters ("marg.mh")

marg.type   <- c("norm", "exp", "norm", "t")
marg.priors <- list(list(), list(a.a = 2, a.b = 1), list(), list())
marg.mh     <- list(list(mu.s2q = 0.01, s2.prop = 0.9),
                    list(a.win = 0.95),
                    list(mu.s2q = 0.01, s2.prop = 0.9),
                    list(mu.s2q = 0.01, s2.prop = 0.9, nu.prop = 0.9))

Notice that the priors for "norm" and "t" have no hyperparameters, and
hence the respective lists are empty.

In what follows, we describe how to provide the required information
for each marginal.

* Gaussian ("norm"): prior is Jeffrey's prior (prior for mean
  proportional to a constant, prior for variance v proportional to 
  1 / v, so no parameters). Metropolis-Hastings proposal for
  mean is Gaussian with variance given by field "mu.s2q". 
  Metropolis-Hastings proposal for variance at
  current value "v" is uniform in [v * c, v / c] for c in (0, 1).
  The value of "c" is set by the field "s2.prop";
  
* Student t ("t"): prior is Geweke's prior (Geweke, J.,
  1993. "Bayesian treatment of the Student's-t linear
  model". J. Appl. Econom. 8) if exponential parameter for degrees of
  freedom set at 0.1 shifted by one unit (so it is always greater than
  1). So, again we currently don't allow for hyperparameters. MH
  parameters for "mean" and "variance" work as in the Gaussian
  case. Moreover, the specification of MH proposals should include the
  one for the degees of freedom parameter, "nu". At the current value
  of "nu", proposal is uniform in [1 + nu * c, 1 + nu / c] for c in
  (0, 1). The value of "c" is set by field "nu.prop";

* Gamma ("gamma"): hyperprior for shape parameter is gamma with
  hyperparameters (a.a, a.b). Hyperprior for scale parameter is
  gamma with hyperparameters (b.a, b.b). Metropolis-Hastings proposal
  for shape parameter at value "a" is uniform in  [a * c, a / c] for 
   c in (0, 1). The value of "c" is set by the field "a.win". 
  Analogously, the corresponding field for scale parameter is 
  called "b.win";

* Exponential ("exponential"): hyperprior for parameter is gamma with
  hyperparameters (a.a, a.b). Metropolis-Hastings proposal
  for parameter at value "a" is uniform in  [a * c, a / c] for c in 
  (0, 1). The value of "c" is set by the field "a.win".

* Empirical ("empirical"): empirical distribution (i.e., CDF is given
  by the empirical CDF). No priors or MH proposals: these marginals
  currently have to be fixed.

If marginals are set to be fixed (which happens by default using the
empirical marginals), then parameters are set to their respective
maximum likelihood estimates (obtained by maximizing independent
likelihoods for each marginal). The exception is the Student t
distribution, where the degrees of freedom are fixed at 5.

COPULA DOCUMENTATION
--------------------

As in the marginal case, information is provided via lists. Unlike the
marginal case, we assume all pairs of variables share the same type of
copula, hyperparameters and MH parameters.

Additional information for such copulas can also be found in the
following reference:

* Learning with tree-averaged densities and distributions, S. Kirshner,
  Advances in Neural Information Processing Systems, NIPS 2007, December
  2007 (preliminary version)
  http://www.stat.purdue.edu/~skirshne/pubs/nips2007_prelim.pdf

Information about type, hyperparameters and MH parameters are as
follows:

* Gaussian ("norm"): uniform [-1, 1] prior for parameter, so no 
  hyperparameters to set. MH proposal is Gaussian truncated at [-1, 1] 
  with variance parameter given by field "win";

* Farlie-Gumbel-Morgenstern ("fgm"): uniform [-1, 1] prior for
  parameter, so no hyperparameters to set. MH proposal is uniform 
  [v - c, v + c] truncated at [-1, 1] for c > 0, where "v" is the
  current value of the copula parameter. Set "c" via the field "win";

* Ali-Mikhail-Haq ("amh"): uniform [-1, 1] prior for
  parameter, so no hyperparameters to set. MH proposal is uniform 
  [v - c, v + c] truncated at [-1, 1] for c > 0, where "v" is the
  current value of the copula parameter. Set "c" via the field "win";

* Clayton ("clayton"): prior is truncated Gaussian in [-1, +Infinity]
  with "mean" and "variance" parameters given by fields "mu" and "s2",
  respectively. MH proposal is uniform  [v - c, v + c] truncated at 
  [-1, +Infinity] for c > 0, where "v" is the  current value of
  the copula parameter. Set "c" via the field "win";

* Gumbell-Barnett ("gb"): uniform [0, 1] prior for
  parameter, so no hyperparameters to set. MH proposal is uniform 
  [v - c, v + c] truncated at [0, 1] for c > 0, where "v" is the
  current value of the copula parameter. Set "c" via the field "win";

* Frank ("frank"): Gaussian prior. Set mean and variance using fields
  "mu" and "s2. MH proposal is uniform [v - c, v + c] for c > 0, where
  "v" is the current value of the copula parameter. Set "c" via the
  field "win";

FINAL NOTES
-----------

Example file "runme.example2.R" contains an example using lower level
access functions to run sampling. In particular, this file compares
the four different proposal strategies for trees using the Gaussian
copula.

Example file "runme.example3.R" is very similar to the first example
file. The main difference is the inclusion of a test of the data
imputation procedure of the sampler.

DISCLAIMER AND ACKNOWLEDGEMENTS
-------------------------------

This software is provided as is with no guarantees. However,
questions, the unavoidable bug reports, and suggestions for
improvement are always welcome. Contact the author at

> ricardo@stats.ucl.ac.uk

Thanks to Sergey Kirshner for discussions on the paper and some
implementation aspects.

