
----------------------------------------------------------------
GPSEM: efficient Gaussian process inference for structural
equation models with latent variables.

Version 1.0b

By Ricardo Silva
University College London
Date of this file: February 25th 2010
Last modification: June 21st 2010

The companion paper is filed under arxiv:
http://arxiv.org/abs/1002.4802
----------------------------------------------------------------



INSTALLATION PROCEDURE: DOWNLOADS
---------------------------------

This code is based partially on third-party C code and MATLAB
libraries.

You will need to download the following MATLAB libraries first:

- Lighspeed, some statistical tools by Tom Minka:
  http://research.microsoft.com/en-us/um/people/minka/software/lightspeed/

- GPML, a Gaussian process toolbox by Carl Rasmussen and Chris Williams
  http://www.gaussianprocess.org/gpml/

- Matthias Seeger's Cholesky update software
  http://people.mmci.uni-saarland.de/~mseeger/papers/cholupdate.html

You can create separate directories for these libraries, and update
the MATLAB path to point to them when necessary. There is a file in
the root directory of GPSEM with an example on how to set paths for
such libraries ('setpath_gpsem.m'). You could use it as a template for
your own configuration. I just run this script every time I start
a MATLAB session that uses GPSEM.

INSTALLATION PROCEDURE: COMPILATION
-----------------------------------

The GPSEM code has a subdirectory labeled 'c,' where some C++ code is
stored. This code has to be compiled with MEX before being
used. Please consult the MEX documentation of MATLAB in case you never
used it before. Most of the time, just typing 'mex' <ENTER> will be
enough for you to figure out how it works. 

Bear in mind that using MEX is really simple and the payoff is
big. This is the reason why I've avoided providing a pure MATLAB
alternative.

Once MEX is setup, using it is trivial: change to the 'c' directory
and type

> mex cholroll.cpp

Done. Now you need to tweak/install the three libraries, if you
haven't done it before.

Refer to the Lightspeed documentation in order to install it.

The GPML library has an option of compiling some of its functions.
This is optional, but please do it. It is trivial to compile, and will
make a major difference. If you are using Windows, though, you might
need to do some simple tweaks. Everything is explained in the README
file of that library, under the section 'COMPILING MEX FILES.'

Finally, mex-compile Seeger's Cholesky Update library. Instructions
are also provided with the library file. If you are using Windows, a
few modifications are necessary (for Linux, follow the original
documentation):

* first, compile the Fortran file 'dchex.f' using the 
  'Windows standard' for naming BLAS/LAPACK functions:

  g77 -fno-underscoring -c dchex.f

  I recommend g77, the GNU compiler. You can use it by installing e.g.
  Mingw (http://www.mingw.org/) or Cygwin (http://www.cygwin.com/);

* second, change the BLAS/LAPACK function calling standard to the 
  Window standard in the file 'mex_helper.h': locate the line 
  '#define BLASFUNC(NAME) NAME' and un-comment it, making sure you
  comment out the respective Linux line;

* third, compile the relevant files using the following three 
  commands (make sure you are in the right directory):

  > mex -O choluprk1.c <directory>/libmwlapack.lib
  > mex -O choldnrk1.c <directory>/libmwlapack.lib
  > mex -O cholupexch.c dchex.o <directory>/libmwlapack.lib

  where <directory> is the path for the LAPACK library in your
  MATLAB installation (e.g., c:/matlab6p5/extern/lib/win32/microsoft/
  msvc60/libmwlapack.lib).
  
  My Windows mex configuration uses gcc, which can be hacked into
  MATLAB using the 'gnumex' tools (http://gnumex.sourceforge.net/).
  I believe the above procedure should work with the standard 
  mex C/C++ compilers, but I haven't tried it. If it doesn't work,
  try gnumex.

This is it.

RUNNING THE EXAMPLES: SAMPLING AND LOGGING THE RESULTS
------------------------------------------------------

The best way of understanding how to use the code is to look at the
examples. There are four demonstration datasets you can load in order
to try an example (NOTICE: you might need further instructions on how
to obtain some of these datasets. Please look at the 'data'
directory). Scripts showing examples of GPSEM usage can be found in
the 'app' directory.

For instance, load a simple synthetic example using

> load_synth

This generates a synthetic dataset of 150 points and 6 observed
variables, and set the corresponding graphical structure. This data
is generated by a model with two latent variables

X1 -> X2

where X1 is a standard Gaussian and X2 = 4X1^2 + error.

Each latent variable has three observed children

Y1 <- X1 + error
Y2 <- X1 + error
Y3 <- X1 + error
Y4 <- 2X2 + error
Y5 <- 2X2 + error
Y6 <- 2X2 + error

All errors are independent standard Gaussian. Notice that variables
{Y4, Y5, Y6} are strongly correlated because X2 has high variance. See
the file for more details.

The script 'test_gpsem_mcmc_spgp' runs a cross-validation study using
the sparse GPSEM with M = 50 pseudopoints. This cross-validation
doesn't evaluate anything in particular in this script: it just generates
logfiles that can be used for general purposes.

The script 'test_gpsem_mcmc_spgp' assumes that several pieces of
information are loaded in MATLAB:

- raw_data: a N x D matrix, where N is the sample size and D is the
  number of observed variables;
- graph: a (D + L) x (D + L) binary matrix, where L is the number of
  latent variables. Each entry graph(i, j) is equal to 1 if and only
  if the j-th variable is a parent of the i-th variable.
- n_types: a (D + L) binary vector indicating with '1' which variables
  are observed. '0' indicates latent. IT IS ASSUMED THAT ALL LATENT
  VARIABLES ARE INDEXED BEFORE ALL OBSERVED VARIABLES. Yes, this is
  silly (passing the number of latent variables would be enough
  under this assumption), but this will be changed later so I prefer
  to require this vector for now;
- NUM_FOLDS: the number of folds the original full data 'raw_data'
  has been split, for cross-validation purposes;
- all_train_points: a cell structure with NUM_FOLDS entries. Each
  'all_train_points{i}' is a subset of 1:N indicating which points are
  for training in the partition;
- all_test_points: a cell structure with NUM_FOLDS entries. Each
  'all_test_points{i}' is a subset of 1:N indicating which points are
  for testing in the partition;
- num_x_mix: indicates how many mixture components for exogenous
  latent variables;
- anchors: a vector indicating observed variables that are used to
  scale latent variables. For instance, if the first observed variable Y1
  is given by 

  Y1 = X1 + error

  then we say that Y1 is the 'anchor' of X1. The anchors should have
  a single latent parent. Otherwise, you have to set manually which
  coefficients should be fixed for identifiability purposes. See the 
  inner workings of 'run_gpsem_mcmc_spgp' for details.

Please notice:

IT IS ASSUMED THAT LATENT VARIABLES HAVE NO OBSERVED VARIABLES AS
PARENTS.

IT IS ASSUMED THAT THE ORDER OF OBSERVED VARIABLES IN THE GRAPH CORRESPONDS
TO THE ORDER OF COLUMNS IN THE DATASET.

The code doesn't check any of those. I plan to get rid of these
assumptions in the future, since none of them has any significance.
Yes, it's just laziness.

Notice that you have the option of indicating which observable variables
in the test sets are 'missing' and should be filled in as part of
the sampling process. Give a look at how the vector 'test_vars' is
used.

There is also an option of initializing the procedure by first running
a simple polynomial model (quadratic regression is the default) to
obtain the initial sample for GPSEM. Set 'use_starting' to 1.

Priors can be modified by changing the function 'run_gpsem_mcmc_spgp'.
This function works as a template for setting default values when
calling the main function, 'mcmc_gpsem_spgp', which does the actual
sampling. It also fires up an initialization procedure if requested.
See the inner workings of 'mcmc_gpsem_spgp' to play with the proposal
distributions for the Metropolis-Hastings procedure. Notice that the
proposals chosen here had in mind that data was scaled to have a zero
empirical mean and unit variance. They might be quite bad in general.

There are 'full' (i.e., non-sparse) GP variations of all of these
functions: 'test_gpsem_mcmc_full.m', 'run_gpsem_mcmc_full.m' and
'mcmc_gpsem_full.m'. Use them for comparison purposes; or if you have
a couple hundred of datapoints and are really convinced full GP will
be better than a sparse model (which should be true if you have many
latent parents for a particular latent variable: but then you better
have more than a couple hundred points!). Be warned that this will be
outrageously slow. Crashes by lack of memory are likely. If you cannot
run	the sampler due to memory problems, try to set it to dump samples
to disk more frequently: sampling functions have a 'BATCH_SIZE' option
that determines how many samples are kept in main memory at each time.
Setting it to 1 might not be too bad afterall. Moreover, the function
that saves samples to disk is very far from optimized anyway (currently,
saving in batches is definitely not as effective as it could be). As
another and somewhat less sophisticated compromise, it is possible to
set an option where only a random subsample of the training set is used in
the sampler (the whole training set is still used to normalize the data).
Check the 'test_gpsem_mcmc_full.m' file for details.

Using 'load_synth' will generate all required information. In this
case, NUM_FOLDS = 1. Since this is synthetic data, we have the
'true' latent values. Function 'run_gpsem_mcmc_spgp' will terminate
by displaying the true values against the posterior expected values
for the first two latent variables in our domain ('synth' has only two
anyway). If you do not have latent data, this will display the
posterior expected values only.

RUNNING THE EXAMPLES: EVALUATING AND USING RESULTS
--------------------------------------------------

You might want to make use of the logfiles in many different ways. To
run a predictive loglikelihood evaluation of the chosen
cross-validation splits, do the following (assuming you have just ran
'test_gpsem_mcmc_spgp' - otherwise you had to provide the relevant
variables: NUM_FOLDS, raw_data, etc.):

> all_data = cell(NUM_FOLDS, 1); 
> for i = 1:NUM_FOLDS, all_data{i} = ...
    normalize_training(raw_data, all_train_points{i}, ...
                      zeros(size(raw_data))); end
> dir_name = '.'; covfunc = {'covSum', {'covSEiso','covNoise'}};
> [llik llik_prog y_fake x_fake] = pred_llik(graph, M, num_x_mix, ...
     num_mcmc_samples, covfunc, num_latents, all_test_points, ...
     all_data, dir_name, burn_in, NUM_FOLDS);

Function 'pred_llik' returns in 'llik' the MCMC estimate of the
predictive loglikelihood of each test point, for each fold (so
'llik' is a cell structure with NUM_FOLDS entries). It will use all
logfiles found in the directory 'dir_name' (the current directory, 
in this example). The name of the logfiles should be the same as those
generated by 'test_gpsem_mcmc_spgp'.

Notice that the synthetic study, generated by 'load_synth', won't 
produce a cross-validation result. Run some other example such as
'load_depress' or 'load_house'.

The code can be adapted to use other prediction criterion such as mean
squared error (MSE). This is not provided. For general, domain-free
evaluation, we prefer loglikelihoods instead. Although it is true that
loglikelihoods can get extreme values that compromise the
interpretability of average loglikelihoods, the solution to that is
relatively simple: don't compare average loglikelihoods, don't do
nonsensical t-tests. Use the whole of each 'llik' to do your
comparisons using tools such as plotting histograms of the whole
distribution of predictive loglikelihoods, and perform other tests
such as sign tests. (Of course, if you do have sensible domain-dependent
cost functions, it is not hard to write code that takes the generated
predictions and calculates the corresponding evaluation).

Finally, you might be interested in doing some analysis using the
posterior latent variable samples (e.g., for visualization,
dimensionality reduction and clustering in the latent space). Use the
function 'get_x_estimate' to generate posterior expected values from a
logfile. For instance,

> x_estimate = get_x_estimate('sample_x.1.log', num_latents,...
               length(all_train_points{1}), burn_in);

This will give you a matrix, with a row for each training point, and
the columns corresponding to each latent variable. A simple
visualization of the results for, e.g., the first two latent variables
can be easily done by

> scatter(x_estimate(:, 1), x_estimate(:, 2))

If you want to get the full MCMC trace of the samples, you can load
the whole logfile in memory (watch out for memory limitations - I should
implement a simple option for thinning the samples in a near future):

> x_trace = get_x('sample_x.1.log', num_latents,...
                  length(all_train_points{1}), num_mcmc_samples);

This generates a three dimensional matrix. The first entry indexes the
MCMC sample, the second entry index the training point, and the third
entry indexes the latent variable. For instance, to plot a MCMC trace
of the second latent variable in training sample number 88, do

> plot(1:num_mcmc_samples, x_trace(:, 88, 2))

Give a look at function 'print_to_file' in 'mcmc_gpsem_spgp' to
understand the format of the other parameters that are logged, so that
they can be analyzed by usual MCMC criteria (effective sample size,
autocorrelation, etc.) and allow you to refine the proposals.

DISCLAIMER AND ACKNOWLEDGEMENTS
-------------------------------

This software is provided as is with no guarantees. However,
questions, the unavoidable bug reports, and suggestions for
improvement are always welcome. Contact the author at

> ricardo@stats.ucl.ac.uk

Thanks to Tom, Carl, Chris and Matthias for the wonderful libraries
that were used in the implementation of GPSEM.

Thanks to Robert Gramacy for several useful comments about this code
and documentation.

HISTORY
-------

Version 1.0b (June 21st 2010): added complementary documentation on
installation and usage.

Version 1.0a (March 11th 2010): fixed a small bug in the sampling of
hyperparameters of 'full' GP models.
