The RankSearch algorithm: Java Implementation

The following files provide Java classes and source code for the RankSearch algorithm as described in our ICML 2006 paper. You will need the latest JAR file containing the Tetrad library from the Tetrad project webpage. Go to the download directory and get the tetrad-4.3.6-2.jar file, or a more recent version (please notice that the file tetrad-latest.jar is NOT the most up-to-date version).


To use the sofware, you should call it as follows (after setting the proper classpath to include the Tetrad jar file):


The argument <filename> is the name of a file containing data. The datafile format is a text file where the first line must include the name of the variables separated by spaces or tabs. Each row of the file should contain numerical values for each variable. No missing data is allowed. ALSO IMPORTANT: in the same directory one should have both a filename.dat and a filename.tst.  The software will automatically evaluate the log-likelihood of the learned model in the hold-out set filename.tst. This is done by learning the maximum likelihood parameters of the model, using 10 different starting points.

The number of mixture components should be given as input, and the software does not provide an automated way of choosing it. Since RankSearch is computationally expensive, we recommend using a simpler algorithm such as the mixture of factor analyzers to provide such a number. Software for variational Bayesian mixture of factor analyzers can be found in Matthew Beal's page.

Finally, the source code allows a few extra options to be tweaked. See commentaries inside file RankSearch.java. In particular, the number of starting points that is used to score a model by optimizing a variational approximation can be set by modifying the value of the constant MAXIMIZATION_TRIALS.

I haven't done much testing on this code, so please if you are suspicious of some bug, do tell me. Bear in mind that this procedure can take a long time to converge if you have a reasonably large dataset or a large number of mixture components. To give you an idea, the largest experiments from the ICML paper (a few hundred data points, about 40 variables and 2 mixture components) would take up to 6 hours in Pentium IV 1.8 Ghz.

Last modification: August 11th 2006