The RankSearch algorithm: Java
Implementation
The following files provide Java classes and source code for the
RankSearch algorithm as described in our ICML
2006 paper. You will need the latest JAR file containing the Tetrad
library from the Tetrad
project webpage. Go to the download
directory and get the tetrad-4.3.6-2.jar file, or a more recent
version (please notice that the file tetrad-latest.jar is NOT the most
up-to-date version).
To use the sofware, you should call it as follows (after setting the
proper classpath to include the Tetrad jar file):
- java rbas.ranksearch.app.RankSearchApp <filename>
<number of mixture components>
The argument <filename> is the name of a file containing data.
The datafile format is a text file where the first line must include
the name of the variables separated by spaces or tabs. Each row of the
file should contain numerical values for each variable. No missing data
is allowed. ALSO IMPORTANT: in the same directory one should have both
a filename.dat and a filename.tst. The
software will automatically evaluate the log-likelihood of the learned
model in the hold-out set filename.tst.
This is done by learning the maximum likelihood parameters of the
model, using 10 different starting points.
The number of mixture components should be given as input, and the
software does not provide an automated way of choosing it. Since
RankSearch is computationally expensive, we recommend using a simpler
algorithm such as the mixture of factor analyzers to provide such a
number. Software for variational Bayesian mixture of factor analyzers
can be found in Matthew
Beal's page.
Finally, the source code allows a few extra options to be tweaked. See
commentaries inside file RankSearch.java.
In particular, the number of starting points that is used to
score a model by optimizing a variational approximation can be set by
modifying the value of the constant MAXIMIZATION_TRIALS.
I haven't done much testing on this code, so please if you are
suspicious of some bug, do tell me. Bear in mind that this procedure
can take a long time to converge if you have a reasonably large dataset
or a large number of mixture components. To give you an idea, the
largest experiments from the ICML paper (a few hundred data points,
about 40 variables and 2 mixture components) would take up to 6 hours
in Pentium IV 1.8 Ghz.
Last modification: August 11th 2006