
The Political Books dataset
---------------------------

I'm yet to ask for permission to include the actual network data in
this file. Meanwhile, you can download it from

http://www-personal.umich.edu/~mejn/netdata/

In this file, you will find the feature data and labels for such
books. There is a total of 105 books. I downloaded the Amazon.com
frontpages of each book and included in the 'raw' subdirectory (please
notice that there were a couple of books I could not identify from the
original titles - the corresponding raw text files contains nothing
but the title). I had no worries about cleaning the data from possible
noise sources (such as the 'recent history' items from the frontpage).
It is already an easy classification task.

After stemming, each word in all of the 105 files is mapped into an
unique integer. The file 'vocabulary' contains all stemmed words and
their respective ids. The file polbooks.dat contains the data in
MATLAB sparse matrix format. Each line contains three numbers:
<book_id>, <word_id>, <word count>. The file polbooks_labels.dat contains
the labels: 0 for conservative and 1 for liberal. Please notice that
in the original dataset there were a few books labeled as 'neutral'. I
gave them a label of 1.

In my actual use of this data (Silva et al., 2007), I recoded each
document using tf-idf features (as in Chu et al. 2006). For the
"cross-validation" experiments, I used the selection indicated in the
file polbooksf.folds: each line in this file contains 105 digits. The
position of the '1' digits indicate which books are used as training
data (this is not really cross-validation, since we sampled randomly
from the 105 books one hundred times - estimates of AUC variance in
the actual study should be seen as underestimates).

---

References:

W. Chu, V. Sindhwani, Z. Ghahramani and S. S. Keerthi (2006)
Relational learning with Gaussian processes, in Neural Information
Processing Systems (NIPS-19)

R. Silva, W. Chu and Z. Ghahramani (2007) Hidden common cause
relations in relational learning, in Neural Information Processing
Systems (NIPS-20)

---
Ricardo Silva
London, September 2007