PhD project
A new method for estimating the number of clusters

The task of cluster analysis is to find groupings in data. This has many
applications, for example in pattern recognition, biological species 
identification, psychology and social stratification. 
Most classical cluster analysis methods assume the number of clusters to 
be known, which in reality almost never is the case. There are several
approaches to estimate the number of clusters. However, most of them are
connected to specific cluster analysis methods, and for many of them 
strong evidence about their quality doesn't exist. This project is about 
defining a new method based on distances of observations to the closest 
cluster centre (most existing methods are based on squared distances, 
which is theoretically nice but inflexible and often not very robust) and 
comparing it systematically to existing approaches. The method can then 
be used together with k-medoids clustering (Kaufman and Rousseeuw, 1990).
Actually, Kaufman and Rousseeuw, 1990, suggest an alternative method, the
"average silhouette width", which lacks a theoretical basis and is not
robust against outliers (Hennig, 2008).
  
While methods based on squared distances can be motivated by their connection
with Maximum Likelihood-estimators for the normal distribution, other 
distributions need to be considered for k-medoids. 

C. Hennig: Dissolution point and isolation robustness: robustness criteria for general cluster analysis methods. Journal of Multivariate Analysis 99 (2008), 1154-1176. 
L. Kaufman and P. J. Rousseeuw: Finding Groups in Data. Wiley (1990).