Redington, F. M. (1992). A statistical approach to syntax acquisition. Unpublished masters thesis, Department of Artificial Intelligence, University of Edinburgh.



Abstract

I investigated the extent to which simple distributional (statistical and/or connectionist) learning techniques can provide information about the syntactic categories of individual words. The particular technique was one independently discovered by Finch & Chater (1992), where the bigram statistics of individual words are recorded, and these words are clustered according to the similarity of their bigram statistics. Words with the same predominant syntactic category are found to be clustered closer together than one would expect by chance. The quality of the clustering is such that groups of nouns, verbs, adjectives etc., is immediately apparent to the naked eye. I investigated the properties of the method on a number of small-scale benchmark problems, and applied the technique to a corpus of transcribed adult speech taken from the CHILDES corpora (MacWhinney & Snow, 1985), which is a closer approximation to the language to which children are exposed than text corpora. Results from this analysis were comparable in quality (appropriateness of clustering) to those from text corpora. This work demonstrates that simple distributional approaches, using no a priori information, can provide strong constraints for determining initial syntactic categories. This and similar approaches provide a means of investigating the extent to which linguistic knowledge is innate, as where these methods fail, a priori knowledge may be required. The learning mechanisms which might embody such methods could themselves be considered &knowledge& which constitutes part of the language learner's initial state.


Back to publications ...


Last modified: Jan 10, 1999