Overview

How exactly does human speech transmit multiple layers of communicative meanings through an articulation process? This is the central concern of my research. To address this issue, some fundamental questions need to be answered: What are the kinds of meanings transmitted by speech? What are the encoding mechanisms? What are the decoding mechanisms? Since it is impossible to answer these questions all at once, a realistic strategy is to divide and conquer. That is, to always prioritize the kind of questions for which other things are relatively established.

My research priority has been based on the following understanding of the state of the art in speech science:

With regard to encoding and decoding mechanisms, the static aspects of speech sounds, whether in terms of acoustic patterns or articulatory correlates, are relatively well established. What remains unclear, even to this day, is the basic dynamic mechanisms of speech production and their processing in perception.
With regard to meanings, lexical meanings are the most easily established; all other meanings are up for grabs.

My early work was therefore focused on Mandarin tones in continuous speech. The functional meaning of tone is clear: to distinguish morphemes that are otherwise identical in terms of CV structure. The canonical forms of Mandarin tones had also been previously well established. What my work further established is the basic patterns of contextual tonal variation (Xu 1993, 1994, 1997, 1998, 2001a). This has led to the Target Approximation (TA) model of tone production (Xu & Wang, 2001). The TA model was then applied to intonation of both Mandarin and English (Xu, 1999; Xu & Xu, 2005). The success of these applications led to further expansion of the approach in a number of new directions.

The qTA model -- A computational realization of the TA model. Its parameters can be automatically extracted by a Praat script: qTAtrainer (Prom-on, Xu & Thipakorn, 2009).
The parallel encoding and target approximation (PENTA) model -- A model of speech prosody that allows parallel marking of multiple layers of communicative meanings with articulatory targets that can generate surface F0 contours (Xu, 2005). The intonational version of PENTA is realized through PENTAtrainer, a Praat script that can automatically extract function-loaded pitch targets (Xu & Prom-on, 2014).
The maximum rate of information hypothesis -- Speech is driven by the need to convey information at the fastest rate possible (Xu & Prom-on (2019). As a result, speech articulation is executed near an overall performance ceiling in terms of articulatory effort (Xu & Sun, 2002; Xu & Wang, 2009; Cheng & Xu, 2013). This view differs from the widely accepted principle of economy of effort, especially in the form of the H&H theory (Lindblom, 1990).
The time Structure model of the syllable -- A new conceptualization of the syllable as the basic temporal organization structure that assigns time intervals to both segmental and laryngeal units. It offers a drastically different view on issues such as the nature of coarticulation, coarticulation resistance, locus equation, time interval of segments and temporal alignment of segmental and tonal events (Xu & Liu, 2006). It also makes it possible to fully integrate segmental and suprasegmental aspects of speech, treating them as following the same basic articulatory dynamics (Xu & Liu, 2012).
The Bio-informational Dimensions (BID) theory of vocally expression of emotions (Xu, Kelly & Smillie, 2013). Emotional and attitudinal meanings are vocal expressed by simultaneously manipulating a number of bio-informational dimensions -- size projection, dynamicity, audibility and association. At least the first two dimensions have been found to be highly relevant for a number of emotions (Chuenwattanapranithi et al., 2008; Xu & Kelly, 2010; Xu, Kelly & Smillie, 2013).
The Nostratic Origin of PFC hypothesis -- The use of post-focus compression (PFC) as a prosodic marker of focus is likely to have a single historical origin, possibly the hypothetical proto-Nostratic language (Xu, 2011; Xu, Chen & Wang, 2012).
The perceptual learnability of the dynamic output of TA, as demonstrated by unsupervised-learning simulations of tone acquisition using self-organizing maps (SOMs) (Gauthier, Shi & Xu, 2007a, 2007b). The importance of these findings is that they suggest perceptual learning does not need to involve distinctive features.

Empirical Findings

Post-focus Compression (PFC) is absent in Taiwanese, Taiwan Mandarin and Cantonese (Chen, Wang & Xu, 2009; Wu & Xu, 2010)
Syllable organization is done through direct adjustment of syllable duration without mediation of stress or prominence (Xu & Wang, 2009)
Segmental and tonal elements all start about 26-48 ms earlier than conventional segmentation (Xu and Liu, 2006, 2007).
The maximum speed of pitch change is often approached in speech, which may be the likely source of many observed F0 contour and alignment patterns (Xu & Sun, 2002).
Post-low bouncing -- F0 after a Low tone bounces back in the subsequent syllables, especially if they carry the neutral tone (Chen & Xu, 2006).
The neutral tone is not toneless. Rather, it is likely to have a [mid] pitch target accompanied by weak articulatory strength (Chen & Xu, 2006).
F0 peak delay is closely related to the interaction of tonal targets articulatory constraints (Xu, 2001a).
Focus in Mandarin consists of both on-focus pitch range expansion and post-focus pitch range compression, and final focus is very similar to broad/neutral focus (Xu, 1999).
Tonal targets are synchronously implemented with the entire syllable rather than with only nucleus vowel or syllable rime (Xu, 1998).
Contextual tonal variations are robustly asymmetrical: Carryover effects are strong and assimilatory, whereas anticipatory effects are weak and largely dissimilatory (Xu, 1993, 1994, 1997, 1999).
Lexical tones in Mandarin that are distorted due to articulatory constraints are still perceptually identifiable. No categorical changes therefore is likely to have taken place (Xu, 1993, 1994).
Perceptual compensation for contextual tonal variations is not complete. (Xu, 1993, 1994).
Mandarin tone 3 (the Low tone) sandhi occurs during a short-term memory task, indicating the phonetic nature of the working memory (Xu, 1991).
In a Mandarin syllable with a final nasal, the duration of the nucleus vowel is inversely related to vowel height: the higher the vowel, the shorter the duration; This duration variation is compensated for by the duration of the nasal murmur: the shorter the vowel, the longer the nasal murmur; Thus in a syllable with a low vowel such as /bang/, there is often hardly any nasal murmur, whereas in a syllable with a high vowel such as /bing/, the nasal murmur can be longer than the vowel (Xu, 1986);
Mandarin final nasals are realized as nasalization on the preceding nucleus vowel with no nasal murmur if the following syllable begins with a vowel or a glide (Xu, 1986);
In a Mandarin disyllabic word or phrase, the initial consonant position in syllable 2 is much less "consonantal" than the initial consonant position in syllable 1: where the consonant is shorter, more likely to become voiced if voiceless, a stop or affricate is more likely to lose its closure and become a fricative, and a fricative is more likely to lose its frication (Xu, 1986).

Research Philosophies

Home