During the holiday I spent much of my time on a local discussion forum, reading and discussing topics regarding the English language. One question that was raised again and again by local students was this: Why does the ‘p’ in spy sound somewhat different from the ‘p’ in pie, and in fact, for Chinese speakers, the same as ‘b’ in buy?

The answer is simple: because they are different.

In IPA, the three words buy, pie and spy are represented as [baɪ], [pʰaɪ] and [spaɪ] respectively. In other words, they are three different sounds. This distinction also exists in other plosive/stop triplets, namely d, t, (s)t (dear [diɚ], tear [tʰiɚ], steer [stiɚ]) and g, k/c/q, (s)k/c/q (gill [gɪl], kill [kʰɪl], skill [skɪl]). Putting it differently, p, t, k/c/q are normally pronounced as [pʰ], [tʰ], [kʰ], but when they are preceded by an ‘s’, they are pronounced as [p], [t], [k]. Here are some more examples:

beer [biɚ] dink [dɪŋk] gate [geɪt]
pear [pʰiɚ] tink [tʰɪŋk] Kate [kʰeɪt]
spear [spiɚ] stink [stɪŋk] skate [skeɪt]

Regarding these these triplets, a few questions arise. For simplicity, we will only talk about the triplet b, p, sp below, but the principle extends to the other two triplets as well.

  1. How are they different?
  2. Why does a single letter ‘p’ represent both [p] and [pʰ]?
  3. Why do Chinese speakers think [p] and [b] are the same?

How are they different?

In technical jargons, [b] is called a voiced plosive, [pʰ] a voiceless aspirated plosive and [p] a voiceless unaspirated plosive. In layman terms they mean something not too hard to grasp.

When you produce a [b] sound, your vocal cords vibrate at the same time, so it’s called voiced. When you produce a [pʰ] or [p], your vocal cords do not vibrate, so they are voiceless. In principle, you can put your fingers on your throat to feel the vibration as you produce a [b]. However, that actually does not work because you can hardly produce a [b] alone as it is too short, but if you add a vowel after it (e.g. [ba]), then the vibration you feel is mostly from the vowel, not the consonant [b]. Nevertheless, you can try the method with [z] and [s], which are also a voiced/voiceless pair but can be easily lengthened.

On the other hand, when you produce a [pʰ], there is a puff of air coming out, so it is aspirated; and when you produce a [p], there should be no air coming out, hence unaspirated. You can put your palm in front of your mouth to feel the puff of air, it should be pretty obvious.

The actual mechanism of the production of these three sounds is slightly more complicated. To understand this, we have to first understand how a plosive consonant is produced. To produce a plosive, there are three steps:

  1. Closure: the oral cavity is blocked completely at a certain place (e.g. for [b], the lips are closed to block the oral cavity; for [d], the tip of the tongue touches the part above the upper teeth (alveolar ridge) to create a blockage).
  2. Blockage: the oral cavity is held blocked, as air from the lungs continues to come into the cavity. Therefore the air pressure inside the cavity increases.
  3. Release: the blockage is released. Since the air pressure inside the cavity is now higher than the pressure outside, air rushes out and creates an “explosion” (hence the name plosive).

It should be noted that the proper or essential part of a plosive consonant is actually the blockage stage, as normally a plosive sits between two vowels, so the closure stage coincides with the production of the preceding vowel, and the release stage coincides with the production of the following vowel. Now, if the vocal cords start to vibrate in the blockage stage, before the release of the plosive, the consonant is voiced. If the vocal cords start to vibrate at about the same time as the consonant is released, it becomes voiceless unaspirated. If the vibration only starts significantly after the release, the plosive is voiceless aspirated. This is because the vibration of the vocal cords create a constriction which disturbs the airflow coming out from the lungs; if voicing only starts significantly after the release, there is a period of time (~100ms) when the airflow is undisturbed and can thus come out of the oral cavity, resulting in the puff of air in an aspirated plosive.

Voice Onset Time

Voice Onset Time (VOT)

Shown above is a graphical representation of the relative VOT of voiced, voiceless unaspirated and voiceless aspirated plosives. The waves represent voicing, or in other words the vibration of the vocal cords. The double-ended fork represents the relative positions of the articulators involved. This suggests that the voiced plosive [b], the voiceless unaspirated plosive [p] and the voiceless aspirated plosive [pʰ] are really on a continuum. In fact, English voiced consonants [b], [d], [g] are only fully voiced (vibration starts immediately upon the blockage of the oral cavity) when they occur in between two voiced segments. When they occur word-initially, voicing starts much later, so they are only partially voiced or even become voiceless unaspirated.

Why does a single letter ‘p’ represent both [p] and [pʰ]?

The English writing system is basically a phonetic system, however it does not mean that every phonetic detail is recorded in the system. We need to recognize three levels of representation in the transcription of sounds.

The first level is called phonetic transcription. On this level, every distinguishable phonetic detail is recorded. By convention, phonetic transcriptions are given in square brackets [], which is what we have done so far.

The second level is called phonological transcription. In a certain given language, sometimes similar sounds can be grouped together and represented with a single symbol if the distribution is predictable. For example, in English, [p] always appears after [s], whereas [pʰ] appears in all other places but not after [s]. They are said to be in complementary distribution. Therefore, they are grouped together and represented as /p/ (phonological transcriptions are given in slashes //). Dictionaries normally show pronunciations on the level of phonological transcription.

The third level is orthography. The (phonetic) orthography of a language is normally associated with the phonological transcription of that language. However, for various historical reasons, letters in an orthography often do not correspond to the actual sounds unambiguously. For instance, a single letter ‘a’ in English can represent /a/ in bar, /æ/ in bat, /eɪ/ in bake, /ɔ/ in ball and /ə/ in abound.

In the case of the letter ‘p’ in English, it actually coincides with the phonological /p/, because it can represent both [p] and [pʰ]. The reason is partly out of economy, and partly because psychologically native speakers regard them as the same sound (unless they pay special attention to the phonetic difference).˥˥˥˥

Why do Chinese speakers think [p] and [b] are the same?

In Chinese (both Mandarin and Cantonese), there are only voiceless unaspirated and voiceless aspirated plosives, but no voiced plosives. Take Cantonese as an example, there are [p], [t], [k] and [pʰ], [tʰ], [kʰ], but no [b], [d], [g]. Furthermore, [p] and [pʰ] are contrastive in Cantonese, meaning that they can occur in the same position of a word and result in different meanings. For example, the word /paɪ/ 拜 means “to worship”, whereas /pʰaɪ/ 派 means “to distribute” (In Jyutping, the romanized orthography, they are represented as baai and paai). Naturally, this also applies to the pairs [t]/[tʰ] and [k]/[kʰ].

Since voiced plosives do not exist, most untrained Chinese speakers are unaware of this “voiced” property. When they hear the English [b], they mistake it as a [p] because it is the most similar consonant in Chinese.

Nonetheless, it should be made clear here that the similarity between [b] and [p] is not an objective fact. For Italian and Spanish speakers, whose language have [b] and [p] but not [pʰ], they would actually say [p] is closer to [pʰ].