Sound is a mechanical wave generated by a vibrating sound source and transmitted through an acoustic media(e.g air) by means of an an oscillation pattern of pressure composed of frequencies within the range of hearing.
Voice is the sound produced by humans and other vertebrates using the lungs and the vocal folds in the larynx.
Voice is not always produced as speech.
Speech is decode-able sound humans use to express thoughts, feelings and ideas orally.
4. Speech generation
Speech is produced when air is forced from the lungs through the vocal cords and along the vocal tract. Fig1 is the human vocal tract.
Fig1. Human Vocal Tract
The vocal tract introduces short-term correlations (of the order of 1 ms) into speech signal, and can be thought of as a filter with broad resonances called formants. An important part of many speech codecs is the modelling of the vocal tract as a short term filter the transfer function of which needs to be updated only relatively infrequently (typically every 20 ms or so).
Speech sounds can be broke into three classes based on their mode of excitation. The excitation is the air forced into the vocal tract filter through the vocal cords.
- voiced sounds
- unvoiced sounds
- Plosive sounds
Although there are many possible speech sounds, the shape of the vocal tract and its mode of excitation change relatively slowly, and so speech can be considered to be quasi-stationary over short periods of time (of the order of 20 ms). Speech signals show a high degree of predictability, due sometimes to the quasi-periodic vibrations of the vocal cords and also to the resonances of the vocal tract. Speech codes attempt to exploit this predictability in order to reduce the data rate necessary for good quality voice transmission.
Fig2 is a speech generation model
Fig2 Speech Generation Model
5. Speech properties
- Formants are defined as ‘the spectral peaks of the sound spectrum’.
One major property of speech is its correlation, i.e. successive samples of a speech signal are similar. The short-term correlation of successive speech samples has consequences for the short-term spectral envelopes. These spectral envelopes have a few local maximal, the so called ‘formants’ which correspond to resonance frequencies of the human vocal tract.
This (short-term) correlation can be used to estimate teh current speech samples from the past samples. The estimation is called prediction. Because the prediction is done by a linear combination of past speech samples, it is called linear prediction.Only the prediction error signal is conveyed to the receiver.
- Pitch represents the perceived fundamental frequency of sound.
Pitch can be quantified as frequency, however it’s not a purely objective physical property, but a subjective psycho-acoustical attribute of sound.
Voiced sounds as e.g. vowels have a periodic structure, i.e. their signal form repeats itself after some milliseconds, the so-called pitch period TP. Its reciprocal value fP=1/TP is called pitch frequency. So there is also correlation between distant samples in voiced sounds.
This long-time correlation is exploited for bit-rate reduction with a so-called long-term predictor (also called pitch predictor).
6. Speech codecs taxonomy
- waveform coding attempts to reproduce the time domain speech waveform as accurately as possible.
- analysis-by-synthesis methods utilize the linear prediction model and a perceptual distortion measure to reproduce only those characteristics of the input speech that are determined to be most important.
- Sub-band approaches break the speech into several frequency sub-bands and code them separately.
- transform coding performs transform to the input signal and transmits the coefficients information to the receiver.
7. Speech digitization
The analogue speech is sampled and quantized. According to the sampling theory, the required sampling rate is 2*BW, wherein BW means the frequency band of the original signal. The bandwidth of the speech signal can be classified as follows:
- Narrow band: 300Hz~3.4kHz. Used in traditional telephony network. Usually allocated a channel of 4kHz, and thus allows sampling rate of 8kHz.
- wide band: 50Hz~7kHz. Used in VoIP
- super wide band: upper frequency is more than 7kHz. Used in video telephony.
PCM is the basic digital representation of analog speech signal. The two basic property is the sampling rate and the bit depth. They determines the original bit rate of the digital signal. With linear PCM, sampling rate is 8 kHz and bit depth is 16 bit, and thus the bit rate is 128 kbps. With logarithmic PCM, bit depth is 8 bit, and thus the bit rate is 64 kbps. This is applied by G.711 codec which is the standard codec used in PSTN and ISDN.
PCM stream is not compressed and regarded to have toll quality. However, the bit rate is usually very high for transmission. So the speech compression is needed.
8. Speech coder attribute
- Bit rate is the rate of the output bit-stream of encoder. It should conform with the target network bit rate(network bandwidth or channel bandwidth).
- Delay usually consists of three major component. The algorithmic delay, the process delay and the transmission delay. The latter two is dependent on the implementation and the external channel property. However, the first one is independent of practical implementation. Usually algorithmic delay = frame size(or frame length) + look-ahead. The sum of the first two is the one way codec delay. And the total of all the three is one way system delay.
- Complexity is often referred to as required MIPS, RAM memory size and ROM storage size
- Quality has the most dimensions of all the attributes. There are subject tests and object approaches to evaluate codec quality.