Yong Qiu Liu's Web Page--G.723.1 Audio

Introduction of multimedia codecs

4.5 G.723.1 Audio

G.723.1, also called TrueSpeech 6.3/5.3, is a member in the TrueSpeech family of speech compression algorithms from DSP Group, Inc., and produces digital voice compression levels of 20:1 and 24:1 respectively (6.3 Kbps and 5.3 Kbps). It is the highest quality compression algorithm DSP Group offers. It is adopted by the International Telecommunications Union (ITU) as a recommendation of ITU-T, called G.732.1, to specifies a coded representation used for compressing the speech or other audio signal component of multi-media services at a very low bit rate over public telephone (POTS) networks as part of the H.324 family. It is also used for the ITU H.323 audio and video standard as the low bit rate speech technology recommendation ^[22].

G.732 or G.723.1

Sometimes people call G.723.1 as G.723. Actually in the past, no coder designated as G.723 has ever existed. It was later folded into G.726. The ITU changed the name of the currently adopted G.723 coder to G.723.1 in order to avoid confusion. Thus, there is no real distinction between G.723.1 and G.723 with reference to the currently adopted G.723.1 standard^[23].

Bit rates supported

The G.723.1 coder has two bit rates associated with it. These are 5.3 Kbps with the ACELP algorithm that has a good quality and additional flexibility to the system designer, and 6.3 Kbps with the MP-MLQ algorithm that has a better quality. Both rates are mandatory parts of the encoder and decoder. This codec enables voice communications through the Internet and other audio compression applications just like over a regular telephone^[23].

Encoder/Decoder

As the G.723.1 coder is designed to operate with a digital signal, the following steps are performed to convert the analogue signal to digital samples suitable for the encoder.

Telephone bandwidth filtering of the analogue input using Recommendation G.712.

Sampling at 8 kHz.

Converting to 16-bit linear PCM for the input to the encoder.

The processes used after decoding are similar and convert the digital samples back to an analogue signal.

Figure 2-27 G.723.1 codec block diagram ^[19]

The codec uses a linear prediction analysis-by-synthesis coding to minimise a perceptually weighted error signal.

As shown in figure 6, some key processes used in the codec algorithm are described below.

Framer: The samples are grouped as frames or blocks. Each frame has 240 samples which is 30 msec at the 8 kHz sampling rate.

High-pass filter: Each frame is high-pass filtered to remove the DC component.

Sub frames: Then each frame is divided into four sub frames, which have 60 samples each.

LPC analysis: By using the unprocessed input signal a 10th order Linear Prediction Coder (LPC) filter is computed for every sub frame. The LPC filter for the last sub frame is quantised using a Predictive Split Vector Quantiser (PSVQ).

Format perceptual weighting: The quantised LPC coefficients are used to construct the short-term perceptual weighting filter, which is used to filter the entire frame and to obtain the perceptually weighted speech signal [g7231].

Pitch estimator: For every two sub frames (120 samples), the open loop pitch period is computed by using the weighted speech signal. This pitch estimation is performed on blocks of 120 samples. The pitch period is searched in the range from 18 to 142 samples. From this point, the speech is processed on a 60-samples-per-subframe basis.

Harmonic noise-shaping: Using the estimated pitch period computed previously, a harmonic noise-shaping filter is constructed.

Impulse response calculator: The combination of the LPC synthesis filter, the format perceptual weighting filter, and the harmonic noise-shaping filter is used to create an impulse response. The impulse response is then used for further computations.

Pitch predictor: Using the estimated pitch period estimation and the impulse response, a closed loop pitch predictor is computed. A fifth order pitch predictor is used. The pitch period is computed as a small differential value around the open loop pitch estimate. The result of the pitch predictor is then subtracted from the initial target vector. Both the pitch period and the differential values are transmitted to the decoder.

Encoding: Finally, the non-periodic component of the excitation is approximated. Multi-pulse Maximum Likelihood Quantisation (MP-MLQ) excitation is used for the 6.3 kHz bit rate and Algebraic-Code-Excited Linear-Prediction (ACELP) is used for 5.3 kHz bit rate^[24]. These two rates can be switched dynamically^[20].

Further compression: It is also possible to use Voice Activity Detection (VAD) and Comfort Noise Generation (CNG) as an option to compress speech down to an additional 25-30% below the 6.3 Kbps and 5.3 Kbps rate for variable rate operation. The VAD code, which compresses out the silent portions between words, provides an effective compression rate of up to 3.7 Kbps (35:1)^[22].

LSP (Line Spectral Pair) quantiser: The LPC coefficients are converted to LSP coefficients. Then the LSP vectors and LSP sub-vectors are quantised using a predictive split vector quantiser.^[48]

LSP decoder: This is the inverse quantisation of LSP. First, the sub-vectors are decoded to form a tenth order vector. The predicted vector is added to the decoded vector and DC vector to form the decoded LSP vector. A stability check is performed on the decoded LSP vector to ensure that the decoded LSP vector is ordered. If the condition of stability is not met, the previous LSP vector is used.^[48]

LSP interpolation: Linear interpolation is performed between the decoded LSP vector and the previous LSP vector for each subframe. Four interpolated LSP vectors are converted to LPC vectors. The quantised LPC synthesis filter is used for generating the decoded speech signal.^[48]

Time delay: Using this codec, the total algorithmic delay is 37.5 ms, where 30 ms is the frame size and 7.5 ms is an additional look ahead time. In addition to this delay, there are processing delays for the implementation, transmission delays in the communication link and buffering delays of the multiplexing protocol^[21].

Testing: Extensive tests were performed before TrueSpeech 6.3/5.3 was selected by the ITU as the G.723.1 standard. The tests included a variety of conditions such as situations with background noise, individuals speaking different languages, male and female talkers, multiple people talking simultaneously, etc. Handling lost packets, also known as frame erasure, was also tested to determine the algorithm’s robustness. This is especially important to enable high quality voice communications in an Internet environment. TrueSpeech 6.3/5.3’s outstanding performance has led it to be selected by the ITU as the recommendation G.723.1^[20].

G.723.1 is an international standard currently in force as an international standard codec for voice on the Internet, with high compression ratio and very good audio quality. It has been used for POTS defined in the protocol stack H.324 and LAN and TCP/IP video conferencing defined in H.323^[20].

Last update April 9, 2002