|
G.723.1, also called
TrueSpeech 6.3/5.3, is a member in the TrueSpeech family of speech
compression algorithms from DSP Group, Inc., and produces digital
voice compression levels of 20:1 and 24:1 respectively (6.3 Kbps
and 5.3 Kbps). It is the highest quality compression algorithm
DSP Group offers. It is adopted by the International Telecommunications
Union (ITU) as a recommendation of ITU-T, called G.732.1, to specifies
a coded representation used for compressing the speech or other
audio signal component of multi-media services at a very low bit
rate over public telephone (POTS) networks as part of the H.324
family. It is also used for the ITU H.323 audio and video standard
as the low bit rate speech technology recommendation [22].
G.732
or G.723.1
Sometimes people
call G.723.1 as G.723. Actually in the past, no coder designated
as G.723 has ever existed. It was later folded into G.726. The
ITU changed the name of the currently adopted G.723 coder to G.723.1
in order to avoid confusion. Thus, there is no real distinction
between G.723.1 and G.723 with reference to the currently adopted
G.723.1 standard[23].
Bit
rates supported
The G.723.1 coder
has two bit rates associated with it. These are 5.3 Kbps with
the ACELP algorithm that
has a good quality and additional flexibility to the system designer,
and 6.3 Kbps with the MP-MLQ algorithm that has a better quality.
Both rates are mandatory parts of the encoder and decoder. This
codec enables voice communications through the Internet and other
audio compression applications just like over a regular telephone[23].
Encoder/Decoder
As the G.723.1
coder is designed to operate with a digital signal, the following
steps are performed to convert the analogue signal to digital
samples suitable for the encoder.
The processes used
after decoding are similar and convert the digital samples back
to an analogue signal.

Figure 2-27 G.723.1
codec block diagram [19]
The codec uses
a linear prediction analysis-by-synthesis
coding to minimise a perceptually weighted error signal.
As shown in figure
6, some key processes used in the codec algorithm are described
below.
Framer:
The samples are grouped as frames or blocks. Each frame has 240
samples which is 30 msec at the 8 kHz sampling rate.
High-pass
filter: Each frame is high-pass filtered to remove
the DC component.
Sub frames:
Then each frame is divided into four sub frames, which have 60
samples each.
LPC analysis:
By using the unprocessed input signal a 10th order Linear
Prediction Coder (LPC) filter is computed for every sub frame.
The LPC filter for the last sub frame is quantised using a Predictive
Split Vector Quantiser (PSVQ).
Format
perceptual weighting: The quantised LPC coefficients
are used to construct the short-term perceptual weighting filter,
which is used to filter the entire frame and to obtain the perceptually
weighted speech signal [g7231].
Pitch
estimator: For every two
sub frames (120 samples), the open loop pitch period is computed
by using the weighted speech signal. This pitch estimation is
performed on blocks of 120 samples. The pitch period is searched
in the range from 18 to 142 samples. From this point, the speech
is processed on a 60-samples-per-subframe basis.
Harmonic
noise-shaping: Using the estimated pitch period computed
previously, a harmonic noise-shaping filter is constructed.
Impulse
response calculator: The
combination of the LPC synthesis filter, the format perceptual
weighting filter, and the harmonic noise-shaping filter is used
to create an impulse response. The impulse response is then used
for further computations.
Pitch
predictor: Using the estimated
pitch period estimation and the impulse response, a closed loop
pitch predictor is computed. A fifth order pitch predictor is
used. The pitch period is computed as a small differential value
around the open loop pitch estimate. The result of the pitch predictor
is then subtracted from the initial target vector. Both the pitch
period and the differential values are transmitted to the decoder.
Encoding:
Finally, the non-periodic component of the excitation is
approximated. Multi-pulse Maximum Likelihood Quantisation (MP-MLQ)
excitation is used for the 6.3 kHz bit rate and Algebraic-Code-Excited
Linear-Prediction (ACELP) is used for 5.3 kHz bit rate[24].
These two rates can be switched dynamically[20].
Further
compression: It is also
possible to use Voice Activity Detection (VAD) and Comfort Noise
Generation (CNG) as an option to compress speech down to an additional
25-30% below the 6.3 Kbps and 5.3 Kbps rate for variable rate
operation. The VAD code, which compresses out the silent portions
between words, provides an effective compression rate of up to
3.7 Kbps (35:1) [22].
- LSP (Line Spectral Pair)
quantiser: The LPC coefficients are converted to
LSP coefficients. Then the LSP vectors and LSP sub-vectors are
quantised using a predictive split vector quantiser. [48]
- LSP decoder: This is
the inverse quantisation of LSP. First, the sub-vectors
are decoded to form a tenth order vector. The predicted vector
is added to the decoded vector and DC vector to form the decoded
LSP vector. A stability check is performed on the decoded LSP
vector to ensure that the decoded LSP vector is ordered. If
the condition of stability is not met, the previous LSP vector
is used. [48]
- LSP interpolation: Linear
interpolation is performed between the decoded LSP vector and
the previous LSP vector for each subframe. Four interpolated
LSP vectors are converted to LPC vectors. The quantised LPC
synthesis filter is used for generating the decoded speech signal.
[48]
Time
delay: Using this codec,
the total algorithmic delay is 37.5 ms, where 30 ms is the frame
size and 7.5 ms is an additional look ahead time. In addition
to this delay, there are processing delays for the implementation,
transmission delays in the communication link and buffering delays
of the multiplexing protocol[21].
Testing:
Extensive tests were performed before TrueSpeech 6.3/5.3
was selected by the ITU as the G.723.1 standard. The tests included
a variety of conditions such as situations with background noise,
individuals speaking different languages, male and female talkers,
multiple people talking simultaneously, etc. Handling lost packets,
also known as frame erasure, was also tested to determine the
algorithms robustness. This is especially important to enable
high quality voice communications in an Internet environment.
TrueSpeech 6.3/5.3s outstanding performance has led it to
be selected by the ITU as the recommendation G.723.1[20].
G.723.1 is an international
standard currently in force as an international standard codec
for voice on the Internet, with high compression ratio and very
good audio quality. It has been used for POTS defined in the protocol
stack H.324 and LAN and TCP/IP video conferencing defined in H.323[20].
|