Yong Qiu Liu's Web Page--MPEG Audio

Introduction of multimedia codecs

4.6 MPEG Audio

About MPEG Audio

There are several phases in MPEG format, MPEG-1, MPEG-2, MPEG-4 and MPEG-7. They are completely different coexisting standards that all handle different aspects of Multimedia communication and the later phases will NOT replace the earlier phase but complement them.Also, there are three different layers defined in MPEG-1 and in MPEG-2 to represent a family of coding algorithms. Those are Layer I, Layer II and Layer III. Version is used in the context of MPEG-4. Version 1 provides a set of tools for audio coding. Version 2 is added with new tools for additional functionality. However, it cannot replace version 1but is fully backward compatible to Version 1.^[81]

Functionality

MPEG-1 supports mono, stereo and dual mono sounds at 32, 44.1, and 48 kHz sampling rate. The predefined bit rate is from 32 to 448 Kbps for Layer I, from 32 to 384 Kbps for Layer II and from 32 to 320 Kbps for Layer III.

Specially, at bit rates from 64kb/s up to 192kb/s per channel, MPEG Audio Layer II can provide a sound quality that is competitive to any perceptual coding scheme using the same bit rate. It also provides a compatible multi-channel solution: the term "compatible" implies that MPEG2 Multi channel streams can replace any MPEG1 stereo stream in e.g. DVD and DVB systems while preserving compatibility with existing stereo decoders. Existing MPEG1 decoders are able to decode a stereo down-mix from the multi-channel stream they receive.^[82]

MPEG-2, or MPEG2 BC (backwards compatible), is backwards compatible for multi-channel extension to MPEG-1. It supports up to 5 main channels and a 'low frequent enhancement' (LFE) channel. Its bit rate is extended up to 1 Mbps; Furthermore, it is also an extension of MPEG-1 for lower sampling rates 16, 22.05, and 24 kHz for bit rates from 32 to 256 Kbps (Layer I) and from 8 to 160 Kbps (Layer II & Layer III).

MPEG-2 AAC (Advanced Audio Coding) or MPEG NBC (Non-Backward Compatible audio) provides a very high-quality audio coding standard for 1 to 48 channels at sampling rates of 8 to 96 kHz, with multi-channel, multi-lingual, and multi-program capabilities. AAC works at bit rates from 8 Kbps for a monophonic speech signal up to in excess of 160 Kbps/channel for very-high-quality coding that permits multiple encode/decode cycles. Three profiles of AAC provide varying levels of complexity and scalability. ^[81]

MPEG-4 supports coding and composition of natural and synthetic audio objects, scalability of the bitrate of an audio bit stream, scalability of encoder or decoder complexity, Structured Audio: A universal language for score-driven sound synthesis and TTSI, which is an interface for text-to-speech conversion systems.

MPEG-7 will provide standardised descriptions and description schemes of audio structures and sound content and language to specify such descriptions and description schemes.^[81]

Compression factor of MPEG formats ranges from 2.7 to 24. With a compression rate of 6:1 (16 bits stereo sampled at 48 KHz is reduced to a 256 Kbps data stream) and under optimal listening conditions, expert listeners could not distinguish between coded and original audio clips. ^[83]

Encoding

SBC encode: Sub-Band Coding (SBC) is a very popular and efficient audio method, which can encode any audio signal from any source. It is used for music recordings, movie soundtracks, etc. MPEG Audio is an example of SBC.

When a lot of signal energy is present at one frequency, normal human ears cannot hear a signal at a lower energy at nearby frequencies. The louder frequency masks the softer frequencies and the louder frequency is called the masker. SBC uses this phenomenon to save signal bandwidth by throwing away information about frequencies, which are masked. This is a lossy encoding, but if the computation is done correctly, then the human ear can not hear the difference.^[84]

Here are the encoding procedures:^[83]^[84] (Figure 2-28 )

A time-frequency mapping (a filter bank, or FFT, or something else) is used to divide the input 48 kHz PCM signal into one that approximates the 32 critical sub-bands.

The psycho-acoustic model checks both sub-bands and the original signal to determine masking thresholds using psycho-acoustic information.

Each sample of the subband is quantised and encoded using these masking thresholds so as to keep the quantisation noise below the masking threshold.

If the power in a subband is below the masking threshold, just ignore it. Otherwise, calculate the bits for the coefficient such that noise introduced by quantisation is below the masking effect.

Then assemble all these quantised samples into frames to format bitstream.

Figure 2-28 SBC encode^[84]

For decoding, the frames are unpacked and subband samples are decoded. Then, a frequency-time mapping translates them back into a PCM signal (Figure 2-29).

Figure 2-29 SBC decode^[84]

MPEG Layers: MPEG defines 3 layers for audio. Their basic model is same, but the codec complexity increases with each layer. Each one is a self-contained SBC coder with its own time-frequency mapping, psycho-acoustic model, and quantiser. Layer 1 is the simplest, but gives the poorest compression. Layer 3 is the most complicated and difficult to compute, but gives the best compression.^[84]

Figure 2-30 Grouping of subband samples for Layer 1, 2, and 3^[83]

Each layer’s basic model is the same and is divided into frames that contain 384 samples (12 samples * 32 filtered sub-bands, Figure 2-30), but as already mentioned, the codec’s complexity increases with each layer.^[83]

Layer 1: Its time-frequency mapping is a DCT type filter with one frame and equal frequency spread per band. The psycho-acoustic model only uses frequency masking. The quantiser/encoder quantises the maximum absolute value of the samples 6 bits and determines the bit allocation for each subband, then linearly quantifies the samples to the bit allocation for that subband.
Layer 2: Use three frames in filter (before, current, next, a total of 1152 samples). This model uses a little bit of the temporal masking.The psychoacoustic model is similar to the Layer 1 model, but it uses a 1024-point FFT for greater frequency resolution.The quantiser/encoder is similar to that used in Layer 1, but frames are three times as long as Layer 1’s.
Layer 3 uses both poly-phase and discrete-cosine-transform filter banks and a polynomial prediction psycho-acoustic model. The sophisticated quantisation and encoding schemes allow variable length frames. The frame packer includes temporal masking effects, takes into account stereo redundancy, and uses the Huffman coder.
Layer 3 uses a low bit rate and sophisticated encoding system. It creates high quality output at bit rates as low as 64 Kbps

Effectiveness: In comparing the three layers, Layer 3 has the smallest stream bit rate and highest compression ratio, Layer 2’s bit rate is bigger and the compression ratio is not so good, whereas Layer 1 has the biggest bit rate and the smallest compression ratio. That’s way MP3 (MPEG layer 3) music is so popular on the web.

Because of the complexity of the encoders, the sequence of theoretical delay, from shorter to longer in the various MPEG layers is Layer 1, Layer 2 and Layer 3. So, for the real-time communication, Layer 1 or layer 2 is probably more suitable.

Because this encoding is lossy, the quality of the sound will be affected by the encoding bit depth. Experiments have shown that at a 64 Kbits encoding bit depth, Layer 3 provides a good qualitative reproduction but Layer 2 had some annoying interference. When using a 128 Kbits encoding bit depth, both Layer 3 and Layer 2 provided excellent effects.^[83]

The effectiveness of each layer is listed in Table 2-6.

Table 2-6 Effectiveness of MPEG audio^[83]

Layer	Target bit rate	Ratio	Quality @ 64 Kbits*	Quality @ 128 Kbits*	Theoretical delay**
Layer 1	192 Kbps	4:1			19 ms
Layer 2	128 Kbps	6:1	2.1 to 2.6	4+	35 ms
Layer 3	64 Kbps	12:1	3.6 to 3.8	4+	59 ms
* 5 = perfect, 4 = just noticeable, 3 = slightly annoying, 2 = annoying, 1 = very annoying ** Real delay is about 3 times theoretical delay

MPEG2: MPEG-2 audio became an international standard in November of 1994. This standard further extends the MPEG1 standard in the following ways:^[86]

Multi-channel audio support: The enhanced standard supports up to 5 high fidelity audio channels, plus a low frequency enhancement channel (5.1 channels), applicable for the compression of audio for HDTV (High Definition Television) or digital movies.

Multilingual audio support: It supports up to 7 additional commentary channels.

Lower bit rates: It supports additional lower bit rates down to 8 Kbps.

Lower sampling rates: Besides 32, 44.1, and 48 kHz, it also accommodates 16, 22.05, and 24 kHz sampling rates.

MPEG2 audio is compatible with MPEG1 audio. MPEG-2 audio decoders can decode MPEG-1 audio streams. In addition, MPEG-1 decoders can decode the two main channels of an MPEG-2 audio bitstream. This backward compatibility is achieved by combining suitably weighted versions of each of the up to 5.1 channels into a "down-mixed" left and right channel. These two channels fit into the audio data framework of a MPEG1 audio bitstream.

MPEG2 AAC (Advanced Audio Coding): MPEG-2 AAC or NBC (Non-Backward Compatible audio) is the consequent continuation of the coding method of MPEG Audio Layer 3. It supports high coding gain with great flexibility. This method is compatible for future developments in the audio sector with sampling frequencies between 8 kHz and 96 kHz and any number of channels between 1 and 48. It may just get half the bit rate without loss of subjective quality if compared to MPEG-2 Layer-2.

MPEG2 AAC also offers a better compression ratio than layer-3. MPEG formal listening tests have demonstrated it is able to provide slightly better audio quality at 96 kb/s than layer-3 at 128 kb/s or layer-2 at 192 kb/s. The crucial differences between MPEG-2 AAC and its predecessor ISO/MPEG1 Layer 3 are shown as follows:^[85]

Filter bank: MPEG-2 AAC uses a plain Modified Discrete Cosine Transform (MDCT). Together with the increased window length (2048 lines per transformation) the MDCT outperforms the filter banks of previous coding methods.

Temporal Noise Shaping TNS: It shapes the distribution of quantisation noise in time by prediction in the frequency domain. Voice signals in particular experience considerable improvement through TNS.

Prediction: A certain type of audio signal is easy to predict.

Quantisation: The given bit rate can be used more efficiently by allowing finer control of quantisation resolution.

Bit-stream format: the information to be transmitted undergoes entropy coding in order to keep redundancy as low as possible.

Last update April 9, 2002