Upto: Table of Contents of full book "Programming and Using Linux Sound"

Sound codecs and file formats

There are many different ways of representing sound data. Some of these involve compressing the data, which may or may not lose information. Data may be stored in the file system or transmitted across the network, and this raises additional issues. This chapter considers the major sound codecs and container formats.

Overview

Audio and video data needs to be represented in digital format to be used by a computer. Audio and video data contain an enormous amount of information, and so digital representations of this data can occupy huge amounts of space. Consequently, computer scientists have developed many different ways of representing this information, sometimes in ways that preserve all of the information (lossless) and sometimes in ways that lose information (lossy).

Each way of representing the information digitally is known as a codec. The simplest way, described in the next section, is to represent it as "raw" pulse-code modulated data (PCM). Hardware devices such as sound cards can deal with PCM data directly, but PCM data can use a lot of space.

Most codecs will attempt to reduce the memory requirements of PCM data by encoding it to another form, called encoded data. It can then be decoded back to PCM form when required. Depending on the codec algorithms, the re-generated PCM may have the same information content as the original PCM data (lossless) or may contain less information (lossless).

Encoded audio data may or may not contain information about the properties of the data. This information may be about the original PCM data such as the number of channels (mono, stereo), the sampling rate, the number of bits in the sample, etc. Or it may be information about the encoding process itself, such as the size of framed data. The encoded data along with this additional information may be stored in a file, transmitted across the network, etc. if this is done, the encoded data plus the additional information is amalgamated into a container.

It is important at times to know whether you are dealing with just the encoded data, or with a container that holds this data. For example, files on disk will normally be containers, holding additional information along with the encoded data. But audio data manipulation libaries will typically deal with the encoded data itself, after the additional data has been removed.

PCM

From Wikipedia

Pulse-code modulation (PCM) is a method used to digitally represent sampled analog signals. It is the standard form for digital audio in computers and various Blu-ray, DVD and Compact Disc formats, as well as other uses such as digital telephone systems. A PCM stream is a digital representation of an analog signal, in which the magnitude of the analog signal is sampled regularly at uniform intervals, with each sample being quantized to the nearest value within a range of digital steps.

PCM streams have two basic properties that determine their fidelity to the original analog signal: the sampling rate, which is the number of times per second that samples are taken; and the bit depth, which determines the number of possible digital values that each sample can take.

PCM data can be stored in files as "raw" data. In this case there is no header information to say what the sampling rate and bit depth are. Many tools such as sox use the file extension to determine these properties. From man soxformat:

f32 and f64 indicate files encoded as 32 and 64-bit (IEEE single and double precision) floating point PCM respectively; s8, s16, s24, and s32 indicate 8, 16, 24, and 32-bit signed integer PCM respectively; u8, u16, u24, and u32 indicate 8, 16, 24, and 32-bit unsigned integer PCM respectively

But it should be noted that the file extension is only an aid to understanding some of the PCM codec parameters and how it is stored in the file.

WAV

WAV is a file format wrapper around audio data as a container. The audio data is often PCM. The file format is based on RIFF (Resource Interchange File Format ). While it is a Microsoft/IBM format, it does not seem to be encumbered by patents.

A good description of the format is given by Topherlee . The WAV file header contains information about the PCM codec and also about how it is stored (e.g. little- or big-endian).

Because they usually contain uncompressed audio data, WAV files are often huge, around 50Mbytes for a 3 minute song.

MP3

The MP3 and related formats are covered by a patent. Actually, a whole lot of patents. For using an encoder or decoder, users should pay a license fee to an organisation such as the Fraunhofer Society. Most casual users neither do this nor are aware that they should, but it is reported by Wikipedia that the society earned €100,000,000 in revenue for the society in 2005. The Society has at present chosen not to pursue free open source implementations of encoders and decoders for royalties.

The codec used by MP3 is the MPEG-1 Audio Layer III audio compression format. This includes a header component which gives all the additional information about the data and the compression algorithm. There is no need for a separate container format.

Ogg Vorbis

Ogg Vorbis is one of the "good guys". From Vorbis.com "Ogg Vorbis is a completely open, patent-free, professional audio encoding and streaming technology with all the benefits of Open Source"

The names are described as

Ogg: Ogg is the name of Xiph.org's container format for audio, video, and metadata. This puts the stream data into frames which are easier to manage in files other things.

Vorbis: Vorbis is the name of a specific audio compression scheme that's designed to be contained in Ogg. Note that other formats are capable of being embedded in Ogg such as FLAC and Speex.

The extension .oga is preferred for Ogg audio files, although .ogg was previously used.

At times it is necessary to be closely aware of the distinction between Ogg and Vorbis. For example, OpenMAX IL has a number of standard audio components including one to decode various codecs. The LIM component with role "audio decoder ogg" can decode Vorbis streams. But even though the component includes the name "ogg", it cannot decode Ogg files - they are the containers of Vorbis streams and it can only decode the Vorbis stream. To decode an Ogg file requires use of a different component, an "audio decoder with framing".

WMA

From the standpoint of Open Source, WMA files are evil. WMA files are based on two Microsoft proprietary formats. The first is the Advanced Systems Format (ASF) file format which describes the "container" for the music data. The second is the codec, Windows Media Audio 9.

The ASF is the primary problem. Microsoft have a published specification. This specification is strongly antagonistic to anything open source. The license states that if you build an implementation based on that specification then you:

And what's more, you are not allowed to begin any new implementation after January 1, 2012 - and (at the time of writing) it is already July, 2012!

Just to make it a little worse, Microsoft have Patent 6041345 "Active stream format for holding multiple media streams" filed in the US on March 7, 1997. The patent appears to cover the same ground as many other such formats which were in existence at the time, so the standing of this patent (were it to be challenged) is not clear. However, it has been used to block the GPL-licensed project VirtualDub from supporting ASF. The status of patenting a file format is a little suspect anyway, but may become a little clearer now that Oracle has lost its claim to patent the Java API.

The FFmpeg project has nevertheless done a clean-room implementation of ASF, reverse-engineering the file format and not using the ASF specification at all. It has also reverse-engineered the WMA codec. This allows players such as mplayer and VLC to play ASF/WMA files. FFmpeg itself can also convert from ASF/WMA to better formats such as Ogg Vorbis.

There is no Java handler for WMA files, and given the license there is unlikely to be one unless it is a native-code one based on FFmpeg.

Matroska

From the Matroska web site

Matroska aims to become THE standard of multimedia container formats. It was derived from a project called MCF, but differentiates from it significantly because it is based on EBML (Extensible Binary Meta Language), a binary derivative of XM. It incorporates features you would expect from a modern container format, like:

I hadn't come across it until I started looking at subtitles which can be (optionally) added to videos, where it seems to be one of the major formats.

A GUI tool to create and manage MKV files is mkvmerge, in the Ubuntu repositories. the command mmg is a GUI version. mplayer and vlc will play them happily. There is a list of recognised formats.

Conclusion

There are many codecs for sound, and more are being devised all the time. They vary between being codecs, containers or both, and come with a variety of features, some with encumbrances such as patents.



Copyright © Jan Newmarch, jan@newmarch.name
Creative Commons License
"Programming and Using Linux Sound - in depth" by Jan Newmarch is licensed under a Creative Commons Attribution-ShareAlike 4.0 International License .
Based on a work at https://jan.newmarch.name/LinuxSound/ .

If you like this book, please contribute using PayPal

Or Flattr me:
Flattr this book