Freigeben über


Intro to Audio Programming, Part 2: Demystifying the WAV Format

The WAV format is arguably the most basic of sound formats out there. It was developed by Microsoft and IBM, and it is rather loosely defined. As a result, there are a lot of WAV files out there that theoretically should not work, but somehow do.

Even if you do not plan to work with WAV data directly, future entries in this series will use the DirectSound library which has direct analogs to elements of the wave file, so it’s important to understand what all of this means.

WAV files are usually stored uncompressed, which means that they can get quite large, but they cannot exceed 4 gigabytes due to the fact that the file size header field is a 32-bit unsigned integer (32 bit file length means a maximum of 4 gigs).

WAV files are stored in a binary format. They are comprised of chunks, where each chunk tells you something about the data in the file.

Here are a couple of quick links that describe the WAV file format in detail. If you are doing work with the WAV format, you should bookmark these:

Let’s dig in a little deeper to how WAV works.

Chunks

A chunk is used to represent certain metadata (or actual data) regarding the file. Each chunk serves a specific purpose and is structured very specifically (order matters).

Note: There are a lot of available chunks that you can use to accomplish different things, but not all of them are required to be in a wave file. To limit the amount of stuff you have to absorb, we’ll only be looking at the header and the two chunks you need to create a functioning WAV file.

Also, every chunk (including the file header) starts with four characters (byte) that define what’s coming called sGroupID. We’ll see more about this… now.

Header

While not really a chunk according to the WAV spec, the header is what comes first in every WAV file. Here is what the header looks like:

Field Name Size (bytes) C# Data Type Value Description
sGroupID 4 char[4] “RIFF” For WAV files, this value is always RIFF. RIFF stands for Resource Interchange File Format, and is not limited to WAV audio – RIFFs can also hold AVI video.
dwFileLength 32 uint varies The total file size, in bytes, minus 8 (to ignore the text RIFF and WAVE in this header).
sRiffType 4 char[4] “WAVE” For WAV files, this value is always WAVE.

 

Format Chunk

The Format chunk is the metadata chunk. It specifies many of the things we talked about in part 1 of this series, such as the sample rate, bit depth, number of channels, and more.

Before we look at the format chunk structure, let’s run through some definitions in gory detail.

  • Sample – A single, scalar value representing the amplitude of the sound wave in one channel of audio data.
  • Channel – An independent waveform in the audio data. The number of channels is important: one channel is “Mono,” two channels is “Stereo” – there are different waves for the left and right speakers. 5.1 surround sound has 5 channels, one of which is for the lowest sounds and is usually sent to a subwoofer. Again, each channel holds audio data that is independent of all the other channels, although all channels will be the same overall length.
  • Frame – A frame is like a sample, but in multichannel format – it is a snapshot of all the channels at a specific data point.
  • Sampling Rate / Sample Rate – The number of samples (or frames)that exist for each second of data. This field is represented in Hz, or “per second.” For example, CD-quality audio has 44,100 samples per second. A higher sampling rate means higher fidelity audio.
  • Bit Depth / Bits per Sample – The number of bits available for one sample. Common bit depths are 8-bit, 16-bit and 32-bit. A sample is almost always represented by a native data type, such as byte, short, or int. A higher bit depth means each sample can be more precise, resulting in higher fidelity audio.
  • Block Align – This is the number of bytes in a frame. This is calculated by multiplying the number of channels by the number of bytes (not bits) in a sample. To get the number of bytes per sample, we divide the bit depth by 8 (assuming a byte is 8 bits). The resulting formula to calculate block align looks like blockAlign = nChannels * (bitsPerSample / 8) . For 16-bit stereo format, this gives you 2 channels * 2 bytes = 4 bytes.
  • Average Bytes per Second – Used mainly to allocate memory, this measurement is equal to sampling rate * block align.

Now that we know what all these things mean (or at least, you can scroll up and read them when you need to), let’s dive into the format chunk’s structure.

Field Name Size (bytes) C# Data Type Value Description
sGroupID 4 char[4] “fmt “ Indicates the format chunk is defined below. Note the single space at the end to fill out the 4 bytes required here.
dwChunkSize 32 uint varies The length of the rest of this chunk, in bytes (not including sGroupID or dwChunkSize).
wFormatTag 16 ushort 1 For WAV files, this value is always 1 and indicates PCM format.
wChannels 16 ushort varies Indicates the number of channels in the audio. 1 for mono, 2 for stereo, etc.
dwSamplesPerSec 32 uint varies The sampling rate for the audio (e.g. 44100, 8000, 96000, depending on what you want).
dwAvgBytesPerSec 32 uint sampleRate * blockAlign The number of multichannel audio frames per second. Used to estimate how much memory is needed to play the file.
wBlockAlign 16 ushort wChannels * (dwBitsPerSample / 8) The number of bytes in a multichannel audio frame.
dwBitsPerSample 32 uint varies The bit depth (bits per sample) of the audio. Usually 8, 16, or 32.

As I mentioned, this chunk gives you pretty much everything you need to specify the wave format. When we look at DirectSound later, we will see the same fields to describe the wave format.

Now, let’s look at the data chunk, my favorite chunk of them all!

Data Chunk

The data chunk is really simple. It’s got the sGroupID, the length of the data and the data itself. Depending on your chosen bit depth, the data type of the array will vary.

Field Name Size C# Data Type Value Description
sGroupID 4 bytes char[4] “data” Indicates the data chunk is coming next.
dwChunkSize 32 bytes uint varies The length of the array below.
sampleData Number of elements in the sample data: dwSamplesPerSec * wChannels * duration of audio in seconds byte[] (8-bit audio) short[] (16-bit audio) float[] (32-bit audio) sample data All sample data is stored here.

For the sample data, it’s important to note that each element of the data array is a signed value – it can be negative, positive or zero. The range of each of these elements is important too.

Since each sample represents amplitude, and we are working with signed values, we have to consider what is our minimum and maximum amplitude given the data type we’ve chosen. Due to some crazy business involving Endianness and 2’s complement, 16-bit samples range from -32760 to 32760 instead of -32768 to 32768 (2^16 / 2). We don’t worry about this with 8-bit data, because 8 bits is only one byte and Endianness is not an issue, nor with 32-bit data because it is represented as a proper float from -1.0f to 1.0f.

Here are the value ranges for each data type:

  • 8-bit audio: - 128 to 127 (because, again, of 2’s complement)
  • 16-bit audio: - 32760 to 32760
  • 32-bit audio: - 1.0f to 1.0f

Making a Wave File

Writing the wave file is as easy as constructing the Header, Format Chunk and Data Chunk and writing them in binary fashion to a file in that order.

There are a couple of caveats. Firstly, determining the dwChunkSize for the format and data chunks can be weird because you have to sum up the byte count for each field and then use that result as your chunk size. You can do this by using the .NET Framework’s BitConverter class to convert each of the fields to a byte array and then retrieve its length, summing up the total byte count for all the counted elements in the chunk. If you get the chunk size wrong, the wave file won’t work because whatever is trying to play it has an incorrect number of bytes to read for that chunk.

The other thing you have to calculate AFTER generating the chunks is the dwFileLength field in the header. This value is equal to the total size in bytes of the entire file minus 8 (including those fields that are ignored when calculating chunk size, such as sGroupID and dwChunkSize). The reason we subtract 8 is because the format dictates that we don’t count the 8 bytes held by the RIFF and WAVE markers in the header. There are some other ways to do this, which we’ll explore in the next article.

The best way to accomplish all these things effectively is to implement a structure or class for the header and the two chunks, which have all the fields defined properly with the data types shown above. Put methods on the chunk class that will calculate that chunk’s dwChunkSize as well as the total size of the chunk including sGroupID and dwChunkSize. Then, add the total chunk size of the data & format chunks, then add 4 (the size of the header chunk, since we only count dwFileLength). Assign that value to dwFileLength and you are golden.

That’s It, For Now…

What, no example code? I know, I know. That’s what the next blog post is for, so stay tuned. In the next post, you’ll learn how to use all this knowledge to write a C# program that synthesizes audio, generates the samples, and writes the results to a working WAV file that you can listen to!

Currently Playing: Silversun Pickups – Swoon – The Royal We

Comments

  • Anonymous
    October 04, 2009
    Theres an error in the data chunk. It says: dwBitsPerSample 32 uint That field is a ushort.

  • Anonymous
    February 07, 2010
    is it possible Sampling Rate convert 44100 to 8000 ?

  • Anonymous
    May 04, 2010
    For the WAV header: dwFileLength -> this is not 32 bytes, it is 4 bytes (32 bits) and it would make more sense if the 8 bytes that are ignored are actually the first 2 fields (sGroupID + dwFileLength) instead of the 1st one and the 3rd.

  • Anonymous
    May 04, 2010
    Well actually most of the sizes are in bits, and not in bytes :(

  • Anonymous
    May 06, 2010
    Agree with the above comments, a few of the details above are incorrect. http://www-mmsp.ece.mcgill.ca/Documents/AudioFormats/WAVE/WAVE.html

  • Anonymous
    June 06, 2010
    hi, guys above,thanks very mach! I doublt that when I read the content of this article, yours  comments make me clearly.

  • Anonymous
    February 02, 2011
    when i record and read what i record as a wave file format in order.to decode I need to find an amplitude threshold. Do you know what is the formula of that?...amplitude threshold depend of what?  

  • Anonymous
    February 09, 2011
    The comment has been removed

  • Anonymous
    March 01, 2011
    Did the test for a stereo file. I think it should be for (uint i = 0; i < numSamples - 1; i += 2) instead of for (uint i = 0; i < numSamples - 1; i++) in WaveGenerator.cs Otherwise you are rewriting the right channel in stereo...

  • Anonymous
    June 19, 2014
    Thanks for the article.  I'm trying to get specifics about how the raw samples are stored, and you've answered that fantastically, and correlates with the one other source I found on the subject.  I can code with confidence now!

  • Anonymous
    September 25, 2014
    Thanks for the effort. Now be honest.. you never really READ the spec when writing this article, did you? Nick is right by the way on all accounts. I always wonder why people don't just READ the spec. Ok, it was hard (3 versions, 2 chapters each), especially to throw ALL their equations overboard (becouse they are wrong). It's probably also hard to then IGNORE the wide-spread errors that ..uhm.. 'people with a SEVERE reading-disabillity' continue to copy from each others errors for the last 23 YEARS...   So, +1 for realizing the specified equation for AvgBytesPerSec is wrong. Your's is right BUT produces the wrong results becouse the counter-part (BlockAlign) is still WRONG (nice attempt though). No go back to the spec, ignore the equations and READ. Now, don't they clearly talk about CEILing some numbers... You know.. like RIGHT ABOVE the equations.. Now put your copy-hat down and put your thinking-hat ON. What are the only 2 ways that CEIL would actually do something? Ok, now realise both options result in DIFFERENT numbers.. Then which of the two is correct? The spec extremely clearly specifies that, for basic PCM, resolutions are stuffed into the smallest amount of BYTES required to hold the conceptual int they hold. These are containers. If a sample doesn't completely fill a containter, it is padded. If a block===frame and a frame holds all sample's containers for all channels at ONE moment, then how much bytes is one frame of  2 channel 12bits_per_sec? Hmm.. 12bit doesn't fit in one byte, so we need another byte.. making the sample's container 2 bytes (inc 4 bits of padding). Well, then 2 channels must be 4 bytes (per frame). Right? Now look at your 2 possible equations: which renders 3 bytes and which renders 4 bytes...? Yup, you should take the one that renders 4 bytes. Let's try this exercise again.. 41.1khz-20bps-2ch...  anser: 6 bytes. That is the frame-width, or Block-width in MS terminology and that is what is specified to be the correct value for wBlockAlign. By the way, the spec-updates and v3 repeat (very clearly) that blockAlign===frame-width. Why is this important.. well. you dont want to write illegal wav-files (or you specifically FOR A KNOWN REASON just want to actually write an illegal wav, mostly for compatibility with a broken player). If you write a player then.. you'd need to know the correct way AND the common errors.  Actually.. for bare type 1 PCM.. the avgBytesPerSec and BlockAlign are actually process data.. redundant.. you can (and should) calculate your own correct values if you're writing a player. Also, the required fields of a fmt hold data according to the wave type. Different types hold different meaning to those values. In other words, it's not the fmt defenition that describes what the correct values and their meaning are for each field, but the wave type!

  • Anonymous
    December 28, 2014
    Thanks for the article and getting me started on writing C# code which create WAV streams. I also noticed the C# data type char is two bytes, but the WAV spec requires one byte per character. The format chunk dwChunkSize would typically have a value of 16, which is the number of bytes remaining in the format chunk. The data chunk dwChunkSize is the number of bytes in the sampleData. The article linked provides some good info. I was able to get my code working from it. www.sonicspot.com/.../wavefiles.html