Calculating the Duration of an MP3 File

Introduction

I have a side-project for working with ID3 tags called scribbu. Having some time off, I decided to add a feature whereby the tool would print an .m3u playlist entry given a file; something like this:

┌┤ .../mgh/doc/hacking/mp3 (Wed Apr 07 10:34:51)
└──5007:mgh@Crickhollow[0,0,3]: scribbu print-m3u-entry foo/bar/splat.mp3
#EXTINF: 323, Artist - Title
foo/bar/splat.mp3
┌┤ .../mgh/doc/hacking/mp3 (Wed Apr 07 10:34:51)
└──5008:mgh@Crickhollow[0,0,3]:

In the output, "323" denotes the track duration in seconds, so I needed to figure out how, given an .mp3 file, I could compute that song's duration. That simple question has led me down a rabbit-hole over the past few days that involved poring over archaic documents, long-abandoned forum postings, and hex dumps of my .mp3 collection. The entire process was involved enough that I wanted to document it both for my own sanity, and in the hopes of sparing others the same labor by providing a complete accounting here in one place.

It also brought home to me something else: MP3 is dying out. The number of broken links I encountered while doing this research was sobering. So I'm also going to take this opportunity to start an archive of MP3 technical information & lore. I'll begin with links in the hopes that the Internet Archive will have them after they're gone, but I'll also host such documents as I can (attributed as best as possible).

If you're reading this, I'm going to assume you're broadly familiar with MP3; this post skips the background & jumps right into the technical details of decoding data in MP3 format & determining duration. When I can do so with confidence, I characterize the MPEG format generally, but I focus on audio layer three.

BLUF

MPEG audio data is organized into frames, each of which begins with a fixed-size header describing its contents. The bottom line up-front: the only reliable way to compute a song's duration is to walk the file frame-by-frame, parsing each header, computing the duration of the sound data therein, and summing them all up. Since this is inconvenient and computationally expensive (depending on your frame of reference), people have looked for easier ways. They are the topic of the second half of this post, but I'm going to begin by documenting the "brute force" approach.

The MPEG Format

The MPEG specification went through three revisions. MPEG 1 (ISO/IEC 13818-3) and MPEG 2 (ISO/IEC 11172-3) are ISO standards. MPEG 2.5 is an unofficial extension of MPEG 2 to support lower sampling rates. MPEG 2/2.5 is also known by the abbreviation LSF, which stands for Lower Sampling Frequencies.

An MPEG audio file is made up of frames. There are three "layers" to the spec; this post is concerned only with layer III, but in Layers I & II the frames are completely independent of one another (meaning that you could cut the file at any frame boundry & have the resulting parts play correctly). Layer III can make use of a technique known as the "bit resevoir" where unused space in each frame can be put to use to hold data for subsequent frames, meaning that frames may depend on one another. In the worst case, a decoder may need to read nine frames before being able to decode one [3]​.

Each frame is made up of a fixed-size header, possibly a two-byte CRC, a block of information that will be needed by decoders to interpret the audio samples referred to as the "side information", and finally the audio samples themselves.

Schematically, MPEG audio data looks like this:

+--------------+--------------+------------------+---------------+--------------+----
| frame header | optional CRC | side information | audio samples | frame header | ...
+--------------+--------------+------------------+---------------+--------------+----

This post is concerned with computing duration, which can be computed strictly from the header.

The MPEG Audio Frame Header

In all layers & all versions of the spec, each frame begins with a 32 bit header. If the 32 bits are denoted AAAAAAAA AAABBCCD EEEEFFGH IIJJKLMM then their significance is is described by this table (thanks to [1]​, [2]​ & [3]​):

Table 1: MPEG Audio Frame Header Layout
letter length bits description
A 11 31-21 frame sync– all bits 1
B 2 20-19 MPEG audio version ID (see Table 2 below)
C 2 18-17 MPEG audio layer (see Table 3 below)
D 1 16 protection bit: 0 indicates that this frame is protected by a 16-bit CRC, 1 indicates that it is not
E 4 15-12 bitrate index (see Table 4 below)
F 2 11-10 sampling frequency index (see Table 5 below)
G 1 9 padding bit: 1 indicates that this frame is padded with an extra slot (a slot is one byte in Layer III)
H 1 8 private bit: application-specific; informational only
I 2 7-6 channel mode (see Table 6 below)
J 2 5-4 mode extension: only used with joint stereo channel mode & not relevant here
K 1 3 copyright bit: 1 indicates that this audio is copyrighted
L 1 2 original bit: 1 indicates that this is original media
M 2 1-0 emphasis: not relevant here

In MPEG versions 1 & 2, 12 bits were used to signal the beginning of a frame. In version 2.5, the version numbers are chosen to be backward compatible:

Table 2: MPEG Audio Versions
bits interpretation
00 MPEG version 2.5 (an unofficial extension to MPEG-2)
01 reserved
10 MPEG version 2 (ISO/IEC 13818-3)
11 MPEG version 1 (ISO/IEC 11172-3)

The MPEG specification provides for three "layers"; three ways of encoding audio data that offer different trade-offs in terms of space & computational complexity. This post is only concerned with layer III (which offers the best compression at the cost of the greatest computational complexity); layer III is the "3" in "MP3".

Table 3: Layer ID
bits interpretation
00 resrved
01 Layer III
10 Layer II
11 Layer I
Table 4: Bitrates
bits V1,L1 V1,L2 V1,L3 V2,L1 V2, L2 & L3
0000 free free free free free
0001 32 32 32 32 8
0010 64 48 40 48 16
0011 96 56 48 56 24
0100 128 64 56 64 32
0101 160 80 64 80 40
0110 192 96 80 96 48
0111 224 112 96 112 56
1000 256 128 112 128 64
1001 288 160 128 144 80
1010 320 192 160 160 96
1011 352 224 192 176 112
1100 384 256 224 192 128
1101 416 320 256 224 144
1110 448 384 320 256 160
1111 bad bad bad bad bad

All values are in kbps (kilo, as in x1000, not x1024)

  • V1 - MPEG Version 1
  • V2 - MPEG Version 2 and Version 2.5
  • L1 - Layer I
  • L2 - Layer II
  • L3 - Layer III

"free" indicates that the file is encoded with a constant bitrate, just not one of the predefined values of bitrate.

Table 5: Sampling Rates
bits MPEG1 MPEG2 MPEG2.5
00 44100 22050 11025
01 48000 24000 12000
10 32000 16000 8000
11 reserv reserv reserv

Table values are in Hertz (i.e. samples/second).

Table 6: Channel Mode
bits channel mode
00 stereo
01 joint stereo (stereo)
10 dual channel (stereo)
11 single channel (mono)

Finally, note that the number of audio samples is constant in each frame, and specified by the version & layer of the MPEG specification in use.

Table 7: Samples per Frame
  MPEG 1 MPEG 2 (LSF) MPEG 2.5 (LSF)
Layer I 384 384 384
Layer II 1152 1152 1152
Layer III 1152 576 576

Armed with this information, we can compute the size of this frame along with the duration of the audio contained therein.

Frame Size

A little dimensional analysis points the way to computing the frame size:

samples   bytes    seconds   bytes
------- * ------ * ------- = -----
 frame    second   sample    frame

We know samples per frame from the MPEG audio version & Table 7. We can compute bytes per second from the bitrate (cf. Table 4): if the bitrate is B, then we have B kilobits per second or B * 1000/8 bytes per second. Finally, seconds per sample is just the inverse of the sampling freqency. Explicitly, let S be samples per frame from Table 7, B the bitrate from Table 4, and H the sampling frequency from Table 5. Then the frame size will be:

           B * 1000
floor(S * ----------)
            H * 8

Note that if the padding bit is set, we'll need to add one to arrive at the offset to the next frame.

Here's a worked example:

000000 ff fa 92 04 41 41 00 00 02 dd 06 c2 83 03 18 90  >....AA..........<
000016 5c ab 08 50 48 22 6e 4a f9 51 25 35 21 00 21 78  >\..PH"nJ.Q%5!.!x<
000032 18 e4 56 b0 20 04 01 81 a3 b4 f8 d8 b7 70 e0 60  >..V. ........p.`<

The frame header is 0xff, 0xfa, 0x92, 0x04. Un-packing that gives:

ff        fa        92        04
1111 1111 1111 1010 1001 0010 0000 0100
AAAA AAAA AAAB BCCD EEEE FFGH IIJJ KLMM

Un-packing each field gives:

  • A, 11 bits, frame sync
  • B, 2 bits, 11, MPEG version 1
  • C, 2 bits, 01, layer 3
  • D, 1 bit, not protected/no CRC
  • E, 4 bits, 1001, 128kbps bitrate
  • F, 2 bits, 00, 44100 Hz sampling rate
  • G, 1 bit, 1, frame is padded with one extra slot
  • H, 1 bit, 0 => private bit unset
  • I, 2 bits, 00, stereo channel mode
  • J, 2 bits, 00, ignored for this example
  • K, 1 bit, 0, no copyright
  • L, 1 bit, 1, original media
  • M, 2 bits, 00, ignored for this example

So we have S := 1152, B := 128 & H := 44100

           B * 1000                    128000
floor(S * ----------) = floor(1152 * ---------) =~ floor(417.959183673) = 417
            H * 8                    44100 * 8

The padding bit is set, so this frame is 418 bytes in size.

Frame Duration

There are two ways of which I am aware to calculate the duration of the frame.

The simpler of the two is to again do a little dimensional analysis:

samples    seconds   seconds
------- * -------- = -------
 frame     sample    frame

Explicitly, let S be samples per frame from Table 7 and H the sampling frequency from Table 5. Then the duration of this frame, in seconds, will be S/H.

Continuing the example from the last section, S is 1152 & H is 44100. S/H = 0.0261224489796 seconds.

The second approach, if B is the bitrate & D the frame size in bytes, is to compute:

    8
-------- * D
B * 1000

The dimensional analysis is left as an exercise. In this case:

   8
------ * 418 = 0.026125
128000

It irritates me that these two values only agree to three figures. I suspect that this is due to the fact that I don't really understand how the audio data is encoded (the value of 418 bytes, for example, includes the frame header & side information; it's not clear to me that that should be counted).

libmad [4]​ takes the first approach, for what it's worth:

int mad_header_decode(struct mad_header *header, struct mad_stream *stream)
{
  ...
  /* calculate frame duration */
  mad_timer_set(&header->duration, 0,
                32 * MAD_NSBSAMPLES(header), header->samplerate);

Improvements

Assuming nothing, and with no outside information about a particular block of MPEG audio data, that's it: we have to walk every frame to learn about the duration of the audio contained therein. Two problems quickly became apparent in the early days of MP3:

  1. computing the duration is inconvenient; you have to walk every frame
  2. it is impossible to seek ahead in a track (by time or by percentage) without walking the file once & building a lookup table

Before exploring what was done about that, we need to touch on the subject of encoding bitrates.

Bitrates

The bitrate refers to the number of bits in each audio sample: the more bits the better the quality. Encoding can be done using either a constant bit rate (CBR) or by varying the bitrate with each frame (VBR).

CBR brings with it advantages on the decoding side. If we somehow know, a priori, that a given file was encoded using CBR, then a lot of the problems described in the previous section disappear. All frames are the same size (up to padding) & of the same duration (assuming a constant sampling rate, on which more below). If you know the total size of the audio data (a reasonable assumption), you can compute, after decoding just the first frame header, the number of frames, the total duration, and how to seek ahead anywhere in the file. This post would not exist if all .mp3 files used CBR.

CBR also brings along some notable drawbacks. It forces the unpalatable choice of either encoding every frame at a bitrate suitable for the most complex regions of the audio, resulting in a file that is larger than necessary, or of encoding every frame at the minimal bitrate, resulting in poor audio quality. For that reason, VBR was introduced. There is a third variant– Average Bit Rate, or ABR. We'll touch on that below, but for purposes of this discussion, ABR is just a particular form of VBR. VBR lets the encoder select high bit rates for complex portions of the audio being encoded, and low bitrates for the simpler sections (such as the silence at the beginning & end of a song). The cost is that you now have to parse every frame header to do things like seek to a given timestamp.

Finally, absent outside information, there is no way to determine that a give file was encoded using CBR other than… walking every frame & checking, which pretty much leaves us having to treat everything as VBR.

It was due to this regrettable situation that both Fraunhofer & Xing (two big names in MP3 technology), in the time-honored tradition, introduced different & incompatible unofficial extensions to the standard. The idea in both cases was to start the data with a frame that contained no audio data, and to use that frame to hold a tag containing additional information that would ease life on the decoder side. That way decoders that didn't support their tags would just skip the frame harmlessly. Fraunhofer introduced the VBRI tag & Xing the Xing tag. The LAME encoder later introduced their own tag that extended that of the Xing encoder. Each contained the total number of frames for duration calculations, as well as a lookup table for seeking purposes. Collectively, these tags are referred to as VBRI.

Duration in the Presence of VBRI

Before digging into the details of each tag, let's wrap-up the question with which this post began: how to compute the duration of an .mp3 file? We have one solution, outlined above, that is known to be good in all cases, albeit expensive.

Now suppose we have parsed the first audio frame header, so we know how many samples are present in each frame (that's constant, remember) and what the sampling rate is for this frame. This gives us the duration of this frame, in seconds. Suppose further we have discovered a VBRI tag in that first frame (on which more below), and that that tag contains the total number of frames in this file. We can now compute the duration of the file by simply multiplying this frame's duration by the number of frames.

Well, it's very likely we can. That calculation is only accurate if the sampling frequency doesn't change. There's nothing that says it can't: every frame specifies it independently. However, I have never seen that happen in the wild. My sense is that every frame specifies it separately to avoid the need for a file header to specify it for all following frames. With the current arrangement, a decoder can connect to a stream of audio frames (think on-line radio), use the frame sync to "lock on" to the next frame it sees & just start playing. To confirm my hunch, I may concoct a pathological .mp3 file containing stretches of frames that use different sampling frequences & see which players get the duration correct & which don't.

Again for what it's worth, this is the approach taken by MPD: parse the first frame header & estimate the song duration based on that frame's bitrate & the track's total size (i.e. assume CBR). However, if the Xing frame is present, it will abandon that estimate & just multiply the first frame's duration by the total number of frames given in the Xing header.

The Tags

The VBRI Tag

The VBRI tag is only found in files created with the Fraunhofer encoder per Windszus [1]​. Perhaps because Fraunhofer has ended support for the MP3 standard, and seems to have removed all documentation from their site, this frame is largely undocumented. Also per Windszus [1]​ it is always located 32 bytes after the frame header, regardless of the side information size, which matches my experience with test data. Its layout is described here:

Table 8: VBRI Tag Layout
offset length description
0 4 VBRI header 'VBRI'
4 2 version: 16-bit big-endian
6 2 delay: 16-bit big-endian
10 4 file size: 32-bit big-endian
14 4 number of frames: 32-bit big-endian
18 2 number of entries in the TOC: 16 bit big-endian
20 2 TOC scale factor: 16-bit bit-endian
22 2 size per table entry in bytes: 16-bit big-endian
24 2 frames per table entry: 16-bit big-endian
26 variable TOC

As far as I can tell, each entry in the table of contents gives the size, in bytes, of the corresponding n-tuple of frames, where "n" is the "frames-per-table-entry". I would guess that the scale factor applies to the number of bytes, but I have yet to encounter a scale factor not equal to one.

The Xing Tag

The Xing tag is, as far as I can tell, undocumented (or the documents are lost). Most of what I know I've learned by reverse-engineering code: both the LAME decoder and the Xing VBRI SDK are still available.

Table 9: Xing Tag Layout
length description
4 Xing header tag; four characters, either "Xing" (VBR) or "Info" (CBR)
4 flags indicating which fields are present in this tag (see below)
4 (optional) number of frames (16-bits, big-endian)
4 (optional) number of bytes (16-bits, big-endian)
100 (optional) TOC (one hundred bytes)
4 (optional) quality indicator from 0 to 100 (16-bit, big-endian)

The tag is slightly more complex to parse due to the fact that most of the fields are optional; their presence or absence is signalled by the only mandatory field beyond the four-byte tag, the flags field:

Table 10: Xing Flags
value meaning
0x00000001 if set, the frames field is present
0x00000002 if set, the bytes field is present
0x00000004 if set, the TOC is present
0x00000008 if set, the quality indicator field is present

The number of frames & number of bytes fields are self-explanatory, I trust. The "quality indicator" is, as far as I know, completely undocumented. The LAME encoder source code mentions that it is a value between zero & one hundred (zero being the best quality and one hundred the worst) & computes it in the function PutLameVBR as:

/*recall: cfg->vbr_q is for example set by the switch -V  */
/*   gfp->quality by -q, -h, -f, etc */

int     nQuality = (100 - 10 * gfp->VBR_q - gfp->quality);

Like the VBRI tag, the TOC is intended as a lookup table for seeking ahead in the track. In this case, to skip i% of the way through the track (where i is an integer between 0 & 99), lookup toc[i]. It will contain a value between 0 & 255; this value is to 255 as the desired offset is to the total file size [5]​.

The Xing VBR SDK gives what I take to be the reference implementation of this procedure:

int SeekPoint(unsigned char TOC[100], int file_bytes, float percent)
{
    // interpolate in TOC to get file seek point in bytes
    int a, seekpoint;
    float fa, fb, fx;

    if( percent < 0.0f )   percent = 0.0f;
    if( percent > 100.0f ) percent = 100.0f;

    a = (int)percent;
    if( a > 99 ) a = 99;
        fa = TOC[a];
    if( a < 99 ) {
        fb = TOC[a+1];
    }
    else {
        fb = 256.0f;
    }

    fx = fa + (fb-fa)*(percent-a);

    seekpoint = (int)((1.0f/256.0f)*fx*file_bytes);

    return seekpoint;
}

I was able to dig-up some source code comments from LAME & have reproduced them here:

/*
...
 * toc (table of contents) gives seek points
 * for random access
 * the ith entry determines the seek point for
 * i-percent duration
 * seek point in bytes = (toc[i]/256.0) * total_bitstream_bytes
 * e.g. half duration seek point = (toc[50]/256.0) * total_bitstream_bytes
...
 */

The LAME Tag

The LAME encoder writes its own VBRI tag, extending the Xing format. The document "Mp3 Info Tag Revision 1 Specifications" [7]​ is the generally cited documentaion, but it is extremely difficult to read, and I found the LAME encoder source code to be an essential companion to working through it.

Table 11: LAME Tag Layout
offset length description
0 9 LAME version (see below)
9 1 top four bits are the tag version, the bottom four the VBR method (see below)
10 1 lowpass filter value; multiply by 100 to get Hz
11 8 replay gain
19 1 top four bits hold encoding flags, bottom hold ATH type
20 1 if ABR, given bitrate, else minimal bitrate (see below)
21 3 encoder delays
24 1 miscellaneous information
25 1 mp3 gain
26 2 preset & surround information
28 4 length in bytes of this track, measured from the first byte of this tag
32 2 CRC-16 of the music data, from the next MPEG frame to the end
34 2 CRC-16 of the first 190 bytes of this frame (i.e. up until this field)

The LAME tag begins with a version identifier for the LAME encoder that wrote it. Regrettably, a textual string was chosen, with all the problems that entails. Happily, the format is well documented [8]​.

The LAME version string occupies nine bytes. If the version string is ever fewer than nine bytes, it is padded out with blanks to be nine bytes. The general format is "LAME" + major version + "." + minor version + flag. When the minor version reached 100, that format was changed to "LAME" + major version + minor version + flag. Flag is one of the following:

  • "a" for alpha version
  • "b" for beta version
  • "r" for release versions whose patch version is > 0 (beginning with release 3.96.1)
  • " " all other versions

There was an interesting deviation from this approach: the 3.99.1 release changed the format to "L" + major version + "." + minor version + flag + patch version, but reverted when they realized this broke existing decoders.

The tag format version is contained in the top four bits of byte 9; the only two permissible values are 0 & 1. The bottom four bits of byte 9 encode the VBR method:

Table 12: LAME VBR method
value meaning
0 unknown
1 constant bitrate
2 restricted VBR targetting a given average bitrate (ABR)
3 full VBR method 1
4 full VBR method 2
5 full VBR method 3
6 full VBR method 4
8 constant bitrate two-pass
9 ABR two-pass

If the track was encoded using ABR, byte 20 will hold the target bitrate in kbps (zero denotes "unknown"). Otherwise, byte 20 holds the bitrate (if CBR was used) or the minimum bitrate used (if some other form of VBR was used). If that value is greater than 255, 255 will be stored. Note that, in the case of ABR, this gives another way to compute the duration: just divide the overall size by the average bitrate (adjusting for units).

Conclusion

There are a lot of details I've turned up in my research that I've glossed over here. I might update the post at some point in the future, but I've spent enough time on this as it is. I can't help ending with this rant coming down to us through the years "This is just one of the many ways that MP3 files are often broken… Seeking to a location in a VBR file is next to impossible… I really wish everyone would instead use Ogg Vorbis." [9]​ Well, we didn't (or at least I didn't), and to be honest I've seen worse (the Windows Driver Model from that time comes to mind, for instance). I hope this accounting will save someone else the trouble of working it out themselves. Corrections, questions & additions are welcome.

References

  1. Windszus, Konrad, "MPEG Audio Frame Header" https://www.codeproject.com/Articles/8295/MPEG-Audio-Frame-Header, (retrieved April 5, 2021)
  2. Supurovic, Predrag, "MPEG AUDIO FRAME HEADER" http://www.mpgedit.org/mpgedit/mpeg_format/mpeghdr.htm, (retrieved April 5, 2021)
  3. Bouvigne, Gabriel, "MPEG Audio Layer I/II/III frame header" http://www.mp3-tech.org/programmer/frame_header.html (retrieved April 6, 2021).
  4. "libmad - MPEG audio decoder library" https://www.underbit.com/products/mad/, (retrieved April 7, 2021).
  5. unknown, "MP3 Inside" http://www.multiweb.cz/twoinches/MP3inside.htm#VBR, (retrieved April 7, 2021).
  6. unknown, "VBR header and LAME tag" https://wiki.hydrogenaud.io/index.php?title=LAME#VBR_header_and_LAME_tag, (retrieved April 7, 2021).
  7. unknown, "Mp3 Info Tag Revision 1 Specifications" http://gabriel.mp3-tech.org/mp3infotag.html, (retrieved April 9, 2021).
  8. multiple, "LAME version string" https://wiki.hydrogenaud.io/index.php?title=LAME_version_string, (retrieved April 9, 2021).
  9. O'Conner, Russell "MP3 Sucks" http://r6.ca/blog/20030720T195900Z.html, (retrieved April 11, 2021).

04/12/21 06:52