After careful reading of FFmpeg Bitstream Filters Documentation, I still do not understand what they are really for.
The document states that the filter:
performs bitstream level modifications without performing decoding
Could anyone further explain that to me? A use case would greatly clarify things. Also, there are clearly different filters. How do they differ?
Let me explain by example. FFmpeg video decoders typically work by converting one video frame per call to avcodec_decode_video2. So the input is expected to be "one image" worth of bitstream data. Let's consider this issue of going from a file (an array of bytes of disk) to images for a second.
For "raw" (annexb) H264 (.h264/.bin/.264 files), the individual nal unit data (sps/pps header bitstreams or cabac-encoded frame data) is concatenated in a sequence of nal units, with a start code (00 00 01 XX) in between, where XX is the nal unit type. (In order to prevent the nal data itself to have 00 00 01 data, it is RBSP escaped.) So a h264 frame parser can simply cut the file at start code markers. They search for successive packets that start with and including 00 00 01, until and excluding the next occurence of 00 00 01. Then they parse the nal unit type and slice header to find which frame each packet belongs to, and return a set of nal units making up one frame as input to the h264 decoder.
H264 data in .mp4 files is different, though. You can imagine that the 00 00 01 start code can be considered redundant if the muxing format already has length markers in it, as is the case for mp4. So, to save 3 bytes per frame, they remove the 00 00 01 prefix. They also put the PPS/SPS in the file header instead of prepending it before the first frame, and these also miss their 00 00 01 prefixes. So, if I were to input this into the h264 decoder, which expects the prefixes for all nal units, it wouldn't work. The h264_mp4toannexb bitstream filter fixes this, by identifying the pps/sps in the extracted parts of the file header (ffmpeg calls this "extradata"), prepending this and each nal from individual frame packets with the start code, and concatenating them back together before inputting them in the h264 decoder.
You might now feel that there's a very fine line distinction between a "parser" and a "bitstream filter". This is true. I think the official definition is that a parser takes a sequence of input data and splits it in frames without discarding any data or adding any data. The only thing a parser does is change packet boundaries. A bitstream filter, on the other hand, is allowed to actually modify the data. I'm not sure this definition is entirely true (see e.g. vp9 below), but it's the conceptual reason mp4toannexb is a BSF, not a parser (because it adds 00 00 01 prefixes).
Other cases where such "bitstream tweaks" help keep decoders simple and uniform, but allow us to support all files variants that happen to exist in the wild: