What's wrong with my use of timestamps/timebases for frame seeking/reading using libav (ffmpeg)?

mtree picture mtree · Sep 16, 2013 · Viewed 8.9k times · Source

So I want to grab a frame from a video at a specific time using libav for the use as a thumbnail.

What I'm using is the following code. It compiles and works fine (in regards to retrieving a picture at all), yet I'm having a hard time getting it to retrieve the right picture.

I simply can't get my head around the all but clear logic behind libav's apparent use of multiple time-bases per video. Specifically figuring out which functions expect/return which type of time-base.

The docs were of basically no help whatsoever, unfortunately. SO to the rescue?

#define ABORT(x) do {fprintf(stderr, x); exit(1);} while(0)

av_register_all();

AVFormatContext *format_context = ...;
AVCodec *codec = ...;
AVStream *stream = ...;
AVCodecContext *codec_context = ...;
int stream_index = ...;

// open codec_context, etc.

AVRational stream_time_base = stream->time_base;
AVRational codec_time_base = codec_context->time_base;

printf("stream_time_base: %d / %d = %.5f\n", stream_time_base.num, stream_time_base.den, av_q2d(stream_time_base));
printf("codec_time_base: %d / %d = %.5f\n\n", codec_time_base.num, codec_time_base.den, av_q2d(codec_time_base));

AVFrame *frame = avcodec_alloc_frame();

printf("duration: %lld @ %d/sec (%.2f sec)\n", format_context->duration, AV_TIME_BASE, (double)format_context->duration / AV_TIME_BASE);
printf("duration: %lld @ %d/sec (stream time base)\n\n", format_context->duration / AV_TIME_BASE * stream_time_base.den, stream_time_base.den);
printf("duration: %lld @ %d/sec (codec time base)\n", format_context->duration / AV_TIME_BASE * codec_time_base.den, codec_time_base.den);

double request_time = 10.0; // 10 seconds. Video's total duration is ~20sec
int64_t request_timestamp = request_time / av_q2d(stream_time_base);
printf("requested: %.2f (sec)\t-> %2lld (pts)\n", request_time, request_timestamp);

av_seek_frame(format_context, stream_index, request_timestamp, 0);

AVPacket packet;
int frame_finished;
do {
    if (av_read_frame(format_context, &packet) < 0) {
        break;
    } else if (packet.stream_index != stream_index) {
        av_free_packet(&packet);
        continue;
    }
    avcodec_decode_video2(codec_context, frame, &frame_finished, &packet);
} while (!frame_finished);

// do something with frame

int64_t received_timestamp = frame->pkt_pts;
double received_time = received_timestamp * av_q2d(stream_time_base);
printf("received:  %.2f (sec)\t-> %2lld (pts)\n\n", received_time, received_timestamp);

Running this with a test movie file I get this output:

    stream_time_base: 1 / 30000 = 0.00003
    codec_time_base: 50 / 2997 = 0.01668

    duration: 20062041 @ 1000000/sec (20.06 sec)
    duration: 600000 @ 30000/sec (stream time base)
    duration: 59940 @ 2997/sec (codec time base)

    requested: 10.00 (sec)  -> 300000 (pts)
    received:  0.07 (sec)   -> 2002 (pts)

The times don't match. What's going on here? What am I doing wrong?


While searching for clues I stumbled upon this this statement from the libav-users mailing list…

[...] packet PTS/DTS are in units of the format context's time_base,
where the AVFrame->pts value is in units of the codec context's time_base.

In other words, the container can have (and usually does) a different time_base than the codec. Most libav players don't bother using the codec's time_base or pts since not all codecs have one, but most containers do. (This is why the dranger tutorial says to ignore AVFrame->pts)

…which confused me even more, given that I couldn't find any such mention in the official docs.

Anyway, I replaced…

double received_time = received_timestamp * av_q2d(stream_time_base);

…with…

double received_time = received_timestamp * av_q2d(codec_time_base);

…and the output changed to this…

...

requested: 10.00 (sec)  -> 300000 (pts)
received:  33.40 (sec)  -> 2002 (pts)

Still no match. What's wrong?

Answer

Anton Khirnov picture Anton Khirnov · Sep 17, 2013

It's mostly like this:

  • the stream timebase is what you are really interested in. It's what the packet timestamps are in, and also pkt_pts on the output frame (since it's just copied from the corresponding packet).

  • the codec timebase is (if set at all) just the inverse of the framerate that might be written in the codec-level headers. It can be useful in cases where there is no container timing information (e.g. when you're reading raw video), but otherwise can be safely ignored.

  • AVFrame.pkt_pts is the timestamp of the packet that got decoded into this frame. As already said, it's just a straight copy from the packet, so it's in the stream timebase. This is the field you want to use (if the container has timestamps).

  • AVFrame.pts is not ever set to anything useful when decoding, ignore it (it might replace pkt_pts in the future, to make the whole mess less confusing, but for now it's like this, for historical reasons mostly).

  • the format context's duration is in AV_TIME_BASE (i.e. microseconds). It cannot be in any stream timebase, since you can have three bazillion streams, each with its own timebase.

  • the problem you see with getting a different timestamp after seeking is simply that seeking is not accurate. In most cases you can only seek to closest keyframe, so it's common to be a couple seconds off. Decoding and discarding the frames you don't need must be done manually.