Synchronization has always fascinated me, or to be precise: why a .ts can be viewed in sync by media players, while the demuxed audio+video reassembled is out of sync.
So I'm trying to understand this, and what can be done to prevent it.
I've read the following: https://trac.handbrake.fr/wiki/LibHandBrakeSync and the source of sync.c (also available on the wiki)
BitStreamTools have written a Theory 101 on the subject also (but I can't link as I'm a new user, sorry)
While I thought my understanding of PCR/PTS was (conceptually) right, I'm having a hard time following handbrake's excellent A/V sync paper.
My question is this: is there a somewhat intuitive (it can be brief, short or longer, as long) explanation of a/v synchronization? While I know that one can recalculate PTS from PCR if audio or video pts is corrupted (discontinuity?), handbrake does not seem to rely on this, but on it's internal PTS. 0, += 1/fps (~=5), 10, 15, ....
Would it be possible to recalculate the pts offsets and correct the .ts (binary) by fixing all audio and video PTS values (and skewing all DTS with the same offset, so the player doesn't "run out of frames", so to speak), and thus have a .ts which can be demuxed, and the isolated tracks then be in sync (if put back together)?
EDIT: Or would it not be possible to fix by using PCR to recalculate all PTS values in a given .ts? While I understand that some frames/audio might be damaged in broadcast so it can not be presented correctly, I'll leave the handling of this (such as removing the video if it's damaged and has corresponding audio part, inserting x ms silence if the audio package is damaged etc.) to later, and for the sake of discussion I'll presume all frames are intact. (But then the PTS values would always be correct though, or what?)
Appendix: My take on the handbrake A/V paper is this: At "expected" 100, the offset is calculated as video pts (100) - audio pts (0) - the internal PTS, to bring the audio up to the same presentation time, thus giving a pts offset of 99. at 105 the offset would be 105-5 = 100, not 99, but we proceed to use 99 as offset since there's no need to recalculate (100-99 = 1. 1/fps < 100ms). At 150, the pts offset is calculated again as the video pts is decreasing, as opposed to increasing...
I'm almost positive I'm complete wrong about this, but can someone point me in the right direction, please?
The concept of Audio Video synchronization is much deeper. The first reading i would recommed is the following paper.
http://downloads.bbc.co.uk/rd/pubs/reports/1996-02.pdf
I won't repeat everything here - but essentially, every encoder records timestamps and stamps it on the respective Audio and Video. Later on, when decoder plays it, it does two things - one, ensures that decoder's own clock is "enslaved" with encoder's clock, and two it ensures that every picture is presented on the screen and audio frame presented to speaker exactly when that respective time occurs. This is only and best way that audio remains in synchronization with video. These timestamps are called PTS/DTS values which are of resolution of 90 kHz clock.
Understand that over time clocks skew but since only the exact time is referenced, decoder playout exactly in same time order.
Now the major concern remains is that decoder's clock needs to remain in control/synchronization of encoder's clock. The first thing done in MPEG is using a higher precision at 27 MHz, (300 times higher). Further, this needs to remain consistent during any transmission path in the middle. (this is called clock recovery process).
Below are another couple of good paper that explains how clock recovery/synchronization process works.
https://www.soe.ucsc.edu/sites/default/files/technical-reports/UCSC-CRL-98-04.pdf
http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.86.1016&rep=rep1&type=pdf
This final paper puts every thing together much nicely.
http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.50.975&rep=rep1&type=pdf
Remember - the PCR and PTS/DTS based audio video synchronization is what make Digital TV broadcast is very stringent and is far different from any other streaming methods used in Internet streaming. This is crucial to make it 24x7 streaming to function.