VSE movie strips: time delatation with sawtooth curve - possibly source timestamp deviations?
VSE movie strips: time delatation with sawtooth curve at all possible frame rate settings and source materials (original title)
**System Information**
Operating system: Windows-10-10.0.18362-SP0 64 Bits
Graphics card: GeForce RTX 2060 SUPER/PCIe/SSE2 NVIDIA Corporation 4.5.0 NVIDIA 456.38
**Blender Version**
Broken: version: 2.90.0, branch: master, commit date: 2020-08-31 11:26, hash: `rB0330d1af29c0`
Worked: not known
**Short description of error**
Time delatation between video and audio stream. Non-linear, non-continuous, non-reliable, but at least repeatable as long as the source and settings remain exactly the same.
The delatation follows curves, mostly bipolar curved sawtooth:
FOR EXAMPLE, video and audio keep sync for about an hour (to about 1/60 sec),
then for about 20 minutes, video falls behind audio up to half a second,
then for about 5 Seconds, video images are massively skipped so that video jumps ahead of audio by about 4 Seconds,
then for the next 2 hours, this difference further accumulates to about 8 seconds difference max, then again slightly falls back to about 7,5 seconds.
But thats not where it ends, only where it starts.
Those curves seem to be highly dependend on the source material - its exact FPS (my graphics cards capture tools produce anything from 59,92 to 60,0 despite being configured to 60 FPS) and the exact length (i worked with length from 30 minutes to 3 hours).
Moreover, the (A) visualization in the VSE timelines, (B) rendering in the VSE preview and (C) rendering in the final output are three completely unrelatedly buggy processes - all differently buggy, but not a single one of them correct.
Moreover, if putting different sections of the timeline to final rendering, the results are completely chaotically differently buggy. Thus, one and the same scene of the source material can result in indefinitely many variations of buggy time dilatations between audio and video, including different sawtooth functions of those, depending of where in the rendering section that scene becomes located.
As far as my experiments go, i assume the audio stream to be always rendered correctly - i never observed noticeable distortions in the audio stream, and the synchronization between multiple audio streams from different sources (including: (A) integrated in source video, (B) extracted from source video, then added in VSE, (C) recorded with audacity or goldwave and then added in VSE) are consistent, strictly linear and reliable. Only the video stream gets messed up, sometimes clearly visible (as described in the example above with that sawtooth function), and the more the longer the video.
I tried to circumvent those glaring bugs by first separating audio from video source, then mixing pure video with pure audio streams back together in VSE. It didn't change a thing. The VSE always messes up the video stream. You may get away with relatively small errors as long as the video material is short. But as soon as you hit the 90 minutes barrier, you get massive time dilatations. And you have no reliability on what you see in the VSE preview, since the distortions in the final rendering seem completely independent from that preview.
I have now used blender for a few month and successively raised my suspicions and control flow measures to detect and mitigate those time dilatations, but now i am at a point were i give up. At first, i thought it being all my fault for having source material with deviating FPS. But even after streching all to the same FPS - integer 60 -, even after separating video and audio streams prior to editing and painstakingly stretching all to exact sync by a hundredth of a second at sync marks, there is no horizon in view of getting those sources added together to like 90 minutes of edited video without capers of distortions, even chaotically distributed inbetween.
**Exact steps for others to reproduce the error**
Take any video and audio source of > 60 minutes and try to get an output video.
Check the resulting video! Do NOT do that only at the very beginning versus the very end. There are chances that the streams are in sync just at those endpoints while running apart in between. Do this with scenes alongside the whole video output!
My measures for checking thoroughly at the moment consist of coupling sync tick sounds into the recording (A) video-visually by having the tick sound open and played in goldwave, (B) acoustically in the ingame audio stream by recording ingame audio from the mixer channel, (C) acoustically in the micro recording by putting the headphones around the microphone while having those ticks played. I do that at least at the beginning and ending of recording sessions, but for long recording sessions also in-between.
If you do provide said tick sounds, you will notice at some points (which to my knowledge so far are completely chaotic):
(A) that when you align video and audio strems in the VSE in that way, that the ticks are perfectly aligned on the timeline, you will hear/see them desynced when you try to preview the resulting video. AND when you put that timeline to the output rendering, you get a resulting video that has just another desynchronization, which will be neither null nor the same as the desynchronization you got in the preview.
(B) that when you by successive correction align the streams on the VSE timeline in such a way that you get perfect audiophile sync when playing the preview, you will see the streams visually in desynchronization (the ticks in the audio stream will not correspond to passing over the ticks in the video stream). AND you will get still desynced resulting videos.
I can deliver blend files of about 7xx000 byte. The video material, of course, is something very heavy for said lengths (> 60 min).
{F9377450}
This example project corresponds to the example in the "Short description of error" paragraph.
########################
2020-11-29 Additional Thoughts to POSSIBLE CAUSE...
Would it be possible that the source material has frames that have individually different durations (or timestamps)? Is there some kind of time- or duration-stamping in the individual frames of a video stream (mp4 / H.264)? A quick search for that idea yielded no usable results. It would provide an explanation for the fact that the dilatations are dependend on the video source. For different video sources with equal bended FPS values (something like 59,92 - you know), i get different dilatations, but those in itself seem to be somewhat stable in relation to the output FPS setting in blender.
The theory to get that under one cloak would be that blender uses perfectly strait timings for its output, but gets source frames with individual durations deviating from the overall FPS of the source.
IF it has something to to with individually different frame durations in the source, then there MUST be some data describing those durations or timestamps in the video stream, since a program like VLC or Windows Media Player CAN very well render the video source with correct timing (so that video and audio are kept synchronized).
This would as well explain the deviations for the FPS infos that i get from different tools for the same source video. For example, the next video i took yesterday had a FPS value of "59.92" under detailed properties in Windows file manager, but got imported by Blender consistently as "60.0" FPS (plain integer) - while USUALLY the FPS set by blender by itself on importing a video source corresponds to the FPS shown in the file properties of the video source. Something made both "think" differently of the exact same video material.
And it would explain the fact that the different recordings get those completely chaotically distributed FPS despite the fact that the recording FPS is configured to be exact 60. And it would explain the matter of those sawtooth events, where suddenly the video stream falls multiple seconds behind the audio stream for no other obvious reason. IF the recording system (Geforce Experience in my case) was slowed by some other sudden activity on the system for a few seconds, but kept timestamps of the frames in the produced video, and Blender not knowing of and looking for that, the sawtooth would be perfectly consequential.
Just an idea. Regrettably, i don't know anything about the internal structure of the H.264 data so far...
Maybe this is getting in the right direction:
https://gitlab.com/mbunkus/mkvtoolnix/-/issues/2085
https://grouper.ieee.org/groups/1722/contributions/2015/IEEE1722_H264_Timestamps.pdf