Page MenuHome

Fix audaspace not reading ffmpeg files with start offset correctly
ClosedPublic

Authored by Sebastian Parborg (zeddb) on Jul 14 2021, 6:50 PM.

Details

Summary

The duration and start time for audio strips were not correctly read in audaspace.

Some codecs have a "lead in" section of audio that shouldn't be played back but only used to "warm up" the decoder.
Before this patch, we would play this lead in audio and thus the audio would not be in sync anymore.

Now the audio starts when the video starts and the duration should be correctly calculated with this in mind.

Diff Detail

Event Timeline

Sebastian Parborg (zeddb) requested review of this revision.Jul 14 2021, 6:50 PM
Sebastian Parborg (zeddb) created this revision.

Can you please send me a sample file where this issue happens, so that I can try/investigate it?

Can you please send me a sample file where this issue happens, so that I can try/investigate it?

No investigation, only rubber stamp patches! ;)

Here is a simple file to clearly reproduce the issue:

Here is how it should look:

However without this patch the "lead in" audio will be used and the audio will be out of sync.

Note to see the audio properly you need this patch:
https://developer.blender.org/D11184

I'm not done with that patch, but if you want to give me feedback on the audiaspace portion of that one as well, that would be great.

I do have other files that have audio desync as well, but I think that synthetic file showcases it the best.
I can of course give you additional files as well if you want to, but those other files needs more out of tree patches to be applied to playback correctly atm (non audaspace related)

Checked file from T47135, and sound strip is one frame shorter. I can reproduce this also with my own video files. Perhaps use ceil instead of lround to calculate duration?

Checked file from T47135, and sound strip is one frame shorter. I can reproduce this also with my own video files. Perhaps use ceil instead of lround to calculate duration?

Right you are. I could have sworn that I tried T47135 before. But I guess I missed to add a round in strip_add

The reason for not using ceil is that usually the audio and video duration does not match.
This is because the audio sample lengths do not line up with the video frame duration.

This means that the audio track either undershoot or overshoots the video duration.
So doing ceil here will lead to the audio track being one frame too long in certain cases (even with the negative offset added previously)

Okay, so I looked more detailed into the code and the sample file you sent.

You're basing the code on the start_time field of the two streams in the file. Now the code/header documentation of ffmpeg for this field states:

/**
 * Decoding: pts of the first frame of the stream in presentation order, in stream time base.
 * Only set this if you are absolutely 100% sure that the value you set
 * it to really is the pts of the first frame.
 * This may be undefined (AV_NOPTS_VALUE).
 * @note The ASF header does NOT contain a correct start_time the ASF
 * demuxer must NOT set this.
 */
int64_t start_time;

The values for the test file are 21 for the video stream and 0 for the audio stream which to me means that audio starts at 0 and video should start later. This made me wonder whether the issue here is indeed with audio or if it's actually the video stream that is problematic.

You wrote:

Some codecs have a "lead in" section of audio that shouldn't be played back but only used to "warm up" the decoder.

Where did you get this information? I wondered why ffmpeg wouldn't tell us that and indeed it could: AVPacket and AVFrame both have flags fields which can have either AV_PKT_FLAG_DISCARD or AV_FRAME_FLAG_DISCARD set. However, the first packets and frames in the audio stream do not have this flag set. This supports my suspicion that the audio is not the issue here.

As a final experiment I thought I would try to use ffmpeg (the command line tool) itself to convert the file to a wav, i.e. audio only file. Surely, if the first samples of audio were not to be played back, ffmpeg would remove them in this case. It didn't!

Thus I think this issue really needs to be solved on the video side.

Sorry, perhaps I wasn't clear and gave you too little information:

This is 100% not a video side issue. This is an issue of us not cropping audio at the start of the stream.

The example video and audio was created in blender and exported.
This issue was discovered because the imported result doesn't line up with the source data.
(The audio is offset by half a frame without this patch)

I also did the same thing with ffmpeg to confirm that the exported result matches (so there wasn't any encoding errors when exporting in blender).

Here is the source blend file:

AAC (and some other audio codecs) have a "lead in" period that is automatically added at the start of the audio track.
If you export to mkv, this "lead in" period is exposed and it is up to the video player to correctly line up the streams.

You can check this with ffmpeg by combining a wav file and a video file into a mkv file and encode the wav as aac.
You will notice when you then separate the audio track into wav again that there is padding added at the start of the audio file.

Even if there wasn't any lead in audio like this, we want to make sure that the audio lines up with the video if the start times are different.

We can't represent anything else than frames in the VSE, so we have to chop off audio or add even more artificial silence at the start of the audio to make them line up.
And if we did that, then the strip start poistions would not line up in the VSE. Which don't think we should do just to display and playback garbage audio data.

My response was based on the assumption that ffmpeg is providing us the necessary information, handles this correctly and we have found the necessary information in its datastructures to handle. That's apparently not the case.

So either we haven't found the proper fields yet on how to handle this file properly (though I have looked quite extensively in the ffmpeg provided data structures of the format, streams, etc) or there is a bug in ffmpeg that we should report upstream.

Certainly, at some point ffmpeg should either by itself discard those sound samples or at least tell us to do so.

I'm not saying that there are no bugs in the audio code, but there are also bugs in the video code. E.g. if the discard flags would be set for the audio data, there is currently no code that checks them and discards the audio.

This patch here is a workaround for the file provided, but it's unfortuantely not the correct solution to the actual issue. Here are two files that Blender currently cannot handle properly (as I said there are bugs), but which your patch makes even worse. Try playing them in vlc to see how they should look like and then add them in the VSE and play them back there.

These two files show us that that chopping off audio at the beginning, if the starting timestamps for the two streams in the file differ, is the wrong solution. It works for your file, but not these two.

I can reproduce issues with test2. It seems like you manged to violate my assumption that the audio stream always starts before the video.
I'm guessing that you did this by shifting the streams around with ffmpeg?

However test1 seems correct to me:

It seems like you simply shifted the audio forwards.
Without this patch we would desync with the video as the audio track would start at the same time as the video.

With this patch we are now in sync with the video, but don't playback the audio that started before the video.

While this doesn't line up with vlc or mpv, as those seem to play that audio in full and stall the video playback, we lack the ability to properly line up the the video and audio strips in blender.
This is because we can only align the strips to frames. So we can't have the audio strip visually start x pts before or after the video strip.

This is not an issue for video players as they simply just have to playback the data, not edit them.
And I guess that this lead in data is usually not noticeable.

Just to reiterate:

I'm proposing that to properly lineup the audio, we should chop off the start of the audio stream if needed.
This seems to be fine from the video sample files Richard and I tested.
While we could lose out on audio data, I don't think this is an issue in practice and I don't think or have run into any non synthetic file where audio data started long before the video starts playing back.

Is this a flawless and ideal solution? No!
But I think this good solution to prevent audio desync.

I created this patch because the blender studio discovered that video edit of the movie they had in blender had various levels of audio desync at the video source level.
This was because they had exported and re imported strips during the creation of the movie and because of this issue the audio had been shifted ever so slightly with each import and export.

Note that this patch is part of a bigger patch set, so I'm testing this in a branch where I have all of them applied, just to be clear:

D11916
D11917
D11918
D11919
D11920
D11921

I can reproduce issues with test2. It seems like you manged to violate my assumption that the audio stream always starts before the video.
I'm guessing that you did this by shifting the streams around with ffmpeg?

Correct, here is how I created those two files:

ffmpeg -i 0000-0191.mkv -i sound.wav -map 0:v:0 -map 1:a:0 -acodec aac -filter:v "setpts=PTS+2/TB" test1.mkv

ffmpeg -i 0000-0191.mkv -itsoffset 2 -i sound.wav -map 0:v:0 -map 1:a:0 -acodec aac test2.mkv

So in test1, video is offset by 2 seconds and in test2 audio is offset by two seconds.

However test1 seems correct to me:

It's not correct, there are two seconds of audio missing at the beginning!

The point is: one issue here is that we don't know how to exactly determine which parts of the audio stream need to be stripped away. Just cutting of any audio before the video stream starts is not the correct solution. We would have to discard the first few samples of the audio, but just using the video stream start time is the wrong approach as this example shows, even if it's artificial. I think that this is actually a bug in ffmpeg that needs to be reported. If we are doing something wrong, we should at least be told in the bug report.

It seems like you simply shifted the audio forwards.
Without this patch we would desync with the video as the audio track would start at the same time as the video.

As I said, these files don't work with and without the patch.

With this patch we are now in sync with the video, but don't playback the audio that started before the video.

While this doesn't line up with vlc or mpv, as those seem to play that audio in full and stall the video playback, we lack the ability to properly line up the the video and audio strips in blender.
This is because we can only align the strips to frames. So we can't have the audio strip visually start x pts before or after the video strip.

And this is where we can work on a better solution. I already wrote about the issue with Blender that frame based editing is simply not accurate enough for audio. A solution that's probably not ideal from a UI point of view, but simple to implement is the following: a new time offset for audio strips that allows editing of the audio start with subframe precision. All that would need to be done here is add the field to the audio strips (exposing it in the strip properties in the UI) on the Blender side and consider its value in the AUD_Sequence_add and AUD_SequenceEntry_move calls in sound.c.

With this it is then possible to properly setup the strips when loading a file with ffmpeg that has streams with certain offsets and possibly length differences and correctly align them in the VSE. And if you want an intermittent solution for the ffmpeg bug (either on the ffmpeg or our side) you could also use this feature in the meantime to cut off the beginning of the audio. A major benefit here is that the user gets the possibility to edit this cutting off, if it's wrong.


To reiterate: there are two issues here. One is that some parts of an aac stream coming from ffmpeg should actually be discarded rather than played. And the second is that streams from files read with ffmpeg are wrongly placed.

Sebastian Parborg (zeddb) updated this revision to Diff 40271.EditedAug 3 2021, 5:50 PM

I made it so that we can now query if we need to shift the audio strip forward to make it line up properly.

Any audio that starts before the video or on fractional frames when shifting the strip forward will be cut off:

Note: I've moved this commit to last in the stack, so you might need to apply the previous patches first for this one to apply properly.

I've created a branch for the fixes for easier testing: temp-VSE-fixes

Updated to fix an issue where I didn't check if the ffmpeg context start_time value was valid before using it.

@Joerg Mueller (nexyon) Just a heads up. If you don't have time to review this, we will merge this into master on monday.
This is because it is a big show stopper for the blender studio in regards to film editing and my patch set was deemed very important and needs to get in ASAP.

If you don't have time, then we can discuss and refine this after the fact when you do.

Hi @Sebastian Parborg (zeddb),

Sorry for not answering earlier, I was (and still am) busy with life, since I just started a new job two weeks ago. I have limited time so I just quickly looked over the changes and have not tried the branch yet. I spent some time however looking at the ffmpeg bug tracker and possible solutions.

So let's start with the ffmpeg bug tracker, I created a ticket there: https://trac.ffmpeg.org/ticket/9375 but actually I'm not too confident that this will be solved any time soon since my search revealed similar bugs (if not the same!) that have been opened 8 years ago...

Looking at the blender source code to check out the video side of this, I figured that there actually is no place where the audio and video streams are both handled with ffmpeg, they are always separate... I still perceive this as an issue of the synchronization between the two streams, but there is no such place, so I understand, that you try to fix this in the audio code. Of course this can also be solved on the video side. I know neither solution is pretty. I still ask you to please move this to the video side for the reason that your changes are now touching many files and even change the API of audaspace (which is also published as a standalone library and not just used by Blender) because of this nuisance of ffmpeg.

Let me detail the changes I suggest that gives a solution that is not perfect, but should handle both issues (the aac priming issue and streams intentionally starting at different times) and is less intrusive with respect to the audaspace API:

  • The code that handles the stream offsets should move from FFMPEGReader.cpp to the startffmpeg function in anim_movie.c. This should get rid of the audaspace API change.
  • You already introduced offset_time for bSound (though I think this would be better put into Sequence) and correctly pass it to AUD_Sequence_add. This is what I meant to use for fixing the offset of audio not starting directly with a video frame.
  • You already have video_start_offset passing through SEQ_add_movie_strip and the start_offset of the anim struct, that passes the offset from basically startffmpeg to the add movie operator of the sequencer.
  • I now propose the following: in startffmpeg determine the actual offset between video and audio stream. This can be a fraction of a frame, a number of frames and of course a combination of that and it can be positive and negative. Return this value as seconds via the video_start_offset that you already introduced.
  • Actual strip placement: the offset is split into a frame count and the remaining fraction (basically div and mod FPS). The strips are then placed with the frame count offset. Depending on the sign of the offset, either the video or the audio stream will be first and the other one will be offset. Finally, the audio stream will get its offset_time set to the fractional part.

Doing it like this should ensure that any audio samples that do not fill a full frame are cut off at the beginning of the stream. This fixes any synchronization issues, the aac priming issue and both of the test files above should be loaded correctly.

Let me know your thoughts, especially if I'm getting anything wrong here!

Hi @Sebastian Parborg (zeddb),

Sorry for not answering earlier, I was (and still am) busy with life, since I just started a new job two weeks ago. I have limited time so I just quickly looked over the changes and have not tried the branch yet. I spent some time however looking at the ffmpeg bug tracker and possible solutions.

I figured as much, I simply just wanted to keep you in the loop so you are not surprised if we merged the changes on Monday.

So let's start with the ffmpeg bug tracker, I created a ticket there: https://trac.ffmpeg.org/ticket/9375 but actually I'm not too confident that this will be solved any time soon since my search revealed similar bugs (if not the same!) that have been opened 8 years ago...

Yeah, some paper cut issues like these seems to have stuck around since a very long time. Lets hope this gets fixed eventually at least.
But on the bright side of things, it did manage to shine a light on our poor stream synchronization in the VSE. :)

Looking at the blender source code to check out the video side of this, I figured that there actually is no place where the audio and video streams are both handled with ffmpeg, they are always separate... I still perceive this as an issue of the synchronization between the two streams, but there is no such place, so I understand, that you try to fix this in the audio code. Of course this can also be solved on the video side. I know neither solution is pretty. I still ask you to please move this to the video side for the reason that your changes are now touching many files and even change the API of audaspace (which is also published as a standalone library and not just used by Blender) because of this nuisance of ffmpeg.

My question is why this shouldn't be something you can query from audaspace?
Basically what you are suggesting is that we parse the audio stream data twice now. So once in audaspace and then again in blender just to get the start time synchronization correct.

To me, being able to provide this sort of data could be very useful if not vital from the audio side of things as well.
While it is not supported currently, I have had some thoughts about supporting containers with multiple audio streams in them in the future.
So we could for example import an mkv file with multiple audio tracks and have them added at once with the correct synchronization.
If we handle the synchronization purely on the blender side of things then I don't really see the point of using the ffmpeg loader in audaspace at all.

Now, I fully agree that my current approach is not at all the best as it pollutes the class structure with many "dummy" functions just to make this work.
But I'm thinking that even the other stand alone programs that uses audaspace would like to get the correct audio offsets when loading files with multiple audio streams.
Either by having audaspace automatically place them correctly in the internal playback timeline or providing information so the API user can themselves have all the data they need to do this.

Otherwise it seems a bit flawed to me.

If I was writing a audio program using audaspace (no video at all), then I would be quite annoyed that I would have to write ffmpeg code and link in the ffmpeg library to my program as well to be able to properly get synchronization between the audio streams.
To me, this would be like having to write my own ALSA/wasapi code to be able to properly initialize the audio output device in audaspace.
The user shouldn't have to be knowledgeable about ffmpeg or other audio libraries to be able to properly use audaspace.
They are probably using audaspace so they don't have to read up on and implement other audio libraries. I suspect that they expect a complete audio loading and playback package.

Don't you agree?

Something I should have probably asked earlier: why are you using aac in the first place? Vorbis and opus are certainly equally good if not supperior codecs and opposed to aac they are free. In case those video files are an intermediate product in the production pipeline, you may also want to consider a lossless codec, i.e. flac (rather than wav/pcm).

My question is why this shouldn't be something you can query from audaspace?
Basically what you are suggesting is that we parse the audio stream data twice now. So once in audaspace and then again in blender just to get the start time synchronization correct.

I can turn this argument around perfectly: Your patch suggests to parse the video stream data twice now. Once in audaspace and then again in blender just to get the start time synchronization correct.
Adding to that: Why should audaspace - an audio only library - care about the video streams in a file? Blender/the sequencer on the other hand cares about both. So this should be the place where synchronization happens.
Specifically for the original test file you provided, the audio stream starts at time 0, while the video stream has some delay. So without your patch here, audaspace does exactly what ffmpeg suggests.
This is the reason I keep asking you to handle this on the Blender side and not inside audaspace.

To me, being able to provide this sort of data could be very useful if not vital from the audio side of things as well.
While it is not supported currently, I have had some thoughts about supporting containers with multiple audio streams in them in the future.
So we could for example import an mkv file with multiple audio tracks and have them added at once with the correct synchronization.
If we handle the synchronization purely on the blender side of things then I don't really see the point of using the ffmpeg loader in audaspace at all.

Now, I fully agree that my current approach is not at all the best as it pollutes the class structure with many "dummy" functions just to make this work.
But I'm thinking that even the other stand alone programs that uses audaspace would like to get the correct audio offsets when loading files with multiple audio streams.
Either by having audaspace automatically place them correctly in the internal playback timeline or providing information so the API user can themselves have all the data they need to do this.

Otherwise it seems a bit flawed to me.

If I was writing a audio program using audaspace (no video at all), then I would be quite annoyed that I would have to write ffmpeg code and link in the ffmpeg library to my program as well to be able to properly get synchronization between the audio streams.
To me, this would be like having to write my own ALSA/wasapi code to be able to properly initialize the audio output device in audaspace.
The user shouldn't have to be knowledgeable about ffmpeg or other audio libraries to be able to properly use audaspace.
They are probably using audaspace so they don't have to read up on and implement other audio libraries. I suspect that they expect a complete audio loading and playback package.

Don't you agree?

I agree that there are missing features here. Currently audaspace only supports the first audio stream within a multimedia file. It would certainly be nice to have a way of accessing all audio streams within a file and consider their synchronization. This would require a different API though and the changes in this patch here are going in the wrong direction, adding functions to all possible classes in the library, while only audio files should be concerned. It makes sense to be able to consider the stream synchronization, but only once you can actually work with more than one stream. I'll gladly accept patches that cover this functionality. :)

Something I should have probably asked earlier: why are you using aac in the first place? Vorbis and opus are certainly equally good if not supperior codecs and opposed to aac they are free. In case those video files are an intermediate product in the production pipeline, you may also want to consider a lossless codec, i.e. flac (rather than wav/pcm).

Basically our animators like to share progress videos of their work and also to double check that the output in regards to lipsync is correct and looks natural in the exported video file.
Mostly because the exported video playback has smoother playback and no temporary hitching that can occur in the playback in blender.
When sharing videos online, a lot video sites or chat clients only supports mp4 video. The default codec for mp4 is AAC.

I can turn this argument around perfectly: Your patch suggests to parse the video stream data twice now. Once in audaspace and then again in blender just to get the start time synchronization correct.

That was what my first patch did. It doesn't anymore. I changed it after your suggestions to do so.

Now audaspace is ONLY working on the audio streams and nothing more.
The required video data is extracted when we read the video streams on the blender side of things.

So currently the audio and video stream data is only parsed once in their respective audio and video sections.
Audaspace doesn't know or care about the video stream in my current patch.
Therefore I think your counter argument is moot here.

I agree that there are missing features here. Currently audaspace only supports the first audio stream within a multimedia file. It would certainly be nice to have a way of accessing all audio streams within a file and consider their synchronization. This would require a different API though and the changes in this patch here are going in the wrong direction, adding functions to all possible classes in the library, while only audio files should be concerned. It makes sense to be able to consider the stream synchronization, but only once you can actually work with more than one stream. I'll gladly accept patches that cover this functionality. :)

Then do you have any suggestions on how this API would look like?
I don't think it is fair to block this functionality until multiple stream support is in as well.
It is something you would want sooner rather than later either way.
As stated before, this probably goes in on monday if we don't find a better solution.

I could probably just make the base reader class not have a pure virtual getStartOffset function and simply the only implement something that doesn't return zero in the ffmpegReader class?

Basically our animators like to share progress videos of their work and also to double check that the output in regards to lipsync is correct and looks natural in the exported video file.
Mostly because the exported video playback has smoother playback and no temporary hitching that can occur in the playback in blender.
When sharing videos online, a lot video sites or chat clients only supports mp4 video. The default codec for mp4 is AAC.

I see, so they are not actually using the exported video for editing, but just checking? Yeah, it's unfortunate that Vorbis/Opus are not as widely supported as AAC.

That was what my first patch did. It doesn't anymore. I changed it after your suggestions to do so.

Now audaspace is ONLY working on the audio streams and nothing more.
The required video data is extracted when we read the video streams on the blender side of things.

So currently the audio and video stream data is only parsed once in their respective audio and video sections.
Audaspace doesn't know or care about the video stream in my current patch.
Therefore I think your counter argument is moot here.

My mistake, I'm sorry! I didn't check the changes in the ffmpeg reader class thoroughly enough and still had your first version in mind. You are completely right. Now, if I'm right the changes currently in audaspace are not required to fix the aac issue in the original file. The changes only affect the second test file I created, where the audio stream starts later than the video. For the other files, the offset is 0.

Then do you have any suggestions on how this API would look like?
I don't think it is fair to block this functionality until multiple stream support is in as well.
It is something you would want sooner rather than later either way.
As stated before, this probably goes in on monday if we don't find a better solution.

I could probably just make the base reader class not have a pure virtual getStartOffset function and simply the only implement something that doesn't return zero in the ffmpegReader class?

Side note: the "sooner rather than later" bit is funny, since I've introduced audaspace to Blender in 2009 (Blender 2.5) and until now I never saw anyone complaining about this feature missing. xD

Since this feature is not cruicial for the issue you are trying to solve until Monday, I would ask you to please separate the changes to audaspace from the changes in Blender. The changes with the offset_time should fix the aac stream issue for now. You could set sound_info->start_offset to 0 for now.

We can solve this issue with the audio starting later together with the audio actually starting earlier, which is not supported by this patch yet either. This solution will be a combination of querying the start times of the audio and video streams and computing proper starting frames for the sequences. For that I suggest, that I'll implement the necessary changes based on this patch in audaspace, to support both the start_time and the stream selection, so that you don't have to spend too much more time on this.

One question regarding the current computations in the patch: Why do you subtract m_formatCtx->start_time from the start_offset? I saw the same computation in the video code on the Blender side. I have only one file, where m_formatCtx->start_time is non-zero and it's again the second of my test files where the audio starts later. The start_time in there is the start_time of the video stream (of which it apparently got copied from). You could just not subtract it for both streams and the result would be fine, right? In any case we have to align the video stream to an actual frame number. I suggest dropping this subtraction, unless there is a reason I'm overlooking. The start_time values of the streams give us the correct PTS where the streams start and the start_time of the context can just be used by video players if they don't check the values of the streams, but we (will) do that.

I see, so they are not actually using the exported video for editing, but just checking? Yeah, it's unfortunate that Vorbis/Opus are not as widely supported as AAC.

They imported the exported videos as well into blender to do some quick editing and mashing together of WIP videos to do some brainstorming.
However they basically gave up on doing that because of all the audio and video sync issues. (That my patch set aims to remedy)

My mistake, I'm sorry! I didn't check the changes in the ffmpeg reader class thoroughly enough and still had your first version in mind. You are completely right. Now, if I'm right the changes currently in audaspace are not required to fix the aac issue in the original file. The changes only affect the second test file I created, where the audio stream starts later than the video. For the other files, the offset is 0.

There are more fixes in this patch besides the offset issue. There are also seeking and duration related fixes. (Which is bad of me to bake in here, I know. But they are needed as well. -_-)

Side note: the "sooner rather than later" bit is funny, since I've introduced audaspace to Blender in 2009 (Blender 2.5) and until now I never saw anyone complaining about this feature missing. xD

Well this is the issue, the VSE as been near right unusable for any serious work since a very long time.
I started working on fixes like these because even our own studio employees couldn't use it to edit together simple videos.

An other great example is that the new Pulseaudio backend is also completely busted leading to no playback happening at all sometimes (playhead not advancing, stuck) and audio desync issues that is backend related.

In most cases people simply give up and won't use Blender or the VSE as things are obviously not working correctly.
It was like pulling teeth to actually get people to try to use the VSE during our production and I can see why.
In many areas it was simply not usable, but hopefully by the end of this movie production it will at least be in a usable state where you can trust it to playback and export files correctly so artists doesn't constantly have to second guess themselves.

Since this feature is not cruicial for the issue you are trying to solve until Monday, I would ask you to please separate the changes to audaspace from the changes in Blender. The changes with the offset_time should fix the aac stream issue for now. You could set sound_info->start_offset to 0 for now.

We can solve this issue with the audio starting later together with the audio actually starting earlier, which is not supported by this patch yet either. This solution will be a combination of querying the start times of the audio and video streams and computing proper starting frames for the sequences. For that I suggest, that I'll implement the necessary changes based on this patch in audaspace, to support both the start_time and the stream selection, so that you don't have to spend too much more time on this.

If we agree that we should have a sound_info->start_offset why can't we just have a placeholder implementation that we have now?
This code is currently local to blender source tree either way, so if I clean up the new class function and make it not pure virtual, it should not really touch any of the other reader classes directly.

Then you can replace it when you have a working implementation of the future api.

But if you insist, then I could just remove the getStartOffset functions, merge the other changes, and leave the rest to you?
(Thanks for helping out! :) )

To give you full context:
We discussed the audio strip start offset internally in the studio and the consensus was that we should just cut off the start audio and have it start at the same time as the video if there is any audio before the video stream starts.
There is also talks about per default having the audio and video duration be the same even if the audio ends before the video.

This is because a lot of our users seems to get confused when the imported strips doesn't start and end at the same time even if we show the actual data truthfully.

However I guess this could be made into an import option toggle or something, but that has to be discussed further.

One question regarding the current computations in the patch: Why do you subtract m_formatCtx->start_time from the start_offset? I saw the same computation in the video code on the Blender side. I have only one file, where m_formatCtx->start_time is non-zero and it's again the second of my test files where the audio starts later. The start_time in there is the start_time of the video stream (of which it apparently got copied from). You could just not subtract it for both streams and the result would be fine, right? In any case we have to align the video stream to an actual frame number. I suggest dropping this subtraction, unless there is a reason I'm overlooking. The start_time values of the streams give us the correct PTS where the streams start and the start_time of the context can just be used by video players if they don't check the values of the streams, but we (will) do that.

We have to subtract the over all start time of the context to get the relative offset.
Otherwise if you add a file that as a ctx_start_time of 30 minutes (yes, we have example test files with this from TV broadcasts), then we pass around meaningless numbers as the actual content is not supposed to be played 30 minutes into the future.
If we don't do this then the only way to find that out is to look at all of the streams as a whole to figure out which one starts first.

Or for example if I just use audaspace to import a sound strip from a video file.
If the reported offset I get is 30 min then I have no way to tell where that sound strip is truly meant to start in relation to other audio channels in the file.
I have no other stream from the file to compare to as that part is abstracted away, so there is no way for me to tell at that point what that offset means.

You could of course load other channels as well to figure this out, but that seems just wasteful to me when the actual container neatly provides this information for us that we can convert to a relative offset.

This revision was not accepted when it landed; it landed in state Needs Review.Aug 16 2021, 3:15 PM
This revision was automatically updated to reflect the committed changes.

Well this is the issue, the VSE as been near right unusable for any serious work since a very long time.
I started working on fixes like these because even our own studio employees couldn't use it to edit together simple videos.

An other great example is that the new Pulseaudio backend is also completely busted leading to no playback happening at all sometimes (playhead not advancing, stuck) and audio desync issues that is backend related.

In most cases people simply give up and won't use Blender or the VSE as things are obviously not working correctly.
It was like pulling teeth to actually get people to try to use the VSE during our production and I can see why.
In many areas it was simply not usable, but hopefully by the end of this movie production it will at least be in a usable state where you can trust it to playback and export files correctly so artists doesn't constantly have to second guess themselves.

Rather than giving up, it would be better to report the bugs or comment on existing reports. Often I'm waiting for users to respond as well. It would be nice to get more feedback this way. The pulseaudio backend is new and thus I'd say some starting problems are expected. I know that there are some issues, but I'm having a hard time reproducing them, so more help is always welcome!

If we agree that we should have a sound_info->start_offset why can't we just have a placeholder implementation that we have now?
This code is currently local to blender source tree either way, so if I clean up the new class function and make it not pure virtual, it should not really touch any of the other reader classes directly.

Then you can replace it when you have a working implementation of the future api.

But if you insist, then I could just remove the getStartOffset functions, merge the other changes, and leave the rest to you?
(Thanks for helping out! :) )

The way you committed it is alright, it's at least not touching as many files!

To give you full context:
We discussed the audio strip start offset internally in the studio and the consensus was that we should just cut off the start audio and have it start at the same time as the video if there is any audio before the video stream starts.
There is also talks about per default having the audio and video duration be the same even if the audio ends before the video.

This is because a lot of our users seems to get confused when the imported strips doesn't start and end at the same time even if we show the actual data truthfully.

However I guess this could be made into an import option toggle or something, but that has to be discussed further.

That would be a sensible option, but in case the video stream is longer than audio, it would need to be cut as well.

We have to subtract the over all start time of the context to get the relative offset.
Otherwise if you add a file that as a ctx_start_time of 30 minutes (yes, we have example test files with this from TV broadcasts), then we pass around meaningless numbers as the actual content is not supposed to be played 30 minutes into the future.
If we don't do this then the only way to find that out is to look at all of the streams as a whole to figure out which one starts first.

Or for example if I just use audaspace to import a sound strip from a video file.
If the reported offset I get is 30 min then I have no way to tell where that sound strip is truly meant to start in relation to other audio channels in the file.
I have no other stream from the file to compare to as that part is abstracted away, so there is no way for me to tell at that point what that offset means.

You could of course load other channels as well to figure this out, but that seems just wasteful to me when the actual container neatly provides this information for us that we can convert to a relative offset.

Is there a way I could get such a sample file? Your explanation doesn't make complete sense to me. For example, if you have a file where stream 0 starts at 30:00 and stream 1 starts at 10:00 and you're just importing stream 0, you would have it start at 20:00, since ctx_start_time would be 10:00 and this is subtracted from the 30:00 with the current implementation. Wouldn't it make more sense to subtract the minimum offset of those streams in the file that are actually loaded, instead of subtractinng ctx_start_time? Also, maybe there are use cases where you would want the placement exactly as it is reported in the file? E.g. you have a broadcast that is separated into multiple video files with correct start times and you want to edit them together. So you add all video files in the sequencer and want them to consider the offsets. Maybe this should be another loading option (with the default set to remove the offset?)?

Rather than giving up, it would be better to report the bugs or comment on existing reports. Often I'm waiting for users to respond as well. It would be nice to get more feedback this way. The pulseaudio backend is new and thus I'd say some starting problems are expected. I know that there are some issues, but I'm having a hard time reproducing them, so more help is always welcome!

That is what I tell them as well, but the sad reality is that they are working with tight deadlines (which is an other issue in itself), so when they encounter an issue the first thought is not "how can I reproduce and report this?" it is "how can I workaround this issue?". In a few cases the workaround is to avoid that area in blender entirely.

Note that I'm not defending this behavior, I'm just stating what I have observed.
I'll try to find time to figure out what is going on with the pulseaudio issue so I can share this with you and figure out how to fix it.

That would be a sensible option, but in case the video stream is longer than audio, it would need to be cut as well.

Well the idea seemed to be to treat the video as the "ground truth" so if the audio is shorter, then the audio strip is simply padded with silence/no data.
Like here:

Note that this is inevitable even with video as we can't really tell in some cases when the strip data actually ends before decoding everything.

Is there a way I could get such a sample file?

It's not something that a Jedi would give you... ;)

Here is the example report with the example file: https://developer.blender.org/T87967

I do notice thought that the example file now neatly showcases the problem that you pointed out! :D
There is a subtitle track in that file that makes my current approach position the audio out of sync with the video.

Your explanation doesn't make complete sense to me. For example, if you have a file where stream 0 starts at 30:00 and stream 1 starts at 10:00 and you're just importing stream 0, you would have it start at 20:00, since ctx_start_time would be 10:00 and this is subtracted from the 30:00 with the current implementation. Wouldn't it make more sense to subtract the minimum offset of those streams in the file that are actually loaded, instead of subtractinng ctx_start_time?

I should clarify my approach a bit:

The context start time is the start time of the earliest stream in the file.
So for me it is logical to subtract this from all the start times as then we can easily tell which stream/strip is at the very start as that start time will be zero.

However this doesn't always help with correctly lining up the strips as you pointed out and what the example file shows.
But I think it is a good practice regardless as then at least we have a relative start value that always starts at zero.
Note that we can still use this relative offset to line up streams correctly, we have just "trimmed off the fat" in regards to time quantities.

But as shown by the example file, more work needs to be done to get this right. :)
However we already have the relative start time from the video and audio strip, so we should be able to fix this with the current API now, no?

Also, maybe there are use cases where you would want the placement exactly as it is reported in the file? E.g. you have a broadcast that is separated into multiple video files with correct start times and you want to edit them together. So you add all video files in the sequencer and want them to consider the offsets. Maybe this should be another loading option (with the default set to remove the offset?)?

To me that sounds more like a batch operation, so you load multiple video files and they are appended after each other.
In that case it shouldn't really matter what internal start time they have as they could simply be added one after an other without this information, right?

That is what I tell them as well, but the sad reality is that they are working with tight deadlines (which is an other issue in itself), so when they encounter an issue the first thought is not "how can I reproduce and report this?" it is "how can I workaround this issue?". In a few cases the workaround is to avoid that area in blender entirely.

Note that I'm not defending this behavior, I'm just stating what I have observed.
I'll try to find time to figure out what is going on with the pulseaudio issue so I can share this with you and figure out how to fix it.

I just committed two fixes for pulseaudio, I'm not sure if they help, please check out a recent build of master (3.0.0 alpha), e.g.: https://builder.blender.org/download/experimental/blender-3.0.0-alpha+master.a217e043be2d-linux.x86_64-release.tar.xz

I should clarify my approach a bit:

The context start time is the start time of the earliest stream in the file.
So for me it is logical to subtract this from all the start times as then we can easily tell which stream/strip is at the very start as that start time will be zero.

However this doesn't always help with correctly lining up the strips as you pointed out and what the example file shows.
But I think it is a good practice regardless as then at least we have a relative start value that always starts at zero.
Note that we can still use this relative offset to line up streams correctly, we have just "trimmed off the fat" in regards to time quantities.

But as shown by the example file, more work needs to be done to get this right. :)
However we already have the relative start time from the video and audio strip, so we should be able to fix this with the current API now, no?

Well, give me a week or two and I'll have a patch ready for you that shows what behavior I'm talking about all the time and let's see what issues that actually has or if it's a good solution. :)

To me that sounds more like a batch operation, so you load multiple video files and they are appended after each other.
In that case it shouldn't really matter what internal start time they have as they could simply be added one after an other without this information, right?

And how would you do that? With a script, where you know the list of files in order? If someone who doesn't code wants to do it and he can select multiple files in the file selector, how would Blender know which order the files should be added? Or does the user have to name them so that sorting them by name produces the correct order? Also what if you don't want to have them after each other, but there are for example some gaps in there, like advertising for example in a broadcast, that you don't want to edit, but you want the gaps where they were? All I'm saying is that I can still see reasons to use the start time and a small checkbox with which you could consider the start time can't hurt, right? Again, I'll have this in my patch, so as I said, give me a week or two! :)

I just committed two fixes for pulseaudio, I'm not sure if they help, please check out a recent build of master (3.0.0 alpha), e.g.: https://builder.blender.org/download/experimental/blender-3.0.0-alpha+master.a217e043be2d-linux.x86_64-release.tar.xz

Ah, thanks! :)

Sadly we are still having issue with seemingly the internal playback location and the location in blender desyncing with the pulseaudio backend.
When it does this, sound continues to play even after you have stopped playback for a while.
It seems to play around 1-2 seconds of audio after we have hit stop (and that is the amount it has drifted apart).

Sadly that issue only happens after 5 - 10 minutes of usage it seems.

The "no playback at all" is also still happening and I can reproduce that on a few other computers still here at the studio.
I'll see if I can look into this further later.

Well, give me a week or two and I'll have a patch ready for you that shows what behavior I'm talking about all the time and let's see what issues that actually has or if it's a good solution. :)

Sounds good :)

And how would you do that? With a script, where you know the list of files in order? If someone who doesn't code wants to do it and he can select multiple files in the file selector, how would Blender know which order the files should be added? Or does the user have to name them so that sorting them by name produces the correct order? Also what if you don't want to have them after each other, but there are for example some gaps in there, like advertising for example in a broadcast, that you don't want to edit, but you want the gaps where they were? All I'm saying is that I can still see reasons to use the start time and a small checkbox with which you could consider the start time can't hurt, right? Again, I'll have this in my patch, so as I said, give me a week or two! :)

From what I've seen, the start times are not that reliable with streams. So you can't really count on them to lineup.
I'm guessing that they simply automate this and reset start time when it gets to large. So a tv program could start at with a start time of 1h:20min and then in the middle it would reset to 0 for that chunk.

Sadly we are still having issue with seemingly the internal playback location and the location in blender desyncing with the pulseaudio backend.
When it does this, sound continues to play even after you have stopped playback for a while.
It seems to play around 1-2 seconds of audio after we have hit stop (and that is the amount it has drifted apart).

Sadly that issue only happens after 5 - 10 minutes of usage it seems.

I know what that is and this is on my list to be fixed, so don't bother investigating it.

The "no playback at all" is also still happening and I can reproduce that on a few other computers still here at the studio.
I'll see if I can look into this further later.

Still? :-/ Can you tell me which distribution those computers are running?

From what I've seen, the start times are not that reliable with streams. So you can't really count on them to lineup.
I'm guessing that they simply automate this and reset start time when it gets to large. So a tv program could start at with a start time of 1h:20min and then in the middle it would reset to 0 for that chunk.

Well, I'm just hypothesising here. I'm just an amateur video editor who only works with videos from GoPros, phones, screen captures and such stuff, so I haven't encountered any video with streams that have a start time other than basically 0. But I can still think, that under some (rare?) circumstances someone would want to be able to keep those start times, when importing in Blender, so why not give them the option?

I know what that is and this is on my list to be fixed, so don't bother investigating it.

Nice! Thanks for the heads up :)

The "no playback at all" is also still happening and I can reproduce that on a few other computers still here at the studio.
I'll see if I can look into this further later.

Still? :-/ Can you tell me which distribution those computers are running?

I've been able to reproduce this on two PopOS computers at the office and my own Gentoo linux computer.

At first I thought it was a pipewire issue as that is what my computer is running.
However the other two PopOS computers at the office that had this issue did not run pipewire, so then that theory went out of the window.

It is very temperamental as sometimes playback works temporarily, but then it stops working again.

Well, I'm just hypothesising here. I'm just an amateur video editor who only works with videos from GoPros, phones, screen captures and such stuff, so I haven't encountered any video with streams that have a start time other than basically 0. But I can still think, that under some (rare?) circumstances someone would want to be able to keep those start times, when importing in Blender, so why not give them the option?

Basically to try to prevent bloat and option overload, we have a rule that we shouldn't add features where the motivation is "MAYBE someone will find it useful".
We have to present an actual real use case that at least someone will use. Of course you could manipulate files to create a use case, but I haven't really seen any files that would work with this in the wild.
Just as you pointed out, it is only in very rare cases like streaming rips that I've seen a start time that is not zero. And in those cases I've seen that the times are not always useful and can be quite random.

We also try to not duplicate functionality. So we try to create tools that you can create a pipeline workflow out of, we shouldn't try to create a tool that is a swiss army knife.
Basically the "UNIX philosophy" but for features and tools, kinda. Now of course you can point at a lot of things in Blender and say that this is not the case, but we strive to at least try to keep things dumb and simple.

Currently I feel that:

  1. We are theorizing about possible use cases, not actually having a real user that will use this.
  1. There are already tools that users can use to line up strips. So adding this functionality to the loading mechanism seems a bit out of place to me. Also because this breaks the expectation that strips are added at the timeline cursor position.

I've been able to reproduce this on two PopOS computers at the office and my own Gentoo linux computer.

At first I thought it was a pipewire issue as that is what my computer is running.
However the other two PopOS computers at the office that had this issue did not run pipewire, so then that theory went out of the window.

It is very temperamental as sometimes playback works temporarily, but then it stops working again.

Weird, since it happens in pulseaudio and pipewire it certainly has to be a bug in my code, but I cannot reproduce it at all. If you could look into it, that would be nice! Or it would already be great to find a way for me to reproduce.

Basically to try to prevent bloat and option overload, we have a rule that we shouldn't add features where the motivation is "MAYBE someone will find it useful".
We have to present an actual real use case that at least someone will use. Of course you could manipulate files to create a use case, but I haven't really seen any files that would work with this in the wild.
Just as you pointed out, it is only in very rare cases like streaming rips that I've seen a start time that is not zero. And in those cases I've seen that the times are not always useful and can be quite random.

We also try to not duplicate functionality. So we try to create tools that you can create a pipeline workflow out of, we shouldn't try to create a tool that is a swiss army knife.
Basically the "UNIX philosophy" but for features and tools, kinda. Now of course you can point at a lot of things in Blender and say that this is not the case, but we strive to at least try to keep things dumb and simple.

Currently I feel that:

  1. We are theorizing about possible use cases, not actually having a real user that will use this.
  1. There are already tools that users can use to line up strips. So adding this functionality to the loading mechanism seems a bit out of place to me. Also because this breaks the expectation that strips are added at the timeline cursor position.

Well, it's a minor option that could be added with what I'm planning to do, no issue if that's not part of it. Unfortunately, I have to ask you to bear with me a little longer until I get to implement it, my time is very limited currently.