Page MenuHome

Improve proxy building performance
AbandonedPublic

Authored by Richard Antalik (ISS) on Feb 11 2021, 6:31 AM.

Details

Summary
Principle of operation

Proxy rebuild job will spawn 2 threads that are responsible for reading
packets from source file and writing transcoded packets to output file.
This is done by functions index_ffmpeg_read_packets() and
index_ffmpeg_write_frames(). These threads work rather quickly and
don't use too much CPU resources.
Transcoding of read packets is done in thread pool by function
index_ffmpeg_transcode_packets().

This scheme is used because transcoded packets must be read and written
in order as if they were transcoded in one single loop. Transcoding can
happen relatively (see next paragraph) asynchronously.

Because decoding must always start on I-frame, each GOP is fed to
transcoding thread as whole. Some files may not have enough GOPs to be
fed to all threads. In such case, performance gain won't be as great.
This is relatively rare case though.

According to FFmpeg docs some packets may contain multiple frames, but
in such case only first frame is decoded. I am not sure if this is
limitation of FFmpeg, or it is possible to decode these frames, but in
previous proxy building implementation such case wasn't handled either.

Similar as above, there is assumption that decoding any number of packets
in GOP chunk will produce same number of output packets. This must be
always true, otherwise we couldn't map proxy frames to original perfectly.
Therefore it should be possible to increment input and output packet
containers independently and one of them can be manipulated "blindly".
For example sometimes decoded frames lag after packets in 1 or 2 or
more steps, sometimes these are output immediately. It depends on
codec. But number of packets fed to decoder must match number of
frames received.

Transcoding contexts are allocated only when building process starts.
This is because these contexts use lot of RAM. avcodec_copy_context()
is used to allocate input and output codec contexts. These have to be
unique for each thread. Sws_context also needs to be unique but it is
not copied, because it is needed only for transcoding.

Job coordination

In case that output file can not be written to disk fast enough,
transcoded packets will accumulate in RAM potentially filling it up
completely. This isn't such problem on SSD, but on HDD it can easily
happen. Therefore packets are read in sync with packets written with
lookahead.
When building all 4 sizes for 1080p movie, writing speed averages at
80MBps.

During operation packets are read in advance. Lookahead is number of
GOPs to read ahead. This is because all transcoding threads must have
packets to decode and each thread is working on whole GOP chunk.

Jobs are suspended when needed using thread conditions and wake signals
Threads will suspend on their own and are resumed in ring scheme:

read_packets -> transcode -> write_packets
    ^                              |
    |______________________________|

In addition, when any of threads above are done or cancelled, they will
resume building job to free data and finish building process.

Performance

On my machine (16 cores) building process is about 9x faster.
Before I have introduced job coordination, transcoding was 14x faster.
So there is still some room for optimization, perhaps wakeup frequency
is too high or threads are put to sleep unnecessarily.


Code layout

I am using FFmpegIndexBuilderContext as "root" context for storing contexts.
Transcode job is wrapped in TranscodeJob because I need to pass thread number, that determines on which GOP chunks this job will work on.
output_packet_wrap and source_packet_wrap wrap AVPacket with some additional information like GOP chunk number (currently i_frame_segment).
These 2 structs could be consolidated which will simplify some auxilary logic. This is bit tricky part because sometimes output_packet_wrap must
lag one step after source_packet_wrap and this needs to be managed when jumping between GOP chunks properly.

Other than that, I am not super happy with amount of code and mere setup this patch adds. But it doesn't look like anything could be simplified significantly.

Problems / TODO

Codec contexts use a lot of RAM, on machines with SMT we are wasting a lot of space because SMT doesn't boost performance by much but 2x more jobs require 2x more RAM. 2K footage with 32 running jobs can use 8GB or RAM. on the other hand ffmpeg executable uses same amount of memory and I think it is not possible to detect whether machine uses SMT or not. Currently it's assumed it does and number of job is only half of reported logical CPU's

Diff Detail

Repository
rB Blender
Branch
faster-proxies (branched from master)
Build Status
Buildable 12899
Build 12899: arc lint + arc unit

Event Timeline

Richard Antalik (ISS) requested review of this revision.Feb 11 2021, 6:31 AM
  • Run tasks in parallel
Richard Antalik (ISS) edited the summary of this revision. (Show Details)Feb 12 2021, 6:33 AM
  • Fix counting issue. Since I was dropping frames, performance increase is only 30%
Richard Antalik (ISS) edited the summary of this revision. (Show Details)Feb 12 2021, 7:36 AM
Richard Antalik (ISS) edited the summary of this revision. (Show Details)
Richard Antalik (ISS) edited the summary of this revision. (Show Details)Feb 12 2021, 7:39 AM
  • Parallelize decoding of packets. Decoding is now significantly faster than processing + encoding, so processing should be parallelized as well. Not sure if encoding can be parallelized as well, but will try this obviously. Current performance is 2.68x better than original.
  • Moved scaling to thread pool, performance is now 5x better than original. Encoding is bottleneck at this point. There are some 1-frame long glitches and few missing frames, need to track these issues down.
  • Moved encoding to thread pool. Performance is now 7x better. This works only because each frame is I-frame, which may be an issue with codecs other than MJPEG. This can be resolved by changing how threadpool is scheduled and still get maximum possible performance. Currently, only MJPEG is used, so I think, this is fine.
  • Fix significant memleak, there will be smaller leaks still. Found performance limitation of BLI_findlink with large files.
  • Major cleanup - finalize code layout and data flow, there is still big mess, mainly need to check how indexer worked, reinstate it and free all memory that should be freed. Also flush encoder/decoder buffers so there are no frames missing. Performance is now 14x better than original on 16core machine. this is probably near maximum what is achievable in theory.
  • Cleanup
  • Resolve issue with missing frames - add flushing packets. This could be probably resolved in nicer way, but not quite sure how exactly.
  • Fix missing frames again. Previous solution wasn't good, source packets must be cmpletely decoupled from output, codec flushing must be done in interleaved fashion.

I tried the patch on top of 17dddc941 (by replacing completely indexer.c due to the patch not applying as it is, even with a larger fuzzy factor) on a couple of video files and got SIGSEGV upon proxy build finishing

Thread 1 "blender" received signal SIGSEGV, Segmentation fault.
0x0000000017486322 in av_opt_next ()
(gdb) bt full
#0  0x0000000017486322 in av_opt_next ()
No symbol table info available.
#1  0x000000001748684b in av_opt_free ()
No symbol table info available.
#2  0x0000000002d5e0c7 in avcodec_close ()
No symbol table info available.
#3  0x0000000005796d68 in index_ffmpeg_free_transcode_output_context (output_ctx=0x606002bef1a8) at /home/olivier/work/blender-git/blender/source/blender/imbuf/intern/indexer.c:751
No locals.
#4  0x0000000005799424 in index_ffmpeg_free_transcode_contexts (context=0x612000331e48) at /home/olivier/work/blender-git/blender/source/blender/imbuf/intern/indexer.c:898
        size = 0
        transcode_context = 0x6040001b70d8
        i = 0
#5  0x000000000579b582 in index_ffmpeg_free_context (context=0x612000331e48, stop=0) at /home/olivier/work/blender-git/blender/source/blender/imbuf/intern/indexer.c:974
        i = 4
#6  0x00000000057a546e in IMB_anim_index_rebuild_finish (context=0x612000331e48, stop=0) at /home/olivier/work/blender-git/blender/source/blender/imbuf/intern/indexer.c:1798
No locals.
#7  0x0000000007153122 in SEQ_proxy_rebuild_finish (context=0x60700042c158, stop=false) at /home/olivier/work/blender-git/blender/source/blender/sequencer/intern/proxy.c:554
        sanim = 0x0
#8  0x0000000008b82341 in proxy_endjob (pjv=0x606002bef9e8) at /home/olivier/work/blender-git/blender/source/blender/editors/space_sequencer/sequencer_proxy.c:99
        pj = 0x606002bef9e8
        ed = 0x61f000016c88
        link = 0x60300008a648
#9  0x00000000049a8e60 in wm_jobs_timer (wm=0x61b0002f0188, wt=0x60b0002a09b8) at /home/olivier/work/blender-git/blender/source/blender/windowmanager/intern/wm_jobs.c:646
        wm_job = 0x612000315948
        wm_job_iter_next = 0x0
#10 0x0000000004a20a7e in wm_window_timer (C=0x60d000093a38) at /home/olivier/work/blender-git/blender/source/blender/windowmanager/intern/wm_window.c:1548
        win = 0x6130001bc308
        wt = 0x60b0002a09b8
        wt_iter_next = 0x0
        bmain = 0x61c000078088
        wm = 0x61b0002f0188
        time = 1613636648.4336879
        retval = 0
#11 0x0000000004a2106e in wm_window_process_events (C=0x60d000093a38) at /home/olivier/work/blender-git/blender/source/blender/windowmanager/intern/wm_window.c:1584
        __func__ = "wm_window_process_events"
        hasevent = 0
#12 0x00000000049216c2 in WM_main (C=0x60d000093a38) at /home/olivier/work/blender-git/blender/source/blender/windowmanager/intern/wm.c:634
No locals.
#13 0x0000000002ebadd3 in main (argc=1, argv=0x7fffffffdc38) at /home/olivier/work/blender-git/blender/source/creator/creator.c:522
        C = 0x60d000093a38
        ba = 0x0
        app_init_data = <error reading variable app_init_data (Cannot access memory at address 0xffffffffffffffe0)>
  • Rebase
  • Found issue with some codecs decoding frame immediately
  • another issue
  • Detect if codec needs flushing, I think this could fix crash in avcodec_close as well. forcing codec to flush is not really good idea.

I tried the patch on top of 17dddc941 (by replacing completely indexer.c due to the patch not applying as it is, even with a larger fuzzy factor) on a couple of video files and got SIGSEGV upon proxy build finishing

Thanks for testing, you can re-check. There are still issues like file proxy_50_part.avi is not renamed to proxy_50.avi and there is mem leak somewhere. These issues may be linked.
It is possible that it will crash on finishing still (may be same issue as well). If it crashes, please provide source media codec and container info or sample file so I can replicate issue and fix.

Unfortunately I have no idea where leak is yet.

Thanks for testing, you can re-check.

No more crash on my side with your latest revision :)

Thanks for testing, you can re-check.

No more crash on my side with your latest revision :)

Thanks, that's good to know either

  • Detect if flushing is required in nicer way
  • Use avcodec_copy_context() instead creating it from stream data. This simplifies code a bit and resoves issue where proxy file wasn't renamed, because it seems that it has been opened by codec(s).
  • Fix big leak. There seems to be about 8MB stuck somewhere in ffmpeg per one job. Hard to tell exactly, after running job repeatedly memory usage seems to jump around same number so there may be no leak now.
  • Reinstate TC index builder
  • Use thread conditions for waiting. Limit packet reading rate to packet writing rate. This is needed because if writing to slow drive, with big input files, RAM usage would be too high. Margin is currently arbitrary, but it must be big enough.
  • Fix thread condition interlock bug.
  • allocate transcode contexts on demand, otherwise this causes huge RAM usage with many strips
  • Cleanup: consolidate structs
  • Make sure cancelling transcoding is handled without crashing.
  • Make TC builder not as bad. It is not working correctly, but it doesn't prevent adding strips now.
Richard Antalik (ISS) edited the summary of this revision. (Show Details)Feb 22 2021, 12:18 PM
  • Use GOP chunks for packet reader lookahead
Richard Antalik (ISS) edited the summary of this revision. (Show Details)Feb 22 2021, 1:15 PM
  • Make timecode index builder work correctly. Implementation is not nice at all.
  • Fix missing frame with some files. Issue here was that 2 consequent packets were not decoded. solved by managing output packet index independently. After GOP jump, just set new index to same position as source packet. This must always work otherwise it wouldn't work even in previous implementation.
Richard Antalik (ISS) retitled this revision from [WIP] Improve proxy building performance to Improve proxy building performance.Feb 22 2021, 8:55 PM
Richard Antalik (ISS) edited the summary of this revision. (Show Details)
  • Fix GOP jumping implemented previously. Becasue output_packet_wrap can lag behind input, only increment it's index when it exceeds GOP size boundary.

I have one video which triggers a warning during proxy generation (with or without this patch)

/home/olivier/work/blender-git/blender/extern/audaspace/plugins/ffmpeg/FFMPEGReader.cpp:377:63: runtime error: 1.84467e+19 is outside the range of representable values of type 'int'
/home/olivier/work/blender-git/blender/extern/audaspace/plugins/ffmpeg/FFMPEGReader.cpp:385:16: runtime error: signed integer overflow: 77 - -2147483648 cannot be represented in type 'int'

With the current patch, it results in the progression indicator staying stuck (at 12% in my case). Cancelling the task doesn't finish (just showing "cancelling task...") and when trying to close blender, it doesn't respond properly, and I had to force kill it.
Without the current patch, the proxy progression goes up to 100% as expected.

The video is a montage of CC-0 and CC-BY footage that I can share if needed.

With the current patch, it results in the progression indicator staying stuck (at 12% in my case). Cancelling the task doesn't finish (just showing "cancelling task...") and when trying to close blender, it doesn't respond properly, and I had to force kill it.
Without the current patch, the proxy progression goes up to 100% as expected.

The video is a montage of CC-0 and CC-BY footage that I can share if needed.

Please do share the video you can just provide download link if it's too big.

I tried to reduce it while keeping the problem: http://dl.free.fr/mhqZGjeum

By removing the audio track, the warning disappeared, but with a long enough portion of the video, the problem with the proxy generation staying stuck remains.

I am concerned about the use of the deprecated FFMPEG functions here. At what point should we be switching to the newer FFMPEG api? Should I have not used the newer api in my own patches?

2016-04-21 - 7fc329e - lavc 57.37.100 - avcodec.h

Add a new audio/video encoding and decoding API with decoupled input
and output -- avcodec_send_packet(), avcodec_receive_frame(),
avcodec_send_frame() and avcodec_receive_packet()

Moving to the new FFMPEG api could greatly affect this design.

You can see the changes required for the switch to the new api for indexer.c in my patch ( D10394). The changes to indexer.c there are ONLY for the api change and can be broken off to avoid conflicts with this patch.

I don't believe this can be assumed - but maybe there aren't enough cases to worry about? I'll see if I have any examples hanging around.

Similar as above, there is assumption that decoding any number of packets
in GOP chunk will produce same number of output packets. This must be
always true, otherwise we couldn't map proxy frames to original perfectly.

I've gained an enormous amount of sympathy for anyone working with codecs...

Other than that, I am not super happy with amount of code and mere setup this patch adds. But it doesn't look like anything could be simplified significantly.

  • Use only half of available threads to limit RAM usage with little to no performance impact. This could be probably handled better, I think that some systems may have HT disabled.
  • Fix issue when after jumping to next GOP chunk there were less frames than lag of optput packets. This caused that output packets did 2 jumps at once resulting in gap which caused jobs to get stuck. Solution is to store jumps in large enough buffer. It's not nicest solution, but it works.

I don't believe this can be assumed - but maybe there aren't enough cases to worry about? I'll see if I have any examples hanging around.

Similar as above, there is assumption that decoding any number of packets
in GOP chunk will produce same number of output packets. This must be
always true, otherwise we couldn't map proxy frames to original perfectly.

Please provide example file if you have one, I would like to test edge cases. I would compare this to previous implementation and see what can be done.
Ultimately I could create "subframes" out of predicted order and write these correctly. it's just another level of complexity I would rather avoid if possible.

I know that I am using old API, I don't think this would affect this patch really. I have started this patch purely by refactoring and moving things around. I could have switched to new API in first place, but I haven't.
Still I can switch, but first I would like some feedback whether this is even acceptable. If current transcoding performance was good I would think twice. This is risky and complex system, much more complicated to solve bugs.

Also another patch to move to new API is in D10338.

Richard Antalik (ISS) edited the summary of this revision. (Show Details)Feb 24 2021, 12:42 AM

Looks like a vast majority of multiple decodes frames in a packet are for audio packets.
This is the only thing that came close: http://samples.mplayerhq.hu/ffmpeg-bugs/720p60_DVCProHD_problem/
but is two fields ( not within two frames?).. either way, other programs don't seem to handle it either.

The api change is a trickier topic. The ffmpeg api change de-couples the "feeding and fleecing" so should probably require a redesign of the control flow. It may actually make things simpler for multi-threading things?
See "Separated Threads" here: https://blogs.gentoo.org/lu_zero/2016/03/29/new-avcodec-api/

Abandoning this patch in prospect of D10731.