There are two sources of problem involved.
One of them is that the sound handles creation is to be delayed until
scene/VSE evaluation. This was violated for scene strips duplication
(a code which is used for copy-on-write). The Sound strips already did
a proper thing to delay sound handle creation, so just did the same
for the Scene strip.
The other issue is related on how prefetch localized data, combined
together with the scene strip rendering. It seems that the prefetch
relies on the dependency copy-on-write mechanism to localize data,
which works for "regular" strips. It was failing for scene strips
because rendering of scene strips creates a dependency graph used
by a render engine. And it happened to be a second level of copy on
write, which was not supported yet.
Now it is possible to create dependency graph from an evaluated
state of another dependency graph. Such dependency graph can never
become active, but other than that it "should just work".
There might be other ways of fixing the issue, like localizing part
of bmain, but then one would need to manually maintain required
dependencies.
Further improvement would be to avoid creating copy-on-write data
blocks in the dependency graph, because in-place modification is
possible in-place.
Still need to verify that modifiers are applied correctly (i.e.
that subsurf is not applied twice). Before spending more time on
the topic wanted to have design feedback from developers who are
involved in the areas and see if the direction this patch is moving
towards is good.