I have a reproducer. It's in the form of a benchmark for ease of setup, though it's not _exactly_ a benchmark. The code is at the bottom, and I've included a few screenshots of execution traces that it generates. (The benchmark tries to calculate some measures of how often the application code is able to run, but the main value is in looking at the execution traces.)

This issue is a combination of a few circumstances:
- The shape of the heap means there's not enough mark work or mark assist credit to go around, so the goroutines that allocate memory end up blocked in the assist queue until the end of the GC. These end up in the global run queue via gcWakeAllAssists.
- The code to restart the world only launches Ps that have goroutines in their local run queue, through the interaction of procresize (returning that list of Ps) and startTheWorldWithSema (launching them).
- If there are idle Ps and runnable goroutines, a spinning M can pick them up. (The runtime makes sure to trigger the start of this if there's a chance of finding work.) If the spinning M finds a P and some goroutines, it starts executing those after kicking off another spinning M. Pulling work from the global run queue in that way requires holding sched.lock.
- Moving into the sweep phase uses forEachP, which does the work itself for any P that is idle. It obtains sched.lock before it begins and does not release it until it's finished handling every idle P.

Data from `perf record -g -a -e 'sched:sched_switch' -e 'sched:sched_wakeup'` combined with execution traces that show fast vs slow resumption after stop-the-worlds point to the "spinning M" mechanism being in play when the application is slow to resume.

The execution traces below (from the reproducer at the very bottom) show the difference in speed between procresize+startTheWorldWithSema resuming lots of Ps (fast) and relying instead on the mspinning mechanism to get back to work (slow). These traces are from a server with two Intel sockets and a total of 96 hyperthreads. I've zoomed them all so the major time divisions are at 5ms intervals.

---

Here's go1.16.3 with a fast resume during the start of mark. When there are a lot of active Ps right before the STW, the program is able to have a lot of active Ps very soon after the STW. This is good behavior -- the runtime is able to deliver.

```go1.16.3-startmark-good.png```

Here's another part of the go1.16.3 execution trace, showing a slow resume at the end of mark. Nearly all of the Gs are in the mark assist queue (the left half of the "Goroutines" ribbon is light green), and nearly all of the Ps are idle (very thin lavender "Threads" ribbon). Once the idle Ps start getting back to work (growing lavender, start of colored ribbons in the "Proc 0" etc rows), it takes about 4ms for the program to get back to work completely. This is the high-level bug behavior.

```go1.16.3-startsweep-bad.png```

Here's a third part of the go1.16.3 execution trace, with an interesting middle-ground. After the STW, the lavender "Threads" ribbon grows to about 80% of its full height and stays there for a few hundred microseconds before growing, more slowly than before, to 100%. I interpret this as almost all of the Ps having had local work (so procresize returns them and they're able to start immediately), then forEachP holding sched.lock for a while and preventing further growth, and finally the mspinning mechanism finding the remaining Ps and putting them to work.

```go1.16.3-startsweep-ok.png```

Here's Go tip at the parent of PS 6, `go version devel go1.17-8e91458b19 Fri Apr 30 20:00:36 2021 +0000`. It shows similar behavior to the first go1.16.3 image. This is from a different phase in the benchmark's lifecycle, so the "Goroutines" ribbon looks a little different. The idle Ps take about 5ms to get to work.

```gotip-startsweep-bad.png```

Here's PS 6, which (1) assigns goroutines from the global run queue to idle Ps as part of starting the world and (2) does not hold sched.lock in the parts of forEachP that do substantial work. It's from the same part of the benchmark's lifecycle as the previous image, but it only takes about 300µs for the Ps to get to work.

```cl6-startsweep-good.png```

