This improves parallel rustc parallelism by avoiding the bottleneck after each individual `par_body_owners` (because it needs to wait for queries to finish, so if there is one long running one, a lot of cores will be idle while waiting for the single query).