Should I try to use as many queues as possible?

Maik Klein picture Maik Klein · Jun 1, 2016 · Viewed 8k times · Source

On my machine I have two queue families, one that supports everything and one that only supports transfer.

The queue family that supports everything has a queueCount of 16.

Now the spec states

Command buffers submitted to different queues may execute in parallel or even out of order with respect to one another

Does that mean I should try to use all available queues for maximal performance?

Answer

krOoze picture krOoze · Jun 1, 2016

Yes, if you have workload that is highly independent use separate queues.

If the queues need a lot of synchronization between themselves, it may kill any potential benefit you may get.

Basically what you are doing is supplying GPU with some alternative work it can do (and fill stalls and bubbles and idles with and giving GPU the choice) in the case of same queue family. And there is some potential to better use CPU (e.g. singlethreaded vs one queue per thread).

Using separate transfer queues (or other specialized family) seem to be the recommended approach even.

That is generally speaking. More realistic, empirical, sceptical and practical view was already presented by SW and NB answers. In reality one does have to be bit more cautious as those queues target the same resources, have same limits, and other common restrictions, limiting potential benefits gained from this. Notably, if the driver does the wrong thing with multiple queues, it may be very very bad for cache.

This AMD's Leveraging asynchronous queues for concurrent execution(2016) discusses a bit how it maps to their HW\driver. It shows potential benefits of using separate queue families. It says that although they offer two queues of compute family, they did not observe benefits in apps at that time. They say they have only one graphics queue, and why.

NVIDIA seems to have a similar idea of "asynch compute". Shown in Moving to Vulkan: Asynchronous compute.

To be safe, it seems we should still stick with only one graphics, and one async compute queue though on current HW. 16 queues seem like a trap and a way to hurt yourself.

With transfer queues it is not as simple as it seems either. You should use the dedicated ones for Host->Device transfers. And the non-dedicated should be used for device->device transfer ops.