Can I run a Docker container with CUDA 10 when host has CUDA 9?

JLuxton picture JLuxton · Jul 13, 2019 · Viewed 7.6k times · Source

Im deploying an application in a docker container that requires CUDA 10. This is necessary to run some of the underlying pytorch functionality that the application uses.

However, the host server is running docker ce 17, Nvidia-docker v 1.0 with CUDA version 9, and I will not be able to upgrade the host.

I’m under the impression that I’m handcuffed to the v1 nvidia docker runtime and CUDA version available on the host.

Is there a way to run CUDA 10 on the container so I can leverage the functionality of this toolkit?

Answer

Robert Crovella picture Robert Crovella · Jul 14, 2019

In the general case, any specific CUDA version will require a minimum GPU driver version. That is covered in places like here and here (table 1). So to use CUDA 9.0 you would need at least a GPU driver version that supports CUDA 9.0, such as a R384 driver. To use CUDA 10.0 you would need at least a GPU driver version that supports CUDA 10.0, such as a R410 driver.

The usage of containers doesn't fundamentally change this. If you want to use a container that has CUDA 10 code in it, your base machine needs a driver that supports CUDA 10.

NVIDIA did start publishing compatibility libraries that allow modifications to the above statements. These compatibility libraries are available but not installed by default with a CUDA toolkit install. These compatibility libraries only work in certain cases, and they have certain requirements to be usable. The compatibility libraries are documented here.

One of the specific requirements for use of these compatibility libraries is that the GPU(s) in use must be Tesla-brand GPUs. GeForce, Quadro, Jetson, and Titan family GPUs are not supported by these compatibility libraries.

Furthermore, the libraries only work with certain combination of CUDA toolkit versions, and GPU driver versions installed on the base machine. This "compatibility matrix" is documented here (Table 3). Only the specific combinations of CUDA toolkit versions with installed driver versions will be usable for compatibility. To pick one example, if you wish to use CUDA 10.0, and your base machine has a Tesla GPU with a R396 driver installed, there is no compatibility support. In the same setup, however, if you wish to use CUDA 10.1, there is compatibility support for that.

If you have satisfied the requirements for compatibility usage, then the remaining step would be to install the compatibility libraries (or build your container from a base container that has the compatibility libraries already installed).

For a package manager CUDA install method, the method to install the compatibility libraries is simple (example on Ubuntu, installing the CUDA 10.1 compatibility to match CUDA 10.1 toolkit install):

sudo apt-get install cuda-compat-10.1

Make sure to match the version to the CUDA toolkit version that you are using (that you installed with the package manager method, or that was already installed in your container).

This compatibility "path" only began in the CUDA 9.0 timeframe. Systems that are equipped with drivers that predate CUDA 9.0 will not be usable in any way for this compatibility path. There are also various functional limitations and restrictions, which are covered in the documentation.

When this "compatibility path" is correctly installed and in use, the overall system configuration can "appear" to be violating the rules indicated at the top of this answer. For example a CUDA 10.1 application could possibly be running on a machine that had only a R396 driver installed.

For the specific question in view here, OP eventually indicated that the base machine had a Quadro GPU, so this "compatibility path" does not apply, and the only way to run e.g. a CUDA 10.0 container would be if a CUDA 10.0-capable driver is installed in the base machine, e.g. R410 or later driver.