CUDA kernel failed : no kernel image is available for execution on the device, Error when running PyTorch model inside Google Compute VM

user3882675 picture user3882675 · Mar 24, 2020 · Viewed 11.1k times · Source

I have a docker image of a PyTorch model that returns this error when run inside a google compute engine VM running on debian/Tesla P4 GPU/google deep learning image:

CUDA kernel failed : no kernel image is available for execution on the device

This occurs on the line where my model is called. The PyTorch model includes custom c++ extensions, I'm using this model https://github.com/daveredrum/Pointnet2.ScanNet

My image installs these at runtime

The image runs fine on my local system. Both VM and my system have these versions:

Cuda compilation tools 10.1, V10.1.243

torch 1.4.0

torchvision 0.5.0

The main difference is the GPU as far as I'm aware

Local:

+-----------------------------------------------------------------------------+
| NVIDIA-SMI 435.21       Driver Version: 435.21       CUDA Version: 10.1     |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|===============================+======================+======================|
|   0  GeForce GTX 960M    Off  | 00000000:01:00.0 Off |                  N/A |
| N/A   36C    P8    N/A /  N/A |    361MiB /  2004MiB |      0%      Default |
+-------------------------------+----------------------+----------------------+

VM:

+-----------------------------------------------------------------------------+
| NVIDIA-SMI 418.87.01    Driver Version: 418.87.01    CUDA Version: 10.1     |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|===============================+======================+======================|
|   0  Tesla P4            Off  | 00000000:00:04.0 Off |                    0 |
| N/A   42C    P0    23W /  75W |      0MiB /  7611MiB |      3%      Default |

If I ssh into the VM torch.cuda.is_available() returns true

Therefore I suspect it must be something to do with the compilation of the extensions

This is the relevant part of my docker file:

ENV CUDA_HOME "/usr/local/cuda-10.1"
ENV PATH /usr/local/nvidia/bin:/usr/local/cuda-10.1/bin:${PATH}
ENV NVIDIA_VISIBLE_DEVICES all
ENV NVIDIA_DRIVER_CAPABILITIES compute,utility
ENV NVIDIA_REQUIRE_CUDA "cuda>=10.1 brand=tesla,driver>=384,driver<385 brand=tesla,driver>=396,driver<397 brand=tesla,driver>=410,driver<411 brand=tesla,driver>=418,driver<419"
ENV FORCE_CUDA=1

# CUDA 10.1-specific steps
RUN conda install -c open3d-admin open3d
RUN conda install -y -c pytorch \
    cudatoolkit=10.1 \
    "pytorch=1.4.0=py3.6_cuda10.1.243_cudnn7.6.3_0" \
    "torchvision=0.5.0=py36_cu101" \
 && conda clean -ya
RUN pip install -r requirements.txt
RUN pip install flask
RUN pip install plyfile
RUN pip install scipy


# Install OpenCV3 Python bindings
RUN sudo apt-get update && sudo apt-get install -y --no-install-recommends \
    libgtk2.0-0 \
    libcanberra-gtk-module \
    libgl1-mesa-glx \
 && sudo rm -rf /var/lib/apt/lists/*

RUN dir
RUN cd pointnet2 && python setup.py install
RUN cd ..

I have already re-running this line from ssh in the VM:

TORCH_CUDA_ARCH_LIST="6.0 6.1 7.0" python setup.py install

Which I think targets the installation to the Tesla P4 compute capability?

Is there some other setting or troubleshooting step I can try?

I didn't know anything about docker/VMs/pytorch extensions until a couple of days ago, so somewhat shooting in the dark. Also this is my first stackoverflow post, apologies if I'm not following some etiquette, feel free to point out.

Answer

user3882675 picture user3882675 · Apr 3, 2020

I resolved this in the end by manually deleting all the folders except for "src" in the folder containing setup.py

Then rebuilt the docker image

Then when building the image I ran TORCH_CUDA_ARCH_LIST="6.1" python setup.py install, to install the cuda extensions targeting the correct compute capability for the GPU on the VM

and it worked!

I guess just running setup.py without deleting the folders previously installed doesn't fully overwrite the extension