I have a docker image of a PyTorch model that returns this error when run inside a google compute engine VM running on debian/Tesla P4 GPU/google deep learning image:
CUDA kernel failed : no kernel image is available for execution on the device
This occurs on the line where my model is called. The PyTorch model includes custom c++ extensions, I'm using this model https://github.com/daveredrum/Pointnet2.ScanNet
My image installs these at runtime
The image runs fine on my local system. Both VM and my system have these versions:
Cuda compilation tools 10.1, V10.1.243
torch 1.4.0
torchvision 0.5.0
The main difference is the GPU as far as I'm aware
Local:
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 435.21 Driver Version: 435.21 CUDA Version: 10.1 |
|-------------------------------+----------------------+----------------------+
| GPU Name Persistence-M| Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap| Memory-Usage | GPU-Util Compute M. |
|===============================+======================+======================|
| 0 GeForce GTX 960M Off | 00000000:01:00.0 Off | N/A |
| N/A 36C P8 N/A / N/A | 361MiB / 2004MiB | 0% Default |
+-------------------------------+----------------------+----------------------+
VM:
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 418.87.01 Driver Version: 418.87.01 CUDA Version: 10.1 |
|-------------------------------+----------------------+----------------------+
| GPU Name Persistence-M| Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap| Memory-Usage | GPU-Util Compute M. |
|===============================+======================+======================|
| 0 Tesla P4 Off | 00000000:00:04.0 Off | 0 |
| N/A 42C P0 23W / 75W | 0MiB / 7611MiB | 3% Default |
If I ssh into the VM torch.cuda.is_available()
returns true
Therefore I suspect it must be something to do with the compilation of the extensions
This is the relevant part of my docker file:
ENV CUDA_HOME "/usr/local/cuda-10.1"
ENV PATH /usr/local/nvidia/bin:/usr/local/cuda-10.1/bin:${PATH}
ENV NVIDIA_VISIBLE_DEVICES all
ENV NVIDIA_DRIVER_CAPABILITIES compute,utility
ENV NVIDIA_REQUIRE_CUDA "cuda>=10.1 brand=tesla,driver>=384,driver<385 brand=tesla,driver>=396,driver<397 brand=tesla,driver>=410,driver<411 brand=tesla,driver>=418,driver<419"
ENV FORCE_CUDA=1
# CUDA 10.1-specific steps
RUN conda install -c open3d-admin open3d
RUN conda install -y -c pytorch \
cudatoolkit=10.1 \
"pytorch=1.4.0=py3.6_cuda10.1.243_cudnn7.6.3_0" \
"torchvision=0.5.0=py36_cu101" \
&& conda clean -ya
RUN pip install -r requirements.txt
RUN pip install flask
RUN pip install plyfile
RUN pip install scipy
# Install OpenCV3 Python bindings
RUN sudo apt-get update && sudo apt-get install -y --no-install-recommends \
libgtk2.0-0 \
libcanberra-gtk-module \
libgl1-mesa-glx \
&& sudo rm -rf /var/lib/apt/lists/*
RUN dir
RUN cd pointnet2 && python setup.py install
RUN cd ..
I have already re-running this line from ssh in the VM:
TORCH_CUDA_ARCH_LIST="6.0 6.1 7.0" python setup.py install
Which I think targets the installation to the Tesla P4 compute capability?
Is there some other setting or troubleshooting step I can try?
I didn't know anything about docker/VMs/pytorch extensions until a couple of days ago, so somewhat shooting in the dark. Also this is my first stackoverflow post, apologies if I'm not following some etiquette, feel free to point out.
I resolved this in the end by manually deleting all the folders except for "src" in the folder containing setup.py
Then rebuilt the docker image
Then when building the image I ran TORCH_CUDA_ARCH_LIST="6.1" python setup.py install
, to install the cuda extensions targeting the correct compute capability for the GPU on the VM
and it worked!
I guess just running setup.py without deleting the folders previously installed doesn't fully overwrite the extension