I was trying to use TensorFlow with GPU and got the following error:
I tensorflow/core/common_runtime/gpu/gpu_device.cc:838] Creating TensorFlow device (/gpu:0) -> (device: 0, name: Tesla K20m, pci bus id: 0000:02:00.0)
E tensorflow/stream_executor/cuda/cuda_dnn.cc:347] Loaded runtime CuDNN library: 5005 (compatibility version 5000) but source was compiled with 5103 (compatibility version 5100). If using a binary install, upgrade your CuDNN library to match. If building from sources, make sure the library loaded at runtime matches a compatible version specified during compile configuration.
F tensorflow/core/kernels/conv_ops.cc:457] Check failed: stream->parent()->GetConvolveAlgorithms(&algorithms)
of course I am trying to fix this error (though this has already been asked Loaded runtime CuDNN library: 5005 (compatibility version 5000) but source was compiled with 5103 (compatibility version 5100)) but I'd like to understand the error. I always try to attempt solving (coding) problems myself before posting (asking for help) but I am having a hard time even starting this one because the error message seems a little cryptic/unclear to me and I can't seem to find a good resource to understand what the error means.
To understand the error I focused on the line that seems to be where the error starts:
Loaded runtime CuDNN library: 5005 (compatibility version 5000) but source was compiled with 5103 (compatibility version 5100).
After reading some github pages that seemed relevant I realized that reading the error as follows is actually more helpful:
Loaded runtime CuDNN library: 5005 but source was compiled with 5103.
removing the parenthesis makes the error make a bit more sense (though I'd like to understand/know what the role of the parenthesis is in the error message to easy the debugging) since it seems that it loaded CuDNN library 5005 (at the level of UNIX/OS) but the TensorFlow (for python) was compiled with what I would guess is version 5103. Obviously if the TensorFlow library is using an API according to 5103 but the "real" API to talk to the (cuda) deep learning library CuDNN is version 5005, its clear it would be a problem. Though they are just guesses of whats going on.
My first confusion is that as far as I can tell, there is no such thing CuDNN 5005 or 5103. It would be awesome to understand what that part of the error means for sure so that I can start trying to debug this for real. As far as I can tell when I use module list
I am using:
cudnn/5.0
My second confusion is the parenthesis that I ignored and what they mean:
Loaded runtime CuDNN library: 5005 (compatibility version 5000)
but source was compiled with 5103 (compatibility version 5100)
I honestly have no idea idea what the "compatibility version XXXX" means. Maybe its suggestion to install version 5000 (whatever that means) for CuDNN (which is still confusing because there isn't a 5 thousand version of CuDNN) and compile a version of TensorFlow (somehow) that uses CuDNN version 5100.
Does someone know more precisely what the errors mean exactly (and make provide their solution to the question I linked?)
This is an approximate description of what is going on.
cuDNN has major releases that are numbered e.g. 4.0, 5.0, 5.1, etc.
These major releases may incorporate API changes. Therefore a program that uses cuDNN v4 (i.e. 4.0) may need some modifications to work with or use new features in cuDNN v5 (i.e. 5.0).
The major release is encoded in the first two digits of the 4-digit version number. So a cuDNN 4-digit version number of 5103 means it belongs to the 5.1 major release and has a sub-version number of 03. For compatibility purposes, such a release should be API-compatible with any other cuDNN library version of 51xx because they all belong to the 5.1 major release (this is not guaranteed to be strictly true AFAIK, but it is the general idea). Therefore any of these libraries with release numbering 51xx would have a compatibility version of 5100, to indicate that they belong to (and are (should be) compatible with) the 5.1 major release.
So when we are referring to a compatibility version (what major release is this library compatible with) we only need to specify the first two digits - 5000 indicates 5.0, 5100 indicates 5.1. But it is possible for a release to have a sub-release version number that is non-zero. There could be a variety of reasons for this, for example to allow for bug-fix releases and the like.
When a program (like tensorflow) is designed to use cuDNN, it will generally be coded to work with a particular version of cuDNN. In some cases, this can be handled at compile time, by "compiling against" a pariticular cuDNN version (and it's associated API, i.e. header files used when building tensorflow). Therefore, at compile time, a program like tensorflow can determine what version of the cuDNN API it was compiled against, and that is a 4-digit version (although generally speaking, only the compatibililty version i.e. the first two digits of the 4-digit version should really matter).
At runtime, you have a particular version of the cuDNN library (e.g. .so on linux) loaded on your machine somewhere. The version of that library can be determined, queried, and reported. If that actual library version does not match (at least from a compatibility version perspective) the version of the cuDNN library that tensorflow was compiled against, then that's a good indication that things may not work, and so tensorflow points this out when it is running:
Loaded runtime CuDNN library: 5005 but source was compiled with 5103.
This is tensorflow telling you "hey, I was designed (compiled) to work with cuDNN v5.1 but you are only giving me cuDNN 5.0 to work with".
Differences at the sub-version level should be less significant. If you know what you are doing, it may be ok to use cuDNN runtime version 5107 even if your tensorflow was compiled against version 5103. This is just a hypothetical example, but that would indicate that there is some difference in the library which was not intended to change proper functionality or behavior, or the API interface. It could be just a bug-fixed version of 5103, for example (hypothetically. This is an imaginary example.)
In the ideal case, you would build tensorflow against the version of cuDNN that you are using. If you have downloaded pre-built tensorflow packages, however, then you may witness this sort of message (since you presumably downloaded cuDNN separately). In that case, you should at least seek to match the cuDNN major version you are using against the compatibility version that tensorflow is expecting. In this particular example, you are not doing that.