I've googled for a while and the only useful infos are:
Unfortunately, the latest Arch Linux only provides CUDA 7.5 package, so the barnex's project may be not supported.
Arne Vansteenkiste recommends concurrency rather than pure Golang or Golang plus CUDA. What's more, there's someone says the same idea that "Wouldn't it be cool to start a goroutine on a GPU and communicate with it via channels?". I think both of these ideas are wonderful since I would like to change the existing code as little as possible instead of refactoring the whole program. Is the idea possible, or is there some documents introducing this topic in details?
Update
It seems that there's two bindings to HPC in Golang:
Both of them are less documented, currently what I got is only Macro13's answer, very helpful, but it's more about java . So please help me some detailed materials in Golang. Thanks!