I want to use 8 gpus on parallel, not sequencely.
For example, when I execute this code,
import tensorflow as tf
with tf.device('/gpu:0'):
for i in range(10):
print(i)
with tf.device('/gpu:1'):
for i in range(10, 20):
print(i)
I tried cmd command 'CUDA_VISIBLE_DEVICE='0,1' but result is same.
I want to see the result "0 10 1 11 2 3 12 .... etc"
But actual result is sequencely "0 1 2 3 4 5 ..... 10 11 12 13.."
How can I get wanted result?
** I see an edit with the question so adding this to my answer**
You need to pass your operations to Tensorflow session, otherwise, code will be interpreted as sequential (as many programming language does), then operations will be completed sequential.
For the previous understanding of the question a discussion for creating a training of neural networks with multiple gpus discussed below:
Bad news is there is not magic functionality that will simply do this for you.
Good news is there are a few established methods.
First one is something familiar to some CUDA and maybe other GPU developers to replicate the model to multiple GPUs, synchronize through the CPU. One way to do this is to split your dataset in batches, or called towers in this case, then feed each GPU a tower. If this was MNIST dataset, and you had two GPUs, you could initiate initiate this data using CPU as device explicitly. Now, as your dataset got smaller, your relative batch size can be larger. Once you complete an epoch you can share the gradients and average to it train both networks. Of course, this easily scales to your case with 8 GPUs.
A minimal example of a task distribution and collecting results on CPU can be seen below:
# Creates a graph.
c = []
for d in ['/gpu:2', '/gpu:3']:
with tf.device(d):
a = tf.constant([1.0, 2.0, 3.0, 4.0, 5.0, 6.0], shape=[2, 3])
b = tf.constant([1.0, 2.0, 3.0, 4.0, 5.0, 6.0], shape=[3, 2])
c.append(tf.matmul(a, b))
with tf.device('/cpu:0'):
sum = tf.add_n(c)
# Creates a session with log_device_placement set to True.
sess = tf.Session(config=tf.ConfigProto(log_device_placement=True))
# Runs the op.
print(sess.run(sum))
However, transferring data between many devices, will prevent you from having exactly your_gpu_number times acceleration. Therefore, you need to optimize your workload for each GPU to maximize your performance and try to avoid inter-device communication as much as possible.
Second one is splitting your neural network into number of devices you have, train and merge them.
Running models explicitly on multiple GPUs will require you to set your algorithm in that fashion. Check these out:
https://www.tensorflow.org/guide/using_gpu#using_multiple_gpus
https://gist.github.com/j-min/69aae99be6f6acfadf2073817c2f61b0