Sample code for model-parallel and pipelined training in TensorFlow

The name of the pictureThe name of the pictureThe name of the pictureClash Royale CLAN TAG#URR8PPP



Sample code for model-parallel and pipelined training in TensorFlow



Naive model-partitioning across several GPUs results in the workload moving from GPU to GPU during the forward and backward pass. At any instant, one GPU is busy. Here's the naive version.


with tf.device('/gpu:0'):
model.add(Conv2D(32, kernel_size=(3, 3),
activation='relu',
input_shape=input_shape))
model.add(Conv2D(64, (3, 3), activation='relu'))
model.add(MaxPooling2D(pool_size=(2, 2)))

with tf.device('/gpu:1'):
model.add(Conv2D(128, kernel_size=(3, 3),
activation='relu',
input_shape=input_shape))
model.add(MaxPooling2D(pool_size=(2, 2)))
model.add(Conv2D(64, (3, 3), activation='relu'))
model.add(MaxPooling2D(pool_size=(2, 2)))

with tf.device('/gpu:2'):
model.add(Dropout(0.25))
model.add(Flatten())
model.add(Dense(1500, activation='relu'))
model.add(Dropout(0.5))
model.add(Dense(num_classes, activation='softmax'))



We need sample code (template) that pipelines the work and keeps all GPUs busy by sending waves of batches and coordinates the work on each GPU (forward, gradient calc, parameter updates).



A hint is provided here via the use of a data_flow_ops.StagingArea but a concrete example would be helpful.


data_flow_ops.StagingArea



I understand that data-partitioning (or data-parallel) is the way to go, but there exist use-cases where the model needs to be partitioned across CPU+GPUs.



Grateful for any pointer or sample (pseudo)code.




1 Answer
1



Model parallelism is a technique used when the parameters of a model don't fit into a single GPU's memory. This happens when your model is either quite complex (many layers) or when some of the layers are huge. Usually model parallelism is something that you should use only as your last resort as usually it is quite slow.



Your model looks quite simple, so I am not sure if you really need model parallelism (was it just an example?). If you want to use only single GPU at a time and can fit all your model into a single GPU I wouldn't recommend doing model parallelism.



If you are sure you need model parallelism, then refer to this example to do it using Apache MXNet.





Thanks for the link. To answer your question, the code above was an example of the naive way. But imagine if you have large 10K x 10K images to classify (for example).
– auro
Aug 7 at 20:09






Does a single image fits into GPU memory (10000 * 10000 * 3 / 1024 / 1024 ~= 284 mb per image)? If yes, then data parallelism with a tiny mini-batch size (maybe even 1) is the answer. If not, well, I am not sure you can run the training at all (using model parallelism or not) since you would still need to fit at least 1 sample of data into GPU memory.
– Sergei
Aug 7 at 20:20





I agree with our batch-size analysis, but where would the output activations for the layer go? Question on mxnet. In the example you provided, is the processing pipelined? As I understood it, the embeddings processing is on the cpu and the fully connected net is on the gpu(s) (maybe more that one). Is the processing pipelined? Both CPU and GPU are busy at the same time?
– auro
Aug 7 at 21:36





I haven't run it myself, so I don't know the answer for sure. But my guess would be that because in forward pass one part of the model depends on another and for the backward pass the dependency just reverse, they would have to calculate data sequentially. That means that they won't be "busy" at the same time (of course, actual computation may still require some CPU cycles even when it is GPU who does heavy lifting).
– Sergei
Aug 8 at 21:29






By clicking "Post Your Answer", you acknowledge that you have read our updated terms of service, privacy policy and cookie policy, and that your continued use of the website is subject to these policies.

Popular posts from this blog

Firebase Auth - with Email and Password - Check user already registered

Dynamically update html content plain JS

How to determine optimal route across keyboard