Unable to run Keras with GPU on AWS EC2
Clash Royale CLAN TAG#URR8PPP
Unable to run Keras with GPU on AWS EC2
I'm trying to use a g2.2xlarge EC2 instance to train some simple ml models but I'm not sure if GPU support is working. I'm afraid not since the training times are very similar to my crappy laptop.
I've installed Tensorflow GPU support following these official guidelines and the following are the outputs of some commands.
Running nvidia-smi
in the shell returns
nvidia-smi
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 396.37 Driver Version: 396.37 |
|-------------------------------+----------------------+----------------------+
| GPU Name Persistence-M| Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap| Memory-Usage | GPU-Util Compute M. |
|===============================+======================+======================|
| 0 GRID K520 On | 00000000:00:03.0 Off | N/A |
| N/A 29C P8 17W / 125W | 0MiB / 4037MiB | 0% Default |
+-------------------------------+----------------------+----------------------+
+-----------------------------------------------------------------------------+
| Processes: GPU Memory |
| GPU PID Type Process name Usage |
|=============================================================================|
| No running processes found |
+-----------------------------------------------------------------------------+
Running pip list
pip list
...
jupyterlab (0.31.5)
jupyterlab-launcher (0.10.2)
Keras (2.2.2)
Keras-Applications (1.0.4)
Keras-Preprocessing (1.0.2)
kiwisolver (1.0.1)
...
tensorboard (1.10.0)
tensorflow (1.10.0)
tensorflow-gpu (1.10.0)
...
and I get a very similar output by running conda list
.
conda list
Python version is Python 3.6.4 |Anaconda
.
Python 3.6.4 |Anaconda
Some other, hopefully useful, outputs:
from keras import backend as K
K.tensorflow_backend._get_available_gpus()
2018-08-11 16:42:54.942052: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:897] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero
2018-08-11 16:42:54.943269: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1405] Found device 0 with properties:
name: GRID K520 major: 3 minor: 0 memoryClockRate(GHz): 0.797
pciBusID: 0000:00:03.0
totalMemory: 3.94GiB freeMemory: 3.90GiB
2018-08-11 16:42:54.943309: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1455] Ignoring visible gpu device (device: 0, name: GRID K520, pci bus id: 0000:00:03.0, compute capability: 3.0) with Cuda compute capability 3.0. The minimum required Cuda capability is 3.5.
2018-08-11 16:42:54.943337: I tensorflow/core/common_runtime/gpu/gpu_device.cc:965] Device interconnect StreamExecutor with strength 1 edge matrix:
2018-08-11 16:42:54.943355: I tensorflow/core/common_runtime/gpu/gpu_device.cc:971] 0
2018-08-11 16:42:54.943371: I tensorflow/core/common_runtime/gpu/gpu_device.cc:984] 0: N
.
from tensorflow.python.client import device_lib
device_lib.list_local_devices()
2018-08-11 16:44:03.560954: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1455] Ignoring visible gpu device (device: 0, name: GRID K520, pci bus id: 0000:00:03.0, compute capability: 3.0) with Cuda compute capability 3.0. The minimum required Cuda capability is 3.5.
2018-08-11 16:44:03.561015: I tensorflow/core/common_runtime/gpu/gpu_device.cc:965] Device interconnect StreamExecutor with strength 1 edge matrix:
2018-08-11 16:44:03.561035: I tensorflow/core/common_runtime/gpu/gpu_device.cc:971] 0
2018-08-11 16:44:03.561052: I tensorflow/core/common_runtime/gpu/gpu_device.cc:984] 0: N
[name: "/device:CPU:0"
device_type: "CPU"
memory_limit: 268435456
locality
incarnation: 15704826459248001252
]
.
import tensorflow as tf
tf.test.is_gpu_available()
2018-08-11 16:45:22.049670: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1455] Ignoring visible gpu device (device: 0, name: GRID K520, pci bus id: 0000:00:03.0, compute capability: 3.0) with Cuda compute capability 3.0. The minimum required Cuda capability is 3.5.
2018-08-11 16:45:22.049748: I tensorflow/core/common_runtime/gpu/gpu_device.cc:965] Device interconnect StreamExecutor with strength 1 edge matrix:
2018-08-11 16:45:22.049782: I tensorflow/core/common_runtime/gpu/gpu_device.cc:971] 0
2018-08-11 16:45:22.049814: I tensorflow/core/common_runtime/gpu/gpu_device.cc:984] 0: N
False
Can you confirm that Keras is not running on GPU? Do you have any suggestion on how to eventually solve this problem?
Thanks
EDIT:
I've tried to use a p2.xlarge EC2 instance, but the issues does not seem solved. Here are a couple of outputs
>>> from keras import backend as K
Using TensorFlow backend.
>>> K.tensorflow_backend._get_available_gpus()
2018-08-11 21:54:24.238022: I tensorflow/core/platform/cpu_feature_guard.cc:140] Your CPU supports instructions that this TensorFlow binary was not compiled to use: SSE4.1 SSE4.2 AVX AVX2 FMA
2018-08-11 21:54:24.247402: E tensorflow/stream_executor/cuda/cuda_driver.cc:406] failed call to cuInit: CUDA_ERROR_UNKNOWN
2018-08-11 21:54:24.247430: I tensorflow/stream_executor/cuda/cuda_diagnostics.cc:145] kernel driver does not appear to be running on this host (ip-172-31-2-145): /proc/driver/nvidia/version does not exist
.
>>> from tensorflow.python.client import device_lib
>>> device_lib.list_local_devices()
[name: "/device:CPU:0"
device_type: "CPU"
memory_limit: 268435456
locality
incarnation: 80385595218229545
, name: "/device:XLA_GPU:0"
device_type: "XLA_GPU"
memory_limit: 17179869184
locality
incarnation: 6898783310276970136
physical_device_desc: "device: XLA_GPU device"
, name: "/device:XLA_CPU:0"
device_type: "XLA_CPU"
memory_limit: 17179869184
locality
incarnation: 4859092998934769352
physical_device_desc: "device: XLA_CPU device"
]
thanks for the prompt reply. Do you have any pointer on how to update the cuda version? I did a search on G but is seems rather cumbersome to update those drivers so I'm not sure if I'm looking in the right direction. I will try to use a p2.xlarge instance, hope it will work.
– crash
Aug 11 at 18:06
hey @lmontigny I've updated my question with the tests I did on a more recent p2.xlarge EC2 instance. Would you be able to give me a feedback? Thanks again
– crash
Aug 11 at 22:11
Good that you solved the issue! Some info about the cuda installation here
– supercheval
Aug 12 at 7:44
1 Answer
1
I solved the problem by doing the following:
conda env list
source activate tensorflow_p36
This last point was probably the thing I never realized to do in my previous tests.
After that, everything was working as expected
>>> from keras import backend as K
/home/ubuntu/anaconda3/envs/tensorflow_p36/lib/python3.6/site-packages/h5py/__init__.py:36: FutureWarning: Conversion of the second argument of issubdtype from `float` to `np.floating` is deprecated. In future, it will be treated as `np.float64 == np.dtype(float).type`.
from ._conv import register_converters as _register_converters
Using TensorFlow backend.
>>> K.tensorflow_backend._get_available_gpus()
['/job:localhost/replica:0/task:0/device:GPU:0']
Also, running nvidia-smi
was showing the use of gpu resources during model training, the same with nvidia-smi -i 0 -q -d MEMORY,UTILIZATION,POWER
.
nvidia-smi
nvidia-smi -i 0 -q -d MEMORY,UTILIZATION,POWER
In my sample case, training a single epoch went from 42s to 13s.
By clicking "Post Your Answer", you acknowledge that you have read our updated terms of service, privacy policy and cookie policy, and that your continued use of the website is subject to these policies.
The installed cuda version is too old. It requires Cuda>=3.5, you need to update the cuda version on the node or use a newer gpu (K520 is relatively old) on AWS.
– supercheval
Aug 11 at 17:28