Unable to run Keras with GPU on AWS EC2

I'm trying to use a g2.2xlarge EC2 instance to train some simple ml models but I'm not sure if GPU support is working. I'm afraid not since the training times are very similar to my crappy laptop.

I've installed Tensorflow GPU support following these official guidelines and the following are the outputs of some commands.

Running nvidia-smi in the shell returns

nvidia-smi

Running pip list

pip list

... jupyterlab (0.31.5) jupyterlab-launcher (0.10.2) Keras (2.2.2) Keras-Applications (1.0.4) Keras-Preprocessing (1.0.2) kiwisolver (1.0.1) ... tensorboard (1.10.0) tensorflow (1.10.0) tensorflow-gpu (1.10.0) ...

and I get a very similar output by running conda list.

conda list

Python version is Python 3.6.4 |Anaconda.

Python 3.6.4 |Anaconda

Some other, hopefully useful, outputs:

from keras import backend as K K.tensorflow_backend._get_available_gpus() 2018-08-11 16:42:54.942052: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:897] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero 2018-08-11 16:42:54.943269: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1405] Found device 0 with properties: name: GRID K520 major: 3 minor: 0 memoryClockRate(GHz): 0.797 pciBusID: 0000:00:03.0 totalMemory: 3.94GiB freeMemory: 3.90GiB 2018-08-11 16:42:54.943309: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1455] Ignoring visible gpu device (device: 0, name: GRID K520, pci bus id: 0000:00:03.0, compute capability: 3.0) with Cuda compute capability 3.0. The minimum required Cuda capability is 3.5. 2018-08-11 16:42:54.943337: I tensorflow/core/common_runtime/gpu/gpu_device.cc:965] Device interconnect StreamExecutor with strength 1 edge matrix: 2018-08-11 16:42:54.943355: I tensorflow/core/common_runtime/gpu/gpu_device.cc:971] 0 2018-08-11 16:42:54.943371: I tensorflow/core/common_runtime/gpu/gpu_device.cc:984] 0: N

from tensorflow.python.client import device_lib device_lib.list_local_devices() 2018-08-11 16:44:03.560954: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1455] Ignoring visible gpu device (device: 0, name: GRID K520, pci bus id: 0000:00:03.0, compute capability: 3.0) with Cuda compute capability 3.0. The minimum required Cuda capability is 3.5. 2018-08-11 16:44:03.561015: I tensorflow/core/common_runtime/gpu/gpu_device.cc:965] Device interconnect StreamExecutor with strength 1 edge matrix: 2018-08-11 16:44:03.561035: I tensorflow/core/common_runtime/gpu/gpu_device.cc:971] 0 2018-08-11 16:44:03.561052: I tensorflow/core/common_runtime/gpu/gpu_device.cc:984] 0: N [name: "/device:CPU:0" device_type: "CPU" memory_limit: 268435456 locality incarnation: 15704826459248001252 ]

import tensorflow as tf tf.test.is_gpu_available() 2018-08-11 16:45:22.049670: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1455] Ignoring visible gpu device (device: 0, name: GRID K520, pci bus id: 0000:00:03.0, compute capability: 3.0) with Cuda compute capability 3.0. The minimum required Cuda capability is 3.5. 2018-08-11 16:45:22.049748: I tensorflow/core/common_runtime/gpu/gpu_device.cc:965] Device interconnect StreamExecutor with strength 1 edge matrix: 2018-08-11 16:45:22.049782: I tensorflow/core/common_runtime/gpu/gpu_device.cc:971] 0 2018-08-11 16:45:22.049814: I tensorflow/core/common_runtime/gpu/gpu_device.cc:984] 0: N False

Can you confirm that Keras is not running on GPU? Do you have any suggestion on how to eventually solve this problem?

Thanks

EDIT:

I've tried to use a p2.xlarge EC2 instance, but the issues does not seem solved. Here are a couple of outputs

>>> from keras import backend as K Using TensorFlow backend. >>> K.tensorflow_backend._get_available_gpus() 2018-08-11 21:54:24.238022: I tensorflow/core/platform/cpu_feature_guard.cc:140] Your CPU supports instructions that this TensorFlow binary was not compiled to use: SSE4.1 SSE4.2 AVX AVX2 FMA 2018-08-11 21:54:24.247402: E tensorflow/stream_executor/cuda/cuda_driver.cc:406] failed call to cuInit: CUDA_ERROR_UNKNOWN 2018-08-11 21:54:24.247430: I tensorflow/stream_executor/cuda/cuda_diagnostics.cc:145] kernel driver does not appear to be running on this host (ip-172-31-2-145): /proc/driver/nvidia/version does not exist

>>> from tensorflow.python.client import device_lib >>> device_lib.list_local_devices() [name: "/device:CPU:0" device_type: "CPU" memory_limit: 268435456 locality incarnation: 80385595218229545 , name: "/device:XLA_GPU:0" device_type: "XLA_GPU" memory_limit: 17179869184 locality incarnation: 6898783310276970136 physical_device_desc: "device: XLA_GPU device" , name: "/device:XLA_CPU:0" device_type: "XLA_CPU" memory_limit: 17179869184 locality incarnation: 4859092998934769352 physical_device_desc: "device: XLA_CPU device" ]

The installed cuda version is too old. It requires Cuda>=3.5, you need to update the cuda version on the node or use a newer gpu (K520 is relatively old) on AWS.
– supercheval
Aug 11 at 17:28

thanks for the prompt reply. Do you have any pointer on how to update the cuda version? I did a search on G but is seems rather cumbersome to update those drivers so I'm not sure if I'm looking in the right direction. I will try to use a p2.xlarge instance, hope it will work.
– crash
Aug 11 at 18:06

hey @lmontigny I've updated my question with the tests I did on a more recent p2.xlarge EC2 instance. Would you be able to give me a feedback? Thanks again
– crash
Aug 11 at 22:11

Good that you solved the issue! Some info about the cuda installation here
– supercheval
Aug 12 at 7:44

1 Answer
1

I solved the problem by doing the following:

conda env list

source activate tensorflow_p36

This last point was probably the thing I never realized to do in my previous tests.

After that, everything was working as expected

>>> from keras import backend as K /home/ubuntu/anaconda3/envs/tensorflow_p36/lib/python3.6/site-packages/h5py/__init__.py:36: FutureWarning: Conversion of the second argument of issubdtype from `float` to `np.floating` is deprecated. In future, it will be treated as `np.float64 == np.dtype(float).type`. from ._conv import register_converters as _register_converters Using TensorFlow backend. >>> K.tensorflow_backend._get_available_gpus() ['/job:localhost/replica:0/task:0/device:GPU:0']

Also, running nvidia-smi was showing the use of gpu resources during model training, the same with nvidia-smi -i 0 -q -d MEMORY,UTILIZATION,POWER.

nvidia-smi

nvidia-smi -i 0 -q -d MEMORY,UTILIZATION,POWER

In my sample case, training a single epoch went from 42s to 13s.

By clicking "Post Your Answer", you acknowledge that you have read our updated terms of service, privacy policy and cookie policy, and that your continued use of the website is subject to these policies.

搜尋此網誌

Sfyjdyy

Unable to run Keras with GPU on AWS EC2

Unable to run Keras with GPU on AWS EC2

1 Answer
1

Popular posts from this blog

Firebase Auth - with Email and Password - Check user already registered

Dynamically update html content plain JS

How to determine optimal route across keyboard

Unable to run Keras with GPU on AWS EC2

Unable to run Keras with GPU on AWS EC2

1 Answer 1

Popular posts from this blog

Firebase Auth - with Email and Password - Check user already registered

Dynamically update html content plain JS

How to determine optimal route across keyboard

1 Answer
1