Running Image Matching service with FaceNet on GPU using GUNICORN and FLASK issue

The name of the pictureThe name of the pictureThe name of the pictureClash Royale CLAN TAG#URR8PPP



Running Image Matching service with FaceNet on GPU using GUNICORN and FLASK issue



I'm trying to create an inference server using FLASK APIs for the Facenet model for an image matching task. I'm using Gunicorn for scaling the server and the server gets images from the client using a POST request in the form of a string sequence. The server gets that image and matches it with an image from a mongodb database and finds the distance.



The server makes calls for tensorflow to load the model when the app is run using Gunicorn and it creates gunicorn worker instances which I can see using nvidia-smi pmon but when I'm sending calls to this server using a client, only the GPU 0 is being utilised and even that is not being utilised as much as I'm able to use when I run it without a server/client.
My gunicorn call is using the gevent worker-class and my call looks like this:


nvidia-smi pmon


gevent


gunicorn --bind 0.0.0.0:5000 --timeout 1000000 -w 4 -k gevent wsgi:app



I have 4 GPUs and when my server runs in the above call, the nvidia-smi pmon output is as follows:


nvidia-smi pmon


0 93715 C 0 0 0 0 python
0 93716 C 0 0 0 0 python
0 93717 C 0 0 0 0 python
0 93719 C 3 0 0 0 python
1 93715 C 0 0 0 0 python
1 93716 C 0 0 0 0 python
1 93717 C 0 0 0 0 python
1 93719 C 0 0 0 0 python
2 93715 C 0 0 0 0 python
2 93716 C 0 0 0 0 python
2 93717 C 0 0 0 0 python
2 93719 C 0 0 0 0 python
3 93715 C 0 0 0 0 python
3 93716 C 0 0 0 0 python
3 93717 C 0 0 0 0 python
3 93719 C 0 0 0 0 python
0 93715 C 0 0 0 0 python
0 93716 C 0 0 0 0 python
# gpu pid type sm mem enc dec command
# Idx # C/G % % % % name
0 93717 C 0 0 0 0 python
0 93719 C 2 0 0 0 python
1 93715 C 0 0 0 0 python
1 93716 C 0 0 0 0 python
1 93717 C 0 0 0 0 python
1 93719 C 0 0 0 0 python
2 93715 C 0 0 0 0 python
2 93716 C 0 0 0 0 python
2 93717 C 0 0 0 0 python
2 93719 C 0 0 0 0 python
3 93715 C 0 0 0 0 python
3 93716 C 0 0 0 0 python
3 93717 C 0 0 0 0 python
3 93719 C 0 0 0 0 python
0 93715 C 0 0 0 0 python
0 93716 C 0 0 0 0 python
0 93717 C 0 0 0 0 python
0 93719 C 3 0 0 0 python



As can be seen above, only GPU 0 is getting all the calls and that too only with around 3-5% usage. My test code without a server-client model is able to reach 25% usage directly on each GPU.
Can someone explain what I'm doing wrong or anything else that I should try?




1 Answer
1



The problem of only the 1st GPU got used is that by default, even if Tensorflow can see all 4 GPUs, it will only use the 1st one.



Here although Gunicorn sent several calls and each call tries to invoke its Tensorflow, they all see 4 GPUs and use the 1st one.



I think one possible solutions is to have 4 different Flask or gunicorns that configured with environment varible "CUDA_VISIBLE_DIVICES" be 0, 1, 2, 3 respectively. Then you should use Nginx to forward api calls to these 4 servers.



I'm not sure about the reasons of low GPU mem usage.





I actually modified in config.gpu_options the argument for visible devices and I was able to force each worker to pick a GPU but I still wasn't able to solve the low usage issue.
– mankeyboy
Aug 15 at 10:00






By clicking "Post Your Answer", you acknowledge that you have read our updated terms of service, privacy policy and cookie policy, and that your continued use of the website is subject to these policies.

Popular posts from this blog

Firebase Auth - with Email and Password - Check user already registered

Dynamically update html content plain JS

How to determine optimal route across keyboard