Cloud solution architecture for sparsely used real-time AI

Question

I have to deploy a real-time AI which is sparsely used, sometimes it's being used once in a week, sometimes 500 times in a day.

The solution is working in a container locally for now and basically looks like an API, takes images in input and returns images as output.

The main constraint is that we want to have an almost instant response : maximum 5-10s of latency overall. The model isn't the problem, the problem is the solution itself.

Also, a GPU instance hourly-billed might not be profitable at all since it might mostly be running without absolutely any computation done.

I was thinking to use an EC2 instance with persistent data in order to keep the docker image somewhere, for instance in EFS so the time to build the environment would be reduced unfortunately after consulting resources in internet, I discovered that starting an on-demand instance takes a few minutes which is way too slow.

I also experimented HuggingFace inference endpoint which is an easy to use tool for deploying an AI container, unfortunately, it is yet again hourly billed.

To summarize the constraints :

Sparsely used, once a day, once a week or 500 times a day sometimes
Needs Nvidia's GPU capability.
5-10s response time maximum.
as cheap as possible.

Thanks in advance !

Cloud solution architecture for sparsely used real-time AI

0 Answers0