Deploying ML models doesn’t always need GPUs or Kubernetes clusters. Sometimes a simple, single machine is plenty.
In the rush to ‘scale’ it is possible to ignore simple solutions. A single virtula machine (VM) is easy to build, deploy and maintain. It is possible to make a fast and simple model serving system with a single virtual machine and a bit of code optimisation, writes Jacques Verré, product manager at Comet ML. Verré describes the process of benchmarking and improving performance of an API that served the Bert NLP model using Fast API; with some really simple modifications he was able to improve performance from 6 requests per second to 100. That is 8.6 million requests per day!
Some of the improvements he made were:
- Turning off gradient computation in Pytorch
- Tuning FastAPI by adding more gunicorn workers and turning off asynchronous processing
- Using model distillation to decrease model size
- Choosing the right cloud instances (30 vCPUs)