Today, businesses and developers are keen to use cloud for deep learning. Especially with the GPU cloud instances, you pay as you go. It is much more cost-efficient comparing to having an expensive metal machine in the office.
But let's switch the role now. Say you are the GPU cloud provider, and you provide the GPU environment for hosting other users applications. The problem now becomes, how can you, as this platform provider, lower down the GPU costs to maximize the profit?
This is not abou
Tao Ran • 6 minutes read
Today, businesses and developers are keen to use cloud for deep learning. Especially with the GPU cloud instances, you pay as you go. It is much more cost-efficient comparing to having an expensive metal machine in the office.
But let's switch the role now. Say you are the GPU cloud provider, and you provide the GPU environment for hosting other users applications. The problem now becomes, how can you, as this platform provider, lower down the GPU costs to maximize the profit?
This is not about finding the cheapest GPU vendors. In fact, it is the question we were facing at Jina AI when designing our GPU cloud platform.
The answer is time-slicing.
💡
Time-slicing allows oversubscription of GPUs. Under the hood, CUDA time-slicing is used to allow workloads that land on oversubscribed GPUs to interleave with one another. Each workload has access to the GPU memory and runs in the same fault-domain as of all the others
In this article, we will use Karpenter - an elastic node scaling method in Kubernetes and NVIDIA’s k8s plugin to achieve time-slicing on GPUs. A GPU cloud with time-slicing will allow users to share GPUs between pods, hence saves the costs.
Karpenter itself provides an auto scaling feature to nodes, which means that you will have the GPU instance only when you need it and can schedule the node based on the instance type you configured. It saves you money and schedules nodes more effectively.
The purpose of utilizing the GPU with Karpenter is not only saving cost, but more importantly, it also provides us a flexible method to schedule GPU resources to our applications within the kubernetes cluster. You may own tens of applications which need the GPU in different time slots, how to schedule them in a more cost effective way is so important in the cloud.
Architecture
It’s pretty straightforward: the application will choose a karpenter provisioner with a selector. The karpenter provisioner will create nodes based on the launch template in that provisioner.
Deployment
Building the architect is simple, the problem we are left with is how we are going to deploy it. There are some particulars we need to think about.
How we deploy the nvidia k8s plugin to the nodes with GPU only.
How we configure the shared GPU nodes to use time-slicing without affecting others.
How do we automatically update nodes AMI in the launch template so the nodes can use the latest image.
How do we setup karpenter provisioners
Let’s do it one by one then.
First, install karpenter and setup provisioner with terraform. You can manually install karpenter in eks with an official document as well. If you already have eks with karpenter, you can skip it.
Set provisioner
The Provisioners is set to use corelated launch templates to provision GPU nodes with labels and taints.
Add time-slicing config
Secondly, we need to deploy the NVIDIA k8s plugin with time-slicing config and default config and set up a node selector so the daemonset will only run on the GPU instances.
Run the below command to install NVIDIA’s k8s plugin:
Third, deploy the user application with nodeSelector and toleration.
Validate the results
Now, if you deploy both YAML files. You will see two nodes provisioned in AWS console or you can see via use kubectl get nodes — show-labels. After the nvidia-k8s-plugin is running in each nodes, you can test in your applications.
If you like this article or want to learn more about the architecture behind Jina AI Cloud, make sure to follow us on social channels and subscribe to our blog.