Goal:
This article shares a step-by-step guide on how to install a Kubernetes Cluster with NVIDIA GPU on AWS.
It includes spinning up an AWS EC2 instance, installing NVIDIA drivers&cudatoolkit, installing Kubernetes Cluster with GPU support, and eventually ran a Spark+Rapids job to test it.
Env:
AWS EC2 (G4dn)
Ubuntu 18.04
Solution:
1. Spin up an AWS EC2 instance with NVIDIA GPU
Here I choose "Ubuntu Server 18.04 LTS (HVM), SSD Volume Type" base image.
Choose "Instance Type": g4dn.2xlarge (8vCPU, 32G memory, 1x 225 SSD).
Note: EC2 G4dn instance has NVIDIA T4 GPU(s) attached.
Go to "Step 3: Configure Instance Details": Auto-assign Public IP=Enable.
Go to "Step 4: Add Storage": Increase the Root Volume from default 8G to 200G.
Go to "Step 6: Configure Security Group": Create a security group with ssh only allowed from your public IP address.
Eventually "Launch" and select an existing key pair or create a new key pair.
2. SSH to the EC2 instance
Please follow the Doc on how to ssh to EC2 instance.
ssh -i /path/my-key-pair.pem ubuntu@my-instance-public-dns-name
sudo su - root
3. Install NVIDIA Driver and cudatoolkit
Please follow this blog on How to intall CUDA Toolkit and NVIDIA Driver on Ubuntu (step by step).
Make sure "nvidia-smi" returns correct results.
Below is a lazy-man's script to install CUDA 11.0.3 with NVIDIA Driver 450.51.06 on ubuntu x86-64 run by root user after you logon this EC2 machine:
(Note: Please validate it carefully yourself!)
apt-get update
apt install -y gcc
apt-get install -y linux-headers-$(uname -r)
wget https://developer.download.nvidia.com/compute/cuda/repos/ubuntu1804/x86_64/cuda-ubuntu1804.pin
mv cuda-ubuntu1804.pin /etc/apt/preferences.d/cuda-repository-pin-600
wget https://developer.download.nvidia.com/compute/cuda/11.0.3/local_installers/cuda-repo-ubuntu1804-11-0-local_11.0.3-450.51.06-1_amd64.deb
dpkg --install cuda-repo-ubuntu1804-11-0-local_11.0.3-450.51.06-1_amd64.deb
apt-key add /var/cuda-repo-ubuntu1804-11-0-local/7fa2af80.pub
apt-get update
apt-get install -y cuda
printf "export PATH=/usr/local/cuda/bin\${PATH:+:\${PATH}}\nexport LD_LIBRARY_PATH=/usr/local/cuda/lib64{LD_LIBRARY_PATH:+:\${LD_LIBRARY_PATH}}" >> ~/.bashrc
nvidia-smi
4. Install a Kubernetes Cluster with NVIDIA GPU
Please follow this NVIDIA Doc on how to install a Kubernetes Cluster with NVIDIA GPU attached.
Here I choose to use "Option 2" which is to use kubeadm.
4.1 Install Docker
curl https://get.docker.com | sh \
&& sudo systemctl --now enable docker
4.2 Install kubeadm
Please follow this K8s Doc on how to install kubeadm.
4.3 Init a Kubernetes Cluster
kubeadm init --pod-network-cidr=192.168.0.0/16
Then follow the printed steps in the end to start using the cluster.
4.4 Configure network
kubectl apply -f https://docs.projectcalico.org/manifests/calico.yaml
kubectl taint nodes --all node-role.kubernetes.io/master-
4.5 Check the Nodes which should be in "Ready" status
# kubectl get nodes
NAME STATUS ROLES AGE VERSION
ip-xxx-xxx-xx-xx Ready control-plane,master 11m v1.20.5
4.6 Install NVIDIA Container Toolkit (nvidia-docker2)
distribution=$(. /etc/os-release;echo $ID$VERSION_ID) \
&& curl -s -L https://nvidia.github.io/nvidia-docker/gpgkey | sudo apt-key add - \
&& curl -s -L https://nvidia.github.io/nvidia-docker/$distribution/nvidia-docker.list | sudo tee /etc/apt/sources.list.d/nvidia-docker.list
Then install nvidia-docker2 package and its dependencies:
sudo apt-get update \
&& sudo apt-get install -y nvidia-docker2
Add "default-runtime" set to "nvidia" into /etc/docker/daemon.json:
{
"default-runtime": "nvidia",
"runtimes": {
"nvidia": {
"path": "/usr/bin/nvidia-container-runtime",
"runtimeArgs": []
}
}
}
Restart Docker daemon:
sudo systemctl restart docker
Test a base CUDA container:
sudo docker run --rm --gpus all nvidia/cuda:11.0-base nvidia-smi
4.7 Install NVIDIA Device Plugin
Firstly install helm which is the preferred option:
curl -fsSL -o get_helm.sh https://raw.githubusercontent.com/helm/helm/master/scripts/get-helm-3 \
&& chmod 700 get_helm.sh \
&& ./get_helm.sh
Add the nvidia-device-plugin helm repository:
helm repo add nvdp https://nvidia.github.io/k8s-device-plugin \
&& helm repo update
Deploy the device plugin:
helm install --generate-name nvdp/nvidia-device-plugin
Check current running PODs to make sure nvidia-device-plugin-xxx POD is running:
kubectl get pods -A
4.8 Test CUDA job
Create gpu-pod.yaml with below content:
apiVersion: v1
kind: Pod
metadata:
name: gpu-operator-test
spec:
restartPolicy: OnFailure
containers:
- name: cuda-vector-add
image: "nvidia/samples:vectoradd-cuda10.2"
resources:
limits:
nvidia.com/gpu: 1
Deploy this sample POD:
kubectl apply -f gpu-pod.yaml
After the POD completes successfully, check the logs to double confirm:
# kubectl logs gpu-operator-test
[Vector addition of 50000 elements]
Copy input data from the host memory to the CUDA device
CUDA kernel launch with 196 blocks of 256 threads
Copy output data from the CUDA device to the host memory
Test PASSED
Done
5. Test a Spark+Rapids on K8s job
Please follow this Doc on Getting Started with RAPIDS and Kubernetes.
Please also refer to Spark on K8s Doc to get familiar with the basics.
For example, here we assume you know how to create service account and assign proper role to that service account.
5.1 Create a service account named "spark" to run spark jobs
kubectl create serviceaccount spark
kubectl create clusterrolebinding spark-role --clusterrole=edit --serviceaccount=default:spark --namespace=default
5.2 Capture the cluster-info
kubectl cluster-info
Take the notes of the "Kubernetes control plane" URL which will be used in spark job.
5.3 Run sample spark jobs
Follow all the steps in Getting Started with RAPIDS and Kubernetes to run sample Spark job in cluster or client mode.
Here we are using "spark" service account to run the Spark jobs with below extra option:
--conf spark.kubernetes.authenticate.driver.serviceAccountName=spark
References:
https://spark.apache.org/docs/latest/running-on-kubernetes.html
https://kubernetes.io/docs/tasks/manage-gpus/scheduling-gpus/
https://docs.nvidia.com/datacenter/cloud-native/kubernetes/install-k8s.html
No comments:
Post a Comment