Goal:

This article shares a step-by-step guide on how to install a Kubernetes Cluster with NVIDIA GPU on AWS.

It includes spinning up an AWS EC2 instance, installing NVIDIA drivers&cudatoolkit, installing Kubernetes Cluster with GPU support, and eventually ran a Spark+Rapids job to test it.

Env:

AWS EC2 (G4dn)

Ubuntu 18.04

Solution:

1. Spin up an AWS EC2 instance with NVIDIA GPU

Here I choose "Ubuntu Server 18.04 LTS (HVM), SSD Volume Type" base image.

Choose "Instance Type": g4dn.2xlarge (8vCPU, 32G memory, 1x 225 SSD).

Note: EC2 G4dn instance has NVIDIA T4 GPU(s) attached.

Go to "Step 3: Configure Instance Details": Auto-assign Public IP=Enable.

Go to "Step 4: Add Storage": Increase the Root Volume from default 8G to 200G.

Go to "Step 6: Configure Security Group": Create a security group with ssh only allowed from your public IP address.

Eventually "Launch" and select an existing key pair or create a new key pair.

2. SSH to the EC2 instance

Please follow the Doc on how to ssh to EC2 instance.

ssh -i /path/my-key-pair.pem ubuntu@my-instance-public-dns-name
sudo su - root

3. Install NVIDIA Driver and cudatoolkit

Please follow this blog on How to intall CUDA Toolkit and NVIDIA Driver on Ubuntu (step by step).

Make sure "nvidia-smi" returns correct results.

Below is a lazy-man's script to install CUDA 11.0.3 with NVIDIA Driver 450.51.06 on ubuntu x86-64 run by root user after you logon this EC2 machine:

(Note: Please validate it carefully yourself!)

apt-get update
apt install -y gcc
apt-get install -y linux-headers-$(uname -r)
wget https://developer.download.nvidia.com/compute/cuda/repos/ubuntu1804/x86_64/cuda-ubuntu1804.pin
mv cuda-ubuntu1804.pin /etc/apt/preferences.d/cuda-repository-pin-600
wget https://developer.download.nvidia.com/compute/cuda/11.0.3/local_installers/cuda-repo-ubuntu1804-11-0-local_11.0.3-450.51.06-1_amd64.deb
dpkg --install cuda-repo-ubuntu1804-11-0-local_11.0.3-450.51.06-1_amd64.deb
apt-key add /var/cuda-repo-ubuntu1804-11-0-local/7fa2af80.pub
apt-get update
apt-get install -y cuda
printf "export PATH=/usr/local/cuda/bin\${PATH:+:\${PATH}}\nexport LD_LIBRARY_PATH=/usr/local/cuda/lib64{LD_LIBRARY_PATH:+:\${LD_LIBRARY_PATH}}" >> ~/.bashrc
nvidia-smi

4. Install a Kubernetes Cluster with NVIDIA GPU

Please follow this NVIDIA Doc on how to install a Kubernetes Cluster with NVIDIA GPU attached.

Here I choose to use "Option 2" which is to use kubeadm.

4.1 Install Docker

curl https://get.docker.com | sh \
  && sudo systemctl --now enable docker

4.2 Install kubeadm

Please follow this K8s Doc on how to install kubeadm.

4.3 Init a Kubernetes Cluster

kubeadm init --pod-network-cidr=192.168.0.0/16

Then follow the printed steps in the end to start using the cluster.

4.4 Configure network

kubectl apply -f https://docs.projectcalico.org/manifests/calico.yaml
kubectl taint nodes --all node-role.kubernetes.io/master-

4.5 Check the Nodes which should be in "Ready" status

# kubectl get nodes
NAME               STATUS   ROLES                  AGE   VERSION
ip-xxx-xxx-xx-xx   Ready    control-plane,master   11m   v1.20.5

4.6 Install NVIDIA Container Toolkit (nvidia-docker2)

Setup the stable repository for the NVIDIA runtime and the GPG key:

distribution=$(. /etc/os-release;echo $ID$VERSION_ID) \
   && curl -s -L https://nvidia.github.io/nvidia-docker/gpgkey | sudo apt-key add - \
   && curl -s -L https://nvidia.github.io/nvidia-docker/$distribution/nvidia-docker.list | sudo tee /etc/apt/sources.list.d/nvidia-docker.list

Then install nvidia-docker2 package and its dependencies:

sudo apt-get update \
   && sudo apt-get install -y nvidia-docker2

Add "default-runtime" set to "nvidia" into /etc/docker/daemon.json:

{
   "default-runtime": "nvidia",
   "runtimes": {
      "nvidia": {
            "path": "/usr/bin/nvidia-container-runtime",
            "runtimeArgs": []
      }
   }
}

Restart Docker daemon:

sudo systemctl restart docker

Test a base CUDA container:

sudo docker run --rm --gpus all nvidia/cuda:11.0-base nvidia-smi

4.7 Install NVIDIA Device Plugin

Firstly install helm which is the preferred option:

curl -fsSL -o get_helm.sh https://raw.githubusercontent.com/helm/helm/master/scripts/get-helm-3 \
   && chmod 700 get_helm.sh \
   && ./get_helm.sh

Add the nvidia-device-plugin helm repository:

helm repo add nvdp https://nvidia.github.io/k8s-device-plugin \
   && helm repo update

Deploy the device plugin:

helm install --generate-name nvdp/nvidia-device-plugin

Check current running PODs to make sure nvidia-device-plugin-xxx POD is running:

kubectl get pods -A

4.8 Test CUDA job

Create gpu-pod.yaml with below content:

apiVersion: v1
kind: Pod
metadata:
  name: gpu-operator-test
spec:
  restartPolicy: OnFailure
  containers:
  - name: cuda-vector-add
    image: "nvidia/samples:vectoradd-cuda10.2"
    resources:
      limits:
         nvidia.com/gpu: 1

Deploy this sample POD:

kubectl apply -f gpu-pod.yaml

After the POD completes successfully, check the logs to double confirm:

# kubectl logs gpu-operator-test
[Vector addition of 50000 elements]
Copy input data from the host memory to the CUDA device
CUDA kernel launch with 196 blocks of 256 threads
Copy output data from the CUDA device to the host memory
Test PASSED
Done

5. Test a Spark+Rapids on K8s job

Please follow this Doc on Getting Started with RAPIDS and Kubernetes.

Please also refer to Spark on K8s Doc to get familiar with the basics.

For example, here we assume you know how to create service account and assign proper role to that service account.

5.1 Create a service account named "spark" to run spark jobs

kubectl create serviceaccount spark
kubectl create clusterrolebinding spark-role --clusterrole=edit --serviceaccount=default:spark --namespace=default

5.2 Capture the cluster-info

kubectl cluster-info

Take the notes of the "Kubernetes control plane" URL which will be used in spark job.

5.3 Run sample spark jobs

Follow all the steps in Getting Started with RAPIDS and Kubernetes to run sample Spark job in cluster or client mode.

Here we are using "spark" service account to run the Spark jobs with below extra option:

--conf spark.kubernetes.authenticate.driver.serviceAccountName=spark

References:

https://spark.apache.org/docs/latest/running-on-kubernetes.html

https://kubernetes.io/docs/tasks/manage-gpus/scheduling-gpus/

https://docs.nvidia.com/datacenter/cloud-native/kubernetes/install-k8s.html

Tuesday, March 30, 2021

How to install a Kubernetes Cluster with NVIDIA GPU on AWS