Goal:
How to check if Spark job runs out of quota in CSpace.Env:
MKE 1.0Solution:
The example configuration file for CSpace based on MKE 1.0 version has below 3 default PODs:- terminal
- hivemetastore
- sparkhs
This information is inside:
git clone https://github.com/mapr/mapr-operators cd ./mapr-operators git checkout mke-1.0.0.0 cat examples/cspaces/cr-cspace-full-gce.yaml cspaceservices: terminal: count: 1 image: cspaceterminal-6.1.0:201912180140 sshPort: 7777 requestcpu: "2000m" requestmemory: 8Gi logLevel: INFO hivemetastore: count: 1 image: hivemeta-2.3:201912180140 requestcpu: "2000m" requestmemory: 8Gi logLevel: INFO sparkhs: count: 1 image: spark-hs-2.4.4:201912180140 requestcpu: "2000m" requestmemory: 8Gi logLevel: INFOSo when we are calculating how much resources are available for other ecosystems like Spark and Drill, we need to take those resource into consideration.
How to check if the Spark job is running out of quota in CSpace?
We need to get the Spark driver log using below commands:
Take pi job for example:
kubectl logs spark-pi-driver -n mycspace or sparkctl log spark-pi -n mycspace
Here are 3 scenarios at least:
1. No nodes in Kubernetes cluster have sufficient resources
For example, if the CSpace quota has 50 CPUs, and no any other PODs running besides the 3 default PODs.We still have 50-6=44 CPUs available for running one Spark job.
If the Spark driver only needs 1 CPU, then we still have 43 CPUs available for Spark executors.
For below definition in the Spark job YAML file:
executor: cores: 20 instances: 2 memory: "1024m" labels: version: 2.4.4I need to start 2 Spark executors with 20 CPUs each.
Symptom:
The requirement(40 CPUs) is below the available quota(43 CPUs), however it may hit below error from Spark driver log:
WARN scheduler.TaskSchedulerImpl: Initial job has not accepted any resources; check your cluster UI to ensure that workers are registered and have sufficient resourcesTroubleshooting:
2 Spark executor PODs are pending forever.
$ kubectl get pods -n mycspace NAME READY STATUS RESTARTS AGE spark-pi-1581449230742-exec-1 0/1 Pending 0 17m spark-pi-1581449230742-exec-2 0/1 Pending 0 16m spark-pi-driver 1/1 Running 0 17m ...
"kubectl describe executor-POD" tells the reason why they are pending:
$ kubectl describe pod spark-pi-1581449230742-exec-1 -n mycspace Events: Type Reason Age From Message ---- ------ ---- ---- ------- Warning FailedScheduling 2m24s (x29 over 17m) default-scheduler 0/3 nodes are available: 3 Insufficient cpu.Basically it means on any nodes have sufficient resources.
This can be confirmed by below commands:
$ kubectl describe node ... Allocatable: attachable-volumes-csi-com.mapr.csi-kdf: 20 attachable-volumes-gce-pd: 127 cpu: 15890m ephemeral-storage: 47093746742 hugepages-2Mi: 0 memory: 56288592Ki pods: 110 ...Root Cause:
In this Kubernetes cluster, we have 3 nodes.
The most empty node can allocate 15.89 CPUs at most, which is less than 20 CPUs request.
2. Spark executors run out of quota of CSpace
For example, if the CSpace quota has 10 CPUs, and no any other PODs running besides the 3 default PODs.We still have 10-6=4 CPUs available for running one Spark job.
If the Spark driver only need 1 CPUs, then we still have 3 CPUs available for Spark executors.
For below definition in the Spark job YAML file:
driver: cores: 1 coreLimit: "1000m" memory: "1024m" labels: version: 2.4.4 serviceAccount: mapr-mycspace-cspace-sa executor: cores: 2 instances: 2 memory: "1024m" labels: version: 2.4.4I need to start 2 Spark executors with 2 CPUs each.
Symptom:
The requirement(4 CPUs) is above the available quota(3 CPUs), it may show below error from Spark driver log:
ERROR util.Utils: Uncaught exception in thread kubernetes-executor-snapshots-subscribers-1 io.fabric8.kubernetes.client.KubernetesClientException: Failure executing: POST at: https://kubernetes.default.svc/api/v1/namespaces/mycspace/pods. Message: Forbidden!Configured service account doesn't have access. Service account may have been revoked. pods "spark-pi-1581464839789-exec-3" is forbidden: exceeded quota: mycspacequota, requested: cpu=2, used: cpu=9, limited: cpu=10.However the job can still completes because it will put both tasks in one Spark executor.
The SparkHistoryServer should show below from "Executors" tab:
If we reduce the CPU requirement for each Spark executor to 1 from 2, SparkHistoryServer should show below as a comparison:
3. Spark driver run out of quota of CSpace
For example, if the CSpace quota has 10 CPUs, and no any other PODs running besides the 3 default PODs.We still have 10-6=4 CPUs available for running one Spark job.
If the Spark driver need 5 CPU then it is already above the available quota.
For below definition in the Spark job YAML file:
driver: cores: 5 coreLimit: "5000m" memory: "1024m" labels: version: 2.4.4 serviceAccount: mapr-mycspace-cspace-saI need to start 1 Spark driver with 5 CPUs.
Symptom:
The spark job will fail by checking their status using sparkctl:
$ sparkctl list -n mycspace +----------+--------+----------------+-----------------+ | NAME | STATE | SUBMISSION AGE | TERMINATION AGE | +----------+--------+----------------+-----------------+ | spark-pi | FAILED | 1m | N.A. | +----------+--------+----------------+-----------------+Troubleshooting:
No driver log is generated yet:
$ kubectl logs spark-pi-driver -n mycspace -f |tee /tmp/sparkjob.txt Error from server (NotFound): pods "spark-pi-driver" not foundThis is because even Spark driver POD is not started yet:
$ kubectl get pods -n mycspace NAME READY STATUS RESTARTS AGE cspaceterminal-bcdcf7bbb-f68r9 1/1 Running 0 5h18m hivemeta-f6d746f-jq5rj 1/1 Running 0 5h18m sparkhs-667f46dcfd-24k86 1/1 Running 0 5h18m"kubectl describe sparkapplication" should show the reason:
$ kubectl describe sparkapplication spark-pi -n mycspace ... Application State: Error Message: failed to run spark-submit for SparkApplication mycspace/spark-pi: 20/02/11 23:59:59 ERROR deploy.SparkSubmit$$anon$2: Failure executing: POST at: https://10.0.32.1/api/v1/namespaces/mycspace/pods. Message: Forbidden!Configured service account doesn't have access. Service account may have been revoked. pods "spark-pi-driver" is forbidden: exceeded quota: mycspacequota, requested: cpu=5, used: cpu=6, limited: cpu=10. io.fabric8.kubernetes.client.KubernetesClientException: Failure executing: POST at: https://10.0.32.1/api/v1/namespaces/mycspace/pods. Message: Forbidden!Configured service account doesn't have access. Service account may have been revoked. pods "spark-pi-driver" is forbidden: exceeded quota: mycspacequota, requested: cpu=5, used: cpu=6, limited: cpu=10. ...Root Cause:
Spark driver POD could not be started because it is out of quota of CSpace already.
No comments:
Post a Comment