Goal:
This article shares how to monitor NVIDIA GPU performance metrics when running a job.
Most important metrics include GPU%, Memory%, and inbound/outbound PCIe throughput.
Env:
Ubuntu 18.04
Quadro RTX 6000
Solution:
If we are running a Spark on GPU job, how do we monitor the NVIDIA GPU performance?
nvidia-smi has several options can achieve that goal.
I just ran below 2 options commands at the same time when a test job is running.
Both of them capture metrics every 1 second for that GPU with index=0.
1. Option 1
nvidia-smi --query-gpu=utilization.gpu,utilization.memory,memory.total,memory.free,memory.used --format=csv -l 1 -i 0
Sample output:
utilization.gpu [%], utilization.memory [%], memory.total [MiB], memory.free [MiB], memory.used [MiB]
0 %, 0 %, 24220 MiB, 24209 MiB, 11 MiB
27 %, 0 %, 24220 MiB, 23953 MiB, 267 MiB
57 %, 0 %, 24220 MiB, 23989 MiB, 231 MiB
29 %, 1 %, 24220 MiB, 23941 MiB, 279 MiB
2. Option 2
nvidia-smi dmon -i 0 -s mutc -d 1 -o TD
Sample output:
#Date Time gpu fb bar1 sm mem enc dec rxpci txpci mclk pclk
#YYYYMMDD HH:MM:SS Idx MB MB % % % % MB/s MB/s MHz MHz
20210306 22:58:19 0 11 4 0 0 0 0 0 0 405 300
20210306 22:58:20 0 271 9 30 0 0 0 632 1506 6500 1440
20210306 22:58:21 0 231 9 63 1 0 0 11184 1489 6500 2010
20210306 22:58:22 0 279 9 32 1 0 0 2721 2768 6500 2010
"fb" stands for On-board frame buffer memory which is so called device memory. And it matches above "utilization.memory" in option 1.
"SM" stands for Streaming Multiprocessor which matches above "utilization.gpu" in option 1(with a little time gap).
"rxpci txpci" means PCIe Rx and Tx Throughput in MB/s.
Please refer to "man nvidia-smi" for more options.
No comments:
Post a Comment