Goal:
This article explains how to use NVIDIA Nsight Systems to profile a Spark job on Rapids Accelerator.
Env:
Spark 3.1.1 (Standalone Cluster)
RAPIDS Accelerator for Apache Spark 0.5 snapshot
cuDF jar 0.19 snapshot
Solution:
1. Build the cuDF JARs with USE_NVTX option on.
Follow Doc: https://nvidia.github.io/spark-rapids/docs/dev/nvtx_profiling.html
Note: Starting from cuDF 0.19, the USE_NVTX(NVIDIA Tools Extension) is on by default as per this PR so we do not need to build jar any more. It means in the future cuDF release(>=0.19) we can skip this step.
So here in this test, I just used the latest cuDF 0.19 snapshot jar and Rapids Accelerator 0..5 snapshot jar(built from source code manually) together. Note: these 2 jars are not stable releases.
2. Download nsight systems on your client machine
https://developer.nvidia.com/nsight-systems
Here I downloaded and installed on Mac where I will view the metrics later.
3. Make sure target machine has nsys installed and meet requirements.
$ nsys status -e
Sampling Environment Check
Linux Kernel Paranoid Level = 3: Fail
Linux Distribution = Ubuntu
Linux Kernel Version = 5.4.0-70: OK
Linux perf_event_open syscall available: Fail
Sampling trigger event available: Fail
Intel(c) Last Branch Record support: Not Available
Sampling Environment: Fail
See the product documentation for more information.
$ cat /proc/sys/kernel/perf_event_paranoid
3
$ sudo sh -c 'echo 2 >/proc/sys/kernel/perf_event_paranoid'
$ cat /proc/sys/kernel/perf_event_paranoid
2
$ sudo sh -c 'echo kernel.perf_event_paranoid=2 > /etc/sysctl.d/local.conf'
$ nsys status -e
Sampling Environment Check
Linux Kernel Paranoid Level = 2: OK
Linux Distribution = Ubuntu
Linux Kernel Version = 5.4.0-70: OK
Linux perf_event_open syscall available: OK
Sampling trigger event available: OK
Intel(c) Last Branch Record support: Available
Sampling Environment: OK
4. Add extra java options in both driver and executor.
--conf spark.driver.extraJavaOptions=-Dai.rapids.cudf.nvtx.enabled=true
--conf spark.executor.extraJavaOptions=-Dai.rapids.cudf.nvtx.enabled=true
You can consider putting those into spark-defaults.conf or specifying them each time for spark-shell/spark-sql/etc.
If you have other extraJavaOption(s), do not forget to append them.
5. Start spark-shell using "nsys profile"
nsys profile bash -c " \
CUDA_VISIBLE_DEVICES=0 ${SPARK_HOME}/sbin/start-slave.sh $master_url & \
$SPARK_HOME/bin/spark-shell; \
${SPARK_HOME}/sbin/stop-slave.sh"
6. Run some query
When quitting spark-shell, it will generate a *.qdrep file in current directory.
For example:
scala> :quit
:quit
stopping org.apache.spark.deploy.history.HistoryServer
stopping org.apache.spark.deploy.worker.Worker
stopping org.apache.spark.deploy.master.Master
Processing events...
Capturing symbol files...
Saving temporary "/tmp/nsys-report-58cb-6240-1a5f-e6f7.qdstrm" file to disk...
Creating final output files...
Processing [==============================================================100%]
Saved report file to "/tmp/nsys-report-58cb-6240-1a5f-e6f7.qdrep"
Report file moved to "/home/xxx/report1.qdrep"
7. Use "nsys stat" command on the target machine to check the report
$ nsys stats report8.qdrep
Using report8.sqlite file for stats and reports.
Exporting [/opt/nvidia/nsight-systems/2020.3.2/target-linux-x64/reports/cudaapisum report8.sqlite] to console...
Time(%) Total Time (ns) Num Calls Average Minimum Maximum Name
------- --------------- --------- --------------- ------------- ------------- --------------------------
66.8 152,391,401,099 192,250 792,673.1 679 18,448,141 cudaStreamSynchronize_ptsz
31.2 71,169,590,822 114,830 619,782.2 195 9,667,534 cudaMemcpyAsync_ptsz
0.7 1,565,365,626 7 223,623,660.9 3,454 1,565,334,856 cudaFree
0.5 1,117,531,408 65,671 17,017.1 3,496 131,888 cudaLaunchKernel_ptsz
...
Exporting [/opt/nvidia/nsight-systems/2020.3.2/target-linux-x64/reports/gpukernsum report8.sqlite] to console...
Time(%) Total Time (ns) Instances Average Minimum Maximum Name
------- --------------- --------- ------------ ---------- ---------- ----------------------------------------------------------------------------------------------------
37.5 83,645,234,788 14,576 5,738,558.9 5,554,755 6,897,949 void (anonymous namespace)::scatter_kernel<int, (anonymous namespace)::boolean_mask_filter<false>, …
28.2 62,805,133,776 7,288 8,617,608.9 8,459,988 8,955,404 void cudf::binops::jit::kernel_v_v<bool, int, int, cudf::binops::jit::Greater>(int, bool*, int*, in…
18.8 41,854,794,778 7,288 5,742,974.0 5,634,787 5,984,609 void thrust::cuda_cub::core::_kernel_agent<thrust::cuda_cub::__parallel_for::ParallelForAgent<thrus…
8.7 19,342,375,816 7,289 2,653,639.2 2,575,613 2,869,850 void thrust::cuda_cub::core::_kernel_agent<thrust::cuda_cub::__parallel_for::ParallelForAgent<thrus…
...
Exporting [/opt/nvidia/nsight-systems/2020.3.2/target-linux-x64/reports/gpumemtimesum report8.sqlite] to console...
Time(%) Total Time (ns) Operations Average Minimum Maximum Operation
------- --------------- ---------- -------- ------- ------- ------------------
47.8 78,733,508 82,908 949.6 608 610,013 [CUDA memcpy DtoH]
35.7 58,761,119 80,174 732.9 640 13,792 [CUDA memset]
16.4 26,979,351 31,900 845.7 671 662,844 [CUDA memcpy HtoD]
0.1 136,064 8 17,008.0 1,632 32,640 [CUDA memcpy DtoD]
Exporting [/opt/nvidia/nsight-systems/2020.3.2/target-linux-x64/reports/gpumemsizesum report8.sqlite] to console...
Total Operations Average Minimum Maximum Operation
---------- ---------- --------- ------- --------- ------------------
37,577.836 31,900 1.178 0.004 7,813.324 [CUDA memcpy HtoD]
32,226.750 8 4,028.344 244.187 7,812.500 [CUDA memcpy DtoD]
24,145.266 82,908 0.291 0.001 7,812.500 [CUDA memcpy DtoH]
16,326.898 80,174 0.204 0.001 7,812.500 [CUDA memset]
...
Exporting [/opt/nvidia/nsight-systems/2020.3.2/target-linux-x64/reports/nvtxppsum report8.sqlite] to console...
Time(%) Total Time (ns) Instances Average Minimum Maximum Range
------- --------------- --------- --------------- ------------- ------------- -------------------------------
41.0 209,116,856,965 10,002 20,907,504.2 117,938 23,476,086 libcudf:apply_boolean_mask
41.0 209,039,719,367 10,002 20,899,792.0 116,416 23,467,375 libcudf:copy_if
16.7 85,273,533,436 10,000 8,527,353.3 8,431,597 13,684,934 libcudf:cross_join
...
8. Copy the *.qdrep to the client machine where nsight systems is installed.
Open the *.qdrep using nsight systems.
My query in above #5 is a cross-join which takes around 6mins.
Normally I will firstly "Analysis Summary" tab to get the PID of Spark Executor(24897) which would be my focus.
Then move to "Timeline view" tab and identify Spark Executor process:
As we can see the CUDA HW(GPU) is showing busy(blue) for most of the time.If we hover mouse on it, it can show you the CUDA Kernel running% at that time:
We can dig further into all threads of Spark Executor process, and we can identify the Executor Task 1 thread keeps calling CUDA API during that time.
And most importantly, here the "libcudf" and "NVTX(libcudf)" rows will show up.
Note:They will NOT show up if "NVTX" is not switched on when building cuDF jar.
Here "libcudf" row shows "cross_join" which match our query type.
"NVTX(libcudf)" row shows similar things under "CUDA HW" section:
Tips:
1. One useful tip is to pin the related rows and compare:
After those rows got pinned, if you scroll down/up, they will always be on top or at bottom, such as:
2. Change the time from "session time" to "global time"
After that, it will show machine time which can help you match the real world time.
3. How to start/stop collection manually
${SPARK_HOME}/sbin/stop-slave.sh
nsys launch bash -c "CUDA_VISIBLE_DEVICES=0 $SPARK_HOME/sbin/start-slave.sh spark://$HOSTNAME:7077 &"c. Open another terminal session, run "nsys start"
$ nsys startd. Run a Spark job using either spark-shell or spark-submit or something else.
$ nsys sessions list
ID TIME STATE LAUNCH NAME
1028142 00:51 Collecting 1 [default]
$ nsys stop
Processing events...
Capturing symbol files...
Saving temporary "/tmp/nsys-report-4026-c2c5-8a18-5372.qdstrm" file to disk...
Creating final output files...
Processing [==============================================================100%]
Saved report file to "/tmp/nsys-report-4026-c2c5-8a18-5372.qdrep"
Report file moved to "/home/xxx/report10.qdrep"
stop executed
References:
- https://nvidia.github.io/spark-rapids/docs/dev/nvtx_profiling.html
- https://docs.nvidia.com/nsight-systems/InstallationGuide/index.html
- https://docs.nvidia.com/nsight-systems/UserGuide/index.html
- Youtube: Profiling GPU Applications with Nsight Systems
No comments:
Post a Comment