Goal:
How to run the pandas cudf_udf test for RAPIDS Accelerator for Apache Spark.
Env:
RAPIDS Accelerator for Apache Spark 0.4
Spark 3.1.1
Solution:
1. Compile RAPIDS Accelerator for Apache Spark
1.a Create a conda env for compiling
conda create -n cudftest -c conda-forge python=3.8 pytest pandas pyarrow sre_yield pytest-xdist findspark
Here I decide to use one conda env "cudftest" for compiling and use another conda env named "rapids-0.18" to test the cudf_udf in Spark.
Of course you can choose to use one conda env if you want but it may include too many python packages in the end.
I just want to keep the conda env "rapids-0.18" to be as small as possible because eventually I need to distribute it to all Executors in Spark cluster.
1.b Compile from source code
cd ~/github/spark-rapids
# git checkout v0.4.0
mvn clean install -DskipTests
You can decide which version to compile. Here I am going to compile the 0.15-snapshot which is the current main branch. The current GA release is 0.4 though.
2. Run pandas cudf_udf Tests
Please follow this Doc on how to enable the pandas cudf_udf tests.
Basically pandas cudf_udf tests are inside "./integration_tests/runtests.py" with option "--cudf_udf".
The key is to make sure the all the python envs and needed jar file paths are correct.
2.a Create a conda env for running cudf_udf tests
Please follow the steps mentioned in rapids.ai to create the conda env with cudf installed.
For example:
conda create -n rapids-0.18 -c rapidsai -c nvidia -c conda-forge \
-c defaults cudf=0.18 python=3.7 cudatoolkit=11.0
2.b Install needed python packages needed by cudf_udf tests
conda activate rapids-0.18
conda install pandas
2.c Package your conda env
You can refer to this blog on how to package your conda env for spark job.
cd /home/xxx/miniconda3/envs
zip -r rapids-0.18.zip rapids-0.18/
mv rapids-0.18.zip ~/
cd ~/ && mkdir MYGLOBALENV
cd MYGLOBALENV/ && ln -s /home/xxx/miniconda3/envs/rapids-0.18/ rapids-0.18
cd ..
export PYSPARK_PYTHON=./MYGLOBALENV/rapids-0.18/bin/python
2.d Run the pandas cudf_udf tests
cd /home/xxx/github/spark-rapids/integration_tests
PYSPARK_PYTHON=/home/xxx/MYGLOBALENV/rapids-0.18/bin/python $SPARK_HOME/bin/spark-submit --jars "/home/xxx/github/spark-rapids/dist/target/rapids-4-spark_2.12-0.5.0-SNAPSHOT.jar,/home/xxx/github/spark-rapids/udf-examples/target/rapids-4-spark-udf-examples_2.12-0.5.0-SNAPSHOT.jar,/home/xxx/spark/rapids/cudf.jar,/home/xxx/github/spark-rapids/tests/target/rapids-4-spark-tests_2.12-0.5.0-SNAPSHOT.jar" \
--conf spark.rapids.memory.gpu.allocFraction=0.3 \
--conf spark.rapids.python.memory.gpu.allocFraction=0.3 \
--conf spark.rapids.python.concurrentPythonWorkers=2 \
--py-files "/home/xxx/github/spark-rapids/dist/target/rapids-4-spark_2.12-0.5.0-SNAPSHOT.jar" \
--conf spark.executorEnv.PYTHONPATH="/home/xxx/github/spark-rapids/dist/target/rapids-4-spark_2.12-0.5.0-SNAPSHOT.jar" \
--conf spark.executorEnv.PYSPARK_PYTHON=/home/xxx/rapids-0.18/bin/python \
--archives /home/xxx/rapids-0.18.zip#MYGLOBALENV \
./runtests.py -m "cudf_udf" -v -rfExXs --cudf_udf
Note1: Make sure all jar paths are correct.
Note2: Here I am using spark standalone cluster, that is why I used spark.executorEnv.PYSPARK_PYTHON. For Spark on YARN, you need to use corresponding parameters such as spark.yarn.appMasterEnv.PYSPARK_PYTHON .
Note3: Make sure $SPARK_HOME is set and also the spark cluster is working fine with Rapids for Spark enabled.
The expected result is: PASSED [100%].
Reference:
http://alkaline-ml.com/2018-07-02-conda-spark/
No comments:
Post a Comment