Goal:

How to run the pandas cudf_udf test for RAPIDS Accelerator for Apache Spark.

Env:

RAPIDS Accelerator for Apache Spark 0.4

Spark 3.1.1

Solution:

1. Compile RAPIDS Accelerator for Apache Spark

1.a Create a conda env for compiling

conda create -n cudftest -c conda-forge python=3.8 pytest pandas pyarrow sre_yield pytest-xdist findspark

Here I decide to use one conda env "cudftest" for compiling and use another conda env named "rapids-0.18" to test the cudf_udf in Spark.

Of course you can choose to use one conda env if you want but it may include too many python packages in the end.

I just want to keep the conda env "rapids-0.18" to be as small as possible because eventually I need to distribute it to all Executors in Spark cluster.

1.b Compile from source code

cd ~/github/spark-rapids
# git checkout v0.4.0
mvn clean install -DskipTests

You can decide which version to compile. Here I am going to compile the 0.15-snapshot which is the current main branch. The current GA release is 0.4 though.

2. Run pandas cudf_udf Tests

Please follow this Doc on how to enable the pandas cudf_udf tests.

Basically pandas cudf_udf tests are inside "./integration_tests/runtests.py" with option "--cudf_udf".

The key is to make sure the all the python envs and needed jar file paths are correct.

2.a Create a conda env for running cudf_udf tests

Please follow the steps mentioned in rapids.ai to create the conda env with cudf installed.

For example:

conda create -n rapids-0.18 -c rapidsai -c nvidia -c conda-forge \
    -c defaults cudf=0.18 python=3.7 cudatoolkit=11.0

2.b Install needed python packages needed by cudf_udf tests

conda activate rapids-0.18
conda install pandas

2.c Package your conda env

You can refer to this blog on how to package your conda env for spark job.

cd /home/xxx/miniconda3/envs
zip -r rapids-0.18.zip rapids-0.18/
mv rapids-0.18.zip ~/
cd ~/ && mkdir MYGLOBALENV
cd MYGLOBALENV/ && ln -s /home/xxx/miniconda3/envs/rapids-0.18/ rapids-0.18
cd ..
export PYSPARK_PYTHON=./MYGLOBALENV/rapids-0.18/bin/python

2.d Run the pandas cudf_udf tests

cd /home/xxx/github/spark-rapids/integration_tests 
PYSPARK_PYTHON=/home/xxx/MYGLOBALENV/rapids-0.18/bin/python $SPARK_HOME/bin/spark-submit --jars "/home/xxx/github/spark-rapids/dist/target/rapids-4-spark_2.12-0.5.0-SNAPSHOT.jar,/home/xxx/github/spark-rapids/udf-examples/target/rapids-4-spark-udf-examples_2.12-0.5.0-SNAPSHOT.jar,/home/xxx/spark/rapids/cudf.jar,/home/xxx/github/spark-rapids/tests/target/rapids-4-spark-tests_2.12-0.5.0-SNAPSHOT.jar" \
                             --conf spark.rapids.memory.gpu.allocFraction=0.3 \
                             --conf spark.rapids.python.memory.gpu.allocFraction=0.3 \
                             --conf spark.rapids.python.concurrentPythonWorkers=2 \
                             --py-files "/home/xxx/github/spark-rapids/dist/target/rapids-4-spark_2.12-0.5.0-SNAPSHOT.jar" \
                             --conf spark.executorEnv.PYTHONPATH="/home/xxx/github/spark-rapids/dist/target/rapids-4-spark_2.12-0.5.0-SNAPSHOT.jar"  \
                             --conf spark.executorEnv.PYSPARK_PYTHON=/home/xxx/rapids-0.18/bin/python \
                             --archives /home/xxx/rapids-0.18.zip#MYGLOBALENV \
                             ./runtests.py -m "cudf_udf" -v -rfExXs --cudf_udf

Note1: Make sure all jar paths are correct.

Note2: Here I am using spark standalone cluster, that is why I used spark.executorEnv.PYSPARK_PYTHON. For Spark on YARN, you need to use corresponding parameters such as spark.yarn.appMasterEnv.PYSPARK_PYTHON .

Note3: Make sure $SPARK_HOME is set and also the spark cluster is working fine with Rapids for Spark enabled.

The expected result is: PASSED [100%].

Reference:

http://alkaline-ml.com/2018-07-02-conda-spark/

Tuesday, March 23, 2021

How to run the pandas cudf_udf test for RAPIDS Accelerator for Apache Spark