Comet Benchmarking on macOS

This guide is for setting up TPC-H benchmarks locally on macOS using the 100 GB dataset.

Note that running this benchmark on macOS is not ideal because we cannot force Spark or Comet to use performance cores rather than efficiency cores, and background processes are sharing these cores. Also, power and thermal management may throttle CPU cores.

Prerequisites

Java and Rust must be installed locally.

Data Generation

cargo install tpchgen-rs
tpchgen-cli -s 100 --format=parquet

Clone the DataFusion Benchmarks Repository

git clone https://github.com/apache/datafusion-benchmarks.git

Install Spark

Install Spark

wget https://archive.apache.org/dist/spark/spark-3.5.4/spark-3.5.4-bin-hadoop3.tgz
tar xzf spark-3.5.4-bin-hadoop3.tgz
sudo mv spark-3.5.4-bin-hadoop3 /opt
export SPARK_HOME=/opt/spark-3.5.4-bin-hadoop3/
mkdir /tmp/spark-events

Start Spark in standalone mode:

$SPARK_HOME/sbin/start-master.sh

Set SPARK_MASTER env var (host name will need to be edited):

export SPARK_MASTER=spark://Rustys-MacBook-Pro.local:7077
$SPARK_HOME/sbin/start-worker.sh $SPARK_MASTER

Run Spark Benchmarks

Run the following command (the --data parameter will need to be updated to point to your TPC-H data):

$SPARK_HOME/bin/spark-submit \
    --master $SPARK_MASTER \
    --conf spark.driver.memory=8G \
    --conf spark.executor.instances=1 \
    --conf spark.executor.cores=8 \
    --conf spark.cores.max=8 \
    --conf spark.executor.memory=16g \
    --conf spark.memory.offHeap.enabled=true \
    --conf spark.memory.offHeap.size=16g \
    --conf spark.eventLog.enabled=true \
    /path/to/datafusion-benchmarks/runners/datafusion-comet/tpcbench.py \
    --name spark \
    --benchmark tpch \
    --data /Users/rusty/Data/tpch/sf100 \
    --queries /path/to/datafusion-benchmarks/tpch/queries \
    --output . \
    --iterations 1

Run Comet Benchmarks

Build Comet from source, with mimalloc enabled.

make release COMET_FEATURES=mimalloc

Set COMET_JAR to point to the location of the Comet jar file.

export COMET_JAR=`pwd`/spark/target/comet-spark-spark3.5_2.12-0.8.0-SNAPSHOT.jar

Run the following command (the --data parameter will need to be updated to point to your S3 bucket):

$SPARK_HOME/bin/spark-submit \
    --master $SPARK_MASTER \
    --conf spark.driver.memory=8G \
    --conf spark.executor.instances=1 \
    --conf spark.executor.cores=8 \
    --conf spark.cores.max=8 \
    --conf spark.executor.memory=16g \
    --conf spark.memory.offHeap.enabled=true \
    --conf spark.memory.offHeap.size=16g \
    --conf spark.eventLog.enabled=true \
    --jars $COMET_JAR \
    --driver-class-path $COMET_JAR \
    --conf spark.driver.extraClassPath=$COMET_JAR \
    --conf spark.executor.extraClassPath=$COMET_JAR \
    --conf spark.plugins=org.apache.spark.CometPlugin \
    --conf spark.shuffle.manager=org.apache.spark.sql.comet.execution.shuffle.CometShuffleManager \
    --conf spark.comet.enabled=true \
    --conf spark.comet.exec.shuffle.enableFastEncoding=true \
    --conf spark.comet.exec.shuffle.fallbackToColumnar=true \
    --conf spark.comet.exec.replaceSortMergeJoin=true \
    --conf spark.comet.cast.allowIncompatible=true \
    /path/to/datafusion-benchmarks/runners/datafusion-comet/tpcbench.py \
    --name comet \
    --benchmark tpch \
    --data /path/to/tpch-data/ \
    --queries /path/to/datafusion-benchmarks//tpch/queries \
    --output . \
    --iterations 1