Comet Benchmarking on macOS

This guide is for setting up TPC-H benchmarks locally on macOS using the 100 GB dataset.

Note that running this benchmark on macOS is not ideal because we cannot force Spark or Comet to use performance cores rather than efficiency cores, and background processes are sharing these cores. Also, power and thermal management may throttle CPU cores.

Prerequisites

Java and Rust must be installed locally.

Data Generation

cargo install tpchgen-cli
tpchgen-cli -s 100 --format=parquet

Clone the DataFusion Benchmarks Repository

git clone https://github.com/apache/datafusion-benchmarks.git

Install Spark

Install Spark

wget https://archive.apache.org/dist/spark/spark-3.5.4/spark-3.5.4-bin-hadoop3.tgz
tar xzf spark-3.5.4-bin-hadoop3.tgz
sudo mv spark-3.5.4-bin-hadoop3 /opt
export SPARK_HOME=/opt/spark-3.5.4-bin-hadoop3/
mkdir /tmp/spark-events

Start Spark in standalone mode:

$SPARK_HOME/sbin/start-master.sh

Set SPARK_MASTER env var (host name will need to be edited):

export SPARK_MASTER=spark://Rustys-MacBook-Pro.local:7077
$SPARK_HOME/sbin/start-worker.sh $SPARK_MASTER

Start local Apache Spark cluster using spark-class

For Apache Spark distributions installed using brew tool, it may happen there is no $SPARK_HOME/sbin folder on your machine. In order to start local Apache Spark cluster on localhost:7077 port, run:

$SPARK_HOME/bin/spark-class org.apache.spark.deploy.master.Master --host 127.0.0.1 --port 7077 --webui-port 8080

Once master has started, in separate console start the worker referring the spark master uri on localhost:7077

$SPARK_HOME/bin/spark-class org.apache.spark.deploy.worker.Worker --cores 8 --memory 16G spark://localhost:7077

Run Spark Benchmarks

Run the following command (the --data parameter will need to be updated to point to your TPC-H data):

$SPARK_HOME/bin/spark-submit \
    --master $SPARK_MASTER \
    --conf spark.driver.memory=8G \
    --conf spark.executor.instances=1 \
    --conf spark.executor.cores=8 \
    --conf spark.cores.max=8 \
    --conf spark.executor.memory=16g \
    --conf spark.memory.offHeap.enabled=true \
    --conf spark.memory.offHeap.size=16g \
    --conf spark.eventLog.enabled=true \
    /path/to/datafusion-benchmarks/runners/datafusion-comet/tpcbench.py \
    --benchmark tpch \
    --data /Users/rusty/Data/tpch/sf100 \
    --queries /path/to/datafusion-benchmarks/tpch/queries \
    --output . \
    --iterations 1

Run Comet Benchmarks

Build Comet from source, with mimalloc enabled.

make release COMET_FEATURES=mimalloc

Set COMET_JAR to point to the location of the Comet jar file.

export COMET_JAR=`pwd`/spark/target/comet-spark-spark3.5_2.12-0.8.0-SNAPSHOT.jar

Run the following command (the --data parameter will need to be updated to point to your S3 bucket):

$SPARK_HOME/bin/spark-submit \
    --master $SPARK_MASTER \
    --conf spark.driver.memory=8G \
    --conf spark.executor.instances=1 \
    --conf spark.executor.cores=8 \
    --conf spark.cores.max=8 \
    --conf spark.executor.memory=16g \
    --conf spark.memory.offHeap.enabled=true \
    --conf spark.memory.offHeap.size=16g \
    --conf spark.eventLog.enabled=true \
    --jars $COMET_JAR \
    --driver-class-path $COMET_JAR \
    --conf spark.driver.extraClassPath=$COMET_JAR \
    --conf spark.executor.extraClassPath=$COMET_JAR \
    --conf spark.plugins=org.apache.spark.CometPlugin \
    --conf spark.shuffle.manager=org.apache.spark.sql.comet.execution.shuffle.CometShuffleManager \
    --conf spark.comet.enabled=true \
    --conf spark.comet.exec.shuffle.enableFastEncoding=true \
    --conf spark.comet.exec.shuffle.fallbackToColumnar=true \
    --conf spark.comet.exec.replaceSortMergeJoin=true \
    --conf spark.comet.expression.allowIncompatible=true \
    /path/to/datafusion-benchmarks/runners/datafusion-comet/tpcbench.py \
    --benchmark tpch \
    --data /path/to/tpch-data/ \
    --queries /path/to/datafusion-benchmarks//tpch/queries \
    --output . \
    --iterations 1