Comet Benchmarking on macOS¶
This guide is for setting up TPC-H benchmarks locally on macOS using the 100 GB dataset.
Note that running this benchmark on macOS is not ideal because we cannot force Spark or Comet to use performance cores rather than efficiency cores, and background processes are sharing these cores. Also, power and thermal management may throttle CPU cores.
Prerequisites¶
Java and Rust must be installed locally.
Data Generation¶
cargo install tpchgen-rs
tpchgen-cli -s 100 --format=parquet
Clone the DataFusion Benchmarks Repository¶
git clone https://github.com/apache/datafusion-benchmarks.git
Install Spark¶
Install Spark
wget https://archive.apache.org/dist/spark/spark-3.5.4/spark-3.5.4-bin-hadoop3.tgz
tar xzf spark-3.5.4-bin-hadoop3.tgz
sudo mv spark-3.5.4-bin-hadoop3 /opt
export SPARK_HOME=/opt/spark-3.5.4-bin-hadoop3/
mkdir /tmp/spark-events
Start Spark in standalone mode:
$SPARK_HOME/sbin/start-master.sh
Set SPARK_MASTER
env var (host name will need to be edited):
export SPARK_MASTER=spark://Rustys-MacBook-Pro.local:7077
$SPARK_HOME/sbin/start-worker.sh $SPARK_MASTER
Run Spark Benchmarks¶
Run the following command (the --data
parameter will need to be updated to point to your TPC-H data):
$SPARK_HOME/bin/spark-submit \
--master $SPARK_MASTER \
--conf spark.driver.memory=8G \
--conf spark.executor.instances=1 \
--conf spark.executor.cores=8 \
--conf spark.cores.max=8 \
--conf spark.executor.memory=16g \
--conf spark.memory.offHeap.enabled=true \
--conf spark.memory.offHeap.size=16g \
--conf spark.eventLog.enabled=true \
/path/to/datafusion-benchmarks/runners/datafusion-comet/tpcbench.py \
--name spark \
--benchmark tpch \
--data /Users/rusty/Data/tpch/sf100 \
--queries /path/to/datafusion-benchmarks/tpch/queries \
--output . \
--iterations 1
Run Comet Benchmarks¶
Build Comet from source, with mimalloc
enabled.
make release COMET_FEATURES=mimalloc
Set COMET_JAR
to point to the location of the Comet jar file.
export COMET_JAR=`pwd`/spark/target/comet-spark-spark3.5_2.12-0.8.0-SNAPSHOT.jar
Run the following command (the --data
parameter will need to be updated to point to your S3 bucket):
$SPARK_HOME/bin/spark-submit \
--master $SPARK_MASTER \
--conf spark.driver.memory=8G \
--conf spark.executor.instances=1 \
--conf spark.executor.cores=8 \
--conf spark.cores.max=8 \
--conf spark.executor.memory=16g \
--conf spark.memory.offHeap.enabled=true \
--conf spark.memory.offHeap.size=16g \
--conf spark.eventLog.enabled=true \
--jars $COMET_JAR \
--driver-class-path $COMET_JAR \
--conf spark.driver.extraClassPath=$COMET_JAR \
--conf spark.executor.extraClassPath=$COMET_JAR \
--conf spark.plugins=org.apache.spark.CometPlugin \
--conf spark.shuffle.manager=org.apache.spark.sql.comet.execution.shuffle.CometShuffleManager \
--conf spark.comet.enabled=true \
--conf spark.comet.exec.shuffle.enableFastEncoding=true \
--conf spark.comet.exec.shuffle.fallbackToColumnar=true \
--conf spark.comet.exec.replaceSortMergeJoin=true \
--conf spark.comet.cast.allowIncompatible=true \
/path/to/datafusion-benchmarks/runners/datafusion-comet/tpcbench.py \
--name comet \
--benchmark tpch \
--data /path/to/tpch-data/ \
--queries /path/to/datafusion-benchmarks//tpch/queries \
--output . \
--iterations 1