Apache DataFusion Comet: Benchmarks Derived From TPC-H#

The following benchmarks were performed on an EKS cluster (r6i.24xlarge instances with EBS storage) with data stored in S3.

Configuration#

Common:

spark.executor.instances=32
spark.executor.cores=16
spark.memory.fraction=0.6
spark.memory.storageFraction=0.2
# Kubernetes CPU constraints
spark.kubernetes.executor.request.cores=8
spark.kubernetes.executor.limit.cores=8

Spark:

spark.executor.memory=64G
spark.executor.memoryOverhead=10G

Comet:

spark.executor.memory=32G
spark.executor.memoryOverhead=10G
spark.memory.offHeap.enabled=true
spark.memory.offHeap.size=32G

Comet (Tuned):

spark.executor.memory=32G
spark.executor.memoryOverhead=10G
spark.memory.offHeap.enabled=true
spark.memory.offHeap.size=32G
spark.comet.exec.replaceSortMergeJoin=true
spark.comet.memoryPool.fraction=0.8

Benchmark Results#

The following chart shows benchmark results comparing Spark to Comet, both with Comet’s default settings, and with Hash Join enabled in Comet.

Comet’s Hash Join does not support spilling yet, so it isn’t suitable for all workloads.

Comet (with Hash Join enabled)#

Here is a breakdown showing relative performance of Spark and Comet for each query.

The following chart shows how much Comet currently accelerates each query from the benchmark in relative terms.

The following chart shows how much Comet currently accelerates each query from the benchmark in absolute terms.