Comet Benchmarking in AWS¶
This guide is for setting up benchmarks on AWS EC2 with a single node with Parquet files located in S3.
Data Generation¶
Create an EC2 instance with an EBS volume sized for approximately 2x the size of the dataset to be generated (200 GB for scale factor 100, 2 TB for scale factor 1000, and so on)
Create an S3 bucket to store the Parquet files
Install prerequisites:
sudo yum install -y docker git python3-pip
sudo systemctl start docker
sudo systemctl enable docker
sudo usermod -aG docker ec2-user
newgrp docker
docker pull ghcr.io/scalytics/tpch-docker:main
pip3 install datafusion
Run the data generation script:
git clone https://github.com/apache/datafusion-benchmarks.git
cd datafusion-benchmarks/tpch
nohup python3 tpchgen.py generate --scale-factor 100 --partitions 16 &
Check on progress with the following commands:
docker ps
du -h -d 1 data
Fix ownership in the generated files:
sudo chown -R ec2-user:docker data
Convert to Parquet:
nohup python3 tpchgen.py convert --scale-factor 100 --partitions 16 &
Delete the CSV files:
cd data
rm *.tbl.*
Copy the Parquet files to S3:
aws s3 cp . s3://your-bucket-name/top-level-folder/ --recursive
Install Spark¶
Install Java
sudo yum install -y java-17-amazon-corretto-headless java-17-amazon-corretto-devel
Set JAVA_HOME
export JAVA_HOME=/usr/lib/jvm/java-17-amazon-corretto
Install Spark
wget https://archive.apache.org/dist/spark/spark-3.5.4/spark-3.5.4-bin-hadoop3.tgz
tar xzf spark-3.5.4-bin-hadoop3.tgz
sudo mv spark-3.5.4-bin-hadoop3 /opt
export SPARK_HOME=/opt/spark-3.5.4-bin-hadoop3/
mkdir /tmp/spark-events
Set SPARK_MASTER
env var (IP address will need to be edited):
export SPARK_MASTER=spark://172.31.34.87:7077
Set SPARK_LOCAL_DIRS
to point to EBS volume
sudo mkdir /mnt/tmp
sudo chmod 777 /mnt/tmp
mv $SPARK_HOME/conf/spark-env.sh.template $SPARK_HOME/conf/spark-env.sh
Add the following entry to spark-env.sh
:
SPARK_LOCAL_DIRS=/mnt/tmp
Start Spark in standalone mode:
$SPARK_HOME/sbin/start-master.sh
$SPARK_HOME/sbin/start-worker.sh $SPARK_MASTER
Install Hadoop jar files:
wget https://repo1.maven.org/maven2/org/apache/hadoop/hadoop-aws/3.3.4/hadoop-aws-3.3.4.jar -P $SPARK_HOME/jars
wget https://repo1.maven.org/maven2/com/amazonaws/aws-java-sdk-bundle/1.11.1026/aws-java-sdk-bundle-1.11.1026.jar -P $SPARK_HOME/jars
Add credentials to ~/.aws/credentials
:
[default]
aws_access_key_id=your-access-key
aws_secret_access_key=your-secret-key
Run Spark Benchmarks¶
Run the following command (the --data
parameter will need to be updated to point to your S3 bucket):
$SPARK_HOME/bin/spark-submit \
--master $SPARK_MASTER \
--conf spark.driver.memory=4G \
--conf spark.executor.instances=1 \
--conf spark.executor.cores=8 \
--conf spark.cores.max=8 \
--conf spark.executor.memory=16g \
--conf spark.eventLog.enabled=false \
--conf spark.local.dir=/mnt/tmp \
--conf spark.driver.extraJavaOptions="-Djava.io.tmpdir=/mnt/tmp" \
--conf spark.executor.extraJavaOptions="-Djava.io.tmpdir=/mnt/tmp" \
--conf spark.hadoop.fs.s3a.impl=org.apache.hadoop.fs.s3a.S3AFileSystem \
--conf spark.hadoop.fs.s3a.aws.credentials.provider=com.amazonaws.auth.DefaultAWSCredentialsProviderChain \
tpcbench.py \
--benchmark tpch \
--data s3a://your-bucket-name/top-level-folder \
--queries /home/ec2-user/datafusion-benchmarks/tpch/queries \
--output . \
--iterations 1
Run Comet Benchmarks¶
Install Comet JAR from Maven:
wget https://repo1.maven.org/maven2/org/apache/datafusion/comet-spark-spark3.5_2.12/0.7.0/comet-spark-spark3.5_2.12-0.7.0.jar -P $SPARK_HOME/jars
export COMET_JAR=$SPARK_HOME/jars/comet-spark-spark3.5_2.12-0.7.0.jar
Run the following command (the --data
parameter will need to be updated to point to your S3 bucket):
$SPARK_HOME/bin/spark-submit \
--master $SPARK_MASTER \
--conf spark.driver.memory=4G \
--conf spark.executor.instances=1 \
--conf spark.executor.cores=8 \
--conf spark.cores.max=8 \
--conf spark.executor.memory=16g \
--conf spark.memory.offHeap.enabled=true \
--conf spark.memory.offHeap.size=16g \
--conf spark.eventLog.enabled=false \
--conf spark.local.dir=/mnt/tmp \
--conf spark.driver.extraJavaOptions="-Djava.io.tmpdir=/mnt/tmp" \
--conf spark.executor.extraJavaOptions="-Djava.io.tmpdir=/mnt/tmp" \
--conf spark.hadoop.fs.s3a.impl=org.apache.hadoop.fs.s3a.S3AFileSystem \
--conf spark.hadoop.fs.s3a.aws.credentials.provider=com.amazonaws.auth.DefaultAWSCredentialsProviderChain \
--jars $COMET_JAR \
--driver-class-path $COMET_JAR \
--conf spark.driver.extraClassPath=$COMET_JAR \
--conf spark.executor.extraClassPath=$COMET_JAR \
--conf spark.plugins=org.apache.spark.CometPlugin \
--conf spark.shuffle.manager=org.apache.spark.sql.comet.execution.shuffle.CometShuffleManager \
--conf spark.comet.enabled=true \
--conf spark.comet.cast.allowIncompatible=true \
--conf spark.comet.exec.replaceSortMergeJoin=true \
--conf spark.comet.exec.shuffle.enabled=true \
--conf spark.comet.exec.shuffle.fallbackToColumnar=true \
--conf spark.comet.exec.shuffle.compression.codec=lz4 \
--conf spark.comet.exec.shuffle.compression.level=1 \
tpcbench.py \
--benchmark tpch \
--data s3a://your-bucket-name/top-level-folder \
--queries /home/ec2-user/datafusion-benchmarks/tpch/queries \
--output . \
--iterations 1