Comet Benchmarking in EC2¶
This guide is for setting up benchmarks on AWS EC2 with a single node with Parquet files either located on an attached EBS volume, or stored in S3.
Create EC2 Instance¶
Create an EC2 instance with an EBS volume sized for approximately 2x the size of the dataset to be generated (200 GB for scale factor 100, 2 TB for scale factor 1000, and so on)
Optionally, create an S3 bucket to store the Parquet files
Recommendation: Use c7i.4xlarge
instance type with provisioned iops 2
EBS volume (8000 IOPS).
Install Prerequisites¶
sudo yum update -y
sudo yum install -y git protobuf-compiler java-17-amazon-corretto-headless java-17-amazon-corretto-devel
sudo yum groupinstall -y "Development Tools"
export JAVA_HOME=/usr/lib/jvm/java-17-amazon-corretto
Generate Benchmark Data¶
curl --proto '=https' --tlsv1.2 -sSf https://sh.rustup.rs | sh
RUSTFLAGS='-C target-cpu=native' cargo install tpchgen-cli
tpchgen-cli -s 100 --format parquet --parts 32 --output-dir data
Rename the generated directories so that they have a .parquet
suffix. For example, rename customer
to
customer.parquet
.
Set the TPCH_DATA
environment variable. This will be referenced by the benchmarking scripts.
export TPCH_DATA=/home/ec2-user/data
Install Apache Spark¶
export SPARK_VERSION=3.5.6
wget https://archive.apache.org/dist/spark/spark-$SPARK_VERSION/spark-$SPARK_VERSION-bin-hadoop3.tgz
tar xzf spark-$SPARK_VERSION-bin-hadoop3.tgz
sudo mv spark-$SPARK_VERSION-bin-hadoop3 /opt
export SPARK_HOME=/opt/spark-$SPARK_VERSION-bin-hadoop3/
mkdir /tmp/spark-events
Set SPARK_MASTER
env var (IP address will need to be edited):
export SPARK_MASTER=spark://172.31.34.87:7077
Set SPARK_LOCAL_DIRS
to point to EBS volume
sudo mkdir /mnt/tmp
sudo chmod 777 /mnt/tmp
mv $SPARK_HOME/conf/spark-env.sh.template $SPARK_HOME/conf/spark-env.sh
Add the following entry to spark-env.sh
:
SPARK_LOCAL_DIRS=/mnt/tmp
Git Clone DataFusion Repositories¶
git clone https://github.com/apache/datafusion-benchmarks.git
git clone https://github.com/apache/datafusion-comet.git
Build Comet
cd datafusion-comet
make release
Set COMET_JAR
environment variable.
export COMET_JAR=/home/ec2-user/datafusion-comet/spark/target/comet-spark-spark3.5_2.12-0.11.0-SNAPSHOT.jar
Run Benchmarks¶
Use the scripts in dev/benchmarks
in the Comet repository.
cd dev/benchmarks
export TPCH_QUERIES=/home/ec2-user/datafusion-benchmarks/tpch/queries/
Run Spark benchmark:
./spark-tpch.sh
Run Comet benchmark:
./comet-tpch.sh
Running Benchmarks with S3¶
Copy the Parquet data to an S3 bucket.
aws s3 cp /home/ec2-user/data s3://your-bucket-name/--recursive
Update TPCH_DATA
environment variable.
export TPCH_DATA=s3a://your-bucket-name
Install Hadoop jar files:
wget https://repo1.maven.org/maven2/org/apache/hadoop/hadoop-aws/3.3.4/hadoop-aws-3.3.4.jar -P $SPARK_HOME/jars
wget https://repo1.maven.org/maven2/com/amazonaws/aws-java-sdk-bundle/1.11.1026/aws-java-sdk-bundle-1.11.1026.jar -P $SPARK_HOME/jars
Add credentials to ~/.aws/credentials
:
[default]
aws_access_key_id=your-access-key
aws_secret_access_key=your-secret-key
Modify the scripts to add the following configurations.
--conf spark.hadoop.fs.s3a.impl=org.apache.hadoop.fs.s3a.S3AFileSystem \
--conf spark.hadoop.fs.s3a.aws.credentials.provider=com.amazonaws.auth.DefaultAWSCredentialsProviderChain \
Now run the spark-tpch.sh
and comet-tpch.sh
scripts.