Deploying Ballista with Kubernetes¶
Ballista can be deployed to any Kubernetes cluster using the following instructions. These instructions assume that you are already comfortable managing Kubernetes deployments.
The Ballista deployment consists of:
k8s deployment for one or more scheduler processes
k8s deployment for one or more executor processes
k8s service to route traffic to the schedulers
k8s persistent volume and persistent volume claims to make local data accessible to Ballista
(optional) a keda instance for autoscaling the number of executors
Testing Locally¶
Microk8s is recommended for installing a local k8s cluster. Once Microk8s is installed, DNS must be enabled using the following command.
microk8s enable dns
Build Docker Images¶
Run the following commands to download the official Docker image:
docker pull ghcr.io/apache/datafusion-ballista-standalone:0.12.0-rc4
Altenatively run the following commands to clone the source repository and build the Docker images from source:
git clone git@github.com:apache/datafusion-ballista.git -b 0.12.0
cd datafusion-ballista
./dev/build-ballista-docker.sh
This will create the following images:
apache/datafusion-ballista-benchmarks:0.12.0
apache/datafusion-ballista-cli:0.12.0
apache/datafusion-ballista-executor:0.12.0
apache/datafusion-ballista-scheduler:0.12.0
apache/datafusion-ballista-standalone:0.12.0
Publishing Docker Images¶
Once the images have been built, you can retag them and can push them to your favourite Docker registry.
docker tag apache/datafusion-ballista-scheduler:0.12.0 <your-repo>/datafusion-ballista-scheduler:0.12.0
docker tag apache/datafusion-ballista-executor:0.12.0 <your-repo>/datafusion-ballista-executor:0.12.0
docker push <your-repo>/datafusion-ballista-scheduler:0.12.0
docker push <your-repo>/datafusion-ballista-executor:0.12.0
Create Persistent Volume and Persistent Volume Claim¶
Copy the following yaml to a pv.yaml
file and apply to the cluster to create a persistent volume and a persistent
volume claim so that the specified host directory is available to the containers. This is where any data should be
located so that Ballista can execute queries against it.
apiVersion: v1
kind: PersistentVolume
metadata:
name: data-pv
labels:
type: local
spec:
storageClassName: manual
capacity:
storage: 10Gi
accessModes:
- ReadWriteOnce
hostPath:
path: "/mnt"
---
apiVersion: v1
kind: PersistentVolumeClaim
metadata:
name: data-pv-claim
spec:
storageClassName: manual
accessModes:
- ReadWriteOnce
resources:
requests:
storage: 3Gi
To apply this yaml:
kubectl apply -f pv.yaml
You should see the following output:
persistentvolume/data-pv created
persistentvolumeclaim/data-pv-claim created
Deploying a Ballista Cluster¶
Copy the following yaml to a cluster.yaml
file and change <your-image>
with the name of your Ballista Docker image.
apiVersion: v1
kind: Service
metadata:
name: ballista-scheduler
labels:
app: ballista-scheduler
spec:
ports:
- port: 50050
name: scheduler
- port: 80
name: scheduler-ui
selector:
app: ballista-scheduler
---
apiVersion: apps/v1
kind: Deployment
metadata:
name: ballista-scheduler
spec:
replicas: 1
selector:
matchLabels:
app: ballista-scheduler
template:
metadata:
labels:
app: ballista-scheduler
ballista-cluster: ballista
spec:
containers:
- name: ballista-scheduler
image: <your-repo>/datafusion-ballista-scheduler:0.12.0
args: ["--bind-port=50050"]
ports:
- containerPort: 50050
name: flight
volumeMounts:
- mountPath: /mnt
name: data
volumes:
- name: data
persistentVolumeClaim:
claimName: data-pv-claim
---
apiVersion: apps/v1
kind: Deployment
metadata:
name: ballista-executor
spec:
replicas: 2
selector:
matchLabels:
app: ballista-executor
template:
metadata:
labels:
app: ballista-executor
ballista-cluster: ballista
spec:
containers:
- name: ballista-executor
image: <your-repo>/datafusion-ballista-executor:0.12.0
args:
- "--bind-port=50051"
- "--scheduler-host=ballista-scheduler"
- "--scheduler-port=50050"
ports:
- containerPort: 50051
name: flight
volumeMounts:
- mountPath: /mnt
name: data
volumes:
- name: data
persistentVolumeClaim:
claimName: data-pv-claim
kubectl apply -f cluster.yaml
This should show the following output:
service/ballista-scheduler created
deployment.apps/ballista-scheduler created
deployment.apps/ballista-executor created
You can also check status by running kubectl get pods
:
$ kubectl get pods
NAME READY STATUS RESTARTS AGE
ballista-executor-78cc5b6486-4rkn4 0/1 Pending 0 42s
ballista-executor-78cc5b6486-7crdm 0/1 Pending 0 42s
ballista-scheduler-879f874c5-rnbd6 0/1 Pending 0 42s
You can view the scheduler logs with kubectl logs ballista-scheduler-0
:
$ kubectl logs ballista-scheduler-0
[2021-02-19T00:24:01Z INFO scheduler] Ballista v0.7.0 Scheduler listening on 0.0.0.0:50050
[2021-02-19T00:24:16Z INFO ballista::scheduler] Received register_executor request for ExecutorMetadata { id: "b5e81711-1c5c-46ec-8522-d8b359793188", host: "10.1.23.149", port: 50051 }
[2021-02-19T00:24:17Z INFO ballista::scheduler] Received register_executor request for ExecutorMetadata { id: "816e4502-a876-4ed8-b33f-86d243dcf63f", host: "10.1.23.150", port: 50051 }
Port Forwarding¶
If you want to run applications outside of the cluster and have them connect to the scheduler then it is necessary to set up port forwarding.
First, check that the ballista-scheduler
service is running.
$ kubectl get services
NAME TYPE CLUSTER-IP EXTERNAL-IP PORT(S) AGE
kubernetes ClusterIP 10.152.183.1 <none> 443/TCP 26h
ballista-scheduler ClusterIP 10.152.183.21 <none> 50050/TCP 24m
Use the following command to set up port-forwarding.
kubectl port-forward service/ballista-scheduler 50050:50050
Deleting the Ballista Cluster¶
Run the following kubectl command to delete the cluster.
kubectl delete -f cluster.yaml
Autoscaling Executors¶
Ballista supports autoscaling for executors through Keda. Keda allows scaling a deployment through custom metrics which are exposed through the Ballista scheduler, and it can even scale the number of executors down to 0 if there is no activity in the cluster.
Keda can be installed in your kubernetes cluster through a single command line:
kubectl apply -f https://github.com/kedacore/keda/releases/download/v2.7.1/keda-2.7.1.yaml
Once you have deployed Keda on your cluster, you can now deploy a new kubernetes object called ScaledObject
which will let Keda know how to scale your executors. In order to do that, copy the following YAML into a
scale.yaml
file:
apiVersion: keda.sh/v1alpha1
kind: ScaledObject
metadata:
name: ballista-executor
spec:
scaleTargetRef:
name: ballista-executor
minReplicaCount: 0
maxReplicaCount: 5
triggers:
- type: external
metadata:
# Change this DNS if the scheduler isn't deployed in the "default" namespace
scalerAddress: ballista-scheduler.default.svc.cluster.local:50050
And then deploy it into the cluster:
kubectl apply -f scale.yaml
If the cluster is inactive, Keda will now scale the number of executors down to 0, and will scale them up when you launch a query. Please note that Keda will perform a scan once every 30 seconds, so it might take a bit to scale the executors.
Please visit Keda’s documentation page for more information.