Deploying Ballista with Kubernetes

Ballista can be deployed to any Kubernetes cluster using the following instructions. These instructions assume that you are already comfortable managing Kubernetes deployments.

The Ballista deployment consists of:

  • k8s deployment for one or more scheduler processes

  • k8s deployment for one or more executor processes

  • k8s service to route traffic to the schedulers

  • k8s persistent volume and persistent volume claims to make local data accessible to Ballista

  • (optional) a keda instance for autoscaling the number of executors

Testing Locally

Microk8s is recommended for installing a local k8s cluster. Once Microk8s is installed, DNS must be enabled using the following command.

microk8s enable dns

Build Docker Images

Run the following commands to download the official Docker image:

docker pull ghcr.io/apache/datafusion-ballista-standalone:0.12.0-rc4

Altenatively run the following commands to clone the source repository and build the Docker images from source:

git clone git@github.com:apache/datafusion-ballista.git -b 0.12.0
cd datafusion-ballista
./dev/build-ballista-docker.sh

This will create the following images:

  • apache/datafusion-ballista-benchmarks:0.12.0

  • apache/datafusion-ballista-cli:0.12.0

  • apache/datafusion-ballista-executor:0.12.0

  • apache/datafusion-ballista-scheduler:0.12.0

  • apache/datafusion-ballista-standalone:0.12.0

Publishing Docker Images

Once the images have been built, you can retag them and can push them to your favourite Docker registry.

docker tag apache/datafusion-ballista-scheduler:0.12.0 <your-repo>/datafusion-ballista-scheduler:0.12.0
docker tag apache/datafusion-ballista-executor:0.12.0 <your-repo>/datafusion-ballista-executor:0.12.0
docker push <your-repo>/datafusion-ballista-scheduler:0.12.0
docker push <your-repo>/datafusion-ballista-executor:0.12.0

Create Persistent Volume and Persistent Volume Claim

Copy the following yaml to a pv.yaml file and apply to the cluster to create a persistent volume and a persistent volume claim so that the specified host directory is available to the containers. This is where any data should be located so that Ballista can execute queries against it.

apiVersion: v1
kind: PersistentVolume
metadata:
  name: data-pv
  labels:
    type: local
spec:
  storageClassName: manual
  capacity:
    storage: 10Gi
  accessModes:
    - ReadWriteOnce
  hostPath:
    path: "/mnt"
---
apiVersion: v1
kind: PersistentVolumeClaim
metadata:
  name: data-pv-claim
spec:
  storageClassName: manual
  accessModes:
    - ReadWriteOnce
  resources:
    requests:
      storage: 3Gi

To apply this yaml:

kubectl apply -f pv.yaml

You should see the following output:

persistentvolume/data-pv created
persistentvolumeclaim/data-pv-claim created

Deploying a Ballista Cluster

Copy the following yaml to a cluster.yaml file and change <your-image> with the name of your Ballista Docker image.

apiVersion: v1
kind: Service
metadata:
  name: ballista-scheduler
  labels:
    app: ballista-scheduler
spec:
  ports:
    - port: 50050
      name: scheduler
    - port: 80
      name: scheduler-ui
  selector:
    app: ballista-scheduler
---
apiVersion: apps/v1
kind: Deployment
metadata:
  name: ballista-scheduler
spec:
  replicas: 1
  selector:
    matchLabels:
      app: ballista-scheduler
  template:
    metadata:
      labels:
        app: ballista-scheduler
        ballista-cluster: ballista
    spec:
      containers:
        - name: ballista-scheduler
          image: <your-repo>/datafusion-ballista-scheduler:0.12.0
          args: ["--bind-port=50050"]
          ports:
            - containerPort: 50050
              name: flight
          volumeMounts:
            - mountPath: /mnt
              name: data
      volumes:
        - name: data
          persistentVolumeClaim:
            claimName: data-pv-claim
---
apiVersion: apps/v1
kind: Deployment
metadata:
  name: ballista-executor
spec:
  replicas: 2
  selector:
    matchLabels:
      app: ballista-executor
  template:
    metadata:
      labels:
        app: ballista-executor
        ballista-cluster: ballista
    spec:
      containers:
        - name: ballista-executor
          image: <your-repo>/datafusion-ballista-executor:0.12.0
          args:
            - "--bind-port=50051"
            - "--scheduler-host=ballista-scheduler"
            - "--scheduler-port=50050"
          ports:
            - containerPort: 50051
              name: flight
          volumeMounts:
            - mountPath: /mnt
              name: data
      volumes:
        - name: data
          persistentVolumeClaim:
            claimName: data-pv-claim
kubectl apply -f cluster.yaml

This should show the following output:

service/ballista-scheduler created
deployment.apps/ballista-scheduler created
deployment.apps/ballista-executor created

You can also check status by running kubectl get pods:

$ kubectl get pods
NAME                                 READY   STATUS    RESTARTS   AGE
ballista-executor-78cc5b6486-4rkn4   0/1     Pending   0          42s
ballista-executor-78cc5b6486-7crdm   0/1     Pending   0          42s
ballista-scheduler-879f874c5-rnbd6   0/1     Pending   0          42s

You can view the scheduler logs with kubectl logs ballista-scheduler-0:

$ kubectl logs ballista-scheduler-0
[2021-02-19T00:24:01Z INFO  scheduler] Ballista v0.7.0 Scheduler listening on 0.0.0.0:50050
[2021-02-19T00:24:16Z INFO  ballista::scheduler] Received register_executor request for ExecutorMetadata { id: "b5e81711-1c5c-46ec-8522-d8b359793188", host: "10.1.23.149", port: 50051 }
[2021-02-19T00:24:17Z INFO  ballista::scheduler] Received register_executor request for ExecutorMetadata { id: "816e4502-a876-4ed8-b33f-86d243dcf63f", host: "10.1.23.150", port: 50051 }

Port Forwarding

If you want to run applications outside of the cluster and have them connect to the scheduler then it is necessary to set up port forwarding.

First, check that the ballista-scheduler service is running.

$ kubectl get services
NAME                 TYPE        CLUSTER-IP      EXTERNAL-IP   PORT(S)     AGE
kubernetes           ClusterIP   10.152.183.1    <none>        443/TCP     26h
ballista-scheduler   ClusterIP   10.152.183.21   <none>        50050/TCP   24m

Use the following command to set up port-forwarding.

kubectl port-forward service/ballista-scheduler 50050:50050

Deleting the Ballista Cluster

Run the following kubectl command to delete the cluster.

kubectl delete -f cluster.yaml

Autoscaling Executors

Ballista supports autoscaling for executors through Keda. Keda allows scaling a deployment through custom metrics which are exposed through the Ballista scheduler, and it can even scale the number of executors down to 0 if there is no activity in the cluster.

Keda can be installed in your kubernetes cluster through a single command line:

kubectl apply -f https://github.com/kedacore/keda/releases/download/v2.7.1/keda-2.7.1.yaml

Once you have deployed Keda on your cluster, you can now deploy a new kubernetes object called ScaledObject which will let Keda know how to scale your executors. In order to do that, copy the following YAML into a scale.yaml file:

apiVersion: keda.sh/v1alpha1
kind: ScaledObject
metadata:
  name: ballista-executor
spec:
  scaleTargetRef:
    name: ballista-executor
  minReplicaCount: 0
  maxReplicaCount: 5
  triggers:
    - type: external
      metadata:
        # Change this DNS if the scheduler isn't deployed in the "default" namespace
        scalerAddress: ballista-scheduler.default.svc.cluster.local:50050

And then deploy it into the cluster:

kubectl apply -f scale.yaml

If the cluster is inactive, Keda will now scale the number of executors down to 0, and will scale them up when you launch a query. Please note that Keda will perform a scan once every 30 seconds, so it might take a bit to scale the executors.

Please visit Keda’s documentation page for more information.