Ballista Scheduler Metrics

Prometheus

This is optional scheduler feature which should be enabled with prometheus-metrics feature

Built with default features, the ballista scheduler will automatically collect and expose a standard set of prometheus metrics. The metrics currently collected automatically include:

  • job_exec_time_seconds - Histogram of successful job execution time in seconds

  • planning_time_ms - Histogram of job planning time in milliseconds

  • failed - Counter of failed jobs

  • job_failed_total - Counter of failed jobs

  • job_cancelled_total - Counter of cancelled jobs

  • job_completed_total - Counter of completed jobs

  • job_submitted_total - Counter of submitted jobs

  • pending_task_queue_size - Number of pending tasks

NOTE Currently the histogram buckets for the above metrics are set to reasonable defaults. If the defaults are not appropriate for a given use case, the only workaround is to implement a customer SchedulerMetricsCollector. In the future the buckets should be made configurable.

The metrics are then exported through the scheduler REST API at GET /api/metrics. It should be sufficient to ingest metrics into an existing metrics system by point your chosen prometheus exporter at that endpoint.