Argo Workflows failed workflows notification with Prometheus

In Argo Workflows there’s no built-in method of sending notifications. But what if you want to be notified of failed workflows? Well, you have several options:
- Use
exitHandler
. You can configure per-workflow or defaultexitHandler
that will send notifications if the job fails. - Use workflow generated Kubernetes Events. Every time Argo WF starts or stops the workflow it emits Kubernetes Events. You can then listen to those events with something like Argo Events and take actions in response.
- Use custom metrics in workflow. You can then scrape those metrics using Prometheus and send alerts with Alertmanager. This is the most operationally efficient method because you’ll be using the same Alertmanager you use for your Kubernetes cluster monitoring. This also removes the need to write a custom alerting logic which you would have to implement in case of options 1 and 2.
So let’s see how we can implement failed workflows notification with Prometheus.
First of all we need to configure Argo Workflows to emit custom metric when any workflow fails. For that to work you need to enable Prometheus metrics in Argo Workflows.
Then you need to add the following snippet to your default workflow spec:
workflowDefaults:
spec:
metrics:
prometheus:
- name: workflow_failed_count
counter:
value: "1"
help: Failed workflow counter
labels:
- key: workflow_name
value: "{{ workflow.name }}"
- key: workflow_namespace
value: "{{ workflow.namespace }}"
- key: workflow_duration_seconds
value: "{{ workflow.duration }}"
- key: workflow_created_at
value: "{{ workflow.creationTimestamp }}"
- key: cron_workflow
value: "{{= sprig.default('', workflow.labels['workflows.argoproj.io/cron-workflow']) }}"
when: "{{workflow.status}} == Failed || {{workflow.status}} == Error"
Because this is a default workflow spec it will be added to every workflow which is exactly what we want. Remember that the alert routing will be done on Alertmanager side.
Let’s take a closer look at the above code snippet:
name
defines the metric name as it will be seen by Prometheus.counter
is the type of Prometheus metric. In this case we’re increasing the counter each time the workflow fails. We will then create an alerting rule based on the counter value increase.help
is not mandatory but serves as a hintlabels
defines the list of labels that will be added to Prometheus metric. In this case we’re setting up some useful label like workflow name, namespace, duration and creation timestamp. We’re also definingcron_workflow
label which is not empty for anyWorkflow
that is triggered byCronWorkflow
. If the workflow isn’t related to cron the label will be set to empty string.
The spec above will result in the following Prometheus metric (assuming that Prometheus scrapes metrics from Argo WF workflow-controller
):
argo_workflows_workflow_failed_count{cron_workflow="example-cron-failure",workflow_created_at="2024-11-06T12:40:00Z",workflow_duration_seconds="10.047688",workflow_name="example-cron-failure-1730896800",workflow_namespace="default"} 1
Finally let’s create a PrometheusRule
(requires prometheus-operator
) like this:
apiVersion: monitoring.coreos.com/v1
kind: PrometheusRule
metadata:
name: argo-workflows
spec:
groups:
- name: argo-workflows
rules:
- alert: ArgoWorkflowFailed
annotations:
summary: Workflow {{ $labels.workflow_namespace }}/{{ $labels.workflow_name }} failed
link: https://ARGO_WORKFLOWS_HOSTNAME/workflows/{{ $labels.workflow_namespace }}/{{ $labels.workflow_name }}
expr: max_over_time(argo_workflows_workflow_failed_count[2m]) > 0
for: 0m
labels:
app: argo-workflows
severity: warning
Optionally you can have routing rules in your Alertmanager config like this:
route:
routes:
- matchers:
- alertname = "ArgoWorkflowFailed"
- workflow_namespace =~ ".*-prod"
receiver: "prod-alerts"
continue: true
- matchers:
- alertname = "ArgoWorkflowFailed"
- workflow_namespace =~ ".*-qa"
receiver: "qa-alerts"
This configuration sends ArgoWorkflowFailed
alerts to different receivers based on the workflow namespace.
That’s it! Thank you guys for reading!