Skip to main content

Alerts Prometheus

Default alerts

A set of default alerts are already present and will trigger on issues with your workloads. You can get an overview of all default alerts at the Kubernetes Monitoring Runbook :book: When an alert related to one of your workloads is triggered, a Slack message will be sent to the alert channel specified with slack_alert_channel in namespaces.<cluster>.<env>.tfvars in the repo terraform-aks.

Custom prometheus metrics and Recording rules

For better monitoring of your application, you can add your own prometheus metrics to an application and then base alerts on those metrics. Your can find some examples at Quick intro to custom Spring Boot metrics at GAP Community Blog and How to use Micrometer with Spring Boot + Prometheus. If you have computationally heavy expressions, please use Recording rules in your alerts.

You can define alert rules in addition to the default alerts by using the PrometheusRule resource. Below is a set of recommended rules if you have an Ingress configured for your workload. If you are using Spring Boot with Spring Boot Actuator, we recommend using the alerts from this guide.

Required rule label

You should always add the label namespace to your alert rules. This will ensure that Slack messages are routed to the channel defined for your namespace in the terraform-aks repo.

prometheusrule.yaml
apiVersion: monitoring.coreos.com/v1
kind: PrometheusRule
metadata:
name: "your-app-name"
namespace: "your-namespace"
labels:
app: "your-app-name"
role: "alert-rules"
spec:
groups:
- name: "your-namespace.your-app-name"
rules:
- alert: "HighAmountOfHTTPServerErrors" # Trigger if 1% of requests result in a 5** response
annotations:
description: "High amount of HTTP server errors in '{{ $labels.container }}' in namespace '{{ $labels.namespace }}'"
summary: "High amount of HTTP server errors"
expr: "(100 * sum by (ingress) (rate(nginx_ingress_controller_request_duration_seconds_count{exported_namespace='your-namespace',ingress='your-ingress-name',status=~'5.+'}[3m])) / sum by (ingress) (rate(nginx_ingress_controller_request_duration_seconds_count{exported_namespace='your-namespace',ingress='your-ingress-name'}[3m]))) > 1"
for: "3m"
labels:
severity: "warning"
namespace: "your-namespace" # Important to route Slack messages to correct channel
- alert: "HighAmountOfHTTPClientErrors" # Trigger if 10% of requests result in a 4** response
annotations:
description: "High amount of HTTP client errors in '{{ $labels.container }}' in namespace '{{ $labels.namespace }}'"
summary: "High amount of HTTP client errors"
expr: "(100 * sum by (ingress) (rate(nginx_ingress_controller_request_duration_seconds_count{exported_namespace='your-namespace',ingress='your-ingress-name',status=~'4.+'}[3m])) / sum by (ingress) (rate(nginx_ingress_controller_request_duration_seconds_count{exported_namespace='your-namespace',ingress='your-ingress-name'}[3m]))) > 10"
for: "3m"
labels:
severity: "warning"
namespace: "your-namespace" # Important to route Slack messages to correct channel

Separate alert channels

There is also the possibility to set up custom alert channels per deployment. If you don't want to flood the general team-alert-channel, or you need something more specific for certain applications. You can set up your Slack channels and then

If you are using a ServiceMonitor

service.yaml and servicemonitor.yaml
apiVersion: v1
kind: Service
metadata:
name: my-service
labels:
slack_alert_channel: 'team-<name>-custom-service-alerts-test'
---
apiVersion: monitoring.coreos.com/v1
kind: ServiceMonitor
metadata:
name: my-service
spec:
targetLabels:
- slack_alert_channel

If you are using a PodMonitor

deployment.yaml and podmonitor.yaml
apiVersion: apps/v1
kind: Deployment
metadata:
name: my-deployment
labels:
slack_alert_channel: 'team-<name>-custom-service-alerts-test'
---
apiVersion: monitoring.coreos.com/v1
kind: PodMonitor
metadata:
name: my-deployment
spec:
targetLabels:
- slack_alert_channel
Important if using specific alert channel!

You will also need to ask #team-platform to set up the OpsGenie integration for your Slack channel.