Alerts Prometheus
Default alerts
A set of default alerts are already present and will trigger on issues with your workloads. You can get an overview of all default alerts at the Kubernetes Monitoring Runbook :book: When an alert related to one of your workloads is triggered, a Slack message will be sent to the alert channel specified with slack_alert_channel
in namespaces.<cluster>.<env>.tfvars
in the repo terraform-aks.
Custom prometheus metrics and Recording rules
For better monitoring of your application, you can add your own prometheus metrics to an application and then base alerts on those metrics. Your can find some examples at Quick intro to custom Spring Boot metrics at GAP Community Blog and How to use Micrometer with Spring Boot + Prometheus. If you have computationally heavy expressions, please use Recording rules in your alerts.
Recommended additional Prometheus alerts
You can define alert rules in addition to the default alerts by using the PrometheusRule resource. Below is a set of recommended rules if you have an Ingress configured for your workload. If you are using Spring Boot with Spring Boot Actuator, we recommend using the alerts from this guide.
You should always add the label namespace
to your alert rules. This will ensure that Slack messages are routed to the channel defined for your namespace in the terraform-aks repo.
apiVersion: monitoring.coreos.com/v1
kind: PrometheusRule
metadata:
name: "your-app-name"
namespace: "your-namespace"
labels:
app: "your-app-name"
role: "alert-rules"
spec:
groups:
- name: "your-namespace.your-app-name"
rules:
- alert: "HighAmountOfHTTPServerErrors" # Trigger if 1% of requests result in a 5** response
annotations:
description: "High amount of HTTP server errors in '{{ $labels.container }}' in namespace '{{ $labels.namespace }}'"
summary: "High amount of HTTP server errors"
expr: "(100 * sum by (ingress) (rate(nginx_ingress_controller_request_duration_seconds_count{exported_namespace='your-namespace',ingress='your-ingress-name',status=~'5.+'}[3m])) / sum by (ingress) (rate(nginx_ingress_controller_request_duration_seconds_count{exported_namespace='your-namespace',ingress='your-ingress-name'}[3m]))) > 1"
for: "3m"
labels:
severity: "warning"
namespace: "your-namespace" # Important to route Slack messages to correct channel
- alert: "HighAmountOfHTTPClientErrors" # Trigger if 10% of requests result in a 4** response
annotations:
description: "High amount of HTTP client errors in '{{ $labels.container }}' in namespace '{{ $labels.namespace }}'"
summary: "High amount of HTTP client errors"
expr: "(100 * sum by (ingress) (rate(nginx_ingress_controller_request_duration_seconds_count{exported_namespace='your-namespace',ingress='your-ingress-name',status=~'4.+'}[3m])) / sum by (ingress) (rate(nginx_ingress_controller_request_duration_seconds_count{exported_namespace='your-namespace',ingress='your-ingress-name'}[3m]))) > 10"
for: "3m"
labels:
severity: "warning"
namespace: "your-namespace" # Important to route Slack messages to correct channel
Separate alert channels
There is also the possibility to set up custom alert channels per deployment. If you don't want to flood the general team-alert-channel, or you need something more specific for certain applications. You can set up your Slack channels and then
If you are using a ServiceMonitor
- yaml
- app-template-libsonnet
apiVersion: v1
kind: Service
metadata:
name: my-service
labels:
slack_alert_channel: 'team-<name>-custom-service-alerts-test'
---
apiVersion: monitoring.coreos.com/v1
kind: ServiceMonitor
metadata:
name: my-service
spec:
targetLabels:
- slack_alert_channel
k8s_service+:: {
labels: {
slack_alert_channel: 'team-shadow-custom-service-alerts-test',
},
}
If you are using a PodMonitor
- yaml
- app-template-jsonnet
apiVersion: apps/v1
kind: Deployment
metadata:
name: my-deployment
labels:
slack_alert_channel: 'team-<name>-custom-service-alerts-test'
---
apiVersion: monitoring.coreos.com/v1
kind: PodMonitor
metadata:
name: my-deployment
spec:
targetLabels:
- slack_alert_channel
k8s_deployment+:: {
labels: {
slack_alert_channel: 'team-shadow-custom-service-alerts-test',
}
}
You will also need to ask #team-platform to set up the OpsGenie integration for your Slack channel.