Skip to main content

Alerts

Ding, ding! You've got an alert! 🔔 It's crucial to have a set of good alerts when doing operations. Kubernetes clusters at Gjensidige are set up with Prometheus and Alertmanager making it easy to create alerts for your workloads.

Default alerts

A set of default alerts are already present and will trigger on issues with your workloads. You can get an overview of all default alerts at the Kubernetes Monitoring Runbook 📖 When an alert related to one of your workloads is triggered, a Slack message will be sent to the alert channel specified with slack_alert_channel in namespaces.<cluster>.<env>.tfvars in the repo terraform-aks.

You can define alert rules in addition to the default alerts by using the PrometheusRule resource. Below is a set of recommended rules if you have an Ingress configured for your workload. If you are using Spring Boot with Spring Boot Actuator, we recommend using the alerts from this guide.

Required rule label

You should always add the label namespace to your alert rules. This will ensure that Slack messages are routed to the channel defined for your namespace in the terraform-aks repo.

prometheusrule.yaml
apiVersion: monitoring.coreos.com/v1
kind: PrometheusRule
metadata:
name: "your-app-name"
namespace: "your-namespace"
labels:
app: "your-app-name"
role: "alert-rules"
spec:
groups:
- name: "your-namespace.your-app-name"
rules:
- alert: "HighAmountOfHTTPServerErrors" # Trigger if 1% of requests result in a 5** response
annotations:
description: "High amount of HTTP server errors in '{{ $labels.container }}' in namespace '{{ $labels.namespace }}'"
summary: "High amount of HTTP server errors"
expr: "(100 * sum by (ingress) (rate(nginx_ingress_controller_request_duration_seconds_count{exported_namespace='your-namespace',ingress='your-ingress-name',status=~'5.+'}[3m])) / sum by (ingress) (rate(nginx_ingress_controller_request_duration_seconds_count{exported_namespace='your-namespace',ingress='your-ingress-name'}[3m]))) > 1"
for: "3m"
labels:
severity: "warning"
namespace: "your-namespace" # Important to route Slack messages to correct channel
- alert: "HighAmountOfHTTPClientErrors" # Trigger if 10% of requests result in a 4** response
annotations:
description: "High amount of HTTP client errors in '{{ $labels.container }}' in namespace '{{ $labels.namespace }}'"
summary: "High amount of HTTP client errors"
expr: "(100 * sum by (ingress) (rate(nginx_ingress_controller_request_duration_seconds_count{exported_namespace='your-namespace',ingress='your-ingress-name',status=~'4.+'}[3m])) / sum by (ingress) (rate(nginx_ingress_controller_request_duration_seconds_count{exported_namespace='your-namespace',ingress='your-ingress-name'}[3m]))) > 10"
for: "3m"
labels:
severity: "warning"
namespace: "your-namespace" # Important to route Slack messages to correct channel

Architecture overview

Prometheus with Alertmanager flow