Alerts Loki
You should always try to optimize queries (alerting expressions) as much as possible. A query normally should specify cluster
, namespace
and application\pod
or a well defined list of applications. Wider queries will work on the larger datasets, which take longer time and is very demanding on resources. Well defined queries are even more crucial in alerting than in usual log searches, as alerting rulers will be evaluated constantly.
Logs vs Metrics
As a rule of thumb, metrics should be used for monitoring applications and logs for debugging purposes. Metrics are preferable for sending alerts because executing queries on a time-series database is more efficient, reliable, and cost-effective compared to querying logs systems. In most cases a preferred alerting solution will be adding custom defined prometheus metrics to your application and creating a Prometheus alerts based on those metrics Custom prometheus metrics.
Loki rules
Loki alerting rule should be created as a PrometheusRule
kubernetes resource, just with an additional tag loki-rule: "true"
. Examples of Loki alerting rules.
You should always add the label namespace
to your alert rules. This will ensure that Slack messages are routed to the channel defined for your namespace in the terraform-aks repo.
- yaml
- app-template-jsonnet
apiVersion: monitoring.coreos.com/v1
kind: PrometheusRule
metadata:
name: "loki-alerting-your-app-name"
namespace: "your-namespace"
labels:
loki-rule: "true" # this label is important for alerts to be send to Loki
app: "your-app-name"
environment: "environment"
spec:
groups:
- name: app-name
rules:
- alert: KnownErrorInApp
expr: 'sum by (error) (count_over_time({cluster_name="your-cluster", namespace="your-namespace", service_name="your-application", pod_container_name="your-application"} | json | level=`error` | pattern `<_> <well-defined-error> pattern-text <_>` [10m])) > 1'
labels:
severity: warning
namespace: "your-app-namespace" # this label is important for sending alerts to correct Slack channels
ruler: loki
annotations:
summary: KnownError appearing in logs
k8s_lokirule:: {
enabled: true,
rules: [
{
alert: 'KnownErrorInApp',
annotations: {
summary: 'Specific errors appearing in logs',
},
expr: 'sum by (error) (count_over_time({cluster_name="your-cluster", namespace="your-namespace", service_name="your-application", pod_container_name="your-application"} | json | level=`error` | pattern `<_> <well-defined-error> pattern-text <_>` [10m])) > 1',
for: '10m',
labels: {
severity: 'warning',
ruler: 'loki',
},
},
],
},