Observability Stack Overview
This page describes the architecture of Gjensidige's observability platform. The stack provides unified logs, metrics, and traces for all applications on GAP, enabling developers to monitor, debug, and alert on their services through a single Grafana interface.
The platform is built on the Grafana LGTM stack (Loki, Grafana, Tempo, Mimir) and the OpenTelemetry Collector, hosted on dedicated tools-management AKS clusters.
Architecture Diagram
Why this architecture
Single pane of glass
All three signal types — logs, traces, and metrics — are accessible through one Grafana instance. Instead of switching between separate tools for debugging (logs), performance analysis (traces), and alerting (metrics), developers get a unified view where signals are correlated and cross-linked. This dramatically reduces the time from "something is wrong" to "here's why".
OpenTelemetry as an open standard
The platform uses OpenTelemetry (OTEL) as its telemetry standard. This means:
- No vendor lock-in — OTEL is an open, CNCF-backed standard supported by all major observability vendors. If we change backends tomorrow, applications don't need to change.
- One instrumentation, many signals — the same OTEL SDK and Collector handle logs, traces, and metrics through a single pipeline.
- Broad ecosystem support — libraries and frameworks increasingly ship with built-in OTEL support, reducing instrumentation effort over time.
- Portable skills — engineers learn one telemetry standard that works across companies and cloud providers.
Central control of telemetry data
All telemetry is routed through a central OpenTelemetry Collector before reaching the storage backends. This gives the platform a single point to filter noise, mask sensitive data, enrich with metadata, and enforce policies — before anything is written to long-term storage. It also means switching from Grafana to another observability tool requires changes in one place, not in every application.
Signal types
The platform collects three types of telemetry signals, each answering a different question about your application:
- Logs — event messages your application writes. Stored in Grafana Loki (180 days retention)
- Traces — a timeline showing how a single request flows through your services. Stored in Grafana Tempo (30 days retention)
- Metrics — numeric measurements sampled over time (request rate, error rate, CPU, latency). Stored in Grafana Mimir (90 days retention)
All three signals are accessed through Grafana at grafana.gjensidige.io, and are correlated — meaning you can jump from a spike in a metric, to the traces that caused it, to the logs from those specific requests.
More about each signal type
Logs
Logs are the event messages your application writes — errors, warnings, request details, and anything you explicitly log. They're what most developers reach for first when something goes wrong.
On GAP, application logs are automatically collected and shipped to Loki. You don't need to configure log shipping — it's handled by the platform. Logs are queryable in Grafana using LogQL, which supports filtering by labels, full-text search, and aggregation.
When to use logs: Debugging specific errors, auditing events, understanding what happened at a specific point in time.
Traces
A trace represents the full journey of a single request as it travels through your services. Each step (called a "span") records which service handled it, how long it took, and whether it succeeded or failed.
For example, a trace might show: user request → API gateway (2ms) → order service (15ms) → database query (45ms) → payment service (200ms). If the payment service is slow, you see it immediately in the trace waterfall.
When to use traces: Debugging latency issues, understanding service dependencies, finding bottlenecks in distributed request flows.
Metrics
Metrics are numeric measurements sampled at regular intervals — request rate, error percentage, CPU usage, memory consumption, response time percentiles. Unlike logs (which record individual events), metrics give you the aggregated picture: trends, patterns, and anomalies over time.
Metrics are the foundation for dashboards and alerts. You typically set up alerts on metrics ("alert me if error rate exceeds 5%") and then use traces and logs to investigate when those alerts fire.
When to use metrics: Dashboards, alerting, capacity planning, SLO tracking, detecting trends.
How data gets in
Application clusters
Application clusters send telemetry directly to the tools cluster over HTTPS. All traffic enters through Traefik, which routes requests to the appropriate backend gateways:
- Metrics — Prometheus scrapes cluster metrics and remote-writes to Mimir
- Traces — the OpenTelemetry Collector exports traces to Tempo via OTLP
- Logs — the OpenTelemetry Collector exports logs to Loki via OTLP
Browser telemetry (Faro)
Frontend applications include the Grafana Faro SDK, which sends browser telemetry (errors, performance data, traces) to the internet-facing Faro endpoint. Traffic passes through Azure Application Gateway (WAF protection and TLS termination) before reaching Grafana Alloy inside the tools cluster.
Alloy processes browser payloads and forwards the resulting logs and traces to the OpenTelemetry Collector for storage in Loki and Tempo.
For Gjensidige frontends, @gjensidige/service-grafana-faro provides a wrapper around the Faro SDK with sensible defaults — automatic collector URL configuration, user context (session, partref), React integration, and web tracing out of the box.
Azure PaaS and SaaS integrations
Azure platform services (databases, App Services, networking) stream diagnostic logs and metrics to Azure EventHub through native diagnostic settings. External SaaS integrations (such as GitHub Enterprise audit logs) also stream to EventHub.
The OpenTelemetry Collector runs as a StatefulSet that connects to EventHub and continuously pulls events for processing. This design ensures reliable delivery with checkpointing — if the collector restarts, it resumes from where it left off.
Virtual machines and external sources
Virtual machines and other external systems send telemetry using the OpenTelemetry protocol (OTLP) over HTTPS. These connections are authenticated with OIDC and enter the tools cluster through a dedicated external OTLP endpoint exposed via Traefik. See Installing the OpenTelemetry Collector on your VM for setup instructions.
Ingress and routing
All traffic into the tools cluster — whether from the internet, application clusters, or VMs — flows through a single ingress layer:
- Internet traffic — passes through Azure Application Gateway (WAF_v2) for DDoS protection and TLS termination, then reaches the Traefik gateway inside the cluster
- Corporate network traffic — application clusters and VMs connect directly to Traefik using internal DNS endpoints
- EventHub streaming — the exception to the pattern. The OTEL Collector initiates outbound connections to EventHub (pull-based), so no inbound routing is needed
OpenTelemetry Collector
The OpenTelemetry Collector is the central data pipeline. It receives telemetry from all sources, enriches it with Kubernetes metadata, applies filtering and batching, and exports to the appropriate backend.
The collector runs in multiple deployment modes to handle different workloads:
- Log collection — runs on every node to collect container logs from the filesystem
- EventHub ingestion — connects to Azure EventHub to pull platform logs and metrics from Azure PaaS services
- Cluster monitoring — collects Kubernetes events and runs HTTP health checks
- OTLP receivers — dedicated instances that accept telemetry pushed by applications (internal) and VMs (external with OIDC authentication)
Zero-code auto-instrumentation
Applications deployed with Gappynator automatically get distributed tracing and metrics — no code changes or SDK integration required. Gappynator knows the application's runtime (Java, .NET, Python, Node.js, or Go) and configures the OpenTelemetry Operator to inject instrumentation at startup. This is enabled by default for all applications on GAP.
Cross-signal correlation
Grafana is configured with datasource correlations that enable seamless navigation between signals:
- From a log line → jump to the associated trace in Tempo (via trace ID)
- From a trace span → see the related logs in Loki
- From a metric → follow exemplar links to the trace that produced it
- From a trace → view request rate and error rate metrics (service graph)
This means developers can start from any signal and follow the thread across logs, traces, and metrics without context-switching.
Alerting
The platform uses Alertmanager to handle alert notifications. Both Mimir and Loki continuously evaluate alerting rules against incoming metrics and logs. When a rule fires, the alert is sent to Alertmanager, which deduplicates, groups, and routes notifications to the appropriate channel:
- Slack — alert channels per team and environment
- OpsGenie — on-call escalation for critical infrastructure alerts
Grafana displays active and historical alerts in its UI by querying Alertmanager directly, giving developers a single place to view both observability data and alert state.