Skip to main content

Observability Stack Overview

This page describes the architecture of Gjensidige's observability platform. The stack provides unified logs, metrics, and traces for all applications on GAP, enabling developers to monitor, debug, and alert on their services through a single Grafana interface.

The platform is built on the Grafana LGTM stack (Loki, Grafana, Tempo, Mimir) and the OpenTelemetry Collector, hosted on dedicated tools-management AKS clusters.

Architecture Diagram

flowchart LR %% Data Sources subgraph sources["Data Sources"] direction TB browsers["Frontend Apps (Faro SDK)"] app_clusters["App Clusters (Prometheus + OTEL Collector)"] vms["Virtual Machines (OTEL Collector)"] saas["SaaS Integrations (e.g. GitHub audit logs)"] azure_paas["Azure PaaS (Diagnostic Settings)"] end %% External Services agw["Application Gateway (WAF_v2)"] eventhub["Azure EventHub"] blob["Azure Blob Storage"] user["👤 User"] notifications["Slack / OpsGenie"] %% Tools Cluster subgraph tools["GAP Tools Kubernetes Cluster"] subgraph ingress["Ingress"] traefik["Traefik (Gateway API)"] alloy["Grafana Alloy (Faro Receiver)"] end subgraph collectors["Collectors"] otel["OTEL Collector"] end subgraph backends["Storage Backends"] loki["Grafana Loki"] tempo["Grafana Tempo"] mimir["Grafana Mimir"] %% Force vertical stacking loki ~~~ tempo tempo ~~~ mimir end subgraph alerting["Alerting and UI"] grafana["Grafana (grafana.gjensidige.io)"] alertmanager["Alertmanager"] end end postgres[("PostgreSQL")] %% Ingest: Sources to cluster browsers --> agw user --> agw agw --> traefik app_clusters --> traefik vms --> traefik saas --> eventhub azure_paas --> eventhub %% Ingress routing traefik --> alloy traefik --> otel traefik --> grafana %% Collector pulls from EventHub otel -.->|pulls| eventhub %% Processing pipeline alloy --> otel otel --> loki otel --> tempo otel --> mimir %% Long-term storage loki --> blob tempo --> blob mimir --> blob %% Grafana queries grafana -.-> loki grafana -.-> tempo grafana -.-> mimir grafana -.-> alertmanager grafana --> postgres %% Alerting mimir --> alertmanager loki --> alertmanager alertmanager --> notifications

Why this architecture

Single pane of glass

All three signal types — logs, traces, and metrics — are accessible through one Grafana instance. Instead of switching between separate tools for debugging (logs), performance analysis (traces), and alerting (metrics), developers get a unified view where signals are correlated and cross-linked. This dramatically reduces the time from "something is wrong" to "here's why".

OpenTelemetry as an open standard

The platform uses OpenTelemetry (OTEL) as its telemetry standard. This means:

  • No vendor lock-in — OTEL is an open, CNCF-backed standard supported by all major observability vendors. If we change backends tomorrow, applications don't need to change.
  • One instrumentation, many signals — the same OTEL SDK and Collector handle logs, traces, and metrics through a single pipeline.
  • Broad ecosystem support — libraries and frameworks increasingly ship with built-in OTEL support, reducing instrumentation effort over time.
  • Portable skills — engineers learn one telemetry standard that works across companies and cloud providers.

Central control of telemetry data

All telemetry is routed through a central OpenTelemetry Collector before reaching the storage backends. This gives the platform a single point to filter noise, mask sensitive data, enrich with metadata, and enforce policies — before anything is written to long-term storage. It also means switching from Grafana to another observability tool requires changes in one place, not in every application.

Signal types

The platform collects three types of telemetry signals, each answering a different question about your application:

  • Logs — event messages your application writes. Stored in Grafana Loki (180 days retention)
  • Traces — a timeline showing how a single request flows through your services. Stored in Grafana Tempo (30 days retention)
  • Metrics — numeric measurements sampled over time (request rate, error rate, CPU, latency). Stored in Grafana Mimir (90 days retention)

All three signals are accessed through Grafana at grafana.gjensidige.io, and are correlated — meaning you can jump from a spike in a metric, to the traces that caused it, to the logs from those specific requests.

More about each signal type

Logs

Logs are the event messages your application writes — errors, warnings, request details, and anything you explicitly log. They're what most developers reach for first when something goes wrong.

On GAP, application logs are automatically collected and shipped to Loki. You don't need to configure log shipping — it's handled by the platform. Logs are queryable in Grafana using LogQL, which supports filtering by labels, full-text search, and aggregation.

When to use logs: Debugging specific errors, auditing events, understanding what happened at a specific point in time.

Traces

A trace represents the full journey of a single request as it travels through your services. Each step (called a "span") records which service handled it, how long it took, and whether it succeeded or failed.

For example, a trace might show: user request → API gateway (2ms) → order service (15ms) → database query (45ms) → payment service (200ms). If the payment service is slow, you see it immediately in the trace waterfall.

When to use traces: Debugging latency issues, understanding service dependencies, finding bottlenecks in distributed request flows.

Metrics

Metrics are numeric measurements sampled at regular intervals — request rate, error percentage, CPU usage, memory consumption, response time percentiles. Unlike logs (which record individual events), metrics give you the aggregated picture: trends, patterns, and anomalies over time.

Metrics are the foundation for dashboards and alerts. You typically set up alerts on metrics ("alert me if error rate exceeds 5%") and then use traces and logs to investigate when those alerts fire.

When to use metrics: Dashboards, alerting, capacity planning, SLO tracking, detecting trends.

How data gets in

Application clusters

Application clusters send telemetry directly to the tools cluster over HTTPS. All traffic enters through Traefik, which routes requests to the appropriate backend gateways:

  • Metrics — Prometheus scrapes cluster metrics and remote-writes to Mimir
  • Traces — the OpenTelemetry Collector exports traces to Tempo via OTLP
  • Logs — the OpenTelemetry Collector exports logs to Loki via OTLP

Browser telemetry (Faro)

Frontend applications include the Grafana Faro SDK, which sends browser telemetry (errors, performance data, traces) to the internet-facing Faro endpoint. Traffic passes through Azure Application Gateway (WAF protection and TLS termination) before reaching Grafana Alloy inside the tools cluster.

Alloy processes browser payloads and forwards the resulting logs and traces to the OpenTelemetry Collector for storage in Loki and Tempo.

For Gjensidige frontends, @gjensidige/service-grafana-faro provides a wrapper around the Faro SDK with sensible defaults — automatic collector URL configuration, user context (session, partref), React integration, and web tracing out of the box.

Azure PaaS and SaaS integrations

Azure platform services (databases, App Services, networking) stream diagnostic logs and metrics to Azure EventHub through native diagnostic settings. External SaaS integrations (such as GitHub Enterprise audit logs) also stream to EventHub.

The OpenTelemetry Collector runs as a StatefulSet that connects to EventHub and continuously pulls events for processing. This design ensures reliable delivery with checkpointing — if the collector restarts, it resumes from where it left off.

Virtual machines and external sources

Virtual machines and other external systems send telemetry using the OpenTelemetry protocol (OTLP) over HTTPS. These connections are authenticated with OIDC and enter the tools cluster through a dedicated external OTLP endpoint exposed via Traefik. See Installing the OpenTelemetry Collector on your VM for setup instructions.

Ingress and routing

All traffic into the tools cluster — whether from the internet, application clusters, or VMs — flows through a single ingress layer:

  • Internet traffic — passes through Azure Application Gateway (WAF_v2) for DDoS protection and TLS termination, then reaches the Traefik gateway inside the cluster
  • Corporate network traffic — application clusters and VMs connect directly to Traefik using internal DNS endpoints
  • EventHub streaming — the exception to the pattern. The OTEL Collector initiates outbound connections to EventHub (pull-based), so no inbound routing is needed

OpenTelemetry Collector

The OpenTelemetry Collector is the central data pipeline. It receives telemetry from all sources, enriches it with Kubernetes metadata, applies filtering and batching, and exports to the appropriate backend.

The collector runs in multiple deployment modes to handle different workloads:

  • Log collection — runs on every node to collect container logs from the filesystem
  • EventHub ingestion — connects to Azure EventHub to pull platform logs and metrics from Azure PaaS services
  • Cluster monitoring — collects Kubernetes events and runs HTTP health checks
  • OTLP receivers — dedicated instances that accept telemetry pushed by applications (internal) and VMs (external with OIDC authentication)

Zero-code auto-instrumentation

Applications deployed with Gappynator automatically get distributed tracing and metrics — no code changes or SDK integration required. Gappynator knows the application's runtime (Java, .NET, Python, Node.js, or Go) and configures the OpenTelemetry Operator to inject instrumentation at startup. This is enabled by default for all applications on GAP.

Cross-signal correlation

Grafana is configured with datasource correlations that enable seamless navigation between signals:

  • From a log line → jump to the associated trace in Tempo (via trace ID)
  • From a trace span → see the related logs in Loki
  • From a metric → follow exemplar links to the trace that produced it
  • From a trace → view request rate and error rate metrics (service graph)

This means developers can start from any signal and follow the thread across logs, traces, and metrics without context-switching.

Alerting

The platform uses Alertmanager to handle alert notifications. Both Mimir and Loki continuously evaluate alerting rules against incoming metrics and logs. When a rule fires, the alert is sent to Alertmanager, which deduplicates, groups, and routes notifications to the appropriate channel:

  • Slack — alert channels per team and environment
  • OpsGenie — on-call escalation for critical infrastructure alerts

Grafana displays active and historical alerts in its UI by querying Alertmanager directly, giving developers a single place to view both observability data and alert state.