Cluster observability that engineers actually use (not 400 dashboards)

Every cluster we inherit comes with a Grafana folder full of dashboards nobody opens. Someone installed a bundle, it generated four hundred panels, and during an actual incident the on call engineer ignores all of them and runs kubectl. Coverage is not the goal. Answering questions fast is.

Start from the questions, not the metrics

We do not start by listing what we can measure. We start by listing the questions on call needs answered at 3am: is the service up, is it slow, is it dropping requests, and is something starved for resources. Then we build the smallest set of signals that answers those. A dashboard that does not answer a question someone asks under pressure is noise.

The four golden signals per service: latency, traffic, errors, saturation
Cluster capacity: are we running out of CPU, memory or pods on the nodes
Pod health: restarts, OOMKills, and pending pods that cannot schedule
Recent change: what deployed in the last hour, because that is usually the cause

Alert on symptoms, not causes

The fastest way to make people ignore alerts is to page on causes. High CPU on one node is not worth waking someone. Users seeing errors is. We alert on symptoms that map to user pain, and we keep the page worthy set small enough that every alert firing means someone should genuinely look. Everything else is a dashboard you consult, not a page that wakes you.

If an alert fires and the right response is to silence it, it was never an alert. It was a dashboard panel with your phone number attached.

The stack we tend to land on

Prometheus for metrics, a small curated set of Grafana dashboards we maintain by hand rather than auto generate, and either Loki or the client's existing log platform. We add tracing when there is a real distributed latency mystery to solve, not by reflex. The honest test of an observability setup is whether the newest engineer can find why a service is slow without asking anyone.

Delete the auto generated dashboard sprawl and keep the handful you actually open
Curate alerts down to the set where every firing deserves a human
Add tracing when a latency problem crosses service boundaries, not before

The measure that matters

Good observability is not about how much you collect. It is about how quickly someone who did not write the code can answer what broke and why. If your setup needs a guided tour, it has too much in it. We would rather ship ten dashboards people trust than four hundred they scroll past.

Start from the questions, not the metrics

The four golden signals per service: latency, traffic, errors, saturation

Cluster capacity: are we running out of CPU, memory or pods on the nodes

Pod health: restarts, OOMKills, and pending pods that cannot schedule

Recent change: what deployed in the last hour, because that is usually the cause

Alert on symptoms, not causes

If an alert fires and the right response is to silence it, it was never an alert. It was a dashboard panel with your phone number attached.

The stack we tend to land on

Delete the auto generated dashboard sprawl and keep the handful you actually open

Curate alerts down to the set where every firing deserves a human

Add tracing when a latency problem crosses service boundaries, not before

The measure that matters

Cluster observability that engineers actually use (not 400 dashboards)

Start from the questions, not the metrics

Alert on symptoms, not causes

The stack we tend to land on

The measure that matters

João Matos

Other notes from the team.

When you should NOT reach for Kubernetes

Autoscaling that works: HPA, VPA and Karpenter in practice

Choosing an ingress or gateway (and when the API Gateway pattern is overkill)

Cluster observability that engineers actually use (not 400 dashboards)

Start from the questions, not the metrics

Alert on symptoms, not causes

The stack we tend to land on

The measure that matters

João Matos

Other notes from the team.

When you should NOT reach for Kubernetes

Autoscaling that works: HPA, VPA and Karpenter in practice

Choosing an ingress or gateway (and when the API Gateway pattern is overkill)