Every cluster we inherit comes with a Grafana folder full of dashboards nobody opens. Someone installed a bundle, it generated four hundred panels, and during an actual incident the on call engineer ignores all of them and runs kubectl. Coverage is not the goal. Answering questions fast is.
Start from the questions, not the metrics
We do not start by listing what we can measure. We start by listing the questions on call needs answered at 3am: is the service up, is it slow, is it dropping requests, and is something starved for resources. Then we build the smallest set of signals that answers those. A dashboard that does not answer a question someone asks under pressure is noise.
- The four golden signals per service: latency, traffic, errors, saturation
- Cluster capacity: are we running out of CPU, memory or pods on the nodes
- Pod health: restarts, OOMKills, and pending pods that cannot schedule
- Recent change: what deployed in the last hour, because that is usually the cause
Alert on symptoms, not causes
The fastest way to make people ignore alerts is to page on causes. High CPU on one node is not worth waking someone. Users seeing errors is. We alert on symptoms that map to user pain, and we keep the page worthy set small enough that every alert firing means someone should genuinely look. Everything else is a dashboard you consult, not a page that wakes you.
If an alert fires and the right response is to silence it, it was never an alert. It was a dashboard panel with your phone number attached.
The stack we tend to land on
Prometheus for metrics, a small curated set of Grafana dashboards we maintain by hand rather than auto generate, and either Loki or the client's existing log platform. We add tracing when there is a real distributed latency mystery to solve, not by reflex. The honest test of an observability setup is whether the newest engineer can find why a service is slow without asking anyone.
- Delete the auto generated dashboard sprawl and keep the handful you actually open
- Curate alerts down to the set where every firing deserves a human
- Add tracing when a latency problem crosses service boundaries, not before
The measure that matters
Good observability is not about how much you collect. It is about how quickly someone who did not write the code can answer what broke and why. If your setup needs a guided tour, it has too much in it. We would rather ship ten dashboards people trust than four hundred they scroll past.