Running incident response and postmortems that don't assign blame

The first postmortem I ran badly, I asked who pushed the change. The room went quiet, the engineer got defensive, and we spent an hour establishing innocence instead of understanding the system. We learned nothing useful and the same class of failure hit us again four months later. Blame is not just unkind. It is operationally useless.

Separate the incident from the postmortem

During the incident there is one job: stop the bleeding. We name a single incident commander whose authority is total for the duration, and everyone else either does what they ask or gets out of the way. We do not analyze root cause while the site is down. We do not assign fault in the chat. We restore service, then we go home, then we meet to learn.

One incident commander with clear authority, rotated so it is not always the same person
A scribe capturing a timeline in real time, because memory rewrites itself within hours
No root-cause debate during the fire, only during the review
A hard rule that the person closest to the change is treated as the best source, never the suspect

Why blameless is selfish, not soft

Blameless is often sold as kindness. It is, but that is not why we do it. We do it because the engineer who made the change has the most context about what happened, and a frightened person hides context to protect themselves. You cannot fix a system whose operators are incentivized to obscure how it actually behaves. We want the messy truth, so we make telling it safe.

If your postmortem produces a name instead of a changed system, you ran a trial, not a review. The same incident is already scheduled to return.

Ask how the system allowed it

The phrasing matters. We do not ask why did you deploy that. We ask how was it possible to deploy that without a check catching it. The deploy was a person doing their job inside a system that permitted the outcome. The fix lives in the system: a missing guardrail, a confusing dashboard, an alert that fired too late. People are not bugs you patch.

Action items or it didn't happen

A postmortem with no owned, dated action items is therapy. We close every review with a small number of concrete changes, each with a name and a date, and we track them like any other work. And we publish the writeup widely, because a postmortem read only by the team that lived it teaches no one else, and the next team inherits the same trap we just walked out of.

Separate the incident from the postmortem

One incident commander with clear authority, rotated so it is not always the same person

A scribe capturing a timeline in real time, because memory rewrites itself within hours

No root-cause debate during the fire, only during the review

A hard rule that the person closest to the change is treated as the best source, never the suspect

Why blameless is selfish, not soft

If your postmortem produces a name instead of a changed system, you ran a trial, not a review. The same incident is already scheduled to return.

Ask how the system allowed it

Action items or it didn't happen

Running incident response and postmortems that don't assign blame

Separate the incident from the postmortem

Why blameless is selfish, not soft

Ask how the system allowed it

Action items or it didn't happen

João Matos

Other notes from the team.

Keeping on-call humane (and what a healthy rotation looks like)

Hiring senior engineers: what we screen for beyond coding

Growing juniors into engineers we trust (mentoring that scales)

Running incident response and postmortems that don't assign blame

Separate the incident from the postmortem

Why blameless is selfish, not soft

Ask how the system allowed it

Action items or it didn't happen

João Matos

Other notes from the team.

Keeping on-call humane (and what a healthy rotation looks like)

Hiring senior engineers: what we screen for beyond coding

Growing juniors into engineers we trust (mentoring that scales)