Good observability on operational metrics, as much automation as possible, clear processes and playbooks for escalations.
Most engineering orgs I’ve seen require Eng oncalls. The perspective above is a simple non expensive way to minimize time investigations eg 5% movement of a metric. Teams spend lots of time asking “is this a real world effect, if so, should we be worried?”
Good observability on operational metrics, as much automation as possible, clear processes and playbooks for escalations.
Most engineering orgs I’ve seen require Eng oncalls. The perspective above is a simple non expensive way to minimize time investigations eg 5% movement of a metric. Teams spend lots of time asking “is this a real world effect, if so, should we be worried?”