It was 2 AM when my phone started buzzing. Half-asleep, I grabbed it and saw an alert:
High error rate detected. Immediate action required.
I rushed to my laptop and opened the dashboard. Red. Everything was red. Our system, which had been running smoothly for months, was suddenly failing.
I scanned the logs. They made no sense… just a mess of errors, and unhelpful messages. Users were flooding support with complaints.
After hours of debugging, we found the problem… One bug had taken down an entire service. We fixed it, deployed it and everything was up and running again.
That night, I realised that we didn’t have good enough observability of our system.
If you ask most engineers if they have good monitoring, they’ll say yes.
They have dashboards. They have alerts. They have logs.
But then something breaks, and suddenly, they’re playing detective.
Digging through logs.
Restarting services.
Guessing.
Most teams don’t see problems until they explode. They don’t catch small failures before they snowball into disasters.
And that’s the real problem. Software is messy. Small failures happen all the time. But if you can’t see them, you can’t stop them.
If you enjoy posts like this, consider supporting my work and subscribing to this newsletter.
As a free subscriber, you get:
✉️ 1 post per week
🧑🎓 Access to the Engineering Manager Masterclass
As a paid subscriber, you get:
🔒 50 Engineering Manager templates and playbooks (worth $79)
🔒 A weekly "What would you do?" scenario & breakdown from real challenges EMs face
🔒 The complete archive
Observability-Driven Development
I learned that just monitoring isn’t enough. You need something better.
Observability-Driven Development means building your system in a way that tells you exactly what’s happening at all times.
Here’s what it looks like:
1. Logs that tell a story
Most logs dump information without context, making it impossible to piece together what happened.
Good logs act like a time machine. They show every step a request takes, from start to finish. With clear logs, you can replay an issue.
2. Traces that follow every request
Errors don’t just appear out of nowhere. They start somewhere maybe in a database query, maybe in an API call.
Without tracing, finding the root cause is a guessing game.
With tracing, you can follow a request from start to finish across services, databases, and APIs.
3. Metrics that catch problems early
Big failures can start as small warnings. A slight increase in latency. A tiny bump in error rates. A slow database query.
If you set up alerts for small changes, you can catch problems before they hit users.
4. Breaking things on purpose
Chaos Engineering sounds scary, but it’s the best way to test if your system can survive failure.
Shut down a database and see what happens.
Kill a service and check if users notice.
If something surprises you, fix it before it surprises you in production.
How ODD transformed my team
After that 2 AM disaster, we made some changes.
We started using OpenTelemetry to trace requests.
We cleaned up our logs so they actually made sense.
We built alerts based on small patterns, not just full-blown failures.
And?
Bugs that used to take hours to debug now took minutes.
Instead of reacting to failures, we started seeing them coming.
Our team slept better at night because we trusted our system.
Do you have good observability in your system?
Ask yourself:
✅ When something breaks, do you immediately know where and why?
✅ Can you trace a single user’s request across all your services?
✅ Do you catch small problems before they turn into big ones?
If you answered “no” to any of these, you’re not.
Observability-Driven Development isn’t a buzzword. It can be the difference between guessing and knowing. Between chaos and control.
Don’t wait for the next 2 AM call to get to this realisation.
I really like the term "Observability-Driven Development".
Monitoring is important, but observability is what saves your team for being firefighting and avoiding major issues.
Good observability of your system is key - but unfortunately it is not usually detected that this is system is weak until something goes wrong. Several years ago I witnessed a snapshot data capture had been down for several weeks, but there were no errors. There were no errors as the snapshot technically was working, but the underlying data was not refreshing. After its discovery, this taught me to not be reliant on error messages alone for a system of monitoring.