Design for failure with retries, timeouts, circuit breakers, and bulkheads
Every distributed system fails. The only question is how it fails: gracefully or catastrophically.
Ask: Is it better to return an error quickly or degrade gracefully?
Examples:
Reliability is a tradeoff. SRE teams use error budgets to decide how much instability is acceptable before slowing feature work.
If your SLO is 99.9%, you get ~43 minutes of error per month. Spend it wisely.
Q: Why are retries dangerous without backoff and jitter?
<details> <summary>💡 Reveal Answer</summary>Without backoff, many services retry simultaneously, creating a thundering herd that overwhelms the dependency. Jitter spreads retries over time, giving the system a chance to recover.
</details>Audit one critical dependency:
Write this as a short “resilience contract.”
Full access
Unlock all 12 lessons, templates, and resources for Software Architecture & Decision Patterns. Free.
Scenario: Your API retries failed payments three times instantly. Stripe goes down and your app collapses.
How do you redesign the retry strategy?
Use exponential backoff with jitter, add a circuit breaker, and move payment retries to an async queue. This prevents user-facing collapse and lets the system recover without a retry storm.
| Idea | Remember This |
|---|---|
| Failure is normal | Architect for it, don’t hope it away |
| Resilience patterns | Timeouts, retries, breakers, bulkheads |
| Fail fast/soft | Decide based on user impact |
| Error budgets | Reliability has a cost |
Next: Security & Compliance as Architecture