🤔 What Would You Do?

Scenario: On-call is blind: no dashboards, only raw logs. A production outage starts.

What’s the first architectural improvement you make after recovery?

Observability & Operability | Amir Brooks

Software Architecture & Decision Patterns/Lesson 9

Preview lesson

Observability & Operability

Design systems you can debug, monitor, and run without heroics

reading50 min2 min readFree

Observability & Operability

A system you can’t observe is a system you can’t operate. Observability turns unknown failures into explainable failures.

The Three Pillars

Logs: discrete events and errors
Metrics: trends and health
Traces: end‑to‑end request paths

All three matter. Logs tell you what, metrics tell you how often, traces tell you where.

Ask: “If this breaks at 3am, can we diagnose it quickly?”

If the answer is no, invest in:

Structured logging
Correlation IDs
Standard dashboards
Runbooks for critical paths

SLOs as Architecture Constraints

Service Level Objectives define acceptable failure. Your architecture should aim to meet the SLO with margin.

Example SLO:

Checkout success rate ≥ 99.9% per month

This drives redundancy, retries, and fallback decisions.

🧠 Knowledge Check

Q: Why are traces critical in distributed systems?

<details> <summary>💡 Reveal Answer</summary>

Because latency and failures often occur across multiple services. Traces show the full request path and identify where time is spent or errors occur, which logs or metrics alone can’t reveal.

</details>

✍️ Recall Cards

Front: Logs are for ______ events; metrics are for ______ trends. <details><summary>Back</summary>Discrete; aggregate</details>
Front: Traces reveal the full ______ path. <details><summary>Back</summary>Request</details>
Front: SLOs define ______ failure. <details><summary>Back</summary>Acceptable</details>

🔨 Try It Yourself (Hands‑On)

Create an observability checklist for one endpoint:

Define the success metric (e.g., 2xx rate)
Add 3 key logs (start, error, completion)
Add 2 key metrics (latency, error rate)
Identify trace spans for major dependencies

<details> <summary>Best approach</summary> </details>

Resources

Full access

Continue the full course

Unlock all 12 lessons, templates, and resources for Software Architecture & Decision Patterns. Free.

Start learning free All-access membership

Back to course Member login

Idea	Remember This
Observability	Logs + metrics + traces
Pager test	Design for 3am debugging
SLOs	Reliability targets that drive architecture
Operability	Runbooks and dashboards reduce MTTR

🤔 What Would You Do?

Observability & Operability

Observability & Operability

The Three Pillars

Decision Pattern: “Design for the Pager”

SLOs as Architecture Constraints

🧠 Knowledge Check

✍️ Recall Cards

🔨 Try It Yourself (Hands‑On)

Resources

Continue the full course

🤔 What Would You Do?

📋 Key Takeaways