AIOpenLibraryAIOpenLibrary
Back to Software Architecture

Observability Stack Designer

Design a complete observability platform with logging, metrics, distributed tracing, alerting, and SLO-based monitoring.

Updated Mar 11, 2026

ShareLinkedIn

Customize Your Prompt

0/7 filled

Prompt

You are an observability engineer. Design a comprehensive observability strategy for my system.

System: [SYSTEM_DESCRIPTION]
Architecture: [ARCHITECTURE] (monolith, microservices, serverless, hybrid)
Current observability: [CURRENT_STATE]
Pain points: [PAIN_POINTS]
Team size: [TEAM_SIZE]
Budget: [BUDGET]
Cloud provider: [CLOUD_PROVIDER]

Design the observability stack:

**1. The Three Pillars**

**Logging:**
- Structured logging format (JSON with standard fields)
- Log levels strategy (when to use each level)
- Required context fields: request_id, trace_id, user_id, service, timestamp
- Log aggregation tool recommendation
- Log retention and rotation policy
- What NOT to log (PII, secrets, high-cardinality spam)

**Metrics:**
- RED metrics for services (Rate, Errors, Duration)
- USE metrics for resources (Utilization, Saturation, Errors)
- Business metrics (the ones stakeholders actually care about)
- Custom metrics design
- Metrics tool recommendation (Prometheus, CloudWatch, Datadog)
- Cardinality management (avoiding metric explosion)

**Distributed Tracing:**
- OpenTelemetry implementation plan
- Trace context propagation across services
- Sampling strategy (head-based vs. tail-based)
- Span naming conventions
- Trace tool recommendation (Jaeger, Zipkin, Tempo, X-Ray)

**2. SLIs, SLOs, and Error Budgets**
- Define SLIs for each critical user journey:
  - Availability: Successful requests / total requests
  - Latency: % of requests < target (p50, p95, p99)
  - Correctness: % of correct results
- Set SLOs for each SLI (realistic, not aspirational)
- Error budget calculation and tracking
- Error budget policy: What happens when budget is exhausted?

**3. Alerting Strategy**
- Alert on symptoms (user impact), not causes
- Multi-window, multi-burn-rate alerts (avoid flapping)
- Severity levels: Page (wake someone up) vs. Ticket vs. Log
- On-call rotation design
- Runbook requirement for every alert
- Alert fatigue prevention (review and prune monthly)

**4. Dashboards**
- Service overview dashboard (the first place you look)
- Per-service deep dive dashboards
- Business metrics dashboard
- Infrastructure dashboard
- Dashboard design principles: Answer a question, not show data

**5. Incident Response Integration**
- From alert to incident to post-mortem workflow
- Status page integration
- Communication templates

**6. Tool Stack Recommendation**
| Need | Option A | Option B | Recommendation |
- Build vs. buy analysis
- Estimated monthly cost

**7. Implementation Roadmap**
- Week 1-2: Structured logging + basic metrics
- Week 3-4: Distributed tracing
- Month 2: SLOs + alerting
- Month 3: Dashboards + refinement

Powered by Hugging Face Inference API

Pro Tips

  • Observability is the difference between 'the system is broken' and 'I know exactly why and can fix it in 5 minutes.' SLO-based monitoring focuses alerts on what actually matters to users.

References

Comments

Log in to leave a comment

More Software Architecture Prompts

🏗️Software ArchitectureNEW

Architecture Decision Record Writer

Write well-structured Architecture Decision Records (ADRs) that document the context, options considered, and rationale behind key technical decisions.

You are a principal software architect who believes that documented decisions ar...

Claude
IntermediateView prompt
🏗️Software ArchitectureNEW

System Design Document Generator

Generate comprehensive system design documents (RFCs/design docs) with component architecture, data flow, API contracts, and operational considerations.

You are a staff engineer writing a design document for a new system. Create a co...

Claude
AdvancedView prompt
🏗️Software ArchitectureNEW

Event-Driven Architecture Planner

Design event-driven systems with event sourcing, CQRS, message brokers, and eventual consistency patterns.

You are a distributed systems architect specializing in event-driven architectur...

Claude
AdvancedView prompt

You Might Also Like

✍️Writing & Content✦ Premium

Blog Post Architect

Create SEO-optimized, engaging blog posts with structured outlines, compelling hooks, and strategic keyword placement.

You are an expert content strategist and SEO specialist. Create a comprehensive ...

Claude Opus 4
IntermediateView prompt
📚Education✦ Premium

Socratic Method Tutor

Learn any concept through guided questioning that builds deep understanding instead of memorization.

You are a Socratic tutor. Your role is to help me deeply understand a concept th...

Claude Opus 4
BeginnerView prompt
📦Product Management✦ Premium

Product Requirements Document (PRD)

Generate comprehensive PRDs with user stories, acceptance criteria, technical requirements, and success metrics.

You are a senior product manager at a top tech company. Write a comprehensive PR...

Claude Opus 4
IntermediateView prompt