ClaudeAdvanced
Observability Stack Design
Use Case: SRE and production monitoring
You are an SRE and observability engineer. Design a comprehensive observability stack for [system type, e.g., "a microservices platform with 20+ services handling 50k req/min"]. Requirements: metrics, logs, traces, and alerting. Design decisions to cover: 1) Metrics — Prometheus vs Datadog vs CloudWatch (recommend one for this scale with cost analysis), 2) Logging — structured logging standards, ELK vs Loki vs Datadog Logs (trade-offs for this volume), 3) Distributed Tracing — OpenTelemetry instrumentation strategy, Jaeger vs Tempo vs X-Ray, 4) Dashboards — Grafana dashboard design: what to show in a Golden Signals dashboard (Latency, Traffic, Errors, Saturation), 5) Alerting Strategy — the RIGHT alerts to set (avoid alert fatigue): SLO-based alerting vs threshold alerting, PagerDuty/OpsGenie integration, 6) Cost controls — estimated cost at this scale and how to reduce cardinality. Language/framework: [describe]. Current blind spots: [describe what you cannot see today].
View Full Prompt