#SRE.

Discover 3 professional prompt templates tagged with #SRE. All templates are tested for 2026 reasoning models.

ClaudeAdvanced

SLO/SLA Design Framework

Use Case: SRE service level management

You are an SRE lead. Design a Service Level Objective (SLO) framework for [service/product name]. Service description: [what it does, who uses it, business criticality]. Framework design: 1) SLI Selection — for each user journey, define the right SLI: availability SLI (good requests/total), latency SLI (% under threshold), quality SLI (error-free responses). Justify why these are the right indicators, 2) SLO Targets — propose starting SLO values with rationale (be conservative — over-promising is worse than under-promising), 3) Error Budget — calculate and explain the error budget for each SLO (minutes/month of allowed downtime), 4) Error Budget Policy — what happens when 50%/75%/100% of budget is burned: feature freeze triggers, deploy halts, team notifications, 5) SLA — translate internal SLOs to customer-facing SLAs with appropriate buffer, 6) Measurement Implementation — exact Prometheus queries or APM configuration to measure each SLI, 7) Dashboard Design — what the SLO burn rate dashboard should show. Current uptime: [X nines]. Customer expectations: [describe].
View Full Prompt
ClaudeIntermediate

SRE Incident Runbook Generator

Use Case: SRE incident response and reliability

You are a Site Reliability Engineer. Create a detailed incident runbook for: Service: [service name]. Common failure mode: [describe, e.g., "database connection pool exhaustion" or "memory leak causing OOM kills"]. Runbook sections: 1) Alert Context — what triggered this runbook, what the metric/log looks like, normal baseline, 2) Impact Assessment — what user-facing impact does this cause, how to quantify severity, 3) Triage Steps — step-by-step diagnostic commands (include exact commands with placeholders for env-specific values), 4) Mitigation Options — ordered from fastest to most complete: a) immediate mitigation (restart/rollback/scale), b) root cause fix, c) permanent solution, 5) Escalation Path — when to escalate, who to page, and what information to have ready, 6) Verification — how to confirm the issue is resolved, 7) Prevention — what monitoring, alerting, or code changes would prevent recurrence. Include: exact CLI commands, links to relevant dashboards, and a post-incident review checklist.
View Full Prompt
ClaudeAdvanced

Observability Stack Design

Use Case: SRE and production monitoring

You are an SRE and observability engineer. Design a comprehensive observability stack for [system type, e.g., "a microservices platform with 20+ services handling 50k req/min"]. Requirements: metrics, logs, traces, and alerting. Design decisions to cover: 1) Metrics — Prometheus vs Datadog vs CloudWatch (recommend one for this scale with cost analysis), 2) Logging — structured logging standards, ELK vs Loki vs Datadog Logs (trade-offs for this volume), 3) Distributed Tracing — OpenTelemetry instrumentation strategy, Jaeger vs Tempo vs X-Ray, 4) Dashboards — Grafana dashboard design: what to show in a Golden Signals dashboard (Latency, Traffic, Errors, Saturation), 5) Alerting Strategy — the RIGHT alerts to set (avoid alert fatigue): SLO-based alerting vs threshold alerting, PagerDuty/OpsGenie integration, 6) Cost controls — estimated cost at this scale and how to reduce cardinality. Language/framework: [describe]. Current blind spots: [describe what you cannot see today].
View Full Prompt