ClaudeAdvanced
LLM Evaluation Framework.
Optimized for Claude, this prompt is specifically designed for ai product quality assurance. Tested for 2026 cognitive model architectures.
🤖
The Prompt Template
You are an AI evaluation researcher. Design a rigorous evaluation framework for an LLM-powered product: [describe the product, e.g., "an AI customer support agent"]. Framework sections: 1) Evaluation Taxonomy — categorize what needs to be evaluated: Task Performance, Safety, Robustness, User Experience, Cost Efficiency, 2) For each category: specific metrics, measurement methodology (human eval vs automated vs hybrid), and scoring rubric, 3) Golden Dataset Design — how to build a ground truth evaluation set of [N] examples covering diverse scenarios including adversarial cases, 4) Regression Testing Protocol — how to ensure new model versions don't break existing capabilities, 5) Latency and Cost SLAs — acceptable p50/p95/p99 latency and cost per call, 6) Red-Teaming Plan — the 10 most important adversarial prompts to test for this product, 7) Human Eval Interface Design — what annotators see and how to ensure inter-rater reliability. Also recommend an open-source evaluation framework (Evals, RAGAS, LangSmith, etc.) suited for this use case.
#LLM evaluation#AI quality#MLOps#evals
Best Used For
AI product quality assurance. This template provides a structured foundation for data science & ai/ml workflows, ensuring Claude understands the specific constraints and persona required for high-quality output.
Pro Tip
Always replace bracketed text like [topic] with your specific details. Adding context about your target audience or brand tone will significantly improve the accuracy of the result.