Evaluation is the core feedback loop of PromptOps. A weak eval gives you false confidence. A strong eval tells you immediately when a prompt change breaks something — and exactly what broke. This guide covers how to design test cases, choose the right assertions, write good judge rubrics, and build an eval practice that improves over time.Documentation Index
Fetch the complete documentation index at: https://bintzgavin-apastra-14.mintlify.app/llms.txt
Use this file to discover all available pages before exploring further.
The eval maturity ladder
Start at Level 1 and move up as your prompt matures. Most teams get 80% of the value from deterministic checks alone — don’t skip ahead to AI grading before you have solid deterministic coverage.| Level | What | When to use | Tools |
|---|---|---|---|
| Level 1: Deterministic checks | contains, is-json, regex, starts-with | Always — fast, free, runs on every change | Inline assertions, quick eval files |
| Level 2: AI-graded checks | llm-rubric, similar, factuality | When deterministic checks can’t capture quality (tone, coherence, reasoning) | Judge evaluators, llm-rubric assertions |
| Level 3: Baseline comparison | Compare scorecards against a known-good run | When you need regression detection across prompt changes | Baseline skill, regression policies |
| Level 4: Human review | Periodic spot-checks of model outputs | When you need to calibrate AI judges or validate subjective quality | Manual scorecard review |
Two evaluation modes
Apastra supports two modes. Use whichever fits your situation:- Quick eval (single file)
- Suite mode (full pipeline)
A single YAML file in The agent reads this file and internally treats it as a prompt spec + dataset + inline assertions + suite. No other files needed.
promptops/evals/ that combines prompt, cases, and assertions. Best for smoke tests and rapid iteration.Designing test cases
Start from real failures, not hypotheticals
The most valuable test cases come from actual bad outputs your prompt has produced. Every time a prompt fails in production or review, turn that case into a test. If you don’t have failures yet, try the prompt with adversarial inputs and edge cases to find them before your users do.Break your prompt into features and scenarios
Decompose what your prompt does into discrete capabilities. Write separate test cases for each capability. For a “classify email” prompt, that means:- Correctly classifies obvious spam
- Handles ambiguous emails (could be sales or support)
- Returns valid JSON
- Doesn’t expose internal IDs or metadata
- Handles empty input gracefully
Cover all five categories
| Category | Examples |
|---|---|
| Happy path | Normal inputs that should work correctly |
| Edge cases | Empty input, very long input, special characters, Unicode |
| Adversarial | Prompt injection, jailbreak attempts, off-topic requests |
| Format compliance | JSON output, length limits, required fields |
| Safety | Refusal of harmful requests, PII handling |
Prioritize volume over perfection
50 cases with automated grading is more valuable than 10 perfectly curated cases with careful human review. You can always improve case quality later — you can’t retroactively add coverage. Ask your IDE agent: “Generate 20 test cases for this prompt, including edge cases and adversarial inputs.” Review and curate the results — don’t blindly trust synthetic data.Dataset format (JSONL)
Dataset files are.jsonl files — one JSON object per line. Each case has a stable case_id and an inputs object:
expected_outputs (used by suite evaluators) with inline assert blocks (applied per-case). Both apply when present.
Inline assertions on dataset cases
Put assertions directly on a case instead of a separate evaluator file:Choosing the right assertion type
Deterministic assertions
These are fast, free, and deterministic. Use them for every check where the output can be verified without a model:| Type | What it checks | Example |
|---|---|---|
equals | Exact string match | {"type": "equals", "value": "Hello, World!"} |
contains | Substring (case-sensitive) | {"type": "contains", "value": "Bonjour"} |
icontains | Substring (case-insensitive) | {"type": "icontains", "value": "summary"} |
contains-any | At least one of several values | {"type": "contains-any", "value": ["yes", "correct"]} |
contains-all | Every value present | {"type": "contains-all", "value": ["name", "age"]} |
regex | Regular expression match | {"type": "regex", "value": "\\d{3}-\\d{4}"} |
starts-with | Output begins with value | {"type": "starts-with", "value": "Dear "} |
is-json | Output is valid JSON | {"type": "is-json"} |
contains-json | Output contains a JSON block | {"type": "contains-json"} |
is-valid-json-schema | Output matches a JSON Schema | {"type": "is-valid-json-schema", "value": {"type": "object", "required": ["category"]}} |
Negation
Any assertion type can be negated with thenot- prefix:
not-contains, not-regex, not-is-json, and so on.
Model-assisted assertions
Use these when deterministic checks can’t capture the quality you care about:| Type | What it checks | Example |
|---|---|---|
similar | Semantic similarity to a reference (threshold 0–1) | {"type": "similar", "value": "expected answer", "threshold": 0.8} |
llm-rubric | AI grades output using a rubric | {"type": "llm-rubric", "value": "Is the response helpful, accurate, and under 100 words?"} |
factuality | Output is factually consistent with reference | {"type": "factuality", "value": "reference facts text"} |
answer-relevance | Output is relevant to the input | {"type": "answer-relevance"} |
Performance assertions
| Type | What it checks | Example |
|---|---|---|
latency | Response time in milliseconds | {"type": "latency", "threshold": 500} |
cost | Token cost in dollars | {"type": "cost", "threshold": 0.01} |
Decision table: if you want X, use Y
| If you want to check… | Use this assertion |
|---|---|
| Output contains specific keywords | contains or icontains |
| Output is valid JSON | is-json |
| Output matches a specific structure | is-valid-json-schema |
| Output doesn’t leak internal data | not-regex |
| Output is semantically close to a reference | similar |
| Output quality requires judgment | llm-rubric |
| Output mentions at least one of several options | contains-any |
| Output always starts correctly | starts-with |
| Output avoids a specific bad pattern | not-contains |
| Free-text output is accurate | factuality |
Writing good judge rubrics
When usingllm-rubric or judge evaluators, the quality of the rubric determines the quality of the grading.
Be specific, not vague
Vague: “Is the output good?”Specific: “Does the output mention the company name in the first sentence? Does it use a professional tone? Is it under 100 words?”The rubric should describe observable, checkable criteria — not impressions.
Use binary or numeric scales
Ask for “correct/incorrect” or a 1–5 scale, not open-ended qualitative feedback. Open-ended output is hard to aggregate into a score.
Ask the judge to reason first
Instruct the judge to think step by step before scoring: “Think step by step about whether this output meets the criteria, then give a score of 1–5.”Chain-of-thought grading consistently improves accuracy.
Version your rubrics
Changing the rubric text changes what the metric means. A rubric edit is a new evaluator version. Without versioning, historical comparisons become meaningless — the baseline was graded by a different rubric.
Common eval mistakes
| Mistake | Why it’s bad | Fix |
|---|---|---|
| Only testing happy paths | You miss the failures that matter most | Add edge cases and adversarial inputs |
Using equals for free-text outputs | LLM output is non-deterministic — exact match almost always fails | Use contains, icontains, or similar instead |
| Thresholds set too high | Flaky evals erode trust — people start ignoring failures | Start with achievable thresholds (e.g., 0.6), tighten over time |
| No baseline comparison | You can’t tell if a prompt change made things worse | Establish a baseline after your first passing run |
| Ignoring flaky cases | Random noise masks real regressions | Increase trials, quarantine consistently flaky cases |
| Overfitting to test cases | Prompt works for tests but fails in production | Maintain a holdout set, add cases from real production failures |
The evolving eval cadence
Build your eval practice incrementally. Here is a realistic week-by-week plan:Week 1: Start small
Write a quick eval file with 5 cases and deterministic assertions only. Run it locally. Get a passing baseline.
Week 2–3: Graduate to a full suite
Move to
prompts/ + datasets/ + evaluators/ + suites/. Add 20+ cases. Cover edge cases and at least one adversarial case. Establish your first baseline with the apastra-baseline skill.Month 2+: Add AI grading and regression policies
Add
llm-rubric or judge evaluators for subjective quality dimensions (tone, helpfulness, completeness). Set up a regression policy. Connect to CI so every PR runs the suite.Tips for robust evals
What trials do and when to use them
What trials do and when to use them
The
trials field on a suite runs each case multiple times and records variance. Use trials: 1 for smoke suites (fast, low cost). Use trials: 3 or more for regression suites where variance matters — especially when using AI-graded assertions.How to handle flaky cases
How to handle flaky cases
A flaky case is one that passes sometimes and fails others due to model non-determinism. Steps to handle them:
- Increase
trialsto get a stable average. - Widen
allowed_deltain your regression policy for that metric. - If the case is still unreliable, quarantine it and track its flake rate separately — don’t let it silently pass as “random noise”.
Inline assertions vs evaluator files
Inline assertions vs evaluator files
Both can apply to the same case — they complement each other. Inline assertions are per-case and ideal for specific edge-case checks. Evaluator files are per-suite and ideal for consistent, reusable scoring rules (like
keyword_recall) applied across all cases.When to use a holdout set
When to use a holdout set
A holdout set is a dataset your prompt has never been tuned against. It guards against benchmark gaming — where you optimize for the test cases you see instead of real-world inputs. Add a holdout dataset to your release-candidate suite so the final gate uses unseen data.