A regression is when a prompt change causes measurable quality to drop beyond what your policy allows. Apastra detects regressions by comparing a candidate scorecard against a saved baseline, evaluated by a policy that defines per-metric rules with floors, allowed deltas, and severity.Documentation Index
Fetch the complete documentation index at: https://bintzgavin-apastra-14.mintlify.app/llms.txt
Use this file to discover all available pages before exploring further.
What a regression is and why it matters
When you change a prompt — even a small wording tweak — the model’s behavior can shift in ways that are hard to see from a single manual test. A regression detection system makes that shift visible and gates it: if the quality drop exceeds what your policy allows, the merge is blocked. Without regression detection:- Prompt edits ship without evidence of quality impact.
- Regressions accumulate silently until a user reports a problem.
- There is no record of what changed, when, and what the quality was before.
- Every PR that touches prompts or datasets runs a scorecard comparison.
- Blockers fail the required status check and prevent merge.
- Warnings flag tradeoffs (e.g., cost improved but recall dropped slightly) for human review.
The baseline → scorecard → regression report flow
Establish a baseline
After your first passing eval run, use the The baseline is saved to
apastra-baseline skill to save the scorecard as the baseline:derived-index/baselines/<suite-id>.json. It represents the known-good quality state your future runs are compared against.Run a candidate eval
When you change a prompt, dataset, or policy, run the suite again. The agent produces a candidate scorecard with the same metrics —
keyword_recall, pass_rate, latency, and so on.Apply the regression policy
The agent reads
promptops/policies/regression.yaml and evaluates each rule against the candidate vs baseline metrics. Rules have floors (absolute minimums), allowed deltas (how much the metric can drop from baseline), directionality, and severity.The regression policy format
The regression policy is a YAML file atpromptops/policies/regression.yaml. It declares which baseline to compare against and a list of per-metric rules:
Policy semantics
baseline — the named reference to compare against. The default is "prod-current", which resolves to the promotion record for the current production version. Baseline references resolve to immutable run IDs or digests, not “latest”.
metric — the metric name from the scorecard to evaluate. Must match a metric produced by one of the suite’s evaluators.
floor — an absolute minimum. The candidate metric must be at or above this value regardless of what the baseline was. A floor of 0.5 means keyword recall must never fall below 50%, even if the baseline was already low.
allowed_delta — how much the metric can drop from the baseline value before the rule fails. A delta of 0.1 means the candidate can be up to 0.1 below the baseline. For higher_is_better metrics:
lower_is_better metrics (e.g., latency, cost):
direction — higher_is_better or lower_is_better. This determines whether the delta is applied as a minimum or maximum.
severity — blocker or warning.
Blockers vs warnings
| Severity | Effect |
|---|---|
blocker | Fails the required status check. Merge is blocked until the issue is resolved. |
warning | Surfaced in the regression report and PR summary, but does not block merge. Requires human review. |
The regression report output
The regression report summarizes the candidate vs baseline comparison for every metric:reports/regression_report.json and rendered as a step summary in GitHub Actions:
| Metric | Status | Candidate | Baseline | Delta | Message |
|---|---|---|---|---|---|
| keyword_recall | pass | 0.85 | 0.80 | +0.05 | Within allowed delta |
| pass_rate | fail | 0.55 | 0.80 | -0.25 | Below floor (0.5) and delta exceeded |
Handling a detected regression
When a regression blocks your PR, you have three options:- Iterate on the prompt
- Widen the allowed delta
- Override with human signoff
The preferred path. Look at the per-case results in
promptops/runs/<run-id>/cases.jsonl to understand which cases failed and why. Revise the prompt, re-run the suite, and check the new scorecard.Suite tiers and when to use each
Apastra defines four suite tiers. Each tier has a different purpose, cost, and trigger:| Tier | Purpose | Size/cost | Trigger | Output expectation |
|---|---|---|---|---|
| Smoke | Fast sanity checks | Small (5–20 cases) | Local + every PR | Deterministic, quick pass/fail |
| Regression | Protect known failure modes | Medium (20–100 cases) | PRs touching prompts/policies | Evidence-heavy regression report |
| Full | Broader coverage | Large (100+ cases) | Nightly or on-demand | Trend analysis, drift detection |
| Release candidate | Ship gate | Large + holdout set | Pre-release / promotion to prod | Highest rigor, human signoff |
A suite’s
tier determines how strictly regressions gate promotion. Release-candidate suites should reference a holdout dataset that the prompt was never tuned against, to prevent benchmark gaming.Variance, flakiness, and trials
Non-deterministic model outputs mean that a single run can pass or fail by chance. Usetrials to run each case multiple times and record variance:
trials: 3, each case runs three times. The scorecard records the mean and variance. Your regression policy evaluates the mean — so a single unlucky output doesn’t trigger a blocker.
Quarantining flaky cases
If a specific case is consistently flaky (passes ~50% of the time regardless of prompt changes), quarantine it:- Identify the flaky case from
cases.jsonl— look for high variance across trials. - Move it to a separate dataset used only in nightly runs.
- Track its flake rate over time. If it stabilizes, promote it back.