Regression detection

A regression is when a prompt change causes measurable quality to drop beyond what your policy allows. Apastra detects regressions by comparing a candidate scorecard against a saved baseline, evaluated by a policy that defines per-metric rules with floors, allowed deltas, and severity.

What a regression is and why it matters

When you change a prompt — even a small wording tweak — the model’s behavior can shift in ways that are hard to see from a single manual test. A regression detection system makes that shift visible and gates it: if the quality drop exceeds what your policy allows, the merge is blocked. Without regression detection:

Prompt edits ship without evidence of quality impact.
Regressions accumulate silently until a user reports a problem.
There is no record of what changed, when, and what the quality was before.

With regression detection:

Every PR that touches prompts or datasets runs a scorecard comparison.
Blockers fail the required status check and prevent merge.
Warnings flag tradeoffs (e.g., cost improved but recall dropped slightly) for human review.

The baseline → scorecard → regression report flow

Establish a baseline

After your first passing eval run, use the apastra-baseline skill to save the scorecard as the baseline:

> "Use the apastra-baseline skill to set the current results as the baseline"

The baseline is saved to derived-index/baselines/<suite-id>.json. It represents the known-good quality state your future runs are compared against.

Run a candidate eval

When you change a prompt, dataset, or policy, run the suite again. The agent produces a candidate scorecard with the same metrics — keyword_recall, pass_rate, latency, and so on.

Apply the regression policy

The agent reads promptops/policies/regression.yaml and evaluates each rule against the candidate vs baseline metrics. Rules have floors (absolute minimums), allowed deltas (how much the metric can drop from baseline), directionality, and severity.

Produce the regression report

The agent emits a regression report with a pass or fail status, per-metric evidence, deltas, and messages for any failing rules. In CI, this report is saved to reports/regression_report.json on the promptops-artifacts branch and surfaced as a required status check.

The regression policy format

The regression policy is a YAML file at promptops/policies/regression.yaml. It declares which baseline to compare against and a list of per-metric rules:

baseline: "prod-current"
rules:
  - metric: keyword_recall
    floor: 0.5
    allowed_delta: 0.1
    direction: higher_is_better
    severity: blocker

Policy semantics

baseline — the named reference to compare against. The default is "prod-current", which resolves to the promotion record for the current production version. Baseline references resolve to immutable run IDs or digests, not “latest”. metric — the metric name from the scorecard to evaluate. Must match a metric produced by one of the suite’s evaluators. floor — an absolute minimum. The candidate metric must be at or above this value regardless of what the baseline was. A floor of 0.5 means keyword recall must never fall below 50%, even if the baseline was already low. allowed_delta — how much the metric can drop from the baseline value before the rule fails. A delta of 0.1 means the candidate can be up to 0.1 below the baseline. For higher_is_better metrics:

pass if: candidate >= (baseline - allowed_delta) AND candidate >= floor

For lower_is_better metrics (e.g., latency, cost):

pass if: candidate <= (baseline + allowed_delta) AND candidate <= floor

direction — higher_is_better or lower_is_better. This determines whether the delta is applied as a minimum or maximum. severity — blocker or warning.

Blockers vs warnings

Severity	Effect
`blocker`	Fails the required status check. Merge is blocked until the issue is resolved.
`warning`	Surfaced in the regression report and PR summary, but does not block merge. Requires human review.

Use blockers for metrics that must never regress (correctness, safety, schema compliance). Use warnings for tradeoff metrics where a small drop might be acceptable in exchange for improved cost or latency.

The regression report output

The regression report summarizes the candidate vs baseline comparison for every metric:

Regression Report:
  Baseline: derived-index/baselines/summarize-smoke.json
  Status: PASS ✅ (or REGRESSION DETECTED ❌)

  keyword_recall: 0.85 (baseline: 0.80, delta: +0.05) ✅

In CI, the report is written to reports/regression_report.json and rendered as a step summary in GitHub Actions:

Metric	Status	Candidate	Baseline	Delta	Message
keyword_recall	pass	0.85	0.80	+0.05	Within allowed delta
pass_rate	fail	0.55	0.80	-0.25	Below floor (0.5) and delta exceeded

Handling a detected regression

When a regression blocks your PR, you have three options:

Iterate on the prompt
Widen the allowed delta
Override with human signoff

The preferred path. Look at the per-case results in promptops/runs/<run-id>/cases.jsonl to understand which cases failed and why. Revise the prompt, re-run the suite, and check the new scorecard.

> "Use the apastra-eval skill to run the summarize-smoke suite"

If the regression is intentional — for example, you simplified the prompt to reduce cost and accepted a small recall drop — update the policy to widen allowed_delta for that metric. Document the tradeoff in your PR description.

rules:
  - metric: keyword_recall
    floor: 0.5
    allowed_delta: 0.15   # widened from 0.10
    direction: higher_is_better
    severity: blocker

Policy changes require CODEOWNERS review and are tracked in Git, so the tradeoff is auditable.

Demote the severity from blocker to warning for one release, with explicit human approval recorded in the PR. Use this sparingly — it bypasses the gate and should be a conscious decision, not a workaround.

Suite tiers and when to use each

Apastra defines four suite tiers. Each tier has a different purpose, cost, and trigger:

Tier	Purpose	Size/cost	Trigger	Output expectation
Smoke	Fast sanity checks	Small (5–20 cases)	Local + every PR	Deterministic, quick pass/fail
Regression	Protect known failure modes	Medium (20–100 cases)	PRs touching prompts/policies	Evidence-heavy regression report
Full	Broader coverage	Large (100+ cases)	Nightly or on-demand	Trend analysis, drift detection
Release candidate	Ship gate	Large + holdout set	Pre-release / promotion to prod	Highest rigor, human signoff

Declare the tier on your suite spec:

id: summarize-regression
name: Summarize Regression Suite
tier: regression
datasets: [summarize-smoke, summarize-edge-cases]
evaluators: [keyword-check, schema-check]
model_matrix: [default]
trials: 3
thresholds:
  keyword_recall: 0.6

A suite’s tier determines how strictly regressions gate promotion. Release-candidate suites should reference a holdout dataset that the prompt was never tuned against, to prevent benchmark gaming.

Variance, flakiness, and trials

Non-deterministic model outputs mean that a single run can pass or fail by chance. Use trials to run each case multiple times and record variance:

trials: 3

With trials: 3, each case runs three times. The scorecard records the mean and variance. Your regression policy evaluates the mean — so a single unlucky output doesn’t trigger a blocker.

Quarantining flaky cases

If a specific case is consistently flaky (passes ~50% of the time regardless of prompt changes), quarantine it:

Identify the flaky case from cases.jsonl — look for high variance across trials.
Move it to a separate dataset used only in nightly runs.
Track its flake rate over time. If it stabilizes, promote it back.

Do not let flaky cases silently pass as “random noise”. A flaky case that reliably triggers in production is a bug waiting to surface. Track it explicitly or remove it from the gate suite.

Preventing overfitting

Holdout sets: Maintain a dataset that your prompt has never been tuned against. Reserve it for release-candidate runs. This guards against the pattern where a prompt passes all tests but fails on new real-world inputs. Incident-driven regression suites: When a prompt fails in production, add the failing input as a new test case in a “never again” regression suite. Over time this suite becomes the most valuable part of your test coverage — it encodes real failures, not hypotheticals.

Get Started

Skills

Guides

Reference

Regression detection

What a regression is and why it matters

The baseline → scorecard → regression report flow

The regression policy format

Policy semantics

Blockers vs warnings

The regression report output

Handling a detected regression

Suite tiers and when to use each

Variance, flakiness, and trials

Quarantining flaky cases

Preventing overfitting

Get Started

Skills

Guides

Reference

Documentation Index

​What a regression is and why it matters

​The baseline → scorecard → regression report flow

​The regression policy format

​Policy semantics

​Blockers vs warnings

​The regression report output

​Handling a detected regression

​Suite tiers and when to use each

​Variance, flakiness, and trials

​Quarantining flaky cases

​Preventing overfitting

What a regression is and why it matters

The baseline → scorecard → regression report flow

The regression policy format

Policy semantics

Blockers vs warnings

The regression report output

Handling a detected regression

Suite tiers and when to use each

Variance, flakiness, and trials

Quarantining flaky cases

Preventing overfitting