Skip to main content

Documentation Index

Fetch the complete documentation index at: https://bintzgavin-apastra-14.mintlify.app/llms.txt

Use this file to discover all available pages before exploring further.

A regression is when a prompt change causes measurable quality to drop beyond what your policy allows. Apastra detects regressions by comparing a candidate scorecard against a saved baseline, evaluated by a policy that defines per-metric rules with floors, allowed deltas, and severity.

What a regression is and why it matters

When you change a prompt — even a small wording tweak — the model’s behavior can shift in ways that are hard to see from a single manual test. A regression detection system makes that shift visible and gates it: if the quality drop exceeds what your policy allows, the merge is blocked. Without regression detection:
  • Prompt edits ship without evidence of quality impact.
  • Regressions accumulate silently until a user reports a problem.
  • There is no record of what changed, when, and what the quality was before.
With regression detection:
  • Every PR that touches prompts or datasets runs a scorecard comparison.
  • Blockers fail the required status check and prevent merge.
  • Warnings flag tradeoffs (e.g., cost improved but recall dropped slightly) for human review.

The baseline → scorecard → regression report flow

1

Establish a baseline

After your first passing eval run, use the apastra-baseline skill to save the scorecard as the baseline:
> "Use the apastra-baseline skill to set the current results as the baseline"
The baseline is saved to derived-index/baselines/<suite-id>.json. It represents the known-good quality state your future runs are compared against.
2

Run a candidate eval

When you change a prompt, dataset, or policy, run the suite again. The agent produces a candidate scorecard with the same metrics — keyword_recall, pass_rate, latency, and so on.
3

Apply the regression policy

The agent reads promptops/policies/regression.yaml and evaluates each rule against the candidate vs baseline metrics. Rules have floors (absolute minimums), allowed deltas (how much the metric can drop from baseline), directionality, and severity.
4

Produce the regression report

The agent emits a regression report with a pass or fail status, per-metric evidence, deltas, and messages for any failing rules. In CI, this report is saved to reports/regression_report.json on the promptops-artifacts branch and surfaced as a required status check.

The regression policy format

The regression policy is a YAML file at promptops/policies/regression.yaml. It declares which baseline to compare against and a list of per-metric rules:
baseline: "prod-current"
rules:
  - metric: keyword_recall
    floor: 0.5
    allowed_delta: 0.1
    direction: higher_is_better
    severity: blocker

Policy semantics

baseline — the named reference to compare against. The default is "prod-current", which resolves to the promotion record for the current production version. Baseline references resolve to immutable run IDs or digests, not “latest”. metric — the metric name from the scorecard to evaluate. Must match a metric produced by one of the suite’s evaluators. floor — an absolute minimum. The candidate metric must be at or above this value regardless of what the baseline was. A floor of 0.5 means keyword recall must never fall below 50%, even if the baseline was already low. allowed_delta — how much the metric can drop from the baseline value before the rule fails. A delta of 0.1 means the candidate can be up to 0.1 below the baseline. For higher_is_better metrics:
pass if: candidate >= (baseline - allowed_delta) AND candidate >= floor
For lower_is_better metrics (e.g., latency, cost):
pass if: candidate <= (baseline + allowed_delta) AND candidate <= floor
directionhigher_is_better or lower_is_better. This determines whether the delta is applied as a minimum or maximum. severityblocker or warning.

Blockers vs warnings

SeverityEffect
blockerFails the required status check. Merge is blocked until the issue is resolved.
warningSurfaced in the regression report and PR summary, but does not block merge. Requires human review.
Use blockers for metrics that must never regress (correctness, safety, schema compliance). Use warnings for tradeoff metrics where a small drop might be acceptable in exchange for improved cost or latency.

The regression report output

The regression report summarizes the candidate vs baseline comparison for every metric:
Regression Report:
  Baseline: derived-index/baselines/summarize-smoke.json
  Status: PASS ✅ (or REGRESSION DETECTED ❌)

  keyword_recall: 0.85 (baseline: 0.80, delta: +0.05) ✅
In CI, the report is written to reports/regression_report.json and rendered as a step summary in GitHub Actions:
MetricStatusCandidateBaselineDeltaMessage
keyword_recallpass0.850.80+0.05Within allowed delta
pass_ratefail0.550.80-0.25Below floor (0.5) and delta exceeded

Handling a detected regression

When a regression blocks your PR, you have three options:
The preferred path. Look at the per-case results in promptops/runs/<run-id>/cases.jsonl to understand which cases failed and why. Revise the prompt, re-run the suite, and check the new scorecard.
> "Use the apastra-eval skill to run the summarize-smoke suite"

Suite tiers and when to use each

Apastra defines four suite tiers. Each tier has a different purpose, cost, and trigger:
TierPurposeSize/costTriggerOutput expectation
SmokeFast sanity checksSmall (5–20 cases)Local + every PRDeterministic, quick pass/fail
RegressionProtect known failure modesMedium (20–100 cases)PRs touching prompts/policiesEvidence-heavy regression report
FullBroader coverageLarge (100+ cases)Nightly or on-demandTrend analysis, drift detection
Release candidateShip gateLarge + holdout setPre-release / promotion to prodHighest rigor, human signoff
Declare the tier on your suite spec:
id: summarize-regression
name: Summarize Regression Suite
tier: regression
datasets: [summarize-smoke, summarize-edge-cases]
evaluators: [keyword-check, schema-check]
model_matrix: [default]
trials: 3
thresholds:
  keyword_recall: 0.6
A suite’s tier determines how strictly regressions gate promotion. Release-candidate suites should reference a holdout dataset that the prompt was never tuned against, to prevent benchmark gaming.

Variance, flakiness, and trials

Non-deterministic model outputs mean that a single run can pass or fail by chance. Use trials to run each case multiple times and record variance:
trials: 3
With trials: 3, each case runs three times. The scorecard records the mean and variance. Your regression policy evaluates the mean — so a single unlucky output doesn’t trigger a blocker.

Quarantining flaky cases

If a specific case is consistently flaky (passes ~50% of the time regardless of prompt changes), quarantine it:
  1. Identify the flaky case from cases.jsonl — look for high variance across trials.
  2. Move it to a separate dataset used only in nightly runs.
  3. Track its flake rate over time. If it stabilizes, promote it back.
Do not let flaky cases silently pass as “random noise”. A flaky case that reliably triggers in production is a bug waiting to surface. Track it explicitly or remove it from the gate suite.

Preventing overfitting

Holdout sets: Maintain a dataset that your prompt has never been tuned against. Reserve it for release-candidate runs. This guards against the pattern where a prompt passes all tests but fails on new real-world inputs. Incident-driven regression suites: When a prompt fails in production, add the failing input as a new test case in a “never again” regression suite. Over time this suite becomes the most valuable part of your test coverage — it encodes real failures, not hypotheticals.