Skip to main content

Documentation Index

Fetch the complete documentation index at: https://bintzgavin-apastra-14.mintlify.app/llms.txt

Use this file to discover all available pages before exploring further.

Installation

npx skills add BintzGavin/apastra/skills/baseline

What is a baseline?

A baseline is a snapshot of a scorecard from a known-good evaluation run. Once established, every future eval for that suite is compared against it. If a prompt change causes quality to drop beyond the allowed thresholds defined in your regression policy, the eval reports a regression.
Baselines are stored as JSON files in derived-index/baselines/. They are never deleted — when you update a baseline, the previous one is archived with a timestamp suffix.

How to invoke

Ask your agent:
“Use the apastra-baseline skill to set the current results as the baseline for [suite-name]“

Establishing a baseline

1

Locate the scorecard

Your agent finds the most recent run for the target suite in promptops/runs/. It looks for the latest directory matching <suite-id>-* and reads its scorecard.json.If no recent run exists, your agent will prompt you to run the apastra-eval skill first.
2

Create the baseline file

Your agent writes the baseline to derived-index/baselines/<suite-id>.json:
{
  "suite_id": "summarize-smoke",
  "established_at": "2026-03-11T12:00:00Z",
  "source_run": "summarize-smoke-2026-03-11-120000",
  "scorecard": {
    "normalized_metrics": {
      "keyword_recall": 0.85
    },
    "metric_definitions": {
      "keyword_recall": {
        "description": "Fraction of expected keywords found in output",
        "version": "1.0",
        "direction": "higher_is_better"
      }
    }
  }
}
3

Confirm

Your agent reports what was established:
Baseline established ✅

Suite: summarize-smoke
Source run: summarize-smoke-2026-03-11-120000
Metrics:
  keyword_recall: 0.85

Saved to: derived-index/baselines/summarize-smoke.json

Updating a baseline

When you’ve verified that a prompt improvement is intentional and you want to raise the bar, you can update the baseline. Your agent follows an append-friendly model:
  1. Renames the existing baseline to <suite-id>-<timestamp>.json (for example, summarize-smoke-2026-03-10.json) as an archive
  2. Writes the new baseline to derived-index/baselines/<suite-id>.json
  3. Reports both the old and new metric values so the change is visible
Baseline updated ✅

Suite: summarize-smoke
Previous baseline: keyword_recall=0.80 (archived to summarize-smoke-2026-03-10.json)
New baseline: keyword_recall=0.85

The eval skill will now compare future runs against the new baseline.

Rolling back a baseline

If a regression surfaces and you need to undo a baseline update, ask your agent to restore a prior baseline:
“Use the apastra-baseline skill to roll back the summarize-smoke baseline”
Your agent copies the archived baseline file (for example, summarize-smoke-2026-03-10.json) back to derived-index/baselines/summarize-smoke.json. This promotes the prior scorecard as the active baseline without deleting any records.

Baseline file location

Active baselines are always at:
derived-index/baselines/<suite-id>.json
Archived (superseded) baselines are stored alongside them with a timestamp suffix:
derived-index/baselines/<suite-id>-<YYYY-MM-DD>.json

Relationship to regression policies

The baseline file contains the reference metrics. The regression policy (promptops/policies/regression.yaml) defines how much deviation is allowed before an eval is marked as a regression:
baseline: "prod-current"
rules:
  - metric: keyword_recall
    floor: 0.5
    allowed_delta: 0.1
    direction: higher_is_better
    severity: blocker
During an eval, your agent reads both files and compares:
  • For higher_is_better metrics: regression if candidate < (baseline − allowed_delta) or candidate < floor
  • For lower_is_better metrics: regression if candidate > (baseline + allowed_delta) or candidate > floor
Rules with severity: blocker fail the eval. Rules with severity: warning are reported but do not block.

Rules

Never establish a baseline from a failing run. A baseline represents a quality floor — if you baseline a failing scorecard, all future comparisons will be measured against a poor result.
  • Never delete a baseline — archive it with a timestamp suffix
  • Only baseline passing runs — the scorecard must have passed all suite thresholds
  • One active baseline per suite — the active baseline is always <suite-id>.json
  • Baselines are immutable once set — updating means archiving the old file and writing a new one