apastra-baseline

Installation

npx skills add BintzGavin/apastra/skills/baseline

What is a baseline?

A baseline is a snapshot of a scorecard from a known-good evaluation run. Once established, every future eval for that suite is compared against it. If a prompt change causes quality to drop beyond the allowed thresholds defined in your regression policy, the eval reports a regression.

Baselines are stored as JSON files in derived-index/baselines/. They are never deleted — when you update a baseline, the previous one is archived with a timestamp suffix.

How to invoke

Ask your agent:

“Use the apastra-baseline skill to set the current results as the baseline for [suite-name]“

Establishing a baseline

Locate the scorecard

Your agent finds the most recent run for the target suite in promptops/runs/. It looks for the latest directory matching <suite-id>-* and reads its scorecard.json.If no recent run exists, your agent will prompt you to run the apastra-eval skill first.

Create the baseline file

Your agent writes the baseline to derived-index/baselines/<suite-id>.json:

{
  "suite_id": "summarize-smoke",
  "established_at": "2026-03-11T12:00:00Z",
  "source_run": "summarize-smoke-2026-03-11-120000",
  "scorecard": {
    "normalized_metrics": {
      "keyword_recall": 0.85
    },
    "metric_definitions": {
      "keyword_recall": {
        "description": "Fraction of expected keywords found in output",
        "version": "1.0",
        "direction": "higher_is_better"
      }
    }
  }
}

Confirm

Your agent reports what was established:

Baseline established ✅

Suite: summarize-smoke
Source run: summarize-smoke-2026-03-11-120000
Metrics:
  keyword_recall: 0.85

Saved to: derived-index/baselines/summarize-smoke.json

Updating a baseline

When you’ve verified that a prompt improvement is intentional and you want to raise the bar, you can update the baseline. Your agent follows an append-friendly model:

Renames the existing baseline to <suite-id>-<timestamp>.json (for example, summarize-smoke-2026-03-10.json) as an archive
Writes the new baseline to derived-index/baselines/<suite-id>.json
Reports both the old and new metric values so the change is visible

Baseline updated ✅

Suite: summarize-smoke
Previous baseline: keyword_recall=0.80 (archived to summarize-smoke-2026-03-10.json)
New baseline: keyword_recall=0.85

The eval skill will now compare future runs against the new baseline.

Rolling back a baseline

If a regression surfaces and you need to undo a baseline update, ask your agent to restore a prior baseline:

“Use the apastra-baseline skill to roll back the summarize-smoke baseline”

Your agent copies the archived baseline file (for example, summarize-smoke-2026-03-10.json) back to derived-index/baselines/summarize-smoke.json. This promotes the prior scorecard as the active baseline without deleting any records.

Baseline file location

Active baselines are always at:

derived-index/baselines/<suite-id>.json

Archived (superseded) baselines are stored alongside them with a timestamp suffix:

derived-index/baselines/<suite-id>-<YYYY-MM-DD>.json

Relationship to regression policies

The baseline file contains the reference metrics. The regression policy (promptops/policies/regression.yaml) defines how much deviation is allowed before an eval is marked as a regression:

baseline: "prod-current"
rules:
  - metric: keyword_recall
    floor: 0.5
    allowed_delta: 0.1
    direction: higher_is_better
    severity: blocker

During an eval, your agent reads both files and compares:

For higher_is_better metrics: regression if candidate < (baseline − allowed_delta) or candidate < floor
For lower_is_better metrics: regression if candidate > (baseline + allowed_delta) or candidate > floor

Rules with severity: blocker fail the eval. Rules with severity: warning are reported but do not block.

Rules

Never establish a baseline from a failing run. A baseline represents a quality floor — if you baseline a failing scorecard, all future comparisons will be measured against a poor result.

Never delete a baseline — archive it with a timestamp suffix
Only baseline passing runs — the scorecard must have passed all suite thresholds
One active baseline per suite — the active baseline is always <suite-id>.json
Baselines are immutable once set — updating means archiving the old file and writing a new one

Get Started

Skills

Guides

Reference

apastra-baseline

Installation

What is a baseline?

How to invoke

Establishing a baseline

Updating a baseline

Rolling back a baseline

Baseline file location

Relationship to regression policies

Rules

Get Started

Skills

Guides

Reference

Documentation Index

​Installation

​What is a baseline?

​How to invoke

​Establishing a baseline

​Updating a baseline

​Rolling back a baseline

​Baseline file location

​Relationship to regression policies

​Rules

Installation

What is a baseline?

How to invoke

Establishing a baseline

Updating a baseline

Rolling back a baseline

Baseline file location

Relationship to regression policies

Rules