apastra-eval

Installation

npx skills add BintzGavin/apastra/skills/eval

How to invoke

Ask your agent:

“Use the apastra-eval skill to run the [suite-name] suite”

For a quick eval file:

“Use the apastra-eval skill to run the summarize-quick eval”

Evaluation modes

Suite mode
Quick eval mode

Suite mode is the full spec/dataset/evaluator/suite pipeline. Use it for structured, reusable test suites with baseline tracking and regression detection.When you ask to run a suite (for example, “run the summarize-smoke suite”), your agent follows these steps:

Load the suite

Your agent reads the suite file from promptops/suites/<suite-id>.yaml and extracts:

datasets — list of dataset IDs to load
evaluators — list of evaluator IDs to apply
model_matrix — models to test against ("default" means the current agent’s model)
harness — (optional) identifier for the execution environment; auto-detected if omitted
trials — how many times to run each case (default: 1)
thresholds — minimum metric scores required to pass

Load dependencies

For each dataset ID, your agent reads promptops/datasets/<dataset-id>.jsonl (one JSON object per line).For each evaluator ID, your agent reads promptops/evaluators/<evaluator-id>.yaml.For the prompt being evaluated, your agent reads promptops/prompts/<prompt-id>.yaml.

Run each case

For every case in the dataset, your agent:

Renders the template — substitutes {{variable}} placeholders with values from the case’s inputs object
Calls the model — sends the rendered prompt and captures the full response; if trials > 1, runs multiple times
Scores the output — applies evaluators and any inline assertions on the case

Per-case results are recorded in this format:

{
  "case_id": "short-article",
  "inputs": {},
  "output": "<model response>",
  "evaluator_scores": {
    "keyword_recall": 1.0
  }
}

Aggregate the scorecard

Your agent averages each metric across all cases:

{
  "normalized_metrics": {
    "keyword_recall": 0.85
  },
  "metric_definitions": {
    "keyword_recall": {
      "description": "Fraction of expected keywords found in output",
      "version": "1.0",
      "direction": "higher_is_better"
    }
  },
  "variance": {}
}

Check thresholds

Your agent compares each metric against the suite’s thresholds. If any metric falls below its threshold, the suite fails.Results are reported like this:

Suite: summarize-smoke
Status: PASS ✅

Metrics:
  keyword_recall: 0.85 (threshold: 0.60) ✅

Per-case results:
  short-article: keyword_recall=1.00 ✅
  technical-paragraph: keyword_recall=1.00 ✅
  empty-edge-case: keyword_recall=1.00 ✅
  long-document: keyword_recall=1.00 ✅
  multi-topic: keyword_recall=0.50 ⚠️

Compare against baseline (if one exists)

Your agent checks for a baseline at derived-index/baselines/<suite-id>.json.If a baseline exists, your agent reads the regression policy from promptops/policies/regression.yaml and compares each metric:

For higher_is_better metrics: fail if candidate < (baseline − allowed_delta) or candidate < floor
For lower_is_better metrics: fail if candidate > (baseline + allowed_delta) or candidate > floor

The regression comparison is reported like this:

Regression Report:
  Baseline: derived-index/baselines/summarize-smoke.json
  Status: PASS ✅

  keyword_recall: 0.85 (baseline: 0.80, delta: +0.05) ✅

If no baseline exists, your agent will note this and suggest running the apastra-baseline skill to establish one.

Save results

Your agent writes results to promptops/runs/<run-id>/:

File	Contents
`scorecard.json`	Aggregated metrics
`cases.jsonl`	Per-case results (one JSON object per line)
`run_manifest.json`	Metadata: timestamp, model, harness, suite ID, prompt digest

Run IDs follow the format <suite-id>-<YYYY-MM-DD-HHmmss>.

Quick eval mode uses a single YAML file that combines the prompt, test cases, and inline assertions. Use it for smoke tests and rapid iteration.When you ask to run a quick eval (for example, “run the summarize-quick eval”), your agent follows these steps:

Load the quick eval file

Your agent reads promptops/evals/<eval-id>.yaml. The file contains:

id — eval identifier
prompt — the prompt template (with {{variable}} placeholders)
cases — array of test cases, each with id, inputs, and assert
thresholds — for example, pass_rate: 1.0

Example quick eval file:

id: classify-email-quick
prompt: |
  Classify the following email into one of these categories: spam, support, sales, personal.
  Respond with JSON: {"category": "<category>", "confidence": <0-1>}

  Email: {{email}}
cases:
  - id: obvious-spam
    inputs:
      email: "CONGRATULATIONS! You've won $1,000,000! Click here NOW!"
    assert:
      - type: is-json
      - type: contains
        value: "spam"
  - id: support-request
    inputs:
      email: "Hi, I'm having trouble logging in. My password reset isn't working."
    assert:
      - type: is-json
      - type: contains-any
        value: ["support", "help"]
thresholds:
  pass_rate: 1.0

Run each case

For each case, your agent renders the prompt template with the case’s inputs, calls the model, and applies each assertion from the assert array.

Report results

Quick Eval: classify-email-quick
Status: PASS ✅

Cases:
  obvious-spam: 2/2 assertions passed ✅
  support-request: 2/2 assertions passed ✅

Pass rate: 1.00 (threshold: 1.00) ✅

Save results

Results are written to promptops/runs/<eval-id>-<timestamp>/ using the same format as suite runs.

run_manifest.json format

Every run produces a run_manifest.json with metadata about how the eval was executed:

{
  "suite_id": "summarize-smoke",
  "timestamp": "2026-03-16T09:00:00Z",
  "model": "claude-sonnet-4-20250514",
  "harness": "claude-code",
  "prompt_digest": "sha256:abc123...",
  "status": "pass"
}

Harness identifiers

The harness field records which execution environment ran the evaluation. The same model can produce different results in different environments due to system prompts, tool availability, and context window handling.

Value	Environment
`claude-code`	Claude Code CLI or IDE
`antigravity`	Antigravity by Google DeepMind
`cursor`	Cursor IDE agent
`copilot`	GitHub Copilot agent
`api`	Direct API call (no IDE agent)
`github-actions`	CI/CD pipeline
`jules`	Jules by Google

Evaluator types

When processing suite evaluators, your agent applies the following scoring logic:

Type	Scoring behavior
`deterministic` with `keyword_recall`	Fraction of `expected_outputs.should_contain` keywords found in the response
`deterministic` with `exact_match`	1 if output exactly matches expected, 0 otherwise
`schema`	1 if output validates against the evaluator’s `config.schema`, 0 otherwise
`judge`	0–1 score using the evaluator’s `config.rubric` as the grading criteria

Assertion types reference

Use inline assertions on dataset cases or quick eval cases to apply per-case checks.

Deterministic assertions

Type	What it checks	Value
`equals`	Output exactly matches value	`"expected string"`
`contains`	Output contains substring (case-sensitive)	`"substring"`
`icontains`	Output contains substring (case-insensitive)	`"substring"`
`contains-any`	Output contains at least one value	`["a", "b", "c"]`
`contains-all`	Output contains every value	`["x", "y", "z"]`
`regex`	Output matches regex pattern	`"\\d{3}-\\d{4}"`
`starts-with`	Output begins with value	`"Dear "`
`is-json`	Output is valid JSON	(no value needed)
`contains-json`	Output contains a JSON block	(no value needed)
`is-valid-json-schema`	Output matches a JSON Schema	`{schema object}`

Model-assisted assertions

Type	What it checks	Value
`similar`	Semantic similarity to reference (threshold 0–1)	`"reference text"`
`llm-rubric`	AI grades output using rubric	`"rubric text"`
`factuality`	Output is factually consistent with reference	`"reference facts"`
`answer-relevance`	Output is relevant to the input	(no value needed)

Performance assertions

Type	What it checks	Threshold
`latency`	Response time in ms	`500`
`cost`	Token cost in dollars	`0.01`

Negate any assertion type by prepending not-. For example: not-contains, not-regex, not-is-json.

Regression policy format

When a baseline exists, your agent reads promptops/policies/regression.yaml to determine allowed deltas:

baseline: "prod-current"
rules:
  - metric: keyword_recall
    floor: 0.5
    allowed_delta: 0.1
    direction: higher_is_better
    severity: blocker

File reference

File	Location	Purpose
Suite	`promptops/suites/<id>.yaml`	Test configuration
Dataset	`promptops/datasets/<id>.jsonl`	Test cases (one JSON per line)
Evaluator	`promptops/evaluators/<id>.yaml`	Scoring rules
Prompt spec	`promptops/prompts/<id>.yaml`	Prompt template and variables
Baseline	`derived-index/baselines/<suite-id>.json`	Known-good scorecard
Regression policy	`promptops/policies/regression.yaml`	Allowed deltas and severity rules
Run output	`promptops/runs/<run-id>/`	Scorecard, cases, manifest

Use trials: 1 for smoke suites and trials: 3 or more for regression suites. More trials reduce variance and make regression detection more reliable.

Get Started

Skills

Guides

Reference

Installation

How to invoke

Evaluation modes

run_manifest.json format

Harness identifiers

Evaluator types

Assertion types reference

Deterministic assertions

Model-assisted assertions

Performance assertions

Regression policy format

File reference

Get Started

Skills

Guides

Reference

Documentation Index

​Installation

​How to invoke

​Evaluation modes

​run_manifest.json format

​Harness identifiers

​Evaluator types

​Assertion types reference

​Deterministic assertions

​Model-assisted assertions

​Performance assertions

​Regression policy format

​File reference

Installation

How to invoke

Evaluation modes

run_manifest.json format

Harness identifiers

Evaluator types

Assertion types reference

Deterministic assertions

Model-assisted assertions

Performance assertions

Regression policy format

File reference