Skip to main content

Documentation Index

Fetch the complete documentation index at: https://bintzgavin-apastra-14.mintlify.app/llms.txt

Use this file to discover all available pages before exploring further.

Installation

npx skills add BintzGavin/apastra/skills/eval

How to invoke

Ask your agent:
“Use the apastra-eval skill to run the [suite-name] suite”
For a quick eval file:
“Use the apastra-eval skill to run the summarize-quick eval”

Evaluation modes

Suite mode is the full spec/dataset/evaluator/suite pipeline. Use it for structured, reusable test suites with baseline tracking and regression detection.When you ask to run a suite (for example, “run the summarize-smoke suite”), your agent follows these steps:
1

Load the suite

Your agent reads the suite file from promptops/suites/<suite-id>.yaml and extracts:
  • datasets — list of dataset IDs to load
  • evaluators — list of evaluator IDs to apply
  • model_matrix — models to test against ("default" means the current agent’s model)
  • harness — (optional) identifier for the execution environment; auto-detected if omitted
  • trials — how many times to run each case (default: 1)
  • thresholds — minimum metric scores required to pass
2

Load dependencies

For each dataset ID, your agent reads promptops/datasets/<dataset-id>.jsonl (one JSON object per line).For each evaluator ID, your agent reads promptops/evaluators/<evaluator-id>.yaml.For the prompt being evaluated, your agent reads promptops/prompts/<prompt-id>.yaml.
3

Run each case

For every case in the dataset, your agent:
  1. Renders the template — substitutes {{variable}} placeholders with values from the case’s inputs object
  2. Calls the model — sends the rendered prompt and captures the full response; if trials > 1, runs multiple times
  3. Scores the output — applies evaluators and any inline assertions on the case
Per-case results are recorded in this format:
{
  "case_id": "short-article",
  "inputs": {},
  "output": "<model response>",
  "evaluator_scores": {
    "keyword_recall": 1.0
  }
}
4

Aggregate the scorecard

Your agent averages each metric across all cases:
{
  "normalized_metrics": {
    "keyword_recall": 0.85
  },
  "metric_definitions": {
    "keyword_recall": {
      "description": "Fraction of expected keywords found in output",
      "version": "1.0",
      "direction": "higher_is_better"
    }
  },
  "variance": {}
}
5

Check thresholds

Your agent compares each metric against the suite’s thresholds. If any metric falls below its threshold, the suite fails.Results are reported like this:
Suite: summarize-smoke
Status: PASS ✅

Metrics:
  keyword_recall: 0.85 (threshold: 0.60) ✅

Per-case results:
  short-article: keyword_recall=1.00 ✅
  technical-paragraph: keyword_recall=1.00 ✅
  empty-edge-case: keyword_recall=1.00 ✅
  long-document: keyword_recall=1.00 ✅
  multi-topic: keyword_recall=0.50 ⚠️
6

Compare against baseline (if one exists)

Your agent checks for a baseline at derived-index/baselines/<suite-id>.json.If a baseline exists, your agent reads the regression policy from promptops/policies/regression.yaml and compares each metric:
  • For higher_is_better metrics: fail if candidate < (baseline − allowed_delta) or candidate < floor
  • For lower_is_better metrics: fail if candidate > (baseline + allowed_delta) or candidate > floor
The regression comparison is reported like this:
Regression Report:
  Baseline: derived-index/baselines/summarize-smoke.json
  Status: PASS ✅

  keyword_recall: 0.85 (baseline: 0.80, delta: +0.05) ✅
If no baseline exists, your agent will note this and suggest running the apastra-baseline skill to establish one.
7

Save results

Your agent writes results to promptops/runs/<run-id>/:
FileContents
scorecard.jsonAggregated metrics
cases.jsonlPer-case results (one JSON object per line)
run_manifest.jsonMetadata: timestamp, model, harness, suite ID, prompt digest
Run IDs follow the format <suite-id>-<YYYY-MM-DD-HHmmss>.

run_manifest.json format

Every run produces a run_manifest.json with metadata about how the eval was executed:
{
  "suite_id": "summarize-smoke",
  "timestamp": "2026-03-16T09:00:00Z",
  "model": "claude-sonnet-4-20250514",
  "harness": "claude-code",
  "prompt_digest": "sha256:abc123...",
  "status": "pass"
}

Harness identifiers

The harness field records which execution environment ran the evaluation. The same model can produce different results in different environments due to system prompts, tool availability, and context window handling.
ValueEnvironment
claude-codeClaude Code CLI or IDE
antigravityAntigravity by Google DeepMind
cursorCursor IDE agent
copilotGitHub Copilot agent
apiDirect API call (no IDE agent)
github-actionsCI/CD pipeline
julesJules by Google

Evaluator types

When processing suite evaluators, your agent applies the following scoring logic:
TypeScoring behavior
deterministic with keyword_recallFraction of expected_outputs.should_contain keywords found in the response
deterministic with exact_match1 if output exactly matches expected, 0 otherwise
schema1 if output validates against the evaluator’s config.schema, 0 otherwise
judge0–1 score using the evaluator’s config.rubric as the grading criteria

Assertion types reference

Use inline assertions on dataset cases or quick eval cases to apply per-case checks.

Deterministic assertions

TypeWhat it checksValue
equalsOutput exactly matches value"expected string"
containsOutput contains substring (case-sensitive)"substring"
icontainsOutput contains substring (case-insensitive)"substring"
contains-anyOutput contains at least one value["a", "b", "c"]
contains-allOutput contains every value["x", "y", "z"]
regexOutput matches regex pattern"\\d{3}-\\d{4}"
starts-withOutput begins with value"Dear "
is-jsonOutput is valid JSON(no value needed)
contains-jsonOutput contains a JSON block(no value needed)
is-valid-json-schemaOutput matches a JSON Schema{schema object}

Model-assisted assertions

TypeWhat it checksValue
similarSemantic similarity to reference (threshold 0–1)"reference text"
llm-rubricAI grades output using rubric"rubric text"
factualityOutput is factually consistent with reference"reference facts"
answer-relevanceOutput is relevant to the input(no value needed)

Performance assertions

TypeWhat it checksThreshold
latencyResponse time in ms500
costToken cost in dollars0.01
Negate any assertion type by prepending not-. For example: not-contains, not-regex, not-is-json.

Regression policy format

When a baseline exists, your agent reads promptops/policies/regression.yaml to determine allowed deltas:
baseline: "prod-current"
rules:
  - metric: keyword_recall
    floor: 0.5
    allowed_delta: 0.1
    direction: higher_is_better
    severity: blocker

File reference

FileLocationPurpose
Suitepromptops/suites/<id>.yamlTest configuration
Datasetpromptops/datasets/<id>.jsonlTest cases (one JSON per line)
Evaluatorpromptops/evaluators/<id>.yamlScoring rules
Prompt specpromptops/prompts/<id>.yamlPrompt template and variables
Baselinederived-index/baselines/<suite-id>.jsonKnown-good scorecard
Regression policypromptops/policies/regression.yamlAllowed deltas and severity rules
Run outputpromptops/runs/<run-id>/Scorecard, cases, manifest
Use trials: 1 for smoke suites and trials: 3 or more for regression suites. More trials reduce variance and make regression detection more reliable.