Documentation Index
Fetch the complete documentation index at: https://bintzgavin-apastra-14.mintlify.app/llms.txt
Use this file to discover all available pages before exploring further.
Installation
How to invoke
Ask your agent:“Use the apastra-eval skill to run the [suite-name] suite”For a quick eval file:
“Use the apastra-eval skill to run the summarize-quick eval”
Evaluation modes
- Suite mode
- Quick eval mode
Suite mode is the full spec/dataset/evaluator/suite pipeline. Use it for structured, reusable test suites with baseline tracking and regression detection.When you ask to run a suite (for example, “run the summarize-smoke suite”), your agent follows these steps:
Load the suite
Your agent reads the suite file from
promptops/suites/<suite-id>.yaml and extracts:datasets— list of dataset IDs to loadevaluators— list of evaluator IDs to applymodel_matrix— models to test against ("default"means the current agent’s model)harness— (optional) identifier for the execution environment; auto-detected if omittedtrials— how many times to run each case (default: 1)thresholds— minimum metric scores required to pass
Load dependencies
For each dataset ID, your agent reads
promptops/datasets/<dataset-id>.jsonl (one JSON object per line).For each evaluator ID, your agent reads promptops/evaluators/<evaluator-id>.yaml.For the prompt being evaluated, your agent reads promptops/prompts/<prompt-id>.yaml.Run each case
For every case in the dataset, your agent:
- Renders the template — substitutes
{{variable}}placeholders with values from the case’sinputsobject - Calls the model — sends the rendered prompt and captures the full response; if
trials > 1, runs multiple times - Scores the output — applies evaluators and any inline assertions on the case
Check thresholds
Your agent compares each metric against the suite’s
thresholds. If any metric falls below its threshold, the suite fails.Results are reported like this:Compare against baseline (if one exists)
Your agent checks for a baseline at If no baseline exists, your agent will note this and suggest running the apastra-baseline skill to establish one.
derived-index/baselines/<suite-id>.json.If a baseline exists, your agent reads the regression policy from promptops/policies/regression.yaml and compares each metric:- For
higher_is_bettermetrics: fail if candidate < (baseline − allowed_delta) or candidate < floor - For
lower_is_bettermetrics: fail if candidate > (baseline + allowed_delta) or candidate > floor
run_manifest.json format
Every run produces arun_manifest.json with metadata about how the eval was executed:
Harness identifiers
Theharness field records which execution environment ran the evaluation. The same model can produce different results in different environments due to system prompts, tool availability, and context window handling.
| Value | Environment |
|---|---|
claude-code | Claude Code CLI or IDE |
antigravity | Antigravity by Google DeepMind |
cursor | Cursor IDE agent |
copilot | GitHub Copilot agent |
api | Direct API call (no IDE agent) |
github-actions | CI/CD pipeline |
jules | Jules by Google |
Evaluator types
When processing suite evaluators, your agent applies the following scoring logic:| Type | Scoring behavior |
|---|---|
deterministic with keyword_recall | Fraction of expected_outputs.should_contain keywords found in the response |
deterministic with exact_match | 1 if output exactly matches expected, 0 otherwise |
schema | 1 if output validates against the evaluator’s config.schema, 0 otherwise |
judge | 0–1 score using the evaluator’s config.rubric as the grading criteria |
Assertion types reference
Use inline assertions on dataset cases or quick eval cases to apply per-case checks.Deterministic assertions
| Type | What it checks | Value |
|---|---|---|
equals | Output exactly matches value | "expected string" |
contains | Output contains substring (case-sensitive) | "substring" |
icontains | Output contains substring (case-insensitive) | "substring" |
contains-any | Output contains at least one value | ["a", "b", "c"] |
contains-all | Output contains every value | ["x", "y", "z"] |
regex | Output matches regex pattern | "\\d{3}-\\d{4}" |
starts-with | Output begins with value | "Dear " |
is-json | Output is valid JSON | (no value needed) |
contains-json | Output contains a JSON block | (no value needed) |
is-valid-json-schema | Output matches a JSON Schema | {schema object} |
Model-assisted assertions
| Type | What it checks | Value |
|---|---|---|
similar | Semantic similarity to reference (threshold 0–1) | "reference text" |
llm-rubric | AI grades output using rubric | "rubric text" |
factuality | Output is factually consistent with reference | "reference facts" |
answer-relevance | Output is relevant to the input | (no value needed) |
Performance assertions
| Type | What it checks | Threshold |
|---|---|---|
latency | Response time in ms | 500 |
cost | Token cost in dollars | 0.01 |
not-. For example: not-contains, not-regex, not-is-json.
Regression policy format
When a baseline exists, your agent readspromptops/policies/regression.yaml to determine allowed deltas:
File reference
| File | Location | Purpose |
|---|---|---|
| Suite | promptops/suites/<id>.yaml | Test configuration |
| Dataset | promptops/datasets/<id>.jsonl | Test cases (one JSON per line) |
| Evaluator | promptops/evaluators/<id>.yaml | Scoring rules |
| Prompt spec | promptops/prompts/<id>.yaml | Prompt template and variables |
| Baseline | derived-index/baselines/<suite-id>.json | Known-good scorecard |
| Regression policy | promptops/policies/regression.yaml | Allowed deltas and severity rules |
| Run output | promptops/runs/<run-id>/ | Scorecard, cases, manifest |