Apastra is built on a small set of file-based primitives. Once you understand what each file does, the full workflow — from writing a prompt to detecting regressions — follows naturally.Documentation Index
Fetch the complete documentation index at: https://bintzgavin-apastra-14.mintlify.app/llms.txt
Use this file to discover all available pages before exploring further.
Prompt spec
A prompt spec is a YAML file that defines a prompt as a versioned software asset. It is the source of truth for what your prompt is, what inputs it accepts, and what shape of output it is expected to produce.id— A stable, unique identifier. Use kebab-case with a version suffix (e.g.,summarize-v1). This ID is used throughout the rest of the system to reference the prompt.variables— A map of input variable names to their JSON Schema type definitions. These are the{{placeholder}}values in your template.template— The prompt text with{{variable}}placeholders. Can be a string for simple prompts or a message array for chat models.
output_contract— A JSON Schema defining the expected output structure. Useful for prompts that should return JSON.metadata— Arbitrary key-value pairs for organization: author, intent, tags.
promptops/prompts/<id>.yaml and are validated against the prompt-spec.schema.json JSON Schema.
Dataset
A dataset is a.jsonl file of test cases — one JSON object per line. Each case has a stable case_id, an inputs object that matches the prompt spec’s variables, and optional expected_outputs used by evaluators for scoring.
promptops/datasets/<id>.jsonl.
The JSONL format (one object per line, not a JSON array) is intentional: it is append-friendly, diff-friendly, and easy to stream. When you add cases, you append lines. When you review changes, each line is a complete, self-contained object.
Test cases can also carry inline assertions directly on the case object. This skips the separate evaluator file for simple checks:
Use stable
case_id values. Case IDs appear in scorecard output, regression reports, and run artifacts. Changing a case ID is a breaking change for historical comparisons.Evaluator
An evaluator is a scoring rule — a YAML file that defines how to grade model outputs for a suite. Evaluators live inpromptops/evaluators/<id>.yaml.
There are three evaluator types:
- Deterministic
- Schema
- Judge
Rule-based checks. The agent applies the scoring rule directly without calling a model. Good for keyword presence, format compliance, and other objective criteria.The
keyword_recall metric scores the fraction of expected_outputs.should_contain keywords found in the model response: (keywords found) / (total keywords).Suite
A suite is the test configuration that ties everything together. It declares which datasets to use, which evaluators to apply, which models to test against, and what metric thresholds define a passing run.promptops/suites/<id>.yaml.
There are four recommended suite tiers, each suited to a different point in the development lifecycle:
| Tier | Purpose | Cases | Trials | When to run |
|---|---|---|---|---|
| Smoke | Fast sanity check | 5–10 | 1 | Every prompt edit |
| Regression | Protect known failure modes | 20–50 | 3 | Before merging |
| Full | Broader coverage | 50+ | 5 | Nightly or on-demand |
| Release | Ship gate | 100+ | 5 | Before shipping |
Quick eval
For rapid iteration, Apastra supports a single-file format that combines prompt, cases, and assertions into one file. This is the fastest way to start testing a new prompt — no need to create four separate files. Quick eval files live inpromptops/evals/<id>.yaml:
"Use the apastra-eval skill to run the summarize-quick eval".
Graduate to the full spec/dataset/evaluator/suite structure when your eval grows beyond a few cases, when you need reusable evaluators, or when you want baseline tracking across multiple runs.
Baseline
A baseline is a saved scorecard from a passing run. It is the reference point for regression detection — “this is what good looks like.” Baselines live inderived-index/baselines/<suite-id>.json:
summarize-smoke-2026-03-10.json) and a new file is written. Nothing is deleted.
Regression
A regression is detected when a new eval run produces metrics that fall below the baseline thresholds defined inpromptops/policies/regression.yaml.
The regression policy defines per-metric rules:
blocker severity means the suite fails. warning severity flags the issue but allows the run to pass. Use blockers for quality drops that should never ship; use warnings for metrics you are watching but not yet enforcing.
Consumption manifest
When your application needs to consume prompts — especially from a separate prompt repo or with specific version pins — you declare that in a consumption manifest (promptops/manifests/consumption.yaml):
pin value can be a commit SHA, a Git tag, or a semver range. The resolver uses the manifest to determine which version of a prompt to load at runtime.
The resolution chain
When your application or agent needs to load a prompt, Apastra’s resolver walks a four-step precedence chain. The first match wins:Local override
If the consumption manifest has an
override pointing to a local file path, the resolver loads the prompt from that path. Used for fast local iteration against a checked-out prompt repo without publishing anything.Workspace path
If the prompt ID is found in
promptops/prompts/ in the current workspace (same-repo topology), the resolver loads it from there. This is the default for most teams starting out.Git ref
If the manifest specifies a
pin that is a commit SHA, tag, or branch name, the resolver fetches the prompt at that ref. Used for pinning a specific version of a prompt from a separate prompt repo.promptops/resolver/chain.py:
The eval workflow
When you ask your agent to run an eval, it follows this sequence:Read the suite spec
Load
promptops/suites/<suite-id>.yaml and extract datasets, evaluators, model matrix, and thresholds.Load dependencies
Read each dataset from
promptops/datasets/<id>.jsonl and each evaluator from promptops/evaluators/<id>.yaml. Load the prompt spec from promptops/prompts/<id>.yaml.Run each case
For every case in the dataset: render the prompt template with the case’s inputs, call the model, score the output using evaluators and any inline assertions, and record the per-case result.
Aggregate the scorecard
Average each metric across all cases to produce normalized scores. Compare against the suite’s thresholds to determine PASS or FAIL.
Compare against baseline
If a baseline exists at
derived-index/baselines/<suite-id>.json, compare the candidate scorecard against it using the regression policy. Report any regressions.Next steps
Quickstart
Walk through scaffolding your first prompt and running your first eval
Writing evals
Learn to write test cases and choose the right assertion types
Assertion types reference
Full reference for all built-in assertion types
File structure reference
Complete reference for every file in the promptops directory