Core concepts

Apastra is built on a small set of file-based primitives. Once you understand what each file does, the full workflow — from writing a prompt to detecting regressions — follows naturally.

Prompt spec

A prompt spec is a YAML file that defines a prompt as a versioned software asset. It is the source of truth for what your prompt is, what inputs it accepts, and what shape of output it is expected to produce.

id: summarize-v1
variables:
  text: { type: string }
template: "Summarize: {{text}}"

Every prompt spec has three required fields:

id — A stable, unique identifier. Use kebab-case with a version suffix (e.g., summarize-v1). This ID is used throughout the rest of the system to reference the prompt.
variables — A map of input variable names to their JSON Schema type definitions. These are the {{placeholder}} values in your template.
template — The prompt text with {{variable}} placeholders. Can be a string for simple prompts or a message array for chat models.

Two optional fields add structure for more mature prompts:

output_contract — A JSON Schema defining the expected output structure. Useful for prompts that should return JSON.
metadata — Arbitrary key-value pairs for organization: author, intent, tags.

Prompt specs live in promptops/prompts/<id>.yaml and are validated against the prompt-spec.schema.json JSON Schema.

Always include a version suffix in your prompt ID. When you change a prompt’s interface (its variables or output contract), create a new file with an incremented version — e.g., summarize-v2.yaml — rather than editing the existing file in place.

Dataset

A dataset is a .jsonl file of test cases — one JSON object per line. Each case has a stable case_id, an inputs object that matches the prompt spec’s variables, and optional expected_outputs used by evaluators for scoring.

{"case_id": "case-1", "inputs": {"text": "..."}, "expected_outputs": {"should_contain": ["key", "words"]}}

Datasets live in promptops/datasets/<id>.jsonl. The JSONL format (one object per line, not a JSON array) is intentional: it is append-friendly, diff-friendly, and easy to stream. When you add cases, you append lines. When you review changes, each line is a complete, self-contained object. Test cases can also carry inline assertions directly on the case object. This skips the separate evaluator file for simple checks:

{"case_id": "case-1", "inputs": {"text": "Hello"}, "assert": [{"type": "icontains", "value": "summary"}, {"type": "is-json"}]}

Use stable case_id values. Case IDs appear in scorecard output, regression reports, and run artifacts. Changing a case ID is a breaking change for historical comparisons.

Evaluator

An evaluator is a scoring rule — a YAML file that defines how to grade model outputs for a suite. Evaluators live in promptops/evaluators/<id>.yaml. There are three evaluator types:

Deterministic
Schema
Judge

Rule-based checks. The agent applies the scoring rule directly without calling a model. Good for keyword presence, format compliance, and other objective criteria.

id: keyword-check
type: deterministic
metrics:
  - keyword_recall
description: Checks if output contains expected keywords.
config:
  match_field: should_contain
  case_sensitive: false

The keyword_recall metric scores the fraction of expected_outputs.should_contain keywords found in the model response: (keywords found) / (total keywords).

Validates that model output matches a JSON Schema. Use this when your prompt should return structured JSON.

id: json-output-valid
type: schema
metrics:
  - schema_valid
description: Validates that model output is valid JSON matching the output contract.
config:
  schema:
    type: object
    required: ["category", "confidence"]
    properties:
      category:
        type: string
      confidence:
        type: number

AI-graded evaluation. The agent uses a rubric to score output quality on dimensions that deterministic checks cannot capture — tone, coherence, relevance.

id: quality-judge
type: judge
metrics:
  - coherence
  - relevance
description: Uses AI judgment to score output quality.
config:
  rubric: |
    Score the output on two dimensions (0-1 each):
    - coherence: Is the text well-structured and readable?
    - relevance: Does the output address the input query?
  model: default

Treat judge rubrics as versioned artifacts. Changing the rubric text changes what the metric means — which invalidates historical comparisons. If you update a rubric, version the evaluator file (e.g., quality-judge-v2.yaml) so old and new runs remain comparable.

Suite

A suite is the test configuration that ties everything together. It declares which datasets to use, which evaluators to apply, which models to test against, and what metric thresholds define a passing run.

id: smoke
name: Smoke Suite
description: Fast sanity checks for the summarize prompt
datasets:
  - summarize-smoke
evaluators:
  - keyword-check
model_matrix:
  - default
trials: 1
thresholds:
  keyword_recall: 0.6

Suites live in promptops/suites/<id>.yaml. There are four recommended suite tiers, each suited to a different point in the development lifecycle:

Tier	Purpose	Cases	Trials	When to run
Smoke	Fast sanity check	5–10	1	Every prompt edit
Regression	Protect known failure modes	20–50	3	Before merging
Full	Broader coverage	50+	5	Nightly or on-demand
Release	Ship gate	100+	5	Before shipping

Quick eval

For rapid iteration, Apastra supports a single-file format that combines prompt, cases, and assertions into one file. This is the fastest way to start testing a new prompt — no need to create four separate files. Quick eval files live in promptops/evals/<id>.yaml:

id: summarize-quick
prompt: "Summarize in {{max_length}} words: {{text}}"
cases:
  - id: short
    inputs: { text: "The fox jumps over the dog.", max_length: "10" }
    assert:
      - type: icontains
        value: "fox"
thresholds:
  pass_rate: 1.0

Your agent treats a quick eval file as a prompt spec + dataset + inline assertions + suite internally. Run it the same way as a suite: "Use the apastra-eval skill to run the summarize-quick eval". Graduate to the full spec/dataset/evaluator/suite structure when your eval grows beyond a few cases, when you need reusable evaluators, or when you want baseline tracking across multiple runs.

Baseline

A baseline is a saved scorecard from a passing run. It is the reference point for regression detection — “this is what good looks like.” Baselines live in derived-index/baselines/<suite-id>.json:

{
  "suite_id": "summarize-smoke",
  "established_at": "2026-03-11T12:00:00Z",
  "source_run": "summarize-smoke-2026-03-11-120000",
  "scorecard": {
    "normalized_metrics": {
      "keyword_recall": 0.85
    },
    "metric_definitions": {
      "keyword_recall": {
        "description": "Fraction of expected keywords found in output",
        "version": "1.0",
        "direction": "higher_is_better"
      }
    }
  }
}

Baselines follow an append-only model. When you update a baseline, the old file is archived with a timestamp suffix (e.g., summarize-smoke-2026-03-10.json) and a new file is written. Nothing is deleted.

Regression

A regression is detected when a new eval run produces metrics that fall below the baseline thresholds defined in promptops/policies/regression.yaml. The regression policy defines per-metric rules:

baseline: "prod-current"
rules:
  - metric: keyword_recall
    floor: 0.5
    allowed_delta: 0.1
    direction: higher_is_better
    severity: blocker

When the eval skill detects a regression, it reports the delta clearly:

Regression Report:
  Baseline: derived-index/baselines/summarize-smoke.json
  Status: REGRESSION DETECTED ❌

  keyword_recall: 0.55 (baseline: 0.85, delta: -0.30) ❌

blocker severity means the suite fails. warning severity flags the issue but allows the run to pass. Use blockers for quality drops that should never ship; use warnings for metrics you are watching but not yet enforcing.

Consumption manifest

When your application needs to consume prompts — especially from a separate prompt repo or with specific version pins — you declare that in a consumption manifest (promptops/manifests/consumption.yaml):

version: "1.0"
prompts:
  summarize-v1:
    pin: "abc123"  # commit SHA, tag, or semver

The pin value can be a commit SHA, a Git tag, or a semver range. The resolver uses the manifest to determine which version of a prompt to load at runtime.

The resolution chain

When your application or agent needs to load a prompt, Apastra’s resolver walks a four-step precedence chain. The first match wins:

Local override

If the consumption manifest has an override pointing to a local file path, the resolver loads the prompt from that path. Used for fast local iteration against a checked-out prompt repo without publishing anything.

Workspace path

If the prompt ID is found in promptops/prompts/ in the current workspace (same-repo topology), the resolver loads it from there. This is the default for most teams starting out.

Git ref

If the manifest specifies a pin that is a commit SHA, tag, or branch name, the resolver fetches the prompt at that ref. Used for pinning a specific version of a prompt from a separate prompt repo.

Packaged artifact

If the pin refers to a packaged artifact — an OCI digest, GitHub Release asset, npm package, or PyPI package — the resolver fetches it from the appropriate registry.

This is implemented in promptops/resolver/chain.py:

class ResolverChain:
    def resolve(self, prompt_id, manifest):
        # 1. Local override
        if rules and 'override' in rules:
            return LocalResolver().resolve(target_id, rules['override'])

        # 2. Workspace path
        workspace_result = WorkspaceResolver().resolve(target_id)
        if workspace_result is not None:
            return workspace_result

        # 3. Git ref or packaged artifact
        if rules and 'pin' in rules:
            pin = rules['pin']
            if pin.startswith('sha256:') or pin.startswith('oci://') or ...:
                return PackagedResolver().resolve(target_id, pin)
            else:
                return GitRefResolver().resolve(target_id, pin)

The resolution order means local development always wins — you can iterate on a prompt locally without publishing anything, and the rest of the system (CI, teammates, production) resolves to the pinned version.

The eval workflow

When you ask your agent to run an eval, it follows this sequence:

Read the suite spec

Load promptops/suites/<suite-id>.yaml and extract datasets, evaluators, model matrix, and thresholds.

Load dependencies

Read each dataset from promptops/datasets/<id>.jsonl and each evaluator from promptops/evaluators/<id>.yaml. Load the prompt spec from promptops/prompts/<id>.yaml.

Run each case

For every case in the dataset: render the prompt template with the case’s inputs, call the model, score the output using evaluators and any inline assertions, and record the per-case result.

Aggregate the scorecard

Average each metric across all cases to produce normalized scores. Compare against the suite’s thresholds to determine PASS or FAIL.

Compare against baseline

If a baseline exists at derived-index/baselines/<suite-id>.json, compare the candidate scorecard against it using the regression policy. Report any regressions.

Save run artifacts

Write scorecard.json, cases.jsonl, and run_manifest.json to promptops/runs/<run-id>/. These artifacts are the durable record of the run.

No external runtime is involved. Your agent reads the protocol files and executes every step.

Next steps

Quickstart

Walk through scaffolding your first prompt and running your first eval

Writing evals

Learn to write test cases and choose the right assertion types

Assertion types reference

Full reference for all built-in assertion types

File structure reference

Complete reference for every file in the promptops directory

Get Started

Skills

Guides

Reference

Prompt spec

Dataset

Evaluator

Suite

Quick eval

Baseline

Regression

Consumption manifest

The resolution chain

The eval workflow

Next steps

Quickstart

Writing evals

Assertion types reference

File structure reference

Get Started

Skills

Guides

Reference

Documentation Index

​Prompt spec

​Dataset

​Evaluator

​Suite

​Quick eval

​Baseline

​Regression

​Consumption manifest

​The resolution chain

​The eval workflow

​Next steps

Quickstart

Writing evals

Assertion types reference

File structure reference

Prompt spec

Dataset

Evaluator

Suite

Quick eval

Baseline

Regression

Consumption manifest

The resolution chain

The eval workflow

Next steps