apastra-scaffold

Installation

npx skills add BintzGavin/apastra/skills/scaffold

How to invoke

Ask your agent to create any combination of files:

“Use the apastra-scaffold skill to create a prompt spec, dataset, evaluator, and suite for summarizing text”

For a quick start without four separate files:

“Use the apastra-scaffold skill to create a quick eval for email classification”

What gets created

A full scaffold creates four files:

promptops/
├── prompts/summarize-v1.yaml        # Prompt template + variables
├── datasets/summarize-smoke.jsonl   # Test cases (5 examples)
├── evaluators/contains-keywords.yaml # Scoring rule
└── suites/summarize-smoke.yaml      # Test configuration

You can also ask for any individual piece: just a prompt spec, just a dataset, just an evaluator, or just a suite.

Prompt spec template

Your agent creates promptops/prompts/<id>.yaml:

id: <kebab-case-id>
variables:
  <var_name>:
    type: string
template: |
  <The actual prompt text with {{var_name}} placeholders>
output_contract:
  type: object
  properties:
    <output_field>:
      type: string
metadata:
  author: <user or team name>
  intent: <what this prompt does>
  tags:
    - <relevant-tags>

Rules for prompt specs:

id is required and must be unique — use kebab-case with a version suffix (for example, classify-email-v1)
variables is required — defines the input schema as a map of variable names to JSON Schema type objects
template is required — the prompt text with {{variable}} placeholders
output_contract is optional but recommended — defines expected output structure
Never rename an id; create a new version instead

Dataset template

Your agent creates promptops/datasets/<id>.jsonl — one JSON object per line:

{"case_id": "<unique-case-id>", "inputs": {"<var>": "<value>"}, "expected_outputs": {"<field>": "<expected>"}, "metadata": {"tags": ["<tag>"]}}

Rules for datasets:

Use .jsonl format (one JSON object per line, not a JSON array)
case_id is required and must be unique within the dataset
inputs is required — keys must match the prompt spec’s variables
expected_outputs is optional — used by evaluators for checking
Aim for 5–10 cases in a smoke dataset and 50+ in a regression dataset
Include edge cases: empty inputs, very long inputs, adversarial inputs

Evaluator templates

Your agent creates promptops/evaluators/<id>.yaml. Three evaluator types are available:

Deterministic
Schema
Judge

Rule-based checks — fastest to run, no model calls required:

id: keyword-check
type: deterministic
metrics:
  - keyword_recall
description: Checks if output contains expected keywords.
config:
  match_field: should_contain
  case_sensitive: false

Validates that the model output matches a specific JSON structure:

id: json-output-valid
type: schema
metrics:
  - schema_valid
description: Validates that model output is valid JSON matching the output contract.
config:
  schema:
    type: object
    required: ["category", "confidence"]
    properties:
      category:
        type: string
      confidence:
        type: number

Uses AI judgment to score output quality — useful when deterministic checks can’t capture what matters:

id: quality-judge
type: judge
metrics:
  - coherence
  - relevance
description: Uses AI judgment to score output quality.
config:
  rubric: |
    Score the output on two dimensions (0-1 each):
    - coherence: Is the text well-structured and readable?
    - relevance: Does the output address the input query?
  model: default

Rules for evaluators:

id is required and must be unique
type is required — must be one of deterministic, schema, or judge
metrics is required — array of metric names this evaluator produces (minimum 1)
For judge evaluators: treat the rubric text as a versioned artifact — changing it changes what the metric means

Suite template

Your agent creates promptops/suites/<id>.yaml:

id: <suite-id>
name: <Human Readable Name>
description: <what this suite tests>
datasets:
  - <dataset-id>
evaluators:
  - <evaluator-id>
model_matrix:
  - default
trials: 1
thresholds:
  <metric>: <minimum-score>

Suite tiers — recommended usage:

Tier	When to run	Cases	Trials
Smoke	Every prompt edit	5–10	1
Regression	Before merging	20–50	3
Full	Nightly or on-demand	50+	5
Release	Before shipping	100+	5

Quick eval template

For rapid iteration, your agent can scaffold a single file instead of four: Your agent creates promptops/evals/<id>.yaml:

id: classify-email-quick
prompt: |
  Classify the following email into one of these categories: spam, support, sales, personal.
  Respond with JSON: {"category": "<category>", "confidence": <0-1>}

  Email: {{email}}
cases:
  - id: obvious-spam
    inputs:
      email: "CONGRATULATIONS! You've won $1,000,000! Click here NOW!"
    assert:
      - type: is-json
      - type: contains
        value: "spam"
  - id: support-request
    inputs:
      email: "Hi, I'm having trouble logging in. My password reset isn't working."
    assert:
      - type: is-json
      - type: contains-any
        value: ["support", "help"]
  - id: personal-email
    inputs:
      email: "Hey! Want to grab lunch on Friday?"
    assert:
      - type: is-json
      - type: contains
        value: "personal"
thresholds:
  pass_rate: 1.0

When to use quick eval vs. full suite:

Quick eval	Full suite
1–5 test cases	10+ cases
Simple inline assertions	Reusable evaluator files
Rapid iteration on a new prompt	Baseline tracking and regression detection
No evaluator file needed	Multiple evaluator types

Dataset with inline assertions

When you want per-case checks without a separate evaluator file, ask your agent to add assert arrays directly in the JSONL:

{"case_id": "case-1", "inputs": {"text": "Hello"}, "assert": [{"type": "contains", "value": "Bonjour"}, {"type": "not-contains", "value": "error"}]}
{"case_id": "case-2", "inputs": {"text": ""}, "assert": [{"type": "regex", "value": ".*"}]}

Inline assertions and evaluator files complement each other. Use inline assertions for per-case checks and evaluator files for suite-wide scoring rules.

Available assertion types

Deterministic: equals, contains, icontains, contains-any, contains-all, regex, starts-with, is-json, contains-json, is-valid-json-schema Model-assisted: similar, llm-rubric, factuality, answer-relevance Performance: latency, cost Negate any type with not- prefix — for example, not-contains, not-is-json.

After scaffolding, run the apastra-validate skill to catch any typos or formatting issues before your first eval.

Get Started

Skills

Guides

Reference

apastra-scaffold

Installation

How to invoke

What gets created

Prompt spec template

Dataset template

Evaluator templates

Suite template

Quick eval template

Dataset with inline assertions

Available assertion types

Get Started

Skills

Guides

Reference

Documentation Index

​Installation

​How to invoke

​What gets created

​Prompt spec template

​Dataset template

​Evaluator templates

​Suite template

​Quick eval template

​Dataset with inline assertions

​Available assertion types

Installation

How to invoke

What gets created

Prompt spec template

Dataset template

Evaluator templates

Suite template

Quick eval template

Dataset with inline assertions

Available assertion types