Documentation Index
Fetch the complete documentation index at: https://bintzgavin-apastra-14.mintlify.app/llms.txt
Use this file to discover all available pages before exploring further.
Installation
npx skills add BintzGavin/apastra/skills/scaffold
How to invoke
Ask your agent to create any combination of files:
“Use the apastra-scaffold skill to create a prompt spec, dataset, evaluator, and suite for summarizing text”
For a quick start without four separate files:
“Use the apastra-scaffold skill to create a quick eval for email classification”
What gets created
A full scaffold creates four files:
promptops/
├── prompts/summarize-v1.yaml # Prompt template + variables
├── datasets/summarize-smoke.jsonl # Test cases (5 examples)
├── evaluators/contains-keywords.yaml # Scoring rule
└── suites/summarize-smoke.yaml # Test configuration
You can also ask for any individual piece: just a prompt spec, just a dataset, just an evaluator, or just a suite.
Prompt spec template
Your agent creates promptops/prompts/<id>.yaml:
id: <kebab-case-id>
variables:
<var_name>:
type: string
template: |
<The actual prompt text with {{var_name}} placeholders>
output_contract:
type: object
properties:
<output_field>:
type: string
metadata:
author: <user or team name>
intent: <what this prompt does>
tags:
- <relevant-tags>
Rules for prompt specs:
id is required and must be unique — use kebab-case with a version suffix (for example, classify-email-v1)
variables is required — defines the input schema as a map of variable names to JSON Schema type objects
template is required — the prompt text with {{variable}} placeholders
output_contract is optional but recommended — defines expected output structure
- Never rename an
id; create a new version instead
Dataset template
Your agent creates promptops/datasets/<id>.jsonl — one JSON object per line:
{"case_id": "<unique-case-id>", "inputs": {"<var>": "<value>"}, "expected_outputs": {"<field>": "<expected>"}, "metadata": {"tags": ["<tag>"]}}
Rules for datasets:
- Use
.jsonl format (one JSON object per line, not a JSON array)
case_id is required and must be unique within the dataset
inputs is required — keys must match the prompt spec’s variables
expected_outputs is optional — used by evaluators for checking
- Aim for 5–10 cases in a smoke dataset and 50+ in a regression dataset
- Include edge cases: empty inputs, very long inputs, adversarial inputs
Evaluator templates
Your agent creates promptops/evaluators/<id>.yaml. Three evaluator types are available:
Deterministic
Schema
Judge
Rule-based checks — fastest to run, no model calls required:id: keyword-check
type: deterministic
metrics:
- keyword_recall
description: Checks if output contains expected keywords.
config:
match_field: should_contain
case_sensitive: false
Validates that the model output matches a specific JSON structure:id: json-output-valid
type: schema
metrics:
- schema_valid
description: Validates that model output is valid JSON matching the output contract.
config:
schema:
type: object
required: ["category", "confidence"]
properties:
category:
type: string
confidence:
type: number
Uses AI judgment to score output quality — useful when deterministic checks can’t capture what matters:id: quality-judge
type: judge
metrics:
- coherence
- relevance
description: Uses AI judgment to score output quality.
config:
rubric: |
Score the output on two dimensions (0-1 each):
- coherence: Is the text well-structured and readable?
- relevance: Does the output address the input query?
model: default
Rules for evaluators:
id is required and must be unique
type is required — must be one of deterministic, schema, or judge
metrics is required — array of metric names this evaluator produces (minimum 1)
- For
judge evaluators: treat the rubric text as a versioned artifact — changing it changes what the metric means
Suite template
Your agent creates promptops/suites/<id>.yaml:
id: <suite-id>
name: <Human Readable Name>
description: <what this suite tests>
datasets:
- <dataset-id>
evaluators:
- <evaluator-id>
model_matrix:
- default
trials: 1
thresholds:
<metric>: <minimum-score>
Suite tiers — recommended usage:
| Tier | When to run | Cases | Trials |
|---|
| Smoke | Every prompt edit | 5–10 | 1 |
| Regression | Before merging | 20–50 | 3 |
| Full | Nightly or on-demand | 50+ | 5 |
| Release | Before shipping | 100+ | 5 |
Quick eval template
For rapid iteration, your agent can scaffold a single file instead of four:
Your agent creates promptops/evals/<id>.yaml:
id: classify-email-quick
prompt: |
Classify the following email into one of these categories: spam, support, sales, personal.
Respond with JSON: {"category": "<category>", "confidence": <0-1>}
Email: {{email}}
cases:
- id: obvious-spam
inputs:
email: "CONGRATULATIONS! You've won $1,000,000! Click here NOW!"
assert:
- type: is-json
- type: contains
value: "spam"
- id: support-request
inputs:
email: "Hi, I'm having trouble logging in. My password reset isn't working."
assert:
- type: is-json
- type: contains-any
value: ["support", "help"]
- id: personal-email
inputs:
email: "Hey! Want to grab lunch on Friday?"
assert:
- type: is-json
- type: contains
value: "personal"
thresholds:
pass_rate: 1.0
When to use quick eval vs. full suite:
| Quick eval | Full suite |
|---|
| 1–5 test cases | 10+ cases |
| Simple inline assertions | Reusable evaluator files |
| Rapid iteration on a new prompt | Baseline tracking and regression detection |
| No evaluator file needed | Multiple evaluator types |
Dataset with inline assertions
When you want per-case checks without a separate evaluator file, ask your agent to add assert arrays directly in the JSONL:
{"case_id": "case-1", "inputs": {"text": "Hello"}, "assert": [{"type": "contains", "value": "Bonjour"}, {"type": "not-contains", "value": "error"}]}
{"case_id": "case-2", "inputs": {"text": ""}, "assert": [{"type": "regex", "value": ".*"}]}
Inline assertions and evaluator files complement each other. Use inline assertions for per-case checks and evaluator files for suite-wide scoring rules.
Available assertion types
Deterministic: equals, contains, icontains, contains-any, contains-all, regex, starts-with, is-json, contains-json, is-valid-json-schema
Model-assisted: similar, llm-rubric, factuality, answer-relevance
Performance: latency, cost
Negate any type with not- prefix — for example, not-contains, not-is-json.
After scaffolding, run the apastra-validate skill to catch any typos or formatting issues before your first eval.