Skip to main content

Documentation Index

Fetch the complete documentation index at: https://bintzgavin-apastra-14.mintlify.app/llms.txt

Use this file to discover all available pages before exploring further.

Apastra runs entirely through your IDE agent. There is no server to start, no API key to configure, and no CI required to get going. By the end of this guide, you will have a working prompt spec, a test dataset, and a passing eval with a baseline set.
Apastra works with any IDE agent that supports SKILL.md — including Claude Code, Cursor, Amp, Codex, and 37 more.
1

Install skills

Run this command in your project root to install all Apastra skills into your IDE agent:
npx skills add BintzGavin/apastra --all --full-depth -y
This installs five skills:
SkillWhat it does
apastra-getting-startedProject setup and onboarding walkthrough
apastra-scaffoldGenerate prompt specs, datasets, evaluators, and suites
apastra-evalRun evaluations and compare against baselines
apastra-baselineEstablish and manage known-good baselines
apastra-validateValidate all files against JSON schemas
You can also install individual skills if you only need part of the workflow:
npx skills add BintzGavin/apastra/skills/eval
npx skills add BintzGavin/apastra/skills/baseline
2

Scaffold your first prompt

Ask your IDE agent:
“Use the apastra-scaffold skill to create a prompt spec, dataset, evaluator, and suite for summarizing text”
Your agent creates four files:
promptops/
├── prompts/summarize-v1.yaml        # Prompt template + variables
├── datasets/summarize-smoke.jsonl   # Test cases (5 examples)
├── evaluators/contains-keywords.yaml # Scoring rule
└── suites/summarize-smoke.yaml      # Test configuration
The prompt spec (prompts/summarize-v1.yaml) looks like this:
id: summarize-v1
variables:
  text: { type: string }
template: "Summarize: {{text}}"
The dataset (datasets/summarize-smoke.jsonl) has one JSON object per line:
{"case_id": "case-1", "inputs": {"text": "..."}, "expected_outputs": {"should_contain": ["key", "words"]}}
Not ready for the full four-file structure? Ask your agent to scaffold a quick eval instead — a single file that combines prompt, cases, and assertions. See core concepts for details.
3

Run your first eval

Ask your IDE agent:
“Use the apastra-eval skill to run the summarize-smoke suite”
Your agent reads the suite spec, loads the dataset and evaluator, renders each prompt template with the test case inputs, calls the model, scores the outputs, and reports results:
Suite: summarize-smoke
Status: PASS ✅

Metrics:
  keyword_recall: 0.85 (threshold: 0.60) ✅

Per-case results:
  short-article: keyword_recall=1.00 ✅
  technical-paragraph: keyword_recall=1.00 ✅
  empty-edge-case: keyword_recall=1.00 ✅
  long-document: keyword_recall=1.00 ✅
  multi-topic: keyword_recall=0.50 ⚠️
The agent also saves run artifacts to promptops/runs/<run-id>/ — a scorecard, per-case results, and a run manifest with timestamps and model metadata.
Suite mode uses the full four-file pipeline: prompt spec + dataset + evaluator + suite config. Best for structured, reusable test suites.Ask your agent: "Use the apastra-eval skill to run the summarize-smoke suite"
4

Set a baseline

Ask your IDE agent:
“Use the apastra-baseline skill to set the current results as the baseline”
Your agent reads the most recent run’s scorecard and writes it to derived-index/baselines/summarize-smoke.json:
Baseline established ✅

Suite: summarize-smoke
Source run: summarize-smoke-2026-04-09-120000
Metrics:
  keyword_recall: 0.85

Saved to: derived-index/baselines/summarize-smoke.json
Now every future eval automatically compares against this baseline. If you change the prompt and quality drops, the agent tells you:
Regression Report:
  Baseline: derived-index/baselines/summarize-smoke.json
  Status: REGRESSION DETECTED ❌

  keyword_recall: 0.55 (baseline: 0.85, delta: -0.30) ❌
Only set a baseline from a passing run. The baseline represents your “known good” quality level — baselining a failing run means future comparisons start from a low bar.
That’s it. No CI, no cloud, no API keys to configure. Your agent is the harness.

What just happened

Here is the full file structure you now have:
promptops/
├── prompts/
│   └── summarize-v1.yaml          # Prompt template + variables
├── datasets/
│   └── summarize-smoke.jsonl      # Test cases
├── evaluators/
│   └── contains-keywords.yaml     # Scoring rule
├── suites/
│   └── summarize-smoke.yaml       # Test configuration
└── runs/
    └── summarize-smoke-<ts>/
        ├── scorecard.json          # Aggregated metrics
        ├── cases.jsonl             # Per-case results
        └── run_manifest.json       # Metadata: model, harness, timestamp

derived-index/
└── baselines/
    └── summarize-smoke.json        # Known-good scorecard
Every file follows a JSON schema. Run apastra-validate any time to confirm all files are correctly formatted.

Next steps

Core concepts

Understand each building block — prompt specs, datasets, evaluators, suites, baselines, and the resolution chain

Writing evals

Learn to write test cases that catch real regressions — not just happy paths

Skills reference

Explore all available skills and what each one does

CI integration

Upgrade from local-first evaluation to automated GitHub Actions PR gating