Quickstart

Apastra runs entirely through your IDE agent. There is no server to start, no API key to configure, and no CI required to get going. By the end of this guide, you will have a working prompt spec, a test dataset, and a passing eval with a baseline set.

Apastra works with any IDE agent that supports SKILL.md — including Claude Code, Cursor, Amp, Codex, and 37 more.

Install skills

Run this command in your project root to install all Apastra skills into your IDE agent:

npx skills add BintzGavin/apastra --all --full-depth -y

This installs five skills:

Skill	What it does
`apastra-getting-started`	Project setup and onboarding walkthrough
`apastra-scaffold`	Generate prompt specs, datasets, evaluators, and suites
`apastra-eval`	Run evaluations and compare against baselines
`apastra-baseline`	Establish and manage known-good baselines
`apastra-validate`	Validate all files against JSON schemas

You can also install individual skills if you only need part of the workflow:

npx skills add BintzGavin/apastra/skills/eval
npx skills add BintzGavin/apastra/skills/baseline

Scaffold your first prompt

Ask your IDE agent:

“Use the apastra-scaffold skill to create a prompt spec, dataset, evaluator, and suite for summarizing text”

Your agent creates four files:

promptops/
├── prompts/summarize-v1.yaml        # Prompt template + variables
├── datasets/summarize-smoke.jsonl   # Test cases (5 examples)
├── evaluators/contains-keywords.yaml # Scoring rule
└── suites/summarize-smoke.yaml      # Test configuration

The prompt spec (prompts/summarize-v1.yaml) looks like this:

id: summarize-v1
variables:
  text: { type: string }
template: "Summarize: {{text}}"

The dataset (datasets/summarize-smoke.jsonl) has one JSON object per line:

{"case_id": "case-1", "inputs": {"text": "..."}, "expected_outputs": {"should_contain": ["key", "words"]}}

Not ready for the full four-file structure? Ask your agent to scaffold a quick eval instead — a single file that combines prompt, cases, and assertions. See core concepts for details.

Run your first eval

Ask your IDE agent:

“Use the apastra-eval skill to run the summarize-smoke suite”

Your agent reads the suite spec, loads the dataset and evaluator, renders each prompt template with the test case inputs, calls the model, scores the outputs, and reports results:

Suite: summarize-smoke
Status: PASS ✅

Metrics:
  keyword_recall: 0.85 (threshold: 0.60) ✅

Per-case results:
  short-article: keyword_recall=1.00 ✅
  technical-paragraph: keyword_recall=1.00 ✅
  empty-edge-case: keyword_recall=1.00 ✅
  long-document: keyword_recall=1.00 ✅
  multi-topic: keyword_recall=0.50 ⚠️

The agent also saves run artifacts to promptops/runs/<run-id>/ — a scorecard, per-case results, and a run manifest with timestamps and model metadata.

Suite mode
Quick eval mode

Suite mode uses the full four-file pipeline: prompt spec + dataset + evaluator + suite config. Best for structured, reusable test suites.Ask your agent: "Use the apastra-eval skill to run the summarize-smoke suite"

Quick eval mode uses a single YAML file in promptops/evals/. Best for smoke tests and rapid iteration.Ask your agent: "Use the apastra-eval skill to run the summarize-quick eval"Output follows the same format:

Quick Eval: summarize-quick
Status: PASS ✅

Cases:
  short: 2/2 assertions passed ✅
  empty-input: 1/1 assertions passed ✅

Pass rate: 1.00 (threshold: 1.00) ✅

Set a baseline

Ask your IDE agent:

“Use the apastra-baseline skill to set the current results as the baseline”

Your agent reads the most recent run’s scorecard and writes it to derived-index/baselines/summarize-smoke.json:

Baseline established ✅

Suite: summarize-smoke
Source run: summarize-smoke-2026-04-09-120000
Metrics:
  keyword_recall: 0.85

Saved to: derived-index/baselines/summarize-smoke.json

Now every future eval automatically compares against this baseline. If you change the prompt and quality drops, the agent tells you:

Regression Report:
  Baseline: derived-index/baselines/summarize-smoke.json
  Status: REGRESSION DETECTED ❌

  keyword_recall: 0.55 (baseline: 0.85, delta: -0.30) ❌

Only set a baseline from a passing run. The baseline represents your “known good” quality level — baselining a failing run means future comparisons start from a low bar.

That’s it. No CI, no cloud, no API keys to configure. Your agent is the harness.

What just happened

Here is the full file structure you now have:

promptops/
├── prompts/
│   └── summarize-v1.yaml          # Prompt template + variables
├── datasets/
│   └── summarize-smoke.jsonl      # Test cases
├── evaluators/
│   └── contains-keywords.yaml     # Scoring rule
├── suites/
│   └── summarize-smoke.yaml       # Test configuration
└── runs/
    └── summarize-smoke-<ts>/
        ├── scorecard.json          # Aggregated metrics
        ├── cases.jsonl             # Per-case results
        └── run_manifest.json       # Metadata: model, harness, timestamp

derived-index/
└── baselines/
    └── summarize-smoke.json        # Known-good scorecard

Every file follows a JSON schema. Run apastra-validate any time to confirm all files are correctly formatted.

Next steps

Core concepts

Understand each building block — prompt specs, datasets, evaluators, suites, baselines, and the resolution chain

Writing evals

Learn to write test cases that catch real regressions — not just happy paths

Skills reference

Explore all available skills and what each one does

CI integration

Upgrade from local-first evaluation to automated GitHub Actions PR gating

Get Started

Skills

Guides

Reference

What just happened

Next steps

Core concepts

Writing evals

Skills reference

CI integration

Get Started

Skills

Guides

Reference

Documentation Index

​What just happened

​Next steps

Core concepts

Writing evals

Skills reference

CI integration

What just happened

Next steps