AI teams face a problem that software engineering already solved decades ago: how do you ship changes with confidence? For code, the answer is version control, automated tests, and regression detection. For prompts, most teams are still using comments in a shared doc. Apastra brings software engineering discipline to AI prompts. It is a file-based PromptOps framework — prompts, test cases, scoring rules, and quality baselines are all files in your repo, versioned in Git, and tested automatically by your IDE agent.Documentation Index
Fetch the complete documentation index at: https://bintzgavin-apastra-14.mintlify.app/llms.txt
Use this file to discover all available pages before exploring further.
The PromptOps problem
Prompts are not static strings. They are the core logic of AI features — and they break in ways that are easy to miss:- A wording change improves one use case while quietly degrading another
- A model update from your provider changes output behavior without warning
- A well-intentioned edit removes a constraint that was preventing bad outputs
- No one knows which version of the prompt is actually running in production
Key principles
File-based. Prompts, datasets, evaluators, suites, baselines, and regression policies are all plain YAML and JSONL files. There is no hidden database, no required SaaS control plane, and no proprietary format. Files live in your repo, move with your code, and work with every tool in your existing workflow. Agent-as-harness. Your IDE agent — Claude, Cursor, Amp, Codex, and many more — is the evaluation harness. When you ask it to run an eval, it reads the protocol files and executes the workflow: renders prompts, calls the model, scores outputs, and reports results. No external runtime. No API keys to configure. Local-first. You can run full evaluations, set baselines, and catch regressions entirely on your machine — no CI required. When your team is ready for PR gating and automated regression detection, theapastra-setup-ci skill upgrades you to GitHub Actions without changing any file formats.
Git-native. Because everything is files, you get diffing, history, blame, pull request review, and rollback for free. Prompt changes go through the same review process as code changes. Baselines and run artifacts are append-only records — nothing is mutated in place.
What you get
| Capability | How it works |
|---|---|
| Prompt versioning | YAML specs with stable IDs, variable schemas, and output contracts |
| Automated evals | Your IDE agent runs test suites and scores outputs |
| Regression detection | New results are compared against known-good baselines |
| Schema validation | JSON schemas ensure all files are correctly formatted |
| No infrastructure | No CI, no cloud, no hosted platform — just files and your agent |
Who it’s for
Solo builders who want prompt unit tests and pinned prompt versions without adopting a platform. Run evaluations locally, catch regressions before they ship, and keep everything in your existing repo. Product engineers who need PR gating and regression detection as part of their normal development workflow. Apastra integrates with GitHub pull requests and required status checks — failing evals block merges. Platform teams responsible for shared prompt infrastructure across multiple apps or teams. Apastra’s file-based protocol supports reusable workflows, CODEOWNERS-based review, and standardized artifact formats that work across repos. Applied AI teams with rigorous evaluation requirements. Apastra supports dataset versioning, judge-based evaluation, multi-run variance tracking, and tiered suite structures (smoke → regression → release candidate).How Apastra compares
Most tools in this space make a tradeoff between power and portability. Apastra takes a different position.promptfoo
promptfoo
promptfoo is a capable CI-centric eval runner with good PR feedback loop support. It was acquired by OpenAI in March 2026, making it no longer vendor-neutral for teams using other models. It also does not define a complete system of record for prompt assets — results can be ephemeral unless you build append-only artifacts and promotion semantics around it. Apastra is designed from the start as a complete protocol with promotion lineage, baselines, and delivery semantics built in.
Langfuse, PromptLayer, Humanloop
Langfuse, PromptLayer, Humanloop
Platform prompt registries solve the “runtime hot swap” problem and make prompts accessible to non-engineers. The tradeoff is that the external platform becomes the source of truth — which weakens Git-based review, diff, and release lineage. Apastra keeps Git as the control plane. Platform observability tools can be integrated as optional sinks rather than replacing the workflow.
OpenAI Evals, DeepEval, Ragas
OpenAI Evals, DeepEval, Ragas
Eval frameworks as code libraries give you powerful custom metrics and programmatic control. The cost is coupling your team to a specific runtime and evaluation contract. Apastra defines a thin harness contract — any framework can be a harness adapter — so you are not locked into a single evaluation library.
Arize Phoenix, Weights & Biases Weave
Arize Phoenix, Weights & Biases Weave
Observability-first stacks excel at debugging traces and async execution. They solve “what happened” — but they do not inherently solve “pin what shipped.” Apastra handles the packaging, pinning, and promotion semantics that observability platforms leave to you. You can emit run artifacts to these platforms as an optional sink.
Next steps
Quickstart
Install skills and run your first evaluation in 5 minutes
Core concepts
Understand prompt specs, datasets, evaluators, suites, and baselines