Audit the prompts your app ships.
Close the prompt loop: scan, audit, eval, grade, fix. vibe-prompt is the prompt-audit and behavioral-testing layer for vibe-coded apps that ship LLM features — running 13 structural smell checks across 5 dimensions, evaluating against your production model with a calibrated LLM-judge, and routing confident fixes to ready-to-apply diffs.
Includes9 commands · 13 skills
Nine commands across the prompt lifecycle.
From inventory to fix, every command in the loop is a direct call. Read-only by default — real model calls only on :eval, behind a cost gate.
Inventory every prompt site.
Reads the codebase and maps every place an LLM prompt is constructed or dispatched. Read-only — nothing is changed, nothing is written. The starting point for the rest of the loop.
Reach for it when you want to know what your app is actually sending to the model.
Run 13 structural smell checks.
Applies the F1–F13 taxonomy across five dimensions — schema tightness, persona consistency, instruction clarity, token efficiency, and injection resistance. No model calls, no cost.
Reach for it before an eval run, or any time you change a prompt.
Behavioral test against production.
Runs test cases against your real model with a calibrated LLM-judge. Cost-gated — you confirm spend before any call goes out. Surfaces behavioral drift that static checks can't catch.
Reach for it when you need ground truth on how the prompt behaves, not just how it reads.
Composite grade + regression check.
Rolls scan, audit, and eval signals into a composite grade and compares it against your monotonic baseline — so regressions show up as deltas, not surprises.
Reach for it on every PR that touches a prompt.
Confidence-routed fix diffs.
Routes high-confidence findings to ready-to-apply diffs and lower-confidence ones to annotated review files. Backs up every prompt before touching it and keeps rollback explicit.
Reach for it after audit surfaces issues you want to fix.
Propose new AI features.
Reads your product and audit history, then proposes concrete AI feature candidates grounded in what your current prompt surface already handles well.
Reach for it when you want to extend your LLM surface, not just maintain it.
Model-news digest.
Pulls a digest of model releases, capability changes, and pricing shifts relevant to your current stack. No eval calls — read-only.
Reach for it when a model update lands and you want to know if anything breaks.
L3 self-improvement.
Reads vibe-prompt's own session and friction logs, then proposes concrete edits to its own SKILLs and evaluation templates. Never auto-applies.
Reach for it when the audit keeps missing something you're catching manually.
State-aware router.
Reads your project state — last scan age, open audit findings, baseline freshness — and recommends the right next command. Asks before launching anything.
Reach for it when you're not sure where in the loop you are.
Scan, audit, eval, grade, fix — in that order.
The loop runs left to right: scan inventories every prompt site read-only, audit applies the F1–F13 taxonomy statically, eval runs behavioral tests against your production model behind a cost gate, grade rolls everything into a composite score compared against a monotonic baseline, and remediate routes confident fixes to diffs with backup and explicit rollback. Each step is a standalone call — run the whole loop or drop in at any stage.
The F1–F13 taxonomy covers 13 structural smell classes across five scoring dimensions: schema tightness (are outputs constrained or freeform?), persona consistency (does the assistant's identity hold across turns?), instruction clarity (are directives specific enough to survive model updates?), token efficiency (is the prompt spending tokens on things that matter?), and injection resistance (is user input sanitized before it reaches the model?). F10–F12 are the injection-surface smells — when those trigger, vibe-prompt hands off directly to vibe-sec, which carries the injection findings through its own security posture loop. The two plugins compose, neither depends on the other.
The eval step uses a calibrated LLM-judge: a secondary model call that scores the primary output against a rubric derived from your audit findings. Calibration means the judge's pass threshold is tuned against known-good and known-bad examples from your own baseline, not a generic rubric — so the score means something specific to your app. Every eval run costs real tokens; the cost gate ensures you confirm spend before anything goes out.
Two channels.
Stable marketplace
Tagged releases, promoted via the Vibe Plugins marketplace.
/plugin marketplace add estevanhernandez-stack-ed/vibe-plugins /plugin install vibe-prompt@vibe-plugins
Canary bleeding edge
Latest main from this repo.
/plugin marketplace add estevanhernandez-stack-ed/Vibe-Prompt /plugin install vibe-prompt
Read-only by default; real vendor calls only on :eval behind a cost gate. Composes with vibe-sec for injection handoff.
One plugin in a family.
Vibe Plugins are a coordinated family — installed independently, composed when present.