Ground-truth retrieval benchmark
The production pipeline, measured on a public repository anyone can inspect. Local models only — every number below was produced on one workstation, with zero egress.
Method
Corpus: express v5.1.0 (commit cd7d439), indexed exactly like a customer repository.
Questions: 30 authored questions with known answers, each declaring the file a correct answer must cite and the facts the answer text must contain. They're the questions a developer actually asks:
- "How does the trust proxy setting affect how the client IP address is determined?"
- "Where is the etag setting compiled into the function that generates ETags?"
- "What handles a request when no route matches it?"
Run: every question goes fresh through the full production pipeline — retrieval, context assembly, local model — no cache, no cherry-picking. Reports are machine-generated by the same eval harness that ships in the product.
Results — production pipeline, 2026-06-12
qwen3-coder:30b + nomic-embed-text, one workstation, fully local.
| What was measured | Result |
|---|---|
| Answers grounded in cited sources (clickable file-and-line citations) | 100% (30/30) |
| Answers containing the facts the ground truth demanded | 100% (30/30) |
| Median time to a cited answer | 20.4 s |
| Source code transmitted anywhere | 0 bytes |
That is the citation guarantee, measured: answers about your code, with file-and-line proof — and an answer engine that declines rather than guesses when retrieval comes up empty.
One stricter metric, for completeness: in 60% of questions the answer's citations included the exact file our ground truth named. The remainder answered correctly while citing related code — for example the place a function is used rather than the line it's defined on. We publish it because honest benchmarks publish their strictest number, not just their best one.
Against cloud assistants
Cursor, Copilot, and Cody can't be driven headlessly, so a scored head-to-head has to be produced by hand; when we publish one it will include both sides' full transcripts, on this same question set. What can be compared today is structural — properties that don't depend on who runs the benchmark:
| SourceVault | Cloud codebase chat | |
|---|---|---|
| Source code leaves your infrastructure | Never — architecturally | Chunks/embeddings upload |
| Git history answers ("why was this changed?") | Indexed and cited | Never sees your history |
| Uncommitted work and local branches | Indexed on your machine | Only what syncs |
| Retrieval quality measured on your codebase | Eval report per install | Not exposed |
| Privacy model | Verifiable (zero egress) | Contractual (policy) |
The second row is the one no cloud vendor can ever match by shipping a feature: answering "why was this changed?" requires your commit history, and their indexers never see it.
Verify it on your own code
The benchmark harness ships inside every SourceVault install — the same machinery produced this page. Every pilot ends with this measurement run against your repositories, delivered as a retrieval-quality report. If those answers don't cite your code with file-and-line proof, the pilot is free.