Watchdog finds it. Your agent fixes it. The next scan proves it.
Watchdog is the independent codebase-assurance surveyor — it measures, it never fixes. It's read-only by doctrine, so instead of touching your code it serves every finding to your own coding agent over a Model Context Protocol server: prioritised by impact, briefed with the rule that fired and the exact file and line, and verified by the next scan. The hand on your code is always yours.
Works with any MCP client — Claude Code, GitHub Copilot, Cursor, opencode. C#/.NET · a measurement, not an opinion.
Measure → fix → prove, on a loop.
Watchdog scans on your cadence and turns the findings into a ranked, briefed task list. Your agent works the highest-leverage item in your own repo and records a provisional fix. The next scan re-measures and decides: the finding is credited as fixed only when it genuinely stops firing. Nothing the agent asserts ever moves the score.
A ranked, briefed task list — not a wall of findings.
Point an agent at the repo and it starts with audit() — repo health, every lens score, and
a remediation plan ranked by impact ÷ effort: what actually moves the grade, in order.
next_task() hands back the single highest-leverage item, fully briefed in one object;
get_task("D34") pulls the full packet for one dimension. No round-trips, no guessing.
- Dimension — D34 · Knowledge Freshness (Maturity lens)
- Score now 6.1 / 10 · Est. gain +1.8 · Effort Medium
- What to do — re-engage the orphaned billing modules: add characterization tests and a named owner before the next change.
- Instances —
a1b2c3src/Billing/LegacyInvoicer.cs:212 ·d4e5f6src/Billing/TaxRules.cs:88
Illustrative — your real task list is generated from the latest scan.
Each task packet carries what the dimension measures, the current score, the estimated point-gain and effort, the remediation brief, and every open instance with its fingerprint and file:line. The agent reads one object and gets to work.
The plan ranks whole dimensions by impact ÷ effort — "lift D34 from 6 to 8 for +1.8" — so the agent works what moves the grade, instead of clearing a hundred Info-level findings in an area that's already strong.
Every finding is content-addressed — a stable fingerprint plus exact file and line. The agent never invents a location, and it can diff two scans (or two SARIF runs) and get the same adds and removes Watchdog reports.
When a dimension can't be measured — coverage on a suite that won't run, an npm scan on a Python repo — it returns not-measured with the reason, never a phantom 0. The agent never fixes a problem that isn't there.
Scores are pinned to a frozen rubric version, so an agent verifying its own fix is always comparing against the same ruler. Findings ship as SARIF too — helpUri to the rule's intent, partialFingerprints for stable cross-tool identity, security findings tagged with their CWE — so your code-scanning tools read them as well.
What we catch that line-scanners pass.
The failure mode of fast, high-volume code isn't bad syntax — a linter catches that. It's code that looks finished and does nothing: stubs under confident names, tests that assert nothing, errors quietly swallowed. It type-checks, it reads as done, it sails past a line-level scanner. Watchdog measures the hollowness itself — by shape, deterministically.
| Signature | What it looks like | Why a line-scanner passes it |
|---|---|---|
| Stubs that look implemented IC1 | A method named CalculateTax that just return 0;; an async method that never awaits; skeleton types half-full of holes; dead if(false) branches. |
It's valid C#, it type-checks, it reads as done. A scanner sees a method; we see a hollow one — and weight a lone stub differently from pervasive scaffolding. |
| Tests that assert nothing D10 | Green tests with no assertions; skipped tests dressed up as coverage. | Coverage counts the test file; it never asks whether the test actually checks a result. |
| Errors made invisible X3 | Empty catch {} that swallows the exception; bare throw; that loses the stack trace. |
The exception type is caught, so type-based checks pass. The silence isn't visible to them. |
| Untracked debt & dead code D17 | TODO / FIXME / HACK, blanket suppressions, commented-out code, unreferenced symbols. | Counted raw at best — not weighed by whether the debt is tracked, and dead code in generated or test paths is missed. |
| Copy-paste, never parameterised D4 | Near-duplicate blocks — templated generation that was never individualised. | Clone detection is expensive and often skipped; ours is type-aware and scored by density across the codebase. |
The re-scan is a ratchet — the score is earned, not talked up.
Surfacing slop is only half the job; the other half is making sure a "fix" is real. Because the measurement is deterministic and the next scan re-runs it, a finding can't be dressed up or churned away — it either stops firing or it doesn't. This holds on every scheduled scan, with or without an agent.
A finding's identity is a hash of its dimension, title and file — never the line number. Reformat, rename a variable, move the block: the finding stays put. You can't churn your way out of it.
An agent's "resolved" is a claim, not a verdict. The next scan re-measures and credits it only when the fingerprint is genuinely gone — and reopens it if the rule still fires. No self-reported fixes.
Repeats of the same problem decay — each costs half the last — and every category has a floor, so you can't flood a file with cosmetic findings to game it, or hand-wave a score below the line. Same commit in, same score out.
Twelve tools, four jobs — and a way to push back.
Everything runs over one HTTPS endpoint (/mcp), Bearer-token-scoped to a single repo.
There is no write-to-repo tool — no commit, no push, no open-PR. The agent reads,
works in your environment, records provisional fixes, and — when it disagrees — pushes back through
honest channels where a human, not the agent, adjudicates.
| Job | Tool | What it does |
|---|---|---|
| Read & plan | audit | Repo health, every lens score, and a remediation plan ranked by impact ÷ effort. Start here. |
next_task | The single highest-leverage item, fully briefed: what it measures, the gain, the effort, what to do, every instance. | |
get_task | The full work packet for one dimension you choose (e.g. D34). | |
list_findings | The latest scan's findings — fingerprint, dimension, level, file:line. | |
get_finding | One finding by fingerprint. | |
| Work the fix | claim_finding | Take a short lease (≈45 min) so two agents don't collide on the same finding. |
release_finding | Hand it back to the open queue. | |
resolve_finding | Record a provisional fix — the next scan confirms it by fingerprint delta. | |
| Push back | dispute_finding | Flag a scored finding as a false positive → routed to human triage (Fixed / Declined). |
flag_advisory | Flag an advisory (LLM-judged) note as unhelpful — a separate channel that never touches the score. | |
report_detector_gap | Report something Watchdog missed → improves the detector, never your repo's score. | |
| Verify | request_rescan | Ask for a verify re-scan (opt-in per repo; counts toward your scan budget). |
The hand on your code is always yours.
Watchdog runs against a throwaway clone and exposes no way to write to your repo. That's deliberate: a measurer that also rewrites the thing it grades can't stay neutral, and you'd lose chain-of-custody on every change. Tools that auto-refactor inside their own engine make the opposite trade. Watchdog stays the instrument; your agent — whichever you already run — stays the hand.
Serves prioritised, briefed findings over MCP; records a provisional resolve; re-measures on the next scan and proves what genuinely moved. It hands the intelligence to your agent.
Edit, commit, push, or open a PR; move the score on an agent's say-so; touch your working tree. The change — and the credit, and the chain of custody — is always yours.
Point any MCP client at your repo.
Each repository has an MCP endpoint and a scoped bearer token. Add them to your agent's MCP configuration — Claude Code, GitHub Copilot, Cursor, opencode, or your own — and the task list is live. Verify-now re-scans are opt-in per repo (off by default) and count toward your scan budget; otherwise the next scheduled scan is the arbiter.
{
"mcpServers": {
"watchdog": {
"url": "https://<your-watchdog-host>/mcp",
"headers": { "Authorization": "Bearer <repo-token>" }
}
}
}
The exact endpoint URL and the per-repo token are shown for each repository once it's surveyed.
Stop triaging findings by hand. Hand them to your agent.
Sign in with GitHub · no card · C#/.NET · the first full report is €0.