Agents & MCP · the remediation loop

Watchdog finds it. Your agent fixes it. The next scan proves it.

Watchdog is the independent codebase-assurance surveyor — it measures, it never fixes. It's read-only by doctrine, so instead of touching your code it serves every finding to your own coding agent over a Model Context Protocol server: prioritised by impact, briefed with the rule that fired and the exact file and line, and verified by the next scan. The hand on your code is always yours.

Survey a repository — free See what we measure

Works with any MCP client — Claude Code, GitHub Copilot, Cursor, opencode. C#/.NET · a measurement, not an opinion.

The closed loop

Measure → fix → prove, on a loop.

Watchdog scans on your cadence and turns the findings into a ranked, briefed task list. Your agent works the highest-leverage item in your own repo and records a provisional fix. The next scan re-measures and decides: the finding is credited as fixed only when it genuinely stops firing. Nothing the agent asserts ever moves the score.

Read-only throughout: Watchdog measures and serves the tasks; your agent makes every change. The fingerprint delta on the next scan is what marks a finding fixed — never the agent's say-so.

Built for an agent to act on

A ranked, briefed task list — not a wall of findings.

Point an agent at the repo and it starts with audit() — repo health, every lens score, and a remediation plan ranked by impact ÷ effort: what actually moves the grade, in order. next_task() hands back the single highest-leverage item, fully briefed in one object; get_task("D34") pulls the full packet for one dimension. No round-trips, no guessing.

next_task() · illustrative response

Dimension — D34 · Knowledge Freshness (Maturity lens)
Score now 6.1 / 10 · Est. gain +1.8 · Effort Medium
What to do — re-engage the orphaned billing modules: add characterization tests and a named owner before the next change.
Instances — a1b2c3 src/Billing/LegacyInvoicer.cs:212 · d4e5f6 src/Billing/TaxRules.cs:88

Illustrative — your real task list is generated from the latest scan.

One call, everything needed

Each task packet carries what the dimension measures, the current score, the estimated point-gain and effort, the remediation brief, and every open instance with its fingerprint and file:line. The agent reads one object and gets to work.

Ranked by leverage, not noise

The plan ranks whole dimensions by impact ÷ effort — "lift D34 from 6 to 8 for +1.8" — so the agent works what moves the grade, instead of clearing a hundred Info-level findings in an area that's already strong.

No hallucinated locations

Every finding is content-addressed — a stable fingerprint plus exact file and line. The agent never invents a location, and it can diff two scans (or two SARIF runs) and get the same adds and removes Watchdog reports.

Honest absence

When a dimension can't be measured — coverage on a suite that won't run, an npm scan on a Python repo — it returns not-measured with the reason, never a phantom 0. The agent never fixes a problem that isn't there.

Scores are pinned to a frozen rubric version, so an agent verifying its own fix is always comparing against the same ruler. Findings ship as SARIF too — helpUri to the rule's intent, partialFingerprints for stable cross-tool identity, security findings tagged with their CWE — so your code-scanning tools read them as well.

The hollow code that compiles

What we catch that line-scanners pass.

The failure mode of fast, high-volume code isn't bad syntax — a linter catches that. It's code that looks finished and does nothing: stubs under confident names, tests that assert nothing, errors quietly swallowed. It type-checks, it reads as done, it sails past a line-level scanner. Watchdog measures the hollowness itself — by shape, deterministically.

Signature	What it looks like	Why a line-scanner passes it
Stubs that look implemented IC1	A method named `CalculateTax` that just `return 0;`; an `async` method that never awaits; skeleton types half-full of holes; dead `if(false)` branches.	It's valid C#, it type-checks, it reads as done. A scanner sees a method; we see a hollow one — and weight a lone stub differently from pervasive scaffolding.
Tests that assert nothing D10	Green tests with no assertions; skipped tests dressed up as coverage.	Coverage counts the test file; it never asks whether the test actually checks a result.
Errors made invisible X3	Empty `catch {}` that swallows the exception; bare `throw;` that loses the stack trace.	The exception type is caught, so type-based checks pass. The silence isn't visible to them.
Untracked debt & dead code D17	TODO / FIXME / HACK, blanket suppressions, commented-out code, unreferenced symbols.	Counted raw at best — not weighed by whether the debt is tracked, and dead code in generated or test paths is missed.
Copy-paste, never parameterised D4	Near-duplicate blocks — templated generation that was never individualised.	Clone detection is expensive and often skipped; ours is type-aware and scored by density across the codebase.

One signature, whoever wrote it. We don't guess whether a machine typed it — stylometric "AI detection" is a credibility trap, and we don't make the claim. We measure the hollowness by shape, so a rushed human and an eager model produce the same finding and the same fix. And it compounds: each signature feeds a per-file quality reading with diminishing returns and floors, so one stray stub barely registers while a file full of scaffolding falls below the line — the "one stub versus a hundred stubs" distinction a flat rule can't make.

Why it can't be papered over

The re-scan is a ratchet — the score is earned, not talked up.

Surfacing slop is only half the job; the other half is making sure a "fix" is real. Because the measurement is deterministic and the next scan re-runs it, a finding can't be dressed up or churned away — it either stops firing or it doesn't. This holds on every scheduled scan, with or without an agent.

Line-insensitive identity

A finding's identity is a hash of its dimension, title and file — never the line number. Reformat, rename a variable, move the block: the finding stays put. You can't churn your way out of it.

Provisional until proven

An agent's "resolved" is a claim, not a verdict. The next scan re-measures and credits it only when the fingerprint is genuinely gone — and reopens it if the rule still fires. No self-reported fixes.

Deterministic, floored scoring

Repeats of the same problem decay — each costs half the last — and every category has a floor, so you can't flood a file with cosmetic findings to game it, or hand-wave a score below the line. Same commit in, same score out.

This runs the same whether the code came from a coding assistant, a contractor, or a 2 a.m. hotfix. The deterministic re-scan is the ratchet under all of it: the number resists cosmetic gaming and can't be papered over with line churn — it can only be earned by code that genuinely clears the bar.

The MCP surface

Twelve tools, four jobs — and a way to push back.

Everything runs over one HTTPS endpoint (/mcp), Bearer-token-scoped to a single repo. There is no write-to-repo tool — no commit, no push, no open-PR. The agent reads, works in your environment, records provisional fixes, and — when it disagrees — pushes back through honest channels where a human, not the agent, adjudicates.

Job	Tool	What it does
Read & plan	`audit`	Repo health, every lens score, and a remediation plan ranked by impact ÷ effort. Start here.
	`next_task`	The single highest-leverage item, fully briefed: what it measures, the gain, the effort, what to do, every instance.
	`get_task`	The full work packet for one dimension you choose (e.g. `D34`).
	`list_findings`	The latest scan's findings — fingerprint, dimension, level, file:line.
	`get_finding`	One finding by fingerprint.
Work the fix	`claim_finding`	Take a short lease (≈45 min) so two agents don't collide on the same finding.
	`release_finding`	Hand it back to the open queue.
	`resolve_finding`	Record a provisional fix — the next scan confirms it by fingerprint delta.
Push back	`dispute_finding`	Flag a scored finding as a false positive → routed to human triage (Fixed / Declined).
	`flag_advisory`	Flag an advisory (LLM-judged) note as unhelpful — a separate channel that never touches the score.
	`report_detector_gap`	Report something Watchdog missed → improves the detector, never your repo's score.
Verify	`request_rescan`	Ask for a verify re-scan (opt-in per repo; counts toward your scan budget).

The agent proposes; the measurement disposes. No dispute, claim or resolve ever moves the score or erases a finding. A dispute goes to a human — and if it's declined, the finding carries a maintainer's note explaining why it stands, so the same one isn't re-litigated next scan. Disagreement is a first-class signal; it just isn't a back door to the number.

Open, not closed

The hand on your code is always yours.

Watchdog runs against a throwaway clone and exposes no way to write to your repo. That's deliberate: a measurer that also rewrites the thing it grades can't stay neutral, and you'd lose chain-of-custody on every change. Tools that auto-refactor inside their own engine make the opposite trade. Watchdog stays the instrument; your agent — whichever you already run — stays the hand.

What Watchdog does

Serves prioritised, briefed findings over MCP; records a provisional resolve; re-measures on the next scan and proves what genuinely moved. It hands the intelligence to your agent.

What it never does

Edit, commit, push, or open a PR; move the score on an agent's say-so; touch your working tree. The change — and the credit, and the chain of custody — is always yours.

Connect an agent

Point any MCP client at your repo.

Each repository has an MCP endpoint and a scoped bearer token. Add them to your agent's MCP configuration — Claude Code, GitHub Copilot, Cursor, opencode, or your own — and the task list is live. Verify-now re-scans are opt-in per repo (off by default) and count toward your scan budget; otherwise the next scheduled scan is the arbiter.

MCP client config · shape

{
  "mcpServers": {
    "watchdog": {
      "url": "https://<your-watchdog-host>/mcp",
      "headers": { "Authorization": "Bearer <repo-token>" }
    }
  }
}

The exact endpoint URL and the per-repo token are shown for each repository once it's surveyed.

Stop triaging findings by hand. Hand them to your agent.

Survey a repository — free How it fits your stack →