Overlap Detection — Precision#

How Co-Vibe decides whether a duplicate-work match is worth escalating to the agent. This is the reference for the stage-1 precision signals; the agent-judge protocol (stage 2) lives in mcp-contract.md → Overlap Detection v2.

The one rule that governs everything#

Recall is sacred. Precision is earned.

Detection runs in two stages. Stage 1 is the cheap, local, always-on gate; strong candidates escalate to the stage-2 agent-judge. Every precision signal below only ever suppresses an escalation — it never touches the warn-floor, so a warning is still recorded for the dashboard/audit even when the agent is not interrupted. A real duplicate must never stop escalating; that invariant is enforced by the eval harness at recall = 100% on every change.

The warn-floor informs; only escalation gates. Warn-floor warnings are disclosed on the check response (warning_id + data.matches with score and reason) and never block start/start_planned. This changed 2026-06-10: the old behavior let a sub-threshold warn-floor warning silently hard-block the subsequent start, which a live 30-scenario battery measured at a 7/7 false-block rate on non-duplicate traps (matches at 29–39%, some with actionDivergent: true) — every precision signal above was being undone at the start gate. Blocking now happens only on the escalated (requires_verdict) path and on agent-confirmed duplicates, so stage-1 precision carries through to the developer-facing behavior. The committed regression suite held recall 10/10 (100%) and precision 10/11 (90.9%) across the change. One more escalation rule rides on the same principle: matches against done/cancelled work inform but never escalate (finished work cannot be duplicated, only repeated — covibe_team audits cover that).

The escalation gate#

A candidate escalates only when all four hold — the single predicate isEscalatable(match, threshold) in src/server/overlap/scoring.ts:

score >= escalate_threshold      AND      // strong enough (default 0.50)
NOT actionDivergent              AND      // not opposed/different work
NOT lowInformation               AND      // not too vague to trust
status NOT IN (done, cancelled)           // open work only; finished work informs

findOverlap computes actionDivergent and lowInformation per candidate and hangs them on the Match. The same predicate is used by check-task, planned-task-tools, and the eval harness — they cannot drift apart.

1 · Base score — semantic ∥ lexical (OR-of-signals)#

combinedScore (overlap/semantic.ts) takes the stronger of a local sentence-embedding cosine (Xenova/all-MiniLM-L6-v2, offline) and a lexical word-overlap score. Either signal firing is a match, so a reworded duplicate with no shared words is still caught. score = max(lexical, semantic).

The escalate threshold is 0.50 (DEFAULT_ESCALATE_THRESHOLD, runtime-tunable via app_config.escalate_threshold). It is locked: real duplicates bottom out at ~0.51, so any higher value starts dropping duplicates (recall loss). Precision above the threshold is bought with the divergence/guard signals below, not by moving the bar.

2 · Action divergence — opposed work (Approach A)#

overlap/action.ts extracts a canonical action bucket (create/remove) from each item's title + scope and reports when the two are opposed (enable↔remove, add↔invalidate). Two items sharing an entity but doing opposite work — enable the flag vs remove the flag — are not the same task. Unknown actions are neutral (never opposed), so a duplicate phrased with an out-of-lexicon verb is never suppressed.

3 · Structured action / component divergence (Approach B)#

Agents may declare two optional fields on covibe_task (operation: "check", "plan", or "start"):

Field	Type	Meaning
`action`	`create \| modify \| remove \| fix \| audit`	what is being done
`component`	string (≤40, e.g. `server`, `client`, `db`, `ui`)	which layer

They are persisted on the task (idempotent migration) and surfaced on every candidate, so the divergence check works on both sides. The layered rule (isDivergent): different declared component → divergent (catches server endpoint vs client, which text alone cannot); else opposed action (declared preferred, the Approach-A text heuristic as fallback). Everything is additive — a caller that sends neither field gets exactly the Approach-A behavior.

4 · Low-information guard#

isLowInformation flags an item whose every content token is generic-work/filler vocabulary (fix the thing that is broken, address the reported issue) — it names no specific subject, so any match it produces (lexical or semantic) is unreliable. If either side is low-information, escalation is suppressed. Recall-safe because a real duplicate always names a concrete subject (redis, jwt, invoice).

Results#

Measured by the eval harness at the shipped escalate_threshold = 0.50:

Stage	False positives	Precision	Recall
Baseline (semantic+lexical)	8	76.5%	100%
+ Action divergence (A)	5	83.9%	100%
+ Component/action (B)	3	89.7%	100%
+ Low-information guard	2	92.9%	100%

Recall has never dropped below 100% across the 66+ labeled scenarios (52 pairwise

a 24-task scale-stress + the structured/low-info edge sets). The committed regression suite also enforces a meaningful duplicate sample, 100% recall, and a >=90% precision floor over its stage-1 scenarios so npm run check:accuracy fails before the customer handoff promise drifts.

Known residual false positives#

Two remain, both the action-family residue — same entity/component, but a work-type difference the opposed-axis deliberately treats as neutral for recall safety:

build dark-mode toggle vs audit dark-mode colors (create vs audit, 0.51)
create invoice vs email invoice (create vs send, 0.52)

The next lever is a declared-first action-family expansion (treat clearly distinct declared actions as divergent). It carries genuine recall risk, so it stays eval-gated and unshipped until proven at 100% recall.

Where it lives#

Concern	Path
Base scorer / combiner	`src/server/overlap/semantic.ts`, `overlap/lexical.ts`
Action + component divergence	`src/server/overlap/action.ts`
Low-information guard	`src/server/overlap/lexical.ts`
Escalation gate + match assembly	`src/server/overlap/scoring.ts` (`findOverlap`, `isEscalatable`)
Thresholds (single source)	`src/server/overlap/constants.ts`, `repositories/config-repo.ts`
Stage-1 wiring	`src/server/mcp/check-task.ts`, `mcp/planned-task-tools.ts`
Committed regression suite	`tests/unit/overlap-scenarios.test.ts` + `tests/fixtures/overlap-scenarios.json`
Semantic calibration	`tests/unit/overlap-calibration.test.ts` + `tests/fixtures/overlap-eval.json`
Precision/recall eval (local, gitignored)	`.gstack/overlap-eval/`

Changing the precision logic — the rule#

Any change to a signal must keep the committed regression suite green and hold recall = 100% on the full eval. Add scenarios to tests/fixtures/overlap-scenarios.json when you add a signal — a structure guard in the suite fails if the fixture drifts back to text-only. Never lower the escalate threshold or relax a recall assertion to make a build pass.