Engineering

Overlap Detection — Precision#

How Co-Vibe decides whether a duplicate-work match is worth escalating to the agent. This is the reference for the stage-1 precision signals; the agent-judge protocol (stage 2) lives in mcp-contract.mdOverlap Detection v2.

The one rule that governs everything#

Recall is sacred. Precision is earned.

Detection runs in two stages. Stage 1 is the cheap, local, always-on gate; strong candidates escalate to the stage-2 agent-judge. Every precision signal below only ever suppresses an escalation — it never touches the warn-floor, so a warning is still recorded for the dashboard/audit even when the agent is not interrupted. A real duplicate must never stop escalating; that invariant is enforced by the eval harness at recall = 100% on every change.

The warn-floor informs; only escalation gates. Warn-floor warnings are disclosed on the check response (warning_id + data.matches with score and reason) and never block start/start_planned. This changed 2026-06-10: the old behavior let a sub-threshold warn-floor warning silently hard-block the subsequent start, which a live 30-scenario battery measured at a 7/7 false-block rate on non-duplicate traps (matches at 29–39%, some with actionDivergent: true) — every precision signal above was being undone at the start gate. Blocking now happens only on the escalated (requires_verdict) path and on agent-confirmed duplicates, so stage-1 precision carries through to the developer-facing behavior. The committed regression suite held recall 10/10 (100%) and precision 10/11 (90.9%) across the change. One more escalation rule rides on the same principle: matches against done/cancelled work inform but never escalate (finished work cannot be duplicated, only repeated — covibe_team audits cover that).

The escalation gate#

A candidate escalates only when all four hold — the single predicate isEscalatable(match, threshold) in src/server/overlap/scoring.ts:

score >= escalate_threshold      AND      // strong enough (default 0.50)
NOT actionDivergent              AND      // not opposed/different work
NOT lowInformation               AND      // not too vague to trust
status NOT IN (done, cancelled)           // open work only; finished work informs

findOverlap computes actionDivergent and lowInformation per candidate and hangs them on the Match. The same predicate is used by check-task, planned-task-tools, and the eval harness — they cannot drift apart.

1 · Base score — semantic ∥ lexical (OR-of-signals)#

combinedScore (overlap/semantic.ts) takes the stronger of a local sentence-embedding cosine (Xenova/all-MiniLM-L6-v2, offline) and a lexical word-overlap score. Either signal firing is a match, so a reworded duplicate with no shared words is still caught. score = max(lexical, semantic).

The escalate threshold is 0.50 (DEFAULT_ESCALATE_THRESHOLD, runtime-tunable via app_config.escalate_threshold). It is locked: real duplicates bottom out at ~0.51, so any higher value starts dropping duplicates (recall loss). Precision above the threshold is bought with the divergence/guard signals below, not by moving the bar.

2 · Action divergence — opposed work (Approach A)#

overlap/action.ts extracts a canonical action bucket (create/remove) from each item's title + scope and reports when the two are opposed (enable↔remove, add↔invalidate). Two items sharing an entity but doing opposite work — enable the flag vs remove the flag — are not the same task. Unknown actions are neutral (never opposed), so a duplicate phrased with an out-of-lexicon verb is never suppressed.

3 · Structured action / component divergence (Approach B)#

Agents may declare two optional fields on covibe_task (operation: "check", "plan", or "start"):

FieldTypeMeaning
actioncreate | modify | remove | fix | auditwhat is being done
componentstring (≤40, e.g. server, client, db, ui)which layer

They are persisted on the task (idempotent migration) and surfaced on every candidate, so the divergence check works on both sides. The layered rule (isDivergent): different declared component → divergent (catches server endpoint vs client, which text alone cannot); else opposed action (declared preferred, the Approach-A text heuristic as fallback). Everything is additive — a caller that sends neither field gets exactly the Approach-A behavior.

4 · Low-information guard#

isLowInformation flags an item whose every content token is generic-work/filler vocabulary (fix the thing that is broken, address the reported issue) — it names no specific subject, so any match it produces (lexical or semantic) is unreliable. If either side is low-information, escalation is suppressed. Recall-safe because a real duplicate always names a concrete subject (redis, jwt, invoice).

Results#

Measured by the eval harness at the shipped escalate_threshold = 0.50:

StageFalse positivesPrecisionRecall
Baseline (semantic+lexical)876.5%100%
+ Action divergence (A)583.9%100%
+ Component/action (B)389.7%100%
+ Low-information guard292.9%100%

Recall has never dropped below 100% across the 66+ labeled scenarios (52 pairwise

  • a 24-task scale-stress + the structured/low-info edge sets). The committed regression suite also enforces a meaningful duplicate sample, 100% recall, and a >=90% precision floor over its stage-1 scenarios so npm run check:accuracy fails before the customer handoff promise drifts.

Known residual false positives#

Two remain, both the action-family residue — same entity/component, but a work-type difference the opposed-axis deliberately treats as neutral for recall safety:

  • build dark-mode toggle vs audit dark-mode colors (create vs audit, 0.51)
  • create invoice vs email invoice (create vs send, 0.52)

The next lever is a declared-first action-family expansion (treat clearly distinct declared actions as divergent). It carries genuine recall risk, so it stays eval-gated and unshipped until proven at 100% recall.

Where it lives#

ConcernPath
Base scorer / combinersrc/server/overlap/semantic.ts, overlap/lexical.ts
Action + component divergencesrc/server/overlap/action.ts
Low-information guardsrc/server/overlap/lexical.ts
Escalation gate + match assemblysrc/server/overlap/scoring.ts (findOverlap, isEscalatable)
Thresholds (single source)src/server/overlap/constants.ts, repositories/config-repo.ts
Stage-1 wiringsrc/server/mcp/check-task.ts, mcp/planned-task-tools.ts
Committed regression suitetests/unit/overlap-scenarios.test.ts + tests/fixtures/overlap-scenarios.json
Semantic calibrationtests/unit/overlap-calibration.test.ts + tests/fixtures/overlap-eval.json
Precision/recall eval (local, gitignored).gstack/overlap-eval/

Changing the precision logic — the rule#

Any change to a signal must keep the committed regression suite green and hold recall = 100% on the full eval. Add scenarios to tests/fixtures/overlap-scenarios.json when you add a signal — a structure guard in the suite fails if the fixture drifts back to text-only. Never lower the escalate threshold or relax a recall assertion to make a build pass.

View as .md