Overlap Detection — Precision#
How Co-Vibe decides whether a duplicate-work match is worth escalating to the
agent. This is the reference for the stage-1 precision signals; the agent-judge
protocol (stage 2) lives in mcp-contract.md → Overlap
Detection v2.
The one rule that governs everything#
Recall is sacred. Precision is earned.
Detection runs in two stages. Stage 1 is the cheap, local, always-on gate; strong candidates escalate to the stage-2 agent-judge. Every precision signal below only ever suppresses an escalation — it never touches the warn-floor, so a warning is still recorded for the dashboard/audit even when the agent is not interrupted. A real duplicate must never stop escalating; that invariant is enforced by the eval harness at recall = 100% on every change.
The warn-floor informs; only escalation gates. Warn-floor warnings are
disclosed on the check response (warning_id + data.matches with score and
reason) and never block start/start_planned. This changed 2026-06-10: the
old behavior let a sub-threshold warn-floor warning silently hard-block the
subsequent start, which a live 30-scenario battery measured at a 7/7
false-block rate on non-duplicate traps (matches at 29–39%, some with
actionDivergent: true) — every precision signal above was being undone at the
start gate. Blocking now happens only on the escalated (requires_verdict)
path and on agent-confirmed duplicates, so stage-1 precision carries through to
the developer-facing behavior. The committed regression suite held recall
10/10 (100%) and precision 10/11 (90.9%) across the change. One more
escalation rule rides on the same principle: matches against done/cancelled
work inform but never escalate (finished work cannot be duplicated, only
repeated — covibe_team audits cover that).
The escalation gate#
A candidate escalates only when all four hold — the single predicate
isEscalatable(match, threshold) in src/server/overlap/scoring.ts:
score >= escalate_threshold AND // strong enough (default 0.50)
NOT actionDivergent AND // not opposed/different work
NOT lowInformation AND // not too vague to trust
status NOT IN (done, cancelled) // open work only; finished work informsfindOverlap computes actionDivergent and lowInformation per candidate and
hangs them on the Match. The same predicate is used by check-task,
planned-task-tools, and the eval harness — they cannot drift apart.
1 · Base score — semantic ∥ lexical (OR-of-signals)#
combinedScore (overlap/semantic.ts) takes the stronger of a local
sentence-embedding cosine (Xenova/all-MiniLM-L6-v2, offline) and a lexical
word-overlap score. Either signal firing is a match, so a reworded duplicate with
no shared words is still caught. score = max(lexical, semantic).
The escalate threshold is 0.50 (DEFAULT_ESCALATE_THRESHOLD, runtime-tunable
via app_config.escalate_threshold). It is locked: real duplicates bottom out
at ~0.51, so any higher value starts dropping duplicates (recall loss). Precision
above the threshold is bought with the divergence/guard signals below, not by
moving the bar.
2 · Action divergence — opposed work (Approach A)#
overlap/action.ts extracts a canonical
action bucket (create/remove) from each item's title + scope and reports
when the two are opposed (enable↔remove, add↔invalidate). Two items sharing an
entity but doing opposite work — enable the flag vs remove the flag — are not
the same task. Unknown actions are neutral (never opposed), so a duplicate phrased
with an out-of-lexicon verb is never suppressed.
3 · Structured action / component divergence (Approach B)#
Agents may declare two optional fields on covibe_task (operation: "check", "plan", or "start"):
| Field | Type | Meaning |
|---|---|---|
action | create | modify | remove | fix | audit | what is being done |
component | string (≤40, e.g. server, client, db, ui) | which layer |
They are persisted on the task (idempotent migration) and surfaced on every
candidate, so the divergence check works on both sides. The layered rule
(isDivergent): different declared component → divergent (catches
server endpoint vs client, which text alone cannot); else opposed action
(declared preferred, the Approach-A text heuristic as fallback). Everything is
additive — a caller that sends neither field gets exactly the Approach-A behavior.
4 · Low-information guard#
isLowInformation flags an item whose every
content token is generic-work/filler vocabulary (fix the thing that is broken,
address the reported issue) — it names no specific subject, so any match it
produces (lexical or semantic) is unreliable. If either side is
low-information, escalation is suppressed. Recall-safe because a real duplicate
always names a concrete subject (redis, jwt, invoice).
Results#
Measured by the eval harness at the shipped escalate_threshold = 0.50:
| Stage | False positives | Precision | Recall |
|---|---|---|---|
| Baseline (semantic+lexical) | 8 | 76.5% | 100% |
| + Action divergence (A) | 5 | 83.9% | 100% |
| + Component/action (B) | 3 | 89.7% | 100% |
| + Low-information guard | 2 | 92.9% | 100% |
Recall has never dropped below 100% across the 66+ labeled scenarios (52 pairwise
- a 24-task scale-stress + the structured/low-info edge sets).
The committed regression suite also enforces a meaningful duplicate sample,
100% recall, and a >=90% precision floor over its stage-1 scenarios so
npm run check:accuracyfails before the customer handoff promise drifts.
Known residual false positives#
Two remain, both the action-family residue — same entity/component, but a work-type difference the opposed-axis deliberately treats as neutral for recall safety:
- build dark-mode toggle vs audit dark-mode colors (create vs audit, 0.51)
- create invoice vs email invoice (create vs send, 0.52)
The next lever is a declared-first action-family expansion (treat clearly distinct declared actions as divergent). It carries genuine recall risk, so it stays eval-gated and unshipped until proven at 100% recall.
Where it lives#
| Concern | Path |
|---|---|
| Base scorer / combiner | src/server/overlap/semantic.ts, overlap/lexical.ts |
| Action + component divergence | src/server/overlap/action.ts |
| Low-information guard | src/server/overlap/lexical.ts |
| Escalation gate + match assembly | src/server/overlap/scoring.ts (findOverlap, isEscalatable) |
| Thresholds (single source) | src/server/overlap/constants.ts, repositories/config-repo.ts |
| Stage-1 wiring | src/server/mcp/check-task.ts, mcp/planned-task-tools.ts |
| Committed regression suite | tests/unit/overlap-scenarios.test.ts + tests/fixtures/overlap-scenarios.json |
| Semantic calibration | tests/unit/overlap-calibration.test.ts + tests/fixtures/overlap-eval.json |
| Precision/recall eval (local, gitignored) | .gstack/overlap-eval/ |
Changing the precision logic — the rule#
Any change to a signal must keep the committed regression suite green and hold
recall = 100% on the full eval. Add scenarios to
tests/fixtures/overlap-scenarios.json when you add a signal — a structure guard
in the suite fails if the fixture drifts back to text-only. Never lower the
escalate threshold or relax a recall assertion to make a build pass.