# Overlap Detection — Precision

How Co-Vibe decides whether a duplicate-work match is worth **escalating** to the
agent. This is the reference for the stage-1 precision signals; the agent-judge
protocol (stage 2) lives in [`mcp-contract.md`](./mcp-contract.md) → *Overlap
Detection v2*.

## The one rule that governs everything

> **Recall is sacred. Precision is earned.**

Detection runs in two stages. Stage 1 is the cheap, local, always-on gate; strong
candidates escalate to the stage-2 agent-judge. Every precision signal below only
ever **suppresses an escalation** — it never touches the warn-floor, so a warning
is still recorded for the dashboard/audit even when the agent is not interrupted.
A real duplicate must never stop escalating; that invariant is enforced by the
eval harness at **recall = 100%** on every change.

**The warn-floor informs; only escalation gates.** Warn-floor warnings are
disclosed on the check response (`warning_id` + `data.matches` with score and
reason) and never block `start`/`start_planned`. This changed 2026-06-10: the
old behavior let a sub-threshold warn-floor warning silently hard-block the
subsequent start, which a live 30-scenario battery measured at a **7/7
false-block rate** on non-duplicate traps (matches at 29–39%, some with
`actionDivergent: true`) — every precision signal above was being undone at the
start gate. Blocking now happens only on the escalated (`requires_verdict`)
path and on agent-confirmed duplicates, so stage-1 precision carries through to
the developer-facing behavior. The committed regression suite held recall
10/10 (100%) and precision 10/11 (90.9%) across the change. One more
escalation rule rides on the same principle: matches against `done`/`cancelled`
work inform but never escalate (finished work cannot be duplicated, only
repeated — `covibe_team` audits cover that).

## The escalation gate

A candidate escalates only when **all four** hold — the single predicate
`isEscalatable(match, threshold)` in [`src/server/overlap/scoring.ts`](../../src/server/overlap/scoring.ts):

```
score >= escalate_threshold      AND      // strong enough (default 0.50)
NOT actionDivergent              AND      // not opposed/different work
NOT lowInformation               AND      // not too vague to trust
status NOT IN (done, cancelled)           // open work only; finished work informs
```

`findOverlap` computes `actionDivergent` and `lowInformation` per candidate and
hangs them on the `Match`. **The same predicate is used by `check-task`,
`planned-task-tools`, and the eval harness** — they cannot drift apart.

### 1 · Base score — semantic ∥ lexical (OR-of-signals)

`combinedScore` (`overlap/semantic.ts`) takes the **stronger** of a local
sentence-embedding cosine (`Xenova/all-MiniLM-L6-v2`, offline) and a lexical
word-overlap score. Either signal firing is a match, so a reworded duplicate with
no shared words is still caught. `score = max(lexical, semantic)`.

The **escalate threshold is 0.50** (`DEFAULT_ESCALATE_THRESHOLD`, runtime-tunable
via `app_config.escalate_threshold`). It is **locked**: real duplicates bottom out
at ~0.51, so any higher value starts dropping duplicates (recall loss). Precision
above the threshold is bought with the divergence/guard signals below, not by
moving the bar.

### 2 · Action divergence — opposed work (Approach A)

[`overlap/action.ts`](../../src/server/overlap/action.ts) extracts a canonical
**action bucket** (`create`/`remove`) from each item's title + scope and reports
when the two are **opposed** (enable↔remove, add↔invalidate). Two items sharing an
entity but doing opposite work — *enable the flag* vs *remove the flag* — are not
the same task. Unknown actions are neutral (never opposed), so a duplicate phrased
with an out-of-lexicon verb is never suppressed.

### 3 · Structured action / component divergence (Approach B)

Agents may declare two optional fields on `covibe_task` (`operation: "check"`, `"plan"`, or `"start"`):

| Field | Type | Meaning |
|---|---|---|
| `action` | `create \| modify \| remove \| fix \| audit` | what is being done |
| `component` | string (≤40, e.g. `server`, `client`, `db`, `ui`) | which layer |

They are **persisted on the task** (idempotent migration) and surfaced on every
candidate, so the divergence check works on both sides. The layered rule
(`isDivergent`): **different declared component → divergent** (catches
*server endpoint* vs *client*, which text alone cannot); else **opposed action**
(declared preferred, the Approach-A text heuristic as fallback). Everything is
additive — a caller that sends neither field gets exactly the Approach-A behavior.

### 4 · Low-information guard

[`isLowInformation`](../../src/server/overlap/lexical.ts) flags an item whose every
content token is generic-work/filler vocabulary (*fix the thing that is broken*,
*address the reported issue*) — it names no specific subject, so any match it
produces (lexical or semantic) is unreliable. If **either** side is
low-information, escalation is suppressed. Recall-safe because a real duplicate
always names a concrete subject (redis, jwt, invoice).

## Results

Measured by the eval harness at the shipped `escalate_threshold = 0.50`:

| Stage | False positives | Precision | Recall |
|---|---|---|---|
| Baseline (semantic+lexical) | 8 | 76.5% | 100% |
| + Action divergence (A) | 5 | 83.9% | 100% |
| + Component/action (B) | 3 | 89.7% | 100% |
| + Low-information guard | **2** | **92.9%** | **100%** |

Recall has never dropped below 100% across the 66+ labeled scenarios (52 pairwise
+ a 24-task scale-stress + the structured/low-info edge sets).
The committed regression suite also enforces a meaningful duplicate sample,
100% recall, and a >=90% precision floor over its stage-1 scenarios so
`npm run check:accuracy` fails before the customer handoff promise drifts.

### Known residual false positives

Two remain, both the **action-family residue** — same entity/component, but a
work-*type* difference the opposed-axis deliberately treats as neutral for recall
safety:

- *build dark-mode toggle* vs *audit dark-mode colors* (create vs audit, 0.51)
- *create invoice* vs *email invoice* (create vs send, 0.52)

The next lever is a **declared-first action-family expansion** (treat clearly
distinct declared actions as divergent). It carries genuine recall risk, so it
stays eval-gated and unshipped until proven at 100% recall.

## Where it lives

| Concern | Path |
|---|---|
| Base scorer / combiner | `src/server/overlap/semantic.ts`, `overlap/lexical.ts` |
| Action + component divergence | `src/server/overlap/action.ts` |
| Low-information guard | `src/server/overlap/lexical.ts` |
| Escalation gate + match assembly | `src/server/overlap/scoring.ts` (`findOverlap`, `isEscalatable`) |
| Thresholds (single source) | `src/server/overlap/constants.ts`, `repositories/config-repo.ts` |
| Stage-1 wiring | `src/server/mcp/check-task.ts`, `mcp/planned-task-tools.ts` |
| **Committed regression suite** | `tests/unit/overlap-scenarios.test.ts` + `tests/fixtures/overlap-scenarios.json` |
| Semantic calibration | `tests/unit/overlap-calibration.test.ts` + `tests/fixtures/overlap-eval.json` |
| Precision/recall eval (local, gitignored) | `.gstack/overlap-eval/` |

## Changing the precision logic — the rule

Any change to a signal **must** keep the committed regression suite green and hold
**recall = 100%** on the full eval. Add scenarios to
`tests/fixtures/overlap-scenarios.json` when you add a signal — a structure guard
in the suite fails if the fixture drifts back to text-only. Never lower the
escalate threshold or relax a recall assertion to make a build pass.