Build Loop Tests#

These tests must exist before agents claim the Build Test Fix loop is done.

Agent Logging Test#

Prove agents really log their work.

Simulate or run an agent calling covibe_task with operation: "plan".
Verify Workstreams displays the planned task.
Simulate or run an agent calling covibe_task with operation: "check".
Simulate or run the same agent calling covibe_task with operation: "start" or operation: "start_planned".
Verify the task exists in structured storage.
Verify a task.started work event exists.
Verify a usage event exists for all MCP calls.
Verify Workstreams displays the task.
Verify Activity displays the event.
Complete the task and verify Activity displays the result summary.
Cancel a planned task and verify it leaves Workstreams while Activity shows the reason.
Start an overlapping planned task with confirmation and verify Activity shows warning.confirmed.
Log a blocker and decision and verify Coordination displays both.
Verify covibe_team operation: "state" returns the planned work, blocker, and decision context.
Verify covibe_team operation: "state" returns recent completions with result summaries.

Session Heartbeat Test#

Prove long-running agents do not look active forever after going quiet.

Start an agent session.
Call covibe_session with operation: "heartbeat".
Verify last_seen_at updates.
Age a session beyond the stale threshold.
Verify the dashboard/session view reports stale.
Verify covibe_team operation: "state" returns stale_sessions.
Verify stale sessions create warning feedback for agents.

Session Ownership Test#

Prove one developer token cannot close or revive another developer's run.

Start an agent session with Developer A.
Call covibe_session operation: "heartbeat" with Developer B's token and A's session_id.
Verify the response is an error.
Call covibe_session operation: "end" with Developer B's token and A's session_id.
Verify the response is an error.
Verify A's session remains active.
Verify no session.ended work event was created by B.

Parallel Work Audit Test#

Prove duplicate work can be caught after work has already started.

Developer A starts a long-running workstream.
Developer B starts an overlapping task with a confirmation reason.
Call covibe_team with operation: "audit_parallel_work".
Verify the response returns a structured warning.
Verify the conflict includes both pieces of work.
Verify recently completed matching work is also reported.
Verify a feedback event and parallel_work.audit work event were saved.
Verify the dashboard Audit parallel work button shows the same conflict.
Verify the dashboard audit does not mint an MCP token.
Refresh and verify the latest audit still appears.
Generate a weekly summary and verify Parallel audits includes the conflict.
Verify covibe_team operation: "state" returns latest_parallel_audits and warns the agent for active conflicts.
Complete both sides and verify covibe_team operation: "state" no longer warns on that historical audit.

Token Inventory Test#

Prove developers can manage agent tokens after refresh.

Create an MCP token from the UI.
Verify the one-time modal shows the raw token and the Settings → Agent connection stdio MCP config embeds no raw token (the stdio bridge resolves it from ~/.covibe/credentials.json via COVIBE_AGENT + COVIBE_BASE_URL).
Create a second token and verify default labels are distinguishable.
Refresh the page.
Verify the token metadata still appears.
Revoke the active token from Settings → Agent connection.
Verify the UI reports revocation.
Verify the revoked raw token can no longer call /api/mcp.
Verify Activity shows token creation and revocation events without raw token values.

Customer Readiness Tests#

The local companion setup and hosted deployment canary checklists live in customer-readiness-tests.md. They cover the hosted /downloads/co-vibe.tgz install path, npm exec -- covibe-local setup --base-url <origin>, manual snapshot --base-url <origin>, watch --base-url <origin> --once, and npm exec -- covibe-mcp.

Accuracy Gate#

Run npm run check:accuracy before customer handoff. It executes the committed overlap scenarios, semantic calibration, and Performance quality read-model tests so the duplicate-work and designed Performance functionality stay above the 90% product-quality bar. The overlap scenario suite explicitly requires 100% recall for real duplicates and at least 90% precision over committed stage-1 scenarios. npm run readiness includes this gate.

Local Identity Test#

Prove local dogfood mode does not hardcode one developer.

Open the UI without selecting a developer.
Create an MCP token and verify it is created for hakan.
Select dev2 through the local-dev /api/dev-login test path (the sign-in form only creates new tenants through /onboarding).
Create an MCP token.
Verify the token is created for dev2.
Use that token to start a task through MCP.
Verify Workstreams shows the task under @dev2.
Complete the task through MCP so the e2e run does not leave active test work behind.

Block And Warn Test#

Prove warnings and blocks are returned to the agent.

Create an existing active task.
Have another agent check similar work.
Verify the MCP response returns a structured warning.
Have the agent try to start without confirmation.
Verify the MCP response returns a structured block.
Verify the block tells the agent what to do next.
Verify the warning and block are saved.
Verify covibe_team operation: "feedback" returns the saved warning or block.
Verify the UI shows the warning.

Example blocked response:

json

{
  "status": "blocked",
  "reason": "Possible duplicate work found.",
  "required_action": "Ask the developer for confirmation and a reason before starting.",
  "warning_id": "warn_123",
  "matches": [
    {
      "type": "task",
      "title": "Research Claude/Codex orchestration for overnight builds",
      "owner": "hakan"
    }
  ]
}

Duplicate Work Scenario#

The loop must include this test scenario.

Developer A starts: "Research Claude/Codex orchestration for overnight builds."
Developer B checks: "Compare Gas Town and Mission Control for overnight AI builds."
Co-Vibe warns that the work may overlap.
Developer B tries to start without confirmation.
Co-Vibe rejects it.
Developer B starts again with a confirmation reason.
Co-Vibe accepts it and logs the override.
The UI shows the warning and confirmation reason.
Activity shows the warning, block, confirmation, and accepted start.
Usage events show the tool calls that created the flow.
Metrics show tool calls, warnings, and confirmed overlaps from the same records.
Settings → Agent connection does not retain demo-scenario tokens.
Browser and stdio e2e tests revoke tokens they create.
Dashboard and weekly summary reads stay bounded after repeated dogfood runs.
Weekly summary structured data includes plannedTasks.

This is the most important test.

Required Test Areas And First Human Test#

The broad coverage checklist and first human test live in build-loop-test-areas.md. The Build Test Fix loop must still include a browser test for the main user journey when the app has a browser UI.