Integration Testing

ACP Kit's internal tests are not enough to prove a downstream integration is truthful.

Every real integration should have a small black-box ACP test set that validates the adapter at the session boundary.

Testing Goal

Prove that:

ACP methods work end to end
visible session state matches real runtime state
approvals, plans, and host backends behave coherently
session replay and reload do not drift from the original run

Minimum Scenario Set

1. Session Lifecycle

Test:

new_session
prompt once
reload the same session
confirm transcript and session state still exist

This catches broken session storage, replay drift, and missing persistence.

2. Approval Roundtrip

Test:

a tool requires approval
approval is requested
approval is granted or denied
the next visible behavior matches the approval result

If persistent choices are enabled, also verify that remembered approval policy is reused intentionally.

3. Host-backed File Flow

Test:

read a file through the ACP client-backed filesystem
write a file through the ACP client-backed filesystem
verify the file result and projection are both truthful

4. Host-backed Terminal Flow

Test:

execute a command through the ACP terminal backend
wait for completion
inspect output and exit status
verify projected output matches the real command result

5. Mode And Model Switching

Test:

switch mode
verify behavior actually changes
switch model
verify session-local model state changes cleanly

Do not stop at "the API returned success". Confirm runtime behavior changed.

6. Plan Roundtrip

Test:

create or update plan state
persist it
reload the session
verify the same plan is still visible

7. MCP Metadata Visibility

Test:

configured tool metadata appears as intended
host-backed and local tool families are distinguishable
client-facing labels and categories are not ambiguous

Recommended Test Style

Prefer:

black-box ACP method calls
visible session updates
persisted session assertions
end-to-end approval and host-backend flows

Reusable Harness

pydantic-acp now ships a small reusable test surface for downstream integrations:

import asyncio

from pydantic_acp import AdapterConfig, BlackBoxHarness, FileSessionStore

harness = BlackBoxHarness.create(
    agent=my_agent,
    config=AdapterConfig(session_store=FileSessionStore(tmp_path / 'sessions')),
)

session = asyncio.run(harness.new_session(cwd=str(tmp_path)))
response = asyncio.run(harness.prompt_text('Inspect the workspace.', session_id=session.session_id))

assert response.stop_reason == 'end_turn'
assert harness.agent_messages(session_id=session.session_id)

What The Harness Actually Solves

Most downstream integrations need the same proof chain:

start a session
send a prompt
observe visible ACP updates
resolve approval when required
verify host-backed file or terminal behavior
reload the session
confirm replayed state still matches the original run

Without a shared harness, each integration rebuilds:

a recording ACP client
a permission response queue
message reconstruction logic
update filtering helpers
session lifecycle wrappers

That duplication is exactly where proof quality drifts. Some integrations stop at "the prompt returned success" and never verify the ACP-visible transcript, approval behavior, or replay path. BlackBoxHarness removes that repeated setup cost.

Use this guide after the integration already has a real adapter construction seam.

If the integration still does not know who owns models, modes, approvals, or host access rules, fix that architecture first. The harness is for proof, not for discovering the design by trial and error.

What `BlackBoxHarness` Provides

BlackBoxHarness.create(...) to build an ACP adapter plus a recording ACP client in one place
initialize() for protocol negotiation tests
new_session(...) and load_session(...) helpers that keep track of the last active session id
prompt_text(...) so black-box tests stay focused on behavior instead of ACP block assembly
set_mode(...) and set_model(...) helpers for runtime control tests
queue_permission_selected(...) and queue_permission_cancelled() for approval roundtrip tests
updates(...) and updates_of_type(...) for visible ACP update inspection
agent_messages(...) for reconstructed final assistant messages across streamed chunks
tool_updates(...) for projected tool activity checks

Typical Usage Pattern

Use it in this order:

Create the harness with the real adapter construction seam you expect downstream users to rely on.
Start a session with a real cwd.
Queue approval outcomes before prompts that should defer.
Run one prompt through the ACP boundary.
Assert on visible updates, not internal helper calls.
Reload the same session and verify replayed state.

Checklist before using the harness:

the integration already has one real adapter construction seam
session ownership is already decided
approvals and host tools are already mapped truthfully
the test is trying to prove ACP-visible behavior, not helper internals

Example:

import asyncio

from pydantic_acp import AdapterConfig, BlackBoxHarness, FileSessionStore

harness = BlackBoxHarness.create(
    agent_factory=build_agent,
    config=AdapterConfig(session_store=FileSessionStore(tmp_path / 'sessions')),
)

session = asyncio.run(harness.new_session(cwd=str(tmp_path)))
harness.queue_permission_selected('allow_once')
response = asyncio.run(harness.prompt_text('Write the workspace note.'))

assert response.stop_reason == 'end_turn'
assert harness.tool_updates(session_id=session.session_id)
assert harness.agent_messages(session_id=session.session_id)

harness.clear_updates()
asyncio.run(harness.load_session(cwd=str(tmp_path), session_id=session.session_id))
assert harness.agent_messages(session_id=session.session_id)

Smaller verified example from the harness test:

session = asyncio.run(harness.new_session(cwd=str(tmp_path)))
harness.queue_permission_selected('allow_once')
response = asyncio.run(harness.prompt_text('Write the workspace note.'))

assert response.stop_reason == 'end_turn'
assert harness.agent_messages(session_id=session.session_id) == ['done']

What It Is Not

BlackBoxHarness is not:

a full replacement for product-level end-to-end tests
a mock ACP transport stack
a white-box adapter inspection tool

It is intentionally narrow. Its job is to prove the ACP boundary, not to simulate the entire product around it.

Recommended Assertions

Prefer assertions on:

returned ACP stop reason
emitted ToolCallStart / ToolCallProgress updates
approval request behavior
reconstructed agent messages
persisted session replay
host-backed filesystem or terminal side effects

Avoid assertions on:

private helper call order
internal runtime helper names
intermediate objects that ACP clients never see

Use the harness when you want to verify session-bound ACP behavior without rebuilding:

a recording ACP client
permission response queueing
update filtering
agent message reconstruction
reload and replay assertions

Recommended Scenario Ladder

If you are adding ACP support to a real project, start with these harness-backed tests:

Session create -> prompt -> reload
Approval required -> allow once -> visible completion
Approval required -> deny once -> visible denial behavior
Host file read/write flow
Host terminal execution flow
Mode switch changes tool behavior
Model switch changes session-local model state
Plan survives reload if the integration exposes plan state

Avoid:

deep mocks of private adapter helpers
tests that only assert helper call order
white-box assertions that do not prove ACP behavior

Practical Definition Of Done

An integration test set is good enough when it proves:

session state is replayable
approvals are truthful
file and terminal host backends work end to end
model and mode changes are real
plan state survives reload

If those are not covered, the integration still has a proof gap.

Integration Testing

Testing Goal

Minimum Scenario Set

1. Session Lifecycle

2. Approval Roundtrip

3. Host-backed File Flow

4. Host-backed Terminal Flow

5. Mode And Model Switching

6. Plan Roundtrip

7. MCP Metadata Visibility

Recommended Test Style

Reusable Harness

What The Harness Actually Solves

What BlackBoxHarness Provides

Typical Usage Pattern

What It Is Not

Recommended Assertions

Recommended Scenario Ladder

Practical Definition Of Done

Related Guides

What `BlackBoxHarness` Provides