Framework Guide

Benchmark OpenAI-powered agents with trace-backed durability scores.

Use Crucible as the pre-deployment stress test for autonomous agents built on the Responses API, with report export and scenario-based replay built in.

How this integration works

Wrap an OpenAI client and map output into Crucible actions.

Charge retrieval, tool use, and think time against the agent ledger.

Generate replay and certification artifacts from the same run output.

Starter example

from crucible import evaluate, wrap_openai_responses_agent

wrapped = wrap_openai_responses_agent(
    client,
    model="gpt-4.1-mini",
    output_parser=parse_action,
)

result = evaluate(
    wrapped,
    name="ResponsesAgent-v2",
    scenario="compliance-auditor",
    seed=7,
)