Test an agent end-to-end

Iterating on an AI agent means iterating on its prompt, its tools, and the way it handles edge cases. The test-calls API runs real (bot-to-bot or SIP loopback) calls against an agent using a scenario prompt you supply — every run produces a real call log with transcript, grading, and billing, so you see exactly how the agent behaves and what it costs. Use it for:

Pre-deploy smoke tests after every prompt edit
Regression suites wired into CI (hook test-call.completed webhook → fail the build if score drops)
Stress-testing concurrency limits

One-shot: single run

curl -X POST https://api.thunderphone.com/v1/test-calls \
  -H "Authorization: Bearer sk_live_YOUR_API_KEY" \
  -H "Content-Type: application/json" \
  -d '{
    "target_type":     "agent",
    "target_id":       12,
    "direction":       "outbound",
    "scenario_prompt": "You are a polite caller asking about refund policy for order 12345.",
    "consent_to_charge": true
  }'

Fields:

Field	Type	Required	Description
`target_type`	string	yes	`agent` or `phone_number`
`target_id`	integer	yes	The agent id (or phone number id)
`direction`	string	yes	`outbound` (bot places) or `inbound` (bot answers)
`scenario_prompt`	string	no	Drives what the test bot says
`mode`	string	no	`bot` (bot-to-bot, default) or `sip` (SIP loopback)
`consent_to_charge`	boolean	yes	Must be `true`. Test calls cost 2× normal rate
`target_number`	string	no	Override for the bot’s caller id (E.164)

The response is a Test call run object in status="queued". Poll until status becomes completed or failed; once call_id is set, load the transcript via GET /v1/calls/{call_id}/transcript.

Batches: parallel scenarios

Run N scenarios concurrently — useful for regression suites that hit every known edge case in parallel:

curl -X POST https://api.thunderphone.com/v1/test-call-batches \
  -H "Authorization: Bearer sk_live_YOUR_API_KEY" \
  -H "Content-Type: application/json" \
  -d '{
    "target_type":     "agent",
    "target_id":       12,
    "direction":       "outbound",
    "run_count":       5,
    "stagger_seconds": 2,
    "scenario_prompts": [
      "Ask about refund policy.",
      "Ask for hours of operation.",
      "Complain about a delayed shipment.",
      "Ask to speak with a human.",
      "Ask an unrelated trivia question."
    ],
    "consent_to_charge": true
  }'

Response carries a run_ids list of child run ids. Fetch batch status:

curl https://api.thunderphone.com/v1/test-call-batches/{batch_id} \
  -H "Authorization: Bearer sk_live_YOUR_API_KEY"

run_count is capped at 20; stagger_seconds spaces out the spawn to avoid hammering the agent (0–60 s).

Hook it into CI

The test-call.completed webhook fires once per run. Subscribe to it and fail your CI job if any run returns status: "failed" or scores below your threshold on the companion call.graded event:

# Pseudocode for a CI integration
@app.post("/thunderphone-hook")
async def hook(request):
    body = await request.body()
    if not verify(body): abort(401)
    event = json.loads(body)
    if event["type"] == "test-call.completed":
        run = event["data"]["test_call_run"]
        if run["status"] != "completed":
            trigger_ci_failure(run)
    if event["type"] == "call.graded" and event["data"]["grade"]["score"] < 0.8:
        trigger_ci_failure(event["data"])

Patterns

Per-prompt regression corpus

Maintain a JSON file of {name, scenario_prompt, expected_outcome} tuples. On every prompt change, run the full set as a batch; diff the transcripts and grades against the previous run.

Per-release smoke test

A single batch of five happy-path scenarios you run after every deploy. Latency-sensitive, so keep stagger_seconds: 0.

Latency benchmarking

Run identical scenarios against different product tiers (spark, bolt, storm-base). Compare the call.graded scores and the duration_seconds from each resulting call log.

Next steps

Test calls reference

Every query parameter, status code, and batch shape.

AI grading

Auto-score every test run to track quality over time.

Issue reports

Flag specific tests for human review.

test-call.completed webhook

Stream results into your CI / Slack / PagerDuty.

​One-shot: single run

​Batches: parallel scenarios

​Hook it into CI

​Patterns

​Per-prompt regression corpus

​Per-release smoke test

​Latency benchmarking

​Next steps