Skip to Content

Simulations

Simulations let you replay a customer situation against your agent and grade the outcome automatically. Run the same set of tests against every new revision to catch regressions before they reach customers — the core loop for improving an agent with confidence.

A simulation is built from two pieces:

  • A test customer — the data the agent sees when it looks things up (orders, subscriptions, shipments, or any API response).
  • A test case — the situation to replay and what a correct agent should do.

Test customers (test data)

Real conversations depend on real backends. Test customers let a simulation look up realistic data without touching production systems. Two kinds of data are supported, and a single test customer can mix both:

  • Connectors — Shopify, ShipBob, and Loop data is modeled as entities (orders, subscriptions, shipments). These support read-after-write: if the agent cancels an order mid-conversation, a later lookup reflects the change, just like production.

  • HTTP endpoints — for any other API a flow calls, add a canned HTTP response keyed by request method and URL (a substring or regex match) plus the status code and JSON body to return. Use this for loyalty systems, custom backends, or any integration that isn’t a built-in connector.

    On a test customer, add an HTTP endpoint resource. Each fixture looks like:

    { "method": "get", "url_contains": "/loyalty/points", "status": 200, "body": { "points": 1240, "tier": "gold" } }
    • method — the HTTP method to match (get, post, …). Omit it (or use any) to match any method.
    • url_contains — match when the request URL contains this substring. Use url_regex instead for a regular-expression match.
    • status — the status code to return (defaults to 200).
    • body — the exact JSON the agent receives back.

    Read-after-write. If a flow writes through an HTTP node (for example cancelling via POST /subscription/cancel) and a later turn should see the change, give the write fixture a mutations list — each entry patches another fixture’s body, so the next read reflects it:

    { "method": "post", "url_contains": "/subscription/cancel", "status": 200, "body": { "ok": true }, "mutations": [ { "target_ref": "subscription_lookup", "patch": { "status": "canceled" } } ] }

    Here target_ref is the key of the read fixture to update and patch is deep-merged into its body. After this call, a follow-up GET of the subscription returns status: "canceled" for the rest of the run.

Different shops have different integrations — model only what your flows actually call. If a flow makes an HTTP request with no matching fixture, the run fails loudly and tells you exactly which request to add.

Creating a test case

Every test case has a scenario (the situation to replay) and an expectation (the pass criteria). There are two ways to create one:

From scratch

  1. Pick a test customer. You’ll see its data inline so you know what the agent can look up.
  2. Describe what to test — the customer’s situation and goal. An AI plays this customer across multiple turns: it sends the opening message, reacts to the agent’s replies, and keeps going until the goal is met or it gives up.
  3. Describe what should happen — the specific, behavioral pass criteria. An AI judge grades the conversation against it.

From a real conversation

The fastest way to build a regression suite is to turn conversations that already happened into tests:

  1. In the create dialog, paste a conversation ID under Start from a real conversation and click Draft.
  2. The scenario, the pass criteria, and (when the conversation was itself a simulation) the test customer are filled in automatically.
  3. Review and save.

This turns a production issue into a permanent regression test in one step.

Running and comparing revisions

Run a single test, the failures, or the whole set against the live revision or any past revision. Runs execute in parallel in the background, so large suites finish quickly and keep going even if you close the tab.

To hill-climb safely, compare two revisions:

  1. Run your tests against the new revision.
  2. Pick a baseline revision in Compare to….
  3. The toolbar shows regressions (passed on the baseline, now failing) and fixes (failing on the baseline, now passing). Click either to filter.

A change is safe to ship when it has fixes and zero regressions.

Inspecting a run

Open any result to see the full conversation, every assertion with its pass or fail rationale, and the provider calls the agent made. This is where you diagnose why a test failed before adjusting Responses or Flows.

Quick workflow

  1. Create a few test customers covering your common backends.
  2. Draft test cases from real conversations — especially ones that went wrong.
  3. Run the set against your live revision to establish a green baseline.
  4. Make a change, run again against the new revision, and compare.
  5. Ship when you have fixes and no regressions.

Next steps

Last updated on