Simulations

Simulations let you replay a customer situation against your agent and grade the outcome automatically. Run the same set of tests against every new revision to catch regressions before they reach customers — the core loop for improving an agent with confidence.

A simulation is built from three pieces:

A test customer — the data the agent sees when it looks things up (orders, subscriptions, shipments, or any API response).
A scenario — the customer situation to replay.
An expected procedure — the procedure the agent should select. The procedure’s guidance, examples, and never checks define what correct behavior means, so you do not have to maintain a second scoring rubric.

Test customers (test data)

Real conversations depend on real backends. Test customers let a simulation look up realistic data without touching production systems. Two kinds of data are supported, and a single test customer can mix both:

Connectors — Shopify, ShipBob, and Loop data is modeled as entities (orders, subscriptions, shipments). These support read-after-write: if the agent cancels an order mid-conversation, a later lookup reflects the change, just like production.
HTTP endpoints — for any other API a flow calls, add a canned HTTP response keyed by request method and URL (a substring or regex match) plus the status code and JSON body to return. Use this for loyalty systems, custom backends, or any integration that isn’t a built-in connector.

On a test customer, add an HTTP endpoint resource. Each fixture looks like:
```
{
  "method": "get",
  "url_contains": "/loyalty/points",
  "status": 200,
  "body": { "points": 1240, "tier": "gold" }
}
```
- method — the HTTP method to match (get, post, …). Omit it (or use any) to match any method.
- url_contains — match when the request URL contains this substring. Use url_regex instead for a regular-expression match.
- status — the status code to return (defaults to 200).
- body — the exact JSON the agent receives back.
Read-after-write. If a flow writes through an HTTP node (for example cancelling via POST /subscription/cancel) and a later turn should see the change, give the write fixture a mutations list — each entry patches another fixture’s body, so the next read reflects it:
```
{
  "method": "post",
  "url_contains": "/subscription/cancel",
  "status": 200,
  "body": { "ok": true },
  "mutations": [
    { "target_ref": "subscription_lookup", "patch": { "status": "canceled" } }
  ]
}
```
Here target_ref is the key of the read fixture to update and patch is deep-merged into its body. After this call, a follow-up GET of the subscription returns status: "canceled" for the rest of the run.

Different shops have different integrations — model only what your flows actually call. If a flow makes an HTTP request with no matching fixture, the run fails loudly and tells you exactly which request to add.

Creating a test case

Every new test case has a scenario (the situation to replay) and one or more expected procedures. There are two ways to create one:

From scratch

Pick a test customer. You’ll see its data inline so you know what the agent can look up.
Describe what to test — the customer’s situation and goal. An AI plays this customer across multiple turns: it sends the opening message, reacts to the agent’s replies, and keeps going until the goal is met or it gives up.
Select the procedure the agent should use. The reviewer checks only whether the agent selected the right procedure and followed it correctly.

From a real conversation

The fastest way to build a regression suite is to turn conversations that already happened into tests:

In the create dialog, paste a conversation ID under Start from a real conversation and click Draft.
The scenario, expected procedure, and (when the conversation was itself a simulation) the test customer are filled in automatically.
Review and save.

This turns a production issue into a permanent regression test in one step.

Existing tests keep working. Saved simulations that already have custom pass criteria continue to use those criteria. You can leave them as-is and move them to procedure-based review when you are ready.

Running and comparing revisions

Run a single test, the failures, or the whole set against the live revision or any past revision. Runs execute in parallel in the background, so large suites finish quickly and keep going even if you close the tab.

To hill-climb safely, compare two revisions:

Run your tests against the new revision.
Pick a baseline revision in Compare to….
The toolbar shows regressions (passed on the baseline, now failing) and fixes (failing on the baseline, now passing). Click either to filter.

A change is safe to ship when it has fixes and zero regressions.

Inspecting a run

Open any result to see the full conversation, provider calls, and two review decisions:

Right procedure — the agent selected exactly the expected procedure or procedures.
Followed correctly — the agent followed that procedure’s guidance, examples, and never checks.

When a reviewer finds a reusable gap, Applied can draft a focused procedure improvement and a matching regression simulation. A teammate reviews the draft before it changes the procedure; suggestions never edit live behavior automatically.

Quick workflow

Create a few test customers covering your common backends.
Link each test to the procedure it should exercise.
Draft additional test cases from real conversations — especially ones that went wrong.
Run the set against your live revision to establish a green baseline.
Make a change, run again against the new revision, and compare.
Ship when you have fixes and no regressions.