Reliability And Concurrency Reference

Load this reference before creating or editing fanout, concurrency, waits,
retries, idempotency, large artifacts, checkpoints, or failure behavior.

Reliability Planning

Before push:

define every side effect and what must happen exactly once
name the idempotency or duplicate-protection key for each side-effectful path
classify every external call as retryable or fail-fast
set timeout expectations for each external boundary
choose concurrency intentionally and state why
define payload strategy: inline small data, persist large artifacts, pass refs for large blobs
define cursor/checkpoint behavior for partial failure and replay
define exact runtime proof: result fields, counters, child runs, or resources that prove success

Concurrency Defaults

Use sequential behavior when order matters, when mutating shared state, when moving large artifacts, or when external systems are slow or fragile.
Use fanout only for independent, bounded, side-effect-safe items with explicit timeout awareness.
Use keyed concurrency when work must serialize per entity while allowing cross-entity parallelism.
If concurrency is not clearly beneficial, default to sequential.
For public or paid installed apps with a manual run surface, choose
concurrency based on the intended end-user UX, not only author iteration
speed. :singleton :supersede is reasonable for hot iteration, webhook
refresh jobs, and "latest run wins" automation. It is the wrong default when
the installed app page is supposed to let users start another run while an
earlier one is still visible or still working.
flows run returns concurrencyDecision and concurrencyKey; use those
fields plus run timing to prove whether a run started, coexisted, queued,
blocked, or was rejected.

Answer these before build:

What can run in parallel safely?
What must never overlap?
What shared state or side effect could be corrupted?
What is the per-item timeout?
What happens when one item stalls?
How are partial successes summarized without losing failed items?

Installed Manual-Run UX

When debugging a public or paid app that appears to "lose" earlier runs from
the installed app page, check the flow's :concurrency before assuming a UI or
billing regression.

:singleton :on-new-version :supersede can terminate or replace an earlier
active run when a new run starts. In manual QA this looks like "the second run
trumped the first".
:singleton :on-new-version :coexist is usually the right policy when the
installed app page should allow another run to start without replacing the
earlier run card/state.
flows run / flows interfaces call plus concurrencyDecision, run timing,
and the run list are the quickest proof of whether the backend intentionally
superseded the earlier run.

For reusable paid/public smoke fixtures, prefer :coexist unless the product
semantics explicitly require replacement.

Keyed Concurrency And Raw Entrypoints

For webhook, event, and interface-started flows, :concurrency {:type :keyed ...} is
evaluated before the flow body can normalize input. The :key-field must exist
in the raw entrypoint payload that starts the run, not only in a derived
request_key or normalized function output.

Use a root field when the provider gives one. For a GitHub pull request webhook,
:number is a safer key than a later normalized :request-key. If the key is
nested, use the raw nested path, such as [:pull_request :id], and prove it
with a real webhook-shaped payload before release.

Fanout

Use fanout for bounded independent work. Avoid fanout for large file transfer,
cursor advancement, or shared-state mutation unless docs and tests show it is
safe.

For fanout flows:

keep item payloads small
persist large artifacts and pass refs
include a per-item result shape with success/failure fields
use one async item mode per fanout: all :call-flow items or all named :agent items
summarize counts and failures in final output
prove the chosen mode with evidence: counts, failures, child runs, and no skipped/reprocessed items
when fanout starts child workflows, group the parent and child flows with shared flow metadata so users can see the relationship; grouping is only display metadata, not runtime orchestration

Paging And Loops

Breyta can handle paging and looping, but loops must be explicit, bounded, and
checkpoint-aware. Do not assume a single provider call is enough for large
datasets, and do not build unbounded "keep fetching until done" flows.

Preferred paging shapes:

one packaged :steps wrapper or :function-backed step that pages a provider
API up to max-pages / max-items, persists each page or final rows, and
returns counts plus resource refs
table resource paging with :table {:op :query ... :page {:mode :cursor ...}}
and explicit :sort
multiple small runs or a child flow when each page is heavy or side-effectful
cursor/checkpoint state that advances only after the page's durable writes or
side effects have succeeded

Use flow/poll for external async job completion, not generic data paging.
If a loop includes repeated flow/step calls, keep the bound small and consider
rate-limit pauses with :sleep. For unknown/unbounded data, persist page
results and pass resource refs instead of carrying accumulated bodies inline.

Fixed Delays

Use :sleep for timer delays, deployment-lag buffers, polling gaps, or
rate-limit spacing. Use :wait only when the run should pause for an external
signal, webhook, CLI action, or human action.

Example:

(flow/step :sleep :deployment-lag {:duration "10m"})

Retries And Checkpoints

Retries should be bounded and reserved for transient failures.
Never advance cursors/checkpoints past failed work.
Resume/replay behavior must be explicit for partial success paths.
Rerun once when feasible to verify idempotency and duplicate protection.

Large Artifacts

Persist large artifacts and pass resource refs.
Prefer child flows to isolate heavyweight artifact creation and handoff.
Avoid moving large bodies through many steps.
Use breyta resources get, breyta resources read, and breyta resources url to prove artifact availability.

Verification

Before release, verify:

happy path
no-item or no-op path when applicable
partial failure or retry path when feasible
replay/rerun behavior for duplicate protection
output/resource evidence for the user-facing result

After release, capture live smoke proof when side effects are safe.