How Arc compares
Arc governs the action.
Everything else watches, caps, or hopes.
Observability records what your agent did, after. Spend gateways cap the dollars at the model call. DIY checks work in the demo and race in production. Arc is the layer in front of the action itself — and most teams run it alongside the others, not instead of them.
Helicone, Langfuse & Datadog — the autopsy, not the seatbelt
LLM observability tools record what your AI agents did after the request happens. They are the flight recorder, and they are excellent at it. But a trace cannot un-spend $40,000 of tokens, and a dashboard cannot un-delete a production table. By the time the span shows up, the action already fired.
Arc runs before the action. Every high-risk move passes an allow / ask / block policy, gets a human approval when it's risky, and executes only through a request your app cryptographically verifies (ES256) — then Arc writes a redacted, hash-chained audit record. Use observability to understand your agents; use Arc to stop the ones you can't afford.
A logging proxy and a tracing SDK assume the dangerous thing already happened and your job is to explain it later. That model was fine when an LLM call returned text. It breaks the moment an agent holds a token that can move money, send to customers, or drop a table — because the cost is irreversible and the observability tool, by design, is downstream of it. Tracing, evals, and alerts are all posterior to the event.
| Capability | Arc | Helicone | Langfuse | Datadog LLM Obs |
|---|---|---|---|---|
| Logs / traces agent activity | ● audit log | ● | ● full trees | ● |
| Sits in front of the action (can block it) | ● | ○ logs after | ○ observes after | ○ observes after |
| Pre-action allow / ask / block policy | ● | ○ | ○ | ○ |
| Human approval on risky actions | ● | ○ | ○ | ○ |
| Signed (ES256), app-verified execution | ● | ○ | ○ | ○ |
| Cumulative spend / budget caps that enforce | ● | ◐ tracking, alerts | ◐ tracking | ◐ tracking, alerts |
| Tamper-evident hash-chained audit | ● | ○ | ○ | ○ |
| Evals / LLM-as-judge / quality scoring | ○ | ◐ | ● | ◐ |
| Latency / token-cost dashboards | ◐ basic | ● | ● | ● best-in-class |
| Guards app actions vs model calls | actions | model calls | model calls | model calls |
Read it as a heatmap — green where the tool's job is to know, red where it cannot intervene. The bottom row is the real divide: observability watches model calls; Arc guards business actions.
You don't replace your observability stack with Arc, and you shouldn't. Run them in series — Arc is the gate, observability is the camera behind it.
agent → Arc (allow / ask / block → approval → ES256-signed exec) → your app │ └── every decision + outcome → Langfuse / Helicone / Datadog
your agents only generate text or make read-only calls, nothing moves money or mutates production, and your real problem is debugging prompt quality and latency. You don't need a seatbelt for a parked car.
an autonomous agent holds production credentials and a single bad action — a runaway loop, a wrong refund, a destructive delete — costs real money or can't be undone. A trace of that event is a receipt, not a brake.
LiteLLM, Portkey & Bifrost cap the dollars — not the consequence
AI gateways sit between your code and the model providers and do one category of job extremely well: cap the dollars. Virtual keys, per-team budget windows, rate limits, a 429 when the budget runs out. If your only fear is the bill, a gateway is a genuinely good answer — and Arc does not try to replace its routing or caching.
But a spend cap stops spend. It does not stop the action. A gateway happily lets your agent issue a wrong refund, email the wrong customer, or delete a production record — as long as the tokens are under budget. Use a gateway to control cost; use Arc to control consequences.
Notice what a dollar cap actually constrains: aggregate token
spend. It is blind to which action the agent is about to
take. Two actions can cost the same fraction of a cent in tokens —
draft_reply (harmless) and delete_customer
(irreversible) — and a spend gateway treats them identically: both
pass if the budget has headroom. Arc treats them as what they are: one is
allow, the other is
block. The dimension Arc adds isn't cost
— it's authority over the consequence.
There's a second gap. Even when a gateway blocks a call, it blocks it at the gateway — a 429 to your client. Arc instead delivers approved work as a signed request your application verifies before it runs business logic: the app checks the JWS signature, timestamp, nonce, and a hash of the exact body. Tampering between “approved” and “executed” is detectable, and your app refuses anything Arc didn't sign. A gateway has no equivalent — it trusts whatever code holds the virtual key.
| Capability | Arc | LiteLLM | Portkey | Bifrost |
|---|---|---|---|---|
| Multi-provider model routing / fallback | ○ not a gateway | ● | ● | ● |
| Virtual keys | ○ | ● | ● | ● |
| Budget / spend caps (cap the $) | ● cumulative | ● | ● | ● |
| Semantic / response caching | ○ | ◐ | ● | ● |
| Content / PII / topic guardrails | ○ out of scope | ◐ | ● | ● |
| Per-action allow / ask / block (not per-key) | ● | ○ | ○ | ○ |
| Human-in-the-loop approval on a specific action | ● | ○ | ○ | ○ |
| Signed (ES256) execution your app verifies | ● | ○ | ○ | ○ |
| Body-hash + nonce + timestamp anti-replay | ● | ○ | ○ | ○ |
| Tamper-evident hash-chained audit of decisions | ● | ◐ request logs | ◐ request logs | ◐ audit logs |
| Governs business actions vs model calls | actions | model calls | model calls | model calls |
Clean split. Top block — routing, keys, budgets, caching, content filters — is gateway territory, and they're good at it. The bottom block — per-action approval, signed app-verified execution, hash-chained audit — is where the gateways go blank. Arc isn't a better gateway; it's a different layer.
They do — and it's worth being precise, because the word is doing a lot of work. In gateway-land, “guardrails” means content guardrails: PII detection, topic blocking, output-format checks, tool-permission rules that block or rewrite a tool call by pattern. Those run on the model request/response. None of them is a human approving a specific high-stakes action, and none produces a signed execution your app cryptographically verifies. Arc's ask is a real person clicking approve in a console — the action then delivered as a signed request your app checks before it runs. That's the trust envelope a gateway doesn't model.
Keep your gateway for what it's best at — provider routing, dollar caps, caching. Put Arc in front of the actions that move money or mutate production.
agent → LiteLLM / Portkey / Bifrost (route, cap $, cache, filter) → Arc (allow/ask/block → approval → signed exec) → your app (verifies signature, runs business logic)
The gateway answers “can we afford this token spend?” — Arc answers “is this specific action allowed, approved, and signed?”
your problem is purely cost and routing — one API across providers, per-team dollar budgets, caching, and a 429 when the budget's gone. For controlling the bill, a gateway is the right and sufficient tool.
your agents take consequential actions — refunds, sends, cancellations, deletes, infra changes — where the danger isn't the token cost but the action itself, and you need a human in the loop plus cryptographic proof your app only executed what was approved. A budget cap won't stop a correctly-budgeted catastrophe.
Timeouts, try/catch & budget checks — the guardrail you keep meaning to harden
Every team running autonomous agents builds some version of
this: a budget check before the expensive call, a try/catch
around the dangerous one, a timeout so the loop can't run forever, maybe a
Slack ping for “important” actions, and a
console.log you promise to turn into a real audit log later.
It works in the demo. It is also the exact stack the runaway-cost stories
were built on.
The problem isn't that DIY is wrong — it's that doing it correctly is its own product. Fail-closed policy evaluation, a real approval queue with expiry, replay-safe signed delivery, idempotency, and a tamper-evident audit chain are weeks of un-fun infra work that competes with your roadmap. npm i @geostack/arc instead of a quarter of platform work you'll under-resource.
The honest lifecycle of the homegrown version:
// week 1 — looks fine
if (estimatedCost > budgetRemaining) throw new Error("over budget");
try {
const result = await doRiskyAction(input); // refund, delete, send…
console.log("did action", { action, input }); // "we'll make this a real log later"
return result;
} catch (e) {
// swallow? retry? alert? …we'll decide later
} - the budget check races.
Two agent loops read
budgetRemainingat the same time; both pass; you're over budget. Correct enforcement needs a locked, atomic counter — not a read-then-act. - “important actions need approval” has no home. Where does a pending approval live? Who can see it? What happens when nobody clicks for an hour — fail open or closed? Build that and you've built an approval lifecycle with expiry and per-user authorization.
- try/catch ≠ replay safety. The action succeeded but the network hiccupped on the response; your retry runs the refund twice. Now you need idempotency keyed on the action, and to record “unknown outcome” instead of blindly retrying.
- console.log is not an audit log. The day an auditor asks “who approved this and was it tampered with?”, a log line won't answer it. You need redaction, canonicalization, and a hash chain so edits are detectable.
- nothing proves the action was authorized.
Any code path that can call
doRiskyAction()can do it unguarded. There's no signature binding “this exact action was approved” to “this is what executed.” - it rots.
Every new action re-implements the pattern slightly differently. Six
months later the policy lives in twelve
ifstatements and no one can answer “what can this agent do?” in one place.
| Concern | DIY (timeouts + try/catch + budget checks) | Arc |
|---|---|---|
| Time to first guardrail | hours (and it shows) | minutes npm i @geostack/arc |
| Policy model | scattered if statements per action | one declarative allow / ask / block model |
| “What can this agent do?” answerable in one place | ○ | ● |
| Budget enforcement under concurrency | race-prone read-then-act | atomic, locked cumulative caps |
| Human approval queue (expiry + per-user scope) | you build it | built in (console + lifecycle) |
| Fail-closed by default on the risky path | usually fails open | deterministic, fail-closed rules |
| Replay / double-execution safety | try/catch won't save you | idempotency by invocation, safe-retry only |
| Proof the executed action was the approved one | none | ES256-signed (sig + body hash + nonce + timestamp) |
| Audit you can hand an auditor | console.log | redacted, hash-chained event log |
| Who owns it at 2am | you | a reviewable, documented layer |
DIY isn't free; it's deferred. The sticker price is “we already have a budget check.” The real price is the quarter of platform-engineering time to make approval, signing, idempotency, and audit actually correct — plus the carrying cost of maintaining it forever, plus the tail risk of the one un-hardened path that fails open on the day it matters. Arc collapses that into an SDK install and a policy file — so “build vs buy” becomes “adopt a hardened, documented layer vs reinvent it.”
you have one or two low-stakes actions, no irreversible operations, no compliance/audit requirement, and no concurrency — and you're genuinely fine fixing it by hand. A timeout and a try/catch are a reasonable v0 for a toy.
you have a growing set of consequential actions, more than one agent or process, any need to prove who approved what, or any action you cannot undo. That's the point where “we'll harden it later” becomes the risk itself — and Arc is the hardened version, today, with signed execution and an audit you can re-verify.
Four jobs that all get called “guardrails”
“AI agent guardrails” is a crowded term covering at least four different jobs, and most listicles blur them. A runaway bill and a destructive action are different failure modes: the first is solved by a dollar cap, the second only by something that can refuse the action and get a human to approve it. Most teams need two or three of these, not one.
| Tool class | Controls… | Acts… | Example tools |
|---|---|---|---|
| Spend gateway | dollars / tokens | before the model call | LiteLLM, Portkey, Bifrost |
| Observability | knowledge of what happened | after the event | Helicone, Langfuse, Datadog |
| Content guardrails | the model's text (PII, topics, format) | on the model I/O | NeMo Guardrails, Guardrails AI |
| Action control plane | the action (refund, delete, send) | before the action runs | Arc |
If you remember one thing: spend gateways and content guardrails stop a call or a string. Only an action control plane like Arc evaluates a specific business action, can require human approval, and delivers a signed execution your app verifies — stopping the action, not just the spend or the text.
The questions buyers actually ask
Is Arc an observability tool, an AI gateway, or a guardrail?
It is an agent action control plane. Observability tools (Helicone, Langfuse, Datadog) record what an agent did after the request. Gateways (LiteLLM, Portkey, Bifrost) cap dollars at the model call. Arc governs the action itself — allow / ask / block, human approval on the risky ones, signed execution your app verifies, and a hash-chained audit. Most teams run Arc alongside one of the others, not instead of it.
Doesn't cost tracking in observability already protect me from a runaway bill?
It alerts you to one; it does not stop one. Cost tracking is posterior to the spend. Arc enforces cumulative spend and budget caps and can block or require approval before the next costly action executes — the difference between a smoke detector and a sprinkler.
LiteLLM already has budget caps. Why add Arc?
Because a budget cap constrains aggregate dollars, not which action runs. An agent can stay under budget and still issue a wrong refund or delete a record. Arc adds per-action allow / ask / block, human approval, and a signed execution your app verifies — and it still enforces cumulative spend caps, so you keep the dollar guard and gain the action guard.
What does “signed, app-verified execution” actually mean?
Arc's worker signs each approved action as an ES256 JWS and POSTs it to your app's execute endpoint. The @geostack/arc SDK verifies the signature, timestamp, nonce, and a hash of the exact request body before your business logic runs. Your app refuses anything Arc didn't sign, and tampering after approval is detectable. A virtual key offers no such guarantee — it trusts whoever holds it.
Why not just add a budget check and a try/catch myself?
For a demo, that is enough. In production it races under concurrency (two loops both pass the check), try/catch retries can double-execute an irreversible action, and a console.log won't satisfy an auditor. Arc handles atomic caps, replay-safe signed delivery keyed on the invocation, and a tamper-evident audit so you don't discover these gaps during an incident.
Where does Arc store the audit log?
In a redacted, hash-chained event log computed as sha256(prev_hash + canonical_json), which makes post-hoc tampering detectable. External immutable export to object storage is on the roadmap for stronger tamper-evidence; the V1 chain is tamper-evident, not by itself immutable against privileged database access.
How is Arc delivered and priced?
Arc is a hosted control plane — sign up for a free workspace, no credit card, and put your first agent behind it. You integrate with the @geostack/arc SDK (npm i @geostack/arc) and verify signed execution in your own app. You meter on protected agents and guarded actions, not seats. Every decision is signed and lands in a hash-chained audit you can re-verify, so you can evaluate the trust envelope before you rely on it.
Put one risky action behind Arc — and keep the rest of your stack.
Arc isn't a gateway, a tracer, or a content filter. It's the layer none of those provide: approval, signed execution, and a hash-chained audit for the action itself. Free to start — sign up for a hosted workspace, no credit card, metered on guarded actions, not seats.