Skip to content
GEOstack

How Arc compares

Arc governs the action.
Everything else watches, caps, or hopes.

Observability records what your agent did, after. Spend gateways cap the dollars at the model call. DIY checks work in the demo and race in production. Arc is the layer in front of the action itself — and most teams run it alongside the others, not instead of them.

  1. 01 vs Observability Helicone · Langfuse · Datadog The autopsy vs the seatbelt.
  2. 02 vs AI Spend Gateways LiteLLM · Portkey · Bifrost They cap the dollars. Arc governs the consequence.
  3. 03 vs Building It Yourself timeouts + try/catch + budget checks The guardrail you keep meaning to harden.
01 · vs Observability

Helicone, Langfuse & Datadog — the autopsy, not the seatbelt

LLM observability tools record what your AI agents did after the request happens. They are the flight recorder, and they are excellent at it. But a trace cannot un-spend $40,000 of tokens, and a dashboard cannot un-delete a production table. By the time the span shows up, the action already fired.

Arc runs before the action. Every high-risk move passes an allow / ask / block policy, gets a human approval when it's risky, and executes only through a request your app cryptographically verifies (ES256) — then Arc writes a redacted, hash-chained audit record. Use observability to understand your agents; use Arc to stop the ones you can't afford.

A logging proxy and a tracing SDK assume the dangerous thing already happened and your job is to explain it later. That model was fine when an LLM call returned text. It breaks the moment an agent holds a token that can move money, send to customers, or drop a table — because the cost is irreversible and the observability tool, by design, is downstream of it. Tracing, evals, and alerts are all posterior to the event.

~$500M
Reportedly spent on Claude in a single month by one enterprise client that set no usage caps on employee licenses — an AI consultant's account reported by Axios. The company is unnamed and no company has confirmed the figure. A perfectly instrumented observability stack would have produced a beautiful, very expensive graph of that month — after the money was gone.
Arc vs LLM observability — capability comparison
Capability Arc Helicone Langfuse Datadog LLM Obs
Logs / traces agent activity audit log full trees
Sits in front of the action (can block it) logs after observes after observes after
Pre-action allow / ask / block policy
Human approval on risky actions
Signed (ES256), app-verified execution
Cumulative spend / budget caps that enforce tracking, alerts tracking tracking, alerts
Tamper-evident hash-chained audit
Evals / LLM-as-judge / quality scoring
Latency / token-cost dashboards basic best-in-class
Guards app actions vs model calls actions model calls model calls model calls

Read it as a heatmap — green where the tool's job is to know, red where it cannot intervene. The bottom row is the real divide: observability watches model calls; Arc guards business actions.

You don't replace your observability stack with Arc, and you shouldn't. Run them in series — Arc is the gate, observability is the camera behind it.

agentArc (allow / ask / block → approval → ES256-signed exec)your app
                  │
                  └── every decision + outcome → Langfuse / Helicone / Datadog
Pick observability, not Arc, if

your agents only generate text or make read-only calls, nothing moves money or mutates production, and your real problem is debugging prompt quality and latency. You don't need a seatbelt for a parked car.

You need Arc (alongside it) if

an autonomous agent holds production credentials and a single bad action — a runaway loop, a wrong refund, a destructive delete — costs real money or can't be undone. A trace of that event is a receipt, not a brake.

02 · vs AI Spend Gateways

LiteLLM, Portkey & Bifrost cap the dollars — not the consequence

AI gateways sit between your code and the model providers and do one category of job extremely well: cap the dollars. Virtual keys, per-team budget windows, rate limits, a 429 when the budget runs out. If your only fear is the bill, a gateway is a genuinely good answer — and Arc does not try to replace its routing or caching.

But a spend cap stops spend. It does not stop the action. A gateway happily lets your agent issue a wrong refund, email the wrong customer, or delete a production record — as long as the tokens are under budget. Use a gateway to control cost; use Arc to control consequences.

Notice what a dollar cap actually constrains: aggregate token spend. It is blind to which action the agent is about to take. Two actions can cost the same fraction of a cent in tokens — draft_reply (harmless) and delete_customer (irreversible) — and a spend gateway treats them identically: both pass if the budget has headroom. Arc treats them as what they are: one is allow, the other is block. The dimension Arc adds isn't cost — it's authority over the consequence.

There's a second gap. Even when a gateway blocks a call, it blocks it at the gateway — a 429 to your client. Arc instead delivers approved work as a signed request your application verifies before it runs business logic: the app checks the JWS signature, timestamp, nonce, and a hash of the exact body. Tampering between “approved” and “executed” is detectable, and your app refuses anything Arc didn't sign. A gateway has no equivalent — it trusts whatever code holds the virtual key.

$1,500/engineer/mo
The per-employee cap Uber instituted after burning through its entire 2026 AI coding budget by April (CTO Praveen Neppalli Naga, via The Information / TechCrunch). Microsoft is moving most of its Experiences + Devices engineers off Claude Code by June 30, where internal usage reportedly ran $500–$2,000 per engineer per month. The lesson everyone took was “set the cap.” The lesson under it: the cap is necessary and not sufficient — it bounds the bill, not the blast radius of a single destructive action.
Arc vs AI spend gateways — capability comparison
Capability Arc LiteLLM Portkey Bifrost
Multi-provider model routing / fallback not a gateway
Virtual keys
Budget / spend caps (cap the $) cumulative
Semantic / response caching
Content / PII / topic guardrails out of scope
Per-action allow / ask / block (not per-key)
Human-in-the-loop approval on a specific action
Signed (ES256) execution your app verifies
Body-hash + nonce + timestamp anti-replay
Tamper-evident hash-chained audit of decisions request logs request logs audit logs
Governs business actions vs model calls actions model calls model calls model calls

Clean split. Top block — routing, keys, budgets, caching, content filters — is gateway territory, and they're good at it. The bottom block — per-action approval, signed app-verified execution, hash-chained audit — is where the gateways go blank. Arc isn't a better gateway; it's a different layer.

“But Portkey / LiteLLM have guardrails”

They do — and it's worth being precise, because the word is doing a lot of work. In gateway-land, “guardrails” means content guardrails: PII detection, topic blocking, output-format checks, tool-permission rules that block or rewrite a tool call by pattern. Those run on the model request/response. None of them is a human approving a specific high-stakes action, and none produces a signed execution your app cryptographically verifies. Arc's ask is a real person clicking approve in a console — the action then delivered as a signed request your app checks before it runs. That's the trust envelope a gateway doesn't model.

Keep your gateway for what it's best at — provider routing, dollar caps, caching. Put Arc in front of the actions that move money or mutate production.

agentLiteLLM / Portkey / Bifrost  (route, cap $, cache, filter)Arc                          (allow/ask/block → approval → signed exec)your app                     (verifies signature, runs business logic)

The gateway answers “can we afford this token spend?” — Arc answers “is this specific action allowed, approved, and signed?”

Pick a gateway, not Arc, if

your problem is purely cost and routing — one API across providers, per-team dollar budgets, caching, and a 429 when the budget's gone. For controlling the bill, a gateway is the right and sufficient tool.

You need Arc (in addition) if

your agents take consequential actions — refunds, sends, cancellations, deletes, infra changes — where the danger isn't the token cost but the action itself, and you need a human in the loop plus cryptographic proof your app only executed what was approved. A budget cap won't stop a correctly-budgeted catastrophe.

03 · vs Building It Yourself

Timeouts, try/catch & budget checks — the guardrail you keep meaning to harden

Every team running autonomous agents builds some version of this: a budget check before the expensive call, a try/catch around the dangerous one, a timeout so the loop can't run forever, maybe a Slack ping for “important” actions, and a console.log you promise to turn into a real audit log later. It works in the demo. It is also the exact stack the runaway-cost stories were built on.

The problem isn't that DIY is wrong — it's that doing it correctly is its own product. Fail-closed policy evaluation, a real approval queue with expiry, replay-safe signed delivery, idempotency, and a tamper-evident audit chain are weeks of un-fun infra work that competes with your roadmap. npm i @geostack/arc instead of a quarter of platform work you'll under-resource.

The honest lifecycle of the homegrown version:

guardrail.ts — week 1, looks fine
// week 1 — looks fine
if (estimatedCost > budgetRemaining) throw new Error("over budget");
try {
  const result = await doRiskyAction(input);   // refund, delete, send…
  console.log("did action", { action, input }); // "we'll make this a real log later"
  return result;
} catch (e) {
  // swallow? retry? alert? …we'll decide later
}
  • the budget check races. Two agent loops read budgetRemaining at the same time; both pass; you're over budget. Correct enforcement needs a locked, atomic counter — not a read-then-act.
  • “important actions need approval” has no home. Where does a pending approval live? Who can see it? What happens when nobody clicks for an hour — fail open or closed? Build that and you've built an approval lifecycle with expiry and per-user authorization.
  • try/catch ≠ replay safety. The action succeeded but the network hiccupped on the response; your retry runs the refund twice. Now you need idempotency keyed on the action, and to record “unknown outcome” instead of blindly retrying.
  • console.log is not an audit log. The day an auditor asks “who approved this and was it tampered with?”, a log line won't answer it. You need redaction, canonicalization, and a hash chain so edits are detectable.
  • nothing proves the action was authorized. Any code path that can call doRiskyAction() can do it unguarded. There's no signature binding “this exact action was approved” to “this is what executed.”
  • it rots. Every new action re-implements the pattern slightly differently. Six months later the policy lives in twelve if statements and no one can answer “what can this agent do?” in one place.
0
The number of usage caps reportedly in place at the enterprise client that, per an AI consultant's account reported by Axios, ran up ~$500M on Claude in a single month. (Company unnamed; no company has confirmed the figure.) The cap wasn't missing because no one could build it — it was missing because “we'll add the guardrail later” is the default state of every DIY control plane. Arc is the cap, turned on by default.
DIY vs Arc — line by line
Concern DIY (timeouts + try/catch + budget checks) Arc
Time to first guardrail hours (and it shows) minutes npm i @geostack/arc
Policy model scattered if statements per action one declarative allow / ask / block model
“What can this agent do?” answerable in one place
Budget enforcement under concurrency race-prone read-then-act atomic, locked cumulative caps
Human approval queue (expiry + per-user scope) you build it built in (console + lifecycle)
Fail-closed by default on the risky path usually fails open deterministic, fail-closed rules
Replay / double-execution safety try/catch won't save you idempotency by invocation, safe-retry only
Proof the executed action was the approved one none ES256-signed (sig + body hash + nonce + timestamp)
Audit you can hand an auditor console.log redacted, hash-chained event log
Who owns it at 2am you a reviewable, documented layer
The honest cost comparison

DIY isn't free; it's deferred. The sticker price is “we already have a budget check.” The real price is the quarter of platform-engineering time to make approval, signing, idempotency, and audit actually correct — plus the carrying cost of maintaining it forever, plus the tail risk of the one un-hardened path that fails open on the day it matters. Arc collapses that into an SDK install and a policy file — so “build vs buy” becomes “adopt a hardened, documented layer vs reinvent it.”

Keep DIY if

you have one or two low-stakes actions, no irreversible operations, no compliance/audit requirement, and no concurrency — and you're genuinely fine fixing it by hand. A timeout and a try/catch are a reasonable v0 for a toy.

Adopt Arc if

you have a growing set of consequential actions, more than one agent or process, any need to prove who approved what, or any action you cannot undo. That's the point where “we'll harden it later” becomes the risk itself — and Arc is the hardened version, today, with signed execution and an audit you can re-verify.

The whole map

Four jobs that all get called “guardrails”

“AI agent guardrails” is a crowded term covering at least four different jobs, and most listicles blur them. A runaway bill and a destructive action are different failure modes: the first is solved by a dollar cap, the second only by something that can refuse the action and get a human to approve it. Most teams need two or three of these, not one.

What each class of tool actually controls
Tool class Controls… Acts… Example tools
Spend gateway dollars / tokens before the model call LiteLLM, Portkey, Bifrost
Observability knowledge of what happened after the event Helicone, Langfuse, Datadog
Content guardrails the model's text (PII, topics, format) on the model I/O NeMo Guardrails, Guardrails AI
Action control plane the action (refund, delete, send) before the action runs Arc

If you remember one thing: spend gateways and content guardrails stop a call or a string. Only an action control plane like Arc evaluates a specific business action, can require human approval, and delivers a signed execution your app verifies — stopping the action, not just the spend or the text.

FAQ

The questions buyers actually ask

Is Arc an observability tool, an AI gateway, or a guardrail?

It is an agent action control plane. Observability tools (Helicone, Langfuse, Datadog) record what an agent did after the request. Gateways (LiteLLM, Portkey, Bifrost) cap dollars at the model call. Arc governs the action itself — allow / ask / block, human approval on the risky ones, signed execution your app verifies, and a hash-chained audit. Most teams run Arc alongside one of the others, not instead of it.

Doesn't cost tracking in observability already protect me from a runaway bill?

It alerts you to one; it does not stop one. Cost tracking is posterior to the spend. Arc enforces cumulative spend and budget caps and can block or require approval before the next costly action executes — the difference between a smoke detector and a sprinkler.

LiteLLM already has budget caps. Why add Arc?

Because a budget cap constrains aggregate dollars, not which action runs. An agent can stay under budget and still issue a wrong refund or delete a record. Arc adds per-action allow / ask / block, human approval, and a signed execution your app verifies — and it still enforces cumulative spend caps, so you keep the dollar guard and gain the action guard.

What does “signed, app-verified execution” actually mean?

Arc's worker signs each approved action as an ES256 JWS and POSTs it to your app's execute endpoint. The @geostack/arc SDK verifies the signature, timestamp, nonce, and a hash of the exact request body before your business logic runs. Your app refuses anything Arc didn't sign, and tampering after approval is detectable. A virtual key offers no such guarantee — it trusts whoever holds it.

Why not just add a budget check and a try/catch myself?

For a demo, that is enough. In production it races under concurrency (two loops both pass the check), try/catch retries can double-execute an irreversible action, and a console.log won't satisfy an auditor. Arc handles atomic caps, replay-safe signed delivery keyed on the invocation, and a tamper-evident audit so you don't discover these gaps during an incident.

Where does Arc store the audit log?

In a redacted, hash-chained event log computed as sha256(prev_hash + canonical_json), which makes post-hoc tampering detectable. External immutable export to object storage is on the roadmap for stronger tamper-evidence; the V1 chain is tamper-evident, not by itself immutable against privileged database access.

How is Arc delivered and priced?

Arc is a hosted control plane — sign up for a free workspace, no credit card, and put your first agent behind it. You integrate with the @geostack/arc SDK (npm i @geostack/arc) and verify signed execution in your own app. You meter on protected agents and guarded actions, not seats. Every decision is signed and lands in a hash-chained audit you can re-verify, so you can evaluate the trust envelope before you rely on it.

Put one risky action behind Arc — and keep the rest of your stack.

Arc isn't a gateway, a tracer, or a content filter. It's the layer none of those provide: approval, signed execution, and a hash-chained audit for the action itself. Free to start — sign up for a hosted workspace, no credit card, metered on guarded actions, not seats.