Why your agent works on Tuesday: notes from a team building LLM features into a multi-tenant SaaS

We are not going to pretend we have figured this out.

We have been shipping LLM-backed features into GFoundry for a while now – a B2B SaaS for talent management with multi-tenant clients in retail, banking, pharma, logistics. Some clients have a few hundred users. Some have tens of thousands. They share infrastructure. They share rate limits with our model providers. They mostly do not share the same definition of “working.”

What I want to write about is the gap between “the demo worked” and “the agent works on Tuesday morning when payroll is running, two enterprise tenants are doing onboarding cohorts, and the model provider is throttling every third call.” That gap is where a lot of small teams are about to lose six to twelve months. We are losing some of them too. We are also paying close attention to the teams that are several years and several orders of magnitude ahead of us, because they are publishing the kind of postmortems that, if you read them honestly, save you from repeating the same mistakes.

This is not a guide. It is a set of notes from inside the gap.

The gap is not the model

The first thing worth saying is that the LLM is rarely the part that breaks. Datadog’s State of AI Engineering report this year had a number that we kept coming back to: in March 2026, around 2% of all LLM spans across their customer base returned an error, and rate limit errors accounted for almost a third of them. That is roughly 8.4 million rate limit errors in a single month. Not hallucinations. Not bad outputs. Capacity errors from the providers themselves.

Their phrasing is direct: “the dominant production failure mode of LLM applications is capacity.” We think they are probably right.

This matches what we see in our own logs, on a much smaller scale. The model does what it is asked to do most of the time. The problems show up around it – in capacity, in tool schemas, in tenant boundaries, in caching behaviour we did not understand the first time we relied on it. The same kinds of problems that have been the bread and butter of distributed systems engineering for thirty years. They look new because the dependency is new. They are not actually new.

LangChain’s State of Agent Engineering survey of 1,300+ teams puts numbers on it from a different angle: 57% of respondents have agents in production, 32% cite quality as the top barrier to keeping them there, and observability adoption (89%) is now ahead of evaluations (52%). Read that pair of numbers carefully. It says that more teams are watching agents than testing them. That is the same shape as a lot of early production-monitoring stories. People instrument before they understand what to assert against.

Capacity is the noisy neighbour we did not expect

In a traditional multi-tenant SaaS, the noisy neighbour problem is well understood. One tenant runs a heavy report at 2pm, another tenant’s queries slow down. You solve it with quotas, connection pool isolation, query budgets, tenant-tagged metrics. There is a reasonable body of writing on it now.

What we did not anticipate is that the LLM provider is, from the perspective of our application, a single shared dependency that all tenants pull on. When tenant A decides to ingest 4,000 PDFs through the metadata extraction pipeline at 9am, and tenant B is in the middle of an onboarding flow at the same time, they are both queued behind the same provider rate limit. The shared dependency is not our database. It is OpenAI, or Anthropic, or whichever provider is behind the feature.

We did not engineer for this on the first pass. The first pass had per-tenant queues for our own background jobs, which is the obvious thing. What it did not have was a budgeting layer between those queues and the provider, so a single tenant’s batch job could – and on at least one occasion did – eat enough provider capacity that interactive features for other tenants started timing out.

The fix is not exciting. We added per-tenant token budgets, a backpressure mechanism, and a priority lane for interactive (user-facing) calls versus batch (background) calls. It looks a lot like the queueing patterns we already had for our own database, applied one layer up. Datadog’s report calls this “capacity engineering” and treats it as a first-class discipline. We think they are right to.

What we still do not have a clean answer for: what to do when the provider’s overall capacity is the ceiling. You can add backpressure inside your system all you want; if the provider is saturated, your interactive lane is also slow. The only real answer there is multi-provider failover, which adds its own set of problems (different tool calling formats, different cache semantics, drift in output behaviour between providers). We are watching what bigger teams do here. Anthropic’s own guidance on building effective agents suggests using “an API gateway or load balancer to abstract specific models, making A/B testing and model swapping seamless.” Easy to write. Less easy to do without breaking the prompt cache.

Schema drift is the new dependency hell

In February 2026, n8n shipped versions 2.4.7 through 2.6.3 of its workflow engine. Users who upgraded found that the Vector Store Question Answer Tool started generating invalid JSON schemas for function calling. OpenAI rejected the calls with Invalid schema for function: schema must be a JSON Schema of 'type: "object"'. Anthropic rejected them with tools.0.custom.input_schema.type: Field required. Production workflows stopped working entirely. The fix was to roll back the version.

The same failure pattern showed up at the same time in FlowiseAI, in Zed IDE, and in the OpenAI Agents SDK itself. Different teams, different codebases, same root cause: a tool schema generator changed how it serialised types, and the new output was rejected by the providers’ validators.

This one we have not been bitten by yet. Reading the postmortems, we think it is partly luck. We do not generate tool schemas dynamically from a typed runtime; we author them by hand in JSON, the same way we author API contracts. That is slower to develop and probably will not scale forever, but it has the property that schema changes are visible in PR review and run through the same approval flow as anything else. When we eventually move to generation – we probably will, because hand-authored does not scale to a real catalogue of tools – we want a contract test that round-trips schemas through every provider’s validator on CI. Anthropic’s agent SDK post talks about treating tools as “a new kind of software which reflects a contract between deterministic systems and non-deterministic agents.” That framing is useful. Tools are contracts. Contracts get tested.

The honest part: we have not built that contract test yet. We are aware we should. It is on the list. The reason it is not done yet is that the list is long, and so far the hand-authoring has held up. That is not a great reason.

Caching is everything, and we still do not fully trust it

The Anthropic engineering team published a piece in late April called Lessons from building Claude Code: Prompt caching is everything. The opening line is worth quoting because it is unusual to see a model provider say it this plainly: long-running agentic products like Claude Code are made feasible by prompt caching. The economics of running an agent without a high cache hit rate do not work. They run alerts on their cache hit rate and declare incidents when it drops.

A week before that piece, the same team published a separate postmortem explaining that on March 26 they shipped a caching optimisation that was meant to clear stale reasoning from idle sessions once. Instead, due to a bug, it cleared it on every turn for the rest of the session. Users reported Claude becoming forgetful and repetitive across the second half of March and into April. They fixed it on April 10. The postmortem is detailed, owns the mistake, and explains the cascade: the broken flag caused continuous cache misses, which also caused users’ rate limits to drain faster than expected, which generated a separate set of complaints that took weeks to correlate with the original change.

We read this postmortem carefully because we are doing a much smaller version of the same thing. Our prompts include tenant-specific context (organisation name, role taxonomy, recent activity) and most of that is stable across a session. The caching pattern is to put the stable part of the prefix at the start, the volatile part at the end, and resist the urge to “just” inject a small dynamic value (a timestamp, a random ID, anything) into the cached prefix. Anthropic’s post is direct about this: “any change anywhere in the prefix invalidates everything after it. Design your entire system around this constraint.”

We have already shipped at least one change that quietly invalidated our cache for a couple of days because we added a small dynamic field to the system prompt without realising what it would do to the cache hit rate. We caught it because the bill for that feature came back higher than expected. Without that bill, we might not have caught it for longer. We do not currently alert on cache hit rate the way Anthropic does internally. We probably should.

Tenant isolation in the context window

This one we are still thinking about, and we have less to say with confidence.

In a traditional multi-tenant database, isolation is a well-defined property. Tenants are isolated if no query for tenant A can read or write data belonging to tenant B. The mechanisms (row-level security, schema-per-tenant, database-per-tenant) are well understood, the failure modes (a missing WHERE tenant_id = ?) are familiar, and the test patterns are mature.

In an LLM-backed feature, isolation is not just about which rows the retrieval layer pulls. It is also about what ends up in the context window, what ends up in observability traces, what ends up in cached prefixes, and what ends up in feedback loops. The public guides on this tend to mention “scope your retrieval to the tenant” and move on. That is a true sentence and a small fraction of the actual problem.

A few of the questions we are working through, mostly without clean answers:

If we cache a prefix that includes a tenant-specific summary, what is the lifetime of that cache and who can it be served to? Anthropic’s docs say cache entries are isolated between organisations, which is a property of their API. What about within our organisation, where the “tenants” are our customers? The KV cache lives in their infrastructure, but the prefix we send is built by us. If we get the prefix construction wrong – if a stale tenant context bleeds into a new request – we have invented a new way to break isolation.
Observability tools log prompts and completions. PII redaction is a real engineering project, not a checkbox. What we have learned: redact at the source, not in the dashboard. Anything that flows into a third-party trace store has to be assumed to be visible to whoever can read traces.
Evaluations are still a research problem at our scale. We have evals for individual prompts. We do not yet have a clean way to evaluate end-to-end behaviour across a representative slice of tenants, because tenants have very different data shapes and a regression in one tenant’s behaviour can be invisible in aggregate metrics. The Anthropic multi-agent research post recommends “end-state evaluation” rather than turn-by-turn analysis for stateful agents. We agree in principle. We have not built it.

What we are doing, in plain terms

To put this in one paragraph: we treat the LLM provider as an external dependency that is unstable, capacity-limited, and occasionally lies. We put queueing and budgeting between our application and that dependency. We hand-author tool schemas until we trust generation more than we trust ourselves. We design prompts so that the cache survives normal product changes. We treat tenant isolation as a property that has to hold not just at the database layer but at the prompt, cache, and observability layers too. We read postmortems from teams that are years ahead of us, because they are publishing the maps of the territory we are walking into.

None of this is novel. The work that maps cleanly onto traditional reliability engineering – quotas, backpressure, cache discipline, contract testing, observability with the right scopes – is the work that holds up. The work that is genuinely new, mostly around evaluation of stateful behaviour across tenants, is the work where we are still finding our footing. We expect to be wrong about some of this in twelve months.

What we are not pretending

We are a small team. We have never run anything at the scale of Anthropic, or Datadog, or LangChain, or Slack. The closest we have to that scale is +twelve years of multi-tenant SaaS in GFoundry, which gives us a lot of experience with the second-order problems that LLM features create – because they are mostly the same second-order problems that any new dependency creates – but very little direct experience with the specifics of operating LLMs at production scale. We are catching up by reading carefully.

The teams that are publishing detailed postmortems right now are doing the rest of us a real service. The Anthropic April 23 piece is the kind of writing that, if you read it as a small team about to ship a similar feature, can save you a quarter of work and a real incident. Datadog’s data on rate limit errors lets us argue for capacity engineering in our own roadmap with numbers we did not have to collect ourselves. The postmortems in the agent failure literature – schema drift, MCP startup tax, context poisoning – are warning shots we get to fire from someone else’s gun.

We owe it back. That is partly why this post exists. We are writing about what we are doing – including what we are not doing well – because the next small team trying to ship an LLM feature into a serious B2B product is going to hit the same gap we are in, and the more of us who write down what we found there, the smaller the gap gets.

If you are working on the same thing and have figured out something we have not – particularly around end-to-end evaluation across tenants, or around multi-provider failover without breaking the cache – we would genuinely like to hear from you. We are still in the part of the curve where the people building this are mostly figuring it out as they go, and the honest conversations are happening in postmortems rather than in marketing.

The agent works on Tuesday. We are trying to make sure it also works on the Tuesday after the model provider has a bad afternoon, the SDK ships a schema change, and one of our tenants decides to stress-test the metadata extraction pipeline. Most of that work is not glamorous. None of it is unique to us. Nearly all of it is the same engineering as before, applied to a dependency that behaves differently.

That, we think, is actually the good news.

Why your agent works on Tuesday: notes from a team building LLM features into a multi-tenant SaaS

The gap is not the model

Capacity is the noisy neighbour we did not expect

Schema drift is the new dependency hell

Caching is everything, and we still do not fully trust it

Tenant isolation in the context window

What we are doing, in plain terms

What we are not pretending

More posts

Managing HR from Your LLM of Choice – GFoundry’s MCP in Practice

The platform isn’t dying. It’s changing who it talks to.

ServiceNow Published 64 Pages on People Intelligence. One Line Buried Inside Changes Everything.

Why your agent works on Tuesday: notes from a team building LLM features into a multi-tenant SaaS