Amish Kushwaha - May 25, 2026

LLM Routing Is Infrastructure, Not Application Logic

The Day-Two Production Problem

When you build your first LLM-powered feature, the code looks deceptively clean. You call an API, parse the response, and move on. The model is fast, the endpoint is stable, and the entire integration takes maybe fifty lines of code.

Then production happens.

OpenAI returns a 503 at 2 AM. Anthropic rate-limits your account during a morning traffic spike. A model you have tested extensively suddenly starts returning slightly mutated JSON. It does not break your parser, but it subtly corrupts downstream state. Meanwhile, your largest customer routes from a region where one provider’s latency is 4x worse than the other. Then a new competitor cuts their pricing in half. You want to switch providers, but doing so means touching seventeen different service files.

None of these are edge cases. They are the standard operating conditions of production LLM systems.

The real issue is where the code handling these scenarios actually lives. In most current architectures, it lives everywhere and nowhere. It is scattered across application services, duplicated across teams, implemented inconsistently, and generally treated as an afterthought.

This operational mess is exactly why we’re building llmrouter.


The Hidden Complexity of LLM Integrations

Every engineering team building on top of LLM APIs hits the same wall. The initial setup is fast, but the real engineering begins when you try to make it resilient.

A production-grade LLM integration requires handling several non-trivial operational concerns:

  • Provider Management: Relying on a single provider is a single point of failure. Pricing changes, localized outages, and regional performance variances force you into multi-provider setups. This means juggling multiple SDKs, authentication mechanisms, and API quirks.
  • Granular Retry Logic: LLM APIs frequently throw transient errors. A 429 (rate limited) or a 503 (service unavailable) means you should back off and try again. However, a 400 (bad request) or 401 (unauthorized) means your request or credentials are fundamentally broken. Misclassifying these errors either burns tokens on useless retries or drops recoverable requests.
  • Fallback Strategies: When a primary provider degrades, traffic must shift elsewhere immediately. Your system needs to know which alternative models are structurally equivalent enough to substitute, without leaking that operational logic into your core business services.
  • Timeouts and Circuit Breaking: A hanging LLM request that takes 90 seconds is often worse than an immediate failure because it holds resources open and degrades user experience. You need circuit breakers to prevent a failing provider from continuously swallowing traffic.
  • Structured Output Stability: Models regress on JSON formatting far more often than providers admit. Parsing, validating, and recovering from malformed output is its own layer of defensive engineering.
  • Observability: Debugging becomes pure guesswork if you cannot instantly see which provider handled a specific request, the latency breakdown, whether a retry occurred, or if a fallback was triggered. None of this is domain logic. None of it belongs in your application’s business layer, yet that is exactly where it lands when there is no dedicated infrastructure to handle it.

Why Abstraction Alone Is Not Enough

The immediate reaction to this complexity is to build an abstraction layer. Developers wrap provider SDKs behind a unified interface to standardize the request and response schemas.

While necessary, this approach is incomplete.

A unified interface solves the syntactic problem (writing the same code regardless of the underlying vendor) but ignores the operational problem. It does not determine which provider to call in a given moment, how to recover when that provider fails, or how to enforce global execution policies. An interface simply states that all providers look the same to the caller; it says nothing about runtime state, latency, or degradation.

You do not just need an interface abstraction. You need a runtime environment: a dedicated layer that accepts an intent, applies execution policies, selects the optimal provider, handles failures gracefully, and returns the result. The application code should only express what it wants, leaving the how to the infrastructure.

This is the core difference between a passive facade and an active orchestration layer. The facade simplifies the API, but the orchestration layer owns the execution.


From Model Calls to Intent Specifications

To see the difference, look at how tightly coupled application code becomes without this separation of concerns:

// Without llmrouter โ€” provider coupling baked into application logic
func classifyDocument(ctx context.Context, doc string) (string, error) {
    client := openai.NewClient(os.Getenv("OPENAI_API_KEY"))
 
    var lastErr error
    for attempt := 0; attempt < 3; attempt++ {
        if attempt > 0 {
            time.Sleep(time.Duration(attempt) * 2 * time.Second)
        }
 
        resp, err := client.Chat.Completions.New(ctx, openai.ChatCompletionNewParams{
            Model: "gpt-4o-mini",
            Messages: []openai.ChatCompletionMessageParamUnion{
                openai.UserMessage("Classify this document: " + doc),
            },
        })
        if err != nil {
            var apiErr *openai.Error
            if errors.As(err, &apiErr) && apiErr.StatusCode == 429 {
                lastErr = err
                continue // rate limited, retry
            }
            return "", err // non-retryable, bail
        }
 
        return resp.Choices[0].Message.Content, nil
    }
    return "", fmt.Errorf("max retries exceeded: %w", lastErr)
}

This single function is forced to manage provider selection, authentication, retry counting, error classification, backoff timing, and response extraction. If you need to switch to an alternative model or adjust a timeout, you have to modify this application file directly.

Here is that exact same intent handled via llmrouter:

// With llmrouter โ€” intent at the call site, policy at the infrastructure layer
func classifyDocument(ctx context.Context, doc string) (string, error) {
    resp, err := router.Complete(ctx, &llmrouter.Request{
        Model: "gpt-4o-mini",
        Messages: []llmrouter.Message{
            {Role: llmrouter.RoleUser, Content: "Classify this document: " + doc},
        },
    })
    if err != nil {
        return "", err
    }
    return resp.Choices[0].Message.Content, nil
}

The call site now focuses purely on the business requirement: obtaining a completion from a model capable of handling this workload. The underlying execution policies are configured once at application startup, using cleanly decoupled middleware definitions:

cb := middleware.NewCircuitBreaker(5, 30*time.Second)
router := llmrouter.New(
    llmrouter.WithProvider("openai", openai.NewFromEnv("openai", "OPENAI_API_KEY")),
    llmrouter.WithProvider("anthropic", anthropic.NewFromEnv()),
    llmrouter.WithModelMapping("gpt-4o-mini", "openai"),
    llmrouter.WithFallback("anthropic"),
    llmrouter.WithMiddleware(
        middleware.Retry(3, time.Second),
        cb.Wrap,
        middleware.Timeout(30*time.Second),
    ),
)

Modifying retry thresholds, introducing a fallback provider, or remapping a model tier now happens globally in a single initialization file.


What is llmrouter

llmrouter is a lightweight Go library designed to sit cleanly between your application logic and your provider SDKs. It isolates the operational mechanics so your core services do not have to deal with them.

The library handles three core functions:

  1. Deterministic Model Resolution: You register your providers at startup. When a request specifies a model name, the router resolves the destination using a strict three-step precedence: explicit model mapping, direct provider name matching, and finally an ordered scan of provider capability lists. This resolution order is completely deterministic and predictable, removing any magical or surprising runtime behavior.
  2. Composable Middleware Pipelines: Cross-cutting concerns like retries, timeouts, and circuit breaking are implemented as standard middleware wrappers around providers. The middleware chain is constructed at request time, making your execution policies highly testable and entirely decoupled from vendor implementations.
  3. Native Fallbacks: If a primary vendor fails, the router steps through your configured fallback providers sequentially. This resilience logic is entirely encapsulated within the router infrastructure. Out of the box, llmrouter supports OpenAI, Anthropic, and Google Gemini natively. For OpenAI-compatible endpoints (including DeepSeek, Groq, Together AI, Ollama, and Sarvam), a unified provider implementation handles them via a simple Presets map. Adding a new compatible service requires just a single line of configuration.

The library is purposefully minimal. It does not manage your prompts, generate code, or try to abstract away unique provider features. It simply provides a stable, resilient execution layer and gets out of your way.


Architectural Principles

The design of llmrouter adheres to a few core software engineering principles:

  • Middleware as a Native Function Primitive: In llmrouter, middleware is a simple function signature as MiddlewareFunc func(Provider) Provider rather than a rigid interface. Writing custom middleware means writing a plain function. Composability comes naturally from Go’s type system, leveraging a pattern immediately familiar to anyone who has written HTTP middleware in Go.
  • Explicit Resolution Over Magic: When a request asks for a specific model, you should be able to look at the config and know exactly where it will land. llmrouter completely avoids dynamic scoring or implicit priority weighting.
  • Thread Safety Is Non-Negotiable: Because the router manages provider and middleware states across concurrent server routines, thread safety is treated as a core correctness requirement. The library utilizes a sync.RWMutex throughout to ensure clean concurrent reads and safe isolated writes.
  • Clean Interface Contracts: The core Provider contract exposes four essential methods: Name, Models, Complete, and Stream. Everything else is treated as an implementation detail. The interface is small enough that middleware can wrap it without structural friction, yet clear enough that new additions do not force bloated boilerplates.
  • Type Assertions for Optional Capabilities: Not every model or provider supports native tool calling. Instead of polluting the primary Provider interface with methods like SupportsTools() bool, llmrouter defines an optional, separate ToolsProvider interface checked via Go’s native type assertions. This keeps the core footprint small and avoids forcing basic providers to implement methods they cannot support.

Designing for Real-World Failures

The error classification mechanism within llmrouter is designed to fix a major flaw found in most home-grown retry loops: blind retrying.

Running a retry loop on a 400 Bad Request or a 403 Forbidden is a waste of execution time and resources. llmrouter.IsRetryable(err) explicitly filters for transient failure codes: 429 (rate limits), 500, 502, 503, and 504. The retry middleware evaluates this classification before deciding whether to back off or immediately surface the error to the application layer.

The circuit breaker adds an essential defensive boundary. If a provider fails repeatedly, the breaker trips open after a configured threshold of consecutive errors (set to 5 in our initialization example). Subsequent requests fail fast immediately via ErrCircuitOpen rather than hanging and consuming backend resources, allowing the failing provider a quiet recovery window.

When OpenAI experiences an infrastructure blip, your backend won’t lock up or drop traffic. The circuit opens, the fallback triggers, and your users seamlessly receive responses generated by Anthropic while your team monitors the primary incident.


Decentralizing the Orchestration Layer

The AI engineering ecosystem is still formalizing its core infrastructure patterns. The current default strategy of allowing application logic to dictate provider routing and custom retry loops creates long-term maintenance debt. Teams copy-paste slightly different variations of retry logic across downstream services, fallback mechanisms are written reactively during an active outage, and observability remains fragmented.

The cleaner alternative is to treat LLM routing exactly how we treat database connection pooling, HTTP clients, or service meshes. These are infrastructure concerns that should live at the infrastructure layer, managed through centralized configuration, completely transparent to the application code.

llmrouter provides this architecture as a clean, compiled library rather than an external network service. There are no secondary network hops, no extra infrastructure components to deploy, and no SaaS subscriptions. It compiles directly into your Go binary.


Final Thoughts

Deciding which model should fulfill a request sounds straightforward on paper. In production, that choice is tightly bound to real-time vendor availability, pricing tiers, localized latencies, compliance rules, and operational risk.

Your application code shouldn’t have to evaluate these variables on every request. It should simply declare its intent and hand execution over to a layer built to fulfill it reliably.

Decoupling the application from the underlying provider makes your code cleaner, ensures your reliability policies remain consistent, and allows your entire stack to adapt smoothly as the model landscape evolves.

llmrouter is open source, written in Go, and hosted at github.com/bluefunda/llmrouter. Issues, contributions, and architectural feedback are always welcome.


Share this article
LinkedIn