Multi-Tier LLM Routing: 68% Cost Cut, Sub-Second Latency, Zero Schema Failures

Table of contents

Early on, we sent every LLM request to GPT-4. It felt safe — GPT-4 is smart, handles ambiguity well, rarely fails. But at production scale with hundreds of concurrent sessions, the bill was unsustainable and latency was creeping past three seconds. We needed a smarter approach.

This is the routing layer we built to fix that.

Multi-Tier LLM Routing Architecture

The Problem with Uniform Model Selection

When you're prototyping, defaulting to the most capable model is fine. In production with real cost and latency constraints, it's a mistake. Not every request needs GPT-4's reasoning power.

Consider the spread of tasks in a typical AI agent pipeline:

Task	Complexity	Needs GPT-4?
Classify intent from a short message	Low	No
Confirm a customer greeting	Low	No
Extract structured order details from natural language	Medium	Sometimes
Reconcile ambiguous product names against a catalog	High	Yes
Generate a personalized campaign with LLM flyers	High	Yes
Format a product list into a WhatsApp message	Low	No

Running everything through GPT-4 meant paying GPT-4 prices for tasks a smaller model handles perfectly.

The Routing Architecture

We built a ModelRouter class that scores each request and selects a model:

type RoutingTier = "fast" | "balanced" | "powerful";
 
interface RoutingRequest {
  task: string;
  input: string;
  requiresStructuredOutput?: boolean;
  schema?: ZodSchema;
  maxTokens?: number;
}
 
class ModelRouter {
  private readonly tiers: Record<RoutingTier, string> = {
    fast: "gpt-3.5-turbo",
    balanced: "gpt-4o-mini",
    powerful: "gpt-4o",
  };
 
  selectTier(req: RoutingRequest): RoutingTier {
    // Heuristic scoring
    let score = 0;
 
    // Complex tasks get bumped up
    const complexKeywords = [
      "reconcile", "disambiguate", "campaign", "segment", "analyze",
    ];
    if (complexKeywords.some((k) => req.task.toLowerCase().includes(k))) {
      score += 2;
    }
 
    // Long inputs signal complexity
    if (req.input.length > 500) score += 1;
 
    // Structured output with tight schemas needs reliability
    if (req.requiresStructuredOutput && req.schema) score += 1;
 
    if (score === 0) return "fast";
    if (score <= 2) return "balanced";
    return "powerful";
  }
}

This keeps the routing logic explicit and tunable. When we noticed a certain task class was failing on balanced, we bumped its weight—no rewrite needed.

Structured Output Validation with Zod

The most failure-prone part of any LLM pipeline is parsing the output. Models hallucinate keys, change field names, return arrays when you expect objects. We enforced structured output with Zod schemas and a retry-with-feedback loop:

import { z } from "zod";
 
const OrderExtractionSchema = z.object({
  product_name: z.string(),
  quantity: z.number().int().positive(),
  variant: z.string().optional(),
  delivery_address: z.string().optional(),
});
 
type OrderExtraction = z.infer<typeof OrderExtractionSchema>;
 
async function extractOrderDetails(
  message: string
): Promise<OrderExtraction> {
  const router = new ModelRouter();
  const tier = router.selectTier({
    task: "extract order details",
    input: message,
    requiresStructuredOutput: true,
    schema: OrderExtractionSchema,
  });
 
  const model = router.tiers[tier];
 
  for (let attempt = 0; attempt < 3; attempt++) {
    const response = await openai.chat.completions.create({
      model,
      messages: [
        {
          role: "system",
          content: `Extract order details as JSON matching this schema:
${JSON.stringify(OrderExtractionSchema._def, null, 2)}`,
        },
        { role: "user", content: message },
      ],
      response_format: { type: "json_object" },
      temperature: 0,
    });
 
    const raw = response.choices[0].message.content ?? "{}";
 
    try {
      const parsed = JSON.parse(raw);
      return OrderExtractionSchema.parse(parsed); // throws ZodError if invalid
    } catch (err) {
      if (attempt === 2) throw err;
      // On failure, escalate to a more powerful model
      if (tier !== "powerful") {
        return extractOrderDetails(message); // retry at higher tier
      }
    }
  }
 
  throw new Error("Failed to extract order details after 3 attempts");
}

The escalation pattern was key: if structured output validation fails, we retry with the next tier up. In practice this escalation happens on fewer than 2% of requests.

Caching Repeated Inputs

A surprising amount of traffic was semantically identical. Customers in the same retailer's audience send the same "what's in stock?" query hundreds of times per day. We added a semantic cache layer using Redis and hashed prompts:

import crypto from "crypto";
 
async function cachedCompletion(
  prompt: string,
  tier: RoutingTier,
  ttl = 300
): Promise<string> {
  const key = `llm:cache:${crypto
    .createHash("sha256")
    .update(`${tier}:${prompt}`)
    .digest("hex")}`;
 
  const cached = await redis.get(key);
  if (cached) return cached;
 
  const result = await callModel(tier, prompt);
  await redis.setex(key, ttl, result);
  return result;
}

Cache TTL is 5 minutes for catalog queries (products don't change that fast) and 0 for anything personalized or transactional.

Latency Numbers

After the routing layer went live, here's what we observed across a week of production traffic:

Route	Before (all GPT-4)	After
Intent classification	1.1s	0.28s
Product search formatting	0.9s	0.31s
Order extraction	1.4s	0.6s (balanced)
Campaign generation	2.2s	2.1s (powerful, unchanged)
p95 overall	3.4s	0.87s

The p95 latency dropping to under one second changed the user experience noticeably. WhatsApp has an implicit "typing..." indicator and users started perceiving the agent as genuinely instant.

Cost Impact

We didn't publish exact numbers publicly, but the routing change reduced our OpenAI spend by around 68% at equivalent request volume. Almost all of that came from moving intent classification and catalog formatting off GPT-4 entirely.

Observability: Know Which Tier Is Running What

Routing logic only compounds in complexity over time. Without visibility into which tier handled which request — and whether it succeeded — you're flying blind when something degrades.

We logged every routing decision alongside the outcome:

type RoutingLog = {
  requestId: string;
  task: string;
  selectedTier: RoutingTier;
  model: string;
  promptTokens: number;
  completionTokens: number;
  latencyMs: number;
  validationPassed: boolean;
  escalated: boolean;
};
 
async function logRoutingDecision(log: RoutingLog) {
  await db.routingLogs.insert({
    ...log,
    costUsd: estimateCost(log.model, log.promptTokens, log.completionTokens),
    timestamp: new Date(),
  });
}

This gave us a dashboard where we could see, per task type: average tier used, escalation rate, p95 latency, and daily cost. When a new task class started escalating more than 10% of the time, we knew to either tune the scoring heuristic or permanently bump that task to balanced. Without this data, tuning the router would have been guesswork.

It also caught a silent failure mode early: one task type was hitting the powerful tier on 80% of requests because a scoring keyword matched unintentionally. A five-minute heuristic fix dropped that to 4%.

Key Takeaways

Not all tasks are created equal. A two-sentence intent classification does not belong on the same model as campaign generation.
Structured output validation is mandatory in production. LLMs will occasionally hallucinate schema fields. Zod + retry at higher tier catches this gracefully.
Cache aggressively for repeated non-personalized prompts. Catalog queries, product formatting templates, and FAQ responses are perfect cache candidates.
Make routing logic explicit and tunable. A score-based heuristic beats a black-box classifier here — you can trace exactly why a request hit a particular tier.
Log every routing decision. Without per-task-type observability, cost spikes and silent escalation loops are invisible until they become expensive.