Back to blogs
AI / Retail
7 min read
May 8, 2025

Multi-Tier LLM Routing: GPT-4 vs GPT-3.5, Structured Output Validation, Sub-Second Latency

How we built a model routing layer at Afto that routes requests between GPT-4 and GPT-3.5 based on task complexity, validates structured outputs with Zod, and maintains sub-second latency while cutting LLM costs significantly.

LLM RoutingGPT-4GPT-3.5Structured OutputZodCost OptimizationAI
Table of contents

Early in Afto's development, we sent every LLM request to GPT-4. It felt safe—GPT-4 is smart, handles ambiguity well, rarely fails. But at production scale with hundreds of concurrent WhatsApp sessions per retailer, the bill was unsustainable and latency was creeping past three seconds. We needed a smarter approach.

This is the routing layer we built to fix that.

The Problem with Uniform Model Selection

When you're prototyping, defaulting to the most capable model is fine. In production with real cost and latency constraints, it's a mistake. Not every request needs GPT-4's reasoning power.

Consider the spread of tasks in Afto's pipeline:

TaskComplexityNeeds GPT-4?
Classify intent from a short messageLowNo
Confirm a customer greetingLowNo
Extract structured order details from natural languageMediumSometimes
Reconcile ambiguous product names against a catalogHighYes
Generate a personalized campaign with LLM flyersHighYes
Format a product list into a WhatsApp messageLowNo

Running everything through GPT-4 meant paying GPT-4 prices for tasks a smaller model handles perfectly.

The Routing Architecture

We built a ModelRouter class that scores each request and selects a model:

type RoutingTier = "fast" | "balanced" | "powerful";
 
interface RoutingRequest {
  task: string;
  input: string;
  requiresStructuredOutput?: boolean;
  schema?: ZodSchema;
  maxTokens?: number;
}
 
class ModelRouter {
  private readonly tiers: Record<RoutingTier, string> = {
    fast: "gpt-3.5-turbo",
    balanced: "gpt-4o-mini",
    powerful: "gpt-4o",
  };
 
  selectTier(req: RoutingRequest): RoutingTier {
    // Heuristic scoring
    let score = 0;
 
    // Complex tasks get bumped up
    const complexKeywords = [
      "reconcile", "disambiguate", "campaign", "segment", "analyze",
    ];
    if (complexKeywords.some((k) => req.task.toLowerCase().includes(k))) {
      score += 2;
    }
 
    // Long inputs signal complexity
    if (req.input.length > 500) score += 1;
 
    // Structured output with tight schemas needs reliability
    if (req.requiresStructuredOutput && req.schema) score += 1;
 
    if (score === 0) return "fast";
    if (score <= 2) return "balanced";
    return "powerful";
  }
}

This keeps the routing logic explicit and tunable. When we noticed a certain task class was failing on balanced, we bumped its weight—no rewrite needed.

Structured Output Validation with Zod

The most failure-prone part of any LLM pipeline is parsing the output. Models hallucinate keys, change field names, return arrays when you expect objects. We enforced structured output with Zod schemas and a retry-with-feedback loop:

import { z } from "zod";
 
const OrderExtractionSchema = z.object({
  product_name: z.string(),
  quantity: z.number().int().positive(),
  variant: z.string().optional(),
  delivery_address: z.string().optional(),
});
 
type OrderExtraction = z.infer<typeof OrderExtractionSchema>;
 
async function extractOrderDetails(
  message: string
): Promise<OrderExtraction> {
  const router = new ModelRouter();
  const tier = router.selectTier({
    task: "extract order details",
    input: message,
    requiresStructuredOutput: true,
    schema: OrderExtractionSchema,
  });
 
  const model = router.tiers[tier];
 
  for (let attempt = 0; attempt < 3; attempt++) {
    const response = await openai.chat.completions.create({
      model,
      messages: [
        {
          role: "system",
          content: `Extract order details as JSON matching this schema:
${JSON.stringify(OrderExtractionSchema._def, null, 2)}`,
        },
        { role: "user", content: message },
      ],
      response_format: { type: "json_object" },
      temperature: 0,
    });
 
    const raw = response.choices[0].message.content ?? "{}";
 
    try {
      const parsed = JSON.parse(raw);
      return OrderExtractionSchema.parse(parsed); // throws ZodError if invalid
    } catch (err) {
      if (attempt === 2) throw err;
      // On failure, escalate to a more powerful model
      if (tier !== "powerful") {
        return extractOrderDetails(message); // retry at higher tier
      }
    }
  }
 
  throw new Error("Failed to extract order details after 3 attempts");
}

The escalation pattern was key: if structured output validation fails, we retry with the next tier up. In practice this escalation happens on fewer than 2% of requests.

Caching Repeated Inputs

A surprising amount of traffic was semantically identical. Customers in the same retailer's audience send the same "what's in stock?" query hundreds of times per day. We added a semantic cache layer using Redis and hashed prompts:

import crypto from "crypto";
 
async function cachedCompletion(
  prompt: string,
  tier: RoutingTier,
  ttl = 300
): Promise<string> {
  const key = `llm:cache:${crypto
    .createHash("sha256")
    .update(`${tier}:${prompt}`)
    .digest("hex")}`;
 
  const cached = await redis.get(key);
  if (cached) return cached;
 
  const result = await callModel(tier, prompt);
  await redis.setex(key, ttl, result);
  return result;
}

Cache TTL is 5 minutes for catalog queries (products don't change that fast) and 0 for anything personalized or transactional.

Latency Numbers

After the routing layer went live, here's what we observed across a week of production traffic:

RouteBefore (all GPT-4)After
Intent classification1.1s0.28s
Product search formatting0.9s0.31s
Order extraction1.4s0.6s (balanced)
Campaign generation2.2s2.1s (powerful, unchanged)
p95 overall3.4s0.87s

The p95 latency dropping to under one second changed the user experience noticeably. WhatsApp has an implicit "typing..." indicator and users started perceiving the agent as genuinely instant.

Cost Impact

We didn't publish exact numbers publicly, but the routing change reduced our OpenAI spend by around 68% at equivalent request volume. Almost all of that came from moving intent classification and catalog formatting off GPT-4 entirely.

Key Takeaways

  • Not all tasks are created equal. A two-sentence intent classification does not belong on the same model as campaign generation.
  • Structured output validation is mandatory in production. LLMs will occasionally hallucinate schema fields. Zod + retry at higher tier catches this gracefully.
  • Cache aggressively for repeated non-personalized prompts. Catalog queries, product formatting templates, and FAQ responses are perfect cache candidates.
  • Make routing logic explicit and tunable. A score-based heuristic beats a black-box classifier here—you can trace exactly why a request hit a particular tier.