Table of contents
Early in Afto's development, we sent every LLM request to GPT-4. It felt safe—GPT-4 is smart, handles ambiguity well, rarely fails. But at production scale with hundreds of concurrent WhatsApp sessions per retailer, the bill was unsustainable and latency was creeping past three seconds. We needed a smarter approach.
This is the routing layer we built to fix that.
The Problem with Uniform Model Selection
When you're prototyping, defaulting to the most capable model is fine. In production with real cost and latency constraints, it's a mistake. Not every request needs GPT-4's reasoning power.
Consider the spread of tasks in Afto's pipeline:
| Task | Complexity | Needs GPT-4? |
|---|---|---|
| Classify intent from a short message | Low | No |
| Confirm a customer greeting | Low | No |
| Extract structured order details from natural language | Medium | Sometimes |
| Reconcile ambiguous product names against a catalog | High | Yes |
| Generate a personalized campaign with LLM flyers | High | Yes |
| Format a product list into a WhatsApp message | Low | No |
Running everything through GPT-4 meant paying GPT-4 prices for tasks a smaller model handles perfectly.
The Routing Architecture
We built a ModelRouter class that scores each request and selects a model:
type RoutingTier = "fast" | "balanced" | "powerful";
interface RoutingRequest {
task: string;
input: string;
requiresStructuredOutput?: boolean;
schema?: ZodSchema;
maxTokens?: number;
}
class ModelRouter {
private readonly tiers: Record<RoutingTier, string> = {
fast: "gpt-3.5-turbo",
balanced: "gpt-4o-mini",
powerful: "gpt-4o",
};
selectTier(req: RoutingRequest): RoutingTier {
// Heuristic scoring
let score = 0;
// Complex tasks get bumped up
const complexKeywords = [
"reconcile", "disambiguate", "campaign", "segment", "analyze",
];
if (complexKeywords.some((k) => req.task.toLowerCase().includes(k))) {
score += 2;
}
// Long inputs signal complexity
if (req.input.length > 500) score += 1;
// Structured output with tight schemas needs reliability
if (req.requiresStructuredOutput && req.schema) score += 1;
if (score === 0) return "fast";
if (score <= 2) return "balanced";
return "powerful";
}
}This keeps the routing logic explicit and tunable. When we noticed a certain task class was failing on balanced, we bumped its weight—no rewrite needed.
Structured Output Validation with Zod
The most failure-prone part of any LLM pipeline is parsing the output. Models hallucinate keys, change field names, return arrays when you expect objects. We enforced structured output with Zod schemas and a retry-with-feedback loop:
import { z } from "zod";
const OrderExtractionSchema = z.object({
product_name: z.string(),
quantity: z.number().int().positive(),
variant: z.string().optional(),
delivery_address: z.string().optional(),
});
type OrderExtraction = z.infer<typeof OrderExtractionSchema>;
async function extractOrderDetails(
message: string
): Promise<OrderExtraction> {
const router = new ModelRouter();
const tier = router.selectTier({
task: "extract order details",
input: message,
requiresStructuredOutput: true,
schema: OrderExtractionSchema,
});
const model = router.tiers[tier];
for (let attempt = 0; attempt < 3; attempt++) {
const response = await openai.chat.completions.create({
model,
messages: [
{
role: "system",
content: `Extract order details as JSON matching this schema:
${JSON.stringify(OrderExtractionSchema._def, null, 2)}`,
},
{ role: "user", content: message },
],
response_format: { type: "json_object" },
temperature: 0,
});
const raw = response.choices[0].message.content ?? "{}";
try {
const parsed = JSON.parse(raw);
return OrderExtractionSchema.parse(parsed); // throws ZodError if invalid
} catch (err) {
if (attempt === 2) throw err;
// On failure, escalate to a more powerful model
if (tier !== "powerful") {
return extractOrderDetails(message); // retry at higher tier
}
}
}
throw new Error("Failed to extract order details after 3 attempts");
}The escalation pattern was key: if structured output validation fails, we retry with the next tier up. In practice this escalation happens on fewer than 2% of requests.
Caching Repeated Inputs
A surprising amount of traffic was semantically identical. Customers in the same retailer's audience send the same "what's in stock?" query hundreds of times per day. We added a semantic cache layer using Redis and hashed prompts:
import crypto from "crypto";
async function cachedCompletion(
prompt: string,
tier: RoutingTier,
ttl = 300
): Promise<string> {
const key = `llm:cache:${crypto
.createHash("sha256")
.update(`${tier}:${prompt}`)
.digest("hex")}`;
const cached = await redis.get(key);
if (cached) return cached;
const result = await callModel(tier, prompt);
await redis.setex(key, ttl, result);
return result;
}Cache TTL is 5 minutes for catalog queries (products don't change that fast) and 0 for anything personalized or transactional.
Latency Numbers
After the routing layer went live, here's what we observed across a week of production traffic:
| Route | Before (all GPT-4) | After |
|---|---|---|
| Intent classification | 1.1s | 0.28s |
| Product search formatting | 0.9s | 0.31s |
| Order extraction | 1.4s | 0.6s (balanced) |
| Campaign generation | 2.2s | 2.1s (powerful, unchanged) |
| p95 overall | 3.4s | 0.87s |
The p95 latency dropping to under one second changed the user experience noticeably. WhatsApp has an implicit "typing..." indicator and users started perceiving the agent as genuinely instant.
Cost Impact
We didn't publish exact numbers publicly, but the routing change reduced our OpenAI spend by around 68% at equivalent request volume. Almost all of that came from moving intent classification and catalog formatting off GPT-4 entirely.
Key Takeaways
- Not all tasks are created equal. A two-sentence intent classification does not belong on the same model as campaign generation.
- Structured output validation is mandatory in production. LLMs will occasionally hallucinate schema fields. Zod + retry at higher tier catches this gracefully.
- Cache aggressively for repeated non-personalized prompts. Catalog queries, product formatting templates, and FAQ responses are perfect cache candidates.
- Make routing logic explicit and tunable. A score-based heuristic beats a black-box classifier here—you can trace exactly why a request hit a particular tier.