Table of contents
When we started building Afto, one of the hardest requirements was letting SMB retailers' customers shop directly over WhatsApp. The idea sounds deceptively simple: a customer sends a message, gets product info, adds to cart, places an order. In practice, building something reliable at production scale took months of iteration.
Here's the full architecture we landed on, and the lessons we learned along the way.
The System Architecture
The agent has three distinct layers:
- Webhook receiver — Receives messages from Meta's WhatsApp Business API, verifies signatures, and queues them.
- Intent detection + session manager — Classifies what the customer wants and maintains conversation state across turns.
- LLM function-calling layer — Maps the intent to the right tool: browse catalog, add to cart, place order, check status.
Customer → Meta Cloud API → POST /webhook/whatsapp
↓
Signature Verification
↓
Intent Classifier (GPT-3.5)
↓
Session Manager (Redis, 24h TTL)
↓
Tool-Calling Agent (GPT-4o-mini)
↓
Order Service / Product Catalog (PostgreSQL)
We deliberately split intent classification from the tool-calling step. The classifier is a cheap, fast GPT-3.5-turbo call. We only spin up the heavier function-calling pipeline when the intent needs it.
Webhook Setup and Signature Verification
Meta's Cloud API delivers messages as POST requests to your webhook. The first thing to do on every request is verify the X-Hub-Signature-256 header:
import crypto from "crypto";
export function verifyWhatsAppSignature(
rawBody: string,
signature: string
): boolean {
const expected = crypto
.createHmac("sha256", process.env.WHATSAPP_APP_SECRET!)
.update(rawBody)
.digest("hex");
return `sha256=${expected}` === signature;
}We reject anything that fails verification before it touches our queue. Early on we skipped this in staging and got flooded by Meta's verification challenges—don't make that mistake.
Intent Classification
Not every message is an order action. Customers say "hi", complain about deliveries, ask random product questions. We categorize each message into six buckets before doing anything expensive:
BROWSE_CATALOG— "show me your products", "what do you have?"ADD_TO_CART— "I want 2 of the blue shirt"PLACE_ORDER— "confirm order", "checkout"TRACK_ORDER— "where is my order?", "delivery status"CUSTOMER_SUPPORT— complaints, returns, generic questionsCHITCHAT— greetings, off-topic
The classifier prompt is minimal and the response is just the label:
const CLASSIFY_PROMPT = `
Classify this WhatsApp customer message into exactly one of:
BROWSE_CATALOG, ADD_TO_CART, PLACE_ORDER, TRACK_ORDER, CUSTOMER_SUPPORT, CHITCHAT
Message: "${message}"
Reply with only the label, nothing else.
`;This costs fractions of a cent and completes in under 300ms. We only hit GPT-4o-mini for BROWSE_CATALOG, ADD_TO_CART, and PLACE_ORDER.
LLM Tool Definitions
For the three commerce intents, we pass the LLM a set of tools and let it decide what to call:
const tools = [
{
type: "function",
function: {
name: "search_products",
description: "Search the retailer's catalog by name, category, or keyword",
parameters: {
type: "object",
properties: {
query: { type: "string" },
category: { type: "string" },
limit: { type: "number", default: 5 },
},
required: ["query"],
},
},
},
{
type: "function",
function: {
name: "add_to_cart",
description: "Add a product to the customer's active cart",
parameters: {
type: "object",
properties: {
product_id: { type: "string" },
quantity: { type: "number" },
},
required: ["product_id", "quantity"],
},
},
},
{
type: "function",
function: {
name: "place_order",
description: "Confirm and place the order from cart contents",
parameters: {
type: "object",
properties: {
delivery_address: { type: "string" },
payment_method: {
type: "string",
enum: ["COD", "CARD", "UPI"],
},
},
required: ["delivery_address", "payment_method"],
},
},
},
{
type: "function",
function: {
name: "get_order_status",
description: "Get the current delivery status of a specific order",
parameters: {
type: "object",
properties: {
order_id: { type: "string" },
},
required: ["order_id"],
},
},
},
];When the LLM calls search_products, we hit our PostgreSQL catalog with full-text search. The result goes back to the LLM, which formats it into a WhatsApp-friendly reply. We cap responses at 3 products per message to avoid wall-of-text replies.
Session Management with Redis
Each customer session stores conversation history, cart state, and context:
type SessionData = {
history: Array<{ role: "user" | "assistant"; content: string }>;
cart: Array<{ product_id: string; quantity: number; name: string; price: number }>;
active_order_id?: string;
language: string;
};
const SESSION_KEY = (phone: string) => `wa:session:${phone}`;
const TTL = 86400; // 24 hours
async function loadSession(phone: string): Promise<SessionData> {
const raw = await redis.get(SESSION_KEY(phone));
return raw ? JSON.parse(raw) : { history: [], cart: [], language: "en" };
}
async function saveSession(phone: string, data: SessionData) {
// keep only last 10 turns to stay within context limits
data.history = data.history.slice(-20);
await redis.setex(SESSION_KEY(phone), TTL, JSON.stringify(data));
}The 10-turn limit was important. Early in development, we had no truncation and context windows blew up after long conversations—expensive and slow.
Multi-Turn Conversation Handling
The trickiest edge cases were multi-turn flows where customers don't give all information at once:
"I want to order something"
(Agent asks: what product?)
"That red t-shirt"
(Agent asks: which size? how many?)
"Medium, 2 please"
The LLM handles these naturally from history. Our job was to:
- Persist history correctly — always append both the user message and the LLM's reply before saving.
- Detect abandoned flows — if someone starts checkout and disappears, we clear the pending state on next contact.
- Handle language switching — some customers start in English and switch to Hindi or Tamil mid-conversation. We added language detection and stored the preference in session.
Human Handoff
Not everything belongs with the LLM. We route to a human support queue for:
- Payment failures (card declined, UPI timeouts)
- Complaints about delivered orders
- Any tool call that errors three times in a row
- When the LLM returns a low-confidence response (we score with a follow-up classification call)
async function shouldHandoff(response: string): Promise<boolean> {
const check = await openai.chat.completions.create({
model: "gpt-3.5-turbo",
messages: [
{
role: "user",
content: `Does this AI response require human review? Reply YES or NO.\n\n"${response}"`,
},
],
temperature: 0,
});
return check.choices[0].message.content?.trim() === "YES";
}Results
After two months in production with pilot retailers, over 40% of orders were placed through the WhatsApp channel. Median response time was 1.8 seconds for catalog queries and 3.4 seconds for order placements. Human handoff rate sat around 6%, which was lower than expected.
The biggest surprise was how often customers asked the agent questions we hadn't anticipated—and how well the LLM handled them anyway.
Key Takeaways
- Two-stage pipeline (cheap classifier → expensive tool-caller) cuts cost dramatically without hurting quality.
- Redis session state is non-negotiable — multi-turn commerce over WhatsApp requires server memory; the model alone isn't sufficient.
- Always ship a human handoff path — for payments, complaints, and low-confidence situations. Customers notice when a bot goes silent.
- Test regional language inputs early — customers type in their native language, abbreviate product names, and send voice notes you'll need to transcribe.