Table of contents
When a retailer approves a campaign on Afto, that single approval triggers messages going out to potentially thousands of customers across WhatsApp, SMS, and email—simultaneously. Early on, we handled this synchronously inside the campaign service itself. That worked fine for tens of recipients. It fell apart fast at hundreds.
The solution was to decouple campaign execution from notification delivery entirely, using a Pub/Sub model.
The Problem with Synchronous Delivery
The first version of campaign execution looked like this:
async function executeCampaign(campaign: Campaign) {
const customers = await getTargetCustomers(campaign);
for (const customer of customers) {
await sendWhatsAppMessage(customer.phone, campaign.message);
await sendSmsMessage(customer.phone, campaign.sms_fallback);
}
await markCampaignComplete(campaign.id);
}This had three fatal flaws:
- No parallelism — a campaign to 2,000 customers was strictly sequential.
- A single WhatsApp API timeout failed the whole batch — no partial success.
- No retry logic — failed deliveries were lost forever.
The rewrite moved delivery to a proper Pub/Sub architecture.
Architecture Overview
Campaign Service
│
│ publishes CampaignExecutionEvent
▼
Kafka Topic: campaign.execute
│
│
┌───┴───────────────────┐
│ Notification Worker │ (multiple instances)
└─────────┬─────────────┘
│
┌───────┼────────┐
▼ ▼ ▼
WhatsApp Twilio SendGrid
│ │ │
│ webhooks back │
└───────┼────────┘
▼
Analytics Collector
│
▼
PostgreSQL (events table)
Campaign execution publishes a single event. The notification worker consumes from Kafka, fans out to channels in parallel, and each delivery channel's webhook calls back with delivery status.
Campaign Execution Event
The publisher is deliberately thin—it just emits a well-typed event and moves on:
type CampaignExecutionEvent = {
campaign_id: string;
retailer_id: string;
segment: CustomerSegment;
message: string;
sms_fallback: string;
customer_ids: string[];
channels: Array<"whatsapp" | "sms" | "email">;
scheduled_at: string;
};
await kafkaProducer.send({
topic: "campaign.execute",
messages: [
{
key: campaignId,
value: JSON.stringify(event),
},
],
});Using the campaign_id as the Kafka message key ensures all events for the same campaign go to the same partition—useful for ordered processing if we ever need it.
Notification Worker
The worker consumes the event, resolves customer contact info, and fires all channels in parallel:
kafkaConsumer.on("message", async (msg) => {
const event: CampaignExecutionEvent = JSON.parse(msg.value.toString());
const customers = await resolveContacts(event.customer_ids);
const deliveryBatches = chunk(customers, 100); // respect API rate limits
for (const batch of deliveryBatches) {
await Promise.allSettled(
batch.flatMap((customer) =>
event.channels.map((channel) =>
deliver(channel, customer, event)
)
)
);
}
await updateCampaignStats(event.campaign_id, { status: "sent" });
});
async function deliver(
channel: "whatsapp" | "sms" | "email",
customer: ResolvedCustomer,
event: CampaignExecutionEvent
): Promise<void> {
const messageId = generateMessageId();
// Pre-log the attempt for webhook correlation
await db.query(
`INSERT INTO campaign_deliveries
(campaign_id, customer_id, channel, message_id, status)
VALUES ($1, $2, $3, $4, 'pending')`,
[event.campaign_id, customer.id, channel, messageId]
);
switch (channel) {
case "whatsapp":
await whatsappClient.sendMessage(customer.phone, event.message, {
context: { message_id: messageId },
});
break;
case "sms":
await twilioClient.messages.create({
to: customer.phone,
from: process.env.TWILIO_FROM_NUMBER,
body: event.sms_fallback,
statusCallback: `${process.env.BASE_URL}/webhooks/sms/${messageId}`,
});
break;
case "email":
await sendgridClient.send({
to: customer.email,
subject: event.email_subject,
html: event.email_html,
customArgs: { message_id: messageId },
});
break;
}
}The pre-logging of the message_id before the send call is important—it means our webhook handler can always look up the delivery record, even if the webhook arrives before our database write in a race condition (which happens more than you'd think under load).
Webhook Analytics Collection
Each channel fires delivery webhooks back to us with status updates. These are our analytics events:
| Event | Trigger |
|---|---|
delivered | WhatsApp confirmed delivery to device |
read | WhatsApp blue double-tick |
failed | Delivery failure (unreachable, opted out) |
sms_delivered | Twilio delivery confirmation |
email_opened | SendGrid pixel tracking |
email_clicked | SendGrid link tracking |
Our webhook handler is a simple event logger:
app.post("/webhooks/:channel/:messageId", async (req, res) => {
const { channel, messageId } = req.params;
const status = parseWebhookStatus(channel, req.body);
await db.query(
`UPDATE campaign_deliveries
SET status = $1, event_ts = NOW(), raw_payload = $2
WHERE message_id = $3`,
[status, JSON.stringify(req.body), messageId]
);
res.sendStatus(200); // always ACK quickly
});We always acknowledge webhooks immediately and process asynchronously. Meta's WhatsApp API will retry webhook delivery aggressively if it doesn't get a 200 within 5 seconds—causing duplicate processing if you're slow.
Retry Logic and Dead Letter Queue
Failed deliveries go to a retry queue with exponential backoff:
async function handleDeliveryFailure(delivery: CampaignDelivery, error: Error) {
const retryCount = delivery.retry_count ?? 0;
if (retryCount >= 3) {
await kafkaProducer.send({
topic: "campaign.delivery.dead_letter",
messages: [{ key: delivery.message_id, value: JSON.stringify({ delivery, error: error.message }) }],
});
await updateDeliveryStatus(delivery.message_id, "permanently_failed");
return;
}
const delayMs = Math.pow(2, retryCount) * 60_000; // 1min, 2min, 4min
await scheduleRetry(delivery, delayMs);
}We alert on dead letter queue depth. If it spikes, it usually means a channel is down or a rate limit was hit hard.
Key Takeaways
- Pub/Sub decoupling means a WhatsApp API hiccup doesn't freeze the campaign pipeline—the worker retries independently.
- Pre-log the message ID before sending, not after—you avoid race conditions between your DB write and the inbound webhook.
- Always ACK webhooks immediately; process asynchronously. WhatsApp, Twilio, and SendGrid will all retry on slow responses.
- Dead letter queues with alerting are essential—silent delivery failures are worse than noisy ones.