Back to blogs
Fintech / Payments
9 min read
Mar 20, 2025

High-Volume Transaction Processing: Pub/Sub, Kafka, MongoDB Rollbacks, and Surviving Failures

How Trulipay handles thousands of concurrent payment transactions using Kafka for async processing, MongoDB multi-document transactions with rollback logic, and a webhook delivery system that survives partial failures—resulting in a 32% throughput gain.

KafkaMongoDBTransactionsPub/SubWebhooksFintechNode.jsHigh Availability
Table of contents

When Trulipay first went to production, we could handle maybe 50 concurrent payment transactions comfortably. Beyond that, timeouts started cascading. A slow database write would block the HTTP handler, which would block the connection pool, which would start rejecting new payment requests. We had to redesign the pipeline from scratch.

This is the architecture we moved to—and how it handled the 32% throughput improvement we measured after the switch.

The Original (Broken) Architecture

The original flow was synchronous and coupled:

POST /payment → validate → charge gateway → write to MongoDB → send webhook → respond 200

Every step blocked the request handler. Gateway calls took 1-3 seconds. MongoDB writes contended under load. Webhook deliveries could timeout. Everything ran in series in a single async handler. At peak, we were seeing p99 response times of 12 seconds.

The New Architecture

We broke the pipeline into three independent stages:

POST /payment
    │
    ▼
Validation + Auth (fast, synchronous)
    │
    ▼  202 Accepted
Kafka: payments.initiated
    │
    ▼
Payment Processor Worker (async)
    ├── Charge gateway
    ├── Write to MongoDB (with retry)
    └── Publish: payments.completed / payments.failed
                    │
                    ▼
            Webhook Dispatcher
                    ├── Deliver to merchant webhook
                    └── Retry with backoff on failure

The HTTP handler now does the minimum: validate the request, check for duplicates, publish to Kafka, and return 202 Accepted. The actual payment processing happens asynchronously.

Idempotency from Day One

Before the Kafka topic even sees the payment, we check for duplicates using an idempotency key:

async function initiatePayment(req: PaymentRequest): Promise<{ payment_id: string }> {
  const idempotencyKey = req.idempotency_key ?? generateKey(req);
 
  // Check for existing payment with this key
  const existing = await db.collection("payments").findOne({
    idempotency_key: idempotencyKey,
    merchant_id: req.merchant_id,
  });
 
  if (existing) {
    return { payment_id: existing.payment_id }; // return existing result
  }
 
  const paymentId = generatePaymentId();
 
  // Create payment in PENDING state atomically with the idempotency record
  await db.collection("payments").insertOne({
    payment_id: paymentId,
    idempotency_key: idempotencyKey,
    merchant_id: req.merchant_id,
    amount: req.amount,
    currency: req.currency,
    status: "pending",
    created_at: new Date(),
    metadata: req.metadata,
  });
 
  await kafka.producer.send({
    topic: "payments.initiated",
    messages: [{ key: paymentId, value: JSON.stringify({ paymentId, ...req }) }],
  });
 
  return { payment_id: paymentId };
}

The idempotency_key has a unique index on (merchant_id, idempotency_key). If a merchant retries a request (network error, timeout), the duplicate is detected immediately without re-charging the gateway.

MongoDB Multi-Document Transactions

Payment processing involves multiple writes that must succeed or fail together: updating the payment record, creating a ledger entry, and updating the merchant's balance. We use MongoDB sessions for multi-document ACID transactions:

async function processPayment(paymentId: string, gatewayResult: GatewayResult) {
  const session = client.startSession();
 
  try {
    await session.withTransaction(async () => {
      // 1. Update payment status
      await db.collection("payments").updateOne(
        { payment_id: paymentId },
        {
          $set: {
            status: gatewayResult.success ? "completed" : "failed",
            gateway_transaction_id: gatewayResult.transaction_id,
            processed_at: new Date(),
            failure_reason: gatewayResult.failure_reason,
          },
        },
        { session }
      );
 
      if (!gatewayResult.success) return; // no ledger entry for failed payments
 
      // 2. Create ledger entry
      await db.collection("ledger_entries").insertOne(
        {
          payment_id: paymentId,
          merchant_id: gatewayResult.merchant_id,
          amount: gatewayResult.net_amount,
          fee: gatewayResult.fee,
          type: "payment",
          created_at: new Date(),
        },
        { session }
      );
 
      // 3. Update merchant balance
      await db.collection("merchant_balances").updateOne(
        { merchant_id: gatewayResult.merchant_id },
        { $inc: { available_balance: gatewayResult.net_amount } },
        { upsert: true, session }
      );
    });
  } finally {
    await session.endSession();
  }
}

The withTransaction helper handles retries on transient write conflicts automatically. If any operation fails, MongoDB rolls back all three writes—the payment never ends up in a half-written state.

Rollback Scenarios

There are situations where the gateway charge succeeds but our database write fails. This is the worst case—money moved, but our records don't reflect it. We handle this with a reconciliation worker:

async function reconcileUnrecordedCharges() {
  // Payments stuck in 'pending' for more than 5 minutes
  const stuckPayments = await db.collection("payments").find({
    status: "pending",
    created_at: { $lt: new Date(Date.now() - 5 * 60_000) },
  }).toArray();
 
  for (const payment of stuckPayments) {
    // Query the gateway directly for the charge status
    const gatewayStatus = await gateway.getChargeStatus(payment.payment_id);
 
    if (gatewayStatus.charged && !gatewayStatus.captured) {
      // Gateway charged but we didn't record it—void the charge
      await gateway.voidCharge(payment.payment_id);
      await updatePaymentStatus(payment.payment_id, "voided");
      await alertTeam(`Voided unrecorded charge: ${payment.payment_id}`);
    } else if (gatewayStatus.charged && gatewayStatus.captured) {
      // Gateway fully processed—recover the record
      await processPayment(payment.payment_id, gatewayStatus);
    } else {
      // Gateway never charged—mark as failed
      await updatePaymentStatus(payment.payment_id, "failed");
    }
  }
}

This reconciliation job runs every 5 minutes. In our first three months of production, it recovered 23 unrecorded successful charges that would otherwise have been invisible in our system.

Webhook Delivery

Merchants need to know when payments complete. We deliver webhooks from a dedicated Kafka consumer:

kafkaConsumer.on("payments.completed", async (payment) => {
  const merchant = await getMerchant(payment.merchant_id);
  if (!merchant.webhook_url) return;
 
  await deliverWebhook(merchant.webhook_url, payment, {
    maxRetries: 5,
    baseDelay: 1000,
    maxDelay: 60_000,
  });
});
 
async function deliverWebhook(
  url: string,
  payload: unknown,
  options: RetryOptions
) {
  const signature = signPayload(payload, process.env.WEBHOOK_SECRET!);
 
  for (let attempt = 0; attempt <= options.maxRetries; attempt++) {
    try {
      const res = await fetch(url, {
        method: "POST",
        headers: {
          "Content-Type": "application/json",
          "X-Trulipay-Signature": signature,
          "X-Trulipay-Attempt": String(attempt + 1),
        },
        body: JSON.stringify(payload),
        signal: AbortSignal.timeout(5000),
      });
 
      if (res.ok) return;
      if (res.status >= 400 && res.status < 500) return; // merchant error, don't retry
    } catch {
      if (attempt === options.maxRetries) {
        await alertMerchant(payload, "webhook_delivery_failed");
        return;
      }
      const delay = Math.min(options.baseDelay * Math.pow(2, attempt), options.maxDelay);
      await sleep(delay);
    }
  }
}

We skip retries on 4xx responses—a 400 from the merchant's endpoint means they rejected the payload intentionally, not a transient failure.

Results

After the rewrite, we measured:

MetricBeforeAfter
p50 response time1.2s82ms
p99 response time12s1.8s
Max concurrent payments~50600+
Throughput (req/min)180238
Webhook delivery success rate91%99.2%

The 32% throughput gain came from decoupling the HTTP handler from the gateway call. The HTTP handler now finishes in under 100ms, freeing connections for new requests while the gateway call happens in the background.

Key Takeaways

  • Decouple the HTTP handler from slow operations—gateway calls, DB writes, and webhooks don't belong in a synchronous request-response path at volume.
  • Idempotency keys on every payment—network retries are guaranteed; without them, you'll double-charge customers.
  • MongoDB transactions are ACID within a session—use them for multi-collection operations; don't rely on application-level compensating transactions.
  • Reconciliation workers for the gap case—when the gateway succeeds but your DB write fails, you need a recovery path that doesn't leave money in limbo.
  • Skip webhook retries on 4xx—the merchant rejected the payload; retrying won't help and will exhaust your retry budget.