← Back to blog

Stripe webhooks in production: retries, idempotency, and why 200 OK isn't enough

Most teams think the webhook problem is solved when they stop getting signature errors.

It is not.

The hard problems start after signature verification:

  • retries and duplicate deliveries,
  • partial failures after 200 OK,
  • state drift between Stripe and your app,
  • no safe replay when incidents happen.

If you process subscription revenue, these are not edge cases. They are normal operating conditions.

Who this is for

This article is for:

  • backend/platform engineers running Stripe-based SaaS billing,
  • technical founders between first revenue and scale,
  • teams that already "handle webhooks" but still get billing-state incidents.

The goal is simple: ship a webhook pipeline that is correct under failure, not only under happy path.

The core misconception: 200 OK means success

Stripe webhook delivery is at least once. Your system must tolerate duplicates and retries.

A successful HTTP response only means your endpoint acknowledged receipt. It does not guarantee your business logic completed safely.

Common failure timeline:

  1. Stripe sends checkout.session.completed.
  2. Your API verifies signature.
  3. Your handler starts DB writes.
  4. A timeout/deploy/network issue interrupts work.
  5. You already returned 200, so Stripe stops retrying.
  6. Payment succeeded in Stripe, but entitlement was never provisioned.

From the customer perspective: "I paid and your product did nothing."

Production architecture that actually works

Use this pattern:

  1. Verify signature on raw body.
  2. Persist event durably (including raw body + headers).
  3. Return 2xx fast.
  4. Process asynchronously in a worker.
  5. Enforce idempotency in storage and business handlers.
  6. Retry with backoff and dead-letter visibility.

This decouples delivery reliability from business processing latency.

1) Durable ingest before acknowledgment

Create an ingestion table and write to it before any heavy logic:

CREATE TABLE stripe_ingest_events (
  id TEXT PRIMARY KEY,
  type TEXT NOT NULL,
  received_at TIMESTAMPTZ NOT NULL DEFAULT NOW(),
  headers JSONB NOT NULL,
  raw_body TEXT NOT NULL,
  processed_at TIMESTAMPTZ,
  processing_status TEXT NOT NULL DEFAULT 'PENDING',
  last_error TEXT
);

Key point: id is Stripe event.id and must be unique.

If the insert conflicts, you already saw this event. Return 2xx and move on.

2) Idempotency at the domain level (not just transport)

A unique event.id row prevents duplicate queueing, but you also need idempotent business updates.

Example:

  • setting subscription status to active is idempotent,
  • incrementing "credits" blindly is not idempotent.

Use guard conditions:

await db.subscription.updateMany({
  where: {
    accountId,
    status: { not: 'active' },
  },
  data: {
    status: 'active',
    activatedAt: new Date(),
  },
});

Treat every handler as if it can run multiple times.

3) Queue processing with deterministic dedupe

Use a queue (BullMQ, SQS, etc.) and dedupe by job id:

await queue.add(
  'stripe-webhook',
  { eventId: stripeEvent.id },
  {
    jobId: stripeEvent.id,
    removeOnComplete: 1000,
    removeOnFail: 1000,
  }
);

This gives you:

  • safe retry scheduling,
  • clear failure visibility,
  • controlled concurrency,
  • operational backpressure handling.

4) Retry policy: bounded, observable, and business-aware

Not all failures are equal.

Good baseline:

  • retry transient failures (network, 5xx, lock timeout),
  • do not retry deterministic failures forever (validation bugs),
  • cap max attempts,
  • expose "stuck" and "failed permanently" metrics.

Example backoff schedule:

  • 30s, 2m, 10m, 30m, 2h, 6h, 24h

Tie retries to revenue risk. A failed renewal event deserves higher alert priority than low-impact telemetry.

5) Observability you need in production

Minimum fields per delivery/processing attempt:

  • event.id, event.type, account id
  • first seen time
  • attempt number
  • processing duration
  • result status (SUCCESS, RETRYING, FAILED)
  • normalized error class

Minimum dashboards:

  • success rate by event type,
  • pending queue age percentile,
  • failed events in last 1h/24h,
  • recovered vs unresolved impact.

Without this, incidents become Slack archaeology.

6) Replay and reconciliation close the loop

Retries are reactive. Reconciliation is preventive.

Run periodic checks that compare:

  • Stripe truth (events + current payment/subscription state),
  • app truth (entitlements, billing flags, access state).

When divergence appears:

  1. classify severity,
  2. estimate impact,
  3. replay safely if applicable,
  4. track resolved vs unresolved impact.

This is where reliability work turns into direct revenue protection.

A practical Fastify + worker flow

// API route
fastify.post('/stripe/webhook', async (req, reply) => {
  const rawBody = await req.raw.text();
  const signature = req.headers['stripe-signature'] as string;
 
  const event = stripe.webhooks.constructEvent(
    rawBody,
    signature,
    process.env.STRIPE_WEBHOOK_SECRET!
  );
 
  const inserted = await saveEventIfNew(event, req.headers, rawBody); // unique by event.id
 
  if (inserted) {
    await queue.add('stripe-webhook', { eventId: event.id }, { jobId: event.id });
  }
 
  return reply.status(204).send();
});
// Worker
worker.process(async (job) => {
  const event = await loadStoredEvent(job.data.eventId);
  if (!event) return;
 
  await processBusinessLogicIdempotently(event);
  await markProcessed(event.id);
});

No blocking business logic in the request path. No duplicate side effects in the worker.

Production checklist

  • [ ] Signature verification uses raw body.
  • [ ] event.id uniqueness enforced at DB level.
  • [ ] Event persisted before returning 2xx.
  • [ ] Queue dedupe uses deterministic jobId.
  • [ ] Business handlers are idempotent.
  • [ ] Retry policy is bounded and observable.
  • [ ] Reconciliation detects Stripe-vs-app drift.
  • [ ] Replay workflow tested in staging.

If any item is missing, you still have a silent failure path.

Why this matters to growth (not just engineering)

Webhook correctness affects:

  • activation (paid users not provisioned),
  • churn (cancellations/refunds misapplied),
  • support load (billing tickets),
  • trust ("can we rely on this product?").

For most SaaS teams, webhook reliability is a revenue system.

Where this fits with our previous posts

  • Post 1 covered raw-body signature verification.
  • Post 2 covered finding lost Stripe webhooks.
  • Post 3 covered hidden revenue leakage patterns.
  • This post gives the implementation architecture that ties all three into a production pipeline.

If you want this operating model without building all internal tooling first, Revenue Recovery Autopilot gives you scanner + monitoring + recovery workflows on top of your Stripe integration.

Start with a free scan: https://katsuralabs.com

Revenue Recovery Autopilot will detect broken webhooks that cost you money. Join the early access waitlist.

Join the early access waitlist →