Stripe webhooks in production: retries, idempotency, and why 200 OK isn't enough

Most teams think the webhook problem is solved when they stop getting signature errors.

It is not.

The hard problems start after signature verification:

retries and duplicate deliveries,
partial failures after 200 OK,
state drift between Stripe and your app,
no safe replay when incidents happen.

If you process subscription revenue, these are not edge cases. They are normal operating conditions.

Who this is for

This article is for:

backend/platform engineers running Stripe-based SaaS billing,
technical founders between first revenue and scale,
teams that already "handle webhooks" but still get billing-state incidents.

The goal is simple: ship a webhook pipeline that is correct under failure, not only under happy path.

The core misconception: `200 OK` means success

Stripe webhook delivery is at least once. Your system must tolerate duplicates and retries.

A successful HTTP response only means your endpoint acknowledged receipt. It does not guarantee your business logic completed safely.

Common failure timeline:

Stripe sends checkout.session.completed.
Your API verifies signature.
Your handler starts DB writes.
A timeout/deploy/network issue interrupts work.
You already returned 200, so Stripe stops retrying.
Payment succeeded in Stripe, but entitlement was never provisioned.

From the customer perspective: "I paid and your product did nothing."

Production architecture that actually works

Use this pattern:

Verify signature on raw body.
Persist event durably (including raw body + headers).
Return 2xx fast.
Process asynchronously in a worker.
Enforce idempotency in storage and business handlers.
Retry with backoff and dead-letter visibility.

This decouples delivery reliability from business processing latency.

1) Durable ingest before acknowledgment

Create an ingestion table and write to it before any heavy logic:

CREATE TABLE stripe_ingest_events (
  id TEXT PRIMARY KEY,
  type TEXT NOT NULL,
  received_at TIMESTAMPTZ NOT NULL DEFAULT NOW(),
  headers JSONB NOT NULL,
  raw_body TEXT NOT NULL,
  processed_at TIMESTAMPTZ,
  processing_status TEXT NOT NULL DEFAULT 'PENDING',
  last_error TEXT
);

Key point: id is Stripe event.id and must be unique.

If the insert conflicts, you already saw this event. Return 2xx and move on.

2) Idempotency at the domain level (not just transport)

A unique event.id row prevents duplicate queueing, but you also need idempotent business updates.

Example:

setting subscription status to active is idempotent,
incrementing "credits" blindly is not idempotent.

Use guard conditions:

await db.subscription.updateMany({
  where: {
    accountId,
    status: { not: 'active' },
  },
  data: {
    status: 'active',
    activatedAt: new Date(),
  },
});

Treat every handler as if it can run multiple times.

3) Queue processing with deterministic dedupe

Use a queue (BullMQ, SQS, etc.) and dedupe by job id:

await queue.add(
  'stripe-webhook',
  { eventId: stripeEvent.id },
  {
    jobId: stripeEvent.id,
    removeOnComplete: 1000,
    removeOnFail: 1000,
  }
);

This gives you:

safe retry scheduling,
clear failure visibility,
controlled concurrency,
operational backpressure handling.

4) Retry policy: bounded, observable, and business-aware

Not all failures are equal.

Good baseline:

retry transient failures (network, 5xx, lock timeout),
do not retry deterministic failures forever (validation bugs),
cap max attempts,
expose "stuck" and "failed permanently" metrics.

Example backoff schedule:

30s, 2m, 10m, 30m, 2h, 6h, 24h

Tie retries to revenue risk. A failed renewal event deserves higher alert priority than low-impact telemetry.

5) Observability you need in production

Minimum fields per delivery/processing attempt:

event.id, event.type, account id
first seen time
attempt number
processing duration
result status (SUCCESS, RETRYING, FAILED)
normalized error class

Minimum dashboards:

success rate by event type,
pending queue age percentile,
failed events in last 1h/24h,
recovered vs unresolved impact.

Without this, incidents become Slack archaeology.

6) Replay and reconciliation close the loop

Retries are reactive. Reconciliation is preventive.

Run periodic checks that compare:

Stripe truth (events + current payment/subscription state),
app truth (entitlements, billing flags, access state).

When divergence appears:

classify severity,
estimate impact,
replay safely if applicable,
track resolved vs unresolved impact.

This is where reliability work turns into direct revenue protection.

A practical Fastify + worker flow

// API route
fastify.post('/stripe/webhook', async (req, reply) => {
  const rawBody = await req.raw.text();
  const signature = req.headers['stripe-signature'] as string;
 
  const event = stripe.webhooks.constructEvent(
    rawBody,
    signature,
    process.env.STRIPE_WEBHOOK_SECRET!
  );
 
  const inserted = await saveEventIfNew(event, req.headers, rawBody); // unique by event.id
 
  if (inserted) {
    await queue.add('stripe-webhook', { eventId: event.id }, { jobId: event.id });
  }
 
  return reply.status(204).send();
});

// Worker
worker.process(async (job) => {
  const event = await loadStoredEvent(job.data.eventId);
  if (!event) return;
 
  await processBusinessLogicIdempotently(event);
  await markProcessed(event.id);
});

No blocking business logic in the request path. No duplicate side effects in the worker.

Production checklist

[ ] Signature verification uses raw body.
[ ] event.id uniqueness enforced at DB level.
[ ] Event persisted before returning 2xx.
[ ] Queue dedupe uses deterministic jobId.
[ ] Business handlers are idempotent.
[ ] Retry policy is bounded and observable.
[ ] Reconciliation detects Stripe-vs-app drift.
[ ] Replay workflow tested in staging.

If any item is missing, you still have a silent failure path.

Why this matters to growth (not just engineering)

Webhook correctness affects:

activation (paid users not provisioned),
churn (cancellations/refunds misapplied),
support load (billing tickets),
trust ("can we rely on this product?").

For most SaaS teams, webhook reliability is a revenue system.

Where this fits with our previous posts

Post 1 covered raw-body signature verification.
Post 2 covered finding lost Stripe webhooks.
Post 3 covered hidden revenue leakage patterns.
This post gives the implementation architecture that ties all three into a production pipeline.

If you want this operating model without building all internal tooling first, Revenue Recovery Autopilot gives you scanner + monitoring + recovery workflows on top of your Stripe integration.

Start with a free scan: https://katsuralabs.com