Stripe webhooks in production: retries, idempotency, and why 200 OK isn't enough
Most teams think the webhook problem is solved when they stop getting signature errors.
It is not.
The hard problems start after signature verification:
- retries and duplicate deliveries,
- partial failures after
200 OK, - state drift between Stripe and your app,
- no safe replay when incidents happen.
If you process subscription revenue, these are not edge cases. They are normal operating conditions.
Who this is for
This article is for:
- backend/platform engineers running Stripe-based SaaS billing,
- technical founders between first revenue and scale,
- teams that already "handle webhooks" but still get billing-state incidents.
The goal is simple: ship a webhook pipeline that is correct under failure, not only under happy path.
The core misconception: 200 OK means success
Stripe webhook delivery is at least once. Your system must tolerate duplicates and retries.
A successful HTTP response only means your endpoint acknowledged receipt. It does not guarantee your business logic completed safely.
Common failure timeline:
- Stripe sends
checkout.session.completed. - Your API verifies signature.
- Your handler starts DB writes.
- A timeout/deploy/network issue interrupts work.
- You already returned
200, so Stripe stops retrying. - Payment succeeded in Stripe, but entitlement was never provisioned.
From the customer perspective: "I paid and your product did nothing."
Production architecture that actually works
Use this pattern:
- Verify signature on raw body.
- Persist event durably (including raw body + headers).
- Return
2xxfast. - Process asynchronously in a worker.
- Enforce idempotency in storage and business handlers.
- Retry with backoff and dead-letter visibility.
This decouples delivery reliability from business processing latency.
1) Durable ingest before acknowledgment
Create an ingestion table and write to it before any heavy logic:
CREATE TABLE stripe_ingest_events (
id TEXT PRIMARY KEY,
type TEXT NOT NULL,
received_at TIMESTAMPTZ NOT NULL DEFAULT NOW(),
headers JSONB NOT NULL,
raw_body TEXT NOT NULL,
processed_at TIMESTAMPTZ,
processing_status TEXT NOT NULL DEFAULT 'PENDING',
last_error TEXT
);Key point: id is Stripe event.id and must be unique.
If the insert conflicts, you already saw this event. Return 2xx and move on.
2) Idempotency at the domain level (not just transport)
A unique event.id row prevents duplicate queueing, but you also need idempotent business updates.
Example:
- setting subscription status to
activeis idempotent, - incrementing "credits" blindly is not idempotent.
Use guard conditions:
await db.subscription.updateMany({
where: {
accountId,
status: { not: 'active' },
},
data: {
status: 'active',
activatedAt: new Date(),
},
});Treat every handler as if it can run multiple times.
3) Queue processing with deterministic dedupe
Use a queue (BullMQ, SQS, etc.) and dedupe by job id:
await queue.add(
'stripe-webhook',
{ eventId: stripeEvent.id },
{
jobId: stripeEvent.id,
removeOnComplete: 1000,
removeOnFail: 1000,
}
);This gives you:
- safe retry scheduling,
- clear failure visibility,
- controlled concurrency,
- operational backpressure handling.
4) Retry policy: bounded, observable, and business-aware
Not all failures are equal.
Good baseline:
- retry transient failures (network, 5xx, lock timeout),
- do not retry deterministic failures forever (validation bugs),
- cap max attempts,
- expose "stuck" and "failed permanently" metrics.
Example backoff schedule:
- 30s, 2m, 10m, 30m, 2h, 6h, 24h
Tie retries to revenue risk. A failed renewal event deserves higher alert priority than low-impact telemetry.
5) Observability you need in production
Minimum fields per delivery/processing attempt:
event.id,event.type, account id- first seen time
- attempt number
- processing duration
- result status (
SUCCESS,RETRYING,FAILED) - normalized error class
Minimum dashboards:
- success rate by event type,
- pending queue age percentile,
- failed events in last 1h/24h,
- recovered vs unresolved impact.
Without this, incidents become Slack archaeology.
6) Replay and reconciliation close the loop
Retries are reactive. Reconciliation is preventive.
Run periodic checks that compare:
- Stripe truth (events + current payment/subscription state),
- app truth (entitlements, billing flags, access state).
When divergence appears:
- classify severity,
- estimate impact,
- replay safely if applicable,
- track resolved vs unresolved impact.
This is where reliability work turns into direct revenue protection.
A practical Fastify + worker flow
// API route
fastify.post('/stripe/webhook', async (req, reply) => {
const rawBody = await req.raw.text();
const signature = req.headers['stripe-signature'] as string;
const event = stripe.webhooks.constructEvent(
rawBody,
signature,
process.env.STRIPE_WEBHOOK_SECRET!
);
const inserted = await saveEventIfNew(event, req.headers, rawBody); // unique by event.id
if (inserted) {
await queue.add('stripe-webhook', { eventId: event.id }, { jobId: event.id });
}
return reply.status(204).send();
});// Worker
worker.process(async (job) => {
const event = await loadStoredEvent(job.data.eventId);
if (!event) return;
await processBusinessLogicIdempotently(event);
await markProcessed(event.id);
});No blocking business logic in the request path. No duplicate side effects in the worker.
Production checklist
- [ ] Signature verification uses raw body.
- [ ]
event.iduniqueness enforced at DB level. - [ ] Event persisted before returning
2xx. - [ ] Queue dedupe uses deterministic
jobId. - [ ] Business handlers are idempotent.
- [ ] Retry policy is bounded and observable.
- [ ] Reconciliation detects Stripe-vs-app drift.
- [ ] Replay workflow tested in staging.
If any item is missing, you still have a silent failure path.
Why this matters to growth (not just engineering)
Webhook correctness affects:
- activation (paid users not provisioned),
- churn (cancellations/refunds misapplied),
- support load (billing tickets),
- trust ("can we rely on this product?").
For most SaaS teams, webhook reliability is a revenue system.
Where this fits with our previous posts
- Post 1 covered raw-body signature verification.
- Post 2 covered finding lost Stripe webhooks.
- Post 3 covered hidden revenue leakage patterns.
- This post gives the implementation architecture that ties all three into a production pipeline.
If you want this operating model without building all internal tooling first, Revenue Recovery Autopilot gives you scanner + monitoring + recovery workflows on top of your Stripe integration.
Start with a free scan: https://katsuralabs.com
Revenue Recovery Autopilot will detect broken webhooks that cost you money. Join the early access waitlist.