VOOZH about

URL: https://thenewstack.io/3-hour-cloudflare-outage-knocks-out-ai-chatbots-shopify/

⇱ 3-Hour Cloudflare Outage Knocks Out AI Chatbots, Shopify - The New Stack


TNS
SUBSCRIBE
Join our community of software engineering leaders and aspirational developers. Always stay in-the-know by getting the most important news and exclusive content delivered fresh to your inbox to learn more about at-scale software development.
REQUIRED
It seems that you've previously unsubscribed from our newsletter in the past. Click the button below to open the re-subscribe form in a new tab. When you're done, simply close that tab and continue with this form to complete your subscription.
The New Stack does not sell your information or share it with unaffiliated third parties. By continuing, you agree to our Terms of Use and Privacy Policy.
Welcome and thank you for joining The New Stack community!
Please answer a few simple questions to help us deliver the news and resources you are interested in.
REQUIRED
REQUIRED
REQUIRED
REQUIRED
REQUIRED
Great to meet you!
Tell us a bit about your job so we can cover the topics you find most relevant.
REQUIRED
REQUIRED
REQUIRED
REQUIRED
REQUIRED
Welcome!

We’re so glad you’re here. You can expect all the best TNS content to arrive Monday through Friday to keep you on top of the news and at the top of your game.

What’s next?

Check your inbox for a confirmation email where you can adjust your preferences and even join additional groups.

Follow TNS on your favorite social media networks.

Become a TNS follower on LinkedIn.

Check out the latest featured and trending stories while you wait for your first TNS newsletter.

PREV
1 of 2
NEXT
VOXPOP
As a JavaScript developer, what non-React tools do you use most often?
Angular
0%
Astro
0%
Svelte
0%
Vue.js
0%
Other
0%
I only use React
0%
I don't use JavaScript
0%
Thanks for your opinion! Subscribe below to get the final results, published exclusively in our TNS Update newsletter:
NEW! Try Stackie AI
From clobbered drafts to real-time sync
Apr 14th 2026 10:00am, by David Moore
TypeScript 6.0 RC arrives as a bridge to a faster future
Mar 14th 2026 9:00am, by Darryl K. Taft
Mastra empowers web devs to build AI agents in TypeScript
Jan 28th 2026 11:00am, by Loraine Lawson
2025-11-20 07:00:42
3-Hour Cloudflare Outage Knocks Out AI Chatbots, Shopify
Cloud Services / Frontend Development / Networking

3-Hour Cloudflare Outage Knocks Out AI Chatbots, Shopify

A simple database permissions blunder at Cloudflare triggered a massive, hours-long outage, crippling sites like Shopify and services like ChatGPT.
Nov 20th, 2025 7:00am by Steven J. Vaughan-Nichols
👁 Featued image for: 3-Hour Cloudflare Outage Knocks Out AI Chatbots, Shopify

On Nov. 18, 2025, Cloudflare experienced a major outage lasting several hours that disrupted access to numerous popular websites and online services worldwide. This was only the latest in a wave of major Internet service providers going down. Others have included Amazon Web Services and Azure, both in October. It’s becoming painfully clear that we rely all too much on a handful of cloud and network services companies.

However, there’s no single flaw here. In AWS‘s case, it was ultimately — yes, you know this story — a Domain Name System (DNS) foul-up, while Azure’s failure was due to a mistaken configuration change. With Cloudflare, the root cause was a database system’s permissions blunder. This resulted in popular sites and services such as Shopify, Amazon, and Robox failing, and in essentially all AI chatbots, such as ChatGPT, Perplexity, and Anthropic Claude, being knocked out.

Root Cause: A Database Permissions Blunder

Specifically, the outage was triggered not by a cyberattack, but by a software bug in Cloudflare’s Bot Management system. Specifically, a recent change to the permissions for a database query generated an overlarge “feature file” that was used by the Bot Management module with many duplicate entries.

This file is usually a fixed size and regenerated every few minutes, but the bug caused the file to exceed expected limits, thereby crashing the Bot Management module repeatedly. Since this module is integral to Cloudflare’s core proxy pipeline, any traffic relying on it was affected, resulting in widespread 5xx errors.

Outage Timeline and Resolution

The issues began around 11:20 UTC, with symptoms including elevated latency, access authentication failures, and error codes surfaced throughout Cloudflare’s core networks. Initial confusion led some teams to suspect a large-scale DDoS attack, but this was quickly ruled out once the root cause was identified as the corrupted feature file.

In the meantime, many people on the net at work and play noticed trouble. As Cisco ThousandEyes reported, while network paths to Cloudflare’s frontend infrastructure appeared clear of any elevated latency or packet loss, Cisco ThousandEyes observed a number of timeouts and HTTP 5XX server errors, which are indicative of a backend services issue. Ironically, even websites that monitor web outages themselves, such as Downdetector, went down due to the Cloudflare failure.

Outage Timeline and Resolution

Behind the scenes, Cloudflare explained, the feature file was being regenerated every five minutes by a query running on a ClickHouse database cluster, which was being gradually updated to improve permissions management. So, “every five minutes, there was a chance of either a good or a bad set of configuration files being generated and rapidly propagated across the network.”

“Eventually,” Cloudflare continued, “every ClickHouse node was generating the bad configuration file and the fluctuation stabilized in the failing state.” This fix was to stop “the generation and propagation of the bad feature file and manually insert a known good file into the feature file distribution queue. And then forcing a restart of our core proxy.”

Fortunately, Cloudflare’s engineers halted the generation and propagation of the bad files relatively quickly. By 14:24 UTC, Cloudflare had rolled back to a previously stable version. Core traffic largely normalized by 14:30 UTC, with full system restoration completed by 17:06 UTC.

Cascading Effects on Ancillary Systems

As is always the case with such things, one problem cascaded into another. Other impacted ancillary Cloudflare systems were affected. This included the Workers KV storage and Cloudflare Access, which depend on the core proxy, and suffered increased error rates and login disruptions. The Cloudflare Dashboard login was severely affected as Turnstile, Cloudflare’s CAPTCHA service, failed to load correctly. It also didn’t help any that CPU usage surges due to internal debugging systems working overtime to diagnose uncaught errors, and was always slowing the content delivery network (CDN) down.

All together, the main outage lasted about three hours with a period of recovery, then final stabilization following full remediation. Some clients experienced longer disruptions due to backlogs and retry storms as services returned to life.

Cloudflare’s Commitment to Preventing Future Outages

Looking ahead, Cloudflare has committed to several measures to prevent recurrence. These include:

  • Hardening ingestion of configuration files with validation similar to user inputs.
  • Implement global kill switches for problematic features to rapidly isolate issues.
  • Eliminate scenarios where error reports or core dumps could overwhelm resources.
  • Conduct thorough reviews of failure modes across all core proxy modules.

That’s all well and good, but this failure, when considered alongside other recent Internet outages, has underscored just how fragile today’s Internet is. True, external attacks, such as Terabyte-sized Distributed Denial of Service (DDoS) attacks, which can cascade into global service outages for millions of users, are also a real problem. But, even without such attacks, these system failure incidents are raising important questions about just how safe critical cloud infrastructure systems are anyway.

TRENDING STORIES
Steven J. Vaughan-Nichols, aka sjvn, has been writing about technology and the business of technology since CP/M-80 was the cutting-edge PC operating system, 300bps was a fast internet connection, WordStar was the state-of-the-art word processor, and we liked it.
Read more from Steven J. Vaughan-Nichols
SHARE THIS STORY
TRENDING STORIES
TNS owner Insight Partners is an investor in: Root, ClickHouse, Anthropic.
SHARE THIS STORY
TRENDING STORIES
TNS DAILY NEWSLETTER Receive a free roundup of the most recent TNS articles in your inbox each day.
The New Stack does not sell your information or share it with unaffiliated third parties. By continuing, you agree to our Terms of Use and Privacy Policy.