VOOZH about

URL: https://phabricator.wikimedia.org/T422130

⇱ ⚓ T422130 Database servers in cluster(number) are overloaded


Maniphest T422130

Database servers in cluster(number) are overloaded
Open, MediumPublicBUG REPORT

Description

Error

Trying to undelete https://commons.wikimedia.org/w/index.php?title=File:Tajiks_of_Uzbekistan.PNG
and I repeatedly get

Sorry! This site is experiencing technical difficulties.
Try waiting a few minutes and refreshing.

(Cannot access the database: Cannot access the database: Database servers in cluster28 are overloaded. In order to protect application servers, the circuit breaking to databases of this section have been activated. Please try again a few seconds.)
Impact

File cannot be undeleted.

Event Timeline

Comment Actions

For https://commons.wikimedia.org/w/index.php?title=File:DepEd_Undersecretary_Michael_T._Poa.jpg
I get

(Cannot access the database: Cannot access the database: Database servers in cluster30 are overloaded. In order to protect application servers, the circuit breaking to databases of this section have been activated. Please try again a few seconds.)
MBH renamed this task from Database servers in cluster28 are overloaded to Database servers in cluster(number) are overloaded.Thu, Apr 2, 10:30 AM
Comment Actions

Many such servers: 26, 31. When just opening pages for read.

Comment Actions

I've been experiencing these errors intermittently on English Wikipedia today, but only on trying to save edits. Each time trying again has resulted in the save being successful.

Aklapper changed the subtype of this task from "Production Error" to "Bug Report".Thu, Apr 2, 10:42 AM
Aklapper updated the task description. (Show Details)
Wellverywell triaged this task as Unbreak Now! priority.Thu, Apr 2, 10:48 AM
Wellverywell subscribed.
Comment Actions

I experienced such errors when diffing and saving edits.

Comment Actions

Should I expect the coming backport window be cancelled or delayed due to this incident?

Comment Actions

Should I expect the coming backport window be cancelled or delayed due to this incident?

Very likely yes. A deployment won't take place unless incident responders are comfortable it won't affect or distract from the incident.

Comment Actions

Should I expect the coming backport window be cancelled or delayed due to this incident?

Very likely yes. A deployment won't take place unless incident responders are comfortable it won't affect or distract from the incident.

Thanks for the info, I've rescheduled my backports.

Comment Actions

I've just encountered what I presume is the same error, this time when trying to use the reply tool
[6a4d47bf-961e-4513-9b1f-c6970e11f156] Caught exception of type Wikimedia\Rdbms\DBConnectionError
I know the user-unfriendliness of that error message is a different issue but I'm not sure where to document that?

Comment Actions

We are hopeful the situation should have improved after codfw was repooled, adding additional capacity. Root cause of the circuit breaking is still being investigated.

MoritzMuehlenhoff lowered the priority of this task from Unbreak Now! to Medium.Thu, Apr 2, 2:37 PM
MoritzMuehlenhoff subscribed.
Comment Actions

The immediate impact has been mitigated, reducing priority, the task might still be used to collect followups.

Comment Actions

FWIW, I'm still currently encountering this error on frwiki, and it prevents my local custom JS/CSS files from loading.

Unexpectedly not loaded:

  • , , , …

Not impacted — loading as expected:

  • , , , …
  • ,
Comment Actions

FWIW, I'm still currently encountering this error on frwiki, and it prevents my local custom JS/CSS files from loading.

Unexpectedly not loaded:

  • , , , …

Not impacted — loading as expected:

  • , , , …
  • ,

Please let us know if you are still experiencing this issue

Comment Actions

Right now, I’m still seeing the JS error in the console.

Interestingly, and are still not loading, but is now loading (and I don’t have a page).

Comment Actions

I'm still seeing the issue. The UUID is always the same, so I'm posting it here in case it helps:

[e0e9c2f5-9aa0-47a2-92a1-6f9e523708fe] 2026-04-02 11:37:52: Fatal exception of type "Wikimedia\Rdbms\DBConnectionError"

The failing network request is consistently this URL:
https://fr.wikipedia.org/w/load.php?lang=fr&modules=user&skin=vector&user=Od1n&version=8ea0b

The error only occurs with this specific value; if I change or remove the version parameter, the request succeeds.

As an additional note, yesterday while editing I was repeatedly asked to re‑authenticate — not every time, but often when opening or submitting an edit page. I'm not sure whether this is related, but mentioning it in case it’s useful.

Comment Actions

We haven't had db circuit breaking being active for two days now (based on logstash logs). Your issue seems to be completely different. One suggestion: Clear your caches (ctrl+shift+r). It could be that somehow the error page got cached into your browser (which it really shouldn't as this is a 500 response but you never know with browsers). The fact that I can load that URL with correct content and no problem says it's not related to this. If it's not fixed, I'd say file a new bug so we can investigate it separately.

Comment Actions

I've cleared my browser cache and restarted Chrome.

  • I still encounter the exact same error (same UUID and timestamp), even when requesting the asset in a Chrome Incognito window or from a different Chrome profile.
  • But it works when I request it using another browser (Firefox).

This really feels like a reverse‑proxy issue — some stale or polluted cache that hasn’t been invalidated and is still being served based on IP, user‑agent, or similar, while a different browser triggers a cache miss.

Hopefully it will sort itself out within a few days at most.

Comment Actions

If you're logged in, it should bypass all CDN caches since that can pollute the cache (e.g. if you set your interface language to something else, we don't want to serve that to logged out users :D)‌ there is an exception and that's images but that's not related here. It doesn't mean there can't be any bugs here and there though. I notify the traffic team to investigate. It is caching on some layer somewhere since UUIDs by nature shouldn't repeat themselves.

Comment Actions

I was still encountering the issue, and I’ve just resolved it by making an edit to MediaWiki:Group-sysop.js, which I noticed was included in the bundle, in order to trigger a server‑side cache refresh.

Comment Actions

Is this good to be closed?

Judging by T422130#11782760, I guess it's just waiting for followups to be filed (if any)

Content licensed under Creative Commons Attribution-ShareAlike (CC BY-SA) 4.0 unless otherwise noted; code licensed under GNU General Public License (GPL) 2.0 or later and other open source licenses. By using this site, you agree to the Terms of Use, Privacy Policy, and Code of Conduct. · Wikimedia Foundation · Privacy Policy · Code of Conduct · Terms of Use · Disclaimer · CC-BY-SA · GPL · Credits