Maniphest T422130

Database servers in cluster(number) are overloaded
Open, MediumPublicBUG REPORT
Actions

Assigned To

None

Authored By

Yann

Thu, Apr 2, 10:25 AM

Description

Error

Trying to undelete https://commons.wikimedia.org/w/index.php?title=File:Tajiks_of_Uzbekistan.PNG
and I repeatedly get

Sorry! This site is experiencing technical difficulties.
Try waiting a few minutes and refreshing.

(Cannot access the database: Cannot access the database: Database servers in cluster28 are overloaded. In order to protect application servers, the circuit breaking to databases of this section have been activated. Please try again a few seconds.)

Impact

File cannot be undeleted.

Related Objects

Mentioned In: T413974: Northward Datacenter Switchover (March 2026; codfw to eqiad)
T422166: scap can’t deploy (blob upload unknown) after apus.discovery.wmnet is repooled in codfw
T422111: es1042 not starting after powercycle
T422140: Fatal exception of type "Wikibase\DataModel\Services\Lookup\EntityLookupException"

Duplicates Merged Here: T422152: Database servers are overloaded
T422147: Intermittent DB connection failure errors
T422140: Fatal exception of type "Wikibase\DataModel\Services\Lookup\EntityLookupException"
T422127: Database servers in cluster30 are overloaded

Event Timeline

Yann created this task.Thu, Apr 2, 10:25 AM

Restricted Application added a subscriber: Aklapper. · View Herald TranscriptThu, Apr 2, 10:25 AM

Yann added a comment.Thu, Apr 2, 10:29 AM

Comment Actions

For https://commons.wikimedia.org/w/index.php?title=File:DepEd_Undersecretary_Michael_T._Poa.jpg
I get

(Cannot access the database: Cannot access the database: Database servers in cluster30 are overloaded. In order to protect application servers, the circuit breaking to databases of this section have been activated. Please try again a few seconds.)

MBH renamed this task from Database servers in cluster28 are overloaded to Database servers in cluster(number) are overloaded.Thu, Apr 2, 10:30 AM

Peachey88 edited projects, added DBA, SRE; removed Commons, Wikimedia-production-error.Thu, Apr 2, 10:30 AM

MBH subscribed.Thu, Apr 2, 10:31 AM

Comment Actions

Many such servers: 26, 31. When just opening pages for read.

Peachey88 merged a task: T422127: Database servers in cluster30 are overloaded.Thu, Apr 2, 10:31 AM

Peachey88 added a subscriber: AlexisJazz.

Thryduulf subscribed.Thu, Apr 2, 10:34 AM

Comment Actions

I've been experiencing these errors intermittently on English Wikipedia today, but only on trying to save edits. Each time trying again has resulted in the save being successful.

Aklapper changed the subtype of this task from "Production Error" to "Bug Report".Thu, Apr 2, 10:42 AM

Aklapper updated the task description. (Show Details)

AlexisJazz mentioned this in T422140: Fatal exception of type "Wikibase\DataModel\Services\Lookup\EntityLookupException".Thu, Apr 2, 10:46 AM

Wellverywell triaged this task as Unbreak Now! priority.Thu, Apr 2, 10:48 AM

Wellverywell subscribed.

A_smart_kitten subscribed.Thu, Apr 2, 10:49 AM

RhinosF1 added a project: Wikimedia-Incident.Thu, Apr 2, 10:52 AM

Restricted Application added a subscriber: RhinosF1. · View Herald TranscriptThu, Apr 2, 10:52 AM

1F616EMO subscribed.Thu, Apr 2, 10:57 AM

Comment Actions

I experienced such errors when diffing and saving edits.

Ladsgroup subscribed.Thu, Apr 2, 10:58 AM

Comment Actions

We are on it.

1F616EMO added a comment.Thu, Apr 2, 10:59 AM

Comment Actions

Should I expect the coming backport window be cancelled or delayed due to this incident?

RhinosF1 added a comment.Thu, Apr 2, 11:01 AM

Comment Actions

In T422130#11781793, @1F616EMO wrote:

Should I expect the coming backport window be cancelled or delayed due to this incident?

Very likely yes. A deployment won't take place unless incident responders are comfortable it won't affect or distract from the incident.

1F616EMO added a comment.Thu, Apr 2, 11:07 AM

Comment Actions

In T422130#11781814, @RhinosF1 wrote:

In T422130#11781793, @1F616EMO wrote:

Should I expect the coming backport window be cancelled or delayed due to this incident?

Very likely yes. A deployment won't take place unless incident responders are comfortable it won't affect or distract from the incident.

Thanks for the info, I've rescheduled my backports.

Lucas_Werkmeister_WMDE merged a task: T422140: Fatal exception of type "Wikibase\DataModel\Services\Lookup\EntityLookupException".Thu, Apr 2, 11:20 AM

Lucas_Werkmeister_WMDE subscribed.

Nemoralis subscribed.Thu, Apr 2, 11:36 AM

Thryduulf added a comment.Thu, Apr 2, 11:48 AM

Comment Actions

I've just encountered what I presume is the same error, this time when trying to use the reply tool
[6a4d47bf-961e-4513-9b1f-c6970e11f156] Caught exception of type Wikimedia\Rdbms\DBConnectionError
I know the user-unfriendliness of that error message is a different issue but I'm not sure where to document that?

taavi merged a task: T422147: Intermittent DB connection failure errors.Thu, Apr 2, 12:22 PM

taavi added a subscriber: Redmin.

Johannnes89 subscribed.Thu, Apr 2, 12:23 PM

FCeratto-WMF mentioned this in T422111: es1042 not starting after powercycle.Thu, Apr 2, 12:30 PM

Aklapper merged a task: T422152: Database servers are overloaded.Thu, Apr 2, 1:01 PM

Aklapper added a subscriber: Don-vip.

Don-vip awarded a token.Thu, Apr 2, 1:03 PM

Daimona subscribed.Thu, Apr 2, 1:09 PM

Sarsenet subscribed.Thu, Apr 2, 1:13 PM

JustRandomThai subscribed.Thu, Apr 2, 1:33 PM

cmooney subscribed.Thu, Apr 2, 1:56 PM

Comment Actions

We are hopeful the situation should have improved after codfw was repooled, adding additional capacity. Root cause of the circuit breaking is still being investigated.

SomeRandomDeveloper subscribed.Thu, Apr 2, 2:02 PM

Lucas_Werkmeister_WMDE mentioned this in T422166: scap can’t deploy (blob upload unknown) after apus.discovery.wmnet is repooled in codfw.Thu, Apr 2, 2:16 PM

MoritzMuehlenhoff lowered the priority of this task from Unbreak Now! to Medium.Thu, Apr 2, 2:37 PM

MoritzMuehlenhoff subscribed.

Comment Actions

The immediate impact has been mitigated, reducing priority, the task might still be used to collect followups.

Alien333 subscribed.Thu, Apr 2, 3:51 PM

GPSLeo subscribed.Thu, Apr 2, 4:24 PM

Od1n subscribed.Thu, Apr 2, 9:25 PM

Comment Actions

FWIW, I'm still currently encountering this error on frwiki, and it prevents my local custom JS/CSS files from loading.

Unexpectedly not loaded:

, , , …

Not impacted — loading as expected:

, , , …
,

jasmine_ mentioned this in T413974: Northward Datacenter Switchover (March 2026; codfw to eqiad).Thu, Apr 2, 11:19 PM

jijiki subscribed.Fri, Apr 3, 9:56 AM

Comment Actions

In T422130#11784439, @Od1n wrote:

FWIW, I'm still currently encountering this error on frwiki, and it prevents my local custom JS/CSS files from loading.

Unexpectedly not loaded:

, , , …

Not impacted — loading as expected:

, , , …

,

Please let us know if you are still experiencing this issue

Od1n added a comment.Fri, Apr 3, 10:04 AM

Comment Actions

Right now, I’m still seeing the JS error in the console.

Interestingly, and are still not loading, but is now loading (and I don’t have a page).

Od1n added a comment.Sat, Apr 4, 9:35 PM

Comment Actions

I'm still seeing the issue. The UUID is always the same, so I'm posting it here in case it helps:

[e0e9c2f5-9aa0-47a2-92a1-6f9e523708fe] 2026-04-02 11:37:52: Fatal exception of type "Wikimedia\Rdbms\DBConnectionError"

The failing network request is consistently this URL:
https://fr.wikipedia.org/w/load.php?lang=fr&modules=user&skin=vector&user=Od1n&version=8ea0b

The error only occurs with this specific value; if I change or remove the version parameter, the request succeeds.

As an additional note, yesterday while editing I was repeatedly asked to re‑authenticate — not every time, but often when opening or submitting an edit page. I'm not sure whether this is related, but mentioning it in case it’s useful.

Ladsgroup added a comment.Sat, Apr 4, 9:49 PM

Comment Actions

We haven't had db circuit breaking being active for two days now (based on logstash logs). Your issue seems to be completely different. One suggestion: Clear your caches (ctrl+shift+r). It could be that somehow the error page got cached into your browser (which it really shouldn't as this is a 500 response but you never know with browsers). The fact that I can load that URL with correct content and no problem says it's not related to this. If it's not fixed, I'd say file a new bug so we can investigate it separately.

Od1n added a comment.Sun, Apr 5, 6:46 AM

Comment Actions

I've cleared my browser cache and restarted Chrome.

I still encounter the exact same error (same UUID and timestamp), even when requesting the asset in a Chrome Incognito window or from a different Chrome profile.
But it works when I request it using another browser (Firefox).

This really feels like a reverse‑proxy issue — some stale or polluted cache that hasn’t been invalidated and is still being served based on IP, user‑agent, or similar, while a different browser triggers a cache miss.

Hopefully it will sort itself out within a few days at most.

Ladsgroup added a comment.Sun, Apr 5, 1:23 PM

Comment Actions

If you're logged in, it should bypass all CDN caches since that can pollute the cache (e.g. if you set your interface language to something else, we don't want to serve that to logged out users :D)‌ there is an exception and that's images but that's not related here. It doesn't mean there can't be any bugs here and there though. I notify the traffic team to investigate. It is caching on some layer somewhere since UUIDs by nature shouldn't repeat themselves.

Od1n added a comment.Mon, Apr 6, 8:36 AM

Comment Actions

I was still encountering the issue, and I’ve just resolved it by making an edit to MediaWiki:Group-sysop.js, which I noticed was included in the bundle, in order to trigger a server‑side cache refresh.

Marostegui subscribed.Mon, Apr 6, 8:36 AM

Comment Actions

Is this good to be closed?

Marostegui moved this task from Triage to Done on the DBA board.Mon, Apr 6, 8:38 AM

A_smart_kitten added a comment.Mon, Apr 6, 8:54 AM

Comment Actions

In T422130#11789154, @Marostegui wrote:

Is this good to be closed?

Judging by T422130#11782760, I guess it's just waiting for followups to be filed (if any)

Content licensed under Creative Commons Attribution-ShareAlike (CC BY-SA) 4.0 unless otherwise noted; code licensed under GNU General Public License (GPL) 2.0 or later and other open source licenses. By using this site, you agree to the Terms of Use, Privacy Policy, and Code of Conduct. · Wikimedia Foundation · Privacy Policy · Code of Conduct · Terms of Use · Disclaimer · CC-BY-SA · GPL · Credits

URL: https://phabricator.wikimedia.org/T422130

⇱ ⚓ T422130 Database servers in cluster(number) are overloaded

Database servers in cluster(number) are overloaded
Open, MediumPublicBUG REPORT
Actions

Description

Error

Impact

Related Objects

Event Timeline

URL: https://phabricator.wikimedia.org/T422130

⇱ ⚓ T422130 Database servers in cluster(number) are overloaded

Database servers in cluster(number) are overloadedOpen, MediumPublicBUG REPORTActions

Description

Error

Impact

Related Objects

Event Timeline

Database servers in cluster(number) are overloaded
Open, MediumPublicBUG REPORT
Actions