VOOZH about

URL: https://phabricator.wikimedia.org/T204729

⇱ ⚓ T204729 Potential enwiki DOS due to slow WatchedItemStore::countVisitingWatchersMultiple


Maniphest T204729

Potential enwiki DOS due to slow WatchedItemStore::countVisitingWatchersMultiple
Closed, ResolvedPublic

Description

CVE: CVE-2019-12473

If this start going above a tipping point we will have a propoer DOS, for now the query killer is holding on the 2 dbs they are happening (so far recentchanges DBs):

Details

Related Changes in Gerrit:
SubjectAuthorRepoBranchLines +/-
WatchedItemStore::countVisitingWatchersMultiple() shouldn't query all titles when asked for noneReedymediawiki/coreREL1_31+13 -0
WatchedItemStore::countVisitingWatchersMultiple() shouldn't query all titles when asked for noneReedymediawiki/coreREL1_30+13 -0
WatchedItemStore::countVisitingWatchersMultiple() shouldn't query all titles when asked for noneReedymediawiki/coreREL1_27+13 -0
WatchedItemStore::countVisitingWatchersMultiple() shouldn't query all titles when asked for noneAnomiemediawiki/coremaster+5 -0
WatchedItemStore::countVisitingWatchersMultiple() shouldn't query all titles when asked for noneAddshoremediawiki/corewmf/1.32.0-wmf.20+5 -0
WatchedItemStore::countVisitingWatchersMultiple() shouldn't query all titles when asked for noneAddshoremediawiki/corewmf/1.32.0-wmf.22+5 -0
Customize query in gerrit

Related Objects

Event Timeline

jcrespo triaged this task as Unbreak Now! priority.Sep 18 2018, 3:16 PM
Comment Actions

@daniel @Addshore any idea of what could be causing this on enwiki hosts (so far). I am pinging you guys because I saw you are part of T129478 and we are trying to look for people who could maybe help to troubleshoot this a bit further?

Comment Actions

Query and query plan:

SELECT /* WatchedItemStore::countVisitingWatchersMultiple */ wl_namespace,wl_title,COUNT(*) AS `watchers` FROM `watchlist` GROUP BY wl_namespace,wl_title HAVING COUNT(*) >= 30


root@db2088.codfw.wmnet[enwiki]> show explain for 207583907;
+------+-------------+-----------+-------+---------------+-----------------+---------+------+-----------+-----------------------------+
| id | select_type | table | type | possible_keys | key | key_len | ref | rows | Extra |
+------+-------------+-----------+-------+---------------+-----------------+---------+------+-----------+-----------------------------+
| 1 | SIMPLE | watchlist | index | NULL | namespace_title | 261 | NULL | 170180999 | Using index; Using filesort |
+------+-------------+-----------+-------+---------------+-----------------+---------+------+-----------+-----------------------------+
1 row in set, 1 warning (0.03 sec)
Comment Actions

I have setup a query killer killling all WatchedItemStore::countVisitingWatchersMultiple queries longer than 1 second, that should contain the issue until someone understand what is going on. BUT IT IS CAUSING AN OUTAGE TO WHATEVER THE SERVICE is, so not undoing the unbreak now.

Comment Actions

I don't see anything relevant (from the SAL) around that time, and it's a pretty sharp/immediate increase. The train wasn't deployed until 15:00 UTC today. And there was no train last week (DC switchover).

Comment Actions

It is only hitting recentchanges slaves at the moment

Comment Actions

Could it be someone on the API?

Yes, that would be my working theory, a bad performance on an uncommonly run query, which should be limited now that it is run by someone once per second.

We are now killing over 1 query per second on rc changes, which is not an ok solution:

https://grafana.wikimedia.org/dashboard/db/mysql?orgId=1&var-dc=codfw%20prometheus%2Fops&var-server=db2085&var-port=13311&panelId=10&fullscreen&from=now-1h&to=now

Comment Actions

For now it should be easily fixable by moving the load to vslow, similar to this: https://github.com/wikimedia/mediawiki/blob/master/includes/changetags/ChangeTags.php#L1573 until there is a proper fix.

Comment Actions

by "fix" I mean removing the potential DoS vector point. Also forgive my ignorance but have we considered chance of missing index on codfw hosts biting us again?

Comment Actions

Also forgive my ignorance but have we considered chance of missing index on codfw hosts biting us again?

I checked the watchlist table and it is the same in eqiad an codfw for the recentchanges hosts.
Also we'd have seen that probably as soon as we got codfw active I guess :-(

Comment Actions
SELECT /* WatchedItemStore::countVisitingWatchersMultiple */ wl_namespace,wl_title,COUNT(*) AS `watchers` FROM `watchlist` GROUP BY wl_namespace,wl_title HAVING COUNT(*) >= 30

No WHERE clause? That's certainly not good.

It looks like if an empty array is passed as to , it'll happily query all titles in the database instead of querying no titles as it would seem to be intended. That should be easy enough to fix.

Looking at the errors in Kibana, they all seem to be going via that code path. The few URLs I spot-checked are passing invalid titles, which would give an empty .

Comment Actions

by "fix" I mean removing the potential DoS vector point. Also forgive my ignorance but have we considered chance of missing index on codfw hosts biting us again?

There is no index that can optimize a SELECT without a WHERE, I think that query shouldn't be able to run on a synchronous webrequest (it could run on a cached cron like some other slow queries).

Comment Actions

Looking at the code path this will only be hit with action=query prop=info and with the visitingwatchers option.

My crappy command shows a single UA calling this:

addshore@stat1005:~$ kafkacat -b kafka-jumbo1002.eqiad.wmnet -t webrequest_text -q | grep "en.wikipedia" | grep "api.php" | grep "action=query" | grep "prop=info" | grep "visitingwatchers"

Sample request:

{P7564}

Comment Actions

Oh, this got filed as a security bug, because of the DOS potential. See https://gerrit.wikimedia.org/r/#/c/mediawiki/core/+/461153 for a fix.

Comment Actions

For roots: and (non root to request the query killer to be disabled)- pt-kill is running locally on the rc hosts as root on enwiki codfw, make sure to drop the screens if/when that is fixed.

Comment Actions

I'v +2ed the change posted by @Anomie and will put it on the deployed branches shortly (just waiting on jenkins)

Comment Actions

I'v +2ed the change posted by @Anomie and will put it on the deployed branches shortly (just waiting on jenkins)

Cool - let me know here or on IRC when merged, so I can kill the query killers and see if we are good
Thanks a lot guys

Comment Actions

Patches now deployed to both branches:

2018-09-18
16:45 addshore@deploy1001: Synchronized php-1.32.0-wmf.22/includes/watcheditem/WatchedItemStore.php: WatchedItemStore::countVisitingWatchersMultiple() fix T204729 (duration: 00m 57s)
16:43 addshore@deploy1001: Synchronized php-1.32.0-wmf.20/includes/watcheditem/WatchedItemStore.php: WatchedItemStore::countVisitingWatchersMultiple() fix T204729 (duration: 00m 59s)
Comment Actions

I have killed query killers on db2088 and db2085. Now monitoring the graphs to see if this is fixed.

Comment Actions

This looks solved so far.

db2085: https://grafana.wikimedia.org/render/dashboard-solo/db/mysql?panelId=3&orgId=1&var-dc=codfw%20prometheus%2Fops&var-server=db2085&var-port=13311&from=1537278729923&to=1537289529923&width=1000&height=500&tz=UTC%2B02%3A00

db2088: https://grafana.wikimedia.org/render/dashboard-solo/db/mysql?panelId=3&orgId=1&var-dc=codfw%20prometheus%2Fops&var-server=db2088&var-port=13311&from=1537278754322&to=1537289554322&width=1000&height=500&tz=UTC%2B02%3A00

And for the record, these were the query killers used:

while true; do pt-kill --victims=all --kill --print --busy-time 1 --match-info='WatchedItemStore::countVisitingWatchersMultiple' --match-user=wikiuser S=/run/mysqld/mysqld.s1.sock; sleep 100; done

I am going to give it some more more hours before closing the ticket as resolved just in case, but everything looks good so far now.
Thanks a lot @Addshore and @Anomie for getting this quickly tackled

Marostegui assigned this task to Anomie.
Comment Actions

In what commit was this issue introduced? Do we need to backport it to stable MediaWiki releases?

Comment Actions

Yeah, this may need followup for software security backporting- being filled by DBAs and fixed by technology, the focus was mostly on wikimedia DOS vector and not on the software bug. This probably need further consideration for the product side of things.

Comment Actions

In what commit was this issue introduced? Do we need to backport it to stable MediaWiki releases?

It was introduced in https://gerrit.wikimedia.org/r/#/c/mediawiki/core/+/277488/
Based on githubs info https://github.com/wikimedia/mediawiki/commit/7d4e225cb90ac275c6077e830d8b63678c3ea9dc
It is included in the following tags: master 1.31.0 1.31.0-rc.2 1.31.0-rc.1 1.31.0-rc.0 1.30.0 1.30.0-rc.0 1.29.2 1.29.1 1.29.0 1.29.0-rc.1 1.29.0-rc.0 1.28.3 1.28.2 1.28.1 1.28.0 1.28.0-rc.1 1.28.0-rc.0 1.27.4 1.27.3 1.27.2 1.27.1 1.27.0 1.27.0-rc.1 1.27.0-rc.0
I guess we should drop it on REL 131, 130, 129, 128, 127?

Comment Actions

In what commit was this issue introduced? Do we need to backport it to stable MediaWiki releases?

It was introduced in https://gerrit.wikimedia.org/r/#/c/mediawiki/core/+/277488/
Based on githubs info https://github.com/wikimedia/mediawiki/commit/7d4e225cb90ac275c6077e830d8b63678c3ea9dc
It is included in the following tags: master 1.31.0 1.31.0-rc.2 1.31.0-rc.1 1.31.0-rc.0 1.30.0 1.30.0-rc.0 1.29.2 1.29.1 1.29.0 1.29.0-rc.1 1.29.0-rc.0 1.28.3 1.28.2 1.28.1 1.28.0 1.28.0-rc.1 1.28.0-rc.0 1.27.4 1.27.3 1.27.2 1.27.1 1.27.0 1.27.0-rc.1 1.27.0-rc.0
I guess we should drop it on REL 131, 130, 129, 128, 127?

1.28 is EOL already, so we don't need to worry about that. I don't really want to squeeze this into this week's security release since the prep work for that is basically already done, at which point 1.29 is EOL as well. So we could cherry-pick it next week for 1.31, 1.30, and 1.27, after the release? cc @Reedy

Reedy changed the visibility from "Custom Policy" to "Public (No Login Required)".Jun 6 2019, 4:00 PM
Content licensed under Creative Commons Attribution-ShareAlike (CC BY-SA) 4.0 unless otherwise noted; code licensed under GNU General Public License (GPL) 2.0 or later and other open source licenses. By using this site, you agree to the Terms of Use, Privacy Policy, and Code of Conduct. · Wikimedia Foundation · Privacy Policy · Code of Conduct · Terms of Use · Disclaimer · CC-BY-SA · GPL · Credits