VOOZH about

URL: https://phabricator.wikimedia.org/T414338

⇱ ⚓ T414338 FY25-26 WE5.4.12: Identify the provenance of image requests


Maniphest T414338

FY25-26 WE5.4.12: Identify the provenance of image requests
Open, Needs TriagePublic

Description

Problem

As part of the work under WE5.4 to protect our infrastructure from abusive scraping, we want to be able to understand the provenance of image requests. This means being able to distinguish when and where a URL to an image was generated.

This will allow us to use this information as a signal in request filtering at the CDN, by helping to determine if a request is coming from a browser session visiting the website, an API query, from dumps or if they are the result of hotlinking.

This intervention was originally proposed in Dec 2025 in Urgent needs for de-risking WE4.3, WE5.4, and our infrastructure (WMF-restricted)

Approach

Generate signed URLs for image requests, by adding query parameters that contain the provenance information and a signature that can be trivially validated at the CDN. The signature should be an HMAC that includes the URL, source (web, api, dumps), timestamp and a secret.

  1. Acceptance criteria
    • Generated image URLs include provence query parameters
    • Generated image URLs include an HMAC signature
    • Signature contents and HMAC algorithm agreed with SRE
    • SRE can configure the CDN based on the source that generated an image URL
    • SRE can configure the CDN based on the freshness of an image URL

Status updates

Details

Related Changes in Gerrit:
Customize query in gerrit

Related Objects

Mentioned In
T425580: [Spike] [BUG] POTD Gallery doesn't load, crashes upon share
T426373: Mediaviews Analysis returns API not found error
T424082: MediaViewer preview sometimes lacks provenance parameters
T422586: MediaViewer downloads high-res image twice if original is a medium-size JPEG
T418957: Add client-side logging for non-MediaWiki action API errors (HTTP 429)
T419921: TypeError: MediaWiki\Extension\OAuth\ResourceServer::getUser(): Return value must be of type MediaWiki\User\User, false returned
T417278: Choosing client credentials grant for OAuth 2 results in an access token (JWT) with the 'sub' field empty
T419135: Gadget-Stockphoto.js on Commons uses non-common thumbnail sizes, leading to a HTTP 429
T419458: Media dialog in VisualEditor shows odd UTM param strings where file type should be
T246054: Consider dropping the '1.5x' size logos from srcsets
T414805: FY 25/26 WE 5.4.10 Standard Thumbnail Sizes Only
T417309: mw.util.parseImageUrl() returns invalid thumb URLs for images where original size is under requested width
T414337: Identify requests for media files from logged-in users
Mentioned Here
T425580: [Spike] [BUG] POTD Gallery doesn't load, crashes upon share
T424082: MediaViewer preview sometimes lacks provenance parameters
rMEXT1269442c61b4: Updated mediawiki/extensions Project: mediawiki/extensions/WikibaseQuery…
T426217: MediaViewer downloads high-res image twice if thumb URL is re-used
T419135: Gadget-Stockphoto.js on Commons uses non-common thumbnail sizes, leading to a HTTP 429
T422586: MediaViewer downloads high-res image twice if original is a medium-size JPEG
T419458: Media dialog in VisualEditor shows odd UTM param strings where file type should be
T417278: Choosing client credentials grant for OAuth 2 results in an access token (JWT) with the 'sub' field empty
T418957: Add client-side logging for non-MediaWiki action API errors (HTTP 429)
T419921: TypeError: MediaWiki\Extension\OAuth\ResourceServer::getUser(): Return value must be of type MediaWiki\User\User, false returned
T402792: Consider rate limiting non-standard thumbnail sizes
T414337: Identify requests for media files from logged-in users

Event Timeline

There are a very large number of changes, so older changes are hidden. Show Older Changes
Comment Actions

Change #1239464 merged by jenkins-bot:

[mediawiki/core@master] Media: Add provenance parameters to thumbnail and media file URLs

https://gerrit.wikimedia.org/r/1239464

Comment Actions

@Joe @CDanis I heard you're the people to talk to about the desired data and format of these query parameters.

Currently, the proposed patch https://gerrit.wikimedia.org/r/c/mediawiki/core/+/1239464 includes the following data:

  • Site which is requesting the image, e.g. 'www.mediawiki.org'
  • Generator (the software component involved), e.g. 'parser' or 'imageinfo'. Entry point is used as fallback if not specified, e.g. 'index', 'api', 'rest'
  • Format of the requested image, 'original', 'thumbnail' or 'thumbnail_unscaled'

The format is UTM parameters (respectively utm_source, utm_campaign and utm_content, in this order), on the assumption that they'll be stripped by search engines etc.

Example: https://upload.wikimedia.org/wikipedia/commons/a/a9/Example.jpg?utm_source=mediawiki.localhost&utm_campaign=parser&utm_content=thumbnail

Your thoughts on that would be appreciated. I also have two questions:

Sorry it's been a few weeks of intense work on other stuff. The proposed format is good as far as I'm concerned, as a first step.

I think adding a signature is useful. It would be enough to have a simple signature like a simple SHA1 of the other parameters as follows: which we can add in (again abusing the term). I would go with a simple sha1 instead of using hmac because the risk of compromise is pretty low.

Comment Actions

Change #1253625 had a related patch set uploaded (by Bartosz Dziewoński; author: Bartosz Dziewoński):

[operations/mediawiki-config@master] Enable $wgTrackMediaRequestProvenance on testwikis and beta cluster

https://gerrit.wikimedia.org/r/1253625

Comment Actions

Change #1253625 merged by jenkins-bot:

[operations/mediawiki-config@master] Enable $wgTrackMediaRequestProvenance on testwikis and beta cluster

https://gerrit.wikimedia.org/r/1253625

Comment Actions

Mentioned in SAL (#wikimedia-operations) [2026-03-16T20:57:26Z] <catrope@deploy2002> Started scap sync-world: Backport for [[gerrit:1253623|Fix client credentials access tokens (T417278 T419921)]], [[gerrit:1253625|Enable $wgTrackMediaRequestProvenance on testwikis and beta cluster (T414338)]], [[gerrit:1253626|Configure $wgApiClientErrorSampleRate (T418957)]]

Comment Actions

Mentioned in SAL (#wikimedia-operations) [2026-03-16T20:59:17Z] <catrope@deploy2002> matmarex, catrope: Backport for [[gerrit:1253623|Fix client credentials access tokens (T417278 T419921)]], [[gerrit:1253625|Enable $wgTrackMediaRequestProvenance on testwikis and beta cluster (T414338)]], [[gerrit:1253626|Configure $wgApiClientErrorSampleRate (T418957)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there.

Comment Actions

Mentioned in SAL (#wikimedia-operations) [2026-03-16T21:05:37Z] <catrope@deploy2002> Finished scap sync-world: Backport for [[gerrit:1253623|Fix client credentials access tokens (T417278 T419921)]], [[gerrit:1253625|Enable $wgTrackMediaRequestProvenance on testwikis and beta cluster (T414338)]], [[gerrit:1253626|Configure $wgApiClientErrorSampleRate (T418957)]] (duration: 08m 06s)

Comment Actions

Change #1260029 had a related patch set uploaded (by Krinkle; author: Krinkle):

[operations/mediawiki-config@master] Enable $wgTrackMediaRequestProvenance on group0 wikis

https://gerrit.wikimedia.org/r/1260029

Comment Actions

Change #1260029 merged by jenkins-bot:

[operations/mediawiki-config@master] Enable $wgTrackMediaRequestProvenance on group0 wikis

https://gerrit.wikimedia.org/r/1260029

Comment Actions

Mentioned in SAL (#wikimedia-operations) [2026-03-31T23:10:45Z] <krinkle@deploy1003> Started scap sync-world: Backport for [[gerrit:1260029|Enable $wgTrackMediaRequestProvenance on group0 wikis (T414338)]]

Comment Actions

Mentioned in SAL (#wikimedia-operations) [2026-03-31T23:12:45Z] <krinkle@deploy1003> krinkle: Backport for [[gerrit:1260029|Enable $wgTrackMediaRequestProvenance on group0 wikis (T414338)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there.

Comment Actions

Mentioned in SAL (#wikimedia-operations) [2026-03-31T23:51:06Z] <krinkle@deploy1003> Finished scap sync-world: Backport for [[gerrit:1260029|Enable $wgTrackMediaRequestProvenance on group0 wikis (T414338)]] (duration: 40m 21s)

Comment Actions

Change #1267437 had a related patch set uploaded (by Krinkle; author: Krinkle):

[operations/mediawiki-config@master] Enable wgTrackMediaRequestProvenance on most group1 wikis

https://gerrit.wikimedia.org/r/1267437

Comment Actions

Change #1267437 merged by jenkins-bot:

[operations/mediawiki-config@master] Enable wgTrackMediaRequestProvenance on most group1 wikis

https://gerrit.wikimedia.org/r/1267437

Comment Actions

Mentioned in SAL (#wikimedia-operations) [2026-04-08T07:36:29Z] <krinkle@deploy1003> Started scap sync-world: Backport for [[gerrit:1267437|Enable wgTrackMediaRequestProvenance on most group1 wikis (T414338)]]

Comment Actions

Mentioned in SAL (#wikimedia-operations) [2026-04-08T07:38:18Z] <krinkle@deploy1003> krinkle: Backport for [[gerrit:1267437|Enable wgTrackMediaRequestProvenance on most group1 wikis (T414338)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there.

Comment Actions

Mentioned in SAL (#wikimedia-operations) [2026-04-08T07:46:04Z] <krinkle@deploy1003> Finished scap sync-world: Backport for [[gerrit:1267437|Enable wgTrackMediaRequestProvenance on most group1 wikis (T414338)]] (duration: 09m 34s)

Comment Actions

Change #1269440 had a related patch set uploaded (by Krinkle; author: Krinkle):

[operations/mediawiki-config@master] Enable wgTrackMediaRequestProvenance on wikidata.org

https://gerrit.wikimedia.org/r/1269440

Comment Actions

Change #1269441 had a related patch set uploaded (by Krinkle; author: Krinkle):

[operations/mediawiki-config@master] Enable wgTrackMediaRequestProvenance on Commons

https://gerrit.wikimedia.org/r/1269441

Comment Actions

Change #1269442 had a related patch set uploaded (by Krinkle; author: Krinkle):

[operations/mediawiki-config@master] Enable wgTrackMediaRequestProvenance on remaining Wikipedias

https://gerrit.wikimedia.org/r/1269442

Comment Actions

Progress update (2-6 Mar, 9-13 Mar; copied here from Asana for transparancy):

  • Investigate and fix broken thumbnails on officewiki (Timo investigated an found missing thumbnail steps on private wikis, Amir enabled this).
  • Test and merge trial implementation of media provenance URLs in MediaWiki core behind a feature flag (developed by Bartosz and Timo). T414338
    • Refactor logic in FileRepo and Media classes in MediaWiki core to reduce duplication and make adding provenance URLs simpler and more reliable. T414338
    • Find and fix VisualEditor would-be-bug where media type breaks due to accidental reliance on URLs having no query string. T419458
  • Enable media provenance feature in Beta Cluster and on testwikis in production. T414338
Comment Actions

Progress update (9 Apr 2026):

  • Enable media provenance on 573 additional wikis (including all Wiktionary and Wikivoyage wikis, and 18 Wikipedias). We are now live on 720/1068 wikis. T414338
  • Found regression in MediaViewer causing double downloads. T422586
  • Prepare Stockphoto gadget on Commons ahead of rollout to prevent regression. T419135

Next steps:

  • Deploy media provenance feature to Wikidata, Commons, and 346 remaining Wikipedias.
Comment Actions

Change #1276086 had a related patch set uploaded (by Krinkle; author: Krinkle):

[mediawiki/extensions/MultimediaViewer@master] mmv.bootstrap: Avoid double download when thumb is unscaled original

https://gerrit.wikimedia.org/r/1276086

Comment Actions

Change #1276086 merged by jenkins-bot:

[mediawiki/extensions/MultimediaViewer@master] mmv.bootstrap: Avoid double download when thumb is unscaled original

https://gerrit.wikimedia.org/r/1276086

Comment Actions

Change #1269440 merged by jenkins-bot:

[operations/mediawiki-config@master] Enable wgTrackMediaRequestProvenance on wikidata.org

https://gerrit.wikimedia.org/r/1269440

Comment Actions

Change #1269441 merged by jenkins-bot:

[operations/mediawiki-config@master] Enable wgTrackMediaRequestProvenance on Commons

https://gerrit.wikimedia.org/r/1269441

Comment Actions

Mentioned in SAL (#wikimedia-operations) [2026-05-01T19:51:14Z] <krinkle@deploy1003> Started scap sync-world: Backport for [[gerrit:1269440|Enable wgTrackMediaRequestProvenance on wikidata.org (T414338)]], [[gerrit:1269441|Enable wgTrackMediaRequestProvenance on Commons (T414338)]]

Comment Actions

Mentioned in SAL (#wikimedia-operations) [2026-05-01T19:52:57Z] <krinkle@deploy1003> krinkle: Backport for [[gerrit:1269440|Enable wgTrackMediaRequestProvenance on wikidata.org (T414338)]], [[gerrit:1269441|Enable wgTrackMediaRequestProvenance on Commons (T414338)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there.

Comment Actions

Mentioned in SAL (#wikimedia-operations) [2026-05-01T20:06:42Z] <krinkle@deploy1003> Finished scap sync-world: Backport for [[gerrit:1269440|Enable wgTrackMediaRequestProvenance on wikidata.org (T414338)]], [[gerrit:1269441|Enable wgTrackMediaRequestProvenance on Commons (T414338)]] (duration: 15m 27s)

Comment Actions

I think these changes may the the cause behind https://commons.wikimedia.org/wiki/MediaWiki_talk:Gadget-GoogleImagesTineye.js#c-Masur-20251229182900-Reverse_Image_Search_-_Google_and_TinEye_failing_to_retrieve_source_images_from

the gadget's logic is simple. it takes the url of the image and gives it to the search engines in the form of https://lens.google.com/uploadbyurl?url=https%3A%2F%2Fupload.wikimedia.org%2Fwikipedia%2Fcommons%2Fthumb%2Ff%2Ffa%2FStatue_of_Taras_Shevchenko_in_Shevchenkove%252C_Shevchenkove_Raion_2019_by_Venzz_04.jpg%3Futm_source%3Dcommons.wikimedia.org%26utm_campaign%3Dindex%26utm_content%3Doriginal

but as i tested manually, search engines cannot get the image, no matter the link comes with or without the new trackers

utm_source=commons.wikimedia.org&utm_campaign=index&utm_content=original

please explain how to get the gadget working again, i.e. how to get a link of a file that can be read by other websites.

Comment Actions

@RoyZuo A more robust way would be to make the gadget download the image (or a thumbnail), then upload it to the search engine, instead of asking the search engine to fetch it from us, which may be blocked if they don't respect our user-agent policy.

In the meantime, it looks like using a thumbnail URL instead of the original file URL works, at least for now.

Comment Actions

I've tested everything I wanted to test on Commons and Wikidata.

I expected Wikidata to perhaps not get the provenance params or not work with MMV, but it all looks good. I did find a bug, T426217: MediaViewer downloads high-res image twice if thumb URL is re-used, but that's pre-existing and not caused or made more common by provenance params, and so does not need to block roll-out.

Change #1269442 had a related patch set uploaded (by Krinkle; author: Krinkle):

[operations/mediawiki-config@master] Enable wgTrackMediaRequestProvenance on remaining Wikipedias

https://gerrit.wikimedia.org/r/1269442

I've scheduled this for tomorrow afternoon, 13:00 UTC.

Comment Actions

Change #1269442 merged by jenkins-bot:

[operations/mediawiki-config@master] Enable wgTrackMediaRequestProvenance on remaining Wikipedias

https://gerrit.wikimedia.org/r/1269442

Comment Actions

Mentioned in SAL (#wikimedia-operations) [2026-05-14T13:42:53Z] <krinkle@deploy1003> Started scap sync-world: Backport for [[gerrit:1269442|Enable wgTrackMediaRequestProvenance on remaining Wikipedias (T414338)]]

Comment Actions

Mentioned in SAL (#wikimedia-operations) [2026-05-14T13:44:41Z] <krinkle@deploy1003> krinkle: Backport for [[gerrit:1269442|Enable wgTrackMediaRequestProvenance on remaining Wikipedias (T414338)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there.

Comment Actions

Mentioned in SAL (#wikimedia-operations) [2026-05-14T13:49:57Z] <krinkle@deploy1003> Finished scap sync-world: Backport for [[gerrit:1269442|Enable wgTrackMediaRequestProvenance on remaining Wikipedias (T414338)]] (duration: 07m 03s)

Comment Actions

Progress update (15 May 2026):

  • SRE now includes media provenance as a signal in calculating the X-Is-Browser score on the edge.
  • Fixed regression in MediaViewer causing high-res double downloads (blocking rollout). T422586
  • Enable media provenance on Wikidata and Wikimedia Commons (720 -> 722/1068 wikis).
  • Did broad manual testing across Wikimedia Commons and Wikidata post-rollout.
  • Found bug in MediaViewer causing lack of provenance params in some cases (pre-existing, not blocking rollout). T424082
  • Found bug in MediaViewer causing low-res double downloads (pre-existing, not blocking rollout). T426217
  • Enable media provenance on remaining 346 Wikipedias, including English Wikipedia. Now live on all 1068 wikis.
Comment Actions

As explained by @RoyZuo above, we have at Wikimedia Commons a serious problem if the gadget that supports image reverse search on Google Lens, TinEye and Yandex doesn't work. Right now, TinEye and Yandex work but Google Lens fails as it is unable to access the images. I am not responsible for the gadget but as one of the admins at Commons I can tell you that this gadget is absolutely essential to fight against copyright violations. We delete about 2000 copyvios every day and we cannot do this efficiently if Google Lens cannot be conveniently queried. Hence, some solution is required such that this gadget can pass URLs that are subsequently not blocked when the respective services download them.

Comment Actions

Change #1288925 had a related patch set uploaded (by Krinkle; author: Seddon):

[operations/mediawiki-config@master] Revert "Enable wgTrackMediaRequestProvenance on Commons"

https://gerrit.wikimedia.org/r/1288925

Comment Actions

Change #1288925 merged by jenkins-bot:

[operations/mediawiki-config@master] Revert "Enable wgTrackMediaRequestProvenance on Commons"

https://gerrit.wikimedia.org/r/1288925

Comment Actions

Mentioned in SAL (#wikimedia-operations) [2026-05-18T21:31:09Z] <krinkle@deploy1003> Started scap sync-world: Backport for [[gerrit:1288925|Revert "Enable wgTrackMediaRequestProvenance on Commons" (T414338 T425580)]]

Comment Actions

Mentioned in SAL (#wikimedia-operations) [2026-05-18T21:32:56Z] <krinkle@deploy1003> seddon, krinkle: Backport for [[gerrit:1288925|Revert "Enable wgTrackMediaRequestProvenance on Commons" (T414338 T425580)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there.

Comment Actions

Mentioned in SAL (#wikimedia-operations) [2026-05-18T21:42:39Z] <krinkle@deploy1003> Finished scap sync-world: Backport for [[gerrit:1288925|Revert "Enable wgTrackMediaRequestProvenance on Commons" (T414338 T425580)]] (duration: 11m 29s)

Comment Actions

I think these changes may the the cause behind https://commons.wikimedia.org/wiki/MediaWiki_talk:Gadget-GoogleImagesTineye.js#c-Masur-20251229182900-Reverse_Image_Search_-_Google_and_TinEye_failing_to_retrieve_source_images_from

but as i tested manually, search engines cannot get the image, no matter the link comes with or without the new trackers

@RoyZuo and @AFBorchert this and the note on VP, implying this has been going on for several weeks, imply that this is very likely NOT caused by this ticket. More likely simply the anti scraping measures of the foundation that have been implemented before have caught these systems as well..

Matmarex has already given advise on how to change the gadget in a way that might make it work more reliable. This can be done right now. Or you can open a separate ticket to investigate why this websites are blocked from accessing us, but it might be that the it is not actually possible to distinguish these systems from illegitimate scrapers. It's hard to say.

Comment Actions

@AFBorchert use browsers like opera, which have "search image with google lens" when you right click on it. probably some extensions for other browsers also do this.
basically the same method described by matmarex: searching the copied image.
but i'm not gonna put my time into making that a gadget on commons.
who broke it should fix it. or who wants to.

Comment Actions

@TheDJ I am not familiar with the architecture and the algorithms of the protection system against unwanted scraping. To me it appears quite likely that the amount of traffic from a particular site can play a role, causing the tool to work or to fail for some sites. But it appears to me very likely that the gadget failures are linked to the protection system. Regard the gadget: I am not the author of the gadget or anyhow involved in its development. However, downloading and uploading the image to submit them to various reverse searches as suggested by @matmarex do not appear to be the straightforward solution. I think it would be better to be able within the gadget to generate image URLs that are subsequently accepted by protection system. My point is that Wikimedia Commons and its defense against copyright violations is a critical part of the infrastructure. This perspective should be IMHO taken into account when designing and updating the protection system.

Comment Actions

Change #1295921 had a related patch set uploaded (by Slyngshede; author: Slyngshede):

[operations/puppet@production] P:cache:haproxy add image generator information

https://gerrit.wikimedia.org/r/1295921

Comment Actions

Change #1295921 merged by BCornwall:

[operations/puppet@production] P:cache:haproxy add image generator information

https://gerrit.wikimedia.org/r/1295921

Comment Actions

@SLyngshede-WMF I'm wondering: Instead of a new header () would it make sense to use the existing header? For example,

Content licensed under Creative Commons Attribution-ShareAlike (CC BY-SA) 4.0 unless otherwise noted; code licensed under GNU General Public License (GPL) 2.0 or later and other open source licenses. By using this site, you agree to the Terms of Use, Privacy Policy, and Code of Conduct. · Wikimedia Foundation · Privacy Policy · Code of Conduct · Terms of Use · Disclaimer · CC-BY-SA · GPL · Credits