dancy (Ahmon Dancy)Staff Software Engineer, Release EngineeringAdministrator
Projects (19)
- Policy
- Policy
- Policy
- Policy
- Policy
User Details
- User Since
- Jun 27 2020, 12:14 AM (313 w, 2 d)
- Roles
- Administrator
- Availability
- Available
- IRC Nick
- dancy
- LDAP User
- Ahmon Dancy
- MediaWiki User
- ADancy (WMF) [ Global Accounts ]
Recent Activity
Fri, Jun 26
Thu, Jun 25
And:
for n in $(seq 12 14); do host=deployment-cirrussearch$n.deployment-prep.eqiad1.wikimedia.cloud; echo $host; sudo puppetserver ca clean --certname $host; done
I ran
sudo puppetserver ca clean --certname deployment-dancy2.deployment-prep.eqiad1.wikimedia.cloud sudo puppetserver ca clean --certname deployment-dancy3.deployment-prep.eqiad1.wikimedia.cloud
to clean up after some test instances.
Wed, Jun 24
I chose to resolve this by deleting the two running Docker containers and making puppet recreate them:
The biggest consumer is
-rw-r----- 1 root root 9.0G Jun 24 18:42 /var/lib/docker/containers/c5a95725142c1168d1dca1c5a6bd3bf4ec5df287619997124dceebf1084baa56/c5a95725142c1168d1dca1c5a6bd3bf4ec5df287619997124dceebf1084baa56-json.log
dancy@deployment-changeprop-1:~$ df -t ext4 -h Filesystem Size Used Avail Use% Mounted on /dev/sda1 20G 20G 0 100% /
The tail end of output:
Error: Failed to apply catalog: No space left on device @ dir_s_mkdir - /var/lib/puppet/state/state.yaml20260624-2180970-o513qq.lock Error: Could not save last run local report: No space left on device @ dir_s_mkdir - /var/cache/puppet/public/last_run_summary.yaml20260624-2180970-1uvgjro.lock Error: Could not send report: No space left on device @ dir_s_mkdir - /var/lib/puppet/state/last_run_report.yaml20260624-2180970-1899ley.lock
In T429978#12048826, @BLiviero-WMF wrote:Hi! with T429542 being resolved, can you confirm whether this problem is also fixed? thank you!
Tue, Jun 23
In T428069#12042082, @Arnoldokoth wrote:I terminated this host so should not be an issue now. I'll mark this ticket as resolved.
Thu, Jun 18
Thanks @Scott_French. Your suggested layout and sample config make sense to me and look like a good place to start experimenting with implementation, which I can do next week.
Wed, Jun 17
The Cloud-Services project tag is not intended to have any tasks. Please check the list on https://phabricator.wikimedia.org/project/profile/832/ and replace it with a more specific project tag to this task. Thanks!
Tue, Jun 16
@Andrew Do you anticipate any issues if we set to (or something) in the project? was mentioned in T421244#11847986.
Done.
Mon, Jun 15
All servers referenced by operations/mediawiki-config are running MariaDB 10.11 now.
New nodes:
deployment-db15.deployment-prep.eqiad1.wikimedia.cloud deployment-db16.deployment-prep.eqiad1.wikimedia.cloud
@fgiunchedi Let us know how things work now if you remove your workaround.
In T428930#12012467, @Zabe wrote:In T428930#12011975, @dancy wrote:@Zabe I saw that you handled T329577 a few years ago and I'm wondering if you can help me bring deployment-db15 online to take over for deployment-db14 (xref
T428910#12010836).Sure. Please tell me if I can do something. :)
Editing is working again.
I found the script stalled on . Doing some debugging using I found it blocked on a query to . I logged into deployment-db14 and ran there and I see:
Query caused different errors on master and slave. Error on master: message (format)='Cannot load from %s.%s. The table is probably corrupted' error code=1728 ; Error on slave: actual message='no error', error code=0. Default database: 'repltest'. Query: 'drop database repltest'
I'm investigating.
In T418778#12016062, @Krinkle wrote:And again. Can we disable this test until a solution is found? Two months seems long enough as a grace period to "just" fix it directly.
Thu, Jun 11
I copied from to and I'm running
zcat /srv/db11-seed.sql.gz | sudo mysql
root@deployment-db11:~# time /opt/wmf-mariadb106/bin/mariadb-dump --all-databases --single-transaction --gtid --triggers | gzip > /srv/db11-seed.sql.gz
Initialize db stuff:
sudo -u mysql /opt/wmf-mariadb1011/scripts/mariadb-install-db \ --basedir=/opt/wmf-mariadb1011 \ --datadir=/srv/sqldata
I have a 120Gib volume mounted on .
@Zabe I saw that you handled T329577 a few years ago and I'm wondering if you can help me bring deployment-db15 online to take over for deployment-db14 (xref
T428910#12010836).
Current output:
Notes:
From operations/mediawiki-config/wmf-config/db-labs.php:
'hostsByName' => [ // deployment-db11.deployment-prep.eqiad1.wikimedia.cloud, master 'deployment-db11' => '172.16.5.150:3306', // deployment-db14.deployment-prep.eqiad1.wikimedia.cloud 'deployment-db14' => '172.16.5.170:3306', ],
@Jdforrester-WMF How do you feel about reverting https://gerrit.wikimedia.org/r/c/mediawiki/core/+/1300267 until beta cluster is prepared to handle it?
Fri, Jun 5
Thu, Jun 4
In T256168#11960221, @bd808 wrote:In T256168#11959969, @hashar wrote:Can we claim victory on this one did you have following steps in mind? The ones I think of are removing the agent in Jenkins and deleting the jobs (I can take care of that).
On my side we are still missing any notification if the sync jobs break. I found a good blog post on using stanzas with systemd units over the weekend that actually seems like a promising direction. I would like email and irc yelling so we don't miss things getting messed up.
Wed, Jun 3
Since it's been a while since this was originally reported, here's a fresh hit from today:
@brennen Do you know anything about this node?
Tue, Jun 2
In T423914#11978634, @dancy wrote:Train is blocked at testwikis due to T427935.
Given the described scope of the problem (wikidata.org, which is in group1), I will roll the train to group0 now.
Train is blocked at testwikis due to T427935.
Changing priority to UBN! this since task was added as a train blocker in T423914. I'm currently holding the train.
Buildkit v0.30.0 deployed to all places.
Looks good now.
Here's a fresh report from a batch of these errors that I saw today:
May 29 2026
I prepared https://gitlab.wikimedia.org/repos/releng/gitlab-cloud-runner/-/merge_requests/610 which should help with this. It is live in the staging cluster now.
May 28 2026
It looks like stream of container log records being returned from Kubernetes is getting mangled, possibly due to weird Unicode characters?
now exists with helm v4.2.0 installed.
May 27 2026
I deployed a new version of Reggie which handles errors in the upload and manifest cleaners.
Noting that it doesn't always stop at the same message. For example, I ran again recently and it stopped at an earlier timestamp of
Today I created https://gitlab.wikimedia.org/repos/releng/reggie/-/merge_requests/110 with a footer but no comment was added to that ticket.
I deleted the pod. It restarted and the cleaners are running properly again. Space usage is down from 123GB to 32GB at the moment.
Reggie's filesystem usage seems to be only increasing. I'm not seeing regular hits for the word "clean" in the log like I expect. I'll look into that.
May 20 2026
How do we feel about a GitLab repo for this purpose? Alternatively we can put something in https://gerrit.wikimedia.org/r/plugins/gitiles/integration/config/+/refs/heads/master/dockerfiles/, in which case we would receive the benefit of the image being updated when the base image is updated.
Dropping another variant here for searchability:
Error
Deployed in scap 4.266.0
May 18 2026
@Don-vip, I've made a configuration change which might help with your job. Please retry and let me know how it goes.
In T387886#11931806, @Don-vip wrote:It didn't help, sadly.
@Don-vip, please try adding the following to your file:
May 15 2026
May 14 2026
Thanks @bd808!
May 13 2026
Thanks @MGChecker and @cscott!
May 12 2026
I deployed https://gerrit.wikimedia.org/r/c/mediawiki/core/+/1286464 but it did not have an effect on the logging rate.
From a recent deployment
17:56:20 Waiting 20 seconds for production traffic... 17:56:40 Logstash checker counted 107 error(s) in the last 20 seconds. OK.
The threshold is 150.
The volume of warnings has moved us dangerously close to the point where scap deployments will start complaining about it. This is not a place we want to be so I increased this priority of this ticket to Unbreak Now.
May 11 2026
Probably caused by https://gerrit.wikimedia.org/r/c/operations/puppet/+/1283025 (T325394)
$ sudo run-puppet-agent Info: Using environment 'production' Info: Retrieving pluginfacts Info: Retrieving plugin Info: Loading facts Error: Could not retrieve catalog from remote server: Error 500 on SERVER: Server Error: Could not find class role::mail::mx for deployment-mx04.deployment-prep.eqiad1.wikimedia.cloud on node deployment-mx04.deployment-prep.eqiad1.wikimedia.cloud Warning: Not using cache on failed catalog Error: Could not retrieve catalog; skipping run
