VOOZH about

URL: https://phabricator.wikimedia.org/p/dancy/

⇱ ♟ dancy


dancy (Ahmon Dancy)
Staff Software Engineer, Release EngineeringAdministrator

Today

  • No visible events.

Tomorrow

  • No visible events.

Wednesday

  • No visible events.

User Details

User Since
Jun 27 2020, 12:14 AM (313 w, 2 d)
Roles
Administrator
Availability
Available
IRC Nick
dancy
LDAP User
Ahmon Dancy
MediaWiki User
ADancy (WMF) [ Global Accounts ]

Recent Activity

Fri, Jun 26

Thu, Jun 25

And:

for n in $(seq 12 14); do host=deployment-cirrussearch$n.deployment-prep.eqiad1.wikimedia.cloud; echo $host; sudo puppetserver ca clean --certname $host; done

I ran

sudo puppetserver ca clean --certname deployment-dancy2.deployment-prep.eqiad1.wikimedia.cloud
sudo puppetserver ca clean --certname deployment-dancy3.deployment-prep.eqiad1.wikimedia.cloud

to clean up after some test instances.

Wed, Jun 24

I chose to resolve this by deleting the two running Docker containers and making puppet recreate them:

The biggest consumer is

-rw-r----- 1 root root 9.0G Jun 24 18:42 /var/lib/docker/containers/c5a95725142c1168d1dca1c5a6bd3bf4ec5df287619997124dceebf1084baa56/c5a95725142c1168d1dca1c5a6bd3bf4ec5df287619997124dceebf1084baa56-json.log
dancy@deployment-changeprop-1:~$ df -t ext4 -h
Filesystem Size Used Avail Use% Mounted on
/dev/sda1 20G 20G 0 100% /

The tail end of output:

Error: Failed to apply catalog: No space left on device @ dir_s_mkdir - /var/lib/puppet/state/state.yaml20260624-2180970-o513qq.lock
Error: Could not save last run local report: No space left on device @ dir_s_mkdir - /var/cache/puppet/public/last_run_summary.yaml20260624-2180970-1uvgjro.lock
Error: Could not send report: No space left on device @ dir_s_mkdir - /var/lib/puppet/state/last_run_report.yaml20260624-2180970-1899ley.lock
dancy renamed T430075: Properly handle mediawiki code/config updates on deployment-prep jobrunners from Enable opcache revalidation on deployment-prep jobrunners to Properly handle mediawiki code/config updates on deployment-prep jobrunners.

Hi! with T429542 being resolved, can you confirm whether this problem is also fixed? thank you!

Tue, Jun 23

I terminated this host so should not be an issue now. I'll mark this ticket as resolved.

Thu, Jun 18

Thanks @Scott_French. Your suggested layout and sample config make sense to me and look like a good place to start experimenting with implementation, which I can do next week.

Wed, Jun 17

dancy renamed T429542: debian-12.0-bookworm and debian-13.0-trixie image still reference mirrors.wikimedia.org from debian-13.0-trixie image still references mirrors.wikimedia.org to debian-12.0-bookworm and debian-13.0-trixie image still reference mirrors.wikimedia.org.

The Cloud-Services project tag is not intended to have any tasks. Please check the list on https://phabricator.wikimedia.org/project/profile/832/ and replace it with a more specific project tag to this task. Thanks!

Tue, Jun 16

Mon, Jun 15

All servers referenced by operations/mediawiki-config are running MariaDB 10.11 now.
New nodes:

deployment-db15.deployment-prep.eqiad1.wikimedia.cloud
deployment-db16.deployment-prep.eqiad1.wikimedia.cloud

@Zabe I saw that you handled T329577 a few years ago and I'm wondering if you can help me bring deployment-db15 online to take over for deployment-db14 (xref
T428910#12010836).

Sure. Please tell me if I can do something. :)

I found the script stalled on . Doing some debugging using I found it blocked on a query to . I logged into deployment-db14 and ran there and I see:

Query caused different errors on master and slave. Error on master: message (format)='Cannot load from %s.%s. The table is probably corrupted' error code=1728 ; Error on slave: actual message='no error', error code=0. Default database: 'repltest'. Query: 'drop database repltest'

Thu, Jun 11

I copied from to and I'm running

zcat /srv/db11-seed.sql.gz | sudo mysql
root@deployment-db11:~# time /opt/wmf-mariadb106/bin/mariadb-dump --all-databases --single-transaction --gtid --triggers | gzip > /srv/db11-seed.sql.gz

Initialize db stuff:

sudo -u mysql /opt/wmf-mariadb1011/scripts/mariadb-install-db \
 --basedir=/opt/wmf-mariadb1011 \
 --datadir=/srv/sqldata

@Zabe I saw that you handled T329577 a few years ago and I'm wondering if you can help me bring deployment-db15 online to take over for deployment-db14 (xref
T428910#12010836).

Notes:
From operations/mediawiki-config/wmf-config/db-labs.php:

'hostsByName' => [
 // deployment-db11.deployment-prep.eqiad1.wikimedia.cloud, master
 'deployment-db11' => '172.16.5.150:3306',
 // deployment-db14.deployment-prep.eqiad1.wikimedia.cloud
 'deployment-db14' => '172.16.5.170:3306',
],

Fri, Jun 5

Thu, Jun 4

Can we claim victory on this one did you have following steps in mind? The ones I think of are removing the agent in Jenkins and deleting the jobs (I can take care of that).

On my side we are still missing any notification if the sync jobs break. I found a good blog post on using stanzas with systemd units over the weekend that actually seems like a promising direction. I would like email and irc yelling so we don't miss things getting messed up.

Wed, Jun 3

Since it's been a while since this was originally reported, here's a fresh hit from today:

Tue, Jun 2

Given the described scope of the problem (wikidata.org, which is in group1), I will roll the train to group0 now.

Changing priority to UBN! this since task was added as a train blocker in T423914. I'm currently holding the train.

Restricted Application changed the subtype of T366857: InvalidArgumentException from line 80 of ServerInfo.php: No server with index '0' (in a maintenance script) from "Task" to "Production Error".

Here's a fresh report from a batch of these errors that I saw today:

May 29 2026

May 28 2026

It looks like stream of container log records being returned from Kubernetes is getting mangled, possibly due to weird Unicode characters?

May 27 2026

I deployed a new version of Reggie which handles errors in the upload and manifest cleaners.

Noting that it doesn't always stop at the same message. For example, I ran again recently and it stopped at an earlier timestamp of

dancy renamed T427315: Increase CI job timeout for helm-chart job (deployment-charts CI) from Increase CI job timeout for deployment-charts CI to Increase CI job timeout for helm-chart job (deployment-charts CI).

I deleted the pod. It restarted and the cleaners are running properly again. Space usage is down from 123GB to 32GB at the moment.

Reggie's filesystem usage seems to be only increasing. I'm not seeing regular hits for the word "clean" in the log like I expect. I'll look into that.

May 20 2026

How do we feel about a GitLab repo for this purpose? Alternatively we can put something in https://gerrit.wikimedia.org/r/plugins/gitiles/integration/config/+/refs/heads/master/dockerfiles/, in which case we would receive the benefit of the image being updated when the base image is updated.

May 18 2026

@Don-vip, I've made a configuration change which might help with your job. Please retry and let me know how it goes.

May 15 2026

May 14 2026

May 13 2026

May 12 2026

From a recent deployment

17:56:20 Waiting 20 seconds for production traffic...
17:56:40 Logstash checker counted 107 error(s) in the last 20 seconds. OK.

The threshold is 150.

The volume of warnings has moved us dangerously close to the point where scap deployments will start complaining about it. This is not a place we want to be so I increased this priority of this ticket to Unbreak Now.

May 11 2026

$ sudo run-puppet-agent
Info: Using environment 'production'
Info: Retrieving pluginfacts
Info: Retrieving plugin
Info: Loading facts
Error: Could not retrieve catalog from remote server: Error 500 on SERVER: Server Error: Could not find class role::mail::mx for deployment-mx04.deployment-prep.eqiad1.wikimedia.cloud on node deployment-mx04.deployment-prep.eqiad1.wikimedia.cloud
Warning: Not using cache on failed catalog
Error: Could not retrieve catalog; skipping run
Content licensed under Creative Commons Attribution-ShareAlike (CC BY-SA) 4.0 unless otherwise noted; code licensed under GNU General Public License (GPL) 2.0 or later and other open source licenses. By using this site, you agree to the Terms of Use, Privacy Policy, and Code of Conduct. · Wikimedia Foundation · Privacy Policy · Code of Conduct · Terms of Use · Disclaimer · CC-BY-SA · GPL · Credits