VOOZH about

URL: https://phabricator.wikimedia.org/T425585

⇱ ⚓ T425585 Write lightweight OCI-image-based Puppet plans for beta cluster


Maniphest T425585

Write lightweight OCI-image-based Puppet plans for beta cluster
Closed, ResolvedPublic

Description

Beta cluster and are pre-production cirrussearch environments. They don't require the exact production Puppet code, merely an OpenSearch instance with the same plugins as production. Using the same Puppet code in deployment-prep as in Cirrussearch prod increases DPE's support burden, as:

  • Changes to production Puppet code can potentially break deployment-prep
  • Puppet failures in deployment-prep break SSH login and set off alerts and phab tasks
  • Puppet hieradata is managed in a different way that can't be directly changed with git/gerrit.

Since we have Docker images for OpenSearch now, I'm going to experiment with a lighter-weight plan that uses to host the Docker images. This should enable us to change production without inadvertently breaking deployment-prep.

Scope update: Relforge is no longer in scope, only beta cluster remains in scope.

Creating this ticket to:

  • Create Puppet plan for that uses our OpenSearch images.
  • Create Puppet role for /beta cluster that applies the above.

Details

Related Changes in Gerrit:
SubjectAuthorRepoBranchLines +/-
cirrussearch: fix opensearch environment var nameBkingoperations/puppetproduction+1 -1
cirrussearch: Improve beta-cluster deployBkingoperations/puppetproduction+7 -4
deployment-prep: Update cirrussearch (OpenSearch) configBkingoperations/mediawiki-configmaster+3 -9
cirrussearch: Add minimal opensearch config for deployment-prepBkingoperations/puppetproduction+35 -4
cirrussearch: Fix bind mount path for deployment-prepBkingoperations/puppetproduction+1 -1
cirrussearch: hard-code bind mountBkingoperations/puppetproduction+1 -2
cirrussearch: Flesh out deployment-prep planBkingoperations/puppetproduction+12 -0
deployment-prep: activate new cirrussearch profileBkingoperations/puppetproduction+1 -1
cirrussearch: create docker-based role for deployment-prepBkingoperations/puppetproduction+29 -0
relforge: Switch to an OCI-image based profileBkingoperations/puppetproduction+38 -6
Customize query in gerrit

Event Timeline

Comment Actions

The existing module might be usable for you rather than inventing a new system for running containers. gets quite a bit of use in Beta Cluster already for containerized workloads.

Marsam2489 renamed this task from Write lightweight quadlet-based Puppet plans for beta cluster/relforge to 1. Write lightweight quadlet-based Puppet plans for beta cluster/relforge.May 7 2026, 2:51 PM
Marsam2489 closed this task as Resolved.
Marsam2489 set Due Date to May 7 2026, 12:00 AM.
Marsam2489 set the point value for this task to 1.
Marsam2489 updated Other Assignee, added: Underscorre.
bking updated Other Assignee, removed: Underscorre.
Comment Actions

Thanks @bd808 , I'll take a look for sure.

bking renamed this task from 1. Write lightweight quadlet-based Puppet plans for beta cluster/relforge to Write lightweight quadlet-based Puppet plans for beta cluster/relforge.May 7 2026, 10:04 PM
bking removed the point value 1 for this task.
Aklapper removed Due Date which was set to May 7 2026, 12:00 AM.May 11 2026, 10:59 AM
bking renamed this task from Write lightweight quadlet-based Puppet plans for beta cluster/relforge to Write lightweight OCI-image-based Puppet plans for beta cluster/relforge.May 15 2026, 2:18 PM
bking changed the task status from Open to In Progress.
bking triaged this task as Low priority.
bking updated the task description. (Show Details)
Comment Actions

Change #1287889 had a related patch set uploaded (by Bking; author: Bking):

[operations/puppet@production] WIP: relforge: Switch to an OCI-image based profile

https://gerrit.wikimedia.org/r/1287889

bking renamed this task from Write lightweight OCI-image-based Puppet plans for beta cluster/relforge to Write lightweight OCI-image-based Puppet plans for beta cluster.May 26 2026, 2:01 PM
bking updated the task description. (Show Details)
Comment Actions

Change #1287889 abandoned by Bking:

[operations/puppet@production] relforge: Switch to an OCI-image based profile

Reason:

Going back to the standard Relforge setup. However, a subsequent patch will apply these concepts to beta cluster.

https://gerrit.wikimedia.org/r/1287889

Comment Actions

Change #1300232 had a related patch set uploaded (by Bking; author: Bking):

[operations/puppet@production] cirrussearch: create docker-based role for deployment-prep

https://gerrit.wikimedia.org/r/1300232

Comment Actions

Change #1300232 merged by Bking:

[operations/puppet@production] cirrussearch: create docker-based role for deployment-prep

https://gerrit.wikimedia.org/r/1300232

Comment Actions

Change #1300242 had a related patch set uploaded (by Bking; author: Bking):

[operations/puppet@production] deployment-prep: activate new cirrussearch profile

https://gerrit.wikimedia.org/r/1300242

Comment Actions

Change #1300242 merged by Bking:

[operations/puppet@production] deployment-prep: activate new cirrussearch profile

https://gerrit.wikimedia.org/r/1300242

Comment Actions

After merging the above changes, I also had to update the "prefix puppet" (hieradata for a specific role or profile) for in Horizon.

I added the following:

profile::base::overlayfs: true
profile::base::production::role_description: cirrussearch
profile::opensearch::cirrus::oci_image::name: opensearch
profile::opensearch::cirrus::oci_image::ns: repos/data-engineering
profile::opensearch::cirrus::oci_image::version: 2026-05-13-143357-2b2e022d3e2756b0dbd31d66e341a587f5d3b8d7-production2

After removing Puppet SSL files, re-running Puppet, and manually signing the certificate on , is able to run Puppet without errors. However, the other VMs are now having Puppet issues. This is to be expected while we work through the kinks of the new role. From a practical standpoint, these hosts are still up and able to be used for testing.

The next step is to look at the docker service manifest again , figure out which arguments need to be applied (such as entrypoint, volume, bind mounts) and add them to Horizon hieradata.

Comment Actions

Change #1300927 had a related patch set uploaded (by Bking; author: Bking):

[operations/puppet@production] WIP: cirrussearch: Flesh out deployment-prep plan

https://gerrit.wikimedia.org/r/1300927

Comment Actions

Change #1300927 merged by Bking:

[operations/puppet@production] cirrussearch: Flesh out deployment-prep plan

https://gerrit.wikimedia.org/r/1300927

Comment Actions

Change #1301396 had a related patch set uploaded (by Bking; author: Bking):

[operations/puppet@production] cirrussearch: hard-code bind mount

https://gerrit.wikimedia.org/r/1301396

Comment Actions

Change #1301396 merged by Bking:

[operations/puppet@production] cirrussearch: hard-code bind mount

https://gerrit.wikimedia.org/r/1301396

Comment Actions

Change #1301408 had a related patch set uploaded (by Bking; author: Bking):

[operations/puppet@production] cirrussearch: Fix bind mount path for deployment-prep

https://gerrit.wikimedia.org/r/1301408

Comment Actions

Change #1301408 merged by Bking:

[operations/puppet@production] cirrussearch: Fix bind mount path for deployment-prep

https://gerrit.wikimedia.org/r/1301408

Comment Actions

Change #1302280 had a related patch set uploaded (by Bking; author: Bking):

[operations/puppet@production] cirrussearch: Add minimal opensearch config for deployment-prep

https://gerrit.wikimedia.org/r/1302280

Comment Actions

The change is working, I one-offed
in with the following, which roughly approximates the Puppet change above:

root@deployment-cirrussearch15:~# cat /etc/systemd/system/cirrussearch.service.d/override.conf
[Service]
ExecStart=
ExecStart=/usr/bin/docker run --rm=true \
--network host \
-v cirrussearch:/etc/cirrussearch \
-v /srv/opensearch:/usr/share/opensearch/data \
-v /etc/opensearch/opensearch.yml:/usr/share/opensearch/config/opensearch.yml \
--name %n docker-registry.wikimedia.org/repos/data-engineering/opensearch:2026-05-13-143357-2b2e022d3e2756b0dbd31d66e341a587f5d3b8d7-production2
Comment Actions

Change #1302956 had a related patch set uploaded (by Bking; author: Bking):

[operations/mediawiki-config@master] deployment-prep: Update cirrussearch (OpenSearch) config

https://gerrit.wikimedia.org/r/1302956

Comment Actions

Change #1302280 merged by Bking:

[operations/puppet@production] cirrussearch: Add minimal opensearch config for deployment-prep

https://gerrit.wikimedia.org/r/1302280

Comment Actions

Per conversation with @dcausse , on-wiki search is currently down on the beta cluster. To fix this, we'll need to:

  1. Merge and deploy this patch
  2. Run something like in a tmux window on .
  3. Verify the indices are created from :
  4. Keep an eye out for other unexpected errors.
Comment Actions

Change #1302956 merged by jenkins-bot:

[operations/mediawiki-config@master] deployment-prep: Update cirrussearch (OpenSearch) config

https://gerrit.wikimedia.org/r/1302956

Comment Actions

@bking I believe that search is back on the beta cluster but sadly we running a bit low on disk (82% used).
Couple questions/comments:

  • Logs seem only available via docker logs (I could not find a link to the host fs)
  • Are there hiera values we could change to override some settings (Xmx, opensearch.yaml entries)?

It'd be great to cleanup deprecated hiera config from https://horizon.wikimedia.org/project/prefixpuppet/?tab=prefix_puppet__puppet-deployment-cirrussearch to avoid possible confusion in the future.

I suspect we could also delete old instances now?

Thanks!

Comment Actions

Mentioned in SAL (#wikimedia-cloud) [2026-06-25T14:25:28Z] <inflatador> delete unused servers deployment-cirrussearch1[2-4] T425585

Comment Actions

@dcausse thanks for the questions, let me address those:

I believe that search is back on the beta cluster but sadly we running a bit low on disk (82% used).

I should be able to add a Cinder volume and tell OpenSearch to use that for storage, I'll get back to you on that.

Logs seem only available via docker logs (I could not find a link to the host fs)

ACK, it sounds like we need another bind mount for logs. I can set that up.

Are there hiera values we could change to override some settings (Xmx, opensearch.yaml entries)?

We have the following options:

  • Go back to using the template from Puppet in beta cluster
  • Update the current hard-coded /create a for beta cluster in Puppet
  • Configure opensearch/jvm options via environment variables as described here

I think we could implement the final option via service::docker 's environment arg . I'll get started on that, but if you prefer a different approach feel free to reply here.

I suspect we could also delete old instances now?

Done, thank you for the reminder.

Comment Actions

Mentioned in SAL (#wikimedia-cloud) [2026-06-25T16:50:20Z] <inflatador> add 60GB cinder vol to deployment-cirrussearch15 T425585

Comment Actions

Change #1305731 had a related patch set uploaded (by Bking; author: Bking):

[operations/puppet@production] cirrussearch: Improve beta-cluster deploy

https://gerrit.wikimedia.org/r/1305731

Comment Actions

Change #1305731 merged by Bking:

[operations/puppet@production] cirrussearch: Improve beta-cluster deploy

https://gerrit.wikimedia.org/r/1305731

Comment Actions

Change #1305760 had a related patch set uploaded (by Bking; author: Bking):

[operations/puppet@production] cirrussearch: fix opensearch environment var name

https://gerrit.wikimedia.org/r/1305760

Comment Actions

Change #1305760 merged by Bking:

[operations/puppet@production] cirrussearch: fix opensearch environment var name

https://gerrit.wikimedia.org/r/1305760

Comment Actions

OK, I have addressed the above concerns:

  • Attached/formatted/mounted/moved OpenSearch data a 60 GB cinder volume
  • Created a bind mount to so logs are accessible from the host
  • Create/apply env var. This value currently sets the JVM heap size to 4 GB (1/2 of the VM's vRAM).

As such, I'm moving this ticket to "Needs Review" so Search Platform team can verify that we're finished (or add more requirements if not).

Content licensed under Creative Commons Attribution-ShareAlike (CC BY-SA) 4.0 unless otherwise noted; code licensed under GNU General Public License (GPL) 2.0 or later and other open source licenses. By using this site, you agree to the Terms of Use, Privacy Policy, and Code of Conduct. · Wikimedia Foundation · Privacy Policy · Code of Conduct · Terms of Use · Disclaimer · CC-BY-SA · GPL · Credits