Description
Beta cluster and are pre-production cirrussearch environments. They don't require the exact production Puppet code, merely an OpenSearch instance with the same plugins as production. Using the same Puppet code in deployment-prep as in Cirrussearch prod increases DPE's support burden, as:
- Changes to production Puppet code can potentially break deployment-prep
- Puppet failures in deployment-prep break SSH login and set off alerts and phab tasks
- Puppet hieradata is managed in a different way that can't be directly changed with git/gerrit.
Since we have Docker images for OpenSearch now, I'm going to experiment with a lighter-weight plan that uses to host the Docker images. This should enable us to change production without inadvertently breaking deployment-prep.
Scope update: Relforge is no longer in scope, only beta cluster remains in scope.
Creating this ticket to:
- Create Puppet plan for that uses our OpenSearch images.
- Create Puppet role for /beta cluster that applies the above.
Details
Related Objects
| Status | Subtype | Assigned | Task | ||
|---|---|---|---|---|---|
| Open | None | T421757 ☂️ Migrate production OpenSearch clusters from 1.x-2.x ☂️ | |||
| Duplicate | bking | T421763 Migrate beta cluster to OpenSearch 2.x | |||
| Resolved | bking | T425585 Write lightweight OCI-image-based Puppet plans for beta cluster |
- Mentioned In
- T429296: Project deployment-prep instance deployment-cirrussearch14 is down
T427196: [beta-cluster] Fetching task suggestions failed: cirrussearch-backend-error
T428822: No Puppet resources found on instance deployment-cirrussearch13 on project deployment-prep
T427306: Migrate Relforge clusters from OpenSearch 1.x->2x - Mentioned Here
- T427196: [beta-cluster] Fetching task suggestions failed: cirrussearch-backend-error
- Duplicates Merged Here
- T421763: Migrate beta cluster to OpenSearch 2.x
Event Timeline
The existing module might be usable for you rather than inventing a new system for running containers. gets quite a bit of use in Beta Cluster already for containerized workloads.
Thanks @bd808 , I'll take a look for sure.
Change #1287889 had a related patch set uploaded (by Bking; author: Bking):
[operations/puppet@production] WIP: relforge: Switch to an OCI-image based profile
Change #1287889 abandoned by Bking:
[operations/puppet@production] relforge: Switch to an OCI-image based profile
Reason:
Going back to the standard Relforge setup. However, a subsequent patch will apply these concepts to beta cluster.
Change #1300232 had a related patch set uploaded (by Bking; author: Bking):
[operations/puppet@production] cirrussearch: create docker-based role for deployment-prep
Change #1300232 merged by Bking:
[operations/puppet@production] cirrussearch: create docker-based role for deployment-prep
Change #1300242 had a related patch set uploaded (by Bking; author: Bking):
[operations/puppet@production] deployment-prep: activate new cirrussearch profile
Change #1300242 merged by Bking:
[operations/puppet@production] deployment-prep: activate new cirrussearch profile
After merging the above changes, I also had to update the "prefix puppet" (hieradata for a specific role or profile) for in Horizon.
I added the following:
profile::base::overlayfs: true profile::base::production::role_description: cirrussearch profile::opensearch::cirrus::oci_image::name: opensearch profile::opensearch::cirrus::oci_image::ns: repos/data-engineering profile::opensearch::cirrus::oci_image::version: 2026-05-13-143357-2b2e022d3e2756b0dbd31d66e341a587f5d3b8d7-production2
After removing Puppet SSL files, re-running Puppet, and manually signing the certificate on , is able to run Puppet without errors. However, the other VMs are now having Puppet issues. This is to be expected while we work through the kinks of the new role. From a practical standpoint, these hosts are still up and able to be used for testing.
The next step is to look at the docker service manifest again , figure out which arguments need to be applied (such as entrypoint, volume, bind mounts) and add them to Horizon hieradata.
Change #1300927 had a related patch set uploaded (by Bking; author: Bking):
[operations/puppet@production] WIP: cirrussearch: Flesh out deployment-prep plan
Change #1300927 merged by Bking:
[operations/puppet@production] cirrussearch: Flesh out deployment-prep plan
Change #1301396 had a related patch set uploaded (by Bking; author: Bking):
[operations/puppet@production] cirrussearch: hard-code bind mount
Change #1301396 merged by Bking:
[operations/puppet@production] cirrussearch: hard-code bind mount
Change #1301408 had a related patch set uploaded (by Bking; author: Bking):
[operations/puppet@production] cirrussearch: Fix bind mount path for deployment-prep
Change #1301408 merged by Bking:
[operations/puppet@production] cirrussearch: Fix bind mount path for deployment-prep
Change #1302280 had a related patch set uploaded (by Bking; author: Bking):
[operations/puppet@production] cirrussearch: Add minimal opensearch config for deployment-prep
The change is working, I one-offed
in with the following, which roughly approximates the Puppet change above:
root@deployment-cirrussearch15:~# cat /etc/systemd/system/cirrussearch.service.d/override.conf [Service] ExecStart= ExecStart=/usr/bin/docker run --rm=true \ --network host \ -v cirrussearch:/etc/cirrussearch \ -v /srv/opensearch:/usr/share/opensearch/data \ -v /etc/opensearch/opensearch.yml:/usr/share/opensearch/config/opensearch.yml \ --name %n docker-registry.wikimedia.org/repos/data-engineering/opensearch:2026-05-13-143357-2b2e022d3e2756b0dbd31d66e341a587f5d3b8d7-production2
Change #1302956 had a related patch set uploaded (by Bking; author: Bking):
[operations/mediawiki-config@master] deployment-prep: Update cirrussearch (OpenSearch) config
Change #1302280 merged by Bking:
[operations/puppet@production] cirrussearch: Add minimal opensearch config for deployment-prep
Per conversation with @dcausse , on-wiki search is currently down on the beta cluster. To fix this, we'll need to:
- Merge and deploy this patch
- Run something like in a tmux window on .
- Verify the indices are created from :
- Keep an eye out for other unexpected errors.
Change #1302956 merged by jenkins-bot:
[operations/mediawiki-config@master] deployment-prep: Update cirrussearch (OpenSearch) config
Mentioned in SAL (#wikimedia-cloud) [2026-06-18T07:23:03Z] <dcausse> reindexing all wikis to opensearch2 (T425585, T427196)
@bking I believe that search is back on the beta cluster but sadly we running a bit low on disk (82% used).
Couple questions/comments:
- Logs seem only available via docker logs (I could not find a link to the host fs)
- Are there hiera values we could change to override some settings (Xmx, opensearch.yaml entries)?
It'd be great to cleanup deprecated hiera config from https://horizon.wikimedia.org/project/prefixpuppet/?tab=prefix_puppet__puppet-deployment-cirrussearch to avoid possible confusion in the future.
I suspect we could also delete old instances now?
Thanks!
Mentioned in SAL (#wikimedia-cloud) [2026-06-25T14:25:28Z] <inflatador> delete unused servers deployment-cirrussearch1[2-4] T425585
@dcausse thanks for the questions, let me address those:
I believe that search is back on the beta cluster but sadly we running a bit low on disk (82% used).
I should be able to add a Cinder volume and tell OpenSearch to use that for storage, I'll get back to you on that.
Logs seem only available via docker logs (I could not find a link to the host fs)
ACK, it sounds like we need another bind mount for logs. I can set that up.
Are there hiera values we could change to override some settings (Xmx, opensearch.yaml entries)?
We have the following options:
- Go back to using the template from Puppet in beta cluster
- Update the current hard-coded /create a for beta cluster in Puppet
- Configure opensearch/jvm options via environment variables as described here
I think we could implement the final option via service::docker 's environment arg . I'll get started on that, but if you prefer a different approach feel free to reply here.
I suspect we could also delete old instances now?
Done, thank you for the reminder.
Mentioned in SAL (#wikimedia-cloud) [2026-06-25T16:50:20Z] <inflatador> add 60GB cinder vol to deployment-cirrussearch15 T425585
Change #1305731 had a related patch set uploaded (by Bking; author: Bking):
[operations/puppet@production] cirrussearch: Improve beta-cluster deploy
Change #1305731 merged by Bking:
[operations/puppet@production] cirrussearch: Improve beta-cluster deploy
Change #1305760 had a related patch set uploaded (by Bking; author: Bking):
[operations/puppet@production] cirrussearch: fix opensearch environment var name
Change #1305760 merged by Bking:
[operations/puppet@production] cirrussearch: fix opensearch environment var name
OK, I have addressed the above concerns:
- Attached/formatted/mounted/moved OpenSearch data a 60 GB cinder volume
- Created a bind mount to so logs are accessible from the host
- Create/apply env var. This value currently sets the JVM heap size to 4 GB (1/2 of the VM's vRAM).
As such, I'm moving this ticket to "Needs Review" so Search Platform team can verify that we're finished (or add more requirements if not).
