Description
As per title, all hosts running Graphite should be on Bullseye
root@cumin1001:~# cumin 'P{C:graphite} and not P{F:lsbdistcodename = buster}'
2 hosts will be targeted:
graphite2003.codfw.wmnet,graphite1004.eqiad.wmnet
DRY-RUN mode enabled, aborting- Get the role to work in Pontoon on Bullseye
Action plan for codfw:
- Make sure is available on Bullseye and working as expected, as per https://wikitech.wikimedia.org/wiki/Graphite#Merge_and_sync_metrics
- Reimage graphite2003 with Bullseye. graphite1004 will start dropping metrics directed to graphite2003 as expected
- Ensure Puppet is running as expected, all services are up and metrics are received from graphite1004 and wait 24h for some data to accumulate
- Transfer and merge data from graphite1004 to graphite2003 with carbonate, following https://wikitech.wikimedia.org/wiki/Graphite#Merge_and_sync_metrics
- Validate that historical metric data is present and new data is flowing
The plan for eqiad is similar, with the addition of a failover to codfw as per https://wikitech.wikimedia.org/wiki/Graphite#Failover and fail back once things are working in eqiad.
- Failover out of graphite1004 and to graphite2003
- Reimage graphite1004
- Transfer and merge data from graphite1004 to graphite2003 with carbonate, following https://wikitech.wikimedia.org/wiki/Graphite#Merge_and_sync_metrics
- Validate that historical metric data is present and new data is flowing
- Failover back to graphite1004
Details
Related Objects
| Status | Subtype | Assigned | Task | ||
|---|---|---|---|---|---|
| Resolved | dancy | T302086 Set scap minimum python version to 3.7 | |||
| Resolved | None | T247045 Migrate all of production metal and VMs to Buster or later | |||
| Resolved | fgiunchedi | T247963 Migrate role::graphite::production to Bullseye | |||
| Resolved | fgiunchedi | T294220 Graphite query timeout from ExtensionDistributor |
Event Timeline
Change 726612 had a related patch set uploaded (by Filippo Giunchedi; author: Filippo Giunchedi):
[operations/puppet@production] graphite: add Bullseye support
Change 726613 had a related patch set uploaded (by Filippo Giunchedi; author: Filippo Giunchedi):
[operations/puppet@production] graphite: add Bullseye version of graphite auth/index
Change 726614 had a related patch set uploaded (by Filippo Giunchedi; author: Filippo Giunchedi):
[operations/puppet@production] graphite: stop using LVM for /srv in labs
Change 726614 merged by Filippo Giunchedi:
[operations/puppet@production] graphite: stop using LVM for /srv in labs
Change 726612 merged by Filippo Giunchedi:
[operations/puppet@production] graphite: add Bullseye support
Change 726613 merged by Filippo Giunchedi:
[operations/puppet@production] graphite: add Bullseye version of graphite auth/index
Change 726750 had a related patch set uploaded (by Filippo Giunchedi; author: Filippo Giunchedi):
[operations/puppet@production] pontoon: use graphite-04 in o11y stack
Change 726750 merged by Filippo Giunchedi:
[operations/puppet@production] pontoon: use graphite-04 in o11y stack
A few roadblocks and bugs but overall progress, so far:
- isn't in stable, I've imported the testing version to
- has a CPU-hogging bug in stable, I've imported the testing version to
- needed an update (upstream version and python3). It is a local package and a new version lives in
Change 727293 had a related patch set uploaded (by Filippo Giunchedi; author: Filippo Giunchedi):
[operations/puppet@production] statsite: switch to python3 on Bullseye
Change 727294 had a related patch set uploaded (by Filippo Giunchedi; author: Filippo Giunchedi):
[operations/puppet@production] graphite: set settings_module from uwsgi
Change 727295 had a related patch set uploaded (by Filippo Giunchedi; author: Filippo Giunchedi):
[operations/puppet@production] statsite: log instance identifier
Change 727293 merged by Filippo Giunchedi:
[operations/puppet@production] statsite: switch to python3 on Bullseye
Change 727294 merged by Filippo Giunchedi:
[operations/puppet@production] graphite: set settings_module from uwsgi
Change 727295 merged by Filippo Giunchedi:
[operations/puppet@production] statsite: log instance identifier
Cookbook cookbooks.sre.hosts.reimage was started by filippo@cumin1001 for host graphite2003.codfw.wmnet
Change 729934 had a related patch set uploaded (by Filippo Giunchedi; author: Filippo Giunchedi):
[operations/puppet@production] install_server: use standard recipe for graphite2003
Change 729934 merged by Filippo Giunchedi:
[operations/puppet@production] install_server: use standard recipe for graphite2003
Cookbook cookbooks.sre.hosts.reimage started by filippo@cumin1001 for host graphite2003.codfw.wmnet completed:
- graphite2003 (PASS)
- Downtimed on Icinga
- Disabled Puppet
- Removed from Puppet and PuppetDB if present
- Deleted any existing Puppet certificate
- Removed from Debmonitor if present
- Forced PXE for next reboot
- Host rebooted via IPMI
- Host up (Debian installer)
- Host up (new fresh OS)
- Generated Puppet certificate
- Signed new Puppet certificate
- Run Puppet in NOOP mode to populate exported resources in PuppetDB
- Found Nagios_host resource for this host in PuppetDB
- Downtimed the new host on Icinga
- First Puppet run completed and logged in /var/log/spicerack/sre/hosts/reimage/202110110913_filippo_4002_graphite2003.out
- Checked BIOS boot parameters are back to normal
- Rebooted
- Automatic Puppet run was successful
- Forced a re-check of all Icinga services for the host
- Icinga status is optimal
- Icinga downtime removed
- Updated Netbox data from PuppetDB
Cookbook cookbooks.sre.hosts.reimage was started by filippo@cumin1001 for host graphite2003.codfw.wmnet
Cookbook cookbooks.sre.hosts.reimage started by filippo@cumin1001 for host graphite2003.codfw.wmnet completed:
- graphite2003 (PASS)
- Downtimed on Icinga
- Disabled Puppet
- Removed from Puppet and PuppetDB if present
- Deleted any existing Puppet certificate
- Removed from Debmonitor if present
- Forced PXE for next reboot
- Host rebooted via IPMI
- Host up (Debian installer)
- Host up (new fresh OS)
- Generated Puppet certificate
- Signed new Puppet certificate
- Run Puppet in NOOP mode to populate exported resources in PuppetDB
- Found Nagios_host resource for this host in PuppetDB
- Downtimed the new host on Icinga
- First Puppet run completed and logged in /var/log/spicerack/sre/hosts/reimage/202110110950_filippo_29925_graphite2003.out
- Checked BIOS boot parameters are back to normal
- Rebooted
- Automatic Puppet run was successful
- Forced a re-check of all Icinga services for the host
- Icinga status is optimal
- Icinga downtime removed
- Updated Netbox data from PuppetDB
Change 729968 had a related patch set uploaded (by Filippo Giunchedi; author: Filippo Giunchedi):
[operations/puppet@production] graphite: disable tags support
Change 729975 had a related patch set uploaded (by Filippo Giunchedi; author: Filippo Giunchedi):
[operations/puppet@production] graphite: move production to /srv/carbon as storage directory
Change 729975 merged by Filippo Giunchedi:
[operations/puppet@production] graphite: move production to /srv/carbon as storage directory
Change 730427 had a related patch set uploaded (by Filippo Giunchedi; author: Filippo Giunchedi):
[operations/puppet@production] graphite: expire metric files not updated for 3y
Change 729968 merged by Filippo Giunchedi:
[operations/puppet@production] graphite: disable tags support
Change 730427 merged by Filippo Giunchedi:
[operations/puppet@production] graphite: expire metric files not updated for 3y
Mentioned in SAL (#wikimedia-operations) [2021-10-18T09:38:04Z] <godog> sync metrics from graphite1004 to graphite2003 - T247963
Change 731433 had a related patch set uploaded (by Filippo Giunchedi; author: Filippo Giunchedi):
[operations/puppet@production] statsd: failover writes to graphite2003
Change 731434 had a related patch set uploaded (by Filippo Giunchedi; author: Filippo Giunchedi):
[operations/puppet@production] monitoring: check graphite2003 metrics
Change 731435 had a related patch set uploaded (by Filippo Giunchedi; author: Filippo Giunchedi):
[operations/dns@master] discovery: move read traffic to graphite2003
Change 731436 had a related patch set uploaded (by Filippo Giunchedi; author: Filippo Giunchedi):
[operations/dns@master] wmnet: move writes to graphite2003
Change 731435 merged by Filippo Giunchedi:
[operations/dns@master] discovery: move read traffic to graphite2003
Mentioned in SAL (#wikimedia-operations) [2021-10-19T08:50:22Z] <godog> point graphite.discovery.wmnet to graphite2003 - T247963
Change 731434 merged by Filippo Giunchedi:
[operations/puppet@production] monitoring: check graphite2003 metrics
Mentioned in SAL (#wikimedia-operations) [2021-10-19T09:37:11Z] <godog> move graphite/statsd writes to graphite2003 - T247963
Change 731433 merged by Filippo Giunchedi:
[operations/puppet@production] statsd: failover writes to graphite2003
Change 731436 merged by Filippo Giunchedi:
[operations/dns@master] wmnet: move writes to graphite2003
Change 731917 had a related patch set uploaded (by Filippo Giunchedi; author: Filippo Giunchedi):
[operations/deployment-charts@master] mwdebug: add graphite2003 to network policies
Change 731918 had a related patch set uploaded (by Filippo Giunchedi; author: Filippo Giunchedi):
[operations/mediawiki-config@master] ProductionServices: use graphite2003 for statsd
Change 731918 merged by jenkins-bot:
[operations/mediawiki-config@master] ProductionServices: use graphite2003 for statsd
Mentioned in SAL (#wikimedia-operations) [2021-10-19T10:21:26Z] <oblivian@deploy1002> Synchronized wmf-config/ProductionServices.php: Config: [[gerrit:731918|ProductionServices: use graphite2003 for statsd (T247963)]] (duration: 00m 54s)
Mentioned in SAL (#wikimedia-operations) [2021-10-19T10:22:05Z] <godog> flip mw statsd traffic with https://gerrit.wikimedia.org/r/c/operations/mediawiki-config/+/731918 - T247963
Mentioned in SAL (#wikimedia-operations) [2021-10-19T10:22:38Z] <oblivian@deploy1002> Synchronized tests/WmfConfigServicesTest.php: Config: [[gerrit:731918|ProductionServices: use graphite2003 for statsd (T247963)]] (duration: 00m 54s)
Mentioned in SAL (#wikimedia-operations) [2021-10-19T10:45:26Z] <godog> bounce superset on an-tool1010 to pick up statsd changes - T247963
Mentioned in SAL (#wikimedia-operations) [2021-10-19T10:45:37Z] <godog> bounce navtiming on webperf1001 to pick up statsd changes - T247963
Mentioned in SAL (#wikimedia-operations) [2021-10-19T10:50:05Z] <godog> bounce superset on an-tool1005 to pick up statsd changes - T247963
Change 732273 had a related patch set uploaded (by Filippo Giunchedi; author: Filippo Giunchedi):
[operations/puppet@production] install_server: use standard recipe for all graphite hosts
Change 732273 merged by Filippo Giunchedi:
[operations/puppet@production] install_server: use standard recipe for all graphite hosts
Change 731917 merged by jenkins-bot:
[operations/deployment-charts@master] mwdebug: fix statsd network policy
Cookbook cookbooks.sre.hosts.reimage was started by filippo@cumin1001 for host graphite1004.eqiad.wmnet with OS bullseye
Cookbook cookbooks.sre.hosts.reimage started by filippo@cumin1001 for host graphite1004.eqiad.wmnet with OS bullseye completed:
- graphite1004 (PASS)
- Downtimed on Icinga
- Disabled Puppet
- Removed from Puppet and PuppetDB if present
- Deleted any existing Puppet certificate
- Removed from Debmonitor if present
- Forced PXE for next reboot
- Host rebooted via IPMI
- Host up (Debian installer)
- Host up (new fresh bullseye OS)
- Generated Puppet certificate
- Signed new Puppet certificate
- Run Puppet in NOOP mode to populate exported resources in PuppetDB
- Found Nagios_host resource for this host in PuppetDB
- Downtimed the new host on Icinga
- First Puppet run completed and logged in /var/log/spicerack/sre/hosts/reimage/202110210755_filippo_29322_graphite1004.out
- Checked BIOS boot parameters are back to normal
- Rebooted
- Automatic Puppet run was successful
- Forced a re-check of all Icinga services for the host
- Icinga status is optimal
- Icinga downtime removed
- Updated Netbox data from PuppetDB
Change 734224 had a related patch set uploaded (by Filippo Giunchedi; author: Filippo Giunchedi):
[operations/puppet@production] graphite: bump fetch_timeout
Change 734225 had a related patch set uploaded (by Filippo Giunchedi; author: Filippo Giunchedi):
[operations/puppet@production] graphite: set CLUSTER_SERVERS empty with no remote servers
Change 734277 had a related patch set uploaded (by Filippo Giunchedi; author: Filippo Giunchedi):
[operations/dns@master] Revert \"discovery: move read traffic to graphite2003\"
Change 734278 had a related patch set uploaded (by Filippo Giunchedi; author: Filippo Giunchedi):
[operations/puppet@production] Revert \"statsd: failover writes to graphite2003\"
Change 734279 had a related patch set uploaded (by Filippo Giunchedi; author: Filippo Giunchedi):
[operations/puppet@production] Revert \"monitoring: check graphite2003 metrics\"
Change 734280 had a related patch set uploaded (by Filippo Giunchedi; author: Filippo Giunchedi):
[operations/dns@master] Revert \"wmnet: move writes to graphite2003\"
Change 734281 had a related patch set uploaded (by Filippo Giunchedi; author: Filippo Giunchedi):
[operations/mediawiki-config@master] Revert \"ProductionServices: use graphite2003 for statsd\"
Change 734225 merged by Filippo Giunchedi:
[operations/puppet@production] graphite: set CLUSTER_SERVERS empty with no remote servers
Change 734277 merged by Filippo Giunchedi:
[operations/dns@master] Revert \"discovery: move read traffic to graphite2003\"
Mentioned in SAL (#wikimedia-operations) [2021-10-26T09:27:13Z] <godog> move read traffic back to graphite1004 - T247963
Change 734279 merged by Filippo Giunchedi:
[operations/puppet@production] Revert \"monitoring: check graphite2003 metrics\"
Change 734278 merged by Filippo Giunchedi:
[operations/puppet@production] Revert \"statsd: failover writes to graphite2003\"
Mentioned in SAL (#wikimedia-operations) [2021-10-26T09:40:19Z] <godog> flip back write traffic to graphite1004 (all but mediawiki) - T247963
Change 734280 merged by Filippo Giunchedi:
[operations/dns@master] Revert \"wmnet: move writes to graphite2003\"
Mentioned in SAL (#wikimedia-operations) [2021-10-26T09:47:13Z] <godog> bounce navtiming on webperf1001 to pick up statsd changes - T247963
Mentioned in SAL (#wikimedia-operations) [2021-10-26T09:49:16Z] <godog> bounce superset on an-tool1010 to pick up statsd changes - T247963
Mentioned in SAL (#wikimedia-operations) [2021-10-26T09:49:24Z] <godog> bounce superset on an-tool1005 to pick up statsd changes - T247963
Change 734281 merged by jenkins-bot:
[operations/mediawiki-config@master] Revert \"ProductionServices: use graphite2003 for statsd\"
This is complete! Both graphite2003 and graphite1004 run with Bullseye, the failover documentation is up to date. Thanks @Joe for the assistance with mw config deploys.
Change 734224 abandoned by Filippo Giunchedi:
[operations/puppet@production] graphite: bump fetch_timeout
Reason:
Not needed
Mentioned in SAL (#wikimedia-operations) [2021-12-09T03:37:10Z] <cwhite> bounce superset on an-tool1010 and 1005 to pick up statsd changes T247963
Mentioned in SAL (#wikimedia-operations) [2022-11-30T09:30:06Z] <godog> bounce superset on an-tool1010 to pick up statsd changes - T247963
Mentioned in SAL (#wikimedia-operations) [2022-11-30T09:32:48Z] <godog> bounce superset on an-tool1005 to pick up statsd changes - T247963
