VOOZH about

URL: https://phabricator.wikimedia.org/T247963

⇱ ⚓ T247963 Migrate role::graphite::production to Bullseye


Maniphest T247963

Migrate role::graphite::production to Bullseye
Closed, ResolvedPublic

Description

As per title, all hosts running Graphite should be on Bullseye

root@cumin1001:~# cumin 'P{C:graphite} and not P{F:lsbdistcodename = buster}'
2 hosts will be targeted:
graphite2003.codfw.wmnet,graphite1004.eqiad.wmnet
DRY-RUN mode enabled, aborting
  • Get the role to work in Pontoon on Bullseye

Action plan for codfw:

The plan for eqiad is similar, with the addition of a failover to codfw as per https://wikitech.wikimedia.org/wiki/Graphite#Failover and fail back once things are working in eqiad.

Details

Related Changes in Gerrit:
SubjectAuthorRepoBranchLines +/-
graphite: bump fetch_timeoutFilippo Giunchedioperations/puppetproduction+10 -0
Revert "ProductionServices: use graphite2003 for statsd"Filippo Giunchedioperations/mediawiki-configmaster+2 -2
Revert "wmnet: move writes to graphite2003"Filippo Giunchedioperations/dnsmaster+3 -3
Revert "statsd: failover writes to graphite2003"Filippo Giunchedioperations/puppetproduction+1 -1
Revert "monitoring: check graphite2003 metrics"Filippo Giunchedioperations/puppetproduction+3 -3
Revert "discovery: move read traffic to graphite2003"Filippo Giunchedioperations/dnsmaster+2 -2
graphite: set CLUSTER_SERVERS empty with no remote serversFilippo Giunchedioperations/puppetproduction+4 -6
mwdebug: fix statsd network policyFilippo Giunchedioperations/deployment-chartsmaster+6 -1
install_server: use standard recipe for all graphite hostsFilippo Giunchedioperations/puppetproduction+1 -67
ProductionServices: use graphite2003 for statsdFilippo Giunchedioperations/mediawiki-configmaster+2 -2
wmnet: move writes to graphite2003Filippo Giunchedioperations/dnsmaster+3 -3
statsd: failover writes to graphite2003Filippo Giunchedioperations/puppetproduction+1 -1
monitoring: check graphite2003 metricsFilippo Giunchedioperations/puppetproduction+3 -3
discovery: move read traffic to graphite2003Filippo Giunchedioperations/dnsmaster+2 -2
graphite: expire metric files not updated for 3yFilippo Giunchedioperations/puppetproduction+6 -0
graphite: disable tags supportFilippo Giunchedioperations/puppetproduction+1 -0
graphite: move production to /srv/carbon as storage directoryFilippo Giunchedioperations/puppetproduction+16 -4
install_server: use standard recipe for graphite2003Filippo Giunchedioperations/puppetproduction+1 -0
statsite: log instance identifierFilippo Giunchedioperations/puppetproduction+1 -0
graphite: set settings_module from uwsgiFilippo Giunchedioperations/puppetproduction+2 -0
statsite: switch to python3 on BullseyeFilippo Giunchedioperations/puppetproduction+7 -1
pontoon: use graphite-04 in o11y stackFilippo Giunchedioperations/puppetproduction+1 -1
graphite: add Bullseye version of graphite auth/indexFilippo Giunchedioperations/puppetproduction+103 -0
graphite: add Bullseye supportFilippo Giunchedioperations/puppetproduction+21 -11
graphite: stop using LVM for /srv in labsFilippo Giunchedioperations/puppetproduction+0 -5
Fix installation of graphite-web on BusterMuehlenhoffoperations/puppetproduction+7 -4
graphite: django 2.2 compatFilippo Giunchedioperations/puppetproduction+5 -0
Customize query in gerrit

Event Timeline

There are a very large number of changes, so older changes are hidden. Show Older Changes
lmata renamed this task from Migrate role::graphite::production to Buster to Migrate role::graphite::production to Bullseye.Aug 31 2021, 4:19 PM
lmata triaged this task as Medium priority.Sep 30 2021, 9:48 PM
lmata moved this task from Inbox to Up next on the SRE Observability (FY2021/2022-Q2) board.
Comment Actions

Change 726612 had a related patch set uploaded (by Filippo Giunchedi; author: Filippo Giunchedi):

[operations/puppet@production] graphite: add Bullseye support

https://gerrit.wikimedia.org/r/726612

Comment Actions

Change 726613 had a related patch set uploaded (by Filippo Giunchedi; author: Filippo Giunchedi):

[operations/puppet@production] graphite: add Bullseye version of graphite auth/index

https://gerrit.wikimedia.org/r/726613

Comment Actions

Change 726614 had a related patch set uploaded (by Filippo Giunchedi; author: Filippo Giunchedi):

[operations/puppet@production] graphite: stop using LVM for /srv in labs

https://gerrit.wikimedia.org/r/726614

Comment Actions

Change 726614 merged by Filippo Giunchedi:

[operations/puppet@production] graphite: stop using LVM for /srv in labs

https://gerrit.wikimedia.org/r/726614

Comment Actions

Change 726612 merged by Filippo Giunchedi:

[operations/puppet@production] graphite: add Bullseye support

https://gerrit.wikimedia.org/r/726612

Comment Actions

Change 726613 merged by Filippo Giunchedi:

[operations/puppet@production] graphite: add Bullseye version of graphite auth/index

https://gerrit.wikimedia.org/r/726613

Comment Actions

Change 726750 had a related patch set uploaded (by Filippo Giunchedi; author: Filippo Giunchedi):

[operations/puppet@production] pontoon: use graphite-04 in o11y stack

https://gerrit.wikimedia.org/r/726750

Comment Actions

Change 726750 merged by Filippo Giunchedi:

[operations/puppet@production] pontoon: use graphite-04 in o11y stack

https://gerrit.wikimedia.org/r/726750

Comment Actions

A few roadblocks and bugs but overall progress, so far:

  • isn't in stable, I've imported the testing version to
  • has a CPU-hogging bug in stable, I've imported the testing version to
  • needed an update (upstream version and python3). It is a local package and a new version lives in
Comment Actions

Change 727293 had a related patch set uploaded (by Filippo Giunchedi; author: Filippo Giunchedi):

[operations/puppet@production] statsite: switch to python3 on Bullseye

https://gerrit.wikimedia.org/r/727293

Comment Actions

Change 727294 had a related patch set uploaded (by Filippo Giunchedi; author: Filippo Giunchedi):

[operations/puppet@production] graphite: set settings_module from uwsgi

https://gerrit.wikimedia.org/r/727294

Comment Actions

Change 727295 had a related patch set uploaded (by Filippo Giunchedi; author: Filippo Giunchedi):

[operations/puppet@production] statsite: log instance identifier

https://gerrit.wikimedia.org/r/727295

Comment Actions

Change 727293 merged by Filippo Giunchedi:

[operations/puppet@production] statsite: switch to python3 on Bullseye

https://gerrit.wikimedia.org/r/727293

Comment Actions

Change 727294 merged by Filippo Giunchedi:

[operations/puppet@production] graphite: set settings_module from uwsgi

https://gerrit.wikimedia.org/r/727294

Comment Actions

Change 727295 merged by Filippo Giunchedi:

[operations/puppet@production] statsite: log instance identifier

https://gerrit.wikimedia.org/r/727295

Comment Actions

Cookbook cookbooks.sre.hosts.reimage was started by filippo@cumin1001 for host graphite2003.codfw.wmnet

Comment Actions

Change 729934 had a related patch set uploaded (by Filippo Giunchedi; author: Filippo Giunchedi):

[operations/puppet@production] install_server: use standard recipe for graphite2003

https://gerrit.wikimedia.org/r/729934

Comment Actions

Change 729934 merged by Filippo Giunchedi:

[operations/puppet@production] install_server: use standard recipe for graphite2003

https://gerrit.wikimedia.org/r/729934

Comment Actions

Cookbook cookbooks.sre.hosts.reimage started by filippo@cumin1001 for host graphite2003.codfw.wmnet completed:

  • graphite2003 (PASS)
    • Downtimed on Icinga
    • Disabled Puppet
    • Removed from Puppet and PuppetDB if present
    • Deleted any existing Puppet certificate
    • Removed from Debmonitor if present
    • Forced PXE for next reboot
    • Host rebooted via IPMI
    • Host up (Debian installer)
    • Host up (new fresh OS)
    • Generated Puppet certificate
    • Signed new Puppet certificate
    • Run Puppet in NOOP mode to populate exported resources in PuppetDB
    • Found Nagios_host resource for this host in PuppetDB
    • Downtimed the new host on Icinga
    • First Puppet run completed and logged in /var/log/spicerack/sre/hosts/reimage/202110110913_filippo_4002_graphite2003.out
    • Checked BIOS boot parameters are back to normal
    • Rebooted
    • Automatic Puppet run was successful
    • Forced a re-check of all Icinga services for the host
    • Icinga status is optimal
    • Icinga downtime removed
    • Updated Netbox data from PuppetDB
Comment Actions

Cookbook cookbooks.sre.hosts.reimage was started by filippo@cumin1001 for host graphite2003.codfw.wmnet

Comment Actions

Cookbook cookbooks.sre.hosts.reimage started by filippo@cumin1001 for host graphite2003.codfw.wmnet completed:

  • graphite2003 (PASS)
    • Downtimed on Icinga
    • Disabled Puppet
    • Removed from Puppet and PuppetDB if present
    • Deleted any existing Puppet certificate
    • Removed from Debmonitor if present
    • Forced PXE for next reboot
    • Host rebooted via IPMI
    • Host up (Debian installer)
    • Host up (new fresh OS)
    • Generated Puppet certificate
    • Signed new Puppet certificate
    • Run Puppet in NOOP mode to populate exported resources in PuppetDB
    • Found Nagios_host resource for this host in PuppetDB
    • Downtimed the new host on Icinga
    • First Puppet run completed and logged in /var/log/spicerack/sre/hosts/reimage/202110110950_filippo_29925_graphite2003.out
    • Checked BIOS boot parameters are back to normal
    • Rebooted
    • Automatic Puppet run was successful
    • Forced a re-check of all Icinga services for the host
    • Icinga status is optimal
    • Icinga downtime removed
    • Updated Netbox data from PuppetDB
Comment Actions

Change 729968 had a related patch set uploaded (by Filippo Giunchedi; author: Filippo Giunchedi):

[operations/puppet@production] graphite: disable tags support

https://gerrit.wikimedia.org/r/729968

Comment Actions

Change 729975 had a related patch set uploaded (by Filippo Giunchedi; author: Filippo Giunchedi):

[operations/puppet@production] graphite: move production to /srv/carbon as storage directory

https://gerrit.wikimedia.org/r/729975

Comment Actions

Change 729975 merged by Filippo Giunchedi:

[operations/puppet@production] graphite: move production to /srv/carbon as storage directory

https://gerrit.wikimedia.org/r/729975

Comment Actions

Change 730427 had a related patch set uploaded (by Filippo Giunchedi; author: Filippo Giunchedi):

[operations/puppet@production] graphite: expire metric files not updated for 3y

https://gerrit.wikimedia.org/r/730427

Comment Actions

Change 729968 merged by Filippo Giunchedi:

[operations/puppet@production] graphite: disable tags support

https://gerrit.wikimedia.org/r/729968

Comment Actions

Change 730427 merged by Filippo Giunchedi:

[operations/puppet@production] graphite: expire metric files not updated for 3y

https://gerrit.wikimedia.org/r/730427

Comment Actions

Mentioned in SAL (#wikimedia-operations) [2021-10-18T09:38:04Z] <godog> sync metrics from graphite1004 to graphite2003 - T247963

Comment Actions

Change 731433 had a related patch set uploaded (by Filippo Giunchedi; author: Filippo Giunchedi):

[operations/puppet@production] statsd: failover writes to graphite2003

https://gerrit.wikimedia.org/r/731433

Comment Actions

Change 731434 had a related patch set uploaded (by Filippo Giunchedi; author: Filippo Giunchedi):

[operations/puppet@production] monitoring: check graphite2003 metrics

https://gerrit.wikimedia.org/r/731434

Comment Actions

Change 731435 had a related patch set uploaded (by Filippo Giunchedi; author: Filippo Giunchedi):

[operations/dns@master] discovery: move read traffic to graphite2003

https://gerrit.wikimedia.org/r/731435

Comment Actions

Change 731436 had a related patch set uploaded (by Filippo Giunchedi; author: Filippo Giunchedi):

[operations/dns@master] wmnet: move writes to graphite2003

https://gerrit.wikimedia.org/r/731436

Comment Actions

Change 731435 merged by Filippo Giunchedi:

[operations/dns@master] discovery: move read traffic to graphite2003

https://gerrit.wikimedia.org/r/731435

Comment Actions

Mentioned in SAL (#wikimedia-operations) [2021-10-19T08:50:22Z] <godog> point graphite.discovery.wmnet to graphite2003 - T247963

Comment Actions

Change 731434 merged by Filippo Giunchedi:

[operations/puppet@production] monitoring: check graphite2003 metrics

https://gerrit.wikimedia.org/r/731434

Comment Actions

Mentioned in SAL (#wikimedia-operations) [2021-10-19T09:37:11Z] <godog> move graphite/statsd writes to graphite2003 - T247963

Comment Actions

Change 731433 merged by Filippo Giunchedi:

[operations/puppet@production] statsd: failover writes to graphite2003

https://gerrit.wikimedia.org/r/731433

Comment Actions

Change 731436 merged by Filippo Giunchedi:

[operations/dns@master] wmnet: move writes to graphite2003

https://gerrit.wikimedia.org/r/731436

Comment Actions

Change 731917 had a related patch set uploaded (by Filippo Giunchedi; author: Filippo Giunchedi):

[operations/deployment-charts@master] mwdebug: add graphite2003 to network policies

https://gerrit.wikimedia.org/r/731917

Comment Actions

Change 731918 had a related patch set uploaded (by Filippo Giunchedi; author: Filippo Giunchedi):

[operations/mediawiki-config@master] ProductionServices: use graphite2003 for statsd

https://gerrit.wikimedia.org/r/731918

Comment Actions

Change 731918 merged by jenkins-bot:

[operations/mediawiki-config@master] ProductionServices: use graphite2003 for statsd

https://gerrit.wikimedia.org/r/731918

Comment Actions

Mentioned in SAL (#wikimedia-operations) [2021-10-19T10:21:26Z] <oblivian@deploy1002> Synchronized wmf-config/ProductionServices.php: Config: [[gerrit:731918|ProductionServices: use graphite2003 for statsd (T247963)]] (duration: 00m 54s)

Comment Actions

Mentioned in SAL (#wikimedia-operations) [2021-10-19T10:22:38Z] <oblivian@deploy1002> Synchronized tests/WmfConfigServicesTest.php: Config: [[gerrit:731918|ProductionServices: use graphite2003 for statsd (T247963)]] (duration: 00m 54s)

Comment Actions

Mentioned in SAL (#wikimedia-operations) [2021-10-19T10:45:26Z] <godog> bounce superset on an-tool1010 to pick up statsd changes - T247963

Comment Actions

Mentioned in SAL (#wikimedia-operations) [2021-10-19T10:45:37Z] <godog> bounce navtiming on webperf1001 to pick up statsd changes - T247963

Comment Actions

Mentioned in SAL (#wikimedia-operations) [2021-10-19T10:50:05Z] <godog> bounce superset on an-tool1005 to pick up statsd changes - T247963

Comment Actions

Change 732273 had a related patch set uploaded (by Filippo Giunchedi; author: Filippo Giunchedi):

[operations/puppet@production] install_server: use standard recipe for all graphite hosts

https://gerrit.wikimedia.org/r/732273

Comment Actions

Change 732273 merged by Filippo Giunchedi:

[operations/puppet@production] install_server: use standard recipe for all graphite hosts

https://gerrit.wikimedia.org/r/732273

Comment Actions

Change 731917 merged by jenkins-bot:

[operations/deployment-charts@master] mwdebug: fix statsd network policy

https://gerrit.wikimedia.org/r/731917

Comment Actions

Cookbook cookbooks.sre.hosts.reimage was started by filippo@cumin1001 for host graphite1004.eqiad.wmnet with OS bullseye

Comment Actions

Cookbook cookbooks.sre.hosts.reimage started by filippo@cumin1001 for host graphite1004.eqiad.wmnet with OS bullseye completed:

  • graphite1004 (PASS)
    • Downtimed on Icinga
    • Disabled Puppet
    • Removed from Puppet and PuppetDB if present
    • Deleted any existing Puppet certificate
    • Removed from Debmonitor if present
    • Forced PXE for next reboot
    • Host rebooted via IPMI
    • Host up (Debian installer)
    • Host up (new fresh bullseye OS)
    • Generated Puppet certificate
    • Signed new Puppet certificate
    • Run Puppet in NOOP mode to populate exported resources in PuppetDB
    • Found Nagios_host resource for this host in PuppetDB
    • Downtimed the new host on Icinga
    • First Puppet run completed and logged in /var/log/spicerack/sre/hosts/reimage/202110210755_filippo_29322_graphite1004.out
    • Checked BIOS boot parameters are back to normal
    • Rebooted
    • Automatic Puppet run was successful
    • Forced a re-check of all Icinga services for the host
    • Icinga status is optimal
    • Icinga downtime removed
    • Updated Netbox data from PuppetDB
Comment Actions

Change 734224 had a related patch set uploaded (by Filippo Giunchedi; author: Filippo Giunchedi):

[operations/puppet@production] graphite: bump fetch_timeout

https://gerrit.wikimedia.org/r/734224

Comment Actions

Change 734225 had a related patch set uploaded (by Filippo Giunchedi; author: Filippo Giunchedi):

[operations/puppet@production] graphite: set CLUSTER_SERVERS empty with no remote servers

https://gerrit.wikimedia.org/r/734225

Comment Actions

Change 734277 had a related patch set uploaded (by Filippo Giunchedi; author: Filippo Giunchedi):

[operations/dns@master] Revert \"discovery: move read traffic to graphite2003\"

https://gerrit.wikimedia.org/r/734277

Comment Actions

Change 734278 had a related patch set uploaded (by Filippo Giunchedi; author: Filippo Giunchedi):

[operations/puppet@production] Revert \"statsd: failover writes to graphite2003\"

https://gerrit.wikimedia.org/r/734278

Comment Actions

Change 734279 had a related patch set uploaded (by Filippo Giunchedi; author: Filippo Giunchedi):

[operations/puppet@production] Revert \"monitoring: check graphite2003 metrics\"

https://gerrit.wikimedia.org/r/734279

Comment Actions

Change 734280 had a related patch set uploaded (by Filippo Giunchedi; author: Filippo Giunchedi):

[operations/dns@master] Revert \"wmnet: move writes to graphite2003\"

https://gerrit.wikimedia.org/r/734280

Comment Actions

Change 734281 had a related patch set uploaded (by Filippo Giunchedi; author: Filippo Giunchedi):

[operations/mediawiki-config@master] Revert \"ProductionServices: use graphite2003 for statsd\"

https://gerrit.wikimedia.org/r/734281

Comment Actions

Change 734225 merged by Filippo Giunchedi:

[operations/puppet@production] graphite: set CLUSTER_SERVERS empty with no remote servers

https://gerrit.wikimedia.org/r/734225

Comment Actions

Change 734277 merged by Filippo Giunchedi:

[operations/dns@master] Revert \"discovery: move read traffic to graphite2003\"

https://gerrit.wikimedia.org/r/734277

Comment Actions

Mentioned in SAL (#wikimedia-operations) [2021-10-26T09:27:13Z] <godog> move read traffic back to graphite1004 - T247963

Comment Actions

Change 734279 merged by Filippo Giunchedi:

[operations/puppet@production] Revert \"monitoring: check graphite2003 metrics\"

https://gerrit.wikimedia.org/r/734279

Comment Actions

Change 734278 merged by Filippo Giunchedi:

[operations/puppet@production] Revert \"statsd: failover writes to graphite2003\"

https://gerrit.wikimedia.org/r/734278

Comment Actions

Mentioned in SAL (#wikimedia-operations) [2021-10-26T09:40:19Z] <godog> flip back write traffic to graphite1004 (all but mediawiki) - T247963

Comment Actions

Change 734280 merged by Filippo Giunchedi:

[operations/dns@master] Revert \"wmnet: move writes to graphite2003\"

https://gerrit.wikimedia.org/r/734280

Comment Actions

Mentioned in SAL (#wikimedia-operations) [2021-10-26T09:47:13Z] <godog> bounce navtiming on webperf1001 to pick up statsd changes - T247963

Comment Actions

Mentioned in SAL (#wikimedia-operations) [2021-10-26T09:49:16Z] <godog> bounce superset on an-tool1010 to pick up statsd changes - T247963

Comment Actions

Mentioned in SAL (#wikimedia-operations) [2021-10-26T09:49:24Z] <godog> bounce superset on an-tool1005 to pick up statsd changes - T247963

Comment Actions

Change 734281 merged by jenkins-bot:

[operations/mediawiki-config@master] Revert \"ProductionServices: use graphite2003 for statsd\"

https://gerrit.wikimedia.org/r/734281

fgiunchedi claimed this task.
fgiunchedi added a subscriber: Joe.
Comment Actions

This is complete! Both graphite2003 and graphite1004 run with Bullseye, the failover documentation is up to date. Thanks @Joe for the assistance with mw config deploys.

Comment Actions

Change 734224 abandoned by Filippo Giunchedi:

[operations/puppet@production] graphite: bump fetch_timeout

Reason:

Not needed

https://gerrit.wikimedia.org/r/734224

Comment Actions

Mentioned in SAL (#wikimedia-operations) [2021-12-09T03:37:10Z] <cwhite> bounce superset on an-tool1010 and 1005 to pick up statsd changes T247963

Comment Actions

Mentioned in SAL (#wikimedia-operations) [2022-11-30T09:30:06Z] <godog> bounce superset on an-tool1010 to pick up statsd changes - T247963

Comment Actions

Mentioned in SAL (#wikimedia-operations) [2022-11-30T09:32:48Z] <godog> bounce superset on an-tool1005 to pick up statsd changes - T247963

Content licensed under Creative Commons Attribution-ShareAlike (CC BY-SA) 4.0 unless otherwise noted; code licensed under GNU General Public License (GPL) 2.0 or later and other open source licenses. By using this site, you agree to the Terms of Use, Privacy Policy, and Code of Conduct. · Wikimedia Foundation · Privacy Policy · Code of Conduct · Terms of Use · Disclaimer · CC-BY-SA · GPL · Credits