Service Fabric cluster upgrade failure when converting certificate from thumbprint to common name
Hi everyone,
I am working on migrating an Azure Service Fabric cluster from a thumbprint-based certificate configuration to a common name (CN)-based config
Current setup:
- Azure Service Fabric cluster (Windows, VMSS-based)
- originally using a self-signed certificate (thumbprint-based)
- cert stored in Azure Key Vault
- cluster successfully updated to use a new Let’s Encrypt certificate (via Acmebot) using thumbprint
- certis correctly installed on all nodes (LocalMachine\My)
What I implemented:
- I deployed Acmebot and configured DNS (Azure DNS) for ACME challenge
- successfully issued a Let’s Encrypt certand stored it in Key Vault
- updated cluster to use the new certificate via thumbprint (primary) — cluster is healthy
- Attempted to migrate to CN-based configuration by replacing (using https://learn.microsoft.com/en-us/azure/service-fabric/service-fabric-cluster-change-cert-thumbprint-to-cn ):
certificate/certificateSecondarywith:certificateCommonNames(including issuer thumbprint)
Issue: When applying the CN-based config my cluster upgrade fails with:
Cluster upgrade failed. Reason Code: 'UpgradeServiceDown'
I would be grateful for any tips to solve it ;-)
Thanks,
Kornelia
2 answers
-
Hello
Thank you for the detailed information and for outlining the steps already completed. Based on your description, the cluster remains healthy with the thumbprint-based configuration and fails only when transitioning to the Common Name (CN)-based setup with the error “UpgradeServiceDown.”
This error generally indicates that Service Fabric system services were unable to start during the upgrade, most commonly due to certificate resolution or validation failure at runtime, even when the certificate is present on all nodes.
In this scenario, we recommend validating the following areas:
Please review the CN configuration in your cluster template. If a
certificateIssuerThumbprintis specified, it can cause failures because Let’s Encrypt frequently rotates intermediate certificates, and the issuer thumbprint may not match what is installed on the node. In such cases, it is recommended to configure only thecertificateCommonNameand avoid enforcing the issuer thumbprint.Ensure the full certificate chain is trusted on all nodes. Let’s Encrypt certificates depend on intermediate and root certificates. These must be present in the LocalMachine\Root and LocalMachine\CA stores on every node. If the chain is incomplete, Service Fabric will fail certificate validation even if the leaf certificate is installed.
Ensure only one valid certificate matches the configured CN on each node. If multiple certificates with the same Common Name exist in the LocalMachine\My store, Service Fabric may not consistently select the correct certificate during upgrade. Please verify and remove any expired or unused certificates that share the same CN.
Ensure that the certificate configuration is updated consistently in both the Service Fabric cluster resource and the Virtual Machine Scale Set (FabricNode extension). Any mismatch between these configurations can lead to certificate resolution failures during upgrade.
Ensure the cluster is in a valid starting state before switching to CN-based configuration. This may require temporarily configuring both the old and new certificates (primary and secondary), completing an upgrade, and then removing the old certificate so that only the target certificate remains before introducing the CN configuration.
We also recommend performing the migration in a staged manner. Introduce the CN-based configuration alongside the existing thumbprint configuration, confirm the cluster remains healthy, and then promote CN to primary in a subsequent upgrade. This approach reduces the risk of upgrade failures.
If the issue persists, please review the cluster upgrade and infrastructure logs (FabricDeployer, ClusterManager, FabricHost, and FabricUpgradeService) located under: C:\ProgramData\SF\Log\FabricUpgradeService These logs typically provide details on whether the failure is due to certificate not found, chain validation issues, or certificate resolution errors.
Additionally, if the cluster is running on Service Fabric version 8.2 CU2, please be aware of a known issue related to ImageStoreService that can impact certificate-related operations and cluster upgrades. Applying the documented mitigation or upgrading to a later cumulative update is recommended.
Known issue reference:
Similar Issue: https://github.com/microsoft/service-fabric-issues/issues/1435
Hope this helps! Please let me know if you have any queries in comments.
-
AI answer
The failure is caused by nodes being unable to mutually authenticate during the upgrade when switching from thumbprint-based to CN-based declarations. During a rolling upgrade, upgraded and not-yet-upgraded nodes must both be able to validate each other’s certificates at all times. If the configuration jumps directly from thumbprint-only to CN-only (or CN with issuer pinning) without following the supported transition pattern, the UpgradeService and other system services can no longer establish secure connections, which leads to
UpgradeServiceDown.To resolve this, follow the supported two-phase conversion pattern and ensure a valid starting state:
- Ensure a valid starting state using thumbprints only
The cluster must already be using the goal certificate (the Let’s Encrypt certificate) by thumbprint before converting to CN. Valid starting states include:-
Thumbprint: GoalCert, ThumbprintSecondary: None, or -
Thumbprint: GoalCert, ThumbprintSecondary: OldCert1, or -
Thumbprint: OldCert1, ThumbprintSecondary: GoalCert(withGoalCerthaving a laterNotBeforedate thanOldCert1).
If the cluster is not in one of these states, adjust using one or two thumbprint-only upgrades as described in the “Achieve a valid starting state” table. For example, if currentlyThumbprint: OldCert1, ThumbprintSecondary: NoneandOldCert1has a laterNotBeforethanGoalCert, first move toThumbprint: GoalCert, ThumbprintSecondary: OldCert1, then toThumbprint: GoalCert, ThumbprintSecondary: None.
-
- Verify certificates are installed on all nodes
Before changing declarations, ensure both the current and target cluster certificates are provisioned to every VM in each VM scale set (LocalMachine\My). In Azure, this is typically done via Key Vault integration with VMSS. The documentation explicitly requires that all relevant certificates be installed on all nodes before changing the cluster’s certificate declarations. - Choose the CN validation scheme correctly
When switching to CN-based declarations, decide whether to:- Rely on the trusted root CA (leave
certificateIssuerThumbprintListempty), or - Pin the direct issuer(s) via
certificateIssuerThumbprintList(recommended).
If issuer thumbprints are specified, the certificate is only accepted if its direct issuer’s thumbprint matches one of the values in the list. If the Let’s Encrypt certificate is issued by a different intermediate than expected, or if not all relevant issuer thumbprints are listed, the certificate will be rejected and mutual authentication will fail, causing system services (including UpgradeService) to go down. Ensure that all expected issuer thumbprints for that CN are included.
- Rely on the trusted root CA (leave
- Perform the conversion from thumbprint to CN in two phases
The safe pattern is:- Phase 1: Cluster uses the goal certificate by thumbprint only, and validation rules are extended to accept both thumbprint and CN. Conceptually:
- Initial state: all nodes present and validate by thumbprint A only.
- After first upgrade domain: upgraded nodes still present A, but accept A (thumbprint) and B (CN) for validation; other nodes still present and accept A only.
- Phase 2: Change the declaration so that the same certificate is declared by CN (with optional issuer thumbprints) and remove the thumbprint-based presentation. Because validation already accepts CN, nodes can mutually authenticate during the rolling upgrade.
- First upgrade: keep the thumbprint-based
certificateas-is (GoalCert as primary), but extend validation to include the CN (and issuer thumbprints) according to the rules in the certificate configuration documentation. - Second upgrade: update the cluster resource to use
certificateCommonNameswithcommonNames(andcertificateIssuerThumbprintListif pinning issuers), and remove thecertificateproperty. Also update the Service Fabric node extension in each VMSS to match the CN-based configuration. Then redeploy the updated template.
- Phase 1: Cluster uses the goal certificate by thumbprint only, and validation rules are extended to accept both thumbprint and CN. Conceptually:
- Update and redeploy the ARM template correctly
- Export the current template from the resource group if the original is not available.
- Update each VM scale set’s Service Fabric node extension definition and the
Microsoft.ServiceFabric/clustersresource: addcertificateCommonNameswith the correct CN and issuer thumbprints, and remove the oldcertificateblock only in the second phase. - Redeploy using
New-AzResourceGroupDeploymentwith the updated template and parameters.
If the upgrade is failing now, roll back to the last healthy thumbprint-based configuration, verify that the cluster is in a valid starting state with GoalCert as the active thumbprint, confirm that all nodes have the certificate installed, then reattempt the conversion using the two-phase pattern and correct issuer thumbprints.
References:
AI-generated content may be incorrect. Read our transparency notes for more information. - Ensure a valid starting state using thumbprints only
