Service Fabric cluster upgrade failure when converting certificate from thumbprint to common name

Buchczyk, Kornelia 70 Reputation points

Hi everyone,

I am working on migrating an Azure Service Fabric cluster from a thumbprint-based certificate configuration to a common name (CN)-based config

Current setup:

  • Azure Service Fabric cluster (Windows, VMSS-based)
  • originally using a self-signed certificate (thumbprint-based)
  • cert stored in Azure Key Vault
  • cluster successfully updated to use a new Let’s Encrypt certificate (via Acmebot) using thumbprint
  • certis correctly installed on all nodes (LocalMachine\My)

What I implemented:

  1. I deployed Acmebot and configured DNS (Azure DNS) for ACME challenge
  2. successfully issued a Let’s Encrypt certand stored it in Key Vault
  3. updated cluster to use the new certificate via thumbprint (primary) — cluster is healthy
  4. Attempted to migrate to CN-based configuration by replacing (using https://learn.microsoft.com/en-us/azure/service-fabric/service-fabric-cluster-change-cert-thumbprint-to-cn ): certificate / certificateSecondary with: certificateCommonNames (including issuer thumbprint)

Issue: When applying the CN-based config my cluster upgrade fails with:

Cluster upgrade failed. Reason Code: 'UpgradeServiceDown'

I would be grateful for any tips to solve it ;-)
Thanks,
Kornelia

  1. Hemalatha 14,525 Reputation points Microsoft External Staff Moderator

    Hello

    Just checking if above response was helpful! Please let me know if you have any queries.


Sign in to comment

2 answers

  1. Hemalatha 14,525 Reputation points Microsoft External Staff Moderator

    Hello

    Thank you for the detailed information and for outlining the steps already completed. Based on your description, the cluster remains healthy with the thumbprint-based configuration and fails only when transitioning to the Common Name (CN)-based setup with the error “UpgradeServiceDown.”

    This error generally indicates that Service Fabric system services were unable to start during the upgrade, most commonly due to certificate resolution or validation failure at runtime, even when the certificate is present on all nodes.

    In this scenario, we recommend validating the following areas:

    Please review the CN configuration in your cluster template. If a certificateIssuerThumbprint is specified, it can cause failures because Let’s Encrypt frequently rotates intermediate certificates, and the issuer thumbprint may not match what is installed on the node. In such cases, it is recommended to configure only the certificateCommonName and avoid enforcing the issuer thumbprint.

    Reference: https://learn.microsoft.com/en-us/azure/service-fabric/service-fabric-cluster-change-cert-thumbprint-to-cn

    Ensure the full certificate chain is trusted on all nodes. Let’s Encrypt certificates depend on intermediate and root certificates. These must be present in the LocalMachine\Root and LocalMachine\CA stores on every node. If the chain is incomplete, Service Fabric will fail certificate validation even if the leaf certificate is installed.

    Reference: https://learn.microsoft.com/en-us/azure/service-fabric/service-fabric-cluster-security-update-certs-azure?

    Ensure only one valid certificate matches the configured CN on each node. If multiple certificates with the same Common Name exist in the LocalMachine\My store, Service Fabric may not consistently select the correct certificate during upgrade. Please verify and remove any expired or unused certificates that share the same CN.

    Reference: https://learn.microsoft.com/en-us/azure/service-fabric/service-fabric-cluster-security#certificate-management

    Ensure that the certificate configuration is updated consistently in both the Service Fabric cluster resource and the Virtual Machine Scale Set (FabricNode extension). Any mismatch between these configurations can lead to certificate resolution failures during upgrade.

    Ensure the cluster is in a valid starting state before switching to CN-based configuration. This may require temporarily configuring both the old and new certificates (primary and secondary), completing an upgrade, and then removing the old certificate so that only the target certificate remains before introducing the CN configuration.

    We also recommend performing the migration in a staged manner. Introduce the CN-based configuration alongside the existing thumbprint configuration, confirm the cluster remains healthy, and then promote CN to primary in a subsequent upgrade. This approach reduces the risk of upgrade failures.

    Reference: https://learn.microsoft.com/en-us/azure/service-fabric/service-fabric-cluster-change-cert-thumbprint-to-cn

    If the issue persists, please review the cluster upgrade and infrastructure logs (FabricDeployer, ClusterManager, FabricHost, and FabricUpgradeService) located under: C:\ProgramData\SF\Log\FabricUpgradeService These logs typically provide details on whether the failure is due to certificate not found, chain validation issues, or certificate resolution errors.

    Reference: https://learn.microsoft.com/en-us/azure/service-fabric/service-fabric-diagnostics-event-generation-infra

    Additionally, if the cluster is running on Service Fabric version 8.2 CU2, please be aware of a known issue related to ImageStoreService that can impact certificate-related operations and cluster upgrades. Applying the documented mitigation or upgrading to a later cumulative update is recommended.

    Known issue reference:

    https://github.com/Azure/Service-Fabric-Troubleshooting-Guides/blob/master/Known_Issues/Service%20Fabric%208.2%20Upgrade%20or%20Certificate%20Rotation%20Failure%20due%20to%20ImageStoreService%20Error.md

    Similar Issue: https://github.com/microsoft/service-fabric-issues/issues/1435

    Hope this helps! Please let me know if you have any queries in comments.

    1. Hemalatha 14,525 Reputation points Microsoft External Staff Moderator

      Hello Buchczyk,

      If above provided information was helpful! Could you please accept the answer and upvote it. Thanks

      Please let me know if you have any queries in comments


    Sign in to comment
  2. AI answer

    The failure is caused by nodes being unable to mutually authenticate during the upgrade when switching from thumbprint-based to CN-based declarations. During a rolling upgrade, upgraded and not-yet-upgraded nodes must both be able to validate each other’s certificates at all times. If the configuration jumps directly from thumbprint-only to CN-only (or CN with issuer pinning) without following the supported transition pattern, the UpgradeService and other system services can no longer establish secure connections, which leads to UpgradeServiceDown.

    To resolve this, follow the supported two-phase conversion pattern and ensure a valid starting state:

    1. Ensure a valid starting state using thumbprints only
      The cluster must already be using the goal certificate (the Let’s Encrypt certificate) by thumbprint before converting to CN. Valid starting states include:
      • Thumbprint: GoalCert, ThumbprintSecondary: None, or
      • Thumbprint: GoalCert, ThumbprintSecondary: OldCert1, or
      • Thumbprint: OldCert1, ThumbprintSecondary: GoalCert (with GoalCert having a later NotBefore date than OldCert1).
        If the cluster is not in one of these states, adjust using one or two thumbprint-only upgrades as described in the “Achieve a valid starting state” table. For example, if currently Thumbprint: OldCert1, ThumbprintSecondary: None and OldCert1 has a later NotBefore than GoalCert, first move to Thumbprint: GoalCert, ThumbprintSecondary: OldCert1, then to Thumbprint: GoalCert, ThumbprintSecondary: None.
    2. Verify certificates are installed on all nodes
      Before changing declarations, ensure both the current and target cluster certificates are provisioned to every VM in each VM scale set (LocalMachine\My). In Azure, this is typically done via Key Vault integration with VMSS. The documentation explicitly requires that all relevant certificates be installed on all nodes before changing the cluster’s certificate declarations.
    3. Choose the CN validation scheme correctly
      When switching to CN-based declarations, decide whether to:
      • Rely on the trusted root CA (leave certificateIssuerThumbprintList empty), or
      • Pin the direct issuer(s) via certificateIssuerThumbprintList (recommended).
        If issuer thumbprints are specified, the certificate is only accepted if its direct issuer’s thumbprint matches one of the values in the list. If the Let’s Encrypt certificate is issued by a different intermediate than expected, or if not all relevant issuer thumbprints are listed, the certificate will be rejected and mutual authentication will fail, causing system services (including UpgradeService) to go down. Ensure that all expected issuer thumbprints for that CN are included.
    4. Perform the conversion from thumbprint to CN in two phases
      The safe pattern is:
      • Phase 1: Cluster uses the goal certificate by thumbprint only, and validation rules are extended to accept both thumbprint and CN. Conceptually:
        • Initial state: all nodes present and validate by thumbprint A only.
        • After first upgrade domain: upgraded nodes still present A, but accept A (thumbprint) and B (CN) for validation; other nodes still present and accept A only.
      • Phase 2: Change the declaration so that the same certificate is declared by CN (with optional issuer thumbprints) and remove the thumbprint-based presentation. Because validation already accepts CN, nodes can mutually authenticate during the rolling upgrade.
      In ARM terms, this means:
      • First upgrade: keep the thumbprint-based certificate as-is (GoalCert as primary), but extend validation to include the CN (and issuer thumbprints) according to the rules in the certificate configuration documentation.
      • Second upgrade: update the cluster resource to use certificateCommonNames with commonNames (and certificateIssuerThumbprintList if pinning issuers), and remove the certificate property. Also update the Service Fabric node extension in each VMSS to match the CN-based configuration. Then redeploy the updated template.
    5. Update and redeploy the ARM template correctly
      • Export the current template from the resource group if the original is not available.
      • Update each VM scale set’s Service Fabric node extension definition and the Microsoft.ServiceFabric/clusters resource: add certificateCommonNames with the correct CN and issuer thumbprints, and remove the old certificate block only in the second phase.
      • Redeploy using New-AzResourceGroupDeployment with the updated template and parameters.

    If the upgrade is failing now, roll back to the last healthy thumbprint-based configuration, verify that the cluster is in a valid starting state with GoalCert as the active thumbprint, confirm that all nodes have the certificate installed, then reattempt the conversion using the two-phase pattern and correct issuer thumbprints.


    References:

    AI-generated content may be incorrect. Read our transparency notes for more information.
Sign in to answer

Your answer