S2D Fault Domain Behavior

Cem Iscen 40 Reputation points

Does Storage Spaces Direct treat a node in Suspend/Maintenance mode differently from a hard power off when calculating fault domain loss and write quorum?

0 comments No comments

Sign in to comment

Answer accepted by question author

Marcin Policht 92,630 Reputation points MVP Volunteer Moderator

Yep - it does. A node placed into Suspend or Maintenance mode is a coordinated, cluster-aware state, whereas a hard power-off is treated as an unexpected failure. Storage Spaces Direct adjusts its behavior accordingly.

When you put a node into maintenance, the cluster intentionally drains roles and pauses I/O on that node. Storage Spaces Direct also tries to ensure data safety by redirecting writes and, depending on policy and health, may repair or rebalance data so that active copies remain on other nodes. The node is still part of cluster membership (paused), so the system has full knowledge of its state and can make controlled decisions about data placement and quorum. Because of that coordination, it avoids counting the node as a sudden fault domain loss in the same way as a crash, and it will block or warn if entering maintenance would violate resiliency guarantees.

With a hard power-off, the node disappears abruptly. The cluster immediately treats it as a fault domain loss. Storage Spaces Direct must assume any data on that node is unavailable and falls back to remaining replicas or parity. This directly impacts write quorum and resiliency calculations because the system now has fewer available copies, and it may restrict writes if the remaining redundancy is insufficient to meet the volume’s resiliency policy.

In practice, maintenance mode is designed to preserve write availability by ensuring enough healthy copies remain before and during the operation, while a power-off can transiently reduce the number of accessible copies and therefore more aggressively affect write quorum decisions.


If the above response helps answer your question, remember to "Accept Answer" so that others in the community facing similar issues can easily find the solution. Your contribution is highly appreciated.

hth

Marcin

0 comments No comments

Sign in to comment

1 additional answer

  1. kholienchieh 0 Reputation points MVP

    In some scenario, especially evict of corrupted node which cause drive with "Transient error" and non of the PS commands work to resume the status.

    The following script from GitHub helps.

    https://github.com/MicrosoftDocs/windowsserverdocs/blob/main/WindowsServerDocs/storage/storage-spaces/media/troubleshooting/Clear-PhysicalDiskHealthData.txt

    0 comments No comments

    Sign in to comment
Sign in to answer

Your answer