Issue with Service fabric queue

Abhale, Sankalp 0 Reputation points

Hello,
We are using Stateful service with queue, these queues are observed for new entry using while loop and cancellation token, recently we are continously getting exception at commitAsynnc(). We are not able to identify the root cause.

{"Type":"FabricNotPrimaryException","Message":"","StackTrace":" at Microsoft.ServiceFabric.Replicator.Transaction.ThrowReplicationException(Exception e, Boolean abortNeeded) at Microsoft.ServiceFabric.Replicator.Transaction.PrivateCommitAsync() at Microsoft.ServiceFabric.Replicator.Transaction.<>c.<CommitAsync>b__17_0(Task t) at System.Threading.ExecutionContext.RunFromThreadPoolDispatchLoop(Thread threadPoolThread, ExecutionContext executionContext, ContextCallback callback, Object state)--- End of stack trace from previous location --- at System.Threading.ExecutionContext.RunFromThreadPoolDispatchLoop(Thread threadPoolThread, ExecutionContext executionContext, ContextCallback callback, Object state) at System.Threading.Tasks.Task.ExecuteWithThreadLocal(Task& currentTaskSlot, Thread threadPoolThread)--- End of stack trace from previous location --- 
  1. Manish Deshpande 7,010 Reputation points Microsoft External Staff Moderator

    Thanks for reaching out to Microsoft Q&A we are looking into the issue and we will get back to you shortly with an update.

  2. Himanshu Shekhar 6,710 Reputation points Microsoft External Staff Moderator

    FabricNotPrimaryException is thrown when an operation is executed on a replica that is no longer the Primary replica.

    In Service Fabric: https://learn.microsoft.com/en-us/dotnet/api/system.fabric.fabricnotprimaryexception?view=azure-dotnet

    1. Only the Primary replica is allowed to perform write operations (including CommitAsync)
    2. If a replica loses primary role during execution, any in-flight transaction (like CommitAsync) will fail with this exception
  3. Abhale, Sankalp 0 Reputation points

    @Himanshu Shekhar The problem is it is happening 2-3 times a day.


Sign in to comment

1 answer

  1. Jerald Felix 13,500 Reputation points Volunteer Moderator

    Hello Abhale, Sankalp,

    Greetings! Thanks for raising this question in Q&A forum.

    The FabricNotPrimaryException at CommitAsync() is not a bug in your code it is expected behavior in Azure Service Fabric. Service Fabric initiates reconfigurations in response to different events in the system, such as when the primary fails or when Service Fabric needs to move a running primary to another node in response to load balancing or an upgrade. When this happens, any in-flight transaction on the old primary including your CommitAsync() call will throw FabricNotPrimaryException because write operations are only allowed on the current primary replica.

    The fact that this is happening 2-3 times a day suggests your cluster is going through periodic reconfigurations — this could be due to upgrades, node rebalancing, or brief infrastructure events. The key is to handle this exception correctly in your code rather than treating it as an unrecoverable error.

    Here are the steps to resolve this properly:

    1. Catch FabricNotPrimaryException in your RunAsync loop and exit gracefully If the service sees FabricNotPrimaryException in RunAsync, it should catch the exception, complete all tasks, and return from RunAsync. The CancellationToken passed to RunAsync would be signalled, and all background tasks should complete execution when this cancellation is signalled. Update your while loop to handle this pattern:
    protected override async Task RunAsync(CancellationToken cancellationToken)
    {
     var myQueue = await StateManager.GetOrAddAsync<IReliableQueue<MyItem>>("myQueue");
    
     while (!cancellationToken.IsCancellationRequested)
     {
     try
     {
     using (var tx = StateManager.CreateTransaction())
     {
     var result = await myQueue.TryDequeueAsync(tx, cancellationToken);
     if (result.HasValue)
     {
     // process result.Value
     await tx.CommitAsync();
     }
     }
     }
     catch (FabricNotPrimaryException)
     {
     // Replica is no longer primary — exit RunAsync gracefully
     return;
     }
     catch (OperationCanceledException) when (cancellationToken.IsCancellationRequested)
     {
     // Normal shutdown — exit gracefully
     return;
     }
     catch (Exception ex)
     {
     // Log and handle other transient exceptions
     }
    
     await Task.Delay(TimeSpan.FromMilliseconds(100), cancellationToken);
     }
    }
    

    Do not retry on FabricNotPrimaryException — it is not a transient error Unlike TimeoutException (which can be safely retried with a delay), FabricNotPrimaryException means your replica has lost the primary role. Retrying the same transaction on the same replica will keep failing. The Service Fabric runtime will restart RunAsync on the newly promoted primary automatically — you just need to exit cleanly.

    Ensure your RunAsync correctly honours the cancellation token throughout Make sure the cancellationToken passed to RunAsync is honoured and once it has been signalled, RunAsync exits gracefully as soon as possible. Pass the cancellationToken to all async calls inside your loop — including TryDequeueAsync, Task.Delay, and any other awaitable calls — so the loop terminates promptly when Service Fabric requests it.

    Investigate why reconfigurations are happening 2-3 times a day Open Service Fabric Explorer for your cluster and check the health events on the affected partition. Look for System.FM events that show SwapPrimary or Failover reconfiguration types. Common causes include node-level upgrades, VM restarts, or load-balancing moves. If the frequency seems abnormal, check the cluster upgrade settings and node health in the portal.

    Review your cluster's replica set and health settings If your stateful service has a low MinReplicaSetSize or short health check timeouts, it may trigger reconfigurations more aggressively. Review the service manifest settings for TargetReplicaSetSize and MinReplicaSetSize to ensure they are appropriately configured for your environment.

    If this answer helps you kindly accept the answer which will help others who have similar questions.

    Best Regards,

    Jerald Felix.

    0 comments No comments

    Sign in to comment
Sign in to answer

Your answer