Issue with Service fabric queue
Hello,
We are using Stateful service with queue, these queues are observed for new entry using while loop and cancellation token, recently we are continously getting exception at commitAsynnc(). We are not able to identify the root cause.
{"Type":"FabricNotPrimaryException","Message":"","StackTrace":" at Microsoft.ServiceFabric.Replicator.Transaction.ThrowReplicationException(Exception e, Boolean abortNeeded) at Microsoft.ServiceFabric.Replicator.Transaction.PrivateCommitAsync() at Microsoft.ServiceFabric.Replicator.Transaction.<>c.<CommitAsync>b__17_0(Task t) at System.Threading.ExecutionContext.RunFromThreadPoolDispatchLoop(Thread threadPoolThread, ExecutionContext executionContext, ContextCallback callback, Object state)--- End of stack trace from previous location --- at System.Threading.ExecutionContext.RunFromThreadPoolDispatchLoop(Thread threadPoolThread, ExecutionContext executionContext, ContextCallback callback, Object state) at System.Threading.Tasks.Task.ExecuteWithThreadLocal(Task& currentTaskSlot, Thread threadPoolThread)--- End of stack trace from previous location ---
-
Manish Deshpande 7,010 Reputation points • Microsoft External Staff • Moderator
Thanks for reaching out to Microsoft Q&A we are looking into the issue and we will get back to you shortly with an update.
-
Himanshu Shekhar 6,710 Reputation points • Microsoft External Staff • Moderator
FabricNotPrimaryException is thrown when an operation is executed on a replica that is no longer the Primary replica.
In Service Fabric: https://learn.microsoft.com/en-us/dotnet/api/system.fabric.fabricnotprimaryexception?view=azure-dotnet
- Only the Primary replica is allowed to perform write operations (including CommitAsync)
- If a replica loses primary role during execution, any in-flight transaction (like CommitAsync) will fail with this exception
-
Abhale, Sankalp 0 Reputation points
@Himanshu Shekhar The problem is it is happening 2-3 times a day.
Sign in to comment
1 answer
-
Jerald Felix 13,500 Reputation points • Volunteer Moderator
Hello Abhale, Sankalp,
Greetings! Thanks for raising this question in Q&A forum.
The
FabricNotPrimaryExceptionatCommitAsync()is not a bug in your code it is expected behavior in Azure Service Fabric. Service Fabric initiates reconfigurations in response to different events in the system, such as when the primary fails or when Service Fabric needs to move a running primary to another node in response to load balancing or an upgrade. When this happens, any in-flight transaction on the old primary including yourCommitAsync()call will throwFabricNotPrimaryExceptionbecause write operations are only allowed on the current primary replica.The fact that this is happening 2-3 times a day suggests your cluster is going through periodic reconfigurations — this could be due to upgrades, node rebalancing, or brief infrastructure events. The key is to handle this exception correctly in your code rather than treating it as an unrecoverable error.
Here are the steps to resolve this properly:
- Catch
FabricNotPrimaryExceptionin yourRunAsyncloop and exit gracefully If the service seesFabricNotPrimaryExceptioninRunAsync, it should catch the exception, complete all tasks, and return fromRunAsync. TheCancellationTokenpassed toRunAsyncwould be signalled, and all background tasks should complete execution when this cancellation is signalled. Update your while loop to handle this pattern:
protected override async Task RunAsync(CancellationToken cancellationToken) { var myQueue = await StateManager.GetOrAddAsync<IReliableQueue<MyItem>>("myQueue"); while (!cancellationToken.IsCancellationRequested) { try { using (var tx = StateManager.CreateTransaction()) { var result = await myQueue.TryDequeueAsync(tx, cancellationToken); if (result.HasValue) { // process result.Value await tx.CommitAsync(); } } } catch (FabricNotPrimaryException) { // Replica is no longer primary — exit RunAsync gracefully return; } catch (OperationCanceledException) when (cancellationToken.IsCancellationRequested) { // Normal shutdown — exit gracefully return; } catch (Exception ex) { // Log and handle other transient exceptions } await Task.Delay(TimeSpan.FromMilliseconds(100), cancellationToken); } }Do not retry on
FabricNotPrimaryException— it is not a transient error UnlikeTimeoutException(which can be safely retried with a delay),FabricNotPrimaryExceptionmeans your replica has lost the primary role. Retrying the same transaction on the same replica will keep failing. The Service Fabric runtime will restartRunAsyncon the newly promoted primary automatically — you just need to exit cleanly.Ensure your
RunAsynccorrectly honours the cancellation token throughout Make sure thecancellationTokenpassed toRunAsyncis honoured and once it has been signalled,RunAsyncexits gracefully as soon as possible. Pass thecancellationTokento all async calls inside your loop — includingTryDequeueAsync,Task.Delay, and any other awaitable calls — so the loop terminates promptly when Service Fabric requests it.Investigate why reconfigurations are happening 2-3 times a day Open Service Fabric Explorer for your cluster and check the health events on the affected partition. Look for
System.FMevents that showSwapPrimaryorFailoverreconfiguration types. Common causes include node-level upgrades, VM restarts, or load-balancing moves. If the frequency seems abnormal, check the cluster upgrade settings and node health in the portal.Review your cluster's replica set and health settings If your stateful service has a low
MinReplicaSetSizeor short health check timeouts, it may trigger reconfigurations more aggressively. Review the service manifest settings forTargetReplicaSetSizeandMinReplicaSetSizeto ensure they are appropriately configured for your environment.If this answer helps you kindly accept the answer which will help others who have similar questions.
Best Regards,
Jerald Felix.
- Catch
