I remember a call with a customer from my first few months on the Azure Functions team. The customerβs function triggered from a queue message and ran a stored procedure in a SQL database. The solution had been working fantastic for months at a high scale until one fateful day the SQL database crashed. What resulted is a serverless version of kicking someone while they're down. The function executions started to fail, and as they failed, the message would be re-queued and retried (which is great! Retrying generally is a best practice, especially for transient failures). But more queue messages kept landing in the queue, in addition to a growing number of retries. Within minutes they had gone from hundreds of failed executions to thousands. What's worse, once the SQL database failure was resolved, Azure Functions was ready to hammer that poor SQL database back into a failed state with a mountain of queue messages and retries it had been holding up. Eventually, things got back into a steady-state, but the team was left with thousands of messages in a dead-letter queue and some battle scars. It's a powerful reminder that sometimes serverless scale can make a bad situation worse.
What the customer wanted was simple enough to understand.
"Can we make it so our function stops triggering if lots of failures start happening?"
Itβs a simple problem that has been asked by many a developer. I want to show how you can use the new Durable Entities feature of Azure Functions to create a stateful circuit breaker to achieve exactly that.
Traditional circuit breakers and serverless scale
The pattern isn't a new one. Generally, it's known as the circuit breaker pattern. You can think of it as an electric circuit with a gate. Imagine electricity is flowing through a circuit to a destination. A gate is there bridging the connection. If you ever need to stop the flow of electricity, the gate "opens" creating a gap in the circuit, and stopping electricity. When electricity should resume, you "close" the gate connecting the circuit.
Many libraries exist that can implement this pattern in traditional apps. Polly for .NET is my favorite, but there are plenty to choose from. You can define a threshold number of exceptions within a range - if I get more than 30 exceptions in 1 minute - and it "opens" the gate to stop processing. While this works for many applications, serverless presents new challenges where this model doesn't fit.
Single server vs distributed apps
Assume I'm processing all of my queue messages on a single server. Here, libraries like Polly have worked great. Polly keeps an in-memory count of exceptions as they occur, and it can easily keep track of the threshold because every exception will only occur on that single instance.
Distributed applications like serverless functions are much different. Within seconds under load, an Azure Function can scale to dozens of instances. Imagine now that you have 50 active function instances triggering on events. You could have a circuit breaker library running in your app - but you'd have 50 instances of that circuit breaker library. Now imagine that you get 50 exceptions all at once, but each exception spreads evenly across your instances. Each library, like Polly, will think everything is healthy. "Hey I'm good, I only saw one exception, and my threshold is 30, so let's keep the circuit closed." But the global story is much different. Globally the threshold was crossed, but because each instance was processing in isolation, there was nothing that knew the state and health of the entire circuit. Not to mention, even if one instance "halted" processing, the other 49 may keep chugging along.
The scaling and abstraction of all of these instances make traditional circuit breaker libraries unpredictable. Sometimes 30 exceptions could open the circuit breaker if they all happen on a single instance, while other times it could take many many more exceptions to get a successfully opened gate.
State management and distributed circuits
The missing piece here for a distributed circuit breaker is some external state that can monitor the health and status of the entire circuit. There's no single answer on how that state could manifest. The state could be Azure Alerts and metrics that aggregate exceptions and take some action. The state could be an Azure Logic App like an example I created a few years back before we had Durable Entities. My current weapon of choice though is a Durable Entity.
Let's step away from this circuit breaker problem for a beat and introduce what Durable Entities are.
Durable Entities and stateful functions
Serverless functions are no longer the stateless ephemeral snippets of code you've likely heard of. With new capabilities like Azure Durable Functions you can now write Azure Functions that maintain state for an indefinite amount of time. This is extremely useful when you need to orchestrate or coordinate work as it moves through a system.
Recently we've released a new flavor of Durable Functions called Durable Entities. You can think of Durable Entities like a function that can have infinite instances, each with its unique state and ID. The basic example is something like a counter. Let's say you've now become the lead architect for Fitbit and need to build a solution that receives step counts for each of the thousands of users and keep them stored in state. Each Fitbit device will only ever send one signal: "Add Step." You could write a stateless function that interacts with a database, and does something like "lookup this user, get the current step count, add one, and save it back to the database." But it's a bit cumbersome to write, and things start to get messy when I think about how I can prevent two signals from executing at the same time on the same user, and what was two unique step signals only end up incrementing the database by one.
Durable Entities lets me describe that entity - an instance of a counter - while maintaining state and single-threaded operations on each instance. The code ends up looking like this:
public class Counter
{
[JsonProperty("steps")]
public int CurrentValue { get; set; }
public void AddStep() => this.CurrentValue++;
public void Reset() => this.CurrentValue = 0;
[FunctionName(nameof(Counter))]
public static Task Run([EntityTrigger] IDurableEntityContext ctx)
=> ctx.DispatchAsync<Counter>();
}
That's it - that's an entire Azure Function app. Once I publish this, I can use the Durable SDK or HTTP API to do things like "AddStep for Jeff" or "AddStep for Chris." I could have infinite instances of these counters, each one storing the state for that user. And these Durable Entities are just as serverless as Azure Functions. If I send 1,000 steps for Jeff, my function will scale and run and process those 1,000 events (guaranteeing that after all 1,000 are processed my count is "1000"). I pay only for those 1,000 steps. If I never call that function again, the state still lives (in Azure Storage by default), but I never pay for the function compute until I use it again.
Durable Entities make managing state extremely intuitive. We've seen tremendous interest in scenarios like IoT, where each IoT device could persist and expose state operations as a durable entity. My personal favorite use of entities though is to solve our circuit breaker dilemma.
Durable Entities and Circuit State
Let's pull these threads back together. If you recall from the beginning, we have a problem: we need to be able to manage the state and rate of exceptions external to each function. We also have a pretty slick new tool of Durable Entities. Now I want to show how you can combine these for a serverless circuit breaker.
First, let's break down the flow of how things will work. There are two main components. The function app that will be triggering and processing messages from something like an Azure Event Hub or queue, and a durable entity that will monitor and manage the state of the circuit.
Azure Function
- Trigger and try to process the message
- If there is an exception, send a signal to the durable entity to let it know it hit an exception
Azure Function Code
[FunctionName("MyFunction")]
public async Task Run([QueueTrigger("myqueue")] Message message, ILogger log)
{
try
{
// try to process the message
}
catch (Exception ex)
{
// hit an exception
// send a signal to the Durable Entity
await _client.PostAsJsonAsync(entityUrl, new FailureRequest
{
FailureTime = DateTime.UtcNow
});
// Throw the exception so the retries will kick in
throw ex;
}
}
Durable Entity
- Keep track of how many exceptions have been reported across all scaled out instances of a function
- If several exceptions within a certain period are reported, break the circuit
- When breaking the circuit, use the Azure API to stop the Azure Function
Durable Function Code
[JsonObject(MemberSerialization.OptIn)]
public class Circuit
{
[JsonProperty]
[JsonConverter(typeof(StringEnumConverter))]
public CircuitState state = CircuitState.Closed;
// Current rolling window of failures reported for this circuit
[JsonProperty]
public IDictionary<string, FailureRequest> FailureWindow = new Dictionary<string, FailureRequest>();
public void CloseCircuit() => state = CircuitState.Closed;
public void OpenCircuit() => state = CircuitState.Open;
public async Task AddFailure(FailureRequest failure)
{
// Check to make sure the circuit isn't already opened
if(state == CircuitState.Open)
{
_log.LogInformation($"Tried to add additional failure to {Entity.Current.EntityKey} that is already opened. Close the circuit to resume processing");
return;
}
// Add this failure to the stateful aggragate
FailureWindow.Add(failure.RequestId, failure);
// Calculate the time window we should evaluate exceptions for
var thresholdCutoff = failure.FailureTime.Subtract(windowSize);
// Filter the window only to exceptions within the cutoff timespan
FailureWindow = FailureWindow.Where(p => p.Value.FailureTime >= thresholdCutoff).ToDictionary( p => p.Key, p => p.Value);
if(FailureWindow.Count >= failureThreshold)
{
_log.LogCritical($"Break this circuit for entity {Entity.Current.EntityKey}!");
// Kick off a call to disable the Azure Function App
await _durableClient.StartNewAsync(nameof(OpenCircuitOrchestrator.OpenCircuit), failure.ResourceId);
// Mark the circuit as open
state = CircuitState.Open;
}
else
{
_log.LogInformation($"The circuit {Entity.Current.EntityKey} currently has {FailureWindow.Count} exceptions in the window of {windowSize.ToString()}");
}
}
[FunctionName(nameof(Circuit))]
public static Task Run(
[EntityTrigger] IDurableEntityContext ctx) => ctx.DispatchAsync<Circuit>(client);
}
There are other ways you can chain these together - like adding some in-memory retries to the function or even having the function explicitly check the state of the circuit before processing the message, but this flow is my favorite and optimizes for high throughput and low cost. I can deploy the durable entity and function app, and once the durable entity detects the number of failures is too high it will automatically disable the Azure Function app so it stops processing. It is worth noting you can use the same durable entity to monitor and manage the circuit for many different function apps in your subscription at the same time.
The full code for my sample is on GitHub.
Sample circuit breaker scenario
To help make this pattern clear, let's walk through an example. Imagine I have the Azure Function above and the Durable Entity deployed in my subscription. We'll use the same scenario as we started this post with - the function triggers on the queue, interacts with a SQL server, and completes processing.
My function could be running for months without issue. All during this time, my durable entity is sitting idle (and free) in my subscription. Assume the function is triggering on queue messages across 100 active instances.
Suddenly, the SQL server starts having some issues. A few of the messages start throwing exceptions. An instance hits an exception and lets the durable entity know. The durable entity keeps track that 1 exception has happened in the last 30 seconds. Moments later, 20 other instances hit an exception. Each sends a signal to the durable entity, which now knows that 21 exceptions have happened in the span of a few seconds. Finally, the failures start to mount, and quickly the durable entity detects the threshold has crossed - it has the state for over 30 exceptions in a 30 second window.
The logic activates the breaking and opening of the circuit. It makes a call to the Azure APIs and disables the function app so it stops triggering. No messages are lost - the queue messages stay safely in the Service Bus queue. But rather than creating cascading failures, I've gracefully broken the circuit until the health of the systems can be confirmed, the circuit closed, and processing resumes.
With durable entities and Azure Functions we've very efficiently solved the problem we started with:
"Can we make it so our function stops triggering if lots of failures start happening?"
More than just functions
You can use Durable Entities to manage the state of any distributed app. Polly recently announced support for Durable Entity powered circuit breakers that can manage the state of any application. I'd encourage you to check them out, and give durable entities a spin!
For further actions, you may consider blocking this person and/or reporting abuse
