Azure Deployment Slots: how not to make deployment worse (2023)

According to Microsoft, deployment slots could be used to achieve zero-downtime deployments, prewarming, and easy fallbacks. Surprisingly, straightforward implementation without proper testing could give you more downtime, longer deployments and no prewarming at all.

You should read this article if you are using slots in Azure Functions without WEBSITE_ADD_SITENAME_BINDINGS_IN_APPHOST_CONFIG set to 1.

Docs:

Traffic redirection is seamless; no requests are dropped because of a swap.

It is hard to get this sentence wrong, but speaking to people who are using deployment slots could tell you they either did not read the documentation at all or for some reason think otherwise — slots do not guarantee seamless deployment.

The most common feedback is: we still see HTTP 503 happening for a short period during deployment.

How is it possible that the feedback contradicts the very first feature in the documentation? Some people say that a short burst of 503 errors is fine — their applications can mitigate the burst by retrying — not a big problem anyway. Some people say that Microsoft just does not tell all truth and all this is pure marketing.

I agree that Microsoft has a long history of writing obscure documentation with a strange set of limitations (most of which comes from backward compatibility taken to extreme levels).

I also agree that there could be combinations of legacy software that just do not play well with deployment slots at all.

What I cannot agree with is that the simplest possible HTTP service via .Net Core Azure Functions is among those incompatible combinations.

I am going to use .Net Core 3.1, Azure Functions ~3 on Windows, ARM Templates, classic Azure Release Pipelines and 2 S1 App Service Plan instances. I am not going to include Azure Pipeline “code” here, because it is pretty simple: check prod, deploy arm, deploy code to staging, health-check staging, swap, health-check prod.

Function

public static class Functions
{
public readonly static string Id = Guid.NewGuid().ToString();
[FunctionName("health")]
public static HttpResponseMessage Health(
[HttpTrigger(AuthorizationLevel.Anonymous, "get", Route = null)] HttpRequestMessage req,
ILogger log
)
{
return new HttpResponseMessage(System.Net.HttpStatusCode.OK)
{
Content = new StringContent(JsonConvert.SerializeObject(new
{
UtcNow = DateTime.UtcNow.ToString("o"),
ReleaseId = Environment.GetEnvironmentVariable("ReleaseId"),
StaticId = Id,
HostName = Environment.GetEnvironmentVariable("WEBSITE_HOSTNAME")
}), Encoding.UTF8, "application/json")
};
}
}

StaticId is used as an independent way to tell which instance is reporting. It can be “changed” by restarting a service, starting a new app domain or by using AssemblyLoadContext from .net core. Only the former matters.

So, if it did not change — it is the same instance.

ReleaseId is an app-setting that changes with every deployment.

HostName shows that you cannot use WEBSITE_HOSTNAME with slots.

Arm Template

Staging Slot

{
"apiVersion": "2016-08-01",
"name": "[concat(variables('appServiceName'),'/', variables('stagingSlotName'))]",
"type": "Microsoft.Web/sites/slots",
"location": "[resourceGroup().location]",
"kind": "functionapp",
"properties": {
"serverFarmId": ...
},
"dependsOn": [
...
],
"resources": [
{
"condition": "[equals(parameters('deployType'), 'enable-production')]",
"apiVersion": "2016-08-01",
"name": "web",
"type": "config",
"location": "[resourceGroup().location]",
"dependsOn": [
"[variables('stagingSlotName')]"
],
"properties": "[variables('siteProperties')]"
},
{
"apiVersion": "2019-08-01",
"name": "appsettings",
"type": "config",
"dependsOn": [
"[variables('stagingSlotName')]"
],
"properties": {
"FUNCTIONS_EXTENSION_VERSION": "~3",
"AzureWebJobsDashboard": "... listKeys stuff",
"AzureWebJobsStorage": "... listKeys stuff",
"ReleaseId": "[parameters('ReleaseId')]",
"WEBSITE_ADD_SITENAME_BINDINGS_IN_APPHOST_CONFIG": "1"
}
}
]
}

Production Slot / App

{
"condition": "[equals(parameters('deployType'), 'enable-production')]",
"apiVersion": "2016-08-01",
"name": "[variables('appservicename')]",
"type": "Microsoft.Web/sites",
"location": "[resourceGroup().location]",
"kind": "[variables('kind')]",
"properties": {
"name": "[variables('appservicename')]",
"serverFarmId": ...
},
"dependsOn": [
...
],
"resources": [
{
"condition": "[equals(parameters('deployType'), 'enable-production')]",
"apiVersion": "2016-08-01",
"name": "web",
"type": "config",
"location": "[resourceGroup().location]",
"dependsOn": [
"[variables('stagingSlotName')]"
]
"properties": "[variables('siteProperties')]"
},
{
"condition": "[equals(parameters('deployType'), 'enable-production')]",
"apiVersion": "2019-08-01",
"name": "appsettings",
"type": "config",
"dependsOn": [
"[variables('stagingSlotName')]"
],
"properties": {
"FUNCTIONS_EXTENSION_VERSION": "~3",
"AzureWebJobsDashboard": "... listKeys stuff",
"AzureWebJobsStorage": "... listKeys stuff",
"ReleaseId": "[parameters('ReleaseId')]",
"WEBSITE_ADD_SITENAME_BINDINGS_IN_APPHOST_CONFIG": "1"
}
}
]
}

Production Slot (App) and Staging Slot deployments are almost the same. The only difference is that production has “condition” which prevents it from being deployed. It is set to “enable-production” if the deployment pipeline detects production instance does not exist or is unhealthy (calling http health check is good enough, using azure-powershell is another way).

It is important to skip ALL deployments to a production slot because otherwise IIS will detect configuration changes and recycle production instance while you are deploying to staging or doing a swap — the first way to have downtime using slots. It is not just code deployment that adds downtime — infrastructure deployment (especially app-settings) does it as well.

My first idea was not to deploy anything to a production slot at all and just rely on the swap to populate everything when needed. Of course, that did not go well, mostly because the default azure function version is ~1. Swapping between ~3 on staging and ~1 on production is an excellent way to have a busy evening dealing with all those misleading errors in Azure Portal.

Magic

Real word solutions sometimes require some magic to work, especially when working with paranoid backward compatibility from Microsoft.

WEBSITE_ADD_SITENAME_BINDINGS_IN_APPHOST_CONFIG = 1

You can see this magic setting in the appsettings deployment for both production and staging slots. It is documented at the very end of the second documentation source: https://docs.microsoft.com/en-us/azure/app-service/deploy-staging-slots#troubleshoot-swaps.

After slot swaps, the app may experience unexpected restarts. This is because after a swap, the hostname binding configuration goes out of sync, which by itself doesn’t cause restarts. However, certain underlying storage events (such as storage volume failovers) may detect these discrepancies and force all worker processes to restart. To minimize these types of restarts, set the WEBSITE_ADD_SITENAME_BINDINGS_IN_APPHOST_CONFIG=1 app setting on all slots. However, this app setting does not work with Windows Communication Foundation (WCF) apps.

In fact, the app MAY experience unexpected restarts should be read as the app WILL experience unexpected restarts. Every test deployment I made without the setting had a downtime — a short burst of 503 errors. So — the second way to have downtime when using slots.

It surprises me (not really) that the most important information that prevents downtime when using the feature that aims to remove downtime is not enabled by default and is described at the end of the huge second documentation page. If anybody can explain this— please let me know.

I made a simple C# console that calls each slot’s health endpoint independently with 100 msec delay using persistent HTTP connections (per slot). Removing the delay only changes numbers.

Production slot output has 4 sections:

  • HTTP — groups http res[onses by code and 30 second intervals
  • ReleaseId — shows when releaseId app-setting changes
  • StaticId — shows when staticId (C# static readonly field) changes (again, it can happen after restart, in a new app domain, or in a new AssemblyLoadContext, and only the former is important for us)
  • HostName — shows when WEBSITE_HOSTNAME environment variable changes

Staging slot output has only StaticId section for simplicity.

Case #1

  • I remove all conditions from the arm-template
  • I do not use the setting
Azure Deployment Slots: how not to make deployment worse (1)

There are 2 downtimes here: the first comes with app-settings deployment and the second comes with after-swap IIS recycle.

2 instances marked with red lines (4cf2 and 4eb5) came from staging slots (after swap had happened).

2 instances marked with yellow lines (0779 and aa4e) came from a nowhere — pure restart on production, meaning downtime and lost warmup.

Case #2

  • I use conditions: nothing is deployed to the production slot
  • I do not use the setting
Azure Deployment Slots: how not to make deployment worse (2)

Here is only one downtime — after-swap recycle.

ReleaseId changed at 11:03:03, but recycle happened 40 seconds later — 11:03:45. Then it took Azure 20 seconds to bootstrap the first instance and 30 seconds to bootstrap the second instance.

The warmup is lost.

Case #3

  • I use everything
Azure Deployment Slots: how not to make deployment worse (3)

There are no downtimes, but a bit of slowness: it made ~230 calls in 30 seconds before the swap and about ~130 calls in 30 seconds during the swap (still a room for improvement!).

There are no phantom instances on the production slot anymore and the old production instances are properly moved to the staging slot. They are replaced by 2 new ones a bit later (probably it happens because Azure applies sticky staging slot settings to them).

References

Top Articles
Latest Posts
Article information

Author: Carlyn Walter

Last Updated: 01/17/2024

Views: 5574

Rating: 5 / 5 (70 voted)

Reviews: 85% of readers found this page helpful

Author information

Name: Carlyn Walter

Birthday: 1996-01-03

Address: Suite 452 40815 Denyse Extensions, Sengermouth, OR 42374

Phone: +8501809515404

Job: Manufacturing Technician

Hobby: Table tennis, Archery, Vacation, Metal detecting, Yo-yoing, Crocheting, Creative writing

Introduction: My name is Carlyn Walter, I am a lively, glamorous, healthy, clean, powerful, calm, combative person who loves writing and wants to share my knowledge and understanding with you.