Azure revisited

You see, I think of myself as someone who is able to admit when he has made a mistake. As such, I am not afraid to say that time has proven my initial, optimistic evaluation of Azure to be more than wrong.

In my first post about this topic I toured the online portal of azure, in which I configured a few basic infrastructure building blocks and compared them with their AWS equivalents. But this approach has a glaring issue: This would not be how Azure would be used in production.

In fact, in the best case scenario, I would never have visited the azure portal ever again, because I would have been able to work exclusively via an infrastructure-as-code language. This is how I use AWS at least and this is, in my opinion, the only way how a cloud infrastructure provider can be used in a professional setting. Without it, one can neither share nor reproduce a cloud based application.

IaC Options with Azure

After a few test runs of our first Azure-based application in an environment that was created by hand we wanted to prepare the software for production use. The first thing we wanted to do was to consolidate the pieces of infrastructure I had manually created into an IaC document. To do so, we had to decide which IaC language to use beforehand. At the time of our evaluation we had three options:

ARM templates: A terribly named, verbose, JSON flavour
Bicep templates: A DSL which transpiles into ARM templates
Terraform: A meta IaC language by HashiCorp

ARM templates (example) were already kinda deprecated when we started, as we were greeted with this info-box on the ARM start page:

A screenshot of the Azure ARM deprecation warning

This left us with two candidates: Terraform and Bicep. After some discussion we settled for Bicep because of two reasons: Firstly, we felt that we didn't need the additional abstraction layer that Terraform provided and that we probably would have to break out into ARM or Bicep any way if a certain edge case we needed to address could not be fixed with Terraform. At the same time we would not be able to benefit from the advertised upsides of terraform like the multi-cloud capability as we'd be stuck on Azure. We had also made many positive experiences with AWS Cloudformation/SAM which ultimately led us to believe that Microsoft's solutions could not be that much worse, could it?

Documentation

To illustrate my points about the state of IaC at Azure we'll take a closer look at a simple template from the official collection of quickstart templates by Microsoft. To be precise, we'll create a plain function app (which is the official equivalent of an AWS Lambda function).

As you can see, our cloud primitive "serverless code runner" actually consists of three resources: A Microsoft.Storage/storageAccounts, which provides a filesystem to host our function code, a Microsoft.Web/serverfarms, which dictates how our code is run: serverless or on a provisioned machine, how many nines we can expect when it comes to the uptime and so on, and finally a Microsoft.Web/sites which contains the settings for our code runner. The example also adds a Microsoft.Insights/components resource to add additional logging, but out of goodwill I'll ignore this one as it is not strictly necessary.

Now, lets say you have provisioned your first function app by copying this template, and you'd now like to modify it. Maybe you are like me and would like to know what the other possible serverfarm tiers are, other than the Dynamic-tier which is used in hostingPlan.sku.tier. So you head on over to the reference page for the serverfarm resource and check the structure of a SkuDescription:

A screenshot of the SkuDescription of a Microsoft.Web/serverfarms resource

We have now learned that tier must be a string. That's it. As you have probably already guessed, it cannot be ANY string, as only a few ones from the infinite pool of possible strings are allowed. It would be cool if they'd be listed somewhere on this page.

If you check the rest of the attributes you'll see that their descriptions are equally useless. Just to name a few additional examples: The description of AppServicePlanProperties.spotExpirationTime reads The time when the server farm expires. Valid only if it is a spot server farm. and it must be a string string. So we'll probably have to format this time values somehow when we need to specify it, but how? Can we simply drop in an epoch timestamp? Does it taken an ISO-8601 string? Can we use timezones or does it have to be in UTC?

There is also the AppServicePlanProperties.targetWorkerCount which dictates the Scaling worker count. and must be of type int. Can I specify any integer value, without any restrictions? Are 1073741824 workers possible? Is the worker count dependent on the function tier? How does it relate to the maximumElasticWorkerCount? What is the default value?

This level of documentation is the norm with all azure resources. Just to make sure that this is no fluke, let us head over to the documentation of the Microsoft.Web/sites resource, which, if you remember, is the equivalent of a lambda function and should thus be one of the most used resourcetypes.

I am specifically interested in the siteConfig.appSettings options, which are really obscure even in the example quickstart template:

appSettings: [
    {
      name: 'AzureWebJobsStorage'
      value: 'DefaultEndpointsProtocol=https;AccountName=${storageAccountName};EndpointSuffix=${environment().suffixes.storage};AccountKey=${storageAccount.listKeys().keys[0].value}'
    }
    {
      name: 'WEBSITE_CONTENTAZUREFILECONNECTIONSTRING'
      value: 'DefaultEndpointsProtocol=https;AccountName=${storageAccountName};EndpointSuffix=${environment().suffixes.storage};AccountKey=${storageAccount.listKeys().keys[0].value}'
    }
//...

How does this WEBSITE_CONTENTAZUREFILECONNECTIONSTRING work? What are the other possible appSettings? Which are required and what are the default values for those that are not? Let's check the documentation!

Deployment

Now, Azure wouldn't be the first, nor the last tool where the documentation is unusable. There is always the try-and-fail method. Let's just throw together a simple deployment, run it, and see what the errors are.

There are two basic problems with this approach when working with Azure. To begin with, most deployments take ages. You'll want to provision a new serverless mssql database? Come back in 40 minutes. You'll need an API Management instance? That'll also set you back like 40 minutes. But don't worry, we'll send you an emails when it's ready.

Now, let's assume the deployment did not work and after around half an hour you receive an error message. After all this time you've spent waiting, it surely is detailed and allows you to fix the problems with your template right? Well, I've compiled some examples for you:

Degradation

For the last point I'd like to make I can only offer anecdotal evidence, as this incident only happened once, but this was enough to get me worried.

In a resource stack we had deployed on our production system we had a resource which looked like this

resource appservicePlan 'Microsoft.Web/serverfarms@2021-03-01' = {
  name: appservice_plan_name
  location: location
  tags: tag_values
  sku: {
    name: 'Y1'
    tier: 'Dynamic'
  }
}

This stack was deployed and updated countless times between our production and staging/test environments and we never encountered a problem with it. Through a bit of a lull in the development (it was the holiday season) changes to the system slowed down, and we did not run a deployment for a few weeks.

When work resumed in earnest, the resource group could not be deployed anymore. Azure helpfully reported Object reference not set to an instance of an object. and that was it. We had to try and fail our way through the deployment, removing resources until it worked, until be determined that the resource that is show above is the culprit. This confused us to no small amount, as this resource had been in the stack for months, and it never caused any issues.

In the end we found out that we had to add the attribute kind: 'linux' to the appservicePlan. This got the deployment working again, but if this really was the root cause of the issue or if it can occur again some other time, that we don't know. It does not really look like it is required, as there are official examples about (like the one I linked above) which seem to work without this attribute.

Conclusion

After you've read about all the trouble we had with Azure and Bicep/ARM you propably understand why I had to come back to the topic and correct my initial conclusions about the platform.

To be clear: After all these experiences I'll probably try to avoid Azure for some time. In my opinion, a cloud platform without a reliable, well documented IaC language really is not usable for any production use-case.

You may now argue that we should have used Terraform or something else from the beginning and that the platform Azure cannot be judged by the performance of it's IaC language, but I really cannot let that one count. When the official, supported and advertised way of doing infrastructure-as-code on a platform does not work, then they should not advertise it, and it becomes a flaw of the platform. We can argue about runtime and maybe even documentation, but if you advertise a way of doing things it should AT LEAST work reliably.

Azure revisited

IaC Options with Azure

Documentation

Deployment

Degradation

Conclusion

More posts like this one

A look at native TypeScript performance

Note: NixOS Feedback Handling

Improving Median Lambda Init Times By 23%