Azure revisited
Sebastian Staffa
Published on Nov 21, 2022, 1:13 PM
You see, I think of myself as someone who is able to admit when he has made a mistake. As such, I am not afraid to say that time has proven my initial, optimistic evaluation of Azure to be more than wrong.
In my first post about this topic I toured the online portal of azure, in which I configured a few basic infrastructure building blocks and compared them with their AWS equivalents. But this approach has a glaring issue: This would not be how Azure would be used in production.
In fact, in the best case scenario, I would never have visited the azure portal ever again, because I would have been able to work exclusively via an infrastructure-as-code language. This is how I use AWS at least and this is, in my opinion, the only way how a cloud infrastructure provider can be used in a professional setting. Without it, one can neither share nor reproduce a cloud based application.
IaC Options with Azure
After a few test runs of our first Azure-based application in an environment that was created by hand we wanted to prepare the software for production use. The first thing we wanted to do was to consolidate the pieces of infrastructure I had manually created into an IaC document. To do so, we had to decide which IaC language to use beforehand. At the time of our evaluation we had three options:
- ARM templates: A terribly named, verbose, JSON flavour
- Bicep templates: A DSL which transpiles into ARM templates
- Terraform: A meta IaC language by HashiCorp
ARM templates (example) were already kinda deprecated when we started, as we were greeted with this info-box on the ARM start page:
This left us with two candidates: Terraform and Bicep. After some discussion we settled for Bicep because of two reasons: Firstly, we felt that we didn't need the additional abstraction layer that Terraform provided and that we probably would have to break out into ARM or Bicep any way if a certain edge case we needed to address could not be fixed with Terraform. At the same time we would not be able to benefit from the advertised upsides of terraform like the multi-cloud capability as we'd be stuck on Azure. We had also made many positive experiences with AWS Cloudformation/SAM which ultimately led us to believe that Microsoft's solutions could not be that much worse, could it?
Documentation
To illustrate my points about the state of IaC at Azure we'll take a closer look at a simple template from the official collection of quickstart templates by Microsoft. To be precise, we'll create a plain function app (which is the official equivalent of an AWS Lambda function).
As you can see, our cloud primitive "serverless code runner" actually consists
of three resources: A Microsoft.Storage/storageAccounts
, which provides a
filesystem to host our function code, a Microsoft.Web/serverfarms
, which
dictates how our code is run: serverless or on a provisioned machine, how many
nines we can expect when it comes to the uptime and so on, and finally a
Microsoft.Web/sites
which contains the settings for our code runner. The
example also adds a Microsoft.Insights/components
resource to add additional
logging, but out of goodwill I'll ignore this one as it is not strictly
necessary.
Now, lets say you have provisioned your first function app by copying this
template, and you'd now like to modify it. Maybe you are like me and would like
to know what the other possible serverfarm
tiers are, other than the
Dynamic
-tier which is used in hostingPlan.sku.tier
. So you head on over to
the
reference page for the serverfarm resource
and check the structure of a SkuDescription
:
We have now learned that tier
must be a string
. That's it. As you have
probably already guessed, it cannot be ANY string, as only a few ones from the
infinite pool of possible strings are allowed. It would be cool if they'd be
listed somewhere on this page.
If you check the rest of the attributes you'll see that their descriptions are
equally useless. Just to name a few additional examples: The description of
AppServicePlanProperties.spotExpirationTime
reads
The time when the server farm expires. Valid only if it is a spot server farm.
and it must be a string string
. So we'll probably have to format this time
values somehow when we need to specify it, but how? Can we simply drop in an
epoch timestamp? Does it taken an
ISO-8601 string? Can we use timezones
or does it have to be in UTC?
There is also the AppServicePlanProperties.targetWorkerCount
which dictates
the Scaling worker count.
and must be of type int
. Can I specify any integer
value, without any restrictions? Are 1073741824 workers possible? Is the worker
count dependent on the function tier? How does it relate to the
maximumElasticWorkerCount
? What is the default value?
This level of documentation is the norm with all azure resources. Just to make
sure that this is no fluke, let us head over to the documentation of the
Microsoft.Web/sites
resource, which, if you remember, is the equivalent of a
lambda function and should thus be one of the most used resourcetypes.
I am specifically interested in the siteConfig.appSettings
options, which are
really obscure even in the
example quickstart template:
appSettings: [
{
name: 'AzureWebJobsStorage'
value: 'DefaultEndpointsProtocol=https;AccountName=${storageAccountName};EndpointSuffix=${environment().suffixes.storage};AccountKey=${storageAccount.listKeys().keys[0].value}'
}
{
name: 'WEBSITE_CONTENTAZUREFILECONNECTIONSTRING'
value: 'DefaultEndpointsProtocol=https;AccountName=${storageAccountName};EndpointSuffix=${environment().suffixes.storage};AccountKey=${storageAccount.listKeys().keys[0].value}'
}
//...
How does this WEBSITE_CONTENTAZUREFILECONNECTIONSTRING
work? What are the
other possible appSettings
? Which are required and what are the default values
for those that are not? Let's check the documentation!
Deployment
Now, Azure wouldn't be the first, nor the last tool where the documentation is unusable. There is always the try-and-fail method. Let's just throw together a simple deployment, run it, and see what the errors are.
There are two basic problems with this approach when working with Azure. To begin with, most deployments take ages. You'll want to provision a new serverless mssql database? Come back in 40 minutes. You'll need an API Management instance? That'll also set you back like 40 minutes. But don't worry, we'll send you an emails when it's ready.
Now, let's assume the deployment did not work and after around half an hour you receive an error message. After all this time you've spent waiting, it surely is detailed and allows you to fix the problems with your template right? Well, I've compiled some examples for you:
Degradation
For the last point I'd like to make I can only offer anecdotal evidence, as this incident only happened once, but this was enough to get me worried.
In a resource stack we had deployed on our production system we had a resource which looked like this
resource appservicePlan 'Microsoft.Web/serverfarms@2021-03-01' = {
name: appservice_plan_name
location: location
tags: tag_values
sku: {
name: 'Y1'
tier: 'Dynamic'
}
}
This stack was deployed and updated countless times between our production and staging/test environments and we never encountered a problem with it. Through a bit of a lull in the development (it was the holiday season) changes to the system slowed down, and we did not run a deployment for a few weeks.
When work resumed in earnest, the resource group could not be deployed anymore.
Azure helpfully reported Object reference not set to an instance of an object.
and that was it. We had to try and fail our way through the deployment, removing
resources until it worked, until be determined that the resource that is show
above is the culprit. This confused us to no small amount, as this resource had
been in the stack for months, and it never caused any issues.
In the end we found out that we had to add the attribute kind: 'linux'
to the
appservicePlan
. This got the deployment working again, but if this really was
the root cause of the issue or if it can occur again some other time, that we
don't know. It does not really look like it is required, as there are official
examples about (like the one I
linked above)
which seem to work without this attribute.
Conclusion
After you've read about all the trouble we had with Azure and Bicep/ARM you propably understand why I had to come back to the topic and correct my initial conclusions about the platform.
To be clear: After all these experiences I'll probably try to avoid Azure for some time. In my opinion, a cloud platform without a reliable, well documented IaC language really is not usable for any production use-case.
You may now argue that we should have used Terraform or something else from the beginning and that the platform Azure cannot be judged by the performance of it's IaC language, but I really cannot let that one count. When the official, supported and advertised way of doing infrastructure-as-code on a platform does not work, then they should not advertise it, and it becomes a flaw of the platform. We can argue about runtime and maybe even documentation, but if you advertise a way of doing things it should AT LEAST work reliably.