IPAM Invoker Add failed with error: Failed to get IP address from CNS with error: %w: AllocateIPConfig failed: no IPs available, waiting on Azure CNS to allocate more

Edit 1: still working with MS. I can reproduce the issue following their guide https://learn.microsoft.com/en-us/azure/aks/configure-azure-cni-dynamic-ip-allocation

Edit 2: Microsoft confirmed my suspicions:

We identified a regression with a version of DNC that was pushed out.
We are currently rolling back now, and you should see your CNS pods functionality returning shortly if you have not seen it already.

I finished up this past Friday rolling out a new AKS cluster and picked it up this morning. I added a new node pool and started getting errors with 4 specific pods:

I did a kubectl describe pod to see what is going on and it displayed:

I did change my outbound nat IP and thought it some how screwed something up. I wiped the cluster, reverted the nat change and redeployed, but no dice. I am going bonkers trying to figure out what happened. I reviewed both node and pod subnets to ensure I had enough capacity, which I did. I then went to another subscription and tried creating the same AKS cluster from my IaC, but the same thing happened. I had another user ping me that their cluster was having issues, but it resolved itself. I did look at their logs and it seems Microsoft must of done maintenance on the backend which moved some pods around. All their kube-system pods came back up healthy. So, what is different from their cluster and mine? I am using dynamic IP allocation and the other user is not. Alright, so let me try that. Guess what, it worked. This was all working fine last week, so there is something going on in Azure Government and I am waiting to hear back from Microsoft Support.

AKS Double Encryption

I have been living in a world of compliance these past few weeks, specifically NIST 800-171. Azure provides an initiative for NIST and one of the checks is to make sure your disks have both a platform and customer managed key. I recently ran into a scenario where you have an application that is StatefulSet in Azure Kubernetes. Let’s talk a bit more around this and NIST.

The Azure Policy was in non compliance for my disks because they were just a managed platform key. Researching the AKS docs, I found an article for using a customer managed key, but this still is not what I need as I need double encryption to meet compliance. After some research in the Kubernetes SIGs repo, I found the Azure Disk CSI driver doc and check it out:

It looks like this document was modified back in May adding support for this feature, so recently new. Upgrade the driver to 1.18 or above and double encryption support should be there.

To implement, create a new storage class that references your disk encryption set id with double encryption.

kind: StorageClass
apiVersion: storage.k8s.io/v1  
metadata:
  name: byok-double-encrpytion
provisioner: disk.csi.azure.com 
reclaimPolicy: Retain
allowVolumeExpansion: true
parameters:
  skuname: Premium_LRS
  kind: managed
  diskEncryptionType: EncryptionAtRestWithPlatformAndCustomerKeys
  diskEncryptionSetID: "/subscriptions/<sub>/resourceGroups/<rg>/providers/Microsoft.Compute/diskEncryptionSets/<dek-name>"

Apply this snippet above and reference the storage class in your deployment yaml to have double encryption. This will tick that NIST compliance checkbox for AKS disks.

Azure Kubernetes Service and Network Security Groups

One of the most common mistakes I see are people modifying the NSG rules for AKS manually instead letting AKS manage it for them. AKS is a managed service, so it will manage the rules. If the NSG rules are manually modified, AKS might reset the rules which could leave your service in a broken state or exposed to threats.

If you look at the annotations for type LoadBalancer https://docs.microsoft.com/en-us/azure/aks/load-balancer-standard#additional-customizations-via-kubernetes-annotations , you can see an annotation for service.beta.kubernetes.io/azure-allowed-service-tags. Typically, we would have some kind of WAF sitting in front, such as Azure Front Door. We can set the service tag AzureFrontDoor.Backend which will let AKS manage this inbound rule of only letting Azure Front Door’s ip’s communicate with this public IP.

We can do a quick example of deploying this YAML which has the service type set to LoadBalancer which will provision us a public ip.

apiVersion: apps/v1
kind: Deployment
metadata:
  name: aks-helloworld-one  
spec:
  replicas: 1
  selector:
    matchLabels:
      app: aks-helloworld-one
  template:
    metadata:
      labels:
        app: aks-helloworld-one
    spec:
      containers:
      - name: aks-helloworld-one
        image: mcr.microsoft.com/azuredocs/aks-helloworld:v1
        ports:
        - containerPort: 80
        env:
        - name: TITLE
          value: "Welcome to Azure Kubernetes Service (AKS)"
---
apiVersion: v1
kind: Service
metadata:
  name: aks-helloworld-one  
spec:
  type: Loadbalancer
  ports:
  - port: 80
  selector:
    app: aks-helloworld-one

Let’s do a kubectl apply and view the svc.

You can see a public ip has been associated with the svc. Let’s take a look at the inbound NSG. The public ip is open to the internet. I want this svc to be protected by my WAF on Azure Front Door.

In order to apply a tag correctly, let’s modify the yaml to set the correct annotation. In the picture below, I am setting the tag AzureFrontDoor.Backend which AKS will ensure it is always present and managed automatically.

Save the YAML and apply it to update the service.

Viewing the inbound NSG for AKS, we can see it automatically updated the service tag.

Remember, AKS is a managed service. Let it manage the NSGs for you!